Texto agregado para leitura rápida. Confira sempre a fonte original ao enviar a candidatura.
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer Specialist based in Brazil.
This role is a senior technical leadership opportunity focused on defining and elevating reliability practices across a complex, distributed cloud-native platform. You will be responsible for shaping observability, incident response, and SRE standards across large-scale systems running in Kubernetes (GKE) and supported by a modern microservices ecosystem. The environment includes critical components such as messaging, databases, API gateways, and logging pipelines, requiring deep systems thinking and strong operational discipline. This is a highly influential individual contributor position, where you will set the benchmark for SRE excellence, drive SLO adoption, and reduce operational toil at scale. You will also play a key role in major incident management and postmortem culture. The role offers strong cross-team visibility and the opportunity to shape how reliability engineering is practiced across the entire platform.
Accountabilities
- Define and own the technical strategy for observability across the platform, including metrics, logs, and distributed tracing using tools such as OpenTelemetry and Dash0.
- Establish and evolve SLIs, SLOs, and error budgets, ensuring they drive engineering and product decision-making.
- Lead major incident response efforts as incident commander, ensuring structured resolution and blameless postmortems with actionable outcomes.
- Improve on-call practices by reducing alert noise, minimizing toil, and building a sustainable operational model.
- Influence and support architectural decisions across distributed systems including GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
- Mentor SRE and platform engineers, raising the overall maturity of reliability engineering practices across teams.
- Drive adoption of observability and reliability best practices across Java and Node.js services in production.
Requirements
- 8+ years of experience in SRE, infrastructure, or platform engineering, with senior or specialist-level exposure to large-scale production environments.
- Strong hands-on experience with Kubernetes (preferably GKE), including debugging and operating production workloads.
- Deep expertise in observability systems (OpenTelemetry, Prometheus, centralized logging such as Elasticsearch, Logstash, Fluent Bit).
- Experience defining and operationalizing SLIs, SLOs, and error budgets in real-world environments.
- Strong background in incident management, including leading high-severity incidents and postmortem processes.
- Experience operating distributed stateful systems such as PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (S3/MinIO).
- Production experience with Java services (JVM tuning, performance troubleshooting) and familiarity with Node.js environments.
- Proven ability to influence engineering teams and mentor senior engineers without formal authority.
- Strong communication skills in English and Portuguese, with experience working in distributed, cross-functional teams.
Nice To Have
- Experience with iPaaS or multi-tenant distributed platforms.
- Knowledge of Kong API Gateway, Apache Camel, or similar integration technologies.
- Experience with GitOps tools such as FluxCD or GitLab CI.
- Exposure to Chaos Engineering or Production Readiness Review frameworks.
- CNCF or cloud certifications (CKA, CKS, GCP Professional certifications).
- Contributions to open-source observability or Kubernetes ecosystems.
Benefits
- Health and dental care coverage.
- Monthly flexible benefits via Caju card (R$ 1.400, covering food, mobility, home office, wellness, and education).
- Life insurance.
- Childcare assistance.
- Equity (RSUs).
- Gympass partnership for wellness and fitness.
- English classes at a subsidized group rate.
- Collaborative and flexible remote-first work environment.
- Strong engineering culture focused on learning, autonomy, and impact.
How Jobgether Works
We use an
AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Why Apply Through Jobgether?
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.