Remoto LinkedIn

Senior Site Reliability Engineer (SRE) - (GCP)

Devsu • Brazil • 33 candidaturas 2 dias atrás

Salário estimado

R$ 11k - 17k/mês

Sênior CLT

90%

Score de curadoria

Indicador interno 0 a 100: transparência salarial, stack, descrição útil e sinais de qualidade do anúncio. Não é match com o seu CV.

Stack

Kubernetes Python GCP IA

Descrição da vaga

Texto agregado para leitura rápida. Confira sempre a fonte original ao enviar a candidatura.

We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP).

This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments.

As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required.

Responsibilities

Monitoring & Observability (Core Focus)

Own and operate the monitoring and observability stack across on-prem and GCP environments
Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
Define, tune, and maintain alerts to ensure high signal-to-noise ratio
Establish observability standards and best practices across teams
Improve visibility into system health, performance, and reliability

Site Reliability Engineering

Apply SRE principles to improve availability, performance, and resilience
Define and track SLIs, SLOs, and error budgets
Participate in on-call rotations and SEV incident response
Lead or contribute to incident investigations and root cause analysis (RCA)
Drive preventative actions to reduce repeat incidents

Kubernetes & Platform Reliability

Support and monitor Kubernetes environments (GKE and on-prem clusters)
Monitor cluster health, capacity, and resource utilization
Troubleshoot platform-level issues impacting application reliability
Collaborate with Platform and Engineering teams on reliability improvements

Secondary Responsibilities (Backup Application Support)

These responsibilities are activated as needed, not part of day-to-day operations
Provide L2/L3 application support coverage during:
Support team resource shortages
High-severity incidents (SEVs)
Peak support periods or escalations
Triage and troubleshoot application issues using existing runbooks and dashboards
Collaborate with Application Support and Engineering teams during incidents
Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

Requirements

Strong experience as a Site Reliability Engineer or Reliability Engineer
Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting)
Solid experience with monitoring and observability systems
Production experience operating Kubernetes environments
Experience supporting systems in GCP and on-prem environments (mandatory)
Strong Linux systems and troubleshooting skills
Fluent English (written and spoken)
Ability to work in PST time zone
Ability to participate in an on-call rotation that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule

Technology Stack:

Observability: Grafana, Prometheus, logging platforms
Containers: Kubernetes (GKE and on-prem)
Cloud: Google Cloud Platform (GCP)
Operations: Linux, networking, infrastructure monitoring
Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents)

Nice to have:

Experience supporting application teams during SEV incidents
Knowledge of capacity planning and performance tuning
Scripting skills (Python, Bash, etc.)
Experience with hybrid infrastructure environments

Benefits

At Devsu, we believe in creating an environment where you can thrive both personally and professionally. By joining our team, you'll enjoy:

A stable, long-term contract with opportunities for career growth
Private health insurance
A remote-friendly culture that promotes work-life balance
Continuous training, mentorship, and learning programs to keep you at the forefront of the industry
Free access to AI training resources and state-of-the-art AI tools to elevate your daily work
A flexible Paid Time Off (PTO) policy as well as paid holiday days
Challenging, world-class software projects for clients in the US and LatAm
Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment

Join Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.

Vagas relacionadas

Seleção por stack em comum com esta oportunidade

Match59%

Desenvolvedor Fullstack

TWE • Rio de Janeiro • 25 candidaturas Hoje

Salário estimado

R$ 8k - 12k/mês

Pleno CLT

Estamos em busca de um(a) Desenvolvedor(a) Full Stack Pleno/Senior com experiência em React e Node.js para atuar no desenvolvimento e evolução de aplicações web, participando desde a construção de novas funcionalidades até melhorias de performance, arquitetura e integração entre sistemas. Requisitos...

JavaScript TypeScript Docker React Azure +5

Ver Detalhes →

Match50%

Analista de Tecnologia da Informação II - DevOps

Solar Coca-Cola • Fortaleza, Ceará, Brazil • 25 candidaturas Hoje

Salário estimado

R$ 8k - 11k/mês

Pleno CLT

Descrição da vagaQuem somosSomos a Solar Coca-Cola, o segundo maior fabricante do Sistema Coca-Cola no Brasil e um dos 13 maiores do mundo. Um time de pessoas apaixonadas, que distribui alegria em 70% do território nacional, sempre com muito gás, sorriso no rosto, simplicidade e protagonismo.Se atua...

Kubernetes Docker Azure AWS

Ver Detalhes →

Match50%

Estágio AWS - Foco em IA

Cloudster • Curitiba, Paraná, Brazil • 25 candidaturas Hoje

Salário estimado

R$ 8k - 12k/mês

Pleno CLT

Sobre a vagaVocê vai apoiar o time técnico em projetos de Cloud + IA usando serviços AWS e Python, sempre com acompanhamento e espaço para aprender na prática.Vaga Híbrida em Curitiba/PR.Principais AtividadesApoiar no uso de serviços AWS voltados para IA e dados;Ajudar na coleta e organização de dad...

Machine Learning Python AWS

Ver Detalhes →