Site Reliability Engineering Manager
Salário Estimado
R$ 14.400,00 - R$ 21.600,00
Descrição da Vaga
About the Team
SRE Middleware & Data (SRE M&D) enables product teams by providing secure, scalable, and automated data, middleware, and AI services on Azure.
Provisioning: Automates secure delivery of databases, queues, caches, and AI services/resources; builds self-service and guardrails.
Reliability: Ensures availability, performance, and resilience of data, middleware, and AI services in production; builds observability, autohealing, and disaster recovery.
What You Will Do
- Lead and grow the team: Manage, mentor, and develop SRE engineers across Provisioning and Reliability.
- Lead cross-team initiatives:
- Align roadmaps with Architects and partner teams; ensure adoption of architecture standards.
- Run design reviews and architecture signoffs; surface and mitigate risks and complexity early.
- Translate standards into guardrails and automation (policyascode, selfservice) for consistent delivery.
- Apply lightweight RACI and clear escalation paths to resolve tradeoffs quickly.
Drive roadmap and execution:
- Provisioning: Self-service engines, Crossplane/Terraform-based automation, policy-as-code, secured pipelines, access management, backups/restore, Azure AI resource provisioning, and quota management.
- Reliability: Observability integrations (New Relic, Azure Monitor), performance tuning, autohealing, DR/BCP, resilience testing, failover automation, SLOs, reliability for Azure AI services and messaging platforms (Kafka, Event Hubs, Service Bus).
Establish engineering excellence:
- Infrastructure as Code (Terraform/Crossplane) and CI/CD best practices.
- Change management with safe deploys and rollback strategies.
- Incident and problem management with blameless postmortems.
- Continuous improvement loops measured by DORA/SRE metrics.
- Champion security and compliance by design:
- Policy as code, least privilege access, and secrets/identity hygiene.
- Guardrails in self-service flows; auditability and evidence collection.
- Partnership with Security/Compliance for standards and reviews.
Partner with the Architects:
- Codevelop platform architecture and service standards.
- Define SLIs/SLOs and capacity/reliability patterns for core services.
- Align roadmaps and run design reviews for high-impact changes.
Own delivery outcomes:
- Navigate competing priorities across a broad platform scope — balancing reactive operational load (incidents, toil, on-call) against proactive platform investment (self-service, automation, resilience) without a clean separation between the two.
- Make and communicate prioritization decisions under ambiguity, with partial information, across teams that have conflicting urgency.
- Maintain a defensible, visible backlog that reflects real risk and business impact — not just the loudest stakeholder.
Operate a healthy on-call:
- High-quality playbooks and automation-first troubleshooting (AI-assisted).
- Actionable alerts with SLO-based paging and noise reduction.
- Regular resilience testing and post-incident hardening.
Initiatives You Will Lead:
- Self-service provisioning for databases, queues, and caches with golden configurations and policy guardrails.
- AI-assisted troubleshooting for provisioning and production incidents.
- Platform wide observability integration for data, middleware, and AI services (New Relic, Azure Monitor).
- Automated DR runbooks and resilience/chaos testing in production.
- Performance tuning at service and query layers, including automated tuning workflows.
- Standardization of provisioning via Terraform/Crossplane for databases, messaging, and AI services.
- Governance for Azure AI services (quotas, access, safety guardrails) with clear consumption patterns for product teams.
Success Metrics
- Reduced MTTR and incident count, improved SLO attainment for data/middleware services.
- Improved lead time for change and change failure rate; increased automation coverage and reduced toil.
- Faster time to provision and higher first success rate for self-service requests.
- Measurable improvements in cost efficiency, performance, and capacity predictability.
- Team health: engagement, growth, hiring velocity, stress levels, and retention.
- SLO attainment for AI endpoints and messaging services; reduced alert noise via improved observability (New Relic/Azure Monitor).
What You Will Bring
- Proven experience managing SRE/platform/infrastructure teams delivering production-critical services.
- Deep familiarity with Azure and the team stack: PostgreSQL, MongoDB, Cosmos DB; Redis; messaging systems such as CloudAMQP/RabbitMQ, Kafka, Event Hubs, and Service Bus.
- Strong reliability fundamentals: SLOs/SLIs, incident and problem management, capacity, DR/BCP, performance tuning.
- Solid automation background: IaC (Terraform/Crossplane), CI/CD (Azure DevOps), GitOps, policyascode, secrets and identity, RBAC.
- Track record of building selfservice platforms and reducing toil.
- Excellent crossfunctional leadership with product, security, and compliance partners.
- Experience operating Azure AI services in production (Azure OpenAI, Cognitive Services, AI Search).
- Observability experience with New Relic, Azure Monitor, and OpenTelemetry.
What You’ll Need
- Advanced skills in English
- Experience with AI-assisted operations and troubleshooting.
- Observability expertise (Prometheus/Grafana, New Relic, Azure Monitor, OpenTelemetry).
- Database performance engineering and query optimization.
- Experience in regulated environments and security frameworks.
- FinOps capabilities (cost governance, forecasting, rightsizing, quotas, budgets, chargeback/showback).
How We Work
- Collaboration first with a strong partnership between the Engineering Manager and the Architects.
- Automation by default; security and reliability are nonnegotiable.
- Blameless postmortems, continuous learning, and measurable outcomes.
- Participation in an equitable on-call rotation with high-quality runbooks and automation.
Tech Environment
Azure (AKS, managed databases, storage, networking, identity).
Azure AI services (Azure OpenAI, Cognitive Services, AI Search).
Azure DevOps for CI/CD.
CloudAMQP (RabbitMQ).
Databases: PostgreSQL, MongoDB, Cosmos DB.
Caches: Redis.
Queues/Brokers: Kafka, Event Hubs, Service Bus.
Terraform/Crossplane, GitOps.
Observability: New Relic, Azure Monitor, logs, metrics, traces, alerting workflows.
Vagas Semelhantes
Desenvolvedor(a) Full Stack — Inteligência Artificial (Pleno/Sênior)
R$ 16k - 23k/mês
Sobre a StratesysA Stratesys é uma consultoria multinacional especializada em transformação digital e soluções tecnológicas, com presença consolidada na Europa e nas Américas. Reconhecida como parceira estratégica da SAP, atua em projetos inovadores para clientes de diversos setores, garantindo exce...
Senior Full Stack Java - React Developer
R$ 11k - 17k/mês
Senior Full Stack Java - React Developer Brazil Senior Software Engineer Important Information Location: Brazil Job Mode: Full-time Work Mode: Work from home Job Summary As a Senior Full Stack Java - React Developer, you will be responsible for designing, developing, and maintaining high-quality sof...
Jr Backend developer
R$ 15k - 23k/mês
Role Overview A Junior Backend Developer is responsible for building and maintaining the server-side logic, databases, and APIs that power web and mobile applications. The role involves working with senior developers and cross-functional teams to develop scalable, secure, and high-performance backen...
Desenvolvedor Fullstack - Especialista em Integração e Soluções Digitais -
R$ 11k - 16k/mês
Vaga de Desenvolvedor Fullstack - Especialista em Integração e Soluções DigitaisResponsabilidades O(A) profissional será responsável por um conjunto diversificado de atividades, incluindo:Desenvolvimento e Customização em Suite Script (Netsuite): Atuar no desenvolvimento, manutenção e otimização de...
Informações
Análise de Vaga com IA
Estimativa salarial, match de tecnologias e análise de requisitos feitos com Inteligência Artificial
Quer se preparar melhor? Pratique entrevistas com IA no Recrutadoria ou melhore suas habilidades no BitMentor