E

Site Reliability Engineering Manager

Empresa Confidencialvia LinkedIn
São PauloSêniorCLT28 dias atrásFull-timeInformation TechnologyIT System Data Services95 candidaturas

Salário Estimado

R$ 14.400,00 - R$ 21.600,00

0de 100

Regular

Score da Vaga

Descrição da Vaga

About the Team


SRE Middleware & Data (SRE M&D) enables product teams by providing secure, scalable, and automated data, middleware, and AI services on Azure.


Provisioning: Automates secure delivery of databases, queues, caches, and AI services/resources; builds self-service and guardrails.

Reliability: Ensures availability, performance, and resilience of data, middleware, and AI services in production; builds observability, autohealing, and disaster recovery.


What You Will Do


  • Lead and grow the team: Manage, mentor, and develop SRE engineers across Provisioning and Reliability.


  • Lead cross-team initiatives:


  • Align roadmaps with Architects and partner teams; ensure adoption of architecture standards.


  • Run design reviews and architecture signoffs; surface and mitigate risks and complexity early.


  • Translate standards into guardrails and automation (policyascode, selfservice) for consistent delivery.


  • Apply lightweight RACI and clear escalation paths to resolve tradeoffs quickly.


Drive roadmap and execution:


  • Provisioning: Self-service engines, Crossplane/Terraform-based automation, policy-as-code, secured pipelines, access management, backups/restore, Azure AI resource provisioning, and quota management.


  • Reliability: Observability integrations (New Relic, Azure Monitor), performance tuning, autohealing, DR/BCP, resilience testing, failover automation, SLOs, reliability for Azure AI services and messaging platforms (Kafka, Event Hubs, Service Bus).


Establish engineering excellence:


  • Infrastructure as Code (Terraform/Crossplane) and CI/CD best practices.


  • Change management with safe deploys and rollback strategies.


  • Incident and problem management with blameless postmortems.


  • Continuous improvement loops measured by DORA/SRE metrics.


  • Champion security and compliance by design:


  • Policy as code, least privilege access, and secrets/identity hygiene.


  • Guardrails in self-service flows; auditability and evidence collection.


  • Partnership with Security/Compliance for standards and reviews.


Partner with the Architects:


  • Codevelop platform architecture and service standards.


  • Define SLIs/SLOs and capacity/reliability patterns for core services.


  • Align roadmaps and run design reviews for high-impact changes.


Own delivery outcomes:


  • Navigate competing priorities across a broad platform scope — balancing reactive operational load (incidents, toil, on-call) against proactive platform investment (self-service, automation, resilience) without a clean separation between the two.


  • Make and communicate prioritization decisions under ambiguity, with partial information, across teams that have conflicting urgency.


  • Maintain a defensible, visible backlog that reflects real risk and business impact — not just the loudest stakeholder.


Operate a healthy on-call:


  • High-quality playbooks and automation-first troubleshooting (AI-assisted).


  • Actionable alerts with SLO-based paging and noise reduction.


  • Regular resilience testing and post-incident hardening.


Initiatives You Will Lead:


  • Self-service provisioning for databases, queues, and caches with golden configurations and policy guardrails.


  • AI-assisted troubleshooting for provisioning and production incidents.


  • Platform wide observability integration for data, middleware, and AI services (New Relic, Azure Monitor).


  • Automated DR runbooks and resilience/chaos testing in production.


  • Performance tuning at service and query layers, including automated tuning workflows.


  • Standardization of provisioning via Terraform/Crossplane for databases, messaging, and AI services.


  • Governance for Azure AI services (quotas, access, safety guardrails) with clear consumption patterns for product teams.


Success Metrics


  • Reduced MTTR and incident count, improved SLO attainment for data/middleware services.


  • Improved lead time for change and change failure rate; increased automation coverage and reduced toil.


  • Faster time to provision and higher first success rate for self-service requests.


  • Measurable improvements in cost efficiency, performance, and capacity predictability.


  • Team health: engagement, growth, hiring velocity, stress levels, and retention.


  • SLO attainment for AI endpoints and messaging services; reduced alert noise via improved observability (New Relic/Azure Monitor).


What You Will Bring


  • Proven experience managing SRE/platform/infrastructure teams delivering production-critical services.


  • Deep familiarity with Azure and the team stack: PostgreSQL, MongoDB, Cosmos DB; Redis; messaging systems such as CloudAMQP/RabbitMQ, Kafka, Event Hubs, and Service Bus.


  • Strong reliability fundamentals: SLOs/SLIs, incident and problem management, capacity, DR/BCP, performance tuning.


  • Solid automation background: IaC (Terraform/Crossplane), CI/CD (Azure DevOps), GitOps, policyascode, secrets and identity, RBAC.


  • Track record of building selfservice platforms and reducing toil.


  • Excellent crossfunctional leadership with product, security, and compliance partners.


  • Experience operating Azure AI services in production (Azure OpenAI, Cognitive Services, AI Search).


  • Observability experience with New Relic, Azure Monitor, and OpenTelemetry.


What You’ll Need


  • Advanced skills in English


  • Experience with AI-assisted operations and troubleshooting.


  • Observability expertise (Prometheus/Grafana, New Relic, Azure Monitor, OpenTelemetry).


  • Database performance engineering and query optimization.


  • Experience in regulated environments and security frameworks.


  • FinOps capabilities (cost governance, forecasting, rightsizing, quotas, budgets, chargeback/showback).


How We Work


  • Collaboration first with a strong partnership between the Engineering Manager and the Architects.


  • Automation by default; security and reliability are nonnegotiable.


  • Blameless postmortems, continuous learning, and measurable outcomes.


  • Participation in an equitable on-call rotation with high-quality runbooks and automation.


Tech Environment


Azure (AKS, managed databases, storage, networking, identity).


Azure AI services (Azure OpenAI, Cognitive Services, AI Search).


Azure DevOps for CI/CD.


CloudAMQP (RabbitMQ).


Databases: PostgreSQL, MongoDB, Cosmos DB.


Caches: Redis.


Queues/Brokers: Kafka, Event Hubs, Service Bus.


Terraform/Crossplane, GitOps.


Observability: New Relic, Azure Monitor, logs, metrics, traces, alerting workflows.

Vagas Semelhantes

R$ 16k - 23k/mês

SêniorCLT

Sobre a StratesysA Stratesys é uma consultoria multinacional especializada em transformação digital e soluções tecnológicas, com presença consolidada na Europa e nas Américas. Reconhecida como parceira estratégica da SAP, atua em projetos inovadores para clientes de diversos setores, garantindo exce...

Campinápolis, Mato Grosso, Br11 dias atrás

R$ 11k - 17k/mês

SêniorCLT

Senior Full Stack Java - React Developer Brazil Senior Software Engineer Important Information Location: Brazil Job Mode: Full-time Work Mode: Work from home Job Summary As a Senior Full Stack Java - React Developer, you will be responsible for designing, developing, and maintaining high-quality sof...

Rose IT Corp. - Vaga Jr Backend developer

Jr Backend developer

Rose IT Corp.Dice
RemotoRemoto15 dias atrás

R$ 15k - 23k/mês

SêniorCLT

Role Overview A Junior Backend Developer is responsible for building and maintaining the server-side logic, databases, and APIs that power web and mobile applications. The role involves working with senior developers and cross-functional teams to develop scalable, secure, and high-performance backen...

Cianorte, Paraná, Brazil26 dias atrás

R$ 11k - 16k/mês

SêniorCLT

Vaga de Desenvolvedor Fullstack - Especialista em Integração e Soluções DigitaisResponsabilidades O(A) profissional será responsável por um conjunto diversificado de atividades, incluindo:Desenvolvimento e Customização em Suite Script (Netsuite): Atuar no desenvolvimento, manutenção e otimização de...

Interessado nesta vaga?

Candidatar-se

Você será redirecionado para o site original

Informações

NívelSênior
ContratoCLT
LocalSão Paulo
RemotoNão
MoedaBRL
Publicada28 dias atrás
FonteLinkedIn

Análise de Vaga com IA

Estimativa salarial, match de tecnologias e análise de requisitos feitos com Inteligência Artificial

Quer se preparar melhor? Pratique entrevistas com IA no Recrutadoria ou melhore suas habilidades no BitMentor

← Voltar às Vagas