Site Reliability Engineering Manager

Empresa Confidencialvia LinkedIn

São PauloSêniorCLT28 dias atrásFull-timeInformation TechnologyIT System Data Services95 candidaturas

Salário Estimado

R$ 14.400,00 - R$ 21.600,00

Tecnologias

React Go PostgreSQL MongoDB Redis Azure Git REST IA

0de 100

Regular

Score da Vaga

Descrição da Vaga

About the Team

SRE Middleware & Data (SRE M&D) enables product teams by providing secure, scalable, and automated data, middleware, and AI services on Azure.

Provisioning: Automates secure delivery of databases, queues, caches, and AI services/resources; builds self-service and guardrails.

Reliability: Ensures availability, performance, and resilience of data, middleware, and AI services in production; builds observability, autohealing, and disaster recovery.

What You Will Do

Lead and grow the team: Manage, mentor, and develop SRE engineers across Provisioning and Reliability.

Lead cross-team initiatives:

Align roadmaps with Architects and partner teams; ensure adoption of architecture standards.

Run design reviews and architecture signoffs; surface and mitigate risks and complexity early.

Translate standards into guardrails and automation (policyascode, selfservice) for consistent delivery.

Apply lightweight RACI and clear escalation paths to resolve tradeoffs quickly.

Drive roadmap and execution:

Provisioning: Self-service engines, Crossplane/Terraform-based automation, policy-as-code, secured pipelines, access management, backups/restore, Azure AI resource provisioning, and quota management.

Reliability: Observability integrations (New Relic, Azure Monitor), performance tuning, autohealing, DR/BCP, resilience testing, failover automation, SLOs, reliability for Azure AI services and messaging platforms (Kafka, Event Hubs, Service Bus).

Establish engineering excellence:

Infrastructure as Code (Terraform/Crossplane) and CI/CD best practices.

Change management with safe deploys and rollback strategies.

Incident and problem management with blameless postmortems.

Continuous improvement loops measured by DORA/SRE metrics.

Champion security and compliance by design:

Policy as code, least privilege access, and secrets/identity hygiene.

Guardrails in self-service flows; auditability and evidence collection.

Partnership with Security/Compliance for standards and reviews.

Partner with the Architects:

Codevelop platform architecture and service standards.

Define SLIs/SLOs and capacity/reliability patterns for core services.

Align roadmaps and run design reviews for high-impact changes.

Own delivery outcomes:

Navigate competing priorities across a broad platform scope — balancing reactive operational load (incidents, toil, on-call) against proactive platform investment (self-service, automation, resilience) without a clean separation between the two.

Make and communicate prioritization decisions under ambiguity, with partial information, across teams that have conflicting urgency.

Maintain a defensible, visible backlog that reflects real risk and business impact — not just the loudest stakeholder.

Operate a healthy on-call:

High-quality playbooks and automation-first troubleshooting (AI-assisted).

Actionable alerts with SLO-based paging and noise reduction.

Regular resilience testing and post-incident hardening.

Initiatives You Will Lead:

Self-service provisioning for databases, queues, and caches with golden configurations and policy guardrails.

AI-assisted troubleshooting for provisioning and production incidents.

Platform wide observability integration for data, middleware, and AI services (New Relic, Azure Monitor).

Automated DR runbooks and resilience/chaos testing in production.

Performance tuning at service and query layers, including automated tuning workflows.

Standardization of provisioning via Terraform/Crossplane for databases, messaging, and AI services.

Governance for Azure AI services (quotas, access, safety guardrails) with clear consumption patterns for product teams.

Success Metrics

Reduced MTTR and incident count, improved SLO attainment for data/middleware services.

Improved lead time for change and change failure rate; increased automation coverage and reduced toil.

Faster time to provision and higher first success rate for self-service requests.

Measurable improvements in cost efficiency, performance, and capacity predictability.

Team health: engagement, growth, hiring velocity, stress levels, and retention.

SLO attainment for AI endpoints and messaging services; reduced alert noise via improved observability (New Relic/Azure Monitor).

What You Will Bring

Proven experience managing SRE/platform/infrastructure teams delivering production-critical services.

Deep familiarity with Azure and the team stack: PostgreSQL, MongoDB, Cosmos DB; Redis; messaging systems such as CloudAMQP/RabbitMQ, Kafka, Event Hubs, and Service Bus.

Strong reliability fundamentals: SLOs/SLIs, incident and problem management, capacity, DR/BCP, performance tuning.

Solid automation background: IaC (Terraform/Crossplane), CI/CD (Azure DevOps), GitOps, policyascode, secrets and identity, RBAC.

Track record of building selfservice platforms and reducing toil.

Excellent crossfunctional leadership with product, security, and compliance partners.

Experience operating Azure AI services in production (Azure OpenAI, Cognitive Services, AI Search).

Observability experience with New Relic, Azure Monitor, and OpenTelemetry.

What You’ll Need

Advanced skills in English

Experience with AI-assisted operations and troubleshooting.

Observability expertise (Prometheus/Grafana, New Relic, Azure Monitor, OpenTelemetry).

Database performance engineering and query optimization.

Experience in regulated environments and security frameworks.

FinOps capabilities (cost governance, forecasting, rightsizing, quotas, budgets, chargeback/showback).

How We Work

Collaboration first with a strong partnership between the Engineering Manager and the Architects.

Automation by default; security and reliability are nonnegotiable.

Blameless postmortems, continuous learning, and measurable outcomes.

Participation in an equitable on-call rotation with high-quality runbooks and automation.

Tech Environment

Azure (AKS, managed databases, storage, networking, identity).

Azure AI services (Azure OpenAI, Cognitive Services, AI Search).

Azure DevOps for CI/CD.

CloudAMQP (RabbitMQ).

Databases: PostgreSQL, MongoDB, Cosmos DB.

Caches: Redis.

Queues/Brokers: Kafka, Event Hubs, Service Bus.

Terraform/Crossplane, GitOps.

Observability: New Relic, Azure Monitor, logs, metrics, traces, alerting workflows.

Vagas Semelhantes

Desenvolvedor(a) Full Stack — Inteligência Artificial (Pleno/Sênior)

STRATESYSLinkedIn

São Paulo6 dias atrás

R$ 16k - 23k/mês

SêniorCLT

Sobre a StratesysA Stratesys é uma consultoria multinacional especializada em transformação digital e soluções tecnológicas, com presença consolidada na Europa e nas Américas. Reconhecida como parceira estratégica da SAP, atua em projetos inovadores para clientes de diversos setores, garantindo exce...

JavaScript TypeScript React Java Go+17

Ver Detalhes

Senior Full Stack Java - React Developer

Encora Inc.WhatJobs

Campinápolis, Mato Grosso, Br11 dias atrás

R$ 11k - 17k/mês

SêniorCLT

Senior Full Stack Java - React Developer Brazil Senior Software Engineer Important Information Location: Brazil Job Mode: Full-time Work Mode: Work from home Job Summary As a Senior Full Stack Java - React Developer, you will be responsible for designing, developing, and maintaining high-quality sof...

JavaScript TypeScript React Java Go+16

Ver Detalhes

Jr Backend developer

Rose IT Corp.Dice

RemotoRemoto15 dias atrás

R$ 15k - 23k/mês

SêniorCLT

Role Overview A Junior Backend Developer is responsible for building and maintaining the server-side logic, databases, and APIs that power web and mobile applications. The role involves working with senior developers and cross-functional teams to develop scalable, secure, and high-performance backen...

JavaScript Node Python Java PHP+13

Ver Detalhes

Desenvolvedor Fullstack - Especialista em Integração e Soluções Digitais -

Force One AcademiaLinkedIn

Cianorte, Paraná, Brazil26 dias atrás

R$ 11k - 16k/mês

SêniorCLT

Vaga de Desenvolvedor Fullstack - Especialista em Integração e Soluções DigitaisResponsabilidades O(A) profissional será responsável por um conjunto diversificado de atividades, incluindo:Desenvolvimento e Customização em Suite Script (Netsuite): Atuar no desenvolvimento, manutenção e otimização de...

JavaScript React Angular Vue Node+19

Ver Detalhes

Interessado nesta vaga?

Candidatar-se

Você será redirecionado para o site original

Informações

NívelSênior

ContratoCLT

LocalSão Paulo

RemotoNão

MoedaBRL

Publicada28 dias atrás

FonteLinkedIn

Análise de Vaga com IA

Estimativa salarial, match de tecnologias e análise de requisitos feitos com Inteligência Artificial

Quer se preparar melhor? Pratique entrevistas com IA no Recrutadoria ou melhore suas habilidades no BitMentor

← Voltar às Vagas