Remoto LinkedIn

Site Reliability Engineer - AI Agents

Jobgether • Brazil • 25 candidaturas Ontem

Salário estimado

R$ 7k - 10k/mês

Pleno CLT

49%

Score de curadoria

Indicador interno 0 a 100: transparência salarial, stack, descrição útil e sinais de qualidade do anúncio. Não é match com o seu CV.

Stack

Kubernetes Python Docker AWS IA

Descrição da vaga

Texto agregado para leitura rápida. Confira sempre a fonte original ao enviar a candidatura.

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer - AI Agents based in Brazil.

This role sits at the intersection of platform engineering, site reliability, and applied AI, focusing on the systems that power production-grade AI agents at scale. You will help design, operate, and evolve the infrastructure that enables orchestration, execution, and serving of AI-driven workflows across internal tools and external-facing products. The environment is fast-moving and highly technical, requiring strong production discipline applied to emerging AI technologies. You will work closely with data, ML, and engineering teams to ensure reliability, observability, and scalability of agentic systems. Beyond operations, the role emphasizes building developer-facing platforms, APIs, and SDKs that make AI infrastructure accessible and reusable across teams. This is a high-impact opportunity to shape foundational systems for next-generation AI agent platforms in a globally distributed organization.

Accountabilities

You will be responsible for building and operating the infrastructure backbone that supports AI agent systems in production, ensuring reliability, scalability, and usability across engineering teams.

Design, build, and operate cloud-native infrastructure supporting AI agent workflows, including orchestration, execution, and model serving
Ensure high reliability, scalability, and observability of distributed agentic systems across internal and external products
Develop platform capabilities such as APIs, SDKs, and self-service tools to enable efficient consumption of AI infrastructure
Manage compute, deployment, and serving infrastructure for AI and ML workloads in production environments
Build and maintain CI/CD pipelines enabling safe, reliable, and rapid deployment of AI services and agent workflows
Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based environments
Design and operate observability systems, including monitoring, alerting, and incident response tailored to AI/ML workloads
Define reliability patterns, failure handling mechanisms, and recovery strategies for LLM and agent-based systems
Collaborate with AI, Data Engineering, and Product teams to transition experimental prototypes into production-grade systems
Manage Kubernetes-based container orchestration environments to ensure efficient scaling and deployment of services
Implement security controls and access management best practices across infrastructure layers
Document system architecture, operational procedures, and best practices to support platform adoption and knowledge sharing

Requirements

The ideal candidate is a strong infrastructure or SRE engineer with platform engineering experience and exposure to ML or AI-driven systems in production.

5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar roles
Hands-on experience supporting ML infrastructure, model serving, or MLOps pipelines in production environments
Experience building developer platforms, internal tooling, APIs, or SDKs used at scale by engineering teams
Strong understanding of platform engineering principles, including self-service infrastructure and developer experience design
Proficiency with Infrastructure as Code tools, particularly Terraform
Strong experience with Kubernetes and containerized environments (Docker)
Solid cloud infrastructure experience, preferably AWS
Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
Experience designing and operating observability, monitoring, and alerting systems
Experience with incident response, on-call rotations, and production reliability ownership
Strong collaboration skills across data, AI, and engineering organizations
High ownership mindset and ability to operate in fast-paced, high-stakes production environments
Familiarity with AI agent systems, LLM-based applications, or orchestration frameworks is a strong plus

Benefits

Competitive compensation package with performance-based incentives
Fully remote working model across eligible countries, including Brazil
Comprehensive healthcare coverage (medical, dental, and vision where applicable)
Retirement savings programs with employer contributions (where applicable)
Flexible PTO policy and paid company holidays
Mental health and wellness support programs
Learning and development budget for professional and technical growth
Opportunity to work on cutting-edge AI agent infrastructure at global scale
Distributed, high-ownership engineering culture with strong collaboration across teams
Exposure to advanced platform engineering and applied AI systems;

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Vagas relacionadas

Seleção por stack em comum com esta oportunidade

Match50%

Especialista SRE

Serasa Experian • São Paulo • 100 candidaturas Hoje

Salário estimado

R$ 23k - 38k/mês

Especialista CLT

Company DescriptionA Serasa Experian é a primeira e a maior Datatech do Brasil. Líder em soluções de inteligência para análise de riscos e oportunidades, com foco nas jornadas de crédito, autenticação e prevenção à fraude. Com tecnologia de ponta, inovação e os melhores talentos, transforma a incert...

Elasticsearch Kubernetes MongoDB Python AWS

Ver Detalhes →

Match35%

Engenheiro de Dados Pleno

iDdata • São Paulo • 25 candidaturas Hoje

Salário estimado

R$ 4k - 7k/mês

Júnior CLT

Buscamos uma pessoa para atuar como Engenheira(o) de Dados Pleno, com foco em desenvolvimento de pipelines e governança de dados no ecossistema Databricks. Neste cargo, você fará parte do time de Dados e Analytics da ID Data, colaborando diretamente em projetos para clientes de grande porte — contri...

Python Git API

Ver Detalhes →

Remoto LinkedIn

Match65%

Data Scientist

DoorDash • São Paulo • 200 candidaturas Hoje

Salário estimado

R$ 9k - 14k/mês

Pleno CLT

About The TeamThe Analytics team is looking for experienced Data Scientists to guide measurement, strategy, and tactical decision-making across the company across a variety of teams and levels. Data Scientists at DoorDash work to uncover insights and turn them into relevant recommendations, driving ...

Data Science Python Go

Ver Detalhes →