T
Remoto LinkedIn

Lead DevOps Engineer

TELUS Digital Greater Porto Alegre 25 candidaturas Ontem

Salário estimado

R$ 13k - 20k/mês

Sênior CLT
64%

Score de curadoria

Indicador interno 0 a 100: transparência salarial, stack, descrição útil e sinais de qualidade do anúncio. Não é match com o seu CV.

Descrição da vaga

Texto agregado para leitura rápida. Confira sempre a fonte original ao enviar a candidatura.

Lead DevOps Engineer, Site Reliability

Who We Are

Welcome to TELUS Digital — where innovation drives impact at a global scale. As an award-winning digital product consultancy and the digital division of TELUS, one of Canada’s largest telecommunications providers, we design and deliver transformative customer experiences through cutting-edge technology, agile thinking, and a people-first culture.

With a global team across North America, South America, Central America, Europe, Africa, and APAC, we offer end-to-end expertise across eight core service areas: Digital Product Consulting, Digital Marketing Services, Data & AI, Strategy Consulting, Business Operations Modernization, Enterprise Applications, Cloud Engineering, and QA & Test Engineering.

From mobile apps and websites to voice UI, chatbots, AI, customer service, and in-store solutions, TELUS Digital enables seamless, trusted, and digitally powered experiences that meet customers wherever they are — all backed by the secure infrastructure and scale of our multi-billion-dollar parent company.

Location and Flexibility

This role can be fully remote for candidates based in Brazil, due to team distribution and occasional in-person opportunities. If you are based in São Paulo or Porto Alegre, you are welcome to work from one of our offices on a flexible schedule.

About The Role

Our CXAI Platform powers a portfolio of Generative AI products deployed into enterprise contact centers and BPO operations, environments where downtime, latency, or silent model degradation translate directly to commercial impact. As a Staff DevOps Engineer, Site Reliability, you'll lead the architecture and maintenance of the infrastructure and reliability practices that keep AI-powered systems performant, observable, and trustworthy under real production load, including redundancy, latency, and cost management.

This is a staff-level individual contributor role with broad mandate. You'll set technical standards across the platform, partner directly with product and engineering leadership, and have real ownership over how reliability shapes the roadmap.

What You'll Own

  • Platform reliability strategy: help define SLOs/SLIs for AI-powered services, including latency and quality SLOs for LLM inference paths, and build the error-budget discipline that lets product teams ship fast without breaking trust.
  • Cloud architecture on GCP: design scalable, secure infrastructure for distributed AI services, event-driven workloads, and multi-LLM-provider integrations
  • Observability for non-deterministic systems: build metrics, tracing, and alerting that surface not just "is it up" but "is it behaving correctly" for LLM-powered features (drift, regression, hallucination rates, tool-call failures)
  • Resilience engineering: circuit breakers, graceful degradation, multi-provider failover, and chaos/fault-injection practices for AI inference paths
  • Infrastructure-as-code and automation: Terraform-first, automated everything, no toil tolerated
  • Production readiness: define and enforce PRR-style standards across teams launching new AI products and features
  • Technical leadership: mentor engineers, drive architecture reviews, and shape the broader engineering culture around reliability

What You Bring

  • Significant infrastructure engineering experience combining DevOps and SRE disciplines at scale
  • Deep GCP expertise (AWS a strong plus); relevant cloud certifications welcome
  • Production experience with SRE fundamentals: SLO/SLI design, error budgets, toil reduction, blameless incident review
  • Strong background in distributed systems failure modes and resilience patterns
  • Expert-level infrastructure-as-code (Terraform), container orchestration (Kubernetes), and CI/CD
  • Hands-on with modern observability stacks (i.e., OpenTelemetry, Sentry) and AI-specific observability tooling (Arize, LangSmith, Braintrust, or similar)
  • Experience with API management platforms, particularly Apigee and Cloud Run
  • Comfort working across Python, Javascript, and Bash for infra tooling
  • Strong spoken and written communication in english with teams and stakeholders

Bonus Points

  • Presents production experience with LLM-provider integrations (OpenAI, Anthropic, Google, Azure OpenAI) and the reliability quirks of inference at scale, such as, rate limits, latency tails, provider failover, cost controls
  • Has experience with event-driven architecture experience (Pub/Sub, Kafka, EventBridge)
  • Shows understanding of chaos engineering practices (Litmus, Gremlin, or homegrown equivalents)
  • Holds one or more GCP certifications, such as Cloud Architect, Cloud DevOps Engineer, or equivalent.

Why This Role

You will have a clear technical mandate, direct partnership with product and engineering leadership, and real ownership over infrastructure that powers AI workloads in production. Reliability at this scale is not a support function, it is a first-class engineering discipline with direct commercial impact.

If you want to define how cloud infrastructure and site reliability engineering work together for a suite of AI-powered products at a critical growth stage, this is it.

Equal Opportunity Employer

At TELUS Digital, we are proud to be an equal opportunity employer and are committed to creating a diverse and inclusive workplace. All aspects of employment, including the decision to hire and promote, are based on applicants’ qualifications, merits, competence and performance without regard to any characteristic related to diversity.

We will only use the information you provide to process your application and to produce tracking statistics. Since we do not request personal data deemed sensitive, we ask you to abstain from sharing that information with us.

For more information on how we use your information, see our Privacy Policy.

Vagas relacionadas

Seleção por stack em comum com esta oportunidade

B
LinkedIn
Match50%

Intermediate Backend Developer

BEES São Paulo 25 candidaturas Hoje

Salário estimado

R$ 15k - 23k/mês

Sênior CLT

About BEESJoin us to build the future of B2B commerce!BEES is AB InBev’s B2B platform. Through our ecosystem, merchants and retailers across 29 countries can stock their businesses quickly, easily, and securely. At BEES, we dream big, lead with purpose, and develop technology that transforms the way...

Ver Detalhes
J
LinkedIn
Match50%

Desenvolvedor(a) de Sistemas | BACKEND

Jobgether Brazil 25 candidaturas Hoje

Salário estimado

R$ 6k - 10k/mês

Pleno CLT

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Desenvolvedor(a) de Sistemas | BACKEND based in Brazil.This role is focused on building, evolving, and maintaining scalable backend systems that power critical digital se...

Ver Detalhes
Z
LinkedIn
Match50%

Desenvolvedor Full Stack Nodejs - Porto Alegre - Profissional Procurado

ZANC Assessoria Nacional de Cobrança Porto Alegre, Rio Grande Do Sul, Brazil 25 candidaturas Hoje

Salário estimado

R$ 13k - 19k/mês

Sênior CLT

Zanc Acessoria Nacional de Cobrança Porto Alegre-RS PresencialÁrea: Informática / TI / Tecnologia A CombinarRequisitosExperiência com NodeJS, Express, Middlewares e ReactJS. Experiência com MongoDb, Redis e PostgreSQL. Experiência no desenvolvimento de micro serviçoes, integrações e apis RestFul. Ex...

Ver Detalhes