Logo Deepgram

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Deepgramvia Remote Rocketship
RemotoRemotoPlenoCLT25 dias atrás

Salário Estimado

R$ 12.500,00 - R$ 18.333,00

0de 100

Excelente

Score da Vaga

Descrição da Vaga

Job Description:

Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
Automate the life cycle of single-tenant, managed deployments Requirements:
5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
Proven, hands-on experience building and managing production infrastructure with Terraform • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
Strong scripting and automation skills (e.g., Python, Go, Bash) Benefits:
Medical, dental, vision benefits
Annual wellness stipend • Mental health support
Life, STD, LTD Income Insurance Plans • Unlimited PTO
Generous paid parental leave • Flexible schedule
12 Paid US company holidays • Quarterly personal productivity stipend
One-time stipend for home office upgrades • 401(k) plan with company match
Tax Savings Programs • Learning / Education stipend
Participation in talks and conferences • Employee Resource Groups
AI enablement workshops / sessions

Vagas Semelhantes

RemotoRemotoHoje

R$ 9k - 14k/mês

PlenoCLT

We are • *tech transformation** specialists, uniting human expertise with AI to create scalable tech solutions. With over 8,000 CI&Ters around the world, we’ve built partnerships with more than 1,000 clients during our 30 years of history. Artificial Intelligence is our reality. The Mission We are l...

R$ 10k - 15k/mês

PlenoCLT

Backend Software Engineer - Python & Pipeline Orchestration (Cloud / AI Platform) W2 Contract Pay Rate: $55 - $65 per hour Location: Cupertino, CA - Remote Role Job Summary: We are seeking a highly skilled Backend Software Engineer to join our platform engineering team focused on building scalable d...

Logo TechLink Systems, Inc.

AI Sr. Engineer LLMOps & MLOps

TechLink Systems, Inc.Dice
RemotoRemoto2 dias atrás

R$ 9k - 14k/mês

PlenoCLT

Job Title: AI Sr. Engineer – LLMOps & MLOps Location: Memphis, TN (remote) Contract Duration: Direct Hire Job Description Role Overview This is a high-stakes, execution-focused role within the Transformation Office. We are looking for a day-one engineer to own the production lifecycle of our AI init...

Logo Tango.io

LLM Developer, Python

Tango.ioRemote Rocketship
RemotoRemoto3 dias atrás

R$ 10k - 15k/mês

PlenoCLT

Job Description: • Contribute to design software architecture • Deliver tested features • Contribute to groom tickets • Attend scrum meetings • Your direct relations: You will report to Product Owner Requirements: • Python programming (object-oriented principles) • Experience with large language mod...

Interessado nesta vaga?

Candidatar-se

Você será redirecionado para o site original

Informações

NívelPleno
ContratoCLT
LocalRemoto
RemotoSim
MoedaBRL
Publicada25 dias atrás
FonteRemote Rocketship

Análise de Vaga com IA

Estimativa salarial, match de tecnologias e análise de requisitos feitos com Inteligência Artificial

Quer se preparar melhor? Pratique entrevistas com IA no Recrutadoria ou melhore suas habilidades no BitMentor

← Voltar às Vagas