Logo Deepgram

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Deepgramvia Remote Rocketship
RemotoRemotoPlenoCLT5 dias atrás

Salário Estimado

R$ 12.500,00 - R$ 18.333,00

0de 100

Excelente

Score da Vaga

Descrição da Vaga

Job Description:

Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
Automate the life cycle of single-tenant, managed deployments Requirements:
5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
Proven, hands-on experience building and managing production infrastructure with Terraform • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
Strong scripting and automation skills (e.g., Python, Go, Bash) Benefits:
Medical, dental, vision benefits
Annual wellness stipend • Mental health support
Life, STD, LTD Income Insurance Plans • Unlimited PTO
Generous paid parental leave • Flexible schedule
12 Paid US company holidays • Quarterly personal productivity stipend
One-time stipend for home office upgrades • 401(k) plan with company match
Tax Savings Programs • Learning / Education stipend
Participation in talks and conferences • Employee Resource Groups
AI enablement workshops / sessions

Vagas Semelhantes

R$ 11k - 13k/mês

PlenoCLT

Note: The job is a remote job and is open to candidates in USA. LeoTech is passionate about building software that solves real-world problems in the Public Safety sector. The AI/LLM Evaluation & Alignment Software Engineer will ensure that Large Language Model (LLM) and Agentic AI solutions are accu...

3 weeks of paid vacation – out the gate!!Generous medical, dental, and vision plansSick, and paid holidays are offered
RemotoRemoto6 dias atrás

R$ 9k - 14k/mês

PlenoCLT

Note: The job is a remote job and is open to candidates in USA. Apexon is seeking an experienced AI/ML Engineer with deep expertise in Large Language Models and Conversational AI. The role involves designing, building, and optimizing intelligent AI systems for chatbots and enterprise-grade solutions...

Logo Logic20/20

Big Data Engineer - PySpark

Logic20/20Built In Seattle
RemotoSeattle, Washington, Us8 dias atrás

R$ 12k - 15k/mês

PlenoCLT

Company Description We’re a ten-time “Best Company to Work For,” where intelligent, talented people come together to do outstanding work—and have a lot of fun while they’re at it. We offer a solution-focused environment full of collaboration and dedication, to our goals and to each other. You’ll hav...

And when you’re ready to level up in your career, you’ll have access to the training, the project opportunities, and the mentorship to get you where you want to goLogic20/20 offers a competitive compensation package, with a target base salary range of $145,750 - $176,040 for this roleThe final base salary offered is dependent on factors such as relevant experience, skills, qualifications, and location
RemotoSan Francisco, California, Us9 dias atrás

R$ 9k - 14k/mês

PlenoCLT

the position Planet’s mission is to image the entire world every day, making global change visible, accessible, and actionable. We are at a critical inflection point: moving from broad AI research to a delivery-focused "productization" model. To drive this, we are building a new product group focuse...

Comprehensive Medical, Dental, and Vision plansHealth Savings Account (HSA) with a company contributionGenerous Paid Time Off in addition to holidays and company-wide days off

Interessado nesta vaga?

Candidatar-se

Você será redirecionado para o site original

Informações

NívelPleno
ContratoCLT
LocalRemoto
RemotoSim
MoedaBRL
Publicada5 dias atrás
FonteRemote Rocketship

Análise de Vaga com IA

Estimativa salarial, match de tecnologias e análise de requisitos feitos com Inteligência Artificial

Powered by CodeCortex
← Voltar às Vagas