Senior MLOps Engineer - SRE | DevOps

Jobgether • Brazil

Remote

Apply

AI Summary

This role is responsible for ensuring machine learning and LLM systems run reliably, efficiently, and at scale in production environments. The Senior MLOps Engineer will operate at the intersection of infrastructure, DevOps, and machine learning, owning the lifecycle from model deployment to production-grade inference services. The ideal candidate will have 5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps roles, operating production systems at scale.

Key Highlights

Design, build, and operate scalable ML and inference infrastructure

Own the end-to-end ML deployment lifecycle

Operate and optimize production-grade AI and LLM workloads

Key Responsibilities

Design, build, and operate scalable ML and inference infrastructure supporting real-time and batch workloads across multiple tenants

Own the end-to-end ML deployment lifecycle, including model registry, versioning, rollout strategies, and safe rollback mechanisms

Operate and optimize production-grade AI and LLM workloads, managing inference providers, throttling, quotas, and fallback strategies under load

Technical Skills Required

Terraform GitOps Kubernetes AWS GPU scheduling Infrastructure-as-Code

Benefits & Perks

Fully remote work model with flexibility

Opportunity to work on cutting-edge AI and ML infrastructure at scale

High ownership environment with direct impact on platform architecture and evolution

Nice to Have

Experience with GPU/accelerator scheduling and node lifecycle management

Experience operating LLM inference systems at scale

Experience with ML orchestration tools such as Argo Workflows, Kubeflow, Airflow, or SageMaker Pipelines

Job Description

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior MLOps Engineer - SRE | DevOps based in Brazil.

This role sits at the core of a modern AI/ML platform, responsible for ensuring that machine learning and LLM systems run reliably, efficiently, and at scale in production environments.

You will operate at the intersection of infrastructure, DevOps, and machine learning, owning the lifecycle from model deployment to production-grade inference services.

The position involves solving complex engineering challenges such as latency optimization, autoscaling, cost efficiency, and reliability of AI workloads.

You will help define how ML systems are built and operated, introducing strong SRE practices to environments that demand high availability and performance.

A key part of your work will involve designing scalable, automated, and observable ML pipelines and infrastructure on cloud-native platforms.

This is a high-impact role for a senior engineer who thrives in deep technical ownership and wants to shape the future of AI infrastructure.

Accountabilities

Design, build, and operate scalable ML and inference infrastructure supporting real-time and batch workloads across multiple tenants.
Own the end-to-end ML deployment lifecycle, including model registry, versioning, rollout strategies (canary, A/B, shadow), and safe rollback mechanisms.
Operate and optimize production-grade AI and LLM workloads, managing inference providers, throttling, quotas, and fallback strategies under load.
Develop and maintain reproducible ML pipelines for training, evaluation, and deployment with full lineage and automation.
Implement Infrastructure-as-Code practices using Terraform, ensuring scalable multi-account cloud architectures.
Manage GitOps workflows using tools such as ArgoCD to ensure reliable and consistent deployments across environments.
Operate Kubernetes-based infrastructure (AWS EKS), including GPU scheduling, workload isolation, and cost-aware scaling strategies.
Define and enforce SRE best practices, including SLOs, observability, incident response, and performance monitoring for ML systems.
Drive cost optimization initiatives across ML workloads, including resource right-sizing and efficient infrastructure utilization.
Improve automation across the ML lifecycle using modern engineering and agentic coding tools.

Requirements

Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps roles, operating production systems at scale.
Strong hands-on experience deploying and managing ML/AI workloads in production environments.
Deep SRE expertise, including SLO definition, incident response, postmortems, and reliability engineering practices.
Advanced experience with Infrastructure-as-Code using Terraform in complex, multi-account environments.
Strong GitOps experience with declarative infrastructure and deployment workflows.
Deep expertise in Kubernetes, including production operations and failure-mode troubleshooting.
Strong AWS knowledge, including networking, IAM, compute, storage, and distributed architectures.
Experience building CI/CD pipelines using tools such as GitHub Actions, GitLab CI, CircleCI, or similar.
Strong automation mindset with ability to eliminate manual operational work through engineering solutions.
Familiarity with agentic coding tools and ability to use them effectively in infrastructure and pipeline development.
Strong communication skills to articulate technical decisions, trade-offs, and incident analysis clearly.

Nice To Have

Experience with GPU/accelerator scheduling and node lifecycle management (e.g., Karpenter).
Experience operating LLM inference systems at scale, including quota management, caching, and guardrails (e.g., AWS Bedrock or similar).
Experience with ML orchestration tools such as Argo Workflows, Kubeflow, Airflow, or SageMaker Pipelines.
Familiarity with ML observability tools, drift detection, and model monitoring practices.
Background in FinOps and cost attribution for large-scale inference systems.
Experience with multi-tenant infrastructure and isolation strategies.

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

Exposure to feature stores, model registries, and experiment tracking tools such as MLflow or Feast.
Experience scaling ML platforms in high-growth or startup-to-enterprise environments.

Benefits

Fully remote work model with flexibility.
Opportunity to work on cutting-edge AI and ML infrastructure at scale.
High ownership environment with direct impact on platform architecture and evolution.
Exposure to modern cloud-native technologies, Kubernetes, and distributed systems at production scale.
Collaborative engineering culture focused on automation, reliability, and innovation.
Work aligned with global time zones (EST/PST) for structured collaboration.
Continuous technical challenges involving LLMs, ML systems, and large-scale infrastructure.
Strong emphasis on engineering autonomy and senior-level decision-making.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Job Overview

Posted Date Jun 13, 2026

Employment Type Full-time

Experience Level Associate

Location Brazil

Category Devops

Company Jobgether

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Junior DevOps Engineer

Devops

•

2w ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

mídia 3 tecnologia e criação

Brazil

SRE Consultant

Devops

•

4w ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Aubay Portugal

Brazil

Engineering Lead - Remote (Brazil)

Devops

•

4w ago

Visa Sponsorship Relocation Remote

Job Type Contract

Experience Level Mid-Senior level

metal toad

Brazil

Senior MLOps Engineer - SRE | DevOps

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Junior DevOps Engineer

mídia 3 tecnologia e criação

SRE Consultant

Premium Job

Aubay Portugal

Engineering Lead - Remote (Brazil)

metal toad

Subscribe our newsletter