Senior ML Infrastructure Engineer - Large-Scale AI Platform

Jobgether • United State

Remote Visa Sponsorship

Apply

AI Summary

Design, build, and operate GPU clusters for large-scale ML training and inference. Collaborate with ML teams to optimize performance, cost, and developer experience. Ensure security and multi-tenant access control.

Key Highlights

Large-scale ML infrastructure

Cross-cloud, on-prem, hybrid environments

Collaboration with ML teams

Optimize performance, cost, and developer experience

Key Responsibilities

Design, build, and operate GPU and accelerator infrastructure for large-scale training and inference workloads

Develop scheduling, queueing, and resource management systems

Integrate and support ML frameworks

Build and maintain high-performance storage and data pipelines

Design and optimize networking layers

Implement observability, monitoring, and failure analysis tools

Drive automation for provisioning, lifecycle management, and infrastructure configuration

Partner with ML teams to forecast capacity needs and improve developer workflows

Ensure security, isolation, and multi-tenant access control

Optimize cost efficiency across compute, storage, and networking

Technical Skills Required

Python Systems programming (Go or C++) Distributed systems

Benefits & Perks

Competitive salary range: $100,000 - $150,000

100% remote (within the United States)

H1B transfer support for eligible candidates

Nice to Have

Experience with RDMA/InfiniBand

FinOps for ML workloads

Open-source ML infrastructure

Job Description

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a ML Infrastructure Engineer based in United States.

This role focuses on building and operating the core platform that powers large-scale machine learning training and inference workloads.

You will work on GPU cluster infrastructure spanning cloud, on-prem, and hybrid environments.

The position plays a critical role in enabling efficient, reliable, and scalable AI development across multiple teams.

You will design systems for scheduling, distributed training, storage throughput, and high-performance networking.

The environment is highly technical, combining systems engineering, ML frameworks, and platform reliability at scale.

You will collaborate closely with ML researchers and engineers to optimize performance, cost, and developer experience.

This is a hands-on engineering role where impact is measured by infrastructure efficiency and production readiness of AI workloads.

Accountabilities

Design, build, and operate GPU and accelerator infrastructure for large-scale training and inference workloads across cloud, on-prem, and hybrid environments.
Develop scheduling, queueing, and resource management systems to maximize utilization of compute clusters.
Integrate and support ML frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray-based training workflows.
Build and maintain high-performance storage and data pipelines ensuring consistent GPU throughput.
Design and optimize networking layers including RDMA, InfiniBand, and NCCL-based communication.

Searching for Development & Programming roles that provide visa sponsorship? Connect with international employers through Development & Programming Jobs with Visa Sponsorship opportunities actively seeking talented professionals.

Implement observability, monitoring, and failure analysis tools for distributed ML workloads.
Drive automation for provisioning, lifecycle management, and infrastructure configuration.
Partner with ML teams to forecast capacity needs and improve developer workflows and tooling.
Ensure security, isolation, and multi-tenant access control across AI infrastructure systems.
Optimize cost efficiency across compute, storage, and networking through intelligent resource management.

Requirements

Bachelor’s or Master’s degree in Computer Science or related field.
6+ years of experience in infrastructure, platform engineering, or high-performance computing environments.
Hands-on experience operating GPU clusters or large-scale ML training systems in production.
Strong proficiency in Python and at least one systems programming language (Go or C++ preferred).
Deep understanding of distributed systems, accelerator architectures, and ML training workflows.

Explore our comprehensive directory of visa sponsorship jobs from employers worldwide who are ready to sponsor talented international professionals.

Experience with Kubernetes, Slurm, Ray, or similar orchestration/scheduling systems.
Strong knowledge of Linux internals, networking concepts, and high-performance storage systems.
Familiarity with at least one major cloud provider’s ML infrastructure stack.
Solid software engineering practices including testing, CI/CD, and code review workflows.
Strong communication skills and ability to collaborate across research and engineering teams.
Experience with RDMA/InfiniBand, FinOps for ML workloads, or open-source ML infrastructure is a plus.

Benefits

Competitive salary range: $100,000 - $150,000
100% remote (within the United States)
Full-time W2 employment structure
H1B transfer support for eligible candidates

Interested in opportunities specifically in United State? Discover our dedicated Visa Sponsorship Jobs in United State page featuring roles from top employers in this location.

Opportunity to work on large-scale AI infrastructure systems
Exposure to cutting-edge ML frameworks and distributed training technologies
Strong engineering culture focused on performance, reliability, and scalability
Direct impact on production AI systems and research acceleration
Comprehensive career growth opportunities in advanced ML infrastructure.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Job Overview

Posted Date Jul 04, 2026

Employment Type Full-time

Experience Level Not Applicable

Location United State

Category Programming

Company Jobgether

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Regional People Services Manager

Programming

•

1h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

Jobgether

United State

AI Safety Researcher - Red Team Lead

Programming

•

1h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

reflection

United State

Senior Manager, Brand Strategy & Engagement

Programming

•

2h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

Jobgether

United State

Senior ML Infrastructure Engineer - Large-Scale AI Platform

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Regional People Services Manager

Jobgether

AI Safety Researcher - Red Team Lead

reflection

Senior Manager, Brand Strategy & Engagement

Jobgether

Subscribe our newsletter