Deep Learning Engineer (Distributed)

BairesDev Latin America
Remote
Apply
AI Summary

Lead development and optimization of large-scale model training across multi-GPU and multi-node environments. Design and implement distributed training pipelines using PyTorch DDP, FSDP, and DeepSpeed. Collaborate with research and systems teams to manage distributed state and resilient checkpointing at scale.

Key Highlights
4,000+ global team of Top 1% tech talent
100% remote work from anywhere
PyTorch DDP, FSDP, and DeepSpeed expertise required
CUDA and Nvidia GPU profiling tools like Nsight experience
Key Responsibilities
Design and implement distributed training pipelines using PyTorch DDP, FSDP, and DeepSpeed
Optimize model performance through mixed-precision training and advanced sharding techniques
Identify and resolve system bottlenecks using Nvidia GPU profiling tools like Nsight
Develop and deploy optimized model kernels using Triton or TensorRT to enhance inference and training speed
Collaborate with research and systems teams to manage distributed state and resilient checkpointing at scale
Technical Skills Required
PyTorch DDP FSDP DeepSpeed CUDA Nvidia GPU profiling tools Nsight mixed-precision training Triton TensorRT
Benefits & Perks
100% remote work
Excellent compensation
Hardware and software setup for home office
Flexible hours
Paid parental leaves
Vacations and national holidays
Innovative and multicultural work environment

Job Description


At BairesDev®, we've been leading the way in technology projects for over 15 years. We deliver cutting-edge solutions to giants like Google and the most innovative startups in Silicon Valley.

Our diverse 4,000+ team, composed of the world's Top 1% of tech talent, works remotely on roles that drive significant impact worldwide.

When you apply for this position, you're taking the first step in a process that goes beyond the ordinary. We aim to align your passions and skills with our vacancies, setting you on a path to exceptional career development and success.

As a Deep Learning Engineer (Distributed), you will lead the development and optimization of large-scale model training across multi-GPU and multi-node environments. You will focus on maximizing training throughput and efficiency, ensuring that complex models scale seamlessly across high-performance compute clusters.

What You'll Do

  • Design and implement distributed training pipelines using PyTorch DDP, FSDP, and DeepSpeed.
  • Optimize model performance through mixed-precision training and advanced sharding techniques.
  • Identify and resolve system bottlenecks using Nvidia GPU profiling tools like Nsight.
  • Develop and deploy optimized model kernels using Triton or TensorRT to enhance inference and training speed.
  • Collaborate with research and systems teams to manage distributed state and resilient checkpointing at scale.

What We Are Looking For

  • 4+ years of experience in Machine Learning Engineering, Distributed Systems, or a related technical field.
  • Proven expertise in PyTorch and distributed training frameworks including DDP, FSDP, and DeepSpeed.
  • Strong familiarity with CUDA and Nvidia GPU profiling tools like Nsight.
  • Hands-on experience with ML performance optimization using Triton or TensorRT.
  • Advanced proficiency in English.

How we do make your work (and your life) easier:

  • 100% remote work (from anywhere).
  • Excellent compensation in USD or your local currency if preferred
  • Hardware and software setup for you to work from home.
  • Flexible hours: create your own schedule.
  • Paid parental leaves, vacations, and national holidays.
  • Innovative and multicultural work environment: collaborate and learn from the global Top 1% of talent.
  • Supportive environment with mentorship, promotions, skill development, and diverse growth opportunities.

Join a global team where your unique talents can truly thrive and make a significant impact!

Apply now!

Similar Jobs

Explore other opportunities that match your interests

Senior Deep Learning Engineer

Machine Learning
2h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

BairesDev

Latin America
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

BairesDev

Latin America

Senior PyTorch Software Engineer

Machine Learning
3h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

BairesDev

Latin America

Subscribe our newsletter

New Things Will Always Update Regularly