Lead development and optimization of large-scale model training across multi-GPU and multi-node environments. Design and implement distributed training pipelines using PyTorch DDP, FSDP, and DeepSpeed. Collaborate with research and systems teams to manage distributed state and resilient checkpointing at scale.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Job Description
At BairesDev®, we've been leading the way in technology projects for over 15 years. We deliver cutting-edge solutions to giants like Google and the most innovative startups in Silicon Valley.
Our diverse 4,000+ team, composed of the world's Top 1% of tech talent, works remotely on roles that drive significant impact worldwide.
When you apply for this position, you're taking the first step in a process that goes beyond the ordinary. We aim to align your passions and skills with our vacancies, setting you on a path to exceptional career development and success.
As a Deep Learning Engineer (Distributed), you will lead the development and optimization of large-scale model training across multi-GPU and multi-node environments. You will focus on maximizing training throughput and efficiency, ensuring that complex models scale seamlessly across high-performance compute clusters.
What You'll Do
- Design and implement distributed training pipelines using PyTorch DDP, FSDP, and DeepSpeed.
- Optimize model performance through mixed-precision training and advanced sharding techniques.
- Identify and resolve system bottlenecks using Nvidia GPU profiling tools like Nsight.
- Develop and deploy optimized model kernels using Triton or TensorRT to enhance inference and training speed.
- Collaborate with research and systems teams to manage distributed state and resilient checkpointing at scale.
Interested in remote work opportunities in Machine Learning & AI? Discover Machine Learning & AI Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
- 4+ years of experience in Machine Learning Engineering, Distributed Systems, or a related technical field.
- Proven expertise in PyTorch and distributed training frameworks including DDP, FSDP, and DeepSpeed.
- Strong familiarity with CUDA and Nvidia GPU profiling tools like Nsight.
- Hands-on experience with ML performance optimization using Triton or TensorRT.
- Advanced proficiency in English.
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
- 100% remote work (from anywhere).
- Excellent compensation in USD or your local currency if preferred
- Hardware and software setup for you to work from home.
- Flexible hours: create your own schedule.
- Paid parental leaves, vacations, and national holidays.
- Innovative and multicultural work environment: collaborate and learn from the global Top 1% of talent.
- Supportive environment with mentorship, promotions, skill development, and diverse growth opportunities.
Apply now!
Similar Jobs
Explore other opportunities that match your interests
Senior Deep Learning Engineer
BairesDev
BairesDev
Senior PyTorch Software Engineer