MTS: ML Infrastructure, Platform Engineer Opportunity

Essential AI company

Subscribe to our Telegram & Twitter Channel

MTS: ML Infrastructure, Platform Engineer in SAN FRANCISCO BAY AREA

Visa sponsorship & Relocation 9 months ago

About Us

At Essential, we believe that AI will help us do our most ambitious and rewarding work.

Our full-stack AI products mimic human problem solving and learn existing ways of working. Using natural language, our customers in the Financial Enterprise Sector can point our models at mission-critical questions and generate verifiable answers from knowledge buried in their data, documents, and communications, in a matter of minutes instead of hours.


We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are based in SF and are building a world-class multi-disciplinary team of engineers, researchers, designers, and sales and product experts who are excited to solve hard real-world AI problems.


The Role

The ML Infra Platform Engineer will be responsible for architecting and building the compute infra that powers the training and serving of our models. This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels.


Running and training models at scale often requires solving novel system problems. As an Infra Systems Engineer, you'll be responsible for identifying these problems and then developing systems that optimize the throughput and robustness of distributed systems. With proven experience building large-scale platforms, you will be responsible for building and advancing our systems that allow research and engineering organizations to iteratively develop, test, and deploy new features reliably, with high velocity, and with a frictionless-fast development cycle.


What you’ll be working on

  • Design, build, and maintain scalable machine learning infrastructure to support our model training, inference, and applications
  • Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods to improve training in a fast and reliable way
  • You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research
  • Develop tools and frameworks to automate and streamline ML experimentation and management
  • Collaborate with other researchers and product engineers to bring magical product experiences through large language models
  • Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure including researching new techniques and writing custom kernels as needed to achieve improvements
  • Be willing to optimize performance and efficiency across different accelerators


What we are looking for

  • A strong understanding of architectures of new AI accelerators like TPU, IPU, HPU etc and their tradeoffs.
  • Knowledge of parallel computing concepts and distributed systems.
  • Prior experience in performance tuning of training and/or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued.
  • 6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems.
  • Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc
  • Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA
  • Experience with INT8/FP8 training and inference, quantization and/or distillation
  • Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc.
  • Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc


We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.


We are based in-person in SF and work fully onsite 5 days a week. We offer relocation assistance to new employees.


The base pay range target for the role seniority described in this job description is between $215,000 to $240,000 in San Francisco, CA. Final offer amounts depend on various job-related factors, including where you place on our internal performance ladders, which is based on factors including past work experience, relevant education, and performance on our interviews and our benchmarks against market compensation data. In addition to cash pay, full-time regular positions are eligible for equity, 401(k), health benefits, and other benefits like daily onsite lunches and snacks; some of these benefits may be available for part-time or temporary positions.


Essential AI commits to providing a work environment free of discrimination and harassment, as well as equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. You may view all of Essential AI’s recruiting notices here, including our EEO policy, recruitment scam notice, and recruitment agency policy.

Apply now

Subscribe our newsletter

New Things Will Always Update Regularly