Manager, AI/ML Systems Administration
Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia, Cornell, NYU, CUNY, RPI, SUNY, Rochester Schools, Mount Sinai, Simons Foundation, and the Flatiron Institute.
By leveraging the state’s rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change, all while giving New York’s researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society’s most complex challenges.
The initiative is funded by $500+ million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).
Position Summary
The Manager, AI/ML Systems Administration will lead the design, integration, and optimization of high-performance computing platforms that support artificial intelligence, machine learning, and large-scale simulation across its statewide consortium.
Reporting to the Director, AI Research Computing, the Manager, AI/ML Systems Administration is responsible for shaping the technical design of GPU-rich HPC systems and federated data environments that span multiple institutions. This role ensures platform readiness for cutting-edge AI workloads, enables workload portability, and guides infrastructure decisions to meet research, compliance, and sustainability goals.
Duties and Responsibilities
- Lead technical design and architectural planning for Empire AI’s shared and distributed HPC environments.
- Define requirements for GPU, CPU, and memory-bound workloads and recommend scalable solutions.
- Design integration layers between on-prem and hybrid cloud computing environments.
- Architect systems to support AI training and inference pipelines, including large language models and multimodal AI.
- Work with research faculty to translate scientific goals into technical configurations.
- Tune and benchmark systems for GPU-intensive frameworks (e.g., PyTorch, TensorFlow, JAX).
- Develop models for cross-institutional workload orchestration, shared resource access, and software environment consistency.
- Support containerized and virtualized research environments (e.g., Apptainer, Docker, Kubernetes).
- Ensure compatibility across heterogeneous hardware and storage platforms.
- Design architectures that meet requirements for HIPAA, NIST 800-171, and NIH GDS compliance.
- Integrate robust monitoring, alerting, access control, and disaster recovery planning.
- Support secure enclave configurations for regulated data workflows.
- Consult with research teams to assess computational needs and advise on workflow optimization.
- Partner with faculty, developers, and facilitators to deploy novel algorithms and research software stacks.
- Translate user feedback into system-level improvements.
- Maintain clear system documentation, configuration guides, and architecture diagrams.
- Contribute to technical reports, grant proposals, and performance assessments.
- Evaluate emerging hardware/software solutions and make procurement recommendations.
- Participate in special technical initiatives aligned with Empire AI’s mission and growth.
Minimum Qualifications
- Bachelor’s degree in Computer Science, Engineering, or related technical field
- 7+ years of experience designing, operating, or optimizing HPC and/or AI infrastructure
- Expertise with Linux-based clusters, job schedulers (e.g., Slurm), and GPU computing
- Familiarity with AI/ML frameworks, container environments, and distributed storage systems
- Demonstrated success collaborating with researchers or supporting scientific computing projects
Preferred Qualifications
- Master’s degree in Computer Science, Engineering, Data Science, or a related technical field, or equivalent professional experience
- Industry certifications relevant to HPC or AI/ML systems, such as NVIDIA Deep Learning Institute (DLI), Red Hat Certified Engineer (RHCE), or equivalent
- Experience supporting or collaborating within academic or industry research environments focused on artificial intelligence, machine learning, or large-scale data science
- Familiarity with workload patterns and infrastructure needs for training, tuning, and deploying large-scale AI/ML models (e.g., LLMs, computer vision, scientific ML)
- Experience with federated resource environments, research cloud integrations, and collaborative research infrastructure governance
- Proficiency in infrastructure automation and system configuration tools (e.g., Ansible, Terraform, Git)
- Contributions to open-source AI, HPC, or research software communities preferred
Compensation
Our compensation reflects the cost of labor across several US geographic markets. The base pay and target total cash for this position range from $100,000 to $200,000. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.
Travel Requirements
This role requires 20% regional travel and availability to work from our data center when not traveling. Candidates should either live near or be willing to relocate within a reasonable commuting distance of the office. Relocation assistance may be provided.