Head of Data Center Operations
Location: Remote (United States) – Preference for candidates based in the greater Bay Area
Our client is a well-funded, high-growth innovator delivering large-scale GPU compute for cutting-edge AI workloads. As demand accelerates, they are scaling multiple, next-generation data-center clusters across the country. They are seeking a strategic, hands-on Head of Data Center Operations to safeguard uptime, performance, and growth of this mission-critical infrastructure. If you thrive in hyper-scalable environments and enjoy shaping world-class operational teams, this role offers an unmatched opportunity to define the gold standard for GPU data-center reliability.
This Role Offers
- Executive-level influence over a rapidly expanding GPU cloud platform.
- Remote-first culture with high ownership, technical depth, and autonomy.
- Direct impact on reliability engineering strategy during multi-megawatt capacity growth.
- Competitive base salary, performance-based equity, and comprehensive benefits.
- Chance to lead real-time operations at the forefront of AI infrastructure innovation.
Key Responsibilities
- Direct the 24×7 operations of geographically distributed, high-density GPU data centers totaling tens of megawatts of compute capacity.
- Establish and continuously improve monitoring, incident response, and change-management processes to ensure industry-leading uptime and performance.
- Drive adoption of reliability-engineering best practices, creating playbooks, automation, and tooling that scale with rapid capacity growth.
- Partner with hardware, facilities, and platform-engineering teams to optimize resource utilization, thermal efficiency, and service quality.
- Manage vendor and colocation relationships, negotiating SLAs for power, cooling, and network connectivity.
- Lead and mentor a global team of site-reliability engineers, NOC staff, and systems operators.
- Oversee compliance programs covering security, disaster recovery, business continuity, and environmental regulations.
- Analyze incidents and performance trends to identify systemic risks and implement preventive solutions.
Skill Set & Qualifications
- 10+ years in data-center or large-scale infrastructure operations, including hyperscale, GPU, or HPC environments.
- Proven track record operating live production workloads at 20 MW or greater total capacity.
- Expert knowledge of observability, telemetry, and alerting systems for distributed infrastructure.
- Familiarity with GPU workloads, thermal dynamics, and high-density rack design.
- Exceptional incident-management and root-cause-analysis skills.
- Demonstrated success building and scaling remote, globally distributed operations teams.
- Startup or high-growth environment experience strongly preferred.
Ready to lead the next leap in AI infrastructure reliability? Apply today to explore how your experience can power the future of large-scale GPU compute.
About Blue Signal:
Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS