Head of Data Center Operations Opportunity

Blue Signal Search company

Subscribe to our Telegram & Twitter Channel

Head of Data Center Operations in United State

Remote 7 hours ago

Head of Data Center Operations

Location: Remote (United States) – Preference for candidates based in the greater Bay Area


Our client is a well-funded, high-growth innovator delivering large-scale GPU compute for cutting-edge AI workloads. As demand accelerates, they are scaling multiple, next-generation data-center clusters across the country. They are seeking a strategic, hands-on Head of Data Center Operations to safeguard uptime, performance, and growth of this mission-critical infrastructure. If you thrive in hyper-scalable environments and enjoy shaping world-class operational teams, this role offers an unmatched opportunity to define the gold standard for GPU data-center reliability.


This Role Offers

  • Executive-level influence over a rapidly expanding GPU cloud platform.
  • Remote-first culture with high ownership, technical depth, and autonomy.
  • Direct impact on reliability engineering strategy during multi-megawatt capacity growth.
  • Competitive base salary, performance-based equity, and comprehensive benefits.
  • Chance to lead real-time operations at the forefront of AI infrastructure innovation.


Key Responsibilities

  • Direct the 24×7 operations of geographically distributed, high-density GPU data centers totaling tens of megawatts of compute capacity.
  • Establish and continuously improve monitoring, incident response, and change-management processes to ensure industry-leading uptime and performance.
  • Drive adoption of reliability-engineering best practices, creating playbooks, automation, and tooling that scale with rapid capacity growth.
  • Partner with hardware, facilities, and platform-engineering teams to optimize resource utilization, thermal efficiency, and service quality.
  • Manage vendor and colocation relationships, negotiating SLAs for power, cooling, and network connectivity.
  • Lead and mentor a global team of site-reliability engineers, NOC staff, and systems operators.
  • Oversee compliance programs covering security, disaster recovery, business continuity, and environmental regulations.
  • Analyze incidents and performance trends to identify systemic risks and implement preventive solutions.


Skill Set & Qualifications

  • 10+ years in data-center or large-scale infrastructure operations, including hyperscale, GPU, or HPC environments.
  • Proven track record operating live production workloads at 20 MW or greater total capacity.
  • Expert knowledge of observability, telemetry, and alerting systems for distributed infrastructure.
  • Familiarity with GPU workloads, thermal dynamics, and high-density rack design.
  • Exceptional incident-management and root-cause-analysis skills.
  • Demonstrated success building and scaling remote, globally distributed operations teams.
  • Startup or high-growth environment experience strongly preferred.


Ready to lead the next leap in AI infrastructure reliability? Apply today to explore how your experience can power the future of large-scale GPU compute.


About Blue Signal:

Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS


Apply now

Subscribe our newsletter

New Things Will Always Update Regularly