Incident and Escalation Manager

Jobgether • United State
Remote
Apply
AI Summary

This role is responsible for leading major incident response efforts and managing executive-level escalations in a global, AI-driven infrastructure environment. The ideal candidate will have strong technical fluency and exceptional decision-making skills, with the ability to communicate effectively with stakeholders. The position requires 12+ years of experience in Incident Management, Escalation Management, or Technical Operations.

Key Highlights
Lead major incident response efforts and manage executive-level escalations
Collaborate with engineering, support, product, and sales teams to resolve complex technical issues
Develop and improve global incident and escalation management frameworks
Key Responsibilities
Lead and coordinate major incident response efforts for high-severity service disruptions
Act as Incident Commander, driving structured triage, cross-functional collaboration, and service restoration activities
Manage executive-level escalations, ensuring rapid resolution of critical customer issues and maintaining strong stakeholder alignment
Technical Skills Required
Incident Management ITIL frameworks Technical Operations
Benefits & Perks
Remote-first role with global operational exposure
Opportunity to work on cutting-edge AI and high-performance computing infrastructure
Competitive compensation aligned with senior-level responsibilities
Nice to Have
Experience with AI/HPC environments
ITIL certification
Experience with tools such as Jira, Salesforce, Slack, or Confluence

Job Description


This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for an Incident and Escalation Manager based in the United States.

This is a high-impact operational leadership role at the center of major incident response and customer escalation management within a global, AI-driven infrastructure environment. You will act as the central coordination authority during critical service disruptions, ensuring rapid alignment across engineering, support, product, and executive stakeholders. The role requires strong technical fluency combined with exceptional decision-making under pressure, particularly in mission-critical AI and high-performance computing environments. You will be responsible for restoring service stability while maintaining clear, structured communication with customers and internal leadership teams. Beyond incident resolution, you will also shape and improve global incident and escalation management frameworks. The environment is fast-paced, highly technical, and globally distributed, requiring strong leadership across time zones. This is a strategic role with direct impact on customer trust, operational resilience, and platform reliability.

Accountabilities

  • Lead and coordinate major incident response efforts for high-severity service disruptions impacting AI, HPC, and enterprise-scale environments.
  • Act as Incident Commander, driving structured triage, cross-functional collaboration, real-time decision-making, and service restoration activities.
  • Manage executive-level escalations, ensuring rapid resolution of critical customer issues and maintaining strong stakeholder alignment.
  • Provide clear, timely, and structured communication to executives, customers, and internal teams during major incidents.
  • Partner with engineering, support, product, and sales teams to resolve complex technical and service-related challenges.
  • Lead post-incident and escalation reviews (PIER), including root cause analysis and corrective action tracking.
  • Identify systemic issues and drive continuous improvement across incident, escalation, and problem management processes.
  • Contribute to the development of operational frameworks, governance models, and service reliability standards across global teams.

Requirements

  • 12+ years of experience in Incident Management, Escalation Management, Problem Management, or Technical Operations in enterprise or high-tech environments.
  • Proven experience leading high-severity incidents and executive escalations in AI, HPC, or large-scale infrastructure ecosystems.
  • Strong technical understanding of complex distributed systems and ability to collaborate effectively with engineering teams under pressure.
  • Deep knowledge of ITIL frameworks, including Incident, Problem, Change, and Escalation Management practices.
  • Exceptional communication skills, with the ability to manage both technical and executive-level audiences.
  • Strong analytical mindset with experience interpreting incident data, trends, and operational metrics.
  • Ability to operate in high-pressure, customer-facing situations with strong ownership and decision-making capabilities.
  • Experience working in global, 24/7 operational environments with on-call responsibilities.
  • Proven ability to influence cross-functional teams and senior stakeholders without direct authority.
  • Nice to have: experience with AI/HPC environments, distributed storage systems (e.g., Lustre), ITIL certification, and tools such as Jira, Salesforce, Slack, or Confluence.

Benefits

  • Remote-first role with global operational exposure.
  • Opportunity to work on cutting-edge AI and high-performance computing infrastructure.
  • High-impact position with direct visibility to executive leadership and strategic customers.
  • Competitive compensation aligned with senior-level responsibilities.
  • Strong focus on operational excellence, innovation, and continuous improvement.
  • Collaborative, cross-functional environment with global teams.
  • Exposure to mission-critical systems powering advanced AI workloads worldwide.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.


Similar Jobs

Explore other opportunities that match your interests

Remote Creative Lead - Managed Services

Programming
•
55m ago
Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Not Applicable

Creative Circle

United State

Senior Power BI Developer

Programming
•
1h ago
Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Mid-Senior level

estaff llc

United State

Automation Specialist

Programming
•
1h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

mysource digital marketing

United State

Subscribe our newsletter

New Things Will Always Update Regularly