Site Reliability Engineer

digitalxc ai India
Remote
Apply
AI Summary

Design, build, and maintain highly available, scalable, and secure infrastructure for DigitalXC AI's GenAI and automation platform. Monitor system performance, manage incident response, perform root-cause analysis, and implement reliability improvements. Collaborate with engineering teams on CI/CD pipelines, observability, capacity planning, and disaster recovery.

Key Highlights
Design and maintain infrastructure for GenAI-powered hyper-automation platform
Monitor system performance and manage incident response with root-cause analysis
Implement CI/CD pipelines, observability, and capacity planning best practices
Collaborate with software engineering teams on reliability and automation
Key Responsibilities
Design, build, and maintain highly available, scalable, and secure infrastructure
Monitor system performance and manage incident response
Perform root-cause analysis for production issues
Implement reliability and performance improvements
Collaborate with software engineering teams on resilient service design
Automate deployments and improve observability
Implement capacity planning and disaster recovery strategies
Define and refine SLOs/SLIs
Manage CI/CD pipelines
Contribute to tooling that reduces operational toil
Technical Skills Required
Linux system administration Cloud platforms (AWS, Azure, GCP) Python or Go programming Kubernetes and container orchestration
Benefits & Perks
Remote work

Job Description


Company Description DigitalXC AI is a GenAI-powered hyper-automation and employee experience platform focused on transforming enterprise IT operations and support. The platform enables self-service, self-heal, self-help, and operations automation across major IT domains, backed by an app store of 650+ prebuilt automated services that can drive 50–60% automation within 12–18 months. DigitalXC AI delivers a consumer-grade, omnichannel experience through web and mobile apps, chat and voice bots, and integrations with tools like ServiceNow. Its intelligent virtual assistants and AI agents enhance productivity by supporting user queries, content creation, enterprise search, technical support, and more. The platform integrates with a wide range of enterprise technologies, including cloud, digital workplace, service desk, DevOps, networks, security, and leading business applications.
Role Description This is a full-time, remote role for a Site Reliability Engineer at DigitalXC AI. The Site Reliability Engineer will design, build, and maintain highly available, scalable, and secure infrastructure that powers the company’s GenAI and automation platform. Day-to-day responsibilities include monitoring system performance, managing incident response, performing root-cause analysis, and implementing reliability and performance improvements. The role involves collaborating with software engineering teams to design resilient services, automate deployments, improve observability, and implement best practices for capacity planning and disaster recovery. The Site Reliability Engineer will also help define and refine SLOs/SLIs, manage CI/CD pipelines, and contribute to tooling that reduces operational toil.
Qualifications
  • Candidates should possess strong Site Reliability Engineering skills, including observability, incident management, capacity planning, and reliability best practices.
  • Candidates should possess deep System Administration and Infrastructure skills, such as managing Linux-based systems, cloud platforms (e.g., AWS, Azure, GCP), networking basics, and infrastructure-as-code tooling.
  • Candidates should possess solid Software Development skills, including proficiency in at least one programming or scripting language (e.g., Python, Go, Java, or Bash) and experience building automation and internal tools.
  • Candidates should possess advanced Troubleshooting skills for diagnosing complex production issues across applications, infrastructure, and third-party integrations.
  • Experience with CI/CD pipelines, containers and orchestration (e.g., Docker, Kubernetes), and monitoring/logging stacks (e.g., Prometheus, Grafana, ELK, or similar) is highly beneficial.
  • Understanding of security best practices for cloud-native environments, including access control, secrets management, and patching, is preferred.
  • Effective communication skills, a collaborative mindset, and the ability to work independently in a remote, distributed team are essential.
  • Bachelor’s degree in Computer

Similar Jobs

Explore other opportunities that match your interests

Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

Jobgether

India

Senior Windows Infrastructure Engineer - Global Role

Devops
4d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

broadridge india

India
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

insurancedekho

India

Subscribe our newsletter

New Things Will Always Update Regularly