Senior Site Reliability Engineer (SRE) - AI Infrastructure & On-Prem Deployments

acclaim ai • United State

Remote

Apply

AI Summary

We are seeking a Senior SRE to ensure reliability, observability, and performance of our AI agent platform across cloud and on-prem environments. The role involves incident management, monitoring setup, load testing, and mentoring within a cutting-edge tech stack. Candidates must have 5+ years of SRE/DevOps experience with Kubernetes, Prometheus, and strong cloud/linux skills.

Key Highlights

5+ years SRE/DevOps experience with production responsibility

Deep practical knowledge of Docker and Kubernetes

Experience with Prometheus, Alertmanager, Grafana, and SLIs/SLOs

Strong cloud skills (GCP/AWS) and Linux networking

Ownership mindset with mentoring and analytical problem-solving

Fully remote across Europe

Key Responsibilities

Ensure reliability of services: SLIs/SLOs, availability, and bottleneck elimination

Set up monitoring for services, metrics, alerts, and dashboards

Build and maintain Grafana dashboards for team and customers

Run load testing, analyze results, and provide scaling recommendations

Investigate incidents, participate in on-call rotations, write and lead postmortems

Work closely with developers to communicate and defend technical positions

Develop and support Kubernetes-based infrastructure across GCP and AWS

Automate routine work and assist with CI/CD and team tasks

Deliver and support platform for customers including on-prem deployments

Mentor colleagues and raise engineering standards across the team

Technical Skills Required

Docker Kubernetes Prometheus Alertmanager Grafana Python GCP AWS Linux Networking CI/CD GitHub Actions Terraform Ansible Incident investigation Postmortems Load testing Capacity planning SIP FreeSWITCH RTP WebRTC Triton vLLM Kafka ClickHouse

Benefits & Perks

21 vacation days + public holidays + 5 sick days

Private English lessons via Preply

Fully remote across Europe

Nice to Have

Experience using AI agents for routine tasks

Real-time telephony (SIP, FreeSWITCH, RTP, WebRTC)

GPU/ML serving (Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM)

Streaming data and analytics (Kafka, ClickHouse)

Deep IaC and GitOps experience (Terraform, Ansible, ArgoCD)

Logging with Loki/ELK

gRPC

Isolated and highly secure environments

Preparing systems for significant load growth

Job Description

We currently have several large-scale projects and are expanding our infrastructure team. Our product is an advanced platform for creating and managing AI agents. It can be deployed directly inside a customer’s infrastructure and delivered as an enterprise solution, while also being available as a SaaS version.

Under the hood, there is real-time voice and telephony, GPU and LLM inference, streaming analytics, and all of this runs both in the cloud and on-prem, including in banking environments. There is a lot of infrastructure; it is complex, interesting, and sometimes at the edge of what is possible. That is why we are looking for a strong SRE who, like us, cares about making systems transparent, reliable, and built the right way.

This is a role for a strong, independent engineer. A Senior SRE with real influence and a voice in how things are built and operated.

You will also handle DevOps tasks for the team, but your main focus and area of expertise should be SRE: reliability, observability, incident management, and performance under load.

Requirements

5+ years in SRE/DevOps. You have not just seen production; you have been responsible for the reliability of high-load production systems.
Deep, practical understanding of Docker and Kubernetes. You have operated them in production, not just used them in tutorials.
Mature understanding of metrics and alerts, with real hands-on experience writing, tuning, and maintaining them.
Practical experience with Prometheus, Alertmanager, and Grafana.
Ability and willingness to build dashboards and make them clear, useful, and easy to work with.
Experience with SLIs/SLOs, reliability management, incident investigation, and postmortems.
Experience with load testing and basic capacity planning.
Python: you can write code and confidently read and modify other people’s code for automation, exporters, tooling, and related tasks.
Cloud experience with GCP and/or AWS, strong Linux skills, and solid networking knowledge at an operational level.
DevOps fundamentals: CI/CD and infrastructure as code, including GitHub Actions, Terraform, Ansible, and similar tools.
Willingness to understand and support the product in customer environments, including on-prem deployments.
Ownership mindset: you take responsibility for a task, drive it to completion, and think one step ahead.
Friendly, non-toxic, and pleasant to work with.

Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

Strong communication with developers: you can clearly and constructively explain your position, defend it when needed, and find common ground.
Willingness and ability to mentor, teach, and share knowledge with others.
Analytical mindset: you dig down to the root cause instead of just treating symptoms.
Proactivity: you would rather prevent an outage than heroically fight it later.
Strong attention to detail and reliability.

Nice to have

Experience using AI agents for routine and recurring tasks.
Real-time telephony: SIP, FreeSWITCH, RTP, WebRTC.
GPU/ML serving: Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM; understanding of the specifics of deploying LLM/ML models.
Streaming data and analytics: Kafka, ClickHouse.
Deep experience with IaC and GitOps, such as Terraform, Ansible, ArgoCD; logging with Loki/ELK; gRPC.
Experience working in isolated and highly secure environments.
Experience preparing systems for significant growth in load.

Responsibilities

You will be responsible for the reliability of our services: SLIs/SLOs, availability, and identifying and eliminating bottlenecks across the system.

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

You will set up monitoring for services, metrics, alerts, and dashboards. This will rarely come as a clearly defined task; more often, you will decide what is important to measure and bring it to a clear, usable view.
You will build and maintain Grafana dashboards that people actually use, both our team and our customers.
You will run load testing, analyze the results, and provide recommendations on resources and scaling.
You will investigate incidents, participate in on-call rotations, write and lead postmortems, and ensure the same failure does not happen again.
You will work closely with developers: communicate and defend your position, challenge technical decisions, and find win-win solutions.
You will develop and support Kubernetes-based infrastructure across our clouds, including GCP and AWS, automate routine work, and help with CI/CD and general team tasks.
You will take part in delivering and supporting the platform for customers, including on-prem deployments.
You will mentor colleagues and help raise the engineering bar across the team.

What We Offer

The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world
Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment
High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly
Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else
Startup pace with enterprise stability — real clients, real revenue, no bureaucracy
Fully remote across Europe
21 vacation days + public holidays + 5 sick days
Private English lessons via Preply

Job Overview

Posted Date Jun 09, 2026

Employment Type Full-time

Experience Level Not Applicable

Location United State

Category Devops

Company acclaim ai

Mentioned Skills

Industries

Similar Jobs

Explore other opportunities that match your interests

Director of Technology

Devops

•

13h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

firstPRO, Inc

United State

Senior BigCommerce Technical Lead (Remote)

Devops

•

13h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

Jobs via Dice

United State

Senior DevOps Engineer

Devops

•

14h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

rightclick

United State

Senior Site Reliability Engineer (SRE) - AI Infrastructure & On-Prem Deployments

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Director of Technology

firstPRO, Inc

Senior BigCommerce Technical Lead (Remote)

Jobs via Dice

Senior DevOps Engineer

rightclick

Subscribe our newsletter