Site Reliability Engineer - AI Agents

Jobgether • Canada

Remote

Apply

AI Summary

We are seeking a Site Reliability Engineer to design, operate, and scale the infrastructure layer that powers AI agent systems in production. The ideal candidate has experience in cloud-native systems, production infrastructure, and exposure to ML or AI-driven workloads. This is a high-impact opportunity to shape foundational systems powering next-generation AI agent ecosystems.

Key Highlights

Design and operate scalable cloud infrastructure

Ensure reliability, performance, and observability of distributed agentic systems

Collaborate with AI and Data Engineering teams to evolve experimental prototypes into production-grade systems

Key Responsibilities

Design, build, and operate scalable cloud infrastructure supporting AI agent execution, orchestration, and model serving in production

Ensure reliability, performance, and observability of distributed agentic systems across internal and external products

Collaborate with AI and Data Engineering teams to evolve experimental prototypes into production-grade systems

Technical Skills Required

Cloud infrastructure Kubernetes Containerized systems Docker Terraform Python Bash/Shell AWS

Benefits & Perks

Competitive compensation package

Fully remote-friendly structure

Comprehensive health coverage

Retirement savings plans

Flexible PTO policy

Mental health and wellness support programs

Learning and development budget

Nice to Have

Experience with AI agent systems, LLM-based applications, or orchestration frameworks

Job Description

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer - AI Agents based in Canada.

This role sits at the intersection of platform engineering, site reliability, and applied artificial intelligence, focused on building the infrastructure that powers production-grade AI agent systems at scale. You will be responsible for ensuring that agentic workflows—both internal and customer-facing—run reliably, securely, and efficiently in production environments. The position blends deep infrastructure ownership with modern AI system challenges, requiring strong operational discipline and comfort working with rapidly evolving technologies. You will collaborate closely with AI, data engineering, and product teams to transform experimental agent capabilities into hardened, scalable systems. Beyond reliability and operations, the role emphasizes building developer platforms, APIs, and tooling that enable seamless consumption of AI infrastructure across engineering teams. This is a high-impact opportunity to shape foundational systems powering next-generation AI agent ecosystems.

Accountabilities

You will be responsible for designing, operating, and scaling the infrastructure layer that powers AI agent systems in production, ensuring reliability, observability, and developer usability across the platform.

Design, build, and operate scalable cloud infrastructure supporting AI agent execution, orchestration, and model serving in production
Ensure reliability, performance, and observability of distributed agentic systems across internal and external products
Develop platform services, APIs, SDKs, and self-service tooling to enable efficient consumption of AI infrastructure
Manage compute, orchestration, and deployment infrastructure supporting AI and ML workloads at scale
Build and maintain CI/CD pipelines for reliable, automated deployment of AI services and agent workflows
Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS environments
Design and operate monitoring, logging, alerting, and incident response systems tailored to AI/ML workloads
Define reliability patterns, guardrails, and failure recovery mechanisms for LLM and agent-based systems
Collaborate with AI and Data Engineering teams to evolve experimental prototypes into production-grade systems
Manage Kubernetes-based container orchestration environments for scalable deployment of services
Implement security controls, access management, and infrastructure best practices across systems
Document architecture, runbooks, and operational procedures to support platform adoption and reliability

Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

Requirements

The ideal candidate is a strong SRE or platform engineer with experience in cloud-native systems, production infrastructure, and exposure to ML or AI-driven workloads.

5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar roles in production environments
Hands-on experience supporting ML infrastructure, model serving, or MLOps pipelines in production
Experience building developer platforms, internal tools, APIs, or SDKs used at scale by engineering teams
Strong understanding of platform engineering principles, including self-service infrastructure and developer experience design
Proficiency with Infrastructure as Code tools, particularly Terraform
Strong experience with Kubernetes and containerized systems (Docker)
Solid cloud infrastructure experience, preferably AWS
Strong scripting and programming skills (Python preferred, plus bash/shell proficiency)
Experience designing and operating observability, monitoring, and alerting systems
Experience with incident response processes and on-call operational ownership
Strong collaboration skills across AI, data, and engineering teams
High ownership mindset with ability to operate in fast-paced, high-stakes production environments
Familiarity with AI agent systems, LLM-based applications, or orchestration frameworks is a strong plus

Benefits

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

Competitive compensation package with performance-based incentives
Fully remote-friendly structure with flexibility across eligible regions
Comprehensive health coverage including medical, dental, and vision (where applicable)
Retirement savings plans with employer contributions (where applicable)
Flexible PTO policy and paid company holidays
Mental health and wellness support programs
Learning and development budget for continuous technical growth
Opportunity to work on cutting-edge AI agent infrastructure at global scale
High-ownership engineering culture with strong cross-functional collaboration
Exposure to advanced platform engineering and applied AI systems.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Job Overview

Posted Date Jun 14, 2026

Employment Type Full-time

Experience Level Not Applicable

Location Canada

Category Devops

Company Jobgether

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Senior DevOps/SRE Engineer (Remote)

Devops

•

2h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

lazer technologies

Canada

Senior Platform Engineer (Azure, Kubernetes, Terraform, AI Infrastructure)

Devops

•

3h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

tidelinercm

Canada

Site Reliability Engineer (Environment Automation)

Devops

•

2d ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

GitLab

Canada

Site Reliability Engineer - AI Agents

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Senior DevOps/SRE Engineer (Remote)

lazer technologies

Senior Platform Engineer (Azure, Kubernetes, Terraform, AI Infrastructure)

tidelinercm

Site Reliability Engineer (Environment Automation)

Premium Job

GitLab

Subscribe our newsletter