Principal Site Reliability Engineer Opportunity

jps tech solutions company

Home
jobs
Devops
Principal Site Reliability Engineer

Subscribe to our Telegram & Twitter Channel

Principal Site Reliability Engineer in United State

Visa sponsorship 2 hours ago

Report

Apply now

ExperienceMid-Senior level

Job typeContract

Job Category: Engineer

Job Type: Remote

Job Location: District of Columbia Washington

Compensation: Depends on Experience

W2: W2-Contract Only; Kindly note that applications on a C2C basis will not be considered for this role.

JPS-4639 | Posted On: 10/08/2025 | Closes On: 10/17/2025

Job Description

Job Description:

The Principal Site Reliability Engineer will be a critical technical leader responsible for driving the operational excellence, resilience, and security of our core systems for a key Randstad client in the Washington D.C. area. This senior role merges deep expertise in infrastructure automation (IaC), CI/CD architecture, and cloud security with the foundational principles of Site Reliability Engineering (SRE), including defining SLOs, managing error budgets, and leading incident response. You will mentor cross-functional teams, implement cost-efficient cloud practices, and build the foundational tools and platforms that enable our developers to deliver secure, highly available, and scalable services with velocity.

Responsibilities

Reliability Engineering & Operations: Define, implement, and maintain rigorous Service Level Objectives (SLOs) and Service Level Indicators (SLIs), establish effective error budgeting, and lead incident response, root cause analysis, and postmortem processes to ensure continuous service improvement.
Infrastructure Automation: Architect, implement, and manage secure, scalable, and repeatable cloud environments leveraging Infrastructure-as-Code (IaC) tools such as Terraform, Ansible, and CloudFormation.
CI/CD Optimization & Security: Design and optimize secure, high-performance CI/CD pipelines (e.g., GitHub Actions, Jenkins) incorporating advanced deployment techniques like automated rollback, canary, and blue/green strategies, and ensuring artifact validation.
Observability & Telemetry: Develop comprehensive observability solutions, including building robust dashboards, configuring alerts, implementing synthetic checks, and maintaining telemetry pipelines (metrics, logs, traces) to ensure deep visibility into system performance, availability, and cost.
Security & Compliance Enforcement: Integrate security tooling (SAST, DAST, SBOM, secrets scanning) directly into the deployment lifecycle and enforce security policies-as-code within deployment workflows to maintain strict compliance and a secure posture.
Cost & Capacity Management: Implement tooling and financial practices to proactively monitor cloud cost trends, perform right-sizing of infrastructure resources, and strategically plan capacity to ensure optimal cost-to-performance ratio and high availability.
Internal Platform Enablement: Design and build reusable internal tools, shared playbooks, and self-service platforms that significantly enhance developer productivity and enforce consistent, high-quality delivery standards across engineering teams.
Mentorship & Technical Leadership: Serve as a senior technical mentor and subject matter expert across platform, security, and engineering teams, establishing and promoting best practices in operational readiness, fault tolerance, and secure delivery.

Qualifications

Experience:

Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
Minimum of 5 years of progressive experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with proven leadership experience in infrastructure reliability and automation.
3+ years of direct, hands-on experience managing high-availability production environments with modern cloud-native security and observability tooling.

Technical Expertise:

Deep expertise in a major cloud platform (e.g., AWS, Azure, GCP), particularly in core services like Compute, Networking, Identity and Access Management (IAM), and monitoring.
Proficiency with Infrastructure-as-Code tools, specifically Terraform and CloudFormation, and container orchestration technologies like Kubernetes and Docker.
Strong working knowledge of Linux systems and shell scripting.
In-depth familiarity with observability stacks (e.g., Prometheus, Grafana, ELK, Datadog, CloudWatch).
Demonstrated experience designing, implementing, and managing CI/CD systems that incorporate security tollgates, rollback logic, and GitOps patterns.

Skills & Knowledge:

Strong scripting and programming skills in Python, Go, or Bash for automation and tooling development.
In-depth understanding of core SRE practices, including incident response, SLO/SLA management, chaos engineering, and capacity modeling.
Proven track record of creating shared tooling, documentation, and best practices that drive operational excellence and knowledge transfer across an organization.

Apply Online

Your Name *

Your Phone Number *

Your Email Address *

Job Id *

JPS-4639

What is your current visa status? *

Select

Enter Other Valid Visa

W2 or C2C *

Select

Where are you currently located at? *

How many years of relevant experience you have? *

Do you require h1b sponsorship? *

Select

Upload Resume *

Choose a file

No file chosen.

Facebook X LinkedIn WhatsApp

Apply now

Report

Principal Site Reliability Engineer Opportunity

Principal Site Reliability Engineer in United State

Share this

Category

Skills

Industries

Subscribe our newsletter