Lead Site Reliability Engineer

bma group global • United State

Relocation

Apply

AI Summary

We are seeking a highly experienced Lead Site Reliability Engineer to lead the design, implementation, and optimization of highly scalable, secure, and resilient cloud infrastructure platforms. This role is responsible for driving reliability engineering practices, infrastructure automation, observability, incident management, and platform scalability across mission-critical systems. The ideal candidate will have 6-10+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Cloud Engineering, or Software Engineering.

Key Highlights

Lead the design, implementation, and optimization of cloud infrastructure platforms

Drive reliability engineering practices and infrastructure automation

Collaborate with software engineering, security, cloud operations, and architecture teams

Key Responsibilities

Design, implement, and maintain highly available, scalable, and fault-tolerant cloud infrastructure

Drive reliability initiatives through automation, reducing operational overhead and manual intervention

Establish and maintain service reliability objectives, performance standards, and operational excellence practices

Technical Skills Required

Linux systems administration AWS GCP Terraform Ansible Packer Python Go Bash Docker Kubernetes Prometheus Grafana ELK

Benefits & Perks

Comprehensive relocation package

Nice to Have

Experience with hybrid cloud or multi-region deployments

Background in security engineering, compliance, or cloud security programs

Contributions to open-source projects

Knowledge of ITIL/ITSM service management practices

Cloud platform and Kubernetes certifications

Job Description

RELOCATION TO PUERTO RICO IS REQUIRED

This role is based on site in vibrant city of San Juan, Puerto Rico, offering the opportunity to work closely with a dynamic team in a highly collaborative environment. The company is committed to supporting top talent and proudly provides a comprehensive relocation package to make your transition seamless and exciting.

We are seeking a highly experienced Lead Site Reliability Engineer to lead the design, implementation, and optimization of highly scalable, secure, and resilient cloud infrastructure platforms. This role is responsible for driving reliability engineering practices, infrastructure automation, observability, incident management, and platform scalability across mission-critical systems. As a senior technical leader, you will collaborate closely with software engineering, security, cloud operations, and architecture teams to improve system reliability, operational efficiency, and software delivery. You will serve as a subject matter expert, mentor engineers, and champion Site Reliability Engineering and DevOps best practices throughout the organization.

Key Responsibilities

Reliability Engineering & Infrastructure

Design, implement, and maintain highly available, scalable, and fault-tolerant cloud infrastructure.
Drive reliability initiatives through automation, reducing operational overhead and manual intervention.
Establish and maintain service reliability objectives, performance standards, and operational excellence practices.
Lead capacity planning and forecasting to support business growth and ensure cost-effective scalability.

Cloud & Platform Engineering

Architect, deploy, and optimize cloud-native solutions across AWS, GCP, and hybrid cloud environments.
Enhance Infrastructure as Code (IaC) practices using Terraform, Ansible, Packer, and related technologies.
Build and maintain secure, compliant infrastructure aligned with industry security standards and best practices.
Develop and maintain Amazon Machine Images (AMIs) and hardened infrastructure configurations following CIS and STIG benchmarks.

Automation & Software Development

Develop internal tools, automation frameworks, and platform services using Python, Go, Bash, or similar languages.
Improve CI/CD pipelines and deployment workflows using Jenkins, GitOps, FluxCD, and modern delivery practices.
Collaborate with software engineering teams to optimize application performance, reliability, and scalability.
Automate repetitive operational tasks to improve productivity and system consistency.

Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.

Observability & Incident Management

Design and enhance monitoring, logging, alerting, and observability solutions using Prometheus, Grafana, ELK, and related technologies.
Lead critical incident response efforts, coordinating cross-functional teams to minimize service disruption.
Conduct root cause analysis, post-incident reviews, and implement preventive measures to improve system resilience.
Continuously improve operational readiness, response times, and service reliability.

Distributed Systems & Performance Optimization

Troubleshoot and optimize complex distributed systems and data platforms, including technologies such as Apache Kafka, Cassandra, SQL, and NoSQL databases.
Analyze performance bottlenecks and implement solutions to improve scalability, availability, and efficiency.
Drive architectural improvements that support high-volume, low-latency workloads.

Technical Leadership & Collaboration

Serve as a technical leader and trusted advisor for reliability engineering initiatives.
Mentor and guide engineers on SRE, DevOps, cloud infrastructure, automation, and operational excellence practices.
Collaborate with development, security, architecture, and operations teams to align infrastructure capabilities with business objectives.
Promote a culture of continuous improvement, innovation, accountability, and knowledge sharing.

Required Qualifications

Education

Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field; equivalent practical experience considered.
Advanced degree preferred.

Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.

Experience

6-10+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Cloud Engineering, or Software Engineering.
Proven experience designing and operating large-scale production environments.
Experience leading complex infrastructure initiatives and driving operational improvements.

Technical Expertise

Strong experience with Linux systems administration, particularly Debian-based distributions.
Expertise in cloud platforms including AWS and GCP; Azure experience is a plus.
Advanced knowledge of Infrastructure as Code tools such as Terraform, Ansible, and Packer.
Strong programming and automation skills using Python, Go, Bash, or similar languages.
Experience with containerization and orchestration technologies including Docker, Kubernetes, Amazon EKS, and Google GKE.
Deep understanding of CI/CD pipelines, GitOps methodologies, and deployment automation.
Experience implementing monitoring, logging, and observability solutions using Prometheus, Grafana, ELK, and related platforms.
Strong understanding of networking, security, distributed systems, and cloud-native architectures.
Experience working with relational and NoSQL databases.
Experience managing and optimizing distributed platforms such as Kafka and Cassandra.

Preferred Qualifications

Experience with hybrid cloud or multi-region deployments.
Background in security engineering, compliance, or cloud security programs.
Contributions to open-source projects.
Knowledge of ITIL/ITSM service management practices.

Interested in relocating to United State? Check out our comprehensive Relocation Jobs in United State page with detailed relocation packages and benefits.

Cloud platform and Kubernetes certifications.
Experience with platform engineering and developer enablement initiatives.

Core Competencies

Exceptional problem-solving, troubleshooting, and analytical skills.
Strong communication and stakeholder management abilities.
Demonstrated technical leadership and mentoring experience.
Ability to make sound decisions in high-pressure production environments.
Strong sense of ownership, accountability, and customer focus.
Passion for innovation, automation, and continuous improvement.

Day-to-Day Activities

Design and improve scalable, secure, and resilient cloud infrastructure.
Monitor production environments and proactively resolve performance, reliability, and security issues.
Lead incident response, root cause investigations, and reliability improvement initiatives.
Enhance deployment pipelines and infrastructure automation frameworks.
Collaborate with engineering teams to improve application reliability and operational excellence.
Optimize observability platforms, dashboards, and alerting strategies.
Mentor engineers and provide technical guidance across reliability and cloud initiatives.
Participate in on-call rotations while driving improvements that reduce operational burden and incidents.

Work Environment

This is a hybrid position requiring attendance in the San Juan office two days per week. The role offers the opportunity to work on cutting-edge cloud technologies, large-scale distributed systems, and strategic infrastructure initiatives while collaborating with talented teams across the organization.

Job Overview

Posted Date Jun 12, 2026

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Devops

Company bma group global

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

Mid-Level Software Engineer - Lab Environments & CI/CD

Devops

•

7h ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

fetchjobs.co

United State

Senior Site Reliability Engineer III - Mission-Critical Federal Infrastructure

Devops

•

16h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

Seneca Resources

United State

Azure Cloud Engineer - Secure Enclave & DevOps

Devops

•

18h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

Virtual Service Operations

United State

Lead Site Reliability Engineer

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

Mid-Level Software Engineer - Lab Environments & CI/CD

Premium Job

fetchjobs.co

Senior Site Reliability Engineer III - Mission-Critical Federal Infrastructure

Seneca Resources

Azure Cloud Engineer - Secure Enclave & DevOps

Virtual Service Operations

Subscribe our newsletter