Lead Site Reliability Engineer

bma group global • United State
Relocation
Apply
AI Summary

We are seeking a highly experienced Lead Site Reliability Engineer to lead the design, implementation, and optimization of highly scalable, secure, and resilient cloud infrastructure platforms. This role is responsible for driving reliability engineering practices, infrastructure automation, observability, incident management, and platform scalability across mission-critical systems. The ideal candidate will have 6-10+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Cloud Engineering, or Software Engineering.

Key Highlights
Lead the design, implementation, and optimization of cloud infrastructure platforms
Drive reliability engineering practices and infrastructure automation
Collaborate with software engineering, security, cloud operations, and architecture teams
Key Responsibilities
Design, implement, and maintain highly available, scalable, and fault-tolerant cloud infrastructure
Drive reliability initiatives through automation, reducing operational overhead and manual intervention
Establish and maintain service reliability objectives, performance standards, and operational excellence practices
Technical Skills Required
Linux systems administration AWS GCP Terraform Ansible Packer Python Go Bash Docker Kubernetes Prometheus Grafana ELK
Benefits & Perks
Comprehensive relocation package
Nice to Have
Experience with hybrid cloud or multi-region deployments
Background in security engineering, compliance, or cloud security programs
Contributions to open-source projects
Knowledge of ITIL/ITSM service management practices
Cloud platform and Kubernetes certifications

Job Description


RELOCATION TO PUERTO RICO IS REQUIRED

This role is based on site in vibrant city of San Juan, Puerto Rico, offering the opportunity to work closely with a dynamic team in a highly collaborative environment. The company is committed to supporting top talent and proudly provides a comprehensive relocation package to make your transition seamless and exciting.


We are seeking a highly experienced Lead Site Reliability Engineer to lead the design, implementation, and optimization of highly scalable, secure, and resilient cloud infrastructure platforms. This role is responsible for driving reliability engineering practices, infrastructure automation, observability, incident management, and platform scalability across mission-critical systems. As a senior technical leader, you will collaborate closely with software engineering, security, cloud operations, and architecture teams to improve system reliability, operational efficiency, and software delivery. You will serve as a subject matter expert, mentor engineers, and champion Site Reliability Engineering and DevOps best practices throughout the organization.


Key Responsibilities

Reliability Engineering & Infrastructure

  • Design, implement, and maintain highly available, scalable, and fault-tolerant cloud infrastructure.
  • Drive reliability initiatives through automation, reducing operational overhead and manual intervention.
  • Establish and maintain service reliability objectives, performance standards, and operational excellence practices.
  • Lead capacity planning and forecasting to support business growth and ensure cost-effective scalability.


Cloud & Platform Engineering

  • Architect, deploy, and optimize cloud-native solutions across AWS, GCP, and hybrid cloud environments.
  • Enhance Infrastructure as Code (IaC) practices using Terraform, Ansible, Packer, and related technologies.
  • Build and maintain secure, compliant infrastructure aligned with industry security standards and best practices.
  • Develop and maintain Amazon Machine Images (AMIs) and hardened infrastructure configurations following CIS and STIG benchmarks.


Automation & Software Development

  • Develop internal tools, automation frameworks, and platform services using Python, Go, Bash, or similar languages.
  • Improve CI/CD pipelines and deployment workflows using Jenkins, GitOps, FluxCD, and modern delivery practices.
  • Collaborate with software engineering teams to optimize application performance, reliability, and scalability.
  • Automate repetitive operational tasks to improve productivity and system consistency.


Observability & Incident Management

  • Design and enhance monitoring, logging, alerting, and observability solutions using Prometheus, Grafana, ELK, and related technologies.
  • Lead critical incident response efforts, coordinating cross-functional teams to minimize service disruption.
  • Conduct root cause analysis, post-incident reviews, and implement preventive measures to improve system resilience.
  • Continuously improve operational readiness, response times, and service reliability.


Distributed Systems & Performance Optimization

  • Troubleshoot and optimize complex distributed systems and data platforms, including technologies such as Apache Kafka, Cassandra, SQL, and NoSQL databases.
  • Analyze performance bottlenecks and implement solutions to improve scalability, availability, and efficiency.
  • Drive architectural improvements that support high-volume, low-latency workloads.


Technical Leadership & Collaboration

  • Serve as a technical leader and trusted advisor for reliability engineering initiatives.
  • Mentor and guide engineers on SRE, DevOps, cloud infrastructure, automation, and operational excellence practices.
  • Collaborate with development, security, architecture, and operations teams to align infrastructure capabilities with business objectives.
  • Promote a culture of continuous improvement, innovation, accountability, and knowledge sharing.


Required Qualifications

Education

  • Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field; equivalent practical experience considered.
  • Advanced degree preferred.


Experience

  • 6-10+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Cloud Engineering, or Software Engineering.
  • Proven experience designing and operating large-scale production environments.
  • Experience leading complex infrastructure initiatives and driving operational improvements.


Technical Expertise

  • Strong experience with Linux systems administration, particularly Debian-based distributions.
  • Expertise in cloud platforms including AWS and GCP; Azure experience is a plus.
  • Advanced knowledge of Infrastructure as Code tools such as Terraform, Ansible, and Packer.
  • Strong programming and automation skills using Python, Go, Bash, or similar languages.
  • Experience with containerization and orchestration technologies including Docker, Kubernetes, Amazon EKS, and Google GKE.
  • Deep understanding of CI/CD pipelines, GitOps methodologies, and deployment automation.
  • Experience implementing monitoring, logging, and observability solutions using Prometheus, Grafana, ELK, and related platforms.
  • Strong understanding of networking, security, distributed systems, and cloud-native architectures.
  • Experience working with relational and NoSQL databases.
  • Experience managing and optimizing distributed platforms such as Kafka and Cassandra.


Preferred Qualifications

  • Experience with hybrid cloud or multi-region deployments.
  • Background in security engineering, compliance, or cloud security programs.
  • Contributions to open-source projects.
  • Knowledge of ITIL/ITSM service management practices.
  • Cloud platform and Kubernetes certifications.
  • Experience with platform engineering and developer enablement initiatives.


Core Competencies

  • Exceptional problem-solving, troubleshooting, and analytical skills.
  • Strong communication and stakeholder management abilities.
  • Demonstrated technical leadership and mentoring experience.
  • Ability to make sound decisions in high-pressure production environments.
  • Strong sense of ownership, accountability, and customer focus.
  • Passion for innovation, automation, and continuous improvement.


Day-to-Day Activities

  • Design and improve scalable, secure, and resilient cloud infrastructure.
  • Monitor production environments and proactively resolve performance, reliability, and security issues.
  • Lead incident response, root cause investigations, and reliability improvement initiatives.
  • Enhance deployment pipelines and infrastructure automation frameworks.
  • Collaborate with engineering teams to improve application reliability and operational excellence.
  • Optimize observability platforms, dashboards, and alerting strategies.
  • Mentor engineers and provide technical guidance across reliability and cloud initiatives.
  • Participate in on-call rotations while driving improvements that reduce operational burden and incidents.


Work Environment

This is a hybrid position requiring attendance in the San Juan office two days per week. The role offers the opportunity to work on cutting-edge cloud technologies, large-scale distributed systems, and strategic infrastructure initiatives while collaborating with talented teams across the organization.


Similar Jobs

Explore other opportunities that match your interests

Mid-Level Software Engineer - Lab Environments & CI/CD

Devops
•
7h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

fetchjobs.co

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

Seneca Resources

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

Virtual Service Operations

United State

Subscribe our newsletter

New Things Will Always Update Regularly