We are seeking a highly experienced Lead Site Reliability Engineer to lead the design, implementation, and optimization of highly scalable, secure, and resilient cloud infrastructure platforms. This role is responsible for driving reliability engineering practices, infrastructure automation, observability, incident management, and platform scalability across mission-critical systems. The ideal candidate will have 6-10+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Cloud Engineering, or Software Engineering.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
RELOCATION TO PUERTO RICO IS REQUIRED
This role is based on site in vibrant city of San Juan, Puerto Rico, offering the opportunity to work closely with a dynamic team in a highly collaborative environment. The company is committed to supporting top talent and proudly provides a comprehensive relocation package to make your transition seamless and exciting.
We are seeking a highly experienced Lead Site Reliability Engineer to lead the design, implementation, and optimization of highly scalable, secure, and resilient cloud infrastructure platforms. This role is responsible for driving reliability engineering practices, infrastructure automation, observability, incident management, and platform scalability across mission-critical systems. As a senior technical leader, you will collaborate closely with software engineering, security, cloud operations, and architecture teams to improve system reliability, operational efficiency, and software delivery. You will serve as a subject matter expert, mentor engineers, and champion Site Reliability Engineering and DevOps best practices throughout the organization.
Key Responsibilities
Reliability Engineering & Infrastructure
- Design, implement, and maintain highly available, scalable, and fault-tolerant cloud infrastructure.
- Drive reliability initiatives through automation, reducing operational overhead and manual intervention.
- Establish and maintain service reliability objectives, performance standards, and operational excellence practices.
- Lead capacity planning and forecasting to support business growth and ensure cost-effective scalability.
Cloud & Platform Engineering
- Architect, deploy, and optimize cloud-native solutions across AWS, GCP, and hybrid cloud environments.
- Enhance Infrastructure as Code (IaC) practices using Terraform, Ansible, Packer, and related technologies.
- Build and maintain secure, compliant infrastructure aligned with industry security standards and best practices.
- Develop and maintain Amazon Machine Images (AMIs) and hardened infrastructure configurations following CIS and STIG benchmarks.
Automation & Software Development
- Develop internal tools, automation frameworks, and platform services using Python, Go, Bash, or similar languages.
- Improve CI/CD pipelines and deployment workflows using Jenkins, GitOps, FluxCD, and modern delivery practices.
- Collaborate with software engineering teams to optimize application performance, reliability, and scalability.
- Automate repetitive operational tasks to improve productivity and system consistency.
Looking to advance your Devops career with relocation support? Explore Devops Jobs with Relocation Packages that include comprehensive packages to help you move and settle in your new role.
Observability & Incident Management
- Design and enhance monitoring, logging, alerting, and observability solutions using Prometheus, Grafana, ELK, and related technologies.
- Lead critical incident response efforts, coordinating cross-functional teams to minimize service disruption.
- Conduct root cause analysis, post-incident reviews, and implement preventive measures to improve system resilience.
- Continuously improve operational readiness, response times, and service reliability.
Distributed Systems & Performance Optimization
- Troubleshoot and optimize complex distributed systems and data platforms, including technologies such as Apache Kafka, Cassandra, SQL, and NoSQL databases.
- Analyze performance bottlenecks and implement solutions to improve scalability, availability, and efficiency.
- Drive architectural improvements that support high-volume, low-latency workloads.
Technical Leadership & Collaboration
- Serve as a technical leader and trusted advisor for reliability engineering initiatives.
- Mentor and guide engineers on SRE, DevOps, cloud infrastructure, automation, and operational excellence practices.
- Collaborate with development, security, architecture, and operations teams to align infrastructure capabilities with business objectives.
- Promote a culture of continuous improvement, innovation, accountability, and knowledge sharing.
Required Qualifications
Education
- Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field; equivalent practical experience considered.
- Advanced degree preferred.
Discover our full range of relocation jobs with comprehensive support packages to help you relocate and settle in your new location.
Experience
- 6-10+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Cloud Engineering, or Software Engineering.
- Proven experience designing and operating large-scale production environments.
- Experience leading complex infrastructure initiatives and driving operational improvements.
Technical Expertise
- Strong experience with Linux systems administration, particularly Debian-based distributions.
- Expertise in cloud platforms including AWS and GCP; Azure experience is a plus.
- Advanced knowledge of Infrastructure as Code tools such as Terraform, Ansible, and Packer.
- Strong programming and automation skills using Python, Go, Bash, or similar languages.
- Experience with containerization and orchestration technologies including Docker, Kubernetes, Amazon EKS, and Google GKE.
- Deep understanding of CI/CD pipelines, GitOps methodologies, and deployment automation.
- Experience implementing monitoring, logging, and observability solutions using Prometheus, Grafana, ELK, and related platforms.
- Strong understanding of networking, security, distributed systems, and cloud-native architectures.
- Experience working with relational and NoSQL databases.
- Experience managing and optimizing distributed platforms such as Kafka and Cassandra.
Preferred Qualifications
- Experience with hybrid cloud or multi-region deployments.
- Background in security engineering, compliance, or cloud security programs.
- Contributions to open-source projects.
- Knowledge of ITIL/ITSM service management practices.
- Cloud platform and Kubernetes certifications.
- Experience with platform engineering and developer enablement initiatives.
Interested in relocating to United State? Check out our comprehensive Relocation Jobs in United State page with detailed relocation packages and benefits.
Core Competencies
- Exceptional problem-solving, troubleshooting, and analytical skills.
- Strong communication and stakeholder management abilities.
- Demonstrated technical leadership and mentoring experience.
- Ability to make sound decisions in high-pressure production environments.
- Strong sense of ownership, accountability, and customer focus.
- Passion for innovation, automation, and continuous improvement.
Day-to-Day Activities
- Design and improve scalable, secure, and resilient cloud infrastructure.
- Monitor production environments and proactively resolve performance, reliability, and security issues.
- Lead incident response, root cause investigations, and reliability improvement initiatives.
- Enhance deployment pipelines and infrastructure automation frameworks.
- Collaborate with engineering teams to improve application reliability and operational excellence.
- Optimize observability platforms, dashboards, and alerting strategies.
- Mentor engineers and provide technical guidance across reliability and cloud initiatives.
- Participate in on-call rotations while driving improvements that reduce operational burden and incidents.
Work Environment
This is a hybrid position requiring attendance in the San Juan office two days per week. The role offers the opportunity to work on cutting-edge cloud technologies, large-scale distributed systems, and strategic infrastructure initiatives while collaborating with talented teams across the organization.
Similar Jobs
Explore other opportunities that match your interests
Mid-Level Software Engineer - Lab Environments & CI/CD
fetchjobs.co
Seneca Resources