Senior SRE with Observability Expertise

Omnissa Bulgaria
Remote
Apply
AI Summary

We're seeking a Senior SRE with deep observability expertise to maintain the reliability, performance, and operational integrity of our platforms. The role includes designing, deploying, and maintaining Loki, Grafana, Prometheus, and observability pipelines. You'll work across planned and unplanned workstreams with engineering, incident management, and service owners.

Key Highlights
Design, deploy, and maintain Loki, Grafana, Prometheus, and observability pipelines
Drive reliability through capacity planning, performance optimization, SLIs/SLOs, and root cause analysis
Participate in the global on-call rotation and manage incidents and outages
Key Responsibilities
Design, deploy, and maintain Loki, Grafana, Prometheus, and observability pipelines
Drive reliability through capacity planning, performance optimization, SLIs/SLOs, and root cause analysis
Participate in the global on-call rotation and manage incidents and outages
Use Atlassian tools (Jira, Confluence, Opsgenie) for task, change, and incident management
Operate and improve internal clouds (vCF, CloudStack, Proxmox), Kubernetes clusters, and S3-compatible storage
Technical Skills Required
Grafana Loki Prometheus Ansible Linux Kubernetes CI/CD Infrastructure as Code
Benefits & Perks
Competitive salary
Flexible work arrangement
Opportunity to work with a rapidly growing company
Nice to Have
Ollama
n8n
S3/open-source object stores (SeaweedFS, Ceph)
Virtualization stacks (Proxmox, vSphere/vCF, CloudStack)

Job Description


Job Description

We Are Omnissa!

Omnissa is the first AI-driven digital work platform, built to support flexible, secure, work-from-anywhere experiences. We integrate industry-leading solutions—including Unified Endpoint Management, Virtual Apps and Desktops, Digital Employee Experience, and Security & Compliance—into a seamless, autonomous workspace that adapts to how people work. Our platform boosts employee engagement while optimizing IT operations, security, and cost.

Guided by our Core Values—Act in Alignment, Build Trust, Foster Inclusiveness, Drive Efficiency, and Maximize Customer Value—we’re growing rapidly and committed to delivering meaningful impact. If you’re passionate about shaping the future of work, we’d love to hear from you.

The Team

Our internal Platform Engineering team architects and operates Omnissa's enterprise-grade infrastructure. Our environment includes:

  • Core platforms: VMware Cloud Foundation, Apache CloudStack, Proxmox, Kubernetes, and S3-compatible object storage
  • Observability: Prometheus, Grafana, Loki, and Ansible
  • AI-driven automation: An internal incident diagnosis platform built on Ollama, n8n, and MCP servers to reduce MTTD and MTTR

The Role

We're seeking an SRE with deep observability expertise (Grafana, Loki, Prometheus, automation, and scripting) to maintain the reliability, performance, and operational integrity of our platforms. You'll work across planned and unplanned workstreams with engineering, incident management, and service owners. The role includes an on-call rotation covering nights and weekends.

Key Responsibilities

  • Design, deploy, and maintain Loki, Grafana, Prometheus, and observability pipelines; expand logging, metrics, and tracing coverage
  • Build and refine automation and AI workflows for incident analysis and auto-remediation
  • Drive reliability through capacity planning, performance optimization, SLIs/SLOs, and root cause analysis
  • Participate in the global on-call rotation; manage incidents and outages and lead post-mortem reviews
  • Use Atlassian tools (Jira, Confluence, Opsgenie) for task, change, and incident management
  • Operate and improve internal clouds (vCF, CloudStack, Proxmox), Kubernetes clusters, and S3-compatible storage

Required Skills

  • Hands-on expertise with Grafana, Loki, Tempo (or similar tracing), and Prometheus
  • At least one scripting/programming language
  • Configuration management tools (Ansible, SaltStack)
  • Strong Linux skills and experience operating large-scale, highly available distributed systems
  • Familiarity with Kubernetes, CI/CD, and Infrastructure as Code
  • Comfortable with on-call participation and incident leadership
  • Experience with Atlassian tools; proficiency in Linux and Windows

Nice to Have

  • Exposure to Ollama, n8n, or similar AI orchestration tooling
  • Experience with S3/open-source object stores (SeaweedFS, Ceph)
  • Knowledge of virtualization stacks (Proxmox, vSphere/vCF, CloudStack)
  • Background in SRE culture, including SLIs/SLOs and error budgeting

Omnissa is committed to building a workforce that reflects the communities we serve across the globe. We believe this brings unique perspectives, experiences, and ideas, which are essential for driving innovation and achieving business success. We hire based on merit and provide equal opportunity for all.

Similar Jobs

Explore other opportunities that match your interests

Automation Engineer

Devops
2d ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Xebia

Bulgaria

Cloud Platform Engineer

Devops
2w ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Xebia

Bulgaria
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Bright Vision Technologies

United State

Subscribe our newsletter

New Things Will Always Update Regularly