We are seeking a highly skilled and experienced Monitoring and Observability Engineer to join our team. This role involves designing, implementing, and managing comprehensive monitoring solutions using Prometheus, Grafana, SNMP-Exporter, Streaming Telemetry, OpenTelemetry, and other related technologies. The ideal candidate will have a strong background in Time-series databases, network monitoring, and dashboard development, with a focus on ensuring the reliability and performance of our infrastructure and applications.
Responsibilities
Design, implement, and manage Prometheus-based monitoring solutions, including configurations and alert rules.
Develop and maintain interactive and visually appealing Grafana dashboards.
Configure SNMP modules/jobs to scrape SNMP metrics for different network technologies in a very optimized way.
Strong knowledge in Git to be able to clone working branches, develop and commit into the main branch. Or other approaches but show strong hold on Git usage.
Identify and onboard new metrics from various systems and applications, developing data pipelines for metrics collection and storage.
Optimize and scale monitoring environments to handle large volumes of metrics and ensure comprehensive monitoring coverage.
Skills
Familiarity with network monitoring tools and practices.
Extensive experience with Prometheus and related technologies (Alertmanager, Pushgateway, etc.).
Strong knowledge of time-series databases and monitoring concepts.
Proficiency in writing Prometheus queries (PromQL).
Strong experience with Grafana and its ecosystem.
Proficiency in creating and managing Grafana dashboards and panels.
Knowledge of data visualization principles and best practices.
Familiarity with monitoring and observability tools and practices.
Strong knowledge of SNMP protocols and network device management.
Experience with SNMP-Exporter and its integration with Prometheus.
Strong in SNMP modules creations and scrape configs for various network technologies.
Strong Git experience.
Strong understanding of metrics and monitoring concepts.
Experience with metrics collection tools (Prometheus, Telegraf, Collectd, etc.).
Experience with Streaming Telemetry solutions for Real Time monitoring.
Experience with OpenTelemetry for tracing and observability.
Familiarity with Linux/Unix systems and Scripting languages (Bash, Python).
Experience with containerization and orchestration tools (Docker, Kubernetes).