




Summary: Seeking a Senior SRE/Observability Engineer to ensure reliability and performance of production Kubernetes-based systems supporting AI research in Azure Stack. Highlights: 1. Focus on observability, operational support, and driving operational excellence. 2. Build, maintain, and improve observability solutions using tools like Grafana. 3. Collaborate with engineering, platform, and research teams to enhance systems. We are looking for a **Senior SRE / Observability Engineer** to ensure the reliability and performance of production Kubernetes\-based systems supporting AI research within an Azure Stack environment. This position focuses on observability, operational support and collaboration with engineering and research teams to drive operational excellence. **Responsibilities** * Build, maintain and improve observability solutions, including dashboards and visualizations using Grafana or similar tools * Define, implement and manage metrics, SLIs, SLOs and alerting strategies for production systems * Provide business\-hours operational support for Kubernetes\-based environments, including troubleshooting, log analysis and metric\-driven investigations * Support and troubleshoot SQL\-based systems as part of production operations, assisting with issue analysis and performance investigations * Analyze incidents and system behaviors to identify root causes, contribute to post\-incident reviews and recommend improvements to monitoring and reliability practices * Collaborate with engineering, platform and research teams to improve observability standards, operational processes and system reliability * Contribute to documentation, knowledge sharing and continuous improvement within the team **Requirements** * 3\+ years of experience in Site Reliability Engineering, DevOps or Production Support roles supporting production systems * Knowledge of observability and monitoring stacks such as Grafana, Prometheus, Elastic Stack or Datadog * Understanding of Linux systems with strong troubleshooting and log analysis skills * Background in supporting Kubernetes\-based environments in production * Skills in SQL production support, including query troubleshooting and basic performance analysis * Proficiency in scripting with Python, Bash or similar languages for automation and operational tasks * Capability to analyze incidents, identify root causes and contribute to continuous improvement initiatives * Competency in communication and collaboration with distributed and cross\-functional teams * English proficiency at an intermediate to advanced level


