···
Log in / Register

Senior DevOps Engineer

Indeed
Full-time
Onsite
No experience limit
No degree limit
79Q22222+22
Favourites
Share

Description

Summary: Seeking a Senior DevOps Engineer to support Kubernetes-based systems for an AI research-focused tech company, emphasizing SRE, observability, and operational excellence. Highlights: 1. Support production Kubernetes-based systems powering AI research 2. Combine SRE, observability, and SQL production support responsibilities 3. Work closely with engineering and research teams to ensure system reliability We are seeking a **Senior DevOps Engineer** to support production Kubernetes\-based systems for a large tech company focused on infrastructure that powers AI research. This role combines site reliability engineering, observability and SQL production support responsibilities, with a strong emphasis on monitoring, metrics, dashboards and operational excellence. The ideal candidate will work closely with existing engineering and research teams to ensure system reliability, troubleshoot production issues and continuously improve visibility into system health and performance within an Azure Stack environment. **Responsibilities** * Build, maintain and continuously enhance observability solutions, including dashboards and visualizations using Grafana or similar monitoring tools * Define, implement and manage metrics, SLIs, SLOs and alerting strategies to ensure reliability and visibility across production systems * Provide business\-hours operational support for Kubernetes\-based production environments, covering basic troubleshooting, log analysis and metric\-driven investigations * Support and troubleshoot SQL\-based systems as part of production operations, assisting with issue analysis and performance investigations * Analyze incidents and system behaviors to identify root causes, contribute to post\-incident reviews and recommend improvements to monitoring and reliability practices * Collaborate closely with engineering, platform and research teams to improve observability standards, operational processes and overall system reliability * Contribute to documentation, knowledge sharing and continuous improvement initiatives within the team **Requirements** * A minimum of 3 years of relevant professional experience * Proven background in Site Reliability Engineering (SRE), DevOps, Production Support or similar roles supporting production systems * Hands\-on experience with observability and monitoring stacks such as Grafana, Prometheus, Elastic Stack, Datadog or equivalent tools * Solid understanding of Linux systems, combined with strong troubleshooting and log analysis skills * Practical experience supporting Kubernetes\-based environments in production * Experience providing SQL production support, including query troubleshooting and basic performance analysis * Proficiency in scripting with Python, Bash or similar languages for automation and operational tasks * Ability to analyze incidents, uncover root causes and contribute to continuous improvement initiatives * Strong communication and collaboration skills to work effectively with distributed and cross\-functional teams * Excellent oral and written communication skills in English at a B2\+ level or higher **Nice to have** * Experience working with APIs and integration patterns to connect services and support system interoperability * Familiarity with databases, including administration, optimization and production\-level support * Background in Infrastructure as Code development and maintenance for automating the provisioning and configuration of environments * Hands\-on experience with Microsoft Azure for managing cloud resources and deploying production workloads

Source:  indeed View original post
Sofía González
Indeed · HR

Company

Indeed
Cookie
Cookie Settings
Our Apps
Download
Download on the
APP Store
Download
Get it on
Google Play
© 2025 Servanan International Pte. Ltd.