Senior DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking a Senior DevOps Engineer to standardize automation and scheduling performance by administering Kubernetes with Volcano, managing quotas, and automating operations for advanced AI and research work. Highlights: 1. Strengthen GPU-capable orchestration on Kubernetes and Linux. 2. Automate operations using Python and Bash to support advanced AI and research. 3. Drive continuous improvements to infrastructure, tooling, and automation. We are strengthening GPU\-capable orchestration on Kubernetes and Linux, and need a Senior DevOps Engineer to standardize automation and scheduling performance. You will administer Kubernetes with Volcano, manage quotas and isolation, and automate operations using Python and Bash to support advanced AI and research work. Send your application to get started **Responsibilities** * Provision, configure, and support GPU\-enabled Kubernetes clusters and standalone Linux compute environments to keep scheduling and performance at peak * Operate Volcano job scheduling, handling queue setup, POD execution, GPU allocation, and namespace quota enforcement * Own Kubernetes administration end\-to\-end, including namespaces, RBAC, resource quotas, and workload isolation strategies * Automate job submission, resource provisioning, and reporting through Python and Shell scripting maintained over time * Coordinate with orchestration, optimization, and observability teams to enhance scheduling efficiency, capacity utilization, and researcher workflows * Observe infrastructure health and resource consumption, and share data for optimization and reporting requirements * Drive continuous improvements to infrastructure, tooling, and automation workflows to boost performance, scalability, and usability * Support operational processes that ensure researchers have an efficient experience across diverse AI and computational workloads **Requirements** * 3\+ years of DevOps or infrastructure engineering experience in large, complex environments * Expert proficiency administering Kubernetes, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management * Hands\-on background with Volcano scheduler for GPU jobs, including queue setup and workload prioritization with Kubernetes integration * Track record of managing GPU cluster environments both in Kubernetes and on standalone Linux compute nodes * Advanced capability with Python for infrastructure automation and solid UNIX Shell scripting such as Bash * Strong Linux system administration skills with troubleshooting, performance tuning, and configuration management experience * Solid understanding of infrastructure automation and orchestration concepts and the tools used to implement them * Fluent English communication skills (spoken and written) to support direct client collaboration **Nice to have** * Helm knowledge for packaging and managing Kubernetes applications * Experience with monitoring and observability stacks, especially Prometheus, Grafana and Loki * Familiarity with Infrastructure as Code, including Terraform * Exposure to multi\-cloud Kubernetes environments such as Amazon EKS and Google GKE * Understanding of Azure Networking, including VPN, ExpressRoute and network security * Experience using AI\-assisted coding tools like GitHub Copilot, ChatGPT and Claude * Knowledge of hybrid (cloud and on\-premises) scheduling and resource optimization approaches

Source: indeed View original post