Middle DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking a Middle DevOps Engineer to deliver Kubernetes and Linux automation for GPU-enabled platforms supporting advanced AI and research workloads in a client-facing team. Highlights: 1. Deliver Kubernetes and Linux automation for GPU-enabled platforms 2. Support advanced AI and research workloads 3. Build Python and UNIX shell scripting tooling We are seeking a Middle DevOps Engineer to deliver Kubernetes and Linux automation for GPU\-enabled platforms supporting advanced AI and research workloads. You will run Volcano scheduling, manage quotas and isolation, and build Python and UNIX shell scripting tooling to streamline operations in a client\-facing team. Apply today to help scale reliable compute environments **Responsibilities** * Configure and support GPU\-enabled Kubernetes clusters and standalone Linux compute systems to improve workload scheduling and overall efficiency * Coordinate Volcano job scheduling by managing queues, PODs, GPU allocations, and namespace quota controls * Administer Kubernetes foundations including namespaces, RBAC, resource quotas, and workload isolation strategies * Create and maintain Python and Shell scripts to automate job submission, resource allocation, and monitoring activities * Coordinate with orchestration, optimization, and observability teams to raise scheduling performance, utilization, and researcher productivity * Measure infrastructure condition and resource consumption, and provide data for reporting and optimization decisions * Deliver enhancements to infrastructure, tools, and automation processes to increase scalability, performance, and user satisfaction * Provide operational support that ensures researchers have a smooth environment for AI and computational work **Requirements** * 2\+ years of experience in DevOps or infrastructure engineering roles managing complex, large\-scale systems * In\-depth Kubernetes administration skills across namespaces, POD scheduling and balancing, PVC, NFS, and resource quota controls * Experience with Volcano for GPU job scheduling, including queue setup, prioritization, and Kubernetes integration * Track record operating GPU cluster environments in Kubernetes and standalone Linux for high\-performance computing * Advanced Python scripting capability for infrastructure automation, job handling, and system monitoring * Proficiency with UNIX Shell scripting (including Bash) to automate tasks and enhance operational workflows * Strong Linux system administration knowledge for troubleshooting, performance optimization, and configuration management * Thorough grasp of automation and orchestration tools and practices to support scalable, dependable infrastructure * Excellent English communication skills (spoken and written) for client work and collaboration with cross\-functional teams **Nice to have** * Helm knowledge for managing Kubernetes application packaging and configuration * Experience with Prometheus, Grafana, and Loki for monitoring and observability * Familiarity with Terraform for Infrastructure as Code provisioning and management * Exposure to Amazon EKS and Google GKE for multi\-cloud Kubernetes orchestration * Azure networking experience with VPN, ExpressRoute, and network security practices * Experience leveraging GitHub Copilot, ChatGPT, or Claude to improve development efficiency and code quality * Understanding of hybrid scheduling and resource optimization across cloud and on\-premises platforms

Source: indeed View original post