




Summary: Seeking a Middle DevOps Engineer to automate and optimize Kubernetes platforms for GPU workloads and Linux infrastructure, partnering with engineers and researchers. Highlights: 1. Automate and optimize Kubernetes platforms for GPU workloads 2. Implement and support Volcano-based scheduling and cluster operations 3. Deliver reliable, scalable compute environments for AI research We are looking for a Middle DevOps Engineer to automate and optimize Kubernetes platforms for GPU workloads and the Linux infrastructure behind AI research. You will implement and support Volcano\-based scheduling, quotas, and cluster operations using Python and UNIX shell scripting while partnering with engineers and researchers. Apply to help deliver reliable, scalable compute environments **Responsibilities** * Deploy, configure, and support GPU\-enabled Kubernetes clusters and standalone Linux compute systems to improve scheduling and overall efficiency * Administer Volcano scheduling by setting up queues, managing PODs, assigning GPU resources, and enforcing namespace quota controls * Own Kubernetes platform management across namespaces, RBAC, resource quotas, and workload isolation approaches * Develop and maintain Python and Shell automation to simplify job submission, resource allocation, and infrastructure monitoring * Collaborate with orchestration, optimization, and observability teams to increase scheduling throughput, resource utilization, and researcher productivity * Monitor infrastructure health and resource consumption, and share metrics to guide optimization and reporting * Recommend and deliver improvements to infrastructure, tools, and automation processes to enhance scalability, performance, and user experience * Support operational routines that give researchers a smooth environment for AI and computational workloads **Requirements** * Professional experience of 2\+ years in DevOps or infrastructure engineering, supporting complex large\-scale systems * Deep expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling and balancing, PVC, NFS, and resource quota controls * Hands\-on experience with the Volcano scheduler for GPU job management, covering queue configuration, workload prioritization, and Kubernetes integration * Proven ability to run GPU cluster environments in Kubernetes and standalone Linux configurations for high\-performance computing * Advanced Python scripting skills for automating infrastructure operations, job handling, and system monitoring * Practical proficiency in UNIX Shell scripting (e.g., Bash) to automate system tasks and streamline operational workflows * Strong Linux system administration background, including troubleshooting, performance optimization, and configuration management * Thorough understanding of automation and orchestration tooling and concepts to build scalable, dependable infrastructure * Excellent English communication skills (spoken and written) for direct client engagement and collaboration across teams **Nice to have** * Experience with Helm to package and manage Kubernetes applications * Knowledge of monitoring and observability tools such as Prometheus, Grafana, and Loki to track health and performance * Familiarity with Infrastructure as Code tools like Terraform for automated cloud provisioning and management * Background with multi\-cloud Kubernetes platforms including Amazon EKS and Google GKE * Skills in Azure networking, including VPN setup, ExpressRoute configuration, and network security * Experience using AI coding assistants (GitHub Copilot, ChatGPT, Claude) to improve development speed and code quality * Understanding of hybrid scheduling and resource optimization across cloud and on\-premises environments


