




Summary: Seeking a Middle DevOps Engineer to optimize and automate GPU-enabled orchestration with Kubernetes and Volcano for AI and research workloads in a client-facing environment. Highlights: 1. Focus on reliable Kubernetes and Linux platforms for AI and research workloads 2. Automate and optimize GPU-enabled orchestration with Kubernetes and Volcano 3. Build efficient, scalable compute environments We are expanding our delivery team with a **Middle DevOps Engineer** focused on reliable Kubernetes and Linux platforms for AI and research workloads. You will help automate and optimize GPU\-enabled orchestration with Kubernetes and Volcano, supporting scheduling, quotas, and scripting in Python and Shell in a client\-facing environment. Apply to help build efficient, scalable compute environments **Responsibilities** * Deploy and operate GPU\-enabled Kubernetes clusters and standalone Linux compute environments to keep scheduling and performance efficient * Implement and support Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement * Administer Kubernetes environments end\-to\-end, covering namespaces, RBAC, resource quotas, and workload isolation approaches * Build and maintain Python and Shell automation to simplify job submission, resource provisioning, and system reporting * Collaborate with orchestration, optimization, and observability teams to raise scheduling efficiency, capacity utilization, and researcher workflows * Monitor platform health and resource usage, sharing data and feedback to meet optimization and reporting needs * Recommend improvements to infrastructure, tooling, and automation workflows to boost performance, scalability, and usability * Ensure operations provide a smooth and effective experience for researchers running diverse AI and computational workloads **Requirements** * Hands\-on experience with 2\+ years in DevOps or infrastructure engineering roles supporting complex, large\-scale environments * Expert\-level knowledge of Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management * Practical experience with Volcano scheduler for GPU job execution, queue configuration, workload prioritization, and Kubernetes integration * Proven background managing GPU cluster environments in Kubernetes and on standalone Linux compute nodes * Advanced scripting skills in Python for infrastructure automation plus proficiency with UNIX Shell scripting (e.g., Bash) * Strong Linux system administration capability, including troubleshooting, performance tuning, and configuration management * Solid understanding of infrastructure automation and orchestration concepts and related tooling * Fluent English communication skills (spoken and written) for direct client interaction **Nice to have** * Helm for Kubernetes application package management * Monitoring and observability tooling, especially Prometheus, Grafana, and Loki * Infrastructure as Code tools such as Terraform * Multi\-cloud Kubernetes exposure, including Amazon EKS and Google GKE * Azure Networking knowledge, including VPN, ExpressRoute, and network security * Familiarity with AI\-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) * Experience with hybrid (cloud \+ on\-premises) scheduling and resource optimization


