




Summary: Seeking a Middle DevOps Engineer to implement GPU Kubernetes orchestration using Volcano and maintain Linux infrastructure for AI and research, automating operations with Python and UNIX shell scripting. Highlights: 1. Implement GPU Kubernetes orchestration for AI and research initiatives 2. Automate operations with Python and UNIX shell scripting 3. Collaborate on optimizing scheduling and reliability of compute platforms We are expanding with a Middle DevOps Engineer to implement GPU Kubernetes orchestration using Volcano and maintain Linux infrastructure for AI and research initiatives. You will automate operations with Python and UNIX shell scripting, manage namespaces, RBAC, and quotas, and collaborate with delivery teams to optimize scheduling and reliability. Apply to join and improve scalable compute platforms **Responsibilities** * Set up and support GPU\-enabled Kubernetes clusters and standalone Linux compute systems to maximize scheduling quality and system efficiency * Control Volcano scheduling operations by creating queues, managing POD behavior, assigning GPU resources, and enforcing namespace quota controls * Run Kubernetes environments by handling namespaces, RBAC, resource quotas, and workload isolation strategies * Build and maintain Python and Shell scripts that automate job submission, resource allocation, and monitoring routines * Work with orchestration, optimization, and observability teams to enhance scheduling performance, utilization, and researcher productivity * Track infrastructure status and resource usage, then share insights that support reporting and optimization * Introduce and implement improvements to infrastructure, tools, and automation to boost scalability, performance, and user experience * Support operational workflows to ensure researchers have an effective environment for AI and computational projects **Requirements** * Minimum 2\+ years in DevOps or infrastructure engineering focused on complex, large\-scale systems * Deep understanding of Kubernetes administration, including namespaces, POD scheduling and balancing, PVC, NFS, and resource quota controls * Proven experience with the Volcano scheduler for GPU jobs, including queue setup, workload prioritization, and Kubernetes integration * Ability to operate GPU cluster environments in Kubernetes as well as standalone Linux for high\-performance computing * Advanced Python scripting experience for infrastructure automation, job orchestration, and monitoring * Proficient UNIX Shell scripting skills (such as Bash) to automate system tasks and optimize workflows * Strong Linux administration expertise, covering troubleshooting, performance tuning, and configuration management * Solid knowledge of automation and orchestration principles and tools to support scalable, reliable infrastructure * Excellent English communication skills (spoken and written) for client\-facing delivery and cross\-team collaboration **Nice to have** * Helm experience for Kubernetes app packaging and lifecycle management * Knowledge of Prometheus, Grafana, and Loki for monitoring and observability * Terraform familiarity for Infrastructure as Code automation and cloud provisioning * Experience with Amazon EKS and Google GKE across multi\-cloud Kubernetes environments * Azure networking skills including VPN, ExpressRoute, and cloud network security * Experience with AI coding assistants like GitHub Copilot, ChatGPT, and Claude * Understanding of hybrid scheduling and resource optimization across on\-premises and cloud compute


