Senior DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Join a team building scalable, GPU-ready Kubernetes and Linux platforms for AI and research, focusing on administration, orchestration, and automation to improve performance and usability. Highlights: 1. Build scalable, GPU-ready Kubernetes and Linux platforms for AI teams 2. Administer Kubernetes and implement Volcano for GPU job scheduling 3. Develop Python/Shell automation to enhance system efficiency We are building scalable, GPU\-ready Kubernetes and Linux platforms for client AI and research teams with reliable automation and scheduling. You will run Kubernetes administration, Volcano\-based orchestration, and scripting in Python and Shell while partnering with engineers and researchers to improve performance and usability; apply now. **Responsibilities** * Deploy, configure, and maintain GPU\-enabled Kubernetes clusters and standalone Linux compute environments to maximize scheduling quality and performance * Implement and operate Volcano job scheduling, covering queue setup, POD execution, GPU allocation, and namespace quota enforcement * Administer Kubernetes end\-to\-end, including namespaces, RBAC, resource quotas, and workload isolation strategies * Develop and maintain Python and Shell automation to simplify job submission, resource provisioning, and system reporting * Collaborate with orchestration, optimization, and observability teams to boost scheduling efficiency, capacity utilization, and researcher workflows * Monitor infrastructure health and resource utilization, delivering data and insights for optimization and reporting needs * Identify and recommend enhancements to infrastructure, tooling, and automation workflows to improve performance, scalability, and usability * Ensure operational processes provide a smooth and efficient experience for researchers running diverse AI and computational workloads **Requirements** * At least 3 years of experience in DevOps or infrastructure engineering for complex, large\-scale environments * Expert proficiency in Kubernetes administration and orchestration, including namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management * Hands\-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and Kubernetes integration * Proven experience managing GPU cluster environments in Kubernetes and on standalone Linux compute nodes for high\-performance computing workloads * Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting * Proficiency in UNIX Shell scripting (e.g., Bash) to automate systems and improve operational efficiency * Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments * Solid understanding of infrastructure automation and orchestration concepts and tooling for scalable, reliable operations * Fluent English communication skills (spoken and written) for direct client interaction and cross\-functional collaboration **Nice to have** * Experience with Helm package management for deploying and managing Kubernetes applications * Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking * Hands\-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources * Exposure to multi\-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience * Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments * Familiarity with AI\-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality * Experience with hybrid (cloud and on\-premises) scheduling and resource optimization to support flexible and efficient compute environments

Source: indeed View original post