Senior DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking an experienced DevOps/infrastructure engineer to build and maintain scalable, GPU-ready Kubernetes and Linux platforms for AI and research teams, focusing on automation, orchestration, and performance. Highlights: 1. Build scalable, GPU-ready Kubernetes and Linux platforms for AI teams 2. Administer Kubernetes and Volcano for advanced orchestration 3. Develop automation in Python and Shell to enhance operations We are building scalable, GPU\-ready Kubernetes and Linux platforms for client AI and research teams with reliable automation and scheduling. You will run Kubernetes administration, Volcano\-based orchestration, and scripting in Python and Shell while partnering with engineers and researchers to improve performance and usability; apply now. **Responsibilities** * Deploy, configure, and maintain GPU\-enabled Kubernetes clusters and standalone Linux compute environments to maximize scheduling quality and performance * Implement and operate Volcano job scheduling, covering queue setup, POD execution, GPU allocation, and namespace quota enforcement * Administer Kubernetes end\-to\-end, including namespaces, RBAC, resource quotas, and workload isolation strategies * Develop and maintain Python and Shell automation to simplify job submission, resource provisioning, and system reporting * Collaborate with orchestration, optimization, and observability teams to boost scheduling efficiency, capacity utilization, and researcher workflows * Monitor infrastructure health and resource utilization, delivering data and insights for optimization and reporting needs * Identify and recommend enhancements to infrastructure, tooling, and automation workflows to improve performance, scalability, and usability * Ensure operational processes provide a smooth and efficient experience for researchers running diverse AI and computational workloads **Requirements** * At least 3 years of experience in DevOps or infrastructure engineering for complex, large\-scale environments * Expert proficiency in Kubernetes administration and orchestration, including namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management * Hands\-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and Kubernetes integration * Proven experience managing GPU cluster environments in Kubernetes and on standalone Linux compute nodes for high\-performance computing workloads * Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting * Proficiency in UNIX Shell scripting (e.g., Bash) to automate systems and improve operational efficiency * Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments * Solid understanding of infrastructure automation and orchestration concepts and tooling for scalable, reliable operations * Fluent English communication skills (spoken and written) for direct client interaction and cross\-functional collaboration **Nice to have** * Experience with Helm package management for deploying and managing Kubernetes applications * Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking * Hands\-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources * Exposure to multi\-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience * Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments * Familiarity with AI\-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality * Experience with hybrid (cloud and on\-premises) scheduling and resource optimization to support flexible and efficient compute environments

Source: indeed View original post