Senior DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking an experienced DevOps/infrastructure engineer to manage and optimize GPU-enabled Kubernetes and Linux compute infrastructure for AI initiatives, automating workflows, and ensuring high performance. Highlights: 1. Manage GPU-enabled Kubernetes and Linux compute infrastructure for AI 2. Automate workflows with Python and UNIX Shell scripting 3. Support cutting-edge AI and computational workloads We are supporting client delivery by running GPU\-enabled Kubernetes and Linux compute infrastructure optimized for AI initiatives and Volcano\-driven scheduling. You will implement automation in Python and UNIX Shell, administer Kubernetes resources like PVC, NFS, and quotas, and work with researchers to streamline workflows; apply now. **Responsibilities** * Set up and maintain GPU\-enabled Kubernetes clusters alongside standalone Linux compute environments with stable scheduling and high performance * Manage Volcano scheduling workflows, including queue setup, POD execution, GPU allocation, and enforcement of namespace quotas * Control Kubernetes administration across namespaces, RBAC, resource quotas, and workload isolation strategies * Build and support Python and Shell scripts that automate job submission, resource provisioning, and system reporting * Work with orchestration, optimization, and observability teams to improve scheduling efficiency, utilization, and researcher workflows * Assess infrastructure health and resource utilization and contribute data for optimization and reporting requirements * Drive recommendations for improving infrastructure, tooling, and automation workflows to enhance performance, scalability, and usability * Support operational processes that keep researcher experiences smooth across AI and computational workloads **Requirements** * At least 3 years of experience in DevOps or infrastructure engineering roles supporting complex, large\-scale environments * Expert proficiency in Kubernetes administration and orchestration, including management of namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management * Hands\-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and integration with Kubernetes * Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes, to support high\-performance computing workloads * Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting * Proficiency in UNIX Shell scripting (e.g., Bash) for system automation and operational efficiency * Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments * Solid understanding of infrastructure automation and orchestration concepts and tooling to enable scalable and reliable operations * Fluent English communication skills (spoken and written) for direct client interaction and collaboration with cross\-functional teams **Nice to have** * Experience with Helm package management for deploying and managing Kubernetes applications * Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking * Hands\-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources * Exposure to multi\-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience * Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments * Familiarity with AI\-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality * Experience with hybrid (cloud and on\-premises) scheduling and resource optimization to support flexible and efficient compute environments

Source: indeed View original post