Senior DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking an experienced DevOps/infrastructure engineer to strengthen a client-facing delivery team operating Kubernetes and Linux compute stacks for advanced AI workloads, automating operations with Python/UNIX Shell, and partnering with researchers. Highlights: 1. Strengthen client-facing team for advanced AI workloads 2. Automate operations with Python and UNIX Shell 3. Partner with researchers to optimize platforms We are strengthening a client\-facing delivery team that operates Kubernetes and Linux compute stacks for advanced AI workloads, including GPU scheduling with Volcano. You will automate day\-to\-day operations with Python and UNIX Shell, manage namespaces, RBAC, and quotas, and partner with researchers to keep platforms fast and dependable; apply now. **Responsibilities** * Deliver and support GPU\-enabled Kubernetes clusters plus standalone Linux compute environments with strong scheduling behavior and throughput * Run Volcano scheduling operations, including queue setup, POD execution, GPU allocation, and enforcement of namespace quotas * Own Kubernetes administration across namespaces, RBAC, resource quotas, and workload isolation strategies * Create and evolve Python and Shell scripts that automate job submission, resource provisioning, and system reporting * Partner with orchestration, optimization, and observability teams to improve scheduling efficiency, utilization, and researcher workflows * Track infrastructure health and resource utilization and provide input for optimization and reporting requirements * Propose and drive improvements to infrastructure, tooling, and automation workflows to raise performance, scalability, and usability * Support operational processes that ensure researchers have an efficient experience across AI and computational workloads **Requirements** * At least 3 years of experience in DevOps or infrastructure engineering roles supporting complex, large\-scale environments * Expert proficiency in Kubernetes administration and orchestration, including management of namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management * Hands\-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and integration with Kubernetes * Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes, to support high\-performance computing workloads * Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting * Proficiency in UNIX Shell scripting (e.g., Bash) for system automation and operational efficiency * Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments * Solid understanding of infrastructure automation and orchestration concepts and tooling to enable scalable and reliable operations * Fluent English communication skills (spoken and written) for direct client interaction and collaboration with cross\-functional teams **Nice to have** * Experience with Helm package management for deploying and managing Kubernetes applications * Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking * Hands\-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources * Exposure to multi\-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience * Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments * Familiarity with AI\-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality * Experience with hybrid (cloud and on\-premises) scheduling and resource optimization to support flexible and efficient compute environments

Source: indeed View original post