Senior DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking a highly skilled Senior DevOps Engineer to implement, automate, and optimize Kubernetes-based orchestration platforms and Linux infrastructure for advanced AI and research initiatives. Highlights: 1. Client-facing role focused on hands-on delivery and optimization 2. Leverage deep expertise in Kubernetes for AI and research initiatives 3. Develop automation for efficient, reliable, and scalable compute environments We are seeking a highly skilled **Senior DevOps Engineer** to join EPAM’s delivery team. In this client\-facing, delivery\-focused role, you will be responsible for the hands\-on implementation, automation, and optimization of Kubernetes\-based orchestration platforms—including Volcano for GPU\-enabled workloads—and the Linux infrastructure supporting advanced AI and research initiatives. You will leverage deep expertise in Kubernetes administration, workload scheduling, quota management, and automation using Python and Shell scripting to deliver efficient, reliable, and scalable compute environments. You will work closely with other engineers and researchers to ensure a seamless, high\-quality infrastructure experience. **Responsibilities** * Deploy, configure, and maintain GPU\-enabled Kubernetes clusters and standalone Linux compute environments, ensuring optimal workload scheduling and performance * Implement and manage Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement * Administer Kubernetes environments end\-to\-end, including namespaces, RBAC, resource quotas, and workload isolation strategies * Develop and maintain automation scripts in Python and Shell to streamline job submission, resource provisioning, and system reporting * Collaborate with orchestration, optimization, and observability teams to improve scheduling efficiency, capacity utilization, and researcher workflows * Monitor infrastructure health and resource utilization, providing feedback and data to support optimization and reporting requirements * Identify and recommend improvements to infrastructure, tooling, and automation workflows to enhance performance, scalability, and usability * Ensure operational processes deliver a seamless and efficient experience for researchers working on diverse AI and computational workloads **Requirements** * At least 3 years of experience in DevOps or infrastructure engineering roles supporting complex, large\-scale environments * Expert proficiency in Kubernetes administration and orchestration, including management of namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management * Hands\-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and integration with Kubernetes * Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes, to support high\-performance computing workloads * Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting * Proficiency in UNIX Shell scripting (e.g., Bash) for system automation and operational efficiency * Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments * Solid understanding of infrastructure automation and orchestration concepts and tooling to enable scalable and reliable operations * Fluent English communication skills (spoken and written) for direct client interaction and collaboration with cross\-functional teams **Nice to have** * Experience with Helm package management for deploying and managing Kubernetes applications * Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking * Hands\-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources * Exposure to multi\-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience * Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments * Familiarity with AI\-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality * Experience with hybrid (cloud and on\-premises) scheduling and resource optimization to support flexible and efficient compute environments

Source: indeed View original post