···
Log in / Register
DevOps Engineer
Indeed
Full-time
Onsite
No experience limit
No degree limit
79Q22222+22
Favourites
Share
Description

Summary: Seeking a highly skilled DevOps Engineer for a client-facing role to implement, automate, and optimize Kubernetes-based orchestration platforms and Linux infrastructure for advanced AI. Highlights: 1. Hands-on implementation and optimization of Kubernetes-based platforms 2. Leverage deep expertise in Kubernetes administration and automation 3. Support advanced AI and research initiatives We are seeking a highly skilled **DevOps Engineer** to join EPAM’s delivery team. In this client\-facing, delivery\-focused role, you will be responsible for the hands\-on implementation, automation, and optimization of Kubernetes\-based orchestration platforms—including Volcano for GPU\-enabled workloads—and the Linux infrastructure supporting advanced AI and research initiatives. You will leverage deep expertise in Kubernetes administration, workload scheduling, quota management, and automation using Python and Shell scripting to deliver efficient, reliable, and scalable compute environments. You will work closely with other engineers and researchers to ensure a seamless, high\-quality infrastructure experience. **Responsibilities** * Set up, configure, and support GPU\-enabled Kubernetes clusters and independent Linux compute systems to maximize workload scheduling and system efficiency * Oversee Volcano job scheduling, handling queue creation, POD management, GPU resource assignment, and namespace quota controls * Manage all aspects of Kubernetes environments, including namespaces, RBAC, resource quotas, and strategies for workload isolation * Write and maintain automation scripts in Python and Shell to simplify job submissions, resource allocation, and system monitoring * Work alongside orchestration, optimization, and observability teams to boost scheduling performance, resource usage, and researcher productivity * Track infrastructure status and resource consumption, sharing insights and data to drive optimization and reporting * Propose and implement enhancements to infrastructure, tools, and automation processes to improve scalability, performance, and user experience * Support operational workflows that provide researchers with a smooth and effective environment for AI and computational projects **Requirements** * Minimum of 2 years in DevOps or infrastructure engineering roles managing complex, large\-scale systems * Deep knowledge of Kubernetes administration and orchestration, covering namespaces, POD scheduling and balancing, persistent volume claims (PVC), network file systems (NFS), and resource quota controls * Practical experience with the Volcano scheduler for GPU job management, including queue setup, workload prioritization, and Kubernetes integration * Demonstrated ability to operate GPU cluster environments in both Kubernetes and standalone Linux setups for high\-performance computing * Advanced skills in Python scripting for automating infrastructure operations, job handling, and system monitoring * Proficiency in UNIX Shell scripting (such as Bash) for automating system tasks and improving operational workflows * Strong background in Linux system administration, including troubleshooting, optimizing performance, and managing configurations * Thorough understanding of automation and orchestration tools and concepts to support scalable and dependable infrastructure * Excellent English communication skills, both spoken and written, for direct client engagement and teamwork with cross\-functional groups **Nice to have** * Experience using Helm for packaging and managing Kubernetes applications * Knowledge of monitoring and observability tools like Prometheus, Grafana, and Loki for tracking infrastructure health and performance * Familiarity with Infrastructure as Code solutions such as Terraform for automating cloud resource provisioning and management * Background working with multi\-cloud Kubernetes platforms, including Amazon EKS and Google GKE, for expanded orchestration capabilities * Skills in Azure networking, including VPN setup, ExpressRoute configuration, and network security for robust cloud deployments * Experience with AI\-powered coding assistants (e.g., GitHub Copilot, ChatGPT, Claude) to improve development efficiency and code quality * Understanding of hybrid scheduling and resource optimization across cloud and on\-premises environments for flexible compute solutions

Source:  indeed View original post
Sofía González
Indeed · HR

Company

Indeed
Cookie
Cookie Settings
Our Apps
Download
Download on the
APP Store
Download
Get it on
Google Play
© 2025 Servanan International Pte. Ltd.