···
Log in / Register
Senior DevOps Engineer
Indeed
Full-time
Onsite
No experience limit
No degree limit
79Q22222+22
Favourites
Share
Description

Summary: Seeking a DevOps/infrastructure engineer to manage Kubernetes operations, tune Linux compute nodes, and build Python and Bash automation for GPU-heavy AI research environments. Highlights: 1. Operate GPU-enabled Kubernetes clusters for AI research 2. Automate workflows with Python and Shell scripts 3. Tune Linux compute nodes for performance and scalability We are delivering automated Kubernetes orchestration and Linux infrastructure that powers GPU\-heavy AI research, with Volcano handling complex scheduling. You will manage Kubernetes operations (namespaces, RBAC, quotas), tune Linux compute nodes, and build Python and Bash automation to improve reliability and capacity usage; apply now. **Responsibilities** * Operate GPU\-enabled Kubernetes clusters and standalone Linux compute environments to ensure efficient scheduling and consistent performance * Configure and manage Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement * Maintain Kubernetes platforms end\-to\-end, covering namespaces, RBAC, resource quotas, and workload isolation strategies * Automate recurring workflows with Python and Shell scripts for job submission, resource provisioning, and system reporting * Coordinate with orchestration, optimization, and observability teams to refine scheduling efficiency, utilization, and researcher workflows * Observe infrastructure health and resource utilization and provide feedback for optimization and reporting requirements * Recommend upgrades to infrastructure, tooling, and automation workflows to increase performance, scalability, and usability * Maintain operational processes that enable a seamless experience for researchers across AI and computational workloads **Requirements** * At least 3 years of experience in DevOps or infrastructure engineering roles supporting complex, large\-scale environments * Expert proficiency in Kubernetes administration and orchestration, including management of namespaces, POD scheduling and distribution, persistent volume claims (PVC), network file systems (NFS), and resource quota management * Hands\-on experience with Volcano scheduler for GPU job execution, including queue configuration, workload prioritization, and integration with Kubernetes * Proven experience managing GPU cluster environments, both within Kubernetes and on standalone Linux compute nodes, to support high\-performance computing workloads * Advanced Python scripting skills for automating infrastructure tasks, job submissions, and system reporting * Proficiency in UNIX Shell scripting (e.g., Bash) for system automation and operational efficiency * Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management for compute environments * Solid understanding of infrastructure automation and orchestration concepts and tooling to enable scalable and reliable operations * Fluent English communication skills (spoken and written) for direct client interaction and collaboration with cross\-functional teams **Nice to have** * Experience with Helm package management for deploying and managing Kubernetes applications * Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki, for infrastructure health and performance tracking * Hands\-on experience with Infrastructure as Code tools, such as Terraform, for automated provisioning and management of cloud resources * Exposure to multi\-cloud Kubernetes environments, including Amazon EKS and Google GKE, for broader orchestration experience * Azure networking skills, including VPN configuration, ExpressRoute setup, and network security management, to support secure and scalable cloud deployments * Familiarity with AI\-assisted coding tools (e.g., GitHub Copilot, ChatGPT, Claude) to enhance development productivity and code quality * Experience with hybrid (cloud and on\-premises) scheduling and resource optimization to support flexible and efficient compute environments

Source:  indeed View original post
Sofía González
Indeed · HR

Company

Indeed
Sofía González
Indeed · HR
Similar jobs
Cookie
Cookie Settings
Our Apps
Download
Download on the
APP Store
Download
Get it on
Google Play
© 2025 Servanan International Pte. Ltd.