Chief HPC Network Engineer - AI Infrastructure

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking a Chief HPC Network Engineer to define global technical strategy and engineering vision for advanced AI and Kubernetes-based GPU infrastructure. Highlights: 1. Define global technical strategy for advanced AI and GPU infrastructure. 2. Act as principal technical authority influencing executive client roadmaps. 3. Lead and mentor engineers across network, Kubernetes, and AI research teams. We are looking for a **Chief HPC Network Engineer** to define the global technical strategy, reference architecture, and engineering vision behind advanced AI, research, and Kubernetes\-based GPU infrastructure for a major global technology client. The role focuses on establishing the long\-term technical direction, governing architecture decisions across multiple programs, and setting organization\-wide engineering standards for high\-performance network fabrics supporting massive\-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high\-speed Ethernet, Kubernetes networking, host\-side GPU networking, SmartNIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering culture, mentor lead and principal engineers, influence executive client roadmaps, and own end\-to\-end governance of mission\-critical network platforms across the portfolio. The ideal candidate combines authoritative expertise across InfiniBand NDR/HDR and next\-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading multiple engineering teams, defining technical strategy at the program level, and shaping industry\-leading HPC/AI network platforms. **Responsibilities** * Define and own the multi\-year strategic vision and architectural roadmap for high\-performance InfiniBand/RDMA and Ethernet fabrics powering massive\-scale GPU clusters and distributed AI/LLM workloads across the client portfolio * Govern the design, evaluation, and standardization of cluster network topologies, including Fat\-tree, Clos, Rail\-optimized, and Dragonfly, and establish enterprise\-wide decision frameworks aligned with workload scale, performance, and cost constraints * Establish and enforce organization\-wide engineering standards and best practices for host\-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU\-to\-NIC communication paths * Set the strategic direction for performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi\-node GPU training workloads, and oversee resolution of the most complex systemic performance issues * Define the canonical reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi\-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and drive its adoption across programs * Own the strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR\-IOV, offload, isolation, and security use cases, and align adoption with the broader infrastructure roadmap * Define the enterprise observability strategy for network platforms, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies * Provide technical leadership and mentorship to lead and principal engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, building the talent pipeline and driving cross\-functional alignment at scale * Act as the principal technical authority in executive client and stakeholder forums, shaping strategic technical direction, negotiating trade\-offs at the program level, and ensuring delivery of reliable, scalable network platforms across multiple engagements * Contribute to the broader engineering community through thought leadership, internal practice development, and representation of the company at industry events **Requirements** * 8\+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 4\+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (2\+ years) * Proven experience defining the architecture and governing delivery of InfiniBand/RDMA fabrics, high\-speed Ethernet, and Linux networking in large\-scale, performance\-critical distributed compute environments * Authoritative expertise in host\-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU\-to\-NIC affinity, with proven ability to set enterprise\-wide standards and uplift engineering organizations * Deep understanding of distributed AI training communication patterns, including NCCL\-based workloads and collective operations such as all\-reduce and all\-gather, with the ability to drive workload\-network co\-design strategy at scale * Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi\-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures * Expert\-level mastery of RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect\-related patterns, congestion behavior, and performance tuning at very large scale * Mastery of Linux networking and host\-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with the ability to define diagnostic methodologies for the broader engineering organization * Demonstrated ownership of enterprise network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1\-L4, fabric, and RDMA layers * Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C\-level client architecture decisions, and driving consensus across researchers, platform stakeholders, and executive sponsors * English language proficiency at an Advanced level (C1\) **Nice to have** * Hands\-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies * Authoritative command of Grafana, Prometheus, and Network Administration, with experience defining observability standards across an engineering organization * Proven ability to define strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs * Proficiency in Python and UNIX shell scripting for automation, tooling, and enabling organization\-wide engineering productivity * Track record of thought leadership through conference talks, publications, patents, or open\-source contributions in the HPC/AI networking domain

Source: indeed View original post

Sofía González

Indeed · HR

Company

Indeed

Sofía González

Indeed · HR

Similar jobs

Chief HPC Network Engineer - AI Infrastructure

Description

Company

Similar jobs

IT Support Engineer L2

Senior/Senior QA Analyst

Administrative

Construction building

Head of Industrial Services (Utilities & Facilities)

SAP MM/SD Consultant (Advanced English required)