HPC Network Engineering Manager - AI Infrastructure

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking an HPC Network Engineering Manager to guide architecture and technical direction for AI research and Kubernetes-based GPU infrastructure. Highlights: 1. Guide architecture and technical direction for AI research infrastructure 2. Shape reliable, scalable network platforms for massive distributed AI workloads 3. Provide technical leadership and mentorship across engineering teams We are seeking an **HPC Network Engineering Manager \- AI Infrastructure** to guide architecture and technical direction for AI research and Kubernetes\-based GPU infrastructure. You will steer standards for InfiniBand/RDMA, Ethernet, Kubernetes networking, SmartNIC/DPU, and observability across large programs while mentoring senior engineers. Join us to shape reliable, scalable network platforms for massive distributed AI workloads—apply now. **Responsibilities** * Define and own a multi\-year architectural vision and roadmap for InfiniBand/RDMA and high\-speed Ethernet fabrics supporting massive GPU clusters and distributed AI/LLM workloads across the client portfolio * Govern evaluation and standardization of cluster network topologies such as Fat\-tree, Clos, Rail\-optimized, and Dragonfly, and set decision frameworks aligned to scale, performance, and cost constraints * Establish and enforce engineering standards for host\-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU\-to\-NIC communication paths * Drive strategic performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi\-node GPU training, and oversee resolution of the hardest systemic performance issues * Define the reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi\-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and lead adoption across programs * Own strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR\-IOV, offload, isolation, and security use cases, and align rollout with the broader infrastructure roadmap * Define enterprise network observability strategy, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methods * Provide technical leadership and mentorship to lead and principal engineers across networking, Kubernetes, storage, GPU infrastructure, observability, and AI research teams to drive cross\-functional alignment * Represent the principal technical authority in executive stakeholder forums by shaping direction, negotiating program trade\-offs, and ensuring delivery of reliable, scalable network platforms across engagements * Contribute to the engineering community through thought leadership, internal practice building, and representation at industry events **Requirements** * 9\+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 5\+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (3\+ years) * Proven track record defining architecture and governing delivery for InfiniBand/RDMA fabrics, high\-speed Ethernet, and Linux networking in large\-scale, performance\-sensitive distributed compute environments * Authoritative expertise in host\-side networking (NICs, drivers, firmware) plus PCIe topology, NUMA awareness, and GPU\-to\-NIC affinity, with demonstrated ability to set enterprise standards and uplift engineering practices * Deep understanding of distributed AI training communication patterns, including NCCL\-based workloads and collective operations such as all\-reduce and all\-gather, with ability to drive workload\-network co\-design at scale * Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi\-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures * Expert\-level mastery of RDMA networking, including InfiniBand, RoCE/RoCEv2, GPUDirect\-related patterns, congestion behavior, and performance tuning at very large scale * Mastery of Linux networking and host\-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with ability to define repeatable diagnostic methodologies for broader teams * Demonstrated ownership of network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1\-L4, fabric, and RDMA layers * Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C\-level client architecture decisions, and driving alignment across research and platform stakeholders * English language proficiency at an Advanced level (C1\) **Nice to have** * Hands\-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies * Authoritative command of Grafana and Prometheus, plus Network Administration experience defining observability standards across an engineering organization * Proven ability to set strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs * Proficiency in Python and UNIX shell scripting for automation, tooling, and improving engineering productivity * Track record of thought leadership through conference talks, publications, patents, or open\-source contributions in the HPC/AI networking domain

Source: indeed View original post

Sofía González

Indeed · HR

Company

Indeed

Sofía González

Indeed · HR

Similar jobs

HPC Network Engineering Manager - AI Infrastructure

Description

Company

Similar jobs

Azure Data Factory (ADF) Data Engineer - Argentina | 100% Remote

Digital IC Design Intern

ENVIRONMENTAL COORDINATOR | FORMOSA PLANT

Full Stack (Golang, Python, React) MS051SG

DevOps - SRE Engineer DF025BG

DevOps - SRE Engineer DF025BG