




Summary: As a Lead AWS Platform Engineer, you will strengthen our cloud platform by standardizing multi-account AWS and Kubernetes foundations through automation, observability, and scalable patterns. Highlights: 1. Own the AWS environment and run platform operations for HPC workloads 2. Lead technical ownership and set standards across teams 3. Design and support cross-cloud data transfer solutions We are building a dependable AWS platform for HPC teams to run large\-scale workloads with consistent reliability and control. As a **Lead AWS Platform Engineer** (HPC Enablement), you will standardize multi\-account AWS and Kubernetes foundations through automation, observability, and scalable patterns—apply to help strengthen our cloud platform. **Responsibilities** * Own the AWS environment and run platform operations that enable HPC workloads at scale * Provision and administer AWS accounts through internal self\-service tooling and standardized patterns * Develop and maintain Terraform code to provision AWS resources and HPC\-oriented clusters * Design and run centralized CI/CD pipelines to manage all accounts and clusters from a single repository * Migrate remaining AWS accounts into the central repository and align them to standardized infrastructure patterns * Operate and support the in\-cluster container registry (Harbor) and associated platform components * Implement and finalize the observability rollout across the AWS environment, covering metrics, logs, dashboards, and alerting * Support Kubernetes cluster operations and troubleshoot platform issues that impact HPC workloads * Own and enhance Cast AI as the primary mechanism for cluster scaling and optimization * Design and support cross\-cloud data transfer and networking solutions such as AWS DataSync and Interconnect between AWS and GCP * Collaborate with the HPC team to translate requirements into implemented platform solutions * Coordinate working hours to maintain at least 4 hours overlap with Houston time zone and occasional overlap with Australia **Requirements** * Hands\-on experience with Amazon Web Services in multi\-account environments (5\+ years) * Infrastructure\-as\-code expertise with Terraform (HCL/tofu), including modules and state management * Kubernetes operations experience, including cluster and workload troubleshooting * Proven ability to lead technical ownership as a staff\-level individual contributor and set standards across teams * Strong project delivery skills to turn requirements into evaluated options and shipped solutions with minimal guidance * Advanced Python programming skills for automation, tooling, and integrations * Strong Bash scripting skills for operational automation * Solid knowledge of CI/CD and GitOps workflows using tools such as GitLab CI or GitHub Actions * Strong observability capabilities across metrics, logs, dashboards, and alerting with Prometheus and Grafana * Experience improving cluster scaling and cost optimization using Cast AI or similar tooling * Ability to use AI\-assisted tools for code generation, debugging, and documentation in day\-to\-day work * Upper\-Intermediate English proficiency (CEFR B2\) **Nice to have** * Experience with Google Cloud Platform, particularly cross\-cloud integrations with AWS * Background in high\-performance computing (HPC), including schedulers or data\-intensive pipelines


