




Summary: Seeking a Chief AI Platform Engineer to lead the evolution of an agentic AI platform, focusing on scalability, reliability, and governance across GCP and Azure. Highlights: 1. Shape scalability, reliability, and governance of an enterprise AI foundation 2. Lead evolution of agentic AI platform and RAG pipelines on GCP and Azure 3. Enable teams with APIs, SDKs, and comprehensive documentation We are building a resilient enterprise AI foundation and need a **Chief AI Platform Engineer** to shape scalability, reliability, and governance across the platform. You will lead the evolution of our agentic AI platform, RAG pipelines, and cloud\-native operations on GCP and Azure while enabling teams with APIs, SDKs, and documentation—apply now. **Responsibilities** * Build and operate the proprietary agentic AI platform * Administer LiteLLM as the primary AI gateway, tuning routing, cost control, load balancing, and failover * Establish and run monitoring and observability capabilities using Prometheus, Grafana, and OpenTelemetry * Architect and enhance Retrieval\-Augmented Generation (RAG) pipelines, covering document ingestion, chunking, embeddings, and vector stores * Deliver RAG solutions on GCP and Azure using managed AI services and vector databases * Deploy and manage AI services on Kubernetes (AKS, GKE) using automated infrastructure tooling such as Terraform, Helm, and GitOps * Create CI/CD pipelines with Jenkins, Opsera, and GitHub Actions * Uphold system security and ensure compliance requirements are continuously satisfied * Enable multi\-agent orchestration by producing SDKs, APIs, and developer documentation * Develop MCP servers for tool integrations and enable autonomous workflows across teams **Requirements** * Proven track record with 7\+ years of platform engineering or DevOps and a strong infrastructure foundation * Hands\-on experience of 3\+ years building and supporting AI/ML or LLM platforms in production * Deep expertise in Kubernetes and CI/CD tools plus cloud platforms such as GCP or Azure * Strong programming proficiency in Python and/or TypeScript * Solid background automating infrastructure provisioning with Terraform, Helm and GitOps * Practical knowledge of observability tooling including Prometheus, Grafana, and OpenTelemetry * Clear understanding of Retrieval\-Augmented Generation (RAG) methodologies and vector databases * Advanced English proficiency (B2\+/C1\) **Nice to have** * Expertise in LangChain, LlamaIndex or agent frameworks * Familiarity with LiteLLM, MCP, and Backstage solutions * Ability to optimize costs for LLM workloads * Experience building and maintaining enterprise\-scale AI platforms


