




Summary: Lead Cloud Engineer to drive operational excellence, observability, incident response, resilience, and disaster recovery for cloud platforms. Highlights: 1. Lead operational excellence of the cloud platform 2. Own operational health, incident response, and DR solutions 3. Partner with engineers for architecture improvements and roadmap We are looking for a **Lead Cloud Engineer** to join our team. You will lead operational excellence of the cloud platform by owning observability, incident response, resilience, and disaster recovery. This role ensures that "run" is as strong as "build," providing confidence that cloud workloads remain healthy, compliant, and performant. **Responsibilities** * Own operational health dashboards, alert thresholds, and incident response playbooks for the cloud platform * Lead on\-call rotations, coordinate major incident resolution, and drive post\-incident reviews * Implement and maintain Disaster Recovery (DR) solutions for core applications, including DNS routing strategies and low\-RTO repositories * Manage patching pipelines, golden images, container registries, backups, and automated resilience testing * Partner with platform engineers to feed operational learnings into architecture improvements and the roadmap * Use automation and AI\-assisted tools to correlate anomalies, reduce noise, and accelerate root\-cause discovery * Educate product teams on DR patterns, operational best practices, and shared responsibilities **Requirements** * Bachelor's or Master's degree in Computer Science, Computer Engineering, or equivalent professional experience * At least 5 years of relevant professional experience * A minimum of one year of experience in people management or team leadership, leading a team of 5\+ FTEs * Hands\-on experience in cloud operations or SRE roles with deep exposure to AWS or similar hyperscale platforms * Advanced skills in monitoring, alerting, logging, and incident management tooling * Proven track record executing disaster recovery strategies, backup regimes, and resilience testing * Solid knowledge of patching processes, golden AMI and container image management, and change control governance * Experience automating operational workflows to reduce MTTR and toil using tools such as Python, Lambda, and runbooks * Familiarity with AI\-assisted observability and correlation tooling and how to operationalize it * Strong communication skills for on\-call coordination and stakeholder updates * Excellent oral and written communication skills in English (B2\+ level or higher)


