Manager II, SRE Operations [T500-27315]
Talent500
Job Description
Talent500 is hiring for one of its clients. Who are we: Core Insurance Platforms (CIP) is Zurich’s global capability responsible for building, running, and evolving core insurance technology. We set a unified, scalable operating model—covering governance, standards, architecture, service delivery, and reuse—so our business units can deliver at speed and scale.
CIP is the strategic steward of Zurich’s Guidewire ecosystem, aligning platform roadmaps to business strategy while driving stability, modernization, reduced supplier dependency, and long term cost efficiency. India delivery center is one of our global delivery and capability hub. We bring together experts in AI, engineering, analysis, quality, and architecture to deliver product & process solutions, application run services, change and transformation initiatives, and centralized platform services across both on prem and Guidewire Cloud environments.
Our teams operate from multiple global delivery centers, supporting Zurich’s business units worldwide. Manager II - SRE (Incident Management & SRE Strategy): We are seeking a SRE Manager who lives and breathes an AI-first operations philosophy — someone who treats AI, automation, and observability as the default tools of the trade, not as add-ons. Someone who believes manual toil is a bug, that reliability is engineered (not hoped for), and that the best incident is the one that never happens.
This role sits at the intersection of SRE leadership, AI-first operations, and team ownership. You will lead the AIEP Operations unit as both its manager and its most senior technical voice — owning the Incident Management and SRE strategy end-to-end, while building a team that ships fast, automates relentlessly, and scales without handholding. You won't just respond to incidents.
You will engineer a system where AIOps catches issues before users do, where GenAI writes the post-mortems, and where every escalation is one step closer to being industrialized into an AI-assisted L1 workflow. Key Responsibilities: Strategy & SRE Framework: Define and execute the enterprise SRE strategy and reliability framework, establishing standards for SLOs, SLAs, SLIs, and error budgets while platform teams own their specific monitoring implementations — with a clear expectation that AI and automation are used wherever they meaningfully reduce toil and risk. Define the enterprise observability framework, providing the standards, policies, and centralized platforms (e.g., Splunk, Datadog, Dynatrace) for technology owners to plug their monitoring into, driving the adoption of AIOps to reduce alert fatigue and false positives.
Incident Management: Own the internal Incident Management process, participating in critical incident bridges, coordinating across technology teams, and ensuring rapid resolution of P1/P2 events, leveraging GenAI for real-time incident summarization and AIOps for automated alert correlation and decision support. Drive the Post-Incident Review (PIR) and root cause analysis processes, holding platform teams accountable for preventative actions and systemic reliability improvements, using Generative AI to streamline post-mortem documentation and predictive analytics to surface recurring failure modes. Champion the integration of AI-driven analytics and automation within incident triage, escalation, and remediation to reduce manual toil and accelerate response times.
Unit Leadership: Lead the AIEP Operations unit as its manager, setting the AI-first philosophy, coaching team members, and ensuring the unit operates as one team with shared KPIs and shared accountability. Collaborate with the Expert AIEP Operations Engineer to build automation and AI-usage capabilities for incident resolution, L1 AI co-pilots, automated compliance execution, and enhanced L1/L2 workflows. Mentor and guide technical teams, platform owners, and L1/L2 operations on SRE best practices, incident handling, and the practical application of AI in daily operations.
Platform & Tool Governance: Manage and govern centralized internal tools and platforms (e.g., Atlassian suite, ITSM tools), ensuring they are secure, compliant, and optimized for enterprise use, embedding AI assistants and intelligent routing wherever applicable. Reporting & Continuous Improvement: Report on reliability metrics, incident trends, and AI-enabled operational improvements to senior leadership, using data to drive strategic technology investments. Partner with technology owners, application teams, and business stakeholders to gather requirements, align priorities, and deliver custom incident reporting solutions using platforms such as Splunk Enterprise, Splunk Observability Cloud, Splunk ITSI, Datadog, Dynatrace, New Relic, and Grafana.
Drive continuous improvement initiatives, leveraging AI and traditional data analytics to optimize monitoring coverage, reduce false positives, and improve incident response times. Lead post-incident reviews and systemic problem analysis, driving improvements and preventative actions. Skills / Experience: Experience with SRE frameworks (SLOs, SLIs, Error Budgets) Major Incident Management & ITSM processes Platform management for internal tools Enterprise observability frameworks (Splunk, Datadog, Dynatrace) AI/ML integration in IT Operations (AIOps) Experience with Generative AI tools/LLMs in an operational context Cloud platforms (AWS, Azure, GCP) architecture understanding Competencies: Strategic thinking with a proven ability to drive an AI-first cultural transformation in operations.
Crisis management and calm, decisive leadership during major incidents Stakeholder engagement, negotiation, and cross-functional influence Advanced problem-solving and systemic root-cause analysis Strong governance, process design, and documentation skills People leadership, coaching, and team building Expected Outcomes (6/12 months): Implementation of a standardized SRE framework (SLOs/SLIs) across critical platforms, incorporating AI-driven forecasting for error budgets. Establishment of a robust, AI-enhanced Incident Management process. Reduction of MTTR (Mean Time to Resolve) for critical incidents by at least 20% through AIOps and automated remediation.
Successful transition of monitoring execution to platform teams, with Operations providing full framework governance. Optimized and compliant platform management for internal tools (e.g., Atlassian suite). AIEP Operations unit operating as one cohesive team with shared AI-first KPIs and a clear L1/L2/Expert career progression.
Candidate Data Privacy Notice: Applicability of This Notice This job posting relates to opportunities with Zurich Digital International Private Ltd. (“ Zurich ”) and is published with the support of ANSR Inc., (“ ANSR ”) an authorized recruitment service provider engaged by Zurich for its hiring activities. Participation in the Recruitment Process : When you apply for this role, personal data provided as part of your application (such as your name, email address, contact details, address, financial information, background information, medical history, and details of previous employers/employment) is collected for the purposes of recruitment and selection and may be reviewed by Zurich, and transferred/disclosed to Zurich’s affiliates, subsidiaries, and related entities and ANSR (“ Transferees ”), solely for recruitment and selection purposes (in accordance with their respective privacy policies). The Transferees maintain at least the same level of protection for your data as maintained by Zurich, and do not further transfer/disclose/share or publish your data.
For information on how your personal data is processed by ANSR, including your rights and how to contact the relevant data protection office. You understand and acknowledge: You have the option not to provide your data (in which case we may not be able to process your application); You have the option to review, correct/amend your data; You have the option to withdraw your consent from processing your data; Your data is retained until the purpose for its collection is served; and Zurich uses reasonable security measures to help protect against the unauthorized access, loss, misuse and alteration of the personal information under our control. However, no method of transmission over the internet, or method of electronic storage, is completely secure.
By submitting your interest in any of our vacancies, you consent to the collection, storage, transfer, disclosure, and processing of your personal information and/or sensitive personal data or information.