โšก New

Site Reliability Engineer (SRE)

Veriipro

DeerfieldFull-timeMid LevelOn-site

Job Description

We are looking for an experienced Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of Azure-based services in a large-scale enterprise environment. This role involves managing cloud infrastructure, enhancing observability, implementing disaster recovery strategies, and driving reliability improvements through SLOs/SLIs and automation. Key Responsibilities Define and manage SLOs, SLIs, and Error Budgets for Azure-hosted services, reporting SLA compliance to stakeholders.

Lead architectural reviews, ensuring reliability targets (availability, RTO/RPO) are met from design to production. Implement chaos engineering practices and conduct disaster recovery drills across Azure regions. Serve as Incident Commander for P1/P2 incidents, owning the incident lifecycle and post-mortem actions.

Design and operate enterprise observability using Azure Monitor, Log Analytics, Application Insights, and Grafana. Develop alerting frameworks and automate self-healing operations with Azure Automation and scripting (Python/PowerShell). Embed reliability gates in CI/CD pipelines and manage AKS cluster reliability (scaling, upgrades, security).

Enforce infrastructure-as-code best practices with Terraform/Bicep for Azure Landing Zones. Required Qualifications 7+ years in SRE, platform engineering, or cloud infrastructure in large-scale environments. 4+ years of hands-on Azure experience with AKS and cloud engineering. Expertise in Terraform (required), Bicep, and managing Azure Landing Zones.

Proficiency in Python, Go, or PowerShell scripting. Experience with Azure observability tools (Monitor, Log Analytics, Application Insights). Proven track record of owning SLOs/SLIs and improving production reliability. #J-18808-Ljbffr

Posted Yesterday

Related Jobs

Related Searches

Apply Now