Java SRE Engineer (Phoenix)
Atos
Job Description
Must have skills includes Application Support, Observability tool knowledge with distributed background with tech stack of Java/Python. Having knowledge or work experience in AI is preferred. Supporting applications in production, including incident response, on-call rotations, and post-incident reviews Applying observability engineering to our applications โ defining SLOs/SLIs/error budgets, building dashboards, and implementing alerting strategies to proactively detect system degradation before customers are impacted Investigating and resolving production issues, including performance tuning and capacity planning Building automation to reduce toil and improve developer productivity Driving independent initiatives to improve platform reliability, developer experience, and operational maturity Site Reliability Engineer Responsibilities Proactively identify reliability risks and independently drive initiatives to address them before they become incidents Participate in and continuously improve our on-call rotation, including incident response, triage, and leading blameless post-incident reviews Define and implement monitoring, logging, and distributed tracing strategies; build and maintain dashboards; set meaningful alerts; and drive SLO/SLI/SLA and error budget adoption across services Scope technical projects and break them down into user stories and tasks, driving them to completion with minimal oversight Make sound technical decisions, leveraging input from teammates and contributing to technical conversations across engineering teams Automate the provisioning and management of infrastructure using Infrastructure as Code (IaC) tools such as Terraform A good fit will have At least 5 years of experience working in a professional environment as a Site Reliability Engineer (or a Software Engineer with some SRE responsibilities) Strong hands-on experience with observability โ you understand the difference between monitoring and observability, and can articulate how metrics, logs, and traces work together Participated in on-call rotations and are comfortable leading incident response under pressure, communicating clearly with stakeholders throughout Comfortable taking ownership of initiatives or projects independently, from scoping through to delivery, without needing constant direction Contributed to the design, build, and operation of cloud-native applications Experience with automating repeatable tasks and processes Build effective working relationships, give and receive constructive feedback openly, and are trusted by colleagues at all levels Technologies we use include Python, Java, and Go are our primary server languages Our browser applications are based on Angular and React Code lives in GitHub and flows to production through a CI/CD pipeline built on GitHub Actions, with some workloads on Jenkins Infrastructure runs on AWS (EC2) with workloads on Kubernetes-managed Docker containers Datadog is our primary observability platform โ experience with Datadog APM, dashboards, monitors, and RUM is a plus Infrastructure is managed as code using Terraform