⚡ New

Site Reliability Engineer (SRE) - Observability Specialist at Vodastra Las Vegas, NV

Downtown Boulder Partnership

Las VegasFull-timeMid LevelOn-site

Job Description

Job Description Site Reliability Engineer (SRE) - Observability Specialist Location: Las Vegas, NV 89101 (Onsite) Position Type: Contract Job Summary We are seeking a skilled and passionate Site Reliability Engineer (SRE) with a strong focus on Observability to join our onsite team. In this role, you will design, implement, and maintain observability solutions to ensure the reliability, scalability, and performance of our systems. As an Observability Specialist, you will collaborate with development, operations, and business teams to drive improvements in system monitoring, logging, tracing, and alerting.

Key Responsibilities Observability Architecture & Implementation Design and implement observability solutions, including monitoring, logging, and distributed tracing, to provide actionable insights into system behavior and health. Evaluate and integrate observability tools and platforms (e.g., Prometheus, Grafana, Elasticsearch, Datadog, New Relic). Monitoring & Alerting Define and maintain key performance indicators (KPIs) and service level objectives (SLOs) to measure system reliability and performance.

Develop robust alerting systems that minimize noise and provide meaningful, actionable alerts for critical issues. System Reliability Engineering Proactively identify system reliability risks through observability metrics and collaborate with teams to implement mitigation strategies. Participate in root cause analysis (RCA) and implement solutions to prevent the recurrence of incidents.

Collaboration & Advocacy Work closely with development and DevOps teams to embed observability best practices into the software delivery lifecycle. Act as a champion for observability, educating teams on its importance and guiding them in its adoption. Automation & Optimization Automate repetitive observability tasks, such as dashboard creation, log parsing, and alert tuning.

Optimize monitoring systems to reduce overhead and enhance efficiency. Documentation & Reporting Create and maintain documentation for observability processes, tools, and integrations. Develop dashboards and reports to provide visibility into system health and reliability for stakeholders.

Qualifications Education Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience). Experience Proven experience in Site Reliability Engineering, DevOps, or a similar role. Extensive hands‑on experience with observability tools and platforms (e.g., Prometheus, Grafana, Splunk, Elastic Stack, OpenTelemetry).

Experience with cloud platforms (AWS, Azure, GCP) and container orchestration systems (Kubernetes, Docker). Skills Proficiency in programming and scripting languages (e.g., Python, Go, Bash). Strong understanding of distributed systems, microservices architecture, and networking.

Expertise in designing monitoring systems with KPIs, SLOs, and SLIs. Experience with incident response, postmortem analysis, and reliability reporting. Preferred Qualifications Certifications in cloud platforms or observability tools.

Familiarity with chaos engineering principles and practices. Hands‑on experience with Infrastructure-as-Code (e.g., Terraform, Ansible). Key Competencies Analytical mindset with strong problem‑solving skills.

Effective communication and collaboration abilities. Proactive and detail‑oriented with a passion for reliability and automation. #J-18808-Ljbffr

Posted Today

Related Jobs

Related Searches

Apply Now