Site Reliability Engineer (SRE) / Platform Engineer
Perfict
Job Description
Job Title: Site Reliability Engineer (SRE) / Platform Engineer Location: Reston, VA (Hybrid — 2 days onsite / 3 days remote) Type: 3-6 month contract to hire Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies). Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks. Map current hybrid topology and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible).
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams. Drive GitOps‑first workflows; harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy‑as‑code guardrails. Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams.
Lead incident response and postmortems ; institutionalize RCA, blameless learning, and continuous improvement. Advance the hybrid service model —migrations, integrations, reliability/latency tuning, cost and performance optimization. Day‑to‑Day Responsibilities Operate and optimize OpenShift/Kubernetes clusters, ingress (e.g., Nginx), and container networking/service mesh.
Manage Azure services (compute, VNet, storage, data services) supporting analytics workloads. Build and maintain automated infrastructure with Terraform, Ansible, and GitOps workflows. Implement and evolve observability (Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks.
Design, harden, and support delivery pipelines with ArgoCD/Jenkins/GitHub Actions . Provide platform tooling and enablement for application developers, data engineers, and operations teams. Ensure security and access management (HashiCorp Vault, secrets management, least privilege).
Lead incident response , coordinate cross‑functional resolution, and drive corrective actions and platform improvements. Script or develop tools in Bash, Python, or Go to eliminate toil and improve developer experience. Tech You’ll Work With Azure (compute, networking, storage, and data services) Automation & IaC: Terraform, Ansible, GitOps Messaging: Kafka, AMQ Secrets & Access: HashiCorp Vault CI/CD: ArgoCD, Jenkins, GitHub Actions Scripting/Coding: Bash, Python, Go Must‑Have Qualifications 5+ years hands‑on operating and managing Kubernetes and OpenShift clusters.
Strong experience with Microsoft Azure (compute, networking, storage, and data services). Proven skills in automation and Infrastructure‑as‑Code (Terraform, Ansible, GitOps). Proficiency with observability tooling (Datadog, Prometheus, Grafana).
Scripting/coding ability in Bash, Python, or Go . Preferred / Stand‑Out Skills Experience bridging on‑prem and cloud in a hybrid service model (migration, integration, optimization). Expertise with Kafka/AMQ , HashiCorp Vault , and ArgoCD/Jenkins/GitHub Actions .
Background leading incident response and postmortems with strong RCA and continuous improvement practices. #J-18808-Ljbffr