Chaos Engineering & Performance Engineering Specialist
Enterprise Solutions Inc.
Job Description
Job Title: Chaos Engineering & Performance Engineering Specialist Location: Hybrid (3 days onsite per week) - Toronto ,Ontario Time Zone: EST Duration/Type: Contract Job Summary We are seeking a skilled Chaos Engineering & Performance Engineering Specialist to enhance system resilience, scalability, and reliability across distributed environments. The ideal candidate will design and execute chaos experiments, conduct performance testing, and leverage observability tools to ensure high availability and fault tolerance of modern applications. Key Responsibilities Design and execute controlled chaos experiments to validate system resilience and reliability Inject failures across application, infrastructure, network, and cloud layers Identify single points of failure and systemic risks in distributed systems Collaborate with Architecture and DevOps teams to improve fault tolerance and recovery mechanisms Validate high availability (HA), disaster recovery (DR), and resiliency patterns Conduct performance testing and establish performance baselines Analyze system behavior under stress and failure conditions Leverage APM and observability tools (e.g., Dynatrace, AppDynamics, Prometheus, Grafana) for monitoring and insights Correlate chaos experiment outcomes with performance and monitoring data Support production performance tuning and stability analysis as needed Integrate chaos and performance testing into CI/CD pipelines Required Skills & Qualifications Hands-on experience in Chaos Engineering and Performance Engineering Experience with tools such as Gremlin , LoadRunner , and APM/observability platforms Strong understanding of distributed systems, microservices architecture, and cloud environments Experience in failure injection across multiple layers (application, infrastructure, network) Working knowledge of Java-based applications and backend services Familiarity with CI/CD pipelines and DevOps practices Strong analytical and problem-solving skills Preferred Qualifications Experience with cloud platforms (AWS, Azure, or GCP) Knowledge of containerization and orchestration (Docker, Kubernetes) Exposure to SRE (Site Reliability Engineering) practices Experience implementing resilience testing frameworks