Senior AI/ML Engineer - Site Reliability Engineering
Dormont Manufacturing Co
Job Description
Job Description Join RBCâs Site Reliability Engineering team as a founding member building the bankâs first-ever Agentic AI platform for Software reliability and resiliency . Youâll pioneer intelligent automation systems that autonomously prevent incidents, accelerate response times, and transform how we maintain resilience across enterprise systems. This is a rare opportunity to shape the future of AI-driven reliability at scale.
Your innovations will protect millions of daily customer transactions and signâins. With a clear technical leadership trajectory, youâll architect cuttingâedge solutions at the intersection of AI and infrastructure, setting the standard for autonomous operations in financial services. What Will You Do?
Design and implement end-to-end Agentic AI solutions that autonomously detect anomalies, identify root causes, and resolve incidents with minimal human intervention Develop intelligent automation frameworks using LangChain and LangGraph to create contextâaware agents that learn from incident patterns and continuously improve response strategies Build MLâpowered monitoring and alerting systems that distinguish signal from noise, dramatically reducing false positives and improving MTTD (Mean Time to Detect) and MTTI (Mean Time to Identify) Architect scalable, productionâgrade solutions on OpenShift and Kubernetes that process realâtime system metrics and telemetry data at enterprise scale Implement infrastructureâasâcode using Ansible and containerization (Docker) to ensure reproducibility, consistency, and rapid deployment across environments Partner with incident management and operations teams to translate operational pain points into AIâdriven automation opportunities that measurably reduce toil Establish and track KPIs focused on reducing MTTR (Mean Time to Resolve), MTTD, and MTTI while improving system reliability Lead technical design discussions and contribute to architectural decisions that shape RBCâs AIâpowered reliability strategy What Do You Need to Succeed? Must Have: Strong ML engineering background with handsâon experience designing, training, and deploying machine learning models in production environments Proven expertise in Agentic AI frameworks and tools (LangChain, LangGraph, AutoGen, CrewAI, or similar) and building autonomous, multiâagent systems Deep understanding of Model Context Protocol (MCP) for enabling AI agents to interact with external systems and data sources Experience building AI agents with toolâcalling capabilities, memory management, and reasoning chains Proficiency in Python and experience with ML libraries (scikitâlearn, TensorFlow, PyTorch, or similar) Working knowledge of containerization (Docker), orchestration (Kubernetes/OpenShift), and infrastructureâasâcode principles (Ansible, Terraform) Demonstrated ability to translate complex technical concepts into business value and collaborate effectively with crossâfunctional teams NiceâtoâHave: Prior experience in Site Reliability Engineering, DevOps, or infrastructure monitoring roles Familiarity with observability tools (Prometheus, Grafana, ELK stack) and incident management platforms (PagerDuty, ServiceNow) Experience with LLMs, prompt engineering, and retrievalâaugmented generation (RAG) architectures Background in financial services or other highly regulated industries with strict reliability requirements Whatâs in It for You? A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable Leaders who support your development through coaching and managing opportunities Ability to make a difference and lasting impact Work in a dynamic, collaborative, progressive, and highâperforming team A worldâclass training program in financial services Flexible work/life balance options Opportunities to do challenging work #J-18808-Ljbffr