Senior GenAI Engineer (AI Evaluation Engineer) [T500-26087]
FM
Job Description
About FM: FM is a 190-year-old, Fortune 500 commercial property insurance company of 6,000+ employees with a unique focus on science and risk engineering. Serving over a quarter of the Fortune 500 and major corporations globally, they deliver data-driven strategies that enhance resilience, ensure business continuity, and empower organizations to thrive. FM India located in Bengaluru is a strategic location for driving FM's global operational efficiency that allows them to leverage the country’s talented workforce and advance their capabilities to serve their clients better.
GenAI Engineer (AI Evaluation Engineer) Job Summary: The Gen AI Evaluation Engineer leads the design, implementation, and operation of enterprise-grade evaluation, quality, and governance frameworks for Generative AI systems in a highly regulated, responsible AI environment. This role ensures the quality, reliability, safety, and performance of LLMs, vision models, RAG pipelines, and agentic workflows deployed in production. Building on strong GenAI engineering foundations, this position focuses on AI-specific testing, experimentation, automation, and continuous evaluation pipelines , integrating quality gates into CI/CD workflows and aligning GenAI solutions with enterprise architecture, compliance, and risk standards.
The Gen AI Evaluation Engineer partners closely with product, data science, ML engineering, and platform teams to drive trustworthy, scalable, and production-ready AI systems . Essential Functions & Responsibilities: GenAI Application Design & Implementation: Design and implement comprehensive AI evaluation and experimentation frameworks for LLMs, vision models, RAG pipelines, and agentic workflows. Build automated evaluation systems to assess model outputs for accuracy, relevance, bias, hallucinations, safety, and regression stability.
Develop quality benchmarks and continuous testing pipelines covering content quality, safety, alignment, and enterprise compliance. Establish AI-specific quality gates and acceptance criteria integrated into Agile sprints and CI/CD pipelines. Design, develop, and evaluate data pipelines and RAG workflows using Promptflow, Azure AI Search, ADF Pipelines, Databricks, Spark, and Vector Databases.
Validate prompt engineering strategies, prompt consistency, and inference pipelines using GenAI-specific testing tools. Perform prompt-based scenario testing, hallucination detection, and regression validation across model versions. System Support & Operational Excellence: Develop and maintain monitoring capabilities for model drift detection, data quality validation, inference latency, and system reliability.
Monitor production GenAI systems using Azure Application Insights, Dynatrace , and AI evaluation platforms. Apply AI risk-focused testing , including bias, fairness, safety, adversarial testing, and red-teaming methodologies. Ensure optimized utilization of infrastructure and compute across diverse AI workloads.
Automation & Test Engineering: Build and maintain test automation frameworks (primarily in Python) for AI/ML evaluation and end-to-end workflow validation. Write automation scripts to simulate user behavior and backend interactions . Perform exploratory testing, regression testing, and end-to-end system validation for AI-enabled applications.
Track and manage defects using AI evaluation platforms and Agile tools; document test plans, execution results, and evaluation metrics. Mentorship & Team Leadership: Lead design and evaluation reviews; mentor junior engineers and evaluation specialists. Collaborate with product managers to convert business and regulatory requirements into test cases, benchmarks, and test data .
Drive innovation by adopting emerging GenAI evaluation techniques and aligning AI initiatives with enterprise goals. Foster a culture of technical excellence, responsible AI, and continuous improvement . Essential Skills: Design and implementation of AI evaluation frameworks for LLMs, RAG, vision models, and agentic systems Generative AI engineering: prompt engineering, model integration, inference pipelines Data pipelines and RAG workflows (Promptflow, Azure AI Search, ADF Pipelines, Vector DBs) Automated testing and CI/CD integration for AI systems Monitoring and observability (Azure Application Insights, Dynatrace) AI risk, safety, bias, fairness, hallucination detection, and adversarial testing Large-scale experimentation and performance optimization Strong mentorship, technical leadership, and cross-team collaboration Must Have Skills: Generative AI Azure (AI, Data, and Monitoring services) AI/ML Platform tools (Databricks, Azure AI/ML, GCP Vertex, AWS Bedrock) Python (test automation, evaluation frameworks) Software Engineering & Test Engineering fundamentals Basic Qualifications: 1-3 years of experience in evaluation or test engineering with a strong focus on AI/ML and Generative AI systems.
Hands-on experience evaluating LLMs, RAG pipelines, and AI-powered applications in production environments. Strong understanding of AI/ML evaluation concepts including accuracy, relevance, regression, latency, and stability . Bachelor’s Degree in Computer Science, Data Science, Artificial Intelligence, or equivalent practical experience.
Preferred Qualifications: Knowledge of AI risks including bias, fairness, safety, hallucinations, and adversarial attacks . Experience in highly regulated or governed AI environments. Strong collaboration skills with data scientists, ML engineers, platform teams, and product managers.
Passion for AI quality, responsible AI practices, and staying current with emerging GenAI evaluation methodologies .