ML Engineer
AI Chopping Block, Inc.
Job Description
US Sales and Partnerships Lead, Digital Diagnostics Lead the team responsible for the AI/ML Stack infrastructure that bridges ML research and production, evolving the stack to meet large scale ML training and inference workload needs. Develop and execute a long-term vision and roadmap for the MLOps team to support ML development and deployment needs across business units, managing short-term deliveries and long-term architectural transformation. Lead and mentor a team of 6-7+ engineers, strategically allocate resources for support and strategic initiatives.
Collaborate cross-functionally with leaders in machine learning, data science, product engineering, and infrastructure to identify pain points, address bottlenecks, and facilitate deployment of new solutions. Architect compute and storage pipelines to manage millions of slides and complex artifacts without data fragmentation or latency. Modernize AI product inference stack to support substantial growth in AI runs globally.
Work with Site Reliability Engineering to establish comprehensive system observability metrics including compute utilization, network bottlenecks, and cost attribution. Conduct build versus buy assessments and lead stack refresh audits to benchmark proprietary tools against commercial and open-source alternatives. As a Machine Learning Intern at Nomagic, you will dive into complex problems of physical manipulation to enhance robot capabilities.
Your responsibilities include expanding the perception abilities of the robotic system to handle a wider variety of products, detecting anomalies such as identifying when a robot picks more than one item or when an item is disassembling, training models to solve multiple problems with various loss functions, and productionizing machine learning models which involves performance monitoring and A/B testing. You will work on developing groundbreaking technology and collaborate with top professionals in an English-speaking environment, with opportunities to play with robots daily and contribute directly to impactful results. Build and maintain data pipelines for large video generation models, including data ingestion, parsing, filtering, preprocessing, and dataset curation at scale, using tools such as AWS S3 and DynamoDB.
Design and run annotation workflows across platforms such as MTurk, Prolific, and Mechanical Turk, including task design, quality control, and label validation. Train, evaluate, and improve smaller supporting models used for data filtering, quality assessment, preprocessing, or other parts of the ML pipeline. Partner closely with research and engineering teams to turn experimental workflows into scalable, repeatable systems that support model training and evaluation.
Own data quality across the pipeline by identifying bottlenecks, failure modes, and low-quality sources, and continuously improving tooling and processes. Build internal tools and automation that make it easier to prepare datasets, launch annotation jobs, monitor outputs, and support model development end to end. Drive larger pipeline projects from start to finish, such as new dataset creation efforts or upgrades to labeling and preprocessing infrastructure.
Work within a Kubernetes-based training infrastructure, ensuring datasets are properly prepared, formatted, and delivered to training clusters. Profile and optimize research model inference scripts used in preprocessing steps, ensuring that model-driven filtering and transformation stages run within practical time and cost constraints when applied to large-scale raw data. Lead/Manager Site Reliability Engineering Team (Amsterdam) Advance inference efficiency end-to-end by designing and prototyping algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference.
Implement and maintain changes in high-performance inference engines such as SGLang- or vLLM-style systems and Together's inference stack, including kernel backends, speculative decoding methods like ATLAS, and quantization. Profile and optimize performance across GPU, networking, and memory layers to improve latency, throughput, and cost. Unify inference with RL/post-training by designing and operating RL and post-training pipelines where inference constitutes the majority of the cost, optimizing algorithms and systems jointly.
Enhance RL and post-training workloads with inference-aware training loops, including asynchronous RL rollouts and speculative decoding techniques, making large-scale rollout collection and evaluation more efficient. Use these pipelines to train, evaluate, and iterate on cutting-edge models based on the inference stack. Co-design algorithms and infrastructure to tightly couple objectives, rollout collection, and evaluation to efficient inference, and identify bottlenecks across training engines, inference engines, data pipelines, and user-facing layers quickly.
Run ablation and scale-up experiments to analyze trade-offs between model quality, latency, throughput, and cost, feeding insights back into model, RL, and system design. Own critical production-scale systems by profiling, debugging, and optimizing inference and post-training services under real production workloads. Lead roadmap initiatives necessitating engine modifications such as changes to kernels, memory layouts, scheduling logic, and APIs.
Establish metrics, benchmarks, and experimentation frameworks to rigorously validate improvements. Provide technical leadership by setting direction for cross-team efforts at the intersection of inference, RL, and post-training and mentor engineers and researchers on full-stack ML systems work and performance engineering. Senior Applied AI Manager The Senior Applied AI Manager is responsible for owning the strategy and execution for AI science at Oumi.
This includes setting the applied science agenda, building and leading the team, and being accountable for the science quality of every feature shipped on the platform. The role covers the full model development lifecycle, including data strategy, pre-training and post-training methodology, evaluation science, and production deployment, as well as developing agentic systems that automate and improve each stage. The manager works closely with the CEO and product leadership to translate company strategy into a concrete AI science roadmap and executes it with a team of ML engineers and applied researchers.
Responsibilities include defining and driving the research and engineering roadmap, recruiting and managing a high-performing team, leading experimentation across the training stack, owning the data side of model development, designing evaluation frameworks and automated feedback loops, researching and developing agent-based systems for the training lifecycle, partnering with infrastructure and product teams to ensure reliable feature deployment, and contributing to open source and community collaborations. Member of Technical Staff - Post Training, Applied (Vision) Act as the technical owner for enterprise customer vision-language model (VLM) post-training engagements. Translate customer requirements into concrete multimodal post-training specifications and workflows.
Design and execute visual data generation, filtering, and quality assessment processes, including image-text pair curation, annotation pipelines, and synthetic data generation for visual tasks. Run supervised fine-tuning, preference alignment, and reinforcement learning workflows for vision-language models. Design task-specific evaluations for visual understanding, grounding, OCR, document parsing, and other multimodal capabilities.
Interpret evaluation results and feed learnings back into core post-training pipelines. Member of Technical Staff - Applied ML, RecSys Act as the technical owner for enterprise customer engagements involving recommendation and ranking workloads; translate customer requirements into concrete specifications for recommendation models; design and execute data pipelines for user interaction data, feature engineering, and training data curation at scale; fine-tune and adapt large-scale sequential recommendation models for customer-specific use cases; design task-specific evaluations for recommendation model performance and interpret results; build reusable applied tooling and workflows that accelerate future customer engagements. Member of Technical Staff - Post Training, Applied (Audio) Act as the technical owner for enterprise customer post-training engagements involving audio and speech workloads, translating customer requirements into concrete post-training specifications for ASR, TTS, and speech-to-speech tasks; design and execute data generation, preprocessing, augmentation, and quality filtering processes for audio corpora; fine-tune and adapt audio/speech models for customer-specific use cases, owning delivery from requirements through deployment; design task-specific evaluations for audio model performance (noise robustness, speaker variation, latency) and interpret results; build reusable applied tooling and workflows that accelerate future customer engagements. #J-18808-Ljbffr