โšก New

Data Engineer

Saarthee

BengaluruFull-timeMid LevelOn-site

Job Description

POSITION SUMMARY We are seeking a Data Engineer who goes beyond pipeline execution to deliver true data solutioning and implementation. In this role, you will architect and build efficient Silver and Gold data layers, optimise compute costs through deep-dive parameter tuning, enforce robust data quality and governance practices, and orchestrate the Semantic Layer that enables enterprise data to be queried meaningfully and consistently. At Saarthee, we value foundational data and engineering principles over tool familiarity.

Whether your background is in Azure, Google Cloud, or AWS, we are looking for someone who understands how distributed computing works under the hood โ€” and can fine-tune it for speed, cost, reliability, and accuracy. ROLES & RESPONSIBILITIES 1. Architecture & Data Modelling Design & Strategy: Collaborate with business stakeholders to define and document data structures, determining how data flows from Raw (Bronze) to Cleaned/Enriched (Silver) to the final Metrics (Gold) layer โ€” ensuring the fastest possible time-to-insight and long-term scalability.

Data Modelling: Architect extensible data models that decouple storage from compute, enabling flexibility as business requirements evolve. AI-Readiness: Design, build, and implement the Semantic Layer โ€” comprising metadata, definitions, context, relationships, and feature stores โ€” required for Large Language Models (LLMs) to accurately interpret and query enterprise data. 2. Engineering, Performance Tuning & FinOps Data Engineering: Implement the aligned data architecture, ETL/ELT pipelines, silver data aggregations, and gold metric layers.

Enforce RBAC/ABAC, row/column-level security, and PII handling through masking and tokenisation. Implement data dictionaries, metadata, relationship lineage, and data quality guardrails as part of the definition of done. Compute Optimisation & Scalability: Own the cost-to-performance ratio by fine-tuning compute parameters (memory, cores, executors, partitions) across four dimensions: Data Volume: Efficiently handling GBs to TBs of data.

Transformation Complexity: Optimising CPU-bound versus Memory-bound tasks. Data Movement: Minimising network I/O and shuffle operations. SLA Management: Balancing resource costs against batch and real-time processing requirements.

Read Volume Optimisation: Ensure fit-for-purpose data reads to manage costs while producing metrics and aggregate layers at scale. Scalable Architecture: Design and implement architectures that scale from a single show, channel, or partner to multiple โ€” with minimal manual intervention or code change. BAU Management: Manage, enhance, optimise, and resolve bugs across existing data pipelines.

Stack Migration: When the data stack changes, port data assets and pipelines to the new environment. 3. Operational Excellence Data Quality: Implement automated data quality frameworks (e.g., Great Expectations, dbt tests) to detect nulls, schema drift, and anomalies before they reach the gold layer. Orchestration: Manage complex dependencies and workflows using orchestration tools (e.g., Apache Airflow, Dagster, ADF) โ€” including DAG SLA management, backfills, retries, and alert routing via PagerDuty or Teams.

DevOps & CI/CD: Apply software engineering best practices to data operations โ€” including version control (Git), automated testing, and CI/CD pipelines for reliable deployment. REQUIRED SKILLS & QUALIFICATIONS Core Experience Minimum 6โ€“8 years in analytics engineering or data engineering, with at least 2 years in a solutioning and architecture capacity. Proven ability to apply deep technical expertise to diagnose business problems and solve them with clarity and urgency.

Note: This is not a research role โ€” it demands strategising, application, and execution. AI / LLM Readiness Hands-on experience with Vector Databases, Knowledge Graphs, or building data layers specifically for Generative AI consumption. Experience building RAG (Retrieval-Augmented Generation) architectures and preparing structured/unstructured data for LLM pipelines.

Exposure to vibe coding analytics UIs or dashboards using tools like Lovable or Claude for POC development. Data Modelling & Storage Strong proficiency in dimensional modelling: Star Schema, Snowflake Schema. Familiarity with modern data lakehouse table formats: Delta Lake, Apache Iceberg, Apache Hudi.

Distributed Computing & Tech Stack Strong command of distributed computing principles: Apache Spark, Hive, BigQuery. Deep understanding of DAGs, data shuffling, serialisation, and partition pruning. Proven deep-tuning experience โ€” the ability to diagnose a spill-to-disk error or slow stage and confidently adjust the right configuration parameters.

Programming Proficiency in: Advanced SQL, Python, R, Spark, and Spark SQL. Good to have: Scala. Mandatory Skills Vector Databases, Knowledge Graphs, Generative AI Data Modelling, Dimensional Modelling (Star Schema, Snowflake), Apache Spark, Hive, BigQuery, Advanced SQL, Python, R, Spark SQL.

Posted Yesterday

Related Jobs

Related Searches

Apply Now