Job Description

Role OverviewWe are looking for a Lead Data Engineer, own, and evolve the data infrastructure across the organisation. You will work directly with the Director of Data to set technical direction, mentor data engineers, and build production-grade systems across AWS, Python, and cloud-native data services. This role combines hands-on engineering with strategic leadership — you will drive architectural decisions for ASR evaluation pipelines, blockchain data ingestion, API integrations, and data platform evolution while building and guiding the data engineering function.What You Will Work OnTechnical Leadership & ArchitectureDefine the technical roadmap for data infrastructure, including pipeline architecture patterns, tooling standards, and cloud data platform evolutionLead architectural reviews and design decisions for new data systems and integrationsEstablish engineering best practices: CI/CD for data pipelines, testing frameworks, code review standards, monitoring and observability patternsOwn the technical strategy for scaling data infrastructure to support 10x growth in data volume and downstream consumersTeam Leadership & MentorshipMentor and upskill data engineers; conduct code reviews, pair programming sessions, and technical guidanceDefine hiring criteria and lead technical interviews for data engineering rolesFoster a culture of ownership, quality, and continuous improvement within the data teamCollaborate cross-functionally with ML engineers, backend engineers, and product teams to align data infrastructure with business objectivesHands-On Engineering (60-70% of time)Benchmark pipeline — own and evolve the multi-provider ASR transcription system; architect audio preprocessing workflows, chunking logic, retry/error handling, and metrics computation (WER, CER, BERTScore, PIER, DER, CS Precision/Recall)AWS data lake — architect and manage the KGen data lake: design Athena query optimisation strategies, manage Glue crawlers and cataloguing, lead Apache Hudi table management, implement Lake Formation column-level permissions, and define S3 lifecycle policiesETL and ingestion — design and build scalable data ingestion frameworks from Google Forms, Twitch API, on-chain blockchain events (Aptos, BSC, Ethereum, Polygon), and third-party gaming analytics APIs into DynamoDB and PostgreSQLAirflow orchestration — architect DAG patterns, establish monitoring and alerting standards, debug complex pipeline failures, and optimise resource utilisationCloud data transfers — design and manage large-scale S3-to-Google Drive transfers (rclone), cross-region data movement strategies, and vendor data sharing infrastructureInfrastructure and access management — own AWS IAM strategy, Lake Formation policies, and S3 bucket security; manage data engineer access controls; troubleshoot Superset permissions and connectivity issuesQC and annotation tooling — extend the FastAPI-backed audio QC portal; architect data validation frameworks and quality-check automation across egocentric video and audio datasetsSchema design & governance — lead the development of the Universal Data Schema (UDS) for audio, image, and code modalities in the Humyn Labs dataset marketplace; establish data governance and schema evolution practices.You Should HaveRequired7+ years in data engineering with 2+ years in a technical lead or senior individual contributor role with mentorship responsibilitiesProven leadership experience — either as a formal team lead or as a senior engineer who has mentored junior/mid-level engineers and driven technical directionDeep Python expertise — async patterns, subprocess management, API clients, distributed data processing, testing frameworks, and production debuggingAdvanced AWS proficiency — Athena, Glue, S3, DynamoDB, Lake Formation, IAM — with architectural decision-making experience (not just hands-on execution)Apache Hudi or Delta Lake production experience — schema evolution, partition strategies, upserts, compaction, time travel queriesStrong SQL skills — query optimisation, indexing strategies, execution plan analysis for large-scale analytical workloadsAirflow expertise — DAG design patterns, custom operators, monitoring, resource management, and troubleshooting complex dependenciesSystem design thinking — ability to architect end-to-end data systems, evaluate trade-offs, and document technical decisionsCommunication skills — able to articulate technical concepts to non-technical stakeholders and write clear design documentsStrong PlusExperience designing and scaling audio/media data pipelines (format conversion, metadata extraction, chunking, quality checks)Blockchain data engineering experience (on-chain events, wallet transactions, DEX swaps, indexing strategies)Large-scale file transfer and cloud-to-cloud sync pipelines (rclone, AWS DataSync, multi-cloud strategies)Infrastructure-as-code experience (Terraform, CloudFormation)Data quality frameworks and observability tools (Great Expectations, Monte Carlo, dbt)Experience building internal data platforms or self-service analytics tools

Job Title

Company : KGEN

Location : Bangalore, Karnataka

Created : 2026-04-15

Job Type : Full Time