Skip to Main Content

Job Title


Founding AI/ML Engineer (Co-Founder) | Build Transformer Based Audio Foundation Models from Scratch


Company : BoloSuno


Location : Pune, Maharashtra


Created : 2025-08-14


Job Type : Full Time


Job Description

Role Details: Title: Founding AI/ML Engineer (Co-Founder) Focus: Transformer-based audio foundation models built from scratch (no fine-tuning of existing open-source models) Location: Remote Type: Full/part-time, founding team memberApply here:Us: We're building next-generation audio foundation models using transformer architectures trained from the ground up. Think of the scale and complexity of large language models like those behind advanced conversational AI, but specifically engineered for audio understanding and generation. We're not fine-tuning existing models. we're creating entirely new transformer-based foundation models for audio from first principles.The Role: We need a founding AI engineer who has hands-on experience building transformer-based foundation models from scratch for audio applications. You'll be responsible for architecting, training, and scaling neural audio models comparable to what powers modern speech AI systems, but built entirely in-house.What You'll Build: Audio Transformer Architectures: Design and implement encoder-decoder and decoder-only transformer models specifically for audio processing, including self-attention mechanisms optimized for sequential audio data. Foundation Model Training: Train large-scale audio foundation models (100M+ parameters) on diverse unlabelled audio datasets using self-supervised learning objectives like contrastive learning and masked prediction. Distributed Training Infrastructure: Implement multi-GPU/TPU training pipelines with model parallelism, gradient checkpointing, and mixed precision for training foundation models at scale. Real-time Inference Systems: Deploy foundation models for low-latency audio processing with optimized serving infrastructure, quantization, and caching.Must-Have Experience: Transformer Architecture Expertise: Proven experience implementing transformer models from scratch (not using pre-built PyTorch/TensorFlow transformer classes) with deep understanding of attention mechanisms, positional encoding, and layer normalization. Audio Foundation Model Training: Direct experience training large neural networks on audio data (speech, music, or environmental sounds) from scratch, including dataset curation and training objective design. Large-Scale Model Training: Hands-on experience with distributed training, managing training runs spanning weeks, hyperparameter optimization, and debugging convergence issues with models containing millions of parameters. Audio Signal Processing: Strong background in digital audio processing, understanding of sampling rates, spectrograms, mel-frequency analysis, and audio feature extraction methods. Deep Learning Frameworks: Expert-level proficiency in PyTorch or JAX with experience in custom model architectures, loss functions, and training loops.Preferred Experience: Self-Supervised Learning: Experience with contrastive learning, masked language modeling adapted for audio, or other unsupervised training objectives for foundation models. Audio Applications: Background in automatic speech recognition, text-to-speech synthesis, audio generation, or speech understanding tasks. Production Systems: Experience deploying large models in production with considerations for latency, throughput, and cost optimization. Research Background: Publications or demonstrable research experience in transformer architectures, foundation models, or audio machine learning.What’s on the table: Early Builder Role – Shape Bharat’s first at-scale AI-led audio ecosystem from the ground up. Founding-Level Trust – Work directly with the founder and core team; your voice matters in every decision. Equity Ownership – Significant stake in the company (will discuss over the call). Freedom to Create – Architect AI models, pipelines, and infrastructure without bureaucracy. Impact at Bharat-Scale – Build for 150+ crore voices, across 22+ languages & dialects, preserving cultural memory. Growth & Visibility – Recognition in the ecosystem, conferences, and open innovation circles. End-to-End Ownership – From research to deployment, you’ll see your work go live and scale. ❤️ Mission-Driven Work – Your AI will directly empower rural creators, storytellers, and everyday citizens.Ready to build Bharat’s audio future? Apply here: