About the Company:Granules is a fully integrated pharmaceutical manufacturer specializing in Active Pharmaceutical Ingredients (APIs), Pharmaceutical Formulation Intermediates (PFIs), and Finished Dosages (FDs), with operations in over 80 countries and a focus on large-scale pharmaceutical manufacturing. Founded in 1984 and headquartered in Hyderabad, India, the company is publicly traded and employs 5,001 to 10,000 people.About the Role:We are hiring an AI Engineer – Multimodal to design and build real-time multimodal/omni AI systems that generate audio, video, and language for conversational, human-like interfaces. The role focuses on developing models that tightly couple speech, visual behavior, and language to enable natural, low-latency interactions.You will work at the intersection of conversational AI, neural audio, and audio-visual generation, contributing both foundational research and production-ready systems. This is a hands-on role with strong ownership over technical direction.Responsibilities:- Research and develop multimodal/omni generation models for conversational systems, including neural avatars, talking-heads, and audio-visual outputs. - Build and fine-tune expressive neural audio / TTS systems, incorporating prosody, emotion, and non-verbal cues. - Design and operate real-time, streaming inference pipelines optimized for low latency and natural turn-taking. - Experiment with and apply diffusion-based models (DDPMs, LDMs) and other generative approaches for audio, image, or video generation. - Develop models that align conversation flow with verbal and non-verbal behavior across modalities. - Collaborate with applied ML and engineering teams to transition research into production-grade systems. - Track, evaluate, and apply emerging research in multimodal and generative modeling.Qualifications:- Master’s or PhD (or equivalent hands-on experience) in ML, AI, Computer Vision, Speech, or a related field. - 4–8 years of hands-on experience in applied AI/ML research or engineering, with a strong focus on multimodal and generative systems.Required Skills:- Strong experience modeling human behavior and generation, including facial expressions, affect, or speech, preferably in conversational or interactive settings. - Deep understanding of sequence modeling across video, audio, and language domains. - Strong foundation in deep learning, including Transformers, diffusion models, and practical training techniques. - Familiarity with large-scale model training, including LLMs and/or vision-language models (VLMs). - Excellent programming skills in PyTorch, with hands-on experience in GPU-based training and inference. - Proven experience deploying and operating real-time or streaming AI systems in production. - Strong intuition for human-like speech and behavior generation, including diagnosing and improving unnatural outputs.Nice to Have:- Experience with long-form audio or video generation. - Exposure to 3D graphics, Gaussian splatting, or large-scale training pipelines. - Familiarity with production ML or software engineering best practices. - Research publications in respected venues (e.g., CVPR, NeurIPS, ICASSP, BMVC).Equal Opportunity Statement:We are committed to diversity and inclusivity in our hiring practices.
Job Title
AI Engineer – Multimodal