Skip to Main Content

Job Title


AI Ops Engineer


Company : Diligente Technologies


Location : Bangalore, Karnataka


Created : 2026-04-10


Job Type : Full Time


Job Description

Looking for: AI Ops EngineerJob Type: Full timeLocation: Bangalore, India (Hybrid)Description: We are looking for an experienced AIOps / LLMOps Engineer to help design, deploy, and manage the AI infrastructure that powers next-generation AI systems.This role will focus on building and maintaining scalable, secure, and observable AI platforms that support generative AI applications and document intelligence workflows. You will work on self-hosted large language models, asynchronous inference pipelines, and production-grade ML infrastructure on AWS.You will collaborate closely with AI researchers, product teams, and backend engineers to ensure reliable and efficient deployment of AI systems across platform.What You Will Achieve and Key ResponsibilitiesAI Infrastructure & LLM Platform Development• Design and build document AI platforms powered by generative AI, leveraging asynchronous architectures for scalable inference.• Implement event-driven and queue-based systems to support elastic scaling and non-blocking AI workflows.• Architect and maintain self-hosted LLM infrastructure using tools such as vLLM or Ollama on Kubernetes or EC2 with GPU orchestration.LLM Operations & Model Governance• Manage production systems for LLM serving, inference pipelines, and AI workflow orchestration.• Implement LLM gateways and routing systems (e.g., LiteLLM, Portkey) to ensure proper model usage and governance.• Develop guardrails and monitoring systems to reduce hallucinations, misuse, and unsafe outputs in generative AI systems.Observability & AI System Monitoring• Implement end-to-end observability for AI/ML pipelines using distributed tracing and monitoring tools.• Monitor AI system health using platforms such as OpenTelemetry, AWS X-Ray, Prometheus, and Grafana.• Track performance metrics including latency, token usage, inference quality, and model drift.ML Platform & Workflow Management• Manage machine learning workflows using tools such as MLflow, Kubeflow, or SageMaker MLFlow setups.• Enable experiment tracking, model versioning, and deployment pipelines for production AI systems.• Work closely with engineering teams to integrate AI workflows into scalable backend systems.Security & Infrastructure Optimization• Implement AI platform security controls including Bedrock Guardrails, KMS encryption, IAM least-privilege policies, VPC endpoints, and CloudTrail auditing.• Optimize AWS infrastructure—including Bedrock, SageMaker, and EKS—for cost efficiency, performance, and reliability.• Ensure production AI systems maintain high availability and security standards.Why This MattersGenerative AI systems are only as powerful as the infrastructure that supports them. Building reliable, scalable AI platforms, especially those that serve complex document intelligence workloads, requires deep expertise in distributed systems, observability, and model operations.Infrastructure powers AI systems that extract and organize construction product data at scale. Your work will enable the next generation of document AI and generative workflows that make critical industry information accessible, searchable, and actionable.Required Qualifications• Strong experience with AWS cloud infrastructure including services such as EC2, Lambda, S3, EKS, Bedrock, Step Functions, API Gateway, EventBridge, and SQS/SNS.• Experience building ML infrastructure using Infrastructure-as-Code tools such as Terraform or CloudFormation.• Hands-on experience deploying and operating LLM serving infrastructure using platforms such as vLLM or Text Generation Inference.• Experience managing vector databases and retrieval systems such as Pinecone, PGVector, or Weaviate.• Strong experience designing event-driven or asynchronous systems using queues (SQS, Kafka) and micro-batching patterns.• Experience implementing observability and monitoring for distributed AI systems using tools such as ELK, Prometheus, Grafana, and OpenTelemetry.• Strong programming experience in Python, including frameworks such as FastAPI and asynchronous programming patterns (asyncio).• Experience with Docker, Kubernetes, and CI/CD pipelines using tools such as GitHub Actions or ArgoCD.Experience• 5+ years of experience in MLOps, LLMOps, AIOps, or DevOps supporting machine learning or AI systems.• Proven track record building production generative AI systems with high availability and scalability.• Experience deploying self-hosted LLMs on AWS infrastructure and building production-grade document AI platforms.• Experience operating AI systems with >99.9% uptime and cost-efficient infrastructure management.Preferred Qualifications• Bachelor’s or Master’s degree in Computer Science, Data Science, or a related technical field.• Experience working with multimodal LLMs or agent-based AI workflows.• Familiarity with cost-optimized inference hardware and infrastructure such as AWS Trainium.