DevOps/AIOps Engineer (Platform) Experience: 3–5 Years About the Company We aim to bring about a new paradigm in medical image diagnostics — intelligent, holistic, ethical, explainable, and patient‑centric. We’re looking for innovative problem‑solvers who empathize with clinicians and patients, understand business problems, and can design and deliver reliable, intelligent products. Key Responsibilities ·CI/CD for services & models: Own pipelines (GitHub Actions/GitLab CI), environment gates, artifact/version governance (containers, models, SBOMs), safe rollouts & instant rollbacks. ·Kubernetes platform (EKS preferred): Operate multi-env clusters; Helm/Kustomize; GitOps (Argo CD/Flux); progressive delivery (canary/blue green/Argo Rollouts/Flagger). ·Serving & APIs: Deploy and tune FastAPI services and Triton/ONNX/TensorRT inference; traffic shaping, runtime config, autoscaling signals. ·Event-driven orchestration: Build robust consumers/producers on RabbitMQ/ActiveMQ/Kafka with back-pressure, dead-lettering, idempotency, and retry patterns. ·Observability & AIOps: Define SLIs/SLOs and error budgets; metrics/logs/traces (Prometheus/Grafana/Loki/Tempo/ELK); intelligent alerting & noise reduction; basic model/data drift hooks. ·Security in SDLC: Supply-chain security (image signing/provenance, SBOM scans), SAST/DAST/IaC scanning, policy-as-code (OPA/Gatekeeper), secrets hygiene in pipelines/workloads. ·Data/Model platform integration: S3/MinIO for artifacts; integrate model registry (MLflow or similar) into CD; immutable, traceable releases. ·Resilience & performance: Capacity planning (incl. GPU), autoscaling (HPA/VPA/KEDA), caching/queue tuning; chaos/game-days; write runbooks and own incident response for platform services. ·Developer experience: Golden paths, starter repos, internal Helm charts, docs & enablement to make shipping boring and fast. ·FinOps mindset: Cost dashboards, right-sizing, bin-packing, GPU utilization policies, spot vs on-demand strategy.Skills and Qualifications (Required) ·3+ years in DevOps/SRE/MLOps with strong Docker & Kubernetes fundamentals. ·Production CI/CD expertise; canary/blue-green; artifact & version management. ·IaC (Terraform) and GitOps workflows (Argo CD/Flux). ·Observability: Prometheus/Grafana; logs/traces with Loki/Tempo/ELK. ·Production message queues (RabbitMQ/ActiveMQ/Kafka) with back-pressure & retries. ·Cloud experience (AWS/GCP/Azure), EKS preferred; object storage (S3/MinIO); model registries (MLflow or similar). ·Security in SDLC and compliance guardrails for PHI-like data (least-privilege IAM, secrets, auditability). ·Incident response experience; writing SLIs/SLOs, runbooks, and operating to error budgets. ·Scripting for platform tasks (Python/Bash).Preferred ·Triton Inference Server, ONNX/TensorRT optimizations; GPU scheduling on K8s (NVIDIA device plugin, MIG, node pools).·Argo Rollouts/Flagger, Karpenter, KEDA; caching layers (Redis/NVCache patterns). ·Policy-as-code (OPA/Gatekeeper), image signing (cosign), SBOM tools (syft/grype). ·Network savvy for app delivery (ingress, service meshes, egress policies).Education BE/B.Tech (MS/M.Tech a bonus) or equivalent experience. Location & Work Setup On-site - Gurugram
Job Title
AIOps Engineer