Job Description

Title: Senior Software Engineer / SRE (Observability Focus)Timings: 3:00 PM – 1:00 AM IST (Monday to Friday)Work Mode: RemoteRole SummaryWe are seeking a Senior Software Engineer / SRE (Observability Focus) to drive platform reliability, monitoring, and operational excellence. This role combines software engineering (60–70%) and site reliability engineering (30-40%), with a strong emphasis on Kubernetes-based environments and observability platforms. You will play a key role in owning and operating internal engineering platforms, improving system reliability, scalability, and performance across cloud-native and microservices architectures. The ideal candidate is proactive, takes end-to-end ownership, and drives continuous improvements rather than reactive support.What You’ll DoDesign and develop automation tools, services, and integrations to improve platform reliability and operational efficiencyImplement and manage observability solutions (metrics, logs, tracing, dashboards, alerts) using platforms like Datadog, Prometheus, and GrafanaOwn and operate internal observability and monitoring platforms, ensuring reliability, scalability, and performanceWork with Kubernetes environments to deploy, monitor, and optimize containerized applicationsIntegrate observability into CI/CD pipelines to improve deployment visibility and system healthCollaborate with engineering teams to enhance APM practices and reliability engineering standardsAutomate monitoring configurations and operational workflows using Python and scriptingSupport cloud-based observability by integrating AWS services with monitoring platformsProvide operational and training support for observability platforms (e.g., Datadog) used by engineering teamsProactively identify system bottlenecks and lead initiatives to improve availability, scalability, and performanceKey Requirements (Must-Have Skills)Strong programming skills in at least one of the following: Python, JavaScript (Node.js), or JavaHands-on experience with Kubernetes (deployment, operations, monitoring)Strong experience with observability tools, especially Datadog (preferred), Prometheus, and GrafanaExperience with API integrations and working with distributed systemsSolid understanding of monitoring, logging, and distributed tracing conceptsExperience with AWS cloud services and cloud-native architecturesExperience integrating observability into CI/CD pipelinesStrong automation skills using scripting and infrastructure toolingDemonstrated experience in owning production systems/platforms, ensuring reliability and performanceStrongly PreferredExperience operating or owning an internal engineering or observability platformProven track record of improving system reliability, scalability, and performance proactivelyExperience managing Datadog agents, API keys, access controls, and platform configurationsAbility to lead incident response, troubleshooting, and performance optimization effortsExperience working in cross-functional teams and enabling engineering teams with observability best practicesNice-to-HaveExperience with Go (Golang)Familiarity with tools like New Relic, Dynatrace, Elastic Observability, or SplunkKnowledge of security and access management best practices in observability platformsExperience working in distributed microservices environments at scale.

Job Title

Company : PeopleTree Knowledge Services

Location : Nagpur, Maharashtra

Created : 2026-04-29

Job Type : Full Time