Experience : 5+ Years Work Mode : Work from office only Job Description: 1. AWS Cloud Infrastructure Design, deploy, and manage scalable, secure, and highly available systems on AWS. Optimize cloud costs, enforce tagging, and implement security best practices (IAM, VPC, GuardDuty, etc.). Automate infrastructure provisioning using Terraform or AWS CDK. Ensure backup, disaster recovery, and high availability (HA) strategies are in place. 2. Kubernetes (EKS preferred) Manage and scale Kubernetes clusters (preferably Amazon EKS). Implement CI/CD pipelines with GitOps (e.g., ArgoCD or Flux) or traditional tools (e.g., Jenkins, GitLab). Enforce RBAC policies, namespaces isolation, and pod security policies. Monitor cluster health, optimize pod scheduling, autoscaling, and resource limits/requests. 3. Monitoring and Observability (Datadog) Build and maintain Datadog dashboards for real-time visibility across systems and services. Set up alerting policies, SLOs, SLIs, and incident response workflows. Integrate Datadog with AWS, Kubernetes, and applications for full-stack observability. Conduct post-incident reviews using Datadog analytics to reduce MTTR. 4. Automation and DevOps Automate manual processes (e.g., server setup, patching, scaling) using Python, Bash, or Ansible. Maintain and improve CI/CD pipelines (Jenkins) for faster and more reliable deployments. Drive Infrastructure-as-Code (IaC) practices using Terraform to manage cloud resources. Promote GitOps and version-controlled deployments. 5. Linux Systems Administration Administer Linux servers (Ubuntu, RHEL, Amazon Linux) for stability and performance. Harden OS security, configure SELinux, firewalls, and ensure timely patching. Troubleshoot system-level issues: disk, memory, network, and processes. Optimize system performance using tools like top, htop, iotop, netstat, etc.
Job Title
Site Reliability Engineer