Basic Scope of job As a Cloud & Server Engineer, You will be responsible for the administration, support, and optimization of both Azure cloud ,on-prem server and Kubernetes cluster environments. You will take ownership of incidents, execute infrastructure changes, and contribute to the design, implementation, and maintenance of core infrastructure services including cloud networking, storage, and backup solutions. You will also drive improvements in system performance, security, and cost optimization across both cloud and on-prem platforms.Duties & ResponsibilitiesCloud Manage & Support • Manage and support on-prem ,Azure server infrastructure (VMs, OS, backups, storage, networking) and Kubernetes Cluster with Rancher. • Monitor cost and implement Azure governance practices (e.g., tagging, reserved instances). • Maintain cloud security posture (e.g., PFsense, firewalls, identity/access). • Automate operational tasks using scripting tools (PowerShell, Azure CLI, Logic Apps , Ansible). • Perform patch management on Linux system and ensure security compliance across environments. • Monitor the system using tools such as Grafana, CheckMK , Huawei DigitialView. • Contribute to monthly/quarterly health reports and environment reviews.Cloud Infra & Kubernetes Cluster Administration 1. Management & Maintenance • Provision VM, Install, configure development, staging, and production environments. • Keep virtual environment up to date and healthy with routine maintenance and housekeeping activities and coordinate with vendor to solve any infrastructure-related issues. • Setting up virtual machines based on the demands of various workloads, including assigning virtual CPUs, memory, and storage • Establishing virtual networks, VLANs, and subnets to ensure that VMs and applications can communicate securely and efficiently. • Perform virtual storage resources, ensuring high availability (HA), redundancy, and optimization based on different storage tiers (SSD, HDD). • Support & manage M365. • Managing user roles, privileges, and multi-factor authentication to ensure that only authorized personnel can make changes or access critical resources • creation and configuration of Kubernetes clusters • Set up authentication, authorization and Cluster Monitoring and Logging • Monitor cluster health and performance using Prometheus or Grafana • Set up centralized logging • Support to configure alerting (Prometheus) • Cluster Upgrades and Patching • Upgrade Kubernetes versions and components • Apply security patches to Kubernetes and container runtimes • Support in scaling and Resource Management • Set resource requests and limits for containers • Manage node and pod failure handling (rescheduling) • Test disaster recovery and backups • Manage Secrets and sensitive data • Implement network policies for communication control • Support during applications Deployments • Set up Load Balancers and Services2. Performance Tuning • Perform Regular monitoring CPU, memory, storage, and network utilization to prevent bottlenecks or resource exhaustion • Running diagnostic tools to ensure system health and to preemptively address potential issues in hardware or software3. Backup and Recovery • Design and implement regular backup strategies based on the best practices. • Backup Job Setup: Configure backup jobs to define the schedule, retention policy, and target repository for backup data. • Backup Scheduling: Set up daily, weekly, or on-demand backups depending on the business needs. • Backup Integrity Check: Regularly verify that backups are successful and free from errors by running backup verification jobs. • SureBackup: Test backups in an isolated environment to ensure that they are recoverable and operational. • Restore Testing: Periodically restore files or entire virtual machines (VMs) to validate that the restore process works smoothly and quickly. • Replication Jobs: Configure replication of VMs to another site for disaster recovery (DR) purposes. • Failover and Failback: Test and perform failover to replicated environments in the event of a disaster and failback to the primary site once the issue is resolved. • Backup Job Setup: Configure backup jobs to define the schedule, retention policy, and target repository for backup data. • Backup Scheduling: Set up daily, weekly, or on-demand backups depending on the business needs. • Backup Integrity Check: Regularly verify that backups are successful and free from errors by running backup verification jobs • SureBackup: Test backups in an isolated environment to ensure that they are recoverable and operational. • Restore Testing: Periodically restore files or entire virtual machines (VMs) to validate that the restore process works smoothly and quickly. • Replication Jobs: Configure replication of VMs to another site for disaster recovery (DR) purposes. • Failover and Failback: Test and perform failover to replicated environments in the event of a disaster and failback to the primary site once the issue is resolved.4. Security and Access Control • Manage user roles and privileges using least-privilege principles. • Perform security hardening and compliance. • Implementing SIEM on the system.5. Replication and High Availability • Configuring available infrastructure native HA features for automatic failover of VMs and using replication or DR tools to ensure business continuity in case of a site failure.6. Monitoring and Alerting • Use tools like Grafana , CheckMK for health and performance. • Monitoring logs and alerts to detect anomalies or failures in the infrastructure7. Automation and Scripting • Automate routine tasks using Ansible. • Schedule recurring jobs with cron or orchestration tools like Airflow.8. Documentation and Standards • Maintain detailed documentation of Cloud environments and procedures. • Documenting incidents, solutions, and changes made during the troubleshooting process for accountability and future referenceAnalytics & Visualization • Analyze complex datasets to uncover trends, patterns, and actionable insights. • Translate stakeholder requirements into KPIs, reports, and dashboards. • Design, build, and maintain dashboards • Manage reporting layers including deployment, version control, and performance tuning. • Collaborate with Product Owners, Engineers, and Data Scientists to align data strategy with business goals.Stakeholder Collaboration & Leadership • Act as a liaison between technical teams and business stakeholders. • Partner with internal teams to gather and refine reporting requirements. • Mentor junior analysts and support a culture of data literacy. • Standardize reporting across departments and ensure alignment with company-wide metrics.Education & Qualification • 5+ years of experience in infrastructure or cloud and Kubernetes administration roles. • Strong experience in Unix OS, and Azure IaaS components.. • Skilled in troubleshooting and resolving system and cloud performance issues. • Experience with patching, backup (Veeam) DR (Azure Backup & ASR), and automation. • Familiarity with scripting languages like Python or Bash and automation tool Ansible. • Familiarity with monitoring platforms (Grafana, CheckMK). • Knowledge of ITIL processes and change management • Relevant certifications preferred (e.g., RHCSA, CKA, AZ-900, AZ-104, AZ-500, AZ-700, VMware Certified Technical Associate (VCTA), VMware Certified Professional (VCP), , Microsoft 365 Certified Fundamentals (MS-900)Please share CV to Z.Uddin@
Job Title
Sr. Infra Support Engineer- Remote/ Immediate Joiner