Cloud Platform Engineer
HCLSoftware
Job Description
Please share CV to [email protected] with the below details: Total experience- Current CTC- Expected CTC- notice Period- Location- Noida, Pune, Bangalore, Hyderabad Required experience- 3 to 8 years Role-Cloud Platform Engineer βKubernetes and Container Platforms Position Overview We are seeking a highly skilled Platform Engineer with deep expertise in Kubernetes and container orchestration platforms to join the AI & Intelligent Operations LOB. The successful candidate will own the end-to-end lifecycle of container platforms β from architecture and installation design through production operations β with a strong focus on enterprise-grade High Availability (HA), Disaster Recovery (DR), and security. The role spans managed Kubernetes services on AWS (EKS), Azure (AKS), and Google Cloud (GKE) as well as on-premises OpenShift deployments, requiring hands-on proficiency with Helm, Kubernetes-native configurations, cloud-native services, and Infrastructure-as-Code (IaC) toolchains.
Required Qualifications Education Bachelor's or Master's degree in Computer Science or Information Technology Equivalent practical experience considered for exceptionally strong candidates. Experience 3 β 8 years of overall IT / software engineering experience with a minimum of 2 years of hands-on Kubernetes platform engineering in production environments. Demonstrable experience deploying and operating at least two of: EKS, AKS, GKE, and OpenShift in enterprise settings.
Proven track record of designing and implementing HA/DR for containerized workloads at scale and helm chart development. Required Technical Skills: Amazon EKS Azure AKS Kubernetes (upstream/vanilla) Red Hat OpenShift (OCP) Helm (Chart authoring & management) Docker / Podman / Buildah / Kaniko Observability β Prometheus, Grafana, Loki, OpenTelemetry Bash / Python scripting Preferred Qualifications Certified Kubernetes Administrator (CKA) β strongly preferred. Certified Kubernetes Application Developer (CKAD) β advantageous.
Certified Kubernetes Security Specialist (CKS) β highly desirable for senior profiles. Red Hat Certified OpenShift Administrator (EX280) β preferred for OpenShift-heavy roles. Experience with multi-cluster management platforms: Rancher, Red Hat ACM, ArgoCD ApplicationSets, Cluster API (CAPI).
Familiarity with eBPF-based observability and networking tools (Cilium, Hubble, Pixie). Contributions to open-source Kubernetes ecosystem projects or published Helm charts. Experience in regulated industries (BFSI, Healthcare, Government) with compliance frameworks: SOC 2, PCI-DSS, HIPAA, ISO 27001.
Key Responsibilities 1. Platform Architecture & Design Design scalable, highly available Kubernetes cluster architectures for EKS, AKS, GKE, and OpenShift environments aligned with enterprise workload requirements. Architect multi-region and multi-cluster topologies with active-active and active-passive HA patterns, including cross-cluster service discovery and traffic management.
Define Disaster Recovery strategies: RTO/RPO target setting, cluster backup (Velero / OADP), etcd backup & restore, and regional failover runbooks. Produce Low-Level Design (LLD) and High-Level Design (HLD) documents, architecture decision records (ADRs), and capacity planning models. Design network topology: VPC/VNet design, CNI selection (Calico, Cilium, Flannel, OVN-Kubernetes), Network Policies, Service Mesh integration (Istio / Linkerd).
Define storage architecture: persistent volume strategies using CSI drivers, StorageClass selection, RWX/RWO provisioning across EBS, EFS, Azure Disk, Azure Files, GCP Persistent Disk. 2. Kubernetes Platform Installation & Configuration Install and configure production-grade Kubernetes clusters using kubeadm, kops, Rancher, or cloud provider managed services (EKS, AKS, GKE). Deploy and configure Red Hat OpenShift Container Platform (OCP 4.x) using IPI (Installer Provisioned Infrastructure) and UPI (User Provisioned Infrastructure) methods.
Configure enterprise authentication integrations: LDAP/AD integration via OIDC (Dex, Keycloak), AWS IAM IRSA, Azure AD Workload Identity, GCP Workload Identity Federation. Implement Role-Based Access Control (RBAC) hierarchies β ClusterRoles, Roles, RoleBindings β aligned with principle of least privilege and organizational IAM structures. Configure Kubernetes Admission Controllers, Pod Security Admission (PSA/PSP replacement), and OPA/Gatekeeper or Kyverno policy engines for compliance enforcement.
Set up cluster add-ons: CoreDNS, metrics-server, cluster-autoscaler, karpenter, node-problem-detector, external-dns, cert-manager, and ingress controllers (NGINX, Traefik, AWS ALB, Azure Application Gateway Ingress). 3. Application Deployment on Kubernetes Own deployment pipelines for HCL Software product suites onto Kubernetes/OpenShift environments, including stateful and stateless applications. Author, maintain, and publish Helm charts with parameterized values, environment-specific overrides, and lifecycle hooks for complex application topologies.
Implement GitOps deployment workflows using ArgoCD or Flux CD: manage ApplicationSets, multi-cluster deployments, progressive delivery (canary, blue-green) strategies. Configure Kubernetes workload resources: Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, and HorizontalPodAutoscalers (HPA) / VerticalPodAutoscalers (VPA). Define and enforce resource requests/limits, namespace quotas (ResourceQuota, LimitRange), and QoS classes to optimize cluster utilization and stability.
Implement Pod Disruption Budgets (PDBs), topology spread constraints, affinity/anti-affinity rules, and taints/tolerations for workload placement and resilience. Manage ConfigMaps, Secrets (with external secrets operator / Vault integration), and environment variable injection patterns following 12-factor application principles. 4. Enterprise HA, DR & Resiliency Design and implement control plane HA: multi-master etcd clusters with quorum management, etcd compaction, defragmentation, and backup automation.
Configure node-level HA: node groups across multiple Availability Zones, managed node groups vs. self-managed nodes, spot/preemptible instance strategies with fallback. Implement load balancer HA patterns: NLB/ALB for EKS, Azure Load Balancer + Application Gateway for AKS, Cloud Load Balancing for GKE. Establish cluster-level DR procedures: namespace-scoped Velero backups to S3/Azure Blob/GCS, application-consistent snapshots, tested restore runbooks.
Design and document failover playbooks covering DNS cutover, PV data replication (Rook/Ceph, Portworx, Longhorn), and stateful application quorum management. Conduct Game Day exercises and DR drills; measure and report against SLO/SLA commitments. 5. Security Engineering Harden Kubernetes clusters per CIS Kubernetes Benchmark and NSA/CISA Kubernetes Hardening Guide; track compliance using kube-bench.
Implement network segmentation with Kubernetes Network Policies and service mesh mTLS; enforce zero-trust network access within clusters. Integrate container image scanning (Trivy, Snyk, Aqua) into CI/CD pipelines; enforce registry policies to block vulnerable or unsigned images. Configure runtime threat detection using Falco; define and tune rule sets for anomalous syscall detection and container escape attempts.
Manage PKI: configure cert-manager with internal/external CAs, automate TLS certificate provisioning and rotation for ingress and internal service communication. Implement secrets lifecycle management with HashiCorp Vault (Vault Agent, Vault Secrets Operator), AWS Secrets Manager, or Azure Key Vault CSI driver. Enforce image signing and supply chain security (cosign / Sigstore, Notary) for all production workloads.
Conduct security reviews for new platform features; participate in penetration testing activities and remediate findings within SLA. 6. Cloud Infrastructure & IaC Write and maintain Terraform / OpenTofu modules for provisioning cloud infrastructure: VPCs, subnets, security groups, IAM roles, EKS/AKS/GKE clusters, node groups, managed database services, and DNS. Manage Terraform state backends (S3 + DynamoDB, Azure Blob, GCS), implement workspace strategies for multi-environment (dev/staging/prod) provisioning.
Use Ansible for post-provisioning configuration management: OS hardening, prerequisite installation, cluster bootstrapping, and day-2 operations. Implement cloud cost optimization strategies: right-sizing, Spot/Preemptible adoption, cluster autoscaler tuning, resource tagging governance. Maintain parity across AWS, Azure, and GCP deployments; abstract cloud differences using Terraform modules and Helm chart conditional logic. 7.
Observability & Platform Operations Deploy and maintain observability stacks: Prometheus + Alertmanager, Grafana dashboards, Loki for log aggregation, Jaeger/Tempo for distributed tracing. Define and configure Service Level Indicators (SLIs) and alert thresholds; build runbooks for alert response and escalation. Integrate OpenTelemetry collectors and auto-instrumentation for HCL product workloads.
Manage cluster upgrades (minor and patch) using rolling upgrade strategies with zero-downtime requirements; validate compatibility matrices. Participate in on-call rotation; investigate and resolve production incidents; conduct post-incident reviews (PIR/RCA) and track corrective actions.