GKE is arguably the most mature managed Kubernetes offering available — but maturity does not equal simplicity. Every decision you make around cluster mode, node pool topology, workload identity configuration, upgrade channel, and cost allocation compounds over time. This guide distils the practices that matter most for teams operating stateful, latency-sensitive, or compliance-bound workloads on GKE Standard and GKE Autopilot.
This post consolidates the full set of best practices across every production-critical dimension: cluster foundation, scalability, governance, observability, security hardening, disaster recovery, performance, deployment safety, and long-term operational health. It is written for platform engineers, SREs, and architects who are preparing for or already running production workloads on Google Kubernetes Engine.
Table of Contents
- Cluster Foundation: Getting the Basics Right
- Scalability: Horizontal, Vertical, and Cluster-Level
- Governance: RBAC, Network Policies, and Policy Enforcement
- Security Hardening: Defense in Depth
- Secret Management: Never Store Secrets in Code
- Observability: Metrics, Logs, and Traces
- Disaster Recovery: Planning for Failure at Every Layer
- Performance: Tuning Compute, Storage, and JVM
- GitOps and Release Governance
- Deployment Safety: Protecting Production
- Cost Governance
- Compliance and Audit
- Day-2 Operations
- The Pre-Go-Live Checklist
1. Cluster Foundation: Getting the Basics Right
Every best practice downstream depends on a correctly structured cluster. Shortcuts here compound into outages later.
Standard vs Autopilot
Choose GKE Autopilot for stateless workloads where you want Google to manage node provisioning, security hardening, and bin-packing automatically. Choose GKE Standard when you need control over node machine types, kernel parameters, GPU attachment, or have stateful in-memory workloads with strict affinity requirements. Many production environments use Autopilot for stateless services and a dedicated Standard cluster (or node pool) for stateful grid nodes — a pattern that maximises operational simplicity without sacrificing control where it matters.
Regional Clusters and Zone Distribution
Always create Regional clusters — not zonal. A regional cluster spreads control plane replicas across three zones in the region, eliminating the zone as a single point of failure for the API server. Configure node pools to span all three zones and use topology spread constraints so no zone holds a disproportionate share of pods:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-service
Node Pool Segmentation
Mixing memory-intensive stateful workloads with lightweight stateless services on the same node type is a false economy. Use dedicated node pools:
- System pool —
n2-standard-4for kube-system, CoreDNS, and add-ons. Taint these nodes so application pods do not land here. - Compute pool —
n2-highmem-16for CPU and memory-intensive stateless workloads with horizontal scaling. - Stateful pool —
n2-highmem-32for StatefulSets and in-memory grid workloads. These have precise off-heap memory requirements that must fit within instance memory with headroom for the OS and JVM overhead. - Spot pool —
n2-standard-8(Spot VMs) for batch jobs, CI runners, and fault-tolerant workloads.
Taint the stateful pool with workload=stateful:NoSchedule and configure matching tolerations in your Helm values. On GKE Standard, use Node Auto Provisioning (NAP) with resource limits to allow GKE to create additional node pools automatically for burst workloads — but lock down NAP’s machine type and accelerator allowlist to prevent unexpected spend.
Private Cluster and VPC-Native Networking
Enable private cluster mode — nodes have no public IP, and the control plane is accessible only via Private Service Connect or authorised networks. Use VPC-native (alias IP) networking, not routes-based networking, for production. VPC-native enables Network Policy enforcement via GKE Dataplane V2 (eBPF-based, no iptables at scale), supports better pod IP management, and is required for GKE Autopilot. Enable master authorised networks even on private clusters as defence-in-depth.
2. Scalability: Horizontal, Vertical, and Cluster-Level
Horizontal Pod Autoscaler (HPA) for Stateless Workloads
HPA works well for stateless services. Use Google Cloud Managed Service for Prometheus metrics or custom metrics via the Custom Metrics Adapter to scale on business-relevant signals — not just CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: my-subscription
target:
type: AverageValue
averageValue: "500"
Target 65% CPU, not 90%. At 90%, the scaling reaction is too slow — by the time HPA triggers, your pods are already saturated.
StatefulSet Scaling for In-Memory Grid Workloads
StatefulSets running in-memory data grids (Apache Ignite, Hazelcast, etc.) cannot use HPA. Scaling adds a new grid member, which triggers data rebalancing. This must be a deliberate, supervised operation — not an automatic response to a CPU spike. Scale StatefulSets via controlled Helm upgrades with human review, not automation.
For distributed subordinate or calculator nodes that are compute-only and do not hold primary data, autoscaling via KEDA or the Cluster Autoscaler is appropriate since these nodes can be added and removed without data loss.
Cluster Autoscaler and Node Auto Provisioning
Enable the GKE Cluster Autoscaler on all user node pools. For stateful node pools with in-memory data, set --skip-nodes-with-local-storage=true to prevent premature eviction during scale-down. Use Node Auto Provisioning (NAP) to allow GKE to dynamically create new node pools when existing pools cannot satisfy pending pod requirements — constrain it with resource limits and machine type lists to maintain cost control.
Vertical Pod Autoscaler (VPA) in Recommendation Mode
Do not run VPA in Auto mode on stateful workloads — it restarts pods to apply changes, which causes data loss for in-memory grid members. Run it in Off mode to gather resource recommendations over 2–4 weeks of production load, then apply those recommendations to your Helm values deliberately. GKE Autopilot applies VPA-equivalent right-sizing automatically and transparently.
KEDA for Event-Driven Scaling
For workloads that scale to zero or respond to external event sources (Pub/Sub, Kafka, Cloud Tasks), deploy KEDA alongside HPA. KEDA’s Pub/Sub scaler is particularly useful for batch processing pipelines — it scales workers proportionally to the number of unacknowledged messages and scales to zero during quiet periods, eliminating idle compute cost.
Topology Spread Constraints
More expressive and reliable than podAntiAffinity for spreading pods across zones:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: your-service
This guarantees that replicas are spread across zones. With podAntiAffinity, you get best-effort — the scheduler may still place two replicas in the same zone under pressure.
3. Governance: RBAC, Network Policies, and Policy Enforcement
RBAC with Google Workspace and Cloud Identity
Enable GKE Workload Identity and map Google groups to Kubernetes RBAC roles via GKE’s native integration — never create local Kubernetes users or static service account keys. Use the principle of least privilege: operations teams receive view by default, deployment pipelines receive scoped edit on specific namespaces only, bound to short-lived OIDC tokens from CI/CD systems:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: deploy-pipeline-edit
namespace: production
subjects:
- kind: Group
name: gke-admins@example.com
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: edit
apiGroup: rbac.authorization.k8s.io
Audit ClusterRoles and Roles quarterly. Use kubectl auth can-i --list --as=system:serviceaccount:namespace:serviceaccount to enumerate what each service account can actually do.
Policy Controller (OPA/Gatekeeper)
Enable Policy Controller (part of the GKE Enterprise / Anthos Config Management suite) or deploy open-source OPA/Gatekeeper directly. Apply these constraints at minimum:
- Deny privileged containers
- Require resource requests and limits on all containers
- Require liveness and readiness probes
- Restrict allowed image registries to your Artifact Registry instances
- Enforce Pod Security Standards Restricted profile
- Require standard labels (
app,team,environment) on all workloads
Manage Policy Controller configuration in Git via Config Sync so guardrails are version-controlled and applied consistently across fleet clusters.
Namespace Strategy and Resource Quotas
Organise namespaces by environment and team, not by application. Apply ResourceQuota and LimitRange to every namespace to prevent noisy-neighbour effects:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
Set quotas based on expected peak load plus a safety margin. This prevents a single runaway deployment from consuming all cluster resources and causing a cascade failure across other workloads.
Network Policies via GKE Dataplane V2
GKE Dataplane V2 (enabled by default on new clusters) uses eBPF for policy enforcement — more performant and observable than iptables-based approaches. Enable Network Policy and adopt a default-deny posture in all production namespaces. Apply a default-deny NetworkPolicy and then explicitly allow only required traffic paths:
- Ingress controller → web-tier services
- Web-tier services → backend services
- Backend services → databases (via egress to Cloud SQL proxy CIDR)
- Grid members → grid members (within namespace)
- Monitoring agents → all pods (read-only scrape)
Use the NetworkPolicy API or, for more expressive L7 policies, use the GKE Gateway API with Cilium network policies. Document every allow rule alongside a business justification in your policy-as-code repository.
4. Security Hardening: Defense in Depth
Pod Security Standards
Apply the Restricted Pod Security Standard to all production namespaces via namespace labels. GKE Autopilot enforces this by default — on Standard clusters, enforce it explicitly:
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=latest \
pod-security.kubernetes.io/warn=restricted
The restricted profile enforces: non-root containers, no privilege escalation, read-only root filesystem (with explicit writable volume mounts for /tmp), dropped capabilities. These constraints should already be satisfied by well-built container images.
Container Security Context
Every container spec must include an explicit security context:
securityContext:
runAsNonRoot: true
runAsUser: 10001
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
Mount writable paths as emptyDir volumes. For workloads that require stronger isolation, enable GKE Sandbox (gVisor) on sensitive node pools — it provides a kernel-level isolation boundary between the container and the host OS, reducing the blast radius of a container escape.
Image Supply Chain with Artifact Registry and Binary Authorization
Use Artifact Registry as your container registry with VPC Service Controls to restrict access. Enable Binary Authorization to enforce image signing at admission time — GKE will reject unsigned or untrusted images before they ever run:
- Set up a Binary Authorization policy requiring attestation from your CI/CD pipeline.
- Sign images with Cloud KMS asymmetric keys via Cosign or Kritis Signer in your pipeline.
- Scan every image with Artifact Analysis (Container Analysis API) — fail the pipeline on CRITICAL/HIGH CVEs.
- Generate an SBOM and attach it to the image manifest.
- Configure Binary Authorization in enforcement mode on production clusters.
Runtime Threat Detection
Enable GKE Threat Detection (part of Security Command Center Premium) for runtime behavioural analysis. It monitors for suspicious process execution, privilege escalation, cryptomining, and anomalous network connections — all without installing DaemonSets. For deeper custom rules, deploy Falco alongside GKE Threat Detection for a layered approach. Route all findings to Security Command Center and configure notification channels for CRITICAL findings.
mTLS with Anthos Service Mesh
Deploy Anthos Service Mesh (ASM) — the managed Istio distribution on GKE — in strict mTLS mode so all pod-to-pod communication is encrypted and mutually authenticated. ASM integrates directly with GKE’s control plane and is upgraded automatically. Enable Cloud Armor on the Ingress Gateway or GKE Gateway for WAF protection at the edge — Cloud Armor provides OWASP Top-10 rule sets, rate limiting, and geo-blocking without managing separate WAF infrastructure.
5. Secret Management: Never Store Secrets in Code
This deserves its own section because it is the most commonly violated security principle in enterprise Kubernetes deployments. Core rule: no secret, certificate, or connection string ever lives in a Kubernetes Secret object created by a human, checked into Git, or baked into a container image.
The Problem with Kubernetes Secrets
Kubernetes Secrets are base64-encoded, not encrypted. Anyone with read access to the etcd database or the Kubernetes API has access to all secrets. Even with etcd encryption at rest and RBAC restrictions, Kubernetes Secrets are not the right long-term store for sensitive credentials.
GKE Workload Identity
GKE Workload Identity federates Kubernetes Service Account tokens with Google Cloud IAM, eliminating the need for node-level service account keys or static credentials inside pods. Every pod that needs GCP access gets its own Kubernetes Service Account mapped to a dedicated IAM Service Account with the minimum required roles:
# Annotate the Kubernetes Service Account
kubectl annotate serviceaccount my-service-sa \
iam.gke.io/gcp-service-account=my-service@my-project.iam.gserviceaccount.com
# Grant Workload Identity User role
gcloud iam service-accounts add-iam-policy-binding \
my-service@my-project.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:my-project.svc.id.goog[production/my-service-sa]"
External Secrets Operator and Google Secret Manager
Deploy the External Secrets Operator (ESO) with the Google Secret Manager provider. ESO syncs secrets from Secret Manager into Kubernetes Secrets automatically, including version rotation. This keeps the source of truth in Secret Manager while allowing existing applications that expect environment variables or volume mounts to continue working without code changes:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: gcp-secret-manager
kind: ClusterSecretStore
target:
name: db-credentials-k8s
data:
- secretKey: db-password
remoteRef:
key: projects/my-project/secrets/db-password
version: latest
Encrypt the resulting Kubernetes Secrets at rest using application-layer encryption backed by a Cloud KMS key — configure this at cluster creation time. Every secret access to Google Secret Manager is logged in Cloud Audit Logs. Configure log-based metric alerts for unexpected access patterns.
Certificate Lifecycle
Deploy cert-manager with a Let’s Encrypt or internal CA issuer for all TLS certificates. For GCP-managed certificates, use Google-managed SSL certificates on Cloud Load Balancing ingresses — they auto-provision and auto-renew with zero operational overhead. Set renewal thresholds at 30 days before expiry for cert-manager-managed certificates and monitor expiry as a Prometheus metric with alerts at 30 days, 14 days, and 7 days remaining.
6. Observability: Metrics, Logs, and Traces
Observability is not monitoring. Monitoring tells you when something is wrong. Observability tells you why. You need both.
The Three Pillars
Metrics
Enable Google Cloud Managed Service for Prometheus (GMP) on your GKE cluster. GMP is a fully managed, Prometheus-compatible metrics backend that scales to billions of samples per second without infrastructure management. It integrates natively with GKE and stores data in Cloud Monitoring. Deploy PodMonitor and ServiceMonitor CRDs to scrape custom application metrics. Use Cloud Monitoring Dashboards or Grafana (pointing at the GMP backend) for visualisation. Set data retention to 24 months via Cloud Monitoring’s long-term storage.
Logs
GKE automatically installs a Fluent Bit DaemonSet that ships container logs to Cloud Logging. Structure your logs as JSON from day one — Cloud Logging parses structured JSON natively, enabling powerful log-based metrics and alerting without additional parsing configuration. Annotate every log entry with a trace_id for correlation with Cloud Trace:
{
"severity": "INFO",
"message": "Request processed",
"trace_id": "projects/my-project/traces/abc123",
"namespace": "production",
"pod": "my-service-xyz",
"latency_ms": 42,
"status_code": 200
}
Configure Log Buckets with a custom retention policy and enable Log Analytics on the bucket for SQL-based log queries at scale.
Distributed Traces
Instrument services with the OpenTelemetry SDK and deploy the OTel Collector as a DaemonSet or sidecar. Route traces to Cloud Trace via the OTLP exporter. Cloud Trace provides latency distribution analysis, tail latency percentiles, and automatic correlation with Cloud Logging and Cloud Monitoring. Use GKE’s auto-instrumentation feature via the OpenTelemetry Operator to inject the OTel agent without code changes for Java, Python, and Node.js services.
SLO-Based Alerting
Define Service Level Objectives before you configure alerts. Alert on burn rate against your SLO — not on arbitrary threshold crossings. Example SLOs for an enterprise platform:
- Application availability: 99.9% over 30 days (allows 43 minutes of downtime)
- API P95 latency: < 500ms
- Processing job success rate: 99.5%
Define all alert policies in code using Terraform google_monitoring_alert_policy resources committed to Git. Route CRITICAL alerts to PagerDuty via Cloud Monitoring notification channels. Every alert must include a runbook_url annotation pointing to the corresponding incident runbook.
Dashboards
Every component should have a dashboard. Organise them in a hierarchy:
- Business layer: jobs completed, reports generated, active users
- Application layer: per-service latency, error rates, JVM metrics
- Infrastructure layer: node CPU/memory, disk I/O, network throughput
- Kubernetes layer: pod restarts, pending pods, PVC utilisation, HPA state
Version-control your dashboards as JSON in Git. Deploy them via Grafana’s provisioning mechanism (ConfigMaps) so they are reproducible across environments.
7. Disaster Recovery: Planning for Failure at Every Layer
Define Your RTO and RPO First
Recovery Time Objective (RTO) — how long can the system be unavailable? Recovery Point Objective (RPO) — how much data can you afford to lose? These are business decisions, not technical ones. Get explicit agreement from stakeholders before you design your DR architecture. Without them, you will either over-engineer (expensive) or under-engineer (catastrophic).
Multi-Region is Not Multi-AZ
Multi-AZ (regional GKE cluster) protects against a single zone failure. It does not protect against a GCP regional outage, an accidental mass-delete of resources, a ransomware attack, or a cluster upgrade gone wrong. For RPO < 1h and RTO < 30 min, run active-passive GKE clusters in two GCP regions. Use Cloud Load Balancing with global anycast for traffic routing — a single global IP routes to the nearest healthy regional backend automatically. Store all configuration in GitOps repositories — the passive cluster must be bootstrappable from Git in under 15 minutes.
Database DR
Your Kubernetes workloads are only as available as your databases:
- Use Cloud Spanner (multi-region) or Cloud SQL with Multi-AZ High Availability enabled for synchronous replication and automatic failover in under 60 seconds.
- Enable automated backups with a retention period that meets your RPO.
- Configure a cross-region read replica in your DR region. In a DR scenario, promote the read replica to a standalone instance.
- Test the failover process. An untested DR plan is not a DR plan.
Cluster-Level Backup with Velero
Deploy Velero with the GCP plugin, backed by Cloud Storage with multi-region buckets. Configure:
velero backup create production-daily \
--include-namespaces production \
--snapshot-volumes \
--volume-snapshot-locations gcp-default \
--storage-location gcs-primary \
--schedule="0 2 * * *"
Test restoration quarterly against a staging cluster. A backup that has never been restored is not a backup. Automate restore drills as a scheduled CI/CD job that runs a full cluster restore to an ephemeral environment and runs smoke tests against it.
Pod Disruption Budgets
A PodDisruptionBudget (PDB) is your safeguard against voluntary disruptions — node drains for maintenance, GKE version upgrades, spot instance preemption — taking down too many replicas simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
minAvailable: 2 # 3-node grid: always keep quorum
selector:
matchLabels:
app: grid-node
Without PDBs, a node drain during a GKE version upgrade can evict all replicas of a service in parallel, causing a complete outage.
Graceful Pod Shutdown
Configure terminationGracePeriodSeconds generously — 120 to 300 seconds for JVM-based and in-memory stateful workloads. Add a preStop lifecycle hook that sleeps briefly to allow load balancer deregistration to propagate before the application begins shutdown:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 120
Without this, a rolling update, node drain, or pod eviction will cause data loss or corrupt in-flight transactions.
8. Performance: Tuning Compute, Storage, and JVM
Set Resource Requests and Limits Accurately
Set CPU requests at the realistic 50th-percentile consumption. Set memory limits equal to requests for Guaranteed QoS class on stateful pods — this prevents the OOM killer from evicting them under node memory pressure. For stateless services, consider omitting CPU limits (which cause CPU throttling on idle nodes) while keeping memory limits strict. The key rules:
- Always set both requests and limits. Pods without requests are
BestEffortclass and are the first to be evicted under node pressure. - For JVM applications, the sum of
-Xmx(heap) plus off-heap memory (direct buffers, metaspace, Ignite off-heap regions) must be less than the pod’s memory limit. If the total exceeds the limit, the pod will be OOMKilled by the kernel — you get no heap dump, no warning. - Use VPA Recommendation mode output to calibrate requests and limits based on real production traffic.
Storage Class Selection
Choose storage classes based on access pattern:
- premium-rwo (Persistent Disk SSD) — general-purpose block storage for single-pod stateful workloads (ReadWriteOnce). Use for database data directories and WAL logs.
- standard-rwx (Filestore NFS) — fully managed NFS for ReadWriteMany PVCs where multiple pods need to access the same storage. Lower IOPS than block storage; do not use for latency-sensitive off-heap I/O.
- hyperdisk-balanced — sub-millisecond latency block storage for workloads requiring very high IOPS. Use for off-heap persistence and WAL-heavy workloads.
Always set reclaimPolicy: Retain for production PVCs. Use volumeBindingMode: WaitForFirstConsumer so GKE creates the disk in the same zone as the pod. For GKE Autopilot, explicitly set the storage class in PVC specs to avoid defaulting to slower standard disk tiers.
JVM Tuning for Containerised Workloads
Use -XX:+UseContainerSupport (default in JDK 11+) so the JVM reads cgroup limits for heap sizing rather than host memory. For off-heap-heavy workloads, account for direct memory in your container memory limit — the formula is: limit = heap + off-heap + metaspace + thread stacks + 512 MiB headroom. Monitor JVM native memory tracking output to validate your sizing in the first week of production traffic.
Node Performance Tuning
On GKE Standard, apply kernel parameter tuning via privileged init containers or DaemonSets for latency-sensitive workloads:
sysctl -w vm.max_map_count=262144 # required for Lucene, Ignite
sysctl -w net.core.somaxconn=65535 # larger connection queue
sysctl -w vm.swappiness=1 # near-zero swap for in-memory grids
ulimit -n 65535 # file descriptor limit
On GKE Autopilot, kernel tuning is not permitted — choose machine types with higher network bandwidth (C3, N2) and rely on application-level tuning instead. Enable Compact Placement Policy for latency-sensitive StatefulSet pods that benefit from low inter-node latency within the same zone.
9. GitOps and Release Governance
Never Deploy Manually to Production
Manual helm upgrade commands run by engineers are error-prone, unauditable, and inconsistent. The definition of production infrastructure is: the state in Git is the truth. Config Sync (part of Anthos Config Management) provides a native, GKE-integrated GitOps controller that syncs cluster state from Git and integrates with Policy Controller for compliance enforcement. For teams wanting more GitOps features (UI, multi-cluster management, application health views), deploy ArgoCD on top.
Separate Config Repository from Application Repository
Store Helm values files for each environment in a dedicated configuration repository, separate from the application and chart repository. This separation means:
- A developer changing application code cannot accidentally change production configuration.
- Configuration changes have their own review and approval workflow.
- You can see exactly what configuration was applied to production at any point in history (
git log). - Pin all Helm chart versions and image tags — never use
latest.
Helm Chart Validation in CI
Every change to a Helm chart should run through a validation pipeline before it is deployable to any environment:
# Lint
helm lint ./charts/my-service --strict
# Validate against Kubernetes API schema
helm template ./charts/my-service | kubeval --strict
# Security scan rendered manifests
helm template ./charts/my-service | kubesec scan -
# Detect deprecated API versions
helm template ./charts/my-service | pluto detect -
# Validate against Policy Controller constraints (dry-run)
gator test -f ./charts/my-service
Environment Promotion and Branch Protection
Model promotions as pull requests from dev to staging to production branches in your config repository. Require automated integration test gates and smoke test results before a PR can be merged. Use branch protection with required reviewers for the production branch. Integrate with Google Cloud Deploy for pipeline orchestration — Cloud Deploy provides a managed release pipeline with approval gates, rollback support, and built-in audit logging of every promotion event.
10. Deployment Safety: Protecting Production
Progressive Delivery with Argo Rollouts or Cloud Deploy
Never cut over 100% of traffic to a new version in a single step. Use Argo Rollouts to implement a canary strategy for stateless application components, integrating with the GKE Gateway API or ASM for traffic weight splitting:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: error-rate-analysis
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 100
canaryService: my-service-canary
stableService: my-service-stable
trafficRouting:
istio:
virtualService:
name: my-service-vsvc
If any step produces degraded metrics, roll back automatically. This reduces the blast radius of a bad deployment from “all users affected” to “5% of users for 5 minutes.” Alternatively, use Google Cloud Deploy with canary targets — it integrates natively with GKE, provides deployment verification hooks, and records every release in Cloud Audit Logs automatically.
Controlled StatefulSet Upgrades
For StatefulSets (database engines, grid members), use the partition field in the RollingUpdate strategy. Start by updating the highest-ordinal replica first. Verify it is healthy before updating the next. This gives you a manual checkpoint between each replica upgrade — critical for workloads where a bad upgrade might only manifest after the node joins the grid and begins data exchange. Automate validation with a post-upgrade Kubernetes Job that verifies membership count and data accessibility before proceeding.
Post-Deployment Smoke Tests
Run a Kubernetes Job as the final step of every deployment. It should exercise critical user journeys — authentication, core API endpoints, health checks — and emit a PASS/FAIL result that gates promotion. If it fails, Argo Rollouts automatically aborts the canary and reverts traffic to the stable version. Wire the smoke test Job exit code into your Cloud Deploy verification step for full traceability.
Built-In Memory Validation
For JVM workloads with complex memory configurations, add Helm template guards that validate memory configuration consistency before the chart is even rendered. The pod memory limit must be greater than heap plus all off-heap regions. Fail the Helm install with a descriptive error message if this constraint is violated. Catching misconfiguration at deploy time is infinitely better than diagnosing OOMKill events at 3 AM.
11. Cost Governance
Spot VMs for Fault-Tolerant Workloads
Run stateless, batch, and CI/CD workloads on Spot VM node pools — typical savings of 60-91% compared to on-demand. Implement Spot eviction handling: handle SIGTERM within your application, and deploy the gke-spot-interrupt-handler DaemonSet to drain pods gracefully when GCP issues a 30-second preemption notice. On GKE Autopilot, use Spot provisioning mode at the pod level by setting cloud.google.com/gke-spot: "true" in node affinity — no pool management required.
Committed Use Discounts for Baseline
Purchase Compute Engine Committed Use Discounts (CUDs) — 1- or 3-year commitments on vCPU and memory — for your baseline system and stateful node pools. Pair with Spot VMs for burst. This combination typically reduces compute costs by 40-60% compared to on-demand pricing. Use Active Assist Committed Use Discount Recommendations in the Google Cloud Console to identify optimal commitment sizes based on historical usage.
Rightsize Before You Scale
The most common waste in Kubernetes is over-provisioned resource requests. Pods request 4 CPU but average 0.3 CPU in production. The scheduler sees 4 CPU reserved and will not schedule other workloads on that node capacity. Use VPA recommendations and Kubernetes resource utilisation metrics to bring requests within 20% of actual average usage. This alone typically reduces your node count by 30-40% on mature platforms.
Cost Allocation with Labels
Apply consistent labels to all GCP resources and Kubernetes workloads. Enable GKE Cost Allocation (at no extra cost) to attribute GKE compute spend to namespaces and labels in Cloud Billing reports:
nodeLabels:
cost-center: "engineering"
team: "platform"
environment: "production"
application: "my-service"
Use Cloud Billing budgets and alerts with label filters for per-team chargebacks. Export billing data to BigQuery and build cost trend dashboards in Looker Studio.
Storage Lifecycle Management
Configure Cloud Storage lifecycle management rules to transition backup data from Standard to Nearline after 30 days, to Coldline after 90 days, and to Archive after 365 days. For Cloud Logging, set log bucket retention to 90 days — use the locked retention feature for compliance-mandated logs that must not be deleted. This alone reduces storage costs by 60%+ for log-heavy environments.
12. Compliance and Audit
Cloud Audit Logs
Enable Cloud Audit Logs — Admin Activity logs are on by default; enable Data Access logs for GKE API calls. Route audit logs to a dedicated Cloud Logging bucket with a locked retention policy (minimum 12 months; financial regulations often require 7 years). Create log-based metric alerts for high-risk API calls: exec into pods, secret reads, ClusterRoleBinding creations, and workload identity annotations on service accounts.
Security Command Center and CIS Benchmark
Enable Security Command Center (SCC) Premium at the organisation level. SCC continuously assesses GKE clusters against the CIS GKE Benchmark and surfaces findings with severity ratings and remediation guidance. Run kube-bench in your CI pipeline as an additional gate. Set a target of zero HIGH-severity SCC findings before any workload promotes to production. Enable the GKE Security Posture dashboard for a real-time view of vulnerability and misconfiguration findings across your cluster fleet.
Software Bill of Materials (SBOM)
Generate an SBOM for every container image using Syft or the Artifact Analysis SBOM API. Store SBOMs in Artifact Registry alongside image manifests. Enable Artifact Analysis continuous scanning — it alerts you when a new CVE is published for a package in a container that is currently running in your cluster, not just at build time. Route CRITICAL findings to Security Command Center and your security team’s on-call channel. In the event of a zero-day vulnerability disclosure, an SBOM allows you to immediately query which of your running images are affected.
VPC Service Controls
Wrap GKE, Artifact Registry, Cloud Storage, Secret Manager, and Cloud KMS in a VPC Service Controls perimeter. This prevents data exfiltration by enforcing that API calls to these services can only originate from within the perimeter — even if an attacker obtains valid credentials, they cannot exfiltrate data from outside the perimeter boundary. Test perimeter rules in dry-run mode before enforcing to avoid breaking legitimate access patterns.
13. Day-2 Operations
GKE Version Currency and Upgrade Strategy
GKE offers four release channels: Rapid, Regular, Stable, and Extended. Use Regular for production (approximately 2-4 weeks behind Rapid) and Stable for the most conservative environments. Define a structured upgrade cadence:
- Non-production clusters: Rapid channel with auto-upgrade enabled — always on latest.
- Production clusters: Regular or Stable channel; test in staging first; notify application teams 2 weeks in advance.
- Upgrade node pools with
--max-surge 1 --max-unavailable 0for stateful pools. - Validate PDBs prevent evictions below quorum before triggering an upgrade.
- Run smoke tests after each node pool upgrade completes.
Enable Maintenance Windows to restrict when GKE can perform automatic node upgrades — set them to your low-traffic window (e.g., 02:00-06:00 on weekdays).
Dependency Currency
Maintain a software inventory of Helm chart versions, container image versions, and cluster add-on versions. Use Renovate Bot or Dependabot to raise automatic PRs for chart version bumps. Review and merge weekly. Stale dependencies are the leading cause of zero-day exposure windows — a freshly discovered CVE in a base image from 18 months ago is a completely avoidable incident. Establish an SLA for patching: critical CVEs within 72 hours, high within 14 days, medium within 30 days.
Runbook Culture
Every alert must link to a runbook. A runbook is a step-by-step guide for diagnosing and resolving the alert condition. Store runbooks in Git (not a wiki that goes stale), version-control them, and review them quarterly. A good runbook includes: symptom description, Cloud Logging queries to diagnose, kubectl commands for investigation, escalation path, and rollback procedure. Automate runbook steps where possible using Cloud Workflows or GitHub Actions triggered by alert webhook notifications — reducing mean time to recovery by removing the human bottleneck for repeatable incidents.
Chaos Engineering
After your DR and resilience controls are in place, validate them with controlled failure injection using Google Cloud Fault Injection or open-source tools like Litmus Chaos. Inject:
- Node failures (drain a random node in a pool)
- Zone outages (block network traffic to/from a zone)
- Pod failures (kill random pods in a deployment)
- Latency injection (add 500ms to database calls via ASM fault injection)
Run chaos experiments in staging first. Graduate to production during low-traffic windows with a defined abort condition. The goal is not to cause outages — it is to find weaknesses before your customers find them.
14. The Pre-Go-Live Checklist
Use this as a gate before any workload promotes to production on GKE:
Cluster Foundation
- [ ] Regional cluster spanning all 3 zones (not zonal)
- [ ] Private cluster enabled; master authorised networks configured
- [ ] VPC-native (alias IP) networking with GKE Dataplane V2 enabled
- [ ] System node pool isolated from user workloads with taints
- [ ] Cluster Autoscaler or Autopilot autoscaling configured
- [ ] Application-layer secret encryption with Cloud KMS enabled at cluster creation
Security
- [ ] Workload Identity enabled; no static service account keys in pods
- [ ] All secrets sourced from Secret Manager via External Secrets Operator
- [ ] Pod Security Standard Restricted enforced on all production namespaces
- [ ] All containers run as non-root with read-only root filesystem
- [ ] Binary Authorization in enforcement mode; all images signed via Cloud KMS
- [ ] Artifact Analysis continuous scanning enabled; zero CRITICAL/HIGH CVEs
- [ ] GKE Threat Detection enabled
- [ ] ASM mTLS strict mode enforced between all services
- [ ] Cloud Armor WAF enabled on Ingress / GKE Gateway
- [ ] Default-deny Network Policies applied via GKE Dataplane V2
- [ ] VPC Service Controls perimeter covering all GCP services
- [ ] Falco deployed for additional runtime rule coverage
Reliability
- [ ] All Deployments have at least 3 replicas spread across 3 zones
- [ ] PodDisruptionBudgets defined for every production workload
- [ ] HPA configured with CPU and custom/external metrics
- [ ] Liveness, readiness, and startup probes configured and tuned
- [ ]
terminationGracePeriodSecondsappropriate for each workload’s shutdown time - [ ] Velero backup configured and restore tested successfully against staging
- [ ] Multi-region DR failover tested end-to-end
- [ ] Maintenance Windows configured to restrict auto-upgrades
Observability
- [ ] Google Cloud Managed Prometheus scraping all pods and nodes
- [ ] Structured JSON logs shipping to Cloud Logging with Log Analytics enabled
- [ ] Distributed traces reaching Cloud Trace with
trace_idcorrelation - [ ] Alerts defined for error rate, P99 latency, pod crash loops, OOM kills
- [ ] Every alert has a
runbook_urlannotation - [ ] On-call rotation configured via Cloud Monitoring notification channels
- [ ] Certificate expiry monitoring active with alerts at 30/14/7 days
Deployment
- [ ] GitOps controller (Config Sync / ArgoCD) managing all production deployments
- [ ] Helm chart lint, kubeval, kubesec, and pluto passing in CI
- [ ] Post-upgrade smoke tests implemented as Kubernetes Jobs
- [ ] Memory configuration guards active in Helm templates
- [ ] Canary strategy configured for stateless deployments via Argo Rollouts or Cloud Deploy
- [ ] StatefulSet partition upgrade procedure documented and tested
- [ ] Rollback procedure documented and tested
Cost and Compliance
- [ ] Resource quotas and LimitRanges applied to all production namespaces
- [ ] GKE Cost Allocation enabled; labels applied to all resources
- [ ] Cloud Billing budget alerts configured per team
- [ ] Committed Use Discounts purchased for baseline node pools
- [ ] Cloud Storage lifecycle policies configured for backups and logs
- [ ] Cloud Audit Logs (Admin Activity + Data Access) flowing to locked log bucket
- [ ] Security Command Center Premium enabled; zero HIGH findings
- [ ] kube-bench CIS GKE Benchmark: zero HIGH findings
- [ ] SBOM generated and stored for all production images
- [ ] GKE release channel configured; upgrade runbook documented and tested
- [ ] Chaos experiment baseline executed; recovery validated
Closing Thoughts
The gap between a GKE cluster that runs your workloads and a GKE cluster that runs your workloads reliably, securely, and efficiently at scale is significant. Each practice in this guide addresses a specific failure mode — an outage that happened to someone, a security incident that cost a company dearly, a compliance finding that required emergency remediation.
Not every practice needs to be in place before your first production deployment. Prioritize by risk:
- Security controls (Workload Identity, Binary Authorization, non-root containers, Secret Manager) — these prevent irreversible incidents.
- Reliability controls (PDBs, probes, graceful shutdown, Velero backups) — these protect your SLA.
- Observability (Managed Prometheus, Cloud Logging, Cloud Trace) — these give you the ability to diagnose and recover quickly.
- Scalability and performance tuning — these can be iteratively improved after go-live.
- GitOps, cost governance, and chaos engineering — important for long-term operational health but not blocking for initial launch.
GKE’s deep GCP service integration — Workload Identity, Binary Authorization, GKE Dataplane V2, Managed Prometheus, Config Sync — means many of these practices require less custom infrastructure than on other clouds. But they still require deliberate configuration and ongoing discipline. Treat your infrastructure configuration as software: version it, review it, test it, and document it. The cluster you deploy to production today will be maintained by engineers who are not present in this planning discussion. Leave them a system they can understand, operate, and improve.






