AKS Production Best Practices
A comprehensive, opinionated guide to running enterprise Kubernetes workloads on Azure Kubernetes Service — from cluster foundation through Day-2 operations.
Table of Contents
- Cluster Foundation
- Scalability Architecture
- Governance & Policy
- Security Hardening
- Secret Management with Azure Key Vault
- Observability Stack
- Disaster Recovery & High Availability
- Performance & Storage
- GitOps & Release Engineering
- Deployment Safety
- Cost Governance
- Compliance & Audit
- Day-2 Operations
- Pre-Go-Live Checklist
Running Kubernetes in production on Azure is not simply a matter of creating a cluster and deploying workloads. Every decision — node pool topology, secret rotation strategy, upgrade cadence, cost allocation — compounds over time and directly affects reliability, security posture, and total cost of ownership. This guide distils the practices that matter most for teams operating stateful, latency-sensitive, or compliance-bound workloads on AKS.
1. Cluster Foundation
Availability Zone Distribution
Spread every node pool across all three Azure Availability Zones. AKS provisions nodes via Virtual Machine Scale Sets (VMSS); configure zones: ["1","2","3"] on each node pool. Pair this with a topology spread constraint on your workloads so pods never pile up in a single zone:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-serviceNode Pool Segmentation
Use dedicated node pools per workload class, not a single mixed pool:
| Pool | VM SKU | Mode | Purpose |
|---|---|---|---|
system | Standard_D4s_v5 | System | kube-system, CoreDNS |
compute | Standard_E16s_v5 | User | CPU/memory-intensive workloads |
stateful | Standard_E32s_v5 | User | StatefulSets, in-memory grids |
spot | Standard_D8s_v5 | User / Spot | Batch, CI runners, fault-tolerant jobs |
Taint the stateful pool with workload=stateful:NoSchedule and add matching tolerations in your Helm values to prevent accidental co-location.
Private Cluster & Network Model
Enable private cluster mode so the API server is not reachable from the public internet. Use Azure CNI Overlay (or Azure CNI with dedicated subnet) rather than kubenet for production — it avoids double NAT and supports Network Policy enforcement. Enable authorized IP ranges for the API server even on private clusters as a defence-in-depth measure.
2. Scalability Architecture
Horizontal Pod Autoscaler (HPA)
HPA is effective for stateless services. Set both minReplicas and maxReplicas, and use custom metrics from Azure Monitor or Prometheus rather than relying solely on CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"Cluster Autoscaler
Enable the AKS Cluster Autoscaler on every user node pool. Set --scale-down-delay-after-add=10m and --scale-down-unneeded-time=5m for aggressive scale-down to control cost. For stateful node pools containing in-memory grid nodes, set --skip-nodes-with-local-storage=true to prevent premature eviction.
Vertical Pod Autoscaler (VPA)
Run VPA in Recommendation mode first. Feed its output into your Helm values to right-size requests/limits before enabling Auto mode. Never run VPA Auto on StatefulSets with in-memory state — VPA will restart pods to resize, causing data loss.
StatefulSet Scaling
For in-memory grid members, scale conservatively. Use podManagementPolicy: Parallel for faster scale-out, but always validate cluster membership after each scale event before proceeding to the next. Add minReadySeconds to guarantee new members are fully operational before the next is created.
3. Governance & Policy
RBAC with Microsoft Entra ID Integration
Enable AKS-managed Entra ID integration and map Entra ID groups to Kubernetes RBAC roles — never create local Kubernetes users in production. Use the principle of least privilege: operations teams get view by default, deployment pipelines get scoped edit on specific namespaces only.
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: deploy-pipeline-edit
namespace: production
subjects:
- kind: Group
name: <entra-group-object-id>
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: edit
apiGroup: rbac.authorization.k8s.ioAzure Policy & Gatekeeper
Enable the Azure Policy Add-on for AKS. It deploys OPA/Gatekeeper and connects to Azure Policy so you can enforce guardrails centrally. Apply at minimum:
- Deny privileged containers
- Require resource requests and limits
- Require liveness and readiness probes
- Restrict allowed image registries to your ACR
- Enforce pod security baseline (or restricted) profile
Complement with Kyverno for mutation policies (e.g., automatically injecting sidecar configurations, adding standard labels) where Gatekeeper’s purely validating model is insufficient.
Namespace Strategy & Resource Quotas
Organise namespaces by environment and team, not by application. Apply ResourceQuota and LimitRange to every namespace in production to prevent noisy-neighbour effects:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"Network Policies
Deploy Calico or Azure Network Policy Manager and adopt a default-deny posture in all production namespaces. Explicitly allow only the east-west traffic your services require. This limits blast radius if a pod is compromised. Document every allow rule in your policy-as-code repository alongside a justification.
4. Security Hardening
Pod Security Standards
Apply the Restricted Pod Security Standard to all production namespaces via namespace labels. This enforces non-root containers, read-only root filesystems, and disallowed privilege escalation without requiring a separate admission controller:
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=latest \
pod-security.kubernetes.io/warn=restrictedContainer Security Context
Every container spec should include:
securityContext:
runAsNonRoot: true
runAsUser: 10001
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]Mount writable paths explicitly as emptyDir volumes for ephemeral scratch space rather than writing to the container layer.
Image Supply Chain Security
Use Azure Container Registry (ACR) with geo-replication to your deployment regions. Enable Content Trust / Notation signing and configure AKS to reject unsigned images via Azure Policy. In your CI pipeline:
- Scan images with Microsoft Defender for Containers or Trivy at build time — fail the pipeline on CRITICAL/HIGH CVEs
- Generate an SBOM (e.g., with Syft) and attach it to the image manifest
- Sign the image with your pipeline identity using Notation + Azure Key Vault
- Enforce signature verification at admission time via Ratify
Runtime Threat Detection
Enable Microsoft Defender for Containers on your AKS cluster. It provides runtime detection of suspicious process activity, cryptomining, privilege escalation, and suspicious network connections — alerts integrate directly into Microsoft Defender XDR and Azure Monitor. For deeper behavioural analysis, deploy Falco as a DaemonSet alongside Defender for layered coverage.
mTLS Between Services
Deploy Istio (or Linkerd) in strict mTLS mode so all pod-to-pod communication is encrypted and mutually authenticated. AKS now offers a managed Istio add-on (az aks mesh enable) which simplifies lifecycle management. Couple mTLS with network policies — mTLS ensures encryption and identity; network policies enforce access control at the kernel level.
5. Secret Management with Azure Key Vault
Workload Identity (Replaces Pod-Managed Identity)
Use AKS Workload Identity — the successor to AAD Pod Identity. It federates Kubernetes Service Account tokens with Entra ID using OIDC, eliminating long-lived credentials entirely:
# Create managed identity
az identity create --name my-workload-identity --resource-group my-rg
# Federate with Kubernetes Service Account
az identity federated-credential create \
--identity-name my-workload-identity \
--issuer "$(az aks show --query oidcIssuerProfile.issuerUrl -o tsv)" \
--subject "system:serviceaccount:production:my-service-sa"CSI Secret Store Driver
Install the Secrets Store CSI Driver with the Azure Key Vault provider. Mount secrets as volumes — they are rotated automatically when Key Vault values change, and pods pick up the new values without restart when enableSecretRotation: true:
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: akv-secrets
spec:
provider: azure
parameters:
usePodIdentity: "false"
clientID: <managed-identity-client-id>
keyvaultName: my-keyvault
tenantId: <tenant-id>
objects: |
array:
- |
objectName: db-password
objectType: secret
objectVersion: ""Sync secrets to native Kubernetes Secrets only when an application strictly requires environment variable injection — and ensure those Kubernetes Secrets are encrypted at rest using a customer-managed key stored in Azure Key Vault.
Certificate Lifecycle
Deploy cert-manager with an Azure Key Vault or Let’s Encrypt issuer for all TLS certificates. Never manually manage certificate renewals. Set renewal thresholds at 30 days before expiry. Store CA certificates in Key Vault and reference them via CSI — never as opaque Kubernetes Secrets.
6. Observability Stack
Metrics: Prometheus + Azure Monitor
Enable Azure Monitor managed Prometheus (the AKS Monitoring add-on). It scrapes cluster metrics without managing Prometheus infrastructure yourself and stores data in an Azure Monitor Workspace. Layer Grafana (Azure Managed Grafana) on top for dashboards. For custom application metrics, expose a /metrics endpoint from every service and add a PodMonitor or ServiceMonitor CRD. Set retention to 90 days minimum for production.
Logs: Fluent Bit → Log Analytics
Deploy Fluent Bit as a DaemonSet to collect container logs and ship to Azure Log Analytics. Structure your logs as JSON from day one — unstructured text logs are costly to query at scale. Annotate every log line with namespace, pod, container, and a correlation_id for distributed tracing. Set up Log Analytics Workspace data export to Azure Blob Storage for long-term retention required by compliance frameworks.
[OUTPUT]
Name azure
Match *
Customer_ID ${LOG_ANALYTICS_WORKSPACE_ID}
Shared_Key ${LOG_ANALYTICS_SHARED_KEY}
Log_Type ContainerLogs
time_key timeDistributed Tracing: OpenTelemetry → Azure Monitor
Instrument services with the OpenTelemetry SDK and deploy the OTel Collector as a sidecar or DaemonSet. Route traces to Azure Monitor Application Insights via the OTLP exporter. Application Insights provides end-to-end transaction search, dependency maps, and failure analysis without vendor lock-in at the instrumentation layer. Use the auto-instrumentation mutating webhook where available to avoid code changes in existing services.
Alerting
Define alerts in code using Azure Monitor Alert rules or Prometheus AlertManager, committed to Git. Alert on symptoms, not causes — high error rate, elevated latency P99, pod crash-loop rate — not on CPU thresholds. Route critical alerts to PagerDuty or your on-call tool via Action Groups. Every alert must have a corresponding runbook link in its annotations.
7. Disaster Recovery & High Availability
Multi-Region Strategy
For RPO < 1h and RTO < 30 min, run active-passive AKS clusters in two Azure regions. Use Azure Front Door or Azure Traffic Manager for global traffic routing with health probes. Store all configuration in GitOps repos — the passive cluster should be bootstrappable from Git within minutes. For stateful data tiers, use Azure SQL Managed Instance Business Critical tier with geo-replication, or Azure Cosmos DB with multi-region writes where the data model permits.
Cluster State Backup with Velero
Deploy Velero with the Azure plugin. Back up Kubernetes resource manifests and Persistent Volume snapshots to Azure Blob Storage with geo-redundant storage (GRS):
velero backup create production-daily \
--include-namespaces production \
--snapshot-volumes \
--storage-location azure-primary \
--schedule="0 2 * * *"Test restoration quarterly — a backup that has never been restored is not a backup. Schedule automated restore drills to a staging cluster.
Pod Disruption Budgets
Every production workload must have a PDB. For most services, minAvailable: "50%" is appropriate. For StatefulSet grid members where quorum matters, calculate your minimum quorum size and set PDB accordingly — never allow a cluster upgrade to drain below quorum:
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
minAvailable: 2 # 3-node grid: always keep quorum
selector:
matchLabels:
app: grid-nodeGraceful Shutdown
Configure terminationGracePeriodSeconds generously for stateful workloads — 120 to 300 seconds is common for JVM-based services that need to drain in-flight requests and flush state. Implement a preStop hook that sleeps for 5 seconds to allow load balancer deregistration to propagate before the application starts shutting down:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 1208. Performance & Storage
Resource Requests and Limits
Set CPU requests at the realistic 50th-percentile consumption. Set memory limits equal to requests for Guaranteed QoS class on stateful pods — this prevents the OOM killer from evicting them under node memory pressure. For stateless services, setting CPU limits is often counterproductive (it causes CPU throttling even on an idle node); consider omitting CPU limits while keeping memory limits strict.
Storage Classes
| Access Mode | Storage Class | Backend | Use Case |
|---|---|---|---|
| ReadWriteOnce | managed-csi-premium | Azure Disk P30+ | Database data directories, WAL |
| ReadWriteMany | azurefile-csi-premium | Azure Files Premium | Shared content stores, config mounts |
| ReadWriteOnce (Ultra) | managed-csi-ultra | Azure Ultra Disk | Sub-millisecond latency, WAL-heavy loads |
Always set reclaimPolicy: Retain for production PVCs so a namespace deletion or accidental PVC removal does not destroy data. Use volumeBindingMode: WaitForFirstConsumer to ensure disks are created in the same AZ as the pod.
JVM Tuning for Containerised Workloads
JVM-based services require explicit container awareness flags. Use -XX:+UseContainerSupport (enabled by default in JDK 11+) and set heap relative to container memory limits, not host memory. For off-heap/direct-memory-intensive workloads, account for off-heap in your memory limit calculation — a container limit of 8 GiB with a 4 GiB heap still needs headroom for off-heap, metaspace, and thread stacks.
Node Performance Configuration
For latency-sensitive workloads, use node pool system OS configuration to tune kernel parameters via the AKS LinuxOSConfig API — adjust vm.max_map_count, fs.file-max, and TCP buffer sizes without requiring privileged DaemonSets. Enable Accelerated Networking (DPDK) on all node pool VM SKUs that support it.
9. GitOps & Release Engineering
GitOps Controller
Adopt Flux v2 (the AKS GitOps add-on, based on Flux) or ArgoCD as your GitOps controller. The AKS Flux add-on (az k8s-configuration flux create) integrates with Azure Policy for compliance reporting. Key principles:
- The Git repository is the single source of truth — no
kubectl applyin production by humans - Separate the application source repository from the deployment configuration repository
- Pin all Helm chart versions and image tags — never use
latest - Use image update automation (Flux Image Automation) to raise PRs for new image tags rather than deploying automatically
Helm Best Practices
Validate Helm charts in CI before they reach the cluster:
# Lint
helm lint ./charts/my-service --strict
# Template and validate against Kubernetes API schema
helm template ./charts/my-service | kubeval --strict
# Security scan rendered manifests
helm template ./charts/my-service | kubesec scan -
# Check for deprecated API versions
helm template ./charts/my-service | pluto detect -Environment Promotion
Model promotions as pull requests from dev → staging → production branches in your config repo. Require automated test gates (integration tests, smoke tests) to pass before a PR can be merged. Use branch protection and required reviewers for the production branch. This creates a complete, auditable history of every production change.
10. Deployment Safety
Progressive Delivery with Argo Rollouts
Replace standard Deployment objects with Argo Rollouts for canary and blue-green deployments. Integrate with Azure Application Gateway Ingress Controller (AGIC) or NGINX for traffic weight splitting:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: error-rate-check
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 100
canaryService: my-service-canary
stableService: my-service-stableStatefulSet Partition Upgrades
For StatefulSets (grid nodes, databases), use updateStrategy.rollingUpdate.partition to roll one pod at a time and validate cluster health between each step. Automate validation with a post-upgrade health check job that verifies grid membership count and data accessibility before setting partition to 0.
Smoke Tests
Run a Kubernetes Job as the final step of every deployment pipeline. It should exercise the critical user journeys — authentication flow, core API endpoints, health checks — and emit a PASS/FAIL signal that gates promotion. If it fails, the GitOps controller should not advance the canary weight.
11. Cost Governance
Azure Spot VMs for Fault-Tolerant Workloads
Run stateless, batch, and CI/CD workloads on Azure Spot VM node pools — typical savings of 60-90% compared to on-demand. Implement spot eviction handling in your application (handle SIGTERM, checkpoint state) and use spot-interrupt-handler DaemonSet to drain pods gracefully when Azure issues a 30-second eviction notice.
Reserved Instances for Baseline
Purchase Azure Reserved VM Instances (1- or 3-year) for your baseline system and stateful node pools. Pair this with Spot for burst — you pay reserved prices for the predictable floor and Spot prices only for burst capacity. For Java workloads with predictable CPU requirements, this combination typically reduces compute costs by 40-60%.
Cost Allocation with Tagging
Apply consistent Azure tags to node pools, storage accounts, and all PaaS services:
nodeLabels:
cost-center: "engineering"
team: "platform"
environment: "production"
application: "my-service"Use Azure Cost Management views filtered by tag to generate per-team chargebacks. Enable the AKS Cost Analysis add-on to see cost broken down by namespace, deployment, and pod — without any third-party tooling.
Storage Lifecycle Policies
Configure Azure Blob Storage lifecycle management rules to tier backup data from Hot to Cool after 30 days, and to Archive after 90 days. For Log Analytics, set retention to 90 days in the workspace and enable 2-year archive tier for compliance. This alone commonly reduces storage costs by 60%+ for log-heavy environments.
12. Compliance & Audit
Kubernetes Audit Logs → Log Analytics
Enable Diagnostic Settings on your AKS cluster to stream API server audit logs to Log Analytics. Retain audit logs for 12 months minimum (many financial regulations require 7 years — use the Log Analytics archive tier). Create KQL queries for high-risk audit events: exec into pods, secret reads, ClusterRoleBinding creations, and node modifications.
CIS Benchmark & Azure Security Benchmark
Run kube-bench in your CI pipeline against the CIS AKS Benchmark. Use Microsoft Defender for Cloud Recommendations view — it maps findings to the Azure Security Benchmark and CIS controls, giving you a prioritised remediation list. Set a target of 0 High-severity findings before going live.
SBOM & Vulnerability Management
Generate a Software Bill of Materials (SBOM) for every container image as part of CI. Store SBOMs in ACR alongside image manifests. Continuously scan deployed images for new CVEs using Defender for Containers continuous assessment — it alerts you when a vulnerability is disclosed for a container that is currently running in your cluster, not just at build time.
13. Day-2 Operations
AKS Upgrade Strategy
AKS releases minor versions on a roughly quarterly cadence and supports N-2 minor versions. Define a structured upgrade cadence:
- Non-production clusters: auto-upgrade with
patchchannel — always on latest patch of their minor version - Production clusters: manual upgrade, tested in staging first, with at least 2 weeks’ notice to application teams
- Upgrade node pools one at a time using
--max-surge 1for stateful pools - Validate PDBs prevent evictions below quorum before triggering upgrade
- Run smoke tests after each node pool upgrade completes
Dependency Currency
Treat third-party Helm charts and container base images as dependencies that require regular updates. Use Renovate Bot or Dependabot to raise automatic PRs for chart version bumps. Review and merge weekly. Stale dependencies are the leading cause of zero-day exposure windows.
Chaos Engineering
Run controlled failure injection quarterly using Azure Chaos Studio. Start with low-impact experiments: pod kill, node drain, network latency injection. Measure whether alerting fires within SLO thresholds and whether the system recovers automatically. Document findings and feed them into architecture improvements. Chaos experiments should run against production (with a maintenance window) once you have confidence from staging results.
Runbooks
Every alert must link to a runbook. Runbooks should be stored in Git, versioned, and reviewed quarterly. A good runbook includes: symptom description, diagnostic commands (KQL queries, kubectl commands), escalation path, and rollback procedure. Automate runbook steps where possible using Azure Automation Runbooks or GitHub Actions triggered by alert webhooks.
14. Pre-Go-Live Checklist
Use this checklist as a gate before promoting any workload to production on AKS:
Cluster Foundation
- Node pools span all 3 Availability Zones
- Private cluster enabled; API server has authorised IP ranges
- System node pool isolated from user workloads
- Cluster Autoscaler configured on all user node pools
Security
- Entra ID integration enabled; no local Kubernetes users
- Workload Identity configured; no static credentials in pods
- All secrets sourced from Azure Key Vault via CSI driver
- Pod Security Standard Restricted enforced on production namespaces
- All containers run as non-root with read-only root filesystem
- Image signing enabled; unsigned images rejected at admission
- Defender for Containers enabled; 0 unresolved HIGH/CRITICAL CVEs
- mTLS enforced between all services
- Default-deny Network Policies applied
Reliability
- All deployments have at least 3 replicas across 3 zones
- PodDisruptionBudgets defined for every production workload
- HPA configured with CPU and custom metrics
- Liveness, readiness, and startup probes configured and tuned
- terminationGracePeriodSeconds appropriate for shutdown time
- Velero backup configured; restore tested successfully
- Multi-region DR failover tested
Observability
- Metrics flowing to Azure Monitor Workspace; dashboards in Managed Grafana
- Structured JSON logs shipping to Log Analytics
- Distributed traces reaching Application Insights
- Alerts defined for error rate, latency P99, pod crash loops
- Every alert has a runbook link in annotations
- On-call rotation configured in Action Group
Cost & Governance
- All pods have resource requests and limits set
- ResourceQuota and LimitRange applied to all namespaces
- Azure tags applied to all resources for cost allocation
- AKS Cost Analysis add-on enabled
- Reserved Instances purchased for baseline node pools
- Storage lifecycle policies configured
Compliance & Operations
- API server audit logs streaming to Log Analytics
- kube-bench CIS scan: 0 HIGH findings
- SBOM generated and stored for all production images
- GitOps controller deployed; no manual kubectl in production
- Upgrade runbook documented and tested in staging
- Chaos experiment baseline executed; recovery validated
- cert-manager managing all TLS certificates with auto-renewal
The Bottom Line
AKS takes care of the control plane — but the 14 pillars above are your responsibility. None of them is optional for enterprise production. The good news: implemented incrementally, each one compounds the reliability and security of everything that came before. Start with cluster foundation and security, layer in observability, then build out the rest sprint by sprint.
Questions or corrections? Drop a comment below.


