OpenShift Container Platform · On-Premise Production Guide

OpenShift On-Premise Production Best Practices

A comprehensive, opinionated guide to running enterprise Kubernetes workloads on OpenShift Container Platform on-premise — from cluster foundation through Day-2 operations, with full control over your own infrastructure.

Published May 2026 · 20 min read

Cluster Foundation & Infrastructure
Scalability Architecture
Governance & Policy
Security Hardening
Secret Management with Vault
Observability Stack
Disaster Recovery & High Availability
Performance & Storage
GitOps & Release Engineering
Deployment Safety
Capacity & Cost Governance
Compliance & Audit
Day-2 Operations
Pre-Go-Live Checklist

On-premise OpenShift gives you something cloud-managed Kubernetes cannot: complete sovereignty over your infrastructure, network topology, hardware selection, and data residency. That sovereignty comes with accountability. You own every layer — from bare metal BIOS settings and storage fabric through OCP upgrade cadence, RHACS policy, and certificate rotation. This guide covers the practices that make the difference between an on-premise OpenShift cluster that is reliable and secure in production, and one that becomes a liability.

1. Cluster Foundation & Infrastructure

Control Plane Topology

Run three dedicated control plane nodes (master nodes) across three separate physical failure domains — different racks, different power feeds, different top-of-rack switches. Never co-locate control plane and worker workloads. For production, use dedicated etcd nodes or at minimum ensure master nodes run on NVMe-backed storage for etcd — etcd is the most latency-sensitive component in the cluster and disk I/O directly determines API server responsiveness.

# Verify etcd health across all members
oc exec -n openshift-etcd etcd-master-0 -- \
  etcdctl endpoint health \
  --cacert /etc/kubernetes/static-pod-certs/configmaps/etcd-all-bundles/server-ca-bundle.crt \
  --cert /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-0.crt \
  --key /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-0.key \
  --cluster

MachineConfigPool-Based Node Segmentation

Use MachineConfigPools (MCPs) to define distinct worker classes. Each pool gets its own machine configuration — kernel arguments, sysctl tuning, container runtime settings — applied automatically via the Machine Config Operator:

MachineConfigPool	Hardware Profile	Purpose
`worker`	General-purpose CPU	Stateless services, web tiers
`compute-intensive`	High CPU/RAM (e.g., 96-core)	CPU/memory-heavy processing
`stateful`	High-RAM, NVMe local disk	StatefulSets, in-memory grids
`infra`	Standard	Monitoring, logging, registry, router

Taint the stateful pool and add matching tolerations in Helm values. Move all OCP infrastructure components (router, registry, monitoring) to dedicated infra nodes using nodeSelector overrides — this prevents infrastructure load from competing with business workloads and removes the need to purchase additional OCP worker subscriptions for infra nodes.

Network Architecture

Choose OVN-Kubernetes as the CNI for new clusters — it replaces the older OpenShift SDN and provides native support for NetworkPolicy, Egress Firewall, EgressIP, and hardware offloading via DPDK. For on-premise load balancing without a cloud provider, deploy MetalLB (BGP mode preferred for production) to provide real LoadBalancer-type services from your existing routing infrastructure. Use a dedicated HAProxy pair (managed by the OpenShift Ingress Operator) for application ingress with session affinity and TLS termination.

2. Scalability Architecture

Horizontal Pod Autoscaler (HPA)

OpenShift’s HPA integrates with the built-in Prometheus metrics pipeline. Use custom metrics from the custom-metrics-apiserver adapter to scale on business-relevant signals — queue depth, active sessions, or processing backlog — not just CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Pods
      pods:
        metric:
          name: active_processing_jobs
        target:
          type: AverageValue
          averageValue: "50"

Cluster Autoscaler with Machine API

On VMware vSphere or Bare Metal deployments, use the OpenShift Machine API with a ClusterAutoscaler and per-MachineSet MachineAutoscaler resources. This allows OCP to provision new VMs (on vSphere) automatically when pods are pending. On bare metal, autoscaling requires pre-staged hardware in a pool managed by the BMO (Bare Metal Operator) — plan capacity ahead and keep a buffer of provisioned-but-unclaimed hosts ready for rapid scale-out.

Vertical Pod Autoscaler (VPA)

Install VPA from the OperatorHub. Run it in Recommendation mode for two weeks before considering Auto mode. Feed recommendations into your Helm values to right-size requests/limits. For StatefulSets with in-memory state, use Recommendation mode only — Auto mode restarts pods to resize, which causes data loss in grid members. Use the VPA recommender output as a quarterly review input to your capacity planning process.

KEDA for Event-Driven Scaling

Deploy KEDA (available via OperatorHub as the KEDA Operator) for workloads that scale to zero or respond to external event sources — Kafka topics, ActiveMQ queues, or custom Prometheus metrics. KEDA complements HPA rather than replacing it; use HPA for CPU/memory-based scaling and KEDA for event-driven triggers on the same or different workloads.

3. Governance & Policy

RBAC with Red Hat SSO / Keycloak

Integrate OpenShift OAuth with Red Hat SSO (Keycloak) as the identity provider — never use local htpasswd users in production. Map LDAP/AD groups to OpenShift Groups and bind Groups to ClusterRoles or namespace-scoped Roles. Use the principle of least privilege: operations teams get view by default, pipeline service accounts get scoped edit on specific namespaces only:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: deploy-pipeline-edit
  namespace: production
subjects:
  - kind: ServiceAccount
    name: pipeline-sa
    namespace: production
roleRef:
  kind: ClusterRole
  name: edit
  apiGroup: rbac.authorization.k8s.io

For privileged operations, enforce time-limited role elevations using OpenShift’s impersonation or an external PAM solution — no standing privileged access.

Security Context Constraints (SCCs) and OPA/Gatekeeper

OpenShift’s Security Context Constraints are the primary pod security enforcement mechanism, predating Kubernetes Pod Security Standards. Use the built-in restricted-v2 SCC for all production workloads — it enforces non-root UID, read-only root filesystem capability drops, and disallowed privilege escalation. Only grant elevated SCCs (e.g., anyuid) via explicit RoleBinding to specific service accounts with a documented justification.

Layer OPA/Gatekeeper (available from OperatorHub) on top of SCCs for organisation-wide policy that SCCs cannot express — requiring specific labels, restricting image registries to your internal Quay, or enforcing resource limits:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-team-label
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
  parameters:
    labels: ["team", "environment", "cost-center"]

Projects, Quotas, and LimitRanges

OpenShift Projects are namespaces with additional metadata and access control. Use project templates to automatically provision ResourceQuota and LimitRange on every new project — this ensures no project can be created without resource boundaries. Define templates in the openshift-config namespace and reference them via the ProjectConfig API:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"

Network Policies with OVN-Kubernetes

Adopt a default-deny posture in all production namespaces. OVN-Kubernetes supports both standard NetworkPolicy and OpenShift-specific EgressNetworkPolicy (to restrict egress by CIDR or DNS name) and EgressIP (to fix egress source IPs for firewall whitelisting). The combination of ingress NetworkPolicy + EgressNetworkPolicy is particularly important on-premise where east-west traffic between application tiers and corporate databases must be tightly controlled.

4. Security Hardening

Node-Level Hardening with MachineConfig

Use the Machine Config Operator to apply CIS-compliant kernel hardening across all node pools without logging into individual hosts. Apply sysctl settings, audit rules, and FIPS mode through MachineConfig objects committed to Git:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 75-worker-sysctl
spec:
  kernelArguments:
    - net.ipv4.tcp_keepalive_time=300
    - vm.max_map_count=262144
  kernelType: default

Enable FIPS 140-2/3 mode at cluster installation time if your compliance framework requires it — it cannot be enabled post-installation without reinstalling the cluster. FIPS mode applies to the host OS cryptographic libraries and the OpenShift control plane.

Container Security Context

Every container spec must specify an explicit security context. The restricted-v2 SCC enforces most of these, but define them explicitly in your Helm charts as well so they are visible in code review:

securityContext:
  runAsNonRoot: true
  runAsUser: 10001       # Must be in SCC allowedUIDs range
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
  seccompProfile:
    type: RuntimeDefault

Image Supply Chain with Quay and RHACS

Deploy Red Hat Quay as your on-premise container registry. Quay provides geo-replicated on-premise storage, image scanning via Clair, robot accounts for service authentication, and team-based access control. Configure image mirroring to pull upstream images through Quay rather than directly from public registries — this enforces a single inspection point and eliminates external internet dependency for production deployments.

Deploy Red Hat Advanced Cluster Security for Kubernetes (RHACS / StackRox) for the full image security lifecycle:

Build-time scanning — fail CI pipelines on CRITICAL/HIGH CVEs via the roxctl image check CLI
Deploy-time admission control — RHACS admission controller blocks deployment of images failing your security policy
Runtime detection — process allow-listing, network baseline anomaly detection, privilege escalation alerts
Continuous image reassessment — alerts when a new CVE affects an already-running container

mTLS with OpenShift Service Mesh

Deploy OpenShift Service Mesh (Red Hat’s Istio distribution) in strict mTLS mode for all pod-to-pod communication. OpenShift Service Mesh is installed via OperatorHub and integrates with OpenShift’s RBAC for control plane access. Enable PeerAuthentication in strict mode per namespace — this ensures all traffic within the mesh is encrypted and mutually authenticated at the sidecar layer, regardless of the application’s own TLS configuration.

etcd Encryption

Enable etcd encryption for Secrets and ConfigMaps at rest — not just volume-level encryption. On-premise this is critical because the physical etcd backup media (NFS mounts, tape) may leave the secure perimeter. Configure AES-CBC or AES-GCM encryption via the APIServer custom resource and rotate encryption keys quarterly:

apiVersion: config.openshift.io/v1
kind: APIServer
metadata:
  name: cluster
spec:
  encryption:
    type: AES-CBC

5. Secret Management with HashiCorp Vault

Core rule: No secret, certificate, or connection string ever lives in a Kubernetes/OpenShift Secret object created by a human, checked into Git, or baked into a container image — even with etcd encryption enabled.

HashiCorp Vault with Kubernetes Auth

Deploy HashiCorp Vault (or use the OpenShift Secrets Management operator for a managed experience). Configure Vault’s Kubernetes auth method to allow OpenShift ServiceAccounts to authenticate to Vault using their JWT tokens — eliminating static credentials entirely:

# Configure Kubernetes auth in Vault
vault auth enable kubernetes

vault write auth/kubernetes/config \
  kubernetes_host="https://api.cluster.example.com:6443" \
  kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

# Create a role binding a ServiceAccount to a Vault policy
vault write auth/kubernetes/role/my-service \
  bound_service_account_names=my-service-sa \
  bound_service_account_namespaces=production \
  policies=my-service-policy \
  ttl=1h

External Secrets Operator

Deploy the External Secrets Operator (ESO) with the Vault provider. ESO synchronises secrets from Vault into OpenShift Secrets automatically, with configurable refresh intervals. This allows existing applications expecting environment variables or volume mounts to continue working without code changes, while keeping the source of truth in Vault:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: db-credentials-ocp
  data:
    - secretKey: db-password
      remoteRef:
        key: secret/production/database
        property: password

Certificate Management with cert-manager

Deploy cert-manager (available via OperatorHub) with an internal CA issuer — typically your corporate PKI exposed via ACME protocol or a Vault PKI secrets engine issuer. All TLS certificates for OpenShift routes and internal services must be managed by cert-manager with automatic renewal at 30 days before expiry. Never manually create or renew certificates in production. Configure the OpenShift Ingress Operator to use cert-manager-issued certificates for the wildcard domain.

6. Observability Stack

Metrics: OpenShift Monitoring + Thanos

OpenShift ships a built-in monitoring stack (Prometheus Operator, Alertmanager, Thanos Querier) in the openshift-monitoring namespace. Enable user-workload monitoring to allow application teams to deploy PodMonitors and ServiceMonitors in their own namespaces. For long-term retention (beyond Prometheus’s 15-day default), deploy a Thanos sidecar with an S3-compatible backend (MinIO on-premise or a dedicated object store) for indefinite metric retention:

# Enable user workload monitoring
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      retention: 15d
      volumeClaimTemplate:
        spec:
          storageClassName: ocs-storagecluster-ceph-rbd
          resources:
            requests:
              storage: 500Gi

Logs: OpenShift Logging with Loki

Deploy the OpenShift Logging Operator with Loki as the log store (replacing the older Elasticsearch-based stack for new deployments — Loki is significantly more resource-efficient at scale). Configure Fluent Bit (managed by the Logging Operator) as the collector and ship logs to Loki with label-based indexing. Structure all application logs as JSON and include a correlation_id field for distributed tracing correlation. For compliance-required long-term retention, configure a Loki retention policy and back the object store with Ceph or an NFS archival tier:

# Example structured log format
{
  "timestamp": "2026-05-12T10:30:00Z",
  "level": "INFO",
  "message": "Request processed",
  "correlation_id": "abc-123-xyz",
  "namespace": "production",
  "pod": "my-service-7d9f8b",
  "latency_ms": 42,
  "status_code": 200
}

Distributed Tracing: OpenTelemetry + Tempo / Jaeger

Deploy the OpenTelemetry Operator and the Tempo Operator (both available via OperatorHub). The OTel Operator manages the collector DaemonSet and auto-instrumentation webhooks for Java, Python, and Node.js services. Route traces to Tempo backed by Ceph object storage for scalable on-premise retention. Use the Red Hat build of OpenTelemetry for a fully supported, tested configuration. Correlate traces with logs using the trace_id field injected by the OTel SDK.

Alerting

Define all alert rules as PrometheusRule resources committed to Git and applied via GitOps. Alert on symptoms — high error rate, P99 latency breach, pod crash-loop rate, etcd latency spikes — not raw CPU. Route CRITICAL alerts to PagerDuty or your on-call system via Alertmanager’s webhook/PagerDuty receiver. Include a runbook_url label in every alert pointing to the operational runbook. For on-premise environments, also alert on hardware-level events — IPMI sensor alerts, disk predictive failure (smartd) — and correlate them with pod evictions.

7. Disaster Recovery & High Availability

Multi-Site Strategy

For on-premise DR, choose between:

Stretched cluster: A single OCP cluster spanning two data centres with a witness site. Requires <10ms RTT between sites and synchronous storage replication. Provides automatic failover but is operationally complex.
Active-passive cluster pair: Two independent OCP clusters — primary and DR — with GitOps bootstrapping the DR cluster from the same config repository. Lower complexity, higher RTO (30-60 min), simpler to operate.

Use OpenShift Advanced Cluster Management (ACM) to manage both clusters from a single control plane, push policies across them, and monitor cross-cluster application health. ACM’s ApplicationSet integration with ArgoCD enables identical application deployment across clusters with environment-specific overrides.

etcd Backup

Back up etcd on a schedule — this is the single most important backup for an OpenShift cluster:

# Run on a master node — automate via CronJob
/usr/local/bin/cluster-backup.sh /mnt/backup/etcd

# Verify the backup
ls -la /mnt/backup/etcd/
# Expected: snapshot_*.db and static_kuberesources_*.tar.gz

Store backups in a location separate from the cluster storage — NFS share on a different storage system, tape, or replicated object store. Test restoration quarterly against a scratch cluster. An untested etcd restore is not a backup strategy.

Application Backup with Velero

Deploy Velero with an S3-compatible backend (MinIO on-premise). Back up Kubernetes/OpenShift resources and Persistent Volume snapshots. Use OpenShift Data Foundation (ODF/Ceph) VolumeSnapshot integration for application-consistent PVC backups:

velero backup create production-daily \
  --include-namespaces production \
  --snapshot-volumes \
  --volume-snapshot-locations odf-ceph \
  --storage-location minio-primary \
  --ttl 720h

Pod Disruption Budgets

Every production workload must have a PDB. On-premise, PDBs are especially important during node maintenance windows — draining a node for hardware replacement must not drop below quorum for stateful grid members:

apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: 2   # 3-node grid: never drain below quorum
  selector:
    matchLabels:
      app: grid-node

Graceful Shutdown

Configure terminationGracePeriodSeconds at 120–300 seconds for JVM-based services. Add a preStop hook to ensure the HAProxy router has time to deregister the pod before it begins shutdown — on-premise HAProxy reload intervals are typically 1-5 seconds, so a 10-second sleep in the preStop hook is sufficient:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 180

8. Performance & Storage

Resource Requests and Limits

Set CPU requests at the realistic 50th-percentile consumption. For stateful pods, set memory limits equal to requests for Guaranteed QoS — this prevents the OOM killer from evicting them under memory pressure. On-premise nodes typically have larger RAM than cloud VMs; resist the temptation to over-commit memory on stateful nodes — over-commitment on nodes running in-memory grids causes catastrophic latency spikes when the kernel starts swapping.

Storage with OpenShift Data Foundation (ODF)

Deploy OpenShift Data Foundation (ODF) — the Red Hat-supported Ceph distribution — for all persistent storage needs. ODF provides block, file, and object storage from the same Ceph cluster:

Access Mode	StorageClass	Backend	Use Case
ReadWriteOnce	`ocs-storagecluster-ceph-rbd`	Ceph RBD	Database data dirs, WAL logs
ReadWriteMany	`ocs-storagecluster-cephfs`	CephFS	Shared content stores, config mounts
ReadWriteOnce (fast)	`ocs-storagecluster-ceph-rbd-immediate`	Ceph RBD + NVMe OSD	Low-latency WAL, etcd-adjacent workloads

Always set reclaimPolicy: Retain for production PVCs. Use volumeBindingMode: WaitForFirstConsumer to ensure PVCs bind to the same failure domain as the pod. For the very highest IOPS requirements (e.g., WAL on in-memory grid persistence), consider local PVs backed by NVMe drives on dedicated stateful nodes using the Local Storage Operator.

Node Performance Tuning with MachineConfig

Apply production kernel tuning via MachineConfig rather than DaemonSets — MachineConfig is declarative, version-controlled, and applied at node provisioning time rather than at pod scheduling time:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: stateful
  name: 99-stateful-sysctl
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - path: /etc/sysctl.d/99-ocp-stateful.conf
          contents:
            source: "data:,vm.max_map_count%3D262144%0Anet.core.somaxconn%3D65535%0Afs.file-max%3D1000000"

For NUMA-sensitive workloads, deploy the Node Tuning Operator with Tuned profiles — it manages CPU pinning, hugepages, and IRQ affinity, providing sub-millisecond-latency networking for financial workloads without requiring privileged containers.

JVM Tuning

Use -XX:+UseContainerSupport (default in JDK 11+) so the JVM reads cgroup v2 limits for heap sizing rather than host memory. On-premise nodes often have 256–512 GiB RAM — without container support, a JVM will size its heap to the full host memory and fail to start due to OOM. For off-heap-intensive workloads, set the container memory limit to: heap + off-heap + metaspace + thread stacks + 512 MiB safety margin.

9. GitOps & Release Engineering

OpenShift GitOps (ArgoCD)

OpenShift GitOps (available via OperatorHub) is the Red Hat-supported ArgoCD distribution. It integrates with OpenShift RBAC — ArgoCD projects map to OpenShift Groups, and Argo applications are deployed with the service account permissions of their destination namespace. Core principles:

The Git repository is the single source of truth — no oc apply in production by humans
Separate the application source repository from the deployment configuration repository
Pin all Helm chart versions and image tags — never use latest
Use ArgoCD’s ApplicationSet with a git generator to manage multi-cluster deployments via ACM
Enable ArgoCD resource hooks for database migration jobs that must run before the new application version starts

OpenShift Pipelines (Tekton)

OpenShift Pipelines (Tekton) is the preferred CI engine for on-premise OpenShift — it runs natively in the cluster with no external CI server to manage. Define pipeline runs as Kubernetes resources, version them in Git, and trigger them via EventListeners bound to your source control webhook events. Use Tekton Chains for supply chain security — it automatically signs TaskRun results and generates SLSA provenance attestations for every image built by the pipeline.

Helm Best Practices

# Lint strictly
helm lint ./charts/my-service --strict

# Validate rendered manifests against OCP API schema
helm template ./charts/my-service | oc apply --dry-run=server -f -

# Security scan
helm template ./charts/my-service | kubesec scan -

# Check deprecated API versions
helm template ./charts/my-service | pluto detect -

# Validate against OPA/Gatekeeper constraints
helm template ./charts/my-service | gator test -f -

Environment Promotion

Model promotions as pull requests from dev → staging → production branches in your deployment config repository. Require automated test gates (smoke tests, integration tests via Tekton) to pass before a PR can be merged. For on-premise environments with change management requirements, integrate PR approval with your ITSM tool (ServiceNow, Jira Service Management) so every production promotion automatically creates a Change Record with full git diff attached.

10. Deployment Safety

Progressive Delivery with Argo Rollouts

Deploy Argo Rollouts for canary and blue-green deployments. Integrate with the OpenShift Ingress Controller or OpenShift Service Mesh for traffic weight splitting:

strategy:
  canary:
    steps:
      - setWeight: 5
      - pause: {duration: 5m}
      - analysis:
          templates:
            - templateName: error-rate-check
      - setWeight: 25
      - pause: {duration: 10m}
      - setWeight: 100
    canaryService: my-service-canary
    stableService: my-service-stable
    trafficRouting:
      istio:
        virtualService:
          name: my-service-vsvc
          routes:
            - primary

StatefulSet Partition Upgrades

Use updateStrategy.rollingUpdate.partition to roll one StatefulSet pod at a time. Automate validation with a post-upgrade Tekton Task or Kubernetes Job that verifies grid membership count and data accessibility before proceeding to the next partition. Treat partition upgrades as a formal change with a maintenance window on-premise — coordinate with on-call, notify users, have a rollback runbook ready before starting.

Smoke Tests

Run a Kubernetes Job as the final step of every pipeline run. Exercise critical journeys — authentication via Red Hat SSO, core API endpoints, health checks — and emit PASS/FAIL. If it fails, Argo Rollouts aborts the canary automatically. In a regulated on-premise environment, smoke test results should be captured as evidence in your change management system, timestamped and signed.

11. Capacity & Cost Governance

OpenShift Cost Management (Koku)

Deploy Red Hat Cost Management (based on Koku) or integrate with your existing ITSM/financial systems via the Prometheus metrics that OCP exposes. Cost Management breaks down resource consumption by project (namespace), label, and cluster — enabling per-team chargebacks without cloud billing APIs. Configure it to consume OCP Prometheus data and export cost allocation reports to your finance system monthly.

OCP Subscription Management

OpenShift is licensed per core. Track worker node core count against your subscription entitlements using Red Hat Subscription Management (RHSM) integrated with your cluster. Avoid over-provisioning worker nodes for burst capacity you rarely use — on-premise, unused cores still consume subscription. Instead, maintain a pre-approved capacity expansion runbook that allows adding pre-staged nodes within your maintenance window SLA, and size the permanent fleet at your P75 utilisation.

Quota Enforcement for Chargeback

Apply consistent labels to all workloads for cost attribution:

metadata:
  labels:
    cost-center: "12345"
    team: "platform-engineering"
    environment: "production"
    application: "my-service"

Use OpenShift’s Project annotations to store billing codes and team ownership. ResourceQuota enforcement ensures teams cannot silently over-consume — requests above quota are rejected, making cost overruns visible before they happen rather than after month-end billing.

Storage Lifecycle

Implement Ceph pool tiering — move cold data from NVMe-backed pools to spinning disk (HDD) pools automatically using ODF’s storage class tiering. Configure PVC lifecycle policies to reclaim orphaned persistent volumes on a weekly schedule. For log and backup storage, use Ceph’s erasure-coding pools which reduce storage overhead by 40-50% compared to 3x replication for cold data.

12. Compliance & Audit

OpenShift Audit Logs

Configure the OCP API audit policy to capture all security-relevant events. Ship audit logs to your SIEM (Splunk, IBM QRadar, or Elastic SIEM on-premise) via Fluent Bit with a dedicated pipeline separate from application logs. Retain audit logs for 12 months minimum on fast storage and 7 years in archive. Alert on high-risk audit events: exec into pods, secret reads, ClusterRoleBinding creations, SCC grant changes, and MachineConfig modifications.

OpenShift Compliance Operator

Deploy the OpenShift Compliance Operator (available via OperatorHub) to continuously assess the cluster against CIS OpenShift Benchmark, NIST SP 800-53, or PCI-DSS profiles. The Compliance Operator runs scans on a schedule, generates detailed findings with remediation guidance, and can automatically apply remediations via MachineConfig for supported rules:

apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSettingBinding
metadata:
  name: cis-ocp
spec:
  profiles:
    - name: ocp4-cis
      kind: Profile
      apiGroup: compliance.openshift.io/v1alpha1
    - name: ocp4-cis-node
      kind: Profile
      apiGroup: compliance.openshift.io/v1alpha1
  settingsRef:
    name: default
    kind: ScanSetting
    apiGroup: compliance.openshift.io/v1alpha1

RHACS for Continuous Compliance

RHACS provides a compliance dashboard mapped to regulatory frameworks (CIS, NIST, PCI-DSS, HIPAA). It scans running workloads, network topology, and RBAC configuration in real time and generates evidence reports suitable for auditor submission. Configure scheduled compliance report exports to your GRC (Governance, Risk, Compliance) platform. For on-premise environments with air-gapped audit requirements, RHACS reports can be exported to PDF and archived alongside change records.

SBOM and Image Provenance

Generate an SBOM for every container image in your Tekton pipeline using Syft or Red Hat’s SBOM tooling. Use Tekton Chains to sign image digests with Cosign and generate SLSA Level 2 provenance attestations. Store SBOMs in Quay alongside image manifests. RHACS continuously monitors for new CVEs affecting running containers — integrate RHACS violation alerts with your vulnerability management system for automated ticket creation on CRITICAL findings.

13. Day-2 Operations

OCP Upgrade Strategy

OpenShift follows a minor.patch release cadence with defined upgrade channels: candidate, fast, stable, and eus (Extended Update Support). For on-premise production:

Use the eus channel if you need to skip minor versions (e.g., 4.14 → 4.16 without going through 4.15)
Use stable for production — patches have been in the fast channel for at least 2 weeks
Upgrade non-production clusters first via the fast channel, validate, then promote to stable for production
Use oc adm upgrade --to-image for precise version control, not automatic channel following
Upgrade in maintenance windows — OCP upgrades drain and replace nodes, causing brief pod migrations
Validate PDBs before every upgrade to ensure node drains cannot violate quorum

# Check upgrade path and available versions
oc adm upgrade

# Initiate upgrade to specific version
oc adm upgrade --to=4.16.12

# Monitor upgrade progress
oc get clusteroperators
oc get nodes -w

Node Maintenance

Use the Node Maintenance Operator (available via OperatorHub) for structured node maintenance — it cordon, drains, and marks the node as under maintenance in a single operation, integrating with PDB validation to prevent unsafe drains. This is especially important on-premise where you regularly need to take nodes down for firmware updates, hardware replacement, or data centre maintenance.

Dependency Currency

Treat Helm chart versions, operator versions, and base image versions as dependencies requiring regular updates. Use Renovate Bot configured against your on-premise GitLab or Gitea instance to raise automatic PRs for dependency updates. Review weekly. Pay particular attention to OpenShift operator updates — operators installed via OperatorHub can be configured for Manual or Automatic approval; use Manual for production so updates are reviewed before applying.

Chaos Engineering

Run controlled failure injection quarterly using Chaos Mesh or Litmus Chaos on-premise. Start with low-impact experiments: pod kill, node drain simulation, network latency injection via Service Mesh fault injection. Measure whether alerting fires within SLO thresholds and whether automated recovery restores the system without manual intervention. On-premise, also test storage failure scenarios — ODF Ceph node loss, network partition between OSD nodes — which have no equivalent in cloud-managed storage.

Runbooks

Every alert must link to a runbook stored in Git, versioned, and reviewed quarterly. On-premise runbooks should include: symptom description, diagnostic commands with example output, escalation path (including hardware vendor contacts for infrastructure issues), rollback procedure, and ITSM ticket template. Automate repeatable runbook steps using OpenShift Pipelines triggered by alert webhooks — reducing MTTR for known incident patterns to near-zero human intervention time.

14. Pre-Go-Live Checklist

Cluster Foundation

3 control plane nodes across 3 physical failure domains
etcd on NVMe-backed storage; etcd health verified across all members
Dedicated infra MachineConfigPool for OCP infrastructure components
OVN-Kubernetes CNI deployed; MetalLB in BGP mode for LoadBalancer services
Machine Config Operator managing all node configurations declaratively

Security

Red Hat SSO integrated for authentication; no local htpasswd users
All workloads running under restricted-v2 SCC; elevated SCCs documented and justified
All secrets sourced from Vault via External Secrets Operator
etcd encryption enabled (AES-CBC or AES-GCM)
All containers run as non-root with read-only root filesystem
RHACS deployed; all images scanned; 0 CRITICAL/HIGH unresolved CVEs
RHACS admission controller in enforce mode; unsigned/failing images blocked
OpenShift Service Mesh mTLS strict mode enforced
OPA/Gatekeeper policies enforced: required labels, registry restriction, resource limits
Default-deny NetworkPolicy applied via OVN-Kubernetes
FIPS mode enabled (if required by compliance framework)

Reliability

All Deployments have at least 3 replicas spread across physical failure domains
PodDisruptionBudgets defined for every production workload
HPA configured with CPU and custom metrics
Liveness, readiness, and startup probes configured and tuned
terminationGracePeriodSeconds appropriate for application shutdown time
etcd backup scheduled and tested via restoration to scratch cluster
Velero backup configured with ODF snapshot integration; restore tested
DR failover runbook documented and tested (active-passive or stretched)
Node Maintenance Operator deployed; drain procedure validated against PDBs

Observability

User workload monitoring enabled; metrics flowing to Prometheus + Thanos
Structured JSON logs shipping to Loki via OpenShift Logging Operator
Distributed traces reaching Tempo via OpenTelemetry Operator
Alerts defined for error rate, P99 latency, pod crash loops, etcd latency
Every alert has a runbook_url annotation
On-call rotation configured in Alertmanager; PagerDuty/ITSM integration tested
Hardware-level alerts (IPMI, disk smart) correlated with OCP events

Capacity & Governance

All pods have resource requests and limits defined
ResourceQuota and LimitRange applied to all projects via project template
Cost labels applied to all workloads; Cost Management configured
OCP subscription entitlements verified against deployed core count
Storage tiering policy configured; orphaned PV reclaim scheduled

Compliance & Operations

API audit logs shipping to SIEM with 12-month fast retention
Compliance Operator CIS scan: 0 HIGH findings
RHACS compliance dashboard: target score met for all required frameworks
SBOM generated and stored in Quay for all production images
OpenShift GitOps (ArgoCD) deployed; no manual oc apply in production
OCP upgrade channel set to stable or eus; upgrade runbook tested in staging
Chaos experiment baseline executed; automated recovery validated
cert-manager managing all TLS certificates with auto-renewal
Change management integration: production promotions create ITSM change records

The Bottom Line

On-premise OpenShift gives you full control — and full responsibility. The 14 pillars above are non-negotiable for enterprise production. The on-premise model means you own the hardware failure domain planning, the storage fabric, the network BGP configuration, and the upgrade windows that cloud providers handle for you. The upside: complete data sovereignty, predictable cost at scale, and no cloud egress fees. The investment: a disciplined operations practice built on MachineConfig, GitOps, Compliance Operator, RHACS, and Vault — all of which compound into a cluster that is genuinely production-grade rather than one that merely looks like it.

Questions or corrections? Drop a comment below.

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Production-Ready EKS: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

OpenShift On-Premise Production Best Practices

Table of Contents

1. Cluster Foundation & Infrastructure

Control Plane Topology

MachineConfigPool-Based Node Segmentation

Network Architecture

2. Scalability Architecture

Horizontal Pod Autoscaler (HPA)

Cluster Autoscaler with Machine API

Vertical Pod Autoscaler (VPA)

KEDA for Event-Driven Scaling

3. Governance & Policy

RBAC with Red Hat SSO / Keycloak

Security Context Constraints (SCCs) and OPA/Gatekeeper

Projects, Quotas, and LimitRanges

Network Policies with OVN-Kubernetes

4. Security Hardening

Node-Level Hardening with MachineConfig

Container Security Context

Image Supply Chain with Quay and RHACS

mTLS with OpenShift Service Mesh

etcd Encryption

5. Secret Management with HashiCorp Vault

HashiCorp Vault with Kubernetes Auth

External Secrets Operator

Certificate Management with cert-manager

6. Observability Stack

Metrics: OpenShift Monitoring + Thanos

Logs: OpenShift Logging with Loki

Distributed Tracing: OpenTelemetry + Tempo / Jaeger

Alerting

7. Disaster Recovery & High Availability

Multi-Site Strategy

etcd Backup

Application Backup with Velero

Pod Disruption Budgets

Graceful Shutdown

8. Performance & Storage

Resource Requests and Limits

Storage with OpenShift Data Foundation (ODF)

Node Performance Tuning with MachineConfig

JVM Tuning

9. GitOps & Release Engineering

OpenShift GitOps (ArgoCD)

OpenShift Pipelines (Tekton)

Helm Best Practices

Environment Promotion

10. Deployment Safety

Progressive Delivery with Argo Rollouts

StatefulSet Partition Upgrades

Smoke Tests

11. Capacity & Cost Governance

OpenShift Cost Management (Koku)

OCP Subscription Management

Quota Enforcement for Chargeback

Storage Lifecycle

12. Compliance & Audit

OpenShift Audit Logs

OpenShift Compliance Operator

RHACS for Continuous Compliance

SBOM and Image Provenance

13. Day-2 Operations

OCP Upgrade Strategy

Node Maintenance

Dependency Currency

Chaos Engineering

Runbooks

14. Pre-Go-Live Checklist

The Bottom Line

Production-Ready GKE: The Complete Best Practices Guide for Enterprise Kubernetes Deployments

Leave a Reply Cancel reply

You May Be Interested

From Docker Compose to Production: The CDO’s Checklist Before Your First Real ECB Submission

Building a Regulatory Dashboard in Superset — Capital Ratios and Governance Audit in One View