OpenShift On-Premise Production Best Practices
A comprehensive, opinionated guide to running enterprise Kubernetes workloads on OpenShift Container Platform on-premise — from cluster foundation through Day-2 operations, with full control over your own infrastructure.
Table of Contents
- Cluster Foundation & Infrastructure
- Scalability Architecture
- Governance & Policy
- Security Hardening
- Secret Management with Vault
- Observability Stack
- Disaster Recovery & High Availability
- Performance & Storage
- GitOps & Release Engineering
- Deployment Safety
- Capacity & Cost Governance
- Compliance & Audit
- Day-2 Operations
- Pre-Go-Live Checklist
On-premise OpenShift gives you something cloud-managed Kubernetes cannot: complete sovereignty over your infrastructure, network topology, hardware selection, and data residency. That sovereignty comes with accountability. You own every layer — from bare metal BIOS settings and storage fabric through OCP upgrade cadence, RHACS policy, and certificate rotation. This guide covers the practices that make the difference between an on-premise OpenShift cluster that is reliable and secure in production, and one that becomes a liability.
1. Cluster Foundation & Infrastructure
Control Plane Topology
Run three dedicated control plane nodes (master nodes) across three separate physical failure domains — different racks, different power feeds, different top-of-rack switches. Never co-locate control plane and worker workloads. For production, use dedicated etcd nodes or at minimum ensure master nodes run on NVMe-backed storage for etcd — etcd is the most latency-sensitive component in the cluster and disk I/O directly determines API server responsiveness.
# Verify etcd health across all members
oc exec -n openshift-etcd etcd-master-0 -- \
etcdctl endpoint health \
--cacert /etc/kubernetes/static-pod-certs/configmaps/etcd-all-bundles/server-ca-bundle.crt \
--cert /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-0.crt \
--key /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-0.key \
--clusterMachineConfigPool-Based Node Segmentation
Use MachineConfigPools (MCPs) to define distinct worker classes. Each pool gets its own machine configuration — kernel arguments, sysctl tuning, container runtime settings — applied automatically via the Machine Config Operator:
| MachineConfigPool | Hardware Profile | Purpose |
|---|---|---|
worker | General-purpose CPU | Stateless services, web tiers |
compute-intensive | High CPU/RAM (e.g., 96-core) | CPU/memory-heavy processing |
stateful | High-RAM, NVMe local disk | StatefulSets, in-memory grids |
infra | Standard | Monitoring, logging, registry, router |
Taint the stateful pool and add matching tolerations in Helm values. Move all OCP infrastructure components (router, registry, monitoring) to dedicated infra nodes using nodeSelector overrides — this prevents infrastructure load from competing with business workloads and removes the need to purchase additional OCP worker subscriptions for infra nodes.
Network Architecture
Choose OVN-Kubernetes as the CNI for new clusters — it replaces the older OpenShift SDN and provides native support for NetworkPolicy, Egress Firewall, EgressIP, and hardware offloading via DPDK. For on-premise load balancing without a cloud provider, deploy MetalLB (BGP mode preferred for production) to provide real LoadBalancer-type services from your existing routing infrastructure. Use a dedicated HAProxy pair (managed by the OpenShift Ingress Operator) for application ingress with session affinity and TLS termination.
2. Scalability Architecture
Horizontal Pod Autoscaler (HPA)
OpenShift’s HPA integrates with the built-in Prometheus metrics pipeline. Use custom metrics from the custom-metrics-apiserver adapter to scale on business-relevant signals — queue depth, active sessions, or processing backlog — not just CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: Pods
pods:
metric:
name: active_processing_jobs
target:
type: AverageValue
averageValue: "50"Cluster Autoscaler with Machine API
On VMware vSphere or Bare Metal deployments, use the OpenShift Machine API with a ClusterAutoscaler and per-MachineSet MachineAutoscaler resources. This allows OCP to provision new VMs (on vSphere) automatically when pods are pending. On bare metal, autoscaling requires pre-staged hardware in a pool managed by the BMO (Bare Metal Operator) — plan capacity ahead and keep a buffer of provisioned-but-unclaimed hosts ready for rapid scale-out.
Vertical Pod Autoscaler (VPA)
Install VPA from the OperatorHub. Run it in Recommendation mode for two weeks before considering Auto mode. Feed recommendations into your Helm values to right-size requests/limits. For StatefulSets with in-memory state, use Recommendation mode only — Auto mode restarts pods to resize, which causes data loss in grid members. Use the VPA recommender output as a quarterly review input to your capacity planning process.
KEDA for Event-Driven Scaling
Deploy KEDA (available via OperatorHub as the KEDA Operator) for workloads that scale to zero or respond to external event sources — Kafka topics, ActiveMQ queues, or custom Prometheus metrics. KEDA complements HPA rather than replacing it; use HPA for CPU/memory-based scaling and KEDA for event-driven triggers on the same or different workloads.
3. Governance & Policy
RBAC with Red Hat SSO / Keycloak
Integrate OpenShift OAuth with Red Hat SSO (Keycloak) as the identity provider — never use local htpasswd users in production. Map LDAP/AD groups to OpenShift Groups and bind Groups to ClusterRoles or namespace-scoped Roles. Use the principle of least privilege: operations teams get view by default, pipeline service accounts get scoped edit on specific namespaces only:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: deploy-pipeline-edit
namespace: production
subjects:
- kind: ServiceAccount
name: pipeline-sa
namespace: production
roleRef:
kind: ClusterRole
name: edit
apiGroup: rbac.authorization.k8s.ioFor privileged operations, enforce time-limited role elevations using OpenShift’s impersonation or an external PAM solution — no standing privileged access.
Security Context Constraints (SCCs) and OPA/Gatekeeper
OpenShift’s Security Context Constraints are the primary pod security enforcement mechanism, predating Kubernetes Pod Security Standards. Use the built-in restricted-v2 SCC for all production workloads — it enforces non-root UID, read-only root filesystem capability drops, and disallowed privilege escalation. Only grant elevated SCCs (e.g., anyuid) via explicit RoleBinding to specific service accounts with a documented justification.
Layer OPA/Gatekeeper (available from OperatorHub) on top of SCCs for organisation-wide policy that SCCs cannot express — requiring specific labels, restricting image registries to your internal Quay, or enforcing resource limits:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: require-team-label
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
parameters:
labels: ["team", "environment", "cost-center"]Projects, Quotas, and LimitRanges
OpenShift Projects are namespaces with additional metadata and access control. Use project templates to automatically provision ResourceQuota and LimitRange on every new project — this ensures no project can be created without resource boundaries. Define templates in the openshift-config namespace and reference them via the ProjectConfig API:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"Network Policies with OVN-Kubernetes
Adopt a default-deny posture in all production namespaces. OVN-Kubernetes supports both standard NetworkPolicy and OpenShift-specific EgressNetworkPolicy (to restrict egress by CIDR or DNS name) and EgressIP (to fix egress source IPs for firewall whitelisting). The combination of ingress NetworkPolicy + EgressNetworkPolicy is particularly important on-premise where east-west traffic between application tiers and corporate databases must be tightly controlled.
4. Security Hardening
Node-Level Hardening with MachineConfig
Use the Machine Config Operator to apply CIS-compliant kernel hardening across all node pools without logging into individual hosts. Apply sysctl settings, audit rules, and FIPS mode through MachineConfig objects committed to Git:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 75-worker-sysctl
spec:
kernelArguments:
- net.ipv4.tcp_keepalive_time=300
- vm.max_map_count=262144
kernelType: defaultEnable FIPS 140-2/3 mode at cluster installation time if your compliance framework requires it — it cannot be enabled post-installation without reinstalling the cluster. FIPS mode applies to the host OS cryptographic libraries and the OpenShift control plane.
Container Security Context
Every container spec must specify an explicit security context. The restricted-v2 SCC enforces most of these, but define them explicitly in your Helm charts as well so they are visible in code review:
securityContext:
runAsNonRoot: true
runAsUser: 10001 # Must be in SCC allowedUIDs range
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefaultImage Supply Chain with Quay and RHACS
Deploy Red Hat Quay as your on-premise container registry. Quay provides geo-replicated on-premise storage, image scanning via Clair, robot accounts for service authentication, and team-based access control. Configure image mirroring to pull upstream images through Quay rather than directly from public registries — this enforces a single inspection point and eliminates external internet dependency for production deployments.
Deploy Red Hat Advanced Cluster Security for Kubernetes (RHACS / StackRox) for the full image security lifecycle:
- Build-time scanning — fail CI pipelines on CRITICAL/HIGH CVEs via the
roxctl image checkCLI - Deploy-time admission control — RHACS admission controller blocks deployment of images failing your security policy
- Runtime detection — process allow-listing, network baseline anomaly detection, privilege escalation alerts
- Continuous image reassessment — alerts when a new CVE affects an already-running container
mTLS with OpenShift Service Mesh
Deploy OpenShift Service Mesh (Red Hat’s Istio distribution) in strict mTLS mode for all pod-to-pod communication. OpenShift Service Mesh is installed via OperatorHub and integrates with OpenShift’s RBAC for control plane access. Enable PeerAuthentication in strict mode per namespace — this ensures all traffic within the mesh is encrypted and mutually authenticated at the sidecar layer, regardless of the application’s own TLS configuration.
etcd Encryption
Enable etcd encryption for Secrets and ConfigMaps at rest — not just volume-level encryption. On-premise this is critical because the physical etcd backup media (NFS mounts, tape) may leave the secure perimeter. Configure AES-CBC or AES-GCM encryption via the APIServer custom resource and rotate encryption keys quarterly:
apiVersion: config.openshift.io/v1
kind: APIServer
metadata:
name: cluster
spec:
encryption:
type: AES-CBC5. Secret Management with HashiCorp Vault
HashiCorp Vault with Kubernetes Auth
Deploy HashiCorp Vault (or use the OpenShift Secrets Management operator for a managed experience). Configure Vault’s Kubernetes auth method to allow OpenShift ServiceAccounts to authenticate to Vault using their JWT tokens — eliminating static credentials entirely:
# Configure Kubernetes auth in Vault
vault auth enable kubernetes
vault write auth/kubernetes/config \
kubernetes_host="https://api.cluster.example.com:6443" \
kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# Create a role binding a ServiceAccount to a Vault policy
vault write auth/kubernetes/role/my-service \
bound_service_account_names=my-service-sa \
bound_service_account_namespaces=production \
policies=my-service-policy \
ttl=1hExternal Secrets Operator
Deploy the External Secrets Operator (ESO) with the Vault provider. ESO synchronises secrets from Vault into OpenShift Secrets automatically, with configurable refresh intervals. This allows existing applications expecting environment variables or volume mounts to continue working without code changes, while keeping the source of truth in Vault:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-credentials-ocp
data:
- secretKey: db-password
remoteRef:
key: secret/production/database
property: passwordCertificate Management with cert-manager
Deploy cert-manager (available via OperatorHub) with an internal CA issuer — typically your corporate PKI exposed via ACME protocol or a Vault PKI secrets engine issuer. All TLS certificates for OpenShift routes and internal services must be managed by cert-manager with automatic renewal at 30 days before expiry. Never manually create or renew certificates in production. Configure the OpenShift Ingress Operator to use cert-manager-issued certificates for the wildcard domain.
6. Observability Stack
Metrics: OpenShift Monitoring + Thanos
OpenShift ships a built-in monitoring stack (Prometheus Operator, Alertmanager, Thanos Querier) in the openshift-monitoring namespace. Enable user-workload monitoring to allow application teams to deploy PodMonitors and ServiceMonitors in their own namespaces. For long-term retention (beyond Prometheus’s 15-day default), deploy a Thanos sidecar with an S3-compatible backend (MinIO on-premise or a dedicated object store) for indefinite metric retention:
# Enable user workload monitoring
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true
prometheusK8s:
retention: 15d
volumeClaimTemplate:
spec:
storageClassName: ocs-storagecluster-ceph-rbd
resources:
requests:
storage: 500GiLogs: OpenShift Logging with Loki
Deploy the OpenShift Logging Operator with Loki as the log store (replacing the older Elasticsearch-based stack for new deployments — Loki is significantly more resource-efficient at scale). Configure Fluent Bit (managed by the Logging Operator) as the collector and ship logs to Loki with label-based indexing. Structure all application logs as JSON and include a correlation_id field for distributed tracing correlation. For compliance-required long-term retention, configure a Loki retention policy and back the object store with Ceph or an NFS archival tier:
# Example structured log format
{
"timestamp": "2026-05-12T10:30:00Z",
"level": "INFO",
"message": "Request processed",
"correlation_id": "abc-123-xyz",
"namespace": "production",
"pod": "my-service-7d9f8b",
"latency_ms": 42,
"status_code": 200
}Distributed Tracing: OpenTelemetry + Tempo / Jaeger
Deploy the OpenTelemetry Operator and the Tempo Operator (both available via OperatorHub). The OTel Operator manages the collector DaemonSet and auto-instrumentation webhooks for Java, Python, and Node.js services. Route traces to Tempo backed by Ceph object storage for scalable on-premise retention. Use the Red Hat build of OpenTelemetry for a fully supported, tested configuration. Correlate traces with logs using the trace_id field injected by the OTel SDK.
Alerting
Define all alert rules as PrometheusRule resources committed to Git and applied via GitOps. Alert on symptoms — high error rate, P99 latency breach, pod crash-loop rate, etcd latency spikes — not raw CPU. Route CRITICAL alerts to PagerDuty or your on-call system via Alertmanager’s webhook/PagerDuty receiver. Include a runbook_url label in every alert pointing to the operational runbook. For on-premise environments, also alert on hardware-level events — IPMI sensor alerts, disk predictive failure (smartd) — and correlate them with pod evictions.
7. Disaster Recovery & High Availability
Multi-Site Strategy
For on-premise DR, choose between:
- Stretched cluster: A single OCP cluster spanning two data centres with a witness site. Requires <10ms RTT between sites and synchronous storage replication. Provides automatic failover but is operationally complex.
- Active-passive cluster pair: Two independent OCP clusters — primary and DR — with GitOps bootstrapping the DR cluster from the same config repository. Lower complexity, higher RTO (30-60 min), simpler to operate.
Use OpenShift Advanced Cluster Management (ACM) to manage both clusters from a single control plane, push policies across them, and monitor cross-cluster application health. ACM’s ApplicationSet integration with ArgoCD enables identical application deployment across clusters with environment-specific overrides.
etcd Backup
Back up etcd on a schedule — this is the single most important backup for an OpenShift cluster:
# Run on a master node — automate via CronJob
/usr/local/bin/cluster-backup.sh /mnt/backup/etcd
# Verify the backup
ls -la /mnt/backup/etcd/
# Expected: snapshot_*.db and static_kuberesources_*.tar.gzStore backups in a location separate from the cluster storage — NFS share on a different storage system, tape, or replicated object store. Test restoration quarterly against a scratch cluster. An untested etcd restore is not a backup strategy.
Application Backup with Velero
Deploy Velero with an S3-compatible backend (MinIO on-premise). Back up Kubernetes/OpenShift resources and Persistent Volume snapshots. Use OpenShift Data Foundation (ODF/Ceph) VolumeSnapshot integration for application-consistent PVC backups:
velero backup create production-daily \
--include-namespaces production \
--snapshot-volumes \
--volume-snapshot-locations odf-ceph \
--storage-location minio-primary \
--ttl 720hPod Disruption Budgets
Every production workload must have a PDB. On-premise, PDBs are especially important during node maintenance windows — draining a node for hardware replacement must not drop below quorum for stateful grid members:
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
minAvailable: 2 # 3-node grid: never drain below quorum
selector:
matchLabels:
app: grid-nodeGraceful Shutdown
Configure terminationGracePeriodSeconds at 120–300 seconds for JVM-based services. Add a preStop hook to ensure the HAProxy router has time to deregister the pod before it begins shutdown — on-premise HAProxy reload intervals are typically 1-5 seconds, so a 10-second sleep in the preStop hook is sufficient:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 1808. Performance & Storage
Resource Requests and Limits
Set CPU requests at the realistic 50th-percentile consumption. For stateful pods, set memory limits equal to requests for Guaranteed QoS — this prevents the OOM killer from evicting them under memory pressure. On-premise nodes typically have larger RAM than cloud VMs; resist the temptation to over-commit memory on stateful nodes — over-commitment on nodes running in-memory grids causes catastrophic latency spikes when the kernel starts swapping.
Storage with OpenShift Data Foundation (ODF)
Deploy OpenShift Data Foundation (ODF) — the Red Hat-supported Ceph distribution — for all persistent storage needs. ODF provides block, file, and object storage from the same Ceph cluster:
| Access Mode | StorageClass | Backend | Use Case |
|---|---|---|---|
| ReadWriteOnce | ocs-storagecluster-ceph-rbd | Ceph RBD | Database data dirs, WAL logs |
| ReadWriteMany | ocs-storagecluster-cephfs | CephFS | Shared content stores, config mounts |
| ReadWriteOnce (fast) | ocs-storagecluster-ceph-rbd-immediate | Ceph RBD + NVMe OSD | Low-latency WAL, etcd-adjacent workloads |
Always set reclaimPolicy: Retain for production PVCs. Use volumeBindingMode: WaitForFirstConsumer to ensure PVCs bind to the same failure domain as the pod. For the very highest IOPS requirements (e.g., WAL on in-memory grid persistence), consider local PVs backed by NVMe drives on dedicated stateful nodes using the Local Storage Operator.
Node Performance Tuning with MachineConfig
Apply production kernel tuning via MachineConfig rather than DaemonSets — MachineConfig is declarative, version-controlled, and applied at node provisioning time rather than at pod scheduling time:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: stateful
name: 99-stateful-sysctl
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- path: /etc/sysctl.d/99-ocp-stateful.conf
contents:
source: "data:,vm.max_map_count%3D262144%0Anet.core.somaxconn%3D65535%0Afs.file-max%3D1000000"For NUMA-sensitive workloads, deploy the Node Tuning Operator with Tuned profiles — it manages CPU pinning, hugepages, and IRQ affinity, providing sub-millisecond-latency networking for financial workloads without requiring privileged containers.
JVM Tuning
Use -XX:+UseContainerSupport (default in JDK 11+) so the JVM reads cgroup v2 limits for heap sizing rather than host memory. On-premise nodes often have 256–512 GiB RAM — without container support, a JVM will size its heap to the full host memory and fail to start due to OOM. For off-heap-intensive workloads, set the container memory limit to: heap + off-heap + metaspace + thread stacks + 512 MiB safety margin.
9. GitOps & Release Engineering
OpenShift GitOps (ArgoCD)
OpenShift GitOps (available via OperatorHub) is the Red Hat-supported ArgoCD distribution. It integrates with OpenShift RBAC — ArgoCD projects map to OpenShift Groups, and Argo applications are deployed with the service account permissions of their destination namespace. Core principles:
- The Git repository is the single source of truth — no
oc applyin production by humans - Separate the application source repository from the deployment configuration repository
- Pin all Helm chart versions and image tags — never use
latest - Use ArgoCD’s ApplicationSet with a git generator to manage multi-cluster deployments via ACM
- Enable ArgoCD resource hooks for database migration jobs that must run before the new application version starts
OpenShift Pipelines (Tekton)
OpenShift Pipelines (Tekton) is the preferred CI engine for on-premise OpenShift — it runs natively in the cluster with no external CI server to manage. Define pipeline runs as Kubernetes resources, version them in Git, and trigger them via EventListeners bound to your source control webhook events. Use Tekton Chains for supply chain security — it automatically signs TaskRun results and generates SLSA provenance attestations for every image built by the pipeline.
Helm Best Practices
# Lint strictly
helm lint ./charts/my-service --strict
# Validate rendered manifests against OCP API schema
helm template ./charts/my-service | oc apply --dry-run=server -f -
# Security scan
helm template ./charts/my-service | kubesec scan -
# Check deprecated API versions
helm template ./charts/my-service | pluto detect -
# Validate against OPA/Gatekeeper constraints
helm template ./charts/my-service | gator test -f -Environment Promotion
Model promotions as pull requests from dev → staging → production branches in your deployment config repository. Require automated test gates (smoke tests, integration tests via Tekton) to pass before a PR can be merged. For on-premise environments with change management requirements, integrate PR approval with your ITSM tool (ServiceNow, Jira Service Management) so every production promotion automatically creates a Change Record with full git diff attached.
10. Deployment Safety
Progressive Delivery with Argo Rollouts
Deploy Argo Rollouts for canary and blue-green deployments. Integrate with the OpenShift Ingress Controller or OpenShift Service Mesh for traffic weight splitting:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: error-rate-check
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 100
canaryService: my-service-canary
stableService: my-service-stable
trafficRouting:
istio:
virtualService:
name: my-service-vsvc
routes:
- primaryStatefulSet Partition Upgrades
Use updateStrategy.rollingUpdate.partition to roll one StatefulSet pod at a time. Automate validation with a post-upgrade Tekton Task or Kubernetes Job that verifies grid membership count and data accessibility before proceeding to the next partition. Treat partition upgrades as a formal change with a maintenance window on-premise — coordinate with on-call, notify users, have a rollback runbook ready before starting.
Smoke Tests
Run a Kubernetes Job as the final step of every pipeline run. Exercise critical journeys — authentication via Red Hat SSO, core API endpoints, health checks — and emit PASS/FAIL. If it fails, Argo Rollouts aborts the canary automatically. In a regulated on-premise environment, smoke test results should be captured as evidence in your change management system, timestamped and signed.
11. Capacity & Cost Governance
OpenShift Cost Management (Koku)
Deploy Red Hat Cost Management (based on Koku) or integrate with your existing ITSM/financial systems via the Prometheus metrics that OCP exposes. Cost Management breaks down resource consumption by project (namespace), label, and cluster — enabling per-team chargebacks without cloud billing APIs. Configure it to consume OCP Prometheus data and export cost allocation reports to your finance system monthly.
OCP Subscription Management
OpenShift is licensed per core. Track worker node core count against your subscription entitlements using Red Hat Subscription Management (RHSM) integrated with your cluster. Avoid over-provisioning worker nodes for burst capacity you rarely use — on-premise, unused cores still consume subscription. Instead, maintain a pre-approved capacity expansion runbook that allows adding pre-staged nodes within your maintenance window SLA, and size the permanent fleet at your P75 utilisation.
Quota Enforcement for Chargeback
Apply consistent labels to all workloads for cost attribution:
metadata:
labels:
cost-center: "12345"
team: "platform-engineering"
environment: "production"
application: "my-service"Use OpenShift’s Project annotations to store billing codes and team ownership. ResourceQuota enforcement ensures teams cannot silently over-consume — requests above quota are rejected, making cost overruns visible before they happen rather than after month-end billing.
Storage Lifecycle
Implement Ceph pool tiering — move cold data from NVMe-backed pools to spinning disk (HDD) pools automatically using ODF’s storage class tiering. Configure PVC lifecycle policies to reclaim orphaned persistent volumes on a weekly schedule. For log and backup storage, use Ceph’s erasure-coding pools which reduce storage overhead by 40-50% compared to 3x replication for cold data.
12. Compliance & Audit
OpenShift Audit Logs
Configure the OCP API audit policy to capture all security-relevant events. Ship audit logs to your SIEM (Splunk, IBM QRadar, or Elastic SIEM on-premise) via Fluent Bit with a dedicated pipeline separate from application logs. Retain audit logs for 12 months minimum on fast storage and 7 years in archive. Alert on high-risk audit events: exec into pods, secret reads, ClusterRoleBinding creations, SCC grant changes, and MachineConfig modifications.
OpenShift Compliance Operator
Deploy the OpenShift Compliance Operator (available via OperatorHub) to continuously assess the cluster against CIS OpenShift Benchmark, NIST SP 800-53, or PCI-DSS profiles. The Compliance Operator runs scans on a schedule, generates detailed findings with remediation guidance, and can automatically apply remediations via MachineConfig for supported rules:
apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSettingBinding
metadata:
name: cis-ocp
spec:
profiles:
- name: ocp4-cis
kind: Profile
apiGroup: compliance.openshift.io/v1alpha1
- name: ocp4-cis-node
kind: Profile
apiGroup: compliance.openshift.io/v1alpha1
settingsRef:
name: default
kind: ScanSetting
apiGroup: compliance.openshift.io/v1alpha1RHACS for Continuous Compliance
RHACS provides a compliance dashboard mapped to regulatory frameworks (CIS, NIST, PCI-DSS, HIPAA). It scans running workloads, network topology, and RBAC configuration in real time and generates evidence reports suitable for auditor submission. Configure scheduled compliance report exports to your GRC (Governance, Risk, Compliance) platform. For on-premise environments with air-gapped audit requirements, RHACS reports can be exported to PDF and archived alongside change records.
SBOM and Image Provenance
Generate an SBOM for every container image in your Tekton pipeline using Syft or Red Hat’s SBOM tooling. Use Tekton Chains to sign image digests with Cosign and generate SLSA Level 2 provenance attestations. Store SBOMs in Quay alongside image manifests. RHACS continuously monitors for new CVEs affecting running containers — integrate RHACS violation alerts with your vulnerability management system for automated ticket creation on CRITICAL findings.
13. Day-2 Operations
OCP Upgrade Strategy
OpenShift follows a minor.patch release cadence with defined upgrade channels: candidate, fast, stable, and eus (Extended Update Support). For on-premise production:
- Use the eus channel if you need to skip minor versions (e.g., 4.14 → 4.16 without going through 4.15)
- Use stable for production — patches have been in the fast channel for at least 2 weeks
- Upgrade non-production clusters first via the fast channel, validate, then promote to stable for production
- Use
oc adm upgrade --to-imagefor precise version control, not automatic channel following - Upgrade in maintenance windows — OCP upgrades drain and replace nodes, causing brief pod migrations
- Validate PDBs before every upgrade to ensure node drains cannot violate quorum
# Check upgrade path and available versions
oc adm upgrade
# Initiate upgrade to specific version
oc adm upgrade --to=4.16.12
# Monitor upgrade progress
oc get clusteroperators
oc get nodes -wNode Maintenance
Use the Node Maintenance Operator (available via OperatorHub) for structured node maintenance — it cordon, drains, and marks the node as under maintenance in a single operation, integrating with PDB validation to prevent unsafe drains. This is especially important on-premise where you regularly need to take nodes down for firmware updates, hardware replacement, or data centre maintenance.
Dependency Currency
Treat Helm chart versions, operator versions, and base image versions as dependencies requiring regular updates. Use Renovate Bot configured against your on-premise GitLab or Gitea instance to raise automatic PRs for dependency updates. Review weekly. Pay particular attention to OpenShift operator updates — operators installed via OperatorHub can be configured for Manual or Automatic approval; use Manual for production so updates are reviewed before applying.
Chaos Engineering
Run controlled failure injection quarterly using Chaos Mesh or Litmus Chaos on-premise. Start with low-impact experiments: pod kill, node drain simulation, network latency injection via Service Mesh fault injection. Measure whether alerting fires within SLO thresholds and whether automated recovery restores the system without manual intervention. On-premise, also test storage failure scenarios — ODF Ceph node loss, network partition between OSD nodes — which have no equivalent in cloud-managed storage.
Runbooks
Every alert must link to a runbook stored in Git, versioned, and reviewed quarterly. On-premise runbooks should include: symptom description, diagnostic commands with example output, escalation path (including hardware vendor contacts for infrastructure issues), rollback procedure, and ITSM ticket template. Automate repeatable runbook steps using OpenShift Pipelines triggered by alert webhooks — reducing MTTR for known incident patterns to near-zero human intervention time.
14. Pre-Go-Live Checklist
Cluster Foundation
- 3 control plane nodes across 3 physical failure domains
- etcd on NVMe-backed storage; etcd health verified across all members
- Dedicated
infraMachineConfigPool for OCP infrastructure components - OVN-Kubernetes CNI deployed; MetalLB in BGP mode for LoadBalancer services
- Machine Config Operator managing all node configurations declaratively
Security
- Red Hat SSO integrated for authentication; no local htpasswd users
- All workloads running under
restricted-v2SCC; elevated SCCs documented and justified - All secrets sourced from Vault via External Secrets Operator
- etcd encryption enabled (AES-CBC or AES-GCM)
- All containers run as non-root with read-only root filesystem
- RHACS deployed; all images scanned; 0 CRITICAL/HIGH unresolved CVEs
- RHACS admission controller in enforce mode; unsigned/failing images blocked
- OpenShift Service Mesh mTLS strict mode enforced
- OPA/Gatekeeper policies enforced: required labels, registry restriction, resource limits
- Default-deny NetworkPolicy applied via OVN-Kubernetes
- FIPS mode enabled (if required by compliance framework)
Reliability
- All Deployments have at least 3 replicas spread across physical failure domains
- PodDisruptionBudgets defined for every production workload
- HPA configured with CPU and custom metrics
- Liveness, readiness, and startup probes configured and tuned
- terminationGracePeriodSeconds appropriate for application shutdown time
- etcd backup scheduled and tested via restoration to scratch cluster
- Velero backup configured with ODF snapshot integration; restore tested
- DR failover runbook documented and tested (active-passive or stretched)
- Node Maintenance Operator deployed; drain procedure validated against PDBs
Observability
- User workload monitoring enabled; metrics flowing to Prometheus + Thanos
- Structured JSON logs shipping to Loki via OpenShift Logging Operator
- Distributed traces reaching Tempo via OpenTelemetry Operator
- Alerts defined for error rate, P99 latency, pod crash loops, etcd latency
- Every alert has a runbook_url annotation
- On-call rotation configured in Alertmanager; PagerDuty/ITSM integration tested
- Hardware-level alerts (IPMI, disk smart) correlated with OCP events
Capacity & Governance
- All pods have resource requests and limits defined
- ResourceQuota and LimitRange applied to all projects via project template
- Cost labels applied to all workloads; Cost Management configured
- OCP subscription entitlements verified against deployed core count
- Storage tiering policy configured; orphaned PV reclaim scheduled
Compliance & Operations
- API audit logs shipping to SIEM with 12-month fast retention
- Compliance Operator CIS scan: 0 HIGH findings
- RHACS compliance dashboard: target score met for all required frameworks
- SBOM generated and stored in Quay for all production images
- OpenShift GitOps (ArgoCD) deployed; no manual
oc applyin production - OCP upgrade channel set to
stableoreus; upgrade runbook tested in staging - Chaos experiment baseline executed; automated recovery validated
- cert-manager managing all TLS certificates with auto-renewal
- Change management integration: production promotions create ITSM change records
The Bottom Line
On-premise OpenShift gives you full control — and full responsibility. The 14 pillars above are non-negotiable for enterprise production. The on-premise model means you own the hardware failure domain planning, the storage fabric, the network BGP configuration, and the upgrade windows that cloud providers handle for you. The upside: complete data sovereignty, predictable cost at scale, and no cloud egress fees. The investment: a disciplined operations practice built on MachineConfig, GitOps, Compliance Operator, RHACS, and Vault — all of which compound into a cluster that is genuinely production-grade rather than one that merely looks like it.
Questions or corrections? Drop a comment below.


