The CNCF Operator Whitepaper
Every production system has a human who knows its secrets, the person who knows exactly how to perform a failover, which configuration knobs to tune during a traffic spike, and what sequence to follow for a safe upgrade.
That knowledge is invaluable. It's also fragile, unscalable, and unrepeatable.
The Kubernetes Operator pattern solves this by encoding that operational expertise into software. The CNCF TAG App Delivery published a comprehensive whitepaper, Operator Pattern, that defines the pattern, maps its capabilities, and provides best practices for building production-grade operators.
What Is an Operator?
The whitepaper defines an operator as:
"A synthesis of human behaviour, codified into software to facilitate the full lifecycle management of an application."
In Kubernetes terms, an operator extends the platform's API with domain-specific operational knowledge. It's not just deploying an application, it's knowing how to upgrade it, back it up, recover it from failure, and scale it appropriately.
The pattern consists of three components:
- The application or infrastructure being managed
- A domain-specific language (Custom Resources) for declaring desired state
- A controller that continuously reconciles desired state with reality

The Reconciliation Loop
At the heart of every operator is the control loop, the same pattern that powers Kubernetes itself:
- Observe, Read the current state of the system
- Compare, Check it against the desired state declared in the Custom Resource
- Act, Take corrective action to close the gap
- Repeat, Continuously, forever
This is what makes operators powerful. They don't just apply configuration once and walk away. They continuously ensure the system matches the declared intent, handling drift, failures, and environmental changes automatically.

The Eight Operator Capabilities
The whitepaper defines eight core capabilities that operators can provide.
Installation
Provisioning all required resources, not just creating Kubernetes objects, but verifying that everything works correctly after creation. A database operator doesn't just deploy pods; it confirms the cluster has formed and is accepting connections.
Upgrades
Managing version updates with full awareness of dependencies, migration steps, and rollback procedures. This includes executing custom commands like database migrations, monitoring the upgrade process, and rolling back automatically on failure.
Backup and Recovery
Creating consistent backups and restoring applications from them. This is where domain knowledge is critical, a database backup is very different from a message queue backup. The operator knows the difference.
Auto-Remediation
Restoring applications from complex failure states that Kubernetes alone can't handle. Kubernetes knows how to restart a crashed pod; an operator knows how to recover a split-brain database cluster.
Observability
Providing telemetry about both the operator's behaviour and the application's health. Metrics like remediation action counts, backup durations, and reconciliation latency.
Scaling
Manual and automatic scaling with application awareness. Not just "add more pods", but understanding when to add read replicas versus increase memory, respecting minimum and maximum configurations.
Auto-Configuration Tuning
Dynamically adjusting application configuration based on environment characteristics. A database operator might tune buffer pools based on available memory, or adjust connection limits based on node count.
Lifecycle Management
Both clean uninstallation (removing all resources) and graceful disconnection (removing the operator while leaving the application running independently).
Security Considerations
The whitepaper dedicates significant attention to operator security, and for good reason. Operators typically run with elevated privileges.
For developers:
- Document threat models and RBAC scopes
- Specify exact communication ports
- Provide security disclosure processes
- Follow supply chain security practices
For users:
- Isolate operators in dedicated namespaces
- Grant minimum necessary RBAC permissions
- Review installation scripts before executing
- Verify image provenance and maintenance
- Apply SELinux, AppArmor, or seccomp profiles
The paper defines three scope models: cluster-wide (accesses resources across all namespaces), namespace-scoped (restricted, preferred), and external (manages resources outside the cluster).
The principle is clear: least privilege, always.
Choosing the Right Framework
The whitepaper surveys five major operator frameworks:
| Framework | Language | Best For | |-----------|----------|----------| | Operator SDK | Go, Helm, Ansible | Production operators with full lifecycle management | | kubebuilder | Go | Robust operators using controller-runtime | | Kopf | Python | Rapid prototyping, simple operators | | Metacontroller | Any (webhooks) | Lightweight controllers in any language | | Juju | Python/Go | Multi-cloud, charm-based application modelling |
The default recommendation: Operator SDK with Go for anything complex. It has the largest community, the most mature tooling, and direct CNCF backing.
The "Operator of Operators" Pattern
One of the most interesting patterns in the whitepaper is the meta-operator, an operator that coordinates multiple subordinate operators to manage complex application stacks.
Two approaches:
-
Single package, One user-facing CRD that internally delegates to multiple controllers through internal CRDs
-
Dependency model, A higher-level operator that depends on independently-useful operators, managed through OLM (Operator Lifecycle Manager)
This is exactly how you manage a full-stack application, a meta-operator for "the application" that coordinates separate database, cache, and message queue operators underneath.
Best Practices Worth Highlighting
From the whitepaper's extensive guidance:
-
One operator per application type, don't build monolithic operators
-
Design for operator absence, applications should keep running if the operator stops
-
Use Kubernetes primitives, leverage ReplicaSets, Services, ConfigMaps rather than reimplementing them
-
Test against failure modes, simulate pod crashes, storage failures, network partitions
-
One CRD per controller, keeps reconciliation logic clean and debuggable
-
Backward compatibility, support previous CRD versions during transitions
Why Operators Matter for Platform Engineering
Operators are the automation layer of an Internal Developer Platform. They encode "how to run this thing properly" into repeatable, auditable, testable code.
In GoldenPath IDP, we use the operator pattern thinking throughout:
-
Certified scripts that encode operational procedures, the same principle as operator reconciliation, applied to infrastructure automation
-
Governance policies that continuously validate state, exactly like a controller checking desired versus actual
-
Architecture Decision Records that capture the domain knowledge operators encode
The operator pattern isn't just for Kubernetes. It's a philosophy: encode human expertise into software, then let the software run continuously.
Read the full whitepaper: CNCF Operator Whitepaper
Building custom operators or automating operations? Get in touch, we can help you codify your expertise.