The CNCF Operator Whitepaper

Every production system has a human who knows its secrets, the person who knows exactly how to perform a failover, which configuration knobs to tune during a traffic spike, and what sequence to follow for a safe upgrade.

That knowledge is invaluable. It's also fragile, unscalable, and unrepeatable.

The Kubernetes Operator pattern solves this by encoding that operational expertise into software. The CNCF TAG App Delivery published a comprehensive whitepaper, Operator Pattern, that defines the pattern, maps its capabilities, and provides best practices for building production-grade operators.

What Is an Operator?

The whitepaper defines an operator as:

"A synthesis of human behaviour, codified into software to facilitate the full lifecycle management of an application."

In Kubernetes terms, an operator extends the platform's API with domain-specific operational knowledge. It's not just deploying an application, it's knowing how to upgrade it, back it up, recover it from failure, and scale it appropriately.

The pattern consists of three components:

The application or infrastructure being managed
A domain-specific language (Custom Resources) for declaring desired state
A controller that continuously reconciles desired state with reality

Operator Design Pattern, the three components working together

The Reconciliation Loop

At the heart of every operator is the control loop, the same pattern that powers Kubernetes itself:

Observe, Read the current state of the system
Compare, Check it against the desired state declared in the Custom Resource
Act, Take corrective action to close the gap
Repeat, Continuously, forever

This is what makes operators powerful. They don't just apply configuration once and walk away. They continuously ensure the system matches the declared intent, handling drift, failures, and environmental changes automatically.

Operator Big Picture, how controllers, CRDs, and managed applications interact

The Eight Operator Capabilities

The whitepaper defines eight core capabilities that operators can provide.

Installation

Provisioning all required resources, not just creating Kubernetes objects, but verifying that everything works correctly after creation. A database operator doesn't just deploy pods; it confirms the cluster has formed and is accepting connections.

Upgrades

Managing version updates with full awareness of dependencies, migration steps, and rollback procedures. This includes executing custom commands like database migrations, monitoring the upgrade process, and rolling back automatically on failure.

Backup and Recovery

Creating consistent backups and restoring applications from them. This is where domain knowledge is critical, a database backup is very different from a message queue backup. The operator knows the difference.

Auto-Remediation

Restoring applications from complex failure states that Kubernetes alone can't handle. Kubernetes knows how to restart a crashed pod; an operator knows how to recover a split-brain database cluster.

Observability

Providing telemetry about both the operator's behaviour and the application's health. Metrics like remediation action counts, backup durations, and reconciliation latency.

Scaling

Manual and automatic scaling with application awareness. Not just "add more pods", but understanding when to add read replicas versus increase memory, respecting minimum and maximum configurations.

Auto-Configuration Tuning

Dynamically adjusting application configuration based on environment characteristics. A database operator might tune buffer pools based on available memory, or adjust connection limits based on node count.

Lifecycle Management

Both clean uninstallation (removing all resources) and graceful disconnection (removing the operator while leaving the application running independently).

Security Considerations

The whitepaper dedicates significant attention to operator security, and for good reason. Operators typically run with elevated privileges.

For developers:

Document threat models and RBAC scopes
Specify exact communication ports
Provide security disclosure processes
Follow supply chain security practices

For users:

Isolate operators in dedicated namespaces
Grant minimum necessary RBAC permissions
Review installation scripts before executing
Verify image provenance and maintenance
Apply SELinux, AppArmor, or seccomp profiles

The paper defines three scope models: cluster-wide (accesses resources across all namespaces), namespace-scoped (restricted, preferred), and external (manages resources outside the cluster).

The principle is clear: least privilege, always.

Choosing the Right Framework

The whitepaper surveys five major operator frameworks:

| Framework | Language | Best For | |-----------|----------|----------| | Operator SDK | Go, Helm, Ansible | Production operators with full lifecycle management | | kubebuilder | Go | Robust operators using controller-runtime | | Kopf | Python | Rapid prototyping, simple operators | | Metacontroller | Any (webhooks) | Lightweight controllers in any language | | Juju | Python/Go | Multi-cloud, charm-based application modelling |

The default recommendation: Operator SDK with Go for anything complex. It has the largest community, the most mature tooling, and direct CNCF backing.

The "Operator of Operators" Pattern

One of the most interesting patterns in the whitepaper is the meta-operator, an operator that coordinates multiple subordinate operators to manage complex application stacks.

Two approaches:

Single package, One user-facing CRD that internally delegates to multiple controllers through internal CRDs
Dependency model, A higher-level operator that depends on independently-useful operators, managed through OLM (Operator Lifecycle Manager)

This is exactly how you manage a full-stack application, a meta-operator for "the application" that coordinates separate database, cache, and message queue operators underneath.

Best Practices Worth Highlighting

From the whitepaper's extensive guidance:

One operator per application type, don't build monolithic operators
Design for operator absence, applications should keep running if the operator stops
Use Kubernetes primitives, leverage ReplicaSets, Services, ConfigMaps rather than reimplementing them
Test against failure modes, simulate pod crashes, storage failures, network partitions
One CRD per controller, keeps reconciliation logic clean and debuggable
Backward compatibility, support previous CRD versions during transitions

Why Operators Matter for Platform Engineering

Operators are the automation layer of an Internal Developer Platform. They encode "how to run this thing properly" into repeatable, auditable, testable code.

In GoldenPath IDP, we use the operator pattern thinking throughout:

Certified scripts that encode operational procedures, the same principle as operator reconciliation, applied to infrastructure automation
Governance policies that continuously validate state, exactly like a controller checking desired versus actual
Architecture Decision Records that capture the domain knowledge operators encode

The operator pattern isn't just for Kubernetes. It's a philosophy: encode human expertise into software, then let the software run continuously.

Read the full whitepaper: CNCF Operator Whitepaper

Building custom operators or automating operations? Get in touch, we can help you codify your expertise.

The CNCF Operator Whitepaper: Codifying Human Operations Into Software