Kubernetes Operators Demystified: We Built One So You Don't Have To
Operators are Kubernetes's best-kept power feature. We break down how they work, why they matter, and walk through building one from scratch — complete with the bugs we shipped to production.
The Moment It Clicked
We were three weeks into a client migration — replacing their managed PostgreSQL with CloudNativePG — when one of their junior engineers asked the question that stops every Kubernetes conversation dead:
"But who restarts the database if it crashes?"
In the old world, the answer was "AWS does." RDS handles failover, restores replicas, runs backups at 3am while you sleep. That's the deal. You pay Amazon, Amazon babysits your database.
In the Kubernetes world, the answer is: "The Operator does."
And the junior engineer said: "...what's an Operator?"
Fair question. Because despite being one of the most powerful patterns in all of Kubernetes, Operators are criminally under-explained. Most tutorials either hand-wave ("it's like a robot SRE!") or immediately dump you into controller-runtime Go code that reads like assembly language wrote a love letter to YAML.
Let's fix that.
Operators, Explained Like You're at a Bar
Imagine you hire a bartender. You don't stand behind them all night saying "pour that beer, now wipe that glass, now take that order." You tell them once: "Keep the bar running." They know what that means. Low on IPA? Order more. Glass breaks? Clean it up. Customer gets rowdy? Handle it.
A Kubernetes Operator is that bartender, but for your infrastructure.
More precisely: an Operator is a custom controller that watches a custom resource (your "desired state") and continuously reconciles reality to match it. It encodes operational knowledge — the stuff your best SRE knows — into software that runs 24/7 and never needs coffee.
Here's the pattern in pseudocode:
while true:
desired = read_custom_resource("my-database")
actual = observe_cluster_state()
if actual != desired:
take_action_to_reconcile(actual, desired)
sleep(30 seconds)
That's it. That's the whole pattern. Everything else is implementation details.
The Three Layers of an Operator
Every Operator has three components:
1. Custom Resource Definition (CRD)
This is your API — the thing users create to tell the Operator what they want. It extends the Kubernetes API with your own resource types.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.oikonex.com
spec:
group: oikonex.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: ["postgres", "mysql"]
version:
type: string
replicas:
type: integer
minimum: 1
maximum: 5
storage:
type: string
backupSchedule:
type: string
scope: Namespaced
names:
plural: databases
singular: database
kind: Database
shortNames:
- db
Once this CRD is applied, users can do this:
apiVersion: oikonex.com/v1
kind: Database
metadata:
name: orders-db
namespace: production
spec:
engine: postgres
version: "16"
replicas: 3
storage: 100Gi
backupSchedule: "0 2 * * *"
And kubectl get databases Just Works. The Kubernetes API server treats your custom resource like a first-class citizen. It gets validation, RBAC, audit logging, watch events — the whole package.
2. The Controller (The Brain)
This is where the magic lives. The controller watches for changes to your CRD and reconciles the actual state of the cluster to match.
Here's a simplified version of what a database Operator's reconcile loop looks like:
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// Fetch the Database custom resource
var database oikonexv1.Database
if err := r.Get(ctx, req.NamespacedName, &database); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Ensure the StatefulSet exists with correct replica count
statefulSet := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{
Name: database.Name + "-postgres",
Namespace: database.Namespace,
}, statefulSet)
if errors.IsNotFound(err) {
// StatefulSet doesn't exist — create it
log.Info("Creating StatefulSet", "name", database.Name)
newSS := r.buildStatefulSet(&database)
if err := r.Create(ctx, newSS); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
// StatefulSet exists — check if it matches desired state
if *statefulSet.Spec.Replicas != int32(database.Spec.Replicas) {
log.Info("Scaling StatefulSet",
"from", *statefulSet.Spec.Replicas,
"to", database.Spec.Replicas)
statefulSet.Spec.Replicas = ptr.To(int32(database.Spec.Replicas))
if err := r.Update(ctx, statefulSet); err != nil {
return ctrl.Result{}, err
}
}
// Ensure backup CronJob exists
if database.Spec.BackupSchedule != "" {
if err := r.ensureBackupJob(ctx, &database); err != nil {
return ctrl.Result{}, err
}
}
// Update status
database.Status.Ready = r.isClusterHealthy(ctx, &database)
database.Status.Replicas = r.countReadyReplicas(ctx, &database)
if err := r.Status().Update(ctx, &database); err != nil {
return ctrl.Result{}, err
}
// Recheck every 60 seconds
return ctrl.Result{RequeueAfter: 60 * time.Second}, nil
}
The key insight: reconciliation is idempotent. If the StatefulSet already exists and matches the spec, the controller does nothing. If someone manually deletes a replica pod, the controller notices on the next loop and recreates it. If you change the CRD from 3 replicas to 5, the controller picks it up and scales.
This is fundamentally different from imperative scripts that run once. The Operator continuously enforces your desired state. It's the difference between telling someone "set the thermostat to 72" and hiring someone to sit next to the thermostat forever.
3. The Status Subresource
Good Operators report back what's actually happening. The status subresource is where the Operator writes its observations:
status:
ready: true
replicas: 3
readyReplicas: 3
currentVersion: "16.2"
lastBackup: "2026-01-16T02:00:00Z"
conditions:
- type: Available
status: "True"
lastTransitionTime: "2026-01-15T10:30:00Z"
reason: AllReplicasReady
message: "All 3 replicas are running and healthy"
- type: BackupSuccessful
status: "True"
lastTransitionTime: "2026-01-16T02:05:00Z"
reason: BackupCompleted
message: "Backup completed in 4m32s, 2.1GB written to MinIO"
This is what makes kubectl get databases show useful information instead of just names.
Operators We Use Every Day (And Why They're Brilliant)
Rather than building our own Operators for everything (we tried — it's a humbling experience), we rely on battle-tested ones for production workloads:
CloudNativePG (PostgreSQL)
This is the Operator that made us believers. You give it a Cluster resource:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-db
spec:
instances: 3
storage:
size: 100Gi
backup:
barmanObjectStore:
destinationPath: s3://backups/postgres
endpointURL: http://minio:9000
s3Credentials:
accessKeyId:
name: minio-creds
key: access-key
secretAccessKey:
name: minio-creds
key: secret-key
And it handles: provisioning, replication, automatic failover, scheduled backups, point-in-time recovery, connection pooling, rolling upgrades, and TLS certificate management.
We had a node failure at 2:47am on a client's cluster. By 2:48am, CloudNativePG had promoted a replica, redirected connections, and started provisioning a replacement. No pages. No intervention. We found out about it at standup the next morning when someone mentioned the node replacement in the cluster events.
That's what a good Operator does.
Strimzi (Kafka)
Kafka is notoriously painful to operate. Strimzi wraps the entire lifecycle — topic management, user authentication, rack awareness, rolling upgrades — into CRDs:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: event-bus
spec:
kafka:
version: 3.7.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
storage:
type: persistent-claim
size: 50Gi
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 10Gi
We migrated a client from Amazon MSK to Strimzi on bare-metal Kubernetes. The Strimzi Operator handled the ZooKeeper coordination, broker rebalancing, and topic replication — all the stuff that used to require a dedicated Kafka team. The client's Kafka admin said, "I feel like I just automated myself." (He didn't. He just got to work on more interesting problems.)
cert-manager (TLS)
The unsung hero. cert-manager watches Certificate resources and automatically obtains, renews, and rotates TLS certificates:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: api-tls
spec:
secretName: api-tls-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- api.example.com
- "*.api.example.com"
Before cert-manager, we had a cron job that ran certbot. It broke silently on a Tuesday, and our certificates expired on a Friday afternoon. At 4:58pm. Nothing teaches you the value of Operators like explaining to a client why their entire API is returning TLS errors right before the weekend.
When to Use an Existing Operator vs. Build Your Own
We get asked this a lot. Here's our decision tree:
Use an existing Operator when:
- The problem is well-known (databases, message queues, certificates)
- An active community maintains it (check GitHub stars, release frequency, issue response time)
- Your requirements are standard (you're not doing anything exotic)
Build your own when:
- You have domain-specific operational knowledge that no existing Operator encodes
- You need to orchestrate multiple resources in a custom workflow
- Your application has unique scaling, backup, or recovery requirements
- You want to give your users a declarative API for your platform
Don't build your own when:
- A Helm chart would suffice (not everything needs a controller)
- You're just wrapping
kubectl applyin Go (that's a script, not an Operator) - Your team doesn't have Go experience (the learning curve is real)
Building a Simple Operator: The Real Experience
Despite our advice to use existing Operators, we've built a few custom ones for clients with unique requirements. Here's what the experience is actually like.
The Setup (Kubebuilder)
Kubebuilder is the standard scaffolding tool. It generates the boilerplate so you can focus on the reconciliation logic:
# Initialize a new operator project
kubebuilder init --domain oikonex.com --repo github.com/oikonex/app-operator
# Create a new API (CRD + Controller)
kubebuilder create api --group apps --version v1 --kind AppInstance
This generates ~40 files. Don't panic. You only need to care about two:
api/v1/appinstance_types.go— your CRD schemainternal/controller/appinstance_controller.go— your reconciliation logic
The Types (5 Minutes of Joy)
Defining your CRD types in Go is the fun part:
type AppInstanceSpec struct {
// Version of the application to deploy
Version string `json:"version"`
// Number of replicas
Replicas int32 `json:"replicas,omitempty"`
// Environment-specific configuration
Environment string `json:"environment"` // "staging" or "production"
// Customer-specific settings
CustomerID string `json:"customerID"`
// Feature flags
Features FeatureFlags `json:"features,omitempty"`
}
type FeatureFlags struct {
AdvancedAnalytics bool `json:"advancedAnalytics,omitempty"`
SSO bool `json:"sso,omitempty"`
AuditLogging bool `json:"auditLogging,omitempty"`
}
type AppInstanceStatus struct {
Ready bool `json:"ready"`
CurrentVersion string `json:"currentVersion,omitempty"`
URL string `json:"url,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
The Controller (5 Days of Pain)
Then you write the reconciler. And you discover that the gap between "I understand the concept" and "this works in production" is roughly the width of the Grand Canyon.
Things that bit us:
1. Ownership references matter. If your Operator creates a Deployment, you need to set the owner reference. Otherwise, deleting the custom resource leaves orphaned Deployments floating around like ghost ships:
// ALWAYS set owner references on child resources
if err := ctrl.SetControllerReference(appInstance, deployment, r.Scheme); err != nil {
return ctrl.Result{}, err
}
2. Status updates are separate from spec updates. You can't update .Status and .Spec in the same API call. We lost an afternoon to this:
// This updates the spec
if err := r.Update(ctx, appInstance); err != nil { ... }
// This updates the status — separate call!
if err := r.Status().Update(ctx, appInstance); err != nil { ... }
3. Requeue timing is an art. Too fast and you hammer the API server. Too slow and your users think the Operator is broken:
// Fast requeue for things that should resolve quickly
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
// Slow requeue for steady-state monitoring
return ctrl.Result{RequeueAfter: 60 * time.Second}, nil
// Immediate requeue for cascading changes
return ctrl.Result{Requeue: true}, nil
4. Idempotency is non-negotiable. Your reconcile function will be called multiple times for the same event. If it's not idempotent, you'll create duplicate resources, double-scale your deployments, or — in one memorable incident — send 847 duplicate webhook notifications to a client's Slack channel.
The Operator Maturity Model
The Operator SDK defines five capability levels. Think of them like video game difficulty settings:
| Level | Capability | Example |
|---|---|---|
| 1 - Basic Install | Automated provisioning | Helm chart wrapper |
| 2 - Seamless Upgrades | Rolling updates, version management | Upgrade from v1 to v2 without downtime |
| 3 - Full Lifecycle | Backup, restore, failure recovery | CloudNativePG, Strimzi |
| 4 - Deep Insights | Metrics, alerts, log aggregation | Operator exports Prometheus metrics |
| 5 - Auto Pilot | Auto-scaling, auto-tuning, anomaly detection | Operator adjusts replicas based on load |
Most production Operators we deploy are Level 3. Level 5 Operators exist, but they're rare — and honestly, we're still a little suspicious of anything that auto-tunes a database without human approval.
Why This Matters for Portable Infrastructure
Here's the thing that connects Operators to everything else we do at Oikonex: Operators make your infrastructure portable.
When your PostgreSQL is managed by CloudNativePG, it runs the same way on EKS, GKE, AKS, bare metal, or a Raspberry Pi cluster in someone's closet (we've done this — long story). The Operator abstracts away the underlying infrastructure differences.
This is fundamentally different from managed services. RDS only works on AWS. Cloud SQL only works on GCP. But a CloudNativePG Cluster resource works on any Kubernetes cluster, anywhere.
The same applies to every Operator-managed service:
- Strimzi Kafka runs anywhere Kubernetes runs (unlike Amazon MSK)
- cert-manager handles certificates on any cluster (unlike AWS Certificate Manager)
- NATS Operator provides messaging anywhere (unlike Amazon SQS)
When we help clients achieve multi-cloud portability, Operators are the foundation. They're how you get the operational benefits of managed services without the lock-in of managed services.
Getting Started
If you're new to Operators, here's the path we recommend:
-
Use existing Operators first. Install CloudNativePG or cert-manager. Get comfortable with the CRD pattern. See how declarative management feels.
-
Read the source. CloudNativePG is open source and well-written Go. Reading real Operator code teaches you more than any tutorial.
-
Build a toy Operator. Use Kubebuilder. Make something simple — maybe an Operator that watches a
WebsiteCRD and creates Deployments + Services + Ingress resources. Ship it to a dev cluster. -
Contribute to an existing Operator. Fix a bug or add a feature to an Operator you use. You'll learn the patterns from experienced maintainers.
-
Build a real one only when you need to. Custom Operators are powerful but expensive to maintain. Make sure the operational knowledge you're encoding is actually unique to your domain.
The Kubernetes community has done remarkable work building Operators for common infrastructure. Our job — and yours — is to assemble the right ones into a portable platform that runs anywhere your business needs it to.
And if a junior engineer asks "what's an Operator?" — now you've got an answer that doesn't involve the phrase "it's like a robot SRE."
(Though honestly, that's not a terrible analogy. It's just not a complete one.)