Back to Blog
Kubernetes

Kubernetes Operators Demystified: We Built One So You Don't Have To

Operators are Kubernetes's best-kept power feature. We break down how they work, why they matter, and walk through building one from scratch — complete with the bugs we shipped to production.

Oikonex TeamJan 16, 202612 min read

The Moment It Clicked

We were three weeks into a client migration — replacing their managed PostgreSQL with CloudNativePG — when one of their junior engineers asked the question that stops every Kubernetes conversation dead:

"But who restarts the database if it crashes?"

In the old world, the answer was "AWS does." RDS handles failover, restores replicas, runs backups at 3am while you sleep. That's the deal. You pay Amazon, Amazon babysits your database.

In the Kubernetes world, the answer is: "The Operator does."

And the junior engineer said: "...what's an Operator?"

Fair question. Because despite being one of the most powerful patterns in all of Kubernetes, Operators are criminally under-explained. Most tutorials either hand-wave ("it's like a robot SRE!") or immediately dump you into controller-runtime Go code that reads like assembly language wrote a love letter to YAML.

Let's fix that.

Operators, Explained Like You're at a Bar

Imagine you hire a bartender. You don't stand behind them all night saying "pour that beer, now wipe that glass, now take that order." You tell them once: "Keep the bar running." They know what that means. Low on IPA? Order more. Glass breaks? Clean it up. Customer gets rowdy? Handle it.

A Kubernetes Operator is that bartender, but for your infrastructure.

More precisely: an Operator is a custom controller that watches a custom resource (your "desired state") and continuously reconciles reality to match it. It encodes operational knowledge — the stuff your best SRE knows — into software that runs 24/7 and never needs coffee.

Here's the pattern in pseudocode:

while true:
  desired = read_custom_resource("my-database")
  actual  = observe_cluster_state()

  if actual != desired:
    take_action_to_reconcile(actual, desired)

  sleep(30 seconds)

That's it. That's the whole pattern. Everything else is implementation details.

The Three Layers of an Operator

Every Operator has three components:

1. Custom Resource Definition (CRD)

This is your API — the thing users create to tell the Operator what they want. It extends the Kubernetes API with your own resource types.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.oikonex.com
spec:
  group: oikonex.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                engine:
                  type: string
                  enum: ["postgres", "mysql"]
                version:
                  type: string
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 5
                storage:
                  type: string
                backupSchedule:
                  type: string
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database
    shortNames:
      - db

Once this CRD is applied, users can do this:

apiVersion: oikonex.com/v1
kind: Database
metadata:
  name: orders-db
  namespace: production
spec:
  engine: postgres
  version: "16"
  replicas: 3
  storage: 100Gi
  backupSchedule: "0 2 * * *"

And kubectl get databases Just Works. The Kubernetes API server treats your custom resource like a first-class citizen. It gets validation, RBAC, audit logging, watch events — the whole package.

2. The Controller (The Brain)

This is where the magic lives. The controller watches for changes to your CRD and reconciles the actual state of the cluster to match.

Here's a simplified version of what a database Operator's reconcile loop looks like:

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // Fetch the Database custom resource
    var database oikonexv1.Database
    if err := r.Get(ctx, req.NamespacedName, &database); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Ensure the StatefulSet exists with correct replica count
    statefulSet := &appsv1.StatefulSet{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      database.Name + "-postgres",
        Namespace: database.Namespace,
    }, statefulSet)

    if errors.IsNotFound(err) {
        // StatefulSet doesn't exist — create it
        log.Info("Creating StatefulSet", "name", database.Name)
        newSS := r.buildStatefulSet(&database)
        if err := r.Create(ctx, newSS); err != nil {
            return ctrl.Result{}, err
        }
        return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
    }

    // StatefulSet exists — check if it matches desired state
    if *statefulSet.Spec.Replicas != int32(database.Spec.Replicas) {
        log.Info("Scaling StatefulSet",
            "from", *statefulSet.Spec.Replicas,
            "to", database.Spec.Replicas)
        statefulSet.Spec.Replicas = ptr.To(int32(database.Spec.Replicas))
        if err := r.Update(ctx, statefulSet); err != nil {
            return ctrl.Result{}, err
        }
    }

    // Ensure backup CronJob exists
    if database.Spec.BackupSchedule != "" {
        if err := r.ensureBackupJob(ctx, &database); err != nil {
            return ctrl.Result{}, err
        }
    }

    // Update status
    database.Status.Ready = r.isClusterHealthy(ctx, &database)
    database.Status.Replicas = r.countReadyReplicas(ctx, &database)
    if err := r.Status().Update(ctx, &database); err != nil {
        return ctrl.Result{}, err
    }

    // Recheck every 60 seconds
    return ctrl.Result{RequeueAfter: 60 * time.Second}, nil
}

The key insight: reconciliation is idempotent. If the StatefulSet already exists and matches the spec, the controller does nothing. If someone manually deletes a replica pod, the controller notices on the next loop and recreates it. If you change the CRD from 3 replicas to 5, the controller picks it up and scales.

This is fundamentally different from imperative scripts that run once. The Operator continuously enforces your desired state. It's the difference between telling someone "set the thermostat to 72" and hiring someone to sit next to the thermostat forever.

3. The Status Subresource

Good Operators report back what's actually happening. The status subresource is where the Operator writes its observations:

status:
  ready: true
  replicas: 3
  readyReplicas: 3
  currentVersion: "16.2"
  lastBackup: "2026-01-16T02:00:00Z"
  conditions:
    - type: Available
      status: "True"
      lastTransitionTime: "2026-01-15T10:30:00Z"
      reason: AllReplicasReady
      message: "All 3 replicas are running and healthy"
    - type: BackupSuccessful
      status: "True"
      lastTransitionTime: "2026-01-16T02:05:00Z"
      reason: BackupCompleted
      message: "Backup completed in 4m32s, 2.1GB written to MinIO"

This is what makes kubectl get databases show useful information instead of just names.

Operators We Use Every Day (And Why They're Brilliant)

Rather than building our own Operators for everything (we tried — it's a humbling experience), we rely on battle-tested ones for production workloads:

CloudNativePG (PostgreSQL)

This is the Operator that made us believers. You give it a Cluster resource:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: app-db
spec:
  instances: 3
  storage:
    size: 100Gi
  backup:
    barmanObjectStore:
      destinationPath: s3://backups/postgres
      endpointURL: http://minio:9000
      s3Credentials:
        accessKeyId:
          name: minio-creds
          key: access-key
        secretAccessKey:
          name: minio-creds
          key: secret-key

And it handles: provisioning, replication, automatic failover, scheduled backups, point-in-time recovery, connection pooling, rolling upgrades, and TLS certificate management.

We had a node failure at 2:47am on a client's cluster. By 2:48am, CloudNativePG had promoted a replica, redirected connections, and started provisioning a replacement. No pages. No intervention. We found out about it at standup the next morning when someone mentioned the node replacement in the cluster events.

That's what a good Operator does.

Strimzi (Kafka)

Kafka is notoriously painful to operate. Strimzi wraps the entire lifecycle — topic management, user authentication, rack awareness, rolling upgrades — into CRDs:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: event-bus
spec:
  kafka:
    version: 3.7.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
    storage:
      type: persistent-claim
      size: 50Gi
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi

We migrated a client from Amazon MSK to Strimzi on bare-metal Kubernetes. The Strimzi Operator handled the ZooKeeper coordination, broker rebalancing, and topic replication — all the stuff that used to require a dedicated Kafka team. The client's Kafka admin said, "I feel like I just automated myself." (He didn't. He just got to work on more interesting problems.)

cert-manager (TLS)

The unsung hero. cert-manager watches Certificate resources and automatically obtains, renews, and rotates TLS certificates:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
spec:
  secretName: api-tls-secret
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - api.example.com
    - "*.api.example.com"

Before cert-manager, we had a cron job that ran certbot. It broke silently on a Tuesday, and our certificates expired on a Friday afternoon. At 4:58pm. Nothing teaches you the value of Operators like explaining to a client why their entire API is returning TLS errors right before the weekend.

When to Use an Existing Operator vs. Build Your Own

We get asked this a lot. Here's our decision tree:

Use an existing Operator when:

  • The problem is well-known (databases, message queues, certificates)
  • An active community maintains it (check GitHub stars, release frequency, issue response time)
  • Your requirements are standard (you're not doing anything exotic)

Build your own when:

  • You have domain-specific operational knowledge that no existing Operator encodes
  • You need to orchestrate multiple resources in a custom workflow
  • Your application has unique scaling, backup, or recovery requirements
  • You want to give your users a declarative API for your platform

Don't build your own when:

  • A Helm chart would suffice (not everything needs a controller)
  • You're just wrapping kubectl apply in Go (that's a script, not an Operator)
  • Your team doesn't have Go experience (the learning curve is real)

Building a Simple Operator: The Real Experience

Despite our advice to use existing Operators, we've built a few custom ones for clients with unique requirements. Here's what the experience is actually like.

The Setup (Kubebuilder)

Kubebuilder is the standard scaffolding tool. It generates the boilerplate so you can focus on the reconciliation logic:

# Initialize a new operator project
kubebuilder init --domain oikonex.com --repo github.com/oikonex/app-operator

# Create a new API (CRD + Controller)
kubebuilder create api --group apps --version v1 --kind AppInstance

This generates ~40 files. Don't panic. You only need to care about two:

  • api/v1/appinstance_types.go — your CRD schema
  • internal/controller/appinstance_controller.go — your reconciliation logic

The Types (5 Minutes of Joy)

Defining your CRD types in Go is the fun part:

type AppInstanceSpec struct {
    // Version of the application to deploy
    Version string `json:"version"`

    // Number of replicas
    Replicas int32 `json:"replicas,omitempty"`

    // Environment-specific configuration
    Environment string `json:"environment"` // "staging" or "production"

    // Customer-specific settings
    CustomerID string `json:"customerID"`

    // Feature flags
    Features FeatureFlags `json:"features,omitempty"`
}

type FeatureFlags struct {
    AdvancedAnalytics bool `json:"advancedAnalytics,omitempty"`
    SSO               bool `json:"sso,omitempty"`
    AuditLogging      bool `json:"auditLogging,omitempty"`
}

type AppInstanceStatus struct {
    Ready          bool   `json:"ready"`
    CurrentVersion string `json:"currentVersion,omitempty"`
    URL            string `json:"url,omitempty"`
    Conditions     []metav1.Condition `json:"conditions,omitempty"`
}

The Controller (5 Days of Pain)

Then you write the reconciler. And you discover that the gap between "I understand the concept" and "this works in production" is roughly the width of the Grand Canyon.

Things that bit us:

1. Ownership references matter. If your Operator creates a Deployment, you need to set the owner reference. Otherwise, deleting the custom resource leaves orphaned Deployments floating around like ghost ships:

// ALWAYS set owner references on child resources
if err := ctrl.SetControllerReference(appInstance, deployment, r.Scheme); err != nil {
    return ctrl.Result{}, err
}

2. Status updates are separate from spec updates. You can't update .Status and .Spec in the same API call. We lost an afternoon to this:

// This updates the spec
if err := r.Update(ctx, appInstance); err != nil { ... }

// This updates the status — separate call!
if err := r.Status().Update(ctx, appInstance); err != nil { ... }

3. Requeue timing is an art. Too fast and you hammer the API server. Too slow and your users think the Operator is broken:

// Fast requeue for things that should resolve quickly
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil

// Slow requeue for steady-state monitoring
return ctrl.Result{RequeueAfter: 60 * time.Second}, nil

// Immediate requeue for cascading changes
return ctrl.Result{Requeue: true}, nil

4. Idempotency is non-negotiable. Your reconcile function will be called multiple times for the same event. If it's not idempotent, you'll create duplicate resources, double-scale your deployments, or — in one memorable incident — send 847 duplicate webhook notifications to a client's Slack channel.

The Operator Maturity Model

The Operator SDK defines five capability levels. Think of them like video game difficulty settings:

LevelCapabilityExample
1 - Basic InstallAutomated provisioningHelm chart wrapper
2 - Seamless UpgradesRolling updates, version managementUpgrade from v1 to v2 without downtime
3 - Full LifecycleBackup, restore, failure recoveryCloudNativePG, Strimzi
4 - Deep InsightsMetrics, alerts, log aggregationOperator exports Prometheus metrics
5 - Auto PilotAuto-scaling, auto-tuning, anomaly detectionOperator adjusts replicas based on load

Most production Operators we deploy are Level 3. Level 5 Operators exist, but they're rare — and honestly, we're still a little suspicious of anything that auto-tunes a database without human approval.

Why This Matters for Portable Infrastructure

Here's the thing that connects Operators to everything else we do at Oikonex: Operators make your infrastructure portable.

When your PostgreSQL is managed by CloudNativePG, it runs the same way on EKS, GKE, AKS, bare metal, or a Raspberry Pi cluster in someone's closet (we've done this — long story). The Operator abstracts away the underlying infrastructure differences.

This is fundamentally different from managed services. RDS only works on AWS. Cloud SQL only works on GCP. But a CloudNativePG Cluster resource works on any Kubernetes cluster, anywhere.

The same applies to every Operator-managed service:

  • Strimzi Kafka runs anywhere Kubernetes runs (unlike Amazon MSK)
  • cert-manager handles certificates on any cluster (unlike AWS Certificate Manager)
  • NATS Operator provides messaging anywhere (unlike Amazon SQS)

When we help clients achieve multi-cloud portability, Operators are the foundation. They're how you get the operational benefits of managed services without the lock-in of managed services.

Getting Started

If you're new to Operators, here's the path we recommend:

  1. Use existing Operators first. Install CloudNativePG or cert-manager. Get comfortable with the CRD pattern. See how declarative management feels.

  2. Read the source. CloudNativePG is open source and well-written Go. Reading real Operator code teaches you more than any tutorial.

  3. Build a toy Operator. Use Kubebuilder. Make something simple — maybe an Operator that watches a Website CRD and creates Deployments + Services + Ingress resources. Ship it to a dev cluster.

  4. Contribute to an existing Operator. Fix a bug or add a feature to an Operator you use. You'll learn the patterns from experienced maintainers.

  5. Build a real one only when you need to. Custom Operators are powerful but expensive to maintain. Make sure the operational knowledge you're encoding is actually unique to your domain.

The Kubernetes community has done remarkable work building Operators for common infrastructure. Our job — and yours — is to assemble the right ones into a portable platform that runs anywhere your business needs it to.

And if a junior engineer asks "what's an Operator?" — now you've got an answer that doesn't involve the phrase "it's like a robot SRE."

(Though honestly, that's not a terrible analogy. It's just not a complete one.)

KubernetesOperatorsCloud Migration

Stay in the Loop

Get the latest insights on cloud migration, Kubernetes, and enterprise distribution delivered to your inbox.