Back to Blog
multi-cloud

The Great Cloud Jailbreak: A Data-Driven Guide to Escaping Vendor Lock-In

91% of orgs waste money on cloud. $100 billion in market value is at risk. Here's the escape playbook with real configs, real data, and the open-source tools that actually work.

Oikonex TeamFeb 10, 202615 min read

$100 Billion. That's the Number.

Not hypothetical. Not projected. Not "up to." Andreessen Horowitz analyzed the 50 highest-grossing public software companies and found that cloud infrastructure costs are destroying roughly $100 billion in market value. Their analysis showed that repatriating workloads — moving them off the hyperscalers and onto owned or open infrastructure — could cut costs to between one-third and one-half of what those companies currently pay.

A hundred billion dollars. That's not a rounding error. That's the GDP of a mid-sized country sitting in the delta between what companies pay for cloud and what they'd pay running equivalent workloads themselves. Most of those companies got there the same way: "Don't worry about infrastructure. Just focus on your product. We'll handle the rest."

The "rest" turned out to be an expanding set of managed services so deeply woven into your architecture that switching providers becomes a multi-quarter engineering project. Not because anyone did anything nefarious — that's just how proprietary services work.

This is the escape playbook. Real data. Real configs. Real open-source tools. No fairy tales.


The Trap: How Lock-In Actually Happens

Nobody gets locked in on day one. Lock-in is a gradient, not a cliff. It starts innocently — you pick RDS because managed PostgreSQL is convenient. Then you add SQS because decoupling services is smart. Then S3 for file storage, because of course. Then Secrets Manager because hardcoding passwords is bad. Then Lambda because your team read a blog post about serverless.

Each decision is individually reasonable. Collectively, they add up to a migration that nobody wants to touch.

Here's the progression for a typical AWS-native stack:

Year 1: EC2, RDS, S3, Route 53. Reasonable. Portable-ish.

Year 2: Add SQS, SNS, Secrets Manager, ElastiCache. Your SDK imports now look like import boto3 on every other line.

Year 3: Lambda functions, API Gateway, DynamoDB, Step Functions, CloudWatch everything. Your architecture diagrams are indistinguishable from the AWS service catalog.

Year 4: The renewal email arrives. You run the numbers on migrating. The numbers are daunting. You sign the renewal — not because the price is great, but because the switching cost is higher.

This isn't a technology problem. It's a procurement problem that emerges naturally when every convenience comes with a proprietary API.


The Math: What the Data Actually Says

Let's stop talking in abstractions and look at what the numbers say. Every stat here is sourced from public reports — no proprietary data, no cherry-picking.

The Industry-Wide Picture

The Flexera 2025 State of the Cloud Report surveyed 759 respondents and found numbers that should make every CFO lose sleep:

  • 27% of cloud spend is wasted — not "could be optimized," wasted. Gone. Poof.
  • 84% of organizations struggle to manage cloud spend effectively
  • Organizations exceed their cloud budgets by 17% on average
  • 89% have adopted multi-cloud — but most are doing it poorly, running separate silos rather than portable workloads

The HashiCorp/Forrester 2024 State of the Cloud report paints an even bleaker picture: 91% of organizations report wasted cloud spending, and 78% are using or planning to adopt multi-cloud strategies. The gap between "planning to" and "actually doing it well" is where all the money goes to die.

Let's do the arithmetic. If global cloud spending is north of $600 billion annually (Gartner's number), and 27% of that is waste (Flexera's number), that's $162 billion per year lighting itself on fire. For context, that's more than the annual revenue of every airline in the United States combined.

The Companies That Actually Left

Theory is nice. Receipts are better. Here are three companies that actually did the math, then did the migration.

37signals (Basecamp/HEY)

David Heinemeier Hansson and the 37signals team documented their cloud exit in exhaustive detail. The numbers, per their public accounting:

  • Annual AWS bill before: $3.2 million
  • Annual infrastructure cost after: $1.3 million
  • Annual savings: ~$2 million/year
  • Projected 5-year savings: $10 million
  • Payback period on hardware investment: under 6 months

They bought their own servers. Racked them in a colo. Ran their own Kubernetes. And their operations team didn't grow — the same people who were clicking buttons in the AWS console are now running their own iron. DHH's take: "We spent years massively overpaying for cloud."

Dropbox

Dropbox moved the majority of their storage workloads off AWS and onto custom infrastructure they called "Magic Pocket." The result, reported by GeekWire in 2018: $74.6 million saved over two years. Not a typo. Seventy-four point six million dollars.

Dropbox's scale is unusual, but the economics aren't. Their cost-per-gigabyte on owned infrastructure was a fraction of S3 pricing. The same math applies at smaller scales — the crossover point is just lower than most people think.

GEICO

This one's less well-known but arguably more instructive. The Stack reported that GEICO's cloud costs increased 2.5x over a decade, reaching a cloud budget north of $300 million. After initiating a repatriation effort, they achieved a 50% cost reduction per compute core. Half. For an insurance company running workloads that are not exactly cutting-edge compute — they're running batch processing and CRUD apps, not training LLMs.

The Pattern

These aren't outliers. The a16z analysis looked at 50 companies and found the same pattern everywhere: cloud costs scale linearly (or worse) with revenue, and at scale, the unit economics of renting someone else's computers stop making sense. Their headline finding: repatriation delivers infrastructure costs at one-third to one-half of public cloud pricing (a16z, 2021).

You don't need to be Dropbox-sized for this math to work. You need to be spending enough that the delta between "managed" and "self-hosted" exceeds the engineering cost of running it yourself. For most companies, that threshold is somewhere around $50,000-$100,000/month in cloud spend.


The Swap: Every Service Has an Open-Source Replacement

Here's where we get concrete. Below are the three most common AWS services that create lock-in, paired with their open-source replacements and production-ready configs. These aren't toy examples — they're the patterns that survive real traffic.

RDS -> CloudNativePG

CloudNativePG (5,000+ GitHub stars) is the leading PostgreSQL operator for Kubernetes. It handles what RDS handles — automated failover, continuous backups, point-in-time recovery — but runs on any Kubernetes cluster, anywhere.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: app-primary
  namespace: database
spec:
  instances: 3  # Primary + 2 synchronous replicas. HA by default.

  imageName: ghcr.io/cloudnative-pg/postgresql:16.4-1
  # Pin your images. An unpinned PostgreSQL image is a surprise
  # waiting to happen during a rollout.

  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "512MB"
      effective_cache_size: "1536MB"
      work_mem: "8MB"
      maintenance_work_mem: "256MB"
      wal_buffers: "16MB"
      max_wal_size: "4GB"
      checkpoint_completion_target: "0.9"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      max_worker_processes: "4"
      max_parallel_workers_per_gather: "2"
      max_parallel_workers: "4"

  storage:
    size: 100Gi
    storageClass: fast-ssd  # gp3 on AWS, pd-ssd on GCP, managed-premium on Azure

  # Continuous backup to MinIO (not S3 — that's the whole point)
  backup:
    barmanObjectStore:
      destinationPath: s3://pg-backups/app-primary
      endpointURL: http://minio.storage.svc.cluster.local:9000
      s3Credentials:
        accessKeyId:
          name: minio-credentials
          key: access-key
        secretAccessKey:
          name: minio-credentials
          key: secret-key
      wal:
        compression: gzip
        maxParallel: 2
      data:
        compression: gzip
    retentionPolicy: "30d"

  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "4"

  monitoring:
    enablePodMonitor: true

  enablePodDisruptionBudget: true
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: app-primary-daily
  namespace: database
spec:
  schedule: "0 2 * * *"
  backupOwnerReference: self
  cluster:
    name: app-primary

What you get: automated failover (tested it — primary dies, replica promotes in under 10 seconds), continuous WAL archiving, point-in-time recovery, Prometheus metrics, and a connection string that looks like app-primary-rw.database.svc.cluster.local:5432. Standard PostgreSQL. No IAM auth tokens. No RDS Proxy. No vendor-specific anything.

S3 -> MinIO

MinIO has 60,000+ GitHub stars and over a billion Docker pulls. It implements the S3 API so faithfully that most applications literally cannot tell the difference. Your existing boto3, aws-sdk, or minio-go code works without changes — you just point it at a different endpoint.

apiVersion: v1
kind: Secret
metadata:
  name: minio-credentials
  namespace: storage
type: Opaque
stringData:
  root-user: minio-admin
  root-password: change-me-use-external-secrets-operator
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: minio
  namespace: storage
spec:
  serviceName: minio
  replicas: 4  # Minimum for erasure coding. Lose any 1 node, keep all data.
  selector:
    matchLabels:
      app: minio
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
      - name: minio
        image: minio/minio:RELEASE.2025-02-01T00-00-00Z
        args:
        - server
        - http://minio-{0...3}.minio.storage.svc.cluster.local/data
        - --console-address
        - ":9001"
        env:
        - name: MINIO_ROOT_USER
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: root-user
        - name: MINIO_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: root-password
        ports:
        - containerPort: 9000
          name: api
        - containerPort: 9001
          name: console
        volumeMounts:
        - name: data
          mountPath: /data
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2"
        livenessProbe:
          httpGet:
            path: /minio/health/live
            port: 9000
          initialDelaySeconds: 60
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /minio/health/ready
            port: 9000
          initialDelaySeconds: 30
          periodSeconds: 20
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 500Gi
---
apiVersion: v1
kind: Service
metadata:
  name: minio
  namespace: storage
spec:
  selector:
    app: minio
  ports:
  - port: 9000
    targetPort: api
    name: api
  - port: 9001
    targetPort: console
    name: console

The migration? Change one environment variable:

# Before
S3_ENDPOINT=https://s3.us-east-1.amazonaws.com

# After
S3_ENDPOINT=http://minio.storage.svc.cluster.local:9000

Your application code, your SDKs, your presigned URLs — they all keep working. MinIO speaks the same language.

SQS -> NATS JetStream

NATS is a CNCF Incubating project that handles messaging, streaming, and key-value storage. JetStream is its persistence layer — it gives you the durability guarantees of SQS with sub-millisecond latency compared to SQS's typical 15-25ms.

apiVersion: v1
kind: ConfigMap
metadata:
  name: nats-config
  namespace: messaging
data:
  nats.conf: |
    port: 4222
    http_port: 8222

    jetstream {
      store_dir: /data
      max_memory_store: 1Gi
      max_file_store: 20Gi
    }

    cluster {
      name: nats-cluster
      port: 6222
      routes: [
        nats://nats-0.nats.messaging.svc.cluster.local:6222
        nats://nats-1.nats.messaging.svc.cluster.local:6222
        nats://nats-2.nats.messaging.svc.cluster.local:6222
      ]
    }
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nats
  namespace: messaging
spec:
  serviceName: nats
  replicas: 3
  selector:
    matchLabels:
      app: nats
  template:
    metadata:
      labels:
        app: nats
    spec:
      containers:
      - name: nats
        image: nats:2.10-alpine
        args:
        - --config
        - /etc/nats/nats.conf
        - --name
        - $(POD_NAME)
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        ports:
        - containerPort: 4222
          name: client
        - containerPort: 6222
          name: cluster
        - containerPort: 8222
          name: monitor
        volumeMounts:
        - name: config
          mountPath: /etc/nats
        - name: data
          mountPath: /data
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "1"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8222
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz?js-enabled-only=true
            port: 8222
          initialDelaySeconds: 10
          periodSeconds: 10
      volumes:
      - name: config
        configMap:
          name: nats-config
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 20Gi

The SQS-to-NATS translation: queues become streams, visibility timeout becomes ack_wait, dead letter queues become max_deliver with a dead-letter stream. The programming model is cleaner — NATS pushes messages to you instead of you long-polling SQS over HTTP. Your message processing loop goes from "poll, receive, process, delete, handle errors, poll again" to "subscribe, process, ack."

Before and After: The Bill

Here's what a typical mid-stage startup's AWS bill looks like for these three services versus their Kubernetes-native replacements, assuming modest scale (~500GB storage, ~10M messages/month, moderate query volume):

ServiceAWS Managed (Monthly)K8s-Native (Monthly)Notes
RDS (db.r6g.xlarge, Multi-AZ)$1,400$0*CloudNativePG on existing cluster
S3 (500GB + transfer)$45$0*MinIO on existing cluster storage
SQS (10M messages)$4$0*NATS on existing cluster
Secrets Manager (50 secrets)$20$0*Vault on existing cluster
ElastiCache (cache.r6g.large)$365$0*Redis/Dragonfly on existing cluster
Data transfer / NAT Gateway$200$0Pod-to-pod is free
CloudWatch monitoring$150$0*Prometheus + Grafana on cluster
Total~$2,184/mo~$0/mo*

*Not actually free — these run on your Kubernetes cluster's compute and storage. The point is that you're already paying for K8s nodes. The marginal cost of running these workloads on existing capacity is dramatically lower than paying per-service pricing. Realistic marginal infrastructure cost: $400-$800/month for the additional compute and storage, depending on your cluster's headroom.

Net savings: $1,200-$1,700/month, or roughly $15,000-$20,000/year. At higher scale, the delta grows exponentially — 37signals' $2M/year savings didn't come from a 500GB database.


The Playbook: How to Actually Execute This

Knowing what to swap is the easy part. Actually executing the migration without setting production on fire is the hard part. Here's the phased approach that minimizes risk.

Phase 0: Inventory and Assess (1 Week)

Before touching anything, map your cloud dependencies. Every import boto3, every AWS SDK call, every CloudFormation resource — document it. Tools that help:

  • Crossplane (CNCF Graduated, 3,000+ contributors): If you're going to manage infrastructure as Kubernetes resources, Crossplane is the control plane. It lets you define cloud resources as K8s manifests, with providers for AWS, GCP, Azure, and beyond. Start here for new infrastructure — it forces you to think in portable abstractions from day one.
  • OpenTofu (23,000+ GitHub stars): The open-source fork of Terraform. If your infrastructure is already in Terraform, OpenTofu is a drop-in replacement that removes the HashiCorp licensing concerns. Your existing .tf files work unchanged.

Phase 1: Stop the Bleeding (Week 1-2)

New rule, effective immediately: no new direct dependencies on proprietary managed services. New service needs a database? CloudNativePG. New service needs a queue? NATS. New service needs object storage? Code against the S3 API with an endpoint variable that can point anywhere.

You're not migrating anything yet. You're just stopping the lock-in from getting worse.

Phase 2: Stand Up the Parallel Stack (Month 1)

Deploy the open-source alternatives alongside your existing AWS services. Run them in staging. Get your team comfortable with the operational patterns:

  • CloudNativePG for PostgreSQL (replaces RDS)
  • MinIO for object storage (replaces S3)
  • NATS JetStream for messaging (replaces SQS/SNS)
  • HashiCorp Vault for secrets (replaces Secrets Manager)

Don't cut over anything. Don't migrate data. Just prove the stack works and build operational muscle memory.

Phase 3: Migrate by Risk Level (Month 2-4)

Migrate in order of blast radius, lowest risk first:

  1. Object storage (MinIO): Change an endpoint variable. Run both in parallel. Validate. Cut over.
  2. Secrets (Vault + External Secrets Operator): Sync secrets from Vault into K8s Secrets. Applications read environment variables — they never know the backend changed.
  3. Messaging (NATS): Dual-publish to SQS and NATS during transition. Consume from NATS. Fall back to SQS if needed.
  4. Database (CloudNativePG): Last. Always last. Use pg_dump/pg_restore or logical replication. Test backup and recovery before cutover. Then test again.

Phase 4: Prove Portability (Month 5-6)

Deploy your application to a second environment. A different cloud, an on-prem cluster, a rack in a colo — doesn't matter where. The point is proving that your stack is decoupled from any single provider.

This is the leverage. When your cloud vendor's sales team knows you can deploy elsewhere — because you've demonstrably done it — the renewal negotiation goes from "take what we offer" to an actual negotiation. Flexera's data shows that multi-cloud-capable organizations negotiate 15-25% better discounts (Flexera, 2025).


The Business Case: Why Your CFO Should Care

Let's tie the data together into a single argument.

The problem: 91% of organizations waste money on cloud (HashiCorp/Forrester, 2024). The average organization exceeds its cloud budget by 17% (Flexera, 2025). And 27% of total cloud spend is pure waste.

The opportunity: Repatriation or hybrid strategies can cut infrastructure costs by 50-66% (a16z, 2021). This isn't theoretical — 37signals saved $2M/year, Dropbox saved $74.6M over two years, and GEICO cut per-core compute costs by 50%.

The how: You don't need to leave the cloud entirely. You need to break the dependency on proprietary services so you have options. Kubernetes-native alternatives exist for every major managed service — PostgreSQL, object storage, messaging, secrets, caching. The tools are mature, the communities are large, and the migration patterns are well-understood.

The leverage: Even if you never leave AWS, the ability to leave changes the economics. Negotiation leverage alone — better discounts, waived egress fees, removed spend commitments — can save 15-25% annually. For a company spending $1M/year on cloud, that's $150,000-$250,000 in savings from optionality alone, before you've migrated a single workload.

The cloud was supposed to be about agility. Somewhere along the way, it became about dependency. The tools to reverse that dependency are open-source, battle-tested, and sitting on GitHub waiting for a helm install.

The only question is whether you start the jailbreak now, or wait until the next renewal email makes the decision for you.


Sources

multi-cloudvendor-lock-inkubernetesmigration

Stay in the Loop

Get the latest insights on cloud migration, Kubernetes, and enterprise distribution delivered to your inbox.