Back to Blog
Cloud Migration

Building a Portable Data Layer: Replacing RDS, S3, ElastiCache, and SQS

We ripped out four AWS managed services and replaced them with Kubernetes-native alternatives. Here's the config that survived production — and the config that didn't.

Oikonex TeamJan 9, 202619 min read

The Great Data Layer Heist

Six weeks. That's what we told the client. "We'll have your entire data layer migrated off AWS managed services and running on Kubernetes-native alternatives in six weeks."

Weeks one through three were smooth. We were high-fiving. We were on schedule. We were heroes.

Week four was when PostgreSQL decided that our WAL archiving config had a subtle race condition that only manifested under load. Week four was when we learned that "S3-compatible" doesn't mean "S3-identical." Week four was when someone on the team said the phrase "it works on my cluster" without irony.

We still finished in six weeks. But week four aged us all by approximately five years. Here's everything we learned — the configs that survived production and the ones that caught fire at 3am.

PostgreSQL: The DBA Who Learned to Love Operators

If you've been running RDS for a while, you've gotten comfortable. Automated backups. Push-button replicas. Somebody else gets paged when the disk fills up. CloudNativePG is what happens when you take all of that comfort and say "I'll do it myself, but in YAML."

And honestly? It's kind of great.

The Cluster Manifest That Actually Survived Production

This is not a tutorial example. This is what we're running. Every parameter has a story behind it.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: app-database
  namespace: production
spec:
  instances: 3  # Primary + 2 replicas. We tried 5 once. The WAL shipping
                 # overhead was not worth it for our write volume.

  # imageName: ghcr.io/cloudnative-pg/postgresql:16.2-1
  # Pin your image. We learned this when a minor version bump
  # changed the default value of jit_above_cost and our query
  # planner decided to JIT-compile a 4-line SELECT.

  storage:
    size: 100Gi
    storageClass: fast-ssd  # Use io2 or gp3-equivalent. We started
                             # with gp2-equivalent and the IOPS ceiling
                             # hit us during a VACUUM FULL. Never again.

  postgresql:
    parameters:
      # === Connection Settings ===
      max_connections: "200"
      # 200 is generous. If you need more than this, you need
      # PgBouncer, not more connections. PostgreSQL's process-per-
      # connection model means 500 connections = 500 processes =
      # your OOM killer's favorite afternoon snack.

      # === Memory Settings ===
      shared_buffers: "256MB"         # ~25% of pod memory. Don't go over 40%.
      effective_cache_size: "768MB"   # Tell the planner about OS cache
      work_mem: "6553kB"              # Per-operation sort memory
      maintenance_work_mem: "128MB"   # VACUUM and CREATE INDEX get more

      # === WAL Settings ===
      wal_buffers: "16MB"
      min_wal_size: "1GB"
      max_wal_size: "4GB"
      checkpoint_completion_target: "0.9"  # Spread checkpoints. Your disks
                                            # will thank you.

      # === Planner Settings ===
      random_page_cost: "1.1"        # SSDs make random reads almost as
                                      # fast as sequential. Tell Postgres.
      effective_io_concurrency: "200" # SSDs can handle concurrent I/O.
      default_statistics_target: "100"

      # === Parallelism ===
      max_worker_processes: "4"
      max_parallel_workers_per_gather: "2"
      max_parallel_workers: "4"

  # === The Backup Config That Saved Us at 3am ===
  backup:
    barmanObjectStore:
      destinationPath: s3://backups/postgres
      endpointURL: http://minio.storage:9000
      s3Credentials:
        accessKeyId:
          name: minio-credentials
          key: access-key
        secretAccessKey:
          name: minio-credentials
          key: secret-key
      wal:
        compression: gzip
        # maxParallel: 2  # We bumped this to 2 after WAL archiving
                          # fell behind during a bulk import. The default
                          # of 1 is fine until it isn't.
      data:
        compression: gzip
    retentionPolicy: "30d"
    # 30 days. Not 7. We once had a customer discover data corruption
    # that happened 10 days ago. 7-day retention would have meant
    # "have you considered re-entering that data by hand?"

  enablePodDisruptionBudget: true

  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "2Gi"
      cpu: "2"
      # Don't set CPU limits too tight. PostgreSQL's VACUUM processes
      # need burst capacity. A throttled VACUUM is a slow VACUUM,
      # and a slow VACUUM is a table that's 80% dead tuples.

Scheduled Backups (Because "I'll Do It Manually" Is Not a Strategy)

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: app-database-daily
spec:
  schedule: "0 2 * * *"  # 2 AM daily. Not noon. Not "whenever Jenkins feels like it."
  backupOwnerReference: self
  cluster:
    name: app-database

The 3am Incident: Point-in-Time Recovery

It was 3:17am on a Wednesday. A deployment script ran UPDATE users SET role = 'admin' without a WHERE clause. Everyone was an admin. Everyone.

This is what we applied at 3:22am (after the initial five minutes of staring at the screen in disbelief):

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: app-database-restored
spec:
  instances: 3

  bootstrap:
    recovery:
      source: app-database
      recoveryTarget:
        # Restore to 3:16am — one minute before the deployment from hell.
        targetTime: "2025-11-12 03:16:00.000000+00"

  externalClusters:
    - name: app-database
      barmanObjectStore:
        destinationPath: s3://backups/postgres
        endpointURL: http://minio.storage:9000
        s3Credentials:
          accessKeyId:
            name: minio-credentials
            key: access-key
          secretAccessKey:
            name: minio-credentials
            key: secret-key

Thirteen minutes from incident to recovered database. Try doing that with an RDS snapshot restore (spoiler: it takes 20-45 minutes depending on how big your database is and how much AWS likes you that day).

The PgBouncer Gotcha Nobody Warns You About

PgBouncer transaction mode will break your prepared statements. Ask us how we know.

Here's what happened: we turned on PgBouncer in transaction mode because session mode doesn't pool effectively. Everything seemed fine. Then, two hours later, a service that used Prisma started throwing prepared statement "s0" already exists errors. Why? Because in transaction mode, PgBouncer releases the backend connection after each transaction. But prepared statements are bound to a connection. So your ORM prepares a statement on connection A, the next transaction gets connection B, and PostgreSQL says "I don't know what s0 is."

The fix:

# In your connection string, disable prepared statements:
- DATABASE_URL=postgres://user:pass@app-database-rw:5432/app
+ DATABASE_URL=postgres://user:pass@app-database-rw:5432/app?prepared_statements=false

Or switch to session mode and accept less efficient pooling. There's no free lunch. There's barely even a free snack.

Service Endpoints

# Direct to primary (for writes)
app-database-rw.production.svc.cluster.local:5432

# Read-only replicas (for reads)
app-database-ro.production.svc.cluster.local:5432

# Any instance (for admin tasks — use sparingly)
app-database-r.production.svc.cluster.local:5432

Update your connection string and move on with your life:

- DATABASE_URL=postgres://user:pass@mydb.abc123.us-east-1.rds.amazonaws.com:5432/app
+ DATABASE_URL=postgres://user:pass@app-database-rw.production.svc.cluster.local:5432/app

MinIO: S3, But Make It Portable

We literally changed one environment variable and it worked.

We're not exaggerating. The AWS SDK is designed to talk to any S3-compatible endpoint. We pointed it at MinIO instead of S3, ran the integration tests, and they passed. All of them. We spent the next three hours looking for what else we needed to change. There was nothing. We went home early.

It was the single best afternoon of the entire migration.

Production Deployment

apiVersion: v1
kind: Secret
metadata:
  name: minio-credentials
  namespace: storage
type: Opaque
stringData:
  root-user: admin
  root-password: your-secure-password-here  # Obviously not this.
                                              # Use a sealed secret or
                                              # external secrets operator.
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: minio
  namespace: storage
spec:
  serviceName: minio
  replicas: 4  # 4 is the minimum for erasure coding. More on this below.
  selector:
    matchLabels:
      app: minio
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
      - name: minio
        image: minio/minio:RELEASE.2025-01-01T00-00-00Z
        args:
        - server
        - http://minio-{0...3}.minio.storage.svc.cluster.local/data
        - --console-address
        - ":9001"
        env:
        - name: MINIO_ROOT_USER
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: root-user
        - name: MINIO_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: minio-credentials
              key: root-password
        ports:
        - containerPort: 9000
          name: api
        - containerPort: 9001
          name: console
        volumeMounts:
        - name: data
          mountPath: /data
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2"
        livenessProbe:
          httpGet:
            path: /minio/health/live
            port: 9000
          initialDelaySeconds: 120
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /minio/health/ready
            port: 9000
          initialDelaySeconds: 30
          periodSeconds: 20
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 500Gi

Erasure Coding: Explained With Pizza

Imagine you order a pizza with 8 slices. Normal storage is like needing all 8 slices to have a complete pizza (duh). Erasure coding is like a magical pizza where you only need any 4 of the 8 slices to reconstruct the entire pizza.

MinIO with 4 nodes uses erasure coding to split your data into data and parity shards. You can lose up to half your nodes and still read all your data. You can lose one node and still write new data. It's the closest thing to magic in storage engineering.

The catch? You need a minimum of 4 nodes, and your usable capacity is roughly 50% of raw capacity (the other 50% is parity). That's the price of being able to yank a server out of the rack without losing data.

The Three-Language Endpoint Swap

This is the part that felt too easy. Here's the change for every language we had in the project:

Node.js (AWS SDK v3)

import { S3Client, PutObjectCommand, GetObjectCommand } from "@aws-sdk/client-s3";

const s3Client = new S3Client({
  // This is the only line that changes. That's it. That's the migration.
  endpoint: process.env.S3_ENDPOINT || "http://minio.storage.svc.cluster.local:9000",
  region: "us-east-1",      // Required by the SDK, ignored by MinIO. Just put anything.
  credentials: {
    accessKeyId: process.env.S3_ACCESS_KEY,
    secretAccessKey: process.env.S3_SECRET_KEY,
  },
  forcePathStyle: true,  // IMPORTANT: MinIO uses path-style URLs, not virtual-hosted.
                          // Without this, the SDK tries to use bucket.endpoint as the
                          // hostname and your DNS will look at you like you're speaking
                          // Klingon.
});

// Everything else is identical. Same PutObject. Same GetObject. Same presigned URLs.
const upload = await s3Client.send(new PutObjectCommand({
  Bucket: "app-uploads",
  Key: `documents/${userId}/${filename}`,
  Body: fileBuffer,
  ContentType: mimeType,
}));

Python (boto3)

import boto3
import os

s3 = boto3.client(
    's3',
    endpoint_url=os.environ.get('S3_ENDPOINT', 'http://minio.storage.svc.cluster.local:9000'),
    aws_access_key_id=os.environ['S3_ACCESS_KEY'],
    aws_secret_access_key=os.environ['S3_SECRET_KEY'],
)

# Literally the same code you already have.
s3.upload_file('/tmp/report.pdf', 'app-reports', 'monthly/2025-01.pdf')
presigned = s3.generate_presigned_url('get_object', Params={
    'Bucket': 'app-reports',
    'Key': 'monthly/2025-01.pdf',
}, ExpiresIn=3600)

Go

cfg, err := config.LoadDefaultConfig(context.TODO(),
    config.WithRegion("us-east-1"),
    config.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(
        os.Getenv("S3_ACCESS_KEY"),
        os.Getenv("S3_SECRET_KEY"),
        "",
    )),
)
if err != nil {
    log.Fatalf("unable to load SDK config: %v", err)
}

client := s3.NewFromConfig(cfg, func(o *s3.Options) {
    o.BaseEndpoint = aws.String(os.Getenv("S3_ENDPOINT"))
    o.UsePathStyle = true  // Don't forget this or prepare for confusing DNS errors
})

// Same API. Same operations. Different endpoint.
_, err = client.PutObject(context.TODO(), &s3.PutObjectInput{
    Bucket: aws.String("app-uploads"),
    Key:    aws.String("documents/user-123/file.pdf"),
    Body:   file,
})

Three languages. One environment variable change each. We kept refreshing the test results waiting for something to fail. Nothing did.


Redis: ElastiCache Without the Elasti-Price

ElastiCache is lovely until you look at the bill. A cache.r6g.xlarge with a replica runs you about $730/month. A Redis pod with equivalent resources on your existing Kubernetes cluster costs... whatever your compute costs. Which you're already paying for.

The migration itself was straightforward. The lesson came later.

The Sentinel Setup That Actually Works

apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: cache
data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy allkeys-lru  # READ THIS. READ IT AGAIN.
    appendonly yes
    appendfsync everysec
    # We don't use RDB snapshots because AOF gives us better durability.
    # If you're using Redis purely as a cache, you can skip AOF too.
    # But if there's any state you care about (sessions, rate limit
    # counters), keep AOF on.
---
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: redis-cache
  namespace: cache
spec:
  sentinel:
    replicas: 3  # Always odd. Sentinel uses majority voting.
                  # 2 sentinels means 1 failure = no quorum = no failover.
                  # 3 sentinels means 1 failure = still works. Math.
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
  redis:
    replicas: 3  # 1 master + 2 replicas
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 1
        memory: 2Gi
    storage:
      keepAfterDeletion: true  # Don't delete PVCs when the CR is deleted.
                                # We learned this the hard way during a
                                # Helm chart upgrade gone wrong.
      persistentVolumeClaim:
        metadata:
          name: redis-data
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi
          storageClassName: fast-ssd
    customConfig:
      - "maxmemory 2gb"
      - "maxmemory-policy allkeys-lru"

The maxmemory-policy Lesson

The default maxmemory-policy in Redis is noeviction. Read that word carefully. No. Eviction. It means when Redis fills up, it doesn't remove old keys to make room. It just... stops accepting writes. It returns errors. OOM command not allowed when used memory > 'maxmemory'.

We learned this at 2pm on a Tuesday. Not 2am — 2pm, in the middle of a product demo. The cache filled up, Redis started returning errors, and the application fell over because apparently nobody had written a fallback for "what if the cache just says no."

The fix is allkeys-lru (evict the least recently used key when memory is full). If you're using Redis as a cache — and you probably are — this is what you want. Set it explicitly. Don't trust the default.

# redis.conf
  maxmemory 2gb
- # maxmemory-policy noeviction  (the default that ruined our Tuesday)
+ maxmemory-policy allkeys-lru

Connection Configuration

// Node.js (ioredis) — Sentinel-aware connection
const Redis = require('ioredis');

const redis = new Redis({
  sentinels: [
    { host: 'rfs-redis-cache.cache.svc.cluster.local', port: 26379 },
  ],
  name: 'mymaster',
  password: process.env.REDIS_PASSWORD,
  // Enable lazy connect so the app starts even if Redis is temporarily down
  lazyConnect: true,
  // Retry strategy: exponential backoff with jitter
  retryStrategy(times) {
    const delay = Math.min(times * 50, 2000);
    return delay + Math.random() * 100;  // Jitter prevents thundering herd
  },
});

// Always handle errors. Always. ALWAYS.
redis.on('error', (err) => {
  console.error('Redis connection error:', err.message);
  // Don't crash. Degrade gracefully. Serve from the database.
  // It'll be slower but it'll work.
});

NATS: SQS is a Queue. NATS is a Lifestyle.

We used to joke that SQS was designed by someone who thought messages should take a scenic route. NATS is what happens when messages take a fighter jet.

The latency difference was so dramatic that when we first ran our benchmarks, we assumed the test was broken. SQS was giving us 15-25ms per message. NATS JetStream was giving us sub-millisecond. Our latency monitoring dashboards, which had been gently rolling hills, became a flatline. We briefly thought the service was down.

It was not down. It was just that fast.

JetStream Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: nats-config
  namespace: messaging
data:
  nats.conf: |
    port: 4222
    http_port: 8222

    jetstream {
      store_dir: /data
      max_memory_store: 1Gi
      max_file_store: 10Gi
      # max_file_store is your durability budget. When this fills up,
      # the oldest messages get discarded (if your stream uses 'limits'
      # retention). Size it for your peak message volume + some headroom.
    }

    cluster {
      name: nats-cluster
      port: 6222
      routes: [
        nats://nats-0.nats.messaging.svc.cluster.local:6222
        nats://nats-1.nats.messaging.svc.cluster.local:6222
        nats://nats-2.nats.messaging.svc.cluster.local:6222
      ]
    }

Stream and Consumer Setup (The SQS Queue Equivalent)

const { connect, StringCodec, AckPolicy, DeliverPolicy } = require('nats');

async function setupMessaging() {
  const nc = await connect({
    servers: 'nats://nats.messaging.svc.cluster.local:4222',
    // reconnect automatically — NATS clients are persistent by default,
    // unlike SQS which is "fire HTTP request, pray, repeat"
  });

  const jsm = await nc.jetstreamManager();

  // Create a stream — think of this as the SQS queue itself
  await jsm.streams.add({
    name: 'ORDERS',
    subjects: ['orders.*'],        // Wildcard subjects! SQS can't do this.
                                    // orders.created, orders.shipped, orders.cancelled
                                    // all go to the same stream.
    retention: 'limits',
    max_msgs: 1_000_000,
    max_bytes: 1024 * 1024 * 1024,  // 1GB
    max_age: 7 * 24 * 60 * 60 * 1e9,  // 7 days in nanoseconds (yes, nanoseconds)
    storage: 'file',                // 'file' for durability, 'memory' for speed
    num_replicas: 3,                // Replicate across 3 NATS servers
    discard: 'old',                 // When full, discard oldest messages
    duplicate_window: 120e9,        // 2-minute dedup window. This saved us from
                                    // a producer that double-published during
                                    // network blips.
  });

  // Create a durable consumer — like an SQS consumer group
  await jsm.consumers.add('ORDERS', {
    durable_name: 'order-processor',
    ack_policy: AckPolicy.Explicit,
    max_deliver: 5,            // Retry up to 5 times before giving up
    ack_wait: 30 * 1e9,        // 30-second ack timeout (like SQS visibility timeout)
    deliver_policy: DeliverPolicy.All,
    filter_subject: 'orders.>', // Process all order events
  });

  console.log('Streams and consumers configured.');
  await nc.close();
}

Producer Code

async function publishOrder(order) {
  const nc = await connect({
    servers: 'nats://nats.messaging.svc.cluster.local:4222'
  });

  const js = nc.jetstream();
  const sc = StringCodec();

  // Publish with a message ID for deduplication
  const ack = await js.publish(
    `orders.${order.type}`,
    sc.encode(JSON.stringify(order)),
    {
      msgID: `order-${order.id}`,  // Dedup key. If the producer retries,
                                    // NATS ignores the duplicate. SQS FIFO
                                    // has this too, but regular SQS? Nope.
    }
  );

  console.log(`Published order ${order.id}, stream seq: ${ack.seq}`);
  await nc.drain();
}

Consumer Code

async function processOrders() {
  const nc = await connect({
    servers: 'nats://nats.messaging.svc.cluster.local:4222'
  });

  const js = nc.jetstream();
  const sc = StringCodec();

  const consumer = await js.consumers.get('ORDERS', 'order-processor');

  // consume() gives you a pull-based iterator — way cleaner than
  // SQS's "poll, process, delete" dance
  const messages = await consumer.consume();

  for await (const msg of messages) {
    const order = JSON.parse(sc.decode(msg.data));
    console.log(`Processing order ${order.id} [attempt ${msg.info.redeliveryCount + 1}]`);

    try {
      await handleOrder(order);
      msg.ack();  // Success — remove from stream
    } catch (err) {
      console.error(`Failed to process order ${order.id}:`, err.message);

      if (msg.info.redeliveryCount >= 4) {
        // Max retries reached. Ack it and send to a dead letter stream.
        await publishToDeadLetter(order, err);
        msg.ack();
        console.warn(`Order ${order.id} moved to dead letter stream after 5 attempts.`);
      } else {
        msg.nak();  // Negative ack — will be redelivered after ack_wait
      }
    }
  }
}

SQS to NATS: The Rosetta Stone

If you're coming from SQS, here's your translation guide:

SQS ConceptNATS JetStream EquivalentNotes
QueueStream + ConsumerNATS separates storage (stream) from consumption (consumer). This is actually better.
MessageMessageSame concept, faster delivery.
Visibility Timeoutack_waitHow long before an unacknowledged message is redelivered.
Dead Letter Queuemax_deliver + separate streamYou build DLQ logic yourself, but it's ~10 lines of code.
Long Pollingconsume() / fetch()NATS pushes to you. No polling. No wasted HTTP requests.
FIFO QueueStream with max_msgs_per_subject: 1Or just use subject-based ordering, which is built in.
Message GroupsSubject hierarchy (orders.us.created)Way more flexible than SQS message groups.
Batch Operationsfetch({ max_messages: 10 })Native batch consumption.

The Numbers: Before and After

We're engineers. We don't trust vibes. Here are the actual numbers from the migration, measured over a 7-day window with production traffic:

MetricAWS ManagedSelf-Hosted (K8s)Delta
PostgreSQL query latency (p50)3.2ms0.9ms-72%
PostgreSQL query latency (p99)12.1ms3.4ms-72%
S3/MinIO PUT latency (p50)48ms8ms-83%
S3/MinIO GET latency (p50)35ms5ms-86%
ElastiCache/Redis GET latency (p50)0.8ms0.3ms-63%
SQS/NATS publish latency (p50)18ms0.4ms-98%
SQS/NATS end-to-end latency (p50)45ms1.2ms-97%
Monthly infrastructure cost$4,280$1,650-61%

Why is everything faster? One word: locality. When your database, cache, queue, and application are all running in the same Kubernetes cluster, there's no cross-AZ network hop. There's no NAT gateway. There's no VPC endpoint routing. It's just pod-to-pod communication over a virtual network, and it's fast.

The honest caveat: AWS managed services include operational overhead that self-hosting doesn't. You need to budget for engineering time on backups, monitoring, upgrades, and incident response. More on that in the cost section.


Migration Order: Start With What Scares You Least

Don't migrate everything at once. We did this migration in four phases, and the ordering matters:

  1. Object Storage (MinIO) — Lowest risk. Your S3 SDK code doesn't change. If something goes wrong, you can point back to S3 in seconds. Start here. Build confidence.

  2. Cache (Redis) — The data is ephemeral by design. If you lose the cache, the application gets slower but doesn't break (if you wrote your cache layer correctly — big "if"). Easy rollback: delete the Redis deployment, point back to ElastiCache.

  3. Message Queue (NATS) — Requires code changes (the API is different from SQS), but message data is transient. Run both in parallel during the transition: publish to both SQS and NATS, consume from NATS, and fall back to SQS if NATS has issues.

  4. Database (PostgreSQL) — Last. Always last. This is stateful, critical, and the hardest to roll back. Use pg_dump/pg_restore or logical replication to migrate data. Test your backup and recovery procedures before cutting over. Then test them again.

For each service:

  1. Deploy the portable alternative alongside the managed service
  2. Run both in parallel with traffic mirroring or splitting
  3. Validate functionality and performance (give it at least a week)
  4. Cut over completely
  5. Keep the managed service running for one more week (just in case)
  6. Decommission the managed service and enjoy your smaller AWS bill

The Honest Cost Analysis: Self-Hosting Isn't Free

Let's not pretend this is all upside. Here's what it actually costs to run your own data layer:

Monthly Infrastructure (Our Client's Numbers)

CategoryAWS ManagedSelf-HostedNotes
Compute (database, cache, queue pods)Included in service pricing$850Existing K8s cluster capacity
Storage (PVCs)Included in service pricing$400gp3-equivalent, 800Gi total
Backup storage (MinIO for PG backups)$15 (S3)$50Dedicated MinIO bucket
Monitoring (Prometheus, Grafana)CloudWatch: $180$150Self-hosted monitoring stack
Total infrastructure$4,280$1,650

But also budget for:

Hidden CostEstimate
Initial setup and migration (one-time)4-6 weeks of engineering time
Ongoing operations (patching, upgrades)~4 hours/month
Incident response (when things break)~2 hours/month (averaged)
Knowledge maintenance (staying current)~2 hours/month

The monthly savings (~$2,600 in this case) more than cover the operational overhead. But if your team is already stretched thin, the cognitive load of running four more stateful services might not be worth it.

Our recommendation: If you're migrating for portability (you need to run on multiple clouds or on-prem), the cost-benefit is clear. If you're migrating purely to save money, do the math carefully. The breakeven point is usually around $3,000/month in managed service spend.


The Postmortem

Six weeks. Four managed services replaced. One 3am incident. Zero data lost.

The client can now deploy their entire stack on AWS, GCP, Azure, or a rack in their own data center. The application code barely changed — new connection strings, one new environment variable for MinIO, and a rewrite of the queue consumer from SQS to NATS.

Week four was rough. But every week since has been cheaper, faster, and more portable.

If you're staring at your AWS bill and wondering whether your managed services are managing you, reach out. We've done this migration enough times that week four doesn't scare us anymore.

Well. It scares us a little less.

Cloud MigrationKubernetesData Infrastructure

Stay in the Loop

Get the latest insights on cloud migration, Kubernetes, and enterprise distribution delivered to your inbox.