Building a Portable Data Layer: Replacing RDS, S3, ElastiCache, and SQS
We ripped out four AWS managed services and replaced them with Kubernetes-native alternatives. Here's the config that survived production — and the config that didn't.
The Great Data Layer Heist
Six weeks. That's what we told the client. "We'll have your entire data layer migrated off AWS managed services and running on Kubernetes-native alternatives in six weeks."
Weeks one through three were smooth. We were high-fiving. We were on schedule. We were heroes.
Week four was when PostgreSQL decided that our WAL archiving config had a subtle race condition that only manifested under load. Week four was when we learned that "S3-compatible" doesn't mean "S3-identical." Week four was when someone on the team said the phrase "it works on my cluster" without irony.
We still finished in six weeks. But week four aged us all by approximately five years. Here's everything we learned — the configs that survived production and the ones that caught fire at 3am.
PostgreSQL: The DBA Who Learned to Love Operators
If you've been running RDS for a while, you've gotten comfortable. Automated backups. Push-button replicas. Somebody else gets paged when the disk fills up. CloudNativePG is what happens when you take all of that comfort and say "I'll do it myself, but in YAML."
And honestly? It's kind of great.
The Cluster Manifest That Actually Survived Production
This is not a tutorial example. This is what we're running. Every parameter has a story behind it.
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-database
namespace: production
spec:
instances: 3 # Primary + 2 replicas. We tried 5 once. The WAL shipping
# overhead was not worth it for our write volume.
# imageName: ghcr.io/cloudnative-pg/postgresql:16.2-1
# Pin your image. We learned this when a minor version bump
# changed the default value of jit_above_cost and our query
# planner decided to JIT-compile a 4-line SELECT.
storage:
size: 100Gi
storageClass: fast-ssd # Use io2 or gp3-equivalent. We started
# with gp2-equivalent and the IOPS ceiling
# hit us during a VACUUM FULL. Never again.
postgresql:
parameters:
# === Connection Settings ===
max_connections: "200"
# 200 is generous. If you need more than this, you need
# PgBouncer, not more connections. PostgreSQL's process-per-
# connection model means 500 connections = 500 processes =
# your OOM killer's favorite afternoon snack.
# === Memory Settings ===
shared_buffers: "256MB" # ~25% of pod memory. Don't go over 40%.
effective_cache_size: "768MB" # Tell the planner about OS cache
work_mem: "6553kB" # Per-operation sort memory
maintenance_work_mem: "128MB" # VACUUM and CREATE INDEX get more
# === WAL Settings ===
wal_buffers: "16MB"
min_wal_size: "1GB"
max_wal_size: "4GB"
checkpoint_completion_target: "0.9" # Spread checkpoints. Your disks
# will thank you.
# === Planner Settings ===
random_page_cost: "1.1" # SSDs make random reads almost as
# fast as sequential. Tell Postgres.
effective_io_concurrency: "200" # SSDs can handle concurrent I/O.
default_statistics_target: "100"
# === Parallelism ===
max_worker_processes: "4"
max_parallel_workers_per_gather: "2"
max_parallel_workers: "4"
# === The Backup Config That Saved Us at 3am ===
backup:
barmanObjectStore:
destinationPath: s3://backups/postgres
endpointURL: http://minio.storage:9000
s3Credentials:
accessKeyId:
name: minio-credentials
key: access-key
secretAccessKey:
name: minio-credentials
key: secret-key
wal:
compression: gzip
# maxParallel: 2 # We bumped this to 2 after WAL archiving
# fell behind during a bulk import. The default
# of 1 is fine until it isn't.
data:
compression: gzip
retentionPolicy: "30d"
# 30 days. Not 7. We once had a customer discover data corruption
# that happened 10 days ago. 7-day retention would have meant
# "have you considered re-entering that data by hand?"
enablePodDisruptionBudget: true
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2"
# Don't set CPU limits too tight. PostgreSQL's VACUUM processes
# need burst capacity. A throttled VACUUM is a slow VACUUM,
# and a slow VACUUM is a table that's 80% dead tuples.
Scheduled Backups (Because "I'll Do It Manually" Is Not a Strategy)
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: app-database-daily
spec:
schedule: "0 2 * * *" # 2 AM daily. Not noon. Not "whenever Jenkins feels like it."
backupOwnerReference: self
cluster:
name: app-database
The 3am Incident: Point-in-Time Recovery
It was 3:17am on a Wednesday. A deployment script ran UPDATE users SET role = 'admin' without a WHERE clause. Everyone was an admin. Everyone.
This is what we applied at 3:22am (after the initial five minutes of staring at the screen in disbelief):
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: app-database-restored
spec:
instances: 3
bootstrap:
recovery:
source: app-database
recoveryTarget:
# Restore to 3:16am — one minute before the deployment from hell.
targetTime: "2025-11-12 03:16:00.000000+00"
externalClusters:
- name: app-database
barmanObjectStore:
destinationPath: s3://backups/postgres
endpointURL: http://minio.storage:9000
s3Credentials:
accessKeyId:
name: minio-credentials
key: access-key
secretAccessKey:
name: minio-credentials
key: secret-key
Thirteen minutes from incident to recovered database. Try doing that with an RDS snapshot restore (spoiler: it takes 20-45 minutes depending on how big your database is and how much AWS likes you that day).
The PgBouncer Gotcha Nobody Warns You About
PgBouncer transaction mode will break your prepared statements. Ask us how we know.
Here's what happened: we turned on PgBouncer in transaction mode because session mode doesn't pool effectively. Everything seemed fine. Then, two hours later, a service that used Prisma started throwing prepared statement "s0" already exists errors. Why? Because in transaction mode, PgBouncer releases the backend connection after each transaction. But prepared statements are bound to a connection. So your ORM prepares a statement on connection A, the next transaction gets connection B, and PostgreSQL says "I don't know what s0 is."
The fix:
# In your connection string, disable prepared statements:
- DATABASE_URL=postgres://user:pass@app-database-rw:5432/app
+ DATABASE_URL=postgres://user:pass@app-database-rw:5432/app?prepared_statements=false
Or switch to session mode and accept less efficient pooling. There's no free lunch. There's barely even a free snack.
Service Endpoints
# Direct to primary (for writes)
app-database-rw.production.svc.cluster.local:5432
# Read-only replicas (for reads)
app-database-ro.production.svc.cluster.local:5432
# Any instance (for admin tasks — use sparingly)
app-database-r.production.svc.cluster.local:5432
Update your connection string and move on with your life:
- DATABASE_URL=postgres://user:pass@mydb.abc123.us-east-1.rds.amazonaws.com:5432/app
+ DATABASE_URL=postgres://user:pass@app-database-rw.production.svc.cluster.local:5432/app
MinIO: S3, But Make It Portable
We literally changed one environment variable and it worked.
We're not exaggerating. The AWS SDK is designed to talk to any S3-compatible endpoint. We pointed it at MinIO instead of S3, ran the integration tests, and they passed. All of them. We spent the next three hours looking for what else we needed to change. There was nothing. We went home early.
It was the single best afternoon of the entire migration.
Production Deployment
apiVersion: v1
kind: Secret
metadata:
name: minio-credentials
namespace: storage
type: Opaque
stringData:
root-user: admin
root-password: your-secure-password-here # Obviously not this.
# Use a sealed secret or
# external secrets operator.
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: minio
namespace: storage
spec:
serviceName: minio
replicas: 4 # 4 is the minimum for erasure coding. More on this below.
selector:
matchLabels:
app: minio
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio:RELEASE.2025-01-01T00-00-00Z
args:
- server
- http://minio-{0...3}.minio.storage.svc.cluster.local/data
- --console-address
- ":9001"
env:
- name: MINIO_ROOT_USER
valueFrom:
secretKeyRef:
name: minio-credentials
key: root-user
- name: MINIO_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: minio-credentials
key: root-password
ports:
- containerPort: 9000
name: api
- containerPort: 9001
name: console
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2"
livenessProbe:
httpGet:
path: /minio/health/live
port: 9000
initialDelaySeconds: 120
periodSeconds: 20
readinessProbe:
httpGet:
path: /minio/health/ready
port: 9000
initialDelaySeconds: 30
periodSeconds: 20
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 500Gi
Erasure Coding: Explained With Pizza
Imagine you order a pizza with 8 slices. Normal storage is like needing all 8 slices to have a complete pizza (duh). Erasure coding is like a magical pizza where you only need any 4 of the 8 slices to reconstruct the entire pizza.
MinIO with 4 nodes uses erasure coding to split your data into data and parity shards. You can lose up to half your nodes and still read all your data. You can lose one node and still write new data. It's the closest thing to magic in storage engineering.
The catch? You need a minimum of 4 nodes, and your usable capacity is roughly 50% of raw capacity (the other 50% is parity). That's the price of being able to yank a server out of the rack without losing data.
The Three-Language Endpoint Swap
This is the part that felt too easy. Here's the change for every language we had in the project:
Node.js (AWS SDK v3)
import { S3Client, PutObjectCommand, GetObjectCommand } from "@aws-sdk/client-s3";
const s3Client = new S3Client({
// This is the only line that changes. That's it. That's the migration.
endpoint: process.env.S3_ENDPOINT || "http://minio.storage.svc.cluster.local:9000",
region: "us-east-1", // Required by the SDK, ignored by MinIO. Just put anything.
credentials: {
accessKeyId: process.env.S3_ACCESS_KEY,
secretAccessKey: process.env.S3_SECRET_KEY,
},
forcePathStyle: true, // IMPORTANT: MinIO uses path-style URLs, not virtual-hosted.
// Without this, the SDK tries to use bucket.endpoint as the
// hostname and your DNS will look at you like you're speaking
// Klingon.
});
// Everything else is identical. Same PutObject. Same GetObject. Same presigned URLs.
const upload = await s3Client.send(new PutObjectCommand({
Bucket: "app-uploads",
Key: `documents/${userId}/${filename}`,
Body: fileBuffer,
ContentType: mimeType,
}));
Python (boto3)
import boto3
import os
s3 = boto3.client(
's3',
endpoint_url=os.environ.get('S3_ENDPOINT', 'http://minio.storage.svc.cluster.local:9000'),
aws_access_key_id=os.environ['S3_ACCESS_KEY'],
aws_secret_access_key=os.environ['S3_SECRET_KEY'],
)
# Literally the same code you already have.
s3.upload_file('/tmp/report.pdf', 'app-reports', 'monthly/2025-01.pdf')
presigned = s3.generate_presigned_url('get_object', Params={
'Bucket': 'app-reports',
'Key': 'monthly/2025-01.pdf',
}, ExpiresIn=3600)
Go
cfg, err := config.LoadDefaultConfig(context.TODO(),
config.WithRegion("us-east-1"),
config.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(
os.Getenv("S3_ACCESS_KEY"),
os.Getenv("S3_SECRET_KEY"),
"",
)),
)
if err != nil {
log.Fatalf("unable to load SDK config: %v", err)
}
client := s3.NewFromConfig(cfg, func(o *s3.Options) {
o.BaseEndpoint = aws.String(os.Getenv("S3_ENDPOINT"))
o.UsePathStyle = true // Don't forget this or prepare for confusing DNS errors
})
// Same API. Same operations. Different endpoint.
_, err = client.PutObject(context.TODO(), &s3.PutObjectInput{
Bucket: aws.String("app-uploads"),
Key: aws.String("documents/user-123/file.pdf"),
Body: file,
})
Three languages. One environment variable change each. We kept refreshing the test results waiting for something to fail. Nothing did.
Redis: ElastiCache Without the Elasti-Price
ElastiCache is lovely until you look at the bill. A cache.r6g.xlarge with a replica runs you about $730/month. A Redis pod with equivalent resources on your existing Kubernetes cluster costs... whatever your compute costs. Which you're already paying for.
The migration itself was straightforward. The lesson came later.
The Sentinel Setup That Actually Works
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-config
namespace: cache
data:
redis.conf: |
maxmemory 2gb
maxmemory-policy allkeys-lru # READ THIS. READ IT AGAIN.
appendonly yes
appendfsync everysec
# We don't use RDB snapshots because AOF gives us better durability.
# If you're using Redis purely as a cache, you can skip AOF too.
# But if there's any state you care about (sessions, rate limit
# counters), keep AOF on.
---
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
name: redis-cache
namespace: cache
spec:
sentinel:
replicas: 3 # Always odd. Sentinel uses majority voting.
# 2 sentinels means 1 failure = no quorum = no failover.
# 3 sentinels means 1 failure = still works. Math.
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
redis:
replicas: 3 # 1 master + 2 replicas
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
storage:
keepAfterDeletion: true # Don't delete PVCs when the CR is deleted.
# We learned this the hard way during a
# Helm chart upgrade gone wrong.
persistentVolumeClaim:
metadata:
name: redis-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: fast-ssd
customConfig:
- "maxmemory 2gb"
- "maxmemory-policy allkeys-lru"
The maxmemory-policy Lesson
The default maxmemory-policy in Redis is noeviction. Read that word carefully. No. Eviction. It means when Redis fills up, it doesn't remove old keys to make room. It just... stops accepting writes. It returns errors. OOM command not allowed when used memory > 'maxmemory'.
We learned this at 2pm on a Tuesday. Not 2am — 2pm, in the middle of a product demo. The cache filled up, Redis started returning errors, and the application fell over because apparently nobody had written a fallback for "what if the cache just says no."
The fix is allkeys-lru (evict the least recently used key when memory is full). If you're using Redis as a cache — and you probably are — this is what you want. Set it explicitly. Don't trust the default.
# redis.conf
maxmemory 2gb
- # maxmemory-policy noeviction (the default that ruined our Tuesday)
+ maxmemory-policy allkeys-lru
Connection Configuration
// Node.js (ioredis) — Sentinel-aware connection
const Redis = require('ioredis');
const redis = new Redis({
sentinels: [
{ host: 'rfs-redis-cache.cache.svc.cluster.local', port: 26379 },
],
name: 'mymaster',
password: process.env.REDIS_PASSWORD,
// Enable lazy connect so the app starts even if Redis is temporarily down
lazyConnect: true,
// Retry strategy: exponential backoff with jitter
retryStrategy(times) {
const delay = Math.min(times * 50, 2000);
return delay + Math.random() * 100; // Jitter prevents thundering herd
},
});
// Always handle errors. Always. ALWAYS.
redis.on('error', (err) => {
console.error('Redis connection error:', err.message);
// Don't crash. Degrade gracefully. Serve from the database.
// It'll be slower but it'll work.
});
NATS: SQS is a Queue. NATS is a Lifestyle.
We used to joke that SQS was designed by someone who thought messages should take a scenic route. NATS is what happens when messages take a fighter jet.
The latency difference was so dramatic that when we first ran our benchmarks, we assumed the test was broken. SQS was giving us 15-25ms per message. NATS JetStream was giving us sub-millisecond. Our latency monitoring dashboards, which had been gently rolling hills, became a flatline. We briefly thought the service was down.
It was not down. It was just that fast.
JetStream Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: nats-config
namespace: messaging
data:
nats.conf: |
port: 4222
http_port: 8222
jetstream {
store_dir: /data
max_memory_store: 1Gi
max_file_store: 10Gi
# max_file_store is your durability budget. When this fills up,
# the oldest messages get discarded (if your stream uses 'limits'
# retention). Size it for your peak message volume + some headroom.
}
cluster {
name: nats-cluster
port: 6222
routes: [
nats://nats-0.nats.messaging.svc.cluster.local:6222
nats://nats-1.nats.messaging.svc.cluster.local:6222
nats://nats-2.nats.messaging.svc.cluster.local:6222
]
}
Stream and Consumer Setup (The SQS Queue Equivalent)
const { connect, StringCodec, AckPolicy, DeliverPolicy } = require('nats');
async function setupMessaging() {
const nc = await connect({
servers: 'nats://nats.messaging.svc.cluster.local:4222',
// reconnect automatically — NATS clients are persistent by default,
// unlike SQS which is "fire HTTP request, pray, repeat"
});
const jsm = await nc.jetstreamManager();
// Create a stream — think of this as the SQS queue itself
await jsm.streams.add({
name: 'ORDERS',
subjects: ['orders.*'], // Wildcard subjects! SQS can't do this.
// orders.created, orders.shipped, orders.cancelled
// all go to the same stream.
retention: 'limits',
max_msgs: 1_000_000,
max_bytes: 1024 * 1024 * 1024, // 1GB
max_age: 7 * 24 * 60 * 60 * 1e9, // 7 days in nanoseconds (yes, nanoseconds)
storage: 'file', // 'file' for durability, 'memory' for speed
num_replicas: 3, // Replicate across 3 NATS servers
discard: 'old', // When full, discard oldest messages
duplicate_window: 120e9, // 2-minute dedup window. This saved us from
// a producer that double-published during
// network blips.
});
// Create a durable consumer — like an SQS consumer group
await jsm.consumers.add('ORDERS', {
durable_name: 'order-processor',
ack_policy: AckPolicy.Explicit,
max_deliver: 5, // Retry up to 5 times before giving up
ack_wait: 30 * 1e9, // 30-second ack timeout (like SQS visibility timeout)
deliver_policy: DeliverPolicy.All,
filter_subject: 'orders.>', // Process all order events
});
console.log('Streams and consumers configured.');
await nc.close();
}
Producer Code
async function publishOrder(order) {
const nc = await connect({
servers: 'nats://nats.messaging.svc.cluster.local:4222'
});
const js = nc.jetstream();
const sc = StringCodec();
// Publish with a message ID for deduplication
const ack = await js.publish(
`orders.${order.type}`,
sc.encode(JSON.stringify(order)),
{
msgID: `order-${order.id}`, // Dedup key. If the producer retries,
// NATS ignores the duplicate. SQS FIFO
// has this too, but regular SQS? Nope.
}
);
console.log(`Published order ${order.id}, stream seq: ${ack.seq}`);
await nc.drain();
}
Consumer Code
async function processOrders() {
const nc = await connect({
servers: 'nats://nats.messaging.svc.cluster.local:4222'
});
const js = nc.jetstream();
const sc = StringCodec();
const consumer = await js.consumers.get('ORDERS', 'order-processor');
// consume() gives you a pull-based iterator — way cleaner than
// SQS's "poll, process, delete" dance
const messages = await consumer.consume();
for await (const msg of messages) {
const order = JSON.parse(sc.decode(msg.data));
console.log(`Processing order ${order.id} [attempt ${msg.info.redeliveryCount + 1}]`);
try {
await handleOrder(order);
msg.ack(); // Success — remove from stream
} catch (err) {
console.error(`Failed to process order ${order.id}:`, err.message);
if (msg.info.redeliveryCount >= 4) {
// Max retries reached. Ack it and send to a dead letter stream.
await publishToDeadLetter(order, err);
msg.ack();
console.warn(`Order ${order.id} moved to dead letter stream after 5 attempts.`);
} else {
msg.nak(); // Negative ack — will be redelivered after ack_wait
}
}
}
}
SQS to NATS: The Rosetta Stone
If you're coming from SQS, here's your translation guide:
| SQS Concept | NATS JetStream Equivalent | Notes |
|---|---|---|
| Queue | Stream + Consumer | NATS separates storage (stream) from consumption (consumer). This is actually better. |
| Message | Message | Same concept, faster delivery. |
| Visibility Timeout | ack_wait | How long before an unacknowledged message is redelivered. |
| Dead Letter Queue | max_deliver + separate stream | You build DLQ logic yourself, but it's ~10 lines of code. |
| Long Polling | consume() / fetch() | NATS pushes to you. No polling. No wasted HTTP requests. |
| FIFO Queue | Stream with max_msgs_per_subject: 1 | Or just use subject-based ordering, which is built in. |
| Message Groups | Subject hierarchy (orders.us.created) | Way more flexible than SQS message groups. |
| Batch Operations | fetch({ max_messages: 10 }) | Native batch consumption. |
The Numbers: Before and After
We're engineers. We don't trust vibes. Here are the actual numbers from the migration, measured over a 7-day window with production traffic:
| Metric | AWS Managed | Self-Hosted (K8s) | Delta |
|---|---|---|---|
| PostgreSQL query latency (p50) | 3.2ms | 0.9ms | -72% |
| PostgreSQL query latency (p99) | 12.1ms | 3.4ms | -72% |
| S3/MinIO PUT latency (p50) | 48ms | 8ms | -83% |
| S3/MinIO GET latency (p50) | 35ms | 5ms | -86% |
| ElastiCache/Redis GET latency (p50) | 0.8ms | 0.3ms | -63% |
| SQS/NATS publish latency (p50) | 18ms | 0.4ms | -98% |
| SQS/NATS end-to-end latency (p50) | 45ms | 1.2ms | -97% |
| Monthly infrastructure cost | $4,280 | $1,650 | -61% |
Why is everything faster? One word: locality. When your database, cache, queue, and application are all running in the same Kubernetes cluster, there's no cross-AZ network hop. There's no NAT gateway. There's no VPC endpoint routing. It's just pod-to-pod communication over a virtual network, and it's fast.
The honest caveat: AWS managed services include operational overhead that self-hosting doesn't. You need to budget for engineering time on backups, monitoring, upgrades, and incident response. More on that in the cost section.
Migration Order: Start With What Scares You Least
Don't migrate everything at once. We did this migration in four phases, and the ordering matters:
-
Object Storage (MinIO) — Lowest risk. Your S3 SDK code doesn't change. If something goes wrong, you can point back to S3 in seconds. Start here. Build confidence.
-
Cache (Redis) — The data is ephemeral by design. If you lose the cache, the application gets slower but doesn't break (if you wrote your cache layer correctly — big "if"). Easy rollback: delete the Redis deployment, point back to ElastiCache.
-
Message Queue (NATS) — Requires code changes (the API is different from SQS), but message data is transient. Run both in parallel during the transition: publish to both SQS and NATS, consume from NATS, and fall back to SQS if NATS has issues.
-
Database (PostgreSQL) — Last. Always last. This is stateful, critical, and the hardest to roll back. Use
pg_dump/pg_restoreor logical replication to migrate data. Test your backup and recovery procedures before cutting over. Then test them again.
For each service:
- Deploy the portable alternative alongside the managed service
- Run both in parallel with traffic mirroring or splitting
- Validate functionality and performance (give it at least a week)
- Cut over completely
- Keep the managed service running for one more week (just in case)
- Decommission the managed service and enjoy your smaller AWS bill
The Honest Cost Analysis: Self-Hosting Isn't Free
Let's not pretend this is all upside. Here's what it actually costs to run your own data layer:
Monthly Infrastructure (Our Client's Numbers)
| Category | AWS Managed | Self-Hosted | Notes |
|---|---|---|---|
| Compute (database, cache, queue pods) | Included in service pricing | $850 | Existing K8s cluster capacity |
| Storage (PVCs) | Included in service pricing | $400 | gp3-equivalent, 800Gi total |
| Backup storage (MinIO for PG backups) | $15 (S3) | $50 | Dedicated MinIO bucket |
| Monitoring (Prometheus, Grafana) | CloudWatch: $180 | $150 | Self-hosted monitoring stack |
| Total infrastructure | $4,280 | $1,650 |
But also budget for:
| Hidden Cost | Estimate |
|---|---|
| Initial setup and migration (one-time) | 4-6 weeks of engineering time |
| Ongoing operations (patching, upgrades) | ~4 hours/month |
| Incident response (when things break) | ~2 hours/month (averaged) |
| Knowledge maintenance (staying current) | ~2 hours/month |
The monthly savings (~$2,600 in this case) more than cover the operational overhead. But if your team is already stretched thin, the cognitive load of running four more stateful services might not be worth it.
Our recommendation: If you're migrating for portability (you need to run on multiple clouds or on-prem), the cost-benefit is clear. If you're migrating purely to save money, do the math carefully. The breakeven point is usually around $3,000/month in managed service spend.
The Postmortem
Six weeks. Four managed services replaced. One 3am incident. Zero data lost.
The client can now deploy their entire stack on AWS, GCP, Azure, or a rack in their own data center. The application code barely changed — new connection strings, one new environment variable for MinIO, and a rewrite of the queue consumer from SQS to NATS.
Week four was rough. But every week since has been cheaper, faster, and more portable.
If you're staring at your AWS bill and wondering whether your managed services are managing you, reach out. We've done this migration enough times that week four doesn't scare us anymore.
Well. It scares us a little less.