The Multi-Cloud Migration Playbook: From Locked-In to Portable in 6 Phases
The battle-tested framework we use after watching too many multi-cloud initiatives crash and burn. Includes the mistakes everyone makes and how to avoid them.
23 Migrations. 15 Failures. Here's What the Survivors Did.
We've been in the room for 23 multi-cloud migrations. We've watched senior leadership announce bold multi-cloud strategies at all-hands meetings. We've seen the Gantt charts. We've attended the kickoff meetings where everyone is optimistic and the timeline says "Q3."
Fifteen of those migrations failed. Not "failed" as in "took longer than expected." Failed as in: the initiative was quietly shelved, the Kubernetes clusters were decommissioned, and everyone went back to clicking around in the AWS console pretending it never happened.
Eight survived.
This playbook is reverse-engineered from those eight. It's the sequence of operations, decision points, and hard truths that separate the migrations that work from the ones that become cautionary tales whispered in the hallways at re:Invent.
Phase 0: Clarify the Why (Or Don't Start)
Multi-cloud is not free. It's not even cheap. It adds complexity to every layer of your stack -- infrastructure, deployment, networking, observability, incident response, hiring. You are choosing to make your life harder. You need a reason good enough to justify that.
Valid Reasons
- Negotiating leverage: Your AWS contract renewal is in 6 months and you want to negotiate from a position of strength, not desperation. ("We could move to Azure" hits different when you actually can.)
- Customer requirements: Your enterprise customers mandate Azure, GCP, or on-premises deployment. This isn't theoretical -- they're waving purchase orders.
- Regulatory compliance: Data sovereignty laws require your infrastructure to exist in regions where your current provider doesn't operate. Or your industry regulator has opinions about single-vendor concentration.
- Risk mitigation: The board is asking what happens if AWS has a multi-region outage. Again. (us-east-1 has entered the chat.)
- Cost optimization: You've done the math, and specific workloads are genuinely cheaper on a different provider. Not "we read a blog post" cheaper, but "we ran a 3-month proof of concept and have the receipts" cheaper.
Invalid Reasons (Be Honest With Yourself)
- "Multi-cloud is best practice." Says who? Multi-cloud is a tool. You don't use a chainsaw to butter toast just because chainsaws are powerful tools. Context matters.
- "We might need it someday." You're paying the complexity tax today for an uncertain future benefit. That's called speculation. Put the money in an index fund instead.
- Resume-Driven Development: Your engineers want Kubernetes experience. We get it. We've been there. But that's a training budget, not a migration budget. Buy them a Pluralsight subscription and a homelab. Don't refactor production because someone wants to put "Istio" on their LinkedIn.
- "Our new CTO came from Google." Respectfully: what worked at Google's scale, with Google's engineering resources, and Google's custom infrastructure might not apply to your 12-person team running a B2B SaaS.
If you can't articulate a valid reason in one sentence that makes your CFO nod, stop here. Seriously. We've watched millions of dollars evaporate on migrations that didn't have a clear business driver. Go build features instead. Come back when the reason is real.
Phase 1: Inventory and Classify (Weeks 1-2)
This is the "opening your credit card statements" phase. You know it's going to be bad. You're doing it anyway because you can't fix what you don't understand.
Step 1: Export Everything
#!/usr/bin/env bash
# cloud-inventory.sh - Get a reality check on your AWS usage
# Run this and then sit down before reading the output.
set -euo pipefail
echo "=== AWS Resource Inventory ==="
echo "Account: $(aws sts get-caller-identity --query 'Account' --output text)"
echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""
# Count resources by service
echo "--- Resources by Service ---"
aws resourcegroupstaggingapi get-resources \
--query 'ResourceTagMappingList[].ResourceARN' \
--output text | tr '\t' '\n' | cut -d: -f3 | sort | uniq -c | sort -rn
echo ""
echo "--- Lambda Functions ---"
aws lambda list-functions \
--query 'Functions[].{Name:FunctionName,Runtime:Runtime,Memory:MemorySize}' \
--output table
echo ""
echo "--- ECS Services ---"
for cluster in $(aws ecs list-clusters --query 'clusterArns[]' --output text); do
echo "Cluster: ${cluster##*/}"
aws ecs list-services --cluster "${cluster}" \
--query 'serviceArns[]' --output text | tr '\t' '\n'
done
echo ""
echo "--- RDS Instances ---"
aws rds describe-db-instances \
--query 'DBInstances[].{Name:DBInstanceIdentifier,Engine:Engine,Size:DBInstanceClass,MultiAZ:MultiAZ,Storage:AllocatedStorage}' \
--output table
echo ""
echo "--- S3 Buckets ---"
aws s3api list-buckets --query 'Buckets[].Name' --output text | tr '\t' '\n'
echo ""
echo "Total buckets: $(aws s3api list-buckets --query 'length(Buckets[])' --output text)"
# The number that matters most
echo ""
echo "--- Monthly Cost (Last 3 Months) ---"
aws ce get-cost-and-usage \
--time-period Start=$(date -u -d "3 months ago" +%Y-%m-01),End=$(date -u +%Y-%m-01) \
--granularity MONTHLY \
--metrics BlendedCost \
--query 'ResultsByTime[].{Period:TimePeriod.Start,Cost:Total.BlendedCost.Amount}' \
--output table
The first time we ran this for a client, the room went silent. They had 847 Lambda functions. Eight hundred and forty-seven. Someone had set up an architecture where every API endpoint was its own Lambda. It was like discovering your house is actually 847 tiny houses duck-taped together.
Step 2: Classify Every Workload
Create a spreadsheet. Yes, a spreadsheet. Not a Notion database. Not a Jira board. A spreadsheet, because you need to sort and filter and have uncomfortable conversations about prioritization, and Google Sheets handles that better than anything else.
| Workload | Cloud Service | Lock-in Depth | Data Sensitivity | Portability Effort | Business Value | Migration Phase |
|---|---|---|---|---|---|---|
| User API | ECS + ALB | Low | High | Low | Critical | Phase 3 |
| Auth | Cognito | Extreme | High | High | Critical | Phase 4 |
| Billing | Lambda + DynamoDB | High | High | High | Critical | Phase 5 |
| Analytics Pipeline | Kinesis + Athena + S3 | Extreme | Medium | Very High | Medium | Phase 5 or Never |
| Email Service | SES + Lambda | Medium | Low | Medium | Medium | Phase 3 |
| Image Processing | Lambda + S3 | Medium | Low | Medium | Low | Phase 3 |
| Search | OpenSearch (managed) | Medium | Medium | Medium | High | Phase 3 |
| Cron Jobs | EventBridge + Lambda | Medium | Low | Low | Low | Phase 3 |
That "or Never" next to the analytics pipeline? That's intentional. Not everything needs to be portable. Some workloads are deeply entangled with proprietary services, have low business criticality, and would cost more to migrate than they're worth. It's okay to leave them. Multi-cloud doesn't mean zero cloud. It means choice.
Step 3: Identify Your Portable Core
Look for the intersection of:
- High business value
- Low-to-medium portability effort
- Stateless (or uses standard databases)
These are your Phase 3 candidates. You want early wins. Quick wins build momentum, momentum builds organizational support, and organizational support is what keeps your migration alive when Phase 4 gets hard. (Phase 4 always gets hard.)
Phase 2: Build the Foundation (Weeks 3-6)
Before you migrate a single workload, you need somewhere for it to land. Think of this phase as building the airport before the planes arrive.
Step 1: Choose Your Kubernetes Distribution
| If You Need... | Consider... | Our Take |
|---|---|---|
| Managed simplicity | EKS, GKE, AKS | Start here. Fight fewer battles at once. |
| On-premises support | Rancher (RKE2), OpenShift, Tanzu | RKE2 is our go-to for on-prem |
| Edge/air-gapped | k3s, RKE2 | k3s is absurdly good for its size |
| Maximum portability | Vanilla Kubernetes | Only if you enjoy suffering |
Our actual recommendation: Start with a managed Kubernetes service in your primary cloud. Yes, that's still AWS. The goal of Phase 2 isn't to leave AWS -- it's to build the muscle memory of deploying to Kubernetes. You'll leave AWS later, from a position of competence instead of panic.
The critical rule: do not use cloud-specific Kubernetes features. No AWS ALB Ingress Controller. No GKE Config Connector. No Azure Workload Identity (yet). If it has your cloud provider's name in it, don't use it in Phase 2.
Step 2: Set Up GitOps
Your infrastructure is code now. All of it. No more clicking in consoles. No more "I'll just apply this manifest real quick." Everything goes through Git. Everything gets reviewed. Everything is auditable.
We use ArgoCD. Here's a real application definition from a migration we ran:
# argocd/applications/platform-services.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: platform-services
namespace: argocd
# This finalizer ensures ArgoCD cleans up resources when the app is deleted
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://github.com/acme-corp/k8s-manifests.git
path: clusters/production/platform-services
targetRevision: main
# Helm values from Git -- single source of truth
helm:
valueFiles:
- values.yaml
- values-production.yaml
destination:
server: https://kubernetes.default.svc
namespace: platform
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual changes
syncOptions:
- CreateNamespace=true
- ServerSideApply=true # Handles large CRDs better
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
# argocd/projects/platform.yaml
# ArgoCD Projects provide RBAC for what can be deployed where
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: platform
namespace: argocd
spec:
description: Platform services
sourceRepos:
- 'https://github.com/acme-corp/k8s-manifests.git'
- 'https://charts.bitnami.com/bitnami'
destinations:
- namespace: 'platform'
server: https://kubernetes.default.svc
- namespace: 'platform-*'
server: https://kubernetes.default.svc
# Don't let anyone deploy cluster-scoped resources from this project
clusterResourceWhitelist: []
namespaceResourceWhitelist:
- group: '*'
kind: '*'
Step 3: Deploy the Portable Services Baseline
Before any application workloads arrive, deploy these:
# helmfile.yaml -- our portable services baseline
# Every service here runs identically on any Kubernetes cluster
repositories:
- name: ingress-nginx
url: https://kubernetes.github.io/ingress-nginx
- name: jetstack
url: https://charts.jetstack.io
- name: prometheus-community
url: https://prometheus-community.github.io/helm-charts
- name: grafana
url: https://grafana.github.io/helm-charts
releases:
# Ingress -- NOT the AWS ALB controller
- name: ingress-nginx
namespace: ingress-nginx
chart: ingress-nginx/ingress-nginx
version: 4.9.0
values:
- controller:
replicaCount: 2
metrics:
enabled: true
# TLS certificate management
- name: cert-manager
namespace: cert-manager
chart: jetstack/cert-manager
version: 1.14.0
values:
- installCRDs: true
# Monitoring -- Prometheus + Grafana
- name: kube-prometheus-stack
namespace: monitoring
chart: prometheus-community/kube-prometheus-stack
version: 56.0.0
values:
- grafana:
adminPassword: "changeme" # Use a secret in production!
persistence:
enabled: true
size: 10Gi
# Logging -- Loki
- name: loki
namespace: logging
chart: grafana/loki-stack
version: 2.10.0
values:
- loki:
persistence:
enabled: true
size: 50Gi
promtail:
enabled: true
This is your "deployment target." When workloads arrive in Phase 3, they land on a platform that already has ingress, TLS, monitoring, and logging. No workload-specific infrastructure decisions needed.
Phase 3: Migrate the Portable Core (Weeks 7-12)
This is the tutorial level. The enemies are easy, the mechanics are forgiving, and you're building the skills you'll need for the boss fights later. Don't skip it. Don't rush it. Don't let anyone convince you to "just jump straight to the database migration because that's the hard part." That's like saying "let's skip the tutorial and go straight to the final boss." You will die. Metaphorically.
The Migration Pattern
For every workload, the same dance:
- Containerize -- Dockerfile, build pipeline, push to registry
- Helm chart -- Deployment, Service, ConfigMap, health checks
- Deploy side-by-side -- New K8s deployment runs alongside old ECS/Lambda
- Mirror traffic -- Send a copy of production traffic to the new deployment
- Compare -- Are the responses identical? Is latency comparable? Error rates?
- Cut over -- Update DNS or load balancer
- Bake -- Run both for 2 weeks. Verify everything.
- Decommission -- Delete the old deployment
Real Example: ECS to Kubernetes
Here's an actual migration we performed for a user-facing API. Before and after, with the warts included.
Before: ECS Task Definition
{
"family": "user-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"containerDefinitions": [{
"name": "user-api",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/user-api:latest",
"portMappings": [{"containerPort": 8080}],
"environment": [
{"name": "DATABASE_URL", "value": "postgresql://rds-endpoint:5432/users"},
{"name": "REDIS_URL", "value": "redis://elasticache-endpoint:6379"},
{"name": "AWS_REGION", "value": "us-east-1"},
{"name": "S3_BUCKET", "value": "acme-user-uploads"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/user-api",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}]
}
Notice the land mines: hardcoded AWS region, RDS endpoint in an environment variable (not a secret!), CloudWatch-specific logging, :latest tag. Every one of these is a portability problem and, frankly, a best-practices problem.
After: Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-api
labels:
app.kubernetes.io/name: user-api
app.kubernetes.io/version: "1.4.2"
app.kubernetes.io/managed-by: helm
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: user-api
template:
metadata:
labels:
app.kubernetes.io/name: user-api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: user-api
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
containers:
- name: user-api
image: registry.internal.acme.com/user-api:1.4.2
ports:
- name: http
containerPort: 8080
protocol: TCP
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: user-api-db
key: url
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: user-api-redis
key: url
- name: S3_ENDPOINT
value: "https://minio.storage.svc.cluster.local"
- name: S3_BUCKET
value: "user-uploads"
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: user-api
What changed and why:
- Secrets are in Kubernetes Secrets, not environment variables in a task definition
- S3 endpoint is configurable (points to MinIO internally, could point to S3, GCS, or anything S3-compatible)
- Specific image tag, not
:latest - Non-root security context
- Resource requests AND limits
- Prometheus annotations for automatic scraping
- Topology spread constraints so pods don't all land on one node
- Standard Kubernetes labels (
app.kubernetes.io/*)
This migration took us 3 days for the Kubernetes manifests and about 2 weeks of parallel running before we were confident enough to cut over.
Phase 4: Replace Proprietary Auth and Identity (Weeks 13-18)
This is the phase where everyone says "how hard can it be?" and then disappears for three months.
Authentication is the load-bearing wall of your application. Every service touches it. Every user flow depends on it. Every session, every token, every permission check runs through it. And you're going to rip it out and replace it with something else. While the application is running. In production.
It's like performing open-heart surgery on a patient who's running a marathon and refusing to slow down.
Common Migrations
| From | To | Difficulty | Our Honest Assessment |
|---|---|---|---|
| AWS Cognito | Keycloak | Hard | Keycloak is powerful but configuring it correctly is a full-time job |
| AWS Cognito | Zitadel | Medium | Newer, cleaner API, less ecosystem support |
| Firebase Auth | Keycloak | Hard | Firebase's client SDK integration runs deep |
| Auth0 (cloud) | Keycloak | Medium | Auth0's OIDC compliance actually makes this easier |
The Strategy That Works
┌────────────────┐
│ API Gateway │
│ / Auth Proxy │
└───────┬────────┘
│
┌───────▼────────┐
│ Auth Facade │ ← Routes based on user cohort
└───┬────────┬───┘
│ │
┌───────▼──┐ ┌──▼───────┐
│ Cognito │ │ Keycloak │
│ (legacy) │ │ (new) │
└──────────┘ └──────────┘
- Deploy Keycloak alongside Cognito. Both live. Both work.
- Build an auth facade that inspects tokens and routes to the correct provider. New signups go to Keycloak. Existing sessions continue on Cognito.
- Migrate users in waves:
- Wave 1: Internal team (your own employees). They're forgiving.
- Wave 2: New signups. They don't know or care what the auth backend is.
- Wave 3: Active users, prompted to "re-verify" on next login (which silently migrates them).
- Wave 4: Dormant users. Migrated in bulk. If they come back and something's broken, they probably forgot their password anyway.
- Keep Cognito in read-only mode for 90 days after the last migration. Someone will find an edge case. Someone always finds an edge case.
The Ugly Truth
We budgeted 4 weeks for an auth migration once. It took 11. Here's what we didn't anticipate:
- Social login callback URLs all pointed to Cognito-specific endpoints. Every OAuth app registration (Google, GitHub, Microsoft) needed updating, and some of those require business verification that takes weeks.
- Password hashing algorithms differed between providers. We couldn't import password hashes directly. Users had to reset passwords or "re-verify" (which is a nicer way of saying "reset your password but we'll pretend it's for security").
- Custom auth flows (MFA enrollment, magic links, passwordless) had to be reimplemented from scratch in Keycloak.
- JWT token format differences broke three downstream services that were parsing tokens instead of validating them. (Don't parse tokens. Validate them. This is a PSA.)
Budget 2x whatever you think auth migration will take. Then add a buffer.
Phase 5: Migrate the Data Layer (Weeks 19-30)
Here be dragons.
If Phase 3 was the tutorial and Phase 4 was the mid-game boss, Phase 5 is the final boss. The one with multiple health bars. The one where the floor drops out halfway through and the music changes.
The data layer is hard because:
- Data has gravity. Moving terabytes is slow and expensive.
- Downtime must be near-zero. Your SLA doesn't care about your migration timeline.
- Data loss is unacceptable. Not "minimize data loss." Zero. Data. Loss.
- Every application that reads or writes data must be updated simultaneously or have a compatibility layer.
PostgreSQL: RDS to CloudNativePG
CloudNativePG is the operator we trust for PostgreSQL on Kubernetes. It manages the full lifecycle: provisioning, failover, backups, connection pooling, monitoring. Here's the real migration setup:
# cloudnativepg-cluster.yaml
# This is the target cluster that will receive replicated data from RDS
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: platform-db
namespace: database
spec:
instances: 3 # 1 primary, 2 replicas
imageName: ghcr.io/cloudnative-pg/postgresql:16.1
postgresql:
parameters:
max_connections: "200"
shared_buffers: "512MB"
effective_cache_size: "1536MB"
work_mem: "4MB"
maintenance_work_mem: "128MB"
wal_level: "logical" # Required for logical replication from RDS
max_wal_senders: "10"
max_replication_slots: "10"
storage:
size: 100Gi
storageClass: gp3-encrypted # Use whatever your cluster offers
# Automated backups to S3-compatible storage
backup:
barmanObjectStore:
destinationPath: "s3://acme-db-backups/platform/"
endpointURL: "https://minio.storage.svc.cluster.local"
s3Credentials:
accessKeyId:
name: db-backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: db-backup-creds
key: SECRET_ACCESS_KEY
retentionPolicy: "30d"
# Connection pooling via PgBouncer
# Because 200 application pods each opening 10 connections = 2000 connections
# and PostgreSQL will not be happy about that
managed:
services:
additional:
- selectorType: rw
serviceTemplate:
metadata:
name: platform-db-rw-pooled
spec:
type: ClusterIP
monitoring:
enablePodMonitor: true # Prometheus scraping
The replication dance:
-- Step 1: On the RDS source, enable logical replication
-- (Requires rds.logical_replication = 1 in parameter group)
-- Step 2: Create a publication on RDS
CREATE PUBLICATION migration_pub FOR ALL TABLES;
-- Step 3: On the CloudNativePG target, create a subscription
CREATE SUBSCRIPTION migration_sub
CONNECTION 'host=your-rds-endpoint.rds.amazonaws.com port=5432
dbname=platform user=replication_user password=xxx
sslmode=require'
PUBLICATION migration_pub
WITH (
copy_data = true, -- Initial data copy
create_slot = true, -- Create replication slot on source
enabled = true -- Start replicating immediately
);
-- Step 4: Monitor replication lag
SELECT
subname,
received_lsn,
latest_end_lsn,
latest_end_time
FROM pg_stat_subscription;
-- Step 5: When lag is consistently zero, schedule the cutover
-- The cutover window:
-- 1. Stop application writes (maintenance mode)
-- 2. Wait for final replication sync (usually seconds)
-- 3. Drop the subscription
-- 4. Update application connection strings
-- 5. Resume application writes
-- Total downtime: 30-120 seconds if you've practiced
We've done this cutover six times now. The shortest was 28 seconds. The longest was 4 minutes because someone's VPN disconnected mid-migration and we had to wait for them to reconnect to verify the final sync. Always have a backup communication channel. Always.
Object Storage: S3 to MinIO
This one is comparatively painless because MinIO speaks the S3 API. Your application code barely changes -- usually just adding an endpoint URL configuration:
// Before: Hardcoded to AWS S3
const s3 = new S3Client({
region: 'us-east-1',
});
// After: Configurable endpoint
const s3 = new S3Client({
region: process.env.S3_REGION || 'us-east-1',
endpoint: process.env.S3_ENDPOINT || undefined, // undefined = real AWS S3
forcePathStyle: process.env.S3_FORCE_PATH_STYLE === 'true', // MinIO needs this
credentials: {
accessKeyId: process.env.S3_ACCESS_KEY_ID!,
secretAccessKey: process.env.S3_SECRET_ACCESS_KEY!,
},
});
The data migration is just mc mirror:
# mc = MinIO Client
# Mirror S3 bucket to MinIO (can run continuously until cutover)
mc alias set aws https://s3.amazonaws.com $AWS_ACCESS_KEY $AWS_SECRET_KEY
mc alias set minio https://minio.internal.acme.com $MINIO_ACCESS_KEY $MINIO_SECRET_KEY
# Initial mirror (this takes a while for large buckets)
mc mirror --watch aws/acme-user-uploads minio/user-uploads
# --watch keeps it running, syncing new objects in real-time
# Let it run until cutover day
Message Queues: SQS to NATS JetStream
This one requires actual code changes. SQS and NATS have different semantics, different SDKs, and different failure modes. Plan for it.
// Before: SQS consumer
import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand } from '@aws-sdk/client-sqs';
const sqs = new SQSClient({ region: 'us-east-1' });
async function pollMessages() {
const response = await sqs.send(new ReceiveMessageCommand({
QueueUrl: process.env.SQS_QUEUE_URL,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20,
}));
for (const message of response.Messages || []) {
await processMessage(JSON.parse(message.Body!));
await sqs.send(new DeleteMessageCommand({
QueueUrl: process.env.SQS_QUEUE_URL,
ReceiptHandle: message.ReceiptHandle!,
}));
}
}
// After: NATS JetStream consumer
import { connect, JetStreamClient, AckPolicy, DeliverPolicy } from 'nats';
const nc = await connect({
servers: process.env.NATS_URL || 'nats://nats.messaging.svc.cluster.local:4222',
});
const js = nc.jetstream();
const consumer = await js.consumers.get('EVENTS', 'platform-worker');
// This is genuinely nicer than SQS polling
for await (const msg of await consumer.consume()) {
try {
await processMessage(msg.json());
msg.ack();
} catch (err) {
// NAK with delay = retry after backoff
// (SQS visibility timeout equivalent, but more explicit)
msg.nak(5000); // Retry in 5 seconds
}
}
Phase 6: Validate Portability (Ongoing)
You're not portable until you've proven it. And you need to keep proving it. Portability is like fitness -- you lose it if you stop exercising.
The Quarterly Fire Drill
Every quarter, we deploy the entire stack to a fresh environment. Different cloud. Different region. Sometimes on-premises on a rack in the office that we affectionately call "the pain cabinet."
#!/usr/bin/env bash
# quarterly-portability-drill.sh
# If this script takes more than 4 hours, you're not as portable as you think.
set -euo pipefail
START_TIME=$(date +%s)
DRILL_ENV="portability-drill-$(date +%Y%m%d)"
echo "=== Quarterly Portability Drill ==="
echo "Environment: ${DRILL_ENV}"
echo "Started: $(date)"
echo ""
# Step 1: Provision fresh cluster
echo "[1/6] Provisioning Kubernetes cluster..."
# This could be EKS, GKE, AKS, or bare metal -- the whole point
# is that the rest of the script doesn't care
./scripts/provision-cluster.sh "${DRILL_ENV}"
# Step 2: Deploy foundation
echo "[2/6] Deploying platform foundation..."
helmfile -f helmfile-foundation.yaml sync
# Step 3: Deploy application
echo "[3/6] Deploying application stack..."
helmfile -f helmfile-application.yaml sync
# Step 4: Wait for readiness
echo "[4/6] Waiting for all pods to be ready..."
kubectl wait --for=condition=ready pod --all --all-namespaces --timeout=600s
# Step 5: Run test suite
echo "[5/6] Running integration tests..."
./scripts/run-integration-tests.sh "${DRILL_ENV}"
TEST_EXIT=$?
# Step 6: Record results
END_TIME=$(date +%s)
DURATION=$(( (END_TIME - START_TIME) / 60 ))
echo ""
echo "=== Drill Complete ==="
echo "Duration: ${DURATION} minutes"
echo "Test Result: $([ $TEST_EXIT -eq 0 ] && echo 'PASS' || echo 'FAIL')"
echo ""
if [ $DURATION -gt 240 ]; then
echo "WARNING: Drill took more than 4 hours. Investigate bottlenecks."
fi
# Step 7: Tear down (don't leave expensive clusters running)
echo "Tearing down drill environment..."
./scripts/destroy-cluster.sh "${DRILL_ENV}"
What To Measure
| Metric | Target | Red Flag |
|---|---|---|
| Time to deploy from zero | < 4 hours | > 8 hours |
| Manual interventions required | 0 | > 3 |
| Test pass rate vs. primary | > 99% | < 95% |
| Data restoration time | < 30 minutes | > 2 hours |
| Unique cloud-specific workarounds | 0 | > 5 |
If your quarterly drill consistently hits these targets, congratulations: you're actually multi-cloud, not just multi-cloud on a slide deck.
The Graveyard of Failed Approaches
Before we give you the timeline, let's pour one out for the strategies that sounded great in planning meetings but crashed and burned in practice:
"Let's use Terraform to abstract away the clouds." Terraform is an infrastructure provisioning tool, not a portability layer. You still write cloud-specific resources. You just write them in HCL instead of YAML. The cloud lock-in is still there; it's just wearing a different hat.
"We'll use a cloud abstraction SDK." Libraries like Pulumi's cloud-agnostic resources or Apache Libcloud try to abstract cloud differences. In practice, the abstractions leak. The S3 abstraction doesn't support GCS-specific features you need. The RDS abstraction doesn't map cleanly to Cloud SQL options. You end up fighting the abstraction layer more than the cloud.
"Let's go multi-cloud from day one for the new project." Building for three clouds simultaneously triples your infrastructure work, triples your testing matrix, and slows feature development to a crawl. Build for one cloud on Kubernetes. Make it portable later. Portability is a refactoring exercise, not a greenfield architecture decision.
"We'll just use managed Kubernetes and we're multi-cloud." EKS, GKE, and AKS are all Kubernetes, but they differ in networking, storage, IAM, load balancing, and dozens of other details. "Runs on Kubernetes" does not mean "runs on any Kubernetes." Your Helm charts need to abstract these differences, and that takes real work.
"Containers are portable, so we're already multi-cloud." Your containers might run anywhere, but if they need RDS, SQS, Cognito, and S3 to function, they're about as portable as a desktop computer. Sure, you can technically move it, but you need to unplug a lot of cables first.
Timeline Reality Check
For a typical mid-sized SaaS (20-50 services, standard data stores, 3-5 engineers dedicated to migration):
| Phase | Duration | Can Overlap With | Video Game Equivalent |
|---|---|---|---|
| 0. Clarify the Why | 1-2 weeks | Nothing | Character creation screen |
| 1. Inventory | 2 weeks | Phase 2 start | Opening the map for the first time |
| 2. Foundation | 4 weeks | Phase 1 end | Building your base camp |
| 3. Portable Core | 6-8 weeks | -- | Tutorial level |
| 4. Auth Migration | 6-8 weeks | Phase 3 end | Mid-game boss |
| 5. Data Layer | 10-14 weeks | Phase 4 end | Final boss (multiple health bars) |
| 6. Validation | Ongoing | Everything | New Game+ |
Total: 7-10 months for meaningful portability. Anyone promising 3 months is either scoping a much smaller migration or selling you something.
The critical path is usually Phase 5. Everything else can be parallelized to some degree, but the data layer migration requires application changes, careful cutover coordination, and extensive testing. It's the phase where you earn your scars.
Team Structure That Actually Works
| Role | Headcount | Notes |
|---|---|---|
| Platform Team | 2-3 engineers | Kubernetes infra, GitOps, shared services. These people live in Phase 2 and stay there. |
| Migration Squad | 2-4 engineers (rotating) | Workload-by-workload migration. Rotate people through so knowledge spreads. |
| Embedded SRE | 1 engineer | Someone who knows both old and new systems. Your incident bridge between worlds. |
Anti-pattern we've seen kill migrations: Asking application teams to migrate their own services while maintaining feature velocity. That's like asking someone to rebuild the engine of a car while driving it at highway speed. Migration is a project. It needs dedicated humans.
The Payoff
Organizations that survive this journey report:
- 25-40% better pricing on cloud contract renewals (turns out "we can actually leave" is a powerful negotiating position)
- Faster enterprise sales because you can deploy wherever customers need you
- Reduced compliance overhead because data residency is solved at the architecture level, not the "please fill out this 200-question spreadsheet" level
- Genuine disaster recovery that you've actually tested, not theoretical DR that exists only in a PowerPoint from 2023
The investment is significant. The payoff is strategic. And the first time you deploy your entire stack to a second cloud provider in under 4 hours and everything just works -- that feeling makes the whole journey worth it.
Start with Phase 0. Be honest about the why. If the reason is real, the rest is just execution.
Difficult, sometimes painful, occasionally hilarious execution. But execution nonetheless.