Back to Blog
Cloud Migration

The Multi-Cloud Migration Playbook: From Locked-In to Portable in 6 Phases

The battle-tested framework we use after watching too many multi-cloud initiatives crash and burn. Includes the mistakes everyone makes and how to avoid them.

Oikonex TeamJan 12, 202624 min read

23 Migrations. 15 Failures. Here's What the Survivors Did.

We've been in the room for 23 multi-cloud migrations. We've watched senior leadership announce bold multi-cloud strategies at all-hands meetings. We've seen the Gantt charts. We've attended the kickoff meetings where everyone is optimistic and the timeline says "Q3."

Fifteen of those migrations failed. Not "failed" as in "took longer than expected." Failed as in: the initiative was quietly shelved, the Kubernetes clusters were decommissioned, and everyone went back to clicking around in the AWS console pretending it never happened.

Eight survived.

This playbook is reverse-engineered from those eight. It's the sequence of operations, decision points, and hard truths that separate the migrations that work from the ones that become cautionary tales whispered in the hallways at re:Invent.

Phase 0: Clarify the Why (Or Don't Start)

Multi-cloud is not free. It's not even cheap. It adds complexity to every layer of your stack -- infrastructure, deployment, networking, observability, incident response, hiring. You are choosing to make your life harder. You need a reason good enough to justify that.

Valid Reasons

  • Negotiating leverage: Your AWS contract renewal is in 6 months and you want to negotiate from a position of strength, not desperation. ("We could move to Azure" hits different when you actually can.)
  • Customer requirements: Your enterprise customers mandate Azure, GCP, or on-premises deployment. This isn't theoretical -- they're waving purchase orders.
  • Regulatory compliance: Data sovereignty laws require your infrastructure to exist in regions where your current provider doesn't operate. Or your industry regulator has opinions about single-vendor concentration.
  • Risk mitigation: The board is asking what happens if AWS has a multi-region outage. Again. (us-east-1 has entered the chat.)
  • Cost optimization: You've done the math, and specific workloads are genuinely cheaper on a different provider. Not "we read a blog post" cheaper, but "we ran a 3-month proof of concept and have the receipts" cheaper.

Invalid Reasons (Be Honest With Yourself)

  • "Multi-cloud is best practice." Says who? Multi-cloud is a tool. You don't use a chainsaw to butter toast just because chainsaws are powerful tools. Context matters.
  • "We might need it someday." You're paying the complexity tax today for an uncertain future benefit. That's called speculation. Put the money in an index fund instead.
  • Resume-Driven Development: Your engineers want Kubernetes experience. We get it. We've been there. But that's a training budget, not a migration budget. Buy them a Pluralsight subscription and a homelab. Don't refactor production because someone wants to put "Istio" on their LinkedIn.
  • "Our new CTO came from Google." Respectfully: what worked at Google's scale, with Google's engineering resources, and Google's custom infrastructure might not apply to your 12-person team running a B2B SaaS.

If you can't articulate a valid reason in one sentence that makes your CFO nod, stop here. Seriously. We've watched millions of dollars evaporate on migrations that didn't have a clear business driver. Go build features instead. Come back when the reason is real.

Phase 1: Inventory and Classify (Weeks 1-2)

This is the "opening your credit card statements" phase. You know it's going to be bad. You're doing it anyway because you can't fix what you don't understand.

Step 1: Export Everything

#!/usr/bin/env bash
# cloud-inventory.sh - Get a reality check on your AWS usage
# Run this and then sit down before reading the output.

set -euo pipefail

echo "=== AWS Resource Inventory ==="
echo "Account: $(aws sts get-caller-identity --query 'Account' --output text)"
echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""

# Count resources by service
echo "--- Resources by Service ---"
aws resourcegroupstaggingapi get-resources \
  --query 'ResourceTagMappingList[].ResourceARN' \
  --output text | tr '\t' '\n' | cut -d: -f3 | sort | uniq -c | sort -rn

echo ""
echo "--- Lambda Functions ---"
aws lambda list-functions \
  --query 'Functions[].{Name:FunctionName,Runtime:Runtime,Memory:MemorySize}' \
  --output table

echo ""
echo "--- ECS Services ---"
for cluster in $(aws ecs list-clusters --query 'clusterArns[]' --output text); do
  echo "Cluster: ${cluster##*/}"
  aws ecs list-services --cluster "${cluster}" \
    --query 'serviceArns[]' --output text | tr '\t' '\n'
done

echo ""
echo "--- RDS Instances ---"
aws rds describe-db-instances \
  --query 'DBInstances[].{Name:DBInstanceIdentifier,Engine:Engine,Size:DBInstanceClass,MultiAZ:MultiAZ,Storage:AllocatedStorage}' \
  --output table

echo ""
echo "--- S3 Buckets ---"
aws s3api list-buckets --query 'Buckets[].Name' --output text | tr '\t' '\n'
echo ""
echo "Total buckets: $(aws s3api list-buckets --query 'length(Buckets[])' --output text)"

# The number that matters most
echo ""
echo "--- Monthly Cost (Last 3 Months) ---"
aws ce get-cost-and-usage \
  --time-period Start=$(date -u -d "3 months ago" +%Y-%m-01),End=$(date -u +%Y-%m-01) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --query 'ResultsByTime[].{Period:TimePeriod.Start,Cost:Total.BlendedCost.Amount}' \
  --output table

The first time we ran this for a client, the room went silent. They had 847 Lambda functions. Eight hundred and forty-seven. Someone had set up an architecture where every API endpoint was its own Lambda. It was like discovering your house is actually 847 tiny houses duck-taped together.

Step 2: Classify Every Workload

Create a spreadsheet. Yes, a spreadsheet. Not a Notion database. Not a Jira board. A spreadsheet, because you need to sort and filter and have uncomfortable conversations about prioritization, and Google Sheets handles that better than anything else.

WorkloadCloud ServiceLock-in DepthData SensitivityPortability EffortBusiness ValueMigration Phase
User APIECS + ALBLowHighLowCriticalPhase 3
AuthCognitoExtremeHighHighCriticalPhase 4
BillingLambda + DynamoDBHighHighHighCriticalPhase 5
Analytics PipelineKinesis + Athena + S3ExtremeMediumVery HighMediumPhase 5 or Never
Email ServiceSES + LambdaMediumLowMediumMediumPhase 3
Image ProcessingLambda + S3MediumLowMediumLowPhase 3
SearchOpenSearch (managed)MediumMediumMediumHighPhase 3
Cron JobsEventBridge + LambdaMediumLowLowLowPhase 3

That "or Never" next to the analytics pipeline? That's intentional. Not everything needs to be portable. Some workloads are deeply entangled with proprietary services, have low business criticality, and would cost more to migrate than they're worth. It's okay to leave them. Multi-cloud doesn't mean zero cloud. It means choice.

Step 3: Identify Your Portable Core

Look for the intersection of:

  • High business value
  • Low-to-medium portability effort
  • Stateless (or uses standard databases)

These are your Phase 3 candidates. You want early wins. Quick wins build momentum, momentum builds organizational support, and organizational support is what keeps your migration alive when Phase 4 gets hard. (Phase 4 always gets hard.)

Phase 2: Build the Foundation (Weeks 3-6)

Before you migrate a single workload, you need somewhere for it to land. Think of this phase as building the airport before the planes arrive.

Step 1: Choose Your Kubernetes Distribution

If You Need...Consider...Our Take
Managed simplicityEKS, GKE, AKSStart here. Fight fewer battles at once.
On-premises supportRancher (RKE2), OpenShift, TanzuRKE2 is our go-to for on-prem
Edge/air-gappedk3s, RKE2k3s is absurdly good for its size
Maximum portabilityVanilla KubernetesOnly if you enjoy suffering

Our actual recommendation: Start with a managed Kubernetes service in your primary cloud. Yes, that's still AWS. The goal of Phase 2 isn't to leave AWS -- it's to build the muscle memory of deploying to Kubernetes. You'll leave AWS later, from a position of competence instead of panic.

The critical rule: do not use cloud-specific Kubernetes features. No AWS ALB Ingress Controller. No GKE Config Connector. No Azure Workload Identity (yet). If it has your cloud provider's name in it, don't use it in Phase 2.

Step 2: Set Up GitOps

Your infrastructure is code now. All of it. No more clicking in consoles. No more "I'll just apply this manifest real quick." Everything goes through Git. Everything gets reviewed. Everything is auditable.

We use ArgoCD. Here's a real application definition from a migration we ran:

# argocd/applications/platform-services.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-services
  namespace: argocd
  # This finalizer ensures ArgoCD cleans up resources when the app is deleted
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/acme-corp/k8s-manifests.git
    path: clusters/production/platform-services
    targetRevision: main
    # Helm values from Git -- single source of truth
    helm:
      valueFiles:
        - values.yaml
        - values-production.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: platform
  syncPolicy:
    automated:
      prune: true       # Delete resources removed from Git
      selfHeal: true     # Revert manual changes
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true  # Handles large CRDs better
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
# argocd/projects/platform.yaml
# ArgoCD Projects provide RBAC for what can be deployed where
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: platform
  namespace: argocd
spec:
  description: Platform services
  sourceRepos:
    - 'https://github.com/acme-corp/k8s-manifests.git'
    - 'https://charts.bitnami.com/bitnami'
  destinations:
    - namespace: 'platform'
      server: https://kubernetes.default.svc
    - namespace: 'platform-*'
      server: https://kubernetes.default.svc
  # Don't let anyone deploy cluster-scoped resources from this project
  clusterResourceWhitelist: []
  namespaceResourceWhitelist:
    - group: '*'
      kind: '*'

Step 3: Deploy the Portable Services Baseline

Before any application workloads arrive, deploy these:

# helmfile.yaml -- our portable services baseline
# Every service here runs identically on any Kubernetes cluster
repositories:
  - name: ingress-nginx
    url: https://kubernetes.github.io/ingress-nginx
  - name: jetstack
    url: https://charts.jetstack.io
  - name: prometheus-community
    url: https://prometheus-community.github.io/helm-charts
  - name: grafana
    url: https://grafana.github.io/helm-charts

releases:
  # Ingress -- NOT the AWS ALB controller
  - name: ingress-nginx
    namespace: ingress-nginx
    chart: ingress-nginx/ingress-nginx
    version: 4.9.0
    values:
      - controller:
          replicaCount: 2
          metrics:
            enabled: true

  # TLS certificate management
  - name: cert-manager
    namespace: cert-manager
    chart: jetstack/cert-manager
    version: 1.14.0
    values:
      - installCRDs: true

  # Monitoring -- Prometheus + Grafana
  - name: kube-prometheus-stack
    namespace: monitoring
    chart: prometheus-community/kube-prometheus-stack
    version: 56.0.0
    values:
      - grafana:
          adminPassword: "changeme"  # Use a secret in production!
          persistence:
            enabled: true
            size: 10Gi

  # Logging -- Loki
  - name: loki
    namespace: logging
    chart: grafana/loki-stack
    version: 2.10.0
    values:
      - loki:
          persistence:
            enabled: true
            size: 50Gi
        promtail:
          enabled: true

This is your "deployment target." When workloads arrive in Phase 3, they land on a platform that already has ingress, TLS, monitoring, and logging. No workload-specific infrastructure decisions needed.

Phase 3: Migrate the Portable Core (Weeks 7-12)

This is the tutorial level. The enemies are easy, the mechanics are forgiving, and you're building the skills you'll need for the boss fights later. Don't skip it. Don't rush it. Don't let anyone convince you to "just jump straight to the database migration because that's the hard part." That's like saying "let's skip the tutorial and go straight to the final boss." You will die. Metaphorically.

The Migration Pattern

For every workload, the same dance:

  1. Containerize -- Dockerfile, build pipeline, push to registry
  2. Helm chart -- Deployment, Service, ConfigMap, health checks
  3. Deploy side-by-side -- New K8s deployment runs alongside old ECS/Lambda
  4. Mirror traffic -- Send a copy of production traffic to the new deployment
  5. Compare -- Are the responses identical? Is latency comparable? Error rates?
  6. Cut over -- Update DNS or load balancer
  7. Bake -- Run both for 2 weeks. Verify everything.
  8. Decommission -- Delete the old deployment

Real Example: ECS to Kubernetes

Here's an actual migration we performed for a user-facing API. Before and after, with the warts included.

Before: ECS Task Definition

{
  "family": "user-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [{
    "name": "user-api",
    "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/user-api:latest",
    "portMappings": [{"containerPort": 8080}],
    "environment": [
      {"name": "DATABASE_URL", "value": "postgresql://rds-endpoint:5432/users"},
      {"name": "REDIS_URL", "value": "redis://elasticache-endpoint:6379"},
      {"name": "AWS_REGION", "value": "us-east-1"},
      {"name": "S3_BUCKET", "value": "acme-user-uploads"}
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/user-api",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }]
}

Notice the land mines: hardcoded AWS region, RDS endpoint in an environment variable (not a secret!), CloudWatch-specific logging, :latest tag. Every one of these is a portability problem and, frankly, a best-practices problem.

After: Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-api
  labels:
    app.kubernetes.io/name: user-api
    app.kubernetes.io/version: "1.4.2"
    app.kubernetes.io/managed-by: helm
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: user-api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: user-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: user-api
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001
      containers:
      - name: user-api
        image: registry.internal.acme.com/user-api:1.4.2
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: user-api-db
              key: url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: user-api-redis
              key: url
        - name: S3_ENDPOINT
          value: "https://minio.storage.svc.cluster.local"
        - name: S3_BUCKET
          value: "user-uploads"
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /healthz
            port: http
          initialDelaySeconds: 15
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 5
          periodSeconds: 5
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: user-api

What changed and why:

  • Secrets are in Kubernetes Secrets, not environment variables in a task definition
  • S3 endpoint is configurable (points to MinIO internally, could point to S3, GCS, or anything S3-compatible)
  • Specific image tag, not :latest
  • Non-root security context
  • Resource requests AND limits
  • Prometheus annotations for automatic scraping
  • Topology spread constraints so pods don't all land on one node
  • Standard Kubernetes labels (app.kubernetes.io/*)

This migration took us 3 days for the Kubernetes manifests and about 2 weeks of parallel running before we were confident enough to cut over.

Phase 4: Replace Proprietary Auth and Identity (Weeks 13-18)

This is the phase where everyone says "how hard can it be?" and then disappears for three months.

Authentication is the load-bearing wall of your application. Every service touches it. Every user flow depends on it. Every session, every token, every permission check runs through it. And you're going to rip it out and replace it with something else. While the application is running. In production.

It's like performing open-heart surgery on a patient who's running a marathon and refusing to slow down.

Common Migrations

FromToDifficultyOur Honest Assessment
AWS CognitoKeycloakHardKeycloak is powerful but configuring it correctly is a full-time job
AWS CognitoZitadelMediumNewer, cleaner API, less ecosystem support
Firebase AuthKeycloakHardFirebase's client SDK integration runs deep
Auth0 (cloud)KeycloakMediumAuth0's OIDC compliance actually makes this easier

The Strategy That Works

       ┌────────────────┐
       │   API Gateway   │
       │  / Auth Proxy   │
       └───────┬────────┘
               │
       ┌───────▼────────┐
       │  Auth Facade    │ ← Routes based on user cohort
       └───┬────────┬───┘
           │        │
   ┌───────▼──┐ ┌──▼───────┐
   │ Cognito  │ │ Keycloak │
   │ (legacy) │ │  (new)   │
   └──────────┘ └──────────┘
  1. Deploy Keycloak alongside Cognito. Both live. Both work.
  2. Build an auth facade that inspects tokens and routes to the correct provider. New signups go to Keycloak. Existing sessions continue on Cognito.
  3. Migrate users in waves:
    • Wave 1: Internal team (your own employees). They're forgiving.
    • Wave 2: New signups. They don't know or care what the auth backend is.
    • Wave 3: Active users, prompted to "re-verify" on next login (which silently migrates them).
    • Wave 4: Dormant users. Migrated in bulk. If they come back and something's broken, they probably forgot their password anyway.
  4. Keep Cognito in read-only mode for 90 days after the last migration. Someone will find an edge case. Someone always finds an edge case.

The Ugly Truth

We budgeted 4 weeks for an auth migration once. It took 11. Here's what we didn't anticipate:

  • Social login callback URLs all pointed to Cognito-specific endpoints. Every OAuth app registration (Google, GitHub, Microsoft) needed updating, and some of those require business verification that takes weeks.
  • Password hashing algorithms differed between providers. We couldn't import password hashes directly. Users had to reset passwords or "re-verify" (which is a nicer way of saying "reset your password but we'll pretend it's for security").
  • Custom auth flows (MFA enrollment, magic links, passwordless) had to be reimplemented from scratch in Keycloak.
  • JWT token format differences broke three downstream services that were parsing tokens instead of validating them. (Don't parse tokens. Validate them. This is a PSA.)

Budget 2x whatever you think auth migration will take. Then add a buffer.

Phase 5: Migrate the Data Layer (Weeks 19-30)

Here be dragons.

If Phase 3 was the tutorial and Phase 4 was the mid-game boss, Phase 5 is the final boss. The one with multiple health bars. The one where the floor drops out halfway through and the music changes.

The data layer is hard because:

  • Data has gravity. Moving terabytes is slow and expensive.
  • Downtime must be near-zero. Your SLA doesn't care about your migration timeline.
  • Data loss is unacceptable. Not "minimize data loss." Zero. Data. Loss.
  • Every application that reads or writes data must be updated simultaneously or have a compatibility layer.

PostgreSQL: RDS to CloudNativePG

CloudNativePG is the operator we trust for PostgreSQL on Kubernetes. It manages the full lifecycle: provisioning, failover, backups, connection pooling, monitoring. Here's the real migration setup:

# cloudnativepg-cluster.yaml
# This is the target cluster that will receive replicated data from RDS
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: platform-db
  namespace: database
spec:
  instances: 3  # 1 primary, 2 replicas
  imageName: ghcr.io/cloudnative-pg/postgresql:16.1

  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "512MB"
      effective_cache_size: "1536MB"
      work_mem: "4MB"
      maintenance_work_mem: "128MB"
      wal_level: "logical"  # Required for logical replication from RDS
      max_wal_senders: "10"
      max_replication_slots: "10"

  storage:
    size: 100Gi
    storageClass: gp3-encrypted  # Use whatever your cluster offers

  # Automated backups to S3-compatible storage
  backup:
    barmanObjectStore:
      destinationPath: "s3://acme-db-backups/platform/"
      endpointURL: "https://minio.storage.svc.cluster.local"
      s3Credentials:
        accessKeyId:
          name: db-backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: db-backup-creds
          key: SECRET_ACCESS_KEY
    retentionPolicy: "30d"

  # Connection pooling via PgBouncer
  # Because 200 application pods each opening 10 connections = 2000 connections
  # and PostgreSQL will not be happy about that
  managed:
    services:
      additional:
        - selectorType: rw
          serviceTemplate:
            metadata:
              name: platform-db-rw-pooled
            spec:
              type: ClusterIP

  monitoring:
    enablePodMonitor: true  # Prometheus scraping

The replication dance:

-- Step 1: On the RDS source, enable logical replication
-- (Requires rds.logical_replication = 1 in parameter group)

-- Step 2: Create a publication on RDS
CREATE PUBLICATION migration_pub FOR ALL TABLES;

-- Step 3: On the CloudNativePG target, create a subscription
CREATE SUBSCRIPTION migration_sub
  CONNECTION 'host=your-rds-endpoint.rds.amazonaws.com port=5432
              dbname=platform user=replication_user password=xxx
              sslmode=require'
  PUBLICATION migration_pub
  WITH (
    copy_data = true,    -- Initial data copy
    create_slot = true,  -- Create replication slot on source
    enabled = true       -- Start replicating immediately
  );

-- Step 4: Monitor replication lag
SELECT
  subname,
  received_lsn,
  latest_end_lsn,
  latest_end_time
FROM pg_stat_subscription;

-- Step 5: When lag is consistently zero, schedule the cutover
-- The cutover window:
--   1. Stop application writes (maintenance mode)
--   2. Wait for final replication sync (usually seconds)
--   3. Drop the subscription
--   4. Update application connection strings
--   5. Resume application writes
-- Total downtime: 30-120 seconds if you've practiced

We've done this cutover six times now. The shortest was 28 seconds. The longest was 4 minutes because someone's VPN disconnected mid-migration and we had to wait for them to reconnect to verify the final sync. Always have a backup communication channel. Always.

Object Storage: S3 to MinIO

This one is comparatively painless because MinIO speaks the S3 API. Your application code barely changes -- usually just adding an endpoint URL configuration:

// Before: Hardcoded to AWS S3
const s3 = new S3Client({
  region: 'us-east-1',
});

// After: Configurable endpoint
const s3 = new S3Client({
  region: process.env.S3_REGION || 'us-east-1',
  endpoint: process.env.S3_ENDPOINT || undefined,  // undefined = real AWS S3
  forcePathStyle: process.env.S3_FORCE_PATH_STYLE === 'true',  // MinIO needs this
  credentials: {
    accessKeyId: process.env.S3_ACCESS_KEY_ID!,
    secretAccessKey: process.env.S3_SECRET_ACCESS_KEY!,
  },
});

The data migration is just mc mirror:

# mc = MinIO Client
# Mirror S3 bucket to MinIO (can run continuously until cutover)
mc alias set aws https://s3.amazonaws.com $AWS_ACCESS_KEY $AWS_SECRET_KEY
mc alias set minio https://minio.internal.acme.com $MINIO_ACCESS_KEY $MINIO_SECRET_KEY

# Initial mirror (this takes a while for large buckets)
mc mirror --watch aws/acme-user-uploads minio/user-uploads

# --watch keeps it running, syncing new objects in real-time
# Let it run until cutover day

Message Queues: SQS to NATS JetStream

This one requires actual code changes. SQS and NATS have different semantics, different SDKs, and different failure modes. Plan for it.

// Before: SQS consumer
import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({ region: 'us-east-1' });

async function pollMessages() {
  const response = await sqs.send(new ReceiveMessageCommand({
    QueueUrl: process.env.SQS_QUEUE_URL,
    MaxNumberOfMessages: 10,
    WaitTimeSeconds: 20,
  }));

  for (const message of response.Messages || []) {
    await processMessage(JSON.parse(message.Body!));
    await sqs.send(new DeleteMessageCommand({
      QueueUrl: process.env.SQS_QUEUE_URL,
      ReceiptHandle: message.ReceiptHandle!,
    }));
  }
}

// After: NATS JetStream consumer
import { connect, JetStreamClient, AckPolicy, DeliverPolicy } from 'nats';

const nc = await connect({
  servers: process.env.NATS_URL || 'nats://nats.messaging.svc.cluster.local:4222',
});
const js = nc.jetstream();

const consumer = await js.consumers.get('EVENTS', 'platform-worker');

// This is genuinely nicer than SQS polling
for await (const msg of await consumer.consume()) {
  try {
    await processMessage(msg.json());
    msg.ack();
  } catch (err) {
    // NAK with delay = retry after backoff
    // (SQS visibility timeout equivalent, but more explicit)
    msg.nak(5000); // Retry in 5 seconds
  }
}

Phase 6: Validate Portability (Ongoing)

You're not portable until you've proven it. And you need to keep proving it. Portability is like fitness -- you lose it if you stop exercising.

The Quarterly Fire Drill

Every quarter, we deploy the entire stack to a fresh environment. Different cloud. Different region. Sometimes on-premises on a rack in the office that we affectionately call "the pain cabinet."

#!/usr/bin/env bash
# quarterly-portability-drill.sh
# If this script takes more than 4 hours, you're not as portable as you think.

set -euo pipefail

START_TIME=$(date +%s)
DRILL_ENV="portability-drill-$(date +%Y%m%d)"

echo "=== Quarterly Portability Drill ==="
echo "Environment: ${DRILL_ENV}"
echo "Started: $(date)"
echo ""

# Step 1: Provision fresh cluster
echo "[1/6] Provisioning Kubernetes cluster..."
# This could be EKS, GKE, AKS, or bare metal -- the whole point
# is that the rest of the script doesn't care
./scripts/provision-cluster.sh "${DRILL_ENV}"

# Step 2: Deploy foundation
echo "[2/6] Deploying platform foundation..."
helmfile -f helmfile-foundation.yaml sync

# Step 3: Deploy application
echo "[3/6] Deploying application stack..."
helmfile -f helmfile-application.yaml sync

# Step 4: Wait for readiness
echo "[4/6] Waiting for all pods to be ready..."
kubectl wait --for=condition=ready pod --all --all-namespaces --timeout=600s

# Step 5: Run test suite
echo "[5/6] Running integration tests..."
./scripts/run-integration-tests.sh "${DRILL_ENV}"
TEST_EXIT=$?

# Step 6: Record results
END_TIME=$(date +%s)
DURATION=$(( (END_TIME - START_TIME) / 60 ))

echo ""
echo "=== Drill Complete ==="
echo "Duration: ${DURATION} minutes"
echo "Test Result: $([ $TEST_EXIT -eq 0 ] && echo 'PASS' || echo 'FAIL')"
echo ""

if [ $DURATION -gt 240 ]; then
  echo "WARNING: Drill took more than 4 hours. Investigate bottlenecks."
fi

# Step 7: Tear down (don't leave expensive clusters running)
echo "Tearing down drill environment..."
./scripts/destroy-cluster.sh "${DRILL_ENV}"

What To Measure

MetricTargetRed Flag
Time to deploy from zero< 4 hours> 8 hours
Manual interventions required0> 3
Test pass rate vs. primary> 99%< 95%
Data restoration time< 30 minutes> 2 hours
Unique cloud-specific workarounds0> 5

If your quarterly drill consistently hits these targets, congratulations: you're actually multi-cloud, not just multi-cloud on a slide deck.

The Graveyard of Failed Approaches

Before we give you the timeline, let's pour one out for the strategies that sounded great in planning meetings but crashed and burned in practice:

"Let's use Terraform to abstract away the clouds." Terraform is an infrastructure provisioning tool, not a portability layer. You still write cloud-specific resources. You just write them in HCL instead of YAML. The cloud lock-in is still there; it's just wearing a different hat.

"We'll use a cloud abstraction SDK." Libraries like Pulumi's cloud-agnostic resources or Apache Libcloud try to abstract cloud differences. In practice, the abstractions leak. The S3 abstraction doesn't support GCS-specific features you need. The RDS abstraction doesn't map cleanly to Cloud SQL options. You end up fighting the abstraction layer more than the cloud.

"Let's go multi-cloud from day one for the new project." Building for three clouds simultaneously triples your infrastructure work, triples your testing matrix, and slows feature development to a crawl. Build for one cloud on Kubernetes. Make it portable later. Portability is a refactoring exercise, not a greenfield architecture decision.

"We'll just use managed Kubernetes and we're multi-cloud." EKS, GKE, and AKS are all Kubernetes, but they differ in networking, storage, IAM, load balancing, and dozens of other details. "Runs on Kubernetes" does not mean "runs on any Kubernetes." Your Helm charts need to abstract these differences, and that takes real work.

"Containers are portable, so we're already multi-cloud." Your containers might run anywhere, but if they need RDS, SQS, Cognito, and S3 to function, they're about as portable as a desktop computer. Sure, you can technically move it, but you need to unplug a lot of cables first.

Timeline Reality Check

For a typical mid-sized SaaS (20-50 services, standard data stores, 3-5 engineers dedicated to migration):

PhaseDurationCan Overlap WithVideo Game Equivalent
0. Clarify the Why1-2 weeksNothingCharacter creation screen
1. Inventory2 weeksPhase 2 startOpening the map for the first time
2. Foundation4 weeksPhase 1 endBuilding your base camp
3. Portable Core6-8 weeks--Tutorial level
4. Auth Migration6-8 weeksPhase 3 endMid-game boss
5. Data Layer10-14 weeksPhase 4 endFinal boss (multiple health bars)
6. ValidationOngoingEverythingNew Game+

Total: 7-10 months for meaningful portability. Anyone promising 3 months is either scoping a much smaller migration or selling you something.

The critical path is usually Phase 5. Everything else can be parallelized to some degree, but the data layer migration requires application changes, careful cutover coordination, and extensive testing. It's the phase where you earn your scars.

Team Structure That Actually Works

RoleHeadcountNotes
Platform Team2-3 engineersKubernetes infra, GitOps, shared services. These people live in Phase 2 and stay there.
Migration Squad2-4 engineers (rotating)Workload-by-workload migration. Rotate people through so knowledge spreads.
Embedded SRE1 engineerSomeone who knows both old and new systems. Your incident bridge between worlds.

Anti-pattern we've seen kill migrations: Asking application teams to migrate their own services while maintaining feature velocity. That's like asking someone to rebuild the engine of a car while driving it at highway speed. Migration is a project. It needs dedicated humans.

The Payoff

Organizations that survive this journey report:

  • 25-40% better pricing on cloud contract renewals (turns out "we can actually leave" is a powerful negotiating position)
  • Faster enterprise sales because you can deploy wherever customers need you
  • Reduced compliance overhead because data residency is solved at the architecture level, not the "please fill out this 200-question spreadsheet" level
  • Genuine disaster recovery that you've actually tested, not theoretical DR that exists only in a PowerPoint from 2023

The investment is significant. The payoff is strategic. And the first time you deploy your entire stack to a second cloud provider in under 4 hours and everything just works -- that feeling makes the whole journey worth it.

Start with Phase 0. Be honest about the why. If the reason is real, the rest is just execution.

Difficult, sometimes painful, occasionally hilarious execution. But execution nonetheless.

Cloud MigrationMulti-CloudKubernetes

Stay in the Loop

Get the latest insights on cloud migration, Kubernetes, and enterprise distribution delivered to your inbox.