---
name: forge-migrations
description: Safe Postgres and MySQL migrations. Lock-free patterns, batched backfills, the NOT VALID + VALIDATE dance, reversibility, zero-downtime deploys, separating index DDL from transactional schema changes. Contains worked examples of every safe-rewrite multi-step pattern. Use whenever writing or reviewing schema changes against a database that has live traffic.
license: MIT
---

# forge-migrations

You are writing a migration that will run against a live database with concurrent writes. Default agent output produces migrations that look correct in dev and lock the table for ten minutes in prod. This skill exists to stop that.

The single most expensive class of incident is "the migration that locked the users table during peak traffic." Every rule below maps to a real production failure that has happened to real teams.

## Quick reference (the things you must never ship)

1. `ALTER TABLE ... ADD COLUMN ... NOT NULL` on a table over 100K rows.
2. `CREATE INDEX` without `CONCURRENTLY` on any production-sized table.
3. `ADD COLUMN ... DEFAULT now()` or `DEFAULT gen_random_uuid()` (volatile - rewrites the table).
4. `ADD CONSTRAINT ... FOREIGN KEY` without `NOT VALID`, then a separate `VALIDATE`.
5. `UPDATE` on a large table without batching.
6. Renaming a column in place on a live table (always do dual-write + backfill + switch).
7. Editing a migration that has already run anywhere.
8. Integer-prefixed migration files (`0042_init.sql`) - timestamp them.
9. Mixing schema and data changes in one migration file.
10. MySQL `utf8` (3-byte) instead of `utf8mb4`.

## Hard rules for Postgres

### Locking

**1. `ALTER TABLE ADD COLUMN` with no default is safe.** Instant metadata change in Postgres 11+. Adding with a `DEFAULT` is also safe IF the default is a constant. Volatile defaults (`now()`, `uuid_generate_v4()`) rewrite the entire table.

```sql
-- BAD on a large table (rewrites every row)
ALTER TABLE users ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT now();

-- BAD same reason (volatile default)
ALTER TABLE users ADD COLUMN external_id UUID NOT NULL DEFAULT gen_random_uuid();
```

**2. `ADD COLUMN ... NOT NULL` rewrites the table.** Multi-step dance:

```sql
-- Migration A: add nullable.
ALTER TABLE users ADD COLUMN newsletter BOOLEAN;

-- Application deploy: write to the new column on all paths.

-- Migration B: backfill in batches (see rule 6).
-- Run as a separate script, NOT inside a migration file.

-- Migration C: add NOT NULL constraint with NOT VALID, then VALIDATE.
ALTER TABLE users ADD CONSTRAINT users_newsletter_not_null
  CHECK (newsletter IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_newsletter_not_null;

-- Migration D (later, optional): convert the CHECK into a real NOT NULL.
ALTER TABLE users ALTER COLUMN newsletter SET NOT NULL;
-- This is fast IF the CHECK is already VALIDATED.
```

**3. `CREATE INDEX` locks the table. Always `CREATE INDEX CONCURRENTLY`.** It cannot run inside a transaction.

```sql
-- BAD: locks writes for the duration
CREATE INDEX users_email_idx ON users (email);

-- GOOD: non-blocking, but cannot be in a transaction
CREATE INDEX CONCURRENTLY users_email_idx ON users (email);
```

In a migration file: put concurrent indexes OUTSIDE any `BEGIN/COMMIT` block.

```sql
-- migrations/0023_add_email_index.sql
BEGIN;
-- ... transactional schema changes ...
COMMIT;

-- index creation outside any transaction
CREATE INDEX CONCURRENTLY users_email_idx ON users (lower(email));
```

**4. Foreign keys added with `NOT VALID` first, then validated.**

```sql
-- BAD: takes ACCESS EXCLUSIVE lock for the entire table scan
ALTER TABLE orders ADD CONSTRAINT orders_user_fk
  FOREIGN KEY (user_id) REFERENCES users (id);

-- GOOD: brief metadata lock, then unlocked scan
ALTER TABLE orders ADD CONSTRAINT orders_user_fk
  FOREIGN KEY (user_id) REFERENCES users (id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_user_fk;
```

**5. Renaming a column on a live table is a multi-step dance.** Add the new column, dual-write from the application, backfill, switch reads, drop the old. Never rename in place.

### Backfills

**6. Backfill in batches of 1000-10000 rows. Never in one statement.**

```sql
-- BAD: holds locks for the duration; bloats the WAL
UPDATE users SET feature_flag = false WHERE feature_flag IS NULL;

-- GOOD: batched, with COMMIT between, with a small sleep
DO $$
DECLARE updated INT;
BEGIN
  LOOP
    UPDATE users SET feature_flag = false
    WHERE id IN (
      SELECT id FROM users
      WHERE feature_flag IS NULL
      LIMIT 5000
      FOR UPDATE SKIP LOCKED
    );
    GET DIAGNOSTICS updated = ROW_COUNT;
    EXIT WHEN updated = 0;
    COMMIT;
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;
```

**7. Backfills for large tables run outside the migration tool.** Many tools wrap everything in a transaction; a multi-hour backfill in a transaction is a disaster. Run from a script.

### Reversibility

**8. Every migration has a down step. Even if you never plan to run it.** Writing the down forces you to think about reversibility.

**9. Irreversible operations error loudly in down.** Dropping a column with data is irreversible. Down step should refuse to run.

**10. Never drop a column in the same release as the code that stops using it.** Ship one release that stops writing/reading. Wait at least one deploy. Then drop. Rollback safety.

## Hard rules for MySQL

**11. `pt-online-schema-change` or `gh-ost` for any table over 100K rows.** Direct `ALTER TABLE` rewrites the table and locks writes.

**12. InnoDB foreign keys are slow to add.** No NOT VALID equivalent. Validate in the application, then add during low-traffic.

**13. `utf8` is not UTF-8 in MySQL.** Always `utf8mb4` and `utf8mb4_0900_ai_ci`. If you see `utf8` anywhere, flag it.

## Cross-engine discipline

**14. Migrations are immutable after merge.** Never edit one that has run anywhere. Add a new migration to fix.

**15. Migrations ordered by timestamp, not integer.** `20260522_143000_add_user_phone.sql`. Integers collide on branches.

**16. One logical change per migration.** Adding a column + an index + a constraint that relate, fine. Three unrelated columns to three tables, not fine.

**17. No data migrations mixed with schema migrations.** Seeds and transforms go in separate files labeled `_data.sql`.

**18. Migrations reviewed by someone who has been paged at 3am.** Most common cause of prod incidents.

## Common AI-output patterns to reject

| Pattern | Why dangerous | Fix |
| --- | --- | --- |
| `ADD COLUMN x NOT NULL DEFAULT 'value'` on big table | Rewrites every row | Multi-step (nullable, backfill, NOT VALID + VALIDATE) |
| `CREATE INDEX ...` no CONCURRENTLY | Locks writes | `CREATE INDEX CONCURRENTLY` outside transaction |
| `DEFAULT now()` in ADD COLUMN | Volatile, rewrites table | Constant default or app-level fill |
| `ADD FOREIGN KEY ... REFERENCES` no NOT VALID | Full-table scan under exclusive lock | NOT VALID + VALIDATE in two steps |
| `UPDATE users SET ...` no batching | Locks all matched rows, bloats WAL | DO $$ LOOP with LIMIT 5000 + COMMIT |
| `ALTER COLUMN ... RENAME` on live | Code referring to old name breaks | Dual-write column dance |
| Migration files `0001`, `0002` | Branch collision | Timestamped: `20260522_143000_...` |
| Schema + INSERT seed data in same file | Hard to review, hard to rollback | Separate `_data.sql` |
| MySQL CHARACTER SET utf8 | Truncates 4-byte UTF-8 (emoji) | utf8mb4 + utf8mb4_0900_ai_ci |
| `DROP COLUMN` in the same release | No rollback if release fails | Wait one deploy cycle |

## Worked example: adding a NOT NULL column to a 50M-row table

Goal: add `phone_country_code` to `users` with 50M rows.

```sql
-- migrations/20260522_143000_add_phone_country_code_step_1.sql
-- Step 1: add nullable. Safe instant metadata change.
ALTER TABLE users ADD COLUMN phone_country_code CHAR(2);
```

```ts
// Application change: deploy AFTER step 1.
// Every write path that touches `users` fills phone_country_code.
```

```ts
// scripts/backfill_phone_country_code.ts
// Run AFTER the application change has been deployed.
// Run OUTSIDE the migration tool; this can take hours.
import { pool } from "../src/db.js";

async function main() {
  let updated = -1;
  while (updated !== 0) {
    const result = await pool.query(`
      WITH batch AS (
        SELECT id FROM users
        WHERE phone_country_code IS NULL
        ORDER BY id LIMIT 5000
        FOR UPDATE SKIP LOCKED
      )
      UPDATE users
      SET phone_country_code = COALESCE(infer_country(phone), 'XX')
      WHERE id IN (SELECT id FROM batch)
    `);
    updated = result.rowCount ?? 0;
    console.log(`backfilled ${updated}`);
    await new Promise((r) => setTimeout(r, 100));
  }
}
main();
```

```sql
-- migrations/20260601_120000_add_phone_country_code_step_2.sql
-- Step 2: CHECK NOT VALID, then VALIDATE.
ALTER TABLE users
  ADD CONSTRAINT users_phone_country_code_not_null
  CHECK (phone_country_code IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_phone_country_code_not_null;
```

```sql
-- migrations/20260615_120000_add_phone_country_code_step_3.sql
-- Step 3 (optional): promote CHECK to real NOT NULL. Fast.
ALTER TABLE users ALTER COLUMN phone_country_code SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_phone_country_code_not_null;
```

Total elapsed: hours to days. Total lock time on `users`: milliseconds. Single-statement version would have locked `users` 40+ minutes mid-day.

## Worked example: adding an index to a busy table

```sql
-- migrations/20260522_143000_add_orders_status_idx.sql
-- Index creation outside any transaction.
CREATE INDEX CONCURRENTLY orders_status_created_at_idx
  ON orders (status, created_at DESC)
  WHERE status IN ('pending', 'paid');
```

Notes: `CONCURRENTLY` cannot run inside `BEGIN/COMMIT`. Partial index (`WHERE status IN (...)`) keeps the index small and fast.

## Workflow

1. **State what you are doing at the top of the file.**
2. **Estimate table size.** Under 10K, anything goes. 10K-1M, careful. Over 1M, multi-step.
3. **Test against production-shaped data.** Restore a backup to staging; run migration there first.
4. **Write the down.**
5. **Have someone else review.**

## Verification

```bash
bash skills/data/forge-migrations/verify/check_unsafe_patterns.sh path/to/migration.sql
```

Flags: `ADD COLUMN ... NOT NULL`, `CREATE INDEX` without `CONCURRENTLY`, volatile defaults, FK without `NOT VALID`, MySQL `utf8`.

## When to skip this skill

- Brand-new project where no table has more than a few thousand rows.
- Greenfield schema design (use forge-schema-design).
- Test fixtures or seed data scripts.

## Related skills

- [`forge-schema-design`](../forge-schema-design/SKILL.md) - what the schema should look like in the first place.
- [`forge-api-design`](../../backend/forge-api-design/SKILL.md) - schema changes drive API changes.
