Not all schema changes are created equal. Adding a new nullable column is harmless. Dropping a column that a background job still references is a 3 AM pager. After reviewing hundreds of migration incidents, we've identified the five schema changes that cause the most production breakage — and the checks that prevent them.
Why it breaks: Your migration runs successfully. The column is gone. Then a background job, API endpoint, or reporting query tries to read it — and crashes.
Real-world story: A team dropped legacy_user_id after migrating to UUIDs. The migration passed CI. Two hours later, a nightly ETL job failed because it still selected that column. The rollback required restoring from backup.
How to catch it: Search your entire codebase for the column name before dropping. Include background jobs, cron scripts, analytics pipelines, and third-party integrations. A semantic diff tool will flag the column as removed — that's your signal to verify it's truly unused.
Why it breaks: ALTER TABLE ... ADD COLUMN ... NOT NULL on a table with existing rows will fail in most databases. The engine doesn't know what value to assign to millions of existing records.
Real-world story: A developer added timezone VARCHAR(50) NOT NULL to a 10-million-row events table. The migration locked the table for 45 seconds, then failed. The deploy pipeline halted. The fix required a three-step migration: add as nullable, backfill, then add the constraint.
How to catch it: Never add NOT NULL without a default in the same migration. Review every new column's nullability. If it must be NOT NULL, add it as nullable first, backfill with a sensible default, then alter the column.
Why it breaks: Indexes are invisible until they're gone. Queries that ran in milliseconds suddenly scan entire tables. CPU spikes. Timeouts cascade. Customers notice before your alerts do.
Real-world story: A "cleanup" migration dropped three indexes that were "not in the ORM definitions." They were actually used by raw SQL reporting queries. Query latency on the orders table went from 12ms to 4.2 seconds. The incident lasted 23 minutes.
How to catch it: Before dropping an index, check your query planner logs and slow query log. Look for Seq Scan on large tables. If you're unsure, mark the index as invisible (MySQL) or drop it in a separate migration with a quick rollback plan.
Why it breaks: Changing VARCHAR(500) to VARCHAR(100) or TEXT to VARCHAR(255) silently truncates data that exceeds the new limit. The migration succeeds. The data is corrupted.
Real-world story: A team changed description TEXT to description VARCHAR(500) to "enforce UI limits." 2% of descriptions were longer than 500 characters. Those records were truncated. Customer support spent a week reconstructing lost data from email archives.
How to catch it: Before narrowing a type, query for the maximum length of existing data. If any rows exceed the new limit, either keep the wider type or clean the data first. Treat type narrowing as a data migration, not a schema change.
Why it breaks: Adding a foreign key constraint without an existing index on the column forces the database to validate every row with a full table scan. On large tables, this can take hours and hold heavy locks.
Real-world story: A team added a foreign key from orders.user_id to users.id on a 50-million-row table. There was no index on orders.user_id. The migration ran for 3 hours, blocking all writes to the orders table. The deploy window expired. The migration had to be killed and rescheduled during a maintenance window.
How to catch it: Always create the index before adding the foreign key. In SQL Server, use WITH NOCHECK to add the constraint without validating existing rows, then validate separately. In PostgreSQL, consider adding the index concurrently beforehand.
Every incident above shares one root cause: the migration was not reviewed against reality before running in production. The developer knew what they intended to change. They didn't know what else would break.
A good schema review process answers three questions:
The first question is mechanical — a semantic diff answers it instantly. The second and third require human judgment. But most teams skip the first question because they don't have a fast way to answer it. They rely on memory and eyeballing SQL dumps.
Here's a lightweight process that catches 90% of dangerous schema changes:
This takes 5 minutes and prevents incidents that take hours to recover from.
SchemaLens compares two CREATE TABLE dumps and shows you a visual diff with a generated migration script. Catch dangerous changes in seconds.
Try SchemaLens Free →Further reading: Want a complete checklist? Read The Schema Review Checklist Every Engineering Team Needs for a 15-point review process you can add to your PR template.