Normalization
Normalization is arranging columns across tables so every fact is stored in exactly one place. It is not bureaucracy - it is what stops your data from quietly contradicting itself. The fastest way to understand it is to build a bad table and watch it break.
One big table goes wrong
Imagine cramming orders, customers, and products into a single wide table - the customer's email repeated on every one of their order rows:
| order_id | customer_name | customer_email | product | price |
|---|---|---|---|---|
| 101 | Ana | ana@old.test | Running shoe | 40.00 |
| 102 | Ana | ana@old.test | Trail socks | 12.50 |
| 103 | Ben | ben@x.test | Water bottle | 8.00 |
Ana's email lives in two rows already. That duplication causes three classic anomalies.
Update anomaly. Change one fact, chase it through many rows. Update Ana's email and you must hit every order she ever placed - miss one and the data now disagrees with itself:
Insertion anomaly. You cannot record a new customer until they place an order - there is no row to put them in. A product with no orders has nowhere to live.
Deletion anomaly. Delete Ben's only order and you also erase the only record that Ben, and the water bottle, ever existed.
The cure for all three is the same: split the table so each fact sits once.
The normal forms
Normalization proceeds in steps. The first three handle almost every real schema.
First normal form (1NF): atomic values, no repeating groups. Each cell holds one value, and you do not stuff a list into a column (sizes: '40,41,42') or repeat columns (product_1, product_2). A separate row per value instead.
Second normal form (2NF): no partial dependencies. Applies when the key is composite (several columns). Every non-key column must depend on the whole key, not part of it. In an order_items(order_id, product_id, quantity, product_name) table, product_name depends only on product_id - half the key - so it belongs in products, not here.
Third normal form (3NF): no transitive dependencies. No non-key column may depend on another non-key column. An orders(id, customer_id, customer_email) table breaks this: customer_email depends on customer_id, which is not the key. Move it to customers, where customer_id is the key.
Apply all three and the wide table becomes the store's actual schema - customers, orders, products, order_items - each fact in one home. Now Ana's email is a single row to update:
A working rule of thumb: every non-key column should depend on the key, the whole key, and nothing but the key.
Denormalization: breaking the rule on purpose
Normalization optimizes for correct writes. Sometimes you deliberately trade it away for faster reads:
- Precomputed values - storing
orders.totalinstead of summing line items on every read. - Reporting and analytics - a warehouse star schema is denormalized so dashboards avoid huge joins.
- Caching - duplicating a hot value to skip a join.
The cost is the update anomaly you just saw: duplicated data can drift, so you take on the job of keeping copies in sync (triggers, scheduled jobs, application code). The rule for 2026: normalize by default; denormalize only with a measured reason, and document the duplication.
Quick quiz
Normalization
4 questions1Updating one customer's email forces you to change it in many rows. Which anomaly is this?
2Storing '40,41,42' as a comma-separated list in one column violates which rule?
3In orders(id, customer_id, customer_email), customer_email depends on customer_id (a non-key column). Which form does this break?
4When is denormalization a reasonable choice?
The design is sound. Functions and procedures closes the stage by packaging reusable logic in the database itself.