Normalization

Normalization is arranging columns across tables so every fact is stored in exactly one place. It is not bureaucracy - it is what stops your data from quietly contradicting itself. The fastest way to understand it is to build a bad table and watch it break.

One big table goes wrong

Imagine cramming orders, customers, and products into a single wide table - the customer's email repeated on every one of their order rows:

order_id	customer_name	customer_email	product	price
101	Ana	ana@old.test	Running shoe	40.00
102	Ana	ana@old.test	Trail socks	12.50
103	Ben	ben@x.test	Water bottle	8.00

Ana's email lives in two rows already. That duplication causes three classic anomalies.

Update anomaly. Change one fact, chase it through many rows. Update Ana's email and you must hit every order she ever placed - miss one and the data now disagrees with itself:

Loading SQL sandbox…

Insertion anomaly. You cannot record a new customer until they place an order - there is no row to put them in. A product with no orders has nowhere to live.

Deletion anomaly. Delete Ben's only order and you also erase the only record that Ben, and the water bottle, ever existed.

The cure for all three is the same: split the table so each fact sits once.

The normal forms

Normalization proceeds in steps. The first three handle almost every real schema.

First normal form (1NF): atomic values, no repeating groups. Each cell holds one value, and you do not stuff a list into a column (sizes: '40,41,42') or repeat columns (product_1, product_2). A separate row per value instead.

Second normal form (2NF): no partial dependencies. Applies when the key is composite (several columns). Every non-key column must depend on the whole key, not part of it. In an order_items(order_id, product_id, quantity, product_name) table, product_name depends only on product_id - half the key - so it belongs in products, not here.

Third normal form (3NF): no transitive dependencies. No non-key column may depend on another non-key column. An orders(id, customer_id, customer_email) table breaks this: customer_email depends on customer_id, which is not the key. Move it to customers, where customer_id is the key.

Apply all three and the wide table becomes the store's actual schema - customers, orders, products, order_items - each fact in one home. Now Ana's email is a single row to update:

Loading SQL sandbox…

A working rule of thumb: every non-key column should depend on the key, the whole key, and nothing but the key.

Denormalization: breaking the rule on purpose

Normalization optimizes for correct writes. Sometimes you deliberately trade it away for faster reads:

Precomputed values - storing orders.total instead of summing line items on every read.
Reporting and analytics - a warehouse star schema is denormalized so dashboards avoid huge joins.
Caching - duplicating a hot value to skip a join.

The cost is the update anomaly you just saw: duplicated data can drift, so you take on the job of keeping copies in sync (triggers, scheduled jobs, application code). The rule for 2026: normalize by default; denormalize only with a measured reason, and document the duplication.

Quick quiz

Normalization

4 questions

1Updating one customer's email forces you to change it in many rows. Which anomaly is this?

2Storing '40,41,42' as a comma-separated list in one column violates which rule?

3In orders(id, customer_id, customer_email), customer_email depends on customer_id (a non-key column). Which form does this break?

4When is denormalization a reasonable choice?

Next up

The design is sound. Functions and procedures closes the stage by packaging reusable logic in the database itself.

One big table goes wrong​

The normal forms​

Denormalization: breaking the rule on purpose​

Quick quiz​

One big table goes wrong

The normal forms

Denormalization: breaking the rule on purpose

Quick quiz