Skip to main content

Transactions and isolation

You met transactions in Stage 1: a group of changes that all succeed or all roll back, with the ACID guarantees. The classic case is a transfer - debit one account, credit another, all or nothing:

Loading SQL sandbox…

That is the A of ACID. This lesson is about the I - isolation - which matters the moment two transactions run at the same time.

What goes wrong concurrently

Run a thousand transfers at once and, without protection, transactions can see each other's half-finished work. Three anomalies are worth naming:

  • Dirty read - T2 reads a row T1 has changed but not yet committed. If T1 then rolls back, T2 acted on data that never existed.
  • Non-repeatable read - T1 reads a row, T2 commits a change to it, T1 reads it again and gets a different value within the same transaction.
  • Phantom read - T1 runs SELECT ... WHERE ... twice; between them T2 inserts a matching row, so new rows appear the second time.

Isolation levels

An isolation level is the dial that trades safety against concurrency. Each higher level forbids more anomalies, at some cost to throughput. The SQL standard defines four:

LevelDirty readNon-repeatable readPhantom read
Read Uncommittedpossiblepossiblepossible
Read Committedpreventedpossiblepossible
Repeatable Readpreventedpreventedpossible*
Serializablepreventedpreventedprevented

*PostgreSQL's Repeatable Read also prevents phantoms, via snapshots - stricter than the standard requires.

Serializable is the gold standard: the result is guaranteed to match some order of running the transactions one after another. You pay for it in concurrency, so you use it where correctness is non-negotiable (money, inventory) and a cheaper level elsewhere.

How the database enforces it

Two mechanisms, introduced with the ACID isolation point in Stage 0:

  • Locking (pessimistic) - a transaction locks the rows it touches so others wait. Simple, but contention and deadlocks lurk.
  • MVCC (multi-version concurrency control, optimistic) - each transaction sees a consistent snapshot of the data as of when it started; writers create new versions instead of blocking readers. PostgreSQL, Oracle, and MySQL's InnoDB all use MVCC, which is why readers don't block writers.

Deadlocks: two transactions, opposite order

Locking buys safety but introduces a new failure: a deadlock. Two transactions each hold a lock the other needs, and each waits forever. The classic recipe is acquiring locks in the opposite order:

The database does not hang. It runs a deadlock detector, spots the cycle in the wait-for graph, and aborts one transaction as the victim (PostgreSQL raises SQLState 40P01), rolling it back so the other proceeds. The losing transaction's work is undone - so the application must catch the error and retry.

You reduce deadlocks by always acquiring locks in a consistent order (e.g. always touch the lower account id first), keeping transactions short, and not holding locks across user think-time.

Retry logic: expect to be the victim

Because the engine can abort any transaction - a deadlock victim, or a serialization failure under Serializable - production code wraps transactions in a retry loop. The pattern: retry only on the specific transient error, cap the attempts, and back off exponentially with jitter so retriers do not stampede in lockstep.

# Retry a transaction on deadlock / serialization failure (SQLSTATE 40001, 40P01)
import time, random

def run_txn_with_retry(conn, work, max_attempts=5):
for attempt in range(max_attempts):
try:
with conn.transaction(): # BEGIN ... COMMIT, ROLLBACK on error
return work(conn)
except SerializationFailure: # e.g. psycopg errors for 40001 / 40P01
if attempt == max_attempts - 1:
raise # give up after the last attempt
# exponential backoff with jitter: 0.1, 0.2, 0.4, ... + random
sleep = (2 ** attempt) * 0.1 + random.uniform(0, 0.1)
time.sleep(sleep)

The rule: a serialization/deadlock failure is not a bug - it is the engine telling you to try again. Retry the whole transaction (the rolled-back work is gone), do not just re-run the last statement.

Savepoints: partial rollback within a transaction

A transaction is all-or-nothing - but a savepoint lets you roll back part of one without abandoning the whole thing. SAVEPOINT name marks a point; ROLLBACK TO name undoes everything after it while keeping earlier work and the transaction open; RELEASE name discards the marker. This runs in SQLite:

Loading SQL sandbox…

Savepoints are how a long transaction tries a risky step and recovers if it fails, without throwing away the work before it - useful for processing a batch where one bad item should not sink the rest.

A read-side alternative: materialized views

When a read is slow because it aggregates or joins a lot, raising the isolation level will not help - the work is just expensive. A materialized view stores the result of a query on disk, like a snapshot you refresh on a schedule, so reads hit precomputed rows instead of recomputing every time. It is a read-optimization, not a concurrency tool (PostgreSQL):

-- PostgreSQL: precompute an expensive aggregate, refresh on a schedule
CREATE MATERIALIZED VIEW customer_totals AS
SELECT customer_id, SUM(total) AS lifetime_total
FROM orders
GROUP BY customer_id;

-- reads are now a cheap lookup; refresh when the data changes
REFRESH MATERIALIZED VIEW customer_totals;

The trade-off is the familiar one: the stored result is stale between refreshes. Use it for dashboards and reports that tolerate slight lag, not for data that must be exact to the second.

Dialect note

Materialized views are server-side (PostgreSQL, Oracle, SQL Server's indexed views) and do not run in this SQLite sandbox. A plain CREATE VIEW (Stage 2) stores the query and is always live; a materialized view stores the rows and must be refreshed.

In practice (2026)

  • Defaults differ: PostgreSQL, Oracle, and SQL Server default to Read Committed; MySQL/InnoDB defaults to Repeatable Read. Know your engine's default - it shapes what anomalies your code can hit.
  • Serializable is cheaper than it used to be. PostgreSQL's Serializable Snapshot Isolation (SSI) makes true serializability practical for correctness-critical paths, rather than a last resort.
  • SQLite (this sandbox) sidesteps the dial: it allows only one writer at a time, so write transactions are effectively serialized. Concurrency tuning there is about WAL mode and busy-timeouts, not isolation levels.

Rule of thumb: start at your engine's default, and raise the level only for the specific transactions whose correctness demands it.

Case study: Shopify replaces Redis with MySQL for inventory reservations

Source

This section summarizes Shopify's engineering post "We replaced Redis with MySQL for inventory reservations—and it scaled" by Emilie Noel (May 12, 2026). All facts, figures, and design decisions below are theirs; we retell them here because the story exercises nearly every concept in this lesson - isolation levels, deadlocks, lock ordering, and ACID - on a system that handled $5.1M in sales per minute at the Black Friday 2025 peak.

The problem is a race for the last unit. When a buyer clicks "Complete purchase," the items must still be available. Get it wrong one way and two buyers claim the same unit (oversell); the other way, you tell a buyer it's sold out when it isn't (undersell). Shopify's oversell protection does two things: reserve inventory when payment starts, then claim it (permanently deduct) when payment succeeds.

The old design split the source of truth across two systems. Reservations lived in Redis (DECR/INCR on a quantity key); the inventory ledger lived in MySQL. The claim step had to update MySQL and clean up Redis - and those two writes could not be wrapped in one atomic transaction. Depending on order, a crash between them caused overselling or underselling. This is the A and I of ACID breaking down precisely because the work spanned two stores. Moving reservations into the same MySQL database as the ledger let everything sit in a single ACID transaction.

The new design: one row per unit, plus SKIP LOCKED. Instead of one row with a quantity column (a single hot row every checkout fights over), an item with 10 available units gets 10 rows. Reserving three units selects and moves three rows in one transaction:

-- Grab available units without waiting on rows another checkout already holds
SELECT id FROM reservation_units
WHERE shop_id = ? AND inventory_item_id = ? AND inventory_group_id = ?
LIMIT 3
FOR UPDATE SKIP LOCKED;

FOR UPDATE SKIP LOCKED is the key: if another transaction has locked some rows, MySQL skips them and returns other available rows instead of blocking. No waiting on the same row means far less contention - this is the pessimistic-locking path from earlier, redesigned so transactions rarely collide. A bounded pool (capped at 1,000 rows per item/location) keeps the SKIP LOCKED scan fast; a replenishment process refills it from the ledger.

Three of their hard-won fixes are direct applications of this lesson:

  • Consistent lock ordering to kill deadlocks. Reserve and claim touched two tables in different orders and deadlocked - exactly the opposite-order recipe above. The fix was the textbook one: make every path acquire locks in the same order (reserve always DELETEs from the units table first, then INSERTs into reserved_quantities).
  • READ COMMITTED to avoid gap locks. Running SELECT ... FOR UPDATE SKIP LOCKED on a near-empty table under MySQL's default Repeatable Read took gap locks (including on the "supremum" pseudo-record) that blocked replenishment from inserting and risked deadlocks. Dropping those transactions to Read Committed - which doesn't take gap locks the same way - let replenishment proceed. A real instance of the rule: raise or lower the level only for the specific transactions that need it.
  • Composite primary key for fewer locks. With an auto-increment PK, InnoDB locked two indexes per reservation (the secondary index in the WHERE and the clustered PK). Switching to a composite PK on the filter columns cut it to one lock per row - index design directly changing lock count and throughput.

The twist: the real bottleneck wasn't the database engine. They optimized queries and locks for weeks, but hit a throughput ceiling with CPU not maxed and latency fine. By tagging every SQL statement with the business process holding the connection (/* conn_tag:checkout_completion */) and aggregating at the proxy layer, they found other checkout code holding connections too long - reservations were just the straw that broke an already-depleted connection pool. Cleaning that path removed 50% of reads and 33% of transactions on the primary. The lesson they draw - "the answer is often in the plumbing, not the engine" - is why this lesson stresses keeping transactions short and not holding locks (or connections) across slow work.

They cut over with a shadow mode dual-write (both systems live, Redis the source of truth) until MySQL proved correct on real traffic, then flipped the source of truth with a kill switch to revert. The payoff: writer CPU stayed under 50% during flash sales, with no oversells.

Quick quiz

Transactions and isolation

8 questions

1A transaction reads a row that another transaction has changed but not yet committed. What is that?

2Which isolation level guarantees the result equals some serial order of the transactions?

3What does MVCC give you that pure locking does not?

4Two transactions deadlock - each holds a lock the other needs. What does the database do?

5Your retry loop hits a serialization failure. What is the right backoff?

6Inside an open transaction you want to undo just the last few statements, not all of it. What do you use?

7A dashboard query is slow because it aggregates many rows. Which tool helps most?

8What's the practical advice on isolation levels?

Next up

Your app is now correct and fast. Stage 4 - Databases in Real Apps covers shipping it: ORMs, migrations, connection pooling, and the access control you saw with DCL.