Stage 5 review
This page pulls Stage 5 together. The skill is matching the data shape and workload to the right model - relational is the default, but document, columnar, graph, time-series, and JSON each win in their lane. Work the scenarios before opening the answers, then take the quiz and keep the cheatsheet handy.
Go back to a lesson if a scenario stumps you: OLTP vs OLAP, Warehousing, NoSQL modeling, Graph and time-series, JSON in a relational database.
Scenarios
Scenario 1 - OLTP or OLAP?
(a) The checkout writes an order and decrements stock, thousands of times a minute. (b) Finance asks for revenue by region by month for the last three years. Which workload is each, and where does each run?
Show the answer
- (a) OLTP - many small, fast reads and writes. Runs on your live row-store (PostgreSQL, MySQL) tuned for transactions.
- (b) OLAP - a few large aggregate questions over lots of history. Runs on a columnar warehouse (Snowflake, BigQuery, ClickHouse, DuckDB).
Running (b) on the live OLTP database is slow and competes with real orders, so data is copied to a warehouse first. OLTP records what happens; OLAP explains what happened.
Scenario 2 - Star or snowflake, ETL or ELT?
You are designing a warehouse for sales analytics. When do you snowflake a dimension instead of keeping a flat star, and why is ELT (not ETL) the modern default?
Show the answer
Default to a star schema: one central fact table (sales) surrounded by flat dimension tables (date, product, store). Simple joins, fast queries. Snowflake (normalize a dimension into sub-tables) only when a dimension is large and shared enough that the duplication genuinely hurts - otherwise the extra joins cost more than they save.
ELT loads raw data into the warehouse first, then transforms it there with SQL (often via dbt). It beats classic ETL because cheap, scalable warehouse compute does the transform, raw data stays available to re-transform, and transformations are version-controlled SQL.
Scenario 3 - Embed or reference?
In a document database, you store blog posts and their comments. Should comments be embedded in the post document or stored as separate documents that reference it?
Show the answer
Model by access pattern. If you almost always read a post with its comments, and comments are bounded, embed them - one read returns everything. If comments are unbounded, written heavily on their own, or queried independently, reference them - keep posts and comments separate to avoid a giant, contended document.
Rule of thumb: embed what you read together and that stays bounded; reference what grows without limit or is accessed on its own.
Scenario 4 - Which specialized store?
Pick the best fit and say why: (a) "people who bought this also bought" recommendations; (b) per-second CPU metrics from 10,000 servers, kept 30 days; (c) a product catalog where every item has different attributes.
Show the answer
- (a) Graph database - the question is the relationships between purchases; multi-hop traversals (
MATCH) are cheap, versus painful recursive self-joins in SQL. - (b) Time-series database (TimescaleDB, InfluxDB) - optimized for time-ordered ingest, downsampling, retention/TTL, and continuous aggregates. (Postgres + Timescale is often enough.)
- (c) Document database - each record carries its own shape, so varied attributes need no schema migration. A relational table would force every item into the same columns.
Scenario 5 - JSON column or real columns?
You have a product table. Some fields are queried and filtered constantly (price, category); others are rare, vendor-specific extras. How do you model them?
Show the answer
Promote the hot, structured fields to real typed columns - price and category get types, constraints, and indexes, and the planner can use them. Park the rare, irregular extras in a JSON column - flexibility without a migration per vendor quirk.
It is the same judgement as normalization: structure what you rely on, leave genuinely variable data loose. Reach for a full document database only when most of the record is irregular.
Runnable: a graph question answered in SQL
Stage 5 showed graph queries in Cypher. The same "who can Ana reach by following" question runs here in plain SQLite with a WITH RECURSIVE walk over an edge table:
This works, but every extra hop is another join, and traversals get awkward fast. A graph database makes multi-hop questions natural and fast - which is the whole point of choosing one.
Comprehensive quiz
Stage 5 cumulative review
6 questions1A dashboard aggregates three years of sales by region. Which storage and system fit best?
2In a star schema, the fact table holds:
3ELT differs from ETL chiefly in that:
4A document database guideline for embedding versus referencing is:
5Change Data Capture (CDC) feeds downstream systems by:
6Why store rare, vendor-specific product attributes in a JSON column rather than new columns?
Stage 5 cheatsheet
Open the Stage 5 cheatsheet
OLTP vs OLAP
- OLTP - many small reads/writes, row storage, live app DB. Records what happens.
- OLAP - few big aggregates over history, columnar warehouse. Explains what happened.
- Move data OLTP -> OLAP via ETL/ELT or CDC (log-based streaming).
Warehouse design
- Star = one fact table + flat dimensions (default). Snowflake = normalized dimensions (only when duplication hurts).
- Fact = measures + FKs; dimension = descriptive attributes you slice by.
- ELT = load raw, transform in-warehouse with SQL (dbt: models, DAG, tests, incremental).
- Lakehouse = table formats (Iceberg, Delta) add ACID, schema evolution, time travel on object storage.
NoSQL families
- Document (MongoDB) - per-item shape; embed vs reference by access pattern.
- Key-value (Redis) - fastest single-key lookups; caches, sessions.
- Column-family (Cassandra) - huge write volume; partition + clustering keys.
- Graph (Neo4j) - relationships are the question;
MATCHtraversals; ISO GQL (2024). - Modeling rule: embed what you read together and stays bounded; reference what grows or is read alone.
Specialized stores
- Time-series (TimescaleDB, InfluxDB) - time-ordered ingest, downsampling, retention/TTL, continuous aggregates.
- Graph in SQL - possible via
WITH RECURSIVE, but multi-hop gets painful; that is when a graph DB earns its place.
JSON in relational
- Promote hot, structured fields to typed columns (types, constraints, indexes).
- Park rare, irregular fields in a JSON column; extract with
json_extract/->.
You can pick the right model for the data. Stage 6 - Scaling and Advanced takes it to many machines: replication, sharding, consistency trade-offs, and vector search for AI.