Skip to main content

Glossary

Every key term in the course, defined in one line. Use it as a quick reference while you read, or to refresh a term you have half-forgotten. Terms are grouped alphabetically.

A

  • ACID - the guarantee that a transaction is Atomic, Consistent, Isolated, and Durable.
  • Aggregate function - a function that collapses many rows into one value (COUNT, SUM, AVG, MIN, MAX).
  • Anomaly (insert / update / delete) - data problems caused by a poorly normalized schema duplicating facts.
  • ANN (Approximate Nearest Neighbor) - fast similarity search that trades a little accuracy for big speed on vectors.
  • Atomicity - the "A" in ACID: all steps of a transaction happen, or none do.
  • Autovacuum - PostgreSQL's background job that reclaims dead-row space and refreshes planner statistics.
  • Availability - the "A" in CAP: every request gets a (non-error) response, even during a partition.

B

  • Base backup - a full physical copy of a database cluster, the starting point for point-in-time recovery.
  • BM25 - a keyword-relevance ranking used by full-text search; often combined with vectors in hybrid search.
  • B-tree - the default balanced-tree index structure, good for equality and range lookups.
  • Bloat - wasted space left by dead rows; reclaimed by vacuuming.

C

  • Cache - a fast store holding copies of hot data to avoid repeated database work.
  • Cache-aside - a caching pattern: read cache, on miss read the database and populate, invalidate on write.
  • CAP theorem - during a network partition a distributed system must choose Consistency or Availability.
  • Cardinality - how many distinct values a column has; high cardinality makes indexes more useful.
  • CASE expression - SQL conditional logic that returns different values per branch.
  • CDC (Change Data Capture) - streaming row-level changes out of a database, usually from its write-ahead log.
  • Chunking - splitting documents into pieces before embedding them for retrieval-augmented generation.
  • Column - a named, typed field shared by every row of a table.
  • Column-family store - a NoSQL model (Cassandra) built for huge write volume across many machines.
  • Columnar storage - storing data by column, making analytical aggregations fast (used by warehouses).
  • Composite index - an index on several columns, usable left-to-right (the left-prefix rule).
  • Composite key - a primary key made of more than one column.
  • Connection pool - a reused set of open database connections shared across application requests.
  • Consistency (ACID) - the "C" in ACID: every transaction leaves the data in a valid state.
  • Consistency (distributed) - every node returns the same latest value; the property traded in CAP.
  • Consensus - an algorithm (Raft, Paxos) letting distributed nodes agree on an ordered set of writes.
  • Constraint - a rule the database enforces on data (NOT NULL, UNIQUE, CHECK, FOREIGN KEY).
  • Correlated subquery - a subquery that references the outer query and is evaluated per outer row.
  • Covering index - an index that contains every column a query needs, so the table is never read.
  • Crow's foot - ER-diagram notation whose forked end means "many".
  • CTE (Common Table Expression) - a named temporary result defined with WITH to keep queries readable.
  • Cypher - the query language of the Neo4j graph database.

D

  • Data - raw facts with no meaning on their own.
  • Database - an organized collection of data that software can store, find, and change reliably.
  • DBMS - the database management system: the engine that stores, protects, and queries the data.
  • DCL (Data Control Language) - SQL for permissions: GRANT and REVOKE.
  • Deadlock - two transactions each holding a lock the other needs, frozen until one is aborted.
  • DDL (Data Definition Language) - SQL that defines structure: CREATE, ALTER, DROP.
  • Denormalization - deliberately duplicating data for read speed, accepting a sync cost.
  • Dimension table - a warehouse table of descriptive attributes you slice facts by.
  • Dirty read - reading another transaction's uncommitted change.
  • DML (Data Manipulation Language) - SQL that changes rows: INSERT, UPDATE, DELETE.
  • Document store - a NoSQL model (MongoDB) holding JSON-like records that can each differ in shape.
  • Downsampling - reducing time-series resolution (e.g. per-second to per-minute) to save space.
  • Durability - the "D" in ACID: once committed, a change survives crashes.

E

  • ELT - load raw data into the warehouse, then transform it there with SQL.
  • Embedding - a list of numbers capturing the meaning of text, an image, or audio.
  • Embed vs reference - the document-modeling choice between nesting related data or linking to it.
  • ER model (Entity-Relationship) - a diagram of entities, their attributes, and the relationships between them.
  • ETL - transform data before loading it into the warehouse.
  • EXISTS - a subquery operator that is true when the subquery returns any row.
  • EXPLAIN / EXPLAIN ANALYZE - shows a query's execution plan; ANALYZE adds actual run statistics.
  • Expand-contract - a zero-downtime migration: add the new, dual-write, migrate reads, drop the old.

F

  • Fact table - the central warehouse table holding measures plus foreign keys to dimensions.
  • Foreign key (FK) - a column whose value must match a primary key in another table.
  • Full-text search - indexed search over natural-language text (Postgres tsvector/GIN, SQLite FTS5).

G

  • Generated column - a column whose value is computed from other columns (GENERATED ALWAYS AS).
  • GIN index - a PostgreSQL index for multi-valued data like jsonb and full-text vectors.
  • GQL - the ISO graph query language standard (2024).
  • Graph database - a store of nodes and edges, built for relationship-heavy questions.
  • GRANT / REVOKE - DCL statements that give or remove privileges.
  • GROUP BY - groups rows so aggregates are computed per group.

H

  • HAVING - filters groups after aggregation (unlike WHERE, which filters rows before).
  • HNSW - a graph-based approximate-nearest-neighbor index for vector search.
  • Hotspot - a shard or partition that receives a disproportionate share of traffic.
  • Hybrid search - combining keyword (BM25) and vector similarity, then reranking the results.

I

  • Idempotency key - a client-supplied identifier that makes a retried write apply only once.
  • Index - a sorted lookup structure that speeds reads on chosen columns, at a small write cost.
  • Isolation - the "I" in ACID: concurrent transactions do not see each other's partial work.
  • Isolation level - how strictly transactions are isolated (Read Uncommitted to Serializable).

J

  • JOIN - combines rows from two tables on a matching condition (INNER, LEFT, RIGHT, FULL, CROSS).
  • JSON column - a column holding a JSON document, queryable inside a relational table.

K

  • Keyset pagination - paging with WHERE id > :last instead of OFFSET; scales to deep pages.
  • Key-value store - a NoSQL model (Redis) mapping a key to a value for very fast lookups.

L

  • Lakehouse - a data lake with table formats (Iceberg, Delta) adding ACID, schema evolution, and time travel.
  • Least privilege - granting each role only the permissions it actually needs.
  • Left-prefix rule - a composite index is usable from its leftmost column rightward.
  • LIMIT - caps how many rows a query returns.
  • Lock - a reservation a transaction takes on data so others wait their turn.

M

  • Materialized view - a stored, refreshable snapshot of a query's result, for fast reads.
  • Migration - a versioned, repeatable change to the database schema.
  • MVCC (Multi-Version Concurrency Control) - giving each transaction a consistent snapshot so readers do not block writers.

N

  • N+1 problem - firing one query per row in a loop instead of one combined query.
  • Natural key - a primary key made of real-world data (can change, so often risky).
  • NewSQL - distributed databases offering SQL and ACID with horizontal scale (CockroachDB, Spanner, TiDB).
  • Non-repeatable read - reading the same row twice in a transaction and getting different values.
  • Normalization - structuring tables so each fact is stored once.
  • NoSQL - a family of non-relational models (document, key-value, column-family, graph).
  • NULL - the absence of a value; comparisons with it use three-valued logic.

O

  • OLAP - Online Analytical Processing: few large aggregate queries over lots of history.
  • OLTP - Online Transaction Processing: many small, fast reads and writes.
  • ON DELETE (CASCADE / RESTRICT / SET NULL) - what happens to child rows when a parent row is deleted.
  • ORDER BY - sorts the result rows.
  • ORM (Object-Relational Mapping) - a library that maps database rows to application objects.

P

  • PACELC - extends CAP: if Partition then C-or-A, Else trade Latency-or-Consistency.
  • Pagination - returning results in pages (offset-based or keyset).
  • Parameterized query - sending SQL with placeholders and binding values separately; the core injection defense.
  • Partial index - an index over only the rows matching a condition (CREATE INDEX ... WHERE).
  • Partitioning - splitting one large table into pieces by a key, often within a machine.
  • pgvector - a PostgreSQL extension storing and searching embeddings next to relational data.
  • Phantom read - a query's matching row set changes within a transaction as new rows appear.
  • PITR (Point-In-Time Recovery) - restoring a database to an exact past moment using a base backup plus WAL.
  • Polyglot persistence - using several database types in one system, the right tool per job.
  • Primary key (PK) - the column(s) that uniquely identify each row.
  • Primary-replica - a replication topology where one node takes writes and replicas serve reads.

Q

  • Query - a question asked of the database; in SQL you describe the result, not the steps.
  • Query plan - the database's chosen strategy for executing a query (see SCAN vs SEARCH).

R

  • RAG (Retrieval-Augmented Generation) - an LLM answers using documents retrieved by vector search.
  • Read replica - a copy of the database that serves read traffic to offload the primary.
  • Re-embedding - regenerating all stored vectors after changing the embedding model.
  • Reranking - reordering retrieved candidates with a stronger relevance model.
  • Replication - keeping copies of data on multiple nodes.
  • RETURNING - a clause that returns rows affected by an INSERT, UPDATE, or DELETE.
  • Role - a named set of privileges granted to users (DCL).
  • Row - a single record in a table.
  • Row-level security (RLS) - per-row access rules enforced by the database.
  • RPO (Recovery Point Objective) - how much data loss, in time, is acceptable.
  • RTO (Recovery Time Objective) - how long a restore may take.

S

  • Savepoint - a marker letting you roll back part of a transaction without aborting all of it.
  • SCAN vs SEARCH - a full-table read versus an index seek in a query plan.
  • Schema - the defined structure of a database: its tables, columns, types, and constraints.
  • SELECT - the SQL statement that reads rows.
  • Selectivity - the fraction of rows a condition matches; high selectivity makes indexes pay off.
  • Serverless database - a managed database that scales (often to zero) without you running servers.
  • Set operation - UNION, INTERSECT, or EXCEPT combining the result sets of two queries.
  • Shard key - the column deciding which shard a row lives on; needs high cardinality and even access.
  • Sharding - spreading data across many machines by a shard key.
  • Slowly changing dimension - a warehouse dimension whose attributes change over time (Type 1 overwrite, Type 2 history).
  • Snapshot - a frozen point-in-time copy of data.
  • Snowflake schema - a star schema whose dimensions are further normalized into sub-tables.
  • Soft delete - marking a row deleted with a timestamp instead of removing it.
  • Star schema - a central fact table surrounded by flat dimension tables.
  • Stored procedure - reusable procedural logic stored and run in the database.
  • Subquery - a query nested inside another query.
  • Surrogate key - a generated, stable identifier used as the primary key.

T

  • Table - a grid of rows and columns holding one kind of thing.
  • TCL (Transaction Control Language) - SQL controlling transactions: BEGIN, COMMIT, ROLLBACK.
  • Three-valued logic - SQL boolean logic with TRUE, FALSE, and UNKNOWN (from NULL).
  • Time-series database - a store optimized for time-stamped data, retention, and downsampling.
  • TLS / sslmode - encryption for database connections; verify-full also verifies server identity.
  • Transaction - a group of changes treated as a single all-or-nothing unit.
  • tsvector - PostgreSQL's preprocessed full-text search column type.

U

  • UNIQUE - a constraint forbidding duplicate values in a column.
  • UPSERT - insert a row, or update it if it already exists (INSERT ... ON CONFLICT).

V

  • VACUUM - reclaims space from dead rows and updates statistics.
  • Vector database - a store that finds items similar in meaning using embeddings.
  • View - a saved query that behaves like a read-only table.

W

  • WAL (Write-Ahead Log) - a durability log written before changes are applied; also feeds replication and CDC.
  • Warehouse - a database built for analytics (columnar, OLAP).
  • WHERE - filters which rows a query returns.
  • Window function - a function computing across a set of rows related to the current row, without collapsing them (OVER).