You have the Bureau of Labor Statistics CPI data. One user wants the national CPI as a time series. Another wants CPI broken down by metropolitan area. A third wants it sliced by expenditure category: food, housing, transportation.
The instinct is to pre-compute three different tables. Flatten it out for each use case. Then someone asks for CPI by area and category, and now you’re writing a fourth ETL pipeline. Then the BLS publishes a new year of data, and you’re re-running all four.
This is the denormalization trap, and we walked right into it early on. Here’s how we got out.
The Denormalization Trap
Pre-computing every possible view of a dataset doesn’t scale. Not because of compute cost (though that matters), but because of schema drift. Each materialized view is a snapshot of decisions you made when you built the pipeline. New columns arrive. Categories change. Date ranges expand. Your carefully crafted wide tables slowly diverge from the source.
You end up maintaining two systems: the raw data and the pre-computed views. Bugs hide in the gap between them. A user reports that your national CPI number doesn’t match the BLS website, and you spend a day tracking down which view pipeline is stale.
We tried this approach. It was wrong. The better answer turned out to be simpler: store the data exactly once, in the flattest format possible, and build every view on read.
Long Format: One Row Per Observation
The raw BLS CPI data looks like this:
series_id year period value
CUSR0000SA0 2025 M12 315.605
CUSR0000SA0 2025 M11 315.493
CUSR0000SA0L1E 2025 M12 327.682
CUSR0000SAF 2025 M12 332.145
CUUR0100SA0 2025 M12 309.842
One row per observation. The series_id encodes everything: CUSR0000SA0 is all-items CPI for the US, seasonally adjusted. CUUR0100SA0 is all-items for the Northeast, not seasonally adjusted. The area, the item category, the seasonal adjustment, all packed into that one identifier.
If you’re used to spreadsheets, this looks messy. Where are the column headers for each year? Where’s the separate “food” sheet? But this format has a property that matters a lot when you’re building a platform that ingests data from hundreds of sources: it never breaks when new data arrives.
A new month shows up? More rows. A new metro area starts getting tracked? More rows. The schema doesn’t change. There’s no migration. No column rename. No “oh, we need to add a 2026 column” moment.
Wide format (years as columns, categories as separate tables) looks clean in a spreadsheet, but it breaks constantly in production. Every time the source updates, your schema changes. Long format is ugly and stable. We’ll take stable.
Six Views From One Parquet File
Here’s our actual BLS CPI-U dataset config. One Parquet file, six completely different views, zero pre-computation:
spec:
source_url: https://download.bls.gov/pub/time.series/cu/
default_view: enriched
views:
- name: enriched
- name: national
- name: core
- name: by-category
- name: by-area
- name: raw
Each view defines its own filters, computed columns, and joins. They all read from the same Parquet file at query time. Let’s look at a few.
The National View
The headline inflation number that makes the news:
- name: national
description: >
US national CPI - all items, seasonally adjusted
(the headline inflation number)
filters:
- sql: "{period} LIKE 'M%' AND {period} != 'M13'"
- sql: "TRY_CAST({value} AS DOUBLE) IS NOT NULL"
computed:
- name: date
sql: "make_date({year}, period_to_month({period}), 1)"
- name: cpi_value
sql: "TRY_CAST({value} AS DOUBLE)"
- name: inflation_rate
sql: >
ROUND(
(TRY_CAST({value} AS DOUBLE) -
LAG(TRY_CAST({value} AS DOUBLE), 1) OVER
(PARTITION BY {series_id}, {period} ORDER BY {year}))
/ LAG(TRY_CAST({value} AS DOUBLE), 1) OVER
(PARTITION BY {series_id}, {period} ORDER BY {year})
* 100, 1)
default_filters:
series_id: CUSR0000SA0
columns: [date, cpi_value, inflation_rate]
default_sort: date
default_sort_order: desc
From raw rows with cryptic period codes and string values, this view produces a clean time series: date, CPI value, and year-over-year inflation rate. The period_to_month function translates BLS period codes (M01 through M12) into integers. The LAG() window function computes inflation by comparing each month to the same month in the prior year. The M13 filter drops annual averages.
All of that happens at query time. Nothing is stored.
The By-Category View
CPI broken down by what people actually spend money on:
- name: by-category
description: >
CPI by major expenditure category
(Food, Housing, Transportation, etc.)
filters:
- sql: "{m.period} LIKE 'M%' AND {m.period} != 'M13'"
- sql: "TRY_CAST({m.value} AS DOUBLE) IS NOT NULL"
- column: item_code
operator: in
value: [SA0, SAF, SAH, SAA, SAT, SAM, SAR, SAE, SAG]
joins:
- dataset: bls/cpi-u/series
alias: series
key: series_id
select: [area_code, item_code, seasonal]
- dataset: bls/cpi-u/item
alias: item
key: item_code
select: [item_name]
default_filters:
area_code: "0000"
seasonal: S
columns: [date, item_name, cpi_value, inflation_rate]
Same Parquet file. But now we’re joining two dimension tables (series and item) to get human-readable category names. The item_code filter narrows it to the nine major expenditure categories. And the default_filters lock it to the national average with seasonal adjustment.
The Enriched View
The full picture, with four dimension table joins:
- name: enriched
description: >
Human-readable CPI data with item/area names and proper dates
computed:
- name: date
sql: "make_date({m.year}, period_to_month({m.period}), 1)"
- name: cpi_value
sql: "TRY_CAST({m.value} AS DOUBLE)"
- name: inflation_rate
sql: "ROUND((...) * 100, 1)"
joins:
- dataset: bls/cpi-u/series
alias: series
key: series_id
select: [area_code, item_code, series_title, seasonal]
- dataset: bls/cpi-u/area
alias: area
key: area_code
select: [area_name]
- dataset: bls/cpi-u/item
alias: item
key: item_code
select: [item_name]
- dataset: bls/cpi-u/seasonal
alias: seas
key:
source: seasonal
target: seasonal_code
select: [seasonal_text]
columns:
[date, cpi_value, inflation_rate, series_title,
area_name, item_name, seasonal_text]
Four LEFT JOINs, three computed columns, two filters, and the result is a fully denormalized, human-readable dataset with proper dates, readable names, and calculated inflation rates. All from flat rows that look like CUSR0000SA0, 2025, M12, 315.605.
The raw view? It’s literally nothing:
- name: raw
description: Raw CPI data with all columns and periods
default_sort: year
default_sort_order: desc
No transforms. No filters. Just the data as-is. Sometimes that’s what you want.
DuckDB as the Query Engine
We use DuckDB for all dataset queries and PostgreSQL for all metadata. Each engine does what it’s good at.
DuckDB is a columnar analytical database. When you query a Parquet file for just the value and year columns, DuckDB reads only those columns from disk. It doesn’t load the entire file. For a dataset with 20 columns where you need 3, that’s a significant difference.
The critical architecture decision: each web request creates a throwaway DuckDB instance. The code is straightforward:
def _get_connection(self) -> duckdb.DuckDBPyConnection:
"""Create a new in-memory DuckDB connection."""
conn = duckdb.connect(":memory:")
conn.execute("SET enable_progress_bar = false")
return conn
A request comes in. We create an in-memory DuckDB connection. It reads the Parquet file, applies the view transforms, returns the results, and the connection gets garbage collected. No connection pooling, no write conflicts, no stale caches.
This works because DuckDB is embedded (no server process) and in-memory connections are lightweight. The real bottleneck is I/O from reading Parquet files, not connection overhead.
DuckDB has a fundamental constraint: single-writer, multiple-reader. If two web requests tried to write to the same .duckdb file simultaneously, one would block. In-memory connections sidestep this entirely because we never write to DuckDB at all. We only read Parquet files. All writes go through PostgreSQL (metadata) or directly to Parquet files (dataset content via the ingestion pipeline).
From the codebase:
DuckDB is single-writer, multiple-reader. We use IN-MEMORY connections for all queries to avoid write conflicts across multiple backend instances. DO NOT change to persistent DuckDB files without understanding the implications.
This gives us complete isolation between requests. No connection pool to tune. No lock contention to debug. Each query is independent.
The Expression Compiler
The {column} syntax you saw in the YAML isn’t just string interpolation. Those expressions go through a proper compilation pipeline.
Take this computed column:
- name: date
sql: "make_date({year}, period_to_month({period}), 1)"
Here’s what happens:
- Column extraction:
{year}and{period}are identified as column references - Preprocessing: Column refs become placeholders (
__COL_year__), stdlib functions get prefixed to avoid collisions with sqlglot builtins - AST parsing: sqlglot parses the expression into an abstract syntax tree
- Stdlib expansion:
period_to_month()is recognized as a stdlib function and expanded into a CASE expression that maps BLS period codes to month integers - Validation: The AST is checked against a whitelist of allowed functions (no
DROP TABLE, no subqueries) - SQL generation: The AST is rendered back to SQL with quoted column identifiers
The period_to_month function, for example, expands into:
CASE
WHEN "period" LIKE 'M%' THEN
CASE WHEN CAST(SUBSTRING("period" FROM 2 FOR 2) AS INTEGER) <= 12
THEN CAST(SUBSTRING("period" FROM 2 FOR 2) AS INTEGER)
ELSE NULL
END
WHEN "period" LIKE 'Q%' THEN
CASE SUBSTRING("period" FROM 2 FOR 2)
WHEN '01' THEN 1
WHEN '02' THEN 4
WHEN '03' THEN 7
WHEN '04' THEN 10
ELSE 1
END
WHEN "period" LIKE 'S%' THEN
CASE WHEN "period" = 'S01' THEN 1 ELSE 7 END
ELSE NULL
END
This is domain-specific logic that BLS uses across all their time series data. By putting it in a stdlib function, every BLS dataset can use period_to_month() without copy-pasting that CASE expression.
The compiler is strict. It validates against a whitelist of allowed SQL functions (aggregates, math, string, date, window functions) and rejects anything that looks like injection. Semicolons, subqueries, DDL statements: all blocked at the AST level, not with regex.
class ExpressionCompiler:
ALLOWED_FUNCTIONS: ClassVar[set[str]] = {
"COUNT", "SUM", "AVG", "MIN", "MAX",
"ROUND", "ABS", "FLOOR", "CEIL",
"MAKE_DATE", "DATE_TRUNC", "EXTRACT",
"LAG", "LEAD", "ROW_NUMBER", "RANK",
"TRY_CAST", "COALESCE", "NULLIF",
# ... and more
}
When This Breaks Down
This approach has real limitations. Worth naming them.
Views can’t reference other views. The national view and the by-category view both compute inflation_rate with the same window function. You can’t define inflation_rate once and reuse it. Each view repeats the full expression. If the formula changes, you update it in six places.
No caching of intermediate results. If ten users request the enriched view in the same minute, that’s ten separate DuckDB instances all reading the same Parquet file, computing the same joins, running the same window functions. There’s no shared cache. For most datasets this is fine because DuckDB is fast and Parquet is efficient. For large datasets with complex join chains, you’ll feel it.
In-memory means no persistent indexes. Every query starts cold. DuckDB reads column metadata from the Parquet file, but there’s no persistent index to speed up filtered queries. This is offset by Parquet’s built-in row group statistics (min/max values per column chunk), which DuckDB uses for predicate pushdown. But it’s not the same as a B-tree index.
The expression compiler supports a subset of SQL. You can’t use CTEs, recursive queries, or arbitrary subqueries in computed columns. The whitelist is intentionally conservative. If you need something outside it, you’re writing a custom connector instead of a view definition.
Pre-computation still wins for dashboards. If you need sub-100ms responses for a real-time dashboard hitting the same query thousands of times per minute, query-time transforms are the wrong tool. Materialize those views. This architecture is optimized for the read-heavy, moderate-complexity, diverse-query-pattern workload of a data platform, not for powering a live analytics dashboard.
These are real tradeoffs, not hypothetical ones. We’ve hit all of them. The architecture is the right default for what we’re building, but it’s not the only answer.
The Hybrid Storage Picture
The full architecture uses two databases, each doing what it’s best at:
| Layer | Engine | Why |
|---|---|---|
| Metadata (providers, datasets, jobs) | PostgreSQL | OLTP, transactions, foreign keys, full-text search |
| Dataset content (the actual data) | Parquet + DuckDB | OLAP, columnar reads, vectorized execution |
PostgreSQL tracks what datasets exist, who submitted them, when they were last ingested, and whether the ingestion succeeded. DuckDB reads the actual dataset content from Parquet files and applies view transforms at query time.
The two never compete. Postgres handles lots of small transactional writes (user submits a URL, job status updates, star counts). DuckDB handles fewer, larger analytical reads (give me CPI by metropolitan area, sorted by date, with inflation rates). Different access patterns, different engines.
One Parquet file. Six views. No pre-computation. That’s the architecture, and so far, it’s held up.