Store Flat, Transform on Read

Why we store all data in long format and apply transforms at query time instead of pre-computing views. A technical deep dive into DuckDB, Parquet, and the architecture behind OpenData's query engine.

Riley Hilliard
Riley Hilliard
Creator of OpenData·Mar 19, 2026·9 min
Copied to clipboard

You have the Bureau of Labor Statistics CPI data. One user wants the national CPI as a time series. Another wants CPI broken down by metropolitan area. A third wants it sliced by expenditure category: food, housing, transportation.

The instinct is to pre-compute three different tables. Flatten it out for each use case. Then someone asks for CPI by area and category, and now you’re writing a fourth ETL pipeline. Then the BLS publishes a new year of data, and you’re re-running all four.

This is the denormalization trap, and we walked right into it early on. Here’s how we got out.

The Denormalization Trap

Pre-computing every possible view of a dataset doesn’t scale. Not because of compute cost (though that matters), but because of schema drift. Each materialized view is a snapshot of decisions you made when you built the pipeline. New columns arrive. Categories change. Date ranges expand. Your carefully crafted wide tables slowly diverge from the source.

You end up maintaining two systems: the raw data and the pre-computed views. Bugs hide in the gap between them. A user reports that your national CPI number doesn’t match the BLS website, and you spend a day tracking down which view pipeline is stale.

We tried this approach. It was wrong. The better answer turned out to be simpler: store the data exactly once, in the flattest format possible, and build every view on read.

Long Format: One Row Per Observation

The raw BLS CPI data looks like this:

series_id          year  period  value
CUSR0000SA0        2025  M12     315.605
CUSR0000SA0        2025  M11     315.493
CUSR0000SA0L1E     2025  M12     327.682
CUSR0000SAF        2025  M12     332.145
CUUR0100SA0        2025  M12     309.842

One row per observation. The series_id encodes everything: CUSR0000SA0 is all-items CPI for the US, seasonally adjusted. CUUR0100SA0 is all-items for the Northeast, not seasonally adjusted. The area, the item category, the seasonal adjustment, all packed into that one identifier.

If you’re used to spreadsheets, this looks messy. Where are the column headers for each year? Where’s the separate “food” sheet? But this format has a property that matters a lot when you’re building a platform that ingests data from hundreds of sources: it never breaks when new data arrives.

A new month shows up? More rows. A new metro area starts getting tracked? More rows. The schema doesn’t change. There’s no migration. No column rename. No “oh, we need to add a 2026 column” moment.

Wide format (years as columns, categories as separate tables) looks clean in a spreadsheet, but it breaks constantly in production. Every time the source updates, your schema changes. Long format is ugly and stable. We’ll take stable.

Six Views From One Parquet File

Here’s our actual BLS CPI-U dataset config. One Parquet file, six completely different views, zero pre-computation:

spec:
  source_url: https://download.bls.gov/pub/time.series/cu/
  default_view: enriched
  views:
    - name: enriched
    - name: national
    - name: core
    - name: by-category
    - name: by-area
    - name: raw

Each view defines its own filters, computed columns, and joins. They all read from the same Parquet file at query time. Let’s look at a few.

The National View

The headline inflation number that makes the news:

- name: national
  description: >
    US national CPI - all items, seasonally adjusted
    (the headline inflation number)
  filters:
    - sql: "{period} LIKE 'M%' AND {period} != 'M13'"
    - sql: "TRY_CAST({value} AS DOUBLE) IS NOT NULL"
  computed:
    - name: date
      sql: "make_date({year}, period_to_month({period}), 1)"
    - name: cpi_value
      sql: "TRY_CAST({value} AS DOUBLE)"
    - name: inflation_rate
      sql: >
        ROUND(
          (TRY_CAST({value} AS DOUBLE) -
            LAG(TRY_CAST({value} AS DOUBLE), 1) OVER
              (PARTITION BY {series_id}, {period} ORDER BY {year}))
          / LAG(TRY_CAST({value} AS DOUBLE), 1) OVER
              (PARTITION BY {series_id}, {period} ORDER BY {year})
          * 100, 1)
  default_filters:
    series_id: CUSR0000SA0
  columns: [date, cpi_value, inflation_rate]
  default_sort: date
  default_sort_order: desc

From raw rows with cryptic period codes and string values, this view produces a clean time series: date, CPI value, and year-over-year inflation rate. The period_to_month function translates BLS period codes (M01 through M12) into integers. The LAG() window function computes inflation by comparing each month to the same month in the prior year. The M13 filter drops annual averages.

All of that happens at query time. Nothing is stored.

The By-Category View

CPI broken down by what people actually spend money on:

- name: by-category
  description: >
    CPI by major expenditure category
    (Food, Housing, Transportation, etc.)
  filters:
    - sql: "{m.period} LIKE 'M%' AND {m.period} != 'M13'"
    - sql: "TRY_CAST({m.value} AS DOUBLE) IS NOT NULL"
    - column: item_code
      operator: in
      value: [SA0, SAF, SAH, SAA, SAT, SAM, SAR, SAE, SAG]
  joins:
    - dataset: bls/cpi-u/series
      alias: series
      key: series_id
      select: [area_code, item_code, seasonal]
    - dataset: bls/cpi-u/item
      alias: item
      key: item_code
      select: [item_name]
  default_filters:
    area_code: "0000"
    seasonal: S
  columns: [date, item_name, cpi_value, inflation_rate]

Same Parquet file. But now we’re joining two dimension tables (series and item) to get human-readable category names. The item_code filter narrows it to the nine major expenditure categories. And the default_filters lock it to the national average with seasonal adjustment.

The Enriched View

The full picture, with four dimension table joins:

- name: enriched
  description: >
    Human-readable CPI data with item/area names and proper dates
  computed:
    - name: date
      sql: "make_date({m.year}, period_to_month({m.period}), 1)"
    - name: cpi_value
      sql: "TRY_CAST({m.value} AS DOUBLE)"
    - name: inflation_rate
      sql: "ROUND((...) * 100, 1)"
  joins:
    - dataset: bls/cpi-u/series
      alias: series
      key: series_id
      select: [area_code, item_code, series_title, seasonal]
    - dataset: bls/cpi-u/area
      alias: area
      key: area_code
      select: [area_name]
    - dataset: bls/cpi-u/item
      alias: item
      key: item_code
      select: [item_name]
    - dataset: bls/cpi-u/seasonal
      alias: seas
      key:
        source: seasonal
        target: seasonal_code
      select: [seasonal_text]
  columns:
    [date, cpi_value, inflation_rate, series_title,
     area_name, item_name, seasonal_text]

Four LEFT JOINs, three computed columns, two filters, and the result is a fully denormalized, human-readable dataset with proper dates, readable names, and calculated inflation rates. All from flat rows that look like CUSR0000SA0, 2025, M12, 315.605.

The raw view? It’s literally nothing:

- name: raw
  description: Raw CPI data with all columns and periods
  default_sort: year
  default_sort_order: desc

No transforms. No filters. Just the data as-is. Sometimes that’s what you want.

DuckDB as the Query Engine

We use DuckDB for all dataset queries and PostgreSQL for all metadata. Each engine does what it’s good at.

DuckDB is a columnar analytical database. When you query a Parquet file for just the value and year columns, DuckDB reads only those columns from disk. It doesn’t load the entire file. For a dataset with 20 columns where you need 3, that’s a significant difference.

The critical architecture decision: each web request creates a throwaway DuckDB instance. The code is straightforward:

def _get_connection(self) -> duckdb.DuckDBPyConnection:
    """Create a new in-memory DuckDB connection."""
    conn = duckdb.connect(":memory:")
    conn.execute("SET enable_progress_bar = false")
    return conn

A request comes in. We create an in-memory DuckDB connection. It reads the Parquet file, applies the view transforms, returns the results, and the connection gets garbage collected. No connection pooling, no write conflicts, no stale caches.

This works because DuckDB is embedded (no server process) and in-memory connections are lightweight. The real bottleneck is I/O from reading Parquet files, not connection overhead.

DuckDB has a fundamental constraint: single-writer, multiple-reader. If two web requests tried to write to the same .duckdb file simultaneously, one would block. In-memory connections sidestep this entirely because we never write to DuckDB at all. We only read Parquet files. All writes go through PostgreSQL (metadata) or directly to Parquet files (dataset content via the ingestion pipeline).

From the codebase:

DuckDB is single-writer, multiple-reader. We use IN-MEMORY connections for all queries to avoid write conflicts across multiple backend instances. DO NOT change to persistent DuckDB files without understanding the implications.

This gives us complete isolation between requests. No connection pool to tune. No lock contention to debug. Each query is independent.

The Expression Compiler

The {column} syntax you saw in the YAML isn’t just string interpolation. Those expressions go through a proper compilation pipeline.

Take this computed column:

- name: date
  sql: "make_date({year}, period_to_month({period}), 1)"

Here’s what happens:

  1. Column extraction: {year} and {period} are identified as column references
  2. Preprocessing: Column refs become placeholders (__COL_year__), stdlib functions get prefixed to avoid collisions with sqlglot builtins
  3. AST parsing: sqlglot parses the expression into an abstract syntax tree
  4. Stdlib expansion: period_to_month() is recognized as a stdlib function and expanded into a CASE expression that maps BLS period codes to month integers
  5. Validation: The AST is checked against a whitelist of allowed functions (no DROP TABLE, no subqueries)
  6. SQL generation: The AST is rendered back to SQL with quoted column identifiers

The period_to_month function, for example, expands into:

CASE
  WHEN "period" LIKE 'M%' THEN
    CASE WHEN CAST(SUBSTRING("period" FROM 2 FOR 2) AS INTEGER) <= 12
         THEN CAST(SUBSTRING("period" FROM 2 FOR 2) AS INTEGER)
         ELSE NULL
    END
  WHEN "period" LIKE 'Q%' THEN
    CASE SUBSTRING("period" FROM 2 FOR 2)
      WHEN '01' THEN 1
      WHEN '02' THEN 4
      WHEN '03' THEN 7
      WHEN '04' THEN 10
      ELSE 1
    END
  WHEN "period" LIKE 'S%' THEN
    CASE WHEN "period" = 'S01' THEN 1 ELSE 7 END
  ELSE NULL
END

This is domain-specific logic that BLS uses across all their time series data. By putting it in a stdlib function, every BLS dataset can use period_to_month() without copy-pasting that CASE expression.

The compiler is strict. It validates against a whitelist of allowed SQL functions (aggregates, math, string, date, window functions) and rejects anything that looks like injection. Semicolons, subqueries, DDL statements: all blocked at the AST level, not with regex.

class ExpressionCompiler:
    ALLOWED_FUNCTIONS: ClassVar[set[str]] = {
        "COUNT", "SUM", "AVG", "MIN", "MAX",
        "ROUND", "ABS", "FLOOR", "CEIL",
        "MAKE_DATE", "DATE_TRUNC", "EXTRACT",
        "LAG", "LEAD", "ROW_NUMBER", "RANK",
        "TRY_CAST", "COALESCE", "NULLIF",
        # ... and more
    }

When This Breaks Down

This approach has real limitations. Worth naming them.

Views can’t reference other views. The national view and the by-category view both compute inflation_rate with the same window function. You can’t define inflation_rate once and reuse it. Each view repeats the full expression. If the formula changes, you update it in six places.

No caching of intermediate results. If ten users request the enriched view in the same minute, that’s ten separate DuckDB instances all reading the same Parquet file, computing the same joins, running the same window functions. There’s no shared cache. For most datasets this is fine because DuckDB is fast and Parquet is efficient. For large datasets with complex join chains, you’ll feel it.

In-memory means no persistent indexes. Every query starts cold. DuckDB reads column metadata from the Parquet file, but there’s no persistent index to speed up filtered queries. This is offset by Parquet’s built-in row group statistics (min/max values per column chunk), which DuckDB uses for predicate pushdown. But it’s not the same as a B-tree index.

The expression compiler supports a subset of SQL. You can’t use CTEs, recursive queries, or arbitrary subqueries in computed columns. The whitelist is intentionally conservative. If you need something outside it, you’re writing a custom connector instead of a view definition.

Pre-computation still wins for dashboards. If you need sub-100ms responses for a real-time dashboard hitting the same query thousands of times per minute, query-time transforms are the wrong tool. Materialize those views. This architecture is optimized for the read-heavy, moderate-complexity, diverse-query-pattern workload of a data platform, not for powering a live analytics dashboard.

These are real tradeoffs, not hypothetical ones. We’ve hit all of them. The architecture is the right default for what we’re building, but it’s not the only answer.

The Hybrid Storage Picture

The full architecture uses two databases, each doing what it’s best at:

LayerEngineWhy
Metadata (providers, datasets, jobs)PostgreSQLOLTP, transactions, foreign keys, full-text search
Dataset content (the actual data)Parquet + DuckDBOLAP, columnar reads, vectorized execution

PostgreSQL tracks what datasets exist, who submitted them, when they were last ingested, and whether the ingestion succeeded. DuckDB reads the actual dataset content from Parquet files and applies view transforms at query time.

The two never compete. Postgres handles lots of small transactional writes (user submits a URL, job status updates, star counts). DuckDB handles fewer, larger analytical reads (give me CPI by metropolitan area, sorted by date, with inflation rates). Different access patterns, different engines.

One Parquet file. Six views. No pre-computation. That’s the architecture, and so far, it’s held up.

Riley Hilliard
Riley Hilliard

Creator of OpenData

At 13, I secretly drilled holes in my parents' wood floor to route a 56k modem line to my bedroom for late-night Age of Empires marathons. That same scrappy curiosity carried through 3 acquisitions, 9 years as a LinkedIn Staff Engineer building infrastructure for 1B+ users, and now fuels my side projects, like OpenData.

Copied to clipboard

More from OpenData

Why Your Charts Don't Get Shared (And Chartr's Do)

Chartr grew to 500K+ subscribers by making data visualization shareable. What they figured out about headline-first framing, minimal chrome, and social optimization applies to anyone making charts.

Riley HilliardRiley Hilliard·Mar 26, 2026

70% of AI Training Datasets Have the Wrong License

A large-scale audit found that over 70% of popular AI datasets have missing or wrong license metadata. With the EU AI Act now enforcing training data transparency, this isn't just sloppy. It's a liability.

Riley HilliardRiley Hilliard·Mar 12, 2026

Public Data Has a Discovery Problem

Government data is technically public but practically inaccessible. Here's what that actually costs researchers, journalists, and anyone trying to answer a question with data.

Riley HilliardRiley Hilliard·Mar 5, 2026

Welcome to the OpenData Blog

Introducing the OpenData blog. We'll be sharing project updates, deep dives into open data infrastructure, and lessons learned building a platform for public datasets.

Riley HilliardRiley Hilliard·Feb 25, 2026

The Hidden Mess Inside 'Clean' Government Data

Government data has a reputation for being clean and reliable. Anyone who's tried to ingest it programmatically knows that's not the full story. Here are the real encoding quirks, format traps, and silent failures hiding in data from FRED, BLS, Census, the World Bank, and the EPA.

Riley HilliardRiley Hilliard·Feb 19, 2026

The State of Open Data Infrastructure in 2026

A survey of the open data landscape: what data.gov, Socrata, FRED, Kaggle, Hugging Face, and Datasette do well, what's still broken, and where the connective tissue between data sources is finally being built.

Riley HilliardRiley Hilliard·Feb 12, 2026

Building a Headless Visualization Engine

How we separated chart computation from rendering by building a spec-driven visualization engine. The architecture behind @opendata/viz: four packages, a compilation pipeline, and zero DOM dependencies in the math layer.

Riley HilliardRiley Hilliard·Feb 5, 2026

Bootstrapping a Data Platform on Two Mac Minis

OpenData runs in production on two Mac Minis at $0/month infrastructure cost. Here's the architecture, the tradeoffs, and the specific triggers that would move us to cloud.

Riley HilliardRiley Hilliard·Jan 29, 2026

What Happens When All the World's Open Data Lives in One Place

Open data has a discovery problem, not an access problem. When you centralize datasets from hundreds of portals, entirely new capabilities emerge: knowledge graphs that reveal hidden connections, bridge datasets that make cross-agency joins possible, and a compounding network where every new dataset makes every existing one more useful.

Riley HilliardRiley Hilliard·Jan 22, 2026

Curious about open data? Start exploring.

OpenData makes public datasets discoverable, consistently formatted, and queryable without the usual headaches.

Try it out
  • Browse thousands of public datasets
  • Query any dataset with a simple API
  • Download as CSV, JSON, or Parquet