Public Data Has a Discovery Problem

Government data is technically public but practically inaccessible. Here's what that actually costs researchers, journalists, and anyone trying to answer a question with data.

Riley Hilliard
Riley Hilliard
Creator of OpenData·Mar 5, 2026·7 min
Copied to clipboard

You’re a policy researcher studying how air quality correlates with economic conditions across US states. Straightforward question. You need three datasets.

First stop: the EPA. You want PM2.5 air quality measurements. The EPA’s Air Quality System gives you a zip file containing a CSV with 30+ columns. The ones you care about are arithmetic_mean, aqi, and observation_count, but you won’t know that until you’ve downloaded the file, unzipped it, opened it, and scrolled sideways through columns like method_code, poc, and datum.

Second stop: the Bureau of Labor Statistics. You want unemployment data. BLS serves tab-delimited flat files organized in directory listings that look like they were designed in 1996 (because they were). The data uses period codes like M01 through M12 for months, M13 for annual averages, and series IDs like CUSR0000SA0 that mean nothing without a separate lookup table.

Third stop: the Census Bureau. You want state population data so you can normalize per capita. The Census API returns JSON where columns are codes, not names. B01001_001E is total population. B19013_001E is median household income. You only know this if you’ve memorized the American Community Survey variable list or spent 20 minutes on their documentation site figuring it out.

Three agencies, three portals, three formats. None of them talk to each other. And you haven’t started your actual analysis yet.

According to an Anaconda survey, data scientists spend roughly 45% of their time on data preparation and cleaning before they can do any real work. If you’ve been in this world for any amount of time, that number probably feels low. The formats are inconsistent, the documentation is scattered, and the data models assume you already know how each agency organizes its information. “Public data” is a misnomer. It’s more like “technically downloadable data if you know where to look and can decode the format.”

What “Actually Accessible” Looks Like

Here’s the same scenario, but with the data already ingested and normalized.

The Census dataset config transforms cryptic column codes into readable names at ingestion time:

ingest:
  transform:
    - rename:
        B01001_001E: total_population
        NAME: state_name
        state: state_fips
    - cast:
        state_fips: int
        total_population: int

B01001_001E becomes total_population. No lookup table required. The rename happens once, during ingestion, and every query after that uses the human-readable name.

The FRED unemployment dataset filters out FRED’s placeholder values (they use a literal "." for missing data, which will break any parser that expects a number):

ingest:
  json_path: "$.observations[*]"
  transform:
    - filter:
        column: value
        operator: ne
        value: "."
    - rename:
        value: unemployment_rate

The BLS CPI dataset is where things get interesting. Period codes like M01 need to become actual dates. Series IDs like CUSR0000SA0 need to become “All items in U.S. city average, seasonally adjusted.” These are query-time transforms, defined in YAML:

computed:
  - name: date
    sql: "make_date({year}, period_to_month({period}), 1)"
  - name: cpi_value
    sql: "TRY_CAST({value} AS DOUBLE)"

And joins to dimension tables translate those opaque codes into something a human can read:

joins:
  - dataset: bls/cpi-u/series
    key: series_id
    select: [area_code, item_code, series_title]
  - dataset: bls/cpi-u/area
    key: area_code
    select: [area_name]
  - dataset: bls/cpi-u/item
    key: item_code
    select: [item_name]

One dataset, six different views: enriched (human-readable with all joins), national (the headline CPI number, series CUSR0000SA0), core (less food and energy), by-category, by-area, and raw. All of these are query-time transforms defined in the YAML config. Nothing is pre-computed or duplicated. The raw Parquet file stays the same; the views just reshape how you see it.

The result is a single API call:

curl "https://opendata.place/v1/datasets/bls/cpi-u/query?view=national&limit=5"

Instead of downloading a directory of tab-delimited files, joining them manually, converting period codes, and filtering out annual averages, you get clean JSON with real dates and readable column names. That’s the difference between data that’s technically public and data that’s actually usable.

The Connectors That Do the Work

The real complexity lives in the connectors. Different data sources have wildly different access patterns, and pretending they’re all “just HTTP endpoints” doesn’t work.

We handle this with three tiers of connectors:

Native Python connectors for complex sources that need custom logic. BLS serves data as directory listings of flat files that need to be discovered and parsed. Census has an API with pagination quirks and variable code translation. FRED requires authentication and returns a nested JSON structure with placeholder values. Each of these has a dedicated connector: opendata/bls, opendata/census, opendata/fred.

Declarative YAML connectors for sources with simpler but non-trivial patterns. The World Bank, for example, distributes data as zipped CSVs with metadata files mixed in and four header rows to skip:

spec:
  connector: opendata/worldbank
  source_url: https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv
  ingest:
    skip_rows: 4
    file_pattern: "*.csv"
    exclude_pattern: "Metadata_*"
    transform:
      - wide_to_long:
          id_vars: [Country Name, Country Code]
          value_vars: "^\\d{4}$"
          var_name: year
          value_name: gdp

The connector knows how to handle the World Bank’s zip packaging. The YAML config handles the reshape (wide-to-long pivot, cleaning, type casting). Treasury works similarly, with its own connector for daily CSV downloads.

Generic HTTP connector for everything else. EPA air quality data, NOAA weather data, Our World in Data CSVs. These are sources where the access pattern is straightforward (download a file, parse it) and the real work is in the transform step. The generic connector handles HTTP fetching, decompression, and format detection. The YAML config handles the rest.

This tiered approach means community contributions can happen at whatever level makes sense. Adding a new EPA dataset that uses the same zip-of-CSVs pattern? That’s a YAML file. Adding a new data provider with a paginated API? That might need a Python connector. Both are valid, and both end up producing the same thing: a queryable Parquet file with clean column names and documented schema.

The CLI makes this concrete. opendata discover <url> analyzes a data source and generates a starter dataset.yaml config. opendata sync registers configs with the database. opendata add <url> does both plus kicks off ingestion. For the common case of “here’s a URL to a CSV or JSON file,” no code is required. Just a YAML file that describes what to rename, filter, and cast.

Why This Matters

The people who most need public data are often the least equipped to wrangle it.

A journalist fact-checking a claim about rising crime rates needs the actual Bureau of Justice Statistics data, not a 30-day FOIA wait or a secondhand chart from someone’s blog post. The data is public. The barrier isn’t access rights, it’s the practical effort of finding it, downloading it, figuring out the format, and cleaning it enough to answer a simple question.

A city council member comparing their district’s air quality to neighboring counties shouldn’t need a data engineer on staff. The EPA publishes the data. But “publishes” and “makes accessible” are very different things when the data arrives as a 200MB zip file with 30 columns of monitoring station metadata you don’t need.

A grad student shouldn’t spend half their thesis timeline on data wrangling. The Anaconda number bears repeating here: 45% of time on prep. That’s not a rounding error. For a two-year master’s program, that’s nearly a year of formatting CSVs and decoding column names instead of doing actual research.

Right now, a policy researcher in DC and a grad student in Austin and a journalist in New York are all writing essentially the same Python scripts to parse BLS directory listings, decode Census variable codes, and unzip EPA archives. That work gets done and then thrown away. It doesn’t accumulate.

We have over 200 datasets from 35+ providers, each ingested, normalized, and queryable through the same API. BLS labor statistics, FRED economic indicators, NOAA weather data, Our World in Data global metrics. All with documented columns and human-readable names. The goal is to solve each data wrangling problem once so nobody has to solve it again.

Check out the API or browse the source on GitHub. If you work with public data and have opinions about which sources should be added next, we’d like to hear about it.

Riley Hilliard
Riley Hilliard

Creator of OpenData

At 13, I secretly drilled holes in my parents' wood floor to route a 56k modem line to my bedroom for late-night Age of Empires marathons. That same scrappy curiosity carried through 3 acquisitions, 9 years as a LinkedIn Staff Engineer building infrastructure for 1B+ users, and now fuels my side projects, like OpenData.

Copied to clipboard

More from OpenData

Why Your Charts Don't Get Shared (And Chartr's Do)

Chartr grew to 500K+ subscribers by making data visualization shareable. What they figured out about headline-first framing, minimal chrome, and social optimization applies to anyone making charts.

Riley HilliardRiley Hilliard·Mar 26, 2026

Store Flat, Transform on Read

Why we store all data in long format and apply transforms at query time instead of pre-computing views. A technical deep dive into DuckDB, Parquet, and the architecture behind OpenData's query engine.

Riley HilliardRiley Hilliard·Mar 19, 2026

70% of AI Training Datasets Have the Wrong License

A large-scale audit found that over 70% of popular AI datasets have missing or wrong license metadata. With the EU AI Act now enforcing training data transparency, this isn't just sloppy. It's a liability.

Riley HilliardRiley Hilliard·Mar 12, 2026

Welcome to OpenData Labs

Introducing OpenData Labs. We'll be sharing project updates, deep dives into open data infrastructure, and lessons learned building a platform for public datasets.

Riley HilliardRiley Hilliard·Feb 25, 2026

The Hidden Mess Inside 'Clean' Government Data

Government data has a reputation for being clean and reliable. Anyone who's tried to ingest it programmatically knows that's not the full story. Here are the real encoding quirks, format traps, and silent failures hiding in data from FRED, BLS, Census, the World Bank, and the EPA.

Riley HilliardRiley Hilliard·Feb 19, 2026

The State of Open Data Infrastructure in 2026

A survey of the open data landscape: what data.gov, Socrata, FRED, Kaggle, Hugging Face, and Datasette do well, what's still broken, and where the connective tissue between data sources is finally being built.

Riley HilliardRiley Hilliard·Feb 12, 2026

Building a Headless Visualization Engine

How we separated chart computation from rendering by building a spec-driven visualization engine. The architecture behind @opendata/viz: four packages, a compilation pipeline, and zero DOM dependencies in the math layer.

Riley HilliardRiley Hilliard·Feb 5, 2026

Bootstrapping a Data Platform on Two Mac Minis

OpenData runs in production on two Mac Minis at $0/month infrastructure cost. Here's the architecture, the tradeoffs, and the specific triggers that would move us to cloud.

Riley HilliardRiley Hilliard·Jan 29, 2026

What Happens When All the World's Open Data Lives in One Place

Open data has a discovery problem, not an access problem. When you centralize datasets from hundreds of portals, entirely new capabilities emerge: knowledge graphs that reveal hidden connections, bridge datasets that make cross-agency joins possible, and a compounding network where every new dataset makes every existing one more useful.

Riley HilliardRiley Hilliard·Jan 22, 2026

Curious about open data? Start exploring.

OpenData makes public datasets discoverable, consistently formatted, and queryable without the usual headaches.

Try it out
  • Browse thousands of public datasets
  • Query any dataset with a simple API
  • Download as CSV, JSON, or Parquet