The State of Open Data Infrastructure in 2026

A survey of the open data landscape: what data.gov, Socrata, FRED, Kaggle, Hugging Face, and Datasette do well, what's still broken, and where the connective tissue between data sources is finally being built.

Riley Hilliard
Riley Hilliard
Creator of OpenData·Feb 12, 2026·9 min
Copied to clipboard

There’s no shortage of places to find public data in 2026. data.gov lists 300,000+ datasets. Hugging Face hosts over 500,000. Kaggle, Socrata, FRED, Datasette, CKAN, and a growing list of specialized portals each serve their own corner of the ecosystem.

The tooling has gotten genuinely better. But a specific problem has stubbornly persisted: if you need data from more than one source, you’re on your own. Each tool is an island. Combining BLS unemployment figures with Census demographics and EPA air quality data still means writing the same glue code that people were writing five years ago.

Here’s where the landscape stands, what each tool does well, and where the gaps remain.

The Landscape Today

Here’s what the major players do and where they stop.

data.gov is the US government’s central data catalog. It lists over 300,000 datasets harvested from federal agencies, powered by CKAN under the hood. It’s good at what it is: a directory. You search for a dataset, find a listing, and get redirected to whatever agency hosts it. The problem is that “whatever agency hosts it” means a different portal, a different format, and a different access pattern every time. data.gov tells you the data exists. It doesn’t give you a consistent way to get it.

Socrata (now Tyler Technologies’ Data & Insights division) and CKAN are the hosting platforms underneath many city, state, and federal portals. Socrata provides SODA, a SQL-like query API that lets you filter and retrieve data programmatically. They recently shipped SODA3 with better performance and caching. CKAN is the open-source equivalent, powering data portals across the US, UK, Canada, and the EU. Both platforms solved the “where do we put government data” problem for individual agencies. Neither solved the “how do I combine data from five different portals” problem. Each portal is its own island.

FRED (Federal Reserve Economic Data) is the gold standard for what a single-domain data API should look like. Clean REST endpoints, consistent time series format, solid documentation. If you need GDP, unemployment, interest rates, or any of the 800,000+ economic time series the Federal Reserve tracks, FRED is the answer. But it only covers the Fed’s domain. Need air quality data? School enrollment figures? Crime statistics? You’re back to navigating individual agency portals.

Kaggle is the community-driven dataset platform most data scientists encounter first. It’s excellent for ML competitions and exploratory datasets, and its notebooks feature lets you run analysis right next to the data. But it’s fundamentally a file dump. You download CSVs. Quality varies wildly because anyone can upload anything, and there’s no ingestion pipeline normalizing formats or validating schemas. For reproducible research that needs to stay current as sources update, Kaggle isn’t the right tool.

Hugging Face Datasets has seen massive growth, now hosting over 500,000 datasets. It’s become the default distribution channel for ML training data. The ecosystem integration is strong, especially with PyTorch and TensorFlow data loaders. But it prioritizes volume over curation, and the licensing metadata problem is serious. The Data Provenance Initiative’s audit, published in Nature Machine Intelligence, found that over 70% of datasets on popular hosting platforms had missing license information, and over 50% had incorrect metadata. A follow-up study found that over 80% of source content in widely-used datasets carries non-commercial restrictions, even when the dataset labels say otherwise.

Datasette, Simon Willison’s open-source SQLite explorer, takes a different approach entirely. Point it at a SQLite database and you get a queryable, browsable, API-enabled website for that dataset. It’s brilliant for publishing individual datasets. But Datasette is designed around single databases, not cross-source aggregation. If you need to query across BLS, Census, and EPA data simultaneously, you’d need to load everything into SQLite yourself first. That’s the hard part Datasette doesn’t try to solve.

All solid tools. None of them handle the cross-source problem: getting data out of dozens of sources with different formats and schemas, and making it queryable through one interface.

What’s Still Broken

The individual tools have gotten better. The space between them hasn’t.

Cross-source querying doesn’t exist

Combining BLS unemployment data with Census demographics and EPA air quality measurements still requires manual ETL. You download from each source, write transformation scripts, load into your own database, then query. Every researcher, journalist, and analyst who needs multi-source data does this independently, solving the same data wrangling problems from scratch.

This is wildly inefficient. A policy researcher in DC and a grad student in Austin and a journalist in New York are all writing essentially the same Python scripts to parse BLS directory listings, decode Census variable codes, and unzip EPA archives. The work gets done and then thrown away. It doesn’t accumulate into shared infrastructure.

Schema normalization is everyone’s problem and nobody’s product

Every data source invents its own naming conventions. BLS uses series IDs like CUSR0000SA0 that pack area codes, item codes, and seasonal adjustment flags into a single string. Census uses variable codes like B01001_001E. FRED uses tickers like UNRATE. The World Bank uses indicator codes like NY.GDP.MKTP.CD. These are all reasonable choices made by individual agencies for their own internal purposes.

But if you want to join unemployment data from BLS with GDP data from FRED and population data from the World Bank, you need to know that these three different naming systems are all talking about the same countries and time periods. That translation layer doesn’t exist in any of the tools listed above. You build it every time.

Format fragmentation compounds everything else

CSV with various delimiters. Nested JSON with placeholder values (FRED uses a literal "." for missing data, which will break any numeric parser). Zipped archives with metadata files mixed in alongside the actual data. Tab-delimited flat files. PDF tables. Fixed-width text files from the 1990s.

The Bureau of Labor Statistics alone is a case study in format archaeology. Their data lives in directory listings that serve tab-delimited files organized by series prefix. Some files have header rows, some don’t. Period codes use M01 through M12 for months and M13 for annual averages. To even discover what files exist, you need to crawl HTML directory listings and parse the file naming conventions.

This isn’t a criticism of BLS. They’ve been publishing data consistently for decades, which is more than most organizations can say. But the format choices that made sense for FTP distribution in 1996 create real friction for programmatic access in 2026.

What a Unified Layer Looks Like

The concept isn’t replacing existing portals. It’s sitting on top of them. A layer that speaks each portal’s language, normalizes the output, and serves it through a consistent API. Think of it like a package manager for public datasets: you don’t need to know how each data source distributes its files, you just need to know the dataset name.

This is what OpenData is building. It’s open source (github.com/rileyhilliard/opendata), Apache 2.0 licensed, and designed around a specific workflow.

You point the CLI at a data source URL:

opendata discover https://download.bls.gov/pub/time.series/cu/

The discover command analyzes the source. It detects the format (directory listing of tab-delimited files, in this case), identifies the structure, and generates a YAML config with reasonable defaults. That config specifies how to fetch, parse, and transform the data.

Then opendata add runs the config through the ingestion pipeline: it fetches the data, applies transforms (renaming cryptic column codes, casting types, filtering placeholder values), and stores the result as Parquet files.

Now it’s queryable through a REST API:

curl "https://opendata.place/api/v1/datasets/bls/cpi-u/query?view=national&limit=5"

From a directory listing of tab-delimited files to a clean JSON API response. The BLS didn’t change anything. The data is the same data. The difference is that someone wrote a connector config once, and now anyone can query it without understanding BLS’s internal file organization.

The same pattern works for every source. Census variable codes get renamed to human-readable names during ingestion (B01001_001E becomes total_population). FRED’s placeholder dots get filtered out. World Bank’s zipped CSV archives get unpacked and the metadata files get discarded. The quirks of each source are handled once, in the connector config, not repeatedly by every person who wants to use the data.

Three Tiers of Connectors

Not all data sources are equally complex to ingest. The architectural approach needs to account for that range.

Native Python connectors handle sources with complex access patterns that can’t be expressed in configuration alone. BLS requires crawling HTML directory listings to discover individual files, then parsing file naming conventions to understand what each file contains. Census has a paginated API with variable code translation that requires cross-referencing separate metadata endpoints. FRED needs API key authentication and returns nested JSON with placeholder values that need special handling. World Bank distributes zipped CSVs with metadata files mixed in and header rows that need skipping. Treasury has daily CSV downloads with date-based URL patterns. Each of these has a dedicated Python connector that encapsulates the source’s specific quirks.

Declarative YAML connectors handle sources with simpler but non-trivial patterns. A dataset from the ECB, OECD, or FAOSTAT might just need configuration for header row skipping, file pattern matching, column renaming, and type casting. No custom code required. You write a YAML file that describes where the data lives, what format it’s in, and how to transform it. The generic connector handles the rest.

The generic HTTP connector handles everything straightforward. Our World in Data CSVs, NOAA weather data, FiveThirtyEight datasets. The access pattern is simple (download a file, parse it), and any necessary transforms are all YAML configuration: rename this column, cast that type, filter these rows.

This tiered approach means adding a new dataset ranges from “write a YAML file” (the common case) to “write a Python connector” (complex sources like BLS or Census). Both paths produce the same output: a queryable Parquet file served through the same API. Currently, OpenData has over 200 datasets from 30+ providers, all defined as YAML configs in the repository’s datasets/ directory. Anyone can submit a new one.

Where This Is Going

A few patterns are emerging across the ecosystem, independent of any single project.

Version control for data is an unsolved problem at scale. Datasets change. BLS publishes revisions to previous months’ numbers. Census updates population estimates. FRED adjusts seasonal factors retroactively. Git solved this for code decades ago: you can see exactly what changed, when, and why. For data, the best most platforms offer is “here’s the latest version.” The ability to query a dataset as it existed at a specific point in time, or to diff two versions row by row, is infrastructure that doesn’t really exist yet. Some projects are starting to chip away at it (DVC for ML pipelines, LakeFS for data lakes), but nothing has the simplicity of git log for datasets.

Cross-dataset discovery is a harder problem. If you’re looking at unemployment data, you probably also want CPI, GDP, and population data to contextualize it. Automated relationship detection between datasets from different sources, based on shared dimensions like geography and time periods, would save researchers the manual work of figuring out which datasets pair well together. This is graph problem territory, and it’s where the interesting work will happen over the next few years.

Community-driven data curation follows the Wikipedia model. Any individual or organization can submit a dataset config. The community reviews it, improves the transform rules, adds query views for common use cases, and flags quality issues. OpenData’s dataset configs are just YAML files in a git repository. Submitting a new dataset is a pull request. Improving an existing one is another pull request. The data itself stays at the source. What gets version-controlled is the recipe for accessing and normalizing it.

Simon Willison has been working toward something similar through Datasette and his broader advocacy for data infrastructure as a public good. The CKAN community’s push toward “AI-Ready Data Infrastructure” points the same direction. These efforts are complementary, not competing.

The Layer That’s Missing

Every tool in this landscape solves a piece of the problem. None of them bridge the gaps between pieces. The missing layer ingests from any source, normalizes the schema, and serves it through one interface. Not replacing existing portals, but connecting them.

Building that layer means understanding the quirks of hundreds of data sources, maintaining connectors as those sources change, and building community around data curation the way open source built community around code. It’s a large, ongoing engineering problem. But it’s the kind that gets easier with every connector someone contributes and every dataset config that gets shared instead of rewritten from scratch.

Riley Hilliard
Riley Hilliard

Creator of OpenData

At 13, I secretly drilled holes in my parents' wood floor to route a 56k modem line to my bedroom for late-night Age of Empires marathons. That same scrappy curiosity carried through 3 acquisitions, 9 years as a LinkedIn Staff Engineer building infrastructure for 1B+ users, and now fuels my side projects, like OpenData.

Copied to clipboard

More from OpenData

Why Your Charts Don't Get Shared (And Chartr's Do)

Chartr grew to 500K+ subscribers by making data visualization shareable. What they figured out about headline-first framing, minimal chrome, and social optimization applies to anyone making charts.

Riley HilliardRiley Hilliard·Mar 26, 2026

Store Flat, Transform on Read

Why we store all data in long format and apply transforms at query time instead of pre-computing views. A technical deep dive into DuckDB, Parquet, and the architecture behind OpenData's query engine.

Riley HilliardRiley Hilliard·Mar 19, 2026

70% of AI Training Datasets Have the Wrong License

A large-scale audit found that over 70% of popular AI datasets have missing or wrong license metadata. With the EU AI Act now enforcing training data transparency, this isn't just sloppy. It's a liability.

Riley HilliardRiley Hilliard·Mar 12, 2026

Public Data Has a Discovery Problem

Government data is technically public but practically inaccessible. Here's what that actually costs researchers, journalists, and anyone trying to answer a question with data.

Riley HilliardRiley Hilliard·Mar 5, 2026

Welcome to the OpenData Blog

Introducing the OpenData blog. We'll be sharing project updates, deep dives into open data infrastructure, and lessons learned building a platform for public datasets.

Riley HilliardRiley Hilliard·Feb 25, 2026

The Hidden Mess Inside 'Clean' Government Data

Government data has a reputation for being clean and reliable. Anyone who's tried to ingest it programmatically knows that's not the full story. Here are the real encoding quirks, format traps, and silent failures hiding in data from FRED, BLS, Census, the World Bank, and the EPA.

Riley HilliardRiley Hilliard·Feb 19, 2026

Building a Headless Visualization Engine

How we separated chart computation from rendering by building a spec-driven visualization engine. The architecture behind @opendata/viz: four packages, a compilation pipeline, and zero DOM dependencies in the math layer.

Riley HilliardRiley Hilliard·Feb 5, 2026

Bootstrapping a Data Platform on Two Mac Minis

OpenData runs in production on two Mac Minis at $0/month infrastructure cost. Here's the architecture, the tradeoffs, and the specific triggers that would move us to cloud.

Riley HilliardRiley Hilliard·Jan 29, 2026

What Happens When All the World's Open Data Lives in One Place

Open data has a discovery problem, not an access problem. When you centralize datasets from hundreds of portals, entirely new capabilities emerge: knowledge graphs that reveal hidden connections, bridge datasets that make cross-agency joins possible, and a compounding network where every new dataset makes every existing one more useful.

Riley HilliardRiley Hilliard·Jan 22, 2026

Curious about open data? Start exploring.

OpenData makes public datasets discoverable, consistently formatted, and queryable without the usual headaches.

Try it out
  • Browse thousands of public datasets
  • Query any dataset with a simple API
  • Download as CSV, JSON, or Parquet