There’s no shortage of places to find public data in 2026. data.gov lists 300,000+ datasets. Hugging Face hosts over 500,000. Kaggle, Socrata, FRED, Datasette, CKAN, and a growing list of specialized portals each serve their own corner of the ecosystem.
The tooling has gotten genuinely better. But a specific problem has stubbornly persisted: if you need data from more than one source, you’re on your own. Each tool is an island. Combining BLS unemployment figures with Census demographics and EPA air quality data still means writing the same glue code that people were writing five years ago.
Here’s where the landscape stands, what each tool does well, and where the gaps remain.
The Landscape Today
Here’s what the major players do and where they stop.
data.gov is the US government’s central data catalog. It lists over 300,000 datasets harvested from federal agencies, powered by CKAN under the hood. It’s good at what it is: a directory. You search for a dataset, find a listing, and get redirected to whatever agency hosts it. The problem is that “whatever agency hosts it” means a different portal, a different format, and a different access pattern every time. data.gov tells you the data exists. It doesn’t give you a consistent way to get it.
Socrata (now Tyler Technologies’ Data & Insights division) and CKAN are the hosting platforms underneath many city, state, and federal portals. Socrata provides SODA, a SQL-like query API that lets you filter and retrieve data programmatically. They recently shipped SODA3 with better performance and caching. CKAN is the open-source equivalent, powering data portals across the US, UK, Canada, and the EU. Both platforms solved the “where do we put government data” problem for individual agencies. Neither solved the “how do I combine data from five different portals” problem. Each portal is its own island.
FRED (Federal Reserve Economic Data) is the gold standard for what a single-domain data API should look like. Clean REST endpoints, consistent time series format, solid documentation. If you need GDP, unemployment, interest rates, or any of the 800,000+ economic time series the Federal Reserve tracks, FRED is the answer. But it only covers the Fed’s domain. Need air quality data? School enrollment figures? Crime statistics? You’re back to navigating individual agency portals.
Kaggle is the community-driven dataset platform most data scientists encounter first. It’s excellent for ML competitions and exploratory datasets, and its notebooks feature lets you run analysis right next to the data. But it’s fundamentally a file dump. You download CSVs. Quality varies wildly because anyone can upload anything, and there’s no ingestion pipeline normalizing formats or validating schemas. For reproducible research that needs to stay current as sources update, Kaggle isn’t the right tool.
Hugging Face Datasets has seen massive growth, now hosting over 500,000 datasets. It’s become the default distribution channel for ML training data. The ecosystem integration is strong, especially with PyTorch and TensorFlow data loaders. But it prioritizes volume over curation, and the licensing metadata problem is serious. The Data Provenance Initiative’s audit, published in Nature Machine Intelligence, found that over 70% of datasets on popular hosting platforms had missing license information, and over 50% had incorrect metadata. A follow-up study found that over 80% of source content in widely-used datasets carries non-commercial restrictions, even when the dataset labels say otherwise.
Datasette, Simon Willison’s open-source SQLite explorer, takes a different approach entirely. Point it at a SQLite database and you get a queryable, browsable, API-enabled website for that dataset. It’s brilliant for publishing individual datasets. But Datasette is designed around single databases, not cross-source aggregation. If you need to query across BLS, Census, and EPA data simultaneously, you’d need to load everything into SQLite yourself first. That’s the hard part Datasette doesn’t try to solve.
All solid tools. None of them handle the cross-source problem: getting data out of dozens of sources with different formats and schemas, and making it queryable through one interface.
What’s Still Broken
The individual tools have gotten better. The space between them hasn’t.
Cross-source querying doesn’t exist
Combining BLS unemployment data with Census demographics and EPA air quality measurements still requires manual ETL. You download from each source, write transformation scripts, load into your own database, then query. Every researcher, journalist, and analyst who needs multi-source data does this independently, solving the same data wrangling problems from scratch.
This is wildly inefficient. A policy researcher in DC and a grad student in Austin and a journalist in New York are all writing essentially the same Python scripts to parse BLS directory listings, decode Census variable codes, and unzip EPA archives. The work gets done and then thrown away. It doesn’t accumulate into shared infrastructure.
Schema normalization is everyone’s problem and nobody’s product
Every data source invents its own naming conventions. BLS uses series IDs like CUSR0000SA0 that pack area codes, item codes, and seasonal adjustment flags into a single string. Census uses variable codes like B01001_001E. FRED uses tickers like UNRATE. The World Bank uses indicator codes like NY.GDP.MKTP.CD. These are all reasonable choices made by individual agencies for their own internal purposes.
But if you want to join unemployment data from BLS with GDP data from FRED and population data from the World Bank, you need to know that these three different naming systems are all talking about the same countries and time periods. That translation layer doesn’t exist in any of the tools listed above. You build it every time.
Format fragmentation compounds everything else
CSV with various delimiters. Nested JSON with placeholder values (FRED uses a literal "." for missing data, which will break any numeric parser). Zipped archives with metadata files mixed in alongside the actual data. Tab-delimited flat files. PDF tables. Fixed-width text files from the 1990s.
The Bureau of Labor Statistics alone is a case study in format archaeology. Their data lives in directory listings that serve tab-delimited files organized by series prefix. Some files have header rows, some don’t. Period codes use M01 through M12 for months and M13 for annual averages. To even discover what files exist, you need to crawl HTML directory listings and parse the file naming conventions.
This isn’t a criticism of BLS. They’ve been publishing data consistently for decades, which is more than most organizations can say. But the format choices that made sense for FTP distribution in 1996 create real friction for programmatic access in 2026.
What a Unified Layer Looks Like
The concept isn’t replacing existing portals. It’s sitting on top of them. A layer that speaks each portal’s language, normalizes the output, and serves it through a consistent API. Think of it like a package manager for public datasets: you don’t need to know how each data source distributes its files, you just need to know the dataset name.
This is what OpenData is building. It’s open source (github.com/rileyhilliard/opendata), Apache 2.0 licensed, and designed around a specific workflow.
You point the CLI at a data source URL:
opendata discover https://download.bls.gov/pub/time.series/cu/
The discover command analyzes the source. It detects the format (directory listing of tab-delimited files, in this case), identifies the structure, and generates a YAML config with reasonable defaults. That config specifies how to fetch, parse, and transform the data.
Then opendata add runs the config through the ingestion pipeline: it fetches the data, applies transforms (renaming cryptic column codes, casting types, filtering placeholder values), and stores the result as Parquet files.
Now it’s queryable through a REST API:
curl "https://opendata.place/api/v1/datasets/bls/cpi-u/query?view=national&limit=5"
From a directory listing of tab-delimited files to a clean JSON API response. The BLS didn’t change anything. The data is the same data. The difference is that someone wrote a connector config once, and now anyone can query it without understanding BLS’s internal file organization.
The same pattern works for every source. Census variable codes get renamed to human-readable names during ingestion (B01001_001E becomes total_population). FRED’s placeholder dots get filtered out. World Bank’s zipped CSV archives get unpacked and the metadata files get discarded. The quirks of each source are handled once, in the connector config, not repeatedly by every person who wants to use the data.
Three Tiers of Connectors
Not all data sources are equally complex to ingest. The architectural approach needs to account for that range.
Native Python connectors handle sources with complex access patterns that can’t be expressed in configuration alone. BLS requires crawling HTML directory listings to discover individual files, then parsing file naming conventions to understand what each file contains. Census has a paginated API with variable code translation that requires cross-referencing separate metadata endpoints. FRED needs API key authentication and returns nested JSON with placeholder values that need special handling. World Bank distributes zipped CSVs with metadata files mixed in and header rows that need skipping. Treasury has daily CSV downloads with date-based URL patterns. Each of these has a dedicated Python connector that encapsulates the source’s specific quirks.
Declarative YAML connectors handle sources with simpler but non-trivial patterns. A dataset from the ECB, OECD, or FAOSTAT might just need configuration for header row skipping, file pattern matching, column renaming, and type casting. No custom code required. You write a YAML file that describes where the data lives, what format it’s in, and how to transform it. The generic connector handles the rest.
The generic HTTP connector handles everything straightforward. Our World in Data CSVs, NOAA weather data, FiveThirtyEight datasets. The access pattern is simple (download a file, parse it), and any necessary transforms are all YAML configuration: rename this column, cast that type, filter these rows.
This tiered approach means adding a new dataset ranges from “write a YAML file” (the common case) to “write a Python connector” (complex sources like BLS or Census). Both paths produce the same output: a queryable Parquet file served through the same API. Currently, OpenData has over 200 datasets from 30+ providers, all defined as YAML configs in the repository’s datasets/ directory. Anyone can submit a new one.
Where This Is Going
A few patterns are emerging across the ecosystem, independent of any single project.
Version control for data is an unsolved problem at scale. Datasets change. BLS publishes revisions to previous months’ numbers. Census updates population estimates. FRED adjusts seasonal factors retroactively. Git solved this for code decades ago: you can see exactly what changed, when, and why. For data, the best most platforms offer is “here’s the latest version.” The ability to query a dataset as it existed at a specific point in time, or to diff two versions row by row, is infrastructure that doesn’t really exist yet. Some projects are starting to chip away at it (DVC for ML pipelines, LakeFS for data lakes), but nothing has the simplicity of git log for datasets.
Cross-dataset discovery is a harder problem. If you’re looking at unemployment data, you probably also want CPI, GDP, and population data to contextualize it. Automated relationship detection between datasets from different sources, based on shared dimensions like geography and time periods, would save researchers the manual work of figuring out which datasets pair well together. This is graph problem territory, and it’s where the interesting work will happen over the next few years.
Community-driven data curation follows the Wikipedia model. Any individual or organization can submit a dataset config. The community reviews it, improves the transform rules, adds query views for common use cases, and flags quality issues. OpenData’s dataset configs are just YAML files in a git repository. Submitting a new dataset is a pull request. Improving an existing one is another pull request. The data itself stays at the source. What gets version-controlled is the recipe for accessing and normalizing it.
Simon Willison has been working toward something similar through Datasette and his broader advocacy for data infrastructure as a public good. The CKAN community’s push toward “AI-Ready Data Infrastructure” points the same direction. These efforts are complementary, not competing.
The Layer That’s Missing
Every tool in this landscape solves a piece of the problem. None of them bridge the gaps between pieces. The missing layer ingests from any source, normalizes the schema, and serves it through one interface. Not replacing existing portals, but connecting them.
Building that layer means understanding the quirks of hundreds of data sources, maintaining connectors as those sources change, and building community around data curation the way open source built community around code. It’s a large, ongoing engineering problem. But it’s the kind that gets easier with every connector someone contributes and every dataset config that gets shared instead of rewritten from scratch.