28 Years, 14 Formats, One API Call

There are over 300,000 public datasets on data.gov alone. Teacher salaries, poverty rates, workplace injuries, census demographics. All published by government agencies, all technically “open.”

Try to actually use one. You’ll find ZIP files inside ZIP files. Excel spreadsheets where the columns move every year. Files that changed format halfway through their history. Documentation links that 404. Most people give up and use the most recent year or two, because going further back isn’t worth the fight.

We built OpenData to fix that. Here’s the same dataset, before and after.

Same data. The top block is one row from the actual source file: 7,691 semicolon-delimited fields, no headers, no documentation. The bottom is the same dataset on OpenData. Clean JSON. Sortable. Filterable by year, school, city, or any column. Change the year to 1997 and it still works. One API call, one consistent format, spanning 28 years of data that originally came from 29 different files in two completely different formats.

The state of public data in 2026

A researcher studying school funding trends needs data from their state education board. A journalist fact-checking a claim about poverty rates needs Census data. A policy analyst comparing teacher salaries across states needs Bureau of Labor Statistics files.

All of it is public and free. Almost none of it is usable.

Every agency publishes data differently. The Census Bureau labels columns with codes like B01001_001E (that means “Total Population”). The BLS serves files through directory listings that look like they were built in 1996. FRED uses a period for missing values. State agencies change their file formats every few years without notice.

So researchers spend most of their time cleaning data and almost none of it doing research. Most give up and use only the most recent 2-3 years of any dataset, because going further back means fighting format changes, broken links, and inconsistent schemas. The historical picture, usually the most valuable part, goes unused.

That waste is what we’re building OpenData to fix.

What “publicly available” actually looks like

Let’s make this concrete. The Illinois State Board of Education publishes an annual “Report Card” for every public school in the state: enrollment, attendance, graduation rates, teacher staffing, test scores. It’s one of the best state-level education datasets in the country.

If you want to see what working with this data looks like before OpenData, visit the source yourself. You’ll find over 200 disorganized individual download links. Some are ZIP files containing text files. Others are Excel spreadsheets and even PDF print outs. The formats changed in 2018. The column layouts changed almost every year.

A reasonable person might expect to download these files, load them into a database, and start asking questions. An afternoon of work, maybe a day.

Here’s what you actually get: each year’s file has over 600 columns. Every school in the state gets a single row, and every metric from enrollment to test scores to salary gets its own column. Finding “Teacher Avg Salary” means scanning hundreds of unlabeled columns, and the position it lives in changed almost every year. Each data point in the chart below had to be pulled from a different file, hunting through a different column layout. That’s a ton of work just to extract a single trend line.

Same data, same publisher, 14 different locations across 21 years. Each move means anyone using this data has to update their code.

That chart is for one field. Attendance rate moved 8 times. Enrollment moved 5 times. Every metric had its own migration path. And if you get even one position wrong, the data still looks plausible. You just get the wrong numbers. No error message. No warning.

This isn’t a technology problem. Government agencies operate on legislative cycles, not software release cycles. When a new reporting requirement gets added, someone inserts a column in the spreadsheet. Nobody publishes a changelog. Nobody updates a schema version. The documentation for half the years we needed wasn’t even online anymore.

ISBE isn’t uniquely bad. They actually publish more data than most state agencies. This is just how public data works in 2026.

Solving it once

We spent days mapping every column for every year, verifying values against known ground-truth data, normalizing formats, and reconciling different column availability across eras. The result is 28 years of Illinois school data, queryable through a single endpoint.

Once it’s clean, you can start asking questions that span decades. Here’s one: have Illinois teacher salaries kept up with inflation? To answer that, you need salary data from ISBE and consumer price index data from the Bureau of Labor Statistics. Two agencies, two completely different data formats.

Two datasets, two agencies, one chart. Teacher salary from ISBE Report Card, inflation from BLS CPI-U. Both queried through OpenData.

This chart combines data from two different agencies (ISBE and BLS) that publish in completely different formats. The salary data came from 29 files across two formats. The CPI data came from a separate BLS time series. On OpenData, pulling both into one visualization is just two API calls.

118,971 school records. 29 years. Every school in Illinois. The API doesn’t care about the format changes, the column shifts, or the dollar signs that appeared in some years and not others. It returns numbers.

This cleanup only needs to happen once. The next person who needs Illinois school data just calls the API.

GitHub for datasets

GitHub solved this problem for code. Before GitHub, sharing code meant emailing ZIP files or hosting tarballs on personal websites. The code existed. Collaboration was theoretically possible. But the friction was high enough that most people didn’t bother.

GitHub didn’t invent version control. It made collaboration so easy that it became the default. A URL, a README, a pull request. The infrastructure disappeared and the work became visible.

OpenData is building the same thing for public data, in phases. Today we’re in the first phase: absorbing the complexity that sits between raw government files and usable, queryable datasets. We ingest sources, normalize them, and serve them through a consistent API. Users can already create custom views on top of any dataset, filtering and reshaping the data without touching the underlying source.

The next phases open this up. Users will be able to contribute and maintain their own datasets, the way developers maintain repositories on GitHub. Fork a dataset, improve it, share it back. Combined with the cross-dataset connections we’re building (more on that below), the catalog becomes a collaborative resource that gets better as more people use it.

The catalog already spans multiple agencies, each with its own format quirks. The BLS has files that exceed 3 gigabytes. Census data uses geographic codes that don’t match education data. Every one went through its own version of the ISBE cleanup. On OpenData, they all work the same way:

Same API. Same filtering syntax. Same response format. The messiness of each source is absorbed during ingestion and never exposed to the person asking the question.

What becomes possible

Once data is clean and queryable, you can start asking real questions. We wrote an analysis of chronic absence in Illinois schools using this exact dataset, cross-referenced with Census poverty data and national test scores.

The findings: one in four Illinois students is chronically absent. The gap between the richest and poorest schools tripled after COVID. Majority-Black schools never recovered. Schools lost 153,000 students in eight years.

Analysis that would have taken weeks of data prep, done in an afternoon because the data was already queryable.

Most researchers settle for 2-3 years of data because the cleanup work exceeds their deadline. Having 28 years is what makes trends visible and claims defensible. We could do this analysis in an afternoon because the data prep was already done.

Cross-dataset connections

Individual datasets only tell part of the story. The real value shows up when you start connecting them.

The chronic absence analysis became sharper when we cross-referenced school data with Census poverty estimates. We could show that chronic absence tracks with poverty, not school quality. That finding required combining data from two different agencies that use different geographic identifiers and publish in completely different formats.

On a traditional portal, making that connection is a research project in itself. On OpenData, it’s a query that joins two datasets.

GitHub’s real value was never hosting individual repositories. It’s the network effects: one repo depends on another, you discover related projects through the dependency graph, a change in one library ripples through thousands of projects.

We’re building the same connective tissue for datasets. A knowledge graph that detects relationships between datasets from different providers, based on shared columns, overlapping time periods, and semantic similarity. Here’s what the current graph looks like:

Each node is a dataset, colored by category. Edges represent detected similarity. Clusters form naturally around related domains.

School data connects to poverty estimates through geographic identifiers. Poverty data connects to employment data through state and county codes. Labor statistics connect back to education data through demographic dimensions. Add one more dataset to that chain, say county-level health outcomes, and it connects to all three. The catalog gets more useful every time it grows.

Complementary, not competitive

Data.gov has 300,000 datasets. The World Bank has 20,000. The EU Open Data Portal has 1.6 million. None of them know the others exist.

Search for “poverty rates by state” on data.gov and you’ll find Census datasets. You won’t find the World Bank’s comparable poverty data, the BLS employment data that correlates with it, or the state education data that tracks alongside it. Each portal is a silo. The data exists across all of them, but discovery stops at the portal boundary.

OpenData solves this by centralizing the decentralized open data ecosystem. We pull from data.gov, the World Bank, BLS, Census, state agencies, and dozens of other sources into a single searchable catalog. When you search for poverty data on OpenData, you find datasets from every agency that publishes it, with the connections between them already mapped.

We’re not competing with these portals. They’re the source of truth. We’re the layer that connects them and makes their data actually usable. When we clean the ISBE Report Card, more people use ISBE data. When we index World Bank datasets alongside Census data, researchers find connections they never would have on either portal alone.

The closer analogy is the relationship between npm and GitHub. npm didn’t replace GitHub. It made packages from GitHub repositories accessible in a standardized way. OpenData does the same for public datasets: standardized access to authoritative sources, with the discoverability layer that no single portal can provide on its own.

Where this goes

What we’ve built today is the foundation. Here’s what we’re building on top of it.

Visualization

Data is only useful if people can understand it. The charts in this article are rendered by @tryopendata/openchart, our open-source visualization library. It produces publication-quality charts from structured data with minimal configuration.

Creating a chart from an OpenData query should be as easy as embedding a YouTube video. No D3 expertise or data visualization team required.

The knowledge graph

When you put thousands of datasets in one place and map their relationships, you can answer questions that no individual dataset can:

“What datasets can I combine with this one?”

“Which datasets cover the same geographic region and time period as the school data I’m using?”

“Show me everything that connects poverty rates to educational outcomes.”

The graph tracks shared columns, overlapping time periods, and joinable keys between datasets. We’re building this as a dataset relationship graph that gets smarter as the catalog grows.

AI as infrastructure, not product

There’s a pattern in tech right now: take an existing product, bolt a chatbot onto it, call it “AI-powered.” The result is usually a GPT wrapper that adds latency and hallucination risk to something that worked fine before. The product can’t stand on its own without the AI, and the AI doesn’t actually make it better.

OpenData takes the opposite approach. The product is built around the data, and AI is applied where it actually makes the experience better. There’s no chatbot. No “ask AI” button.

Making cryptic data understandable. Government datasets ship with column names like B01001_001E (that’s Census for “Total Population”) or CUSR0000SA0 (BLS consumer price index). Our enrichment pipeline reads each dataset’s documentation and generates human-readable names, descriptions, and semantic types for every column. This runs once during ingestion. Every query after that gets clean, documented columns without anyone looking up what B01001_001E means ever again.

Finding connections between datasets. We generate vector embeddings for each dataset’s metadata and use cosine similarity to surface related datasets automatically. The Census poverty estimates surface alongside ISBE school data because they cover overlapping geographies and time periods, not because someone manually tagged them.

Powering search that understands intent. When you search for “school funding by county”, keyword matching alone misses datasets labeled “per-pupil expenditure” or “instructional spending.” Our hybrid search combines keyword matching with semantic similarity, so you find relevant datasets even when the terminology doesn’t match.

None of this surfaces as an “AI feature.” You just search and find what you need, read column names that make sense, and discover related datasets without digging. The AI disappears into the product.

Bring your own AI

We’re never going to out-build Anthropic, OpenAI, or Google at AI. That’s their thing. What we can do is make OpenData the best possible data platform for the AI tools people already use.

The API is designed for AI agents as a first-class user cohort. Conventional REST patterns. A discovery endpoint so an agent can find relevant datasets on its own. Error responses that explain what went wrong and show how to fix it. A SQL POST endpoint, because LLMs are genuinely good at SQL and some queries are nearly impossible to express through REST parameters alone. Both OpenChart and the OpenData API ship with agent skills, so any AI agent can be onboarded as an expert in querying the platform and building visualizations from its data.

Here’s what that looks like in practice. You open your preferred AI agent, load the OpenData skills, and type a prompt: “Compare US economic health over time: GDP growth, unemployment, and inflation. Find relevant datasets and create a visual report.”

The agent hits OpenData’s discovery endpoint and finds three separate Federal Reserve datasets: fred/gdp, fred/unemployment-rate, and fred/cpi. It queries each one, cross-references the time periods, and generates a full research report with interactive charts, narrative analysis, and source attribution. Every data point traces back to its authoritative government source. The whole thing takes about a minute. This isn’t a roadmap item, we’ve built this and it works today.

Here’s one of those AI-generated charts. This was produced by an agent querying OpenData, not by a human writing code:

This chart was generated by an AI agent querying two OpenData datasets: FRED GDP and BIS Total Credit (US general government). The agent discovered both, queried them, and produced the visualization from a single research prompt.

That chart is one piece of a full research report the agent produced. The same approach works across any topic the platform has data for: climate trends combining emissions, temperature, and renewable energy data from OWID and NOAA. Cost of living combining CPI, mortgage rates, and median income from FRED. Severe weather combining storm event data with population and temperature records. College ROI combining tuition costs with graduate salary outcomes. Each report pulls from 3-6 datasets across different agencies, joins them, and produces a narrative with publication-quality visualizations.

People already pay for and like their AI tools. If you prefer Claude, ChatGPT, Gemini, or whatever comes next, those are all first-class citizens on OpenData. We’d rather make every AI tool better at research than try to build our own chatbot nobody asked for. The data is what matters: every claim traceable to its source, every number backed by an authoritative government dataset.

The work nobody wants to do

Nobody wants to spend days downloading ZIP files, cross-referencing layout documents, and verifying column positions. What people actually want is to ask a question and get a data-backed answer they can trust. “Have teacher salaries kept up with inflation?” “Which decade was it hardest to buy a house?” “Is chronic school absence in my district getting worse?” The cleanup work is just the tax you pay before you can get to the actual research. It’s tedious, error-prone, and completely invisible when done right.

But the data cleanup only needs to be done once.

Every researcher, journalist, policy analyst, and civic hacker who wants to understand Illinois schools currently has to solve these problems independently. Most don’t bother. Most use the last few years of data and skip the historical picture because the effort exceeds their deadline.

Right now, every person who touches a messy dataset solves the same cleanup problem from scratch. We solve it once, and nobody after us has to solve it again. The next person who needs 28 years of Illinois school data gets it in ~200 milliseconds.

OpenData is in active development. You can explore the datasets mentioned in this article at tryopendata.ai, and the visualization library at github.com/opendata-ai/openchart.