You’re a policy researcher studying how air quality correlates with economic conditions across US states. Straightforward question. You need three datasets.
First stop: the EPA. You want PM2.5 air quality measurements. The EPA’s Air Quality System gives you a zip file containing a CSV with 30+ columns. The ones you care about are arithmetic_mean, aqi, and observation_count, but you won’t know that until you’ve downloaded the file, unzipped it, opened it, and scrolled sideways through columns like method_code, poc, and datum.
Second stop: the Bureau of Labor Statistics. You want unemployment data. BLS serves tab-delimited flat files organized in directory listings that look like they were designed in 1996 (because they were). The data uses period codes like M01 through M12 for months, M13 for annual averages, and series IDs like CUSR0000SA0 that mean nothing without a separate lookup table.
Third stop: the Census Bureau. You want state population data so you can normalize per capita. The Census API returns JSON where columns are codes, not names. B01001_001E is total population. B19013_001E is median household income. You only know this if you’ve memorized the American Community Survey variable list or spent 20 minutes on their documentation site figuring it out.
Three agencies, three portals, three formats. None of them talk to each other. And you haven’t started your actual analysis yet.
According to an Anaconda survey, data scientists spend roughly 45% of their time on data preparation and cleaning before they can do any real work. If you’ve been in this world for any amount of time, that number probably feels low. The formats are inconsistent, the documentation is scattered, and the data models assume you already know how each agency organizes its information. “Public data” is a misnomer. It’s more like “technically downloadable data if you know where to look and can decode the format.”
What “Actually Accessible” Looks Like
Here’s the same scenario, but with the data already ingested and normalized.
The Census dataset config transforms cryptic column codes into readable names at ingestion time:
ingest:
transform:
- rename:
B01001_001E: total_population
NAME: state_name
state: state_fips
- cast:
state_fips: int
total_population: int
B01001_001E becomes total_population. No lookup table required. The rename happens once, during ingestion, and every query after that uses the human-readable name.
The FRED unemployment dataset filters out FRED’s placeholder values (they use a literal "." for missing data, which will break any parser that expects a number):
ingest:
json_path: "$.observations[*]"
transform:
- filter:
column: value
operator: ne
value: "."
- rename:
value: unemployment_rate
The BLS CPI dataset is where things get interesting. Period codes like M01 need to become actual dates. Series IDs like CUSR0000SA0 need to become “All items in U.S. city average, seasonally adjusted.” These are query-time transforms, defined in YAML:
computed:
- name: date
sql: "make_date({year}, period_to_month({period}), 1)"
- name: cpi_value
sql: "TRY_CAST({value} AS DOUBLE)"
And joins to dimension tables translate those opaque codes into something a human can read:
joins:
- dataset: bls/cpi-u/series
key: series_id
select: [area_code, item_code, series_title]
- dataset: bls/cpi-u/area
key: area_code
select: [area_name]
- dataset: bls/cpi-u/item
key: item_code
select: [item_name]
One dataset, six different views: enriched (human-readable with all joins), national (the headline CPI number, series CUSR0000SA0), core (less food and energy), by-category, by-area, and raw. All of these are query-time transforms defined in the YAML config. Nothing is pre-computed or duplicated. The raw Parquet file stays the same; the views just reshape how you see it.
The result is a single API call:
curl "https://opendata.place/v1/datasets/bls/cpi-u/query?view=national&limit=5"
Instead of downloading a directory of tab-delimited files, joining them manually, converting period codes, and filtering out annual averages, you get clean JSON with real dates and readable column names. That’s the difference between data that’s technically public and data that’s actually usable.
The Connectors That Do the Work
The real complexity lives in the connectors. Different data sources have wildly different access patterns, and pretending they’re all “just HTTP endpoints” doesn’t work.
We handle this with three tiers of connectors:
Native Python connectors for complex sources that need custom logic. BLS serves data as directory listings of flat files that need to be discovered and parsed. Census has an API with pagination quirks and variable code translation. FRED requires authentication and returns a nested JSON structure with placeholder values. Each of these has a dedicated connector: opendata/bls, opendata/census, opendata/fred.
Declarative YAML connectors for sources with simpler but non-trivial patterns. The World Bank, for example, distributes data as zipped CSVs with metadata files mixed in and four header rows to skip:
spec:
connector: opendata/worldbank
source_url: https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv
ingest:
skip_rows: 4
file_pattern: "*.csv"
exclude_pattern: "Metadata_*"
transform:
- wide_to_long:
id_vars: [Country Name, Country Code]
value_vars: "^\\d{4}$"
var_name: year
value_name: gdp
The connector knows how to handle the World Bank’s zip packaging. The YAML config handles the reshape (wide-to-long pivot, cleaning, type casting). Treasury works similarly, with its own connector for daily CSV downloads.
Generic HTTP connector for everything else. EPA air quality data, NOAA weather data, Our World in Data CSVs. These are sources where the access pattern is straightforward (download a file, parse it) and the real work is in the transform step. The generic connector handles HTTP fetching, decompression, and format detection. The YAML config handles the rest.
This tiered approach means community contributions can happen at whatever level makes sense. Adding a new EPA dataset that uses the same zip-of-CSVs pattern? That’s a YAML file. Adding a new data provider with a paginated API? That might need a Python connector. Both are valid, and both end up producing the same thing: a queryable Parquet file with clean column names and documented schema.
The CLI makes this concrete. opendata discover <url> analyzes a data source and generates a starter dataset.yaml config. opendata sync registers configs with the database. opendata add <url> does both plus kicks off ingestion. For the common case of “here’s a URL to a CSV or JSON file,” no code is required. Just a YAML file that describes what to rename, filter, and cast.
Why This Matters
The people who most need public data are often the least equipped to wrangle it.
A journalist fact-checking a claim about rising crime rates needs the actual Bureau of Justice Statistics data, not a 30-day FOIA wait or a secondhand chart from someone’s blog post. The data is public. The barrier isn’t access rights, it’s the practical effort of finding it, downloading it, figuring out the format, and cleaning it enough to answer a simple question.
A city council member comparing their district’s air quality to neighboring counties shouldn’t need a data engineer on staff. The EPA publishes the data. But “publishes” and “makes accessible” are very different things when the data arrives as a 200MB zip file with 30 columns of monitoring station metadata you don’t need.
A grad student shouldn’t spend half their thesis timeline on data wrangling. The Anaconda number bears repeating here: 45% of time on prep. That’s not a rounding error. For a two-year master’s program, that’s nearly a year of formatting CSVs and decoding column names instead of doing actual research.
Right now, a policy researcher in DC and a grad student in Austin and a journalist in New York are all writing essentially the same Python scripts to parse BLS directory listings, decode Census variable codes, and unzip EPA archives. That work gets done and then thrown away. It doesn’t accumulate.
We have over 200 datasets from 35+ providers, each ingested, normalized, and queryable through the same API. BLS labor statistics, FRED economic indicators, NOAA weather data, Our World in Data global metrics. All with documented columns and human-readable names. The goal is to solve each data wrangling problem once so nobody has to solve it again.
Check out the API or browse the source on GitHub. If you work with public data and have opinions about which sources should be added next, we’d like to hear about it.