Say you’re building the case for expanding opioid treatment in three underserved counties. You need county-level healthcare spending from CMS, overdose mortality from the CDC, and uninsured population rates from the Census.
They use different geographic identifiers, different file formats, and different update schedules. CMS gives you wide-format CSVs where every year gets its own column. CDC gives you nested indicator codes built around ICD classifications. Census gives you XLSX files with multi-row merged headers that crash standard parsers.
The data exists, it’s free, and it’s public. Getting it into the same spreadsheet still takes longer than the analysis you were hired to do.
OpenData puts all of it in one place: same API, same filtering syntax, same column names that actually make sense. The cross-agency join that eats your Monday is a URL parameter.
The $5.3 trillion paper trail
The US spent $5.3 trillion on healthcare in 2024, or $15,474 per person, up from $146 in 1960. The data describing that spending is fragmented across CMS, CDC, FDA, Census, OECD, WHO, and dozens of state agencies, each publishing in its own format.
CMS Part D publishes drug spending in a wide-format spreadsheet where each year gets its own column (Tot_Spndng_2019, Tot_Spndng_2020, all the way through Tot_Spndng_2023), with dollar signs embedded in the numeric fields and a codebook on a separate sheet. CDC’s provisional overdose data uses clinical classification codes in the indicator column. FDA’s NDC Directory is a JSON ZIP file. Census health insurance coverage arrives as an XLSX with multi-row merged headers.
Then there’s the identifier problem. CMS Part D reports drugs by brand name, NADAC (the pharmacy acquisition cost file, also from CMS) reports by NDC product code, and the IRA negotiated prices file uses a different NDC format entirely. Same drug, four different names across four different files, and the only bridge between them is the FDA NDC Directory.
Most people who work with this data spend more time cleaning it than analyzing it. That ratio is backwards.
Same data, before and after
Here’s what one of those files actually looks like, followed by the same data on OpenData.
Same data, two experiences. The top block is the actual CMS file: dollar signs embedded in numeric fields, five years of spending crammed into separate columns, and a codebook that lives on a different sheet. The bottom is that same dataset on OpenData, with clean numbers in long format, sortable by any column. Change the sort to -avg_spending_per_dosage_unit and you see a completely different story about which drugs actually cost the most per dose.
When you open that Part D dataset on the platform, the sidebar shows related datasets: NADAC pharmacy acquisition costs, IRA negotiated drug prices, the FDA NDC Directory, Medicaid drug spending. Those cross-agency connections are already mapped. You don’t discover them by Googling “how to join CMS to FDA data” at 9 a.m. on a Monday.
And the column names? Avg_Spndng_Per_Dsg_Unt_Wgtd_2023 becomes “Average Spending Per Dosage Unit” with a description explaining it’s the weighted average across manufacturers, including ingredient cost, dispensing fee, and sales tax. No codebook hunting.
What the clean data shows
Once the data engineering is solved, questions that used to take a week of prep become an afternoon of analysis. Here are three stories from three different agencies, all pulled from the same platform.
Where Medicare dollars go
Part D alone accounted for over $250 billion in drug spending in 2023. A single drug, Eliquis, cost Medicare $18.3 billion, nearly double the second-place Ozempic. The top ten drugs collectively account for a significant share of Part D spending, but the story changes entirely depending on whether you sort by total spend, per-unit cost, or growth rate.
The US as outlier
Those are domestic numbers. Zoom out and the picture gets sharper. The US spends roughly double what peer nations spend per person on healthcare, and it has for decades.
That OWID dataset surfaced because searching “healthcare spending by country” on OpenData returns it alongside CMS national health expenditure and OECD pharma spending. The platform understands that “healthcare spending” and “health expenditure per capita” are the same concept, even when the terms don’t match. That’s semantic search working behind the scenes, not just keyword matching.
The fentanyl curve is bending
From spending to outcomes: CDC’s provisional drug overdose data shows something that hasn’t gotten enough attention yet. Synthetic opioid (fentanyl) deaths are declining for the first time since the crisis began, with the 12-month trailing count peaking around 75,000 in early 2023 and dropping to roughly 47,000 by early 2025.
Three agencies (CMS, OWID, CDC), three completely different source formats, one platform. The spending chart, the international comparison, and the overdose trend all came from the same API with the same filtering syntax. That’s what makes cross-cutting healthcare analysis possible without a data engineering team.
The join that eats the morning
Most healthcare analysis eventually requires linking data across agencies. Say you want to answer a simple question: is Medicare overpaying for drugs compared to what pharmacies actually pay?
Two CMS datasets have the pieces. Part D Spending tracks what Medicare reimburses for each drug by brand name (“Eliquis”). NADAC tracks what pharmacies pay wholesalers for the same drug, but identifies it by a numeric product code and its chemical name (“APIXABAN 2.5 MG TABLET”). Same agency, same drug, no obvious way to connect them.
The link between them is a third dataset from a different agency entirely: the FDA’s NDC Directory, a lookup table that maps every product code to its brand name, generic name, and manufacturer. Match Part D to the NDC Directory by brand name, then match the NDC Directory to NADAC by product code, and suddenly you can compare Medicare’s reimbursement rate against the pharmacy’s wholesale cost for every drug in the system. Three datasets, two agencies, connected through a shared reference.
On OpenData, that three-way link is a single SQL query:
SELECT
p.brand_name,
ROUND(p.total_spending / 1e9, 1) AS spending_billions,
ROUND(p.avg_spending_per_dosage_unit, 2) AS medicare_cost_per_unit,
ROUND(AVG(n.nadac_per_unit), 2) AS pharmacy_cost_per_unit,
ROUND(p.total_spending
* (1 - AVG(n.nadac_per_unit)
/ p.avg_spending_per_dosage_unit) / 1e6, 0) AS overpay_millions
FROM "cms/part-d-spending" p
JOIN "fda/ndc-directory" d
ON UPPER(p.brand_name) = UPPER(d.brand_name)
JOIN "cms/nadac" n
ON LPAD(SPLIT_PART(d.product_ndc, '-', 1), 5, '0')
|| LPAD(SPLIT_PART(d.product_ndc, '-', 2), 4, '0')
= SUBSTR(LPAD(CAST(n.ndc AS VARCHAR), 11, '0'), 1, 9)
WHERE p.manufacturer = 'Overall'
GROUP BY p.brand_name, p.total_spending,
p.avg_spending_per_dosage_unit
ORDER BY overpay_millions DESC
The middle join looks messy because the FDA and CMS format their product codes differently (hyphenated text vs. plain integers). That kind of identifier mismatch is exactly why this analysis usually takes a morning of data cleanup before you even get to the interesting part. But the results are worth it. Here’s the gap between what Medicare reimburses and what pharmacies actually pay wholesalers, ranked by total dollar overpayment:
Lantus (insulin glargine) tops the list: Medicare reimburses $30 per unit while pharmacies pay wholesalers about $6, an 80% gap that adds up to $2.5 billion annually. But the pattern varies wildly. Eliquis has the largest total Medicare spend ($18.3B) but only an 8% markup, while Abiraterone (a cancer drug with 19 generic manufacturers competing on price) shows an 86% gap. None of this is visible from any single dataset. It only shows up when you connect the three.
That same pattern, linking datasets that weren’t designed to talk to each other, opens up another question: what happens when the Inflation Reduction Act kicks in? Starting in 2026, Medicare can negotiate drug prices directly with manufacturers for the first time. Adding the IRA negotiated prices dataset to the mix shows the expected impact:
Januvia drops 79%, Farxiga drops 69%, and Eliquis, the single most expensive drug in Medicare at $18.3 billion a year, drops 57%. The first chart showed that pharmacies already pay less than Medicare reimburses. The IRA goes much further, negotiating prices 57-79% below current reimbursement rates. Both findings come from linking datasets that the government publishes separately, with different naming conventions and file formats.
The FDA NDC Directory is the connective tissue: 133,000 products with standardized codes, generic names, brand names, manufacturers, and dosage forms. It’s the kind of reference table that makes cross-agency analysis possible.
93 datasets and counting
Drug pricing is just one corner of the platform’s healthcare coverage. As of April 2026, there are 93 healthcare datasets from 22 sources on OpenData, and the catalog grows every week. Here’s a partial map of what’s queryable today:
Drug pricing and utilization: CMS Part D spending, Part B spending, NADAC weekly acquisition costs, ASP physician-administered drug pricing, IRA negotiated prices, Medicaid state drug utilization, Medicaid spending by drug, VA Federal Supply Schedule drug prices, California new drug launch prices and WAC increases
Drug information: FDA NDC Directory, Orange Book (patent/exclusivity data), Drugs@FDA (all approvals since 1939), new molecular entity approvals, drug recalls, drug shortages, Medicaid drug rebate products
Disease surveillance: CDC overdose deaths (state and county), cancer incidence, influenza (ILINet + virology), vaccination coverage, BRFSS behavioral risk factors, WONDER mortality, NNDSS notifiable diseases, vital statistics births, WHO noncommunicable disease mortality
Opioid crisis: CMS opioid prescribing (Medicare and Medicaid), CDC overdose deaths (including county-level), SAMHSA treatment episodes, OECD international opioid prescribing rates, UNODC World Drug Report, EU drug-related deaths and treatment demand
Healthcare quality: CMS hospital general information, hospital readmissions, HCAHPS patient satisfaction, nursing home provider info, dialysis facility compare, home health compare, ACA marketplace plans, Medicaid/CHIP enrollment
Spending and economics: CMS national health expenditure (1960-2024), OWID healthcare spending by country, OECD health expenditure by function, OECD pharmaceutical spending, WHO health expenditure by source, FRED pharmaceutical and medical CPIs/PPIs
International comparison: Commonwealth Fund Mirror Mirror rankings, WHO global mortality, WHO universal health coverage index, OWID global health indicators (obesity, mental health, life expectancy, HIV, tuberculosis, malaria, maternal/child mortality)
Local and county level: County Health Rankings (premature death, overdose, mental health by county), HRSA shortage areas for primary care, dental, and mental health
Every dataset connects to related ones. Browse Part D spending and you find NADAC, IRA prices, and Medicaid drug utilization in the sidebar. Open CDC overdose deaths and you find SAMHSA treatment episodes and CMS opioid prescribing rates. The connections are mapped automatically through shared columns, overlapping geographies, and semantic similarity. The catalog gets more useful every time it grows.
Your Monday morning
The county-level analysis from the opening? CMS health expenditure, CDC overdose deaths, and Census insurance coverage are three queries with the same filtering syntax. The three-way crosswalk between Part D, NADAC, and the FDA NDC Directory is a SQL join. The 64-year spending trend, the fentanyl decline, the IRA price comparison: each one is a query, and every number traces back to its authoritative government source.
Right now, every health economist, policy researcher, and nonprofit analyst who wants to answer a cross-agency question has to solve the same data engineering problem from scratch. Download the files, fix the formats, reconcile the identifiers, hope nothing changed since last quarter. Most people give up before they get to the analysis. The ones who don’t spend more time cleaning data than using it.
That work only needs to happen once. OpenData has already done it for 93 healthcare datasets across 22 sources, and the catalog keeps growing. The next person who needs to compare Medicare reimbursement rates against pharmacy acquisition costs gets the answer in 157 milliseconds instead of a morning of spreadsheet wrangling.
Every dataset in this article is free to query at tryopendata.ai with no signup required.
OpenData is in active development. New healthcare datasets are added weekly.







