The Hidden Mess Inside 'Clean' Government Data

Government data is supposed to be the good stuff. These are official statistics from federal agencies with dedicated data teams, peer review processes, and public accountability. The numbers themselves are generally solid. So when you sit down to build a pipeline that ingests this data programmatically, you expect the hard part to be the analysis, not the parsing.

That expectation lasts about fifteen minutes.

The truth is that every government agency invented its own encoding conventions, its own missing value representations, its own way of packing metadata into column names. None of them coordinated with each other. The data is accurate, but the formatting is a minefield. And unlike a bug that throws an error, most of these issues fail silently. Your parser runs fine. Your numbers just come out wrong.

We’ve built ingestion connectors for dozens of government sources at this point. What follows isn’t hypothetical. These are the actual problems we hit, the actual configs we wrote to handle them, and the actual code you’d need if you’re doing this yourself.

FRED Uses a Literal Dot for Missing Data

The Federal Reserve Economic Data system is one of the most popular sources for economic time series in the US. Unemployment rate, GDP, fed funds rate, CPI. Hundreds of thousands of series, all available through a clean JSON API. Sounds great.

Then you look at what FRED returns when a data point is missing. Not null. Not an empty string. Not NaN. Not -999. A period. A literal "." character sitting in a field that’s supposed to be a number.

Here’s what the raw JSON response looks like for a time series with gaps:

{
  "observations": [
    { "date": "2024-01-01", "value": "3.7" },
    { "date": "2024-02-01", "value": "." },
    { "date": "2024-03-01", "value": "3.8" }
  ]
}

That middle record will quietly poison your dataset. If you cast the value column to float without filtering first, you get one of two outcomes: a thrown exception that kills your whole pipeline, or (worse) a silent NaN that propagates through every downstream calculation.

Here’s the actual YAML config from our FRED unemployment rate dataset:

ingest:
  json_path: "$.observations[*]"
  transform:
    - select:
        - date
        - value
    - filter:
        column: value
        operator: ne
        value: "."
    - cast:
        value: float
    - rename:
        value: unemployment_rate

The filter step removes every observation where value equals "." before the cast ever happens. This isn’t a one-off fix for the unemployment dataset. Every single FRED series uses the same convention. We apply this pattern across all 17 FRED datasets we currently ingest.

The frustrating part: FRED’s API documentation does mention this behavior, buried in a footnote. But if you’re writing a generic JSON parser and you’ve never encountered FRED data before, nothing about a period character screams “missing value.”

BLS Period Codes Are Their Own Language

The Bureau of Labor Statistics publishes some of the most important economic data in the country: inflation (CPI), employment, wages, productivity. The data itself is excellent. The encoding system is… creative.

Instead of dates, BLS uses period codes. M01 through M12 for January through December. That part is fine. Then you find M13, which means “annual average.” That’s a 13th month that doesn’t exist on any calendar but shows up in every monthly time series.

It gets better. Quarterly data uses Q01 through Q04, which is reasonable. But then there’s Q05, which means “annual” in quarterly datasets. Semiannual data uses S01 and S02. Each of these needs to be translated into an actual month number before you can construct a real date.

Here’s the period_to_month function we built into our standard library. Every BLS dataset uses it:

CASE
  WHEN "period" LIKE 'M%' THEN
    CASE WHEN CAST(SUBSTRING("period" FROM 2 FOR 2) AS INTEGER) <= 12
         THEN CAST(SUBSTRING("period" FROM 2 FOR 2) AS INTEGER)
         ELSE NULL
    END
  WHEN "period" LIKE 'Q%' THEN
    CASE SUBSTRING("period" FROM 2 FOR 2)
      WHEN '01' THEN 1
      WHEN '02' THEN 4
      WHEN '03' THEN 7
      WHEN '04' THEN 10
      ELSE 1
    END
  WHEN "period" LIKE 'S%' THEN
    CASE WHEN "period" = 'S01' THEN 1 ELSE 7 END
  ELSE NULL
END

M13 returns NULL, which means annual averages get filtered out when you construct dates. That’s intentional. If you include annual averages alongside monthly observations, you double-count and skew any aggregation.

And that’s just the period codes. BLS series IDs are another layer entirely. CUSR0000SA0 is the Consumer Price Index for All Urban Consumers, US city average, all items, seasonally adjusted. The CU prefix means CPI-U. The S in position 3 means seasonally adjusted (U means unadjusted). R means current base period. 0000 is the area code for US city average. SA0 is the item code for all items. You need lookup tables to decode any of this, and those lookup tables are separate files on the BLS FTP server.

Census Column Codes Are an Inside Joke

The Census Bureau’s American Community Survey might have the most opaque column naming convention in all of federal data. When you hit the Census API, you get back columns like B01001_001E and B19013_001E. Unless you’ve memorized the ACS variable dictionary, those mean nothing.

Here’s the decoder ring. The B prefix means it’s a detailed table (as opposed to C for collapsed tables or S for subject tables). 01001 is the table number, which in this case is “Sex by Age.” The 001 is the variable position within that table, and position 001 happens to be the total population. The E suffix means “estimate.” There’s also an M suffix for margin of error, so B01001_001M is the margin of error on total population.

B19013_001E? Table 19013 is “Median Household Income in the Past 12 Months.” Variable 001 is the overall median. The E is estimate.

None of this context appears in the data itself. If you download the JSON response, you just see columns of numbers labeled with codes. The rename transform handles it:

ingest:
  transform:
    - rename:
        NAME: state_name
        B01001_001E: population
        B19013_001E: median_household_income
    - cast:
        population: int
        median_household_income: int
        state: str

After ingestion, every query uses population and median_household_income. The Census codes disappear. This rename happens once, and it’s the difference between a dataset that’s immediately usable and one that requires a documentation scavenger hunt every time someone new looks at it.

The ACS has thousands of these variables. Table B25077 is median home value. Table B23025 is employment status. Table B15003 is educational attainment. Each one follows the same naming convention, and none of them are self-documenting.

The Format Detection Problem

Before you even get to agency-specific encoding quirks, there’s a more fundamental problem: figuring out what file format you’re looking at. This is harder than it sounds because government servers routinely lie about their content types.

Here’s the detection chain we use, in order of reliability:

Content-Type header. This is supposed to be the canonical answer, but it’s wrong often enough to be dangerous. Government servers return text/plain for CSVs, application/octet-stream for basically everything, or the MIME type of the compression format instead of the actual content. A server that gzips a CSV on the fly might report application/gzip even though the underlying data is comma-separated text.

python-magic MIME detection. This reads the first few bytes of the file (the “magic bytes”) to identify binary formats. Great for Parquet files (PAR1 header) and ZIP archives (PK\x03\x04). Less reliable for text formats. A perfectly valid CSV file often gets identified as text/plain, which tells you nothing.

Delimiter heuristics. Sample the first few kilobytes, split into lines, count delimiters (tabs, commas, semicolons) per line. If the count is consistent across lines and greater than zero, you probably have a delimited file. The catch: wide CSVs with hundreds of columns can have rows that exceed 64KB, and if the heuristic only samples the first chunk, it might not see enough complete lines to detect the pattern. We’ve seen python-magic correctly identify text/csv in cases where our delimiter detection returns nothing, specifically because the first 64KB of a wide file contained fewer than two complete lines.

URL extension fallback. Check if the URL ends in .csv, .json, .tsv. Last resort because many government URLs don’t have file extensions at all, or they end in something like /download or ?format=csv with the actual extension buried in a query parameter.

There’s also a subtle trap with compression. When a server sends Content-Encoding: gzip, the HTTP library (httpx in our case) automatically decompresses the response. That’s transport-level compression, and it’s transparent. But when the actual file is a gzip archive (like a .csv.gz that was uploaded to a server), you need a separate decompression step. And after you decompress, you have to reset the content type to application/octet-stream so the format detector examines the actual content bytes, not the compression MIME. Skipping that reset means the detector sees “this is gzip” and never looks at what’s inside.

Wide Format Breaks When New Data Arrives

The World Bank distributes economic data with years as columns. The GDP dataset has columns like 1960, 1961, 1962, all the way through the current year. When the World Bank publishes 2025 data, a new column appears. Your schema changes. Your column count changes. Any code that references columns by position breaks.

This is wide format, and it’s popular in the spreadsheet world because it’s easy to read. For programmatic ingestion, it’s a headache. Here’s the actual config for the World Bank GDP dataset:

ingest:
  skip_rows: 4
  file_pattern: "*.csv"
  exclude_pattern: "Metadata_*"
  transform:
    - wide_to_long:
        id_vars: [Country Name, Country Code, Indicator Name, Indicator Code]
        value_vars: "^\\d{4}$"
        var_name: year
        value_name: gdp
    - clean_names: true
    - cast:
        year: int
        gdp: float
    - drop_na:
        - gdp
    - select:
        - country_code
        - country_name
        - year
        - gdp

The wide_to_long transform pivots the data from one row per country (with dozens of year columns) to one row per country per year. The regex ^\d{4}$ matches any column name that’s exactly four digits. New years get automatically pivoted. The output schema is always country_code, country_name, year, gdp, regardless of how many years the World Bank has published.

Also note skip_rows: 4. The World Bank CSV starts with four header rows of metadata before the actual data begins. And exclude_pattern: "Metadata_*" because the download is a ZIP containing both the data file and separate metadata files you don’t want to ingest.

Four lines of config to handle a format that would otherwise require rewriting your parser every January.

EPA’s 30 Columns of Mystery

EPA air quality data arrives as a zipped CSV with over 30 columns. When you unzip it and open the file, you see columns like method_code, poc, datum, cbsa_code, date_of_last_change, state_code, county_code, site_num, and about twenty more. Some of these are useful (arithmetic_mean is the actual PM2.5 reading, aqi is the Air Quality Index value). Most of them are internal monitoring station metadata that exists for regulatory compliance, not analysis.

The column names aren’t documented in the file. There’s no header row that says “this is a float” or “this is an internal code you can ignore.” You figure out what each column means by reading the EPA’s AQS documentation, which is a separate PDF on a different part of their website.

Making things worse, column names across different EPA datasets don’t follow consistent capitalization. Some files use State_Name, others use state_name, others use STATE_NAME. The clean_names transform handles this:

ingest:
  file_pattern: "*.csv"
  transform:
    - clean_names: true
    - cast:
        arithmetic_mean: float
        aqi: int
        observation_count: int
        observation_percent: float
        date_local: date

clean_names lowercases everything, strips leading and trailing whitespace, and normalizes separators to underscores. It’s a one-liner that prevents an entire class of “column not found” errors that only show up when you switch between files from the same agency.

The EPA data is thorough and well-collected. It’s just packaged for regulators, not analysts. The 30+ columns make perfect sense if you’re a compliance officer tracking which monitoring method was used at which station. If you’re a researcher studying air quality trends by county, about 80% of those columns are noise.

The Pattern

Every government agency solved its own data encoding problem independently. BLS invented period codes. Census invented variable codes. FRED chose dots for missing values. The World Bank chose wide format with years as columns. EPA chose to include every possible field and let users sort it out.

None of these are bugs. They’re all reasonable choices made by individual teams that were optimizing for their own workflows and their own users. The BLS period code system makes sense if you’re a labor economist who has been working with BLS data for twenty years. The Census variable naming scheme makes sense if you’re inside the Census Bureau and you know all the table numbers by heart.

The mess isn’t in any single dataset. It’s in the gaps between them. When you try to join FRED unemployment data with BLS CPI data and Census population data, you’re bridging three entirely different encoding philosophies. The dates are in different formats. The missing values are represented differently. The column names follow different conventions. Each source is internally consistent, but nothing is consistent across sources.

This is what data engineering actually looks like for government data. Not the glamorous distributed-systems stuff. Just figuring out that FRED uses dots, BLS has a 13th month, Census thinks B01001_001E is a perfectly good column name, and the World Bank adds a new column to your schema every year.

The cleanest-looking data often has the most surprises buried in it. The agencies that publish it know their own formats intimately. The rest of us have to learn them one ingestion failure at a time.

The Hidden Mess Inside 'Clean' Government Data

FRED Uses a Literal Dot for Missing Data

BLS Period Codes Are Their Own Language

Census Column Codes Are an Inside Joke

The Format Detection Problem

Wide Format Breaks When New Data Arrives

EPA’s 30 Columns of Mystery

The Pattern

More from OpenData

Why Your Charts Don't Get Shared (And Chartr's Do)

Store Flat, Transform on Read

70% of AI Training Datasets Have the Wrong License

Public Data Has a Discovery Problem

Welcome to the OpenData Blog

The State of Open Data Infrastructure in 2026

Building a Headless Visualization Engine

Bootstrapping a Data Platform on Two Mac Minis

What Happens When All the World's Open Data Lives in One Place

Curious about open data? Start exploring.