Designing an API for LLMs, Not Humans

We asked Claude to write a data-driven report about US healthcare costs using our Research framework (currently in Alpha). It dispatched 4 parallel research agents, made 72 API calls, and spent ~4 minutes on the research phase before a single word of the report was written.

The report turned out great: eight charts, five narrative sections covering spending trends, life expectancy, drug prices, insurance coverage, and medical inflation all correlated and backed by the real data. The kind of thing that would take a human researcher multiple days to assemble, but 72 API calls to fetch data from 13 datasets across 7 providers is absurd. Most of those calls were the agents inspecting schemas, retrying queries that failed on quoting, and re-discovering datasets that another agent had already found.

The API worked, it just wasn’t designed well for the thing using it.

The Feedback Loop

We didn’t redesign the API by guessing what agents need. We watched them work, asked them what went wrong, and let them write the improvement spec.

The cycle looked like this:

Agent performs a task (write the healthcare cost report)
We run a post-mortem asking specifically about research phase friction
The agent produces structured feedback with specific API improvements, estimated impact, and implementation suggestions
We ship the improvements and re-run the same research task
Repeat

We ran this cycle three times in one afternoon. Each round, the agent’s feedback got more specific because the big problems were solved and subtler issues surfaced.

The feedback comes from the actual consumer, not from a human proxy guessing what the agent needs. The post-mortem produces structured, actionable output. Not “the API is slow” but:

SQL string quoting causes ~10-15% of queries to fail on first try because of triple-nested escaping in shell/JSON/SQL.
Suggested fix: parameterized queries with ? placeholders and a params array.

Each round is a complete loop: ship changes, re-run the exact same task, measure the difference.

Round 1: The 72-Call Baseline

The post-mortem from the initial healthcare report identified six problems. Each one matters specifically because the API consumer is an LLM, not a human.

Discovery redundancy

Four parallel agents each called /v1/discover independently with overlapping queries. Three datasets appeared in multiple agents’ results. No way to batch discover calls or deduplicate.

Multi-agent parallelism is the default pattern for complex research. Agents don’t coordinate with each other. If three agents all need the same OECD healthcare spending dataset, they’ll each discover it separately. The API needs to handle deduplication server-side because the agents won’t do it themselves.

Schema inspection round-trips

For every dataset the agent wanted to use, it had to call /columns to check types and units, then /views to see if curated views existed. That’s 2 extra calls per dataset before writing a single query. Across 13 datasets: ~21 calls just to understand schemas.

Humans can skim a docs page and infer what columns mean. Agents need explicit metadata (units, value ranges, semantic types) or they have to probe the data to figure it out. And probing means more API calls.

The string escaping problem

footage of claude trying to work with nested escaped quotes

Agents construct SQL inside JSON inside shell commands. A country name like “United States” requires escaping at three levels. ~10-15% of SQL queries failed on the first try due to quoting errors, requiring retries.

Humans use SQL clients with proper escaping. Agents are constructing curl commands with string interpolation. The '"United States"' pattern is an error factory.

# What the agent actually tried to construct:
curl -s -X POST "https://api.tryopendata.ai/v1/datasets/owid/healthcare-spending/query" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT country, year, spending_per_capita FROM data WHERE country IN ('"'"'United States'"'"', '"'"'Japan'"'"', '"'"'Germany'"'"') AND year >= 2020 ORDER BY year"}'

# First attempt often failed. Agent retried with different escaping.
# Sometimes took 2-3 attempts to get the quoting right.

Cross-dataset join anxiety

Agents avoided server-side joins even when they’d be more efficient, because they couldn’t confirm that country_code in dataset A used the same values as country_code in dataset B without querying both first. So they’d fetch both datasets separately and join in their reasoning, wasting tokens and context window.

Discover-then-query always required 2+ round-trips

Even for simple questions (“what’s US healthcare spending?”), agents had to: discover the dataset, inspect the schema, then query. No way to get sample data or column metadata from the discover response itself.

No category browsing

Agents couldn’t narrow searches by domain (health, economics, education) without knowing the taxonomy. Every search was keyword-based, which meant broader results and more time spent filtering.

Here’s what a single dataset workflow looked like in Round 1:

1. GET  /v1/discover?q=healthcare+spending              → find datasets
2. GET  /v1/datasets/owid/healthcare-spending/columns    → inspect schema
3. GET  /v1/datasets/owid/healthcare-spending/views      → check for views
4. POST /v1/datasets/owid/healthcare-spending/query      → probe: SELECT DISTINCT country LIMIT 20
5. POST /v1/datasets/owid/healthcare-spending/query      → probe: SELECT MIN(year), MAX(year) FROM data
6. POST /v1/datasets/owid/healthcare-spending/query      → real query (fails on quoting)
7. POST /v1/datasets/owid/healthcare-spending/query      → retry with fixed quoting
= 7 calls for one dataset. Some discover calls were shared, but across 13 datasets
  this pattern added up to ~72 total calls.

What We Shipped (Round 2)

Three main changes, all driven directly by the agent’s feedback.

Batch discover

POST /v1/discover/batch takes multiple queries in one call and deduplicates server-side. The response separates query_results (per-query relevance scores) from datasets (the unique set of matching datasets with full metadata).

curl -s -X POST "https://api.tryopendata.ai/v1/discover/batch" \
  -H "Authorization: Bearer $OPENDATA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "queries": [
      "healthcare spending per capita by country",
      "life expectancy and health outcomes by country",
      "drug prices and pharmaceutical spending",
      "health insurance coverage and medical inflation US"
    ],
    "limit_per_query": 5,
    "deduplicate": true
  }'
# 1 call. 17 unique datasets. 3 overlapping datasets deduplicated.

Four discover calls became one. The three datasets that appeared across multiple agents’ queries were correctly deduplicated in the response.

Self-describing responses

Think about what happens when you hand a new hire a database and say “write me a query.” They don’t start typing SQL. They open a schema browser, click through tables, check column types, maybe run a SELECT * to see what the data looks like. Only then do they write the query. That’s exactly what our agents were doing: 21 calls just to understand what they were looking at before they could ask for what they needed.

The fix wasn’t making those inspection calls faster, it was making them unnecessary. We made the discover response self-describing: everything an agent needs to write a query is in the response that told it the dataset exists.

{
  "columns": [
    {
      "name": "spending_per_capita",
      "type": "float",
      "display_name": "Spending per Capita (USD PPP)",
      "value_range": { "min": 10.53, "max": 12009.9 },
      "sample_values": ["85.86", "118.41", "146.84", "261.57", "461.18"],
      "distinct_count": 996
    },
    {
      "name": "country_code",
      "type": "string",
      "display_name": "Country Code",
      "sample_values": ["AFG", "ALB", "DZA", "AGO", "ARG"],
      "distinct_count": 206
    }
  ],
  "available_views": [
    { "name": "latest", "description": "Most recent data by country" },
    { "name": "us-vs-peers", "description": "US compared to peer nations" }
  ]
}

Value ranges, display names, sample values, available views, all inline. The agent reads one response and knows: this column is a float that ranges from 10 to 12,000, it represents per-capita spending in USD PPP, and there’s a pre-built view that compares the US to peer nations. That’s enough context to write a targeted SQL query immediately.

This is the core idea of an LLM-first API: the response itself is the documentation. An agent can reference external docs, but it shouldn’t have to. Every response should carry enough context for the consumer to take the next step without looking anything up.

Separating values from queries

Remember the string escaping problem from Round 1? An agent building a SQL query has to embed values like “United States” inside SQL, inside JSON, inside a shell command. Three layers of escaping, each with its own quoting rules. The agent would construct something like '"'"'United States'"'"' and hope it parsed correctly. About 10-15% of the time, it didn’t.

The fix was to stop making agents embed values in query strings at all. Instead, the SQL goes in one field with ? placeholders, and the values go in a separate params array. The API binds them together server-side:

curl -s -X POST "https://api.tryopendata.ai/v1/datasets/owid/healthcare-spending/query" \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT country, year, spending_per_capita FROM data WHERE country IN (?, ?, ?) AND year >= ? ORDER BY year",
    "params": ["United States", "Japan", "Germany", 2020]
  }'

No escaping. The string "United States" stays a plain JSON string, never touches SQL quoting. If the agent passes the wrong number of parameters, it gets a structured error (PARAM_COUNT_MISMATCH) with a hint instead of a cryptic SQL parse failure. This works on both single-dataset and cross-dataset SQL endpoints.

Round 2 results

The original estimate was a 30-40% improvement. Actual was 87%. The savings compound: batch discover eliminates redundant discovery, enriched metadata eliminates schema inspection, and parameterized SQL eliminates quoting retries. Entire categories of round-trips disappeared rather than each one getting a little faster.

Round 3: The Details

Round 2 feedback identified remaining gaps. The big problems were solved, so now the subtler issues surfaced.

Standalone `unit` field on columns

Previously, units were buried in display_name (“Spending per Capita (USD PPP)”). Agents had to parse units from strings with inconsistent formatting. Now there’s a dedicated unit field:

{
  "name": "spending_per_capita",
  "type": "float",
  "unit": "per capita",
  "display_name": "Spending per Capita (USD PPP)",
  "value_range": { "min": 10.53, "max": 12009.9 }
}

These two fields work together. When an agent sees a column with unit: "percent" and a range of 0-100, it knows to format labels as 12.5%. When it sees unit: "proportion" with a range of 0-1, it knows that 0.125 means the same thing but needs different formatting. Without this metadata, agents guess wrong and charts end up showing 1,250% instead of 12.5%.

Telling agents what each column means

A column called country_code is obviously geographic. A column called year is obviously temporal (obvious to a human, anyway). An agent looking at a schema doesn’t know whether code is a country code, a product code, or an error code without being told.

We added a label to each column that describes what kind of data it holds: is this a place, a date, a measurement, or a category? When two datasets both have a column labeled “geographic, country,” the agent knows it can join them on that column without having to query both datasets first to check.

?category=health narrows results. available_categories is always returned in the response (14 categories total). Agents can browse the taxonomy instead of guessing keywords.

Round 3 results

The last remaining schema probe from Round 2 disappeared in Round 3 because the column labels and unit metadata gave the agent enough confidence to write queries without checking first.

Here’s what the same research task looks like in Round 3:

1. POST /v1/discover/batch                                  → all 4 queries, deduplicated
2. POST /v1/datasets/owid/healthcare-spending/query          → parameterized SQL
3. POST /v1/datasets/owid/life-expectancy/query              → parameterized SQL
4. POST /v1/datasets/oecd/pharma-spending/query              → parameterized SQL
5. POST /v1/datasets/fred/cpi-medical/query                  → parameterized SQL
6. POST /v1/datasets/census/health-insurance-coverage/query  → parameterized SQL
7. POST /v1/datasets/cms/part-d-spending/query               → parameterized SQL
8. POST /v1/query                                            → cross-dataset join
= 8 calls total for 6 datasets + 1 cross-dataset join. All responses columnar.

The Columnar Format Decision

After the round-trip count was already optimized, we looked at what was left: the response payloads themselves. Is the format wasting tokens?

Standard JSON API responses return rows as arrays of objects:

{
  "data": [
    {
      "country": "United States",
      "country_code": "USA",
      "spending_per_capita": 12586.196,
      "year": 2022
    },
    {
      "country": "United States",
      "country_code": "USA",
      "spending_per_capita": 13473.193,
      "year": 2023
    },
    {
      "country": "Japan",
      "country_code": "JPN",
      "spending_per_capita": 5845.99,
      "year": 2022
    },
    {
      "country": "Japan",
      "country_code": "JPN",
      "spending_per_capita": 6102.44,
      "year": 2023
    },
    {
      "country": "Germany",
      "country_code": "DEU",
      "spending_per_capita": 8306.01,
      "year": 2022
    },
    {
      "country": "Germany",
      "country_code": "DEU",
      "spending_per_capita": 8733.18,
      "year": 2023
    }
  ]
}

Every key name repeated on every row. For 6 rows with 4 columns, that’s 24 redundant key strings. At 73 rows: ~292 redundant keys. This is the JSON convention. Every API does it. Nobody questions it because humans read one object and skip the rest.

But LLMs tokenize the entire response. Every repeated "country": costs tokens. Tokens cost money and consume context window.

Columnar format:

{
  "columns": ["country", "country_code", "spending_per_capita", "year"],
  "types": ["string", "string", "float", "integer"],
  "rows": [
    ["United States", "USA", 12586.196, 2022],
    ["United States", "USA", 13473.193, 2023],
    ["Japan", "JPN", 5845.99, 2022],
    ["Japan", "JPN", 6102.44, 2023],
    ["Germany", "DEU", 8306.01, 2022],
    ["Germany", "DEU", 8733.18, 2023]
  ]
}

Same data, ~45% fewer tokens. The 73-row healthcare spending response went from ~2,200 tokens to ~1,200 tokens. Across a full research session with 6 queries, that’s 3,000-5,000 tokens saved on response payloads alone.

We made columnar the default on all data-returning endpoints.

Columnar responses are a well-established pattern in data systems (Parquet, Arrow, DuckDB all work this way). Client-side remapping to objects is a few lines of code if you need it, and we kept ?response_format=objects for anyone who prefers the verbose format. But the default path should be the efficient one.

What We’d Tell Other API Teams

None of this is speculative: every row in this table came from watching agents fail and fixing the specific thing that tripped them up.

Principle	Why it matters for LLMs	What we saw
Minimize round-trips above all else	Every API call is a tool invocation (latency + tokens + context)	72 calls to 8 by eliminating whole categories of calls
Front-load metadata	Agents can’t read docs, they probe instead	21 schema inspection calls eliminated by enriching discover
Eliminate string construction hazards	SQL in JSON in shell is an escaping nightmare	~8 quoting retries per session to zero with parameterized queries
Optimize for tokens, not readability	Agents tokenize every repeated key name	~45% token savings from columnar response format
Design for multi-agent parallelism	Agents don’t coordinate, the API must deduplicate	Batch discover with server-side dedup across 4 agents
Let agents tell you what’s broken	Structured post-mortems beat guessing	Estimated 30-40% improvement, actual was 87%

Minimize round-trips above all else

Why it matters
Every API call is a tool invocation (latency + tokens + context)

What we saw
72 calls to 8 by eliminating whole categories of calls

Front-load metadata

Why it matters
Agents can't read docs, they probe instead

What we saw
21 schema inspection calls eliminated by enriching discover

Eliminate string construction hazards

Why it matters
SQL in JSON in shell is an escaping nightmare

What we saw
~8 quoting retries per session to zero with parameterized queries

Optimize for tokens, not readability

Why it matters
Agents tokenize every repeated key name

What we saw
~45% token savings from columnar response format

Design for multi-agent parallelism

Why it matters
Agents don't coordinate, the API must deduplicate

What we saw
Batch discover with server-side dedup across 4 agents

Let agents tell you what's broken

Why it matters
Structured post-mortems beat guessing

What we saw
Estimated 30-40% improvement, actual was 87%

The Feedback Loop Is the Product

The best way to design an API for LLMs is to let LLMs use it and tell you what’s wrong. Three post-mortem cycles in one afternoon took us from 72 calls to 8 and from ~2,200 tokens per response to ~1,200.

A lot of the REST conventions we take for granted (verbose JSON, schema-on-demand, one-resource-per-endpoint) were designed for humans reading responses in browser dev tools. Those assumptions don’t hold when your primary consumer is a model with a token budget. The good news is that most of the fixes are simple, and the agents will tell you exactly what to fix if you ask.

Designing an API for LLMs, Not Humans

The Feedback Loop

Round 1: The 72-Call Baseline

Discovery redundancy

Schema inspection round-trips

The string escaping problem

Cross-dataset join anxiety

Discover-then-query always required 2+ round-trips

No category browsing

What We Shipped (Round 2)

Batch discover

Self-describing responses

Separating values from queries

Round 2 results

Round 3: The Details

Standalone `unit` field on columns

Telling agents what each column means

Category filter on discover

Round 3 results

The Columnar Format Decision

What We’d Tell Other API Teams

The Feedback Loop Is the Product

Datasets used in this article

More from OpenData

Florida Already Answered the Voucher Question

What Vouchers Will Cost Austin ISD

The $47 Drug That Costs $10,000: How Middlemen Inflate American Healthcare

93 Datasets, 22 Sources, One API Call

The Variance Is the Story

5 Agencies, 3 Joins, One District Brief

Following the Money: How Austin ISD Ended Up $181 Million Short

The Students Left Behind: Charter Growth and the Hollowing Out of Texas Public Schools

11,654 Lines of Beyonce: What Her Lyrics Reveal When You Count Every Word

Remix as a Middle Layer: Patterns from Production

The datasets and API endpoints in this article are all queryable. Batch discover, parameterized SQL, columnar responses.

The Feedback Loop

Round 1: The 72-Call Baseline

Discovery redundancy

Schema inspection round-trips

The string escaping problem

Cross-dataset join anxiety

Discover-then-query always required 2+ round-trips

No category browsing

What We Shipped (Round 2)

Batch discover

Self-describing responses

Separating values from queries

Round 2 results

Round 3: The Details

Standalone unit field on columns

Telling agents what each column means

Category filter on discover

Round 3 results

The Columnar Format Decision

What We’d Tell Other API Teams

The Feedback Loop Is the Product

Datasets used in this article

More from OpenData

Florida Already Answered the Voucher Question

What Vouchers Will Cost Austin ISD

The $47 Drug That Costs $10,000: How Middlemen Inflate American Healthcare

93 Datasets, 22 Sources, One API Call

The Variance Is the Story

5 Agencies, 3 Joins, One District Brief

Following the Money: How Austin ISD Ended Up $181 Million Short

The Students Left Behind: Charter Growth and the Hollowing Out of Texas Public Schools

11,654 Lines of Beyonce: What Her Lyrics Reveal When You Count Every Word

Remix as a Middle Layer: Patterns from Production

The datasets and API endpoints in this article are all queryable. Batch discover, parameterized SQL, columnar responses.

Standalone `unit` field on columns