Bootstrapping a Data Platform on Two Mac Minis

OpenData runs in production on two Mac Minis at $0/month infrastructure cost. Here's the architecture, the tradeoffs, and the specific triggers that would move us to cloud.

Riley Hilliard
Riley Hilliard
Creator of OpenData·Jan 29, 2026·10 min
Copied to clipboard

The default playbook for launching a data platform goes something like this: spin up a Kubernetes cluster on AWS, set up managed Postgres on RDS, add Redis via ElastiCache, throw in S3 for object storage, and put CloudFront in front of everything. Monthly bill before your first user signs up: $200-500. If you’re doing it “right” with multi-AZ redundancy, monitoring, and alerting, closer to $1,000.

We’re running OpenData on two Mac Minis sitting on a desk. Total monthly infrastructure cost: $0. The Minis were already here. Cloudflare Tunnel provides public HTTPS access with DDoS protection on the free tier. The platform serves real traffic today, 200+ datasets from 30+ government data providers, queryable via API with no downloads or reformatting.

It’s a deliberate strategy: validate the product before spending money on infrastructure you don’t need yet. Cloud is a scaling solution. If you don’t have a scaling problem, it’s a premature optimization that burns cash and adds complexity during the phase when you should be moving fast and learning what users actually want.

The infrastructure conversation in the startup world has a weird bias toward over-provisioning. You see it in blog posts, on Hacker News, in YC advice. “Set up your infra properly from day one so you don’t have to redo it later.” But “properly” gets conflated with “on AWS” and “redo it later” assumes your architecture can’t migrate. Ours can. That’s the whole point.

The Two-Box Split

The architecture decision that made this work: separating serving from processing onto two dedicated machines.

Running everything on one box works until background processing competes with user-facing requests for resources. Data ingestion downloads gigabytes of files from government portals. LLM enrichment burns CPU generating embeddings, extracting metadata, and translating cryptic column names into something human-readable. Neo4j’s Graph Data Science library crunches relationship algorithms across datasets. When all of this runs alongside the API handling user requests, response times spike unpredictably.

The fix was simple: two Mac Minis, each doing what it’s good at.

Serving box (M1 Mac Mini, 16GB RAM):

  • FastAPI backend (API mode only, no background workers)
  • React frontend (SSR via React Router)
  • Redis (caching, rate limiting)
  • DuckDB (in-memory queries against local Parquet files)
  • Cloudflare Tunnel (public HTTPS access)

Processing box (M4 Mac Mini, 32GB RAM):

  • PostgreSQL with pgvector (metadata, full-text search, vector similarity)
  • Neo4j with Graph Data Science (dataset relationship graph)
  • Enrichment worker (LLM calls, embedding generation, metadata extraction)
  • Ingestion jobs (fetching and parsing data from government sources)

This split looks clean, but one thing probably seems backwards: PostgreSQL lives on the processing box, not the serving box. The reason is write pressure. The enrichment worker writes to Postgres constantly. Every dataset that gets ingested triggers a cascade of metadata updates, embedding generation, category assignments, and relationship mapping. Co-locating the heaviest writer with the database eliminates network latency for those writes.

The serving box connects to Postgres over the local LAN. Both machines are on the same desk, connected through the same router. That adds 1-2ms per query, which is acceptable for API responses where the total response time is 50-200ms. Having every single enrichment write cross the network would be worse.

DuckDB stays on the serving box because dataset content queries (the main thing users do) need the fastest possible disk access to Parquet files. Parquet files are synced from the processing box via rsync. Since Parquet files are write-once (a new file gets created per ingestion, the old one is never modified in place), rsync is efficient. It only copies files that changed.

Both machines run Docker. The deploy pipeline builds ARM64 images, pushes them to GitHub Container Registry, and does zero-downtime rolling restarts on the serving box via Traefik health checks. The processing box deploys manually because there’s no user-facing traffic to worry about interrupting.

Cloudflare Tunnel: Free DDoS Protection and SSL

The piece that makes self-hosting viable for a public-facing service is Cloudflare Tunnel. It exposes the serving Mac Mini to the internet without opening router ports, without a static IP, and without revealing your home IP address. It works behind CGNAT, which is increasingly common with residential ISPs. Free tier. Unlimited bandwidth.

How it works: a cloudflared daemon runs on the serving box and establishes an outbound-only connection to Cloudflare’s edge network. Incoming requests from users route through Cloudflare’s global infrastructure, through the tunnel, and arrive at your local machine. You get automatic SSL certificate provisioning, DDoS protection, and Cloudflare’s CDN in front of your API.

Setup is three commands:

cloudflared tunnel create opendata
cloudflared tunnel route dns opendata api.opendata.place
cloudflared tunnel run opendata

Your Mac Mini is now serving HTTPS traffic at a custom domain with Cloudflare’s entire edge network in front of it. No nginx configuration. No Let’s Encrypt renewal scripts. No firewall rules to maintain.

The main limitation: Cloudflare Tunnel is HTTP/HTTPS only. No raw TCP passthrough, no WebSocket support without additional configuration. For a REST API that serves JSON responses (which is all OpenData does), this is exactly what you need. If you were building a real-time collaboration tool or a game server, you’d need a different approach.

There’s also an implicit trust tradeoff. All your traffic flows through Cloudflare. They can see it. For a platform serving public government data via a public API, this is fine. For a platform handling sensitive user data, you’d want to think harder about that.

DuckDB + Parquet: The Analytical Engine

The dataset query engine is DuckDB reading Parquet files. Each API request creates a throwaway in-memory DuckDB connection, reads the relevant Parquet file from disk, applies view transforms (filters, computed columns, joins between related tables), and returns results.

def _get_connection(self) -> duckdb.DuckDBPyConnection:
    conn = duckdb.connect(":memory:")
    conn.execute("SET enable_progress_bar = false")
    return conn

No connection pool. No persistent database file. No write conflicts to manage.

DuckDB has a fundamental concurrency constraint: single writer, multiple readers. If you use a persistent database file, you need to coordinate writes carefully across multiple backend instances. By using in-memory connections that only read Parquet files, we sidestep this entirely. Each request is completely isolated. There’s nothing shared to conflict with.

This sounds wasteful. A new database connection per request? But DuckDB’s embedded architecture makes connection creation surprisingly cheap. There’s no network round-trip, no authentication handshake, no connection negotiation. It’s an in-process library call. The real bottleneck is I/O: reading Parquet column chunks from disk.

On the M1’s SSD, that’s fast enough for our current dataset sizes. And DuckDB’s columnar engine is smart about what it reads. Parquet files store data in column groups. If a dataset has 20 columns and your query only touches 3 of them, DuckDB reads roughly 15% of the file. A 100MB Parquet file effectively becomes a 15MB read for a typical query. Column pruning, row group filtering, and predicate pushdown all happen automatically.

The result: most queries complete in under 500ms, including connection creation, Parquet read, transform application, and JSON serialization. That’s fast enough for an API. It’s not fast enough for 10,000 concurrent users, but we’ll get to that.

The $0 Cost Breakdown

Here’s the honest breakdown of what this costs:

ComponentMonthly CostNotes
Mac Minis$0Already owned (M1 + M4, purchased for other work)
Cloudflare Tunnel$0Free tier, unlimited bandwidth
Cloudflare DNS$0Free tier with registered domain
GitHub Actions CI/CD$0Public repository gets unlimited minutes
GitHub Container Registry$0Free for public packages
Electricity~$10-15Two Mac Minis draw about 20-30W each at idle
Internet$0 (already paying)Residential connection, not a dedicated line
Domain registration~$1Amortized annually, effectively nothing

Total real cost: roughly $10-15/month in electricity. Everything else is either hardware we already owned or services on free tiers that are genuinely free (not “free trial for 30 days” free).

What you give up for that price: no SLA, no geographic redundancy, no automatic failover. If home internet goes down, the platform goes down. If a Mac Mini dies, you restore from backup onto the remaining one and run in degraded mode. If there’s a power outage, everything is offline until power comes back.

For a platform in validation phase with around 50 weekly active users, this is a fine tradeoff. You’re exchanging operational resilience for cost control during the period when cost control matters most. For a platform with 10,000 daily active users and enterprise customers paying for uptime guarantees, it absolutely is not fine. The strategy accounts for that transition. It just doesn’t rush it.

Migration Triggers: When to Leave the Mac Minis

The strategy isn’t “stay on Mac Minis forever.” It’s “stay on Mac Minis until specific, measurable conditions force the move.” Not vibes. Not anxiety about scaling. Actual metrics with actual thresholds.

Disk usage exceeds 80%. The M1 has a 512GB SSD. When Parquet files cross 400GB, it’s time for external storage. The first move isn’t cloud, it’s Cloudflare R2 (S3-compatible object storage with zero egress fees at $0.015/GB/month). DuckDB can read Parquet files from S3-compatible storage natively. The code change is minimal: swap a file path for an R2 URL.

Query latency consistently exceeds 2 seconds. DuckDB on local Parquet handles most queries in under 500ms. If dataset sizes grow or query complexity pushes p95 latency above 2 seconds consistently (not a one-off slow query, but a trend), we need more compute. This likely means bigger Parquet files from sources with decades of historical data, or more complex view transforms with multiple joins.

Uptime requirements exceed “best effort.” The moment someone is paying for the service, or an integration partner depends on the API being available, you need an SLA. An SLA means redundancy. Redundancy means at minimum two serving boxes in different locations. That’s cloud territory.

Concurrent users exceed 100. The M1 can handle 50-100 concurrent API requests before response times degrade noticeably. Beyond that, you need horizontal scaling: more serving instances behind a load balancer. Docker makes this straightforward, but you need somewhere to run the additional instances.

Monthly growth exceeds 20% for three or more consecutive months. This is the real trigger. Sustained growth means the scaling problem is real, not hypothetical. A single viral Hacker News post that spikes traffic for a day is not a reason to migrate. Three months of compounding user growth is. That’s when the Mac Minis stop being a clever strategy and start being a bottleneck.

None of these triggers have fired yet. When they do, we know exactly where we’re going.

The Hetzner Escape Hatch

The next step isn’t AWS. It’s Hetzner.

For roughly $65-75/month total, you get dedicated servers with the same two-box architecture:

Serving box (Hetzner Cloud CX32 or CX42): 4-8 vCPU, 8-16GB RAM, $7-17/month. This replaces the M1 Mac Mini. Same Docker Compose file, same container images, same Cloudflare Tunnel config pointing at a new IP.

Processing box (Hetzner Dedicated AX42): 8-core AMD Ryzen 7, 64GB DDR5, 2x 512GB NVMe in RAID, $46/month. This replaces the M4 Mac Mini. Significantly more RAM for PostgreSQL and Neo4j, and NVMe RAID for faster ingestion writes.

The key insight: the architecture is identical. Same Docker Compose files. Same two-box split. Same Parquet-on-disk strategy. Same Cloudflare Tunnel for public access. The migration is mechanical:

  1. Provision two Hetzner boxes
  2. Pull Docker images from GitHub Container Registry (already public)
  3. Restore PostgreSQL from a pg_dump backup
  4. Rsync the Parquet files to the new serving box
  5. Update the Cloudflare Tunnel to point at the new serving box IP
  6. Wait for DNS propagation

The entire migration takes a few hours of focused work. No code changes. No architecture redesign. No rewrite of deployment scripts. The Docker containers genuinely do not know they moved from a Mac Mini to a Hetzner box. The ARM64 images we build for the M-series Macs also run on Hetzner’s ARM instances, or we build amd64 variants in the same CI pipeline.

Why Hetzner instead of AWS or GCP? Cost, primarily. The equivalent setup on AWS would run $200-500/month. Managed RDS for PostgreSQL ($60+ for a small instance), EC2 instances ($40+ each for something comparable), ElastiCache for Redis ($15+), NAT gateway charges ($30+ if you need private subnets), S3 storage plus egress fees that sneak up on you. Hetzner gives you more raw compute for roughly a quarter of the price.

The tradeoff is less managed infrastructure. There’s no one-click RDS. You run your own PostgreSQL, manage your own backups, handle your own upgrades. For a team comfortable with Docker, SSH, and pg_dump, this is fine. For a team that wants to focus purely on product and never think about database maintenance, managed services are worth the premium.

Hetzner also has no egress fees between boxes on the same private network, EU data centers if GDPR matters later, and straightforward pricing with no surprise line items. You know what you’re paying before you provision anything.

What This Strategy Isn’t

This approach works for a specific situation, and it’s worth being clear about where it doesn’t.

If you’re backed by a well-funded startup with $2M in the bank, spending $500/month on AWS is not a meaningful expense. Your time is more valuable than your infrastructure bill. Go with managed services, skip the ops overhead, and focus on building product. The self-hosting strategy is for bootstrapped projects where every dollar of runway matters.

If you have enterprise customers on day one, or contractual uptime requirements, or compliance needs that require specific data residency guarantees, go straight to cloud. The Mac Mini strategy is for the validation phase, when you’re still figuring out whether anyone wants what you’re building. Once you know they do, invest accordingly.

It also requires genuine comfort with operations. You need to be the person who SSHs into the box when something breaks at midnight. You need to manage your own backups and verify they actually restore. You need to notice when disk usage is creeping up, not discover it when writes start failing. The cloud managed-service premium buys operational peace of mind. Self-hosting buys cost control and complete understanding of your stack. Both are valid. You’re picking between them based on which resource is scarcer for you right now: money or time.

And critically, this only works if your architecture is portable. Docker containers that run identically everywhere. Standard PostgreSQL (not Aurora, not CockroachDB, not anything proprietary). S3-compatible storage abstractions so you can swap local disk for R2 or S3 with a config change. The migration path from Mac Mini to Hetzner to AWS only exists if you haven’t taken hard dependencies on cloud-specific services. Every managed service you adopt makes the previous tier of the strategy harder to go back to. That’s fine as long as the decision is deliberate.

The Decision Tree

The biggest risk in early-stage platform development isn’t that you’ll fail to scale. It’s that you’ll spend months building infrastructure for traffic that never arrives. The Mac Minis cost nothing. The Hetzner escape hatch costs $65/month. AWS costs $300+ and three weeks of DevOps configuration before you serve your first request.

Start cheap. Validate the product. Scale when the metrics tell you to, not when your inner architecture astronaut tells you to.

We’re at $0/month serving real users right now. When we hit the migration triggers, we know exactly where we’re going and how long it takes to get there. Stay on Mac Minis until you can’t, move to Hetzner until you can’t, move to hyperscale cloud when revenue justifies it. Each step is a direct response to measured growth, not anticipated growth.

Riley Hilliard
Riley Hilliard

Creator of OpenData

At 13, I secretly drilled holes in my parents' wood floor to route a 56k modem line to my bedroom for late-night Age of Empires marathons. That same scrappy curiosity carried through 3 acquisitions, 9 years as a LinkedIn Staff Engineer building infrastructure for 1B+ users, and now fuels my side projects, like OpenData.

Copied to clipboard

More from OpenData

Why Your Charts Don't Get Shared (And Chartr's Do)

Chartr grew to 500K+ subscribers by making data visualization shareable. What they figured out about headline-first framing, minimal chrome, and social optimization applies to anyone making charts.

Riley HilliardRiley Hilliard·Mar 26, 2026

Store Flat, Transform on Read

Why we store all data in long format and apply transforms at query time instead of pre-computing views. A technical deep dive into DuckDB, Parquet, and the architecture behind OpenData's query engine.

Riley HilliardRiley Hilliard·Mar 19, 2026

70% of AI Training Datasets Have the Wrong License

A large-scale audit found that over 70% of popular AI datasets have missing or wrong license metadata. With the EU AI Act now enforcing training data transparency, this isn't just sloppy. It's a liability.

Riley HilliardRiley Hilliard·Mar 12, 2026

Public Data Has a Discovery Problem

Government data is technically public but practically inaccessible. Here's what that actually costs researchers, journalists, and anyone trying to answer a question with data.

Riley HilliardRiley Hilliard·Mar 5, 2026

Welcome to the OpenData Blog

Introducing the OpenData blog. We'll be sharing project updates, deep dives into open data infrastructure, and lessons learned building a platform for public datasets.

Riley HilliardRiley Hilliard·Feb 25, 2026

The Hidden Mess Inside 'Clean' Government Data

Government data has a reputation for being clean and reliable. Anyone who's tried to ingest it programmatically knows that's not the full story. Here are the real encoding quirks, format traps, and silent failures hiding in data from FRED, BLS, Census, the World Bank, and the EPA.

Riley HilliardRiley Hilliard·Feb 19, 2026

The State of Open Data Infrastructure in 2026

A survey of the open data landscape: what data.gov, Socrata, FRED, Kaggle, Hugging Face, and Datasette do well, what's still broken, and where the connective tissue between data sources is finally being built.

Riley HilliardRiley Hilliard·Feb 12, 2026

Building a Headless Visualization Engine

How we separated chart computation from rendering by building a spec-driven visualization engine. The architecture behind @opendata/viz: four packages, a compilation pipeline, and zero DOM dependencies in the math layer.

Riley HilliardRiley Hilliard·Feb 5, 2026

What Happens When All the World's Open Data Lives in One Place

Open data has a discovery problem, not an access problem. When you centralize datasets from hundreds of portals, entirely new capabilities emerge: knowledge graphs that reveal hidden connections, bridge datasets that make cross-agency joins possible, and a compounding network where every new dataset makes every existing one more useful.

Riley HilliardRiley Hilliard·Jan 22, 2026

Curious about open data? Start exploring.

OpenData makes public datasets discoverable, consistently formatted, and queryable without the usual headaches.

Try it out
  • Browse thousands of public datasets
  • Query any dataset with a simple API
  • Download as CSV, JSON, or Parquet