checking…index fresh —

Field Notes/ENGINEERING

How We Built a Job Data Pipeline on a $30 Database

MAR 07, 2026·6 min read·By Subspace

Most job market data products are wrappers around a single API. Pull from LinkedIn, slap a dashboard on it, charge $500/month. The data is stale, the coverage is narrow, and the moment LinkedIn changes their TOS, the whole thing dies.

We wanted something different. Subspace scores 36,000+ companies on hiring health through a holistic analysis of public data — no LinkedIn dependency, no single point of failure.

Here's how we built it.

Scraping Job Listings Across Every Major ATS

There is no unified "job listings API." Every company posts jobs through a different Applicant Tracking System, and each ATS has its own page structure, API conventions, and quirks.

We support 15+ ATS providers — Greenhouse, Lever, Workday, Ashby, SmartRecruiters, Workable, iCIMS, Eightfold AI, Recruitee, BambooHR, JazzHR, Rippling, Jobvite, Pinpoint, TalentBrew, and others — through a fleet of dedicated scrapers. Greenhouse and Lever are the easy ones — clean JSON APIs, predictable pagination. Workday is the other end of the spectrum — multiple shard domains, JavaScript-heavy pages, and descriptions that require separate fetches per listing. Some providers omit job descriptions from their bulk endpoints entirely, so we run separate backfill jobs for those.

The harder problem is ATS discovery. When a new company enters our pipeline, we usually don't know which ATS they use. So we built a brute-force waterfall:

Check if we already have the ATS on file
Try every known provider with common slug variations of the company name
Fingerprint the career page HTML for known ATS signatures and redirect patterns
Fall back to generic HTML scraping for self-hosted career pages

Most companies resolve in step 2. The fingerprinting layer catches the long tail — companies using custom subdomains or non-standard configurations. We tried using LLMs for ATS inference early on, but it had a 0% success rate on edge cases. Pure heuristics turned out to be far more reliable.

Domain Validation: How We Nuked 6,500 Fake Companies

The pipeline needs companies to score. We source them through a mix of manual curation, public directories, industry lists, and press coverage — building a curated list of real companies with real career pages.

Early on, we tried to shortcut this by using LLMs to generate company lists at scale. Bad idea.

LLMs hallucinate company names. They'll confidently tell you "NovaTech Solutions" is a real Series B startup in Austin. It isn't. We ingested a few thousand LLM-generated companies before we realized roughly a third of them pointed to parked domains, redirect farms, or simply didn't exist.

So we built a three-stage domain validation gate that every company must pass before entering the pipeline:

Stage 1 — Static checks. Is the domain a known ATS infrastructure domain? Is it on our blocklist? Does the company name bear any resemblance to the domain? ("Ballet Bookstore" mapping to amazon.com is a real example of what LLM domain resolution produces.)

Stage 2 — HTTP reachability. Does the domain resolve? Is the SSL cert valid? Does it redirect to a parked page or domain registrar?

Stage 3 — LLM plausibility. For borderline cases, we ask a small model whether the company name and domain plausibly belong together. This catches the subtle cases — domains that resolve but host a personal blog, or companies that were acquired and the domain now redirects.

After deploying these gates, we ran a cleanup pass across the entire database. Over 6,500 companies got nuked. The pipeline went from noisy to clean overnight.

We also enforce a board-to-company ratio. If one domain has multiple company names pointing to it, something went wrong in resolution. One company, one domain, no exceptions.

Data Enrichment: Going Beyond Job Listings

Job listings alone only tell you part of the story. A company could have great-looking job posts but be three months from layoffs. To catch that, we pull from dozens of enrichment sources — government filings, financial data, corporate registries across multiple countries, press and news feeds, and more.

Some sources are "broadcast" — they produce a continuous firehose of data that we match against our company list. Others are "domain-targeted" — we query them for each company individually. Each source implements a common interface, so adding a new data feed is a single file.

The enrichment layer runs before scoring. This is a hard rule we violated twice early on and got garbage results both times. Without enrichment, the scoring engine only sees surface-level signals and thinks everything looks healthy.

Running a Distributed Pipeline on 7 VMs

We run on 7 VMs across three cloud providers — Vultr, IBM Cloud (burning through free credits), and an Oracle Cloud free-tier micro instance running on 1GB RAM and swap.

Every VM runs the same pipeline script. The queue is the database itself — each VM grabs the company with the oldest check timestamp and claims it atomically, preventing overlap. No Redis, no RabbitMQ, no message broker. Just an atomic UPDATE with a WHERE clause.

The full pipeline per company runs everything end-to-end: discover ATS boards, scrape jobs (up to 3,000 per company), score listings, run enrichment, compute company health across 33 checks across 8 corporate cost centers, and snapshot results. No shortcuts. Every company gets the same treatment every cycle.

The system processes thousands of companies per day across the full pipeline — with over 36,000 companies tracked and 167,000+ jobs analyzed to date.

What You Can Actually Build on a $30 Postgres Database

Everything runs on Supabase's Small tier. 2 CPU cores, 2GB RAM, $30/month.

This is… constraining. A few things we learned the hard way:

Sorting large tables in the database layer is a guaranteed timeout on a 2-core machine. We push sorting to application code. The API layer silently caps queries at 1,000 rows — we had undercounting bugs in three separate places before we caught this and added explicit pagination everywhere. Full-text search across job descriptions was pulling 8-second queries until we narrowed the search scope and added date filters, bringing it down to under 400ms.

You can build a lot on a tiny database if you respect its constraints. The key is knowing where the bottlenecks are and designing around them instead of throwing money at the problem.

9-Second Deploys Without CI/CD

We built a deploy controller that watches the main branch. When a new commit lands, it SSHs into all 7 VMs in parallel, pulls the latest code, kills stale processes, and restarts services with a watchdog. Total deploy time: about 9 seconds.

No GitHub Actions, no CI/CD platform, no Docker registry. The controller also runs a health check loop — if a VM goes dark or a pipeline process hangs, it flags immediately and can kill-restart automatically.

Is this "production-grade" by big-company standards? No. Does it work perfectly for a system where every VM runs the same stateless pipeline script? Yes.

If You're Building a Data Pipeline From Scratch

We made most of the mistakes so you don't have to. Here's what we'd tell someone starting something similar.

Start with the data model, not the scraper. It's tempting to write scrapers first and figure out the schema later. Don't. Define what a "company" means in your system upfront — one domain, one canonical name, one record. Every ambiguity you leave in the data model will multiply into bugs downstream. We spent more time cleaning up entity resolution messes than we ever spent writing scrapers.

Build a validation gate before you build scale. The instinct is to get as many records into the pipeline as fast as possible. Resist it. Every bad record that enters your system will poison scoring, waste compute, and erode trust in your output. We wish we'd built the three-stage domain validation gate on day one instead of after 6,500 fake companies were already in the database. A simple reachability check and a blocklist of known junk domains would have caught 90% of them.

Use your database as the queue. If your workers are all running the same stateless script, you don't need Kafka or Redis or SQS. An atomic UPDATE ... WHERE claim on the oldest unprocessed row is simple, debuggable, and has zero operational overhead. The tradeoff is you won't get sub-second dispatch latency — but for batch pipelines processing thousands of records per day, that doesn't matter.

Design your enrichment as a plugin system early. We built a common interface for enrichment sources — each one declares whether it's broadcast or domain-targeted, exposes a fetch() method, and returns a standard payload. Adding a new data source is now a single file that implements the interface. If we'd hard-coded the first few sources inline, refactoring them later would have been brutal.

Respect the cheap database. You can run a serious data pipeline on a $30 managed Postgres instance if you play by its rules. Move sorting to application code. Paginate everything — assume the API layer has invisible limits. Avoid full-text search on wide columns; scope your queries tightly with date filters and narrow WHERE clauses. The moment you treat a 2-core database like an 8-core database, everything times out.

Don't build CI/CD until you need CI/CD. Our deploy is SSH + git pull + process restart, and it's been rock solid. Tools like GitHub Actions and Docker registries solve real problems — at a certain scale. Below that scale, they're overhead that slows you down and adds failure modes. Start with the simplest thing that works and add complexity only when the simple thing breaks.

Enrich before you score — always. If your scoring engine can run without enrichment data and still produce a result, it will. And that result will be wrong. Make enrichment a hard prerequisite, not an optional enhancement. We violated this rule twice and got meaningless scores both times.

Lessons From Building Job Market Infrastructure

LLM-generated data is a draft, not a source of truth. Every piece of LLM output needs a validation gate before it touches production. We learned this the hard way with 6,500 fake companies.

Pipeline data starts incomplete and fills over cycles. Early on, we kept flagging "the scoring is broken" when the real issue was that the pipeline simply hadn't reached those companies yet. Diagnosis before prescription.

Cost is not a reason to skip quality. When we were tempted to skip enrichment to save API calls, the scores came back meaningless. Never silently degrade quality — flag the tradeoff and let a human decide.

Simple beats clever. Atomic database queue claiming instead of a message broker. SSH deploys instead of a CI/CD pipeline. Scoped search instead of a full-text index. Every "simple" choice saved us weeks of operational complexity.

We're processing 36,000+ companies across public data sources that most people don't know exist, with 167,000+ jobs analyzed and counting. If you want to see what this data looks like in practice, check out our listing quality report, try the live scanner, or install the free browser extension to scan job listings as you browse.

— Subspace