How We Built a Job Data Pipeline on a $30 Database
Most job market data products are wrappers around a single API. Pull from LinkedIn, slap a dashboard on it, charge $500/month. The data is stale, the coverage is narrow, and the moment LinkedIn changes their TOS, the whole thing dies.
We wanted something different. Subspace scores thousands of companies on hiring health through a holistic analysis of public data — no LinkedIn dependency, no single point of failure.
Here's how we built it.
Scraping Job Listings Across Every Major ATS
There is no unified "job listings API." Every company posts jobs through a different Applicant Tracking System, and each ATS has its own page structure, API conventions, and quirks.
We support over a dozen ATS providers through a fleet of dedicated scrapers. Greenhouse and Lever are the easy ones — clean JSON APIs, predictable pagination. Workday is the other end of the spectrum — multiple shard domains, JavaScript-heavy pages, and descriptions that require separate fetches per listing. Some providers omit job descriptions from their bulk endpoints entirely, so we run separate backfill jobs for those.
The harder problem is ATS discovery. When a new company enters our pipeline, we usually don't know which ATS they use. So we built a brute-force waterfall:
- Check if we already have the ATS on file
- Try every known provider with common slug variations of the company name
- Fall back to the company's root domain and look for career page patterns
- If all else fails, use an LLM to infer the ATS from page structure
Most companies resolve in step 2. The waterfall catches the long tail — companies using obscure configurations or self-hosted career pages that don't follow any standard pattern.
Domain Validation: How We Nuked 6,500 Fake Companies
The pipeline needs companies to score. We source them through a mix of manual curation, public directories, industry lists, and press coverage — building a curated list of real companies with real career pages.
Early on, we tried to shortcut this by using LLMs to generate company lists at scale. Bad idea.
LLMs hallucinate company names. They'll confidently tell you "NovaTech Solutions" is a real Series B startup in Austin. It isn't. We ingested a few thousand LLM-generated companies before we realized roughly a third of them pointed to parked domains, redirect farms, or simply didn't exist.
So we built a three-stage domain validation gate that every company must pass before entering the pipeline:
Stage 1 — Static checks. Is the domain a known ATS infrastructure domain? Is it on our blocklist? Does the company name bear any resemblance to the domain? ("Ballet Bookstore" mapping to amazon.com is a real example of what LLM domain resolution produces.)
Stage 2 — HTTP reachability. Does the domain resolve? Is the SSL cert valid? Does it redirect to a parked page or domain registrar?
Stage 3 — LLM plausibility. For borderline cases, we ask a small model whether the company name and domain plausibly belong together. This catches the subtle cases — domains that resolve but host a personal blog, or companies that were acquired and the domain now redirects.
After deploying these gates, we ran a cleanup pass across the entire database. Over 6,500 companies got nuked. The pipeline went from noisy to clean overnight.
We also enforce a board-to-company ratio. If one domain has multiple company names pointing to it, something went wrong in resolution. One company, one domain, no exceptions.
Data Enrichment: Going Beyond Job Listings
Job listings alone only tell you part of the story. A company could have great-looking job posts but be three months from layoffs. To catch that, we pull from dozens of enrichment sources — government filings, financial data, corporate registries across multiple countries, press and news feeds, and more.
Some sources are "broadcast" — they produce a continuous firehose of data that we match against our company list. Others are "domain-targeted" — we query them for each company individually. Each source implements a common interface, so adding a new data feed is a single file.
The enrichment layer runs before scoring. This is a hard rule we violated twice early on and got garbage results both times. Without enrichment, the scoring engine only sees surface-level signals and thinks everything looks healthy.
Running a Distributed Pipeline on 7 VMs
We run on 7 VMs across three cloud providers — Vultr, IBM Cloud (burning through free credits), and an Oracle Cloud free-tier micro instance running on 1GB RAM and swap.
Every VM runs the same pipeline script. The queue is the database itself — each VM grabs the company with the oldest check timestamp and claims it atomically, preventing overlap. No Redis, no RabbitMQ, no message broker. Just an atomic UPDATE with a WHERE clause.
The full pipeline per company runs everything end-to-end: discover ATS boards, scrape jobs, score listings, run enrichment, compute company health, snapshot results. No shortcuts. Every company gets the same treatment every cycle.
The system is designed to scale to thousands of companies per day as the pipeline matures.
What You Can Actually Build on a $30 Postgres Database
Everything runs on Supabase's Small tier. 2 CPU cores, 2GB RAM, $30/month.
This is… constraining. A few things we learned the hard way:
Sorting large tables in the database layer is a guaranteed timeout on a 2-core machine. We push sorting to application code. The API layer silently caps queries at 1,000 rows — we had undercounting bugs in three separate places before we caught this and added explicit pagination everywhere. Full-text search across job descriptions was pulling 8-second queries until we narrowed the search scope and added date filters, bringing it down to under 400ms.
You can build a lot on a tiny database if you respect its constraints. The key is knowing where the bottlenecks are and designing around them instead of throwing money at the problem.
9-Second Deploys Without CI/CD
We built a deploy controller that watches the main branch. When a new commit lands, it SSHs into all 7 VMs in parallel, pulls the latest code, and restarts services. Total deploy time: about 9 seconds.
No GitHub Actions, no CI/CD platform, no Docker registry. The controller also runs a health check loop — if a VM goes dark, it flags immediately.
Is this "production-grade" by big-company standards? No. Does it work perfectly for a system where every VM runs the same stateless pipeline script? Yes.
If You're Building a Data Pipeline From Scratch
We made most of the mistakes so you don't have to. Here's what we'd tell someone starting something similar.
Start with the data model, not the scraper. It's tempting to write scrapers first and figure out the schema later. Don't. Define what a "company" means in your system upfront — one domain, one canonical name, one record. Every ambiguity you leave in the data model will multiply into bugs downstream. We spent more time cleaning up entity resolution messes than we ever spent writing scrapers.
Build a validation gate before you build scale. The instinct is to get as many records into the pipeline as fast as possible. Resist it. Every bad record that enters your system will poison scoring, waste compute, and erode trust in your output. We wish we'd built the three-stage domain validation gate on day one instead of after 6,500 fake companies were already in the database. A simple reachability check and a blocklist of known junk domains would have caught 90% of them.
Use your database as the queue. If your workers are all running the same stateless script, you don't need Kafka or Redis or SQS. An atomic UPDATE ... WHERE claim on the oldest unprocessed row is simple, debuggable, and has zero operational overhead. The tradeoff is you won't get sub-second dispatch latency — but for batch pipelines processing thousands of records per day, that doesn't matter.
Design your enrichment as a plugin system early. We built a common interface for enrichment sources — each one declares whether it's broadcast or domain-targeted, exposes a fetch() method, and returns a standard payload. Adding a new data source is now a single file that implements the interface. If we'd hard-coded the first few sources inline, refactoring them later would have been brutal.
Respect the cheap database. You can run a serious data pipeline on a $30 managed Postgres instance if you play by its rules. Move sorting to application code. Paginate everything — assume the API layer has invisible limits. Avoid full-text search on wide columns; scope your queries tightly with date filters and narrow WHERE clauses. The moment you treat a 2-core database like an 8-core database, everything times out.
Don't build CI/CD until you need CI/CD. Our deploy is SSH + git pull + process restart, and it's been rock solid. Tools like GitHub Actions and Docker registries solve real problems — at a certain scale. Below that scale, they're overhead that slows you down and adds failure modes. Start with the simplest thing that works and add complexity only when the simple thing breaks.
Enrich before you score — always. If your scoring engine can run without enrichment data and still produce a result, it will. And that result will be wrong. Make enrichment a hard prerequisite, not an optional enhancement. We violated this rule twice and got meaningless scores both times.
Lessons From Building Job Market Infrastructure
LLM-generated data is a draft, not a source of truth. Every piece of LLM output needs a validation gate before it touches production. We learned this the hard way with 6,500 fake companies.
Pipeline data starts incomplete and fills over cycles. Early on, we kept flagging "the scoring is broken" when the real issue was that the pipeline simply hadn't reached those companies yet. Diagnosis before prescription.
Cost is not a reason to skip quality. When we were tempted to skip enrichment to save API calls, the scores came back meaningless. Never silently degrade quality — flag the tradeoff and let a human decide.
Simple beats clever. Atomic database queue claiming instead of a message broker. SSH deploys instead of a CI/CD pipeline. Scoped search instead of a full-text index. Every "simple" choice saved us weeks of operational complexity.
We're processing thousands of companies daily across public data sources that most people don't know exist. If you want to see what this data looks like in practice, check out our listing quality report, try the live scanner, or install the free browser extension to scan job listings as you browse.
— Subspace