NewSearch millions of jobs from your AI agent with MCP
All posts
Guide·Jun 26, 2026·9 min read

Job scraper: the build-vs-buy guide for 2026

A job scraper is easy to demo and expensive to keep alive - anti-bot, proxies, changing HTML, dedup, and a 2am pager. Here's the honest build-vs-buy math, by source, and when an API wins.

EN

Eng team

Engineering

A job scraper is one of those projects that demos in an afternoon and bills you for years. You point a script at a careers page, parse the listings into JSON, and it works. Then it ships, and the real job begins: the page that loaded fine yesterday returns a CAPTCHA today, the HTML class names changed overnight, and the same role shows up four times because three boards syndicated it.

This is the honest build-vs-buy guide for anyone weighing a job scraper in 2026 - what the work actually involves, which sources are hard, the cost that never shows up in the prototype, and the cases where building your own is genuinely the right call.

What a job scraper actually has to do

Fetching a page is the 10% everyone sees. The other 90% is the part that decides whether the data is usable:

  • Discovery. You can scrape one company’s board if you know its URL. Scraping the market means first finding every board - thousands of Greenhouse and Lever tenants, hundreds of Workday subdomains - none of which publish a directory.
  • Rendering. Indeed, Glassdoor, and LinkedIn ship jobs through JavaScript and bot-managed endpoints, so a plain HTTP GET sees nothing. You end up running a headless browser fleet, which is slow and expensive.
  • Normalization. Every source phrases salary, location, and employment type differently. $180k-DOE, Remote (US), and Contract / W2 all have to become one schema before the data is queryable.
  • Deduplication. The same job is posted to the company ATS, reposted to Indeed, and syndicated to three aggregators. Without cross-source dedup your “100,000 jobs” is closer to 40,000.
  • Freshness and expiry. A posting that closed last week is worse than no data. You need to re-crawl constantly and detect removals, not just additions.

The cost that does not show up in the demo

The prototype is free. Production has a standing bill, and most of it is operational rather than compute:

  • Residential proxies. The big boards rate-limit and block datacenter IPs, so you rent residential or mobile proxy pools - priced per GB, and job pages are not small.
  • Anti-bot churn. Akamai, Cloudflare, and PerimeterX ship detection updates on their schedule, not yours. Each one is an unplanned firefight.
  • HTML drift. Selectors break silently. The scraper keeps running and quietly returns empty fields until someone notices the data went stale.
  • The pager. All of the above tends to break at the worst time. A scraper is a 24/7 system pretending to be a script.

Difficulty by source

Not all targets are equal. Roughly, from easiest to hardest to scrape reliably at scale:

SourceDifficultyWhy
Greenhouse / Lever / AshbyLowPublic JSON board endpoints - if you can find every tenant
WorkdayMediumStructured JSON feed, but Akamai bot management and 10k result caps
IndeedHighJS-rendered, aggressive bot detection, official API deprecated
GlassdoorHighHeavy rendering, strict TOS, login walls
LinkedInVery highBot-managed, account bans, the hardest target on the list

We’ve written up the worst offenders in detail: Indeed, Glassdoor, and Workday.

The build-vs-buy math

The question is never “can we build a scraper” - you can. It is whether the data is core enough to justify a permanent team owning the proxy bill, the anti-bot firefights, and the pager. For one or two easy-mode ATS boards, build it; the public JSON endpoints are stable. The moment you need Indeed, Glassdoor, or LinkedIn at scale, or you need every tenant rather than a handful, the maintenance curve goes vertical and a managed API is almost always cheaper than the loaded cost of an engineer maintaining it.

When building your own is the right call

  • You need one or two known boards with public JSON, and nothing else.
  • Scraping is your product, and the moat is worth a dedicated team.
  • You have a hard requirement that no vendor can meet (a private source, a bespoke field).

For everything else - sourcing tools, market dashboards, candidate matching, compensation benchmarks - the data is an input, not the product, and you want it to just arrive.

The API alternative

JobsPipe runs the scraping infrastructure once, for everyone: cross-source discovery, the proxy pools, the anti-bot handling, the normalization, and the dedup. You get one authenticated endpoint that returns the same JSON schema from 30+ ATS and job-board sources.

curl https://api.jobspipe.dev/v1/jobs/search \
  -H "Authorization: Bearer jp_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "job_title_or": ["software engineer"],
    "remote": true,
    "posted_at_max_age_days": 7,
    "limit": 50
  }'

Same normalized record from every source - title, company, parsed compensation, resolved location, posted_at, and an apply_url back at the original listing. No proxies, no headless fleet, no pager.

Related research

Skip the scraper - 30+ sources, one API, 5,000 requests/month free.

Get a free API key