How to build a job aggregator in 2026.
A complete technical guide. We’ll cover what a job aggregator actually is, the four problems most people underestimate, the architecture you’ll need, where to get the jobs, how to normalize and deduplicate them, and how to decide whether to build the data layer yourself or rent it from an API.
On this page
- What is a job aggregator?
- Job aggregator vs job board vs ATS
- The 4 hard problems most people underestimate
- Architecture: the 4 layers you have to build
- Where to get the jobs (the data-supply decision)
- Normalization: taming 30+ schemas
- Deduplication: the same job on five sites
- Freshness, retention, and dead jobs
- Search and serving
- Monetization models
- Niche vs broad: picking your wedge
- Build vs buy
- FAQ
What is a job aggregator?
A job aggregator is a service that collects job postings from many different sources, deduplicates them, normalizes the data into one consistent shape, and exposes the result either as a website or as an API. Indeed, Google Jobs, ZipRecruiter, Glassdoor, SimplyHired, and Adzuna are the consumer-facing examples most people know. JobsPipe, Coresignal, JSearch, and TheirStack are the API-facing examples developers tend to encounter.
The defining feature is multi-source ingestion. A job board hosts postings that employers paid to publish on that board. An aggregator pulls postings other people published elsewhere and republishes them in one place. The line gets blurry - Indeed is technically a hybrid because it accepts direct paid postings alongside aggregated content - but the aggregation is what makes the product valuable.
If you’re here because you want to build a job-aggregator website, an aggregator API, a niche job board with broader coverage than the niche alone provides, or a candidate-matching product that needs comprehensive job data underneath it, this guide is for you.
Job aggregator vs job board vs ATS
Three terms get mixed up. Getting them straight saves arguments later when you’re explaining the product to investors, partners, or your team.
- Applicant Tracking System (ATS). Software that one company uses to manage its own hiring pipeline: Workday, Greenhouse, Lever, Ashby, SmartRecruiters, iCIMS, BambooHR, Workable. The ATS is the system of record for that company’s open roles. Most ATSs publish a public career page where the company’s current openings render.
- Job board. A site that hosts postings employers paid to publish on the board. Stack Overflow Jobs (RIP), Hacker News Who’s Hiring (informal), We Work Remotely, Wellfound (formerly AngelList Talent). Job boards typically don’t crawl anywhere else for content - you pay them, they list you.
- Job aggregator. A site that crawls many ATSs and many job boards and republishes the postings under one search interface. Indeed, Google Jobs, ZipRecruiter, Glassdoor. Aggregators don’t need employers to pay to appear - they pull what’s already public.
Is LinkedIn a job board or an aggregator? Both. LinkedIn accepts direct paid postings (job-board behavior) and also syndicates jobs from corporate ATSs (aggregator behavior). Most of the jobs you see on LinkedIn originated on a Workday, Greenhouse, or Lever-hosted career page first.
Is Indeed a job aggregator? Yes, and also a job board. Indeed’s engine is fundamentally an aggregator - Indeed’s bots find postings on corporate career sites and index them - but Indeed also accepts paid Sponsored Jobs directly from employers. The aggregation is what gave Indeed its content moat.
The 4 hard problems most people underestimate
Job aggregators look easy from the outside - crawl some sites, show the results, ship it. They’re not. Four problems eat 80% of the engineering effort, and they’re all problems you don’t notice on day one and can’t ignore by month three.
1. Data supply at scale
Every ATS has a different way of exposing public postings. Some publish a clean unauthenticated REST endpoint (Greenhouse Job Board API, Lever Postings API). Some require partner-program onboarding (Paylocity, ZipRecruiter Partner API). Some have no public API at all and the only path is parsing the career-site HTML (BambooHR, Paycom, most Workday tenants). The combinatorics add up fast: 30 ATSs × varying API shapes × thousands of tenants per ATS × occasional vendor updates = a permanent maintenance load.
2. Deduplication
The same job posting appears on the company’s Workday page, on LinkedIn (via syndication), on Indeed (via Indeed’s scrapers), on Glassdoor (because Glassdoor pulls from Indeed), and possibly on a niche board the recruiter posted to manually. The title is spelled four different ways. The location is “San Francisco, CA” in one place, “San Francisco Bay Area” in another, “Remote - US” in a third. Naive dedup by URL gives you five copies. Naive dedup by title-plus-company gives you false positives (two open senior-engineer roles at the same company are two real, distinct postings). We’ll come back to this.
3. Freshness
Job postings have a half-life of roughly 30 days, but the most valuable signal is the first 48 hours. A user who searches for “senior staff engineer, Bay Area, remote-friendly” and clicks an apply link that 404s because the job filled yesterday is a user you don’t get back. The cost of crawling everything every hour explodes linearly with coverage, and the freshness expectations of users do not.
4. Schema normalization
One ATS returns salary as {compensation: {amount, currency}}, another as a free-text string in the description, another as {minSalary, maxSalary, salaryInterval}, and most don’t expose salary at all. Location is similar: some ATSs return structured city/state/country, some return an opaque office-name string (“HQ - 2nd floor”), some return a comma-separated list when the role can be in multiple places. If you want users to filter by salary range or remote eligibility, you have to extract structured data from everything.
Architecture: the 4 layers you have to build
Every working job aggregator has roughly the same shape, even if the marketing pages don’t describe it that way. There are four layers, and the only question is whether you build each one yourself or buy it.
Layer 1 - Ingest
A pool of workers that pull from sources on a schedule. Each worker knows how to talk to one source type: a Workday worker that hits internal myworkdayjobs.com JSON endpoints, a Greenhouse worker that walks the Job Board API, a Lever worker that hits api.lever.co/v0/postings, an Indeed worker that respects Indeed’s robots.txt and rate limits. The ingest layer pushes raw JSON into a queue downstream.
Layer 2 - Normalize
A transformer that takes raw per-source JSON and emits records conforming to one schema. Salary parsing (regex on the description if no structured field exists), location geocoding, employment-type mapping (“FTE” → full_time, “Permanent” → full_time), remote-eligibility detection, and any description cleanup happen here.
Layer 3 - Store + dedup
A datastore that holds normalized postings, with a deduplication layer that merges duplicates into a single record with multiple sources[] and apply_urls. Postgres works fine for the canonical store; you’ll want a search index (Elasticsearch, OpenSearch, Meilisearch, or Postgres full-text) for query-time filtering. The dedup logic is its own service - we cover it below.
Layer 4 - Serve
A read API that powers your website, your customers’ integrations, or both. Filtering by location, salary, employment type, remote, posted date, source. Pagination. Rate limiting if external. Authentication if you’re selling API access.
Each layer is a separately scaling, separately failing component. The most common architecture mistake is to put ingest, normalize, and dedup in one monolithic job - when the Workday integration breaks, the whole pipeline halts. Decouple with queues from the start.
Where to get the jobs (the data-supply decision)
This is the most important decision you’ll make. It determines your competitive moat, your monthly infra bill, your engineering burn rate, and your legal exposure. The realistic paths split three ways: lean on per-ATS public APIs, build your own scraping fleet, or pay an aggregator for the data layer.
Per-ATS public APIs
Several major ATSs expose a public, unauthenticated endpoint that returns one tenant’s open postings. The endpoints exist; the catch is they don’t list tenants for you, so you have to discover slugs separately.
- Greenhouse Job Board API.
boards-api.greenhouse.io/v1/boards/<company>/jobs- public, no auth. See our Greenhouse source page for what JobsPipe layers on top. - Lever Postings API.
api.lever.co/v0/postings/<company>- public, no auth. See Lever. - Workable Jobs Widget. Per-tenant career endpoint at
apply.workable.com/api/v1/widget/accounts/<slug>. - Ashby. Public board JSON behind jobs.ashbyhq.com.
Many other ATSs (Workday, BambooHR, Paylocity, Paycom, Jobvite, Taleo, SuccessFactors, Zoho Recruit) have no comparable public endpoint. For those, you’re either parsing the career-site HTML or finding a partner-API path. We document the specifics for each on our sources index.
This works if you only need a handful of well-known ATSs and you have engineering capacity to maintain per-vendor parsers as they evolve. Budget one engineer-week per source up front, plus ongoing maintenance whenever an ATS changes its HTML or rotates its bot detection.
Roll your own scrapers
Spin up your own crawler. Headless Chromium (Playwright or Puppeteer) for sites that render client-side, plain HTTP + HTML parsing for sites that render server-side. Add a proxy rotation service (Bright Data, Oxylabs, ScraperAPI) so you don’t get IP-banned. Add an HTML-diffing layer so when a career site redesigns, you find out before your data dries up.
The legal layer. Scraping public corporate career sites is generally lower-risk than scraping consumer aggregators (LinkedIn, Indeed) for two reasons: corporate career pages are designed to be indexed by Google and rarely have ToS that prohibit programmatic access, and the data you’re taking is job listings the company actively wants people to see. The hiQ Labs v. LinkedIn cases established that scraping public data probably doesn’t violate the CFAA, but breach-of-contract claims based on ToS can survive independently. This is not legal advice.
Pick this if proprietary data coverage is your moat and you have the engineering and legal appetite to maintain it. Realistic burn: two to four engineers full time once you’re past 5,000 tenants.
Pay an aggregator API
Pay someone else for the data layer and put your engineering time into the product layer. The honest landscape:
- JobsPipe. Every public ATS posting we cover (50+ sources), normalized into one JSON schema, 24-hour freshness floor, 12 months of historical retention, free tier for the first 5,000 requests/month. Designed specifically for the aggregator and candidate-matching use case. Docs.
- Coresignal. Large company-data and job-postings datasets, sold as bulk downloads more than streaming API. Enterprise pricing.
- TheirStack. Company-firmographic data with job postings as a signal. Heavy on the firmographic side.
- RapidAPI marketplace feeds (JSearch, Indeed12, etc.). Variable quality, variable freshness, thin docs. Useful for prototyping.
- Bright Data / Oxylabs job data products. Scraping-as-a-service oriented; you specify URLs and they return parsed JSON. Powerful but priced per request.
This is the right call when the data isn’t your differentiator. Recruiter products competing on AI matching, candidate-CRM features, niche curation, or sourcing workflows almost always pay for the data layer - there’s no customer benefit from your engineers rebuilding the same pipeline a vendor already runs.
Normalization: taming 30+ schemas
Once raw JSON is flowing in, you have to turn it into something queryable. The minimum useful normalized record looks something like this:
{
"id": "jp_kfd83hsk",
"source": "workday",
"tenant": "salesforce",
"title": "Senior Software Engineer",
"company": "Salesforce",
"location": {
"city": "San Francisco",
"region": "CA",
"country": "US",
"remote": false
},
"salary": {
"min": 165000,
"max": 220000,
"currency": "USD",
"period": "year",
"source": "explicit"
},
"employment_type": "full_time",
"posted_at": "2026-05-14T09:23:00Z",
"expires_at": null,
"apply_url": "https://salesforce.wd1.myworkdayjobs.com/.../job/...",
"description": "..."
}Every field is a normalization problem.
Title
Trim, normalize case, strip the seniority noise some companies attach (“Sr. Software Engineer II - L5 - SWE” → “Senior Software Engineer”). Optional: tag seniority and discipline as structured fields for filtering.
Location
Geocode every free-text string against a gazetteer. Detect multi-location postings (one record per location, or one record with a locations[] array). Detect remote: regex for “remote”, “work from anywhere”, “WFH”, “virtual”, and explicit country-only locations.
Salary
If the ATS exposes structured salary, normalize currency and period (some sources mix annual and hourly within one tenant). If it doesn’t, regex the description for $[0-9,]+-$[0-9,]+ patterns and attempt to extract a range. Tag the source as explicit vs parsed so downstream consumers know how much to trust it. Around 20% of postings from Workday and Greenhouse tenants in California and New York expose explicit salary ranges thanks to pay-transparency laws - that’s your highest-quality slice.
Employment type
Map every variant the wild gives you (“FTE”, “Full-time Regular”, “Permanent”, “Permanent Full Time”) to a small fixed enum: full_time, part_time, contract, internship, temporary.
Description
Some ATSs return HTML, some return Markdown, some return both in different fields. Pick one canonical format (Markdown is friendly to LLM downstream tasks) and convert. Strip ATS tracking pixels and the legally-required EEO boilerplate if you want cleaner display.
Deduplication: the same job on five sites
The fundamental dedup challenge: the same job has different identities on every site that hosts it. A Stripe senior engineering role lives canonically on Stripe’s Greenhouse board, gets syndicated to LinkedIn, gets scraped into Indeed, and appears on Glassdoor because Glassdoor pulls from Indeed. You need to recognize all four as one posting.
The dedup hierarchy that works in production:
- Apply URL canonicalization. If two postings resolve to the same apply URL after following redirects, they’re the same job. This handles LinkedIn-to-Workday syndication trivially because LinkedIn’s apply button deep-links to the underlying ATS.
- Company + title + location + posted-week. Two postings with the same canonicalized company, same fuzzy-matched title (Levenshtein distance under a threshold), same geocoded location, and posted within the same calendar week are very likely the same job.
- Description shingling. Compute MinHash or SimHash signatures of the description text. Postings whose signatures match above a threshold are the same job. This catches cases where the title varies (“Sr Engineer” vs “Senior Software Engineer”) but the body is identical.
- Conservative manual rules. One or two specific companies that post intentionally-similar roles (e.g. Google opens five hundred “Software Engineer” reqs in one batch) need rules to prevent dedup. The minority case where you have to override the algorithm.
Store the result as one canonical record with an array of sources and an array of apply URLs. Surface the primary source (the ATS that originally posted) to users; the duplicates become useful metadata for “also seen on” affordances.
Freshness, retention, and dead jobs
Job postings have a half-life of roughly 30 days, but the most valuable freshness window is the first 48 hours. Engineering roles on hot tech ATSs (Greenhouse, Ashby) routinely close within two weeks because the requisition fills.
A practical crawl-cadence policy:
- 6-hour cadence for high-volume tenants (anyone with 500+ active postings).
- 24-hour cadence for everything else.
- Adaptive same-day re-check for any URL that surfaced a posting yesterday but might have closed today - particularly important for tracking which jobs are still live vs which have been filled.
A posting that disappears from the source between crawls is a signal. Mark it as expired_at = now, but don’t hard-delete it. Retention of historical postings is valuable for labor-market analytics, salary trend analysis, and detecting hiring patterns at specific companies. JobsPipe retains 12 months - most pure aggregators retain only what’s live.
Search and serving
Once data is normalized, deduped, and stored, you need a way for users to find what they’re looking for. Three search considerations dominate.
Filterable facets
Users filter by location, salary range, employment type, remote, posted date, company, and (for niche aggregators) by tags or categories you assign. Plan your index around the facets - Elasticsearch with aggregations, Meilisearch for the cheap-and-fast path, or Postgres GIN indexes if you want to avoid a second datastore.
Full-text search on title and description
Tokenize, stem, and synonym-expand. “ML engineer” should match “Machine Learning Engineer”. “Front end” should match “frontend”. Maintain a synonym list specific to job-search vocabulary; the default English stemmer in most search engines is not enough.
Semantic search for matching products
If you’re building a recruiter tool or candidate-matching product (not just a public aggregator), embed both job descriptions and candidate resumes into a vector space and match by cosine similarity. OpenAI text-embedding-3-large, Voyage AI, or open-source bge-large-en are all production-ready as of 2026. Hybrid retrieval (BM25 + vectors, fused with reciprocal rank fusion) outperforms either alone.
Monetization models
Five models pay the bills for job aggregators. Most successful products run two or three concurrently.
1. Sponsored listings
Employers pay to push their postings to the top of relevant searches. Indeed’s Sponsored Jobs is the canonical example, and it’s most of how Indeed makes money. The trade-off: you need real user traffic before sponsorship has value. Don’t bet on this as your day-one revenue.
2. Affiliate / cost-per-click
Earn a commission when a user clicks through to apply. Many ATSs and aggregators run affiliate programs - you embed a referral parameter in the apply URL and earn $0.20-$2.00 per click depending on the source and seniority of the role. Glassdoor and ZipRecruiter built much of their early revenue on this. Best for high-traffic aggregators with millions of monthly clicks.
3. Subscription
Paid access to advanced filters, salary data, instant alerts, or AI matching. Wellfound, Hired, and several niche tech aggregators run subscription models. The advantage: predictable MRR. The challenge: most consumers won’t pay for what they can get free on Indeed.
4. Leads
Sell candidate signal to recruiters. Hired’s original model: candidates list themselves, recruiters pay for warm intros. Wellfound runs something similar. Operationally heavier because you’re running two-sided marketplace mechanics, not just a content site.
5. Direct paid postings
Accept direct postings from employers alongside the aggregated content. Most niche job boards (We Work Remotely, Hacker News Who’s Hiring effectively) run this as their primary model. Easier to start than sponsored listings because employers can buy a posting without traffic-based pricing.
Niche vs broad: picking your wedge
Trying to out-aggregate Indeed is a losing game. They’ve been crawling the public web since 2004, they have brand recognition consumers actually trust, and their content moat is broader than any new entrant can match without massive capital. The path to a sustainable job aggregator in 2026 runs through a wedge.
Niches that have worked recently:
- Remote-only. We Work Remotely, Remote.co, Working Nomads. Filter aggregator content by remote-eligible and curate.
- Tech-only. Hacker News Who’s Hiring, Wellfound, BuiltIn, RemoteOK. Filter to software, ML, design, product.
- Industry-specific. Healthcare (Health eCareers), legal (LawCrossing), academic (HigherEdJobs), government (USAJobs, federal-only).
- Seniority-specific. Executive search aggregators (Ladders for $100k+). Distinct demand profile.
- Geography-specific. Welcome to the Jungle (France-first), Stepstone (Germany), Naukri (India). Localized language, localized salary norms, localized employer awareness.
- Stage-specific. “Jobs at YC-funded startups” (YC Work at a Startup, Wellfound’s startup vertical). The job-aggregator equivalent of a vertical SaaS.
The pattern: pick a slice where you can have an opinion about quality. “Every remote engineering role at a Series-B-or-later startup with explicit salary” is a product. “Every job” is Indeed.
Build vs buy
You’ve seen the layers and the data-supply options. The build-vs-buy call comes down to one question: what is your differentiator?
If your differentiator is broader or fresher data than competitors - because you cover sources others don’t, or because your freshness is measurably better - then build the data layer yourself. Your engineering effort goes into ingest, normalization, and deduplication, because that’s what customers are buying.
If your differentiator is everything that sits on top of the data - better search, better curation, better matching, recruiter tools, candidate experience, AI features - then buy the data layer. Use an aggregator API for ingest and put your engineers on the product. Most successful niche aggregators of the past five years went this route. Re-building a Workday parser is neither a moat nor a story you can tell investors.
The data layer, already built. Plug us in and ship the product on top.
- 50+
- Sources covered
- 24h
- Freshness floor
- 12 mo
- History retained
- 5,000
- Req/mo free
FAQ
What is a job aggregator?+
A job aggregator is a service that collects job postings from many different sources - corporate ATSs, public job boards, niche sites - deduplicates them, normalizes the data into one consistent shape, and exposes the result either as a website (Indeed, Google Jobs, ZipRecruiter) or as an API. The defining feature is multi-source ingestion: a single job board hosts postings paid for by employers; an aggregator pulls postings other people published and republishes them.
Is a job aggregator the same as a job board?+
No. A job board hosts postings that employers paid to publish on that board. An aggregator pulls postings from other places (ATSs, other boards, company career sites) and republishes them. Indeed started as an aggregator and became a hybrid: it both aggregates and sells direct postings. LinkedIn is a hybrid too. Lever and Greenhouse are pure ATSs - they host postings for one company at a time and are not aggregators.
Is Indeed a job aggregator?+
Indeed is a hybrid. It aggregates postings from corporate career sites and ATSs, and it also accepts direct paid postings from employers (Sponsored Jobs). Most of the content on Indeed originated elsewhere - Indeed's bots discovered it on a Workday career site, an Ashby board, or similar - and Indeed re-hosts it with an Apply button that often deep-links back to the original ATS.
Is it legal to scrape job postings?+
It depends on where you scrape, what data you take, and what the source's Terms of Service say. The hiQ Labs v. LinkedIn cases (Ninth Circuit 2019, Supreme Court remand 2022) clarified that scraping public data probably does not violate the Computer Fraud and Abuse Act, but ToS-based contract claims survive. Scraping public-facing corporate career sites (Workday, Greenhouse, Lever) is generally lower-risk than scraping consumer-facing aggregators (LinkedIn, Indeed) because corporate career sites are designed to be indexed by search engines and rarely have ToS that prohibit programmatic access. This is not legal advice - get a lawyer if you're operating at scale.
How fresh does aggregator data need to be?+
Job postings have a half-life of roughly 30 days, but the most valuable signal is the first 48 hours. Engineering postings on hot ATSs (Greenhouse, Ashby) often fill within two weeks. A practical refresh budget: 6-hour cadence for high-volume sources (Workday tenants with 500+ active postings), 24-hour cadence for everything else, and a same-day check on any URL that returned a posting yesterday.
What's the hardest part of building a job aggregator?+
Deduplication and freshness, in that order. Deduplication is hard because the same job appears on the company's Workday site, on LinkedIn (via syndication), on Indeed (via scraping), and possibly on a niche board the recruiter posted to manually - and the title, company name, and location are spelled slightly differently every time. Freshness is hard because the cost of crawling everything every hour explodes linearly, and your users will notice within minutes when a job they're interested in turns out to be already closed.
Can I build a job aggregator without writing scrapers?+
Yes, but only if you accept dependency on someone else's data supply. Options include aggregator APIs (JobsPipe, ZipRecruiter Publisher Feed, Bright Data, RapidAPI feeds), per-ATS public endpoints (Greenhouse Job Board API, Lever Postings API), and partner-feed programs (Indeed, Google Jobs). The trade-off is control: you can build faster but you depend on the API provider's coverage, schema, and uptime.
How do job aggregators make money?+
Five common models: (1) Sponsored listings - employers pay to boost their postings to the top, (2) Affiliate / cost-per-click - earn a commission when a user clicks through to apply, (3) Subscription - paid access to advanced filters, alerts, or salary data, (4) Leads - sell candidate signal to recruiters, (5) Paid postings - accept direct submissions alongside aggregated content. Indie aggregators typically start with #1 (sponsored) or #3 (subscription) because both have low operational overhead.
Should I build my own job aggregator or use an API?+
If your differentiator is the data supply (broader coverage than competitors, fresher data, unique sources), build it yourself. If your differentiator is on top of the data (better search, niche curation, AI matching, recruiter tools), use an API and put your engineering time into the differentiator. Most successful niche job boards (Hacker News Who's Hiring, Wellfound, We Work Remotely) compete on curation, not on data-supply engineering.
Skip the data-layer build. Start with JobsPipe’s free tier - 5,000 requests/month, every source we cover, no credit card.
Get a free API keyRelated