Glossary·Data

Job posting deduplication

Definition

Job posting deduplication is the process of detecting that several collected job records describe the same real-world opening and collapsing them into one, so an aggregated index does not count the same job many times.

Also called: job dedup, duplicate job detection, job posting dedup.

Key points

One real job appears across many sources and over time, so an aggregator collects many records per opening.
Exact field matching barely works - the same job differs in title, location, description, and apply URL on every source.
Effective deduplication is fuzzy matching with a tuned similarity threshold across company, title, and location.
Duplicate rate is a core quality metric - duplicates inflate coverage, waste spend, and corrupt analytics.

Why duplicates happen

A single real job - one opening at one company - shows up in many places. It is on the company's own ATS-hosted careers page, it has been syndicated to Indeed and LinkedIn, a recruiter may have reposted it, and a niche board may carry it too. An aggregator that collects from all of those sources ends up with five records for one job. Without deduplication, the index looks five times bigger than it is and search results are full of repeats.

Duplicates also appear over time. The same posting gets recrawled, lightly edited by the employer, or reposted with a new identifier when it has been open a while. Deduplication has to handle both the same-job-many-sources case and the same-job-over-time case.

Why exact matching is not enough

The naive approach - treat two records as the same job if their fields are identical - barely works. The same job has a slightly different title on each source ('Senior Engineer' versus 'Senior Software Engineer'), a location written three ways, a description that one source truncated, and a different apply URL everywhere. Exact matching catches almost nothing.

Real deduplication is fuzzy matching: normalize the fields, then decide whether two records are the same job based on a similarity judgment across company, title, location, and description. That means choosing which fields are strong signals, how much fuzziness to tolerate, and where to set the threshold - too strict and duplicates leak through, too loose and genuinely different jobs get merged.

Why deduplication is core to jobs data

Duplicate rate is one of the clearest quality signals for a jobs dataset. A high duplicate rate inflates coverage numbers, wastes whatever the consumer pays per record, and corrupts any analytics built on the data - hiring-volume trends, for instance, become meaningless if one job counts as five. For matching and search products, duplicates directly degrade the user experience.

It is also genuinely hard at scale, which is part of why a jobs-data API is worth paying for. Deduplicating millions of postings refreshed daily, across sources that each describe jobs differently, is a standing engineering problem - not something solved once and forgotten.

FAQ

Why can't I deduplicate jobs by apply URL?+

Because the apply URL is usually different on every source. The company's own careers page, Indeed, and LinkedIn each link to their own version of the apply flow. The apply URL is a useful weak signal when two records do share it, but most duplicates of the same job have entirely different URLs, so it cannot be the primary key.

What fields are best for detecting duplicate job postings?+

Company is the strongest anchor - normalize it first. Then title and location, both fuzzy-matched because each source phrases them differently. Description similarity helps confirm a match, and posted date narrows the time window. No single field is decisive; deduplication combines several into a similarity judgment.

Is some duplication unavoidable?+

Yes. Deduplication is a precision-versus-recall trade-off. Tune it strict and some duplicates survive; tune it loose and you risk merging two genuinely distinct openings, which is worse. A small residual duplicate rate is normal and usually preferable to aggressive merging. The goal is low and measured, not zero.

JobsPipe is the jobs-data API behind this glossary - 30+ sources, one schema, free tier included.

Job aggregator

Job posting schema

Job board API