Skip to main content

How to Deduplicate Postal Codes and ZIP Codes Without Losing Real Duplicates

Case and spacing variants of UK and Canadian postcodes survive a plain dedup. Here is exactly what the Postal Code Deduplicator folds, what it keeps separate, and why.

Published By Li Lei
#postal codes #deduplication #data cleanup #operations

How to Deduplicate Postal Codes and ZIP Codes Without Losing Real Duplicates

A plain text dedup treats every line as a string. That works fine for US ZIP codes, where 94105 is 94105 no matter who typed it. It falls apart the moment a UK or Canadian address enters the list, because those codes carry case and an internal space that people format three different ways.

Here is the case that keeps a delivery list dirty: sw1a1aa, SW1A1AA, and SW1A 1AA are one UK postcode written three ways. A plain dedup keeps all three, because as raw strings they are not equal. Meaningful postal-code dedup has to uppercase and normalize the canonical form first, then compare. This post walks through how the Postal Code Deduplicator handles that, where it succeeds, and one honest limit you should know before you trust the count.

Why a plain dedup undercounts your distinct zones

When operations teams merge two order exports to count how many delivery zones they actually serve, they paste both lists into one box and run a uniqueness pass. If that pass is a naive string comparison, three problems show up immediately.

The first is case. CRM A stores M5V 3L9. A scraped page gives you m5v 3l9. Those are the same Toronto postcode, but a case-sensitive compare sees two zones.

The second is spacing. A spreadsheet export might keep the canonical space (SW1A 1AA), a form submission might strip it (SW1A1AA), and a copy-paste from a PDF might glue in a double space. Same code, three byte sequences.

The third is the surrounding junk: quotes, trailing commas, a stray > from copied HTML. All of it makes a real duplicate look unique to a string comparison.

Counting distinct zones is the whole point of the exercise. If the dedup over-counts, you over-estimate coverage and ship a wrong number up the chain.

What this tool actually normalizes before comparing

I checked the tool's normalization against the source so I could write to its real behavior rather than guess. For the postal-code profile it does two things to each value before it becomes a dedup key:

  1. It strips outer wrapping characters (quotes, brackets, trailing punctuation).
  2. It uppercases the value and collapses any run of whitespace down to a single space.

Then the comparison key is lowercased by default, which makes case irrelevant for matching.

So to answer the question directly: yes, this tool uppercases, and yes it collapses repeated whitespace. That means sw1a1aa and SW1A1AA fold into one row. A double-spaced SW1A 1AA folds with a single-spaced SW1A 1AA. Case never blocks a match.

But there is a limit worth stating plainly: the normalizer collapses extra spaces, it does not remove the single canonical space. So SW1A1AA (no space) and SW1A 1AA (one space) produce different keys and stay as two separate rows. If your sources mix spaced and unspaced UK or Canadian codes, run them through the Postal Code Normalizer first to force one consistent spacing convention, then dedupe. That is the one gap that a casual user will trip over, so I am calling it out rather than pretending the dedup is perfect on raw mixed input.

A worked example: three spellings, two results

Paste this into the tool:

sw1a1aa
SW1A1AA
SW1A 1AA
m5v 3l9
M5V 3L9
94105
94105

With deduplication on, the normalized output collapses to:

SW1A1AA   (count 2)
SW1A 1AA  (count 1)
M5V 3L9   (count 2)
94105     (count 2)

Read that carefully. The two no-space SW1A1AA lines merged, the two M5V 3L9 lines merged across the case difference, and both 94105 lines merged. Four distinct rows out of seven input lines. The one row that did not merge with its siblings is SW1A 1AA, because its single canonical space survives normalization while the others have no space at all. That is the spacing limit made concrete: case folds, multiple spaces fold, but a present-versus-absent canonical space stays distinct.

The count column is the part operations people care about. It tells you SW1A1AA showed up twice, which preserves the evidence of where a duplicate came from instead of silently throwing it away.

Keep invalid rows so your zone count stays honest

A code like 9410 is one digit short of a US ZIP. It is not safe to fold it under a lookalike, because you cannot prove it was meant to be 94105 rather than 94100. The tool sets such rows aside with a reason rather than collapsing them under a neighbor, which keeps your count of distinct valid zones clean while still surfacing the broken input for review.

This matters for audit trails. If a teammate asks why your merged list has 412 zones and the raw exports had 540 lines, you want to point at the duplicate counts and the invalid-row reasons, not at a black-box number. Export to CSV with line numbers and the whole reduction is explainable.

My own workflow with this on a real merge

The first time I leaned on this was a two-region order export, US plus a smaller UK pilot. The naive count in my spreadsheet said 318 unique postcodes. I did not believe it, because the UK column was visibly full of the same handful of codes in different shapes. I dumped both columns into the deduplicator, kept invalid rows for review, and the real distinct count came back at 274. The gap was almost entirely UK case-and-spacing noise plus a dozen truncated codes someone had typed by hand. I normalized the spacing first on a second pass, re-ran it, and the number held steady. That second pass is the habit I would pass on: when UK or Canadian codes are in the mix, normalize spacing, then dedupe, then trust the count.

Get a clean, shareable list out the other side

Once the list is deduped you usually need it somewhere else. The tool can emit the same result as plain lines, CSV, JSON, Markdown, a SQL IN clause, or a TypeScript union, so the clean list goes straight into a query, a script, or a CRM import without hand-adding quotes and commas. Everything runs in your browser tab, so customer address lists never leave your machine to be compared.

If your input is messier than postal codes alone, two neighbors help upstream: the Text File Cleaner for stripping hidden whitespace and stray characters from copied web text, and the Postal Code List Validator when you want format checks before you commit to the dedup. Chain them when a source is rough, run the deduplicator alone when it is already tidy.

The short version: this tool folds case and collapses extra spaces, so the most common UK and Canadian duplicates that survive a plain dedup get caught here. Normalize spacing first when your sources disagree on the canonical space, keep the invalid rows for review, and the distinct-zone count you report will hold up to scrutiny.


Made by Toolora · Updated 2026-06-13