Skip to main content

How to Deduplicate Domains and Keep One Clean Copy of Every Unique Domain

Case, www, protocol wrapping, and trailing dots make one site survive a plain dedup as several rows. Here is how to collapse a domain list into unique domains.

Published By Li Lei
#domains #deduplication #data-cleanup #dns #text-tools

How to Deduplicate Domains and Keep One Clean Copy of Every Unique Domain

A domain list almost never arrives clean. You merge an allowlist from a DNS export, a few hostnames a teammate copied out of an nginx config, and a column someone pasted from a spreadsheet. You run a plain dedup, expecting a tidy set of unique domains, and the count barely drops. The reason is simple and annoying: the same site shows up in shapes that are different text even though they point at the same place.

This post is about why that happens and how to actually collapse a list down to one clean copy per domain.

Why a plain dedup fails on domains

A plain line-based dedup compares strings byte for byte. Two lines survive as separate entries unless every character matches. Domains break that assumption in four common ways:

  • Case. DNS is case-insensitive, but text is not. Example.com and example.com are the same host, yet a string dedup keeps both.
  • The www prefix. www.example.com and example.com usually serve the same site, but they are genuinely different hostnames in DNS.
  • Protocol and path wrapping. https://example.com/ carries the same domain as a bare example.com, wrapped in a scheme and a trailing slash.
  • Trailing dots. example.com. (the fully qualified form with the root dot) and example.com resolve identically, but the trailing . is one more byte the string comparison trips on.

Here is the concrete trap. Example.com, www.example.com, and https://example.com/ are, for most practical purposes, the same site — but they are three distinct text rows. Feed those three lines into a naive dedup and you keep all three. To get to unique domains you have to normalize before you compare: lowercase the value, strip the protocol and path wrapper, and drop the trailing dot, then dedup.

What the tool actually normalizes

I checked this against the parser rather than guessing, because "domain dedup" can mean different things and the details matter.

The Domain Name Deduplicator extracts each domain with a pattern, then builds a dedup key by lowercasing the value and removing a single trailing dot. So all three of these collapse into one row:

  • EXAMPLE.com
  • example.com.
  • example.com

The extractor also pulls the bare domain out of a wrapper. Paste https://example.com/ and it reads example.com — the scheme, the slash, and any path fall away during extraction, so a wrapped URL and a bare domain land on the same key. That covers case, protocol wrapping, and trailing dots out of the box.

The one shape it does not silently merge is the www prefix. www.example.com keeps its www. label and stays a separate row from example.com. That is technically correct — they are different hostnames, and plenty of setups serve different content on the apex versus the www subdomain. If you genuinely want the apex and www treated as one entry, normalize that step yourself before pasting, for example with the Domain Name Normalizer, then run the dedup. I would rather the tool not assume www.shop.example.com is the same as shop.example.com, because for subdomains that assumption is often wrong.

A worked example

Say you pasted this messy list, stitched together from three exports:

Example.com
example.com
example.com.
https://example.com/
www.example.com
API.Example.com
api.example.com
toolora.info
http://broken

Run the deduplicator with "keep unique only" and you get one canonical row per distinct domain, with a count of how many source lines folded into it:

domain          count
example.com       4
www.example.com   1
api.example.com   2
toolora.info      1

Walk through it. The first four input lines — Example.com, example.com, example.com., and https://example.com/ — all normalize to example.com, so they collapse into one row with a count of 4. www.example.com survives on its own because the www label makes it a different host. API.Example.com and api.example.com differ only by case, so they fold into api.example.com with a count of 2. toolora.info is unique. And http://broken has no valid TLD with a leftover scheme, so it is flagged invalid with a reason rather than being passed through as a real domain.

That count column is the part I lean on most. When you merge several exports, the count tells you which domains were duplicated and how badly, which is exactly the evidence you need when someone asks why the merged list is shorter than the inputs.

Keep the audit trail, not just the clean list

The deduplicator keeps the first source line for each value and can preserve invalid rows for review instead of dropping them. That matters for cleanup work that someone else has to trust. If you only copy the final list of unique domains, you lose the answer to "where did this come from" and "what got thrown away." Downloading CSV or Markdown with line numbers keeps that trail intact.

The invalid rows are worth a second look before you apply anything. A row flagged invalid is usually a domain with an illegal character, a missing TLD, or a leftover http:// prefix that the extractor could not resolve into a clean host. Those are precisely the entries you want to fix in the source — an allowlist with a malformed line will silently fail to match the host you meant to allow.

My own workflow

The first time I used this in anger, I had three allowlist exports to merge before a config change, and a plain sort | uniq had left me with what looked like 40 unique domains. I pasted everything in, deduped, and the real count was 31. Nine of the "extra" rows were just case and trailing-dot variants of domains already in the list, plus two https://-wrapped copies. The remaining two were genuine invalids — a typo'd TLD and a host with a stray space in the middle that I had been about to push into production. I exported the clean CSV with line numbers, fixed the two broken entries at the source, and shipped a list I could actually defend in review.

The short version

A plain dedup compares raw text, so case, the www prefix, protocol wrapping, and trailing dots let the same site survive as several rows. Domain dedup has to normalize first — lowercase, strip the scheme and path, drop the trailing dot — and only then compare. The deduplicator does the lowercase, wrapper, and trailing-dot steps for you, keeps a count and source line for every value, and flags the invalid rows so you clean them before they reach a real config. Decide for yourself whether apex and www are the same entry, normalize that one step if you need to, and you end up with a list of unique domains you can hand off without second-guessing.

If your inputs are still buried inside logs or copied web pages, pull the hosts out first with the Domain Name Extractor, then run them through the deduplicator.


Made by Toolora · Updated 2026-06-13