How to Normalize IPv6 Addresses to One Canonical Form (RFC 5952)

The first time I tried to deduplicate an IPv6 allowlist, I got the count wrong twice. The list had 2001:0db8:0000:0000:0000:0000:0000:0001 on one line and 2001:db8::1 on another, and a plain text dedupe treated them as two different hosts. They are the same host. That is the whole problem with IPv6: a single address has many valid spellings, and string comparison does not know that.

This post walks through why that happens, what the canonical form actually is, and how to flatten a messy paste into one consistent list.

Why one IPv6 address has so many spellings

An IPv6 address is 128 bits, written as eight groups (hextets) of four hexadecimal digits separated by colons. The format allows three kinds of optional shortening, and each one is legal, so the same bits can be typed several ways:

Case. Hex digits a through f can be uppercase or lowercase. DB8, db8, and Db8 are identical.
Leading zeros. Inside a group, leading zeros are optional. 0db8 and db8 mean the same thing, and 0000 can shrink to 0.
Zero-run compression. A single run of consecutive all-zero groups can be replaced by a double colon (::). You can use it once per address.

Multiply those choices together and one host explodes into a pile of look-alikes. None of them is wrong. But when you feed that pile into a firewall config, a SQL IN clause, or a dedupe step, the mismatched spellings quietly produce wrong results.

To see how wide the gap gets, take the loopback address ::1. Written out in full it is 0000:0000:0000:0000:0000:0000:0000:0001. A person can also legally type 0:0:0:0:0:0:0:1, or ::0:1, or 0::1. All four describe the exact same 128 bits, but a naive comparison treats them as four distinct hosts. The longer the address and the more zero groups it contains, the more spellings it spawns. Without a single agreed-upon form, every tool downstream has to guess whether two strings mean the same thing, and guessing is where the bugs live.

What RFC 5952 actually fixes

RFC 5952 settles the ambiguity by defining one canonical text form. The rules that matter in practice:

Lowercase every hex digit. 2001:0DB8 becomes 2001:0db8, then the next rule shortens it further.
Drop leading zeros in each group. 0db8 becomes db8; a group of 0000 becomes 0.
Collapse the longest run of zero groups to a single ::. If two runs tie in length, compress the first one. A run of exactly one zero group is not shortened to :: — the double colon is reserved for runs of two or more, so you do not write 2001:db8:0:1:1:1:1:1 as 2001:db8::1:1:1:1:1.

Apply all three and every spelling of a host converges on the same string. Here is the concrete claim worth memorizing: 2001:0DB8:0000:0000:0000:0000:0000:0001 and 2001:db8::1 are the same address, and only after both are rewritten to the canonical 2001:db8::1 will a deduplicator recognize them as one. The canonical form is what makes dedupe, sorting, and exact-match filtering correct. That is exactly the target form the IPv6 Address Normalizer writes out.

A worked example

Say a teammate pastes this address from a router export:

2001:0DB8:0000:0000:0000:0000:0000:0001

Walk it through the rules:

Lowercase the hex: 2001:0db8:0000:0000:0000:0000:0000:0001
Drop leading zeros per group: 2001:db8:0:0:0:0:0:1
Find the longest zero run — six consecutive 0 groups in the middle — and collapse it: 2001:db8::1

Final canonical form:

2001:db8::1

That 39-character verbose address and the 11-character compressed one are byte-for-byte the same host. Once both are normalized, a dedupe pass keeps exactly one of them.

Doing it on a real, messy paste

Hand-applying three rules to one address is fine. Doing it to four hundred addresses scraped out of a log slice, a CSV export, and a support ticket is not. That is where I reach for the tool. I paste the raw text, and the browser-side parser pulls out each address, rewrites it to the RFC 5952 form, lowercases the hextets, and collapses the longest zero run — all locally, with nothing sent to a server. What used to be a "did I miscount again?" moment is now a clean, sorted list I can trust.

The part I appreciate most is that the work stays on my machine. IPv6 addresses inside an internal log can be sensitive — they map to real hosts on a real network — so a tool that quietly shipped my paste to a server would be a non-starter. Reading the uploaded files locally with the browser File API keeps the source text where it belongs.

A few things I lean on every time:

Keep unique rows only. After normalization, identical hosts that were spelled differently finally collapse into one entry, so the dedupe is actually correct.
Surface what cannot be compressed. Addresses with two ::, a group over four hex digits, or non-hex characters do not silently vanish. They stay listed with a reason, so a broken address never sneaks into the clean output. If you only want that validation pass, the IPv6 Address List Validator reports the same reasons without rewriting.
Export the shape you need. The normalized list can go out as plain lines, CSV, JSON, Markdown, a SQL IN list, or a TypeScript union, so I am not hand-adding quotes and commas.

Common traps

Two mistakes burn people repeatedly.

First, treating a passing format check as proof the host exists. Canonical form means the address is well-formed, not that anything answers at that address. Normalization cleans the spelling; it does not ping the network.

Second, deduplicating before normalizing. Copied web text carries hidden whitespace and mixed casing, so a dedupe run before canonicalization will keep look-alikes that should have merged. Normalize first, then dedupe. The order is not optional — it is the entire reason the count comes out right.

When to use it

Reach for canonical normalization any time IPv6 addresses cross a boundary: importing an allowlist into a CRM or ticket system, building a config from scattered exports, generating fixtures for tests, or auditing a log. Anywhere two systems compare addresses as strings, one canonical spelling per host is the difference between a clean diff and a phantom mismatch.

If your source is raw and you only need the addresses pulled out first, start by extracting them with the IPv6 Address Extractor, then normalize the result. Either way, the goal is the same: every host, written exactly one way.

That last line is the whole point. The RFC 5952 authors did not invent the canonical form to be pedantic — they did it so that two engineers, two scripts, and two databases looking at the same host all produce the same string. Lowercase the hex, drop the leading zeros, collapse the longest zero run once, and the ambiguity disappears. Do it by hand for one address to understand the rules, then let a local tool do it for the other four hundred.

Made by Toolora · Updated 2026-06-13