Skip to main content

How to Deduplicate UUIDs When Case and Format Variants Hide Copies

A plain text dedup keeps three rows for one UUID when case and braces differ. Here is how to fold those variants and get a clean, copyable list.

Published By Li Lei
#uuid #deduplication #data-cleanup #developer-tools

How to Deduplicate UUIDs When Case and Format Variants Hide Copies

A UUID is supposed to be one value. The problem is that the same value travels through your data wearing different clothes. One service writes 550e8400-e29b-41d4-a716-446655440000. Another logs it in uppercase. A third wraps it in braces because that is how its config files store identifiers. To a person reading the list, those are clearly the same ID. To a plain text dedup, they are three strangers.

This is the exact gap that bites teams merging exports. You run a "remove duplicate lines" pass, the row count drops, and you ship the result believing it is clean. It is not. Every casing or formatting variant slipped straight through, because text dedup compares bytes, not meaning.

Why a plain text dedup keeps the same ID three times

Take one concrete UUID and write it three ways:

  • 550e8400-e29b-41d4-a716-446655440000 (lowercase, hyphenated)
  • 550E8400-E29B-41D4-A716-446655440000 (uppercase)
  • {550e8400-e29b-41d4-a716-446655440000} (brace-wrapped, the form Windows registry tools and some C# code emit)

These are one identifier. A byte-for-byte dedup sees three distinct strings, because 5 and 5 match but e and E do not, and a leading { changes the string entirely. So all three survive as separate rows. Sort them, eyeball them, and you might still miss it — uppercase and lowercase do not even sort next to each other in a naive comparison.

The fix is not "look harder." The fix is to fold case and format before you compare. UUID deduplication has to normalize first, then deduplicate. Lowercase the hex, strip the wrapping punctuation, and only then ask whether two rows are equal.

What the UUID Deduplicator actually folds (and what it does not)

I checked the parser against the manifest claims rather than taking the FAQ at face value, because "we handle variants" is easy to say and easy to get wrong. Here is the verified behavior.

The dedup key for every row is the value after normalization, and normalization does two things: it strips outer wrapping characters and it lowercases the result. So:

  • Case folds. 550E8400-... and 550e8400-... produce the same key and collapse to one row.
  • Braces fold. {550e8400-...} has its outer { and } stripped during cleanup, so it lands on the same key as the bare hyphenated form.

That covers the two variants people hit most: a column that is uppercase because the database driver returned it that way, and a column wrapped in {…} because a Windows or .NET tool wrote it.

There is one honest limit worth knowing before you trust the output. The extractor matches the canonical hyphenated shape, the standard 8-4-4-4-12 layout. A fully unhyphenated 32-character hex blob550e8400e29b41d4a716446655440000 — is not pulled in as a UUID at all, so it will not be folded against its hyphenated twin. If your source mixes hyphenated and bare 32-char forms, hyphenate the bare ones first (a UUID Normalizer handles that step) and then deduplicate. The tool case-folds and brace-folds; it does not re-insert missing hyphens for you.

That distinction matters. Knowing exactly which variants collapse and which do not is the difference between a list you can trust and a list that looks clean but is not.

A worked example: three rows in, one row out

Here is a list with the duplicates baked in. Paste this and run the dedupe:

550e8400-e29b-41d4-a716-446655440000
550E8400-E29B-41D4-A716-446655440000
{550e8400-e29b-41d4-a716-446655440000}
6f9619ff-8b86-d011-b42d-00cf4fc964ff
6F9619FF-8B86-D011-B42D-00CF4FC964FF

A plain text dedup leaves five rows, or four if you happened to lowercase first by hand. The UUID Deduplicator folds case and braces, so the normalized output is two rows:

550e8400-e29b-41d4-a716-446655440000   (count 3, first seen line 1)
6f9619ff-8b86-d011-b42d-00cf4fc964ff   (count 2, first seen line 4)

The first UUID appeared three times across three formats, the second twice across two cases, and both collapse to a single canonical lowercase row. The duplicate count and first-seen line stay attached, so you keep the evidence of where the copies came from instead of silently discarding it.

Keep the audit trail, not just the answer

The reason the count and line number ride along with each row is that "deduplicated" is a claim you sometimes have to defend. When a teammate asks why a merged export shrank from 4,000 rows to 3,100, "the tool said so" is not an answer. "These 900 were case or brace variants of IDs already present, here is the line each first appeared on" is.

You can also keep invalid rows visible rather than dropping them. A truncated UUID, an extra-character typo, or a stray non-hex digit will not fold into a valid ID — and that is the point. Those are exactly the rows a naive dedup would have left in your final list, looking like real identifiers. Seeing them flagged tells you which IDs need a human before import.

When you need the cleaned list somewhere downstream, export it as CSV or Markdown with line numbers for an audit trail, or as a SQL IN list or TypeScript union when you are feeding code. The output is the same canonical, folded set in whichever shape the next step wants.

A note on what dedup does and does not prove

One caution I have learned the slow way: a deduplicated, validated UUID list tells you the values are well-formed and unique. It does not tell you the rows they point at still exist. Format validity is not existence. After you fold the list down, the IDs are clean — but whether each one still maps to a live account, record, or resource is a separate lookup you have to run against your own system. Treat the clean list as a starting point for that check, not as the check itself.

Everything here runs in the browser. Parsing, normalization, deduplication, and export all happen locally, and uploaded text files are read with the File API rather than sent to a server — which matters when the IDs in question belong to customers. If you want to try the worked example above, the UUID Deduplicator is open, and you can paste your own mixed-format list to see exactly which rows fold together.

The takeaway is simple. Identical UUIDs hide inside case and format differences, and a plain text dedup cannot see through them. Fold case and braces first, then deduplicate, and keep the counts so the result is something you can explain — not just something that looks tidy.


Made by Toolora · Updated 2026-06-13