How to Deduplicate Base64 Blocks Without Quietly Corrupting Your Data

Most deduplication is forgiving. When you clean a contact list, Ada@Example.com and ada@example.com are the same person, so folding case is the right call. When you collapse a UUID dump, A1B2 and a1b2 point at the same record. So nearly every dedup tool lowercases first and compares second. That habit is correct almost everywhere, and it is exactly wrong for Base64.

Base64 is case-significant. The character a and the character A are different alphabet positions, so they decode to different bytes. aGVsbG8= decodes to hello. AGVSBG8= decodes to a completely different byte sequence. They are not one value written two ways. They are two values that happen to look similar. If your dedup pass lowercases before comparing, it will merge those two rows, throw one away, and hand you output that silently lost real data. Nobody gets an error. You just ship a list that is missing a key.

That single difference is why I built base64-block-deduplicator to behave the opposite way from the email and UUID dedup tools that sit right next to it.

The one rule: never fold case on Base64

The whole risk lives in the comparison key. To decide whether two blocks are duplicates, a deduplicator builds a normalized key for each one and groups by that key. The question is what "normalize" is allowed to touch.

For Base64, exactly one thing is safe to normalize before comparing: layout. Whitespace and line-wrapping carry no meaning, so a 64-column wrapped PEM body and the same body on one line are genuinely the same value. URL-safe Base64 (- and _) is just an alternate spelling of standard Base64 (+ and /), so unifying the alphabet is safe too. Everything else, especially case, is real payload. Touch it and you have changed the data.

So the dedup key here strips whitespace and line breaks, maps the URL-safe alphabet onto the standard one, and then stops. It never lowercases. Two blocks collapse into one row only when they are byte-for-byte identical after layout is cleaned up.

This is enforced at the tool level, not left to a checkbox. The Base64 profile is marked case-significant, which overrides the usual case-insensitive default. Even if you tick the "case sensitive" option off, Base64 blocks are still compared with case intact, because folding case would produce wrong answers and there is no honest reason to offer it. By contrast, the email deduplicator in the same family does lowercase, because for email that is correct. Same engine, opposite rule, chosen per data type.

A worked example

Paste this into the tool. It is the kind of mess you get from merging two exports, where one row was hand-edited and a wrapped cert body shows up twice with different line breaks:

eyJ0b29sIjoiVG9vbG9yYSIsIm9rIjp0cnVlfQ==
AGVSBG8=
aGVsbG8=
eyJ0b29sIjoiVG9vbG9yYSIsIm9rIjp0cnVl
fQ==

Run dedup keeping unique rows only. You get three rows back, not two:

| value | count | |---|---| | eyJ0b29sIjoiVG9vbG9yYSIsIm9rIjp0cnVlfQ== | 2 | | AGVSBG8= | 1 | | aGVsbG8= | 1 |

The first and last input lines collapsed into one row with a count of 2, because the wrapped two-line block normalizes to the same value as the single-line one. That is the layout-only normalization doing its job. But AGVSBG8= and aGVsbG8= stayed as two separate rows. A case-folding deduplicator would have merged them and reported a count of 2, deleting one decoded payload you actually needed. Here they survive, because they are different data and the tool knows it.

Where this actually bites

The blocks people dedupe are rarely toy strings. They are certificate bodies, private and public key material, JWT payloads, and base64-encoded log fields, pulled out of bundles, support tickets, copied HTML, and CSV exports. These are precisely the values where a silent merge is most expensive, because the difference between two near-identical blocks might be one flipped bit in a key, and that one bit is the whole point.

In my own work the common shape is a certificate or secrets bundle that got concatenated from two sources during an incident, and somewhere in there is a repeated block I want gone and a near-twin block I absolutely must keep. Pasting the whole thing in, deduping on the exact case-sensitive value, and reading off the duplicate counts is faster and far safer than eyeballing wrapped Base64 by hand, where every block looks like noise and your eyes can not tell l from I or 0 from O anyway. The tool keeps the first occurrence of each unique value, shows how many times it appeared, and records the source line, so when someone asks where a duplicate came from you can point at it instead of guessing.

A few habits make it reliable:

Normalize layout, never content. Strip whitespace and unify the URL-safe alphabet, then stop. If you find yourself wanting to lowercase, you are about to break something.
Keep invalid rows visible. A block that fails validation usually means broken padding or a stray non-alphabet character. That is the row you hand back to whoever exported the dump, so it stays in the output with a reason rather than vanishing.
Treat the count column as evidence. If a block you expected to be unique shows a count above one, that tells you the merge happened upstream, not in the tool.
Keep an audit trail when it matters. Download CSV or Markdown with line numbers instead of copying only the final list, so the cleanup is reproducible.

Doing one job per pass

Deduplication is one stage in a pipeline, and it works best when each stage does exactly one thing. If you need to lift blocks out of surrounding prose first, run base64-block-extractor and feed its output here. If your worry is whether each block is well-formed rather than repeated, base64-block-list-validator checks padding and alphabet per row. When messy line-wrapping is the actual problem, base64-block-normalizer settles the layout before you compare anything, and base64-block-list-converter reshapes the clean list into JSON, a SQL IN, or a TypeScript union for your script. And if half your "Base64 blocks" are really JWTs you want pulled apart, jwt-token-extractor is the right starting point.

The thread through all of them is the same principle this tool is built on: know what your data actually is before you decide what counts as the same. For email, that means folding case. For Base64, it means refusing to.

Made by Toolora · Updated 2026-06-13