Skip to main content

How to Deduplicate Card Numbers When Three Formats Hide the Same Value

A plain text dedupe keeps 4111 1111 1111 1111, 4111-1111-1111-1111, and the bare-digit version as three rows. Here is how to collapse formatting variants into one.

Published By Li Lei
#test-data #deduplication #redaction #data-cleanup

How to Deduplicate Card Numbers When Three Formats Hide the Same Value

If you work with test fixtures or redaction exports, you have probably opened a file, run a quick sort-and-unique on it, watched the line count drop, and assumed the list was clean. For card-shaped numbers, that assumption is usually wrong. The same value shows up wearing different clothes, and a plain dedupe treats every outfit as a separate person.

This post is about that specific failure, and how to collapse the variants properly. It is aimed at people who handle card-shaped numbers as data, not as accounts: QA engineers seeding a sandbox, support staff scrubbing a ticket, anyone building a redaction map. It is not about collecting real cards, and nothing here helps with that.

Why a plain dedupe keeps duplicates

A card number has no canonical printed form. Stripe sandbox docs print 4111 1111 1111 1111 with spaces. A CSV export might emit 4111-1111-1111-1111 with dashes. A log line or a hand-typed fixture often gives you 4111111111111111 with nothing at all. To a person, those are obviously one number. To a byte-comparison dedupe, they are three distinct strings, because the spaces and dashes change the raw bytes.

So 4111 1111 1111 1111, 4111-1111-1111-1111, and 4111111111111111 are the same value in three formats, and a plain dedupe of a test-data or redaction export keeps all three. Your "deduplicated" list still has the duplicate; it just hid it behind whitespace. Meaningful dedup has to strip each value down to its bare digits first, then compare. Anything less is counting punctuation, not numbers.

That sounds obvious once you say it out loud, but it is exactly the kind of thing that slips past a uniq pipeline or a spreadsheet "Remove Duplicates" button, because both of those compare the cell text as written.

What this tool actually does before comparing

The Credit Card Number Deduplicator does the normalization step for you, and it is worth being precise about how, because the answer determines whether you can trust the result.

Internally, every parsed value runs through a digits-only pass before the dedupe key is computed. The masking function that produces the display value first calls a digitsOnly helper that drops every non-digit character, then keeps the first six digits, masks the middle, and shows the last four. The dedupe key is built from that normalized form. In plain terms: spaces and dashes are removed before two values are ever compared, so the bare-digit twin and its spaced or dashed siblings land on the same key and collapse into one row.

That is the behavior the manifest promises in its own FAQ, and the code matches it: a number written with spaces or dashes is treated as the same as its bare-digit twin, and only the first occurrence survives. The output you see is masked, so you are looking at 411111******1111, not the raw value, while still getting the duplicate count and the source line number.

One more detail that matters for cleanup work: invalid rows are not silently dropped. A value that fails the Luhn checksum or has the wrong digit length is flagged with a reason and kept, because a bad row in an import usually points at a bad source, and you want to see it rather than lose it.

A worked example

Here is the kind of messy paste a redaction or test-data task actually produces. Imagine a teammate handed you this fragment, stitched together from two exports and a support note:

4111 1111 1111 1111
4111-1111-1111-1111
4111111111111111
5500 0000 0000 0004
5500-0000-0000-0004
4111 1111 1111 111

Six lines. A plain sort | uniq returns six lines, because every line is a unique string. Run the same input through the deduplicator and you get something like this (masked, CSV-style):

value,count,first_line,valid,reason
411111******1111,3,1,true,OK
550000******0004,2,4,true,OK
4111 1111 1111 111,1,6,false,Card number does not pass the Luhn checksum.

Three distinct values instead of six rows. The first 4111 value absorbed its spaced, dashed, and bare twins and reports a count of 3 with the first occurrence on line 1. The 5500 value collapsed its two formats into one row with a count of 2. The sixth line, fifteen digits long, failed Luhn and stayed visible with its reason so you can go fix the source. That last row is the whole point of keeping invalid entries: it is the one that tells you an export is broken.

How I use it on real cleanup tasks

I reach for this most often when I am merging test fixtures across two repos that were seeded independently. The first time I tried it, I had pasted a sandbox card list and assumed my old awk one-liner had already deduped it; the tool reported a count of 4 on a value I was sure appeared once, which is how I learned my fixture had four formatting variants of the same Visa test number scattered across the file. I had been shipping the same card four times under three spellings and never noticed. Now I run the paste through, sort the normalized output, export CSV with the line numbers, and keep that CSV as the audit trail instead of just copying the final list. The line numbers are what let me walk back to each source and explain where a duplicate came from, which is exactly what a reviewer asks for.

A note on responsible use

This is a cleanup and redaction tool, not a collection tool. The numbers it is built for are test data and the kind of card-shaped strings you find while scrubbing logs, tickets, or sample files for a redaction pass. Passing the Luhn checksum tells you a string is well-formed; it tells you nothing about whether a real account exists, and you should never treat validation as proof of anything real. Everything runs in the browser and the output is masked, but you are still responsible for the source text: if it contains live customer data, handle it under your own data-access rules and do not paste what you are not permitted to touch. The legitimate jobs here are deduplication, validation, and redaction prep, not assembling card numbers for use.

Where it fits in a cleanup pipeline

The deduplicator is one stop in a longer flow. If you only need to pull the card-shaped values out of a noisy paste first, start with the Credit Card Number Extractor and feed its output here. For redaction work on other token types, the same browser-local approach covers things like the Postal Code Extractor when an export mixes addresses into the same blob.

A practical sequence looks like this: extract the candidates, deduplicate them against their bare-digit keys, keep the invalid rows for review, then export CSV with line numbers as your hand-off artifact. Each step stays in the tab, nothing leaves the browser, and the version you hand a teammate is one clean copy per value with the evidence to back it up.

That is the difference between a list that looks deduplicated and one that is: the second one compared the numbers, not the spaces between them.


Made by Toolora · Updated 2026-06-13