Skip to main content

How to Deduplicate MAC Addresses When Notation and Case Disagree

The same NIC shows up as colon, hyphen, and Cisco-dot notation in different exports. Here is how to deduplicate MAC addresses so one card counts once.

Published By Li Lei
#mac-address #deduplication #networking #text-cleanup

How to Deduplicate MAC Addresses When Notation and Case Disagree

The first time I merged two switch port tables, my "device count" was wrong by about a third. The cause was not phantom hardware. It was three ways of writing the same six bytes. One export used colons, another used hyphens, and a vendor dump used the Cisco dotted form. A plain text dedup treats those strings as different rows, so a single network card survives as several lines. If you have ever stacked an ARP table on top of a DHCP lease list and watched the totals refuse to reconcile, this is usually why.

The same card, written three ways

Take one network interface. Here are three notations a real toolchain will hand you:

  • 00:1A:2B:3C:4D:5E — colon-separated, the IEEE EUI-48 form most Linux tools print
  • 00-1A-2B-3C-4D-5E — hyphen-separated, what Windows getmac and many CSV exports use
  • 001a.2b3c.4d5e — the Cisco dotted-quad form, lowercase, four-byte groups

All three are the same NIC: 00 1A 2B 3C 4D 5E. A naive deduplicator compares raw strings, so it keeps all three and reports the card three times. Case makes it worse. 00:1A:2B:3C:4D:5E and 00:1a:2b:3c:4d:5e are byte-for-byte identical hardware, but a case-sensitive string compare sees two values. Stack four exports from four teams and one card can appear five or six times before you have done anything wrong.

Meaningful MAC deduplication has to fold notation and case before it compares anything. The comparison key cannot be the text you pasted. It has to be a canonical form of the bytes.

What this tool actually folds

I checked the parser in the MAC Address Deduplicator rather than trust the marketing line, because "deduplicate" means very different things across tools. Here is the honest behavior.

The normalizer strips every character that is not a hex digit, lowercases the result, and regroups the bytes into a single colon form. So 00:1A:2B:3C:4D:5E, 00-1A-2B-3C-4D-5E, and 001a.2b3c.4d5e all reduce to 00:1a:2b:3c:4d:5e. The dedup key is that normalized value, lowercased by default. The practical result: colon versus hyphen does not matter, uppercase versus lowercase does not matter. Those four or six copies collapse to one row, and the tool keeps the first occurrence with a count and the source line number so you can still explain where each duplicate came from.

There is one real limit, and I would rather state it than let you discover it on a Friday. The extractor that pulls MAC addresses out of pasted free text matches the colon and hyphen forms only. The Cisco dotted form 001a.2b3c.4d5e, sitting on its own in a wall of log text, is not recognized as a MAC token by the extractor, so it will not be pulled out in the first place. The normalizer can fold a dotted value if that value reaches it as an isolated line, but the free-text scanner will not find a bare dotted address mixed into prose. If your source is Cisco-heavy, paste the dotted addresses one per line, or run them through the MAC Address Normalizer first so every value arrives in canonical colon form, then deduplicate the clean list.

A worked example

Here is the kind of input you get after merging a Linux ARP dump, a Windows export, and a vendor sheet:

00:1A:2B:3C:4D:5E
00-1A-2B-3C-4D-5E
00:1a:2b:3c:4d:5e
AA:BB:CC:DD:EE:FF
aa-bb-cc-dd-ee-ff
00:1A:2B:3C:4D

Six lines. To the eye there are clearly two real devices, one truncated entry, and a pile of notation noise. Deduplicating with notation and case folded gives:

00:1a:2b:3c:4d:5e   count 3
aa:bb:cc:dd:ee:ff   count 2

The first three lines fold to one device with a count of 3. Lines four and five fold to a second device with a count of 2. That leaves line six, 00:1A:2B:3C:4D, which has only five byte pairs. It is not a valid 48-bit MAC, and the tool refuses to fold it under a lookalike — there is no honest way to decide which full address a truncated one belongs to. With the include-invalid option on, it is kept aside as a separate row marked with the reason, so it never silently inflates or deflates your device count. Two real devices, one flagged line for review. That is the answer the raw list was hiding.

Why the invalid row matters

It is tempting to want a deduplicator that "cleans up" everything, including the broken entries. Resist that. A five-pair address like 00:1A:2B:3C:4D could be a copy-paste truncation of any address that starts with those bytes. Folding it under the nearest full match would be a guess dressed up as data, and in a device inventory a wrong guess is worse than a flagged unknown. Keeping it visible with its reason and line number means whoever owns the source data can go back and fix the export, instead of trusting a total that was quietly invented. The same applies to hidden whitespace from copied web tables — normalize before you deduplicate, or a trailing space turns one card into two.

Getting a list you can hand off

Once the duplicates collapse, the point is to produce something a teammate or a script can use without manual cleanup. The output can go out as CSV with line numbers for an audit trail, JSON for an API payload, a SQL IN clause, a TypeScript union, Markdown for a ticket, or plain lines for a config file. I default to CSV with line numbers when I am handing the result to someone else, because the count and source-line columns let them verify my merge instead of taking it on faith. For a fleet-wide job, the manifest's own advice is sound: split by switch, deduplicate each table locally, then deduplicate the combined list. A few megabytes of stacked tables runs comfortably in the browser, and nothing leaves the tab.

The takeaway is small but it has bitten me more than once. A MAC address is six bytes, but text is text, and three notations plus case give you up to a dozen string variants of one card. Fold the notation and the case first, treat the broken entries as questions rather than guesses, and the device count finally tells the truth.


Made by Toolora · Updated 2026-06-13