How to Deduplicate IP Addresses and Get a Clean List of Unique IPs
Leading-zero IPv4 variants like 010.0.0.1 survive a plain dedup and inflate your unique-IP count. Here is how to normalize first, then collapse an access log to one row per host.
How to Deduplicate IP Addresses and Get a Clean List of Unique IPs
If you have ever pulled a day of traffic out of an nginx access log and tried to answer the simple question "how many distinct clients hit us?", you have probably been bitten by a quiet bug: the same host shows up twice. Not because two machines share an address, but because the same address was written two different ways. A plain text dedup treats 010.0.0.1 and 10.0.0.1 as two separate strings, keeps both, and your unique-visitor count is suddenly one too high. Multiply that across a noisy log and the error stops being cute.
This post walks through why IPv4 deduplication needs to normalize before it compares, what that means for cleaning an allowlist or an access log, and how the IPv4 Address Deduplicator handles the edge cases that trip up a one-liner.
Why a Plain Dedup Keeps Both 010.0.0.1 and 10.0.0.1
An IP address is a 32-bit number. The dotted-quad text you read in a log is just one of several ways to spell that number. 10.0.0.1, 010.0.0.1, and 010.000.000.001 all describe the exact same host — the leading zeros are cosmetic padding, the kind you get when a logging library zero-pads octets to a fixed width or when someone pastes from a tool that aligns columns.
A naive dedup — sort -u, a Set in JavaScript, a DISTINCT on a text column — compares the raw strings. To a string comparison, 010.0.0.1 and 10.0.0.1 are no more equal than cat and dog. Both survive. You end up with a list that looks deduplicated but still double-counts every host that appeared in more than one spelling.
The fix is to canonicalize each address before the comparison: strip the leading zeros so every spelling of a host collapses to one form, then dedup on that canonical value. The IPv4 Address Deduplicator does exactly this. Its parser reads each dotted-quad, and an exact match plus one written with leading zeros like 010.000.001.001 fold to the same address, with the first occurrence kept. So the comparison happens on the real host, not on the text.
A Worked Example: Collapsing a Log to Unique Hosts
Here is a small slice of an access log with the format duplicates a real export tends to carry — leading zeros, repeated hits, and one broken line:
10.0.0.1
010.0.0.1
192.168.1.50
10.0.0.1
010.000.000.001
192.168.001.050
203.0.113.7
10.0.0
A plain sort -u on this gives you six "unique" lines, because 010.0.0.1 differs from 10.0.0.1 as text, and 192.168.001.050 differs from 192.168.1.50. That number is wrong.
Run the same input through the deduplicator with unique rows kept, and the four spellings of 10.0.0.1 (10.0.0.1, 010.0.0.1, 10.0.0.1 again, 010.000.000.001) collapse to a single canonical row. The two spellings of 192.168.1.50 collapse to one. You are left with the real picture:
| host | count | first line | | --- | --- | --- | | 10.0.0.1 | 4 | 1 | | 192.168.1.50 | 2 | 3 | | 203.0.113.7 | 1 | 7 |
Three unique hosts, not six. The count column tells you how many times each host actually appeared, and the first-line column points you back to the source so you can explain where a duplicate came from. The truncated 10.0.0 does not pretend to be a host — it stays out of the deduped set, and if you turn on the option to keep invalid rows, it shows up flagged for review rather than silently dropped.
Cleaning an Allowlist Down to Unique Entries
The same normalization matters when you are building or auditing a firewall allowlist. An allowlist that contains both 10.0.0.1 and 010.0.0.1 is not just untidy — depending on how the downstream system parses it, the two lines may be treated as distinct rules, or one may be rejected as malformed while the other is accepted. Either way you have drift between what you think the rule set is and what it actually enforces.
Paste the list in, dedup it, and you get one canonical entry per host. From there you can sort the output and export it in the shape the next system wants — the tool can emit plain lines, CSV, JSON, a SQL IN clause, a TypeScript union, or Markdown, so the clean list goes straight into a config file, a migration, or a code constant without you hand-adding quotes and commas.
If you also need to validate the entries — catch an out-of-range 999.1.1.1 or an address with a stray port like 1.2.3.4:80 before they reach production — the companion IPv4 Address List Validator is built for that pass, and the IPv4 Address Normalizer handles the canonicalize-only step when you do not need deduplication at all.
What I Hit the First Time I Tried This
The first time I ran into this, I was reconciling two CSV exports of "blocked IPs" from two different appliances, and I genuinely could not figure out why my merged count was higher than either source on its own. I had run sort -u, so as far as I was concerned the list was unique. It took me an embarrassing few minutes of staring at the file to notice that one appliance zero-padded its octets and the other did not, so every address that appeared on both lists was being counted as two. The moment I normalized leading zeros before deduping, the numbers lined up. Now I never trust a unique-IP count that came out of a raw string dedup — I assume there is a spelling collision until the canonicalization step proves otherwise.
A Few Things That Trip People Up
- Format equality is not string equality.
010.0.0.1and10.0.0.1are the same host but different text, so any dedup that compares strings keeps both. Normalize the octets first, then compare. - Copied text carries hidden whitespace. Pasting from a web page or a rendered table often drags in invisible characters that make two identical-looking addresses compare as different. Clean the text before you dedup it — the Text File Cleaner is good for stripping that noise out of a paste.
- Keep the invalid rows when you need an audit trail. A truncated
10.0.0or an out-of-range octet is not a duplicate, but it is a signal that your source has problems. Keeping those rows visible beats silently discarding them. - Validation is not existence. A perfectly formatted address is not proof that a host is live or that an account behind it is real. Dedup and validation clean the list; they do not verify the world.
Everything runs in the browser. The addresses you paste from a log, and any local text file you load through the File API, are parsed in the tab and never sent to a server — which matters when the log you are cleaning is full of real customer IPs.
Wrapping Up
Deduplicating IPv4 addresses sounds like a solved problem until a leading zero quietly doubles one of your hosts. The reliable approach is the same every time: canonicalize each address so every spelling of a host collapses to one form, then dedup and count on that canonical value. Do that and your unique-IP number actually means "unique hosts" instead of "unique strings." Drop your messy log or allowlist into the IPv4 Address Deduplicator, keep unique rows, and read off the real count.
Made by Toolora · Updated 2026-06-13