Skip to main content

Deduplicate Dates the Right Way: Build a Clean Unique Dates List

A plain dedup keeps 2024-03-04 and 2024-3-4 as two rows even though they're the same day. Here's how to fold format variants into one canonical date.

Published By Li Lei
#dates #data-cleaning #iso-8601 #deduplication

Deduplicate Dates the Right Way: Build a Clean Unique Dates List

Run a plain dedup over a column of dates and you'll often get a "unique" list that still has the same calendar day in it three times. The string 2024-03-04 and the string 2024-3-4 describe the exact same Monday, but to a text deduplicator they are two different sequences of characters, so both survive. Add 03/04/2024 from a third export and now one day occupies three rows. Multiply that across a few hundred records and your "cleaned" list is quietly wrong.

This is the gap between deduplicating text and deduplicating dates. Text dedup compares strings. Date dedup has to compare the calendar day the string points to. The only reliable way to do the second is to normalize every value to one canonical form before you compare, then keep one row per canonical value. That is exactly the job of the ISO Date Deduplicator.

Why format variants survive a plain dedup

A string deduplicator answers a narrow question: "have I seen this exact sequence of characters before?" It has no model of what a date is. So these four values are four distinct strings:

  • 2024-03-04
  • 2024-3-4
  • 2024-03-04T00:00:00Z
  • 2024-03-04T00:00:00+00:00

A human reads all four as the fourth of March, 2024. A Set of strings reads four members. The difference matters because date data is born in different formats. One system zero-pads the month, another doesn't. One writes the UTC offset as Z, another spells it +00:00. A spreadsheet that someone opened and re-saved may rewrite the whole column to its locale default. When you merge those exports, every formatting decision becomes a phantom duplicate.

The fix is not to write a clever regex that matches "dates that look similar." Similarity is a trap — 2024-03-04 and 2024-04-03 look similar and are a month apart. The fix is to parse each value into the instant it actually represents, then compare instants.

Normalize first, then compare

Here is the rule that makes date dedup correct: convert every value to a single canonical form before deduplicating, then deduplicate the canonical forms. ISO 8601 is the natural canonical form because it is unambiguous — year, then month, then day, fixed width, no locale guessing.

I checked the ISO Date Deduplicator against its own manifest before writing this, because "does it normalize before deduping?" is the question that decides whether a date deduplicator is honest or decorative. Its parser reads each ISO 8601 timestamp, resolves it to the instant it points to, and folds values that resolve to the same instant. The manifest spells out the canonical case directly: 2024-03-01T12:00:00Z and 2024-03-01T12:00:00+00:00 are two different strings that name the same moment, so they collapse into one row, and the first occurrence is the one kept. That ordering — parse, resolve, then fold — is what separates a real date dedup from a string Set.

One honest caveat about scope, because the tool's name is precise on purpose. This is an ISO date deduplicator: it parses ISO 8601. A value like 03/04/2024 is not ISO 8601, and it is genuinely ambiguous — is it March 4th or April 3rd? Rather than guess, the tool treats unparseable values as invalid and surfaces them for review instead of silently folding them into a row they might not belong to. That is the safer default. If your source mixes slash formats and ISO, push the slash dates through the ISO Date Normalizer first to get everything into ISO 8601, then run the dedup. Normalize, then deduplicate — in that order.

A worked example

Take a date column pulled from three merged exports. Paste it in:

2024-03-01T12:00:00Z
2024-03-04
2024-03-01T12:00:00+00:00
2024-03-04T00:00:00Z
2026-13-40
2024-03-04
2024-03-09

Seven input lines. A plain string dedup would return six "unique" rows, because only the literal repeat of 2024-03-04 gets caught. But three of these lines are the same instant on March 1st (Z and +00:00 are the same offset), and two are midnight on March 4th once you account for the date-only line resolving to T00:00:00. After parsing and folding by instant, the unique list is:

2024-03-01T12:00:00Z   (count 2, first seen line 1)
2024-03-04T00:00:00Z   (count 2, first seen line 2)
2024-03-09             (count 1, first seen line 7)
2026-13-40             invalid — month 13, day 40 do not exist

Three real calendar instants, each appearing once, plus the junk row flagged rather than discarded. The duplicate counts tell you how many source lines collapsed into each row, and the first-seen line number tells you where the survivor came from — so a teammate can trace it back to the original export. The invalid row stays visible because hiding it would hide a data problem; 2026-13-40 is not a duplicate of anything, it's a broken value that needs fixing at the source.

Keep the evidence, then export

A unique list is more useful when it carries proof of how it was built. The deduplicator keeps a duplicate count and the first source line for every surviving row, so the output is not just "here are the unique dates" but "here are the unique dates, here's how many copies each had, and here's where to find the original." That audit trail is the difference between a list you can defend and a list you have to re-derive when someone questions it.

You can keep unique rows only, or preserve the invalid rows for repair work. You can sort the normalized output so the list is scannable. And you can export the exact artifact the next step needs — plain lines for a quick paste, CSV or Markdown for a hand-off with line numbers, or JSON, a SQL IN clause, or a TypeScript union when the destination is code. No hand-adding quotes and commas; the tool emits the punctuation for you.

Everything runs in the browser. Dates are pasted text or read from a local file with the File API, and nothing is sent to a server — which matters when the column is customer timestamps or internal event logs.

When to reach for it

Reach for date dedup whenever you're merging exports and need each calendar day to appear once: consolidating sign-up dates from two CRMs, collapsing a log timeline into distinct events, or cleaning a date column before a script imports it. The trap to avoid is trusting a plain dedup on a date field — it will look like it worked, the row count will drop, and you'll ship a list that still double-counts March 4th under two spellings.

The discipline is simple and it's the whole point: don't compare the text, compare the day. Normalize to one canonical ISO form, then deduplicate. The ISO Date Deduplicator does that fold for you and hands back a clean, traceable list of genuinely unique dates.


Made by Toolora · Updated 2026-06-13