How to Deduplicate Version Numbers When v1.2.0 and 1.2.0 Are the Same Release
A plain text dedup keeps v1.2.0 and 1.2.0 as two rows. See how to dedupe semver lists so the v-prefix folds and partial versions get flagged instead.
How to Deduplicate Version Numbers When v1.2.0 and 1.2.0 Are the Same Release
The first time I merged two git tag dumps from different machines, my "unique" list still had 41 entries when I knew there were 38 releases. The culprit was boring: one box wrote tags as v1.2.0, the other wrote 1.2.0, and a teammate had pasted 1.2 into a changelog. A plain text dedup looked at those three strings, saw three different sequences of characters, and kept all three. They are the same release — or close enough that nobody should ship a list that pretends otherwise.
That gap between "same characters" and "same version" is the whole problem with deduplicating semantic versions. This post walks through why it happens, what a meaningful semver dedup actually has to do first, and exactly how the Semantic Version Deduplicator handles each case — including the one it deliberately does not fold.
Why a plain dedup keeps duplicates
Deduplication by string equality is the default everywhere: sort -u, a spreadsheet's "remove duplicates," a Set in your favorite language. Every one of them compares the raw bytes. That works perfectly for a column of email addresses and falls apart for versions, because versions have more than one valid spelling for the same value.
Consider this list pasted from three sources:
v1.2.0
1.2.0
1.2
2.0.0-beta.1
2.0.0-beta.1
V1.2.0
A byte-for-byte dedup collapses only the two identical 2.0.0-beta.1 lines. You are left with five rows. But v1.2.0, 1.2.0, and V1.2.0 are the same release written three ways, and 1.2 is a human shorthand for it. A reviewer reading the deduped output would reasonably assume those are four distinct releases. They are not.
The fix is to normalize before comparing. For semver that means deciding, up front, what counts as the canonical form, transforming every input into that form, and only then asking whether two values are equal.
The v-prefix is the most common trap
The leading v is cosmetic. The semver specification does not include it; it is a convention from Git tagging (git tag v1.2.0) and a handful of registries. So v1.2.0 and 1.2.0 point at the exact same release. Any dedup worth running has to strip that v before it compares anything.
This is precisely what the Semantic Version Deduplicator does. Internally it derives a comparison key for each value by stripping a leading v or V, case-insensitively, and matching on the result. So v1.2.0, V1.2.0, and 1.2.0 all reduce to the key 1.2.0 and collapse into a single canonical row — the manifest is explicit about this, and the FAQ states it plainly: "1.2.0 and v1.2.0 count as one."
What it does not do is normalize prerelease casing or build metadata into something cleverer. 2.0.0-beta.1 and 2.0.0-Beta.1 are treated by the spec as different prerelease identifiers, and the tool keeps them distinct unless you turn off case sensitivity. That is the honest, spec-faithful choice, not a shortcut.
What about 1.2 versus 1.2.0?
Here is where I want to be exact rather than reassuring. Semver is strict: a version is MAJOR.MINOR.PATCH. A two-part 1.2 is not a valid semantic version — it is missing the patch segment. A "pad the partial version" dedup would rewrite 1.2 to 1.2.0 and fold it into the same release. Some normalizers do exactly that.
This tool does not pad. Its validator requires all three segments, so 1.2 fails validation and is reported as an invalid row rather than being silently rewritten and merged. So to answer the question directly: 1.2 and 1.2.0 do not fold here. v1.2.0 and 1.2.0 fold; 1.2 and 1.2.0 do not.
I think that is the right call, and the reason is the same reason you keep invalid rows visible at all: 1.2 in your tag list is usually a mistake your release script will choke on. Quietly upgrading it to 1.2.0 hides the bug. Flagging it tells you which tag a human typed wrong. If you genuinely want partial versions normalized to three parts before deduping, run them through a semver normalizer first, then bring the cleaned list back here.
A worked example
Paste the six-line list above and keep the defaults (dedupe on, include invalid rows on). The CSV output looks like this:
value,normalized,line,count,valid,reason
v1.2.0,1.2.0,1,3,valid,
2.0.0-beta.1,2.0.0-beta.1,4,2,valid,
1.2,1.2,3,1,invalid,Semantic version should be MAJOR.MINOR.PATCH with optional prerelease/build metadata.
Read that closely. The first row carries count=3: it absorbed v1.2.0, 1.2.0, and V1.2.0, kept the first occurrence (line 1) as evidence, and normalized the displayed value to 1.2.0. The 2.0.0-beta.1 row carries count=2 for the two identical prerelease lines. And 1.2 survives as its own invalid row with the validation reason attached, so nobody mistakes it for a fourth release or a duplicate of the first.
Six pasted lines became three rows, and the one that needs a human decision is the one labeled invalid. That is the difference between a dedup that lies and a dedup you can hand to a teammate.
Picking the right output and keeping the audit trail
Once the list is clean you rarely want it as prose. The tool switches the same deduped result between plain lines, CSV, JSON, Markdown, a SQL IN (...) clause, and a TypeScript union type, so the canonical versions drop straight into a query, a migration, or a type definition without you hand-adding quotes and commas.
Two habits keep the result trustworthy:
- Download CSV or Markdown, not just the final lines. The line-numbered, count-bearing export is your audit trail. When someone asks "where did
2.0.0-beta.1come from twice," thelineandcountcolumns answer it. Copying only the bare list throws that evidence away. - Normalize messy paste first. Versions copied from a web page or a rendered changelog often carry hidden whitespace or stray markup. Strip that before deduping — a text file cleaner handles the whitespace and invisible characters that otherwise turn one release into two near-identical keys.
Everything runs in the browser. Pasted text and uploaded local files are read with the File API and never sent to a server, which matters when your tag dump leaks internal product names or a release scheme you have not announced yet.
The takeaway is small but it saves you from a wrong list: a meaningful semver dedup strips the v before it compares, so v1.2.0 and 1.2.0 become one — but it refuses to pad 1.2 into 1.2.0, because that two-part string is a typo you should see, not a duplicate you should hide.
Made by Toolora · Updated 2026-06-13