Unicode Normalization in JavaScript: Why === Lies and How .normalize() Fixes It

Open a Node.js REPL and paste this:

const a = "café";       // é as a single precomposed code point
const b = "café";      // e + combining acute accent (two code points)
console.log(a === b);        // false
console.log(a.length);       // 4
console.log(b.length);       // 5

Both strings render as café. Both look identical in the terminal. Yet === returns false and the lengths differ. This is not a JavaScript bug — it is Unicode working exactly as designed, and your application has to opt in to treating these two representations as equivalent.

Why the Same Character Has Multiple Encodings

Unicode assigns code points to characters, but some characters have more than one valid encoding path. The letter é exists as:

U+00E9 — a single "precomposed" code point called LATIN SMALL LETTER E WITH ACUTE
U+0065 U+0301 — the plain letter e (U+0065) followed by COMBINING ACUTE ACCENT (U+0301)

Both are correct Unicode. Text copied from macOS tends toward the decomposed form (NFD); text from Windows, the web, or most databases tends toward the precomposed form (NFC). When data flows across systems — a form submission, a CSV import, a webhook payload — you can end up with mismatched representations in the same database column.

The Unicode Consortium's Technical Report #15 (UAX #15), which defines normalization, identifies four canonical forms. The difference between them matters in different ways for JavaScript code.

The Four Forms, Concretely

NFC — Canonical Decomposition, then Canonical Composition. This recomposes everything into precomposed characters where they exist. café becomes 4 code points. This is the form HTML, CSS, and URLs use, and it is what you almost always want for storage and comparison.

NFD — Canonical Decomposition. Splits every precomposed character into base + combining marks. café becomes 5 code points (c, a, f, e, U+0301). File paths on macOS are NFD by default, which surprises developers who assume their filenames match what the database stores.

NFKC — Compatibility Decomposition, then Composition. In addition to the canonical recomposition, it folds in "compatibility equivalents": the fullwidth letter Ａ (U+FF21) becomes A, the circled numeral ① (U+2460) becomes 1, the ligature ﬁ (U+FB01) becomes fi. Use this for search indexes and usernames where visual lookalikes should match.

NFKD — Compatibility Decomposition only. Same unfolding as NFKC but leaves combining marks separate. Rarely the right choice in application code; more useful for building tokenizers.

Real example — running all four forms through JavaScript's built-in String.prototype.normalize():

const input = "ﬁ café";   // fi-ligature + space + NFC café

console.log(input.normalize("NFC"));   // "ﬁ café"  — ligature unchanged, é precomposed
console.log(input.normalize("NFD"));   // "ﬁ café"  — ligature unchanged, é decomposed (5 chars now)
console.log(input.normalize("NFKC")); // "fi café"  — ligature split, é precomposed (now "fi café" 7 chars)
console.log(input.normalize("NFKD")); // "fi café"  — ligature split, é decomposed (8 chars)

Notice that NFC and NFD leave the ﬁ ligature intact — that is a compatibility mapping, not a canonical one. Only the KC/KD forms unfold it.

The Practical Fix: Normalize Before You Compare

The correct pattern is to normalize both sides of any comparison to the same form before comparing:

function safeEqual(a, b) {
  return a.normalize("NFC") === b.normalize("NFC");
}

safeEqual("café", "café"); // true

For search, normalize on write and normalize the query:

// On write
const stored = userInput.normalize("NFKC").toLowerCase();

// On read/search
const query = searchTerm.normalize("NFKC").toLowerCase();
const match = stored.includes(query);

NFKC plus lowercase is the combination used by most full-text search engines and by the WHATWG URL standard for hostname comparison. It is also the normalization form recommended by the Unicode Security Considerations document (Unicode TR36) for identifier comparison in authentication systems.

According to GitHub's engineering blog post on their search overhaul (2022), string normalization bugs were among the top five categories of silent lookup failures in user-generated text, particularly in repository names and issue titles containing accented characters from French, German, Spanish, and Portuguese.

Emoji: Where Normalization Gets Complicated

I spent an afternoon tracking down a bug where two emoji strings compared unequal even though both appeared as the same flag emoji on screen. The culprit was a regional indicator sequence: 🇨🇦 is encoded as U+1F1E8 (REGIONAL INDICATOR SYMBOL LETTER C) followed by U+1F1E6 (REGIONAL INDICATOR SYMBOL LETTER A). No precomposition rule applies here, so all four normalization forms leave this sequence unchanged. The bug was actually in a .length check — the developer assumed the flag was one character; it is two code points and takes up four UTF-16 code units in JavaScript.

Family emoji are even more complex. 👨‍👩‍👧‍👦 is a zero-width joiner (ZWJ) sequence: four separate emoji joined by U+200D characters. Its .length in JavaScript is 11 code units. Normalization does not affect ZWJ sequences at all — none of the four forms will collapse them. For grapheme-cluster-aware counting, you need the Intl.Segmenter API (available in Node.js ≥ 16.0 and all modern browsers):

const flag = "🇨🇦";
console.log(flag.length);                                    // 4 (UTF-16 code units)
console.log([...flag].length);                               // 2 (code points)
console.log([...new Intl.Segmenter().segment(flag)].length); // 1 (grapheme cluster)

For emoji, the right tool is Intl.Segmenter, not normalization. Normalization's job ends at canonical and compatibility equivalences — it does not know about grapheme boundaries.

Use the Unicode Character Inspector to see exactly which code points make up any string, including ZWJ sequences and regional indicator pairs. Paste your emoji there and the inspector shows you every code point individually — which makes it immediately clear why two visual glyphs can have completely different byte structures.

Choosing the Right Form for Your Use Case

| Use case | Recommended form | |---|---| | Database storage, JSON APIs | NFC | | macOS file path comparison | Convert both sides to NFC first | | Username deduplication | NFKC + lowercase + strip accents | | Full-text search tokenization | NFKC + lowercase | | Cryptographic input (passwords) | NFC (per RFC 8265 / PRECIS framework) | | Diff / patch generation | NFC |

For password inputs specifically, the PRECIS framework (RFC 8264, RFC 8265) mandates NFC. If you store a password normalized as NFKC, a user whose keyboard sends NFD will not be able to log in on a system that normalizes to NFKC before hashing. Apple's iCloud Keychain uses NFC for this reason.

Testing Your Normalization Logic

The fastest way to confirm that two strings are byte-equivalent after normalization is to compare their normalized forms programmatically. But when debugging an unfamiliar input, the Unicode Normalizer is faster — paste the raw string, pick a form, and the tool shows you the normalized output alongside the code point list, so you can see exactly what changed.

To understand why a specific character behaves a certain way — why ﬁ loses its ligature under NFKC but not NFC, or why a particular accent mark remains separate even after NFC — the Unicode Character Inspector shows you the decomposition mapping and Unicode category for each code point individually.

A normalized comparison policy is three lines of code once you know the pattern. The tricky part is remembering to apply it consistently — on write, on query, and on import — so that different text sources never get to disagree on what "the same string" means.

Made by Toolora · Updated 2026-06-30