Unicode Normalization in Practice: NFC vs NFD vs NFKC vs NFKD for Filenames, Search, and Deduplication

Run ls on a folder synced between a Mac and a Linux box and you can end up staring at two entries that both read Résumé.pdf. Same glyphs, same apparent name, two distinct files. Neither ls nor Finder is lying to you. The two names are different byte sequences that render identically, and the reason is Unicode normalization — specifically, that macOS and the rest of your stack disagree about which form to store.

This post walks through what the four normalization forms (NFC, NFD, NFKC, NFKD) actually do to real strings, why filenames are the place developers usually meet the problem first, and how to build a dedup step that doesn't get fooled by visually identical text.

What the four forms do to a real string

Unicode allows the same visible character to be encoded more than one way. The letter é can be one precomposed code point, U+00E9, or two code points: a plain e (U+0065) followed by a combining acute accent (U+0301). Both render as é. Normalization converts text into one agreed-upon representation.

Here is the word café pushed through each form in a JavaScript console, with the actual code points that come out:

const s = "café";                     // stored as NFC here

s.normalize("NFC");
// "café" → U+0063 U+0061 U+0066 U+00E9   (4 code points)

s.normalize("NFD");
// "café" → U+0063 U+0061 U+0066 U+0065 U+0301   (5 code points)

"ﬁle①.txt".normalize("NFC");
// "ﬁle①.txt" — unchanged: ﬁ (U+FB01) and ① (U+2460) survive

"ﬁle①.txt".normalize("NFKC");
// "file1.txt" — the ligature splits into f+i, the circled one becomes a plain 1

The two axes are easy to keep straight once you see them side by side. The C/D suffix decides whether characters end up composed (é as one code point) or decomposed (e + accent). The K decides whether compatibility characters get rewritten: ligatures like ﬁ, fullwidth digits like ４, circled numbers, superscripts. NFC and NFD only reorganize canonically equivalent text and never change what the string means. NFKC and NFKD are lossy by design — after NFKC there is no way to know the input contained a ﬁ ligature rather than the letters f and i.

The byte cost is measurable. In UTF-8, precomposed é (U+00E9) encodes as 2 bytes (c3 a9), while the NFD pair encodes as 3 bytes (65 cc 81) — a 50% size increase for that character, straight from the UTF-8 encoding rules in the Unicode Standard. For accent-heavy French or Vietnamese text, NFD storage is consistently larger than NFC, which is one reason the W3C recommends NFC for content on the web.

Filenames: why your Mac is the odd one out

Apple's HFS+ filesystem stored filenames in decomposed form — a variant of NFD, documented in Apple Technical Note TN1150 — and while modern APFS is normalization-insensitive rather than normalizing, files created by macOS tooling still frequently carry decomposed names. Linux filesystems like ext4 do no normalization at all: a filename is a byte string, and Re´sume´ in NFD and Résumé in NFC are simply two different names that happen to render the same.

You can see the difference with nothing but a shell:

$ echo -n 'é' | xxd          # typed on Linux, NFC
00000000: c3a9

$ echo -n 'é' | xxd          # copied from a macOS filename, NFD
00000000: 65cc 81

That is the entire bug in six bytes. Sync tools, tarballs, Git checkouts, and object-store uploads all move byte strings, so a file created on a Mac and re-uploaded from Linux becomes a sibling instead of an overwrite. Git has a core.precomposeUnicode setting for exactly this; rsync grew --iconv for the same reason.

When I hit this myself, it was an rsync job from a MacBook to a Linux NAS that kept producing duplicate folders for every band name with an accent in a music library — Beyoncé twice, Motörhead twice. Nothing in either file manager showed a difference. What settled it was inspecting the two names code point by code point in the Unicode Character Inspector: one folder name ended in U+00E9, the other in U+0065 U+0301. After batch-renaming the NAS side to NFC, the next sync collapsed the duplicates and the job went from copying 1,900 "changed" files to 0.

If you need to fix strings rather than diagnose them, the Unicode Normalizer converts pasted text between all four forms in the browser and shows the resulting code points, which is the fastest way to confirm what a filename actually contains before you script a rename.

Deduplication: normalize before you hash

Every dedup pipeline has the same skeleton: derive a key from each record, group by key, keep one per group. Normalization belongs in the key derivation, before hashing or comparison, or equivalent records sail straight past each other.

Take this list of tags collected from user input across platforms:

café
café
ﬁle-manager
file-manager

A naive exact-match dedup keeps all four lines, because line 1 is NFC, line 2 is NFD, and line 3 contains the U+FB01 ligature. Apply NFC first and lines 1–2 merge. Apply NFKC and all four collapse into two distinct tags: café and file-manager. Which form you pick is a real decision:

NFC for keys that must stay faithful to the original text: filenames you will write back to disk, URLs, anything user-visible.
NFKC for search indexes and fuzzy identity, where you want ﬁle, ｆｉｌｅ (fullwidth), and file to be the same word. Combine it with case folding. This is also why identifier systems use it: security-sensitive username matching, and Python 3, which normalizes source identifiers to NFKC per the language reference, so ﬁnd and find are the same variable.
NFD / NFKD mostly as intermediate steps — decomposing first makes it trivial to strip accents by dropping combining marks (U+0300–U+036F) before slugging or building a diacritic-insensitive index.

For line-based cleanup jobs, I run lists through the Text Deduplicator after normalizing, since its trim and case-fold options catch the mundane duplicates while normalization catches the invisible ones.

The rules I actually follow

Four forms sounds like four decisions, but in practice it reduces to three habits. Store and transmit NFC, because it is the web default and the most compact canonical form. Normalize at the boundary — the moment text enters your system from a file API, an upload, or a form — rather than sprinkling .normalize() calls at every comparison site. And reserve the K forms for derived values like search keys and slugs, never for data you will show back to a user, because compatibility mapping is one-way.

Unicode 16.0 defines 154,998 characters (per the Unicode Consortium, 2024), and every canonical-equivalent pair among them is a potential phantom duplicate in a system that skips normalization. The fix costs one function call, as long as you make it before the bytes get compared.

Made by Toolora · Updated 2026-07-02