Hunting Zero-Width Characters: How Invisible Unicode Breaks Your Text
Zero-width spaces, joiners, and the BOM hide inside text and break comparisons, JSON parsing, and validation. Here is how to find and strip them out.
Hunting Zero-Width Characters: How Invisible Unicode Breaks Your Text
Some of the worst bugs I have chased were not bugs at all. The code was correct, the logic was sound, and two strings that looked byte-for-byte identical on screen refused to compare equal. The problem was a character I could not see, sitting quietly inside the text, doing exactly what Unicode says it should do.
Zero-width characters are real code points that take up no visual space. They render as nothing, occupy no column, and yet they exist in the bytes. The common offenders are the zero-width space (U+200B), the zero-width non-joiner (U+200C), the zero-width joiner (U+200D), and the byte-order mark (U+FEFF). Because you cannot see them, they slip past your eyes, past your validation, and straight into a comparison that then fails for no reason you can spot.
Where invisible characters come from
You rarely type these characters on purpose. They arrive as passengers. Copy a function name out of a styled API doc and you may pick up a U+200B between two words. Paste a config snippet from a chat app and a no-break space (U+00A0) lands where you expected a plain one. Export a file from a Windows tool and it stamps a BOM at the very front. Lift a paragraph out of a PDF and you inherit a scattering of format controls that the layout engine used for line breaking.
The pattern is consistent: invisibles enter when text crosses a boundary between two systems that disagree about formatting. Chat clients, rich-text editors, CMS fields, PDF extractors, and AI model output are the usual sources. Each one is convinced it is helping.
Why they break things
The damage falls into three buckets.
Broken comparisons. A string with a hidden U+200B is not equal to the same string without it, even though both look the same. If you have ever watched an if (a === b) fail while staring at two values that read identically in the log, this is why.
Failed validation and parsing. A leading U+FEFF at the start of a JSON file makes JSON.parse throw Unexpected token even though the file looks flawless in your editor. A no-break space where a normal space belongs quietly breaks split(' '), so a tokenizer returns one fat token instead of two. A regex that expects \s may or may not match the exact invisible you got, depending on the flavor, so validation passes locally and fails in production.
Hidden watermarks and tracking. This one is more deliberate. Some systems weave a pattern of zero-width spaces, joiners, and non-joiners between visible letters to fingerprint a document or watermark generated text. The prose reads normally, but a run of U+200D characters threaded through it acts as an invisible signature that travels with every copy-paste.
A worked example that fails an equality check
Here is the case that taught me to stop trusting my eyes. I had a lookup keyed on a product code. Two values printed as SKU-4490 in the console, character for character the same, yet the lookup missed.
const fromForm = "SKU-4490"; // typed by hand
const fromImport = "SKU-4490"; // pasted from a spreadsheet export
console.log(fromForm); // SKU-4490
console.log(fromImport); // SKU-4490
console.log(fromForm === fromImport); // false
console.log(fromImport.length); // 9, not 8
Both lines log SKU-4490. The strings are not equal. The imported value carries a zero-width space (U+200B) between SKU and -4490, so its length is 9 instead of 8 and the equality check fails. Nothing in the rendered text gives this away. The only honest signal is the byte count and the character inventory. Strip the U+200B and the comparison passes immediately.
Cleaning them out, locally
The fix is to see the invisibles by code point, then remove them. Paste your text into the Zero Width Character Detector and it lists every hit with its code point, its Unicode name, and how many times it appears. The preview marks each position in red, so you stop guessing where the character lives and start knowing.
From there you choose how to clean:
- Remove all deletes every invisible. Use this for an identifier, a SKU, or a watermark you want gone.
- Normalize turns invisible spaces like U+00A0 into a plain U+0020 while deleting the zero-width family. Use this for prose where you want word spacing preserved, so
price: 9becomesprice: 9rather thanprice:9.
Normal ASCII spaces and newlines are never touched. Only the lookalikes get cleaned, so you do not have to worry about your real spacing being collapsed.
One caution worth repeating: the zero-width joiner U+200D is load-bearing inside emoji. It glues multi-part sequences like the family emoji together. Remove-all will split those into separate glyphs, so glance at the preview before you clean text that legitimately contains emoji.
Scanning stays on your machine
Pasted text is often sensitive. It might be a leaked secret you are auditing, a private document, or AI output under review before it ships. Detection, cleaning, and the preview all run as plain JavaScript inside your browser tab. Nothing you paste is uploaded, logged, or sent to a server. Only the cleaning mode lives in the URL so a shared link reopens with your chosen option; the text itself is deliberately kept out of the link. Close the tab and nothing remains.
If you want to go deeper on a single character rather than clean a whole blob, the Unicode Character Inspector breaks any string down code point by code point with names, categories, and escapes, which pairs well with the detector when you are trying to understand exactly what arrived in your text.
A habit worth building
The lesson that stuck with me is simple: when text that looks right behaves wrong, check the bytes, not the glyphs. An equality test that fails on identical-looking strings, a JSON file that refuses to parse, a split that returns the wrong count, a search that misses an obvious match, these are all symptoms of something you cannot see. Run the text through a detector, read the code points, strip what does not belong, and the mysterious failure usually disappears in one pass.
Invisible characters are not magic. They are ordinary Unicode doing its job in the wrong place. Once you can see them, they stop being scary and start being a five-second fix.
Made by Toolora · Updated 2026-06-13