Unicode Code Points, UTF-8 Bytes, and UTF-16 Units: Debugging Invisible Characters and Database Column Traps
A developer's practical guide to understanding Unicode code points versus UTF-8 bytes versus UTF-16 code units — with real examples of invisible character bugs, MySQL column sizing errors, and surrogate pair traps.
Unicode Code Points, UTF-8 Bytes, and UTF-16 Units: Debugging Invisible Characters and Database Column Traps
Three developers on my team spent a combined four hours debugging a config parsing failure last quarter. The YAML file looked pristine in every editor. The parser disagreed. The culprit was a single character no one could see — a U+00A0 NO-BREAK SPACE that had slipped in from a copy-paste out of a PDF. The fix took 30 seconds once we knew where to look.
That story repeats itself in codebases everywhere, in different disguises. Sometimes it's a database column that silently truncates an emoji. Sometimes it's a JavaScript string whose .length property confidently returns the wrong number. The underlying confusion is always the same: developers conflate code points, UTF-8 bytes, and UTF-16 code units, and those three things are not interchangeable.
The Three Numbers You Keep Confusing
A Unicode code point is the abstract number assigned to a character. The letter A is U+0041. The euro sign is U+20AC. The waving hand emoji is U+1F44B. Code points are integers from 0 to 1,114,111. They say nothing about how a character is stored.
UTF-8 bytes are how most files, databases, and network sockets actually store characters. UTF-8 is variable-width: ASCII characters use 1 byte each, common European and Middle Eastern characters use 2, most CJK characters use 3, and emoji and supplementary symbols use 4.
UTF-16 code units are what JavaScript's String.prototype.length counts. JavaScript strings are stored internally as UTF-16 sequences. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) take one 16-bit code unit. Characters above U+FFFF — including most emoji — take two code units called a surrogate pair.
For everyday ASCII text, all three numbers match. Once you leave ASCII, they diverge.
| Character | Code points | UTF-8 bytes | UTF-16 units | |---|---|---|---| | hello | 5 | 5 | 5 | | café | 4 | 5 | 4 | | 日本語 | 3 | 9 | 3 | | 👋🏽 | 2 | 8 | 4 |
When a Config File Looks Fine But the Parser Disagrees
I ran the YAML failure through the Unicode Code Point Inspector and the table revealed the problem immediately. The "space" between two tokens in the config was U+00A0 (NO-BREAK SPACE), not the U+0020 (SPACE) that YAML tokenizers expect.
Here is the exact inspection result for the string "timeout: 30" with a hidden no-break space:
Char Code Point Name UTF-8
t U+0074 LATIN SMALL LETTER T 74
i U+0069 LATIN SMALL LETTER I 69
m U+006D ... 6D
e U+0065 ... 65
o U+006F ... 6F
u U+0075 ... 75
t U+0074 ... 74
: U+003A COLON 3A
U+00A0 NO-BREAK SPACE C2 A0 ← two bytes, not one
3 U+0033 DIGIT THREE 33
0 U+0030 DIGIT ZERO 30
The NO-BREAK SPACE (U+00A0) encodes to two bytes in UTF-8 (0xC2 0xA0), not one. The YAML parser treats it as a non-whitespace character, making the parser read "30" as part of the key name rather than the value. The editor rendered it as a space, so every visual check passed.
Other invisible troublemakers worth scanning for:
- U+FEFF BYTE ORDER MARK — appears at the start of files saved as UTF-8-with-BOM, invisible to most editors, confuses many parsers
- U+200B ZERO WIDTH SPACE — common in text copied from certain web apps, breaks token splitting
- U+2019 RIGHT SINGLE QUOTATION MARK — looks like an apostrophe (
'), fails string literal parsing
The inspector's Name column is the fastest way to catch all of them. Paste the offending line and scan that column; any unexpected ZERO WIDTH, NO-BREAK, or BYTE ORDER MARK entry is your answer.
MySQL's Dirty Secret: utf8 Is Not UTF-8
If your application ever hit a "Incorrect string value: '\xF0\x9F\x98\x80'" error in MySQL, you know this pain. The root cause: MySQL's utf8 charset is actually utf8mb3, which only stores 1–3 byte UTF-8 sequences. Per the MySQL 8.0 reference manual, utf8 has been an alias for the 3-byte-max variant since the charset was introduced. Characters above U+FFFF — every emoji, many obscure CJK extension characters, mathematical symbols — are rejected outright.
The fix is utf8mb4, which supports the full 4-byte range. But the column sizing trap goes deeper.
A VARCHAR(100) in MySQL means 100 characters, where "characters" is measured in code points. With utf8mb4, MySQL allocates up to 4 bytes per code point for the internal storage of each row, so a VARCHAR(100) column can consume up to 400 bytes of row space. MySQL's ROW_FORMAT=COMPACT has a 65,535-byte row limit, and a table with several long utf8mb4 VARCHAR columns will hit that limit faster than you expect.
To calculate the actual storage cost of a specific string before you size a column, paste it into the UTF-8 Byte Counter and read the UTF-8 byte count directly. A display name like "Jean-Baptiste Léonard 🎵" gives you 27 code points but 31 UTF-8 bytes — 4 extra bytes for the music note emoji. That gap is what breaks columns sized by character count instead of byte count.
Surrogate Pairs and the JavaScript String Length Trap
JavaScript's String.prototype.length returns UTF-16 code unit count, not code point count. For any emoji in the supplementary plane, that count is 2 per emoji instead of 1.
"hello".length // 5 ✓
"café".length // 4 ✓
"👋🏽".length // 4 ✗ (two emoji joined by modifiers, each 2 units)
The 👋🏽 waving hand is actually two code points: U+1F44B (WAVING HAND SIGN) followed by U+1F3FD (MEDIUM SKIN TONE modifier). Each encodes to a surrogate pair in UTF-16, so .length returns 4. Limiting a display name field to 30 length characters could silently block a user from entering a 15-emoji string they expect to fit.
The correct approach depends on what you need to limit:
- Limit UTF-8 bytes (for database or HTTP payload sizing): use the TextEncoder API —
new TextEncoder().encode(str).length - Limit Unicode code points (for character count that matches user expectation for most scripts):
[...str].length - Limit grapheme clusters (for true "visible character" count including ZWJ emoji):
[...new Intl.Segmenter().segment(str)].length
A Repeatable Debugging Workflow
When an encoding bug surfaces, I follow this order:
- Paste the suspect string into the Unicode Code Point Inspector. The per-code-point table shows you exactly what is there, including invisible characters, combining marks, and unexpected control codes. The Name column catches anything that looks like a space but isn't.
- Check the UTF-8 byte count with the UTF-8 Byte Counter. If the byte count is larger than you expected from the character count, you have multi-byte code points. The counter shows you UTF-8 bytes, UTF-16 units, code points, and graphemes all at once — the four numbers your code needs to pick from.
- If you need to escape a specific code point for JavaScript or HTML, the inspector's table has the JS escape and HTML entity columns. Click any cell to copy. This removes the guesswork from writing
\u{1F44B}vs👋(the surrogate pair form).
This workflow caught the YAML bug in under a minute. It also surfaced a database column sizing mistake on the same project — a VARCHAR(50) column that was being asked to store usernames with emoji, which in utf8mb4 would require up to 200 bytes for a 50-code-point string.
The underlying point is that encoding bugs are not mysterious once you can see the actual numbers. The three counts — code points, UTF-8 bytes, UTF-16 units — are always deterministic. You just need a tool that shows them to you per character, not as a summary.
Made by Toolora · Updated 2026-07-01