Skip to main content

UTF-8, UTF-16, and ASCII Demystified: Why JavaScript's String Length Lies

ASCII, UTF-8, and UTF-16 explained from first principles — with exact byte sequences and a clear account of why '😀'.length returns 2 in JavaScript.

Published
#utf-8 #utf-16 #ascii #character-encoding #javascript #unicode #web-development

UTF-8, UTF-16, and ASCII Demystified: Why JavaScript's String Length Lies

Open a browser console and type this:

"😀".length  // → 2

One emoji. Two characters? That result is not a bug — it is the direct consequence of how JavaScript stores strings internally. Understanding why requires a short walk through ASCII, Unicode, UTF-8, and UTF-16: the four layers that together explain almost every encoding surprise a web developer will hit.

ASCII: 128 Slots and a Problem

ASCII (American Standard Code for Information Interchange) was standardized in 1963 and maps 128 characters — uppercase and lowercase English letters, digits 0–9, punctuation, and 33 control codes — to the numbers 0 through 127. Each value fits in 7 bits, so a single byte stores any ASCII character with one bit to spare.

The scheme is elegant for English text. A is 65, z is 122, <newline> is 10. Every C runtime, every HTTP header parser, and every JSON tokenizer still treats bytes 0–127 as ASCII-compatible.

The problem: 128 slots leave no room for é, ñ, 中, or 😀. Dozens of incompatible 8-bit extensions (Latin-1, cp1252, KOI8-R) tried to fill the gap by using the 128 values above 127 for regional characters. A file marked "ASCII" in Paris meant something different from one marked "ASCII" in Moscow.

Unicode: One Number for Every Character

Unicode solves the chaos by assigning a code point — a unique integer — to every character across every writing system. The letter A is U+0041; the emoji 😀 is U+1F600; the Chinese character 中 is U+4E2D. As of Unicode 15.1, the standard covers 149,813 characters.

Unicode is not an encoding — it does not specify how bytes are stored on disk or sent over a network. That is what UTF-8 and UTF-16 do.

UTF-8: Why the Web Chose Variable-Width

UTF-8 encodes each code point using 1 to 4 bytes:

| Code point range | Bytes | Example | |---|---|---| | U+0000 – U+007F | 1 | A0x41 | | U+0080 – U+07FF | 2 | é0xC3 0xA9 | | U+0800 – U+FFFF | 3 | 0xE4 0xB8 0xAD | | U+10000 – U+10FFFF | 4 | 😀0xF0 0x9F 0x98 0x80 |

The critical property: every byte in the 0–127 range means exactly what ASCII says. A UTF-8 file containing only English text is byte-for-byte identical to the ASCII version. This backwards compatibility is the main reason UTF-8 now appears on 98.2% of web pages surveyed by W3Techs (2024 data). Switching from Latin-1 to UTF-8 broke zero ASCII-only pages; adding é or 😀 simply required more bytes per character.

You can verify these byte sequences yourself with the UTF-8 Byte Counter, which shows exact byte counts and code-unit breakdowns for any string you type.

UTF-16: Where JavaScript Strings Live

UTF-16 takes a different trade-off. It uses 2 bytes (one code unit) for any code point in the Basic Multilingual Plane (U+0000 – U+FFFF), and 4 bytes (two code units, called a surrogate pair) for anything above U+FFFF.

JavaScript, Java, and the Windows API all store strings in UTF-16 internally. The .length property in JavaScript counts UTF-16 code units, not Unicode code points or visible characters. Here is the precise breakdown:

"A".length        // → 1   (U+0041, one code unit)
"é".length        // → 1   (U+00E9, one code unit)
"中".length       // → 1   (U+4E2D, one code unit)
"😀".length       // → 2   (U+1F600, above U+FFFF → surrogate pair: two code units)
"👨‍👩‍👧‍👦".length   // → 11  (family emoji = multiple code points joined with ZWJ)

The surrogate pair for 😀 consists of the high surrogate 0xD83D and the low surrogate 0xDE00. Both are valid UTF-16 code units but neither represents a complete character on its own. Programs that split strings at arbitrary .length positions can accidentally cut a surrogate pair in half, producing the replacement character (U+FFFD) or garbled output.

Real-World Debugging: Four Encoding Mistakes

I tracked down a bug last year where user display names were silently truncated in a database. The column was VARCHAR(50) in MySQL using the old utf8 charset — which in MySQL means UTF-8 with a maximum of 3 bytes per character, not 4. Any character above U+FFFF (including many emoji) caused a silent truncation error or a Data too long crash. The fix was a one-line schema change to utf8mb4, MySQL's actual four-byte UTF-8 charset, plus utf8mb4_unicode_ci collation.

Here are four similar traps developers commonly hit:

1. HTTP Content-Type without charset. Sending Content-Type: text/html without ; charset=utf-8 lets the browser guess — often incorrectly for pages containing non-ASCII text. Always declare the charset explicitly, either in the header or in a <meta charset="utf-8"> tag.

2. JSON encoding assumptions. JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32 (per RFC 8259). In practice, every modern parser expects UTF-8. Sending a JSON body in UTF-16 without a BOM will confuse most parsers.

3. URL percent-encoding. The %E4%B8%AD you see in URLs is the UTF-8 byte sequence for 中, percent-encoded. Tools like the URL Encoder / Decoder make it straightforward to inspect or construct those sequences. The key point: the browser percent-encodes the UTF-8 bytes of the character, not the Unicode code point directly.

4. MySQL utf8 vs utf8mb4. As noted above, MySQL's utf8 charset silently rejects characters above U+FFFF. Switch to utf8mb4 for any column that might store emoji or supplementary CJK characters.

Choosing Between UTF-8 and UTF-16

For web development, the answer is almost always UTF-8:

  • All HTML and CSS files should be UTF-8.
  • All JSON APIs should produce and consume UTF-8.
  • Databases should use utf8mb4 (MySQL/MariaDB) or UTF8 (PostgreSQL, which has always meant full four-byte UTF-8).

UTF-16 makes sense when you are interfacing with Windows APIs, Java internals, or the DOM (which exposes JavaScript's UTF-16 representation). When you read element.textContent.length, remember you are counting UTF-16 code units — not characters. For accurate character counts, use the spread operator or Array.from:

[..."😀"].length           // → 1  (code points)
Array.from("👨‍👩‍👧‍👦").length  // → 7  (code points, not grapheme clusters)

For a precise breakdown of any string into code points, byte sequences, and UTF-16 units at once, the Unicode Character Inspector does all three in the browser with no server round-trip.

What to Remember

  • ASCII maps 128 characters to bytes 0–127 and is a subset of both UTF-8 and UTF-16.
  • UTF-8 is the encoding of the web: backward-compatible with ASCII, variable-width (1–4 bytes), used on 98.2% of websites.
  • UTF-16 is the encoding of JavaScript strings: 2 bytes for most characters, 4 bytes (surrogate pair) for characters above U+FFFF.
  • .length in JavaScript counts UTF-16 code units, not characters or grapheme clusters — which is why "😀".length === 2.
  • MySQL's utf8 is not real UTF-8 — use utf8mb4 for full Unicode support.

Encoding bugs are almost always the result of one layer assuming a different encoding than the layer next to it. Naming the encoding explicitly at every boundary — HTTP headers, database column definitions, file open calls — is the most reliable way to prevent them.


Made by Toolora · Updated 2026-06-29