Skip to main content

UTF-8 vs UTF-16 vs ASCII: Character Encoding Explained for Web Developers

A practical guide to ASCII, UTF-8, and UTF-16 encoding — with real byte-level examples showing exactly why your emoji broke and why UTF-8 dominates the web.

Published
#encoding #utf-8 #utf-16 #ascii #unicode #web-development

UTF-8 vs UTF-16 vs ASCII: Character Encoding Explained for Web Developers

The string "café" is four characters. Copy it into a JavaScript length check and you get 4. Serialize it to a byte array and you might get 4, 5, or 8 bytes — depending entirely on which encoding you used. Pick the wrong one and your API payload silently mangles names, your database field truncates mid-character, or your emoji renders as ðŸŒ. Understanding exactly how ASCII, UTF-8, and UTF-16 work is not academic — it is the difference between a bug you can reproduce and one you spend three hours chasing.

What ASCII Actually Does (and Why It Falls Short)

ASCII — the American Standard Code for Information Interchange — encodes 128 characters using 7 bits. That covers the 26 uppercase and lowercase Latin letters, digits 0–9, punctuation, and 33 control characters (tab, newline, etc.). Each character maps to one byte, and the mapping is simple: A is decimal 65 (0x41), a is 97 (0x61), 0 is 48 (0x30).

This simplicity made ASCII fast and easy to implement in the 1960s hardware it was designed for. The problem is obvious the moment you leave the English-speaking world. The letter é (U+00E9, as in "café") sits at code point 233 — above ASCII's ceiling of 127. German's ü, Arabic's ع, and every Chinese character are completely outside ASCII's range. Attempting to encode them produces garbled output or an encoding error, depending on your runtime.

Extended ASCII variants (ISO-8859-1, Windows-1252) added a second half by using the 8th bit, reaching 256 characters. That got Latin European languages covered, but each region defined its own 128 extra slots differently. A file encoded in Windows-1252 and decoded as ISO-8859-5 renders Russian letters as Western punctuation.

UTF-8: Why 98% of Websites Use It

UTF-8 (Unicode Transformation Format, 8-bit) is a variable-width encoding that maps every Unicode code point — all 1,114,112 of them — to between 1 and 4 bytes. Its key design decision: the first 128 code points are encoded in exactly one byte, identical to ASCII. A UTF-8 file containing only ASCII characters is byte-for-byte identical to an ASCII file.

Per W3Techs' global survey (June 2024), UTF-8 accounts for 98.2% of all websites. That dominance comes from a few concrete properties:

  • Backward compatibility. Any tool or protocol that handled ASCII safely also handles UTF-8 for ASCII input, because the bytes are the same.
  • No byte-order mark required. UTF-8 bytes are unambiguous without a BOM.
  • Space efficiency for ASCII-heavy content. An HTML document that is mostly English and punctuation stays compact, using one byte per character for the common cases.

The encoding rules for multi-byte sequences follow a clear pattern: a character above U+007F uses 2–4 bytes, with the first byte signaling the length. U+00E9 (é) uses 2 bytes: 0xC3 0xA9. U+1F30D (🌍) uses 4 bytes: 0xF0 0x9F 0x8C 0x8D.

You can see the exact byte counts for any string using the UTF-8 Byte Counter, which breaks down bytes, UTF-16 code units, and Unicode code points side by side.

UTF-16: JavaScript's Hidden Encoding

UTF-16 encodes every Unicode code point as either one 16-bit unit (2 bytes) or a surrogate pair of two 16-bit units (4 bytes). Code points from U+0000 to U+FFFF — the Basic Multilingual Plane — fit in a single 16-bit unit. Code points above U+FFFF (most emoji, many historic scripts) require a surrogate pair.

Java, C#, JavaScript, and the Windows API all use UTF-16 internally. This is why "🌍".length returns 2 in JavaScript — the runtime is counting UTF-16 code units, not Unicode characters:

"🌍".length       // → 2   (two UTF-16 code units: 0xD83C, 0xDF0D)
[..."🌍"].length  // → 1   (spread forces code-point iteration)
"🌍".codePointAt(0).toString(16)  // → "1f30d"

This trips up real production bugs. I ran into one in a user-facing username validation: a length limit of 20 characters was silently rejecting usernames with 18 Latin letters and 1 emoji, because the emoji consumed two of the 20 allowed length units. Switching to [...str].length or str.codePointAt iteration fixed it.

UTF-16 also has a byte-order problem. Because each unit is 2 bytes, a file beginning with UTF-16 data needs a BOM (0xFE 0xFF or 0xFF 0xFE) to tell the reader which byte comes first. Swap those bytes and every character decodes to a different code point.

A Real Encoding Comparison: "café" and "🌍"

Here are the exact byte representations of two strings in each encoding:

String: café (4 Unicode code points: U+0063, U+0061, U+0066, U+00E9)

| Encoding | Bytes | Count | |----------|-------|-------| | ASCII | cannot encode é | — | | UTF-8 | 63 61 66 C3 A9 | 5 bytes | | UTF-16 BE | 00 63 00 61 00 66 00 E9 | 8 bytes |

String: 🌍 (1 Unicode code point: U+1F30D)

| Encoding | Bytes | Count | |----------|-------|-------| | ASCII | cannot encode | — | | UTF-8 | F0 9F 8C 8D | 4 bytes | | UTF-16 BE | D8 3C DF 0D (surrogate pair) | 4 bytes |

Notice that for this particular emoji, UTF-8 and UTF-16 use the same number of bytes. For a string that is mostly ASCII (an HTML document, a JSON API payload, most English source code), UTF-8 is typically 50–70% smaller than UTF-16. For a string that is entirely characters in the U+0800–U+FFFF range (Chinese, Japanese, Korean in the BMP), UTF-8 uses 3 bytes per character while UTF-16 uses 2 — UTF-16 wins there.

You can verify any string's exact byte breakdown with the Unicode Character Inspector, which shows each code point's UTF-8 byte sequence, its Unicode block, and its category.

Which Encoding Should You Use and When?

Use UTF-8 for almost everything:

  • HTML, XML, JSON, HTTP headers — the spec defaults to UTF-8, so using it avoids any charset negotiation overhead.
  • Source code files — all major editors and compilers default to UTF-8 without BOM.
  • Database text columns — declare CHARACTER SET utf8mb4 in MySQL (not utf8, which only covers 3-byte sequences and silently drops emoji).
  • File storage and APIs — UTF-8 is the interchange format the rest of the world expects.

UTF-16 is unavoidable when working within environments that mandate it:

  • Java String and char are UTF-16 code units internally.
  • Windows API wide-string functions (wchar_t, LPWSTR) are UTF-16 LE.
  • JavaScript string methods measure length in UTF-16 code units — be explicit about whether you want length (code units) or [...str].length (code points) in validation logic.

Avoid ASCII-only encoding for any text that might travel outside your system. Even a field you call "English only" will eventually receive a curly apostrophe (', U+2019) pasted from Word, and your ASCII-only handler will either mangle it or throw.

The practical rule: set UTF-8 at the boundary (database connection, HTTP Content-Type, file write) and handle code points (not code units) when measuring or splitting user-facing strings.


Made by Toolora · Updated 2026-06-19