Unicode, UTF-8, UTF-16, and UTF-32: A Byte-Level Breakdown Every Developer Needs

When I first hit a UnicodeDecodeError in Python at 2 AM, I had no idea there was a difference between Unicode and UTF-8. I just wanted the emoji to render. The confusion makes sense: the terms get mixed up constantly, even in official documentation.

The short version: Unicode is a character inventory. UTF-8, UTF-16, and UTF-32 are three different ways to store that inventory as bytes. Treating them as synonyms is how you get garbled text, file-read errors, and that classic 0xC3 0xA9 appearing where é should be.

Here is what each one actually does to your bytes — with real numbers.

Unicode Is Not an Encoding — It's a Map

Unicode assigns a unique number, called a code point, to every character in every writing system. The letter A is U+0041. The euro sign € is U+20AC. The guitar emoji 🎸 is U+1F3B8. As of Unicode 16.0, the standard covers 154,998 characters across 168 scripts.

A code point is just an integer from 0 to 1,114,111 (0x10FFFF). How you store that integer in a file or send it across a network is a separate question — and that is where encodings come in.

Think of it this way: Unicode is the phone book. UTF-8, UTF-16, and UTF-32 are three different paper sizes you could print it on. The entries are identical; the layout differs.

UTF-8 — Why 98% of the Web Chose It

UTF-8 is a variable-width encoding. It uses 1 to 4 bytes per character depending on where the code point falls:

| Code point range | Bytes used | Characters in this range | |---|---|---| | U+0000–U+007F | 1 | ASCII: A–Z, 0–9, punctuation | | U+0080–U+07FF | 2 | Latin extensions, Greek, Cyrillic, Arabic, Hebrew | | U+0800–U+FFFF | 3 | CJK ideographs, most symbols | | U+10000–U+10FFFF | 4 | Rare CJK extensions, most modern emoji |

As of January 2024, UTF-8 is used by 98.2% of websites surveyed by W3Techs — up from 40% in 2008. The driving reason is practical: ASCII text (English, most source code, JSON keys) compresses to 1 byte per character. A 10 KB JavaScript file containing only ASCII is exactly 10 KB in UTF-8. In UTF-32 that same file would be 40 KB.

The critical property is backward compatibility with ASCII. The UTF-8 byte sequence for any character in the U+0000–U+007F range is identical to its single-byte ASCII value. A C string parser from 1980 can read UTF-8 ASCII without a single modification.

Here is what "Hello" looks like in raw bytes under each encoding (excluding the byte-order mark for now):

UTF-8:  48 65 6C 6C 6F                          (5 bytes)
UTF-16: 48 00 65 00 6C 00 6C 00 6F 00           (10 bytes, little-endian)
UTF-32: 48 00 00 00 65 00 00 00 6C 00 00 00 ...  (20 bytes, little-endian)

That 5-to-20 byte ratio is why network protocols default to UTF-8.

UTF-16 — The Windows, Java, and JavaScript Encoding

UTF-16 uses 2 or 4 bytes per character. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) take 2 bytes. Characters above U+FFFF — including most modern emoji — require a surrogate pair: two consecutive 2-byte values that together encode the code point.

Windows internals, Java String, .NET string, and the JavaScript/ECMAScript string type all use UTF-16 internally. That is why "🎸".length === 2 in JavaScript: the engine counts UTF-16 code units, not characters. This is the source of off-by-one bugs whenever you slice a string by index in languages with UTF-16 runtime strings.

UTF-16 also introduces byte order. The two-byte sequence FF FE or FE FF at the start of a file — the byte-order mark (BOM) — tells the reader whether the 2-byte pairs are little-endian or big-endian. Forgetting the BOM is how UTF-16 files end up scrambled when opened on a different operating system.

You can inspect any character's exact UTF-16 encoding (and compare it to UTF-8) using the Unicode Code Point Explorer, which shows the full byte sequence for any code point side by side.

UTF-32 — Maximum Simplicity, Maximum Size

UTF-32 uses exactly 4 bytes for every code point without exception. The character A (U+0041) becomes 41 00 00 00. The character 🎸 (U+1F3B8) becomes B8 F3 01 00 (little-endian). No variable width, no surrogates, no edge cases.

The advantage is O(1) random access. string[7] means "character at index 7" — no surrogate calculation, no variable-length counting. Some C and C++ programs that do heavy string manipulation use UTF-32 internally via std::u32string, then convert to UTF-8 at I/O boundaries.

The cost is clear from the numbers. A 1,000-character Japanese text in UTF-8 uses roughly 3,000 bytes (3 bytes per CJK character). In UTF-32 that same text uses exactly 4,000 bytes. For anything stored on disk or sent over a network, UTF-32 is rarely worth the overhead.

Real Example — Three Characters, Three Encodings

I ran the string "A€🎸" through a byte inspector to get the exact output under each encoding (little-endian, no BOM):

| Character | Code Point | UTF-8 | UTF-16 | UTF-32 | |---|---|---|---|---| | A | U+0041 | 41 (1 byte) | 41 00 (2 bytes) | 41 00 00 00 (4 bytes) | | € | U+20AC | E2 82 AC (3 bytes) | AC 20 (2 bytes) | AC 20 00 00 (4 bytes) | | 🎸 | U+1F3B8 | F0 9F 8E B8 (4 bytes) | 3C D8 B8 DF (surrogate pair, 4 bytes) | B8 F3 01 00 (4 bytes) | | Total | | 8 bytes | 8 bytes | 12 bytes |

For this three-character mix of ASCII, a BMP symbol, and an emoji, UTF-8 and UTF-16 tie at 8 bytes. UTF-32 needs 12. For pure CJK text, UTF-16 can actually be smaller than UTF-8 (2 bytes vs 3 per character). For ASCII-heavy content — most English source code — UTF-8 wins by roughly 2×. UTF-32 never wins on size.

To measure exactly how many bytes UTF-8 uses for your own text, the UTF-8 Byte Counter tool counts bytes, characters, and code points separately so you can see the difference for any string.

Which Encoding to Use and When

Default to UTF-8 for almost everything: files, databases, REST APIs, HTML pages, JSON, configuration files, and log output. The web's near-universal adoption of UTF-8 means every major library, parser, and browser handles it correctly without configuration.

Meet UTF-16 at its boundaries. When calling Windows APIs (LPCWSTR), writing Java or .NET string operations, or debugging JavaScript string length bugs, you are in UTF-16 territory. Convert to UTF-8 at system edges and you avoid most interoperability problems.

Reserve UTF-32 for specialized in-memory work. If your code needs constant-time character indexing and the data stays in memory without being serialized, u32string or a []rune slice (Go's approach) can simplify string logic. Write UTF-8 when saving or sending.

One practical rule covers most encoding bugs: if your text contains emoji or non-Latin script and you are seeing wrong lengths, missing characters, or two-character boxes where one character should appear, check which encoding layer you are on. A € rendered as Â£ is usually a UTF-8 sequence being decoded as Latin-1. A € that counts as two characters is usually a UTF-16 surrogate pair being counted as code units.

Understanding the difference between what Unicode is (a map of code points) and what the UTFs do (convert those code points to bytes) is the single mental model that makes every encoding error diagnosable.

Made by Toolora · Updated 2026-06-29