Skip to main content

UTF-8 vs ASCII vs Unicode: A Practical Guide to Debugging Broken Characters

What mojibake like café actually tells you, how UTF-8, ASCII, and Unicode relate, and a byte-level workflow for finding where text got corrupted.

Published By Li Lei
#utf-8 #unicode #ascii #encoding #debugging

UTF-8 vs ASCII vs Unicode: A Practical Guide to Debugging Broken Characters

You know the symptoms. A customer named José shows up in your admin panel as José. A CSV export renders “quotes†where curly quotes should be. A log line contains and nothing else. These are not random glitches — each broken shape is a specific, diagnosable failure, and once you can read the pattern you can usually name the bug before opening the code. This guide sorts out what UTF-8, ASCII, and Unicode actually are, then walks through debugging broken characters at the byte level.

Unicode is the catalog, UTF-8 is the wire format, ASCII is the legacy subset

The three terms get used interchangeably, and that confusion is where half of all encoding bugs start.

Unicode is a catalog of characters. It assigns every character a number called a code point: A is U+0041, é is U+00E9, is U+4E2D, 🎉 is U+1F389. Unicode 16.0 defines 154,998 characters (Unicode Consortium, September 2024). A code point is an abstract number — it says nothing about bytes.

UTF-8 is one way to turn those numbers into bytes. It is variable-width: code points up to U+007F take 1 byte, up to U+07FF take 2, most CJK takes 3, and emoji take 4. Per W3Techs, UTF-8 is the declared encoding of more than 98% of all websites as of 2024, so on the web "text" effectively means "UTF-8 bytes" unless something has gone wrong.

ASCII is a 1963-era encoding covering 128 characters: the English alphabet, digits, and control codes. Its entire range fits in bytes 0x000x7F, and UTF-8 was deliberately designed so that those bytes mean the same thing in both. That is why a file containing only English text is simultaneously valid ASCII and valid UTF-8 — and why encoding bugs stay invisible until the first é or enters your data. If you want to see exactly which 128 characters are "safe" this way, the ASCII table reference lays out the full 0–127 range with hex and decimal values.

So the relationship is: Unicode assigns the numbers, UTF-8 encodes the numbers as bytes, and ASCII is the 128-character subset where every encoding agrees.

Reading mojibake: each broken shape names its own bug

Here is a real byte-level example. The string café encoded as UTF-8 is five bytes:

Input:  café
UTF-8:  0x63 0x61 0x66 0xC3 0xA9
         c    a    f    é (2 bytes)

Now suppose a consumer reads those five bytes believing they are Windows-1252 (the default of many older Windows tools and misconfigured databases). In Windows-1252, 0xC3 is à and 0xA9 is ©, so the output becomes:

Output: café

That exact shape — one accented character turning into two Latin-looking characters, usually starting with à or â — is the signature of UTF-8 bytes decoded as a single-byte encoding. The data is intact; only the reader is wrong. If you re-decode café's bytes as UTF-8 you get café back.

Other shapes point elsewhere:

  • caf� — the replacement character U+FFFD means the decoder hit bytes that are not valid UTF-8, typically Latin-1 data (where é is the lone byte 0xE9) being read as UTF-8. Here the reader is right and the data is wrong.
  • caf? — a question mark means a lossy conversion already happened, usually a database or terminal transcoding into a charset that lacks the character. This one is unrecoverable; find the conversion point.
  • café — double mojibake: UTF-8 was decoded as Windows-1252 and the wrong result was re-encoded as UTF-8. Each round trip roughly doubles the junk, which is why you sometimes see four or six garbage characters per accent.

A first-person debugging session

Last month I debugged a signup form where German umlauts arrived in the database as Müller. My first move was not reading code — it was pinning down the bytes at each hop. I pasted Müller into the UTF-8 byte counter and confirmed the correct UTF-8 form is 7 bytes (ü = 0xC3 0xBC), then pasted the corrupted Müller and got 9 bytes — à and ¼ each cost 2 bytes in UTF-8. That 7-vs-9 gap told me the corruption was already baked into stored bytes, not a display-layer problem: the database genuinely contained the re-encoded garbage, so setting <meta charset> or fiddling with response headers could never fix it. The actual culprit was a MySQL connection opened without charset=utf8mb4, so the driver defaulted to latin1 and transcoded every INSERT. One connection-string parameter, plus a one-off repair script that re-decoded the damaged rows, closed the ticket.

The lesson that stuck: debug encodings by comparing byte counts and byte sequences, not by staring at rendered glyphs. Glyphs lie — your terminal, editor, and browser each apply their own decoding before you see anything. Bytes don't.

A checklist for finding where text breaks

Encoding bugs are always a disagreement between a writer and a reader somewhere along the pipe. Walk the hops in order:

  1. Identify the exact damaged characters. Paste the broken string into the Unicode character inspector to see the real code points. Ã is U+00C3 — seeing it confirms UTF-8-as-Latin-1 misreading rather than font problems or invisible characters.
  2. Check the storage layer. On MySQL, SHOW VARIABLES LIKE 'character_set%' — you want utf8mb4 everywhere. The older utf8 alias caps at 3 bytes per character and silently truncates strings at the first emoji.
  3. Check the transport. HTTP responses should carry Content-Type: text/html; charset=utf-8; APIs should emit application/json; charset=utf-8. A missing charset lets clients guess, and old clients guess Latin-1.
  4. Check file ingestion. CSVs exported from Excel are frequently Windows-1252, not UTF-8. Decode with the source's real encoding, then re-encode to UTF-8 at the boundary.
  5. Normalize after decoding. Two visually identical strings can differ at the code-point level: é as one code point (U+00E9, NFC) versus e + combining accent (U+0065 U+0301, NFD — what macOS filenames use). String comparison then fails even though rendering matches. Run both sides through the Unicode normalizer to see the two forms and pick one convention (NFC is the usual choice) at your system boundary.

The durable rule is the classic "UTF-8 sandwich": decode bytes to text at the input edge, work in text everywhere inside, encode back to UTF-8 only at the output edge. ASCII stopped being a safe assumption the day your product got its second user, and Unicode without a declared encoding is just a number with no bytes. Pin the encoding at every boundary and the é never shows up.


Made by Toolora · Updated 2026-07-02