Skip to main content

Character Encoding Explained: UTF-8, UTF-16, ASCII, and Latin-1 for Web Apps

A practical guide to character encodings — UTF-8, UTF-16, ASCII, and Latin-1 — with real input/output examples and tips for handling international text in web applications.

Published
#encoding #utf-8 #unicode #web-development #internationalization

Character Encoding Explained: UTF-8, UTF-16, ASCII, and Latin-1 for Web Apps

If you have ever seen a page display é where é should appear, or watched a JSON payload turn " into “, you have hit an encoding mismatch. Understanding the four most common character encodings — ASCII, Latin-1, UTF-16, and UTF-8 — gives you the mental model to diagnose these problems in seconds rather than hours.

ASCII: The 7-Bit Foundation

ASCII (American Standard Code for Information Interchange) was standardized in 1963 and covers 128 code points: 0–31 are control characters (newline, tab, bell), and 32–127 are printable characters — the English alphabet, digits, and common punctuation.

Every byte in ASCII fits in 7 bits, so the eighth bit is always 0. That means any byte value above 127 is, by definition, not ASCII. This matters because many older systems or protocols assume pure ASCII and silently mishandle anything outside that range.

Real example — ASCII encoding of "Hello":

| Character | Decimal | Hex | Binary | |-----------|---------|------|----------| | H | 72 | 0x48 | 01001000 | | e | 101 | 0x65 | 01100101 | | l | 108 | 0x6C | 01101100 | | l | 108 | 0x6C | 01101100 | | o | 111 | 0x6F | 01101111 |

You can verify this with the Text to Hex Converter — paste "Hello" and it produces 48 65 6C 6C 6F.

Latin-1 (ISO 8859-1): Filling the Eighth Bit

ASCII leaves 128 code points unused. Latin-1 claims them for Western European characters: accented vowels (é, ü, ñ), currency symbols (£, ¥), and typographic marks (©, ®). Code points 128–159 are control characters; 160–255 are printable.

The catch: Latin-1 is a single-byte encoding, so it can represent exactly 256 characters. Every language outside Western Europe is simply absent — Arabic, Chinese, Japanese, Korean, and thousands of others cannot be expressed at all. This made Latin-1 workable in the 1990s for English and French websites, but it is a dead end for any app targeting a global audience.

Mojibake in practice: When a server sends Latin-1 bytes but the browser interprets them as UTF-8, é (Latin-1 byte 0xE9) becomes garbage because 0xE9 is not a valid single-byte UTF-8 sequence. The browser either replaces it with (the Unicode replacement character) or misreads the following bytes.

UTF-16: Unicode's Variable-Width Predecessor

UTF-16 encodes every Unicode code point in either 2 or 4 bytes. Code points in the Basic Multilingual Plane (U+0000–U+FFFF) use a single 16-bit unit; code points above U+FFFF — emoji, many historic scripts, and some CJK extension characters — use a pair of 16-bit surrogates.

Where UTF-16 lives: Windows internally uses UTF-16LE (little-endian). Java String, JavaScript's charCodeAt, and C# char are all UTF-16 under the hood. A file saved by Notepad on Windows with "Unicode" encoding is actually UTF-16LE with a BOM (byte order mark: FF FE).

This surprises developers who assume JavaScript strings are UTF-8. They are not — '😀'.length returns 2, not 1, because the emoji sits at U+1F600 and requires two UTF-16 surrogates. I ran into this personally when building a character-count feature: a 140-"character" limit enforced with .length allowed only 70 emoji, which was nowhere near the expected behavior.

UTF-16 is a poor choice for web data transfer: ASCII text doubles in size (the letter A becomes 41 00 in UTF-16LE), and the BOM can confuse parsers that expect UTF-8.

UTF-8: Why It Won the Web

UTF-8 uses 1–4 bytes per code point, with a clever variable-length design:

| Code point range | Byte pattern | Byte count | |--------------------|---------------------------------------|------------| | U+0000 – U+007F | 0xxxxxxx | 1 | | U+0080 – U+07FF | 110xxxxx 10xxxxxx | 2 | | U+0800 – U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 3 | | U+10000 – U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 4 |

Three properties make UTF-8 the dominant encoding on the web:

  1. ASCII compatibility. Any valid ASCII file is also valid UTF-8. Servers, parsers, and log tools built for ASCII work on UTF-8 without modification.
  2. Self-synchronising. Because continuation bytes always start with 10, you can find the start of any character from any byte position. Other encodings lack this property.
  3. Compact for Latin text. English and other Latin-script content stays at 1 byte per character. Chinese and Japanese characters expand to 3 bytes, which is larger than UTF-16's 2 — but the tradeoff is worth it for ASCII compatibility.

According to the W3Techs survey published in May 2025, UTF-8 accounts for 98.2% of all websites with a known character encoding. Latin-1 is the next most common at 0.8%, with UTF-16 barely registering.

Real encoding output — the character é (U+00E9):

| Encoding | Bytes (hex) | |----------|---------------| | ASCII | cannot encode | | Latin-1 | E9 | | UTF-16LE | E9 00 | | UTF-8 | C3 A9 |

You can verify the UTF-8 bytes with the UTF-8 Byte Counter — paste é and the tool shows 2 bytes and the code point U+00E9.

Common Encoding Bugs and How to Fix Them

Bug 1 — Content-Type missing the charset parameter. HTTP headers like Content-Type: text/html without ; charset=utf-8 let browsers sniff the encoding. On a Latin-1 page they often guess correctly, but on a UTF-8 page with no non-ASCII content in the first 512 bytes, they may assume Latin-1 and corrupt any UTF-8 characters that appear later. Fix: always declare Content-Type: text/html; charset=utf-8.

Bug 2 — Database column declared as latin1 storing UTF-8 bytes. MySQL's default charset was latin1 until version 8.0. Applications that saved UTF-8 bytes into a latin1 column appeared to work (bytes round-trip correctly) until a LIKE search on an accented character failed, or until the data was dumped and re-imported with correct character set handling. Fix: ALTER TABLE … CONVERT TO CHARACTER SET utf8mb4.

Bug 3 — Double encoding. A string gets HTML-entity-encoded twice: <&lt;&amp;lt;. This usually happens when user input goes through an HTML encoder at input time and again at output time. The HTML Entities Encoder lets you inspect what a single encoding pass produces so you can audit whether your pipeline is encoding once or twice.

Bug 4 — Surrogate pairs mishandled in JavaScript. '𝄞'.codePointAt(0) returns 119070 (U+1D11E, the musical G clef), but '𝄞'[0] is the high surrogate 0xD834, not a character. Always iterate with for…of or Array.from when processing user-supplied strings that may contain characters above U+FFFF.

Practical Checklist for International Web Apps

Before shipping any feature that handles user text:

  • HTML meta tag: <meta charset="UTF-8"> in <head>, before any other text.
  • HTTP header: Content-Type: text/html; charset=utf-8 (server-level, not just meta).
  • Database: utf8mb4 in MySQL/MariaDB (not utf8, which only handles 3-byte sequences and silently drops emoji).
  • File I/O: Specify encoding explicitly — open(path, encoding='utf-8') in Python, fs.readFileSync(path, 'utf8') in Node.
  • Length checks: Measure bytes, not characters, if the downstream system has a byte limit (HTTP header fields, SMTP lines, database VARCHAR byte limits).
  • Normalization: NFC is the standard form for web input. Two users can type "é" as a single precomposed code point (U+00E9) or as "e" + combining accent (U+0065 U+0301) — they look identical but differ as bytes. The Unicode Normalizer converts between all four normalization forms.

When I audited a multilingual form for a client whose database was rejecting Emoji reactions, every item on this checklist had been missed independently — MySQL utf8 column, no HTTP charset header, and String.length used for a byte-budget check. Fixing all three together resolved every reported encoding bug in one deploy.


Made by Toolora · Updated 2026-06-27