Unicode, UTF-8, and Emoji Character Counts: Why Text Length Tools Disagree
JavaScript says 11, Python says 7, your database says 25, and your eyes say 1. A practical walk through graphemes, code points, UTF-16 units, and UTF-8 bytes — with real measured numbers.
Unicode, UTF-8, and Emoji Character Counts: Why Text Length Tools Disagree
Paste the family emoji 👩👩👧👦 into a few different "character counter" websites and you will get a small argument. Some report 1. Some report 7. Some report 11. None of them are broken. They are answering four different questions that all happen to be phrased as "how long is this text?"
This post pins down what each number means, shows real measured counts for real strings, and explains which count matters for the limit you are actually trying to hit — an SMS, a tweet, a VARCHAR column, or a form validator.
Four different answers to "how long is this string?"
Every length disagreement between text tools comes down to which of these four units the tool counts:
Grapheme clusters are what a human sees: one visible symbol. The family emoji is one grapheme. So is é, even when it is stored as two separate pieces. This is what Intl.Segmenter in JavaScript or Swift's String.count measures.
Code points are Unicode's numbered entries. 👍 is a single code point, U+1F44D. But 👩👩👧👦 is seven code points: four person emoji glued together with three invisible zero-width joiners (U+200D). Python's len() counts code points.
UTF-16 code units are how JavaScript, Java, and C# store strings internally. Any code point above U+FFFF — which includes almost every emoji — needs two units, a surrogate pair. This is what JavaScript's .length returns, and it is why so many web counters report numbers that look inflated.
UTF-8 bytes are what goes over the network and into most databases. ASCII letters take 1 byte, most European accented letters take 2, CJK characters take 3, and emoji take 4 bytes each.
A counter is only "wrong" if it doesn't tell you which of the four it is counting. Toolora's UTF-8 Byte Counter shows byte length, code point count, and JavaScript .length side by side for exactly this reason — the disagreement is the information.
One emoji, four numbers: a measured example
Here is the string Hi 👩👩👧👦 — the word "Hi", a space, and one family emoji — measured in Node.js 20:
const s = "Hi 👩👩👧👦";
[...new Intl.Segmenter("en", {granularity: "grapheme"}).segment(s)].length
// → 4 (graphemes: H, i, space, family)
[...s].length // → 10 (code points)
s.length // → 14 (UTF-16 code units)
Buffer.byteLength(s) // → 28 (UTF-8 bytes)
The same input produces 4, 10, 14, or 28 depending on the question. The emoji alone accounts for the spread: 1 grapheme, 7 code points, 11 UTF-16 units, 25 UTF-8 bytes (4 bytes × 4 person emoji + 3 bytes × 3 zero-width joiners).
Flags behave the same way at smaller scale. 🇩🇪 is one visible symbol built from two "regional indicator" code points — D and E markers — so it measures 1 grapheme, 2 code points, 4 UTF-16 units, 8 UTF-8 bytes. If you want to see the individual pieces inside any emoji, paste it into the Emoji to Unicode converter and it will list every code point, joiners included.
Accents add one more wrinkle: é can be stored as a single code point (U+00E9) or as e plus a combining accent (U+0065 U+0301). Both render identically, but the first is 1 code point and 2 UTF-8 bytes while the second is 2 code points and 3 bytes. Two files that look identical can fail an equality check — the Unicode Character Inspector makes the difference visible character by character.
Where the mismatch actually costs you
These distinctions are not trivia. Three real limits, three different units:
SMS counts GSM-7 septets, then collapses. A standard SMS carries 160 characters in the GSM 7-bit alphabet — but that alphabet contains no emoji. Add a single 😀 and the entire message silently switches to UCS-2 encoding, dropping the per-message limit from 160 to 70 characters (per the 3GPP TS 23.038 character-set spec). One emoji can turn a one-segment message into a three-segment one, and SMS providers bill per segment.
X (Twitter) counts weighted code points. The 280-character limit is not 280 graphemes: per the open-source twitter-text parsing library that X publishes, most emoji are weighted as 2 characters, while CJK characters also count as 2. A tweet that "looks" 150 characters long can be over the limit. The Social Media Character Counter applies the per-platform rules instead of a naive .length.
MySQL VARCHAR(n) counts code points, but indexes count bytes. A VARCHAR(20) column in utf8mb4 happily stores the 7-code-point family emoji string, but the old 767-byte index-key limit was famously blown up by exactly this kind of 4-bytes-per-character content. The unit that matters flips between "characters" and "bytes" depending on which error message you are staring at.
I measured the spread myself before trusting it
While writing this, I didn't take the numbers above from documentation — I ran them. In Node.js 20 I measured 👍, 🇩🇪, and 👩👩👧👦 one by one: the thumbs-up came out as 1 grapheme / 1 code point / 2 UTF-16 units / 4 bytes, the German flag as 1 / 2 / 4 / 8, and the family emoji as 1 / 7 / 11 / 25. Three symbols that read as "three characters" to any human eye add anywhere from 3 to 37 to a length total depending on the unit. When I pasted the same three emoji into Toolora's Word Counter, its character figure tracked the human-visible count, while the byte counter agreed with Buffer.byteLength to the byte. That spread — 3 versus 37 from identical input — is the entire reason two honest tools can disagree by more than a factor of ten.
Which count should you use?
Match the unit to the limit, not to intuition:
- Form validation a human will read ("bio must be under 160 characters"): count graphemes. Users think in visible symbols, and rejecting "160 characters" of text that looks like 150 feels broken.
- Database column sizing: check both code points (for
VARCHAR(n)semantics) and UTF-8 bytes (for index and row-size limits). Budget 4 bytes per emoji. - API payloads, HTTP headers, URL length: UTF-8 bytes, always.
- JavaScript string slicing: be careful —
s.slice(0, 5)cuts UTF-16 units and can split a surrogate pair in half, producing the � replacement character. Use[...s].slice()for code points orIntl.Segmenterfor graphemes. - Platform limits (X, SMS, push notifications): use the platform's own weighting rules, because none of the four "natural" units matches them exactly.
The one-line summary: there is no such thing as the length of a string. There are at least four, they diverge the moment an emoji or accent appears, and a good counting tool earns its keep by showing you all of them at once instead of picking one silently.
Made by Toolora · Updated 2026-06-12