URL Encoding vs HTML Entities vs Base64: When to Use Each Text Encoding Method — and How to Layer Them

Most explanations of text encoding stop at "Base64 is for binary, percent-encoding is for URLs, entities are for HTML." That's true, but it isn't where the bugs live. The bugs live in the seams — when a Base64 token travels through a query string, or a URL gets printed into an HTML attribute. Choosing an encoding is the easy half. Knowing which order to apply them, and which layer decodes what, is the half that breaks production.

One Question Decides Everything: Who Parses the String Next?

Encoding is not a property of your data. It's a property of the parser your data is about to meet.

A URL parser treats certain bytes as structure. RFC 3986 reserves exactly 18 characters — 7 "gen-delims" like : / ? # and 11 "sub-delims" like & = + — and if your data contains any of them where structure is expected, the parser will slice your value apart. Percent-encoding exists to hide those bytes from that one parser.

An HTML parser has a different set of dangerous bytes: < > & ". The WHATWG HTML specification defines 2,231 named character references (<, &,  , and so on) precisely so text can survive that parser without being read as markup.

Base64 doesn't target a parser at all. It targets channels — systems that mangle raw bytes but pass printable ASCII through untouched. That's why it's the right choice for binary payloads in JSON, data: URIs, and HTTP Basic Auth, and the wrong choice for "making a string URL-safe," which it does not actually do.

So the decision procedure is short: ask what reads the string next. URL parser → percent-encode with a tool like the URL encoder. HTML parser → escape entities with the HTML entities encoder. A text-only channel carrying bytes → Base64 encoder.

The Layering Rule: Encode for the Innermost Context First

Real strings pass through several parsers, and each hop needs its own layer. The rule that makes this manageable: work from the inside out when encoding, and each layer decodes only its own encoding.

Take a login redirect. You want to send the user back to this URL after they sign in:

https://example.com/search?q=cats&page=2

Step 1 — this URL is about to become a value inside another URL's query string, so percent-encode it:

https%3A%2F%2Fexample.com%2Fsearch%3Fq%3Dcats%26page%3D2

Note what happened to &page=2. Unencoded, the outer URL parser would have read page as a separate parameter of the login URL and your redirect would silently lose its second parameter. Encoded as %26, it stays inside the value.

Step 2 — that login link now gets printed into an HTML page with a second parameter:

<a href="/login?next=https%3A%2F%2Fexample.com%2Fsearch%3Fq%3Dcats%26page%3D2&amp;src=nav">

The & that separates next from src belongs to the outer URL, so it is not percent-encoded — but because this URL is sitting inside an HTML attribute, the HTML layer escapes it to &. The browser's HTML parser turns & back into & when it reads the attribute, then its URL parser sees a clean two-parameter URL, and finally the server percent-decodes next back to the original address. Three parsers, three layers, each one undoing exactly its own encoding. Skip a layer or apply one twice, and the chain breaks.

The Bug That Taught Me This: Base64's `+` Meets a Query String

I hit the classic version of this bug wiring up an email verification flow. The token was standard Base64, appended to a link as ?token=.... It worked in every test — until a token happened to contain a +. The input >>> encodes to:

Pj4+

Put Pj4+ in a query string raw, and the server's form decoder converts + to a space (a rule inherited from application/x-www-form-urlencoded). The server received Pj4 , Base64 decoding failed, and the user got "invalid token" on a perfectly valid link. Roughly 1 in 64 characters of random Base64 output is a +, so short tokens pass tests for days and then fail for real users.

There are two correct fixes, and I've used both. Either percent-encode the token before it enters the URL — Pj4+ becomes Pj4%2B, which survives the round trip — or switch to the Base64URL alphabet, which replaces + with - and / with _ so the token needs no second layer at all. That variant exists precisely for this seam; it's what JWTs use, and you can convert between the two alphabets with the Base64URL encoder/decoder. What you must not do is nothing, which is what most first implementations do.

Double Encoding: How `%2520` Sneaks Into Your Logs

The mirror-image failure is applying the same layer twice. If a value is percent-encoded once (cats & dogs → cats%20%26%20dogs) and some middleware "helpfully" encodes it again, every % becomes %25:

cats%2520%2526%2520dogs

When you see %2520 in a log or an address bar, that's the fingerprint: %25 is the percent sign itself, encoded. One layer decoded it back to %20 and stopped, so the user sees a literal %20 in their search box instead of a space.

Double encoding happens for an honest reason: the code can't tell whether a string is already encoded, because cats%20dogs is also a legal raw string. The cure is ownership, not detection. Decide which function owns each encoding layer — typically, encode at the last moment before the string enters its context, decode at the first moment after it leaves — and make every other function handle only raw values. Heuristics like "encode only if it doesn't look encoded" are how %2520 gets into logs in the first place.

The Decision Table I Actually Use

| Your string is going into… | Apply | Never | |---|---|---| | A query-string value or path segment | Percent-encoding (encodeURIComponent) | Base64 "to be safe" — + / = still break URLs | | HTML body text or an attribute | HTML entities | Percent-encoding — the HTML parser won't decode %3C | | JSON, email, or any ASCII-only channel, carrying bytes | Base64 | Entities — there's no HTML parser in this path | | A URL that lives inside an HTML attribute | Percent-encode the inner value first, then entity-escape the whole attribute | Applying either layer twice | | A Base64 token inside a URL | Base64URL alphabet, or percent-encode the token | Raw standard Base64 — + becomes a space |

Two encodings in one cell is not a smell; it's the normal case for the web. The smell is not knowing which layer each transformation belongs to.

Made by Toolora · Updated 2026-07-02