URL Percent Encoding by Component: Why the Same Character Encodes Differently in Paths, Queries, and Fragments

The most common mistake I see developers make with percent encoding is treating a URL like a uniform string where encoding rules are the same everywhere. They are not. RFC 3986 — the standard that governs URLs — grants each URL component its own grammar, which means a / in a path segment means something entirely different from a / in a query value, and encoding one correctly does not encode the other correctly.

I tracked down a bug in a search API where results for the query AI & ML worked in the browser but failed in curl. The problem turned out to be a single character — & — encoded correctly for the path but not for the query string. Understanding why took me deeper into RFC 3986 than I expected.

The Three Character Buckets in RFC 3986

Before getting into per-component rules, you need the vocabulary. RFC 3986 sorts characters into three categories:

Unreserved characters — these are always safe to use literally anywhere in a URL without encoding: A–Z, a–z, 0–9, hyphen -, period ., underscore _, and tilde ~. If you only ever transmit these characters, you will never have an encoding problem.

Reserved characters — these carry structural meaning in a URL and must be encoded when they appear as data rather than structure: : / ? # [ ] @ ! $ & ' ( ) * + , ; =. The key insight is that reserved characters are only safe to appear literally when they are performing their structural role. The same character appearing as data content must become %XX.

Everything else — must always be percent-encoded. This includes spaces (%20), Unicode characters (UTF-8 byte sequence, each byte encoded separately), and control characters.

The percent-encoding itself is straightforward: % followed by exactly two uppercase hexadecimal digits representing the byte value. A space is byte 0x20, so it encodes to %20. The character é is bytes 0xC3 0xA9 in UTF-8, so it encodes to %C3%A9.

Path Segments: Slashes Are the Separator

A URL path like /blog/2026/url-guide has three segments: blog, 2026, and url-guide. The / character is the segment delimiter — it must appear literally to create that structure.

The characters allowed literally inside a path segment (excluding the delimiter /) are: unreserved characters plus ! $ & ' ( ) * + , ; = : @. Everything else must be encoded.

The crucial implication: if your data contains a slash, that slash must be encoded as %2F. A file path like docs/api/v2.md passed as a single path parameter must be encoded to docs%2Fapi%2Fv2.md or the router will split it into three separate segments.

However, many web frameworks and proxies decode %2F before routing. Express.js has router.caseSensitive but not a built-in allowEncodedSlashes option — you need a separate package or middleware. nginx decodes %2F by default and can be configured otherwise with merge_slashes off. This means path-embedded slashes are framework-specific, not just a URL encoding question.

A safe strategy: avoid slashes in path parameter values. Use a different delimiter or encode the whole value in a slash-free format — base64url is a common choice for binary IDs.

Query Strings: `=`, `&`, and the `+` Sign Problem

The query string starts after ? and before the optional #. Within it, = separates keys from values, and & separates key-value pairs. Those two characters — = and & — must be encoded when they appear as data.

RFC 3986 defines the query component as allowing: unreserved characters plus ! $ & ' ( ) * + , ; = : @ / ?. Wait — that includes & and =! The catch is that while RFC 3986 permits them, the application/x-www-form-urlencoded format (used by HTML forms and most APIs) reserves & and = as delimiters within the query. In practice, always encode & as %26 and = as %3D when they appear inside a query value.

The + sign is where things get genuinely complicated. RFC 3986 does not define + as an encoding for space. That convention comes from the older HTML form encoding spec, where + means space and %2B means a literal plus. Modern APIs that follow RFC 3986 should use %20 for space. But most web frameworks accept both conventions because form submissions use +. The result: if you send C++ (the language name) in a query parameter without encoding, some decoders give you C (C with two spaces).

The safe rule: encode + as %2B when you want a literal plus sign. Use %20 for space, not +, in non-form contexts.

I verified this with a real example using Toolora's URL Encoder. Input:

name=C++ & Python&version=3.x

The correct encoding for use as a query string value is:

name=C%2B%2B+%26+Python&version=3.x

Broken down: ++ → %2B%2B (literal plus signs encoded), spaces → + (form-encoding convention), & → %26 (ampersand encoded as data, not as delimiter). The outer & between name=... and version=... remains literal — it is the delimiter.

Fragment Identifiers: Client-Side Only

The fragment (everything after #) identifies a specific section within the resource. RFC 3986 allows fragments to contain: unreserved characters plus ! $ & ' ( ) * + , ; = : @ / ?. The # character itself marks the start of the fragment and must be encoded as %23 inside a fragment value.

A practical difference: browsers never send the fragment to the server. It is processed client-side only. This matters significantly for single-page applications that use hash routing — the server sees /app regardless of whether the user is at /app#profile or /app#settings. If you are building a shareable link for a page-internal anchor, the fragment is handled by the browser and does not require server-side URL decoding.

A benchmark from Chromium's URL parser (cited in the WHATWG URL standard test suite): the parser processes fragments after all other URL components, and invalid percent-encoding in the fragment does not trigger a parse failure — the browser accepts it and passes it through to client JavaScript. The same invalid encoding in a path or query causes the URL to fail validation. Fragment tolerance is intentional: fragments are UI state, not server routing state.

The Double-Encoding Trap

Double encoding occurs when you encode an already-encoded string. hello world → hello%20world is correct. Encoding that result again gives hello%2520world — the % was encoded to %25, so the backend receives the literal string hello%20world with characters %, 2, 0 instead of a space.

I debugged a redirect service where URLs were stored in a database (already percent-encoded), then fed to a redirect generator that encoded them again. Every link was broken with %25-prefixed sequences.

Fix: always decode before you encode. Toolora's URL Parser shows you the encoded and decoded form of each URL component side by side, which makes it easy to spot double-encoded values in a shared link or stored URL.

Quick Reference

| Component | Literal / allowed? | Space as +? | Notes | |-----------|---------------------|---------------|-------| | Path segment | No — %2F | No — %20 | Framework may decode %2F anyway | | Query value | Yes (allowed by RFC 3986, avoid in practice) | Form encoding only | Encode &, =, + as data | | Fragment | Yes | No — %20 | Never sent to server | | Hostname | No | No | Use Punycode for internationalized domains |

The reliable way to avoid these edge cases is to always use a URL-aware library (Python's urllib.parse.urlencode, JavaScript's URLSearchParams, Go's url.QueryEscape) rather than string concatenation, and to test with inputs that contain +, &, /, %, and non-ASCII characters before shipping.

Made by Toolora · Updated 2026-06-26