Skip to main content

HTML Entity Encoding by Context: When to Escape and Which Named Entities to Use

A practical guide to HTML entity encoding — how the right characters and named entities differ between text nodes, attribute values, URL attributes, and script contexts.

Published By Li Lei
#html #encoding #web-development #security

HTML Entity Encoding by Context: When to Escape and Which Named Entities to Use

The advice "escape your HTML" sounds simple until you realise that an ampersand inside an href attribute needs a different treatment than the same character inside a paragraph. I spent an afternoon chasing a broken link that traced back to a double-encoded & in a URL attribute — it looked correct in source, rendered wrong in the browser, and broke quietly with no error. That bug taught me that HTML entity encoding is not one rule but four different rules depending on where in the document you are writing.

Why Context Determines Which Characters to Escape

HTML parsers treat the same byte sequence differently depending on parsing mode. Inside a text node the parser is looking for < and &. Inside a double-quoted attribute value it watches for " and &. Inside href or src the parser hands the value to a URL parser that interprets % sequences before HTML entities. Inside <script> it switches to JavaScript parsing rules entirely.

This means escaping < as &lt; in a <script> block does nothing useful — the JS parser never sees entity references. And percent-encoding an & as %26 inside an onclick attribute only confuses things further. Four contexts, four rule sets.

The HTML5 spec formally defines 2,231 named character references (per the W3C HTML5 reference, 2014 edition), but in day-to-day work fewer than fifteen of them appear with any frequency. Knowing which fifteen and where to use them is more practical than memorising the full table.

Text Nodes: Escape Three Characters, No More

Inside element content — everything between <p> and </p>, for example — the minimum safe set is:

| Character | Named entity | Decimal | When you see it | |-----------|-------------|---------|----------------| | < | &lt; | &#60; | Any literal less-than sign | | > | &gt; | &#62; | After ]] in CDATA sections; optional elsewhere | | & | &amp; | &#38; | Any literal ampersand |

In practice I skip escaping > in text nodes unless I am writing XHTML. Browsers tolerate unescaped > in HTML5 text content, but I always escape < and & without exception.

Real example: suppose you want to display the expression a && b < c in a blog post code sample.

Input: a && b < c

Output HTML: a &amp;&amp; b &lt; c

That is all. Two rules, one output. A tool like HTML Entity Encoder / Decoder handles this in one paste.

Attribute Values: Add the Quote Character

Inside an attribute value the required set expands. Which extra character to escape depends on which quote delimiter you chose:

  • Double-quoted attribute — also escape " as &quot;
  • Single-quoted attribute — also escape ' as &#39; (there is no &apos; in HTML 4; it was added in HTML5 but is less widely supported in older parsers)
<!-- Correct -->
<input placeholder="Search &amp; filter" title="Use &quot;exact&quot; match">

<!-- Also correct, single quotes -->
<button title='It&#39;s working'>Click</button>

A common mistake I see in templating engines: developers escape < and & correctly in attributes but forget the quote character, which lets an attacker close the attribute and inject new ones.

The HTML5 parser specification (WHATWG Living Standard) notes that an unescaped " inside a double-quoted attribute value is a parse error — browsers will recover, but recovery behaviour is implementation-defined, which is exactly the kind of ambiguity that XSS attacks exploit.

URL Attributes: Percent-Encoding Wins

The href, src, action, and data-*-as-URL attributes are different again. A URL that contains an ampersand — as practically every analytics-tagged link does — must follow the URL encoding spec for the query string values themselves, and then use &amp; for the & separator between query parameters when that URL sits inside HTML.

Input URL with two query params: https://example.com/page?utm_source=blog&utm_medium=organic

Correct inside an HTML href: https://example.com/page?utm_source=blog&amp;utm_medium=organic

Correct in a CSS url() value: just the raw URL, no entity encoding.

Where it goes wrong: if your CMS or template already HTML-encodes the & to &amp;, and then a second pass encodes it again, the href becomes &amp;amp; — a literal eight-character string that the browser will not resolve. I have seen this in WordPress shortcodes and Jinja templates that call |escape on a string that was already escaped. The fix is always to escape once, at the point of output, never upstream.

To inspect a URL's exact encoding before pasting it into HTML, I run it through URL Encoder / Decoder — it separates percent-encoding from entity encoding visually, which makes the double-encoding problem obvious at a glance.

The Named Entities Worth Memorising

Beyond the mandatory six (&lt; &gt; &amp; &quot; &#39; &nbsp;), a handful of named entities appear often enough that knowing them saves a copy-paste trip to a reference each time:

| Use case | Named entity | Glyph | |----------|-------------|-------| | Non-breaking space between a number and its unit | &nbsp; | | | Em dash in prose | &mdash; | — | | En dash in ranges | &ndash; | – | | Copyright notice | &copy; | © | | Registered trademark | &reg; | ® | | Ellipsis (single code point) | &hellip; | … | | Left/right typographic quotes | &ldquo; &rdquo; | " " | | Multiplication sign in formulas | &times; | × | | Arrow in keyboard shortcut docs | &rarr; | → |

Everything else I look up as needed using HTML Entities Encoder, which lets me search by glyph or by name and copy the exact format I need (named, decimal, or hex).

One naming trap: &nbsp; is a no-break space (U+00A0), not a zero-width space. Stacking four &nbsp; to fake an indent is a layout smell — use CSS padding or <pre> instead. The entity is correct for "10 kg" where the number and unit must not wrap across lines, not for pushing text sideways.

A Quick Decision Checklist

Before I write any character into HTML markup I ask three questions in order:

  1. Am I inside a <script> or <style> block? If yes, follow JavaScript/CSS escaping rules — HTML entities are irrelevant here.
  2. Am I inside a URL attribute? If yes, percent-encode the query string, then use &amp; for the & separator at the HTML layer.
  3. Which quote character surrounds my attribute value? Escape " as &quot; (double-quoted) or ' as &#39; (single-quoted), plus always escape < and &.

If none of the above applies, I am in a text node and only < and & need escaping.

The key insight is that double-encoding breaks things just as reliably as under-encoding — a &amp;amp; in a URL attribute produces a broken link, not a security fix. Encode once, at the right layer, with the right rule for that context.


Made by Toolora · Updated 2026-06-29