HTML to Markdown: Convert Web Pages and Rich Text to Clean .md
A practical guide to converting HTML to Markdown — how tables, code blocks, and links are handled, why inline styles get dropped, and how to clean dirty CMS or Word HTML before you convert.
HTML to Markdown: Convert Web Pages and Rich Text to Clean .md
Most HTML-to-Markdown jobs do not start with hand-written HTML. They start with a mess: a blog post copied out of WordPress, a chunk of rich text pasted from a CMS editor, or the body of a page you scraped with "View Source." That HTML carries hundreds of inline styles, wrapper divs, and framework class names that mean nothing once the content leaves its original site. The goal of conversion is not to mirror the HTML — it is to keep the structure (headings, lists, links, code, tables) and throw the rest away.
This guide walks through how that conversion actually works, where it breaks, and how to clean dirty HTML before you feed it in. If you just want to paste and go, open the HTML to Markdown tool and skip to the examples.
What "clean Markdown" actually means
When people say a conversion is "clean," they mean the output has no leftover styling and no orphaned tags — only the portable subset Markdown can represent. That subset is small on purpose: headings h1 through h6, paragraphs, bold and italic, ordered and unordered lists with nesting, links, images, blockquotes, inline code, fenced code blocks, horizontal rules, and pipe tables.
Everything outside that list — style="color:red", text-align, font-family, a <div class="mso-list">, a tracking <span> — gets dropped. That is the right default. Markdown has no concept of color or alignment, so trying to preserve it would just produce raw HTML embedded in your .md, which defeats the purpose. If visual fidelity matters more than portability, stay in HTML.
How the parser reads your HTML
A reliable converter does not use regular expressions to find tags. Regex can't handle nesting, malformed markup, or unclosed elements — all of which real-world HTML is full of. Instead, the HTML to Markdown tool hands your input to the browser's native DOMParser, the same engine the browser uses to render pages. It builds a real DOM tree, then walks that tree node by node and emits Markdown for each element it recognizes.
This matters because the browser's parser is forgiving in exactly the ways you need: it auto-closes a <p> you forgot to close, ignores a stray <span>, and still produces a valid tree from copy-pasted soup. Walking the parsed DOM also means nesting is preserved naturally — a list inside a list inside a blockquote comes out correctly indented, because the tree already encodes that depth. And because DOMParser runs entirely in the page, nothing you paste is uploaded or written to the URL; you can convert internal or unpublished content without it leaving your machine.
Tables, code, and links: the three that go wrong
These three element types cause the most pain, so it's worth knowing what to expect.
Tables become pipe tables with a header separator row. Cells with inline content — a link, a bit of bold — convert in place. The limit is block content: a list inside a table cell can't be represented in pipe syntax, so it falls back to escaped text. If your source HTML leans on complex table cells, that's a sign the data wants a different shape; a Markdown table generator is often a cleaner way to rebuild the table from scratch than to wrestle the converted output.
Code splits two ways. Inline <code> becomes backticked spans. A <pre><code> block becomes a fenced block with triple backticks, and the contents are emitted verbatim — no escaping of the < and > inside, which is exactly what you want for code samples.
Links and images keep their href and src. A relative URL stays relative, so if you scraped a page from example.com, you may need to rewrite paths to absolute URLs afterward depending on where the Markdown will live.
A real example
Here is a paragraph of typical CMS HTML — the kind WordPress emits — with inline classes and a wrapper div:
<div class="entry-content">
<h2 class="wp-block-heading">Setup</h2>
<p style="text-align:left">Install the CLI with <code>npm i -g toolora</code>.</p>
<ul class="wp-block-list">
<li>Run <strong>toolora init</strong></li>
<li>Edit <a href="/config">the config</a></li>
</ul>
</div>
Converted, that collapses to portable Markdown with every class, style, and wrapper gone:
## Setup
Install the CLI with `npm i -g toolora`.
- Run **toolora init**
- Edit [the config](/config)
Notice what survived: the heading level, the inline code, the bold, the link, the list. Notice what didn't: class="wp-block-heading", style="text-align:left", the <div> wrapper. That's the whole job in one block — keep structure, drop decoration.
Cleaning dirty HTML before you convert
The single biggest cause of bad output is pasting too much. If you copy a full page, you get the nav menu, the cookie banner, and the footer converted into headings and lists right alongside your article. Copy only the article body — the content element, not the whole document.
Word and Google Docs are their own category. Their paste HTML hides structure inside mso-list styles and <o:p> tags that have no Markdown equivalent, so list numbering and some paragraph breaks vanish. For those, save the document as .html first and clean it, or run it through a plain HTML editor before converting.
In my own workflow I migrated a batch of articles out of a CMS this way, and the step that saved the most time was minifying first: I ran each page's body through an HTML minifier to strip comments and collapse whitespace, which made it obvious where the real content boundaries were before I copied the body into the converter. When the source is structured data rather than prose — config tables, API responses — I skip the HTML round-trip entirely and reach for YAML to JSON or CSV to JSON instead, since those formats round-trip far more cleanly than HTML tables do.
When to use a different tool
HTML to Markdown is for portability — getting content into version control, a docs repo, or an LLM prompt where every <p class="..."> would otherwise burn tokens. If you need the reverse direction, Markdown to HTML closes the loop, and for the supported subset the round-trip is stable. If you're heading into a React codebase rather than a .md file, convert straight to components with HTML to JSX instead of going through Markdown.
The rule of thumb: convert to Markdown when you want clean, diffable, portable text. Stay in HTML — or pick a structured format — when layout, styling, or complex nesting is the point.
Made by Toolora · Updated 2026-06-13