Skip to main content

How to Extract HTML Links and Build a Clean URL Inventory

Pull every anchor href, image, and script URL out of an HTML file, then dedup, split internal vs external, and turn the list into an SEO audit or sitemap.

Published By Li Lei
#html #seo #links #audit #web

How to Extract HTML Links and Build a Clean URL Inventory

Every HTML page hides a small graph of URLs: the pages it points to, the images it loads, the scripts it pulls in, the stylesheet it depends on. Most of the time you never look at that graph. Then you migrate a site, tighten a content security policy, or run an SEO audit, and suddenly you need the whole list in one place. Copy-pasting from "View Source" is miserable, and writing a one-off parser for a single audit is overkill.

This post walks through how to extract links from HTML cleanly, what "extract a link" actually means at the markup level, and how to turn a raw dump into something useful: a sitemap seed, a dead-link checklist, or an internal-versus-external map for an SEO review.

What "a link" really means in HTML

When people say "get all the links," they usually mean the href attribute of every <a> tag. That is the navigational layer of the page, and it is the right place to start. A line like <a href="/pricing/" class="nav-item">Pricing</a> carries exactly one URL worth extracting: the value of href, which is /pricing/. The class, the link text, and the surrounding markup are noise for this job.

But a page references far more than anchors. A realistic inventory also includes:

  • <img src="..."> — image assets
  • <script src="..."> — JavaScript dependencies
  • <link href="..."> — stylesheets, canonical tags, preloads, icons
  • <meta content="..."> values that happen to be URLs, like og:image

If your goal is an SEO audit, the anchor href list is the core. If your goal is CSP hardening or a migration manifest, you want all of the above, because a third-party script URL is exactly the kind of dependency that breaks after you move servers.

A worked example: from markup to a link list

Here is a small, realistic HTML fragment:

<header>
  <a href="/">Home</a>
  <a href="/blog/">Blog</a>
  <a href="https://twitter.com/toolora">Follow us</a>
</header>
<article>
  <a href="/blog/">Back to blog</a>
  <img src="/img/hero.png" alt="Hero">
  <a href="https://partner.example.com/ref?id=42">Partner</a>
</article>
<script src="https://cdn.example.com/app.js"></script>

Pull out the anchor href values and you get five raw entries:

/
/blog/
https://twitter.com/toolora
/blog/
https://partner.example.com/ref?id=42

Notice /blog/ appears twice. Notice three are relative (/, /blog/) and two are absolute. Now run two simple transforms — deduplicate, then classify by whether the URL has a host — and the list becomes an audit artifact:

Internal (relative):
/
/blog/

External (absolute):
https://twitter.com/toolora
https://partner.example.com/ref?id=42

Add the asset rows (/img/hero.png, https://cdn.example.com/app.js) and you have a full resource inventory for one page: two internal pages, two outbound links, one local image, one third-party script. That last script line is what a CSP review or privacy check cares about most.

The HTML Link Extractor does exactly this. You paste markup or upload an HTML export, it parses the document with the browser's built-in DOMParser, and it writes anchors, image sources, script URLs, stylesheet and canonical links, and URL-shaped meta content into a CSV. Nothing is uploaded — the parsing happens locally — which matters when the file is an internal template full of private endpoints.

Why a browser DOMParser beats a regex

It is tempting to grab links with a regular expression like href="([^"]+)". It works on toy input and falls apart on real pages. Single-quoted attributes, attributes with no quotes, commented-out blocks, escaped characters, and href strings that contain other attributes inside them all trip up a naive pattern. You end up either missing links or capturing garbage that is not a URL at all.

Parsing the document as an actual DOM tree sidesteps all of that. The parser builds the same node structure the browser would, then you ask each <a> node for its href. Whether the source wrote href='...', href="...", or wrapped the tag across three lines, the extracted value is the same. This is the difference between a tool that works on your own clean markup and one that survives a vendor template or a CMS export.

Turning the list into an SEO audit or sitemap

Once you have a clean, deduplicated URL list, several common jobs fall out of it almost for free.

Sitemap seed. Filter the inventory to internal anchor URLs only, drop fragments and query strings if you do not want them indexed separately, and you have a starting list of routes. It is not a substitute for a real crawler, but for a small static site it is often the fastest way to bootstrap a sitemap.xml.

Dead-link checklist. The extractor gives you the list of targets; feed those targets to any link checker or a quick batch of HEAD requests, and the broken ones surface. The key is that you are now checking a finite, deduped list instead of re-discovering links by hand on every page.

Internal vs external balance. A page that links out 30 times and links internally twice leaks ranking signal and confuses crawlers about your site structure. Splitting the list by host makes that imbalance obvious at a glance.

One caution worth repeating: relative URLs come out as written. /blog/ is extracted as /blog/, not as https://yoursite.com/blog/. Before you hand the list to a crawler, resolve relative paths against the site's base URL. And because this is static parsing, links that JavaScript injects at runtime simply are not in the source HTML, so they will not appear — that is a property of the page, not a bug in the extraction.

How I use it in a real audit

The first time this saved me real time was a template handoff. A client shipped me a 400-line HTML email-and-landing template from an agency, and the question was simple but urgent: which external domains does this thing call, and are any of them tracking the recipient? Reading the markup top to bottom, I kept losing my place between inline styles and conditional comments. So I dropped the whole file into the extractor, took the CSV, and sorted the URLs by host. In under a minute I had the answer: two CDNs I expected, one analytics pixel nobody mentioned, and a font URL pointing at a domain that had since lapsed. The lapsed font domain was the actual bug — a dead <link> that would have failed silently in production. I would never have spotted it by scrolling.

A small toolkit around the inventory

The link list is rarely the final deliverable. A few neighboring jobs come up constantly. If your URLs are buried inside query strings and you only need the parameters, the URL Query Params Extractor breaks ?id=42&utm_source=x into a clean key-value table. If the CSV the extractor produces needs to be sliced down to a single column before you hand it to another script, the CSV Column Extractor pulls one field out without a spreadsheet. Both pair naturally with the link inventory: extract once, then reshape for whatever consumes the list next.

The pattern that ties all of these together is the same one that makes the extractor worth using in the first place: do the messy parsing once, locally, against the real document structure — then work with a clean list instead of raw markup. An HTML page's link graph is small and knowable. You just need to see it laid flat.


Made by Toolora · Updated 2026-06-13