Skip to main content

How to Audit an XML Sitemap Before You Submit It

A practical walkthrough of auditing an XML sitemap: reading urlset and loc, checking lastmod, and catching redirects, 404s, and non-canonical URLs.

Published By Li Lei
#seo #sitemap #xml #crawling #technical-seo

How to Audit an XML Sitemap Before You Submit It

A sitemap is a small file that does one big job: it tells search engines which URLs on your site you actually want crawled and indexed. When it is clean, Google spends its crawl budget on pages that matter. When it is messy, the crawler wastes requests on redirects, dead pages, and URLs you told it to ignore elsewhere. Most teams generate a sitemap with a plugin or a build step and submit it without reading it once. That is usually where the trouble starts.

This post walks through what an XML sitemap contains, the issues that quietly creep in, and how to audit one before it reaches Search Console.

What an XML sitemap actually contains

Open any sitemap and you will see a predictable shape. The root element is <urlset>, and every page lives inside its own <url> block:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/en/guides/sitemap-basics/</loc>
    <lastmod>2026-05-30</lastmod>
  </url>
</urlset>

The <loc> tag holds the URL itself. The optional <lastmod> tag records when the page last changed. Two more optional tags, <priority> and <changefreq>, exist but carry little weight with modern search engines, so do not lose sleep over them.

The core rule is simpler than most people expect: a sitemap should list each canonical URL in a <loc> tag, with an optional <lastmod>, and it should only contain indexable, 200-status canonical URLs. Anything else is noise that contradicts the signal you are trying to send. One more hard limit worth memorizing: a single sitemap file is capped at 50,000 URLs or 50 MB uncompressed. Cross either ceiling and you need a sitemap index, a <sitemapindex> file that points to several child sitemaps.

The issues that quietly break a sitemap

A sitemap rarely fails loudly. It just slowly loses the trust of the crawler. These are the problems I look for every time:

  • Non-canonical or redirecting URLs. If a <loc> points to a URL that 301-redirects somewhere else, you are telling the crawler "index this" while your redirect says "no, go there instead." The two signals fight, and the sitemap loses.
  • 404s and dead pages. Removed a product or a blog post but never regenerated the sitemap? Dead URLs linger for weeks and burn crawl requests on nothing.
  • Pages blocked by robots.txt or noindex. Listing a URL in the sitemap while blocking it in robots.txt is a direct contradiction. The crawler cannot fetch it, yet you asked it to.
  • Trailing-slash and HTTP/HTTPS conflicts. https://example.com/page and https://example.com/page/ are different URLs to a crawler. Mixing them, or leaving stray http:// entries, splits signals across duplicates.
  • Stale or missing lastmod. A <lastmod> that has not moved in years tells the crawler nothing useful. A missing one removes a freshness hint entirely.
  • Busting the size cap. Sites that auto-generate sitemaps from a CMS can quietly sail past 50,000 URLs after a content import, and the file silently stops being valid.

A worked example: the redirect you would never notice

Here is the kind of thing that hides in plain sight. Imagine your sitemap contains this entry:

<url>
  <loc>http://example.com/blog/old-pricing</loc>
  <lastmod>2024-01-12</lastmod>
</url>

Three separate problems sit in four lines. The protocol is http://, not https://, so it will redirect to the secure version. The path /blog/old-pricing was renamed to /pricing/ last year, so it redirects again. And the lastmod is from early 2024, which flags as stale on any site that updates more than once a year.

Paste a full sitemap into the Sitemap URL Auditor and it surfaces exactly these signals at a glance: the count of HTTP URLs, duplicate <loc> values, trailing-slash conflicts, stale lastmod dates, and the total URL count against the 50,000 ceiling. It parses the XML locally and never fetches a single page, so it is fast and nothing leaves your machine. You still confirm live status with a crawler afterward, but the audit tells you where to point it.

There is a second, subtler version of this: a URL marked noindex in its HTML head that still appears in the sitemap. The auditor cannot read the page's meta tags because it does not fetch URLs, so this is the gap you close manually. A page you have asked search engines not to index has no business being in a file whose entire purpose is to request indexing. Pull those URLs out by hand once the tool has narrowed the list.

How I run a sitemap audit

When I inherit a site, the first thing I do is grab its sitemap and read it before I touch anything else. It is the fastest map of what the previous team thought was important. I copy the XML into the auditor, scan the metrics, and three numbers tell me most of the story: the HTTP-URL count (should be zero), the duplicate count (should be zero), and the total against 50,000. A sitemap with 200 duplicates and 40 http:// entries is a sitemap nobody has looked at in a long time, and that usually means there is deeper neglect underneath it.

From there I export the URL inventory as CSV and sort by path. Patterns jump out fast: a stray /draft/ directory, a ?utm_source= query string that should never be canonical, a batch of .pdf links someone added by accident. That CSV is also handy for diffing two sitemaps during a migration, so you can see exactly which URLs were dropped or added.

Why a clean sitemap helps crawling

Crawl budget is real, especially for large sites. Every URL a search engine fetches that turns out to be a redirect, a 404, or a blocked page is a request it did not spend on a page you care about. A tight sitemap, containing only live canonical URLs that return 200, makes every crawl request count and helps fresh content get discovered faster.

A clean sitemap also makes your other diagnostics honest. Search Console's coverage report becomes meaningful when the sitemap reflects reality, because the "submitted and indexed" number actually means something. Once your sitemap is in shape, the next step is usually checking the links inside your pages, and the HTML Link Extractor pairs well for that, pulling every link out of a page so you can confirm your internal linking matches what the sitemap promises.

Audit the sitemap, fix the contradictions, then crawl to confirm. Do that once before submission and you save yourself the slow, invisible cost of a search engine learning to distrust your map.


Made by Toolora · Updated 2026-06-13