Part of the Agent Readiness course. Measure any page with the Core Agent Vitals analyzer.
What it is
An XML sitemap (/sitemap.xml) is a machine-readable list of every public URL on your site, each with an optional <lastmod> date. It's the standard way to tell crawlers "here is everything worth indexing, and here's when it last changed." The format is defined at sitemaps.org.
Why agents need it
Agents and crawlers discover pages two ways: by following links, and by reading your sitemap. Link-following alone is shallow — it finds what's reachable from your homepage in a few hops and misses the long tail: individual products, doc pages, pricing tiers, deep articles. Those deep pages are exactly what answer specific user questions.
A sitemap flattens your whole site into one list an agent can consume in a single fetch, and <lastmod> tells it what changed so it re-fetches the right pages instead of re-crawling everything or nothing. No sitemap = your deep inventory is invisible unless an agent happens to click its way there.
How to implement
Generate sitemap.xml at build time from your routes (every major framework and CMS has a plugin), and list real, canonical, public URLs:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://your-site.com/</loc>
<lastmod>2026-07-01</lastmod>
</url>
<url>
<loc>https://your-site.com/docs/quickstart</loc>
<lastmod>2026-06-28</lastmod>
</url>
</urlset>
For large sites (>50,000 URLs or >50 MB), split into multiple sitemaps and reference them from a sitemap_index.xml. Then advertise it in robots.txt:
Sitemap: https://your-site.com/sitemap.xml
Validate
curl -s https://your-site.com/sitemap.xml | head -20
Confirm valid XML, real <loc> entries, and recent <lastmod> values. The Core Agent Vitals analyzer checks for the sitemap at /sitemap.xml and /sitemap_index.xml, validates it has URL entries, and flags a stale one.
Common mistakes
- No sitemap at all. The default for many hand-built sites — and a silent cap on how much of you agents can find.
- Faked
lastmod. Setting every page's lastmod to today (or build time) trains crawlers to ignore the signal. Emit the real content-change date. - Listing non-canonical or redirecting URLs. Every
<loc>should be a 200, canonical, indexable URL — not a redirect, not anoindexpage. - Forgetting the robots.txt reference. Without the
Sitemap:line, agents have to guess the location. - Letting it drift. A sitemap generated once and never regenerated slowly diverges from reality. Build it in your pipeline so it can't rot.
Next: JSON-LD Structured Data — telling agents what a page is, not just what links to it.