2026-07-02 · 3 min read · Rafael Lopes

Sitemaps for Agent Discovery

Part of the Agent Readiness course. Measure any page with the Core Agent Vitals analyzer. What it is An XML sitemap () is a machine-readable list of every...

AI agents agent-readiness sitemap web-standards SEO

Part of the Agent Readiness course. Measure any page with the Core Agent Vitals analyzer.

What it is

An XML sitemap (/sitemap.xml) is a machine-readable list of every public URL on your site, each with an optional <lastmod> date. It's the standard way to tell crawlers "here is everything worth indexing, and here's when it last changed." The format is defined at sitemaps.org.

Why agents need it

Agents and crawlers discover pages two ways: by following links, and by reading your sitemap. Link-following alone is shallow — it finds what's reachable from your homepage in a few hops and misses the long tail: individual products, doc pages, pricing tiers, deep articles. Those deep pages are exactly what answer specific user questions.

A sitemap flattens your whole site into one list an agent can consume in a single fetch, and <lastmod> tells it what changed so it re-fetches the right pages instead of re-crawling everything or nothing. No sitemap = your deep inventory is invisible unless an agent happens to click its way there.

How to implement

Generate sitemap.xml at build time from your routes (every major framework and CMS has a plugin), and list real, canonical, public URLs:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://your-site.com/</loc>
    <lastmod>2026-07-01</lastmod>
  </url>
  <url>
    <loc>https://your-site.com/docs/quickstart</loc>
    <lastmod>2026-06-28</lastmod>
  </url>
</urlset>

For large sites (>50,000 URLs or >50 MB), split into multiple sitemaps and reference them from a sitemap_index.xml. Then advertise it in robots.txt:

Sitemap: https://your-site.com/sitemap.xml

Validate

curl -s https://your-site.com/sitemap.xml | head -20

Confirm valid XML, real <loc> entries, and recent <lastmod> values. The Core Agent Vitals analyzer checks for the sitemap at /sitemap.xml and /sitemap_index.xml, validates it has URL entries, and flags a stale one.

Common mistakes

No sitemap at all. The default for many hand-built sites — and a silent cap on how much of you agents can find.
Faked lastmod. Setting every page's lastmod to today (or build time) trains crawlers to ignore the signal. Emit the real content-change date.
Listing non-canonical or redirecting URLs. Every <loc> should be a 200, canonical, indexable URL — not a redirect, not a noindex page.
Forgetting the robots.txt reference. Without the Sitemap: line, agents have to guess the location.
Letting it drift. A sitemap generated once and never regenerated slowly diverges from reality. Build it in your pipeline so it can't rot.

Next: JSON-LD Structured Data — telling agents what a page is, not just what links to it.

Built, then written

Tested on my own homelab before publishing — a four-architecture cluster (ARM · AMD ROCm · NVIDIA CUDA · Apple Silicon) running this blog, the RAG pipeline, and a sovereign research copilot. Built and tested before it's written — refined as I learn. See the platform →

Work with me

The standards are the easy part.

Getting agent-readiness right across a real site — which standards matter for your business and in what order, doing it at scale inside a design system and CI, measuring it against outcomes, and keeping it from rotting — is where teams get stuck. That's what I do, and I built the tooling that measures it.

Work with me →Email me directly

Rafael Lopes

Production AI Engineer in Vancouver, BC. Brazilian. Builds and ships production AI on a self-hosted homelab — RAG pipelines, distributed LLM inference, web performance, and platform engineering.

Machine-readable brief — Rafael Lopes

Sitemaps for Agent Discovery

What it is

Why agents need it

How to implement

Validate

Common mistakes

The standards are the easy part.

Related posts

Machine-readable brief — Rafael Lopes

Sitemaps for Agent Discovery

What it is

Why agents need it

How to implement

Validate

Common mistakes

The standards are the easy part.

Related posts

AI-Aware robots.txt: Let the Right Agents In

WebMCP: Making Your Website Callable, Not Just Crawlable

API Docs for Agent Tool Use: From Readable to Callable