Original research on how prepared the open web actually is for AI agents and answer engines. Citable statistics sampled continuously from the Spacemen Digital Website AI Agent Readiness Check.
Each finding below cites the percentage of audited sites with the signal present or missing. Use these as citable benchmarks in your reports, decks and pitches.
llms.txt is an emerging standard for declaring to AI agents which URLs on a site matter most. We audit for its presence at the root domain. Despite growing adoption in the SEO community, 94% of sites we scan still have no llms.txt file.
AI Instructions pages are a 2025 pattern: a dedicated page with canonical, citable information about the brand written explicitly for AI agents. Reports indicate ChatGPT cites these pages within 48 hours of publishing. Yet only 1% of audited sites have one.
We test robots.txt against GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, Meta-ExternalAgent and CCBot. Over half of audited sites block at least one. Most often the block is unintentional, inherited from overly-aggressive bot rules added to defend against scraping.
Article schema with full author, datePublished and dateModified is one of the strongest AEO signals. Citation engines weight attribution heavily when deciding which sources to surface. Yet most sites either omit Article schema entirely or ship it without author attribution.
We test for seven schema types that drive AI citation: Organization, WebSite, SearchAction, Article, Article-with-author-and-date, FAQPage and BreadcrumbList. The median audited site has three of the seven. Top decile has six.
We test for 23 frontier standards including MCP Server Cards, Agent Skills, WebMCP, API Catalog (RFC 9727), OAuth Discovery (RFC 8414), OAuth Protected Resource (RFC 9728), Web Bot Auth, x402, NLWeb, ai-plugin.json, OIDC discovery, DID configuration, DNS for AI Discovery (DNS-AID), Content Signals, security.txt, humans.txt and RSS/Atom auto-discovery. The median site supports zero. Top sites (Cloudflare, Stripe, GitHub) support six or more.
Server response time matters for AI agents the same way it matters for browsers. Aggressive AI crawlers time out before parsing slow pages, removing those sites from citation pools. 38% of audited sites exceed the 1.5-second threshold where AI agent timeouts begin.
XML sitemaps are foundational discoverability infrastructure. Most are missing not because the team didn't create one but because the sitemap lives at a non-standard URL and isn't declared in robots.txt. Our auditor finds 5+ common locations and parses robots.txt for the Sitemap directive, and 23% of sites still come up empty.
Content Signals (Content-Signal, CF-Content-Signal, TDM-Reservation, noai/noimageai headers and meta tags) let sites declare AI training policy. 86% of audited sites declare nothing, leaving AI usage policy implicit. This becomes increasingly important as EU TDM regulation and AI training opt-outs mature.
All data is collected via the free Spacemen Digital Website AI Agent Readiness Check tool, which scans any URL across 50+ signals across six categories: AI Crawler Access, Discoverability, Structured Meaning, Rendering, Trust Signals and Frontier Agentic Standards.
Statistics in this report represent the aggregate state of all domains scanned through the public tool, deduped to one entry per domain (latest scan wins). Data is refreshed continuously. Only aggregate statistics are published; individual domain scans are not exposed.
This data is released under Creative Commons Attribution 4.0. You may quote any statistic in your own content, deck or report. Please credit "Spacemen Digital AI Agent Readiness Index 2026" with a link back to spacemendigital.com/data/.
Run the free Website AI Agent Readiness Check on your URL. Takes about 10 seconds. 50+ signals tested.