How AI systems actually read your site

This is the technical companion to Your product exists. Does the AI know? — the strategy context is there. This covers the implementation.

When I added llms.txt to gabana.dev, I realised the file itself was the easy part. The harder question was whether the AI crawler could actually reach my site at all — and whether what it read made any sense once it got there. Most guides skip that part.

This is a technical walkthrough of what's actually happening under the hood, where most sites break silently, and what to fix.

How do AI answer engines actually work?

Before optimising for something, you need a working mental model of how it behaves. AI answer engines aren't a single system — they're pipelines, and each stage has its own constraints.

The retrieval layer

Most production AI answer engines use Retrieval-Augmented Generation (RAG). When a user asks a question, the system doesn't answer purely from trained knowledge. It runs a search — against a web index, a vector database, or both — retrieves a set of candidate documents, and feeds those documents into the model as context before generating a response.

This matters because your content has two jobs: getting into the retrieval set, and being legible enough once there that the model cites you. This guide covers both — access first, then structure.

How crawling works

AI crawlers behave more like the Googlebot of ten years ago than the browser-based crawlers of today. They issue HTTP GET requests and read the response body. They do not execute JavaScript. They do not wait for dynamic content. The DOM your user sees in their browser and the HTML the crawler receives can be entirely different documents — and the crawler only ever sees one of them.

The bots you care about and their user agent strings:

OAI-SearchBot       — OpenAI / ChatGPT
PerplexityBot       — Perplexity
Google-Extended     — Google Gemini training
Applebot-Extended   — Apple AI features
anthropic-ai        — Anthropic / Claude
cohere-ai           — Cohere

If your robots.txt is blocking any of these — deliberately or by accident — those systems have no path to your content. Verify this first.

Is your site actually reachable by AI crawlers?

Two things silently block more sites than people realise.

robots.txt

A permissive configuration for AI crawlers looks like this:

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: cohere-ai
Allow: /

If you're using a wildcard disallow and selectively allowing bots, make sure these agents are explicitly permitted. The common mistake is blocking everything by default and forgetting to whitelist AI crawlers alongside Googlebot.

The Cloudflare problem

Cloudflare updated its default bot management configuration to block AI crawlers. If your infrastructure sits behind Cloudflare and you haven't changed this, your content is likely unreachable to most AI systems right now — regardless of what your robots.txt says.

Fix: Cloudflare dashboard → Security → Bots → Bot Fight Mode. Either disable it for AI crawlers specifically, or create custom firewall rules that allow the user agents listed above while keeping protection active for everything else.

gabana.dev runs behind Cloudflare. This was the first thing I checked.

Why does JavaScript rendering make your site invisible to AI?

This is where most modern web applications fail silently.

If your stack renders on the client — React SPA, Vue, Angular with client-side routing — the HTML your server returns typically looks something like this:

<!DOCTYPE html>
<html>
  <head>
    <title>Your Product</title>
  </head>
  <body>
    <div id="root"></div>
    <script src="/bundle.js"></script>
  </body>
</html>

That's what the AI crawler sees. Not your product description, not your features, not your pricing. An empty div and a script tag it can't execute.

How to check what the crawler actually sees

Run this against your own domain:

curl -A "OAI-SearchBot" https://gabana.dev/ | grep -i "gabana"
# Replace with your domain and something you expect to find in the response

If you get nothing back, your content isn't in the initial HTML. The crawler sees the same thing.

Fixing it: server-side rendering for critical pages

The pages that describe what your product is and does need to return their content in the initial HTML response. JavaScript can hydrate on top — but the content must exist first.

In Next.js, the App Router handles this by default — server components render on the server:

// Server component — renders on the server, content in initial HTML
export default async function ProductPage() {
  const features = await getFeatures()

  return (
    <main>
      <h1>What the product does</h1>
      {features.list.map(feature => (
        <section key={feature.id}>
          <h2>{feature.name}</h2>
          <p>{feature.description}</p>
        </section>
      ))}
    </main>
  )
}

In Laravel (Blade templates), this is the default behaviour — Blade renders server-side. If you're on a standard Laravel stack, your content is already in the initial response. Verify it with the curl command above.

{{-- resources/views/product.blade.php --}}
{{-- Blade renders on the server — content is in the HTML response --}}
<main>
    <h1>{{ $product->name }}</h1>
    <p>{{ $product->description }}</p>
    @foreach($product->features as $feature)
        <section>
            <h2>{{ $feature->name }}</h2>
            <p>{{ $feature->description }}</p>
        </section>
    @endforeach
</main>

Content inside tabs, accordions, or modals that require a click to reveal is effectively hidden. If it matters enough to show a user, it should be in the DOM on load — visible or visually hidden with CSS, but present in the markup.

How do you implement llms.txt?

The llms.txt standard was proposed by Jeremy Howard in late 2024. It's a markdown file at yourdomain.com/llms.txt that gives AI systems a structured map of your site — which pages exist, what they contain, and which matter most.

The format

# Your Site or Product Name

> One sentence describing what this is and who it's for.

Optional additional context about the product or how the content
on this site is organised.

## Core

- [Homepage](https://yourdomain.com/): What the product does and who it's for
- [How it works](https://yourdomain.com/how-it-works): Overview of the system
- [Pricing](https://yourdomain.com/pricing): Plans and pricing

## Case studies

- [Client name](https://yourdomain.com/case-studies/client):
  What problem they had and what the outcome was

## Optional

- [Blog](https://yourdomain.com/blog): Writing on product and engineering

The H1 is your site name. The blockquote is your one-sentence description — make it precise. Each list item is a markdown link with a short description of what that page answers, not just its title.

Here's how PsTally's entry reads in my own llms.txt:

- [PsTally](https://pstally.com): Gaming lounge management software built
  for Kenyan lounges — session tracking, shift reconciliation, staff
  accountability, and WhatsApp reporting for owners who aren't on-site

"PsTally — Gaming Lounge Management" tells the model a category. The description above tells it the problem the product solves, the market it's for, and what the owner actually gets. That distinction determines whether the model cites it when someone asks the right question.

What the model does with it

When a system supporting llms.txt encounters your site, it fetches the file first and uses it to build context before crawling individual pages. The descriptions you write next to each URL are read directly — they shape how the model understands each page before it even fetches the content.

Implementing it in Laravel

Place the file at public/llms.txt in your Laravel project. Laravel serves everything in public/ automatically — no route needed.

your-laravel-app/
├── public/
│   ├── llms.txt        ← place it here
│   ├── robots.txt
│   └── index.php

Verify it's accessible:

curl https://yourdomain.com/llms.txt
# Should return plain markdown content

If you want to generate it dynamically — pulling product names or page titles from your database — you can add a route instead:

// routes/web.php
Route::get('/llms.txt', function () {
    $content = view('llms')->render();
    return response($content, 200, ['Content-Type' => 'text/plain']);
});

The extended variant: llms-full.txt

For sites with substantial documentation or content, you can also provide llms-full.txt — a single file containing the complete text of your most important pages, pre-formatted for LLM consumption. This removes the need for the crawler to make multiple requests. Some documentation-heavy sites are moving toward this, particularly where deep technical content benefits from being read as a whole.

Current support

As of mid-2025: OpenAI and Perplexity honour llms.txt. Google Gemini does not currently support it. Implementing it now costs an hour and signals to the systems that do support it that you've thought about how they navigate your site.

How does structured data help with AI discoverability?

Schema markup communicates structured facts to systems that parse content programmatically. For AI systems using RAG, well-structured schema reduces the ambiguity the model has to resolve when deciding what your content is about.

Organization schema — tell the model who you are

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Gabana",
  "url": "https://gabana.dev",
  "description": "Full-stack product engineer building operations SaaS for East African markets",
  "sameAs": [
    "https://github.com/gabana-dev",
    "https://linkedin.com/in/gabana-k-ab57991b7/"
  ]
}

The sameAs array creates entity links across platforms. This helps AI systems resolve your brand as a single consistent entity rather than treating each platform as a separate, unrelated source.

SoftwareApplication schema — for product pages

{
  "@context": "https://schema.org",
  "@type": "SoftwareApplication",
  "name": "PsTally",
  "applicationCategory": "BusinessApplication",
  "description": "Gaming lounge management software for session tracking, shift reconciliation, and staff accountability. Built for Kenyan gaming lounges and eSports centers.",
  "operatingSystem": "Web",
  "url": "https://pstally.com",
  "offers": {
    "@type": "Offer",
    "priceCurrency": "KES",
    "price": "500",
    "description": "Monthly from Ksh 500 base + Ksh 300 per console"
  }
}

FAQPage schema — high citation value

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "How does PsTally track gaming sessions?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "PsTally uses automated timers for every console with locked start and end timestamps. Sessions are recorded the moment they begin and closed by the system at end time — staff cannot modify the record after the fact."
      }
    }
  ]
}

FAQPage schema is particularly effective because AI systems are fundamentally matching user questions to content. Explicit question-answer pairs with schema make the retrieval step more reliable and the answer quality higher.

In Laravel, add schema to your Blade layouts via a @push stack or directly in the <head>:

{{-- In your layout or page view --}}
<script type="application/ld+json">
{!! json_encode($schemaData, JSON_UNESCAPED_SLASHES | JSON_PRETTY_PRINT) !!}
</script>

Verifying your schema

Google's Rich Results Test (search.google.com/test/rich-results) will confirm your markup is structurally valid — but it tests Google's parser, not ChatGPT or Perplexity. For those systems, the practical check is the curl approach: fetch your page as an AI crawler and confirm the <script type="application/ld+json"> block is present in the response body. If it's there, the crawler can read it. There's no equivalent validation tool for non-Google AI systems yet.

What content structure do AI systems actually extract?

Once a crawler reaches your page and reads it, the model needs to parse meaning from the structure. The signals it uses are different from what traditional SEO rewarded.

Use semantic HTML

Heading hierarchy matters more for AI systems than it ever did for traditional search. The model uses headings to build a structural map of the document — what the main topic is, what the subtopics are, how they relate. A flat structure with no hierarchy is genuinely harder to parse.

<!-- Harder for AI to parse -->
<div class="section">
  <div class="title">How shift reconciliation works</div>
  <div class="content">...</div>
</div>

<!-- Structurally legible -->
<section>
  <h2>How shift reconciliation works</h2>
  <p>...</p>
</section>

Write answer-first

Content structured as direct answers to questions performs significantly better in AI citation than content that buries the answer in explanatory text. Put the answer before the explanation.

# Don't — explanation first

Reconciliation in retail is a process that varies based on your payment
mix, staffing model, and end-of-day routine. To understand how Stoka
handles this, it helps to first understand how we define a shift...

[answer appears three paragraphs later]

# Do — answer first

Stoka compares expected cash against actual counted cash at the end
of every shift and flags the difference immediately.

Here's how that calculation works in practice...

Specificity signals credibility

A stat, a specific number, a named example — these signal credibility to AI systems in a way that general claims don't. "882 sessions tracked across 7 weeks with a close variance of Ksh 0–50 per shift" is more citable than "accurate session tracking." The model is looking for content that behaves like a primary source, and specificity is one of the strongest signals of that.

The complete audit — run this in order

Each step assumes the previous ones are working.

1. Crawler access

Check robots.txt for AI bot user agents — are they allowed?
Verify Cloudflare bot settings if applicable
Confirm key pages return HTTP 200 without authentication

2. Rendering

Fetch your homepage and key product pages with curl using an AI bot user agent
Check that important content — product descriptions, features, pricing — is in the response body
Run key pages through Google's Rich Results Test to see what a crawler actually receives

# Test with an AI crawler user agent
curl -A "OAI-SearchBot" https://yourdomain.com/ | grep -i "your product name"

# If this returns nothing, the content isn't in the initial HTML

3. llms.txt

Create the file at the root of your public directory
Verify it's accessible at yourdomain.com/llms.txt
Check that descriptions are specific — what each page answers, not just its title
Confirm the file is plain text (Content-Type: text/plain)

4. Schema markup

Validate existing schema at schema.org/validator
Add Organization schema to every page if not present
Add SoftwareApplication schema to product pages
Add FAQPage schema to any page with question-and-answer content
Verify the <script type="application/ld+json"> block appears in your curl response — Google's Rich Results Test validates structure but doesn't confirm AI crawler access

5. Content structure

Verify heading hierarchy on key pages — is there a clear H1 → H2 → H3 structure?
Identify any important content behind interactive elements (tabs, accordions, modals)
Find your best citation candidates — pages that answer a specific question directly and completely

The honest picture of where this is going

The llms.txt standard is still forming. Not every AI system supports it. Google — which still accounts for the majority of search traffic — doesn't currently honour it. The ROI is real but targeted: primarily ChatGPT and Perplexity, which together represent a meaningful and growing share of AI-driven discovery.

The rendering and schema work is different — that applies across all AI systems and overlaps heavily with existing SEO hygiene. Fixing client-side rendering on important pages makes you more discoverable to every system simultaneously. That's the work with the widest return.

The deeper shift is this: AI systems reward content that was written to be understood, not content that was written to rank. Ranking signals — keyword density, backlink volume — are proxies for quality that a system measures indirectly. AI systems are getting better at measuring quality directly. That means the proxies matter less and the actual clarity of your content matters more.

Say what you mean. Structure it clearly. Make it specific. Give the system something it can extract and cite with confidence.

That's not a new idea. It's what good technical writing has always required. The difference now is that there's a machine on the other end, making decisions about your product's discoverability every time someone asks it a question your product could answer.

Frequently asked questions

Which AI crawlers should I allow in robots.txt? The main ones: OAI-SearchBot (ChatGPT), PerplexityBot (Perplexity), Google-Extended (Gemini training), Applebot-Extended (Apple AI), anthropic-ai (Claude), cohere-ai (Cohere). Add explicit Allow: / rules for each, especially if you use a wildcard disallow.

How do I test whether an AI crawler can actually read my site? Run curl -A "OAI-SearchBot" https://yourdomain.com/ and check the response body for your product name and key content. If it's not there, you have a rendering problem. Also verify your Cloudflare bot settings aren't blocking the request before it reaches your server.

Do I need server-side rendering to be visible to AI systems? For static or Laravel/Blade sites: your content is already server-rendered by default. For React/Vue SPAs: yes, critical pages need SSR or static generation. Next.js App Router handles this automatically. The test is the curl command above — if your content is in the response, you're fine.

What's the difference between llms.txt and llms-full.txt? llms.txt is a structured index — a map of your site with links and short descriptions. llms-full.txt is the full text content of your most important pages in a single file, pre-formatted for LLM consumption. Use llms.txt as the baseline. Add llms-full.txt if you have deep documentation that benefits from being read as a whole.

How do I verify my llms.txt is working? Confirm it's accessible at yourdomain.com/llms.txt with a curl request. Check it returns Content-Type: text/plain. There's no official validator yet, but a useful test: paste the file into any LLM and ask it to describe your site. What it tells you is roughly what AI systems will extract from it.

You can see a live implementation at gabana.dev/llms.txt. The file is served as a static asset from the Laravel public directory — no route, no controller, no configuration required beyond creating the file.

How AI systems actually read your site — and what most developers are missing