Research · 2026-05

How AI agents actually search — and what merchants can influence at each step

Modern agent search has five stages: query reformulation, retrieval, reranking, synthesis, citation. Research from 2023-2026 shows where each one breaks predictably — and where merchants can move the needle.

"Optimize for AI agents" is a one-line slogan that hides a five-stage pipeline. Each stage runs on different signals, fails in different ways, and rewards different merchant work. If you skip the pipeline view, you end up making changes that lift one stage and accidentally break another.

This is a tour of what published research shows for each stage — and the lever a merchant has at that exact moment.

Stage 1 — Query reformulation

Before any retrieval happens, the agent rewrites the user's prompt. A question like "what's a comfortable everyday sneaker?" becomes 3-5 search-style queries: "best comfortable men's sneakers," "everyday walking shoes lightweight," "most comfortable casual sneakers 2026." These reformulations are model-specific and opaque to merchants.

Research: Jagerman et al. (Google, 2023) showed that LLM-based query expansion produces high variance across models, and that recall depends heavily on the expansion. Two models given the same user prompt route to non-overlapping document sets a non-trivial fraction of the time.

What a merchant can influence:

Coverage breadth. Rather than optimize for one phrasing, surface multiple semantic angles in your description: "sneaker" AND "walking shoe," "lightweight" AND "230g per shoe," "sustainable" AND "merino wool."
Test multiple reformulations. Our Query Coverage Map runs five different category queries through three frontier models on every scan — the goal is not to win one query but to be present across the realistic spread.

Stage 2 — Retrieval

The reformulated queries hit a hybrid retrieval system: a dense embedding index (built from training data) plus a sparse keyword index (BM25-ish) plus, increasingly, a live web-search call. ChatGPT routes web search through Bing; Claude through Brave; Perplexity through its own index; Gemini through Google.

Research: The hybrid-retrieval literature (BEIR benchmark, MS MARCO, and the MTEB embedding benchmark) demonstrates that pure dense or pure sparse retrieval both underperform a hybrid stack with reranking on top. Modern commercial agents converge on this hybrid shape.

What a merchant can influence:

Traditional SEO still matters here. Bing rank = ChatGPT inference recall. Google rank = Gemini recall. The retrieval stage of agent search is bolted on top of search engines that SEO already optimizes for.
Training-corpus presence. Allowing GPTBot and ClaudeBot in robots.txt gets your page into the dense embedding index. The downstream effect is long-term — it determines whether your page is even a retrieval candidate when the user asks a category question cold.
Title-query semantic match. Your <title> and h1 are the first thing the sparse index keys off. Pack the category tokens a real shopper would use.

Stage 3 — Reranking

The retrieval stack pulls back 20-50 candidates. A second model — typically a smaller LLM-based reranker (Cohere Rerank, Voyage rerank, OpenAI's own) — re-scores those against the original user prompt and selects 5-10 to put in front of the answer model. This stage is invisible to merchants but high-leverage.

Research: The 2024 RAG reranking surveys show that reranker scores weight title-query semantic similarity, content density relative to query, source domain authority, and recency. Long documents are typically chunked, with the reranker scoring chunks individually — meaning whole-page density can mislead; chunk-level density is what counts.

What a merchant can influence:

Per-chunk density. Don't bury all your statistics in one paragraph. A reranker scoring three chunks where two are dense and one is filler will demote the page on average; three uniformly-dense chunks rank higher.
Visible last-updated date. Recency is a documented reranker factor. We added a check for this: a <time> tag, a JSON-LD dateModified, or a visible "Updated YYYY-MM-DD" line is enough.
Information-rich titles. "Men's Wool Runner" is fine for humans; "Men's Wool Runner — 230g, machine washable, ZQ-certified" is much stronger for a reranker comparing it to other walking-shoe pages.

Stage 4 — Answer synthesis

The 5-10 reranked chunks get packed into the answer model's context window. The model then composes a single answer drawing from the retrieved material. This is where position bias becomes brutal.

Research: Liu et al. (Stanford, 2023) — "Lost in the Middle" — is the foundational result. LLMs preferentially attend to the beginning and end of long contexts and effectively ignore documents placed in the middle. The same fact placed in document position 1 vs. position 5 (out of 5) can show a 50%+ difference in extraction rate. Models do not know they have this bias.

What a merchant can influence:

Distribute density across the page. Move 2+ key statistics into the first paragraph and 2+ into the last. Burying everything in the middle is the single most over-looked AEO anti-pattern. We added a positional-density check that flags pages with all numeric tokens clustered mid-description.
Use structured data as a position-bias hedge. JSON-LD is extracted independent of the body's position — the model sees the structured fields as first-class facts. If your aggregateRating is in JSON-LD, it's reliable; if it's only in a mid-page testimonial, it's gambled away.
Repeat critical facts in multiple positions. Brand name, key spec, key claim — saying each once in the opening and once in the closing isn't keyword stuffing, it's position-bias-aware redundancy.

Stage 5 — Citation generation

The synthesis model decides which retrieved sources to cite. Citation policies vary: Perplexity cites almost always; ChatGPT cites when the answer relies heavily on web-fetched material; Claude cites when web-search is the source.

Research: Liu et al. (2023) "Evaluating Verifiability in Generative Search Engines" — and Mallen et al. follow-up work — found that LLMs strongly prefer to cite sources from which they can extract a clean verbatim snippet. Pages with clear "number + verb-of-evaluation + named authority" sentences ("rated 4.6/5 by 1,247 Wirecutter readers") are cited at materially higher rates than equivalent paraphrased content.

What a merchant can influence:

Citation-ready phrasing. Write at least one sentence in the exact shape: metric + verb-of-evaluation + named authority. "Tested by Outside magazine in their 2025 winter boot review" works; "a popular pick among reviewers" does not. We added a check that detects this pattern.
Quotability beats authority alone. A page that contains a clean verbatim quote is cited more than a page from a more authoritative domain that has nothing extractable. Wrap key claims in <blockquote> or use schema.org Review markup so the LLM can extract them confidently.
Pair every claim with a source. Free-floating numbers are weaker than numbers with a named source attached, even if the source is your own certification page.

How our scoring maps to the pipeline

Our content score now instruments levers across all five stages. Roughly:

Stage 1 (reformulation): Query Coverage Map — 5 queries × 3 models.
Stage 2 (retrieval): robots.txt allow rules + Product JSON-LD + traditional title / description coverage.
Stage 3 (reranking): positional density (new), recency signal (new), title length sweet spot, structured data depth.
Stage 4 (synthesis): positional density (new), data signals, social proof, authority signals, structured data.
Stage 5 (citation): citation-ready claim phrasing (new), quotation density, external authority links.

None of these levers exist in isolation. A page with citation-ready phrasing buried in the middle of the description loses the citation game to position bias. A page with great positional density but no recency signal loses the reranking game. The point of the audit is the joint distribution, not any single signal.

What's well-established vs still emerging

Well-established as of 2026:

Lost-in-the-middle position bias (Liu et al., 2023; replicated by multiple labs).
Citation preference for verbatim-quotable sources (Liu et al., 2023).
Hybrid retrieval as the dominant agent-search architecture (BEIR, MTEB benchmarks).
Query reformulation variance across models (Jagerman et al., 2023).

Still emerging:

Multi-step agent planning (planner → executor → reviewer loops). Effects on commerce surfaces are early.
Tool-use during shopping. Some agents call structured commerce endpoints (UCP) instead of reading HTML.
Per-model citation policy differences. Public data is incomplete.

We'll update this post as new results land. Re-running an audit will pick up the scoring updates automatically.

Audit your store against all 5 stages

Free, no account. The audit instruments query coverage, retrieval signals, reranking factors, position-aware density, and citation-ready phrasing.

Run my scan →