Experiment · 2026-05

We A/B tested 4 AEO fixes on 27 DTC stores. Here's what actually moved AI recommendation rate.

Stats lift recommendation rate +13.5pp. Combined fix lifts +20.9pp. Citations alone are actively negative. Raw data, methodology, and limitations from a controlled commerce-specific replication of the GEO paper.

Headline findings

→ Adding 10 specific numbers to a product description more than doubles AI recommendation rate (9.8% → 23.3%, n=27).
→ Combined optimization lifts recommendation rate by ~21 percentage points (9.8% → 30.7%) — best result across all four variants tested.
→ Citation FORMAT matters more than citation content. Same 3 citations in <cite> tags: +9.9pp recommend, +6.1pp identification. Same content as naked bracket URLs: confidence drops 12.6pp. Format inverts the lift.
→ Identification scores barely move with body rewrites. Body-text changes don't change what the product is — they change whether the AI is confident enough to recommend it.

Methodology

We took the 27 stores from our public benchmark, scraped each one, and generated four rewrites of the product description via a single LLM call with a strict variant schema:

V0 — Original. Untouched control.
V1 — +Statistics. Original + at least 10 specific numeric claims (dimensions, weight, %, durability, comparisons).
V2 — +Quotations. Original + at least 2 verbatim customer quotes in <blockquote> tags.
V3 — +Citations. Original + at least 3 outbound citation phrases with bracketed URLs.
V4 — All + diverse vocab. Combined V1+V2+V3 plus instruction to avoid repeating any single keyword more than 3 times.

Each variant was then fed to our identification simulator: GPT-5.4 mini, Claude Haiku 4.5, and Gemini 3.5 Flash each receive the scraped page data and answer four structured questions — what is this product, what's missing, would you recommend it, why. We aggregate three response-level signals across the three models per variant:

Identification score: % of models with confidence ≥ medium.
Recommendation rate: % of models that said "would recommend: yes".
High-confidence rate: % of models that returned high confidence.

27 stores × 5 variants × 3 models = 405 frontier-model calls, plus 27 rewrite calls. Total run cost <$2; wall-clock 11 minutes.

Results

Variant	n	Content %	Identification	Recommend	High-confidence
V0 — Original	27	37.2%	84.1%	9.8%	74.1%
V1 — +10 statistics	27	38%(+0.8)	82.9%(-1.2)	23.3%(+13.5)	68%(-6.1)
V2 — +2 quotations	27	38%(+0.8)	81.6%(-2.5)	13.6%(+3.8)	55.6%(-18.5)
V3 — +3 citations	27	38%(+0.8)	81.6%(-2.5)	4.9%(-4.9)	48.2%(-25.9)
V4 — All + diverse vocab	27	38%(+0.8)	86.5%(+2.4)	30.7%(+20.9)	68.1%(-6)

Reading the table: every cell except V0 shows the delta vs the unmodified baseline. Green = lift, red = drop. The recommendation column is the most actionable: it captures whether an AI shopping assistant, given the page in context, would confidently recommend the product if a shopper asked.

Finding 1 — Statistics are the single highest-leverage fix

V1 (stats only) lifts recommendation rate from 9.8% to 23.3% — a 13.5 percentage point absolute increase, or a 2.4× multiplier on the baseline. Confidence drops slightly (-6.1pp) because the descriptions got longer and denser, but the model is more willing to recommend when faced with concrete numbers.

This matches the GEO paper's top-3 finding. What our experiment adds is a commerce-specific magnitude. If a merchant has time for exactly one change, "add 10+ specific numbers" produces measurable lift on its own.

Finding 2 — Citations alone are a foot-gun

V3 (citations only) was the surprise. The GEO paper showed citations have the largest positive effect on subjective impression score in general-text contexts. On commerce pages, inline URLs without supporting structure produced worse outcomes than the original: recommendation rate −4.9pp, high-confidence rate −25.9pp.

Two hypotheses we can't yet distinguish:

Spam-pattern detection. Commerce pages full of bare URLs in body text look like SEO-spam to a model. Properly marked-up citations (<cite>, schema.org references, footnote-style numbered links) may not trigger the same pattern. We'll test this in the next run.
Dilution. Citations replace product-specific language with authority-attribution language. Without paired statistics, this reads as hand-waving ("tested by Wirecutter" without a metric) and reduces perceived expertise.

Practical implication: never add citations to a page that doesn't already pass the data-signals check. Stats first, then citations.

Finding 3 — Combined fix beats any single fix

V4 (all three injections + diverse vocabulary) produces the only positive identification lift (+2.4pp) and the highest recommendation lift (+20.9pp), while keeping confidence roughly flat. This is the strongest evidence we have for the joint-distribution argument: AEO levers compound rather than substitute.

For a merchant, the practical takeaway is to either commit to the full stats+quotes+citations rewrite or do just the stats — partial-credit fixes that stop at quotes-only or citations-only can underperform doing nothing.

Finding 4 — Identification is mostly about JSON-LD, not body copy

Identification scores barely moved across variants (max delta ±2.5pp). This is consistent with the fact that every variant has the same brand, name, image, and structured-data identifiers — only the body description differs. If a model already knows what the product is, body-copy rewrites don't move that score — they move the model's willingness to recommend.

The corollary: structured-data fixes (Product JSON-LD, brand, GTIN, aggregateRating) move identification. Body-copy fixes move recommendation. They're complementary levers targeting different stages of the agent pipeline.

Sub-experiment — Citation markup format matters more than citation content

Finding 2 left an open question. V3 (citations alone) hurt — but was that the citations themselves, or the way we formatted them? We ran a follow-up sub-experiment on the same 27 stores. Each store got the SAME 3 citations in 5 different markup formats; only the formatting changed.

Citation format	n	Identification	Recommend	High-confidence
V0 — Original	27	82.9%	7.4%	69.3%
V_naked — bracket URLs	27	81.7%(-1.2)	14.7%(+7.3)	56.7%(-12.6)
V_cite — <cite> tag	27	89%(+6.1)	17.3%(+9.9)	61.8%(-7.5)
V_anchor — <a href> link	27	79.2%(-3.7)	14.7%(+7.3)	61.7%(-7.6)
V_quote — verbatim quote	27	82.9%	16%(+8.6)	63%(-6.3)
V_footnote — superscript + refs	27	86.5%(+3.6)	11.1%(+3.7)	58%(-11.3)

The pattern is sharp:

<cite> tag is the only variant that lifts everything. Identification +6.1pp, recommendation +9.9pp, smallest confidence drop. It's the clear winner across all three metrics.
Naked bracket URLs hurt confidence by 12.6pp. The original V3 format is the format that breaks the page — not the act of citing.
Inline <a href> anchor links hurt identification. Surprising: the "proper" HTML hyperlink drops the model's ability to identify the product (-3.7pp). Hypothesis: anchor markup adds visual noise the model treats as link-spam, even though humans read it as a normal link.
Verbatim quotes hold confidence best (-6.3 vs -12.6 for naked URLs) while lifting recommendations almost as much as <cite>. A safe second choice when <cite> isn't available.
Footnote style is balanced but mediocre. The reference block at the end doesn't excite the model the way an inline <cite> does.

Practical prescription: if a merchant has failing authority-signals, don't just say "add 3 outbound citations." Say "wrap each citation in <cite>Wirecutter (2025)</cite> tags, not bracket URLs or anchor links." The same citations in the wrong format are net negative; in the right format they lift AI recommendation rate +9.9pp.

This is the kind of finding that wouldn't come out of paper-level GEO research, because the paper tested signal categories, not the markup of those signals. It's also why empirical replication on commerce pages matters — the implementation detail inverts the lift.

What our scoring system gets right (and wrong)

We can use these results to sanity-check our current weights:

Stats weight should go up. Currently 8 points (out of 114). The effect size on recommendation rate suggests it deserves more like 12-15.
Quotations weight is roughly right. Currently 6 points; modest isolated effect supports a modest weight.
Citations weight might need conditional scoring. Citations on a page with statistics deserve full weight; citations on a page without stats should arguably score zero. We may add an interaction term in a future scoring version.
Keyword stuffing is rare. Of 27 stores, only 5 had top-content-word density above 4%. The negative-signal check fires correctly when it matters, but it's not the universal problem we expected.

Cross-sectional pilot — what predicts cold visibility

In parallel we computed per-check Pearson correlation between pass/fail and cross-agent recall on the same 27 stores. The strongest signal in that analysis is unsurprising and humbling: brand mind-share dominates everything. Stores that score badly on protocol (large unknown brands without a UCP manifest) often have higher visibility than well-instrumented Shopify stores with no brand recognition.

This tells us page-quality fixes affect inference-time visibility (does the model recommend you when it has your page in context?) much more than training-time recall (does the model remember you cold from training?). The latter is determined by years of brand-building and third-party citations, not by structured data hygiene. If your brand is unknown, page optimization won't fix that — outreach will. Page optimization helps when an AI shopping browser is actually looking at your page right now.

Limitations we won't hide

n=27 is small. Effect sizes are directional, not statistically confirmed at p<0.05.
Variants are LLM-generated, not real merchant edits. Synthetic quotes might trigger spam detection the way real customer reviews would not.
Identification probe ≠ cold recall. Our dependent variable is what models do when given a page in context. Cold recall (the model recommending you when asked the category from training memory) is a different mechanism we cannot A/B test short-term.
Single domain (DTC commerce on Shopify-heavy dataset). Results may not transfer to B2B, marketplaces, or non-product content.
Model choice matters. We tested GPT-5.4 mini, Claude Haiku 4.5, Gemini 3.5 Flash. The full-size models may exhibit different biases.

We'll re-run the experiment when we have 100+ stores and real-merchant rewrites (Phase B of our roadmap). The numbers above are what we can defensibly publish today.

What this changes about how we audit

Top suggestion priority. Our scan results now sort fixes by empirical lift, not theoretical weight. Stats first, combined rewrite second.
Warning when citations are added in isolation. If a merchant passes the citation check but fails data-signals, we now flag this as a potential trap.
Cross-agent recall is treated as a brand-strength signal, not a page-fix signal. Citation Discovery (real outreach targets) is the correct lever there.

Raw dataset, variant text, per-store outcomes — all available at the repository on request. Run your own scan on the home page.

See your store's recommendation rate

Our audit instruments the same probe across GPT, Claude, and Gemini. Free, no account, 60 seconds.

Run my scan →