Blog

What 1,402 scraped marketing articles taught us

We pulled 1,402 marketing articles from across the public web over 8 weeks. The asset-type distribution shocked us. Three observations that should change how marketing content gets produced.

DMOOP Research June 2, 2026 5 min read

Over the last eight weeks DMOOP's scraper pulled 1,402 marketing articles across 13 asset types and 13 marketing intents. Some of what we saw was expected. Some genuinely surprised us. Three observations worth your time.

Observation 1 — Reports dominate. Everything else is noise.

The single most-scraped asset type is "report" — Gartner / Forrester / vendor benchmarks. Reports account for 38% of the corpus despite being 1 of 13 asset types. The second-most: case studies at 18%. Articles at 11%. Whitepapers at 9%.

Everything below 5% — podcasts, videos, ebooks, social posts, ad campaigns — combined accounts for less than 12% of the corpus.

This isn't because we biased the scrape. The Tavily queries are roughly balanced across asset types. Reports dominate because that's what gets published, indexed, and surfaces in search. The marketing internet is shaped like a pyramid with reports at the wide base.

The implication for content marketers is uncomfortable: if you're optimizing your editorial calendar for blog posts and social posts (the conventional wisdom), you're competing in a 23% slice of the surface where attention compounds slowly. If you publish one solid benchmark report a quarter, you're in the 38% slice where citations accrete.

Observation 2 — Tavily can't reach 70% of marketing content

Of the asset types we want to scrape — podcasts, webinars, LinkedIn posts, gated whitepapers, ebooks — Tavily reliably reaches maybe 30% of the surface area. The rest is behind login walls, behind audio/video formats Tavily doesn't process, or behind soft paywalls that return 200 OK to crawlers but render gated content to humans.

This is a structural fact about the public web, not a Tavily limitation. The most valuable marketing content has been moving into closed ecosystems for five years. The public-web corpus is increasingly dominated by SEO-optimized blog posts written for search ranking, not for marketers reading them.

The architectural response is to lean on user uploads. DMOOP's Brand Library exists in part because the most valuable content for any specific brand is content that lives on their own laptops, not on the public web.

Observation 3 — Source diversity matters more than source authority

We track unique source URLs per intent. The intents with the most marketing-publication diversity (10+ unique source domains per scrape cycle) produce noticeably better training pair quality than the intents dominated by 2-3 dominant publishers.

The translation for content marketers: being one of 12 voices on a topic is a worse position than being one of 4. The 2-3 dominant publishers in your niche have already optimized for AI citations. The path to being cited is finding the niche where there are 4-8 voices and adding a substantive one.

This also explains why niche topics with low search volume are increasingly valuable. The top-10 marketing trend topics in Q2 each had 50+ unique source publishers writing about them — meaning any individual piece's citation odds are tiny. The 30 next-tier topics had 4-6 each, with concentration on a few dominant voices. That's the gap.

What we're doing about it

Three internal shifts based on these observations:

Reduce scraper coverage of saturated topics, increase coverage of mid-tier topics where the publisher diversity is 4-8 voices.
Stop expecting parity across asset types. Cap podcast/video/social_post scrape budget at 5% rather than trying to compensate for low yield.
Encourage Brand Library uploads as the path to high-value content that the public web doesn't reach.

The public marketing web is more shaped by SEO incentives than by what marketers actually need. Knowing that changes where you spend attention.

Ready to try it?

Put DMOOP on your next campaign.

Upload your brand docs, name your Brand Agent, and ship your first on-voice asset in under 5 minutes. No credit card.

Get started free

What 1,402 scraped marketing articles taught us

Observation 1 — Reports dominate. Everything else is noise.

Observation 2 — Tavily can't reach 70% of marketing content

Observation 3 — Source diversity matters more than source authority

What we're doing about it

Put DMOOP on your next campaign.

Keep reading.

How we 4×'d our training corpus in 7 days without writing a single article

Responsible AI for marketing tools: a 4-layer framework

The complete playbook for getting cited by Google AI Overviews