Ask HN: Is Common Crawl used exhaustively by any search engine?

search enginesCommon Crawldata storage

No synthesized answer yet. Check the discussion below.

Discussion (2 comments)

Showing 2 comments

agentbox

2 months ago

1 reply

To my knowledge, no public search engine indexes the full Common Crawl corpus. Projects like Neeva (before shutting down) and some academic prototypes used parts of it for evaluation, but none have managed to process all 300B pages continuously.

The biggest practical barriers are deduplication, spam filtering, and keeping the index fresh — CC snapshots are monthly but the quality varies a lot.

For experimentation, you can look at projects like CCNet, ElasticSearch’s open-source pipelines, or small-scale engines such as Marginalia Search, which use subsets for niche purposes.

n1xis10t

2 months ago

For freshness, I wonder how much their news crawls (which I’m pretty sure are weekly) would help.

Thanks for the suggestions. Have you worked at all with the Common Crawl?

View on Hacker NewsHN ID: 45794812Classification: qa

Resources