The Internet Is Not Linguistically Neutral: How Common Crawl Became the Invisible Curriculum of AI
What we now call “large language models” are often described in terms of architecture, scaling laws, and parameter counts. Yet the more fundamental layer lies elsewhere: in the corpora that quietly define what language these systems actually see.
Among these, Common Crawl occupies a uniquely foundational position.
Maintained by the Common Crawl Foundation, it continuously scrapes billions of web pages across the internet, producing one of the largest publicly available text corpora used in machine learning.
At first glance, this appears to be a neutral act of collection. In practice, it is a powerful act of selection through infrastructure.
Why it matters for AI systems
Common Crawl:
- Extracts text from billions of web pages
- Feeds downstream preprocessing pipelines for NLP
- Underlies many training datasets used in large language models
- Serves as a foundational layer for modern web-scale corpora
It is not the final dataset in most systems, but it is often the source of sources.
The linguistic implication
From a linguistic perspective, the implications are structural:
This creates a silent asymmetry:
- English dominates web content
- High-resource languages are overrepresented
- Many languages exist only in fragmented or informal forms
The result is not just imbalance; it is a redefinition of linguistic reality through data availability.
A key insight
Common Crawl does not simply reflect the internet.
It reconstructs the internet as a machine-readable language space.
And that reconstructed space becomes the implicit curriculum of modern AI.

