Corpus Linguistics: Terms Every Data-Driven Linguist Should Know
Mining Meaning from Mountains of Words:
Corpus Linguistics: Terms for Researchers
1. Corpus (pl. Corpora)
A large, structured set of authentic texts, stored electronically, used for linguistic analysis.
E.g., British National Corpus (BNC)
2. Annotation
The process of adding linguistic information (e.g., POS tags, syntactic structure, semantic roles) to a corpus for analysis.
3. Tokenization
Breaking down texts into smaller units (words, punctuation, sentences). A key step in preparing corpora.
“It’s raining” → [“It”, “’s”, “raining”]
4. Lemma / Lemmatization
A lemma is the base or dictionary form of a word.
“Running”, “ran” → lemma: “run”
5. Type-Token Ratio (TTR)
A measure of lexical diversity in a text.
Formula: Number of types ÷ Number of tokens
A higher TTR = more vocabulary variation.
6. Collocation
Words that tend to occur together frequently.
E.g., “strong tea,” “make a decision”
7. Concordance
A keyword-in-context (KWIC) display of occurrences of a word across a corpus, showing patterns of usage.
8. N-gram
A contiguous sequence of n items (words, syllables, or letters) in text.
Bigram: “new car”; Trigram: “as soon as”
9. Corpus Pragmatics
A blend of corpus linguistics and pragmatics, analyzing contextual use of language in real-life corpora.
10. Semantic Prosody
A word’s tendency to appear in either positive or negative collocational environments.
Cause” often collocates with “problem,” “harm.”
11. Reference Corpus
A large corpus representative of a language variety, often used as a baseline for comparison.
E.g., Corpus of Contemporary American English (COCA)
12. Monitor Corpus
A continuously updated corpus, ideal for tracking language change over time.
Used in lexicography and trend analysis.
13. Comparable Corpus
Corpora built from texts in multiple languages on similar topics and genres — used for cross-linguistic analysis.
14. Parallel Corpus
Texts in one language aligned with their translations in another — vital in translation studies and contrastive linguistics.
15. Corpus Triangulation
Using multiple corpora or methods to strengthen findings and avoid reliance on a single dataset.
16. POS Tagging
Assigning part-of-speech labels to each word in a corpus.
E.g., “book” → noun or verb depending on context.
17. Treebank
A parsed corpus where sentences are annotated with syntactic or grammatical structure (often as phrase structure trees).
18. Corpus Stylistics
Application of corpus methods to literary analysis, identifying patterns in authorial style or genre.
19. Corpus-Informed vs Corpus-Based
- Corpus-Informed: Uses corpus as inspiration for theory.
- Corpus-Based: Uses corpus as evidence to test theory empirically.
20. Data-Driven Learning (DDL)
An approach where learners explore corpus data directly to observe patterns and make inferences — promoting autonomy.