header logo

Corpus Linguistics Terms

Corpus Linguistics Terms


Corpus Linguistics: Terms Every Data-Driven Linguist Should Know

Mining Meaning from Mountains of Words:

Corpus Linguistics: Terms for Researchers


1. Corpus (pl. Corpora)

A large, structured set of authentic texts, stored electronically, used for linguistic analysis.
E.g., British National Corpus (BNC)


2. Annotation

The process of adding linguistic information (e.g., POS tags, syntactic structure, semantic roles) to a corpus for analysis.


3. Tokenization

Breaking down texts into smaller units (words, punctuation, sentences). A key step in preparing corpora.
“It’s raining” → [“It”, “’s”, “raining”]


4. Lemma / Lemmatization

A lemma is the base or dictionary form of a word.
“Running”, “ran” → lemma: “run”


5. Type-Token Ratio (TTR)

A measure of lexical diversity in a text.
Formula: Number of types ÷ Number of tokens
A higher TTR = more vocabulary variation.


6. Collocation

Words that tend to occur together frequently.
E.g., “strong tea,” “make a decision”


7. Concordance

A keyword-in-context (KWIC) display of occurrences of a word across a corpus, showing patterns of usage.


8. N-gram

A contiguous sequence of n items (words, syllables, or letters) in text.
Bigram: “new car”; Trigram: “as soon as”


9. Corpus Pragmatics

A blend of corpus linguistics and pragmatics, analyzing contextual use of language in real-life corpora.


10. Semantic Prosody

A word’s tendency to appear in either positive or negative collocational environments.
Cause” often collocates with “problem,” “harm.”


11. Reference Corpus

A large corpus representative of a language variety, often used as a baseline for comparison.
E.g., Corpus of Contemporary American English (COCA)


12. Monitor Corpus

A continuously updated corpus, ideal for tracking language change over time.
Used in lexicography and trend analysis.


13. Comparable Corpus

Corpora built from texts in multiple languages on similar topics and genres — used for cross-linguistic analysis.


14. Parallel Corpus

Texts in one language aligned with their translations in another — vital in translation studies and contrastive linguistics.


15. Corpus Triangulation

Using multiple corpora or methods to strengthen findings and avoid reliance on a single dataset.


16. POS Tagging

Assigning part-of-speech labels to each word in a corpus.
E.g., “book” → noun or verb depending on context.


17. Treebank

A parsed corpus where sentences are annotated with syntactic or grammatical structure (often as phrase structure trees).


18. Corpus Stylistics

Application of corpus methods to literary analysis, identifying patterns in authorial style or genre.


19. Corpus-Informed vs Corpus-Based

  • Corpus-Informed: Uses corpus as inspiration for theory.
  • Corpus-Based: Uses corpus as evidence to test theory empirically.


20. Data-Driven Learning (DDL)

An approach where learners explore corpus data directly to observe patterns and make inferences — promoting autonomy.

Tags

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.