Corpus Linguistics: Terms Every Data-Driven Linguist Should Know

Mining Meaning from Mountains of Words:

Corpus Linguistics: Terms for Researchers

1. Corpus (pl. Corpora)

A large, structured set of authentic texts, stored electronically, used for linguistic analysis.
E.g., British National Corpus (BNC)

2. Annotation

The process of adding linguistic information (e.g., POS tags, syntactic structure, semantic roles) to a corpus for analysis.

3. Tokenization

Breaking down texts into smaller units (words, punctuation, sentences). A key step in preparing corpora.
“It’s raining” → [“It”, “’s”, “raining”]

4. Lemma / Lemmatization

A lemma is the base or dictionary form of a word.
“Running”, “ran” → lemma: “run”

5. Type-Token Ratio (TTR)

A measure of lexical diversity in a text.
Formula: Number of types ÷ Number of tokens
A higher TTR = more vocabulary variation.

6. Collocation

Words that tend to occur together frequently.
E.g., “strong tea,” “make a decision”

7. Concordance

A keyword-in-context (KWIC) display of occurrences of a word across a corpus, showing patterns of usage.

8. N-gram

A contiguous sequence of n items (words, syllables, or letters) in text.
Bigram: “new car”; Trigram: “as soon as”

9. Corpus Pragmatics

A blend of corpus linguistics and pragmatics, analyzing contextual use of language in real-life corpora.

10. Semantic Prosody

A word’s tendency to appear in either positive or negative collocational environments.
Cause” often collocates with “problem,” “harm.”

11. Reference Corpus

A large corpus representative of a language variety, often used as a baseline for comparison.
E.g., Corpus of Contemporary American English (COCA)

12. Monitor Corpus

A continuously updated corpus, ideal for tracking language change over time.
Used in lexicography and trend analysis.

13. Comparable Corpus

Corpora built from texts in multiple languages on similar topics and genres — used for cross-linguistic analysis.

14. Parallel Corpus

Texts in one language aligned with their translations in another — vital in translation studies and contrastive linguistics.

15. Corpus Triangulation

Using multiple corpora or methods to strengthen findings and avoid reliance on a single dataset.

16. POS Tagging

Assigning part-of-speech labels to each word in a corpus.
E.g., “book” → noun or verb depending on context.

17. Treebank

A parsed corpus where sentences are annotated with syntactic or grammatical structure (often as phrase structure trees).

18. Corpus Stylistics

Application of corpus methods to literary analysis, identifying patterns in authorial style or genre.

19. Corpus-Informed vs Corpus-Based

Corpus-Informed: Uses corpus as inspiration for theory.
Corpus-Based: Uses corpus as evidence to test theory empirically.

20. Data-Driven Learning (DDL)

An approach where learners explore corpus data directly to observe patterns and make inferences — promoting autonomy.

Riaz Laghari

header logo

Corpus Linguistics Terms

Mining Meaning from Mountains of Words:

Corpus Linguistics: Terms for Researchers

1. Corpus (pl. Corpora)

2. Annotation

3. Tokenization

4. Lemma / Lemmatization

5. Type-Token Ratio (TTR)

6. Collocation

7. Concordance

8. N-gram

9. Corpus Pragmatics

10. Semantic Prosody

11. Reference Corpus

12. Monitor Corpus

13. Comparable Corpus

14. Parallel Corpus

15. Corpus Triangulation

16. POS Tagging

17. Treebank

18. Corpus Stylistics

19. Corpus-Informed vs Corpus-Based

20. Data-Driven Learning (DDL)

Riaz Laghari

Post a Comment

saidbar

Social Plugin

Comments

About Me

Search This Blog

About Us

Follow Us

Footer Copyright

Contact form

Riaz Laghari

header logo

Corpus Linguistics Terms

Mining Meaning from Mountains of Words:

Corpus Linguistics: Terms for Researchers

1. Corpus (pl. Corpora)

2. Annotation

3. Tokenization

4. Lemma / Lemmatization

5. Type-Token Ratio (TTR)

6. Collocation

7. Concordance

8. N-gram

9. Corpus Pragmatics

10. Semantic Prosody

11. Reference Corpus

12. Monitor Corpus

13. Comparable Corpus

14. Parallel Corpus

15. Corpus Triangulation

16. POS Tagging

17. Treebank

18. Corpus Stylistics

19. Corpus-Informed vs Corpus-Based

20. Data-Driven Learning (DDL)

Riaz Laghari

You may like these posts

Post a Comment

Contact form