NLTK Assignment– 3

POS Tagging & Morphological Analysis

Preparation (run once before starting)

nltk.download('punkt'); nltk.download('averaged_perceptron_tagger');
nltk.download('universal_tagset'); nltk.download('brown'); nltk.download('treebank')

Basic Questions

Print help for the Penn Treebank tagset for these tags using nltk.help.upenn_tagset: NN, NNS, NNP, VB, VBD, VBG, JJ, RB, IN, DT; for each tag, include the short definition in your output.
Tag the sentence “Aarav booked tickets for Priya and Rohan yesterday.” with Penn tags using nltk.pos_tag, then re-tag the same tokens with Universal tags using nltk.pos_tag(…, tagset=’universal’); print both tag sequences.
Tokenize the paragraph “Kavya will present the project tomorrow. Arjun objects to one section. Ananya will present again next week.” into sentences and words, then tag each sentence with Penn tags and print the (token, tag) pairs for each sentence.
Re-tag the sentences from Question 3 with Universal tags and list which Universal tags appear; print their counts using a dictionary.
Write a function tag_counts(sent_tokens, tagset) that returns a frequency dict of tags for a tokenized sentence; test it on the tokenized sentence [“Ritika”, “is”, “reading”, “in”, “Mumbai”, “.”] for both Penn and Universal tagsets.
From nltk.corpus.treebank.sents(), take the first 3 sentences, tag them with Penn tags, and also map each Penn tag to its Universal equivalent using nltk.map_tag; print (token, PTB, Universal) triplets.
From Brown news (brown.tagged_words(categories=’news’)), compute the top 5 most frequent Penn tags and print them with counts.
From the tagged Brown news sample, extract and print the first 30 tokens whose tags are proper nouns (NNP or NNPS).
From treebank.tagged_words(), find 20 surface forms that occur with more than one tag (e.g., “present” as NN/VB); print the token and its distinct tag set.
Tag the list [“Aarav”, “paid”, “₹500”, “on”, “12-08-2024”, “.”] and print the Penn and Universal tags assigned to each token.
Pretty-print the tagged sentence from Question 2 as two aligned rows: first row tokens, second row tags; ensure spacing aligns.
Tag the sentence “Please book a cab for Sanya to the airport” and print the tags assigned to the ambiguous word “book”; explain via a comment which sense the tag implies.
Using Universal tags, count how many tokens are open-class (NOUN, VERB, ADJ, ADV) vs closed-class (all others) in the sentence “Raghav quickly finished work at the office in Delhi.”; print the ratio open/total.
Randomly sample 5 sentences from Brown news (random.seed(42)) and compute the proportion of verbs (Universal VERB + AUX) per sentence; print the list of proportions.
Tag the sentence “Ishan did not attend the meeting” and print the tags for “not” and the adjacent verbs; state in a comment whether “not” is tagged as an adverb.
For the tagged sentence from Question 2, build and print the first 15 bigrams of (word, tag) pairs.
Convert the first 500 (word, tag) pairs from treebank.tagged_words() to Universal tags using nltk.map_tag and report how many tags were mapped (should equal 500).
Tag the sentence “Payment is due by 10:30 am on 15/09/2025” and print the tokens that are numerals (Penn CD or Universal NUM).
Tag the sentence “ARJUN met arjun at Arjun’s cafe” and then tag the lowercased version; print any tag changes caused by case folding (e.g., NNP→NN).
Save a JSON list of dictionaries to disk for the sentence in Question 2 with keys {“token”: …, “penn”: …, “universal”: …}.

Intermediate Questions

Create train/dev/test splits from treebank.tagged_sents() by index with an 80/10/10 split (no shuffling): first 80% train, next 10% dev, last 10% test; print the number of sentences in each split.
Train a nltk.DefaultTagger whose single tag is the most frequent Penn tag in the training data; evaluate accuracy on the dev set and print the accuracy.
Build a nltk.RegexpTagger with rules that tag numerals (^\d+([.,:]\d+)?$ → CD), punctuation (^[^\w\s]$ → .), determiners (^(the|a|an)$ → DT), and simple verbs ending with -ing → VBG; back off to the DefaultTagger; evaluate on dev and print accuracy.
Train a nltk.UnigramTagger on train with backoff to your Regexp+Default chain; evaluate on dev and test, printing both accuracies.
Train a nltk.BigramTagger with backoff to Unigram; evaluate on dev and print accuracy; ensure you handle sparse contexts by using the backoff to avoid exceptions.
Train a nltk.TrigramTagger with backoff to Bigram→Unigram→Regexp→Default; evaluate on test and print the final accuracy.
Compute the OOV (out-of-vocabulary) rate of test tokens relative to the training vocabulary and print the OOV rate as a percentage; then compute tagging accuracy on only the OOV tokens using the test gold tags.
Build a Penn-tag confusion matrix for your best model on the test set using nltk.ConfusionMatrix (gold vs predicted tags) and print the 15 most frequent confusions as pairs with counts.
Compute per-tag accuracy for the 20 most frequent Penn tags in the test set (correct/total per tag) and print a sorted table by accuracy ascending.
Map both gold and predicted tags on the test set to Universal tags using nltk.map_tag and recompute accuracy; print Penn vs Universal accuracies side by side.
Train on Brown news sentences (tag by calling pos_tag on tokenized Brown news sents) and test on Brown fiction sentences (also tagged with pos_tag as gold proxy); compute Universal-tag accuracy and print the cross-genre accuracy drop compared to an in-genre split.
Add an nltk.AffixTagger (suffixes up to length 3) trained on the training data with backoff to Unigram; evaluate on dev and report whether accuracy improved over Unigram alone.
Print 25 random tagging errors from the test set for your best model (token, gold tag, predicted tag, previous two predicted tags, next token if available) to help error analysis.
Create a backoff ablation table: evaluate and print accuracies for Default only, Regexp+Default, Unigram→Regexp→Default, Bigram→Unigram→…, and Trigram→Bigram→… using the same splits.
Implement a simple majority-vote ensemble of Unigram, Bigram, and Trigram predictions (vote on each token; tie-break by backoff order); evaluate on test and print accuracy.
Compute sentence-level accuracy (share of sentences where all tags are correct) for your best model; print this metric.
Train Unigram+Bigram+Trigram on {10%, 25%, 50%, 75%, 100%} of the training sentences (by prefix) and print a learning curve table of train size vs dev accuracy.
On the training data, compute suffix statistics for suffixes {s, ed, ing, ly} (lowercased); for each suffix print the top 3 Penn tags observed with counts.
Create a morphology-aware RegexpTagger using your suffix rules from Question 18 (e.g., default -ed→VBD unless in a small exception list you define) and insert it at the top of the backoff chain; re-evaluate on dev and print the accuracy change.
Save the best performing tagger using pickle to best_pos_tagger.pkl and write a short script that loads it and tags the sentence “Aarav and Sanya are planning a trip to Pune next month.”, printing tokens with tags.

Advanced Questions

From treebank.tagged_words(), compute lexical ambiguity for all token types with ≥20 occurrences: number of distinct tags per token; print the top 30 most ambiguous tokens with counts and two example contexts from the corpus for each (show the 3-token window around the token).
Implement morphological disambiguation heuristics in a post-processing step for your Unigram predictions: for words ending -s decide NNS vs VBZ using the previous tag (noun vs pronoun subject), for -ed resolve VBD vs VBN using following auxiliary (has/have/had), for -ing resolve VBG vs NN using determiner presence; measure error reduction on tokens affected in the test set.
Compare three systems on the same test set: (A) rule-only chain (Regexp+Default), (B) Unigram+backoff, (C) Trigram+backoff. For each, print overall accuracy, OOV accuracy, and per-tag accuracies for NN, VB, and JJ, and include a brief code comment summarizing pros/cons.
Train on treebank and test on raw Brown fiction: tag Brown fiction sentences with your trained model to get predictions, and also obtain “silver” gold by tagging with pos_tag (Universal tagset). Align tokens by simple whitespace tokenization and compute Universal-tag accuracy; print the top 10 most confused Universal tag pairs.
(Optional if available) Train nltk.tag.hmm.HiddenMarkovTagger on a subset of treebank.tagged_sents() (e.g., first 2000 sentences) and compare its test accuracy to your Trigram backoff model on the same split; print both accuracies and 15 example errors.
(Optional) Train a Brill tagger using BrillTaggerTrainer initialized with your Unigram backoff tagger; evaluate on test, and print the top 10 learned transformations (pattern → replacement) with their scores.
Build a Universal-tag confusion heat table (rows gold, cols predicted) normalized by gold counts for your best model; highlight cells where error rate > 20% and print those cell coordinates with values.
For tokens tagged VBG in the test set, attempt to infer the base form by stripping -ing and applying simple spelling rules (double-consonant handling and final-e restoration); compare guessed lemmas to WordNetLemmatizer().lemmatize(token, ‘v’) and print precision on the VBG subset.
For the ambiguous words {set, run, object, present}, compute per-word accuracies under Unigram, Bigram, and Trigram models on test (restrict evaluation to occurrences of these surface forms); print a table of model vs accuracy per word.
Package your end-to-end POS tagging suite: training, evaluation (overall/OOV/sentence-level), confusion pairs, suffix stats, morphology-aware rules, model serialization, and a demo that tags “Priya and Arjun submitted the final report in Bengaluru today.”; run it and print all metrics and the demo output.