NLTK Assignment– 2

Corpora, Lexical Resources & Data Handling

Preparation (run once before starting)

nltk.download('punkt'); nltk.download('gutenberg'); nltk.download('brown');
nltk.download('reuters'); nltk.download('inaugural'); nltk.download('stopwords');
nltk.download('wordnet'); nltk.download('omw-1.4')

Basic Questions

Load the Gutenberg corpus and list all file IDs using nltk.corpus.gutenberg.fileids(); then, for [‘austen-emma.txt’, ‘melville-moby_dick.txt’], print token count, sentence count (using sent_tokenize), and vocabulary size computed from lowercased alphabetic tokens.
Using Gutenberg file ‘austen-emma.txt’, compute average word length (alphabetic tokens only) and average sentence length in tokens; print both values rounded to two decimals.
Using Gutenberg file ‘austen-emma.txt’, print 25 KWIC concordance lines for the word “marriage” (case-insensitive) with a window of 50 characters; ensure you tokenize to words first and join back to raw string for KWIC display.
Using Gutenberg file ‘melville-moby_dick.txt’, find the first 30 token indices where the lowercased token equals “whale”; print those indices.
Using Gutenberg file ‘carroll-alice.txt’, compute the top 20 bigram collocations by PMI with BigramCollocationFinder (alphabetic tokens ≥ length 3, lowercased, stopwords removed via nltk.corpus.stopwords.words(‘english’)); print the pairs and PMI scores.
List Brown corpus categories via nltk.corpus.brown.categories() and, for [‘news’, ‘romance’], print total token counts and type–token ratio (TTR) from lowercased alphabetic tokens.
From the Reuters corpus, print the total number of documents via len(reuters.fileids()), list the first 10 categories with reuters.categories(), and print the number of docs in categories ‘crude’ and ‘trade’.
From the Inaugural corpus, list inaugural.fileids(), extract the year (first four characters of each file id), sort the years ascending, and print the sorted unique years.
For Gutenberg file ‘shakespeare-hamlet.txt’, build nltk.FreqDist over lowercased alphabetic tokens and print the 30 most common tokens with counts.
Re-run Question 9 but remove English stopwords (nltk.corpus.stopwords.words(‘english’)) and punctuation; print the 30 most common remaining tokens.
Build a ConditionalFreqDist from the Brown corpus for two genres ‘news’ and ‘romance’, conditioning on genre and counting lowercased alphabetic tokens; print the 20 most common tokens per genre.
Show 10 tokens from any corpus (you choose) where .casefold() differs from .lower(); print each token and its transformed versions to illustrate Unicode casefolding.
Using WordNet, list all noun synsets for the lemma “bank” with their synset names, definitions, and up to 5 lemma names per synset; print results clearly.
Using WordNet synset good.a.01, list all lemma names and collect any available antonyms across those lemmas; print unique antonyms.
Using WordNet synset dog.n.01, print the names of all direct hypernyms and the first 15 hyponyms (lemma names only).
Compute WordNet path similarity and Wu–Palmer (wup) similarity for pairs (car.n.01, automobile.n.01) and (car.n.01, journey.n.01); print both metrics for both pairs.
Using WordNetLemmatizer, lemmatize the tokens “better/JJ”, “running/VBG”, “mice/NNS”, and “leaves/NNS” by passing the correct POS to .lemmatize() and print token → lemma mappings.
Attempt to find WordNet synsets for the concepts “social_media”, “social medium”, and “social network” (as nouns); print which queries return synsets and which do not.
For the string “run”, count how many WordNet synsets exist per POS (noun, verb, adj, adv) and print the counts by POS.
Create a folder on disk named my_corpus containing three UTF-8 text files you author (e.g., note1.txt, note2.txt, note3.txt) with short passages including only Indian names such as Aarav, Priya, Rohan, Kavya, Arjun; load them using nltk.corpus.PlaintextCorpusReader(‘my_corpus’, ‘.*\.txt’) and print file IDs and total token count.

Intermediate Questions

For Gutenberg ‘austen-emma.txt’ and Brown genre ‘news’, build FreqDist of the top 200 lowercased alphabetic tokens each; print the size of the overlap set and list 30 tokens unique to each source.
Using Brown genres [‘news’, ‘romance’, ‘government’], compute the 20 most frequent bigrams (alphabetic, lowercased) for each genre and store them in a ConditionalFreqDist keyed by genre; print the top 10 bigrams per genre.
In Reuters, compute the category distribution: a dict of {category: document_count} across all categories; print the top 10 categories by document count.
On all Reuters words (concatenate all docs), compute top 25 PMI bigram collocations after removing stopwords, punctuation, digits, and tokens with length < 3; print bigram and PMI score.
Compare Brown genres ‘news’ vs ‘romance’ by computing per-token relative frequency (after stopword removal) and print the top 20 tokens overrepresented in each genre using simple log-ratio or proportion difference.
Implement a function kwic(text_tokens, keyword, width=35) that prints KWIC lines from the token list; demonstrate it on Gutenberg ‘austen-sense.txt’ for the keyword “love” (lowercased).
Build a 5×5 pairwise WUP similarity table for WordNet noun synsets of lemmas [‘car’, ‘bus’, ‘bicycle’, ‘road’, ‘gasoline’] (use synsets(word, pos=’n’)[0] as the default sense for each); print a formatted matrix.
For bird.n.01, print all hypernym paths up to entity.n.01, then compute and print the lowest common hypernym between bird.n.01 and mammal.n.01.
For adjectives [‘hot’, ‘cold’, ‘happy’, ‘sad’, ‘fast’, ‘slow’], list lemma names and any antonyms available; print which adjectives have no antonym links in WordNet.
For set.n.01 and set.v.01, print eight lemma names each and then print the total number of synsets in WordNet for the surface form “set” across all POS.
Group Inaugural speeches by decade using the year from inaugural.fileids(); within each decade, compute the top 10 bigram collocations (alphabetic, stopwords removed) using PMI; print results per decade.
Using Brown, compute the proportion of stopwords vs non-stopwords per genre (lowercased alphabetic tokens); print a table sorted by the proportion of stopwords descending.
For Gutenberg ‘milton-paradise.txt’, build FreqDist of lowercased alphabetic tokens and print the cumulative frequency coverage of the top 50 tokens (as a proportion between 0 and 1).
In Gutenberg ‘shakespeare-macbeth.txt’, detect acts by lines starting with ‘ACT ‘; for names [‘Macbeth’, ‘Lady’, ‘Banquo’] (exact surface forms), compute a conditional frequency of name × act and print the table.
Tag tokens from Gutenberg ‘chesterton-thursday.txt’ with nltk.pos_tag (Penn tagset) and build a ConditionalFreqDist of word frequencies conditioned on POS prefix (NN, VB, JJ, RB, etc.); print the top 10 tokens per POS prefix.
In Reuters, pick categories [‘crude’, ‘trade’, ‘grain’, ‘money-fx’, ‘interest’]. For each category, clean tokens (lowercased alphabetic, stopwords removed) and print reduction ratio: cleaned_token_count / raw_token_count.
Using the custom corpus loaded from my_corpus (created in Basic Q20), implement a concordance search across all files for the keyword “Aarav” (case-insensitive) and print 20 KWIC lines with a window of 40 characters.
For 10 frequent ambiguous words [‘run’, ‘set’, ‘bank’, ‘line’, ‘light’, ‘fair’, ‘right’, ‘left’, ‘object’, ‘present’], print the count of synsets per POS from WordNet, and for each POS print the most common lemma (first in synset) as a quick proxy.
For synsets [‘animal.n.01’, ‘vehicle.n.01’, ‘plant.n.02’, ‘instrumentality.n.03’, ‘artifact.n.01’], compute the number of descendant hyponyms by traversing .closure(hyponyms); print a sorted list of (synset_name, descendant_count).
For Reuters category ‘crude’ and Brown genre ‘news’, compute the top 25 non-stopword tokens (lowercased alphabetic) from each and print the set difference tokens unique to each source.

Advanced Questions

Build a 2×2 “genre keyness” report comparing Brown ‘news’ vs ‘romance’: (A) top 30 key terms per genre using log-likelihood or log-ratio on cleaned tokens; (B) top bigram PMI collocations per genre; (C) conditional frequency of POS prefixes (NN, VB, JJ, RB) per genre; (D) 3 KWIC lines each for two key terms you select; print clean textual tables (no plots required).
For Reuters categories [‘trade’, ‘crude’, ‘grain’, ‘money-fx’, ‘interest’], compute: (i) top 25 cleaned tokens per category, (ii) top 10 PMI bigrams per category, and (iii) a ConditionalFreqDist of year (parse from fileid prefix) × category for the token “oil”; print all three outputs clearly.
In Inaugural, for lemmas [‘freedom’, ‘economy’, ‘war’, ‘peace’, ‘America’], lemmatize (verb/noun appropriately) and build a ConditionalFreqDist of decade × lemma counts; print the top 5 decades for each lemma.
Create a list of 20 transport-related nouns (e.g., car, bus, train, aircraft, bicycle, scooter, ship, taxi, metro, highway, bridge, truck, auto_rickshaw, motorcycle, ferry, tram, airport, runway, station, ticket) and map each to its most common WordNet noun synset; compute all pairwise WUP similarities and save edges with similarity ≥ 0.9 to a CSV with columns u,v,wup.
For nouns [‘dog’, ‘cat’, ‘whale’, ‘eagle’, ‘salmon’], compute the maximum hypernym path depth (distance to root) and print a sorted table; also print the lowest common hypernym for pairs (dog, cat) and (eagle, salmon).
In Gutenberg ‘austen-emma.txt’, POS-tag tokens and extract adjective–noun bigrams (JJ NN) before computing PMI; print the top 30 adjective–noun collocations after stopword removal.
Using your custom corpus my_corpus, build a document–term matrix (documents × terms) from cleaned tokens (min_df ≥ 2); print the top 15 terms per document and the global top 30 terms with counts.
For Brown genres ‘romance’ and ‘government’, extract key bigrams and trigrams (PMI) that are exclusive to each genre (appear in one but not the other after thresholding); print 20 exclusive phrases per genre.
For 15 polysemous nouns of your choice, choose the pairwise most similar senses by maximizing WUP over all sense combinations; print a table (w1, w2, best_sense1, best_sense2, wup) for 15 interesting pairs.
Build a labeled mini-corpus on disk under data_labels/{pos,neg,neutral}/ with 10 UTF-8 files per label (use Indian names like Aarav, Priya, Rohan in the text). Load with PlaintextCorpusReader, compute per-label FreqDist, top PMI bigrams, and save a JSON metadata file containing counts, vocabulary size per label, and the top 20 terms per label.