NLTK Assignment– 2
Corpora, Lexical Resources & Data Handling
Preparation (run once before starting)
nltk.download('punkt'); nltk.download('gutenberg'); nltk.download('brown');
nltk.download('reuters'); nltk.download('inaugural'); nltk.download('stopwords');
nltk.download('wordnet'); nltk.download('omw-1.4')
Basic Questions
- Load the Gutenberg corpus and list all file IDs using nltk.corpus.gutenberg.fileids(); then, for [‘austen-emma.txt’, ‘melville-moby_dick.txt’], print token count, sentence count (using sent_tokenize), and vocabulary size computed from lowercased alphabetic tokens.
- Using Gutenberg file ‘austen-emma.txt’, compute average word length (alphabetic tokens only) and average sentence length in tokens; print both values rounded to two decimals.
- Using Gutenberg file ‘austen-emma.txt’, print 25 KWIC concordance lines for the word “marriage” (case-insensitive) with a window of 50 characters; ensure you tokenize to words first and join back to raw string for KWIC display.
- Using Gutenberg file ‘melville-moby_dick.txt’, find the first 30 token indices where the lowercased token equals “whale”; print those indices.
- Using Gutenberg file ‘carroll-alice.txt’, compute the top 20 bigram collocations by PMI with BigramCollocationFinder (alphabetic tokens ≥ length 3, lowercased, stopwords removed via nltk.corpus.stopwords.words(‘english’)); print the pairs and PMI scores.
- List Brown corpus categories via nltk.corpus.brown.categories() and, for [‘news’, ‘romance’], print total token counts and type–token ratio (TTR) from lowercased alphabetic tokens.
- From the Reuters corpus, print the total number of documents via len(reuters.fileids()), list the first 10 categories with reuters.categories(), and print the number of docs in categories ‘crude’ and ‘trade’.
- From the Inaugural corpus, list inaugural.fileids(), extract the year (first four characters of each file id), sort the years ascending, and print the sorted unique years.
- For Gutenberg file ‘shakespeare-hamlet.txt’, build nltk.FreqDist over lowercased alphabetic tokens and print the 30 most common tokens with counts.
- Re-run Question 9 but remove English stopwords (nltk.corpus.stopwords.words(‘english’)) and punctuation; print the 30 most common remaining tokens.
- Build a ConditionalFreqDist from the Brown corpus for two genres ‘news’ and ‘romance’, conditioning on genre and counting lowercased alphabetic tokens; print the 20 most common tokens per genre.
- Show 10 tokens from any corpus (you choose) where .casefold() differs from .lower(); print each token and its transformed versions to illustrate Unicode casefolding.
- Using WordNet, list all noun synsets for the lemma “bank” with their synset names, definitions, and up to 5 lemma names per synset; print results clearly.
- Using WordNet synset good.a.01, list all lemma names and collect any available antonyms across those lemmas; print unique antonyms.
- Using WordNet synset dog.n.01, print the names of all direct hypernyms and the first 15 hyponyms (lemma names only).
- Compute WordNet path similarity and Wu–Palmer (wup) similarity for pairs (car.n.01, automobile.n.01) and (car.n.01, journey.n.01); print both metrics for both pairs.
- Using WordNetLemmatizer, lemmatize the tokens “better/JJ”, “running/VBG”, “mice/NNS”, and “leaves/NNS” by passing the correct POS to .lemmatize() and print token → lemma mappings.
- Attempt to find WordNet synsets for the concepts “social_media”, “social medium”, and “social network” (as nouns); print which queries return synsets and which do not.
- For the string “run”, count how many WordNet synsets exist per POS (noun, verb, adj, adv) and print the counts by POS.
- Create a folder on disk named my_corpus containing three UTF-8 text files you author (e.g., note1.txt, note2.txt, note3.txt) with short passages including only Indian names such as Aarav, Priya, Rohan, Kavya, Arjun; load them using nltk.corpus.PlaintextCorpusReader(‘my_corpus’, ‘.*\.txt’) and print file IDs and total token count.
Intermediate Questions
- For Gutenberg ‘austen-emma.txt’ and Brown genre ‘news’, build FreqDist of the top 200 lowercased alphabetic tokens each; print the size of the overlap set and list 30 tokens unique to each source.
- Using Brown genres [‘news’, ‘romance’, ‘government’], compute the 20 most frequent bigrams (alphabetic, lowercased) for each genre and store them in a ConditionalFreqDist keyed by genre; print the top 10 bigrams per genre.
- In Reuters, compute the category distribution: a dict of {category: document_count} across all categories; print the top 10 categories by document count.
- On all Reuters words (concatenate all docs), compute top 25 PMI bigram collocations after removing stopwords, punctuation, digits, and tokens with length < 3; print bigram and PMI score.
- Compare Brown genres ‘news’ vs ‘romance’ by computing per-token relative frequency (after stopword removal) and print the top 20 tokens overrepresented in each genre using simple log-ratio or proportion difference.
- Implement a function kwic(text_tokens, keyword, width=35) that prints KWIC lines from the token list; demonstrate it on Gutenberg ‘austen-sense.txt’ for the keyword “love” (lowercased).
- Build a 5×5 pairwise WUP similarity table for WordNet noun synsets of lemmas [‘car’, ‘bus’, ‘bicycle’, ‘road’, ‘gasoline’] (use synsets(word, pos=’n’)[0] as the default sense for each); print a formatted matrix.
- For bird.n.01, print all hypernym paths up to entity.n.01, then compute and print the lowest common hypernym between bird.n.01 and mammal.n.01.
- For adjectives [‘hot’, ‘cold’, ‘happy’, ‘sad’, ‘fast’, ‘slow’], list lemma names and any antonyms available; print which adjectives have no antonym links in WordNet.
- For set.n.01 and set.v.01, print eight lemma names each and then print the total number of synsets in WordNet for the surface form “set” across all POS.
- Group Inaugural speeches by decade using the year from inaugural.fileids(); within each decade, compute the top 10 bigram collocations (alphabetic, stopwords removed) using PMI; print results per decade.
- Using Brown, compute the proportion of stopwords vs non-stopwords per genre (lowercased alphabetic tokens); print a table sorted by the proportion of stopwords descending.
- For Gutenberg ‘milton-paradise.txt’, build FreqDist of lowercased alphabetic tokens and print the cumulative frequency coverage of the top 50 tokens (as a proportion between 0 and 1).
- In Gutenberg ‘shakespeare-macbeth.txt’, detect acts by lines starting with ‘ACT ‘; for names [‘Macbeth’, ‘Lady’, ‘Banquo’] (exact surface forms), compute a conditional frequency of name × act and print the table.
- Tag tokens from Gutenberg ‘chesterton-thursday.txt’ with nltk.pos_tag (Penn tagset) and build a ConditionalFreqDist of word frequencies conditioned on POS prefix (NN, VB, JJ, RB, etc.); print the top 10 tokens per POS prefix.
- In Reuters, pick categories [‘crude’, ‘trade’, ‘grain’, ‘money-fx’, ‘interest’]. For each category, clean tokens (lowercased alphabetic, stopwords removed) and print reduction ratio: cleaned_token_count / raw_token_count.
- Using the custom corpus loaded from my_corpus (created in Basic Q20), implement a concordance search across all files for the keyword “Aarav” (case-insensitive) and print 20 KWIC lines with a window of 40 characters.
- For 10 frequent ambiguous words [‘run’, ‘set’, ‘bank’, ‘line’, ‘light’, ‘fair’, ‘right’, ‘left’, ‘object’, ‘present’], print the count of synsets per POS from WordNet, and for each POS print the most common lemma (first in synset) as a quick proxy.
- For synsets [‘animal.n.01’, ‘vehicle.n.01’, ‘plant.n.02’, ‘instrumentality.n.03’, ‘artifact.n.01’], compute the number of descendant hyponyms by traversing .closure(hyponyms); print a sorted list of (synset_name, descendant_count).
- For Reuters category ‘crude’ and Brown genre ‘news’, compute the top 25 non-stopword tokens (lowercased alphabetic) from each and print the set difference tokens unique to each source.
Advanced Questions
- Build a 2×2 “genre keyness” report comparing Brown ‘news’ vs ‘romance’: (A) top 30 key terms per genre using log-likelihood or log-ratio on cleaned tokens; (B) top bigram PMI collocations per genre; (C) conditional frequency of POS prefixes (NN, VB, JJ, RB) per genre; (D) 3 KWIC lines each for two key terms you select; print clean textual tables (no plots required).
- For Reuters categories [‘trade’, ‘crude’, ‘grain’, ‘money-fx’, ‘interest’], compute: (i) top 25 cleaned tokens per category, (ii) top 10 PMI bigrams per category, and (iii) a ConditionalFreqDist of year (parse from fileid prefix) × category for the token “oil”; print all three outputs clearly.
- In Inaugural, for lemmas [‘freedom’, ‘economy’, ‘war’, ‘peace’, ‘America’], lemmatize (verb/noun appropriately) and build a ConditionalFreqDist of decade × lemma counts; print the top 5 decades for each lemma.
- Create a list of 20 transport-related nouns (e.g., car, bus, train, aircraft, bicycle, scooter, ship, taxi, metro, highway, bridge, truck, auto_rickshaw, motorcycle, ferry, tram, airport, runway, station, ticket) and map each to its most common WordNet noun synset; compute all pairwise WUP similarities and save edges with similarity ≥ 0.9 to a CSV with columns u,v,wup.
- For nouns [‘dog’, ‘cat’, ‘whale’, ‘eagle’, ‘salmon’], compute the maximum hypernym path depth (distance to root) and print a sorted table; also print the lowest common hypernym for pairs (dog, cat) and (eagle, salmon).
- In Gutenberg ‘austen-emma.txt’, POS-tag tokens and extract adjective–noun bigrams (JJ NN) before computing PMI; print the top 30 adjective–noun collocations after stopword removal.
- Using your custom corpus my_corpus, build a document–term matrix (documents × terms) from cleaned tokens (min_df ≥ 2); print the top 15 terms per document and the global top 30 terms with counts.
- For Brown genres ‘romance’ and ‘government’, extract key bigrams and trigrams (PMI) that are exclusive to each genre (appear in one but not the other after thresholding); print 20 exclusive phrases per genre.
- For 15 polysemous nouns of your choice, choose the pairwise most similar senses by maximizing WUP over all sense combinations; print a table (w1, w2, best_sense1, best_sense2, wup) for 15 interesting pairs.
- Build a labeled mini-corpus on disk under data_labels/{pos,neg,neutral}/ with 10 UTF-8 files per label (use Indian names like Aarav, Priya, Rohan in the text). Load with PlaintextCorpusReader, compute per-label FreqDist, top PMI bigrams, and save a JSON metadata file containing counts, vocabulary size per label, and the top 20 terms per label.