NLTK Assignment– 7

Language Modeling & Applications

Preparation (run once before starting)

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('brown')
Toy sentences (use exactly these 12 for all “toy corpus” tasks below; keep punctuation when tokenizing):
toy_sentences = [
    "Aarav likes samosa.",
    "Priya loves filter coffee.",
    "Rohan travels to Delhi by train.",
    "Kavya studies data science in Pune.",
    "Arjun works at TCS in Mumbai.",
    "Ananya visits Bengaluru every month.",
    "Sanya enjoys cricket with Ishan.",
    "Neha buys a ticket from IRCTC.",
    "Dev books a cab in Chennai.",
    "Pooja reads a new book at home.",
    "Ritika orders lunch online.",
    "Ishan plays badminton in Hyderabad."
]

Stopwords (English) for filtering where asked: from nltk.corpus import stopwords; STOP = set(stopwords.words(‘english’))
Random seed to use wherever sampling is required: seed = 42

Basic Questions

Using the Brown corpus news category nltk.corpus.brown.words(categories='news'), lowercase and keep only alphabetic tokens (str.isalpha()), build a unigram FreqDist; print total token count, vocabulary size, and the 20 most common tokens with counts.
From the same Brown news tokens in 1, build a bigram ConditionalFreqDist (CFD) over adjacent tokens; for the contexts "the", "of", and "in", print the top 10 most frequent next words for each context.
Convert the unigram FreqDist from 1 into an MLE probability distribution (nltk.probability.MLEProbDist); print the probabilities of 'the', 'of', 'to', and 'india' (all lowercase). If a term is unseen, print 0.0.
From the bigram counts in 2, create a ConditionalProbDist using MLEProbDist. For the contexts "the", "of", "to", print the probability of the next word 'government' (lowercase) under each context.
Using the Brown news bigram ConditionalProbDist from 4, generate 25 tokens of text by starting with the context token "the" and repeatedly sampling the next word from the conditional distribution (use random.Random(42) for reproducibility). Join with spaces and print the generated sequence.
Rebuild the bigram model from 2 but with Laplace smoothing on conditionals (ConditionalProbDist with LaplaceProbDist). Print the smoothed probability of 'india' after the context "in" and compare to the unsmoothed value from 4.
Build a trigram model on Brown news: create bigram contexts (w_{i-2}, w_{i-1}) → w_i via a ConditionalFreqDist keyed by bigram tuples. Use MLEProbDist to make a ConditionalProbDist. Starting seed = ("in", "the"), generate 30 tokens greedily by always choosing the most probable next word and print the sequence.
On the Brown news tokens (lowercase, alphabetic), compute bigram collocations using nltk.collocations.BigramCollocationFinder.from_words with stopwords removed (use STOP); print the top 20 PMI bigrams with their PMI scores (round to 3 decimals).
Repeat 8 but rank bigrams by the t-test association (use BigramAssocMeasures.student_t); print the top 20 with scores (round to 3 decimals).
Repeat 8 but rank by chi-square (BigramAssocMeasures.chi_sq); print the top 20 with scores (round to 3 decimals).
Tokenize toy_sentences using nltk.word_tokenize (do not lowercase for this question). Build a unigram FreqDist and print the frequency of the tokens 'Aarav', 'Delhi', 'IRCTC', and 'samosa'.
Using the toy corpus tokens (this time lowercase alphabetic only), build a bigram ConditionalFreqDist. For the contexts 'in' and 'to', print the sorted list of next words with their counts.
On the toy corpus, create a bigram ConditionalProbDist with Laplace smoothing and write a function generate_bigram(seed_word, n=15) that starts at seed_word and samples n words. With seed word 'priya' (lowercase), generate 15 words using random.Random(42) and print the output.
On the toy corpus, build a trigram ConditionalProbDist (MLE) and generate a 20-token sequence by starting with the seed bigram ('rohan','travels') (lowercase) and sampling each next word using random.Random(42); print the generated sequence.
On the toy corpus, compute bigram PMI collocations after removing stopwords (lowercase tokens); print the top 10 bigrams with PMI scores.
Compute a simple sentence probability under the Brown bigram MLE model (no smoothing) for the sentence: "aarav travels to delhi by train" (all lowercase, alphabetic tokens). If any bigram is unseen, treat its probability as 0 and print 0.0. Otherwise print the product probability in scientific notation.
Recompute the probability in 16 under Laplace-smoothed bigrams (add-1) and print the (non-zero) value in scientific notation.
Using the Brown unigram FreqDist, compute and print the perplexity of the token sequence ["the","government","in","india"] (lowercase), using unigram probabilities with MLE. If any token is unseen, report perplexity: inf.
Build a character 3-gram count model on the single toy sentence "Aarav likes samosa." (lowercased, include spaces and period) and print the top 10 most frequent character 3-grams with counts.
Save to disk (as a small JSON) the following items derived from the Brown news model: the 20 most common unigrams with normalized probabilities, and for the context "in" the top 20 next-word probabilities from the Laplace-smoothed bigram model.

Intermediate Questions

Create a unigram class that wraps FreqDist and exposes prob(w) for MLE and Laplace-smoothed probabilities. Use it on Brown news and print prob('india') under both MLE and Laplace.
Build interpolated trigram probabilities on Brown news:
$P(w3∣w1,w2)=λ3⋅MLEtrigram+λ2⋅MLEbigram+λ1⋅MLEunigramP(w_3|w_1,w_2)=\lambda_3\cdot \text{MLE}_{\text{trigram}}+\lambda_2\cdot \text{MLE}_{\text{bigram}}+\lambda_1\cdot \text{MLE}_{\text{unigram}}$ with $(λ3,λ2,λ1)=(0.6,0.3,0.1)(\lambda_3,\lambda_2,\lambda_1)=(0.6,0.3,0.1)$ .
Compute this value for contexts ("in","the") → 'city', ("of","the") → 'government', and ("to","the") → 'market' and print the three probabilities (round to 6 decimals).
Implement sentence generation from the interpolated model (2) with max length 30, seeded by ("in","the"); sample at each step with random.Random(42) and stop if a period '.' is generated. Print the generated sentence.
Using Brown news, compute a ConditionalFreqDist of trigrams mapping (w1,w2) to w3 counts. For contexts ("in","the") and ("at","the"), print the top 10 w3 continuations with counts.
On Brown news (lowercase alphabetic), compute trigram PMI using TrigramAssocMeasures.pmi via TrigramCollocationFinder. Filter tokens to length ≥ 3 and remove stopwords. Print the top 20 trigrams with PMI (round to 3 decimals).
For the toy corpus, compute bigram chi-square collocations and print the top 10 with scores. Also print whether ('to','delhi') appears in the ranked list.
Implement a Jelinek–Mercer interpolation for bigrams on Brown news:
$P(w2∣w1)=α⋅MLEbigram+(1−α)⋅MLEunigramP(w_2|w_1)=\alpha\cdot \text{MLE}_{\text{bigram}}+(1-\alpha)\cdot \text{MLE}_{\text{unigram}}$ with $α=0.7\alpha=0.7$ .
Compute this for contexts 'in'→'india', 'to'→'the', and 'of'→'government'.
Measure perplexity of two sentences under the Brown bigram Laplace-smoothed model:
a) "aarav works at tcs in mumbai"
b) "priya studies data science in pune"
Print both values (scientific notation) and state which sentence is less perplexing.
Using the toy corpus, build a simple Markov text generator that alternates between bigram sampling and a forced newline token '<br>' after every 8 generated words. Generate 24 words (3 line blocks) seeded with 'aarav' and print the formatted output with line breaks.
On Brown news (lowercase alphabetic), compute the top 30 bigram collocations by PMI and then recompute after stopword removal. Print the overlap count and list 10 bigrams that dropped out when stopwords were removed.
Create a backoff bigram model for Brown: if MLE_bigram(context,next)==0, fall back to Lidstone unigram with γ=0.2. Compute P('india'|'in'), P('market'|'to'), and P('government'|'of').
Train a toy intent classifier for a chatbot using unigram presence over the following labeled utterances (train on all of them):
('greet', ['hi','hello','namaste','hello aarav','hi priya']),
('bye', ['bye','see you','good night','thanks bye']),
('thanks', ['thanks','thank you so much','thanks aarav']),
('refund', ['how to get train ticket refund','refund for cancelled irctc ticket','aarav needs ticket refund information']),
('bank', ['how to open bank account in delhi','documents for savings account','minimum balance for account']),
('weather', ['weather in mumbai today','is it raining in pune','temperature in bengaluru'])
Build a FreqDist-based vocabulary, a boolean feature extractor, classify with nltk.NaiveBayesClassifier, and print predicted intent for these 4 inputs: "hello priya", "ticket refund rules", "open bank account delhi", "rain in pune".
Add a bigram language-model fallback to the chatbot from 12: if classifier confidence < 0.6, generate a 15-word bigram sample from the toy corpus seeded by 'priya' and reply with I can share more: <generated text>. Demonstrate on the input "tell me something".
Implement a pattern-to-template response table for chatbot intents (e.g., greet → “Namaste! How can I help?”; refund → “IRCTC refunds depend on timing; do you want steps?”; bank → “To open a bank account in Delhi, you need KYC documents.”; weather → “Please specify city.”; thanks → “Happy to help!”; bye → “Goodbye!”). Using predictions from 12, print the chosen response for each input.
Build a tiny retrieval module: given the 6 toy documents from Assignment 6, Basic Q20 (repeat them in code), compute document unigram probabilities (MLE) and a query likelihood with Jelinek–Mercer $α=0.7\alpha=0.7$ against each document model using the query "aarav needs train ticket refund information" (lowercase). Print the top 3 documents.
For Brown news, compute pointwise mutual information (PMI) for the specific bigram ("new","delhi") and print its PMI value (if the bigram does not occur, print "not found").
On Brown news, compute the t-score for the bigram ("of","india") and print the statistic (or "not found" if absent). Clearly show counts used: f(w1,w2), f(w1), f(w2), and N.
Using the toy corpus, train separate unigram models per intent class from 12 by pooling utterances for each label. Classify the query "documents needed for account in delhi" by computing the log-likelihood under each class model + log prior (uniform prior) and print the chosen class.
Combine collocation features with the intent classifier: add binary features for the presence of the top 5 bigram collocations (by PMI) mined from all training utterances in 12; retrain NB and print the new predictions for the same 4 inputs from 12.
Save to disk (pickle) the following chatbot artifacts: vocabulary, NB classifier, bigram generator ConditionalProbDist built on the toy corpus, and the response template dict. Load them back and run a live test on "open savings account mumbai"; print predicted intent and response (and if low-confidence, the generated fallback text).

Advanced Questions

Build K-fold (k=5) splits of Brown news tokens to evaluate bigram models by perplexity on held-out folds (Laplace smoothing). Print the mean perplexity across folds for the sentence "india plans new policy in the market" (lowercase).
Implement interpolated trigram generation with $(λ3,λ2,λ1)=(0.6,0.3,0.1)(\lambda_3,\lambda_2,\lambda_1)=(0.6,0.3,0.1)$ on Brown news and add a repeat-penalty: if the sampled token equals the previous token, resample once. Generate 50 tokens starting from ("in","the") with random.Random(42) and print the result.
Compare Laplace vs Lidstone (γ=0.2) smoothing for Brown bigrams by computing the log-probability of the sentence "priya studies data science in pune" under both; print both log-probs and the absolute difference.
Build a trigram backoff model: try trigram MLE; if unseen, back off to bigram Lidstone γ=0.2; if still unseen, back off to unigram Lidstone γ=0.2. Compute the probability of the sequence "aarav works at tcs in mumbai" and print the value.
Using Brown news, compute top 20 trigrams by PMI and then extract those containing an Indian city token from the set {mumbai, delhi, pune, bengaluru, chennai, hyderabad} (lowercase). Print all matches (or "none" if empty).
Extend the chatbot to intent + slot filling with simple regexes:
a) for refund, extract {"mode": "train|flight", "city": <CityName>} if mentioned;
b) for bank, extract {"city": <CityName>};
c) for weather, extract {"city": <CityName>}.
Test on: "ticket refund in mumbai", "open bank account delhi", "weather in pune today"; print predicted intent and extracted slots.
Create a response ranking step: for inputs that map to refund, rank three candidate responses by query likelihood under a bigram model trained on all refund-labeled utterances (from your earlier dataset). Candidates:
a) "IRCTC refunds depend on timing and class."
b) "You can apply for ticket refund online."
c) "Refunds are not allowed for all tickets."
Print the winning response for "aarav needs ticket refund information".
Perform a collocation significance comparison on Brown news by computing the rank positions of a target bigram (choose ("railway","ticket"), or print "not found" if absent) under PMI, t-test, and chi-square; print the three ranks side by side (or a clear message if missing).
Build a document-aware fallback: if the chatbot confidence < 0.6 and the user mentions any of {'refund','ticket','account','bank','weather'}, instead of text generation, return the best matching document from the 6 toy docs using the query likelihood method (as defined earlier). Demonstrate on "need ticket refund rules" and "account opening delhi" and print the returned titles (first 6 words of each document).
Package a mini LM toolkit lm_utils.py exposing:
a) build_unigram(tokens) -> FreqDist
b) build_cfd_bigrams(tokens) -> ConditionalFreqDist
c) cpd_from_cfd(cfd, smoother='mle'|'laplace'|'lidstone', gamma=0.2)
d) generate_bigram(cpd, seed, n, rng)
e) p_sentence_bi(tokens, cpd, unk_unigram=None) (optional backoff)
f) top_pmi_bigrams(tokens, stop=None, n=20)
Demonstrate by training on Brown news, generating 20 tokens from seed 'the', printing PMI top 10, and computing bigram probability of "india plans policy" (lowercase, alphabetic).