NLTK Assignment– 6

Semantic Analysis

Preparation (run once before starting)

import nltk
nltk.download('punkt')         
nltk.download('averaged_perceptron_tagger') 
nltk.download('wordnet')       
nltk.download('omw-1.4')       
# nltk.download('framenet_v17') # only for questions that mention FrameNet

Basic Questions

Use the Lesk WSD algorithm (nltk.wsd.lesk) on the sentence:
“Arjun went to the bank to deposit cash after work.”
Disambiguate the word “bank” (target token) and print the chosen synset name and definition.
Run Lesk WSD on a different context:
“Aarav sat on the bank of the river near Pune.”
Disambiguate “bank” again and print the chosen synset and definition; then print whether it differs from Question 1.
Disambiguate the ambiguous noun “bat” in two sentences using Lesk and print the synset names and definitions:
a) “Rohan swung the bat during the cricket match in Mumbai.”
b) “Bats fly at night near the old fort in Jaipur.”
For each of these pairs of nouns, compute and print path similarity and Wu–Palmer (wup) similarity using WordNet noun synsets (use the first noun sense for each word):
• (“car”, “automobile”), (“car”, “bus”), (“train”, “vehicle”), (“river”, “bank”).
For the verbs in the sentence “Priya bought a ticket and travelled to Delhi”, get the first verb synset for “buy” and “travel” and print their wup similarity; also print both synset definitions.
List all synonyms (lemma names) and antonyms for the adjective “good” using WordNet. Print two lists: synonyms and antonyms (unique, sorted, ASCII only).
Create a simple paraphrase of the sentence “Kavya is a good singer” by replacing “good” with one WordNet synonym that is an adjective. Print original and paraphrased sentences.
Create a simple antonym flip paraphrase of “The report by Ishan is helpful” by replacing “helpful” with a WordNet antonym if available. Print original and paraphrased sentences.
Tokenize and POS-tag the sentence “Sanya paid ₹500 to Neha in Bengaluru yesterday”. Using a rule-of-thumb SRL mapping, assign roles:• subject NP → AGENT
• direct object NP → PATIENT
• PP with to → RECIPIENT
• PP with city → LOCATION
• time adverb → TIME
Print a dictionary like {“AGENT”:”Sanya”, “PATIENT”:”₹500″, “RECIPIENT”:”Neha”, “LOCATION”:”Bengaluru”, “TIME”:”yesterday”}.
Using the same SRL heuristic as Question 9, annotate the roles for “Dev delivered the parcel to Pooja at Mumbai on Monday” and print the role dictionary.
Build a synonym set for “ticket” by collecting lemma names from all noun synsets of “ticket”. Print the unique list.
From WordNet, print the hypernyms (one level up) for the noun synset train.n.01 and the verb synset buy.v.01 (synset names and definitions).
Construct a query expansion candidate list for the query words [“refund”, “ticket”] by adding up to 5 synonyms and up to 3 hypernyms per word from WordNet. Print the expanded term list (lowercased, unique).
Repeat Question 13 but for the query words [“account”, “bank”]. Print synonyms and hypernyms separately and then the combined expanded list.
Using Lesk WSD, disambiguate “deposit” in “Arjun will deposit the cheque at the bank in Delhi”. Print the selected verb synset and two alternative verb synsets (names + definitions) for comparison.
Using WordNet, print definitions for two senses of “cricket”: one noun sense for the sport and one noun sense for the insect. Also print two example sentences you write with Indian names (one for each sense).
For the nouns [“city”,”vehicle”,”road”,”bridge”], get the first noun synset of each and print a 4×4 wup similarity matrix (rounded to 3 decimals).
Using the sentence “Raghav quickly repaired the phone in Pune”, tag POS and extract AGENT, ACTION, PATIENT, LOCATION using the same SRL heuristic (treat the main verb as ACTION). Print a small dict.
For the sentence “Meera liked the movie but Arjun disliked the ending”, generate a synonym-based paraphrase where “liked” and “disliked” are replaced by WordNet synonyms that keep the same polarity (positive/negative). Print original and paraphrased sentences.
Build a toy document set (Python list of 6 short strings):
a) “IRCTC refund process for cancelled train ticket in Mumbai”
b) “Steps to open a savings bank account in Delhi”
c) “How to transfer money with UPI in Bengaluru”
d) “Cancellation rules for flight ticket from Pune”
e) “Railway ticket refund timeline and charges”
f) “Documents required for new bank account in Chennai”
Given the user query “Aarav needs train ticket refund information”, expand the query with synonyms/hypernyms from WordNet (like Question 13) and print the top 3 matching documents ranked by simple Jaccard overlap between expanded query terms and document terms (lowercased word sets).

Intermediate Questions

Create a function wsd_sense(word, sentence) that tokenizes a sentence with word_tokenize and returns the Lesk synset for the target word. Evaluate it on 8 sentences you write with Indian names: 4 sentences for “bank” (finance vs river), 2 sentences for “bat” (cricket vs animal), 2 sentences for “cricket” (sport vs insect). Print the chosen synset names for all 8 sentences.
For each target in Question 1, manually list the gold sense label (sport/insect; finance/river; animal/cricket-bat) and compute accuracy of your Lesk predictions. Print correct/total.
Implement POS-constrained WSD by filtering synsets to the POS of the target (e.g., only noun synsets when target is tagged NN). Re-run Lesk for the 8 sentences and print the new accuracy vs Question 2.
Build a similarity table for the nouns [“car”,”bus”,”train”,”vehicle”,”road”,”bridge”] using path and wup similarities (first noun senses). Print two formatted 6×6 tables.
Using your SRL heuristic, annotate roles in these sentences and print role dicts:
a) “Aarav sent an email to Priya from Pune” (AGENT, PATIENT, RECIPIENT, SOURCE)
b) “Neha received the parcel from Rohan at Delhi” (AGENT/PATIENT/SOURCE/LOCATION)
c) “Ishan presented the proposal to Kavya in Bengaluru on Friday” (AGENT/PATIENT/RECIPIENT/LOCATION/TIME)
Extract FrameNet information for one verb (requires nltk.download('framenet_v17') before usage). For the verb “buy”, list up to 3 frames that include it and print the core frame elements for the first frame (e.g., Buyer, Goods, Seller). If FrameNet is not available, skip this question and state that FrameNet is not available.
Create a paraphrase generator that replaces adjectives with WordNet synonyms while keeping POS ADJ. Apply it to:
• “This is a good plan by Rohan”
• “The tasty snacks were served by Pooja”
Print 1–2 paraphrases for each sentence.
Create an antonym rewriter that flips sentiment words in:
• “Arjun is happy with the result”
• “Kavya is unhappy about the delay”
Use WordNet antonyms, keep names intact, and print the new sentences (if antonym not found, leave the word as-is).
Build a query expansion pipeline for the query words [“refund”,”ticket”,”train”]: collect up to 5 synonyms and 3 hypernyms per word (lowercased), remove stopwords and near-duplicates, assign weights (base term=1.0, synonym=0.8, hypernym=0.5). Print the weighted dictionary, e.g., {“refund”:1.0, “repayment”:0.8, “ticket”:1.0, …}.
Using the toy document set from Basic Q20, implement a weighted retrieval score. Each document’s score is the sum of weights for expanded terms it contains (term presence only, no counts). Print the top 3 documents with their scores.
Create a sense-aware expansion for the query “Rohan sat on the bank” by first disambiguating “bank” with Lesk and then adding only synonyms from the selected river-bank sense. Print the final expanded list.
For the ambiguous word “charge”, write two sentences with Indian names showing the commerce sense and the legal accusation sense. Run Lesk WSD on both and print synset names and definitions.
For the verb pair (“buy”, “sell”), compute wup similarity and also print whether they appear as antonyms in any of their verb lemmas. Print your findings.
For the nouns [“doctor”,”engineer”,”teacher”], compute and print the top 3 closest nouns to each by wup similarity (search across the set only, first noun senses).
Implement a role-driven paraphrase: in “Ananya sent the invitation to Dev in Chennai”, replace only the ACTION (verb) with a WordNet synonym while keeping AGENT/PATIENT/RECIPIENT/LOCATION intact. Print original and paraphrase.
Create 6 short FAQ-like user queries that include Indian entities (e.g., “Priya wants bank account opening details in Delhi”). For each query, produce an expanded query using your pipeline from Question 9 and print the expanded term sets.
For the queries in Question 16, compute top-1 matches against the same toy document set from Basic Q20 using weighted scores (Question 10). Print the matched document index and score for each query.
Build a small gold mapping for 8 WSD cases (2 each for “bank”, “bat”, “cricket”, “charge”) and compute per-target accuracy of Lesk with POS filtering. Print a table: target, correct/total, accuracy.
For the sentence “Meera found the plot weak but the music strong”, generate a dual paraphrase: (i) keep polarity, swap “weak” and “strong” with same-polarity synonyms, (ii) flip polarity for both using antonyms. Print both paraphrases.
Package your query expansion code into two functions expand_query(tokens) and rank_docs(expanded, docs). Demonstrate with the query “Aarav needs information on ticket refund in Pune” on the toy document list and print the top 3 ranked results.

Advanced Questions

Build a sense-tagged mini dataset of 12 sentences you write (Indian names only) covering ambiguous targets {bank, bat, cricket, charge} (3 per word; distinct senses). Create a gold table mapping sentence → intended synset, run Lesk with POS filtering, and print overall accuracy plus per-target accuracy.
Implement a WSD+Expansion search: given a user query sentence “Priya asked about railway ticket refund timeline in Mumbai”, tokenize and POS-tag, WSD-disambiguate “refund” and “ticket”, expand only with synonyms/hypernyms from the selected senses, rank the Basic Q20 toy documents by weighted score. Print top 3 results and show which expanded terms matched each document.
Create a semantic similarity explorer for a word list [“train”,”bus”,”metro”,”vehicle”,”road”,”bridge”]: compute wup similarities pairwise, for each word print the two nearest neighbors and the two farthest (excluding self), and print any pairs with wup ≥ 0.9.
Implement a rule-based SRL approximator: detect AGENT (leftmost proper-noun NP before main verb), ACTION (main finite verb), PATIENT (direct object NP), RECIPIENT (NP in to PP), LOCATION (GPE-like proper noun after in/at/near), TIME (tokens like yesterday, Monday, or date pattern dd-mm-yyyy). Apply it to these sentences and print JSON-like dicts:
a) “Arjun emailed the file to Priya in Delhi on Monday”
b) “Kavya received a parcel from Rohan at Bengaluru”
c) “Ishan will submit the report in Pune tomorrow”
Build a paraphrase generator with constraints: given “Neha enjoyed the concert in Hyderabad”, produce two paraphrases using WordNet synonyms such that (i) the name and city remain unchanged, (ii) the verb and one adjective/noun are replaced by same-POS synonyms, (iii) tokens remain grammatical (surface form change allowed). Print both.
Evaluate query expansion impact: using the 6 toy documents from Basic Q20 and 8 user queries from Intermediate Q16, compute Precision@1 before expansion (base terms only) and after sense-aware expansion (Advanced Q2 method). Print both P@1 values and absolute improvement.
Create a failure analysis for WSD: for 6 sentences where Lesk fails (choose from your mini dataset), print the sentence, target word, predicted synset, gold synset, and top 8 content words in the sentence. In a brief code comment, note a likely reason (short context, POS mismatch, rare sense).
Construct a semantic neighborhood: for “account.n.01” gather all direct hypernyms and hyponyms. Print the set size, names, and definitions. Then print three lemmas you would add to expand the query “open bank account” based on this neighborhood.
Build a sense profile for “bank” by collecting its top 15 collocates (content words only) from two corpora slices you choose (e.g., 500 sentences you author or from a local text you have permission to use) labeled as finance vs river contexts. Print two collocate lists and note (in a code comment) which collocates are most discriminative.
Package an end-to-end module semantic_utils.py that exposes:
lesk_pos(word, sent)
similarity_table(words, metric)
heuristic_srl(sent)
expand_query(tokens, sense_aware=True)
rank_docs(expanded, docs, weighted=True)

Demonstrate by running on the sentence “Ritika wants information about ticket refunds in Chennai” and printing: WSD senses chosen, SRL roles, expanded terms, and the top 3 ranked documents from the Basic Q20 set.