NLTK Assignment– 4

Parsing & Syntax Trees

Preparation (run once before starting)

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

Basic Questions

Create the following CFG in NLTK using nltk.CFG.fromstring and a ChartParser, then parse the token list ['Aarav','eats','jalebi']. Print every parse tree using t.pformat(margin=120) on its own line.

Using the same grammar from 1, parse ['Priya','sees','Rohan','in','Pune'] and print all parse trees; if multiple trees exist, print the count of parses at the end.

Using the same grammar from 1, parse ['Arjun','drinks','coffee','at','Mumbai'] and print the single best tree if only one exists; otherwise print all trees.
Using the same grammar from 1, switch to nltk.parse.EarleyChartParser and parse ['Kavya','visits','the','park','in','Delhi']; print all resulting trees.
Extend the grammar from 1 by adding Conj -> 'and' and the rules NP -> NP Conj NP and VP -> VP Conj VP; parse ['Aarav','and','Priya','eat','samosa'] after adding V -> 'eat' and N -> 'samosa' if missing; print all trees.
Build a tiny CFG in a new variable grammar_np for only noun phrases:

Parse ['the','delicious','samosa'] and ['Rohan'] using a ChartParser; print the trees.

Using the grammar from 1, parse ['Priya','likes','the','new','project','in','Mumbai']; print all trees and the total count.
Tokenize the sentence "Aarav sees Priya in Pune." into words using nltk.word_tokenize (remove the period) and parse it with the grammar from 1; print the parse tree.
Create a PP-attachment test by parsing both ['Priya','sees','Rohan','with','a','report'] and ['Priya','sees','Rohan','in','the','park'] using the grammar from 1; print how many parses each sentence yields.
For the grammar from 1, print the set of terminal symbols programmatically by inspecting productions and extracting str(p.rhs()[0]) where p.is_lexical() is True; print the sorted list of unique terminals.
Write a chunker using nltk.RegexpParser with the rule NP: {<DT>?<JJ>*<NN|NNS|NNP|NNPS>+}; POS-tag the tokens of "Priya bought a new train ticket in Mumbai" using nltk.pos_tag, run the chunker, and print the chunk tree.
Add a chink rule to the chunker from 11 so that prepositions and verbs are excluded from NP chunks: NP: {<.*>+} }<IN|VB.*>+{; re-run on the same sentence and print the updated chunk tree.
Using nltk.ne_chunk (non-binary), perform NER on the POS-tagged tokens of "Aarav joined TCS in Mumbai after graduating from IIT Bombay"; print the named entity tree and also extract and print lists of PERSON, ORGANIZATION, and GPE leaves.
Using nltk.ne_chunk in binary mode (ne_chunk(tagged, binary=True)), perform NER on "Priya works at Infosys in Bengaluru"; print the tree and count how many NE chunks were found.
Write a function that takes a parse tree (from ChartParser) and prints all the NP subtrees using subtrees(lambda t: t.label()=='NP'); run it on the best parse of ['Arjun','writes','a','report','in','Delhi'] using the grammar from 1.
Modify the grammar from 1 by adding Adv -> 'quickly' and VP -> Adv VP | VP Adv; parse ['Rohan','quickly','writes','a','report'] and print all trees.
Using the grammar from 1, parse ['Kavya','visits','Mumbai','with','Priya']; print whether 'with Priya' attaches to NP or VP by counting the parse trees and printing them.
Build a minimal CFG grammar_intrans:

Parse ['Ishan','runs'] and print the tree.

Using nltk.Tree.fromstring, manually construct the tree string "(S (NP (PropN Aarav)) (VP (V eats) (NP (N jalebi))))" and print it; then traverse and print all leaves using .leaves().
Using the grammar from 1, parse ['Priya','drinks','coffee','in','the','city'] with both ChartParser and EarleyChartParser; print timing (in milliseconds) for each method using time.time() around parsing and the number of parses found.

Intermediate Questions

Create a new CFG grammar_coord:

Parse ['Aarav','and','Priya','drink','tea'] (add V -> 'drink' if needed) and print all parse trees.

Extend the grammar from 1 by adding relative clauses:

Parse ['Priya','who','likes','samosa','visits','Delhi'] and print all trees.

Build a PP-heavy sentence and show ambiguity counts by parsing ['Arjun','sees','Priya','with','a','new','project','in','Delhi'] using the grammar from 1. Print the number of parses and the first two trees.
Write a function that takes a Tree and returns the depth (t.height()) and number of nodes (count subtrees). Parse ['Kavya','writes','a','report','in','Mumbai'] with the grammar from 1 and print the depth and node count for the first parse.
Write a chunk grammar to extract verb phrases:

POS-tag the sentence "Aarav will quickly finish work in Pune" and print the chunked VP spans (start/end indexes using treepositions()).

Using nltk.ne_chunk, perform NER on the POS-tagged sentence "Rohan and Priya visited IIT Delhi and ISRO Bengaluru". Print separate lists for PERSON, ORGANIZATION, and GPE entities detected.
Create a chunk grammar for NP with chinking to remove PPs:

POS-tag "Priya saw a big temple in Jaipur with Aarav" and print the before/after NP chunks (first run without the chink rule, then with it).

Define a CFG grammar_dates:

Parse ['Sanya','paid','Priya','on','15','-','09','-','2025'] after pre-joining the date tokens into '15-09-2025' (write a small preprocessing step), and print the parse.

Build a custom tokenizer for "Aarav paid ₹500 in Mumbai on 12-08-2024" that keeps the date as one token and removes the currency symbol, yielding ['Aarav','paid','500','in','Mumbai','on','12-08-2024']. POS-tag it and show the tag sequence.
Create a chunk grammar that extracts location NPs ending with city names:

Run it on POS-tagged tokens of "Priya moved to New Delhi from Pune". Print NP_LOC chunks (treat New and Delhi as separate NNP tokens from the tokenizer).

Write an Earley parser for the grammar from 2 (with relative clauses) and parse ['Arjun','who','drinks','tea','visits','Mumbai']. Print all trees and the total count.
Write a small function that converts a Penn-tagged sentence to a simple chunk-friendly tagset mapping (NN*→NN, VB*→VB, JJ*→JJ, etc.). Run it on "Kavya submitted the final report in Bengaluru" and print original vs mapped tags.
Create a regex chunker that extracts organization names as two consecutive NNP tokens (very naive):

Run it on POS-tagged "Aarav joined Tata Consultancy" and print ORG chunks.

Using ne_chunk, perform NER on "Arjun joined Indian Institute of Technology Bombay in Mumbai" and print whether the organization was detected as a single ORGANIZATION node or split. Include the chunked tree output.
Using the grammar from 1, parse ['Neha','drinks','tea'] after adding tokens 'Neha' to PropN and 'tea' to N or V as appropriate. Print the parse tree.
Write code to linearize any Tree back to a sentence by joining leaves with spaces. Test on the best parse of ['Rohan','writes','a','report','in','Delhi'] from the grammar in 1.
Build a noun-compound chunk rule:

Apply it to POS-tagged "Priya read project report in Pune" and print NC chunks found.

Parse three sentences in a loop using the grammar from 1:
['Aarav','visits','Delhi'], ['Priya','likes','a','samosa'], ['Rohan','sees','Kavya','in','Mumbai']. Print the number of parses for each.
Using ne_chunk, run NER on "Sanya met Dev at ISRO in Hyderabad" and print three lists: PERSON names, ORGANIZATION names, and GPE names detected.
Create a chunk grammar that first finds NP, then chinks out any verbs/prepositions inside the NP. Run this two-stage chunking on "Arjun bought a new phone in Chennai with Priya" and print the final chunk tree.

Advanced Questions

Write a richer CFG with subordinate clauses and complementizers. Use nltk.EarleyChartParser to parse both sentences and print all parse trees:

a) ['Aarav','said','that','Priya','prepared','the','report','in','Delhi']
b) ['Rohan','who','likes','tea','scheduled','the','meeting','in','Mumbai']

Construct a controlled-ambiguity CFG by including both VP -> V NP PP and VP -> V PP NP. Parse ['Priya','sees','Aarav','with','a','report'] and print the number of parses and the first two trees. Then remove VP -> V PP NP and show that the parse count drops.
Build a rule-based NP chunker with the following sequence of rules (in order) and apply it on POS-tagged "Kavya bought a new laptop for Arjun in Pune". Print the final chunk tree after applying all rules:

Using ne_chunk (non-binary), run NER on the POS-tagged paragraph "Priya and Aarav joined Infosys in Bengaluru. Rohan visited IIT Madras in Chennai.". Programmatically traverse the returned tree to collect and print a dictionary with keys 'PERSON', 'ORGANIZATION', and 'GPE' mapping to sorted lists of unique surface forms.
Write a function that converts a parse Tree into a dependency-like edge list (head, dependent) by treating each parent as the head of its children (simple approximation). Parse ['Arjun','writes','a','report','in','Delhi'] with the grammar from 1 and print the edge list.
Create a grammar that handles proper-noun multiword names by tokenizing "New Delhi" as two tokens ['New','Delhi'] and including NMulti -> 'New' 'Delhi' with NP -> NMulti. Parse ['Priya','visits','New','Delhi'] and print the tree.
Implement a chunk accuracy experiment: define gold NP spans manually for the sentence "Sanya bought a blue dress in Jaipur" as character offsets. Run your NP chunker from Intermediate Q7, convert chunk spans to character offsets, and print precision, recall, and F1 with respect to the gold NP spans.
Write a PP-attachment heuristic resolver: after parsing ['Aarav','sees','Priya','with','a','camera'] (grammar from 1), inspect trees and select the parse where "with a camera" attaches to the verb (VP) rather than the NP. Print only that tree.
Build a small pipeline: tokenize → POS-tag → NP chunk → NER for the sentence "Dev met Neha at TCS in Hyderabad". Print tokens, tags, NP chunks (as list of phrases), and named entities (as (label, phrase) tuples).
Create and parse a mini Indian address using a CFG:

Tokenize "Priya , 108 MG Road , Mumbai" into ['Priya',',','108','MG','Road',',','Mumbai'], parse with a ChartParser, and print the resulting tree.