NLTK Assignment– 4
Parsing & Syntax Trees
Preparation (run once before starting)
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
Basic Questions
Create the following CFG in NLTK using
nltk.CFG.fromstring
and aChartParser
, then parse the token list['Aarav','eats','jalebi']
. Print every parse tree usingt.pformat(margin=120)
on its own line.
S -> NP VP
NP -> PropN | Det N | Det Adj N | PropN PP
VP -> V | V NP | V NP PP | V PP
PP -> P NP
Det -> 'a' | 'an' | 'the'
Adj -> 'big' | 'small' | 'delicious' | 'new'
N -> 'samosa' | 'jalebi' | 'train' | 'report' | 'park' | 'project' | 'coffee' | 'city'
PropN -> 'Aarav' | 'Priya' | 'Rohan' | 'Kavya' | 'Arjun' | 'Delhi' | 'Mumbai' | 'Pune'
V -> 'eats' | 'sees' | 'likes' | 'writes' | 'visits' | 'drinks'
P -> 'in' | 'at' | 'to' | 'with'
Using the same grammar from 1, parse
['Priya','sees','Rohan','in','Pune']
and print all parse trees; if multiple trees exist, print the count of parses at the end.Using the same grammar from 1, parse
['Arjun','drinks','coffee','at','Mumbai']
and print the single best tree if only one exists; otherwise print all trees.Using the same grammar from 1, switch to
nltk.parse.EarleyChartParser
and parse['Kavya','visits','the','park','in','Delhi']
; print all resulting trees.Extend the grammar from 1 by adding
Conj -> 'and'
and the rulesNP -> NP Conj NP
andVP -> VP Conj VP
; parse['Aarav','and','Priya','eat','samosa']
after addingV -> 'eat'
andN -> 'samosa'
if missing; print all trees.Build a tiny CFG in a new variable
grammar_np
for only noun phrases:
NP -> Det Adj* N | PropN
Det -> 'a' | 'the'
Adj -> 'delicious' | 'new'
N -> 'samosa' | 'project'
PropN -> 'Rohan' | 'Priya'
Parse ['the','delicious','samosa']
and ['Rohan']
using a ChartParser
; print the trees.
Using the grammar from 1, parse
['Priya','likes','the','new','project','in','Mumbai']
; print all trees and the total count.Tokenize the sentence
"Aarav sees Priya in Pune."
into words usingnltk.word_tokenize
(remove the period) and parse it with the grammar from 1; print the parse tree.Create a PP-attachment test by parsing both
['Priya','sees','Rohan','with','a','report']
and['Priya','sees','Rohan','in','the','park']
using the grammar from 1; print how many parses each sentence yields.For the grammar from 1, print the set of terminal symbols programmatically by inspecting productions and extracting
str(p.rhs()[0])
wherep.is_lexical()
isTrue
; print the sorted list of unique terminals.Write a chunker using
nltk.RegexpParser
with the ruleNP: {<DT>?<JJ>*<NN|NNS|NNP|NNPS>+}
; POS-tag the tokens of"Priya bought a new train ticket in Mumbai"
usingnltk.pos_tag
, run the chunker, and print the chunk tree.Add a chink rule to the chunker from 11 so that prepositions and verbs are excluded from NP chunks:
NP: {<.*>+} }<IN|VB.*>+{
; re-run on the same sentence and print the updated chunk tree.Using
nltk.ne_chunk
(non-binary), perform NER on the POS-tagged tokens of"Aarav joined TCS in Mumbai after graduating from IIT Bombay"
; print the named entity tree and also extract and print lists of PERSON, ORGANIZATION, and GPE leaves.Using
nltk.ne_chunk
in binary mode (ne_chunk(tagged, binary=True)
), perform NER on"Priya works at Infosys in Bengaluru"
; print the tree and count how many NE chunks were found.Write a function that takes a parse tree (from
ChartParser
) and prints all the NP subtrees usingsubtrees(lambda t: t.label()=='NP')
; run it on the best parse of['Arjun','writes','a','report','in','Delhi']
using the grammar from 1.Modify the grammar from 1 by adding
Adv -> 'quickly'
andVP -> Adv VP | VP Adv
; parse['Rohan','quickly','writes','a','report']
and print all trees.Using the grammar from 1, parse
['Kavya','visits','Mumbai','with','Priya']
; print whether'with Priya'
attaches to NP or VP by counting the parse trees and printing them.Build a minimal CFG
grammar_intrans
:
S -> NP VP
NP -> PropN
VP -> V
PropN -> 'Ishan'
V -> 'runs'
Parse ['Ishan','runs']
and print the tree.
Using
nltk.Tree.fromstring
, manually construct the tree string"(S (NP (PropN Aarav)) (VP (V eats) (NP (N jalebi))))"
and print it; then traverse and print all leaves using.leaves()
.Using the grammar from 1, parse
['Priya','drinks','coffee','in','the','city']
with bothChartParser
andEarleyChartParser
; print timing (in milliseconds) for each method usingtime.time()
around parsing and the number of parses found.
Intermediate Questions
Create a new CFG
grammar_coord
:
S -> NP VP
NP -> PropN | Det N | NP Conj NP
VP -> V NP | V
Conj -> 'and' | 'or'
Det -> 'a' | 'the'
N -> 'samosa' | 'tea'
V -> 'eats' | 'drinks'
PropN -> 'Aarav' | 'Priya'
Parse ['Aarav','and','Priya','drink','tea']
(add V -> 'drink'
if needed) and print all parse trees.
Extend the grammar from 1 by adding relative clauses:
NP -> NP RelClause
RelClause -> RelPro VP
RelPro -> 'who' | 'that'
Parse ['Priya','who','likes','samosa','visits','Delhi']
and print all trees.
Build a PP-heavy sentence and show ambiguity counts by parsing
['Arjun','sees','Priya','with','a','new','project','in','Delhi']
using the grammar from 1. Print the number of parses and the first two trees.Write a function that takes a Tree and returns the depth (
t.height()
) and number of nodes (count subtrees). Parse['Kavya','writes','a','report','in','Mumbai']
with the grammar from 1 and print the depth and node count for the first parse.Write a chunk grammar to extract verb phrases:
VP: {<VB.*><RB.*>*<VB.*>*<NN.*|PRP>?}
POS-tag the sentence "Aarav will quickly finish work in Pune"
and print the chunked VP spans (start/end indexes using treepositions()
).
Using
nltk.ne_chunk
, perform NER on the POS-tagged sentence"Rohan and Priya visited IIT Delhi and ISRO Bengaluru"
. Print separate lists for PERSON, ORGANIZATION, and GPE entities detected.Create a chunk grammar for NP with chinking to remove PPs:
NP: {<DT>?<JJ>*<NN.*>+}
}<IN|TO><.*>*{
POS-tag "Priya saw a big temple in Jaipur with Aarav"
and print the before/after NP chunks (first run without the chink rule, then with it).
Define a CFG
grammar_dates
:
S -> NP VP
NP -> PropN
VP -> V NP | V NP PP
PP -> P DATE
DATE -> NUM '-' NUM '-' NUM
PropN -> 'Sanya'
V -> 'paid'
P -> 'on'
NUM -> '12' | '08' | '2024' | '15' | '09' | '2025'
Parse ['Sanya','paid','Priya','on','15','-','09','-','2025']
after pre-joining the date tokens into '15-09-2025'
(write a small preprocessing step), and print the parse.
Build a custom tokenizer for
"Aarav paid ₹500 in Mumbai on 12-08-2024"
that keeps the date as one token and removes the currency symbol, yielding['Aarav','paid','500','in','Mumbai','on','12-08-2024']
. POS-tag it and show the tag sequence.Create a chunk grammar that extracts location NPs ending with city names:
NP_LOC: {<NNP>+<IN>?<NNP>+}
Run it on POS-tagged tokens of "Priya moved to New Delhi from Pune"
. Print NP_LOC chunks (treat New and Delhi as separate NNP tokens from the tokenizer).
Write an Earley parser for the grammar from 2 (with relative clauses) and parse
['Arjun','who','drinks','tea','visits','Mumbai']
. Print all trees and the total count.Write a small function that converts a Penn-tagged sentence to a simple chunk-friendly tagset mapping (
NN*→NN
,VB*→VB
,JJ*→JJ
, etc.). Run it on"Kavya submitted the final report in Bengaluru"
and print original vs mapped tags.Create a regex chunker that extracts organization names as two consecutive NNP tokens (very naive):
ORG: {<NNP><NNP>}
Run it on POS-tagged "Aarav joined Tata Consultancy"
and print ORG chunks.
Using
ne_chunk
, perform NER on"Arjun joined Indian Institute of Technology Bombay in Mumbai"
and print whether the organization was detected as a single ORGANIZATION node or split. Include the chunked tree output.Using the grammar from 1, parse
['Neha','drinks','tea']
after adding tokens'Neha'
to PropN and'tea'
to N or V as appropriate. Print the parse tree.Write code to linearize any Tree back to a sentence by joining leaves with spaces. Test on the best parse of
['Rohan','writes','a','report','in','Delhi']
from the grammar in 1.Build a noun-compound chunk rule:
NC: {<NNP><NNP>|<NN><NN>}
Apply it to POS-tagged "Priya read project report in Pune"
and print NC chunks found.
Parse three sentences in a loop using the grammar from 1:
['Aarav','visits','Delhi']
,['Priya','likes','a','samosa']
,['Rohan','sees','Kavya','in','Mumbai']
. Print the number of parses for each.Using
ne_chunk
, run NER on"Sanya met Dev at ISRO in Hyderabad"
and print three lists: PERSON names, ORGANIZATION names, and GPE names detected.Create a chunk grammar that first finds NP, then chinks out any verbs/prepositions inside the NP. Run this two-stage chunking on
"Arjun bought a new phone in Chennai with Priya"
and print the final chunk tree.
Advanced Questions
Write a richer CFG with subordinate clauses and complementizers. Use
nltk.EarleyChartParser
to parse both sentences and print all parse trees:
S -> NP VP | S Conj S
NP -> PropN | Det N | Det Adj N | NP PP | NP RelClause
VP -> V | V NP | V NP PP | V SBar | V PP
SBar -> Comp S
RelClause -> RelPro VP
PP -> P NP
Det -> 'a' | 'the'
Adj -> 'new' | 'important'
N -> 'report' | 'meeting' | 'project' | 'tea'
PropN -> 'Aarav' | 'Priya' | 'Rohan' | 'Kavya' | 'Delhi' | 'Mumbai'
V -> 'said' | 'thinks' | 'likes' | 'scheduled' | 'drinks' | 'prepared'
P -> 'in' | 'at' | 'with'
Conj -> 'and'
Comp -> 'that'
RelPro -> 'who' | 'that'
Sentences to parse:
a) ['Aarav','said','that','Priya','prepared','the','report','in','Delhi']
b) ['Rohan','who','likes','tea','scheduled','the','meeting','in','Mumbai']
Construct a controlled-ambiguity CFG by including both
VP -> V NP PP
andVP -> V PP NP
. Parse['Priya','sees','Aarav','with','a','report']
and print the number of parses and the first two trees. Then removeVP -> V PP NP
and show that the parse count drops.Build a rule-based NP chunker with the following sequence of rules (in order) and apply it on POS-tagged
"Kavya bought a new laptop for Arjun in Pune"
. Print the final chunk tree after applying all rules:
NP: {<DT>?<JJ>*<NN.*>+}
PP: {<IN><NP>}
VP: {<VB.*><NP|PP|CLAUSE>+$}
CLAUSE: {<NP><VP>}
}<VB.*|IN>+{ # chink out verbs and preps from NP
Using
ne_chunk
(non-binary), run NER on the POS-tagged paragraph"Priya and Aarav joined Infosys in Bengaluru. Rohan visited IIT Madras in Chennai."
. Programmatically traverse the returned tree to collect and print a dictionary with keys'PERSON'
,'ORGANIZATION'
, and'GPE'
mapping to sorted lists of unique surface forms.Write a function that converts a parse Tree into a dependency-like edge list
(head, dependent)
by treating each parent as the head of its children (simple approximation). Parse['Arjun','writes','a','report','in','Delhi']
with the grammar from 1 and print the edge list.Create a grammar that handles proper-noun multiword names by tokenizing
"New Delhi"
as two tokens['New','Delhi']
and includingNMulti -> 'New' 'Delhi'
withNP -> NMulti
. Parse['Priya','visits','New','Delhi']
and print the tree.Implement a chunk accuracy experiment: define gold NP spans manually for the sentence
"Sanya bought a blue dress in Jaipur"
as character offsets. Run your NP chunker from Intermediate Q7, convert chunk spans to character offsets, and print precision, recall, and F1 with respect to the gold NP spans.Write a PP-attachment heuristic resolver: after parsing
['Aarav','sees','Priya','with','a','camera']
(grammar from 1), inspect trees and select the parse where"with a camera"
attaches to the verb (VP) rather than the NP. Print only that tree.Build a small pipeline: tokenize → POS-tag → NP chunk → NER for the sentence
"Dev met Neha at TCS in Hyderabad"
. Print tokens, tags, NP chunks (as list of phrases), and named entities (as(label, phrase)
tuples).Create and parse a mini Indian address using a CFG:
ADDR -> NAME COMMA HOUSE COMMA CITY
NAME -> 'Priya' | 'Aarav' | 'Rohan'
HOUSE -> NUM 'MG' 'Road'
CITY -> 'Mumbai' | 'Delhi' | 'Pune'
NUM -> '12' | '24' | '108'
COMMA -> ','
Tokenize "Priya , 108 MG Road , Mumbai"
into ['Priya',',','108','MG','Road',',','Mumbai']
, parse with a ChartParser
, and print the resulting tree.