NLTK Assignment– 5

Text Classification

Preparation (run once before starting)

import nltk
nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet');
nltk.download('omw-1.4')
nltk.download('movie_reviews'); nltk.download('reuters')
(If a question asks for TF-IDF with scikit-learn or χ² feature selection, install once: pip install scikit-learn.)

Basic Questions

Using the movie_reviews corpus, build a training set of documents where each document is the raw text of one file and the label is its category (‘pos’ or ‘neg’); split into 80% train and 20% test using a fixed seed (e.g., random.seed(42)), and print the number of train and test documents.
From the training documents in Question 1, create a bag-of-words presence feature extractor: lowercase, tokenize with nltk.word_tokenize, remove English stopwords (nltk.corpus.stopwords.words(‘english’)) and punctuation, and represent each document as {word: True} for the top 2,000 most frequent training tokens; show the first 10 keys of one feature dict.
Train nltk.NaiveBayesClassifier on the features from Question 2 and evaluate on the test set; print overall accuracy (proportion correct to 3 decimals).
Using the same split, create a unigram count feature extractor (counts instead of booleans) and train nltk.DecisionTreeClassifier; print accuracy and the depth of the learned tree (classifier.nodecount() if available, else print len(classifier.pseudocode().splitlines())).
On the Naive Bayes model from Question 3, print the 15 most informative features using show_most_informative_features(15).
Build a bigram presence feature set: add the top 500 bigrams (by frequency over training tokens using nltk.bigrams) as features alongside unigrams; retrain Naive Bayes and print the new accuracy.
Compute a simple TF (term frequency) vector (counts) for each training document and print the top 10 terms by global frequency (training set only).
Compute TF-IDF weights using either nltk.text.TextCollection (tc.tf_idf(term, doc)) or sklearn.feature_extraction.text.TfidfVectorizer (state which you used); print the 10 highest-weighted terms in one chosen training document.
Create a tiny toy dataset with 8 short sentences you write that only use Indian names (e.g., “Aarav loved the movie”, “Priya disliked the ending” …) labeled pos/neg; vectorize with bag-of-words and train Naive Bayes; print predictions for two new sentences with Indian names (e.g., “Rohan enjoyed the songs”, “Kavya hated the plot”).
Using the test predictions from Question 3, build a confusion matrix with labels [‘neg’,’pos’] using nltk.ConfusionMatrix(gold, pred); print the matrix.
From the confusion matrix in Question 10, compute precision, recall, F1 for the pos class manually from counts (TP, FP, FN) and print the three metrics to 3 decimals.
Repeat Question 11 for the neg class and print the three metrics to 3 decimals.
Re-tokenize the training data using a RegexpTokenizer (\w+) to drop punctuation at tokenization time; rebuild features (unigram presence, 2,000 features) and retrain Naive Bayes; print accuracy to compare with Question 3.
Using lemmatization (WordNetLemmatizer with a naive POS assumption of noun), rebuild unigram presence features; train Naive Bayes; print accuracy to see the effect of lemmatization.
Print the top 20 tokens that contribute most to the Naive Bayes log-probability for the pos label (hint: inspect classifier._feature_probdist if available, or recompute from show_most_informative_features output).
Using reuters corpus, create a binary classification dataset for whether a document belongs to the ‘crude’ category (positive) vs “not crude” (negative); split 80/20 with fixed seed, vectorize with unigram presence (2,000 features), train Naive Bayes, and print accuracy.
For the Reuters task in Question 16, print the number of positive and negative training instances to show class balance/imbalance.
For the same Reuters split, train nltk.DecisionTreeClassifier on unigram presence features and print accuracy; note which classifier (NB vs DT) was better.
For the movie_reviews NB model from Question 3, print the first 25 (gold, pred, filename) triples where the prediction is incorrect.
Save the trained movie_reviews Naive Bayes model from Question 3 to disk using pickle and load it back; verify by predicting the first 5 test documents and showing that predictions match before and after saving.

Intermediate Questions

Using the movie_reviews split, build character n-gram features: top 3,000 character 3-grams by document frequency over training; train Naive Bayes; print accuracy and compare to unigram word features.
Build a trigram word presence model (top 1,000 trigrams + top 1,500 unigrams combined) and train Naive Bayes; print accuracy.
Train a Maximum Entropy (MaxEnt) classifier with NLTK’s MaxentClassifier (GIS algorithm, e.g., max_iter=20) on unigram presence features for movie_reviews; print accuracy and training time in seconds.
For the MaxEnt model in Question 3, print the 20 highest-weight features for the pos label (use classifier.weights() or classifier.show_most_informative_features(20) depending on NLTK version).
Implement k-fold cross-validation (k=5) on movie_reviews with unigram presence features and Naive Bayes; print mean and std of accuracy across folds (use the same random seed to shuffle document ids).
For Reuters ‘crude’ vs not task, demonstrate class imbalance by creating a deliberately imbalanced training set (keep all positives, downsample negatives to 30%); train Naive Bayes and print precision, recall, and F1 for the positive class.
Apply class weighting or prior adjustment: up-weight positive features (e.g., duplicate positive training instances or add a prior count) for the imbalanced Reuters set; retrain Naive Bayes and print precision/recall/F1 to show the change.
Use TF-IDF vectors from sklearn.TfidfVectorizer with min_df=5, ngram_range=(1,2) on movie_reviews; train a linear classifier outside NLTK (LogisticRegression) as a sanity baseline; print accuracy (this is for comparison; you still must keep NLTK models in other questions).
Compute feature selection with χ² (sklearn.feature_selection.chi2) on movie_reviews (unigram counts); keep the top 1,500 features; train Naive Bayes on the reduced feature set and print accuracy.
Repeat Question 9 with mutual information (sklearn.feature_selection.mutual_info_classif) instead of χ²; print accuracy and compare.
For the movie_reviews Naive Bayes model, draw (textual print is fine) the top false positives: print 10 test documents predicted pos with highest NB log-prob but gold is neg (print filename and top 5 indicative features present).
Build a hybrid feature set: for each document add two numeric meta-features—document length (tokens) and average token length—alongside unigram presence; train MaxEnt and print accuracy.
Evaluate per-class precision, recall, F1 for movie_reviews Naive Bayes using manual calculations from the confusion matrix; print a clean table with rows neg and pos.
For Reuters ‘crude’ task, compute a ROC-like threshold sweep (if your model yields probabilities): vary the threshold for predicting positive from 0.1 to 0.9 and print precision/recall pairs per threshold.
Implement a pipeline function train_and_eval(docs, labels, vectorizer_fn, classifier_cls) that returns accuracy and confusion matrix; run it for (a) unigram presence + Naive Bayes, (b) unigram presence + Decision Tree, (c) bigram presence + Naive Bayes on movie_reviews; print all results.
Show tokenization sensitivity: run the same movie_reviews Naive Bayes with (a) word_tokenize and (b) RegexpTokenizer(r”\w+”); print both accuracies and the size of the resulting vocabulary in each case.
For the imbalanced Reuters set, apply random undersampling of negatives to achieve 1:1 class ratio; train Naive Bayes; print confusion matrix and F1 for the positive class.
For the same Reuters task, apply random oversampling of positives (duplicate positive documents until 1:1); train Naive Bayes; print confusion matrix and F1 for the positive class.
On movie_reviews, compute a learning curve by training Naive Bayes on {10%, 25%, 50%, 75%, 100%} of the training data and evaluating on the same held-out test set; print the table of train_size vs accuracy.
Save a compact model artifact: persist (a) the list of selected features (vocabulary), (b) stopword list used, and (c) the trained classifier via pickle; load it and classify three new short review snippets you write that include Indian names (e.g., “Arjun said the film was superb”, “Priya felt the acting was dull”, “Rohan liked the music but not the story”).

Advanced Questions

Build a complete evaluation report for movie_reviews with Naive Bayes (unigram presence): overall accuracy, confusion matrix, per-class precision/recall/F1, top 20 informative features per class, and 10 most confident errors (filenames and key features); print the report in a readable, labeled format.
Implement a feature ablation study on movie_reviews: start with unigrams (2,000), then add bigrams (+500), then add trigrams (+300), then add char 3-grams (+3,000); retrain Naive Bayes each time; print accuracy after each addition to quantify marginal gains.
Create a domain shift experiment: train Naive Bayes on movie_reviews and test on a small set of 50 Reuters sentences you select that talk about products/earnings (treat anything with words like “profit, loss, sales” as negative, and your own 50 handmade Indian-name “review-style” positive snippets as positive); print accuracy and discuss (in a brief code comment) why cross-domain performance drops.
For Reuters ‘crude’ vs not, implement cost-sensitive classification by duplicating positives 3× in training; compare precision/recall/F1 against (a) baseline, (b) undersampling, and (c) oversampling; print a comparison table.
Implement feature selection inside NLTK features: compute χ² scores on training data, keep only features above a threshold, and ensure your feature extractor filters tokens at extraction time; retrain Naive Bayes and print accuracy and number of active features.
Train a MaxEnt classifier with TF-IDF magnitude and two numeric meta-features (doc length and punctuation ratio); print per-class precision/recall/F1 and compare with the Naive Bayes unigram baseline.
Build a news vs reviews classifier: combine 1,000 movie_reviews docs (balanced) and 1,000 reuters docs (pick any categories) into a binary dataset (label review vs news), vectorize with TF-IDF (unigram + bigram), train Logistic Regression (sklearn) and also NLTK MaxEnt; print accuracy for both and the top 15 features for the review class.
Implement a nested cross-validation on movie_reviews to tune two hyperparameters of a linear classifier from sklearn (e.g., C and ngram_range) while still using NLTK for tokenization/cleanup; print mean accuracy of the outer folds and the chosen parameters per fold.
Create a hard negative set of 200 sentences that you write with only Indian names and mixed sentiment (e.g., “Kavya liked the cinematography but Arjun found the dialogues weak”); evaluate your best movie_reviews classifier on this set; print precision/recall/F1 for pos and neg on this hard set.
Package a reusable classification module (tc_pipeline.py) exposing: build_vocab(train_docs), extract_features(doc), train_nb(train_features), predict(classifier, doc), and evaluate(gold, pred); demonstrate on movie_reviews by training once, saving artifacts, and classifying three new review snippets featuring Indian names (e.g., “Neha loved the direction”, “Ishan thought the climax was average”, “Sanya said the soundtrack was brilliant”) and printing predicted labels.