NLTK Assignment– 1

Text Preprocessing

Preparation (run once before starting)

import nltk
nltk.download('punkt')       # sentence/word tokenizers
nltk.download('stopwords')   # stopword list
nltk.download('wordnet')     # lemmatization dictionary
nltk.download('omw-1.4')     # multilingual WordNet data used by lemmatizer

Basic Questions

Install/import NLTK and ensure resources are available (punkt, stopwords, wordnet, omw-1.4). Write a check that downloads missing ones at runtime.
Create a small raw text string (5–6 sentences). Tokenize into sentences using nltk.sent_tokenize and print the list.
Using the same text, tokenize into words with nltk.word_tokenize. Count total tokens and unique tokens.
Tokenize with nltk.wordpunct_tokenize and compare its output length vs word_tokenize for the same text.
Build a RegexpTokenizer that splits on words only (\w+) and apply it to the text. Print the first 20 tokens.
Perform case folding: convert all tokens to lowercase (use .casefold() not .lower()) and show 10 examples where casefold changes the string (e.g., “ß”).
Remove punctuation from tokens using string.punctuation (keep numbers and words). Show 20 remaining tokens.
Remove English stopwords (nltk.corpus.stopwords.words(‘english’)). Print the 20 most frequent remaining tokens.
Apply PorterStemmer to your cleaned tokens. Show a mapping (original → stem) for 20 tokens.
Apply LancasterStemmer to the same tokens. Show differences vs Porter for 15 tokens.
Apply SnowballStemmer(‘english’) and compare stems against Porter/Lancaster for 15 tokens.
Use WordNetLemmatizer (default POS) on your cleaned tokens. Show 20 (token → lemma) pairs.
Show that lemmatization depends on POS: lemmatize “better, running, mice” with and without POS hints.
Build a function to strip digits from tokens (keep words) and apply it. Output the first 30 tokens.
Build a pipeline: sentence tokenize → word tokenize → casefold → punctuation+stopword removal. Print the final token count.
Handle Unicode accents: create a string with characters like “café, naïve, coöperate”. Normalize with unicodedata.normalize(‘NFD’) and remove combining marks; print before/after.
Detect emoji tokens using a regex range and list unique emojis from a sample text that contains at least 5 emojis.
Remove URLs from noisy text using regex and then tokenize words. Show tokens that remain.
Remove Twitter handles (@user) and hashtags from a line of social text while keeping the words themselves (e.g., keep “festival” from “#festival”).
Save the final cleaned tokens (after your pipeline) to a list and write to a text file, one token per line.

Intermediate Questions

Create a custom RegexpTokenizer that preserves contractions (e.g., “don’t”, “I’m”) as single tokens and tokenizes the rest normally. Test on a sentence with 6–8 contractions.
Build a tokenizer that keeps hyphenated words (e.g., “state-of-the-art”) intact but splits other punctuation.
Write a regex tokenizer that does not split decimal numbers (e.g., “3.14”) or ISO dates (YYYY-MM-DD). Demonstrate on a mixed string.
Implement two stopword lists: (i) default NLTK, (ii) custom list that keeps negations (no, not, never). Compare token counts after each.
Construct a whitelist of domain terms (e.g., “AI”, “GPU”) that should never be removed even if they appear in stopwords; integrate into your pipeline.
Compute and display top 20 most frequent stems using Porter, Lancaster, and Snowball separately; show how many original tokens map to each stem (frequency of stems).
POS-tag your tokens (nltk.pos_tag) and map tags to WordNet POS. Lemmatize with POS-aware lemmatizer; compare vocabulary size vs default lemmatizer.
Create a function that normalizes elongated words (e.g., “soooo coooool” → “soo cool” or “so cool” by limiting repeats), then re-tokenize. Show before/after tokens.
Remove HTML tags and HTML entities (e.g., &) from text before tokenization. Show cleaned plain-text output.
Build an end-to-end clean_text(text) function parameterized by flags: lower=True, remove_punct=True, remove_stop=True, stemmer=’porter’|’lancaster’|’snowball’|None, lemmatize=True|False. Demonstrate 3 different configurations.
Read text with encoding issues (simulate bytes with latin-1 vs utf-8). Safely decode with errors=’replace’ and normalize to NFC. Tokenize and show lossless result when possible.
Create a frequency distribution (nltk.FreqDist) over cleaned tokens; plot the top 30 tokens with counts (text-only print is fine).
Build bigrams and trigrams from cleaned tokens using nltk.ngrams. List top 10 most frequent n-grams.
Identify hapax legomena (tokens that occur once) from cleaned tokens and print 30 of them.
For a paragraph containing mixed English + Hindi/Devanagari, use a regex to separate scripts; tokenize each portion and print counts per script.
Compare casefold vs lower on edge-case tokens (German “ß”, Turkish dotted i). Show differences in results and explain with code comments.
Build a regression test set: a small set of 15 tricky strings (URLs, emojis, tags, contractions, decimals). Run your pipeline and assert expected token outputs.
Implement punctuation-aware sentence segmentation: split sentences ensuring ellipses (“…”) and quotes are handled; compare against sent_tokenize.
Create a custom domain stopword list from top 50 most frequent tokens after initial cleaning; re-run pipeline and measure vocabulary reduction.
Package your pipeline as a reusable module (preprocess.py) exposing clean_text and tokenize_text; show example usage.

Advanced Questions

Stemming vs Lemmatization Study: On a medium text (≥10k tokens), run four variants: Porter, Lancaster, Snowball, and POS-aware Lemmatizer. Report: vocabulary size, top-25 tokens for each method, and overlap between sets.
Accuracy-Oriented Pipeline: Create a pipeline that retains negations, keeps domain terms, uses POS-aware lemmatization, and limits elongated chars. Evaluate on a small sentiment-like sample by printing bigram collocations before/after.
Speed/Memory Benchmark: For a 100k-token synthetic text, benchmark runtime and memory (approx via tracemalloc or simple timers) for: (i) pure regex tokenizer, (ii) word_tokenize, (iii) regexp + POS-lemmatize. Summarize results.
Noisy Social Text Cleaner: Design a robust cleaner for tweets: remove URLs, mentions, hashtags (keep the word), emojis (optionally keep as token “<EMOJI>”), collapse repeats, normalize accents; output tokens. Provide a toggle to preserve emojis as separate tokens.
Unicode Normalization Suite: Given text containing mixed normalization forms (NFC/NFD/NFKC) and zero-width characters, build a function that strips zero-width joiners, normalizes to NFKC, then tokenizes. Prove correctness with repr() outputs.
Language-Aware Stemming: For two languages (English + one more you choose supported by Snowball), stem and compare how many tokens are reduced. Handle language-specific stopwords if available; keep processes separated.
Custom RegexpTokenizer Design: Create a single regex that handles: emails, URLs, @mentions, #hashtags (capturing words), emojis, decimals, ISO dates, and words with apostrophes. Show 15 diverse inputs → tokens.
Human-in-the-Loop Review: Build a small CLI that prints the top 200 cleaned tokens and lets the user add items to a custom stopword file interactively; rerun cleaning with the updated list.
Error-Resilient File Loader: Implement a reader that loads multiple text files with unknown encodings, tries utf-8 → latin-1 → cp1252, logs failures, and yields normalized Unicode strings for downstream tokenization.
Mini Project – Preprocessing Report: Create a notebook/script that:

- Loads a text corpus (or a few plain files),
- Runs two pipelines (aggressive stemming vs POS-lemmatization),
- Produces side-by-side stats (token count, vocab size, top-20 tokens, top-20 bigrams),
- Saves cleaned tokens of both pipelines to disk and prints a short conclusion on which pipeline you’d choose and why (in comments).