NLTK Assignment– 1
Text Preprocessing
Preparation (run once before starting)
import nltk
nltk.download('punkt') # sentence/word tokenizers
nltk.download('stopwords') # stopword list
nltk.download('wordnet') # lemmatization dictionary
nltk.download('omw-1.4') # multilingual WordNet data used by lemmatizer
Basic Questions
- Install/import NLTK and ensure resources are available (punkt, stopwords, wordnet, omw-1.4). Write a check that downloads missing ones at runtime.
- Create a small raw text string (5–6 sentences). Tokenize into sentences using nltk.sent_tokenize and print the list.
- Using the same text, tokenize into words with nltk.word_tokenize. Count total tokens and unique tokens.
- Tokenize with nltk.wordpunct_tokenize and compare its output length vs word_tokenize for the same text.
- Build a RegexpTokenizer that splits on words only (\w+) and apply it to the text. Print the first 20 tokens.
- Perform case folding: convert all tokens to lowercase (use .casefold() not .lower()) and show 10 examples where casefold changes the string (e.g., “ß”).
- Remove punctuation from tokens using string.punctuation (keep numbers and words). Show 20 remaining tokens.
- Remove English stopwords (nltk.corpus.stopwords.words(‘english’)). Print the 20 most frequent remaining tokens.
- Apply PorterStemmer to your cleaned tokens. Show a mapping (original → stem) for 20 tokens.
- Apply LancasterStemmer to the same tokens. Show differences vs Porter for 15 tokens.
- Apply SnowballStemmer(‘english’) and compare stems against Porter/Lancaster for 15 tokens.
- Use WordNetLemmatizer (default POS) on your cleaned tokens. Show 20 (token → lemma) pairs.
- Show that lemmatization depends on POS: lemmatize “better, running, mice” with and without POS hints.
- Build a function to strip digits from tokens (keep words) and apply it. Output the first 30 tokens.
- Build a pipeline: sentence tokenize → word tokenize → casefold → punctuation+stopword removal. Print the final token count.
- Handle Unicode accents: create a string with characters like “café, naïve, coöperate”. Normalize with unicodedata.normalize(‘NFD’) and remove combining marks; print before/after.
- Detect emoji tokens using a regex range and list unique emojis from a sample text that contains at least 5 emojis.
- Remove URLs from noisy text using regex and then tokenize words. Show tokens that remain.
- Remove Twitter handles (@user) and hashtags from a line of social text while keeping the words themselves (e.g., keep “festival” from “#festival”).
- Save the final cleaned tokens (after your pipeline) to a list and write to a text file, one token per line.
Intermediate Questions
- Create a custom RegexpTokenizer that preserves contractions (e.g., “don’t”, “I’m”) as single tokens and tokenizes the rest normally. Test on a sentence with 6–8 contractions.
- Build a tokenizer that keeps hyphenated words (e.g., “state-of-the-art”) intact but splits other punctuation.
- Write a regex tokenizer that does not split decimal numbers (e.g., “3.14”) or ISO dates (YYYY-MM-DD). Demonstrate on a mixed string.
- Implement two stopword lists: (i) default NLTK, (ii) custom list that keeps negations (no, not, never). Compare token counts after each.
- Construct a whitelist of domain terms (e.g., “AI”, “GPU”) that should never be removed even if they appear in stopwords; integrate into your pipeline.
- Compute and display top 20 most frequent stems using Porter, Lancaster, and Snowball separately; show how many original tokens map to each stem (frequency of stems).
- POS-tag your tokens (nltk.pos_tag) and map tags to WordNet POS. Lemmatize with POS-aware lemmatizer; compare vocabulary size vs default lemmatizer.
- Create a function that normalizes elongated words (e.g., “soooo coooool” → “soo cool” or “so cool” by limiting repeats), then re-tokenize. Show before/after tokens.
- Remove HTML tags and HTML entities (e.g., &) from text before tokenization. Show cleaned plain-text output.
- Build an end-to-end clean_text(text) function parameterized by flags: lower=True, remove_punct=True, remove_stop=True, stemmer=’porter’|’lancaster’|’snowball’|None, lemmatize=True|False. Demonstrate 3 different configurations.
- Read text with encoding issues (simulate bytes with latin-1 vs utf-8). Safely decode with errors=’replace’ and normalize to NFC. Tokenize and show lossless result when possible.
- Create a frequency distribution (nltk.FreqDist) over cleaned tokens; plot the top 30 tokens with counts (text-only print is fine).
- Build bigrams and trigrams from cleaned tokens using nltk.ngrams. List top 10 most frequent n-grams.
- Identify hapax legomena (tokens that occur once) from cleaned tokens and print 30 of them.
- For a paragraph containing mixed English + Hindi/Devanagari, use a regex to separate scripts; tokenize each portion and print counts per script.
- Compare casefold vs lower on edge-case tokens (German “ß”, Turkish dotted i). Show differences in results and explain with code comments.
- Build a regression test set: a small set of 15 tricky strings (URLs, emojis, tags, contractions, decimals). Run your pipeline and assert expected token outputs.
- Implement punctuation-aware sentence segmentation: split sentences ensuring ellipses (“…”) and quotes are handled; compare against sent_tokenize.
- Create a custom domain stopword list from top 50 most frequent tokens after initial cleaning; re-run pipeline and measure vocabulary reduction.
- Package your pipeline as a reusable module (preprocess.py) exposing clean_text and tokenize_text; show example usage.
Advanced Questions
- Stemming vs Lemmatization Study: On a medium text (≥10k tokens), run four variants: Porter, Lancaster, Snowball, and POS-aware Lemmatizer. Report: vocabulary size, top-25 tokens for each method, and overlap between sets.
- Accuracy-Oriented Pipeline: Create a pipeline that retains negations, keeps domain terms, uses POS-aware lemmatization, and limits elongated chars. Evaluate on a small sentiment-like sample by printing bigram collocations before/after.
- Speed/Memory Benchmark: For a 100k-token synthetic text, benchmark runtime and memory (approx via tracemalloc or simple timers) for: (i) pure regex tokenizer, (ii) word_tokenize, (iii) regexp + POS-lemmatize. Summarize results.
- Noisy Social Text Cleaner: Design a robust cleaner for tweets: remove URLs, mentions, hashtags (keep the word), emojis (optionally keep as token “<EMOJI>”), collapse repeats, normalize accents; output tokens. Provide a toggle to preserve emojis as separate tokens.
- Unicode Normalization Suite: Given text containing mixed normalization forms (NFC/NFD/NFKC) and zero-width characters, build a function that strips zero-width joiners, normalizes to NFKC, then tokenizes. Prove correctness with repr() outputs.
- Language-Aware Stemming: For two languages (English + one more you choose supported by Snowball), stem and compare how many tokens are reduced. Handle language-specific stopwords if available; keep processes separated.
- Custom RegexpTokenizer Design: Create a single regex that handles: emails, URLs, @mentions, #hashtags (capturing words), emojis, decimals, ISO dates, and words with apostrophes. Show 15 diverse inputs → tokens.
- Human-in-the-Loop Review: Build a small CLI that prints the top 200 cleaned tokens and lets the user add items to a custom stopword file interactively; rerun cleaning with the updated list.
- Error-Resilient File Loader: Implement a reader that loads multiple text files with unknown encodings, tries utf-8 → latin-1 → cp1252, logs failures, and yields normalized Unicode strings for downstream tokenization.
- Mini Project – Preprocessing Report: Create a notebook/script that:
- Loads a text corpus (or a few plain files),
- Runs two pipelines (aggressive stemming vs POS-lemmatization),
- Produces side-by-side stats (token count, vocab size, top-20 tokens, top-20 bigrams),
- Saves cleaned tokens of both pipelines to disk and prints a short conclusion on which pipeline you’d choose and why (in comments).