Pandas Assignment– 4

Data Cleaning & Preprocessing

Basic Questions

Create a DataFrame with some missing values. Use df.dropna() to remove all rows with at least one NaN.
Use the same DataFrame and apply df.dropna(axis=1) to remove columns containing any NaN.
Create a Series [1,2,None,4,None] and replace NaN with 0 using .fillna(0).
Fill missing values in a DataFrame column with the column’s mean using .fillna(df[‘col’].mean()).
Demonstrate method=’ffill’ to forward-fill NaN values in a Series.
Demonstrate method=’bfill’ to backward-fill NaN values in a Series.
Create a DataFrame with duplicate rows. Use .duplicated() to identify them.
Drop duplicate rows using .drop_duplicates() and print the cleaned DataFrame.
Create a DataFrame with duplicate values in a single column. Drop duplicates only from that column.
Create a DataFrame with columns [‘A’,’B’,’C’]. Rename column ‘A’ to ‘X’ using .rename(columns={‘A’:’X’}).
Rename row index 0 to ‘first’ using .rename(index={0:’first’}).
Create a DataFrame column with integers and convert it to float using .astype(float).
Convert a column of numbers stored as strings (‘1′,’2′,’3’) into integers using .astype(int).
Create a Series of strings [‘a’,’B’,’c’]. Convert all to lowercase using .str.lower().
Convert the same Series to uppercase using .str.upper().
Create a Series of email IDs. Use .str.contains(‘@gmail’) to filter Gmail addresses.
Use .str.replace(‘foo’,’bar’) to replace substrings inside a string column.
Create a Series of sentences. Split each string into a list of words using .str.split().
Create a DataFrame column of numbers [1,2,3,4]. Use .map(lambda x: x**2) to square each element.
Apply .apply(sum) on DataFrame rows to compute row sums.

Intermediate Questions

Create a DataFrame with missing values in multiple columns. Use df.dropna(thresh=2) to keep rows with at least 2 non-NaN values.
Fill missing values in a DataFrame column with the median instead of mean.
Use .fillna({‘col1’:0, ‘col2′:’missing’}) to fill different columns with different values.
Demonstrate chained filling: first forward-fill, then backward-fill to handle remaining NaNs.
Create a DataFrame with duplicates. Keep only the last occurrence of each duplicate using .drop_duplicates(keep=’last’).
Use .duplicated(subset=[‘col1’]) to find duplicates based only on ‘col1’.
Rename multiple columns at once using .rename(columns={‘A’:’alpha’,’B’:’beta’}).
Change the dtype of a float column to integer with .astype(‘int64’).
Create a string Series of names. Use .str.len() to compute the length of each string.
Use .str.contains(‘pattern’, case=False) to search strings ignoring case.
Replace all digits in a string column with ‘#’ using regex inside .str.replace().
Split a string column on space and expand into multiple columns using .str.split(expand=True).
Apply a function column-wise using .apply(np.max) to compute column maximums.
Apply a row-wise function using .apply(lambda row: row.sum(), axis=1).
Use .applymap(lambda x: x*2) to multiply all elements of a numeric DataFrame by 2.
Create a column of prices as strings like ‘$100’. Remove the dollar sign using .str.replace(‘$’,”) and convert to integer with .astype(int).
Demonstrate converting a datetime column stored as string into proper datetime dtype using pd.to_datetime.
Write code to detect which columns of a DataFrame have object type and convert them to string type using .astype(‘string’).
Fill missing values in a column by carrying forward the last non-null value with .fillna(method=’ffill’).
Show how to replace NaN in an entire DataFrame with the string ‘NA’.

Advanced Questions

Create a messy DataFrame with missing values, duplicates, and inconsistent casing. Clean it fully using: drop duplicates, fill NaN with defaults, and normalize strings to lowercase.
Build a DataFrame with numeric and categorical columns. Fill missing values in numeric with column mean and in categorical with the most frequent value (mode).
Write a script that drops rows if more than 50% of their values are NaN.
Demonstrate column renaming to follow snake_case convention (all lowercase, underscores instead of spaces).
Clean a DataFrame of customer emails: strip spaces, convert to lowercase, and filter invalid emails (not containing ‘@’).
Create a DataFrame with salaries stored as strings [’10k’,’20k’,’15k’]. Remove ‘k’ using .str.replace(), convert to int, and compute mean salary.
Use .apply() with a custom function to categorize numbers into ‘small’, ‘medium’, and ‘large’.
Apply .map() on a Series of gender abbreviations (‘M’,’F’) to map them to full form (‘Male’,’Female’).
Apply .applymap() to a mixed numeric DataFrame to add 5 to all elements, then apply .astype() to convert to float.
Design a preprocessing pipeline with Pandas operations: load a CSV, drop duplicates, fill missing with appropriate strategies, rename columns, convert dtypes, clean string fields, and output a clean DataFrame summary.