Pandas Assignment– 4
Data Cleaning & Preprocessing
Basic Questions
- Create a DataFrame with some missing values. Use df.dropna() to remove all rows with at least one NaN.
- Use the same DataFrame and apply df.dropna(axis=1) to remove columns containing any NaN.
- Create a Series [1,2,None,4,None] and replace NaN with 0 using .fillna(0).
- Fill missing values in a DataFrame column with the column’s mean using .fillna(df[‘col’].mean()).
- Demonstrate method=’ffill’ to forward-fill NaN values in a Series.
- Demonstrate method=’bfill’ to backward-fill NaN values in a Series.
- Create a DataFrame with duplicate rows. Use .duplicated() to identify them.
- Drop duplicate rows using .drop_duplicates() and print the cleaned DataFrame.
- Create a DataFrame with duplicate values in a single column. Drop duplicates only from that column.
- Create a DataFrame with columns [‘A’,’B’,’C’]. Rename column ‘A’ to ‘X’ using .rename(columns={‘A’:’X’}).
- Rename row index 0 to ‘first’ using .rename(index={0:’first’}).
- Create a DataFrame column with integers and convert it to float using .astype(float).
- Convert a column of numbers stored as strings (‘1′,’2′,’3’) into integers using .astype(int).
- Create a Series of strings [‘a’,’B’,’c’]. Convert all to lowercase using .str.lower().
- Convert the same Series to uppercase using .str.upper().
- Create a Series of email IDs. Use .str.contains(‘@gmail’) to filter Gmail addresses.
- Use .str.replace(‘foo’,’bar’) to replace substrings inside a string column.
- Create a Series of sentences. Split each string into a list of words using .str.split().
- Create a DataFrame column of numbers [1,2,3,4]. Use .map(lambda x: x**2) to square each element.
- Apply .apply(sum) on DataFrame rows to compute row sums.
Intermediate Questions
- Create a DataFrame with missing values in multiple columns. Use df.dropna(thresh=2) to keep rows with at least 2 non-NaN values.
- Fill missing values in a DataFrame column with the median instead of mean.
- Use .fillna({‘col1’:0, ‘col2′:’missing’}) to fill different columns with different values.
- Demonstrate chained filling: first forward-fill, then backward-fill to handle remaining NaNs.
- Create a DataFrame with duplicates. Keep only the last occurrence of each duplicate using .drop_duplicates(keep=’last’).
- Use .duplicated(subset=[‘col1’]) to find duplicates based only on ‘col1’.
- Rename multiple columns at once using .rename(columns={‘A’:’alpha’,’B’:’beta’}).
- Change the dtype of a float column to integer with .astype(‘int64’).
- Create a string Series of names. Use .str.len() to compute the length of each string.
- Use .str.contains(‘pattern’, case=False) to search strings ignoring case.
- Replace all digits in a string column with ‘#’ using regex inside .str.replace().
- Split a string column on space and expand into multiple columns using .str.split(expand=True).
- Apply a function column-wise using .apply(np.max) to compute column maximums.
- Apply a row-wise function using .apply(lambda row: row.sum(), axis=1).
- Use .applymap(lambda x: x*2) to multiply all elements of a numeric DataFrame by 2.
- Create a column of prices as strings like ‘$100’. Remove the dollar sign using .str.replace(‘$’,”) and convert to integer with .astype(int).
- Demonstrate converting a datetime column stored as string into proper datetime dtype using pd.to_datetime.
- Write code to detect which columns of a DataFrame have object type and convert them to string type using .astype(‘string’).
- Fill missing values in a column by carrying forward the last non-null value with .fillna(method=’ffill’).
- Show how to replace NaN in an entire DataFrame with the string ‘NA’.
Advanced Questions
- Create a messy DataFrame with missing values, duplicates, and inconsistent casing. Clean it fully using: drop duplicates, fill NaN with defaults, and normalize strings to lowercase.
- Build a DataFrame with numeric and categorical columns. Fill missing values in numeric with column mean and in categorical with the most frequent value (mode).
- Write a script that drops rows if more than 50% of their values are NaN.
- Demonstrate column renaming to follow snake_case convention (all lowercase, underscores instead of spaces).
- Clean a DataFrame of customer emails: strip spaces, convert to lowercase, and filter invalid emails (not containing ‘@’).
- Create a DataFrame with salaries stored as strings [’10k’,’20k’,’15k’]. Remove ‘k’ using .str.replace(), convert to int, and compute mean salary.
- Use .apply() with a custom function to categorize numbers into ‘small’, ‘medium’, and ‘large’.
- Apply .map() on a Series of gender abbreviations (‘M’,’F’) to map them to full form (‘Male’,’Female’).
- Apply .applymap() to a mixed numeric DataFrame to add 5 to all elements, then apply .astype() to convert to float.
- Design a preprocessing pipeline with Pandas operations: load a CSV, drop duplicates, fill missing with appropriate strategies, rename columns, convert dtypes, clean string fields, and output a clean DataFrame summary.