Pandas Assignment– 4

Data Cleaning & Preprocessing

Basic Questions

  1. Create a DataFrame with some missing values. Use df.dropna() to remove all rows with at least one NaN.
  2. Use the same DataFrame and apply df.dropna(axis=1) to remove columns containing any NaN.
  3. Create a Series [1,2,None,4,None] and replace NaN with 0 using .fillna(0).
  4. Fill missing values in a DataFrame column with the column’s mean using .fillna(df[‘col’].mean()).
  5. Demonstrate method=’ffill’ to forward-fill NaN values in a Series.
  6. Demonstrate method=’bfill’ to backward-fill NaN values in a Series.
  7. Create a DataFrame with duplicate rows. Use .duplicated() to identify them.
  8. Drop duplicate rows using .drop_duplicates() and print the cleaned DataFrame.
  9. Create a DataFrame with duplicate values in a single column. Drop duplicates only from that column.
  10. Create a DataFrame with columns [‘A’,’B’,’C’]. Rename column ‘A’ to ‘X’ using .rename(columns={‘A’:’X’}).
  11. Rename row index 0 to ‘first’ using .rename(index={0:’first’}).
  12. Create a DataFrame column with integers and convert it to float using .astype(float).
  13. Convert a column of numbers stored as strings (‘1′,’2′,’3’) into integers using .astype(int).
  14. Create a Series of strings [‘a’,’B’,’c’]. Convert all to lowercase using .str.lower().
  15. Convert the same Series to uppercase using .str.upper().
  16. Create a Series of email IDs. Use .str.contains(‘@gmail’) to filter Gmail addresses.
  17. Use .str.replace(‘foo’,’bar’) to replace substrings inside a string column.
  18. Create a Series of sentences. Split each string into a list of words using .str.split().
  19. Create a DataFrame column of numbers [1,2,3,4]. Use .map(lambda x: x**2) to square each element.
  20. Apply .apply(sum) on DataFrame rows to compute row sums.

Intermediate Questions

  1. Create a DataFrame with missing values in multiple columns. Use df.dropna(thresh=2) to keep rows with at least 2 non-NaN values.
  2. Fill missing values in a DataFrame column with the median instead of mean.
  3. Use .fillna({‘col1’:0, ‘col2′:’missing’}) to fill different columns with different values.
  4. Demonstrate chained filling: first forward-fill, then backward-fill to handle remaining NaNs.
  5. Create a DataFrame with duplicates. Keep only the last occurrence of each duplicate using .drop_duplicates(keep=’last’).
  6. Use .duplicated(subset=[‘col1’]) to find duplicates based only on ‘col1’.
  7. Rename multiple columns at once using .rename(columns={‘A’:’alpha’,’B’:’beta’}).
  8. Change the dtype of a float column to integer with .astype(‘int64’).
  9. Create a string Series of names. Use .str.len() to compute the length of each string.
  10. Use .str.contains(‘pattern’, case=False) to search strings ignoring case.
  11. Replace all digits in a string column with ‘#’ using regex inside .str.replace().
  12. Split a string column on space and expand into multiple columns using .str.split(expand=True).
  13. Apply a function column-wise using .apply(np.max) to compute column maximums.
  14. Apply a row-wise function using .apply(lambda row: row.sum(), axis=1).
  15. Use .applymap(lambda x: x*2) to multiply all elements of a numeric DataFrame by 2.
  16. Create a column of prices as strings like ‘$100’. Remove the dollar sign using .str.replace(‘$’,”) and convert to integer with .astype(int).
  17. Demonstrate converting a datetime column stored as string into proper datetime dtype using pd.to_datetime.
  18. Write code to detect which columns of a DataFrame have object type and convert them to string type using .astype(‘string’).
  19. Fill missing values in a column by carrying forward the last non-null value with .fillna(method=’ffill’).
  20. Show how to replace NaN in an entire DataFrame with the string ‘NA’.

Advanced Questions

  1. Create a messy DataFrame with missing values, duplicates, and inconsistent casing. Clean it fully using: drop duplicates, fill NaN with defaults, and normalize strings to lowercase.
  2. Build a DataFrame with numeric and categorical columns. Fill missing values in numeric with column mean and in categorical with the most frequent value (mode).
  3. Write a script that drops rows if more than 50% of their values are NaN.
  4. Demonstrate column renaming to follow snake_case convention (all lowercase, underscores instead of spaces).
  5. Clean a DataFrame of customer emails: strip spaces, convert to lowercase, and filter invalid emails (not containing ‘@’).
  6. Create a DataFrame with salaries stored as strings [’10k’,’20k’,’15k’]. Remove ‘k’ using .str.replace(), convert to int, and compute mean salary.
  7. Use .apply() with a custom function to categorize numbers into ‘small’, ‘medium’, and ‘large’.
  8. Apply .map() on a Series of gender abbreviations (‘M’,’F’) to map them to full form (‘Male’,’Female’).
  9. Apply .applymap() to a mixed numeric DataFrame to add 5 to all elements, then apply .astype() to convert to float.
  10. Design a preprocessing pipeline with Pandas operations: load a CSV, drop duplicates, fill missing with appropriate strategies, rename columns, convert dtypes, clean string fields, and output a clean DataFrame summary.