Pandas Assignment– 1

Pandas Basics, Creating & Importing Data

Basic Questions

  1. Create a Series from the list [10, 20, 30, 40] with custom index [‘a’,’b’,’c’,’d’]; print the Series and then print its shape, ndim, size, and dtype.
  2. Create a DataFrame from a dictionary {‘name’:[‘A’,’B’,’C’], ‘age’:[21,25,19]}; print the DataFrame, then print shape, ndim, size, dtypes, and call info().
  3. Build a Series from a tuple (‘x’,’y’,’z’); print values and index, then convert to a DataFrame with a single column named ‘col’.
  4. Create a DataFrame from a list of dictionaries [{‘a’:1,’b’:2},{‘a’:3,’b’:4}]; print column names and dtypes.
  5. Make a DataFrame with columns ‘city’, ‘temp’, ‘rain’ from three equal-length lists; print the first and last rows using .head(1) and .tail(1).
  6. Create a Series of booleans from a list [True, False, True, False]; print ndim, size, and dtype, then cast to int and print again.
  7. Create a DataFrame index from a list [‘r1′,’r2′,’r3’] and assign it to a 3×2 DataFrame; print the index and columns.
  8. Build a DataFrame from a dictionary of tuples: {‘p’:(1,2,3), ‘q’:(4,5,6)}; print shape, then swap columns order to [‘q’,’p’] and print again.
  9. Create a Series from a range range(5); print .values, .index, and convert to Python list via .tolist().
  10. Create a DataFrame from nested lists [[1,’A’],[2,’B’],[3,’C’]] with columns [‘id’,’label’]; print info() and dtypes.
  11. Save a small DataFrame {‘x’:[1,2], ‘y’:[3,4]} to ‘mini.csv’ using to_csv(index=False); read it back with read_csv and print.
  12. Write a DataFrame to ‘mini.json’ using to_json(orient=’records’); read it back with read_json and print.
  13. Create an Excel file ‘mini.xlsx’ with a single sheet from a DataFrame; read it back with read_excel and print shape.
  14. Read a simple HTML table from a small HTML string via pd.read_html(StringIO(html)) (use io.StringIO); print the number of tables parsed and the first one.
  15. Demonstrate read_clipboard by attempting to read into a DataFrame (use try/except); if it fails, print a short message ‘clipboard not available’.
  16. Create a temporary SQLite table using sqlite3 in memory (e.g., ‘people’), insert 3 rows, use pd.read_sql(‘SELECT * FROM people’, conn) to read; print the DataFrame.
  17. From a DataFrame {‘a’:[1,2,3],’b’:[4,5,6]}, print column data types via .dtypes and the memory usage via .info(memory_usage=’deep’).
  18. Create a Series with mixed types [1, 2.5, ‘three’, True]; print dtype and then convert to string dtype; print dtype again.
  19. Build a DataFrame from an array-like list of lists (without NumPy), add column names [‘c1′,’c2′,’c3’]; print .columns and .index.
  20. Read ‘mini.csv’ again but specify dtype={‘x’:’int64′,’y’:’int64′}; print dtypes to confirm.

Intermediate Questions

  1. Create a DataFrame of 5 rows from a dictionary of lists; print shape, ndim, size, dtypes, and store the result of info() into a variable via buf=io.StringIO() to capture the text (then print the captured text).
  2. Build a Series from a dictionary {‘math’:88,’phy’:92,’chem’:79}; access by label and by position; convert to DataFrame with a column ‘marks’.
  3. From parallel lists ids=[101,102,103], names=[‘Ann’,’Ben’,’Cal’], scores=[77,88,93], create a DataFrame and set ‘ids’ as index; print index, columns, and dtypes.
  4. Create a DataFrame and write to ‘students.csv’ with a custom separator ‘;’ and without index; read it back with the same separator and confirm equality.
  5. Create a DataFrame and write to an Excel file with two sheets (‘S1’, ‘S2’); read each sheet with read_excel(sheet_name=…) and print shapes.
  6. Create a nested list-of-tuples data source and convert it to a DataFrame with explicit column names; print dtypes and use .astype({‘colX’:’Int64′}) to set a nullable integer type.
  7. Assemble a DataFrame with an object column and a numeric column; call info(memory_usage=’deep’) to compare memory before and after converting the object column to ‘category’.
  8. Read a small JSON string with records [{“k”:1,”v”:”a”},{“k”:2,”v”:”b”}] using read_json(orient=’records’); set ‘k’ as index and print.
  9. Create a DataFrame and export to JSON using three orientations: ‘records’, ‘split’, and ‘table’; print the string sizes to compare.
  10. Read a simple HTML table using read_html and show how to pass header=0 and index_col=0; print the resulting index and columns.
  11. Prepare a TSV-like clipboard text (fallback: StringIO) and parse with read_csv(sep=’\t’); print the DataFrame.
  12. In-memory SQL: create two tables ‘dept’ and ’emp’ in SQLite; use read_sql to read each; then run a join query and print the result.
  13. Build a DataFrame from a list of dictionaries where some keys are missing; print dtypes, then fill missing with defaults using .fillna and print again.
  14. Create a Series of dates from a list of strings; convert to datetime with pd.to_datetime; build a DataFrame with an additional text column; print dtypes.
  15. Construct a DataFrame, then reorder columns to a new list order; verify using list(df.columns) before and after.
  16. Create a DataFrame with mixed types and coerce the numeric column from strings to integers using pd.to_numeric(errors=’coerce’); print before/after dtypes.
  17. Write a DataFrame to ‘wide.csv’, then read only selected columns using usecols; print shape and columns to verify selection.
  18. Save a DataFrame to Excel with a custom index name and custom column names; read it back and ensure the index name is preserved.
  19. Demonstrate read_csv parameters: create a CSV with missing values; read with na_values=[‘NA’,’-‘], keep_default_na=True; print .isna().sum() per column.
  20. Create a DataFrame and write to CSV with date_format for a datetime column; read back and show that parsed dtype is object unless parse_dates is used; then read with parse_dates and compare dtypes.

Advanced Questions

  1. Build a moderately wide DataFrame (e.g., 100 columns × 100 rows) from generated Python lists; inspect shape, dtypes, size, info(), and compute approximate memory footprint with df.memory_usage(deep=True).sum().
  2. Create a nested dictionary {‘A’:{‘k1′:1,’k2’:2}, ‘B’:{‘k1′:3,’k3’:4}}; convert to a DataFrame, fill missing with 0, and write to ‘nested.csv’; read back and verify equality.
  3. Simulate incremental CSV writing: append three small DataFrames to the same ‘log.csv’ (header on first write only), read the final CSV and print row count.
  4. Generate a DataFrame from a list of tuples representing transactions (tx_id, user, amount), ensure tx_id is unique by setting it as index, and write/read via JSON ‘records’; confirm that reloaded DataFrame preserves dtypes for amount.
  5. Use read_html to parse multiple tables from an HTML string with two <table> elements; collect the returned list and vertically concatenate them; print final shape.
  6. Create an in-memory SQLite database with a numeric and a text field; insert rows with missing text; use read_sql to read and then export to Excel, verifying dtypes survive round-trip where possible.
  7. Benchmark reading a medium CSV (synthesize 50k rows in-memory to a temp file) with and without selecting columns (usecols); print elapsed times and shape to show performance effect.
  8. Build DataFrames from four different Python sources (list of lists, dict of lists, list of dicts, list of tuples) representing the same data; assert they are equal after column ordering and astype alignment.
  9. Create a DataFrame with columns of Python object strings and integers; convert strings to ‘string’ dtype and integers to nullable ‘Int64′; write to JSON with orient=’table’, read back, and compare schemas.
  10. Compose a mini ETL:
    • Create a raw DataFrame from lists (names, depts, salaries).
    • Save to ’employees_raw.csv’.
    • Read with read_csv, enforce dtype for salary as Int64, normalize casing of names, and add a computed column ‘annual’ = salary*12.
    • Save clean data to ’employees_clean.xlsx’ and to ’employees_clean.json’.
    • Print final shape, dtypes, and a 3-row sample with .head(3).