Pandas Assignment– 1
Pandas Basics, Creating & Importing Data
Basic Questions
- Create a Series from the list [10, 20, 30, 40] with custom index [‘a’,’b’,’c’,’d’]; print the Series and then print its shape, ndim, size, and dtype.
- Create a DataFrame from a dictionary {‘name’:[‘A’,’B’,’C’], ‘age’:[21,25,19]}; print the DataFrame, then print shape, ndim, size, dtypes, and call info().
- Build a Series from a tuple (‘x’,’y’,’z’); print values and index, then convert to a DataFrame with a single column named ‘col’.
- Create a DataFrame from a list of dictionaries [{‘a’:1,’b’:2},{‘a’:3,’b’:4}]; print column names and dtypes.
- Make a DataFrame with columns ‘city’, ‘temp’, ‘rain’ from three equal-length lists; print the first and last rows using .head(1) and .tail(1).
- Create a Series of booleans from a list [True, False, True, False]; print ndim, size, and dtype, then cast to int and print again.
- Create a DataFrame index from a list [‘r1′,’r2′,’r3’] and assign it to a 3×2 DataFrame; print the index and columns.
- Build a DataFrame from a dictionary of tuples: {‘p’:(1,2,3), ‘q’:(4,5,6)}; print shape, then swap columns order to [‘q’,’p’] and print again.
- Create a Series from a range range(5); print .values, .index, and convert to Python list via .tolist().
- Create a DataFrame from nested lists [[1,’A’],[2,’B’],[3,’C’]] with columns [‘id’,’label’]; print info() and dtypes.
- Save a small DataFrame {‘x’:[1,2], ‘y’:[3,4]} to ‘mini.csv’ using to_csv(index=False); read it back with read_csv and print.
- Write a DataFrame to ‘mini.json’ using to_json(orient=’records’); read it back with read_json and print.
- Create an Excel file ‘mini.xlsx’ with a single sheet from a DataFrame; read it back with read_excel and print shape.
- Read a simple HTML table from a small HTML string via pd.read_html(StringIO(html)) (use io.StringIO); print the number of tables parsed and the first one.
- Demonstrate read_clipboard by attempting to read into a DataFrame (use try/except); if it fails, print a short message ‘clipboard not available’.
- Create a temporary SQLite table using sqlite3 in memory (e.g., ‘people’), insert 3 rows, use pd.read_sql(‘SELECT * FROM people’, conn) to read; print the DataFrame.
- From a DataFrame {‘a’:[1,2,3],’b’:[4,5,6]}, print column data types via .dtypes and the memory usage via .info(memory_usage=’deep’).
- Create a Series with mixed types [1, 2.5, ‘three’, True]; print dtype and then convert to string dtype; print dtype again.
- Build a DataFrame from an array-like list of lists (without NumPy), add column names [‘c1′,’c2′,’c3’]; print .columns and .index.
- Read ‘mini.csv’ again but specify dtype={‘x’:’int64′,’y’:’int64′}; print dtypes to confirm.
Intermediate Questions
- Create a DataFrame of 5 rows from a dictionary of lists; print shape, ndim, size, dtypes, and store the result of info() into a variable via buf=io.StringIO() to capture the text (then print the captured text).
- Build a Series from a dictionary {‘math’:88,’phy’:92,’chem’:79}; access by label and by position; convert to DataFrame with a column ‘marks’.
- From parallel lists ids=[101,102,103], names=[‘Ann’,’Ben’,’Cal’], scores=[77,88,93], create a DataFrame and set ‘ids’ as index; print index, columns, and dtypes.
- Create a DataFrame and write to ‘students.csv’ with a custom separator ‘;’ and without index; read it back with the same separator and confirm equality.
- Create a DataFrame and write to an Excel file with two sheets (‘S1’, ‘S2’); read each sheet with read_excel(sheet_name=…) and print shapes.
- Create a nested list-of-tuples data source and convert it to a DataFrame with explicit column names; print dtypes and use .astype({‘colX’:’Int64′}) to set a nullable integer type.
- Assemble a DataFrame with an object column and a numeric column; call info(memory_usage=’deep’) to compare memory before and after converting the object column to ‘category’.
- Read a small JSON string with records [{“k”:1,”v”:”a”},{“k”:2,”v”:”b”}] using read_json(orient=’records’); set ‘k’ as index and print.
- Create a DataFrame and export to JSON using three orientations: ‘records’, ‘split’, and ‘table’; print the string sizes to compare.
- Read a simple HTML table using read_html and show how to pass header=0 and index_col=0; print the resulting index and columns.
- Prepare a TSV-like clipboard text (fallback: StringIO) and parse with read_csv(sep=’\t’); print the DataFrame.
- In-memory SQL: create two tables ‘dept’ and ’emp’ in SQLite; use read_sql to read each; then run a join query and print the result.
- Build a DataFrame from a list of dictionaries where some keys are missing; print dtypes, then fill missing with defaults using .fillna and print again.
- Create a Series of dates from a list of strings; convert to datetime with pd.to_datetime; build a DataFrame with an additional text column; print dtypes.
- Construct a DataFrame, then reorder columns to a new list order; verify using list(df.columns) before and after.
- Create a DataFrame with mixed types and coerce the numeric column from strings to integers using pd.to_numeric(errors=’coerce’); print before/after dtypes.
- Write a DataFrame to ‘wide.csv’, then read only selected columns using usecols; print shape and columns to verify selection.
- Save a DataFrame to Excel with a custom index name and custom column names; read it back and ensure the index name is preserved.
- Demonstrate read_csv parameters: create a CSV with missing values; read with na_values=[‘NA’,’-‘], keep_default_na=True; print .isna().sum() per column.
- Create a DataFrame and write to CSV with date_format for a datetime column; read back and show that parsed dtype is object unless parse_dates is used; then read with parse_dates and compare dtypes.
Advanced Questions
- Build a moderately wide DataFrame (e.g., 100 columns × 100 rows) from generated Python lists; inspect shape, dtypes, size, info(), and compute approximate memory footprint with df.memory_usage(deep=True).sum().
- Create a nested dictionary {‘A’:{‘k1′:1,’k2’:2}, ‘B’:{‘k1′:3,’k3’:4}}; convert to a DataFrame, fill missing with 0, and write to ‘nested.csv’; read back and verify equality.
- Simulate incremental CSV writing: append three small DataFrames to the same ‘log.csv’ (header on first write only), read the final CSV and print row count.
- Generate a DataFrame from a list of tuples representing transactions (tx_id, user, amount), ensure tx_id is unique by setting it as index, and write/read via JSON ‘records’; confirm that reloaded DataFrame preserves dtypes for amount.
- Use read_html to parse multiple tables from an HTML string with two <table> elements; collect the returned list and vertically concatenate them; print final shape.
- Create an in-memory SQLite database with a numeric and a text field; insert rows with missing text; use read_sql to read and then export to Excel, verifying dtypes survive round-trip where possible.
- Benchmark reading a medium CSV (synthesize 50k rows in-memory to a temp file) with and without selecting columns (usecols); print elapsed times and shape to show performance effect.
- Build DataFrames from four different Python sources (list of lists, dict of lists, list of dicts, list of tuples) representing the same data; assert they are equal after column ordering and astype alignment.
- Create a DataFrame with columns of Python object strings and integers; convert strings to ‘string’ dtype and integers to nullable ‘Int64′; write to JSON with orient=’table’, read back, and compare schemas.
- Compose a mini ETL:
- Create a raw DataFrame from lists (names, depts, salaries).
- Save to ’employees_raw.csv’.
- Read with read_csv, enforce dtype for salary as Int64, normalize casing of names, and add a computed column ‘annual’ = salary*12.
- Save clean data to ’employees_clean.xlsx’ and to ’employees_clean.json’.
- Print final shape, dtypes, and a 3-row sample with .head(3).