Pandas Assignment– 8
Exporting & Optimization
Basic Questions
- Create a small DataFrame and write it to ‘data_basic.csv’ with to_csv(index=False); read it back with read_csv and print shape.
- Write the same DataFrame to ‘data_basic.xlsx’ with to_excel(index=False, sheet_name=’S1′); read it back with read_excel and print columns.
- Export the DataFrame to ‘data_basic.json’ using to_json(orient=’records’); read it with read_json and print the first row.
- Use to_html(‘data_basic.html’, index=False) to export; confirm by reading back with pd.read_html(‘data_basic.html’) and printing number of tables.
- Create an in-memory SQLite connection via sqlite3.connect(‘:memory:’); write a DataFrame to SQL table ‘items’ using to_sql(…, index=False); read back with read_sql(‘SELECT * FROM items’, conn).
- Synthesize a CSV with 1,000 rows to disk; read with read_csv(…, chunksize=200) and sum a numeric column across chunks.
- Read the same file without chunksize and compute the same sum; assert both sums match.
- Show memory usage of a DataFrame using df.info(memory_usage=’deep’).
- Convert an object string column to the pandas ‘string’ dtype with astype(‘string’); reprint memory usage.
- Convert an integer column from int64 to int32 using astype(‘int32’); verify dtypes change.
- Convert a float column from float64 to float32 and show effect on .memory_usage(deep=True).sum().
- Convert a high-cardinality text column to ‘category’; print .cat.categories.size and memory before/after.
- Demonstrate vectorization: given column ‘x’, compute ‘y = 3*x + 2’ without loops and print elapsed time.
- Compute the same ‘y’ using a Python for loop and append pattern; print elapsed time and compare with vectorized approach.
- Use df.eval(‘z = a + b * c’) to create a new column; confirm with an equivalent vectorized expression and check equality with (df[‘z’] == z2).all().
- Filter rows using df.query(‘a > 10 and b < 5’); confirm the number of rows equals a boolean-mask equivalent.
- Save a filtered subset to ‘subset.csv’ and to ‘subset.json’ (records); verify row counts.
- Export a styled HTML table from a numeric DataFrame using df.style.format(‘{:.2f}’).to_html(‘styled.html’) (no need to re-read).
- Write multiple sheets to ‘multi.xlsx’ using ExcelWriter context manager; read back sheet_name=None and print keys.
- Demonstrate index_label=’id’ in to_csv and verify on read back that the first column is parsed as index with index_col=0.
Intermediate Questions
- Create a 200k-row DataFrame in memory; write to ‘big.csv’; read with chunksize=50_000 and compute global mean for a numeric column by aggregating chunk-wise sums and counts.
- Using chunksize, compute per-chunk value_counts() of a categorical column and merge counts across chunks with add (fill missing with 0).
- Read ‘big.csv’ with usecols to limit columns; compare load time and memory vs reading all columns.
- Optimize dtypes on load: pass dtype={‘id’:’int32′,’flag’:’boolean’,’score’:’float32′} to read_csv; print dtypes.
- Convert a verbose text column to ‘category’; compute groupby mean on another column and compare speed vs object dtype.
- Use to_sql to append data in two batches to table ‘logs’ with if_exists=’append’; verify total row count with SELECT COUNT(*).
- Create a wide DataFrame and export only a subset to JSON with orient=’columns’; import back and check that keys match selected columns.
- Export a DataFrame with datetimes to CSV using date_format=’%Y-%m-%d’; read with parse_dates=[‘date’] and confirm dtype is datetime64.
- Use eval to compute a composite metric score = 0.7*math + 0.3*eng; compare runtime vs chaining vectorized operations.
- Use query with a local variable thr via @thr to filter rows; check that boolean-mask version matches.
- Build a pipeline: read ‘big.csv’ in chunks → filter each chunk with query → downcast numerics with astype/pd.to_numeric(downcast=’integer’) → append to list → pd.concat → write ‘big_filtered.parquet’ alternative path note (if parquet unavailable, skip write but show size estimate).
- Create a function to_small(df) that downcasts all float64 to float32 and int64 to int32; apply with pipe() and validate memory reduction.
- Demonstrate exporting a MultiIndex DataFrame to CSV and round-trip via read_csv(index_col=[0,1]).
- Export a DataFrame to HTML and include a caption with Styler.set_caption(‘Report’); write to ‘report.html’.
- Use to_excel with engine_kwargs or xlsxwriter to auto-adjust column widths (document with comments if engine not available).
- Compare vectorized np.where vs Python loop if/else to tag values as ‘high’/’low’; print runtimes.
- Use groupby.transform(‘mean’) to add a normalized column x_norm = x – group_mean; time with %timeit-style code alternative using time.perf_counter().
- On a large DataFrame, compare df[‘a’] + df[‘b’] vs df.eval(‘a + b’) in terms of timing; report winner.
- Use read_csv chunksize to write each chunk to SQL with to_sql(if_exists=’append’); verify total rows inserted.
- Create a memory report function that prints rows, cols, df.memory_usage(deep=True).sum()/1e6 MB, dtypes.value_counts(); run before and after dtype optimization.
Advanced Questions
- Build a scalable CSV → SQL loader: stream ‘events.csv’ in chunks of 100k, infer minimal dtypes (downcast numerics, categorize low-cardinality strings), and append to SQLite; validate counts and memory usage of a sample chunk before/after optimization.
- Design an export bundle function that writes a DataFrame to CSV, Excel (multiple sheets if needed), JSON (records), and HTML report (styled) with consistent naming; return file paths; test on a sample dataset.
- Implement out-of-core aggregation: compute global median of a huge CSV by chunking—first build per-chunk histograms (bins) and then approximate overall median; compare with exact median on a smaller sample.
- Create a mixed-type dataset (ints, floats, booleans, categories, datetimes), write to SQL, read back with read_sql, and verify schema fidelity; where types drift (e.g., booleans), document and fix with astype.
- Benchmark four ways to compute z = a*b + c on a 1M-row DataFrame: pure vectorization, np.where branch, eval, numexpr via pd.eval(engine=’numexpr’) if available; report timings and equality.
- Implement a safe_read_csv(path, dtype_map, usecols, chunksize) helper that logs elapsed time, memory footprint of the first chunk, and returns either a generator (for chunks) or a full DataFrame depending on chunksize; test both modes.
- Create a categorical encoding pipeline: detect object columns with low cardinality, convert to ‘category’, export to CSV and JSON; reload both and ensure categories round-trip (note JSON lacks categories—recreate categories post-load).
- Build a partitioned export: split a large DataFrame by year (from a datetime column) and write one CSV per year; verify files and combined row count equals original.
- Construct a query-accelerated workflow: load a medium DataFrame; run 5 complex filters using query vs equivalent boolean masks; time and tabulate results; pick the fastest for each case.
- Compose a reproducible performance notebook script:
- Generate 5M-row synthetic dataset (with ints, floats, strings).
- Save as CSV.
- Load with optimized dtype, usecols, and chunksize.
- Downcast with astype, convert low-cardinality strings to category.
- Add computed columns with eval/query.
- Export cleaned results to CSV and SQL.
- Print a final summary: rows, cols, total MB, and elapsed time per stage.