Pandas Assignment– 8

Exporting & Optimization

Basic Questions

  1. Create a small DataFrame and write it to ‘data_basic.csv’ with to_csv(index=False); read it back with read_csv and print shape.
  2. Write the same DataFrame to ‘data_basic.xlsx’ with to_excel(index=False, sheet_name=’S1′); read it back with read_excel and print columns.
  3. Export the DataFrame to ‘data_basic.json’ using to_json(orient=’records’); read it with read_json and print the first row.
  4. Use to_html(‘data_basic.html’, index=False) to export; confirm by reading back with pd.read_html(‘data_basic.html’) and printing number of tables.
  5. Create an in-memory SQLite connection via sqlite3.connect(‘:memory:’); write a DataFrame to SQL table ‘items’ using to_sql(…, index=False); read back with read_sql(‘SELECT * FROM items’, conn).
  6. Synthesize a CSV with 1,000 rows to disk; read with read_csv(…, chunksize=200) and sum a numeric column across chunks.
  7. Read the same file without chunksize and compute the same sum; assert both sums match.
  8. Show memory usage of a DataFrame using df.info(memory_usage=’deep’).
  9. Convert an object string column to the pandas ‘string’ dtype with astype(‘string’); reprint memory usage.
  10. Convert an integer column from int64 to int32 using astype(‘int32’); verify dtypes change.
  11. Convert a float column from float64 to float32 and show effect on .memory_usage(deep=True).sum().
  12. Convert a high-cardinality text column to ‘category’; print .cat.categories.size and memory before/after.
  13. Demonstrate vectorization: given column ‘x’, compute ‘y = 3*x + 2’ without loops and print elapsed time.
  14. Compute the same ‘y’ using a Python for loop and append pattern; print elapsed time and compare with vectorized approach.
  15. Use df.eval(‘z = a + b * c’) to create a new column; confirm with an equivalent vectorized expression and check equality with (df[‘z’] == z2).all().
  16. Filter rows using df.query(‘a > 10 and b < 5’); confirm the number of rows equals a boolean-mask equivalent.
  17. Save a filtered subset to ‘subset.csv’ and to ‘subset.json’ (records); verify row counts.
  18. Export a styled HTML table from a numeric DataFrame using df.style.format(‘{:.2f}’).to_html(‘styled.html’) (no need to re-read).
  19. Write multiple sheets to ‘multi.xlsx’ using ExcelWriter context manager; read back sheet_name=None and print keys.
  20. Demonstrate index_label=’id’ in to_csv and verify on read back that the first column is parsed as index with index_col=0.

Intermediate Questions

  1. Create a 200k-row DataFrame in memory; write to ‘big.csv’; read with chunksize=50_000 and compute global mean for a numeric column by aggregating chunk-wise sums and counts.
  2. Using chunksize, compute per-chunk value_counts() of a categorical column and merge counts across chunks with add (fill missing with 0).
  3. Read ‘big.csv’ with usecols to limit columns; compare load time and memory vs reading all columns.
  4. Optimize dtypes on load: pass dtype={‘id’:’int32′,’flag’:’boolean’,’score’:’float32′} to read_csv; print dtypes.
  5. Convert a verbose text column to ‘category’; compute groupby mean on another column and compare speed vs object dtype.
  6. Use to_sql to append data in two batches to table ‘logs’ with if_exists=’append’; verify total row count with SELECT COUNT(*).
  7. Create a wide DataFrame and export only a subset to JSON with orient=’columns’; import back and check that keys match selected columns.
  8. Export a DataFrame with datetimes to CSV using date_format=’%Y-%m-%d’; read with parse_dates=[‘date’] and confirm dtype is datetime64.
  9. Use eval to compute a composite metric score = 0.7*math + 0.3*eng; compare runtime vs chaining vectorized operations.
  10. Use query with a local variable thr via @thr to filter rows; check that boolean-mask version matches.
  11. Build a pipeline: read ‘big.csv’ in chunks → filter each chunk with query → downcast numerics with astype/pd.to_numeric(downcast=’integer’) → append to list → pd.concat → write ‘big_filtered.parquet’ alternative path note (if parquet unavailable, skip write but show size estimate).
  12. Create a function to_small(df) that downcasts all float64 to float32 and int64 to int32; apply with pipe() and validate memory reduction.
  13. Demonstrate exporting a MultiIndex DataFrame to CSV and round-trip via read_csv(index_col=[0,1]).
  14. Export a DataFrame to HTML and include a caption with Styler.set_caption(‘Report’); write to ‘report.html’.
  15. Use to_excel with engine_kwargs or xlsxwriter to auto-adjust column widths (document with comments if engine not available).
  16. Compare vectorized np.where vs Python loop if/else to tag values as ‘high’/’low’; print runtimes.
  17. Use groupby.transform(‘mean’) to add a normalized column x_norm = x – group_mean; time with %timeit-style code alternative using time.perf_counter().
  18. On a large DataFrame, compare df[‘a’] + df[‘b’] vs df.eval(‘a + b’) in terms of timing; report winner.
  19. Use read_csv chunksize to write each chunk to SQL with to_sql(if_exists=’append’); verify total rows inserted.
  20. Create a memory report function that prints rows, cols, df.memory_usage(deep=True).sum()/1e6 MB, dtypes.value_counts(); run before and after dtype optimization.

Advanced Questions

  1. Build a scalable CSV → SQL loader: stream ‘events.csv’ in chunks of 100k, infer minimal dtypes (downcast numerics, categorize low-cardinality strings), and append to SQLite; validate counts and memory usage of a sample chunk before/after optimization.
  2. Design an export bundle function that writes a DataFrame to CSV, Excel (multiple sheets if needed), JSON (records), and HTML report (styled) with consistent naming; return file paths; test on a sample dataset.
  3. Implement out-of-core aggregation: compute global median of a huge CSV by chunking—first build per-chunk histograms (bins) and then approximate overall median; compare with exact median on a smaller sample.
  4. Create a mixed-type dataset (ints, floats, booleans, categories, datetimes), write to SQL, read back with read_sql, and verify schema fidelity; where types drift (e.g., booleans), document and fix with astype.
  5. Benchmark four ways to compute z = a*b + c on a 1M-row DataFrame: pure vectorization, np.where branch, eval, numexpr via pd.eval(engine=’numexpr’) if available; report timings and equality.
  6. Implement a safe_read_csv(path, dtype_map, usecols, chunksize) helper that logs elapsed time, memory footprint of the first chunk, and returns either a generator (for chunks) or a full DataFrame depending on chunksize; test both modes.
  7. Create a categorical encoding pipeline: detect object columns with low cardinality, convert to ‘category’, export to CSV and JSON; reload both and ensure categories round-trip (note JSON lacks categories—recreate categories post-load).
  8. Build a partitioned export: split a large DataFrame by year (from a datetime column) and write one CSV per year; verify files and combined row count equals original.
  9. Construct a query-accelerated workflow: load a medium DataFrame; run 5 complex filters using query vs equivalent boolean masks; time and tabulate results; pick the fastest for each case.
  10. Compose a reproducible performance notebook script:
    • Generate 5M-row synthetic dataset (with ints, floats, strings).
    • Save as CSV.
    • Load with optimized dtype, usecols, and chunksize.
    • Downcast with astype, convert low-cardinality strings to category.
    • Add computed columns with eval/query.
    • Export cleaned results to CSV and SQL.
    • Print a final summary: rows, cols, total MB, and elapsed time per stage.