Data Science using Python
Interview Questions with Answers

Explore Our Courses
1. What are Python’s key data structures used in data science?
Answer:
Python offers several built-in and library-provided data structures that are essential in data science and Generative AI:
Data Structure | Description | Usage in Data Science & AI |
---|---|---|
list | Ordered, mutable collection | Arrays, sequences of data |
tuple | Ordered, immutable collection | Coordinates, hashable keys |
set | Unordered, unique elements | Removing duplicates, set operations |
dict | Key-value pairs | Feature mapping, JSON-like structures |
defaultdict | Returns default value for missing keys | Counting, grouping |
Counter | Subclass of dict for counting hashables | Word/token frequency, n-grams |
deque | Double-ended queue | Sliding windows, efficient pops/appends |
DataFrame (Pandas) | 2D labeled data structure | Tabular data analysis |
ndarray (NumPy) | N-dimensional array | Vectorized math, matrices, tensors |
Your dream SEO job is just one click away—start preparing smartly, not blindly!
2. How do list comprehension and generator expressions differ?
Answer:
Feature | List Comprehension | Generator Expression |
Syntax | [x for x in iterable] | (x for x in iterable) |
Output | Returns full list | Returns generator object |
Memory Usage | Stores all elements in memory | Lazy evaluation (memory-efficient) |
Performance | Fast for small-to-medium datasets | Ideal for large or infinite datasets |
Use Case | Eager evaluation | Stream processing, pipelines |
Example
# List comprehension
squares = [x*x for x in range(5)]
# Generator expression
squares_gen = (x*x for x in range(5))
3. Explain the difference between is, ==, and in.
Answer:
Operator | Purpose | Example |
== | Compares values | ‘abc’ == ‘abc’ → True |
is | Compares object identities | a is b → True if a and b refer to the same object |
in | Membership check | ‘a’ in ‘cat’ → True |
Example
a = [1, 2]; b = a
print(a == b) # True (values are equal)
print(a is b) # True (same object)
print(2 in a) # True (element exists)
4. How is memory managed in Python?
Answer:
Python’s memory management includes:
- Automatic memory allocation using:
- Private heap memory where all Python objects and data structures are stored.
- Reference Counting:
- Every object has a reference count.
- When count drops to 0, object is deallocated.
- Garbage Collector:
- Handles cyclic references using gc module.
- Uses generational collection (3 generations: young → old).
- Memory Pools:
- Implemented by the PyMalloc allocator for efficiency.
Example
import gc
gc.collect() # Triggers garbage collection manually
5. What is the difference between deepcopy and copy?
Answer:
- copy() (Shallow Copy): Creates a new object but references original nested objects.
- deepcopy() (Deep Copy): Creates a completely independent clone, including nested objects.
Example
import copy:
original = [[1, 2], [3, 4]] shallow = copy.copy(original) deep = copy.deepcopy(original)
original[0][0] = 99
print(shallow[0][0]) # 99 (same inner list)
print(deep[0][0]) # 1 (independent inner list)
6. What are Python’s *args and **kwargs used for?
Answer:
- *args: Collects extra positional arguments into a tuple.
- **kwargs: Collects extra keyword arguments into a dictionary.
Both are used to create flexible functions.
Example
def sample(*args, **kwargs): print(args) # Tuple of values print(kwargs) # Dict of keyword arguments
sample(1, 2, a=3, b=4)
7. Explain the difference between @staticmethod and @classmethod.
Answer:
Decorator | @staticmethod | @classmethod |
Access to self | No | No (uses cls) |
Access to cls | No | Yes (class itself) |
Use Case | Utility functions inside a class | Factory methods or methods acting on class |
Example
class MyClass: @staticmethod def greet(): return "Hello"
@classmethod
def create(cls):
return cls()
8. How do you handle missing data in Python?
Answer:
Using Pandas, typical steps:
import pandas as pd
df = pd.DataFrame({'a': [1, None, 3], 'b': [4, 5, None]}) df.isnull() # Boolean mask of missing df.dropna() # Drop rows with any missing values df.fillna(0) # Fill missing with 0 df.fillna(df.mean()) # Fill with column mean
Other techniques
- Interpolation: df.interpolate()
- Back/forward fill: df.fillna(method=’ffill’)
9. What is a lambda function and where is it useful?
Answer:
Lambda function is an anonymous, one-line function.
add = lambda x, y: x + y
print(add(3, 4)) # Output: 7
Use cases
- map(), filter(), reduce()
- Sorting by custom keys:
- sorted(data, key=lambda x: x[1])
Note: Limited to single expression, no statements or annotations.
10. What are Python’s most important built-in libraries for data science?
Answer:
Library | Purpose |
NumPy | Numerical computations, n-dimensional arrays |
Pandas | Data manipulation and analysis |
Matplotlib | 2D plotting |
Seaborn | Statistical data visualization |
Scikit-learn | ML models, preprocessing, metrics |
SciPy | Scientific computing (linear algebra, stats) |
Statsmodels | Statistical modeling and tests |
TensorFlow, PyTorch | Deep learning frameworks |
NLTK, spaCy | Natural language processing |
OpenCV | Image processing and computer vision |
11. What is the difference between a NumPy array and a Python list?
Answer:
Feature | NumPy Array | Python List |
Homogeneity | Elements must be of the same type | Can store mixed data types |
Memory Efficiency | More efficient (C-contiguous blocks) | Less efficient (pointers to objects) |
Speed | Much faster due to vectorization | Slower for numerical operations |
Broadcasting | Supported | Not supported |
Operations | Element-wise arithmetic | Requires loops |
Example
import numpy as np
arr = np.array([1, 2, 3]) print(arr * 2) # [2, 4, 6] lst = [1, 2, 3] print([x*2 for x in lst]) # [2, 4, 6]
12. How do you handle missing values in Pandas?
Answer:
Detect:
df.isnull() # Boolean DataFrame
df.isnull().sum() # Count missing per column
Handle:
df.dropna() # Drop rows with missing values
df.dropna(axis=1) # Drop columns with missing values
df.fillna(0) # Replace with 0
df.fillna(df.mean()) # Replace with column mean
df.fillna(method='ffill') # Forward fill
df.interpolate() # Interpolate missing values
13. Explain the difference between .loc[] and .iloc[].
Answer:
Feature | .loc[] | .iloc[] |
Access by | Labels (row/column names) | Integer position (like array indices) |
Syntax | df.loc[row_label, col_label] | df.iloc[row_index, col_index] |
Supports | Slicing with labels | Slicing with integers |
Example
df = pd.DataFrame({'A': [10, 20], 'B': [30, 40]}, index=['x', 'y']) df.loc['x', 'A'] # 10 df.iloc[0, 0] # 10
14. How can you apply a function to every row in a Pandas DataFrame?
Answer:
Using apply() with axis=1:
def process_row(row):
return row['A'] + row['B']
df['sum'] = df.apply(process_row, axis=1)
For vectorized operations, prefer direct column-wise calculations:
df['sum'] = df['A'] + df['B'] # Faster than apply
15. What are vectorized operations in NumPy and why are they faster?
Answer:
Feature | Loop-Based | Vectorized (NumPy) |
Memory Efficient | No | Yes |
Speed | Slower (Python loop) | Faster (compiled code) |
Syntax | Verbose | Concise |
Example
# Vectorized a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) result = a + b # Element-wise addition
16. How do you merge, join, and concatenate datasets in Pandas?
Answer:
Concatenate:
pd.concat([df1, df2], axis=0) # Stack rows pd.concat([df1, df2], axis=1) # Stack columns
Merge (SQL-style joins):
pd.merge(df1, df2, on='key', how='inner') # or 'left', 'right', 'outer'
Join (index-based):
df1.join(df2, how='left') # df2 must have index to join
17. What is the difference between groupby() and pivot_table()?
Answer:
Feature | groupby() | pivot_table() |
Aggregation | Yes (must use .agg(), .sum(), etc.) | Yes (with default aggfunc=’mean’) |
Multi-level index | Yes | Returns a new DataFrame |
Use Case | Group data and apply functions | Reshape and summarize (like Excel pivot) |
Example
df.groupby('category')['sales'].sum() df.pivot_table(values='sales', index='category', columns='region', aggfunc='sum')
18. How do you deal with outliers in a dataset?
Answer:
Detection Techniques:
- IQR method:
Q1 = df['col'].quantile(0.25) Q3 = df['col'].quantile(0.75 IQR = Q3 - Q1 outliers = df[(df['col'] < Q1 - 1.5*IQR) | (df['col'] > Q3 + 1.5*IQR)]
- Z-score method:
from scipy.stats import zscore df['z'] = zscore(df['col']) outliers = df[df['z'].abs() > 3]
Handling Strategies:
- Remove
- Replace with median or capped value (winsorization)
- Use robust models (e.g., decision trees)
19. What are the advantages of using apply() vs a loop in Pandas?
Answer:
Feature | apply() | Python loop |
Speed | Faster (vectorized in C backend) | Slower (interpreted, row-by-row) |
Syntax | Cleaner and more Pythonic | Verbose |
Flexibility | High (can use custom functions) | High but inefficient |
Example
df['squared'] = df['col'].apply(lambda x: x**2) # Faster than for-loop
20. How do you detect and handle duplicate records?
Answer:
Detect Duplicates:
df.duplicated() # Returns boolean Series
df[df.duplicated()] # Get duplicate rows
df.duplicated(subset=['col']) # Check specific column
Remove Duplicates:
df.drop_duplicates(inplace=True)
Keep First/Last:
df.drop_duplicates(keep='last')
Ace Every Interview with Confidence – One Question at a Time!
21. How do you normalize and standardize data?
Answer:
Normalization (Min-Max Scaling): Scales features to a range [0, 1].
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X)
Standardization (Z-score): Centers the data with mean = 0 and std = 1.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_standardized = scaler.fit_transform(X)
Method | Use When… |
Normalization | Data is not Gaussian, bounded features (e.g., pixel values) |
Standardization | Data follows Gaussian distribution or for ML models sensitive to scale (e.g., SVM, KNN, logistic regression) |
22. What is label encoding vs one-hot encoding?
Answer:
Encoding Type | Description | Example |
Label Encoding | Converts categories to integer labels | Red → 0, Green → 1 |
One-Hot Encoding | Creates binary columns for each category | Red → [1, 0, 0] |
Label Encoding (Ordinal/Tree-based Models):
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['encoded'] = le.fit_transform(df['color'])
One-Hot Encoding (Linear/Distance-based Models):
pd.get_dummies(df['color'], prefix='color') # or use OneHotEncoder
23. How do you handle categorical variables with high cardinality?
Answer:
Strategies:
1. Frequency/Count Encoding: Replace categories with their frequency.
df['encoded'] = df['category'].map(df['category'].value_counts())
2. Target Encoding (Mean Encoding): Replace category with mean target value (careful with leakage).
df['encoded'] = df.groupby('category')['target'].transform('mean')
3. Hash Encoding (e.g., CategoryEncoders library): Efficient for large categorical spaces.
from category_encoders import HashingEncoder encoder = HashingEncoder() df_encoded = encoder.fit_transform(df)
4.Embedding Layers (for deep learning models using PyTorch or TensorFlow)
24. Explain feature scaling and when to use it.
Answer:
Feature scaling brings all features to the same scale so that no variable dominates others.
When to Use:
- Distance-based algorithms: KNN, K-Means
- Gradient-based algorithms: Logistic Regression, Neural Networks
- PCA and SVM (sensitive to feature scale)
Techniques:
- Min-Max Scaling → [0, 1]
- Z-score (Standardization) → mean = 0, std = 1
- Robust Scaling → median-centered (good for outliers)
Example
from sklearn.preprocessing import RobustScaler scaler = RobustScaler() X_scaled = scaler.fit_transform(X)
25, What is the difference between .fit(), .transform(), and .fit_transform()?
Answer:
Method | Purpose |
.fit() | Learns the parameters from data (e.g., mean, std) |
.transform() | Applies learned transformation |
.fit_transform() | Combines both steps (faster, cleaner) |
Example
scaler = StandardScaler()
scaler.fit(X) # learns mean and std
X_scaled = scaler.transform(X)
# OR directly
X_scaled = scaler.fit_transform(X)
26. How do you impute missing values using scikit-learn?
Answer:
Use SimpleImputer from sklearn.impute.
Example
from sklearn.impute import SimpleImputer
# For numerical features
imputer = SimpleImputer(strategy='mean') X_imputed = imputer.fit_transform(X)
# For categorical features
imputer = SimpleImputer(strategy='most_frequent') X_cat = imputer.fit_transform(X_cat)
Strategies: 'mean', 'median', 'most_frequent', 'constant'
27. How do you treat multicollinearity in features?
Answer:
Multicollinearity = high correlation between independent variables → leads to unstable models.
Detection:
- Correlation matrix
- VIF (Variance Inflation Factor):
Example
from statsmodels.stats.outliers_influence import variance_inflation_factor vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
Treatment:
- Drop one of the correlated features
- Use PCA to reduce dimensions
- Use regularization (Ridge, Lasso)
28. What are binning and discretization?
Answer:
Binning: Group continuous variables into discrete intervals or bins.
Types:
- Equal-width binning:
pd.cut(df['age'], bins=3, labels=['young', 'middle', 'old'])
- Equal-frequency binning (quantile-based):
pd.qcut(df['income'], q=4)
Use Cases:
- Transform skewed numeric features
- Reduce overfitting
- Interpretability
29. How do you encode cyclic variables (like days, months)?
Answer:
Problem: Numeric encoding of cyclic features (e.g., 0 for Jan, 11 for Dec) fails to capture circular nature.
Solution: Use sine and cosine transformation.
import numpy as np
df['month_sin'] = np.sin(2 * np.pi * df['month']/12) df['month_cos'] = np.cos(2 * np.pi * df['month']/12)
Why?
- sin/cos encodes direction on a circle
- January (0) is close to December (11)
30. What are the best practices for feature engineering?
Answer:
General Best Practices:
- Understand domain before creating features.
- Remove duplicates, missing values, and outliers.
- Use feature scaling where needed.
- Encode categorical variables appropriately.
- Use binning for skewed distributions.
- Create interaction terms, polynomial features, and log transformations.
- Use statistical tests (ANOVA, Chi-Square) for feature selection.
- Leverage external sources (e.g., geolocation, time).
- Avoid data leakage (no target info in features).
- Always validate with cross-validation or a holdout set.
Tools:
- Feature-engine
- scikit-learn pipelines
- category_encoders
31. What steps do you follow during EDA (Exploratory Data Analysis)?
Answer:
Steps in EDA:
- Understand the dataset
- Load data (pandas.read_csv)
- Check shape, column names, and data types
- Missing value analysis
- df.isnull().sum()
- Visualize with seaborn.heatmap(df.isnull())
- Summary statistics
- df.describe()
- df.info()
- Univariate analysis
- Histograms, boxplots, value counts
- Bivariate/multivariate analysis
- Scatter plots, pairplots, heatmaps, correlation
- Outlier detection
- Boxplots, IQR, Z-score
- Skewness check
- df.skew()
- Feature Engineering
- Creating new features or transforming existing ones
- Encoding & Scaling
- Label encoding, one-hot encoding, normalization
32. How do you find correlations between variables?
Answer:
Use the Pearson correlation coefficient (linear), Spearman (non-linear), or Kendall (ordinal).
Example
correlation_matrix = df.corr(method='pearson')
import seaborn as sns
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
Interpretation:
- +1 = perfect positive
- 0 = no correlation
- -1 = perfect negative
Use scipy.stats.pearsonr(x, y) for p-value.
33. How do you deal with skewed data distributions?
Answer:
Detection:
df.skew() sns.histplot(df['feature'], kde=True)
Techniques to handle skewness:
- Log Transform:
df['feature'] = np.log1p(df['feature'])
2. Box-Cox / Yeo-Johnson Transform (for non-negative and negative data):
from sklearn.preprocessing import PowerTransformer pt = PowerTransformer(method='yeo-johnson') df[['feature']] = pt.fit_transform(df[['feature']])
3. Square root or reciprocal transforms
34. What are the key plots for EDA and when to use each?
Answer:
Plot Type | Use Case | Library |
Histogram | Distribution of numerical variable | seaborn.histplot() |
Boxplot | Outlier detection, spread | sns.boxplot() |
Scatter Plot | Relationship between two numeric vars | sns.scatterplot() |
Bar Plot | Frequency of categorical variables | sns.barplot() |
Pairplot | Pairwise relationship across features | sns.pairplot() |
Heatmap | Correlation matrix | sns.heatmap() |
Violin Plot | Distribution + probability density | sns.violinplot() |
Line Plot | Trend over time | sns.lineplot() |
35. How do you visualize multivariate relationships?
Answer:
Multivariate plots:
- Pairplot – all pairs of numerical variables
sns.pairplot(df, hue=’target’)
2. Heatmap – correlation matrix
sns.heatmap(df.corr(), annot=True)
3. 3D Scatter Plot – 3 features
from mpl_toolkits.mplot3d import Axes3D
ax = plt.axes(projection='3d') ax.scatter3D(df.x, df.y, df.z)
4. Grouped boxplots/violin plots – Category vs Numeric
36. What is the use of pairplot and heatmap in Seaborn?
Answer:
pairplot()
- Used for visualizing relationships between multiple pairs of numerical features.
- Highlights correlation and distribution patterns.
- Often includes hue for category analysis.
Example
sns.pairplot(df, hue=’species’)
heatmap()
- Used for displaying correlation matrices or missing values visually.
- Helpful in feature selection and dependency analysis.
Example
sns.heatmap(df.corr(), annot=True)
37. What’s the difference between histogram and bar plot?
Answer:
Feature | Histogram | Bar Plot |
Data Type | Continuous numerical data | Categorical data |
X-axis | Numeric ranges (bins) | Categories |
Purpose | Show distribution | Show frequency or comparison |
Gaps Between Bars | No | Yes |
Example
# Histogram
sns.histplot(df['age'])
# Bar Plot
sns.countplot(x='gender', data=df)
38. How do you choose which variables to keep during EDA?
Answer:
Techniques:
- Low variance removal: Use VarianceThreshold from sklearn
- Correlation Analysis: Drop one of highly correlated pairs (|corr| > 0.85)
- Univariate feature selection: SelectKBest, f_classif, mutual_info_classif
- Model-based Selection: Feature importance from tree models (Random Forest)
- Recursive Feature Elimination (RFE)
Example
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression model = LogisticRegression() rfe = RFE(model, n_features_to_select=5) fit = rfe.fit(X, y)
39. What are the different types of feature distributions and how to interpret them?
Answer:
Distribution Type | Characteristics | Action |
Normal | Bell-shaped, mean ≈ median ≈ mode | Good for parametric models |
Skewed Right | Long tail on right (positive skew) | Log transform or Box-Cox |
Skewed Left | Long tail on left (negative skew) | Square or cube transform |
Bimodal | Two peaks | Segment the data or investigate classes |
Uniform | Equal frequency | Often okay to use directly |
Check with:
sns.histplot(df['feature'], kde=True) print(df['feature'].skew())
40. What is the role of outliers and how do you detect them?
Answer:
Outliers can distort mean, standard deviation, and model accuracy.
Detection Methods:
- Boxplot & IQR
Q1 = df['feature'].quantile(0.25) Q3 = df['feature'].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df['feature'] < Q1 - 1.5 * IQR) | (df['feature'] > Q3 + 1.5 * IQR)]
2. Z-score
from scipy.stats import zscore df['zscore'] = zscore(df['feature'])
3. Isolation Forest / LOF (Advanced models)
Treatment:
- Remove
- Cap (winsorization)
- Transform
- Bin or categorize
Upgrade Your Skills Today to Secure the Job You Deserve Tomorrow!
41. What is the difference between population and sample?
Answer:
Population | Sample |
Entire group of individuals or events | A subset taken from the population |
Has parameters (e.g., μ, σ) | Has statistics (e.g., x̄, s) |
Not always practical to measure | More feasible for analysis |
Example
- Population: All customers of Amazon.
- Sample: 1000 randomly selected Amazon customers for a survey.
42. Explain the Central Limit Theorem with an example.
Answer:
Central Limit Theorem (CLT): The sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s distribution, as the sample size becomes large (typically n ≥ 30).
Key Points:
- Works for independent, identically distributed samples
- Allows us to use normal approximation even for non-normal data
Example
import numpy as np
import matplotlib.pyplot as plt
# Skewed population
population = np.random.exponential(scale=2, size=10000)
# Sampling means
means = [np.mean(np.random.choice(population, 50)) for _ in range(1000)] plt.hist(means, bins=30) plt.title("Sampling Distribution Approaches Normal") plt.show()
43. What is p-value and how is it used in hypothesis testing?
Answer:
p-value is the probability of observing the test statistic or something more extreme assuming the null hypothesis is true.
- Low p-value (≤ 0.05): Reject the null hypothesis
- High p-value (> 0.05): Fail to reject the null hypothesis
Example
If p = 0.01, there’s a 1% chance that the observed results are due to random chance — strong evidence against null hypothesis.
44. What is the difference between Type I and Type II errors?
Answer:
Type I Error (False Positive) | Type II Error (False Negative) |
Rejecting a true null hypothesis | Failing to reject a false null hypothesis |
Controlled by α (significance level) | Controlled by β (power = 1 − β) |
“Crying wolf” | “Missing a real signal” |
Example
- Type I: Diagnosing disease when not ill.
- Type II: Missing diagnosis when disease is present.
45. What is the difference between confidence interval and prediction interval?
Answer:
Confidence Interval (CI) | Prediction Interval (PI) |
Range for population parameter (e.g., mean) | Range for future individual observation |
Narrower | Wider (includes extra variability) |
Example
- CI: Mean income is $50K ± $2K
- PI: Next person’s income likely falls in $30K–$70K
46. What are correlation and covariance?
Answer:
Concept | Meaning | Range |
Covariance | How two variables vary together | (−∞ to ∞) |
Correlation | Standardized measure of covariance (unitless) | [−1, 1] |
Formula
- Covariance: cov(X, Y) = Σ((X − X̄)(Y − Ȳ)) / n
- Correlation: corr = cov(X,Y) / (σX * σY)
Example
import numpy as np
x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
print(np.cov(x, y)) # Covariance
print(np.corrcoef(x, y)) # Correlation
47. Explain t-test, chi-square test, and ANOVA.
Answer:
Test | Use Case | Data Type |
t-test | Compare means of 2 groups | Continuous |
Chi-square | Test for independence or goodness-of-fit | Categorical |
ANOVA | Compare means of 3+ groups | Continuous + Categorical |
Example (t-test)
from scipy.stats import ttest_ind ttest_ind([1,2,3], [2,3,4])
ANOVA Example
from scipy.stats import f_oneway f_oneway([1,2,3], [2,4,6], [3,5,7])
48. What is overfitting and underfitting in statistical models?
Answer:
Overfitting | Underfitting |
High training accuracy, poor test accuracy | Poor performance on both train and test |
Model is too complex | Model is too simple |
Low bias, high variance | High bias, low variance |
Example
- Overfitting: Memorizing training data with noise
- Underfitting: Using linear model on non-linear data
49. What are bias and variance in a predictive model?
Answer:
Bias | Variance |
Error from overly simple assumptions | Error from model sensitivity to data |
Leads to underfitting | Leads to overfitting |
High bias = high training error | High variance = high testing error |
Ideal model minimizes both bias and variance.
Visualization:
Bias vs Variance Tradeoff:
Error
^
| Bias^2 <--|---|
| Variance |
|_______________________|
50. When would you use a non-parametric test?
Answer:
Use non-parametric tests when:
- Data does not follow normal distribution
- Sample size is small
- Data is ordinal or ranked
- You want robustness to outliers
Common Non-parametric Tests:
Test | Use Case |
Mann–Whitney U Test | Compare medians of 2 independent samples |
Wilcoxon Test | Compare paired samples |
Kruskal–Wallis Test | Compare 3+ groups |
Chi-square | Categorical independence |
Example
from scipy.stats import mannwhitneyu mannwhitneyu([1, 2, 3], [4, 5, 6])
51. What’s the difference between supervised and unsupervised learning?
Answer:
Supervised Learning | Unsupervised Learning |
Data includes input and labeled output | Data has only input (no labeled output) |
Goal: Learn a mapping from input → output | Goal: Find patterns or structures in data |
Examples: Classification, Regression | Examples: Clustering, Dimensionality Reduction |
Example
# Supervised
from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X_train, y_train)
# Unsupervised
from sklearn.cluster import KMeans model = KMeans(n_clusters=3).fit(X)
52. Explain the bias-variance tradeoff in machine learning.
Answer:
Bias: Error due to overly simple model assumptions
Variance: Error due to high model sensitivity to training data
High Bias | High Variance |
Underfitting | Overfitting |
High training error | High test error |
Tradeoff Goal: Balance bias and variance for optimal performance on unseen data.
53. How do you evaluate a classification model (precision, recall, F1-score)?
Answer:
Metric | Formula | Meaning |
Precision | TP / (TP + FP) | Correct positive predictions |
Recall | TP / (TP + FN) | Coverage of actual positives |
F1-score | 2 * (Precision * Recall) / (P + R) | Harmonic mean of P and R |
Example
from sklearn.metrics import precision_score, recall_score, f1_score precision_score(y_true, y_pred) recall_score(y_true, y_pred) f1_score(y_true, y_pred)
54. What is the ROC curve and AUC score?
Answer:
- ROC Curve (Receiver Operating Characteristic):
- Plots True Positive Rate vs False Positive Rate at various thresholds
- AUC (Area Under Curve):
- Measures entire 2D area under the ROC curve (range: 0 to 1)
- Higher AUC = Better model performance
Example
from sklearn.metrics import roc_curve, roc_auc_score fpr, tpr, thresholds = roc_curve(y_true, y_scores) auc = roc_auc_score(y_true, y_scores)
55. How does cross-validation help in model evaluation?
Answer:
Cross-validation splits data into multiple folds and rotates the validation set to:
- Reduce overfitting
- Get more robust performance estimates
- Use data more efficiently
Types:
- k-Fold CV
- Stratified k-Fold (for classification)
- Leave-One-Out CV
Example
from sklearn.model_selection import cross_val_score cross_val_score(model, X, y, cv=5)
56. What is feature importance and how is it determined?
Answer:
Feature importance indicates how much a feature contributes to the model’s prediction.
Methods to determine:
- Tree-based models (e.g., RandomForest, XGBoost)
- Permutation importance
- SHAP (SHapley Additive exPlanations)
Example
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier().fit(X, y) print(model.feature_importances_)
57. How do decision trees and random forests differ?
Answer:
Decision Tree | Random Forest |
Single tree, prone to overfitting | Ensemble of trees, reduces overfitting |
Fast, interpretable | Slower, more accurate |
High variance | Lower variance (bagging + randomness) |
Example
from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier tree = DecisionTreeClassifier().fit(X, y) forest = RandomForestClassifier().fit(X, y)
58. What are the pros and cons of using logistic regression?
Answer:
Pros:
- Simple, interpretable
- Efficient with linearly separable data
- Outputs probabilities
Cons:
- Assumes linear relationship
- Not suitable for complex patterns
- Sensitive to multicollinearity
Use Case: Email spam detection, credit risk prediction
59. What is the purpose of regularization (L1 vs L2)?
Answer:
Regularization adds a penalty term to the loss function to prevent overfitting.
Type | Penalty Term | Effect |
L1 (Lasso) | λ * Σ | wᵢ |
L2 (Ridge) | λ * Σwᵢ² | Shrinks weights smoothly (no 0s) |
Example
from sklearn.linear_model import Lasso, Ridge Lasso(alpha=0.1).fit(X, y) Ridge(alpha=0.1).fit(X, y)
60. How do you tune hyperparameters in a model (GridSearchCV, RandomizedSearchCV)?
Answer:
Hyperparameter Tuning: Optimizing model configuration for best performance.
Method | Description |
GridSearchCV | Exhaustive search over given parameter grid |
RandomizedSearchCV | Randomly samples combinations |
Example
from sklearn.model_selection import GridSearchCV param_grid = {'n_estimators': [100, 200], 'max_depth': [4, 8]} grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) grid.fit(X, y)