Data Science using Python
Interview Questions with Answers

1. What are Python’s key data structures used in data science?

Answer:

Python offers several built-in and library-provided data structures that are essential in data science and Generative AI:

Data StructureDescriptionUsage in Data Science & AI
listOrdered, mutable collectionArrays, sequences of data
tupleOrdered, immutable collectionCoordinates, hashable keys
setUnordered, unique elementsRemoving duplicates, set operations
dictKey-value pairsFeature mapping, JSON-like structures
defaultdictReturns default value for missing keysCounting, grouping
CounterSubclass of dict for counting hashablesWord/token frequency, n-grams
dequeDouble-ended queueSliding windows, efficient pops/appends
DataFrame (Pandas)2D labeled data structureTabular data analysis
ndarray (NumPy)N-dimensional arrayVectorized math, matrices, tensors

Your dream SEO job is just one click away—start preparing smartly, not blindly!

2. How do list comprehension and generator expressions differ?

Answer:

FeatureList ComprehensionGenerator Expression
Syntax[x for x in iterable](x for x in iterable)
OutputReturns full listReturns generator object
Memory UsageStores all elements in memoryLazy evaluation (memory-efficient)
PerformanceFast for small-to-medium datasetsIdeal for large or infinite datasets
Use CaseEager evaluationStream processing, pipelines

Example

# List comprehension

squares = [x*x for x in range(5)]

# Generator expression

squares_gen = (x*x for x in range(5))

 

3. Explain the difference between is, ==, and in.

Answer:

OperatorPurposeExample
==Compares values‘abc’ == ‘abc’ → True
isCompares object identitiesa is b → True if a and b refer to the same object
inMembership check‘a’ in ‘cat’ → True

Example

a = [1, 2]; b = a
print(a == b)       # True (values are equal)
print(a is b)        # True (same object)
print(2 in a)        # True (element exists)

 

4. How is memory managed in Python?

Answer:

Python’s memory management includes:

  1. Automatic memory allocation using:
  • Private heap memory where all Python objects and data structures are stored.
  1. Reference Counting:
  • Every object has a reference count.
  • When count drops to 0, object is deallocated.
  1. Garbage Collector:
  • Handles cyclic references using gc module.
  • Uses generational collection (3 generations: young → old).
  1. Memory Pools:
  • Implemented by the PyMalloc allocator for efficiency.

Example

import gc

gc.collect()         # Triggers garbage collection manually

 

5. What is the difference between deepcopy and copy?

Answer:

  • copy() (Shallow Copy): Creates a new object but references original nested objects.
  • deepcopy() (Deep Copy): Creates a completely independent clone, including nested objects.

Example

import copy:

original = [[1, 2], [3, 4]]
shallow = copy.copy(original)
deep = copy.deepcopy(original)
original[0][0] = 99
print(shallow[0][0])        # 99 (same inner list)
print(deep[0][0])           # 1  (independent inner list)

 

6. What are Python’s *args and **kwargs used for?

Answer:

  • *args: Collects extra positional arguments into a tuple.
  • **kwargs: Collects extra keyword arguments into a dictionary.

Both are used to create flexible functions.

Example

def sample(*args, **kwargs):
    print(args)      # Tuple of values
    print(kwargs)    # Dict of keyword arguments

sample(1, 2, a=3, b=4)

 

7. Explain the difference between @staticmethod and @classmethod.

Answer:

Decorator@staticmethod@classmethod
Access to selfNoNo (uses cls)
Access to clsNoYes (class itself)
Use CaseUtility functions inside a classFactory methods or methods acting on class

Example

class MyClass:
    @staticmethod
    def greet():
        return "Hello"
@classmethod
def create(cls):
return cls()

 

8. How do you handle missing data in Python?

Answer:

Using Pandas, typical steps:

import pandas as pd

df = pd.DataFrame({'a': [1, None, 3], 'b': [4, 5, None]})
df.isnull()           # Boolean mask of missing
df.dropna()           # Drop rows with any missing values
df.fillna(0)          # Fill missing with 0
df.fillna(df.mean())  # Fill with column mean

Other techniques

  • Interpolation: df.interpolate()
  • Back/forward fill: df.fillna(method=’ffill’)

 

9. What is a lambda function and where is it useful?

Answer:

Lambda function is an anonymous, one-line function.

add = lambda x, y: x + y
print(add(3, 4))                 # Output: 7

Use cases

  • map(), filter(), reduce()
  • Sorting by custom keys:
  • sorted(data, key=lambda x: x[1])

Note: Limited to single expression, no statements or annotations.

 

10. What are Python’s most important built-in libraries for data science?

Answer:

LibraryPurpose
NumPyNumerical computations, n-dimensional arrays
PandasData manipulation and analysis
Matplotlib2D plotting
SeabornStatistical data visualization
Scikit-learnML models, preprocessing, metrics
SciPyScientific computing (linear algebra, stats)
StatsmodelsStatistical modeling and tests
TensorFlow, PyTorchDeep learning frameworks
NLTK, spaCyNatural language processing
OpenCVImage processing and computer vision

 

11. What is the difference between a NumPy array and a Python list?

Answer:

FeatureNumPy ArrayPython List
HomogeneityElements must be of the same typeCan store mixed data types
Memory EfficiencyMore efficient (C-contiguous blocks)Less efficient (pointers to objects)
SpeedMuch faster due to vectorizationSlower for numerical operations
BroadcastingSupportedNot supported
OperationsElement-wise arithmeticRequires loops

Example

import numpy as np

arr = np.array([1, 2, 3])
print(arr * 2)  # [2, 4, 6]
lst = [1, 2, 3]
print([x*2 for x in lst])  # [2, 4, 6]

12. How do you handle missing values in Pandas?

Answer:

Detect:

df.isnull()                     # Boolean DataFrame
df.isnull().sum()               # Count missing per column

Handle:

df.dropna()                     # Drop rows with missing values
df.dropna(axis=1)           # Drop columns with missing values
df.fillna(0)                    # Replace with 0
df.fillna(df.mean())         # Replace with column mean
df.fillna(method='ffill')  # Forward fill
df.interpolate()                # Interpolate missing values



13. Explain the difference between .loc[] and .iloc[].

Answer:

Feature.loc[].iloc[]
Access byLabels (row/column names)Integer position (like array indices)
Syntaxdf.loc[row_label, col_label]df.iloc[row_index, col_index]
SupportsSlicing with labelsSlicing with integers

Example

df = pd.DataFrame({'A': [10, 20], 'B': [30, 40]}, index=['x', 'y'])
df.loc['x', 'A']     # 10
df.iloc[0, 0]        # 10

14. How can you apply a function to every row in a Pandas DataFrame?

Answer:

Using apply() with axis=1:

def process_row(row):
return row['A'] + row['B']
df['sum'] = df.apply(process_row, axis=1)

For vectorized operations, prefer direct column-wise calculations:

df['sum'] = df['A'] + df['B']             # Faster than apply

 

15. What are vectorized operations in NumPy and why are they faster?

Answer:

Vectorized operations apply operations to entire arrays without explicit loops, using optimized C-based backends.
FeatureLoop-BasedVectorized (NumPy)
Memory EfficientNoYes
SpeedSlower (Python loop)Faster (compiled code)
SyntaxVerboseConcise

Example

# Vectorized
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = a + b                     # Element-wise addition

16. How do you merge, join, and concatenate datasets in Pandas?

Answer:

Concatenate:

pd.concat([df1, df2], axis=0)     # Stack rows
pd.concat([df1, df2], axis=1)     # Stack columns

Merge (SQL-style joins):

pd.merge(df1, df2, on='key', how='inner')   # or 'left', 'right', 'outer'

Join (index-based):

df1.join(df2, how='left')                   # df2 must have index to join

 

17. What is the difference between groupby() and pivot_table()?

Answer:

Featuregroupby()pivot_table()
AggregationYes (must use .agg(), .sum(), etc.)Yes (with default aggfunc=’mean’)
Multi-level indexYesReturns a new DataFrame
Use CaseGroup data and apply functionsReshape and summarize (like Excel pivot)

Example

df.groupby('category')['sales'].sum()
df.pivot_table(values='sales', index='category', columns='region', aggfunc='sum')

18. How do you deal with outliers in a dataset?

Answer:

Detection Techniques:

  • IQR method:
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75
IQR = Q3 - Q1
outliers = df[(df['col'] < Q1 - 1.5*IQR) | (df['col'] > Q3 + 1.5*IQR)]
  • Z-score method:
from scipy.stats import zscore
df['z'] = zscore(df['col'])
outliers = df[df['z'].abs() > 3]

Handling Strategies:

  • Remove
  • Replace with median or capped value (winsorization)
  • Use robust models (e.g., decision trees)

 

19. What are the advantages of using apply() vs a loop in Pandas?

Answer:

Featureapply()Python loop
SpeedFaster (vectorized in C backend)Slower (interpreted, row-by-row)
SyntaxCleaner and more PythonicVerbose
FlexibilityHigh (can use custom functions)High but inefficient

Example

df['squared'] = df['col'].apply(lambda x: x**2)  # Faster than for-loop

20. How do you detect and handle duplicate records?

Answer:

Detect Duplicates:

df.duplicated()                         # Returns boolean Series
df[df.duplicated()]                     # Get duplicate rows
df.duplicated(subset=['col'])     # Check specific column

Remove Duplicates:

df.drop_duplicates(inplace=True)

Keep First/Last:

df.drop_duplicates(keep='last')

Ace Every Interview with Confidence – One Question at a Time!

21. How do you normalize and standardize data?

Answer:

Normalization (Min-Max Scaling): Scales features to a range [0, 1].

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

 Standardization (Z-score): Centers the data with mean = 0 and std = 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
MethodUse When…
NormalizationData is not Gaussian, bounded features (e.g., pixel values)
StandardizationData follows Gaussian distribution or for ML models sensitive to scale (e.g., SVM, KNN, logistic regression)

22. What is label encoding vs one-hot encoding?

Answer:

Encoding TypeDescriptionExample
Label EncodingConverts categories to integer labelsRed → 0, Green → 1
One-Hot EncodingCreates binary columns for each categoryRed → [1, 0, 0]

 Label Encoding (Ordinal/Tree-based Models):

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['color']) 

One-Hot Encoding (Linear/Distance-based Models):

pd.get_dummies(df['color'], prefix='color')        # or use OneHotEncoder

23. How do you handle categorical variables with high cardinality?

Answer:

Strategies:

    1. Frequency/Count Encoding: Replace categories with their frequency.

df['encoded'] = df['category'].map(df['category'].value_counts())

     2. Target Encoding (Mean Encoding): Replace category with mean target value (careful with leakage).

df['encoded'] = df.groupby('category')['target'].transform('mean')

    3. Hash Encoding (e.g., CategoryEncoders library): Efficient for large categorical spaces.

from category_encoders import HashingEncoder
encoder = HashingEncoder()
df_encoded = encoder.fit_transform(df)

   4.Embedding Layers (for deep learning models using PyTorch or TensorFlow)

 

24. Explain feature scaling and when to use it.

Answer:

Feature scaling brings all features to the same scale so that no variable dominates others. 

When to Use:

  • Distance-based algorithms: KNN, K-Means
  • Gradient-based algorithms: Logistic Regression, Neural Networks
  • PCA and SVM (sensitive to feature scale) 

Techniques:

  • Min-Max Scaling → [0, 1]
  • Z-score (Standardization) → mean = 0, std = 1
  • Robust Scaling → median-centered (good for outliers)

Example

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

25, What is the difference between .fit(), .transform(), and .fit_transform()?

Answer:

MethodPurpose
.fit()Learns the parameters from data (e.g., mean, std)
.transform()Applies learned transformation
.fit_transform()Combines both steps (faster, cleaner)

Example

scaler = StandardScaler()
scaler.fit(X)                          # learns mean and std
X_scaled = scaler.transform(X)

# OR directly
X_scaled = scaler.fit_transform(X)

 

26. How do you impute missing values using scikit-learn?

Answer:

Use SimpleImputer from sklearn.impute.

Example

from sklearn.impute import SimpleImputer

# For numerical features

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# For categorical features

imputer = SimpleImputer(strategy='most_frequent')
X_cat = imputer.fit_transform(X_cat)
Strategies: 'mean', 'median', 'most_frequent', 'constant'

27. How do you treat multicollinearity in features?

Answer:

Multicollinearity = high correlation between independent variables → leads to unstable models. 

Detection:

  • Correlation matrix
  • VIF (Variance Inflation Factor):

Example

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] 

Treatment:

  • Drop one of the correlated features
  • Use PCA to reduce dimensions
  • Use regularization (Ridge, Lasso)

28. What are binning and discretization?

Answer:

Binning: Group continuous variables into discrete intervals or bins. 

Types:

  • Equal-width binning:
pd.cut(df['age'], bins=3, labels=['young', 'middle', 'old'])
  • Equal-frequency binning (quantile-based):
pd.qcut(df['income'], q=4)

Use Cases:

  • Transform skewed numeric features
  • Reduce overfitting
  • Interpretability

29. How do you encode cyclic variables (like days, months)?

Answer:

Problem: Numeric encoding of cyclic features (e.g., 0 for Jan, 11 for Dec) fails to capture circular nature.

Solution: Use sine and cosine transformation.

import numpy as np

df['month_sin'] = np.sin(2 * np.pi * df['month']/12)
df['month_cos'] = np.cos(2 * np.pi * df['month']/12)

Why?

  • sin/cos encodes direction on a circle
  • January (0) is close to December (11)

30. What are the best practices for feature engineering?

Answer:

General Best Practices:

  1. Understand domain before creating features.
  2. Remove duplicates, missing values, and outliers.
  3. Use feature scaling where needed.
  4. Encode categorical variables appropriately.
  5. Use binning for skewed distributions.
  6. Create interaction terms, polynomial features, and log transformations.
  7. Use statistical tests (ANOVA, Chi-Square) for feature selection.
  8. Leverage external sources (e.g., geolocation, time).
  9. Avoid data leakage (no target info in features).
  10. Always validate with cross-validation or a holdout set. 

Tools:

  • Feature-engine
  • scikit-learn pipelines
  • category_encoders

31. What steps do you follow during EDA (Exploratory Data Analysis)?

Answer:

Steps in EDA:

  1. Understand the dataset
    • Load data (pandas.read_csv)
    • Check shape, column names, and data types
  2. Missing value analysis
    • df.isnull().sum()
    • Visualize with seaborn.heatmap(df.isnull())
  3. Summary statistics
    • df.describe()
    • df.info()
  4. Univariate analysis
    • Histograms, boxplots, value counts
  5. Bivariate/multivariate analysis
    • Scatter plots, pairplots, heatmaps, correlation
  6. Outlier detection
    • Boxplots, IQR, Z-score
  7. Skewness check
    • df.skew()
  8. Feature Engineering
    • Creating new features or transforming existing ones
  9. Encoding & Scaling
    • Label encoding, one-hot encoding, normalization

32. How do you find correlations between variables?

Answer:

Use the Pearson correlation coefficient (linear), Spearman (non-linear), or Kendall (ordinal). 

Example

correlation_matrix = df.corr(method='pearson')

import seaborn as sns

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') 

Interpretation:

  • +1 = perfect positive
  • 0 = no correlation
  • -1 = perfect negative

Use scipy.stats.pearsonr(x, y) for p-value.

 

33. How do you deal with skewed data distributions?

Answer:

Detection:

df.skew()
sns.histplot(df['feature'], kde=True)

 Techniques to handle skewness:

  1. Log Transform:
df['feature'] = np.log1p(df['feature'])

     2. Box-Cox / Yeo-Johnson Transform (for non-negative and negative data):

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
df[['feature']] = pt.fit_transform(df[['feature']])

     3. Square root or reciprocal transforms

 

34. What are the key plots for EDA and when to use each?

Answer:

Plot TypeUse CaseLibrary
HistogramDistribution of numerical variableseaborn.histplot()
BoxplotOutlier detection, spreadsns.boxplot()
Scatter PlotRelationship between two numeric varssns.scatterplot()
Bar PlotFrequency of categorical variablessns.barplot()
PairplotPairwise relationship across featuressns.pairplot()
HeatmapCorrelation matrixsns.heatmap()
Violin PlotDistribution + probability densitysns.violinplot()
Line PlotTrend over timesns.lineplot()

 

35. How do you visualize multivariate relationships?

Answer:

Multivariate plots:

  1. Pairplot – all pairs of numerical variables

sns.pairplot(df, hue=’target’)

     2. Heatmap – correlation matrix

sns.heatmap(df.corr(), annot=True)

    3. 3D Scatter Plot – 3 features

from mpl_toolkits.mplot3d import Axes3D

ax = plt.axes(projection='3d')
ax.scatter3D(df.x, df.y, df.z)

   4. Grouped boxplots/violin plots – Category vs Numeric

 

36. What is the use of pairplot and heatmap in Seaborn?

Answer:

pairplot()

  • Used for visualizing relationships between multiple pairs of numerical features.
  • Highlights correlation and distribution patterns.
  • Often includes hue for category analysis.

Example

sns.pairplot(df, hue=’species’) 

heatmap()

  • Used for displaying correlation matrices or missing values visually.
  • Helpful in feature selection and dependency analysis.

Example

sns.heatmap(df.corr(), annot=True)

37. What’s the difference between histogram and bar plot?

Answer:

FeatureHistogramBar Plot
Data TypeContinuous numerical dataCategorical data
X-axisNumeric ranges (bins)Categories
PurposeShow distributionShow frequency or comparison
Gaps Between BarsNoYes

Example

# Histogram

sns.histplot(df['age'])

# Bar Plot

sns.countplot(x='gender', data=df)

38. How do you choose which variables to keep during EDA?

Answer:

Techniques:

  1. Low variance removal: Use VarianceThreshold from sklearn
  2. Correlation Analysis: Drop one of highly correlated pairs (|corr| > 0.85)
  3. Univariate feature selection: SelectKBest, f_classif, mutual_info_classif
  4. Model-based Selection: Feature importance from tree models (Random Forest)
  5. Recursive Feature Elimination (RFE)

Example

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
fit = rfe.fit(X, y)

39. What are the different types of feature distributions and how to interpret them?

Answer:

Distribution TypeCharacteristicsAction
NormalBell-shaped, mean ≈ median ≈ modeGood for parametric models
Skewed RightLong tail on right (positive skew)Log transform or Box-Cox
Skewed LeftLong tail on left (negative skew)Square or cube transform
BimodalTwo peaksSegment the data or investigate classes
UniformEqual frequencyOften okay to use directly

 Check with:

sns.histplot(df['feature'], kde=True)
print(df['feature'].skew())

40. What is the role of outliers and how do you detect them?

Answer:

Outliers can distort mean, standard deviation, and model accuracy.

Detection Methods:

  1. Boxplot & IQR
Q1 = df['feature'].quantile(0.25)
Q3 = df['feature'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['feature'] < Q1 - 1.5 * IQR) | (df['feature'] > Q3 + 1.5 * IQR)]

     2. Z-score

from scipy.stats import zscore
df['zscore'] = zscore(df['feature'])

    3. Isolation Forest / LOF (Advanced models)

Treatment:

  • Remove
  • Cap (winsorization)
  • Transform
  • Bin or categorize

Upgrade Your Skills Today to Secure the Job You Deserve Tomorrow!

41. What is the difference between population and sample?

Answer:

Population Sample
Entire group of individuals or events A subset taken from the population
Has parameters (e.g., μ, σ) Has statistics (e.g., x̄, s)
Not always practical to measure More feasible for analysis

Example

  • Population: All customers of Amazon.
  • Sample: 1000 randomly selected Amazon customers for a survey.

42. Explain the Central Limit Theorem with an example.

Answer:

Central Limit Theorem (CLT): The sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s distribution, as the sample size becomes large (typically n ≥ 30). 

Key Points:

  • Works for independent, identically distributed samples
  • Allows us to use normal approximation even for non-normal data

Example

import numpy as np

import matplotlib.pyplot as plt

# Skewed population

population = np.random.exponential(scale=2, size=10000)

# Sampling means

means = [np.mean(np.random.choice(population, 50)) for _ in range(1000)]
plt.hist(means, bins=30)
plt.title("Sampling Distribution Approaches Normal")
plt.show()

43. What is p-value and how is it used in hypothesis testing?

Answer:

p-value is the probability of observing the test statistic or something more extreme assuming the null hypothesis is true.

  • Low p-value (≤ 0.05): Reject the null hypothesis
  • High p-value (> 0.05): Fail to reject the null hypothesis 

Example
If p = 0.01, there’s a 1% chance that the observed results are due to random chance — strong evidence against null hypothesis.

44. What is the difference between Type I and Type II errors?

Answer:

Type I Error (False Positive) Type II Error (False Negative)
Rejecting a true null hypothesis Failing to reject a false null hypothesis
Controlled by α (significance level) Controlled by β (power = 1 − β)
“Crying wolf” “Missing a real signal”

Example

  • Type I: Diagnosing disease when not ill.
  • Type II: Missing diagnosis when disease is present.

45. What is the difference between confidence interval and prediction interval?

Answer:

Confidence Interval (CI) Prediction Interval (PI)
Range for population parameter (e.g., mean) Range for future individual observation
Narrower Wider (includes extra variability)

Example

  • CI: Mean income is $50K ± $2K
  • PI: Next person’s income likely falls in $30K–$70K

46. What are correlation and covariance?

Answer:

Concept Meaning Range
Covariance How two variables vary together (−∞ to ∞)
Correlation Standardized measure of covariance (unitless) [−1, 1]

Formula

  • Covariance: cov(X, Y) = Σ((X − X̄)(Y − Ȳ)) / n
  • Correlation: corr = cov(X,Y) / (σX * σY)

Example

import numpy as np
x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
print(np.cov(x, y))               # Covariance
print(np.corrcoef(x, y))          # Correlation

47. Explain t-test, chi-square test, and ANOVA.

Answer:

Test Use Case Data Type
t-test Compare means of 2 groups Continuous
Chi-square Test for independence or goodness-of-fit Categorical
ANOVA Compare means of 3+ groups Continuous + Categorical

Example (t-test)

from scipy.stats import ttest_ind
ttest_ind([1,2,3], [2,3,4])

ANOVA Example

from scipy.stats import f_oneway
f_oneway([1,2,3], [2,4,6], [3,5,7])

48. What is overfitting and underfitting in statistical models?

Answer:

Overfitting Underfitting
High training accuracy, poor test accuracy Poor performance on both train and test
Model is too complex Model is too simple
Low bias, high variance High bias, low variance

Example

  • Overfitting: Memorizing training data with noise
  • Underfitting: Using linear model on non-linear data

49. What are bias and variance in a predictive model?

Answer:

Bias Variance
Error from overly simple assumptions Error from model sensitivity to data
Leads to underfitting Leads to overfitting
High bias = high training error High variance = high testing error

Ideal model minimizes both bias and variance.

Visualization:

      Bias vs Variance Tradeoff:

          Error

^
|    Bias^2     <--|---|
|          Variance     |
|_______________________|

50. When would you use a non-parametric test?

Answer:

Use non-parametric tests when:

  • Data does not follow normal distribution
  • Sample size is small
  • Data is ordinal or ranked
  • You want robustness to outliers

Common Non-parametric Tests:

Test Use Case
Mann–Whitney U Test Compare medians of 2 independent samples
Wilcoxon Test Compare paired samples
Kruskal–Wallis Test Compare 3+ groups
Chi-square Categorical independence

Example

from scipy.stats import mannwhitneyu
mannwhitneyu([1, 2, 3], [4, 5, 6])

51. What’s the difference between supervised and unsupervised learning?

Answer:

Supervised Learning Unsupervised Learning
Data includes input and labeled output Data has only input (no labeled output)
Goal: Learn a mapping from input → output Goal: Find patterns or structures in data
Examples: Classification, Regression Examples: Clustering, Dimensionality Reduction

 Example

# Supervised

from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)

# Unsupervised

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3).fit(X)
 

52. Explain the bias-variance tradeoff in machine learning.

Answer:

Bias: Error due to overly simple model assumptions

Variance: Error due to high model sensitivity to training data

High Bias High Variance
Underfitting Overfitting
High training error High test error

 Tradeoff Goal: Balance bias and variance for optimal performance on unseen data.

53. How do you evaluate a classification model (precision, recall, F1-score)?

Answer:

Metric Formula Meaning
Precision TP / (TP + FP) Correct positive predictions
Recall TP / (TP + FN) Coverage of actual positives
F1-score 2 * (Precision * Recall) / (P + R) Harmonic mean of P and R

 Example

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y_true, y_pred)
recall_score(y_true, y_pred)
f1_score(y_true, y_pred)

54. What is the ROC curve and AUC score?

Answer:

  • ROC Curve (Receiver Operating Characteristic):
    • Plots True Positive Rate vs False Positive Rate at various thresholds
  • AUC (Area Under Curve):
    • Measures entire 2D area under the ROC curve (range: 0 to 1)
    • Higher AUC = Better model performance

Example

from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

55. How does cross-validation help in model evaluation?

Answer:

Cross-validation splits data into multiple folds and rotates the validation set to:

  • Reduce overfitting
  • Get more robust performance estimates
  • Use data more efficiently

Types:

  • k-Fold CV
  • Stratified k-Fold (for classification)
  • Leave-One-Out CV 

Example

from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)

56. What is feature importance and how is it determined?

Answer:

Feature importance indicates how much a feature contributes to the model’s prediction.

Methods to determine:

  • Tree-based models (e.g., RandomForest, XGBoost)
  • Permutation importance
  • SHAP (SHapley Additive exPlanations) 

Example

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier().fit(X, y)
print(model.feature_importances_)

57. How do decision trees and random forests differ?

Answer:

Decision Tree Random Forest
Single tree, prone to overfitting Ensemble of trees, reduces overfitting
Fast, interpretable Slower, more accurate
High variance Lower variance (bagging + randomness)

Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
tree = DecisionTreeClassifier().fit(X, y)
forest = RandomForestClassifier().fit(X, y)

58. What are the pros and cons of using logistic regression?

Answer:

Pros:

  • Simple, interpretable
  • Efficient with linearly separable data
  • Outputs probabilities

Cons:

  • Assumes linear relationship
  • Not suitable for complex patterns
  • Sensitive to multicollinearity

Use Case: Email spam detection, credit risk prediction

59. What is the purpose of regularization (L1 vs L2)?

Answer:

Regularization adds a penalty term to the loss function to prevent overfitting.

Type Penalty Term Effect
L1 (Lasso) λ * Σ wᵢ
L2 (Ridge) λ * Σwᵢ² Shrinks weights smoothly (no 0s)

Example

from sklearn.linear_model import Lasso, Ridge
Lasso(alpha=0.1).fit(X, y)
Ridge(alpha=0.1).fit(X, y)

60. How do you tune hyperparameters in a model (GridSearchCV, RandomizedSearchCV)?

Answer:

Hyperparameter Tuning: Optimizing model configuration for best performance.

Method Description
GridSearchCV Exhaustive search over given parameter grid
RandomizedSearchCV Randomly samples combinations

Example

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [100, 200], 'max_depth': [4, 8]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X, y)
Note: The interview questions and answers provided on this page have been thoughtfully compiled by our academic team. However, as the content is manually created, there may be occasional errors or omissions. If you have any questions or identify any inaccuracies, please contact us at team@learn2earnlabs.com. We appreciate your feedback and strive for continuous improvement.