What is data science, and how is Python used in it?

Data science is a multidisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge from structured and unstructured data. Python is widely used due to its simplicity, rich libraries like NumPy, pandas, scikit-learn, and its integration with data visualization tools such as matplotlib and seaborn.

What are some commonly used Python libraries in data science?

Some popular Python libraries include: NumPy for numerical operations, pandas for data manipulation and analysis, matplotlib/seaborn for data visualization, scikit-learn for machine learning, and TensorFlow/Keras for deep learning.

What is the difference between Series and DataFrame in pandas?

A Series is a one-dimensional labeled array capable of holding any data type. A DataFrame is a two-dimensional labeled data structure with columns of potentially different types, similar to a table in a database or an Excel sheet.

What is data wrangling in Python?

Data wrangling, also known as data munging, is the process of cleaning, transforming, and enriching raw data into a format suitable for analysis. Python offers tools like pandas to handle missing data, normalize formats, and convert data types efficiently.

What is the role of Jupyter Notebook in data science?

Jupyter Notebook is an open-source web-based interactive environment that allows users to create and share documents that contain live code, visualizations, and narrative text. It is extensively used in data science for exploration, analysis, and reporting.

Data Science using Python
Interview Questions with Answers

Explore Our Courses

Data Analytics Using Python

Data Science & ML – Python

Full Stack Web Development

Cloud Computing & DevOPS

Java Full Stack

Digital Marketing

WordPress and Blogging

Social Media Marketing

Google Ads

Front-End Development

Back-End Development

Design Thinking & UI/UX

React Native

Business Analytics & Intelligence

1. What are Python’s key data structures used in data science?

Answer:

Python offers several built-in and library-provided data structures that are essential in data science and Generative AI:

Data Structure	Description	Usage in Data Science & AI
list	Ordered, mutable collection	Arrays, sequences of data
tuple	Ordered, immutable collection	Coordinates, hashable keys
set	Unordered, unique elements	Removing duplicates, set operations
dict	Key-value pairs	Feature mapping, JSON-like structures
defaultdict	Returns default value for missing keys	Counting, grouping
Counter	Subclass of dict for counting hashables	Word/token frequency, n-grams
deque	Double-ended queue	Sliding windows, efficient pops/appends
DataFrame (Pandas)	2D labeled data structure	Tabular data analysis
ndarray (NumPy)	N-dimensional array	Vectorized math, matrices, tensors

Your dream SEO job is just one click away—start preparing smartly, not blindly!

2. How do list comprehension and generator expressions differ?

Answer:

Feature	List Comprehension	Generator Expression
Syntax	[x for x in iterable]	(x for x in iterable)
Output	Returns full list	Returns generator object
Memory Usage	Stores all elements in memory	Lazy evaluation (memory-efficient)
Performance	Fast for small-to-medium datasets	Ideal for large or infinite datasets
Use Case	Eager evaluation	Stream processing, pipelines

Example

# List comprehension

squares = [x*x for x in range(5)]

# Generator expression

squares_gen = (x*x for x in range(5))

3. Explain the difference between is, ==, and in.

Answer:

Operator	Purpose	Example
==	Compares values	‘abc’ == ‘abc’ → True
is	Compares object identities	a is b → True if a and b refer to the same object
in	Membership check	‘a’ in ‘cat’ → True

Example

a = [1, 2]; b = a
print(a == b)        # True (values are equal)
print(a is b)        # True (same object)
print(2 in a)        # True (element exists)

4. How is memory managed in Python?

Answer:

Python’s memory management includes:

Automatic memory allocation using:

Private heap memory where all Python objects and data structures are stored.

Reference Counting:

Every object has a reference count.
When count drops to 0, object is deallocated.

Garbage Collector:

Handles cyclic references using gc module.
Uses generational collection (3 generations: young → old).

Memory Pools:

Implemented by the PyMalloc allocator for efficiency.

Example

import gc

gc.collect()         # Triggers garbage collection manually

5. What is the difference between deepcopy and copy?

Answer:

copy() (Shallow Copy): Creates a new object but references original nested objects.
deepcopy() (Deep Copy): Creates a completely independent clone, including nested objects.

Example

import copy:

original = [[1, 2], [3, 4]]
shallow = copy.copy(original)
deep = copy.deepcopy(original)
original[0][0] = 99
print(shallow[0][0])        # 99 (same inner list)
print(deep[0][0])           # 1  (independent inner list)

6. What are Python’s *args and **kwargs used for?

Answer:

*args: Collects extra positional arguments into a tuple.
**kwargs: Collects extra keyword arguments into a dictionary.

Both are used to create flexible functions.

Example

def sample(*args, **kwargs):
    print(args)      # Tuple of values
    print(kwargs)    # Dict of keyword arguments

sample(1, 2, a=3, b=4)

7. Explain the difference between @staticmethod and @classmethod.

Answer:

Decorator	@staticmethod	@classmethod
Access to self	No	No (uses cls)
Access to cls	No	Yes (class itself)
Use Case	Utility functions inside a class	Factory methods or methods acting on class

Example

class MyClass:
    @staticmethod
    def greet():
        return "Hello"
@classmethod
def create(cls):
return cls()

8. How do you handle missing data in Python?

Answer:

Using Pandas, typical steps:

import pandas as pd

df = pd.DataFrame({'a': [1, None, 3], 'b': [4, 5, None]})
df.isnull()           # Boolean mask of missing
df.dropna()           # Drop rows with any missing values
df.fillna(0)          # Fill missing with 0
df.fillna(df.mean())  # Fill with column mean

Other techniques

Interpolation: df.interpolate()
Back/forward fill: df.fillna(method=’ffill’)

9. What is a lambda function and where is it useful?

Answer:

Lambda function is an anonymous, one-line function.

add = lambda x, y: x + y
print(add(3, 4))                 # Output: 7

Use cases

map(), filter(), reduce()
Sorting by custom keys:
sorted(data, key=lambda x: x[1])

Note: Limited to single expression, no statements or annotations.

10. What are Python’s most important built-in libraries for data science?

Answer:

Library	Purpose
NumPy	Numerical computations, n-dimensional arrays
Pandas	Data manipulation and analysis
Matplotlib	2D plotting
Seaborn	Statistical data visualization
Scikit-learn	ML models, preprocessing, metrics
SciPy	Scientific computing (linear algebra, stats)
Statsmodels	Statistical modeling and tests
TensorFlow, PyTorch	Deep learning frameworks
NLTK, spaCy	Natural language processing
OpenCV	Image processing and computer vision

11. What is the difference between a NumPy array and a Python list?

Answer:

Feature	NumPy Array	Python List
Homogeneity	Elements must be of the same type	Can store mixed data types
Memory Efficiency	More efficient (C-contiguous blocks)	Less efficient (pointers to objects)
Speed	Much faster due to vectorization	Slower for numerical operations
Broadcasting	Supported	Not supported
Operations	Element-wise arithmetic	Requires loops

Example

import numpy as np

arr = np.array([1, 2, 3])
print(arr * 2)  # [2, 4, 6]
lst = [1, 2, 3]
print([x*2 for x in lst])  # [2, 4, 6]

12. How do you handle missing values in Pandas?

Answer:

Detect:

df.isnull()                     # Boolean DataFrame
df.isnull().sum()               # Count missing per column

Handle:

df.dropna()                     # Drop rows with missing values
df.dropna(axis=1)               # Drop columns with missing values
df.fillna(0)                    # Replace with 0
df.fillna(df.mean())            # Replace with column mean
df.fillna(method='ffill')       # Forward fill
df.interpolate()                # Interpolate missing values

13. Explain the difference between .loc[] and .iloc[].

Answer:

Feature	.loc[]	.iloc[]
Access by	Labels (row/column names)	Integer position (like array indices)
Syntax	df.loc[row_label, col_label]	df.iloc[row_index, col_index]
Supports	Slicing with labels	Slicing with integers

Example

df = pd.DataFrame({'A': [10, 20], 'B': [30, 40]}, index=['x', 'y'])
df.loc['x', 'A']     # 10
df.iloc[0, 0]        # 10

14. How can you apply a function to every row in a Pandas DataFrame?

Answer:

Using apply() with axis=1:

def process_row(row):
return row['A'] + row['B']
df['sum'] = df.apply(process_row, axis=1)

For vectorized operations, prefer direct column-wise calculations:

df['sum'] = df['A'] + df['B']             # Faster than apply

15. What are vectorized operations in NumPy and why are they faster?

Answer:

Vectorized operations apply operations to entire arrays without explicit loops, using optimized C-based backends.

Feature	Loop-Based	Vectorized (NumPy)
Memory Efficient	No	Yes
Speed	Slower (Python loop)	Faster (compiled code)
Syntax	Verbose	Concise

Example

# Vectorized
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = a + b                     # Element-wise addition

16. How do you merge, join, and concatenate datasets in Pandas?

Answer:

Concatenate:

pd.concat([df1, df2], axis=0)     # Stack rows
pd.concat([df1, df2], axis=1)     # Stack columns

Merge (SQL-style joins):

pd.merge(df1, df2, on='key', how='inner')   # or 'left', 'right', 'outer'

Join (index-based):

df1.join(df2, how='left')                   # df2 must have index to join

17. What is the difference between groupby() and pivot_table()?

Answer:

Feature	groupby()	pivot_table()
Aggregation	Yes (must use .agg(), .sum(), etc.)	Yes (with default aggfunc=’mean’)
Multi-level index	Yes	Returns a new DataFrame
Use Case	Group data and apply functions	Reshape and summarize (like Excel pivot)

Example

df.groupby('category')['sales'].sum()
df.pivot_table(values='sales', index='category', columns='region', aggfunc='sum')

18. How do you deal with outliers in a dataset?

Answer:

Detection Techniques:

IQR method:

Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75
IQR = Q3 - Q1
outliers = df[(df['col'] < Q1 - 1.5*IQR) | (df['col'] > Q3 + 1.5*IQR)]

Z-score method:

from scipy.stats import zscore
df['z'] = zscore(df['col'])
outliers = df[df['z'].abs() > 3]

Handling Strategies:

Remove
Replace with median or capped value (winsorization)
Use robust models (e.g., decision trees)

19. What are the advantages of using apply() vs a loop in Pandas?

Answer:

Feature	apply()	Python loop
Speed	Faster (vectorized in C backend)	Slower (interpreted, row-by-row)
Syntax	Cleaner and more Pythonic	Verbose
Flexibility	High (can use custom functions)	High but inefficient

Example

df['squared'] = df['col'].apply(lambda x: x**2)  # Faster than for-loop

20. How do you detect and handle duplicate records?

Answer:

Detect Duplicates:

df.duplicated()                         # Returns boolean Series
df[df.duplicated()]                     # Get duplicate rows
df.duplicated(subset=['col'])           # Check specific column

Remove Duplicates:

df.drop_duplicates(inplace=True)

Keep First/Last:

df.drop_duplicates(keep='last')

Ace Every Interview with Confidence – One Question at a Time!

21. How do you normalize and standardize data?

Answer:

Normalization (Min-Max Scaling): Scales features to a range [0, 1].

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

Standardization (Z-score): Centers the data with mean = 0 and std = 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

Method	Use When…
Normalization	Data is not Gaussian, bounded features (e.g., pixel values)
Standardization	Data follows Gaussian distribution or for ML models sensitive to scale (e.g., SVM, KNN, logistic regression)

22. What is label encoding vs one-hot encoding?

Answer:

Encoding Type	Description	Example
Label Encoding	Converts categories to integer labels	Red → 0, Green → 1
One-Hot Encoding	Creates binary columns for each category	Red → [1, 0, 0]

Label Encoding (Ordinal/Tree-based Models):

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['color'])

One-Hot Encoding (Linear/Distance-based Models):

pd.get_dummies(df['color'], prefix='color')        # or use OneHotEncoder

23. How do you handle categorical variables with high cardinality?

Answer:

Strategies:

1. Frequency/Count Encoding: Replace categories with their frequency.

df['encoded'] = df['category'].map(df['category'].value_counts())

2. Target Encoding (Mean Encoding): Replace category with mean target value (careful with leakage).

df['encoded'] = df.groupby('category')['target'].transform('mean')

3. Hash Encoding (e.g., CategoryEncoders library): Efficient for large categorical spaces.

from category_encoders import HashingEncoder
encoder = HashingEncoder()
df_encoded = encoder.fit_transform(df)

4.Embedding Layers (for deep learning models using PyTorch or TensorFlow)

24. Explain feature scaling and when to use it.

Answer:

Feature scaling brings all features to the same scale so that no variable dominates others.

When to Use:

Distance-based algorithms: KNN, K-Means
Gradient-based algorithms: Logistic Regression, Neural Networks
PCA and SVM (sensitive to feature scale)

Techniques:

Min-Max Scaling → [0, 1]
Z-score (Standardization) → mean = 0, std = 1
Robust Scaling → median-centered (good for outliers)

Example

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

25, What is the difference between .fit(), .transform(), and .fit_transform()?

Answer:

Method	Purpose
.fit()	Learns the parameters from data (e.g., mean, std)
.transform()	Applies learned transformation
.fit_transform()	Combines both steps (faster, cleaner)

Example

scaler = StandardScaler()
scaler.fit(X)                          # learns mean and std
X_scaled = scaler.transform(X)

# OR directly
X_scaled = scaler.fit_transform(X)

26. How do you impute missing values using scikit-learn?

Answer:

Use SimpleImputer from sklearn.impute.

Example

from sklearn.impute import SimpleImputer

# For numerical features

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# For categorical features

imputer = SimpleImputer(strategy='most_frequent')
X_cat = imputer.fit_transform(X_cat)
Strategies: 'mean', 'median', 'most_frequent', 'constant'

27. How do you treat multicollinearity in features?

Answer:

Multicollinearity = high correlation between independent variables → leads to unstable models.

Detection:

Correlation matrix
VIF (Variance Inflation Factor):

Example

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

Treatment:

Drop one of the correlated features
Use PCA to reduce dimensions
Use regularization (Ridge, Lasso)

28. What are binning and discretization?

Answer:

Binning: Group continuous variables into discrete intervals or bins.

Types:

Equal-width binning:

pd.cut(df['age'], bins=3, labels=['young', 'middle', 'old'])

Equal-frequency binning (quantile-based):

pd.qcut(df['income'], q=4)

Use Cases:

Transform skewed numeric features
Reduce overfitting
Interpretability

29. How do you encode cyclic variables (like days, months)?

Answer:

Problem: Numeric encoding of cyclic features (e.g., 0 for Jan, 11 for Dec) fails to capture circular nature.

Solution: Use sine and cosine transformation.

import numpy as np

df['month_sin'] = np.sin(2 * np.pi * df['month']/12)
df['month_cos'] = np.cos(2 * np.pi * df['month']/12)

Why?

sin/cos encodes direction on a circle
January (0) is close to December (11)

30. What are the best practices for feature engineering?

Answer:

General Best Practices:

Understand domain before creating features.
Remove duplicates, missing values, and outliers.
Use feature scaling where needed.
Encode categorical variables appropriately.
Use binning for skewed distributions.
Create interaction terms, polynomial features, and log transformations.
Use statistical tests (ANOVA, Chi-Square) for feature selection.
Leverage external sources (e.g., geolocation, time).
Avoid data leakage (no target info in features).
Always validate with cross-validation or a holdout set.

Tools:

Feature-engine
scikit-learn pipelines
category_encoders

31. What steps do you follow during EDA (Exploratory Data Analysis)?

Answer:

Steps in EDA:

Understand the dataset
- Load data (pandas.read_csv)
- Check shape, column names, and data types
Missing value analysis
- df.isnull().sum()
- Visualize with seaborn.heatmap(df.isnull())
Summary statistics
- df.describe()
- df.info()
Univariate analysis
- Histograms, boxplots, value counts
Bivariate/multivariate analysis
- Scatter plots, pairplots, heatmaps, correlation
Outlier detection
- Boxplots, IQR, Z-score
Skewness check
- df.skew()
Feature Engineering
- Creating new features or transforming existing ones
Encoding & Scaling
- Label encoding, one-hot encoding, normalization

32. How do you find correlations between variables?

Answer:

Use the Pearson correlation coefficient (linear), Spearman (non-linear), or Kendall (ordinal).

Example

correlation_matrix = df.corr(method='pearson')

import seaborn as sns

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

Interpretation:

+1 = perfect positive
0 = no correlation
-1 = perfect negative

Use scipy.stats.pearsonr(x, y) for p-value.

33. How do you deal with skewed data distributions?

Answer:

Detection:

df.skew()
sns.histplot(df['feature'], kde=True)

Techniques to handle skewness:

Log Transform:

df['feature'] = np.log1p(df['feature'])

2. Box-Cox / Yeo-Johnson Transform (for non-negative and negative data):

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
df[['feature']] = pt.fit_transform(df[['feature']])

3. Square root or reciprocal transforms

34. What are the key plots for EDA and when to use each?

Answer:

Plot Type	Use Case	Library
Histogram	Distribution of numerical variable	seaborn.histplot()
Boxplot	Outlier detection, spread	sns.boxplot()
Scatter Plot	Relationship between two numeric vars	sns.scatterplot()
Bar Plot	Frequency of categorical variables	sns.barplot()
Pairplot	Pairwise relationship across features	sns.pairplot()
Heatmap	Correlation matrix	sns.heatmap()
Violin Plot	Distribution + probability density	sns.violinplot()
Line Plot	Trend over time	sns.lineplot()

35. How do you visualize multivariate relationships?

Answer:

Multivariate plots:

Pairplot – all pairs of numerical variables

sns.pairplot(df, hue=’target’)

2. Heatmap – correlation matrix

sns.heatmap(df.corr(), annot=True)

3. 3D Scatter Plot – 3 features

from mpl_toolkits.mplot3d import Axes3D

ax = plt.axes(projection='3d')
ax.scatter3D(df.x, df.y, df.z)

4. Grouped boxplots/violin plots – Category vs Numeric

36. What is the use of pairplot and heatmap in Seaborn?

Answer:

pairplot()

Used for visualizing relationships between multiple pairs of numerical features.
Highlights correlation and distribution patterns.
Often includes hue for category analysis.

Example

sns.pairplot(df, hue=’species’)

heatmap()

Used for displaying correlation matrices or missing values visually.
Helpful in feature selection and dependency analysis.

Example

sns.heatmap(df.corr(), annot=True)

37. What’s the difference between histogram and bar plot?

Answer:

Feature	Histogram	Bar Plot
Data Type	Continuous numerical data	Categorical data
X-axis	Numeric ranges (bins)	Categories
Purpose	Show distribution	Show frequency or comparison
Gaps Between Bars	No	Yes

Example

# Histogram

sns.histplot(df['age'])

# Bar Plot

sns.countplot(x='gender', data=df)

38. How do you choose which variables to keep during EDA?

Answer:

Techniques:

Low variance removal: Use VarianceThreshold from sklearn
Correlation Analysis: Drop one of highly correlated pairs (|corr| > 0.85)
Univariate feature selection: SelectKBest, f_classif, mutual_info_classif
Model-based Selection: Feature importance from tree models (Random Forest)
Recursive Feature Elimination (RFE)

Example

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
fit = rfe.fit(X, y)

39. What are the different types of feature distributions and how to interpret them?

Answer:

Distribution Type	Characteristics	Action
Normal	Bell-shaped, mean ≈ median ≈ mode	Good for parametric models
Skewed Right	Long tail on right (positive skew)	Log transform or Box-Cox
Skewed Left	Long tail on left (negative skew)	Square or cube transform
Bimodal	Two peaks	Segment the data or investigate classes
Uniform	Equal frequency	Often okay to use directly

Check with:

sns.histplot(df['feature'], kde=True)
print(df['feature'].skew())

40. What is the role of outliers and how do you detect them?

Answer:

Outliers can distort mean, standard deviation, and model accuracy.

Detection Methods:

Boxplot & IQR

Q1 = df['feature'].quantile(0.25)
Q3 = df['feature'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['feature'] < Q1 - 1.5 * IQR) | (df['feature'] > Q3 + 1.5 * IQR)]

2. Z-score

from scipy.stats import zscore
df['zscore'] = zscore(df['feature'])

3. Isolation Forest / LOF (Advanced models)

Treatment:

Remove
Cap (winsorization)
Transform
Bin or categorize

Upgrade Your Skills Today to Secure the Job You Deserve Tomorrow!

41. What is the difference between population and sample?

Answer:

Population	Sample
Entire group of individuals or events	A subset taken from the population
Has parameters (e.g., μ, σ)	Has statistics (e.g., x̄, s)
Not always practical to measure	More feasible for analysis

Example

Population: All customers of Amazon.
Sample: 1000 randomly selected Amazon customers for a survey.

42. Explain the Central Limit Theorem with an example.

Answer:

Central Limit Theorem (CLT): The sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s distribution, as the sample size becomes large (typically n ≥ 30).

Key Points:

Works for independent, identically distributed samples
Allows us to use normal approximation even for non-normal data

Example

import numpy as np

import matplotlib.pyplot as plt

# Skewed population

population = np.random.exponential(scale=2, size=10000)

# Sampling means

means = [np.mean(np.random.choice(population, 50)) for _ in range(1000)]
plt.hist(means, bins=30)
plt.title("Sampling Distribution Approaches Normal")
plt.show()

43. What is p-value and how is it used in hypothesis testing?

Answer:

p-value is the probability of observing the test statistic or something more extreme assuming the null hypothesis is true.

Low p-value (≤ 0.05): Reject the null hypothesis
High p-value (> 0.05): Fail to reject the null hypothesis

Example
If p = 0.01, there’s a 1% chance that the observed results are due to random chance — strong evidence against null hypothesis.

44. What is the difference between Type I and Type II errors?

Answer:

Type I Error (False Positive)	Type II Error (False Negative)
Rejecting a true null hypothesis	Failing to reject a false null hypothesis
Controlled by α (significance level)	Controlled by β (power = 1 − β)
“Crying wolf”	“Missing a real signal”

Example

Type I: Diagnosing disease when not ill.
Type II: Missing diagnosis when disease is present.

45. What is the difference between confidence interval and prediction interval?

Answer:

Confidence Interval (CI)	Prediction Interval (PI)
Range for population parameter (e.g., mean)	Range for future individual observation
Narrower	Wider (includes extra variability)

Example

CI: Mean income is $50K ± $2K
PI: Next person’s income likely falls in $30K–$70K

46. What are correlation and covariance?

Answer:

Concept	Meaning	Range
Covariance	How two variables vary together	(−∞ to ∞)
Correlation	Standardized measure of covariance (unitless)	[−1, 1]

Formula

Covariance: cov(X, Y) = Σ((X − X̄)(Y − Ȳ)) / n
Correlation: corr = cov(X,Y) / (σX * σY)

Example

import numpy as np
x = [1, 2, 3, 4]
y = [2, 4, 6, 8]
print(np.cov(x, y))               # Covariance
print(np.corrcoef(x, y))          # Correlation

47. Explain t-test, chi-square test, and ANOVA.

Answer:

Test	Use Case	Data Type
t-test	Compare means of 2 groups	Continuous
Chi-square	Test for independence or goodness-of-fit	Categorical
ANOVA	Compare means of 3+ groups	Continuous + Categorical

Example (t-test)

from scipy.stats import ttest_ind
ttest_ind([1,2,3], [2,3,4])

ANOVA Example

from scipy.stats import f_oneway
f_oneway([1,2,3], [2,4,6], [3,5,7])

48. What is overfitting and underfitting in statistical models?

Answer:

Overfitting	Underfitting
High training accuracy, poor test accuracy	Poor performance on both train and test
Model is too complex	Model is too simple
Low bias, high variance	High bias, low variance

Example

Overfitting: Memorizing training data with noise
Underfitting: Using linear model on non-linear data

49. What are bias and variance in a predictive model?

Answer:

Bias	Variance
Error from overly simple assumptions	Error from model sensitivity to data
Leads to underfitting	Leads to overfitting
High bias = high training error	High variance = high testing error

Ideal model minimizes both bias and variance.

Visualization:

Bias vs Variance Tradeoff:

Error

^
|    Bias^2     <--|---|
|          Variance     |
|_______________________|

50. When would you use a non-parametric test?

Answer:

Use non-parametric tests when:

Data does not follow normal distribution
Sample size is small
Data is ordinal or ranked
You want robustness to outliers

Common Non-parametric Tests:

Test	Use Case
Mann–Whitney U Test	Compare medians of 2 independent samples
Wilcoxon Test	Compare paired samples
Kruskal–Wallis Test	Compare 3+ groups
Chi-square	Categorical independence

Example

from scipy.stats import mannwhitneyu
mannwhitneyu([1, 2, 3], [4, 5, 6])

51. What’s the difference between supervised and unsupervised learning?

Answer:

Supervised Learning	Unsupervised Learning
Data includes input and labeled output	Data has only input (no labeled output)
Goal: Learn a mapping from input → output	Goal: Find patterns or structures in data
Examples: Classification, Regression	Examples: Clustering, Dimensionality Reduction

Example

# Supervised

from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)

# Unsupervised

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3).fit(X)

52. Explain the bias-variance tradeoff in machine learning.

Answer:

Bias: Error due to overly simple model assumptions

Variance: Error due to high model sensitivity to training data

High Bias	High Variance
Underfitting	Overfitting
High training error	High test error

Tradeoff Goal: Balance bias and variance for optimal performance on unseen data.

53. How do you evaluate a classification model (precision, recall, F1-score)?

Answer:

Metric	Formula	Meaning
Precision	TP / (TP + FP)	Correct positive predictions
Recall	TP / (TP + FN)	Coverage of actual positives
F1-score	2 * (Precision * Recall) / (P + R)	Harmonic mean of P and R

Example

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y_true, y_pred)
recall_score(y_true, y_pred)
f1_score(y_true, y_pred)

54. What is the ROC curve and AUC score?

Answer:

ROC Curve (Receiver Operating Characteristic):
- Plots True Positive Rate vs False Positive Rate at various thresholds
AUC (Area Under Curve):
- Measures entire 2D area under the ROC curve (range: 0 to 1)
- Higher AUC = Better model performance

Example

from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

55. How does cross-validation help in model evaluation?

Answer:

Cross-validation splits data into multiple folds and rotates the validation set to:

Reduce overfitting
Get more robust performance estimates
Use data more efficiently

Types:

k-Fold CV
Stratified k-Fold (for classification)
Leave-One-Out CV

Example

from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)

56. What is feature importance and how is it determined?

Answer:

Feature importance indicates how much a feature contributes to the model’s prediction.

Methods to determine:

Tree-based models (e.g., RandomForest, XGBoost)
Permutation importance
SHAP (SHapley Additive exPlanations)

Example

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier().fit(X, y)
print(model.feature_importances_)

57. How do decision trees and random forests differ?

Answer:

Decision Tree	Random Forest
Single tree, prone to overfitting	Ensemble of trees, reduces overfitting
Fast, interpretable	Slower, more accurate
High variance	Lower variance (bagging + randomness)

Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
tree = DecisionTreeClassifier().fit(X, y)
forest = RandomForestClassifier().fit(X, y)

58. What are the pros and cons of using logistic regression?

Answer:

Pros:

Simple, interpretable
Efficient with linearly separable data
Outputs probabilities

Cons:

Assumes linear relationship
Not suitable for complex patterns
Sensitive to multicollinearity

Use Case: Email spam detection, credit risk prediction

59. What is the purpose of regularization (L1 vs L2)?

Answer:

Regularization adds a penalty term to the loss function to prevent overfitting.

Type	Penalty Term	Effect
L1 (Lasso)	λ * Σ	wᵢ
L2 (Ridge)	λ * Σwᵢ²	Shrinks weights smoothly (no 0s)

Example

from sklearn.linear_model import Lasso, Ridge
Lasso(alpha=0.1).fit(X, y)
Ridge(alpha=0.1).fit(X, y)

60. How do you tune hyperparameters in a model (GridSearchCV, RandomizedSearchCV)?

Answer:

Hyperparameter Tuning: Optimizing model configuration for best performance.

Method	Description
GridSearchCV	Exhaustive search over given parameter grid
RandomizedSearchCV	Randomly samples combinations

Example

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [100, 200], 'max_depth': [4, 8]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X, y)

Note: The interview questions and answers provided on this page have been thoughtfully compiled by our academic team. However, as the content is manually created, there may be occasional errors or omissions. If you have any questions or identify any inaccuracies, please contact us at team@learn2earnlabs.com. We appreciate your feedback and strive for continuous improvement.

Data Science using Python Interview Questions with Answers

Explore Our Courses

1. What are Python’s key data structures used in data science?

2. How do list comprehension and generator expressions differ?

3. Explain the difference between is, ==, and in.

4. How is memory managed in Python?

5. What is the difference between deepcopy and copy?

6. What are Python’s *args and **kwargs used for?

7. Explain the difference between @staticmethod and @classmethod.

8. How do you handle missing data in Python?

9. What is a lambda function and where is it useful?

10. What are Python’s most important built-in libraries for data science?

11. What is the difference between a NumPy array and a Python list?

12. How do you handle missing values in Pandas?

13. Explain the difference between .loc[] and .iloc[].

14. How can you apply a function to every row in a Pandas DataFrame?

15. What are vectorized operations in NumPy and why are they faster?

16. How do you merge, join, and concatenate datasets in Pandas?

17. What is the difference between groupby() and pivot_table()?

18. How do you deal with outliers in a dataset?

19. What are the advantages of using apply() vs a loop in Pandas?

20. How do you detect and handle duplicate records?

21. How do you normalize and standardize data?

22. What is label encoding vs one-hot encoding?

23. How do you handle categorical variables with high cardinality?

24. Explain feature scaling and when to use it.

25, What is the difference between .fit(), .transform(), and .fit_transform()?

26. How do you impute missing values using scikit-learn?

27. How do you treat multicollinearity in features?

28. What are binning and discretization?

29. How do you encode cyclic variables (like days, months)?

30. What are the best practices for feature engineering?

31. What steps do you follow during EDA (Exploratory Data Analysis)?

32. How do you find correlations between variables?

33. How do you deal with skewed data distributions?

34. What are the key plots for EDA and when to use each?

35. How do you visualize multivariate relationships?

36. What is the use of pairplot and heatmap in Seaborn?

37. What’s the difference between histogram and bar plot?

38. How do you choose which variables to keep during EDA?

39. What are the different types of feature distributions and how to interpret them?

40. What is the role of outliers and how do you detect them?

41. What is the difference between population and sample?

42. Explain the Central Limit Theorem with an example.

43. What is p-value and how is it used in hypothesis testing?

44. What is the difference between Type I and Type II errors?

45. What is the difference between confidence interval and prediction interval?

46. What are correlation and covariance?

47. Explain t-test, chi-square test, and ANOVA.

48. What is overfitting and underfitting in statistical models?

49. What are bias and variance in a predictive model?

50. When would you use a non-parametric test?

51. What’s the difference between supervised and unsupervised learning?

52. Explain the bias-variance tradeoff in machine learning.

53. How do you evaluate a classification model (precision, recall, F1-score)?

54. What is the ROC curve and AUC score?

55. How does cross-validation help in model evaluation?

56. What is feature importance and how is it determined?

57. How do decision trees and random forests differ?

58. What are the pros and cons of using logistic regression?

59. What is the purpose of regularization (L1 vs L2)?

60. How do you tune hyperparameters in a model (GridSearchCV, RandomizedSearchCV)?

Data Science using Python
Interview Questions with Answers