Machine Learning using Python
Interview Questions with Answers

Explore Our Courses
1. What is a dataset in Machine Learning?
Answer:
A dataset is a collection of data used to train or test a Machine Learning model. It typically contains rows (samples) and columns (features or attributes).
2. What is a model in Machine Learning?
Answer:
A model is the outcome of a Machine Learning algorithm trained on data. It represents a learned pattern and is used to make predictions or classifications.
3. How do you import NumPy and what is it used for in ML?
Answer:
import numpy as np
NumPy is used for numerical computations, especially arrays, matrix operations, and efficient data handling in ML pipelines.
4. What is the role of pandas in Machine Learning?
Answer:
pandas is used to load, manipulate, clean, and analyze datasets. It provides DataFrame structures to handle tabular data easily.
Sharpen your ML skills and impress recruiters with expert-level answers
5. How do you check the shape and type of a dataset in Python?
Answer:
import pandas as pd
data = pd.read_csv("file.csv")
print(data.shape) # Rows and columns
print(type(data)) # Type of object
6. How do you install scikit-learn in Python?
Answer:
pip install scikit-learn
7. What is a target variable?
Answer:
The target variable is the output the model is trying to predict. It’s also known as the dependent variable or label.
8. What does value_counts() do in pandas?
Answer:
data['column_name'].value_counts()
It counts and displays the frequency of unique values in a column. Useful for understanding class distributions.
9. What is a scatter plot and why is it used?
Answer:
A scatter plot shows the relationship between two variables. In ML, it’s used to visualize trends and patterns between features.
import matplotlib.pyplot as plt
plt.scatter(data['x'], data['y'])
10. How can you check for null values in a dataset?
Answer:
data.isnull().sum()
This checks for missing (NaN) values column-wise in a dataset.
11. What is train_test_split used for?
Answer:
train_test_split splits data into training and testing sets. It ensures that models are trained and validated on different data.
12. What is a Jupyter Notebook and why is it useful in ML?
Answer:
A Jupyter Notebook is an interactive coding environment used for data analysis, visualization, and prototyping ML models in Python.
13. What is a CSV file and how do you open it in Python?
Answer:
A CSV (Comma Separated Values) file stores tabular data.
import pandas as pd
df = pd.read_csv("filename.csv")
14. How do you rename a column in pandas?
Answer:
data.rename(columns={'old_name': 'new_name'}, inplace=True)
15. What is a histogram used for in ML?
Answer:
A histogram is used to visualize the distribution of a numeric feature, helping understand skewness and outliers.
import matplotlib.pyplot as plt
data['column'].hist()
16. What is Machine Learning and how is it different from traditional programming?
Answer:
Machine Learning is a subset of AI where computers learn from data and improve over time without being explicitly programmed. In traditional programming, rules are hard-coded, but in ML, the algorithm identifies patterns from data to make decisions.
17. What are the types of Machine Learning?
Answer:
- Supervised Learning – Labeled data (e.g., regression, classification).
- Unsupervised Learning – Unlabeled data (e.g., clustering).
- Reinforcement Learning – Agent learns by interacting with the environment via rewards/punishments.
18. How do you import and load a dataset using Python?
Answer:
import pandas as pd
data = pd.read_csv("data.csv")
print(data.head())
19. What is the difference between classification and regression?
Answer:
- Classification predicts discrete labels (e.g., spam or not).
- Regression predicts continuous values (e.g., house price).
20. What is a feature and a label in ML?
Answer:
- Feature: Input variable (independent variable).
- Label: Output variable (dependent variable).
In a dataset, features help predict labels.
Be confident, be prepared – crack Python ML interviews like a pro.
21. How do you split data into training and testing sets using Python?
Answer:
from sklearn.model_selection import train_test_split
X = data.drop("target", axis=1)
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
22. What is overfitting and how to prevent it?
Answer:
Overfitting is when a model performs well on training data but poorly on unseen data.
Prevention:
- Use cross-validation
- Regularization
- Prune decision trees
- Reduce features
- Use more training data
23. What is the use of the fit() and predict() methods?
Answer:
- fit(): Trains the model on training data.
- predict(): Predicts outcomes on test/new data.
model.fit(X_train, y_train)
predictions = model.predict(X_test)
24. How do you evaluate a classification model?
Answer:
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy = accuracy_score(y_test, predictions)
matrix = confusion_matrix(y_test, predictions)
print(accuracy, matrix)
Metrics include:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
25. What is a confusion matrix?
Answer:
A confusion matrix shows:
- True Positives (TP)
- True Negatives (TN)
- False Positives (FP)
- False Negatives (FN)
It helps evaluate classification performance in detail.
26. How do you handle missing values in a dataset using Python?
Answer:
data.fillna(data.mean(), inplace=True) # Replace missing with mean
# OR
data.dropna(inplace=True) # Drop missing rows
27. What is normalization and why is it important?
Answer:
Normalization scales features to a similar range, improving model performance and convergence.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
28. What is cross-validation in ML?
Answer:
Cross-validation divides data into folds to test model robustness and avoid overfitting.
Example: K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
29. How do you perform linear regression using Python?
Answer:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
30. What libraries are commonly used for Machine Learning in Python?
Answer:
- NumPy – numerical operations
- Pandas – data handling
- Matplotlib / Seaborn – data visualization
- Scikit-learn – machine learning models
- TensorFlow / PyTorch – deep learning
31. What is the bias-variance tradeoff in Machine Learning?
Answer:
- Bias: Error due to overly simplistic model assumptions.
- Variance: Error due to too much model complexity (overfitting).
- Tradeoff: We try to find the right balance—low bias and low variance.
32. How do you handle categorical variables in Python?
Answer:
# One-hot encoding
pd.get_dummies(data['category'])
# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['encoded'] = le.fit_transform(data['category'])
33. What is feature selection and why is it important?
Answer:
Feature selection selects the most relevant variables for training. It improves:
- Model accuracy
- Training speed
- Reduces overfitting
Common methods: correlation, univariate selection, recursive feature elimination.
34. How do you handle imbalanced datasets?
Answer:
- Under sampling majority class
- Oversampling minority class (e.g., SMOTE)
- Use class weights
- Use evaluation metrics like F1-score or ROC-AUC instead of accuracy.
35. What is the difference between ROC curve and AUC score?
Answer:
- ROC curve: Plots True Positive Rate vs False Positive Rate.
- AUC (Area Under Curve): A single number summary (0–1) of how well the model distinguishes classes.
Higher AUC means better performance.
36. What is Grid Search and how do you implement it in Python?
Answer:
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors': [3, 5, 7]}
model = GridSearchCV(KNeighborsClassifier(), params, cv=5)
model.fit(X_train, y_train)
print(model.best_params_)
GridSearch is used for hyperparameter tuning.
37. Explain standardization and normalization.
Answer:
- Standardization: Rescales data to have mean = 0 and std = 1.
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(X)
- Normalization: Rescales features to a range [0, 1].
from sklearn.preprocessing import MinMaxScaler
MinMaxScaler().fit_transform(X)
38. What is regularization? Explain L1 and L2.
Answer:
Regularization reduces overfitting by penalizing large coefficients:
- L1 (Lasso): Can shrink coefficients to zero (feature selection).
- L2 (Ridge): Shrinks coefficients but keeps them > 0.
from sklearn.linear_model import Lasso, Ridge
Lasso(alpha=0.1)
Ridge(alpha=1.0)
39. How do you evaluate a regression model in Python?
Answer:
from sklearn.metrics import mean_squared_error, r2_score
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
Metrics:
- MSE (Mean Squared Error)
- RMSE
- MAE
- R² Score
40. What is the purpose of random_state in scikit-learn?
Answer:
random_state ensures reproducibility by fixing the random seed. Useful in train-test splitting, cross-validation, and model initialization.
Land your dream job in data science by mastering these top questions
41. What is cross-validation and how is it different from train-test split?
Answer:
- Train-test split divides data once.
- Cross-validation splits data into multiple folds and rotates training/testing sets.
Cross-validation gives a more robust evaluation.
42. How do decision trees work and how can you visualize one in Python?
Answer:
Decision Trees split data by feature thresholds to maximize information gain.
from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier()
model.fit(X, y)
plot_tree(model)
43. What are pipelines in scikit-learn and why are they used?
Answer:
Pipelines ‘chain preprocessing + modeling’ steps to make the workflow clean and reproducible.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
pipe.fit(X_train, y_train)
44. What are ensemble methods? Give examples.
Answer:
Ensemble methods combine multiple models for better performance:
- Bagging: e.g., Random Forest
- Boosting: e.g., Gradient Boosting, XGBoost
- Stacking: Combining multiple classifiers via a meta-classifier
45. How do you save and load a trained ML model in Python?
Answer:
import joblib
# Save model
joblib.dump(model, 'model.pkl')
# Load model
model = joblib.load('model.pkl')
46. What is the Curse of Dimensionality and how do you handle it?
Answer:
As dimensions (features) increase, the data becomes sparse, and distance metrics (like Euclidean) become less meaningful.
Solutions:
- Dimensionality reduction (e.g., PCA, t-SNE)
- Feature selection
- Regularization
47. How does Principal Component Analysis (PCA) work in Python?
Answer:
PCA reduces dimensions by projecting data onto new axes (principal components) that maximize variance.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
48. Explain the difference between Bagging and Boosting.
Answer:
- Bagging: Builds models independently on random subsets (e.g., Random Forest). Reduces variance.
- Boosting: Builds models sequentially, correcting errors from previous ones (e.g., XGBoost). Reduces bias.
49. What is XGBoost and why is it popular?
Answer:
XGBoost is an optimized gradient boosting algorithm with:
- Regularization (prevents overfitting)
- Parallel computation
- High accuracy and speed
import xgboost as xgb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
50. How does a Support Vector Machine (SVM) work?
Answer:
SVM finds the best hyperplane that separates classes with the maximum margin.
It can handle non-linear data using kernel tricks like RBF.
51. How do you deal with multicollinearity in features?
Answer:
- Use VIF (Variance Inflation Factor) to detect it.
- Remove or combine correlated features.
- Use regularization (L1 or L2).
52. How can you handle time series data in machine learning?
Answer:
- Feature engineering (lag, rolling mean, date-time features)
- Train-test split must preserve order.
- Models: ARIMA, Prophet, LSTM (for deep learning)
53. What is a ROC Curve and how do you plot it in Python?
Answer:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
ROC shows model performance across different thresholds. AUC close to 1 is ideal.
54. What is the difference between batch and online learning?
Answer:
- Batch Learning: Trains on the whole dataset at once.
- Online Learning: Trains incrementally, useful for streaming or large data.
55. How do you prevent data leakage in ML pipelines?
Answer:
- Do data splitting before feature engineering or scaling.
- Use Pipelines to apply the same transformation across training/testing.
- Be careful with time-based features.
56. What is SHAP and how is it used in model interpretability?
Answer:
SHAP (SHapley Additive exPlanations) explains individual predictions by computing feature contributions.
import shap
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)
shap.plots.beeswarm(shap_values)
57. How do you tune hyperparameters using RandomizedSearchCV?
Answer:
from sklearn.model_selection import RandomizedSearchCV
param_grid = {'n_estimators': [100, 200], 'max_depth': [3, 5, 10]}
search = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=5, cv=3)
search.fit(X_train, y_train)
58. How do you handle class imbalance with SMOTE?
Answer:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
SMOTE generates synthetic examples for the minority class.
59. What is the difference between Hard and Soft Voting in ensemble models?
Answer:
- Hard Voting: Takes the majority class predicted by classifiers.
- Soft Voting: Takes the average of predicted probabilities.
from sklearn.ensemble import VotingClassifier
VotingClassifier(estimators=[...], voting='soft')
60. How do you deploy a machine learning model as an API?
Answer:
Save the model:
import joblib
joblib.dump(model, 'model.pkl')
Use Flask or FastAPI to create endpoints:
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
prediction = model.predict([data['features']])
return jsonify({'prediction': prediction.tolist()})