Chapter 10. Classification Models for Business Decisions
Classification is one of the most widely applied machine learning techniques in business analytics. From predicting customer churn and detecting fraudulent transactions to assessing credit risk and targeting marketing campaigns, classification models help organizations make data-driven decisions that directly impact revenue, risk, and customer satisfaction.
This chapter introduces the fundamental concepts of classification, explores both basic and advanced algorithms, addresses the critical challenge of class imbalance, and demonstrates how to interpret and evaluate classification models. We conclude with a comprehensive Python implementation focused on credit scoring—a classic business application where accurate classification can mean the difference between profit and loss.
10.1 Classification Problems in Business
Classification is a supervised learning task where the goal is to predict a categorical label (the target or class ) based on input features. Unlike regression, which predicts continuous values, classification assigns observations to discrete categories.
Common Business Classification Problems
Customer Churn Prediction
Identifying customers likely to stop using a service or product. Telecom companies, subscription services, and banks use churn models to proactively retain valuable customers through targeted interventions.
- Target: Churned (1) vs. Retained (0)
- Features: Usage patterns, customer demographics, service complaints, contract type
- Business Impact: Reducing churn by even 5% can significantly increase lifetime customer value
Fraud Detection
Detecting fraudulent transactions in credit cards, insurance claims, or online payments.
Recent research
shows that combining traditional ML models with techniques like SMOTE can achieve over 99% accuracy in fraud detection.
- Target: Fraudulent (1) vs. Legitimate (0)
- Features: Transaction amount, location, time, merchant category, user behavior
- Business Impact: Prevents financial losses while minimizing false positives that frustrate customers
Credit Scoring
Assessing the creditworthiness of loan applicants to determine approval and interest rates. Financial institutions rely on classification models to balance risk and opportunity.
- Target: Default (1) vs. Repay (0)
- Features: Income, employment history, existing debt, credit history, loan amount
- Business Impact: Reduces default rates while expanding access to credit for qualified borrowers
Marketing Response Prediction
Predicting which customers will respond to marketing campaigns, enabling targeted outreach and efficient resource allocation.
- Target: Responder (1) vs. Non-responder (0)
- Features: Past purchase behavior, demographics, engagement metrics
- Business Impact: Increases campaign ROI and reduces marketing costs
Medical Diagnosis
Classifying patients as having or not having a particular condition based on symptoms, test results, and medical history.
- Target: Disease Present (1) vs. Absent (0)
- Features: Lab results, vital signs, patient history, imaging data
- Business Impact: Improves patient outcomes and optimizes healthcare resource allocation
Key Characteristics of Business Classification Problems
- Imbalanced Classes: In most business scenarios, the event of interest (fraud, churn, default) is rare, creating significant class imbalance
- Cost-Sensitive: Misclassification costs are often asymmetric—missing a fraud case may be more costly than a false alarm
- Interpretability Matters: Stakeholders often need to understand why a prediction was made, especially in regulated industries
- Dynamic Patterns: Customer behavior and fraud tactics evolve, requiring models to be regularly updated
10.2 Basic Algorithms
10.2.1 Logistic Regression
Despite its name, logistic regression is a classification algorithm. It models the probability that an observation belongs to a particular class using the logistic (sigmoid) function.
Mathematical Foundation
For binary classification, logistic regression models:
P(y=1∣X)=1+e−(β0+β1x1+β2x2+...+βpxp)
Where:
- P(y=1∣X) is the probability of the positive class
- β0,β1,...,βp are coefficients learned from data
- The decision boundary is typically set at P=0.5
Advantages
- Interpretable: Coefficients indicate feature importance and direction of effect
- Probabilistic output: Provides calibrated probability estimates
- Efficient: Fast to train and predict, even on large datasets
- Regularization: L1 (Lasso) and L2 (Ridge) regularization prevent overfitting
Limitations
- Linear decision boundary: Assumes a linear relationship between features and log-odds
- Feature engineering required: May need polynomial features or interactions for complex patterns
- Sensitive to outliers: Extreme values can influence coefficients
Business Use Cases
- Credit scoring (interpretability required for regulatory compliance)
- Email spam detection
- Customer conversion prediction
AI Prompt for Logistic Regression:
"Explain how logistic regression coefficients can be interpreted in a credit scoring model.
If the coefficient for 'income' is 0.05, what does this mean for loan approval probability?"
10.2.2 Decision Trees
Decision trees recursively partition the feature space into regions, making predictions based on simple decision rules learned from data. Each internal node represents a test on a feature, each branch represents an outcome, and each leaf node represents a class label.
How Decision Trees Work
- Splitting: At each node, the algorithm selects the feature and threshold that best separates the classes (using metrics like Gini impurity or information gain)
- Recursion: The process repeats for each child node until a stopping criterion is met (max depth, minimum samples, purity)
- Prediction: New observations traverse the tree from root to leaf, following the decision rules
Key Hyperparameters
- max_depth : Maximum depth of the tree (controls complexity)
- min_samples_split : Minimum samples required to split a node
- min_samples_leaf : Minimum samples required in a leaf node
- criterion : Splitting criterion ('gini' or 'entropy')
Advantages
- Highly interpretable: Can be visualized and explained to non-technical stakeholders
- Non-linear: Captures complex, non-linear relationships
- No feature scaling needed: Works with features on different scales
- Handles mixed data types: Works with both numerical and categorical features
Limitations
- Overfitting: Deep trees can memorize training data
- Instability: Small changes in data can lead to very different trees
- Biased toward dominant classes: In imbalanced datasets, may favor the majority class
Business Use Cases
- Customer segmentation
- Loan approval decisions (when interpretability is critical)
- Medical diagnosis
AI Prompt for Decision Trees:
"I have a decision tree for churn prediction with 15 leaf nodes. How can I simplify this tree
to make it more interpretable for business stakeholders while maintaining reasonable accuracy?"
10.3 More Advanced Algorithms
10.3.1 Random Forests
Random Forest is an ensemble method that combines multiple decision trees to improve prediction accuracy and reduce overfitting. Each tree is trained on a random subset of data (bootstrap sample) and considers only a random subset of features at each split.
Key Concepts:
- Bagging (Bootstrap Aggregating): Each tree sees a different sample of data
- Feature Randomness: Each split considers only a subset of features
- Voting: Final prediction is the majority vote (classification) or average (regression)
Advantages:
- Robust: Less prone to overfitting than single decision trees
- Feature importance: Provides measures of feature relevance
- Handles high-dimensional data: Works well even with many features
- Minimal hyperparameter tuning: Often performs well with default settings
Recent studies show Random Forest achieving 99.5% accuracy in credit card fraud detection when combined with SMOTE for handling class imbalance.
10.3.2 Gradient Boosting
Gradient Boosting builds trees sequentially , where each new tree corrects the errors of the previous ensemble. Popular implementations include XGBoost, LightGBM, and CatBoost. They are one of the best models. For rich categorical data we recommend CatBoost.
Key Concepts:
- Sequential learning: Trees are added one at a time
- Error correction: Each tree focuses on the residuals (errors) of the previous ensemble
- Learning rate: Controls how much each tree contributes to the final prediction
Advantages:
- State-of-the-art performance: Often wins machine learning competitions
- Handles complex patterns: Captures intricate relationships in data
- Built-in regularization: Techniques like shrinkage prevent overfitting
Disadvantages:
- Computationally expensive: Slower to train than Random Forest
- More hyperparameters: Requires careful tuning
- Less interpretable: Harder to explain than single trees
Business Applications:
- Credit scoring (highest accuracy)
- Fraud detection
- Customer lifetime value prediction
10.3.3 Neural Networks
Neural networks, particularly deep learning models, have gained prominence in classification tasks involving unstructured data (images, text, audio). For structured business data, simpler models often suffice, but neural networks can capture highly complex patterns.
Basic Architecture:
- Input layer: One neuron per feature
- Hidden layers: Intermediate layers that learn representations
- Output layer: Neurons corresponding to classes (with softmax activation for multi-class)
Advantages:
- Universal approximators: Can model any function given enough neurons
- Automatic feature learning: Learns relevant features from raw data
- Scalability: Handles massive datasets efficiently with GPUs
Disadvantages:
- Black box: Difficult to interpret
- Data hungry: Requires large amounts of training data
- Computationally intensive: Needs significant resources
- Hyperparameter sensitivity: Many parameters to tune
Business Use Cases:
- Image-based fraud detection (e.g., check fraud)
- Natural language processing for customer sentiment
- Complex pattern recognition in high-dimensional data
Example ANN - ppp
10.4 Handling Class Imbalance
Class imbalance occurs when one class significantly outnumbers the other(s). In business problems like fraud detection (0.17% fraud rate) or churn prediction (typically 5-20% churn), this is the norm rather than the exception.
Why Class Imbalance is Problematic
- Biased Models: Algorithms optimize for overall accuracy, which can be achieved by simply predicting the majority class
- Poor Minority Class Performance: The model fails to learn patterns in the rare but important class
- Misleading Metrics: 99% accuracy is meaningless if it's achieved by predicting "no fraud" for every transaction
Techniques for Handling Class Imbalance
1. Resampling Methods
Undersampling: Reduce the number of majority class samples
- Random Undersampling: Randomly remove majority class samples
- Tomek Links: Remove majority class samples that are close to minority class samples
- Pros: Faster training, balanced dataset
- Cons: Loss of potentially useful information
Oversampling: Increase the number of minority class samples
- Random Oversampling: Duplicate minority class samples
- Pros: No information loss
- Cons: Risk of overfitting, increased training time
SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE creates synthetic minority class samples by interpolating between existing minority class samples. Research shows that SMOTE significantly improves model performance on imbalanced datasets.
How SMOTE Works:
- For each minority class sample, find its k nearest neighbors (typically k=5)
- Randomly select one of these neighbors
- Create a synthetic sample along the line segment connecting the two samples
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
SMOTE-Tomek: Combines SMOTE oversampling with Tomek Links undersampling to clean the decision boundary
2. Algorithm-Level Techniques
Class Weights: Assign higher penalties to misclassifying the minority class
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
Threshold Adjustment: Instead of using 0.5 as the decision threshold, optimize it based on business costs
3. Ensemble Methods
Balanced Random Forest: Each tree is trained on a balanced bootstrap sample
from imblearn.ensemble import BalancedRandomForestClassifier
model = BalancedRandomForestClassifier(random_state=42)
EasyEnsemble: Creates multiple balanced subsets and trains an ensemble
Choosing the Right Technique
- Small datasets: SMOTE or SMOTE-Tomek
- Large datasets: Undersampling or class weights
- Extreme imbalance (< 1% minority): Combination of techniques
- Real-time systems: Class weights (no preprocessing needed)
10.5 Interpreting Classification Models
10.5.1 Coefficients, Feature Importance, and Partial Dependence (Conceptual)
Logistic Regression Coefficients
Coefficients indicate the change in log-odds for a one-unit increase in the feature:
- Positive coefficient: Increases probability of positive class
- Negative coefficient: Decreases probability of positive class
- Magnitude: Indicates strength of effect
Example: In credit scoring, if the coefficient for income is 0.0005, then a $10,000 increase in income increases the log-odds of approval by 5.
Feature Importance (Tree-Based Models)
Feature importance measures how much each feature contributes to reducing impurity across all trees:
- Higher values: More important features
- Interpretation: Relative, not absolute
import pandas as pd
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': importances
}).sort_values('importance', ascending=False)
Partial Dependence Plots (PDP)
PDPs show the marginal effect of a feature on the predicted outcome, holding other features constant. They help visualize non-linear relationships.
SHAP (SHapley Additive exPlanations)
SHAP values provide a unified measure of feature importance based on game theory, showing how much each feature contributes to a specific prediction.
10.5.2 Metrics: Precision, Recall, Confusion Matrix, F1, AUC
Accuracy alone is insufficient for evaluating classification models, especially with imbalanced data. We need a comprehensive set of metrics.
Confusion Matrix
A confusion matrix summarizes prediction results:
|
|
Predicted Negative |
Predicted Positive |
|
Actual Negative |
True Negative (TN) |
False Positive (FP) |
|
Actual Positive |
False Negative (FN) |
True Positive (TP) |
Key Metrics
Accuracy: Overall correctness
Accuracy=TP+TN+FP+FNTP+TN
- Limitation: Misleading with imbalanced data
Precision: Of all positive predictions, how many were correct?
Precision=TP+FPTP
- Business Interpretation: In fraud detection, high precision means few false alarms
Recall (Sensitivity): Of all actual positives, how many did we catch?
Recall=TP+FNTP
- Business Interpretation: In fraud detection, high recall means we catch most fraud cases
F1-Score: Harmonic mean of precision and recall
F1 = 2×Precision+RecallPrecision×Recall
- Use Case: When you need a balance between precision and recall
Specificity: Of all actual negatives, how many did we correctly identify?
Specificity=TN+FPTN
ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve plots True Positive Rate (Recall) vs. False Positive Rate at various threshold settings.
AUC (Area Under the Curve): Measures the model's ability to distinguish between classes
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing
- AUC > 0.8: Generally considered good
Business Interpretation: AUC represents the probability that the model ranks a random positive example higher than a random negative example.
Choosing the Right Metric
- Fraud detection: Prioritize Recall (catch all fraud) and AUC
- Spam filtering: Prioritize Precision (avoid false positives)
- Credit scoring: Balance Precision and Recall (F1-Score), consider business costs
- Medical diagnosis: Prioritize Recall (don't miss diseases)
10.6 Implementing Classification in Python
Credit Scoring Example: Complete Implementation
We'll build a comprehensive credit scoring model using a synthetic dataset that mimics real-world credit data. This example demonstrates data preparation, handling class imbalance, model training, evaluation, and interpretation.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
roc_curve, roc_auc_score, precision_recall_curve,
f1_score, accuracy_score)
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
import warnings
warnings.filterwarnings('ignore')
# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
print("Libraries imported successfully!")
Step 1: Generate Synthetic Credit Scoring Dataset
# Set random seed for reproducibility
np.random.seed(42)
# Generate synthetic credit data
n_samples = 10000
# Create features
data = {
'age': np.random.randint(18, 70, n_samples),
'income': np.random.gamma(shape=2, scale=25000, size=n_samples), # Right-skewed income
'credit_history_length': np.random.randint(0, 30, n_samples), # Years
'num_credit_lines': np.random.poisson(lam=3, size=n_samples),
'debt_to_income_ratio': np.random.beta(a=2, b=5, size=n_samples), # Typically < 0.5
'num_late_payments': np.random.poisson(lam=1, size=n_samples),
'credit_utilization': np.random.beta(a=2, b=3, size=n_samples), # 0 to 1
'num_inquiries_6m': np.random.poisson(lam=1, size=n_samples),
'loan_amount': np.random.gamma(shape=2, scale=10000, size=n_samples),
'employment_length': np.random.randint(0, 25, n_samples),
}
df = pd.DataFrame(data)
# Create target variable (default) based on realistic risk factors
# Higher risk of default with: low income, high debt ratio, late payments, high utilization
risk_score = (
-0.00001 * df['income'] +
0.5 * df['debt_to_income_ratio'] +
0.3 * df['num_late_payments'] +
0.4 * df['credit_utilization'] +
0.1 * df['num_inquiries_6m'] +
-0.02 * df['credit_history_length'] +
-0.01 * df['employment_length'] +
np.random.normal(0, 0.3, n_samples) # Add noise
)
# Convert risk score to probability and then to binary outcome
default_probability = 1 / (1 + np.exp(-risk_score))
df['default'] = (default_probability > 0.7).astype(int) # Create imbalance
# Add some categorical features
df['home_ownership'] = np.random.choice(['RENT', 'OWN', 'MORTGAGE'], n_samples, p=[0.3, 0.2, 0.5])
df['loan_purpose'] = np.random.choice(['debt_consolidation', 'credit_card', 'home_improvement',
'major_purchase', 'other'], n_samples)
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nClass distribution:")
print(df['default'].value_counts())
print(f"\nDefault rate: {df['default'].mean():.2%}")
Step 2: Exploratory Data Analysis (EDA)
# Create comprehensive EDA visualizations
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
fig.suptitle('Credit Scoring Dataset: Exploratory Data Analysis', fontsize=16, fontweight='bold')
# 1. Class distribution
ax = axes[0, 0]
df['default'].value_counts().plot(kind='bar', ax=ax, color=['#2ecc71', '#e74c3c'])
ax.set_title('Class Distribution', fontweight='bold')
ax.set_xlabel('Default Status')
ax.set_ylabel('Count')
ax.set_xticklabels(['No Default (0)', 'Default (1)'], rotation=0)
for container in ax.containers:
ax.bar_label(container)
# 2. Income distribution by default status
ax = axes[0, 1]
df.boxplot(column='income', by='default', ax=ax)
ax.set_title('Income Distribution by Default Status', fontweight='bold')
ax.set_xlabel('Default Status')
ax.set_ylabel('Income ($)')
plt.sca(ax)
plt.xticks([1, 2], ['No Default', 'Default'])
# 3. Debt-to-Income Ratio by default status
ax = axes[0, 2]
df.boxplot(column='debt_to_income_ratio', by='default', ax=ax)
ax.set_title('Debt-to-Income Ratio by Default Status', fontweight='bold')
ax.set_xlabel('Default Status')
ax.set_ylabel('Debt-to-Income Ratio')
plt.sca(ax)
plt.xticks([1, 2], ['No Default', 'Default'])
# 4. Credit utilization by default status
ax = axes[1, 0]
df.boxplot(column='credit_utilization', by='default', ax=ax)
ax.set_title('Credit Utilization by Default Status', fontweight='bold')
ax.set_xlabel('Default Status')
ax.set_ylabel('Credit Utilization')
plt.sca(ax)
plt.xticks([1, 2], ['No Default', 'Default'])
# 5. Number of late payments
ax = axes[1, 1]
df.boxplot(column='num_late_payments', by='default', ax=ax)
ax.set_title('Late Payments by Default Status', fontweight='bold')
ax.set_xlabel('Default Status')
ax.set_ylabel('Number of Late Payments')
plt.sca(ax)
plt.xticks([1, 2], ['No Default', 'Default'])
# 6. Age distribution
ax = axes[1, 2]
df[df['default']==0]['age'].hist(bins=20, alpha=0.5, label='No Default', ax=ax, color='#2ecc71')
df[df['default']==1]['age'].hist(bins=20, alpha=0.5, label='Default', ax=ax, color='#e74c3c')
ax.set_title('Age Distribution by Default Status', fontweight='bold')
ax.set_xlabel('Age')
ax.set_ylabel('Frequency')
ax.legend()
# 7. Correlation heatmap
ax = axes[2, 0]
numeric_cols = df.select_dtypes(include=[np.number]).columns
corr_matrix = df[numeric_cols].corr()
sns.heatmap(corr_matrix[['default']].sort_values(by='default', ascending=False),
annot=True, fmt='.2f', cmap='RdYlGn_r', center=0, ax=ax, cbar_kws={'label': 'Correlation'})
ax.set_title('Feature Correlation with Default', fontweight='bold')
# 8. Home ownership distribution
ax = axes[2, 1]
pd.crosstab(df['home_ownership'], df['default'], normalize='index').plot(kind='bar', ax=ax,
color=['#2ecc71', '#e74c3c'])
ax.set_title('Default Rate by Home Ownership', fontweight='bold')
ax.set_xlabel('Home Ownership')
ax.set_ylabel('Proportion')
ax.legend(['No Default', 'Default'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
# 9. Loan purpose distribution
ax = axes[2, 2]
pd.crosstab(df['loan_purpose'], df['default'], normalize='index').plot(kind='bar', ax=ax,
color=['#2ecc71', '#e74c3c'])
ax.set_title('Default Rate by Loan Purpose', fontweight='bold')
ax.set_xlabel('Loan Purpose')
ax.set_ylabel('Proportion')
ax.legend(['No Default', 'Default'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Print summary statistics
print("\n" + "="*60)
print("SUMMARY STATISTICS BY DEFAULT STATUS")
print("="*60)
print(df.groupby('default')[['income', 'debt_to_income_ratio', 'credit_utilization',
'num_late_payments', 'credit_history_length']].mean())
===========================================================
SUMMARY STATISTICS BY DEFAULT STATUS
============================================================
income debt_to_income_ratio credit_utilization \
default
0 51044.020129 0.283362 0.395485
1 24959.954392 0.329210 0.449313
num_late_payments credit_history_length
default
0 0.918771 14.773282
1 2.833333 9.806548
Step 3: Data Preprocessing
# Encode categorical variables
df_encoded = pd.get_dummies(df, columns=['home_ownership', 'loan_purpose'], drop_first=True)
# Separate features and target
X = df_encoded.drop('default', axis=1)
y = df_encoded['default']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"\nTraining set class distribution:")
print(y_train.value_counts())
print(f"Default rate in training set: {y_train.mean():.2%}")
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)
print("\nData preprocessing completed!")
Output
Training set size: (8000, 16)
Test set size: (2000, 16)
Training set class distribution:
default
0 7731
1 269
Name: count, dtype: int64
Default rate in training set: 3.36%
Step 4: Handle Class Imbalance with SMOTE
# Visualize class imbalance before and after SMOTE
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Original distribution
ax = axes[0]
y_train.value_counts().plot(kind='bar', ax=ax, color=['#2ecc71', '#e74c3c'])
ax.set_title('Original Training Set\nClass Distribution', fontweight='bold', fontsize=12)
ax.set_xlabel('Default Status')
ax.set_ylabel('Count')
ax.set_xticklabels(['No Default (0)', 'Default (1)'], rotation=0)
for container in ax.containers:
ax.bar_label(container)
# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)
# SMOTE distribution
ax = axes[1]
pd.Series(y_train_smote).value_counts().plot(kind='bar', ax=ax, color=['#2ecc71', '#e74c3c'])
ax.set_title('After SMOTE\nClass Distribution', fontweight='bold', fontsize=12)
ax.set_xlabel('Default Status')
ax.set_ylabel('Count')
ax.set_xticklabels(['No Default (0)', 'Default (1)'], rotation=0)
for container in ax.containers:
ax.bar_label(container)
# Apply SMOTE-Tomek
smote_tomek = SMOTETomek(random_state=42)
X_train_smote_tomek, y_train_smote_tomek = smote_tomek.fit_resample(X_train_scaled, y_train)
# SMOTE-Tomek distribution
ax = axes[2]
pd.Series(y_train_smote_tomek).value_counts().plot(kind='bar', ax=ax, color=['#2ecc71', '#e74c3c'])
ax.set_title('After SMOTE-Tomek\nClass Distribution', fontweight='bold', fontsize=12)
ax.set_xlabel('Default Status')
ax.set_ylabel('Count')
ax.set_xticklabels(['No Default (0)', 'Default (1)'], rotation=0)
for container in ax.containers:
ax.bar_label(container)
plt.tight_layout()
plt.show()
print(f"Original training set: {len(y_train)} samples")
print(f"After SMOTE: {len(y_train_smote)} samples")
print(f"After SMOTE-Tomek: {len(y_train_smote_tomek)} samples")
Output
Original training set: 8000 samples
After SMOTE: 15462 samples
After SMOTE-Tomek: 15460 samples
Step 5: Train Multiple Classification Models
# Define models
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Logistic Regression (Balanced)': LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced'),
'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
'Gradient Boosting': GradientBoostingClassifier(random_state=42, n_estimators=100)
}
# Train models on original data
results_original = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
results_original[name] = {
'model': model,
'y_pred': y_pred,
'y_pred_proba': y_pred_proba,
'accuracy': accuracy_score(y_test, y_pred),
'f1': f1_score(y_test, y_pred),
'auc': roc_auc_score(y_test, y_pred_proba)
}
# Train models on SMOTE data
results_smote = {}
for name, model in models.items():
if 'Balanced' in name: # Skip balanced version for SMOTE
continue
model_smote = type(model)(**model.get_params()) # Create new instance
model_smote.fit(X_train_smote, y_train_smote)
y_pred = model_smote.predict(X_test_scaled)
y_pred_proba = model_smote.predict_proba(X_test_scaled)[:, 1]
results_smote[name + ' (SMOTE)'] = {
'model': model_smote,
'y_pred': y_pred,
'y_pred_proba': y_pred_proba,
'accuracy': accuracy_score(y_test, y_pred),
'f1': f1_score(y_test, y_pred),
'auc': roc_auc_score(y_test, y_pred_proba)
}
# Combine results
all_results = {**results_original, **results_smote}
# Create comparison DataFrame
comparison_df = pd.DataFrame({
name: {
'Accuracy': results['accuracy'],
'F1-Score': results['f1'],
'AUC': results['auc']
}
for name, results in all_results.items()
}).T.sort_values('F1-Score', ascending=False)
print("\n" + "="*80)
print("MODEL PERFORMANCE COMPARISON")
print("="*80)
print(comparison_df.round(4))
Output:
================================================================================
MODEL PERFORMANCE COMPARISON
================================================================================
Accuracy F1-Score AUC
Logistic Regression 0.9785 0.6195 0.9712
Gradient Boosting 0.9775 0.5872 0.9489
Gradient Boosting (SMOTE) 0.9605 0.5434 0.9575
Random Forest (SMOTE) 0.9680 0.5152 0.9488
Decision Tree 0.9710 0.4630 0.8939
Logistic Regression (SMOTE) 0.9080 0.3987 0.9720
Random Forest 0.9725 0.3956 0.9395
Logistic Regression (Balanced) 0.8970 0.3758 0.9717
Decision Tree (SMOTE) 0.9020 0.3423 0.8957
Step 6: Detailed Evaluation of Best Model
# Select best model (Random Forest with SMOTE)
best_model_name = 'Random Forest (SMOTE)'
best_model = all_results[best_model_name]['model']
y_pred_best = all_results[best_model_name]['y_pred']
y_pred_proba_best = all_results[best_model_name]['y_pred_proba']
# Create comprehensive evaluation plots
fig = plt.figure(figsize=(20, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
# 1. Confusion Matrix
ax1 = fig.add_subplot(gs[0, 0])
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax1, cbar_kws={'label': 'Count'})
ax1.set_title('Confusion Matrix\n(Random Forest with SMOTE)', fontweight='bold', fontsize=12)
ax1.set_ylabel('Actual')
ax1.set_xlabel('Predicted')
ax1.set_xticklabels(['No Default', 'Default'])
ax1.set_yticklabels(['No Default', 'Default'])
# 2. ROC Curve
ax2 = fig.add_subplot(gs[0, 1])
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba_best)
auc_score = roc_auc_score(y_test, y_pred_proba_best)
ax2.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc_score:.3f})', color='#3498db')
ax2.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
ax2.set_xlabel('False Positive Rate')
ax2.set_ylabel('True Positive Rate (Recall)')
ax2.set_title('ROC Curve', fontweight='bold', fontsize=12)
ax2.legend()
ax2.grid(alpha=0.3)
# 3. Precision-Recall Curve
ax3 = fig.add_subplot(gs[0, 2])
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_pred_proba_best)
ax3.plot(recall, precision, linewidth=2, color='#e74c3c')
ax3.set_xlabel('Recall')
ax3.set_ylabel('Precision')
ax3.set_title('Precision-Recall Curve', fontweight='bold', fontsize=12)
ax3.grid(alpha=0.3)
# 4. Feature Importance
ax4 = fig.add_subplot(gs[1, :])
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)
sns.barplot(data=feature_importance, x='importance', y='feature', ax=ax4, palette='viridis')
ax4.set_title('Top 15 Feature Importances', fontweight='bold', fontsize=12)
ax4.set_xlabel('Importance')
ax4.set_ylabel('Feature')
# 5. Prediction Distribution
ax5 = fig.add_subplot(gs[2, 0])
ax5.hist(y_pred_proba_best[y_test==0], bins=50, alpha=0.6, label='No Default (Actual)', color='#2ecc71')
ax5.hist(y_pred_proba_best[y_test==1], bins=50, alpha=0.6, label='Default (Actual)', color='#e74c3c')
ax5.axvline(0.5, color='black', linestyle='--', linewidth=2, label='Decision Threshold')
ax5.set_xlabel('Predicted Probability of Default')
ax5.set_ylabel('Frequency')
ax5.set_title('Prediction Distribution by Actual Class', fontweight='bold', fontsize=12)
ax5.legend()
# 6. Threshold Analysis
ax6 = fig.add_subplot(gs[2, 1])
thresholds_analysis = np.linspace(0, 1, 100)
precision_scores = []
recall_scores = []
f1_scores = []
for threshold in thresholds_analysis:
y_pred_threshold = (y_pred_proba_best >= threshold).astype(int)
precision_scores.append(precision_score(y_test, y_pred_threshold, zero_division=0))
recall_scores.append(recall_score(y_test, y_pred_threshold, zero_division=0))
f1_scores.append(f1_score(y_test, y_pred_threshold, zero_division=0))
ax6.plot(thresholds_analysis, precision_scores, label='Precision', linewidth=2, color='#3498db')
ax6.plot(thresholds_analysis, recall_scores, label='Recall', linewidth=2, color='#e74c3c')
ax6.plot(thresholds_analysis, f1_scores, label='F1-Score', linewidth=2, color='#2ecc71')
ax6.axvline(0.5, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax6.set_xlabel('Classification Threshold')
ax6.set_ylabel('Score')
ax6.set_title('Metrics vs. Classification Threshold', fontweight='bold', fontsize=12)
ax6.legend()
ax6.grid(alpha=0.3)
# 7. Classification Report
ax7 = fig.add_subplot(gs[2, 2])
ax7.axis('off')
report = classification_report(y_test, y_pred_best, target_names=['No Default', 'Default'], output_dict=True)
report_text = f"""
Classification Report:
precision recall f1-score support
No Default {report['No Default']['precision']:.2f} {report['No Default']['recall']:.2f} {report['No Default']['f1-score']:.2f} {report['No Default']['support']:.0f}
Default {report['Default']['precision']:.2f} {report['Default']['recall']:.2f} {report['Default']['f1-score']:.2f} {report['Default']['support']:.0f}
accuracy {report['accuracy']:.2f} {report['No Default']['support'] + report['Default']['support']:.0f}
macro avg {report['macro avg']['precision']:.2f} {report['macro avg']['recall']:.2f} {report['macro avg']['f1-score']:.2f} {report['No Default']['support'] + report['Default']['support']:.0f}
weighted avg {report['weighted avg']['precision']:.2f} {report['weighted avg']['recall']:.2f} {report['weighted avg']['f1-score']:.2f} {report['No Default']['support'] + report['Default']['support']:.0f}
"""
ax7.text(0.1, 0.5, report_text, fontsize=10, family='monospace', verticalalignment='center')
ax7.set_title('Detailed Classification Report', fontweight='bold', fontsize=12)
plt.suptitle('Comprehensive Model Evaluation: Random Forest with SMOTE',
fontsize=16, fontweight='bold', y=0.995)
plt.show()
# Print detailed metrics
print("\n" + "="*80)
print("DETAILED EVALUATION METRICS")
print("="*80)
print(f"\nConfusion Matrix:")
print(cm)
print(f"\nTrue Negatives: {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives: {cm[1,1]}")
print(f"\nAccuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_best):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_best):.4f}")
print(f"AUC-ROC: {auc_score:.4f}")
================================================================================
DETAILED EVALUATION METRICS
================================================================================
Confusion Matrix:
[[1902 31]
[ 33 34]]
True Negatives: 1902
False Positives: 31
False Negatives: 33
True Positives: 34
Accuracy: 0.9680
Precision: 0.5231
Recall: 0.5075
F1-Score: 0.5152
AUC-ROC: 0.9488
Step 7: Business Interpretation
# Create a business-focused summary
print("\n" + "="*80)
print("BUSINESS INSIGHTS AND RECOMMENDATIONS")
print("="*80)
# Calculate business metrics
total_loans = len(y_test)
actual_defaults = y_test.sum()
predicted_defaults = y_pred_best.sum()
true_positives = cm[1,1]
false_positives = cm[0,1]
false_negatives = cm[1,0]
avg_loan_amount = df['loan_amount'].mean()
estimated_loss_per_default = avg_loan_amount * 0.5 # Assume 50% loss on default
# Financial impact
prevented_losses = true_positives * estimated_loss_per_default
missed_losses = false_negatives * estimated_loss_per_default
opportunity_cost = false_positives * (avg_loan_amount * 0.05) # Assume 5% profit margin
net_benefit = prevented_losses - missed_losses - opportunity_cost
print(f"\n1. MODEL PERFORMANCE SUMMARY:")
print(f" - Total loans evaluated: {total_loans:,}")
print(f" - Actual defaults: {actual_defaults} ({actual_defaults/total_loans:.1%})")
print(f" - Predicted defaults: {predicted_defaults}")
print(f" - Correctly identified defaults: {true_positives} ({true_positives/actual_defaults:.1%} recall)")
print(f" - Missed defaults: {false_negatives}")
print(f" - False alarms: {false_positives}")
print(f"\n2. FINANCIAL IMPACT (Estimated):")
print(f" - Average loan amount: ${avg_loan_amount:,.2f}")
print(f" - Estimated loss per default: ${estimated_loss_per_default:,.2f}")
print(f" - Prevented losses: ${prevented_losses:,.2f}")
print(f" - Missed losses: ${missed_losses:,.2f}")
print(f" - Opportunity cost (rejected good loans): ${opportunity_cost:,.2f}")
print(f" - Net benefit: ${net_benefit:,.2f}")
print(f"\n3. KEY RISK FACTORS (Top 5):")
for i, row in feature_importance.head(5).iterrows():
print(f" {i+1}. {row['feature']}: {row['importance']:.4f}")
print(f"\n4. RECOMMENDATIONS:")
print(f" - The model achieves {recall_score(y_test, y_pred_best):.1%} recall, catching most defaults")
print(f" - Precision of {precision_score(y_test, y_pred_best):.1%} means {false_positives} good applicants were rejected")
print(f" - Consider adjusting the threshold based on business risk tolerance")
print(f" - Focus on top risk factors for manual review of borderline cases")
print(f" - Regularly retrain the model as new data becomes available")
================================================================================
BUSINESS INSIGHTS AND RECOMMENDATIONS
================================================================================
1. MODEL PERFORMANCE SUMMARY:
- Total loans evaluated: 2,000
- Actual defaults: 67 (3.4%)
- Predicted defaults: 65
- Correctly identified defaults: 34 (50.7% recall)
- Missed defaults: 33
- False alarms: 31
2. FINANCIAL IMPACT (Estimated):
- Average loan amount: $19,991.66
- Estimated loss per default: $9,995.83
- Prevented losses: $339,858.24
- Missed losses: $329,862.41
- Opportunity cost (rejected good loans): $30,987.07
- Net benefit: $-20,991.24
3. KEY RISK FACTORS (Top 5):
6. num_late_payments: 0.5007
2. income: 0.1509
8. num_inquiries_6m: 0.0762
3. credit_history_length: 0.0678
10. employment_length: 0.0377
4. RECOMMENDATIONS:
- The model achieves 50.7% recall, catching most defaults
- Precision of 52.3% means 31 good applicants were rejected
- Consider adjusting the threshold based on business risk tolerance
- Focus on top risk factors for manual review of borderline cases
- Regularly retrain the model as new data becomes available
AI Prompt for Further Learning:
"I've built a Random Forest model for credit scoring with 85% recall and 70% precision. The business wants to reduce false positives (rejected good applicants) without significantly increasing defaults. What strategies can I use to optimize this trade-off?"
Exercises
Exercise 1: Formulate a Churn Prediction Problem
Task: You are a data analyst at a telecommunications company. Formulate a customer churn prediction problem by defining:
- Target variable: What constitutes "churn" in this context?
- Features: List at least 10 features you would collect to predict churn
- Evaluation metric: Which metric(s) would you prioritize and why?
- Business objective: How would you measure the success of this model in business terms?
Hint: Consider that retaining a customer costs less than acquiring a new one, and different customer segments have different lifetime values.
Exercise 2: Implement Logistic Regression for Binary Classification
Task: Using the credit scoring dataset from Section 10.6 (or a similar dataset of your choice):
- Train a logistic regression model on the original (imbalanced) data
- Train another logistic regression model with class_weight='balanced'
- Compare the two models using precision, recall, F1-score, and AUC
- Interpret the coefficients: Which features have the strongest positive and negative effects on default probability?
- Create a visualization showing the top 10 most important features
Bonus: Experiment with L1 (Lasso) and L2 (Ridge) regularization and observe the effect on coefficients.
Exercise 3: Compare Decision Tree and Logistic Regression
Task: Train both a decision tree and logistic regression model on the same dataset:
- Evaluate both models using a confusion matrix, ROC curve, and classification report
- Visualize the decision tree (limit depth to 3-4 for interpretability)
- Compare the models in terms of:
- Accuracy and F1-score
- Interpretability: Which model is easier to explain to non-technical stakeholders?
- Overfitting: Use cross-validation to assess generalization
- Write a brief report (200-300 words) recommending which model to deploy and why
Hint: Consider the trade-off between performance and interpretability in a regulated industry like banking.
Exercise 4: Analyze the Impact of Class Imbalance
Task: Using the credit scoring dataset:
- Train a Random Forest model on the original imbalanced data
- Apply SMOTE and train another Random Forest model
- Apply SMOTE-Tomek and train a third Random Forest model
- Compare all three models using:
- Confusion matrices
- Precision, recall, and F1-score for both classes
- ROC curves on the same plot
- Calculate the cost-sensitive performance: Assume that missing a default costs $10,000, while rejecting a good applicant costs $500. Which model minimizes total cost?
Bonus: Experiment with different SMOTE parameters (e.g., k_neighbors ) and observe the effect on model performance.
Summary
In this chapter, we explored classification models for business decision-making:
- Business Applications: Churn prediction, fraud detection, credit scoring, and marketing response
- Basic Algorithms: Logistic regression (interpretable, probabilistic) and decision trees (non-linear, visual)
- Advanced Algorithms: Random Forests and Gradient Boosting (state-of-the-art performance), Neural Networks (for complex patterns)
- Class Imbalance: Techniques like SMOTE, SMOTE-Tomek, class weights, and threshold adjustment
- Evaluation Metrics: Confusion matrix, precision, recall, F1-score, and AUC-ROC
- Python Implementation: Complete credit scoring example with EDA, preprocessing, modeling, and business interpretation
Key Takeaways:
- Accuracy is not enough for imbalanced datasets—use precision, recall, and F1-score
- SMOTE and ensemble methods significantly improve minority class detection
- Feature importance helps identify key risk factors and guide business strategy
- Model interpretability matters in regulated industries and for stakeholder buy-in
- Business context should drive metric selection and threshold tuning
In the next chapter, we'll explore regression models for predicting continuous outcomes like sales, prices, and customer lifetime value.