Home All Chapters Previous Next

Chapter 8. Data Preparation and Feature Engineering in Python

"Garbage in, garbage out."

This old adage is especially true in analytics. You can have the most sophisticated machine learning algorithm, but if your data is messy, incomplete, or poorly structured, your results will be worthless.

Data scientists often say they spend 80% of their time on data preparation and only 20% on modeling. This isn't a sign of inefficiency—it's the reality of working with real-world data. The quality of your data preparation directly determines the quality of your insights.

This chapter covers the practical skills you need to transform raw, messy data into clean, structured datasets ready for analysis. We'll focus on Python and pandas, the standard tools for data preparation in business analytics.

8.1 The Importance of Data Preparation in Analytics Projects

Why Data Preparation Matters

Raw data is rarely analysis-ready. Real-world datasets have:

The consequences of poor data preparation:

  1. Wrong conclusions : Outliers skew averages, missing data creates bias
  2. Model failures : Algorithms break on missing values or inconsistent formats
  3. Wasted time : Debugging model problems that are actually data problems
  4. Lost trust : Stakeholders lose confidence when they spot obvious data errors

The benefits of good data preparation:

  1. Better models : Clean data → better predictions
  2. Faster iteration : Automated pipelines let you test ideas quickly
  3. Reproducibility : Documented processes can be repeated and audited
  4. Trust : Stakeholders trust analysis built on clean, well-understood data

The Data Preparation Workflow

RAW DATA

    ↓

1. UNDERSTAND

   • What does each column mean?

   • What's the grain of the data?

   • What are the data types?

    ↓

2. CLEAN

   • Handle missing values

   • Remove/fix outliers

   • Fix inconsistencies

   • Remove duplicates

    ↓

3. TRANSFORM

   • Scale/normalize

   • Encode categories

   • Parse dates

    ↓

4. ENGINEER FEATURES

   • Create new variables

   • Aggregate information

   • Extract patterns

    ↓

5. VALIDATE

   • Check distributions

   • Verify logic

   • Test for leakage

    ↓

ANALYSIS-READY DATA

Key principle : Understand before you clean.  Don't blindly apply transformations. First understand what the data represents, then decide how to prepare it.

8.2 Data Cleaning

8.2.1 Handling Missing Data

Missing data is inevitable.  The question is: how do you handle it?

Types of missing data:

  1. Missing Completely at Random (MCAR) : Missingness has no relationship to any variable
  1. Missing at Random (MAR) : Missingness related to observed variables, not the missing value itself
  1. Missing Not at Random (MNAR) : Missingness related to the unobserved value

Why this matters : The type of missingness determines the best handling strategy.

Strategies for Handling Missing Data

1. Delete rows with missing values

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
   'customer_id': [1, 2, 3, 4, 5],
   'age': [25, 30, np.nan, 45, 28],
   'income': [50000, np.nan, 75000, 90000, 60000],
   'purchases': [5, 3, 8, np.nan, 4]
})

# Drop any row with missing values
df_complete = df.dropna()
print(f"Original rows: {len(df)}, After dropna: {len(df_complete)}")
# Output: Original rows: 5, After dropna: 2

# Drop rows where specific columns are missing
df_age_complete = df.dropna(subset=['age'])
print(f"Rows with age: {len(df_age_complete)}")

# Output: Rows with age: 4

When to use:

When NOT to use:


2. Impute with simple statistics

# Mean imputation
df['age_imputed'] = df['age'].fillna(df['age'].mean())

# Median imputation (better for skewed data)
df['income_imputed'] = df['income'].fillna(df['income'].median())

# Mode imputation (for categorical data)
df['category'] = ['A', 'B', np.nan, 'A', 'B']
df['category_imputed'] = df['category'].fillna(df['category'].mode()[0])
print(df[['age', 'age_imputed', 'income', 'income_imputed']])

When to use:

Caution :


3. Forward fill / Backward fill (for time series)

# Time series data
df_ts = pd.DataFrame({
   'date': pd.date_range('2024-01-01', periods=5),
   'temperature': [20, np.nan, np.nan, 23, 24]
})

# Forward fill: use last known value
df_ts['temp_ffill'] = df_ts['temperature'].fillna(method='ffill')

# Backward fill: use next known value
df_ts['temp_bfill'] = df_ts['temperature'].fillna(method='bfill')
print(df_ts)

Output:

       date  temperature  temp_ffill  temp_bfill

0 2024-01-01         20.0        20.0        20.0

1 2024-01-02          NaN        20.0        23.0

2 2024-01-03          NaN        20.0        23.0

3 2024-01-04         23.0        23.0        23.0

4 2024-01-05         24.0        24.0        24.0

When to use:

4. Impute with predictive models

from sklearn.impute import KNNImputer

# Use K-Nearest Neighbors to impute based on similar records
imputer = KNNImputer(n_neighbors=3)
df_numeric = df[['age', 'income', 'purchases']]
df_imputed = pd.DataFrame(
   imputer.fit_transform(df_numeric),
   columns=df_numeric.columns
)

print("Original:\n", df_numeric)
print("\nImputed:\n", df_imputed)

When to use:

Other sophisticated methods:

8.2.2 Detecting and Treating Outliers

Outliers  are data points that are significantly different from others.

Two types:

  1. Errors : Data entry mistakes, measurement errors, system glitches
  1. Legitimate extremes : Real but unusual values

Detecting Outliers

Method 1: Visual inspection

import matplotlib.pyplot as plt

# Sample data with outliers

data = pd.DataFrame({

    'salary': [50000, 55000, 52000, 48000, 51000, 53000, 500000, 49000, 54000, 50000]

})

# Box plot

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)

plt.boxplot(data['salary'])

plt.title('Box Plot')

plt.ylabel('Salary')

plt.subplot(1, 2, 2)

plt.hist(data['salary'], bins=20, edgecolor='black')

plt.title('Histogram')

plt.xlabel('Salary')

plt.ylabel('Frequency')

plt.tight_layout()

plt.show()

Method 2: Statistical methods

IQR (Interquartile Range) method:

def detect_outliers_iqr(df, column):
   """Detect outliers using IQR method"""
   Q1 = df[column].quantile(0.25)
   Q3 = df[column].quantile(0.75)
   IQR = Q3 - Q1    

    lower_bound = Q1 - 1.5 * IQR
   upper_bound = Q3 + 1.5 * IQR    

    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

    print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
   print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}")
   print(f"Number of outliers: {len(outliers)}")    
   return outliers

outliers = detect_outliers_iqr(data, 'salary')
print("\nOutliers:")
print(outliers)

Z-score method:

from scipy import stats
def detect_outliers_zscore(df, column, threshold=3):
   """Detect outliers using Z-score method"""
   z_scores = np.abs(stats.zscore(df[column]))
   outliers = df[z_scores > threshold]  

    print(f"Number of outliers (|z| > {threshold}): {len(outliers)}")
   return outliers

outliers_z = detect_outliers_zscore(data, 'salary')
print("\nOutliers (Z-score):")
print(outliers_z)

When to use each:


Treating Outliers

Option 1: Remove

# Remove outliers using IQR

Q1 = data['salary'].quantile(0.25)

Q3 = data['salary'].quantile(0.75)

IQR = Q3 - Q1

data_clean = data[

    (data['salary'] >= Q1 - 1.5 * IQR) &

    (data['salary'] <= Q3 + 1.5 * IQR)

]

print(f"Original: {len(data)} rows, After removal: {len(data_clean)} rows")

When to use:


Option 2: Cap (Winsorize)

# Cap at 5th and 95th percentiles

lower = data['salary'].quantile(0.05)

upper = data['salary'].quantile(0.95)

data['salary_capped'] = data['salary'].clip(lower=lower, upper=upper)

print("Original vs Capped:")

print(data[['salary', 'salary_capped']])

When to use:


Option 3: Transform

# Log transformation (reduces impact of large values)

data['salary_log'] = np.log1p(data['salary'])  # log1p = log(1 + x), handles zeros

# Compare distributions

print("Original - Mean: {:.0f}, Std: {:.0f}".format(

    data['salary'].mean(), data['salary'].std()

))

print("Log transformed - Mean: {:.2f}, Std: {:.2f}".format(

    data['salary_log'].mean(), data['salary_log'].std()

))

When to use:


Business Context Matters

Example: Retail transactions

transactions = pd.DataFrame({

    'transaction_id': range(1, 11),

    'amount': [25, 30, 28, 32, 27, 29, 5000, 31, 26, 30]

})

# $5,000 transaction is an outlier

# But is it an error or a bulk purchase?

# Check context

transactions['is_outlier'] = (

    np.abs(stats.zscore(transactions['amount'])) > 3

)

print(transactions)

# Decision:

# - If it's an error → remove

# - If it's a bulk purchase → keep but analyze separately

# - For average transaction analysis → exclude

# - For total revenue analysis → include

Key insight : There's no universal rule. The right approach depends on:


8.2.3 Dealing with Inconsistent and Duplicate Records

Inconsistencies  make it impossible to group, aggregate, or analyze data correctly.

Common inconsistencies:

  1. Text variations : "New York", "NY", "new york", "N.Y."
  2. Date formats : "01/02/2024", "2024-01-02", "Jan 2, 2024"
  3. Units : Some prices in dollars, others in cents
  4. Encoding : Special characters, extra spaces
  5. Categories : Typos, abbreviations, multiple naming conventions

Cleaning Text Data

# Sample data with inconsistencies

df = pd.DataFrame({

    'customer_id': [1, 2, 3, 4, 5],

    'city': ['New York', 'new york', 'NY', 'NEW YORK', 'New York City'],

    'product': ['  iPhone 14  ', 'iphone 14', 'iPhone14', 'IPHONE 14', 'iPhone-14']

})

# Clean text

df['city_clean'] = (

    df['city']

    .str.strip()              # Remove leading/trailing spaces

    .str.lower()              # Convert to lowercase

    .str.replace('.', '')     # Remove periods

)

df['product_clean'] = (

    df['product']

    .str.strip()

    .str.lower()

    .str.replace(' ', '')     # Remove spaces

    .str.replace('-', '')     # Remove hyphens

)

print(df[['city', 'city_clean', 'product', 'product_clean']])

Output:

          city city_clean        product product_clean

0     New York   new york    iPhone 14      iphone14

1     new york   new york     iphone 14      iphone14

2           NY         ny       iPhone14      iphone14

3     NEW YORK   new york     IPHONE 14      iphone14

4  New York City new york city   iPhone-14      iphone14


Standardizing with Mappings

# Create mapping for known variations

city_mapping = {

    'new york': 'New York',

    'ny': 'New York',

    'new york city': 'New York',

    'nyc': 'New York',

    'los angeles': 'Los Angeles',

    'la': 'Los Angeles',

    'l.a.': 'Los Angeles'

}

df['city_standardized'] = df['city_clean'].map(city_mapping)

print(df[['city', 'city_standardized']])

For large datasets with many variations:

# Use fuzzy matching

from fuzzywuzzy import process

def standardize_city(city, valid_cities, threshold=80):

    """Match to closest valid city name"""

    match, score = process.extractOne(city, valid_cities)

    if score >= threshold:

        return match

    return city

valid_cities = ['New York', 'Los Angeles', 'Chicago', 'Houston']

df['city_fuzzy'] = df['city_clean'].apply(

    lambda x: standardize_city(x, valid_cities)

)


Detecting and Removing Duplicates

Types of duplicates:

  1. Exact duplicates : All columns identical
  2. Partial duplicates : Key columns identical (e.g., same customer_id)
  3. Fuzzy duplicates : Nearly identical (e.g., typos in names)

# Sample data with duplicates

df = pd.DataFrame({

    'customer_id': [1, 2, 2, 3, 4, 4],

    'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David'],

    'email': ['alice@email.com', 'bob@email.com', 'bob@email.com',

              'charlie@email.com', 'david@email.com', 'david2@email.com'],

    'purchase_date': ['2024-01-01', '2024-01-02', '2024-01-02',

                      '2024-01-03', '2024-01-04', '2024-01-05']

})

# Check for duplicates

print("Duplicate rows (all columns):")

print(df[df.duplicated()])

print("\nDuplicate customer_ids:")

print(df[df.duplicated(subset=['customer_id'], keep=False)])

# Remove exact duplicates

df_no_exact_dupes = df.drop_duplicates()

print(f"\nOriginal: {len(df)}, After removing exact duplicates: {len(df_no_exact_dupes)}")

# Remove duplicates based on customer_id (keep first occurrence)

df_no_customer_dupes = df.drop_duplicates(subset=['customer_id'], keep='first')

print(f"After removing customer_id duplicates: {len(df_no_customer_dupes)}")

# Keep last occurrence instead

df_keep_last = df.drop_duplicates(subset=['customer_id'], keep='last')

Which duplicates to keep?

# Keep the most recent record

df['purchase_date'] = pd.to_datetime(df['purchase_date'])

df_sorted = df.sort_values('purchase_date', ascending=False)

df_latest = df_sorted.drop_duplicates(subset=['customer_id'], keep='first')

print("Keeping most recent record per customer:")

print(df_latest)


Identifying Fuzzy Duplicates

# Customers with similar names (possible duplicates)

customers = pd.DataFrame({

    'customer_id': [1, 2, 3, 4],

    'name': ['John Smith', 'Jon Smith', 'John Smyth', 'Jane Doe'],

    'email': ['john@email.com', 'jon@email.com', 'jsmith@email.com', 'jane@email.com']

})

from fuzzywuzzy import fuzz

# Compare all pairs

for i in range(len(customers)):

    for j in range(i+1, len(customers)):

        name1 = customers.iloc[i]['name']

        name2 = customers.iloc[j]['name']

        similarity = fuzz.ratio(name1, name2)

       

        if similarity > 80:  # Threshold for potential duplicates

            print(f"Potential duplicate: '{name1}' vs '{name2}' (similarity: {similarity})")

Output:

Potential duplicate: 'John Smith' vs 'Jon Smith' (similarity: 95)

Potential duplicate: 'John Smith' vs 'John Smyth' (similarity: 91)

Action : Manually review potential duplicates and merge if appropriate.


8.3 Data Transformation

8.3.1 Scaling and Normalization

Why scale data?

Many machine learning algorithms are sensitive to the scale of features:

Example of the problem:

df = pd.DataFrame({

    'age': [25, 30, 35, 40, 45],

    'income': [50000, 60000, 70000, 80000, 90000],

    'purchases': [5, 8, 12, 15, 20]

})

print("Standard deviations:")

print(df.std())

Output:

age             7.91

income      15811.39

purchases       5.70

Income has a much larger scale than age or purchases. In a distance-based algorithm, income would dominate.


Scaling Methods

1. Standardization (Z-score normalization)

Transform to mean=0, std=1:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df_standardized = pd.DataFrame(

    scaler.fit_transform(df),

    columns=df.columns

)

print("Standardized data:")

print(df_standardized)

print("\nMeans:", df_standardized.mean())

print("Stds:", df_standardized.std())

Output:

Standardized data:

        age    income  purchases

0 -1.414214 -1.414214  -1.414214

1 -0.707107 -0.707107  -0.707107

2  0.000000  0.000000   0.000000

3  0.707107  0.707107   0.707107

4  1.414214  1.414214   1.414214

Means: age         -7.401487e-17

       income      -7.401487e-17

       purchases   -7.401487e-17

Stds:  age         1.0

       income      1.0

       purchases   1.0

When to use:


2. Min-Max Scaling

Transform to range [0, 1]:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df_minmax = pd.DataFrame(

    scaler.fit_transform(df),

    columns=df.columns

)

print("Min-Max scaled data:")

print(df_minmax)

print("\nMins:", df_minmax.min())

print("Maxs:", df_minmax.max())

Output:

Min-Max scaled data:

   age  income  purchases

0  0.0     0.0        0.0

1  0.25    0.25       0.2

2  0.5     0.5        0.467

3  0.75    0.75       0.667

4  1.0     1.0        1.0

Mins: age         0.0

      income      0.0

      purchases   0.0

Maxs: age         1.0

      income      1.0

      purchases   1.0

When to use:


3. Robust Scaling

Uses median and IQR instead of mean and std (robust to outliers):

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

df_robust = pd.DataFrame(

    scaler.fit_transform(df),

    columns=df.columns

)

print("Robust scaled data:")

print(df_robust)

When to use:


Important: Fit on training data only

from sklearn.model_selection import train_test_split

# Split data

X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

# Fit scaler on training data only

scaler = StandardScaler()

scaler.fit(X_train)

# Transform both training and test data

X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

# WRONG: Don't do this

# scaler.fit(X_test)  # This would leak information from test set

Why?  If you fit on test data, you're using information from the future, which creates data leakage.


8.3.2 Encoding Categorical Variables

Machine learning algorithms need numbers, not text. Encoding  converts categorical variables to numerical format.

Types of categorical variables:

  1. Nominal : No inherent order (color, city, product type)
  2. Ordinal : Natural order (education level, satisfaction rating)

Encoding Methods

1. Label Encoding

Assign each category a number:

from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({

    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']

})

encoder = LabelEncoder()

df['city_encoded'] = encoder.fit_transform(df['city'])

print(df)

print("\nMapping:", dict(zip(encoder.classes_, encoder.transform(encoder.classes_))))

Output:

         city  city_encoded

0     New York             2

1  Los Angeles             1

2      Chicago             0

3     New York             2

4      Chicago             0

Mapping: {'Chicago': 0, 'Los Angeles': 1, 'New York': 2}

Problem : Implies ordering (Chicago < Los Angeles < New York), which doesn't make sense for nominal variables.

When to use:


2. One-Hot Encoding

Create binary column for each category:

df = pd.DataFrame({

    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']

})

# Using pandas

df_onehot = pd.get_dummies(df, columns=['city'], prefix='city')

print(df_onehot)

Output:

  city_Chicago  city_Los Angeles  city_New York

0             0                 0              1

1             0                 1              0

2             1                 0              0

3             0                 0              1

4             1                 0              0

Using sklearn (better for pipelines):

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first' avoids multicollinearity

city_encoded = encoder.fit_transform(df[['city']])

df_encoded = pd.DataFrame(

    city_encoded,

    columns=encoder.get_feature_names_out(['city'])

)

print(df_encoded)

When to use:

Problem : High cardinality (many categories) creates too many columns.


3. Frequency Encoding

Replace category with its frequency:

df = pd.DataFrame({

    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago',

             'New York', 'Chicago', 'Chicago']

})

# Calculate frequencies

freq = df['city'].value_counts(normalize=True)

df['city_freq'] = df['city'].map(freq)

print(df)

Output:

         city  city_freq

0     New York      0.375

1  Los Angeles      0.125

2      Chicago      0.500

3     New York      0.375

4      Chicago      0.500

5     New York      0.375

6      Chicago      0.500

7      Chicago      0.500

When to use:


4. Target Encoding

Replace category with mean of target variable for that category:

df = pd.DataFrame({

    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],

    'sales': [100, 150, 200, 120, 180]

})

# Calculate mean sales per city

target_means = df.groupby('city')['sales'].mean()

df['city_target_encoded'] = df['city'].map(target_means)

print(df)

Output:

          city  sales  city_target_encoded

0     New York    100                110.0

1  Los Angeles    150                150.0

2      Chicago    200                190.0

3     New York    120                110.0

4      Chicago    180                190.0

When to use:

Caution : Can cause overfitting. Use cross-validation or smoothing.


Handling Ordinal Variables

df = pd.DataFrame({

    'education': ['High School', 'Bachelor', 'Master', 'PhD', 'High School']

})

# Define order

education_order = {

    'High School': 1,

    'Bachelor': 2,

    'Master': 3,

    'PhD': 4

}

df['education_encoded'] = df['education'].map(education_order)

print(df)

Output:

    education  education_encoded

0  High School                  1

1     Bachelor                  2

2       Master                  3

3          PhD                  4

4  High School                  1


8.3.3 Date/Time Handling and Time-Based Features

Dates contain rich information, but need to be extracted properly.

Parsing Dates

df = pd.DataFrame({

    'transaction_date': ['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05']

})

# Convert to datetime

df['date'] = pd.to_datetime(df['transaction_date'])

print(df.dtypes)

Extracting Time Components

# Extract components

df['year'] = df['date'].dt.year

df['month'] = df['date'].dt.month

df['day'] = df['date'].dt.day

df['day_of_week'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6

df['day_name'] = df['date'].dt.day_name()

df['quarter'] = df['date'].dt.quarter

df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

print(df)

Output:

 transaction_date       date  year  month  day  day_of_week day_name  quarter  is_weekend

0       2024-01-15 2024-01-15  2024      1   15            0   Monday        1           0

1       2024-02-20 2024-02-20  2024      2   20            1  Tuesday        1           0

2       2024-03-10 2024-03-10  2024      3   10            6   Sunday        1           1

3       2024-04-05 2024-04-05  2024      4    5            4   Friday        2           0


Time-Based Features

# Days since reference date

reference_date = pd.to_datetime('2024-01-01')

df['days_since_start'] = (df['date'] - reference_date).dt.days

# Time between events

df = df.sort_values('date')

df['days_since_last'] = df['date'].diff().dt.days

# Cyclical encoding (for periodic features like month, day of week)

df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)

df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print(df[['date', 'month', 'month_sin', 'month_cos']])

Why cyclical encoding?

December (month=12) and January (month=1) are adjacent, but numerically far apart. Cyclical encoding preserves this relationship.


Business-Relevant Date Features

# Is it a holiday?

holidays = pd.to_datetime(['2024-01-01', '2024-07-04', '2024-12-25'])

df['is_holiday'] = df['date'].isin(holidays).astype(int)

# Is it month-end? (important for B2B sales)

df['is_month_end'] = (df['date'].dt.is_month_end).astype(int)

# Days until month-end

df['days_to_month_end'] = df['date'].dt.days_in_month - df['date'].dt.day

# Fiscal quarter (if fiscal year starts in April)

df['fiscal_quarter'] = ((df['month'] - 4) % 12) // 3 + 1

print(df[['date', 'is_month_end', 'days_to_month_end', 'fiscal_quarter']])


8.4 Feature Engineering

Feature engineering  is the process of creating new variables from existing data to improve model performance.

Why it matters:

"Applied machine learning is basically feature engineering." — Andrew Ng

Good features can make a simple model outperform a complex model with poor features.

8.4.1 Domain-Driven Feature Construction

The best features come from business understanding, not just data manipulation.

Example: Customer Churn Prediction

Raw data:

customers = pd.DataFrame({

    'customer_id': [1, 2, 3, 4, 5],

    'signup_date': pd.to_datetime(['2023-01-15', '2023-03-20', '2023-06-10', '2023-08-05', '2023-10-12']),

    'last_login': pd.to_datetime(['2024-01-10', '2023-12-15', '2024-01-08', '2023-09-20', '2024-01-12']),

    'total_purchases': [15, 3, 22, 1, 8],

    'total_spent': [1500, 200, 3200, 50, 650],

    'support_tickets': [2, 5, 1, 8, 3],

    'email_opens': [45, 10, 78, 5, 32]

})

Domain-driven features:

# Recency: Days since last activity (recent activity = less likely to churn)

today = pd.to_datetime('2024-01-15')

customers['days_since_last_login'] = (today - customers['last_login']).dt.days

# Tenure: How long they've been a customer

customers['tenure_days'] = (today - customers['signup_date']).dt.days

# Engagement: Activity per unit time

customers['purchases_per_month'] = customers['total_purchases'] / (customers['tenure_days'] / 30)

customers['logins_per_month'] = 30 / customers['days_since_last_login']  # Approximate

# Value: Average order value

customers['avg_order_value'] = customers['total_spent'] / customers['total_purchases']

# Support intensity: Tickets per purchase (high = problems)

customers['tickets_per_purchase'] = customers['support_tickets'] / customers['total_purchases']

# Engagement score: Email engagement

customers['email_engagement'] = customers['email_opens'] / customers['tenure_days'] * 30

print(customers[[

    'customer_id', 'days_since_last_login', 'tenure_days',

    'purchases_per_month', 'avg_order_value', 'tickets_per_purchase'

]])

Business logic behind each feature:


8.4.2 Aggregations, Ratios, and Interaction Features

Aggregations

Summarize detailed data to customer/product level:

# Transaction-level data

transactions = pd.DataFrame({

    'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],

    'transaction_date': pd.to_datetime([

        '2024-01-01', '2024-01-15', '2024-01-20',

        '2024-01-05', '2024-01-25',

        '2024-01-10', '2024-01-12', '2024-01-18', '2024-01-22'

    ]),

    'amount': [100, 150, 200, 50, 75, 300, 100, 250, 180],

    'category': ['Electronics', 'Clothing', 'Electronics',

                 'Clothing', 'Clothing',

                 'Electronics', 'Home', 'Electronics', 'Clothing']

})

# Aggregate to customer level

customer_features = transactions.groupby('customer_id').agg({

    'amount': ['sum', 'mean', 'std', 'min', 'max', 'count'],

    'transaction_date': ['min', 'max']

}).reset_index()

customer_features.columns = ['customer_id', 'total_spent', 'avg_transaction',

                              'std_transaction', 'min_transaction', 'max_transaction',

                              'num_transactions', 'first_purchase', 'last_purchase']

# Time-based features

customer_features['days_active'] = (

    customer_features['last_purchase'] - customer_features['first_purchase']

).dt.days

customer_features['purchase_frequency'] = (

    customer_features['num_transactions'] / (customer_features['days_active'] + 1)

)

print(customer_features)


Ratios

Ratios often reveal more than raw numbers:

# E-commerce metrics

df = pd.DataFrame({

    'sessions': [100, 200, 150, 300],

    'page_views': [500, 800, 600, 1200],

    'add_to_cart': [20, 35, 25, 50],

    'purchases': [5, 10, 8, 15],

    'revenue': [500, 1200, 900, 1800]

})

# Conversion funnel ratios

df['pages_per_session'] = df['page_views'] / df['sessions']

df['add_to_cart_rate'] = df['add_to_cart'] / df['sessions']

df['conversion_rate'] = df['purchases'] / df['sessions']

df['cart_to_purchase_rate'] = df['purchases'] / df['add_to_cart']

# Value metrics

df['avg_order_value'] = df['revenue'] / df['purchases']

df['revenue_per_session'] = df['revenue'] / df['sessions']

print(df[[

    'conversion_rate', 'cart_to_purchase_rate',

    'avg_order_value', 'revenue_per_session'

]])

Why ratios matter:


Interaction Features

Capture relationships between variables:

# Marketing data

df = pd.DataFrame({

    'ad_spend': [1000, 2000, 1500, 3000],

    'email_sends': [5000, 10000, 7500, 15000],

    'conversions': [50, 120, 80, 200]

})

# Interaction: Combined effect of ad spend and email

df['ad_email_interaction'] = df['ad_spend'] * df['email_sends']

# Polynomial features: Non-linear relationships

df['ad_spend_squared'] = df['ad_spend'] ** 2

# Efficiency ratios

df['cost_per_conversion'] = df['ad_spend'] / df['conversions']

df['emails_per_conversion'] = df['email_sends'] / df['conversions']

print(df)

When to use interactions:

8.4.3 Handling Categorical Data

Categorical data is common in business analytics and often needs to be transformed into numerical formats for machine learning models. Here are three widely used techniques to handle categorical variables:

One-Hot Encoding

One-hot encoding converts each category value into a new binary column (feature) with 1s and 0s indicating the presence of that category. This method is simple and effective for nominal categories without intrinsic order.

Example in Python:

import pandas as pd

data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue']}) one_hot = pd.get_dummies(data['Color'], prefix='Color') print(one_hot)

Output:

Blue

Green

Red

0

0

1

1

0

0

0

1

0

1

0

0

Use case:  Best for categorical variables with a small number of unique values. Beware of high dimensionality if categories are many.

Advantages:

Disadvantages:

Handling Large Number of Categories:

Label Encoding

Label encoding assigns each unique category an integer value. This is useful for ordinal categories where the order matters but can mislead models if used on nominal data.

Example in Python:

from sklearn.preprocessing import LabelEncoder
data = ['Low', 'Medium', 'High', 'Medium']
le = LabelEncoder()
encoded = le.fit_transform(data)
print(encoded)

Output:

[1 2 0 2]

Advantages:

Disadvantages:

Handling Large Number of Categories:

Embeddings

Embeddings map categories to dense, low-dimensional vectors learned during model training, capturing semantic relationships between categories. Commonly used in deep learning models, embeddings can represent high-cardinality categorical variables efficiently.

Example concept (using TensorFlow/Keras):

from tensorflow.keras.layers import Input, Embedding, Flatten
from tensorflow.keras.models import Model

input_cat = Input(shape=(1,))
embedding = Embedding(input_dim=100, output_dim=8)(input_cat)
flat = Flatten()(embedding)

model = Model(inputs=input_cat, outputs=flat)

Use case:  Ideal for large cardinality categorical variables and when relationships between categories matter. Requires more complex models and training.

Advantages:

Disadvantages:

Handling Large Number of Categories:

Additional Techniques for Very Large Cardinality Categorical Variables

Summary Table

Encoding Method

Advantages

Disadvantages

Handling Large Cardinality

One-Hot Encoding

Simple, no ordinal assumptions

High dimensionality, sparse

Group rare categories, dimensionality reduction

Label Encoding

Memory efficient, fast

Imposes ordinal relation on nominal data

Use only for ordinal data, combine with other methods

Embeddings

Low-dimensional, captures relations

Complex, less interpretable

Natural for large cardinality, use hashing or regularization

Frequency Encoding

Simple, scalable

Loses category identity

Good for very large cardinality

Target Encoding

Powerful, uses target info

Risk of leakage

Use with cross-validation

Hashing Trick

Fixed size, scalable

Collisions possible

Useful for streaming or huge datasets

Choosing the right approach depends on the data, model, and business context. For very large cardinality categorical variables, embeddings or hashing-based methods are often preferred in modern analytics pipelines.

8.4.4 Text Features (Basic NLP Features)

Text data is everywhere: customer reviews, support tickets, product descriptions, emails.

Basic Text Features

reviews = pd.DataFrame({

    'review_id': [1, 2, 3, 4],

    'text': [

        'Great product! Love it.',

        'Terrible quality. Very disappointed.',

        'Good value for money. Recommended.',

        'Amazing! Best purchase ever!!!'

    ]

})

# Length features

reviews['char_count'] = reviews['text'].str.len()

reviews['word_count'] = reviews['text'].str.split().str.len()

reviews['avg_word_length'] = reviews['char_count'] / reviews['word_count']

# Punctuation (excitement, emphasis)

reviews['exclamation_count'] = reviews['text'].str.count('!')

reviews['question_count'] = reviews['text'].str.count('\?')

# Uppercase (shouting, emphasis)

reviews['uppercase_ratio'] = (

    reviews['text'].str.count(r'[A-Z]') / reviews['char_count']

)

print(reviews)


Sentiment and Keywords

# Simple sentiment (count positive/negative words)

positive_words = ['great', 'love', 'good', 'amazing', 'best', 'excellent', 'recommended']

negative_words = ['terrible', 'bad', 'disappointed', 'worst', 'poor', 'awful']

reviews['text_lower'] = reviews['text'].str.lower()

reviews['positive_word_count'] = reviews['text_lower'].apply(

    lambda x: sum(word in x for word in positive_words)

)

reviews['negative_word_count'] = reviews['text_lower'].apply(

    lambda x: sum(word in x for word in negative_words)

)

reviews['sentiment_score'] = (

    reviews['positive_word_count'] - reviews['negative_word_count']

)

print(reviews[['text', 'positive_word_count', 'negative_word_count', 'sentiment_score']])

For more sophisticated NLP:

# Using TextBlob for sentiment analysis

from textblob import TextBlob

reviews['polarity'] = reviews['text'].apply(

    lambda x: TextBlob(x).sentiment.polarity

)

reviews['subjectivity'] = reviews['text'].apply(

    lambda x: TextBlob(x).sentiment.subjectivity

)

print(reviews[['text', 'polarity', 'subjectivity']])

Polarity : -1 (negative) to +1 (positive)
Subjectivity : 0 (objective) to 1 (subjective)


TF-IDF for Document Similarity

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text to TF-IDF features

vectorizer = TfidfVectorizer(max_features=10, stop_words='english')

tfidf_matrix = vectorizer.fit_transform(reviews['text'])

# Convert to dataframe

tfidf_df = pd.DataFrame(

    tfidf_matrix.toarray(),

    columns=vectorizer.get_feature_names_out()

)

print("TF-IDF features:")

print(tfidf_df)

Use cases:


8.5 Train–Test Splits and Data Leakage Prevention

Data leakage  is when information from the future "leaks" into your training data, making your model look better than it actually is.

The result : Great performance in development, terrible performance in production.

Types of Data Leakage

1. Target leakage

Using information that wouldn't be available at prediction time.

Example: Predicting loan default

# WRONG: Using payment_status to predict default

# payment_status is only known AFTER you know if they defaulted

df = pd.DataFrame({

    'customer_id': [1, 2, 3, 4],

    'income': [50000, 60000, 45000, 70000],

    'payment_status': ['late', 'on_time', 'late', 'on_time'],  # ← This is leakage!

    'defaulted': [1, 0, 1, 0]

})

# payment_status is determined by whether they default

# You can't use it to predict default

How to avoid : Ask "Would I have this information at the time I need to make the prediction?"


2. Train-test contamination

Information from test set leaking into training.

Example: Scaling before splitting

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

df = pd.DataFrame({

    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

    'target': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

})

# WRONG: Scale before split

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df[['feature1']])  # Uses ALL data

X_train, X_test = train_test_split(df_scaled, test_size=0.2)

# RIGHT: Split first, then scale

X_train, X_test, y_train, y_test = train_test_split(

    df[['feature1']], df['target'], test_size=0.2, random_state=42

)

scaler = StandardScaler()

scaler.fit(X_train)  # Fit only on training data

X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)  # Transform test using training parameters

Why it matters : If you scale using the full dataset, the test set's statistics influence the training data.


3. Temporal leakage

Using future information to predict the past.

Example: Time series forecasting

# Sales data

df = pd.DataFrame({

    'date': pd.date_range('2024-01-01', periods=10),

    'sales': [100, 110, 105, 115, 120, 125, 130, 135, 140, 145]

})

# WRONG: Random split for time series

X_train, X_test = train_test_split(df, test_size=0.2)  # Mixes past and future

# RIGHT: Time-based split

train_size = int(0.8 * len(df))

train = df[:train_size]  # First 80%

test = df[train_size:]   # Last 20%

print("Training period:", train['date'].min(), "to", train['date'].max())

print("Test period:", test['date'].min(), "to", test['date'].max())

Rule : Always predict future from past, never past from future.


4. Feature leakage from aggregations

Example: Customer churn

# WRONG: Including future behavior in features

df = pd.DataFrame({

    'customer_id': [1, 1, 1, 2, 2],

    'month': [1, 2, 3, 1, 2],

    'purchases': [2, 1, 0, 3, 2],

    'churned_month_3': [1, 1, 1, 0, 0]  # Churn status at month 3

})

# Calculate total purchases (includes future purchases!)

df['total_purchases'] = df.groupby('customer_id')['purchases'].transform('sum')

# This is leakage: total_purchases includes purchases AFTER prediction time

RIGHT: Use only past information

# Calculate cumulative purchases up to current month

df = df.sort_values(['customer_id', 'month'])

df['cumulative_purchases'] = df.groupby('customer_id')['purchases'].cumsum()

# For month 3 prediction, use data from months 1-2 only

prediction_month = 3

train_data = df[df['month'] < prediction_month]

Proper Train-Test Split

from sklearn.model_selection import train_test_split

# Basic split

X_train, X_test, y_train, y_test = train_test_split(

    X, y,

    test_size=0.2,      # 20% for testing

    random_state=42,    # Reproducibility

    stratify=y          # Maintain class proportions (for classification)

)

print(f"Training set: {len(X_train)} samples")

print(f"Test set: {len(X_test)} samples")

print(f"Class distribution in train: {y_train.value_counts(normalize=True)}")

print(f"Class distribution in test: {y_test.value_counts(normalize=True)}")


Cross-Validation (Better than single split)

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# 5-fold cross-validation

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"Cross-validation scores: {scores}")

print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Why cross-validation?


Time Series Cross-Validation

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):

    X_train, X_test = X[train_index], X[test_index]

    y_train, y_test = y[train_index], y[test_index]

   

    # Train and evaluate model

    # ...

Ensures : Each test set is always in the future relative to its training set.


8.6 Documenting and Automating Data Preparation Pipelines

Why pipelines matter:

  1. Reproducibility : Same transformations every time
  2. Efficiency : Automate repetitive tasks
  3. Deployment : Easy to apply to new data
  4. Debugging : Clear what transformations were applied

Building a Pipeline with sklearn

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Define pipeline
pipeline = Pipeline([
   ('imputer', SimpleImputer(strategy='median')),
   ('scaler', StandardScaler()),
   ('model', LogisticRegression())
])

# Fit pipeline (applies all steps)
pipeline.fit(X_train, y_train)

# Predict (applies same transformations)
y_pred = pipeline.predict(X_test)

# Score
score = pipeline.score(X_test, y_test)
print(f"Accuracy: {score:.3f}")

Benefits:


Handling Different Column Types

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define which columns get which transformations
numeric_features = ['age', 'income', 'purchases']
categorical_features = ['city', 'product_category']

# Numeric pipeline
numeric_transformer = Pipeline([
   ('imputer', SimpleImputer(strategy='median')),
   ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_transformer = Pipeline([
   ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
   ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers
preprocessor = ColumnTransformer([
   ('num', numeric_transformer, numeric_features),
   ('cat', categorical_transformer, categorical_features)
])

# Full pipeline
full_pipeline = Pipeline([
   ('preprocessor', preprocessor),
   ('model', LogisticRegression())
])

# Fit and predict
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)

Custom Transformers

from sklearn.base import BaseEstimator, TransformerMixin
class FeatureEngineer(BaseEstimator, TransformerMixin):
   """Custom transformer for domain-specific features"""    

    def fit(self, X, y=None):
       return self
   

    def transform(self, X):
       X = X.copy()        
       # Create new features
       X['purchases_per_month'] = X['total_purchases'] / X['tenure_months']
       X['avg_order_value'] = X['total_spent'] / X['total_purchases']
       X['support_intensity'] = X['support_tickets'] / X['total_purchases']        
       return X

# Use in pipeline

pipeline = Pipeline([
   ('feature_engineer', FeatureEngineer()),
   ('imputer', SimpleImputer(strategy='median')),
   ('scaler', StandardScaler()),
   ('model', LogisticRegression())
])

Saving and Loading Pipelines

import joblib
# Save pipeline
joblib.dump(pipeline, 'model_pipeline.pkl')
# Load pipeline
loaded_pipeline = joblib.load('model_pipeline.pkl')

# Use on new data
new_predictions = loaded_pipeline.predict(new_data)


Documentation Best Practices

Data Preparation Pipeline for Customer Churn Prediction

Author: [Your Name]

Date: 2024-01-15

Last Updated: 2024-01-15

DATA SOURCE:

- customers.csv: Customer demographic and behavioral data

- transactions.csv: Transaction history

PREPROCESSING STEPS:

1. Missing Value Handling:

   - age: Impute with median

   - income: Impute with median, create missing indicator

   - purchases: Impute with mean

2. Outlier Treatment:

   - income: Cap at 5th and 95th percentiles

   - purchases: Remove values > 3 standard deviations

3. Feature Engineering:

   - purchases_per_month: total_purchases / tenure_months

   - avg_order_value: total_spent / total_purchases

   - days_since_last_purchase: today - last_purchase_date

4. Encoding:

   - city: One-hot encoding

   - education: Ordinal encoding (HS=1, Bachelor=2, Master=3, PhD=4)

5. Scaling:

   - All numeric features: StandardScaler

VALIDATION:

- Train-test split: 80/20

- Stratified by churn status

- Random state: 42

KNOWN ISSUES:

- Small sample size for some cities (< 10 customers)

- Missing data for income is not random (higher for high earners)

NEXT STEPS:

- Consider target encoding for high-cardinality categorical variables

- Experiment with polynomial features for non-linear relationships

Key Takeaways

  1. Data preparation is 80% of the work —and it determines the quality of your results.
  2. Understand before you clean : Don't blindly apply transformations. Understand what the data represents and why it's messy.
  3. Missing data strategy depends on why it's missing : MCAR, MAR, and MNAR require different approaches.
  4. Outliers aren't always errors : Understand whether they're mistakes or legitimate extremes before removing them.
  5. Scaling matters for some algorithms : Distance-based and gradient-based methods need scaled features.
  6. Encoding depends on variable type : Nominal vs. ordinal, low vs. high cardinality.
  7. Feature engineering is where domain knowledge shines : The best features come from understanding the business, not just manipulating data.
  8. Data leakage will ruin your model : Always ask "Would I have this information at prediction time?"
  9. Pipelines ensure reproducibility : Automate your preprocessing to avoid errors and enable deployment.
  10. Document everything : Future you (and your colleagues) will thank you.

Exercises

Exercise 1: Clean a Raw Dataset

You're given a customer dataset with various data quality issues. Clean it using pandas.

Dataset: customers_raw.csv

i mport pandas as pd
import numpy as np

# Create sample messy data
np.random.seed(42)
df = pd.DataFrame({
   'customer_id': [1, 2, 2, 3, 4, 5, 6, 7, 8, 9],  # Duplicate
   'name': ['Alice Smith', 'bob jones', 'Bob Jones', 'CHARLIE BROWN',
            'Diana Prince', 'Eve Adams', None, 'Frank Miller', 'Grace Lee', 'Henry Ford'],
   'age': [25, 30, 30, 250, 45, np.nan, 28, 35, 40, 50],  # Outlier, missing
   'city': ['New York', 'new york', 'NY', 'Los Angeles', 'Chicago',
            'chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio'],
   'income': [50000, 60000, 60000, 75000, np.nan, 55000, 62000, 58000, 70000, 80000],    'signup_date': ['2023-01-15', '2023-02-20', '2023-02-20', '01/03/2023',
'2023-04-10', '2023-05-15', '2023-06-20', '2023-07-25', '2023-08-30', '2023-09-15']
})

df.to_csv('customers_raw.csv', index=False)

Your tasks:

a) Identify and remove duplicate records

b) Standardize text fields (name, city):

c) Handle missing values:

d) Detect and treat outliers: