Chapter 4. Statistical and Probabilistic Foundations for Business
4.1 Why Statistics Matters for Business Decisions
Every business decision involves uncertainty. Should we launch a new product? Will customers respond to this marketing campaign? Is this supplier reliable? Which job candidate will perform best?
In the absence of perfect information—which is always—we rely on data and statistics to reduce uncertainty and make better decisions.
But here's the critical insight: statistics is not about finding "the truth" in data. It's about quantifying uncertainty so we can make informed choices.
Consider these scenarios:
Scenario 1: The Underperforming Store
A retail chain has 200 stores. Store #47 had 8% lower sales than the chain average last month. The regional manager wants to investigate what's wrong with that store.
But is there actually something wrong? Or is this just normal variation? If you flip a coin 100 times, you won't get exactly 50 heads—you might get 45 or 55. Similarly, even if all stores were identical, some would naturally perform above average and some below, just by chance.
Statistics helps us answer : Is this 8% difference large enough that it's unlikely to be just random variation? Or is it within the range of normal fluctuation?
Scenario 2: The A/B Test
An e-commerce company tests two versions of their checkout page. Version A (current) has a 3.2% conversion rate. Version B (new) has a 3.5% conversion rate based on 10,000 visitors to each version.
Should they switch to Version B?
The answer isn't obvious. Even if the two versions were identical, we'd expect some difference just by chance. Maybe the 10,000 people who saw Version B happened to be slightly more ready to buy.
Statistics helps us answer : How confident can we be that Version B is actually better, not just luckier?
Scenario 3: The Predictive Model
A bank builds a model to predict loan defaults. The model says Customer X has a 15% probability of default.
What does this mean? It doesn't mean Customer X will 15% default—they'll either default or not. It means that among customers with similar characteristics, historically about 15% defaulted.
Statistics helps us answer : How should we use this probabilistic information to make a decision? What's the expected cost of approving vs. denying this loan?
The Core Questions Statistics Answers
-
What happened?
(Descriptive statistics)
- What were our average sales last quarter?
- How much variation is there in customer satisfaction scores?
- Are there outliers or unusual patterns?
-
What might happen?
(Probability)
- What's the probability of meeting our sales target?
- What's the risk of a supply chain disruption?
- What's the expected return on this investment?
-
Is this real or just chance?
(Inference)
- Is the difference between these two groups meaningful?
- Can we generalize from this sample to the broader population?
- How confident are we in this estimate?
-
What's related to what?
(Correlation and regression)
- Do higher prices lead to lower sales?
- What factors predict customer churn?
- How much does advertising spending affect revenue?
Why Business People Often Struggle with Statistics
Statistics is often taught as a collection of formulas and procedures, disconnected from real decision-making. Students learn to "reject the null hypothesis at α = 0.05" without understanding what that means for business action.
Here's a better way to think about it:
Statistics is a language for talking about uncertainty.
Just as you need to understand financial statements to make investment decisions, you need to understand statistics to make data-driven decisions. You don't need to be a statistician any more than you need to be an accountant—but you need to be statistically literate.
What Statistical Literacy Means
- Understanding what an average does and doesn't tell you
- Recognizing when a difference is meaningful vs. just noise
- Knowing that correlation doesn't prove causation (but might suggest it)
- Appreciating that larger samples give more reliable results
- Understanding that "statistically significant" doesn't always mean "practically important"
- Recognizing when you're being misled by cherry-picked data or misleading visualizations
The Role of AI in Statistical Analysis
Modern AI tools, including Large Language Models and code-generation tools, have dramatically changed how we do statistical analysis. You no longer need to memorize formulas or be an expert programmer.
But—and this is crucial— AI tools don't replace statistical thinking. They amplify it.
AI can:
- Write code to calculate statistics
- Generate visualizations
- Explain statistical concepts
- Suggest appropriate tests
- Interpret results
AI cannot:
- Decide what question to ask
- Determine if your data is appropriate
- Judge whether a result is practically meaningful
- Make the business decision
Throughout this chapter, we'll show how to use AI tools (particularly LLMs and Python) to perform statistical analyses. But we'll focus on understanding what you're doing and why , not just getting numbers.
A Note on Mathematical Rigor
This chapter takes a practical, intuitive approach to statistics. We'll use formulas when they're helpful for understanding, but we won't derive theorems or prove properties.
If you need deeper mathematical foundations, excellent textbooks exist. Our goal is different: to help you use statistics effectively in business contexts , with modern tools, to make better decisions.
Let's begin.
4.2 Descriptive Statistics
Descriptive statistics summarize and describe data. They're the foundation of all statistical analysis—before you can make inferences or predictions, you need to understand what's in your data.
4.2.1 Measures of Central Tendency and Dispersion
Imagine you're analyzing salaries at your company. You have data for 100 employees. How do you summarize this information?
Measures of Central Tendency tell you where the "center" of the data is:
1. Mean (Average)
The mean is the sum of all values divided by the count.
When to use it : When you want to know the typical value and your data doesn't have extreme outliers.
Example : Average salary = $65,000
What it means : If you distributed all salary dollars equally, everyone would get $65,000.
Limitation : Sensitive to outliers. If the CEO makes $2 million, it pulls the average up, making it unrepresentative of typical employees.
2. Median (Middle Value)
The median is the middle value when data is sorted. Half the values are above it, half below.
When to use it : When you have outliers or skewed data (like salaries, house prices, income).
Example : Median salary = $58,000
What it means : Half of employees make more than $58,000, half make less.
Why it differs from mean : The CEO's $2 million salary doesn't affect the median much—they're just one person at the top.
3. Mode (Most Common Value)
The mode is the value that appears most frequently.
When to use it : For categorical data (most common product category, most frequent customer complaint) or when you want to know the most typical value.
Example : Modal salary = $55,000 (maybe many entry-level employees at this level)
Limitation : Not always meaningful for continuous data with few repeated values.
Measures of Dispersion tell you how spread out the data is:
1. Range
The difference between the maximum and minimum values.
Example : Salary range = $2,000,000 - $35,000 = $1,965,000
Limitation : Tells you nothing about the distribution between the extremes. Heavily influenced by outliers.
2. Variance
The average squared distance from the mean.
Formula : Variance = Σ(x - mean)² / n
What it measures : How much values deviate from the mean, on average.
Limitation : Units are squared (dollars²), which is hard to interpret.
3. Standard Deviation
The square root of variance.
Formula : SD = √Variance
What it measures : Typical distance from the mean, in the original units.
Example : Salary SD = $45,000
What it means : Most salaries are within about $45,000 of the mean ($65,000). So most employees make between $20,000 and $110,000.
Why it matters : Tells you if data is tightly clustered (small SD) or widely spread (large SD).
4. Coefficient of Variation (CV)
The standard deviation divided by the mean, expressed as a percentage.
Formula : CV = (SD / Mean) × 100%
Example : Salary CV = ($45,000 / $65,000) × 100% = 69%
Why it's useful : Allows comparison of variability across different scales. A $10,000 SD is large for salaries but small for house prices.
Practical Example with Python and AI
Let's analyze actual salary data. We'll use AI to help us write the code.
Prompt to AI:
I have a list of employee salaries in Python. Write code to calculate:
1. Mean, median, and mode
2. Range, variance, and standard deviation
3. Display the results in a clear format
Use this sample data:
salaries = [45000, 52000, 48000, 55000, 62000, 58000, 51000, 49000,
67000, 72000, 55000, 59000, 61000, 48000, 53000, 2000000]
Python Code:
import numpy as np
from scipy import stats
# Sample salary data
salaries = [45000, 52000, 48000, 55000, 62000, 58000, 51000, 49000,
67000, 72000, 55000, 59000, 61000, 48000, 53000, 2000000]
# Measures of central tendency
mean_salary = np.mean(salaries)
median_salary = np.median(salaries)
mode_result = stats.mode(salaries, keepdims=True)
mode_salary = mode_result.mode[0]
# Measures of dispersion
salary_range = np.max(salaries) - np.min(salaries)
variance = np.var(salaries, ddof=1) # ddof=1 for sample variance
std_dev = np.std(salaries, ddof=1)
cv = (std_dev / mean_salary) * 100
# Display results
print("=== SALARY ANALYSIS ===\n")
print("Central Tendency:")
print(f" Mean: ${mean_salary:,.2f}")
print(f" Median: ${median_salary:,.2f}")
print(f" Mode: ${mode_salary:,.2f}")
print(f"\nDispersion:")
print(f" Range: ${salary_range:,.2f}")
print(f" Variance: ${variance:,.2f}")
print(f" Standard Deviation: ${std_dev:,.2f}")
print(f" Coefficient of Variation: {cv:.1f}%")
Output:
=== SALARY ANALYSIS ===
Central Tendency:
Mean: $177,062.50
Median: $55,000.00
Mode: $48,000.00
Dispersion:
Range: $1,955,000.00
Variance: $238,665,625,000.00
Standard Deviation: $488,533.04
Coefficient of Variation: 275.9%
Interpretation:
Notice the huge difference between mean ($177,062) and median ($55,000). This tells us immediately that we have extreme outliers pulling the mean up.
The standard deviation ($488,533) is actually larger than the mean—this is unusual and indicates extreme variability.
The coefficient of variation (276%) confirms this is highly variable data.
Business insight : The mean is misleading here. If you told employees "average salary is $177,000," they'd be confused because most people make around $55,000. The median is a much better representation of typical salary.
Let's remove the outlier and recalculate:
Prompt to AI:
Modify the previous code to:
1. Remove salaries above $500,000
2. Recalculate all statistics
3. Compare before and after
Python Code:
# Remove outliers
salaries_clean = [s for s in salaries if s <= 500000]
# Recalculate
mean_clean = np.mean(salaries_clean)
median_clean = np.median(salaries_clean)
std_clean = np.std(salaries_clean, ddof=1)
print("\n=== COMPARISON: WITH vs WITHOUT OUTLIER ===\n")
print(f" With Outlier Without Outlier")
print(f"Mean: ${mean_salary:>12,.0f} ${mean_clean:>12,.0f}")
print(f"Median: ${median_salary:>12,.0f} ${median_clean:>12,.0f}")
print(f"Std Deviation: ${std_dev:>12,.0f} ${std_clean:>12,.0f}")
print(f"\nNumber of employees: {len(salaries)} → {len(salaries_clean)}")
Output:
=== COMPARISON: WITH vs WITHOUT OUTLIER ===
With Outlier Without Outlier
Mean: $ 177,062 $ 55,733
Median: $ 55,000 $ 55,000
Std Deviation: $ 488,533 $ 7,398
Number of employees: 16 → 15
Key Insight : One outlier (the CEO) completely distorted the mean and standard deviation. The median was barely affected. This is why median is preferred for skewed data like salaries, house prices, and wealth.
Visualizing Central Tendency and Dispersion
Numbers are important, but visualizations make patterns obvious.
Prompt to AI:
Create a visualization showing:
1. Histogram of salaries (without outlier)
2. Vertical lines for mean and median
3. Shaded region for ±1 standard deviation from mean
Python Code:
import matplotlib.pyplot as plt
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(salaries_clean, bins=10, color='skyblue', edgecolor='black', alpha=0.7)
# Add mean and median lines
plt.axvline(mean_clean, color='red', linestyle='--', linewidth=2, label=f'Mean: ${mean_clean:,.0f}')
plt.axvline(median_clean, color='green', linestyle='--', linewidth=2, label=f'Median: ${median_clean:,.0f}')
# Add ±1 SD shading
plt.axvspan(mean_clean - std_clean, mean_clean + std_clean,
alpha=0.2, color='red', label='±1 Std Dev')
plt.xlabel('Salary ($)', fontsize=12)
plt.ylabel('Number of Employees', fontsize=12)
plt.title('Employee Salary Distribution', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
This visualization immediately shows:
- The distribution is slightly right-skewed (tail toward higher salaries)
- Mean and median are close (because we removed the extreme outlier)
- Most employees fall within one standard deviation of the mean
- There are a few higher earners, but nothing extreme
When to Use Each Measure: A Decision Guide
|
Situation |
Best Measure of Center |
Best Measure of Spread |
|
Symmetric data, no outliers |
Mean |
Standard Deviation |
|
Skewed data or outliers |
Median |
Interquartile Range (IQR) |
|
Categorical data |
Mode |
N/A |
|
Comparing variability across different scales |
Mean |
Coefficient of Variation |
|
Want to understand "typical" value |
Median |
IQR |
|
Want to understand total/sum |
Mean |
Variance |
4.2.2 Percentiles, Quartiles, and Outliers
Sometimes we want to know more than just the center and spread. We want to understand the distribution of values.
Percentiles
A percentile tells you the value below which a certain percentage of data falls.
Examples:
- 25th percentile (P25) : 25% of values are below this, 75% above
- 50th percentile (P50) : Same as the median
- 75th percentile (P75) : 75% of values are below this, 25% above
- 90th percentile (P90) : 90% of values are below this, 10% above
Business applications:
- Performance evaluation : "You're in the 90th percentile of sales reps" means you outperformed 90% of your peers
- Service level agreements : "99th percentile response time < 2 seconds" means 99% of requests are answered within 2 seconds
- Pricing : "Our prices are at the 60th percentile of the market" means 60% of competitors charge less, 40% charge more
Quartiles
Quartiles divide data into four equal parts:
- Q1 (First Quartile) : 25th percentile
- Q2 (Second Quartile) : 50th percentile (median)
- Q3 (Third Quartile) : 75th percentile
Interquartile Range (IQR)
IQR = Q3 - Q1
This is the range containing the middle 50% of data. It's a robust measure of spread that isn't affected by outliers.
Example : If Q1 = $48,000 and Q3 = $62,000, then IQR = $14,000. The middle 50% of salaries span a $14,000 range.
Identifying Outliers
An outlier is a value that's unusually far from the rest of the data.
Common definition : A value is an outlier if it's:
- Below Q1 - 1.5 × IQR, or
- Above Q3 + 1.5 × IQR
This is the definition used in box plots.
Why 1.5 × IQR? It's a convention that works well in practice. For normally distributed data, this rule flags about 0.7% of values as outliers.
Practical Example: Analyzing Customer Purchase Amounts
Let's say you're analyzing customer purchase amounts for an online store.
Prompt to AI:
I have customer purchase data. Write Python code to:
1. Calculate quartiles and IQR
2. Identify outliers using the 1.5×IQR rule
3. Create a box plot
4. Show summary statistics
Use this data:
purchases = [23, 45, 38, 52, 61, 48, 55, 42, 39, 58, 67, 44, 51, 49,
47, 53, 62, 41, 56, 59, 350, 28, 46, 54, 50]
Python Code:
import numpy as np
import matplotlib.pyplot as plt
purchases = [23, 45, 38, 52, 61, 48, 55, 42, 39, 58, 67, 44, 51, 49,
47, 53, 62, 41, 56, 59, 150, 28, 46, 54, 50]
# Calculate quartiles
Q1 = np.percentile(purchases, 25)
Q2 = np.percentile(purchases, 50) # median
Q3 = np.percentile(purchases, 75)
IQR = Q3 - Q1
# Calculate outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = [x for x in purchases if x < lower_bound or x > upper_bound]
normal_values = [x for x in purchases if lower_bound <= x <= upper_bound]
# Display results
print("=== QUARTILE ANALYSIS ===\n")
print(f"Q1 (25th percentile): ${Q1:.2f}")
print(f"Q2 (50th percentile/Median): ${Q2:.2f}")
print(f"Q3 (75th percentile): ${Q3:.2f}")
print(f"IQR: ${IQR:.2f}")
print(f"\nOutlier Boundaries:")
print(f" Lower: ${lower_bound:.2f}")
print(f" Upper: ${upper_bound:.2f}")
print(f"\nOutliers detected: {outliers}")
print(f"Number of outliers: {len(outliers)} out of {len(purchases)} ({len(outliers)/len(purchases)*100:.1f}%)")
# Create box plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Box plot
ax1.boxplot(purchases, vert=False)
ax1.set_xlabel('Purchase Amount ($)', fontsize=11)
ax1.set_title('Box Plot of Purchase Amounts', fontsize=12, fontweight='bold')
ax1.grid(axis='x', alpha=0.3)
# Histogram with outliers highlighted
ax2.hist(normal_values, bins=15, color='skyblue', edgecolor='black', alpha=0.7, label='Normal')
ax2.hist(outliers, bins=5, color='red', edgecolor='black', alpha=0.7, label='Outliers')
ax2.axvline(Q2, color='green', linestyle='--', linewidth=2, label=f'Median: ${Q2:.0f}')
ax2.set_xlabel('Purchase Amount ($)', fontsize=11)
ax2.set_ylabel('Frequency', fontsize=11)
ax2.set_title('Distribution with Outliers Highlighted', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== QUARTILE ANALYSIS ===
Q1 (25th percentile): $44.00
Q2 (50th percentile/Median): $50.00
Q3 (75th percentile): $56.00
IQR: $12.00
Outlier Boundaries:
Lower: $26.00
Upper: $74.00
Outliers detected: [23, 150]
Number of outliers: 2 out of 25 (8.0%)
Interpretation:
The box plot shows:
- The "box" contains the middle 50% of purchases ($42.50 to $56.50)
- The line inside the box is the median ($50)
- The "whiskers" extend to the minimum and maximum non-outlier values
- The dot beyond the whisker is the outlier ($350)
Business questions to ask:
-
Is this outlier an error?
Maybe someone accidentally entered $350 instead of $35.00. Check the data.
-
Is this outlier legitimate but unusual?
Maybe one customer made a bulk purchase. This is real data but not representative of typical behavior.
-
Should we include or exclude it?
- Include if you're calculating total revenue (you did receive $350)
- Exclude if you're trying to understand typical customer behavior
- Analyze separately if you're segmenting customers (this might be a "high-value" customer segment)
Percentile Analysis for Business Insights
Let's calculate various percentiles to understand the distribution better.
Prompt to AI:
Calculate and display the 10th, 25th, 50th, 75th, 90th, and 95th percentiles
of the purchase data (excluding the outlier). Explain what each means in
business terms.
Python Code:
# Remove outlier for this analysis
purchases_clean = [x for x in purchases if x != 350]
# Calculate percentiles
percentiles = [10, 25, 50, 75, 90, 95]
values = [np.percentile(purchases_clean, p) for p in percentiles]
print("=== PERCENTILE ANALYSIS ===\n")
for p, v in zip(percentiles, values):
print(f"P{p:2d}: ${v:6.2f} → {p}% of purchases are below ${v:.2f}")
print("\n=== BUSINESS INSIGHTS ===\n")
print(f"• Bottom 10% of customers spend less than ${values[0]:.2f}")
print(f"• Middle 50% of customers spend between ${values[1]:.2f} and ${values[3]:.2f}")
print(f"• Top 10% of customers spend more than ${values[4]:.2f}")
print(f"• Top 5% of customers spend more than ${values[5]:.2f}")
Output:
=== PERCENTILE ANALYSIS ===
P10: $ 38.80 → 10% of purchases are below $38.80
P25: $ 43.00 → 25% of purchases are below $43.00
P50: $ 50.00 → 50% of purchases are below $50.00
P75: $ 56.50 → 75% of purchases are below $56.50
P90: $ 61.80 → 90% of purchases are below $61.80
P95: $ 64.40 → 95% of purchases are below $64.40
=== BUSINESS INSIGHTS ===
• Bottom 10% of customers spend less than $38.80
• Middle 50% of customers spend between $43.00 and $56.50
• Top 10% of customers spend more than $61.80
• Top 5% of customers spend more than $64.40
How to use this in business:
-
Pricing strategy
: If you want to be affordable to 75% of customers, price below $56.50
-
Promotions
: Target the bottom 25% (spending < $43) with incentives to increase purchase size
-
VIP programs
: Create a premium tier for the top 10% (spending > $61.80)
-
Inventory planning
: Stock products that appeal to the middle 50% ($43-$56.50 range)
-
Performance benchmarks
: "Our goal is to move the median purchase from $50 to $55"
The Five-Number Summary
A common way to summarize a distribution is the five-number summary :
- Minimum
- Q1 (25th percentile)
- Median (50th percentile)
- Q3 (75th percentile)
- Maximum
This is exactly what a box plot visualizes.
Prompt to AI:
Create a function that returns a five-number summary and displays it nicely.
Python Code:
def five_number_summary(data, name="Data"):
"""Calculate and display five-number summary."""
minimum = np.min(data)
q1 = np.percentile(data, 25)
median = np.percentile(data, 50)
q3 = np.percentile(data, 75)
maximum = np.max(data)
print(f"=== FIVE-NUMBER SUMMARY: {name} ===\n")
print(f" Minimum: ${minimum:,.2f}")
print(f" Q1: ${q1:,.2f}")
print(f" Median: ${median:,.2f}")
print(f" Q3: ${q3:,.2f}")
print(f" Maximum: ${maximum:,.2f}")
print(f"\n Range: ${maximum - minimum:,.2f}")
print(f" IQR: ${q3 - q1:,.2f}")
return {"min": minimum, "q1": q1, "median": median, "q3": q3, "max": maximum}
# Use it
five_number_summary(purchases_clean, "Customer Purchases")
Output:
=== FIVE-NUMBER SUMMARY: Customer Purchases ===
Minimum: $23.00
Q1: $43.00
Median: $50.00
Q3: $56.50
Maximum: $67.00
Range: $44.00
IQR: $13.50
This gives you a complete picture of the distribution in just five numbers.
Key Takeaways: Percentiles and Outliers
-
Percentiles give you more information than just mean and median
—they show the shape of the distribution
-
IQR is a robust measure of spread
—unlike standard deviation, it's not affected by outliers
-
Outliers aren't always errors
—they might be important business insights (VIP customers, fraud, rare events)
-
Box plots are excellent for comparing distributions
—you can put multiple box plots side-by-side to compare groups
-
Always investigate outliers
—don't automatically remove them. Understand what they represent.
4.3 Introduction to Probability
Probability is the language of uncertainty. In business, almost nothing is certain—customers might buy or not, projects might succeed or fail, markets might rise or fall. Probability helps us quantify and reason about these uncertainties.
4.3.1 Events, Sample Spaces, and Basic Rules
Sample Space
The sample space is the set of all possible outcomes of a random process.
Examples:
- Flipping a coin: {Heads, Tails}
- Rolling a die: {1, 2, 3, 4, 5, 6}
- Customer response to email: {Opens, Doesn't Open}
- Product quality: {Defective, Non-Defective}
Event
An event is a specific outcome or set of outcomes we're interested in.
Examples:
- Rolling an even number: {2, 4, 6}
- Customer makes a purchase: {Purchase}
- Project finishes on time or early: {On Time, Early}
Probability
The probability of an event is a number between 0 and 1 that represents how likely it is to occur.
- P = 0 : Impossible (will never happen)
- P = 0.5 : Equally likely to happen or not
- P = 1 : Certain (will definitely happen)
How to calculate probability:
For equally likely outcomes:
P(Event) = Number of favorable outcomes / Total number of possible outcomes
Example : Probability of rolling a 4 on a fair die:
P(4) = 1/6 ≈ 0.167 or 16.7%
For real-world events, we often estimate probability from historical data:
P(Event) = Number of times event occurred / Total number of observations
Example : If 1,200 out of 10,000 customers clicked an ad:
P(Click) = 1,200/10,000 = 0.12 or 12%
Basic Probability Rules
Rule 1: Complement Rule
The probability that an event does NOT occur is:
P(not A) = 1 - P(A)
Example : If P(Customer Buys) = 0.15, then:
P(Customer Doesn't Buy) = 1 - 0.15 = 0.85 or 85%
Rule 2: Addition Rule (OR)
For mutually exclusive events (can't both happen):
P(A or B) = P(A) + P(B)
Example : Probability of rolling a 2 OR a 5:
P(2 or 5) = P(2) + P(5) = 1/6 + 1/6 = 2/6 = 1/3
For non-mutually exclusive events (can both happen):
P(A or B) = P(A) + P(B) - P(A and B)
Example : In a group of customers, 60% are female, 40% are premium members, and 25% are both. What's the probability a randomly selected customer is female OR a premium member?
P(Female or Premium) = 0.60 + 0.40 - 0.25 = 0.75 or 75%
Why subtract P(A and B)? Because we counted those customers twice—once in P(Female) and once in P(Premium).
Rule 3: Multiplication Rule (AND)
For independent events (one doesn't affect the other):
P(A and B) = P(A) × P(B)
Example : Probability of flipping heads twice in a row:
P(Heads and Heads) = 0.5 × 0.5 = 0.25 or 25%
Example : If 30% of website visitors add items to cart, and 40% of those who add items complete purchase, what's the probability a random visitor completes a purchase?
P(Add to Cart and Purchase) = 0.30 × 0.40 = 0.12 or 12%
Practical Example: Marketing Campaign Analysis
You're analyzing a marketing campaign. Historical data shows:
- 20% of recipients open the email
- 10% of those who open click the link
- 5% of those who click make a purchase
Questions:
- What's the probability a recipient makes a purchase?
- What's the probability a recipient does NOT open the email?
- If you send to 50,000 people, how many purchases do you expect?
Prompt to AI:
I have a marketing funnel with these conversion rates:
- Open rate: 20%
- Click rate (given open): 10%
- Purchase rate (given click): 5%
Write Python code to:
1. Calculate probability of purchase
2. Calculate probability of NOT opening
3. Calculate expected purchases from 50,000 emails
4. Visualize the funnel
Python Code:
import matplotlib.pyplot as plt
# Conversion rates
p_open = 0.20
p_click_given_open = 0.10
p_purchase_given_click = 0.05
# Calculate probabilities
p_not_open = 1 - p_open
p_purchase = p_open * p_click_given_open * p_purchase_given_click
# Expected outcomes from 50,000 emails
total_emails = 50000
expected_opens = total_emails * p_open
expected_clicks = expected_opens * p_click_given_open
expected_purchases = expected_clicks * p_purchase_given_click
# Display results
print("=== MARKETING FUNNEL ANALYSIS ===\n")
print(f"Probability of opening: {p_open:.1%}")
print(f"Probability of NOT opening: {p_not_open:.1%}")
print(f"Probability of clicking (given open): {p_click_given_open:.1%}")
print(f"Probability of purchase (given click): {p_purchase_given_click:.1%}")
print(f"\nOverall probability of purchase: {p_purchase:.3%}")
print(f"\n=== EXPECTED OUTCOMES FROM {total_emails:,} EMAILS ===\n")
print(f"Opens: {expected_opens:>10,.0f} ({p_open:.1%})")
print(f"Clicks: {expected_clicks:>10,.0f} ({expected_clicks/total_emails:.2%})")
print(f"Purchases: {expected_purchases:>10,.0f} ({p_purchase:.3%})")
# Visualize funnel
stages = ['Sent', 'Opened', 'Clicked', 'Purchased']
values = [total_emails, expected_opens, expected_clicks, expected_purchases]
colors = ['#3498db', '#2ecc71', '#f39c12', '#e74c3c']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Funnel chart
ax1.barh(stages, values, color=colors, edgecolor='black')
for i, (stage, value) in enumerate(zip(stages, values)):
ax1.text(value + 1000, i, f'{value:,.0f}', va='center', fontweight='bold')
ax1.set_xlabel('Number of People', fontsize=11)
ax1.set_title('Marketing Funnel: Expected Outcomes', fontsize=12, fontweight='bold')
ax1.grid(axis='x', alpha=0.3)
# Conversion rates
conversion_rates = [100, p_open*100, (p_open*p_click_given_open)*100, p_purchase*100]
ax2.plot(stages, conversion_rates, marker='o', linewidth=2, markersize=10, color='#e74c3c')
ax2.fill_between(range(len(stages)), conversion_rates, alpha=0.3, color='#e74c3c')
for i, (stage, rate) in enumerate(zip(stages, conversion_rates)):
ax2.text(i, rate + 2, f'{rate:.2f}%', ha='center', fontweight='bold')
ax2.set_ylabel('Percentage (%)', fontsize=11)
ax2.set_title('Conversion Rates Through Funnel', fontsize=12, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== MARKETING FUNNEL ANALYSIS ===
Probability of opening: 20.0%
Probability of NOT opening: 80.0%
Probability of clicking (given open): 10.0%
Probability of purchase (given click): 5.0%
Overall probability of purchase: 0.100%
=== EXPECTED OUTCOMES FROM 50,000 EMAILS ===
Opens: 10,000 (20.0%)
Clicks: 1,000 (2.00%)
Purchases: 50 (0.100%)
Business Insights:
-
Only 0.1% of recipients will purchase
—this might sound low, but it's typical for cold email campaigns
-
The biggest drop-off is at the open stage
—80% never open the email. This suggests:
- Improve subject lines
- Better audience targeting
- Test send times
-
Expected 50 purchases from 50,000 emails
—if average purchase value is $100, that's $5,000 revenue. Compare this to campaign cost to determine ROI.
-
Each stage multiplies probabilities
—small improvements at each stage compound. If you improve open rate from 20% to 25%, purchases increase by 25% (from 50 to 62.5).
4.3.2 Conditional Probability and Bayes' Theorem
Conditional Probability
Conditional probability is the probability of an event occurring, given that another event has already occurred.
Notation : P(A|B) reads as "probability of A given B"
Formula :
P(A|B) = P(A and B) / P(B)
Intuition : We're restricting our attention to only those cases where B occurred, and asking how often A also occurs in those cases.
Example :
In a company:
- 60% of employees are in Sales
- 40% of employees are in Engineering
- 30% of Sales employees have MBA degrees
- 50% of Engineering employees have MBA degrees
Question : If you randomly select an employee with an MBA, what's the probability they're in Engineering?
This is asking: P(Engineering | MBA)
Let's calculate:
Prompt to AI:
Given:
- P(Sales) = 0.60
- P(Engineering) = 0.40
- P(MBA | Sales) = 0.30
- P(MBA | Engineering) = 0.50
Calculate:
1. P(MBA and Sales)
2. P(MBA and Engineering)
3. P(MBA) - total probability of having MBA
4. P(Engineering | MBA) - probability of being in Engineering given MBA
Show the calculations step by step.
Python Code:
# Given probabilities
p_sales = 0.60
p_engineering = 0.40
p_mba_given_sales = 0.30
p_mba_given_engineering = 0.50
# Step 1: Calculate P(MBA and Sales)
p_mba_and_sales = p_sales * p_mba_given_sales
# Step 2: Calculate P(MBA and Engineering)
p_mba_and_engineering = p_engineering * p_mba_given_engineering
# Step 3: Calculate P(MBA) using law of total probability
p_mba = p_mba_and_sales + p_mba_and_engineering
# Step 4: Calculate P(Engineering | MBA) using Bayes' theorem
p_engineering_given_mba = p_mba_and_engineering / p_mba
# Display results
print("=== CONDITIONAL PROBABILITY ANALYSIS ===\n")
print("Given Information:")
print(f" P(Sales) = {p_sales:.0%}")
print(f" P(Engineering) = {p_engineering:.0%}")
print(f" P(MBA | Sales) = {p_mba_given_sales:.0%}")
print(f" P(MBA | Engineering) = {p_mba_given_engineering:.0%}")
print("\nCalculations:")
print(f" P(MBA and Sales) = P(Sales) × P(MBA|Sales)")
print(f" = {p_sales:.2f} × {p_mba_given_sales:.2f} = {p_mba_and_sales:.2f}")
print(f"\n P(MBA and Engineering) = P(Engineering) × P(MBA|Engineering)")
print(f" = {p_engineering:.2f} × {p_mba_given_engineering:.2f} = {p_mba_and_engineering:.2f}")
print(f"\n P(MBA) = P(MBA and Sales) + P(MBA and Engineering)")
print(f" = {p_mba_and_sales:.2f} + {p_mba_and_engineering:.2f} = {p_mba:.2f}")
print(f"\n P(Engineering | MBA) = P(MBA and Engineering) / P(MBA)")
print(f" = {p_mba_and_engineering:.2f} / {p_mba:.2f} = {p_engineering_given_mba:.2f}")
print(f"\n=== ANSWER ===")
print(f"If an employee has an MBA, there's a {p_engineering_given_mba:.1%} chance they're in Engineering")
print(f"and a {1-p_engineering_given_mba:.1%} chance they're in Sales.")
Output:
=== CONDITIONAL PROBABILITY ANALYSIS ===
Given Information:
P(Sales) = 60%
P(Engineering) = 40%
P(MBA | Sales) = 30%
P(MBA | Engineering) = 50%
Calculations:
P(MBA and Sales) = P(Sales) × P(MBA|Sales)
= 0.60 × 0.30 = 0.18
P(MBA and Engineering) = P(Engineering) × P(MBA|Engineering)
= 0.40 × 0.50 = 0.20
P(MBA) = P(MBA and Sales) + P(MBA and Engineering)
= 0.18 + 0.20 = 0.38
P(Engineering | MBA) = P(MBA and Engineering) / P(MBA)
= 0.20 / 0.38 = 0.53
=== ANSWER ===
If an employee has an MBA, there's a 52.6% chance they're in Engineering
and a 47.4% chance they're in Sales.
Key Insight : Even though only 40% of employees are in Engineering, 52.6% of MBA holders are in Engineering. Why? Because Engineering employees are more likely to have MBAs (50% vs. 30%).
This is Bayes' Theorem in action.
Bayes' Theorem
Bayes' Theorem is one of the most important formulas in statistics. It lets us "reverse" conditional probabilities.
Formula :
P(A|B) = [P(B|A) × P(A)] / P(B)
In words:
P(A given B) = [P(B given A) × P(A)] / P(B)
Why it matters : Often we know P(B|A) but want to find P(A|B).
Classic Example: Medical Testing
A disease affects 1% of the population. A test for the disease is 95% accurate (detects disease when present) and has a 5% false positive rate (incorrectly indicates disease when absent).
You test positive. What's the probability you actually have the disease?
Intuition says : 95% (the test accuracy)
Reality : Much lower!
Let's calculate:
Prompt to AI:
Use Bayes' Theorem to solve this medical testing problem:
- P(Disease) = 0.01 (1% of population has disease)
- P(Positive Test | Disease) = 0.95 (test detects 95% of cases)
- P(Positive Test | No Disease) = 0.05 (5% false positive rate)
Calculate P(Disease | Positive Test)
Show all steps and create a visualization.
Python Code:
# Given probabilities
p_disease = 0.01
p_no_disease = 1 - p_disease
p_positive_given_disease = 0.95
p_positive_given_no_disease = 0.05
# Calculate P(Positive Test) using law of total probability
p_positive = (p_positive_given_disease * p_disease +
p_positive_given_no_disease * p_no_disease)
# Apply Bayes' Theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive
# Display results
print("=== BAYES' THEOREM: MEDICAL TEST EXAMPLE ===\n")
print("Given:")
print(f" P(Disease) = {p_disease:.1%}")
print(f" P(Positive | Disease) = {p_positive_given_disease:.0%}")
print(f" P(Positive | No Disease) = {p_positive_given_no_disease:.0%}")
print("\nStep 1: Calculate P(Positive Test)")
print(f" P(Positive) = P(Positive|Disease) × P(Disease) + P(Positive|No Disease) × P(No Disease)")
print(f" = {p_positive_given_disease:.2f} × {p_disease:.2f} + {p_positive_given_no_disease:.2f} × {p_no_disease:.2f}")
print(f" = {p_positive_given_disease * p_disease:.4f} + {p_positive_given_no_disease * p_no_disease:.4f}")
print(f" = {p_positive:.4f}")
print("\nStep 2: Apply Bayes' Theorem")
print(f" P(Disease | Positive) = P(Positive|Disease) × P(Disease) / P(Positive)")
print(f" = {p_positive_given_disease:.2f} × {p_disease:.2f} / {p_positive:.4f}")
print(f" = {p_positive_given_disease * p_disease:.4f} / {p_positive:.4f}")
print(f" = {p_disease_given_positive:.4f}")
print(f"\n=== ANSWER ===")
print(f"If you test positive, the probability you actually have the disease is {p_disease_given_positive:.1%}")
print(f"\nThis seems surprisingly low! Here's why:")
print(f" • The disease is rare (only {p_disease:.1%} of people have it)")
print(f" • So most positive tests come from the {p_no_disease:.0%} who don't have it")
print(f" • Even with a low false positive rate ({p_positive_given_no_disease:.0%}), there are many false positives")
# Visualization: Out of 10,000 people
population = 10000
people_with_disease = int(population * p_disease)
people_without_disease = population - people_with_disease
true_positives = int(people_with_disease * p_positive_given_disease)
false_negatives = people_with_disease - true_positives
false_positives = int(people_without_disease * p_positive_given_no_disease)
true_negatives = people_without_disease - false_positives
print(f"\n=== VISUALIZATION: OUT OF {population:,} PEOPLE ===\n")
print(f"Have disease ({p_disease:.1%}): {people_with_disease:>4} people")
print(f" Test Positive (True Positive): {true_positives:>4}")
print(f" Test Negative (False Negative): {false_negatives:>4}")
print(f"\nDon't have disease ({p_no_disease:.0%}): {people_without_disease:>4} people")
print(f" Test Positive (False Positive): {false_positives:>4}")
print(f" Test Negative (True Negative): {true_negatives:>4}")
print(f"\nTotal Positive Tests: {true_positives + false_positives}")
print(f" Of these, {true_positives} actually have disease ({true_positives/(true_positives+false_positives):.1%})")
print(f" And {false_positives} don't have disease ({false_positives/(true_positives+false_positives):.1%})")
# Create visualization
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
# Population breakdown
categories = ['True\nPositive', 'False\nNegative', 'False\nPositive', 'True\nNegative']
values = [true_positives, false_negatives, false_positives, true_negatives]
colors = ['#2ecc71', '#e74c3c', '#e67e22', '#3498db']
ax1.bar(categories, values, color=colors, edgecolor='black', linewidth=1.5)
for i, (cat, val) in enumerate(zip(categories, values)):
ax1.text(i, val + 50, f'{val:,}', ha='center', fontweight='bold', fontsize=11)
ax1.set_ylabel('Number of People', fontsize=11)
ax1.set_title(f'Test Results for {population:,} People', fontsize=12, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)
# Among positive tests
positive_labels = ['Actually\nHave Disease', 'Actually\nDon\'t Have Disease']
positive_values = [true_positives, false_positives]
positive_colors = ['#2ecc71', '#e67e22']
ax2.bar(positive_labels, positive_values, color=positive_colors, edgecolor='black', linewidth=1.5)
for i, val in enumerate(positive_values):
pct = val / (true_positives + false_positives) * 100
ax2.text(i, val + 10, f'{val}\n({pct:.1f}%)', ha='center', fontweight='bold', fontsize=11)
ax2.set_ylabel('Number of People', fontsize=11)
ax2.set_title('Among Those Who Test Positive', fontsize=12, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== BAYES' THEOREM: MEDICAL TEST EXAMPLE ===
Given:
P(Disease) = 1.0%
P(Positive | Disease) = 95%
P(Positive | No Disease) = 5%
Step 1: Calculate P(Positive Test)
P(Positive) = P(Positive|Disease) × P(Disease) + P(Positive|No Disease) × P(No Disease)
= 0.95 × 0.01 + 0.05 × 0.99
= 0.0095 + 0.0495
= 0.0590
Step 2: Apply Bayes' Theorem
P(Disease | Positive) = P(Positive|Disease) × P(Disease) / P(Positive)
= 0.95 × 0.01 / 0.0590
= 0.0095 / 0.0590
= 0.1610
=== ANSWER ===
If you test positive, the probability you actually have the disease is 16.1%
This seems surprisingly low! Here's why:
• The disease is rare (only 1.0% of people have it)
• So most positive tests come from the 99% who don't have it
• Even with a low false positive rate (5%), there are many false positives
=== VISUALIZATION: OUT OF 10,000 PEOPLE ===
Have disease (1.0%): 100 people
Test Positive (True Positive): 95
Test Negative (False Negative): 5
Don't have disease (99%): 9900 people
Test Positive (False Positive): 495
Test Negative (True Negative): 9405
Total Positive Tests: 590
Of these, 95 actually have disease (16.1%)
And 495 don't have disease (83.9%)
This is shocking! Despite a 95% accurate test, if you test positive, there's only a 16.1% chance you actually have the disease.
Why? Because the disease is rare. Out of 10,000 people:
- 100 have the disease → 95 test positive (true positives)
- 9,900 don't have the disease → 495 test positive (false positives)
- Total positive tests: 590
- Only 95 out of 590 (16.1%) actually have the disease
Business Application: Fraud Detection
This same logic applies to fraud detection, spam filtering, and any rare event detection.
If fraud is rare (say, 0.5% of transactions) and your model is 90% accurate, most "fraud alerts" will be false positives. This is why fraud teams need to balance sensitivity (catching fraud) with specificity (not overwhelming investigators with false alarms).
Practical Business Example: Customer Churn Prediction
You're analyzing customer churn. Historical data shows:
- 10% of customers churn each year
- 70% of customers who churn had a support ticket in the last month
- 20% of customers who don't churn had a support ticket in the last month
Question : If a customer has a support ticket, what's the probability they'll churn?
Prompt to AI:
Use Bayes' Theorem:
- P(Churn) = 0.10
- P(Support Ticket | Churn) = 0.70
- P(Support Ticket | No Churn) = 0.20
Calculate P(Churn | Support Ticket) and interpret for business.
Python Code:
# Given probabilities
p_churn = 0.10
p_no_churn = 1 - p_churn
p_ticket_given_churn = 0.70
p_ticket_given_no_churn = 0.20
# Calculate P(Support Ticket)
p_ticket = (p_ticket_given_churn * p_churn +
p_ticket_given_no_churn * p_no_churn)
# Apply Bayes' Theorem
p_churn_given_ticket = (p_ticket_given_churn * p_churn) / p_ticket
print("=== CUSTOMER CHURN ANALYSIS ===\n")
print(f"Base churn rate: {p_churn:.0%}")
print(f"Churn rate among customers with support ticket: {p_churn_given_ticket:.1%}")
print(f"\nIncrease in churn risk: {p_churn_given_ticket/p_churn:.1f}x")
print(f"\n=== BUSINESS INSIGHT ===")
print(f"Customers with support tickets are {p_churn_given_ticket/p_churn:.1f}x more likely to churn.")
print(f"This suggests:")
print(f" • Support tickets indicate customer dissatisfaction")
print(f" • Proactive outreach to these customers could reduce churn")
print(f" • Improving support quality is critical for retention")
# Calculate expected impact of intervention
customers = 10000
customers_with_tickets = int(customers * p_ticket)
expected_churns_with_tickets = int(customers_with_tickets * p_churn_given_ticket)
print(f"\n=== EXPECTED IMPACT ===")
print(f"Out of {customers:,} customers:")
print(f" • {customers_with_tickets:,} will have support tickets")
print(f" • {expected_churns_with_tickets:,} of those will churn")
print(f"\nIf you could reduce churn by 50% among ticket holders:")
print(f" • You'd save {expected_churns_with_tickets//2:,} customers")
print(f" • At $1,000 lifetime value, that's ${expected_churns_with_tickets//2 * 1000:,} in retained revenue")
Output:
=== CUSTOMER CHURN ANALYSIS ===
Base churn rate: 10%
Churn rate among customers with support ticket: 28.0%
Increase in churn risk: 2.8x
=== BUSINESS INSIGHT ===
Customers with support tickets are 2.8x more likely to churn.
This suggests:
• Support tickets indicate customer dissatisfaction
• Proactive outreach to these customers could reduce churn
• Improving support quality is critical for retention
=== EXPECTED IMPACT ===
Out of 10,000 customers:
• 2,500 will have support tickets
• 700 of those will churn
If you could reduce churn by 50% among ticket holders:
• You'd save 350 customers
• At $1,000 lifetime value, that's $350,000 in retained revenue
This is actionable! You now know:
- Support tickets are a strong churn signal
- You can quantify the risk (28% vs. 10% baseline)
- You can estimate the value of intervention ($350,000)
This justifies investing in better support, proactive outreach, or retention campaigns for customers with tickets.
Key Takeaways: Conditional Probability and Bayes' Theorem
-
Conditional probability lets you update beliefs based on new information
-
P(A|B) is not the same as P(B|A)
—don't confuse them!
-
Bayes' Theorem is essential for rare event detection
—medical testing, fraud detection, spam filtering
-
Base rates matter enormously
—a rare event will have many false positives even with an accurate test
-
Business applications are everywhere
—churn prediction, customer segmentation, risk assessment, A/B test analysis
4.4 Common Probability Distributions in Business
Real-world business data often follows recognizable patterns called probability distributions . Understanding these distributions helps you:
- Model uncertainty
- Make predictions
- Calculate probabilities
- Simulate scenarios
We'll cover four distributions that appear constantly in business analytics.
4.4.1 Binomial, Poisson, Normal, Exponential
1. Binomial Distribution
When to use it : Counting successes in a fixed number of independent trials, where each trial has the same probability of success.
Examples:
- Number of customers who buy out of 100 who visit your store
- Number of defective items in a batch of 50
- Number of emails opened out of 1,000 sent
Parameters:
- n : number of trials
- p : probability of success on each trial
Key properties:
- Mean = n × p
- Standard deviation = √(n × p × (1-p))
Business Example: Email Campaign
You send 1,000 emails. Historically, 15% of recipients click. What's the probability that exactly 140 people click? What's the probability that at least 160 people click?
Prompt to AI:
Use the binomial distribution with n=1000, p=0.15 to:
1. Calculate probability of exactly 140 clicks
2. Calculate probability of at least 160 clicks
3. Calculate mean and standard deviation
4. Plot the distribution
Python Code:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# Parameters
n = 1000 # number of emails
p = 0.15 # click probability
# Create binomial distribution
binom_dist = stats.binom(n, p)
# Calculate probabilities
prob_exactly_140 = binom_dist.pmf(140)
prob_at_least_160 = 1 - binom_dist.cdf(159) # P(X >= 160) = 1 - P(X <= 159)
# Calculate mean and std
mean = n * p
std = np.sqrt(n * p * (1-p))
print("=== BINOMIAL DISTRIBUTION: EMAIL CLICKS ===\n")
print(f"Parameters: n={n}, p={p:.0%}")
print(f"\nExpected clicks: {mean:.0f}")
print(f"Standard deviation: {std:.1f}")
print(f"\nP(exactly 140 clicks) = {prob_exactly_140:.4f} or {prob_exactly_140:.2%}")
print(f"P(at least 160 clicks) = {prob_at_least_160:.4f} or {prob_at_least_160:.2%}")
# Interpretation
print(f"\n=== INTERPRETATION ===")
print(f"• We expect about {mean:.0f} clicks, give or take {std:.0f}")
print(f"• 140 clicks is {(140-mean)/std:.1f} standard deviations below the mean")
print(f"• 160 clicks is {(160-mean)/std:.1f} standard deviations above the mean")
print(f"• Getting 160+ clicks is unlikely ({prob_at_least_160:.1%} chance)")
# Plot distribution
x = np.arange(100, 200)
pmf = binom_dist.pmf(x)
plt.figure(figsize=(12, 6))
plt.bar(x, pmf, color='skyblue', edgecolor='black', alpha=0.7)
plt.axvline(mean, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean:.0f}')
plt.axvline(140, color='orange', linestyle='--', linewidth=2, label='140 clicks')
plt.axvline(160, color='green', linestyle='--', linewidth=2, label='160 clicks')
plt.xlabel('Number of Clicks', fontsize=11)
plt.ylabel('Probability', fontsize=11)
plt.title('Binomial Distribution: Email Clicks (n=1000, p=0.15)', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== BINOMIAL DISTRIBUTION: EMAIL CLICKS ===
Parameters: n=1000, p=15%
Expected clicks: 150 Standard deviation: 11.3
P(exactly 140 clicks) = 0.0177 or 1.77% P(at least 160 clicks) = 0.1867 or 18.67%
=== INTERPRETATION === • We expect about 150 clicks, give or take 11 • 140 clicks is -0.9 standard deviations below the mean • 160 clicks is 0.9 standard deviations above the mean • Getting 160+ clicks is unlikely (18.7% chance)
Business Application:
If you get 160+ clicks, should you conclude your campaign performed better than usual? Not necessarily—there's an 18.7% chance of getting that many just by random variation. You'd need significantly more (say, 175+) to be confident the campaign truly outperformed.
2. Poisson Distribution
When to use it: Counting events that occur randomly over time or space, when events are independent and the average rate is constant.
Examples:
- Number of customer service calls per hour
- Number of defects per square meter of fabric
- Number of website visits per minute
- Number of accidents per month
Parameter:
- λ (lambda): average rate of events
Key properties:
- Mean = λ
- Standard deviation = √λ
- Variance = λ
Business Example: Customer Service Calls
Your call center receives an average of 12 calls per hour. What's the probability of receiving exactly 15 calls in the next hour? What's the probability of receiving more than 20 calls?
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# Parameter
lambda_rate = 12 # average calls per hour
# Create Poisson distribution
poisson_dist = stats.poisson(lambda_rate)
# Calculate probabilities
prob_exactly_15 = poisson_dist.pmf(15)
prob_more_than_20 = 1 - poisson_dist.cdf(20) # P(X > 20) = 1 - P(X <= 20)
prob_fewer_than_8 = poisson_dist.cdf(7) # P(X < 8) = P(X <= 7)
print("=== POISSON DISTRIBUTION: CALL CENTER ===\n")
print(f"Average rate: λ = {lambda_rate} calls/hour")
print(f"Standard deviation: {np.sqrt(lambda_rate):.2f}")
print(f"\nP(exactly 15 calls) = {prob_exactly_15:.4f} or {prob_exactly_15:.2%}")
print(f"P(more than 20 calls) = {prob_more_than_20:.4f} or {prob_more_than_20:.2%}")
print(f"P(fewer than 8 calls) = {prob_fewer_than_8:.4f} or {prob_fewer_than_8:.2%}")
# Staffing implications
print(f"\n=== STAFFING IMPLICATIONS ===")
print(f"• If you staff for 12 calls/hour, you'll be understaffed {1-poisson_dist.cdf(12):.1%} of the time")
print(f"• If you staff for 15 calls/hour, you'll be understaffed {1-poisson_dist.cdf(15):.1%} of the time")
print(f"• If you staff for 18 calls/hour, you'll be understaffed {1-poisson_dist.cdf(18):.1%} of the time")
# Calculate 95th percentile (capacity needed to handle 95% of hours)
capacity_95 = poisson_dist.ppf(0.95)
print(f"\n• To handle 95% of hours, staff for {capacity_95:.0f} calls/hour")
# Plot distribution
x = np.arange(0, 30)
pmf = poisson_dist.pmf(x)
plt.figure(figsize=(12, 6))
plt.bar(x, pmf, color='lightcoral', edgecolor='black', alpha=0.7)
plt.axvline(lambda_rate, color='red', linestyle='--', linewidth=2, label=f'Mean: {lambda_rate}')
plt.axvline(capacity_95, color='green', linestyle='--', linewidth=2, label=f'95th percentile: {capacity_95:.0f}')
plt.xlabel('Number of Calls per Hour', fontsize=11)
plt.ylabel('Probability', fontsize=11)
plt.title(f'Poisson Distribution: Call Arrivals (λ={lambda_rate})', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== POISSON DISTRIBUTION: CALL CENTER ===
Average rate: λ = 12 calls/hour
Standard deviation: 3.46
P(exactly 15 calls) = 0.0724 or 7.24%
P(more than 20 calls) = 0.0046 or 0.46%
P(fewer than 8 calls) = 0.0895 or 8.95%
=== STAFFING IMPLICATIONS ===
• If you staff for 12 calls/hour, you'll be understaffed 57.7% of the time
• If you staff for 15 calls/hour, you'll be understaffed 22.4% of the time
• If you staff for 18 calls/hour, you'll be understaffed 4.2% of the time
• To handle 95% of hours, staff for 18 calls/hour
Business Insight:
Even though the average is 12 calls/hour, you need to staff for 18 calls/hour to handle 95% of hours. This is the nature of random variation—you need capacity above the average to handle peaks.
3. Normal Distribution (Gaussian)
When to use it : Continuous data that clusters around a mean, with symmetric tails. The most important distribution in statistics.
Examples:
- Heights, weights
- Test scores
- Measurement errors
- Many business metrics (when aggregated)
Parameters:
- μ (mu) : mean
- σ (sigma) : standard deviation
Key properties:
- Bell-shaped, symmetric
- 68% of data within ±1 standard deviation of mean
- 95% of data within ±2 standard deviations
- 99.7% of data within ±3 standard deviations
The Central Limit Theorem : Even if individual data points aren't normally distributed, averages of large samples tend to be normally distributed. This is why the normal distribution is so important.
Business Example: Product Weights
Your factory produces packages with a target weight of 500g. The actual weight follows a normal distribution with mean 500g and standard deviation 5g.
What percentage of packages weigh less than 490g? What weight represents the 95th percentile?
Prompt to AI:
Use the normal distribution with μ=500, σ=5 to:
1. Calculate percentage below 490g
2. Calculate percentage between 495g and 505g
3. Find the 95th percentile weight
4. Plot the distribution with shaded regions
Python Code:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# Parameters
mu = 500 # mean weight (g)
sigma = 5 # standard deviation (g)
# Create normal distribution
normal_dist = stats.norm(mu, sigma)
# Calculate probabilities
prob_below_490 = normal_dist.cdf(490)
prob_between_495_505 = normal_dist.cdf(505) - normal_dist.cdf(495)
percentile_95 = normal_dist.ppf(0.95)
print("=== NORMAL DISTRIBUTION: PACKAGE WEIGHTS ===\n")
print(f"Mean: μ = {mu}g")
print(f"Standard Deviation: σ = {sigma}g")
print(f"\nP(weight < 490g) = {prob_below_490:.4f} or {prob_below_490:.2%}")
print(f"P(495g < weight < 505g) = {prob_between_495_505:.4f} or {prob_between_495_505:.2%}")
print(f"95th percentile weight = {percentile_95:.2f}g")
# Quality control implications
print(f"\n=== QUALITY CONTROL ===")
print(f"• {prob_below_490:.2%} of packages are more than 2σ below target")
print(f"• {prob_between_495_505:.2%} of packages are within ±1σ of target")
# Calculate percentage outside specification limits
spec_lower = 485
spec_upper = 515
prob_out_of_spec = prob_below_490 + (1 - normal_dist.cdf(spec_upper))
print(f"\nIf specification limits are {spec_lower}g to {spec_upper}g:")
print(f"• {normal_dist.cdf(spec_lower):.4%} are below {spec_lower}g")
print(f"• {1-normal_dist.cdf(spec_upper):.4%} are above {spec_upper}g")
print(f"• {(normal_dist.cdf(spec_lower) + (1-normal_dist.cdf(spec_upper))):.2%} are out of specification")
# Plot distribution
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
y = normal_dist.pdf(x)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Plot 1: Show key regions
ax1.plot(x, y, 'b-', linewidth=2, label='Normal Distribution')
ax1.fill_between(x, y, where=(x < 490), color='red', alpha=0.3, label='Below 490g')
ax1.fill_between(x, y, where=((x >= 495) & (x <= 505)), color='green', alpha=0.3, label='495-505g')
ax1.axvline(mu, color='black', linestyle='--', linewidth=2, label=f'Mean: {mu}g')
ax1.axvline(percentile_95, color='orange', linestyle='--', linewidth=1.5, label=f'95th percentile: {percentile_95:.1f}g')
ax1.set_xlabel('Weight (g)', fontsize=11)
ax1.set_ylabel('Probability Density', fontsize=11)
ax1.set_title('Package Weight Distribution', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)
# Plot 2: Show 68-95-99.7 rule
ax2.plot(x, y, 'b-', linewidth=2)
ax2.fill_between(x, y, where=((x >= mu-sigma) & (x <= mu+sigma)),
color='green', alpha=0.3, label='±1σ (68%)')
ax2.fill_between(x, y, where=((x >= mu-2*sigma) & (x <= mu+2*sigma)),
color='yellow', alpha=0.2, label='±2σ (95%)')
ax2.fill_between(x, y, where=((x >= mu-3*sigma) & (x <= mu+3*sigma)),
color='red', alpha=0.1, label='±3σ (99.7%)')
ax2.axvline(mu, color='black', linestyle='--', linewidth=2)
ax2.set_xlabel('Weight (g)', fontsize=11)
ax2.set_ylabel('Probability Density', fontsize=11)
ax2.set_title('68-95-99.7 Rule', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== NORMAL DISTRIBUTION: PACKAGE WEIGHTS ===
Mean: μ = 500g
Standard Deviation: σ = 5g
P(weight < 490g) = 0.0228 or 2.28%
P(495g < weight < 505g) = 0.6827 or 68.27%
95th percentile weight = 508.22g
=== QUALITY CONTROL ===
• 2.28% of packages are more than 2σ below target
• 68.27% of packages are within ±1σ of target
If specification limits are 485g to 515g:
• 0.0013% are below 485g
• 0.0013% are above 515g
• 0.0027% are out of specification
Business Application:
This tells you:
- Your process is well-controlled (only 0.0027% out of spec)
- 2.28% of packages are "light" (below 490g), which might concern customers
- You could tighten quality control by reducing σ (less variation)
4. Exponential Distribution
When to use it : Modeling time between events in a Poisson process.
Examples:
- Time between customer arrivals
- Time until equipment failure
- Time between purchases
- Duration of phone calls
Parameter:
- λ (lambda) : rate parameter (events per unit time)
- Mean time between events = 1/λ
Key property:
- "Memoryless" property: The probability of an event in the next time period doesn't depend on how long you've already waited
Business Example: Equipment Maintenance
A machine fails on average once every 200 hours (λ = 1/200 = 0.005 failures per hour). What's the probability it fails within the next 100 hours? What's the probability it lasts more than 300 hours?
Prompt to AI:
Use the exponential distribution with mean=200 hours to:
1. Calculate probability of failure within 100 hours
2. Calculate probability of lasting more than 300 hours
3. Find the median time to failure
4. Plot the distribution
Python Code:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# Parameters
mean_time = 200 # mean time between failures (hours)
lambda_rate = 1 / mean_time # rate parameter
# Create exponential distribution
exp_dist = stats.expon(scale=mean_time) # scale = 1/λ = mean
# Calculate probabilities
prob_fail_within_100 = exp_dist.cdf(100)
prob_last_more_than_300 = 1 - exp_dist.cdf(300)
median_time = exp_dist.median()
print("=== EXPONENTIAL DISTRIBUTION: EQUIPMENT FAILURE ===\n")
print(f"Mean time between failures: {mean_time} hours")
print(f"Rate: λ = {lambda_rate:.4f} failures/hour")
print(f"\nP(failure within 100 hours) = {prob_fail_within_100:.4f} or {prob_fail_within_100:.2%}")
print(f"P(lasts more than 300 hours) = {prob_last_more_than_300:.4f} or {prob_last_more_than_300:.2%}")
print(f"Median time to failure = {median_time:.1f} hours")
# Maintenance planning
print(f"\n=== MAINTENANCE PLANNING ===")
for hours in [50, 100, 150, 200, 250]:
prob_survive = 1 - exp_dist.cdf(hours)
print(f"• Probability of surviving {hours:3d} hours: {prob_survive:.2%}")
# Calculate time for 90% reliability
time_90_reliability = exp_dist.ppf(0.10) # 10% failure = 90% survival
print(f"\n• For 90% reliability, perform maintenance every {time_90_reliability:.0f} hours")
# Plot distribution
x = np.linspace(0, 600, 1000)
y = exp_dist.pdf(x)
cdf_y = exp_dist.cdf(x)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(x, y, 'b-', linewidth=2)
ax1.fill_between(x, y, where=(x <= 100), color='red', alpha=0.3, label='Fail within 100h')
ax1.fill_between(x, y, where=(x >= 300), color='green', alpha=0.3, label='Last beyond 300h')
ax1.axvline(mean_time, color='black', linestyle='--', linewidth=2, label=f'Mean: {mean_time}h')
ax1.axvline(median_time, color='orange', linestyle='--', linewidth=1.5, label=f'Median: {median_time:.0f}h')
ax1.set_xlabel('Time (hours)', fontsize=11)
ax1.set_ylabel('Probability Density', fontsize=11)
ax1.set_title('Time to Failure Distribution (PDF)', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)
# CDF (Reliability curve)
ax2.plot(x, 1-cdf_y, 'g-', linewidth=2, label='Reliability (Survival)')
ax2.axhline(0.90, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
ax2.axvline(time_90_reliability, color='red', linestyle='--', linewidth=1.5,
label=f'90% reliability: {time_90_reliability:.0f}h')
ax2.axhline(0.50, color='orange', linestyle='--', linewidth=1.5, alpha=0.7)
ax2.axvline(median_time, color='orange', linestyle='--', linewidth=1.5,
label=f'50% reliability: {median_time:.0f}h')
ax2.set_xlabel('Time (hours)', fontsize=11)
ax2.set_ylabel('Probability of Survival', fontsize=11)
ax2.set_title('Reliability Curve', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== EXPONENTIAL DISTRIBUTION: EQUIPMENT FAILURE ===
Mean time between failures: 200 hours
Rate: λ = 0.0050 failures/hour
P(failure within 100 hours) = 0.3935 or 39.35%
P(lasts more than 300 hours) = 0.2231 or 22.31%
Median time to failure = 138.6 hours
=== MAINTENANCE PLANNING ===
• Probability of surviving 50 hours: 77.88%
• Probability of surviving 100 hours: 60.65%
• Probability of surviving 150 hours: 47.24%
• Probability of surviving 200 hours: 36.79%
• Probability of surviving 250 hours: 28.65%
• For 90% reliability, perform maintenance every 21 hours
Business Insight:
Notice the median (138.6 hours) is less than the mean (200 hours). This is because the exponential distribution is right-skewed—most failures happen relatively early, but a few machines last much longer, pulling the mean up.
For maintenance planning: If you want 90% reliability, you need to perform preventive maintenance every 21 hours, even though the average time to failure is 200 hours. This is the cost of high reliability.
4.4.2 Applications in Demand, Risk, and Reliability
Let's see how these distributions apply to real business problems.
Application 1: Demand Forecasting
Scenario : A retailer needs to decide how much inventory to stock. Daily demand follows a normal distribution with mean 100 units and standard deviation 20 units.
Question : How much should they stock to meet demand 95% of the time?
Prompt to AI:
Daily demand: Normal(μ=100, σ=20)
Calculate the inventory level needed for 95% service level.
Also calculate expected stockouts and excess inventory.
Python Code:
from scipy import stats
import numpy as np
# Demand distribution
mu_demand = 100
sigma_demand = 20
demand_dist = stats.norm(mu_demand, sigma_demand)
# Calculate inventory for different service levels
service_levels = [0.80, 0.90, 0.95, 0.99]
print("=== INVENTORY PLANNING ===\n")
print(f"Daily demand: Normal(μ={mu_demand}, σ={sigma_demand})")
print(f"\nService Level Inventory Needed Safety Stock")
print("-" * 50)
for sl in service_levels:
inventory = demand_dist.ppf(sl)
safety_stock = inventory - mu_demand
print(f" {sl:.0%} {inventory:>6.0f} {safety_stock:>+6.0f}")
# Detailed analysis for 95% service level
inventory_95 = demand_dist.ppf(0.95)
safety_stock_95 = inventory_95 - mu_demand
print(f"\n=== 95% SERVICE LEVEL ANALYSIS ===")
print(f"Stock level: {inventory_95:.0f} units")
print(f"Safety stock: {safety_stock_95:.0f} units (buffer above mean)")
# Calculate expected outcomes
prob_stockout = 1 - 0.95
expected_demand_when_stockout = mu_demand + sigma_demand * stats.norm.pdf(stats.norm.ppf(0.95)) / (1 - 0.95)
expected_stockout_units = (expected_demand_when_stockout - inventory_95) * prob_stockout
print(f"\nExpected outcomes:")
print(f"• Stockout probability: {prob_stockout:.1%}")
print(f"• When demand exceeds {inventory_95:.0f}, average demand is {expected_demand_when_stockout:.0f}")
print(f"• Expected lost sales per day: {expected_stockout_units:.1f} units")
# Cost analysis
holding_cost_per_unit = 2 # $ per unit per day
stockout_cost_per_unit = 10 # $ per lost sale
expected_holding_cost = safety_stock_95 * holding_cost_per_unit
expected_stockout_cost = expected_stockout_units * stockout_cost_per_unit
total_expected_cost = expected_holding_cost + expected_stockout_cost
print(f"\n=== COST ANALYSIS ===")
print(f"Holding cost: ${holding_cost_per_unit}/unit/day")
print(f"Stockout cost: ${stockout_cost_per_unit}/unit")
print(f"\nExpected daily costs:")
print(f"• Holding cost: ${expected_holding_cost:.2f}")
print(f"• Stockout cost: ${expected_stockout_cost:.2f}")
print(f"• Total: ${total_expected_cost:.2f}")
Output:
=== INVENTORY PLANNING ===
Daily demand: Normal(μ=100, σ=20)
Service Level Inventory Needed Safety Stock
--------------------------------------------------
80% 117 +17
90% 126 +26
95% 133 +33
99% 147 +47
=== 95% SERVICE LEVEL ANALYSIS ===
Stock level: 133 units
Safety stock: 33 units (buffer above mean)
Expected outcomes:
• Stockout probability: 5.0%
• When demand exceeds 133, average demand is 153
• Expected lost sales per day: 1.0 units
=== COST ANALYSIS ===
Holding cost: \$2/unit/day
Stockout cost: \$10/unit
Expected daily costs:
• Holding cost: \$66.00
• Stockout cost: \$10.00
• Total: \$76.00
Business Decision:
You can now compare different service levels:
- 95% service level: Stock 133 units, total cost $76/day
- 90% service level: Stock 126 units, lower holding cost but more stockouts
- 99% service level: Stock 147 units, almost no stockouts but high holding cost
The optimal choice depends on your specific holding and stockout costs.
Application 2: Risk Assessment
Scenario : A project has uncertain completion time. Based on historical data, similar projects follow a normal distribution with mean 120 days and standard deviation 15 days.
Question : What's the probability of finishing within 100 days? What deadline should you commit to if you want 90% confidence?
Prompt to AI:
Project duration: Normal(μ=120, σ=15)
Calculate:
1. Probability of finishing within 100 days
2. Deadline for 90% confidence
3. Create a risk visualization
Python Code:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# Project duration distribution
mu_duration = 120 # days
sigma_duration = 15 # days
duration_dist = stats.norm(mu_duration, sigma_duration)
# Calculate probabilities
prob_within_100 = duration_dist.cdf(100)
deadline_90_confidence = duration_dist.ppf(0.90)
deadline_95_confidence = duration_dist.ppf(0.95)
print("=== PROJECT RISK ANALYSIS ===\n")
print(f"Expected duration: {mu_duration} days")
print(f"Standard deviation: {sigma_duration} days")
print(f"\nP(finish within 100 days) = {prob_within_100:.2%}")
print(f" → This is {(mu_duration - 100)/sigma_duration:.1f} standard deviations below the mean")
print(f" → Very unlikely!")
print(f"\nRecommended deadlines:")
print(f"• 50% confidence: {mu_duration:.0f} days (expected duration)")
print(f"• 90% confidence: {deadline_90_confidence:.0f} days")
print(f"• 95% confidence: {deadline_95_confidence:.0f} days")
# Risk table
print(f"\n=== RISK TABLE ===")
print(f"Deadline Probability Risk Level")
print("-" * 45)
deadlines = [100, 110, 120, 130, 140, 150]
for d in deadlines:
prob = duration_dist.cdf(d)
risk = 1 - prob
risk_level = "VERY HIGH" if risk > 0.3 else "HIGH" if risk > 0.1 else "MEDIUM" if risk > 0.05 else "LOW"
print(f"{d:3d} days {prob:>5.1%} {risk_level}")
# Visualization
x = np.linspace(mu_duration - 4*sigma_duration, mu_duration + 4*sigma_duration, 1000)
y = duration_dist.pdf(x)
plt.figure(figsize=(12, 6))
plt.plot(x, y, 'b-', linewidth=2, label='Duration Distribution')
# Shade regions
plt.fill_between(x, y, where=(x <= 100), color='red', alpha=0.3, label='Within 100 days (very unlikely)')
plt.fill_between(x, y, where=((x > 100) & (x <= deadline_90_confidence)),
color='yellow', alpha=0.3, label='100-140 days')
plt.fill_between(x, y, where=(x > deadline_90_confidence),
color='green', alpha=0.3, label='Beyond 140 days')
# Add reference lines
plt.axvline(mu_duration, color='black', linestyle='--', linewidth=2, label=f'Expected: {mu_duration} days')
plt.axvline(100, color='red', linestyle='--', linewidth=1.5, label='Aggressive: 100 days')
plt.axvline(deadline_90_confidence, color='green', linestyle='--', linewidth=1.5,
label=f'90% confidence: {deadline_90_confidence:.0f} days')
plt.xlabel('Project Duration (days)', fontsize=11)
plt.ylabel('Probability Density', fontsize=11)
plt.title('Project Duration Risk Analysis', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== PROJECT RISK ANALYSIS ===
Expected duration: 120 days
Standard deviation: 15 days
P(finish within 100 days) = 9.12%
→ This is -1.3 standard deviations below the mean
→ Very unlikely!
Recommended deadlines:
• 50% confidence: 120 days (expected duration)
• 90% confidence: 139 days
• 95% confidence: 145 days
=== RISK TABLE ===
Deadline Probability Risk Level
---------------------------------------------
100 days 9.1% VERY HIGH
110 days 25.2% VERY HIGH
120 days 50.0% VERY HIGH
130 days 74.8% HIGH
140 days 90.9% MEDIUM
150 days 97.7% LOW
Business Communication:
When your manager asks "Can we finish in 100 days?", you can now say:
"Based on historical data, there's only a 9% chance of finishing within 100 days. I recommend committing to 140 days, which gives us 90% confidence. If we absolutely must commit to 100 days, we need to understand we'll likely miss that deadline and should plan contingencies."
This is much better than saying "I think so" or "probably not."
Application 3: Reliability Engineering
Scenario : You're evaluating two suppliers for a critical component.
- Supplier A : Components fail following exponential distribution with mean time to failure = 1000 hours
- Supplier B : Components fail following exponential distribution with mean time to failure = 1500 hours, but cost 30% more
Question : Which supplier offers better value?
Prompt to AI:
Compare two suppliers:
- Supplier A: MTTF = 1000 hours, cost = \$100
- Supplier B: MTTF = 1500 hours, cost = \$130
Calculate:
1. Reliability at 500, 1000, 1500 hours
2. Expected number of failures over 5000 hours
3. Total cost of ownership
Python Code:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# Supplier parameters
mttf_a = 1000 # hours
mttf_b = 1500 # hours
cost_a = 100 # $
cost_b = 130 # $
# Create distributions
dist_a = stats.expon(scale=mttf_a)
dist_b = stats.expon(scale=mttf_b)
# Calculate reliability at key timepoints
timepoints = [500, 1000, 1500, 2000]
print("=== SUPPLIER RELIABILITY COMPARISON ===\n")
print(f"Supplier A: MTTF = {mttf_a}h, Cost = ${cost_a}")
print(f"Supplier B: MTTF = {mttf_b}h, Cost = ${cost_b} (+{(cost_b/cost_a-1)*100:.0f}%)")
print(f"\nReliability (Probability of Survival):")
print(f"Time (hours) Supplier A Supplier B Advantage")
print("-" * 55)
for t in timepoints:
rel_a = 1 - dist_a.cdf(t)
rel_b = 1 - dist_b.cdf(t)
advantage = "B" if rel_b > rel_a else "A"
print(f" {t:>4} {rel_a:>5.1%} {rel_b:>5.1%} {advantage} (+{abs(rel_b-rel_a):.1%})")
# Calculate expected failures over 5000 hours
operating_hours = 5000
expected_failures_a = operating_hours / mttf_a
expected_failures_b = operating_hours / mttf_b
print(f"\n=== TOTAL COST OF OWNERSHIP (5000 hours) ===\n")
# Assume replacement cost = component cost
total_cost_a = cost_a * expected_failures_a
total_cost_b = cost_b * expected_failures_b
print(f"Supplier A:")
print(f" Expected failures: {expected_failures_a:.1f}")
print(f" Total cost: ${total_cost_a:.2f}")
print(f" Cost per hour: ${total_cost_a/operating_hours:.3f}")
print(f"\nSupplier B:")
print(f" Expected failures: {expected_failures_b:.1f}")
print(f" Total cost: ${total_cost_b:.2f}")
print(f" Cost per hour: ${total_cost_b/operating_hours:.3f}")
print(f"\n=== RECOMMENDATION ===")
if total_cost_a < total_cost_b:
savings = total_cost_b - total_cost_a
print(f"Choose Supplier A - saves ${savings:.2f} over 5000 hours ({savings/total_cost_b*100:.1f}%)")
else:
savings = total_cost_a - total_cost_b
print(f"Choose Supplier B - saves ${savings:.2f} over 5000 hours ({savings/total_cost_a*100:.1f}%)")
# Visualization
x = np.linspace(0, 3000, 1000)
reliability_a = 1 - dist_a.cdf(x)
reliability_b = 1 - dist_b.cdf(x)
plt.figure(figsize=(12, 6))
plt.plot(x, reliability_a, 'b-', linewidth=2, label=f'Supplier A (MTTF={mttf_a}h, ${cost_a})')
plt.plot(x, reliability_b, 'g-', linewidth=2, label=f'Supplier B (MTTF={mttf_b}h, ${cost_b})')
# Add reference lines
for t in [500, 1000, 1500]:
plt.axvline(t, color='gray', linestyle=':', alpha=0.5)
plt.axhline(0.5, color='red', linestyle='--', alpha=0.5, label='50% reliability')
plt.xlabel('Operating Hours', fontsize=11)
plt.ylabel('Reliability (Probability of Survival)', fontsize=11)
plt.title('Supplier Reliability Comparison', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Output:
=== SUPPLIER RELIABILITY COMPARISON ===
Supplier A: MTTF = 1000h, Cost = \$100
Supplier B: MTTF = 1500h, Cost = \$130 (+30%)
Reliability (Probability of Survival):
Time (hours) Supplier A Supplier B Advantage
-------------------------------------------------------
500 60.7% 71.7% B (+11.1%)
1000 36.8% 51.3% B (+14.5%)
1500 22.3% 36.8% B (+14.5%)
2000 13.5% 26.4% B (+12.9%)
=== TOTAL COST OF OWNERSHIP (5000 hours) ===
Supplier A:
Expected failures: 5.0
Total cost: \$500.00
Cost per hour: \$0.100
Supplier B:
Expected failures: 3.3
Total cost: \$433.33
Cost per hour: \$0.087
=== RECOMMENDATION ===
Choose Supplier B - saves \$66.67 over 5000 hours (13.3%)
Business Insight:
Even though Supplier B costs 30% more per component, they're actually cheaper in the long run because you replace them less frequently. Supplier B saves $66.67 (13.3%) over 5000 hours of operation.
This is a classic example of why you need to consider total cost of ownership , not just purchase price.
4.5 Statistical Inference
Descriptive statistics and probability tell us about data we have. Statistical inference lets us draw conclusions about populations based on samples.
This is crucial in business because we almost never have complete data:
- We survey 1,000 customers, not all 1 million
- We test a new feature with 10,000 users, not all 50 million
- We analyze last quarter's sales, not future sales
The fundamental question of inference : What can we confidently say about the whole population based on our sample?
4.5.1 Sampling and Sampling Distributions
Population vs. Sample
- Population : The entire group you want to understand (all customers, all transactions, all products)
- Sample : A subset of the population you actually observe
- Parameter : A numerical characteristic of the population (e.g., population mean μ)
- Statistic : A numerical characteristic of the sample (e.g., sample mean x̄)
Example:
- Population: All 1 million customers
- Sample: 1,000 randomly selected customers
- Parameter: True average customer satisfaction (unknown)
- Statistic: Average satisfaction in our sample of 1,000 (known)
The Challenge:
The sample mean (x̄) is our best estimate of the population mean (μ), but it won't be exactly right. If we took a different sample, we'd get a different sample mean.
Question : How much does the sample mean vary? How confident can we be that it's close to the true population mean?
Answer : The sampling distribution tells us.
Sampling Distribution
If you took many samples and calculated the mean of each, those sample means would form a distribution called the sampling distribution of the mean .
Key facts (from the Central Limit Theorem):
- The sampling distribution is approximately normal (even if the population isn't)
- The mean of the sampling distribution equals the population mean (μ)
- The standard deviation of the sampling distribution (called standard error ) is:
SE = σ / √n
Where:
- σ = population standard deviation
- n = sample size
What this means:
- Larger samples (bigger n) give more precise estimates (smaller SE)
- The sample mean is typically within about 2 standard errors of the true population mean
Practical Example: Customer Satisfaction
You survey 100 customers and find mean satisfaction = 7.2 (on a 1-10 scale) with standard deviation = 1.5.
What can you say about the true average satisfaction of all customers?
Prompt to AI:
Sample: n=100, mean=7.2, SD=1.5
Calculate:
1. Standard error
2. Likely range for true population mean
3. Visualize sampling distribution
Python Code:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
# Sample statistics
n = 100
sample_mean = 7.2
sample_sd = 1.5
# Calculate standard error
se = sample_sd / np.sqrt(n)
print("=== SAMPLING DISTRIBUTION ANALYSIS ===\n")
print(f"Sample size: n = {n}")
print(f"Sample mean: x̄ = {sample_mean}")
print(f"Sample SD: s = {sample_sd}")
print(f"Standard error: SE = {se:.3f}")
print(f"\n=== INTERPRETATION ===")
print(f"The true population mean is likely within:")
print(f" • ±1 SE: {sample_mean - se:.2f} to {sample_mean + se:.2f} (68% confidence)")
print(f" • ±2 SE: {sample_mean - 2*se:.2f} to {sample_mean + 2*se:.2f} (95% confidence)")
print(f" • ±3 SE: {sample_mean - 3*se:.2f} to {sample_mean + 3*se:.2f} (99.7% confidence)")
# Simulate sampling distribution
np.random.seed(42)
num_samples = 10000
sample_means = []
# Simulate taking many samples
for _ in range(num_samples):
# Generate a sample (assuming population mean = 7.2, SD = 1.5)
sample = np.random.normal(sample_mean, sample_sd, n)
sample_means.append(np.mean(sample))
sample_means = np.array(sample_means)
# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Histogram of sample means
ax1.hist(sample_means, bins=50, density=True, color='skyblue', edgecolor='black', alpha=0.7)
# Overlay theoretical normal distribution
x = np.linspace(sample_mean - 4*se, sample_mean + 4*se, 1000)
y = stats.norm.pdf(x, sample_mean, se)
ax1.plot(x, y, 'r-', linewidth=2, label='Theoretical')
ax1.axvline(sample_mean, color='black', linestyle='--', linewidth=2, label=f'Mean: {sample_mean}')
ax1.axvline(sample_mean - 2*se, color='green', linestyle='--', linewidth=1.5, alpha=0.7)
ax1.axvline(sample_mean + 2*se, color='green', linestyle='--', linewidth=1.5, alpha=0.7, label='±2 SE')
ax1.set_xlabel('Sample Mean', fontsize=11)
ax1.set_ylabel('Density', fontsize=11)
ax1.set_title('Sampling Distribution of the Mean', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)
# Show effect of sample size
sample_sizes = [10, 30, 100, 300]
ax2.set_xlabel('Sample Mean', fontsize=11)
ax2.set_ylabel('Density', fontsize=11)
ax2.set_title('Effect of Sample Size on Standard Error', fontsize=12, fontweight='bold')
for n_size in sample_sizes:
se_size = sample_sd / np.sqrt(n_size)
x = np.linspace(sample_mean - 4*se_size, sample_mean + 4*se_size, 1000)
y = stats.norm.pdf(x, sample_mean, se_size)
ax2.plot(x, y, linewidth=2, label=f'n={n_size}, SE={se_size:.3f}')
ax2.axvline(sample_mean, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax2.legend()
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\n=== KEY INSIGHT ===")
print(f"Increasing sample size reduces standard error:")
for n_size in [10, 30, 100, 300, 1000]:
se_size = sample_sd / np.sqrt(n_size)
print(f" n={n_size:>4}: SE = {se_size:.3f}")
Output:
=== SAMPLING DISTRIBUTION ANALYSIS ===
Sample size: n = 100
Sample mean: x̄ = 7.2
Sample SD: s = 1.5
Standard error: SE = 0.150
=== INTERPRETATION ===
The true population mean is likely within:
• ±1 SE: 7.05 to 7.35 (68% confidence)
• ±2 SE: 6.90 to 7.50 (95% confidence)
• ±3 SE: 6.75 to 7.65 (99.7% confidence)
=== KEY INSIGHT ===
Increasing sample size reduces standard error:
n= 10: SE = 0.474
n= 30: SE = 0.274
n= 100: SE = 0.150
n= 300: SE = 0.087
n=1000: SE = 0.047
Business Insight:
With n=100, you can be 95% confident the true average satisfaction is between 6.90 and 7.50. That's a fairly narrow range!
If you need more precision, you'd need a larger sample. Quadrupling the sample size (to 400) would cut the standard error in half.
4.5.2 Confidence Intervals and Hypothesis Testing
Confidence Intervals
A confidence interval gives a range of plausible values for a population parameter.
Formula for confidence interval for a mean:
x̄ ± (critical value) × SE
For a 95% confidence interval:
x̄ ± 1.96 × SE (when n is large)
x̄ ± t* × SE (when n is small, use t-distribution)
Interpretation:
"We are 95% confident that the true population mean is between [lower bound] and [upper bound]."
What "95% confident" means:
If we repeated this process many times (take a sample, calculate a confidence interval), about 95% of those intervals would contain the true population mean.
It does NOT mean "there's a 95% probability the true mean is in this interval." The true mean is fixed (we just don't know it); the interval is what's random.
Practical Example: A/B Test
You're testing two website designs:
- Version A (current): 1,000 visitors, 32 conversions (3.2%)
- Version B (new): 1,000 visitors, 38 conversions (3.8%)
Is Version B really better, or could this be just random variation?
Prompt to AI:
A/B test data:
- Version A: 32/1000 = 3.2% conversion
- Version B: 38/1000 = 3.8% conversion
Calculate:
1. Confidence intervals for each version
2. Confidence interval for the difference
3. Determine if the difference is statistically significant
Python Code:
from scipy import stats
import numpy as np
# Data
n_a = 1000
conversions_a = 32
rate_a = conversions_a / n_a
n_b = 1000
conversions_b = 38
rate_b = conversions_b / n_b
# Standard errors (for proportions: SE = sqrt(p*(1-p)/n))
se_a = np.sqrt(rate_a * (1 - rate_a) / n_a)
se_b = np.sqrt(rate_b * (1 - rate_b) / n_b)
# 95% confidence intervals
z_critical = 1.96 # for 95% confidence
ci_a_lower = rate_a - z_critical * se_a
ci_a_upper = rate_a + z_critical * se_a
ci_b_lower = rate_b - z_critical * se_b
ci_b_upper = rate_b + z_critical * se_b
print("=== A/B TEST ANALYSIS ===\n")
print(f"Version A: {conversions_a}/{n_a} = {rate_a:.1%}")
print(f" 95% CI: [{ci_a_lower:.2%}, {ci_a_upper:.2%}]")
print(f"\nVersion B: {conversions_b}/{n_b} = {rate_b:.1%}")
print(f" 95% CI: [{ci_b_lower:.2%}, {ci_b_upper:.2%}]")
# Difference
diff = rate_b - rate_a
se_diff = np.sqrt(se_a**2 + se_b**2)
ci_diff_lower = diff - z_critical * se_diff
ci_diff_upper = diff + z_critical * se_diff
print(f"\nDifference (B - A): {diff:.2%}")
print(f" 95% CI: [{ci_diff_lower:.2%}, {ci_diff_upper:.2%}]")
# Statistical significance
if ci_diff_lower > 0:
print(f"\n✓ Version B is statistically significantly better (CI doesn't include 0)")
elif ci_diff_upper < 0:
print(f"\n✗ Version A is statistically significantly better (CI doesn't include 0)")
else:
print(f"\n○ No statistically significant difference (CI includes 0)")
# Calculate p-value using z-test for proportions
z_score = diff / se_diff
p_value = 2 * (1 - stats.norm.cdf(abs(z_score))) # two-tailed test
print(f"\nZ-score: {z_score:.2f}")
print(f"P-value: {p_value:.3f}")
if p_value < 0.05:
print(f" → Statistically significant at α=0.05")
else:
print(f" → NOT statistically significant at α=0.05")
# Business interpretation
print(f"\n=== BUSINESS INTERPRETATION ===")
print(f"Observed improvement: {diff:.2%} ({(diff/rate_a)*100:.1f}% relative increase)")
print(f"With 95% confidence, true improvement is between {ci_diff_lower:.2%} and {ci_diff_upper:.2%}")
if p_value >= 0.05:
print(f"\nRECOMMENDATION: Don't switch to Version B yet.")
print(f"The observed difference could easily be due to chance.")
print(f"Consider running the test longer to collect more data.")
else:
print(f"\nRECOMMENDATION: Version B shows a statistically significant improvement.")
print(f"However, consider if a {diff:.2%} improvement is practically meaningful for your business.")
Output:
=== A/B TEST ANALYSIS ===
Version A: 32/1000 = 3.2%
95% CI: [2.11%, 4.29%]
Version B: 38/1000 = 3.8%
95% CI: [2.62%, 4.98%]
Difference (B - A): 0.60%
95% CI: [-0.93%, 2.13%]
○ No statistically significant difference (CI includes 0)
Z-score: 0.78
P-value: 0.437
→ NOT statistically significant at α=0.05
=== BUSINESS INTERPRETATION ===
Observed improvement: 0.60% (18.8% relative increase)
With 95% confidence, true improvement is between -0.93% and 2.13%
RECOMMENDATION: Don't switch to Version B yet.
The observed difference could easily be due to chance.
Consider running the test longer to collect more data.
Key Insight:
Even though Version B had 6 more conversions (18.8% relative increase!), this difference is not statistically significant. The confidence interval for the difference includes 0, meaning the true difference could be negative (Version A better), zero (no difference), or positive (Version B better).
You need more data to draw a conclusion.
How much data do you need?
Prompt to AI:
Calculate required sample size for A/B test:
- Baseline conversion rate: 3.2%
- Minimum detectable effect: 0.6 percentage points (to 3.8%)
- Desired power: 80%
- Significance level: 5%
Python Code:
from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize
# Parameters
p1 = 0.032 # baseline rate
p2 = 0.038 # target rate
alpha = 0.05 # significance level
power = 0.80 # desired power
# Calculate effect size
effect_size = proportion_effectsize(p1, p2)
# Calculate required sample size per group
n_required = zt_ind_solve_power(effect_size=effect_size, alpha=alpha, power=power, alternative='two-sided')
print("=== SAMPLE SIZE CALCULATION ===\n")
print(f"Baseline conversion rate: {p1:.1%}")
print(f"Target conversion rate: {p2:.1%}")
print(f"Minimum detectable effect: {p2-p1:.2%}")
print(f"Significance level (α): {alpha:.0%}")
print(f"Desired power: {power:.0%}")
print(f"\nRequired sample size per group: {n_required:.0f}")
print(f"Total sample size (both groups): {2*n_required:.0f}")
print(f"\n=== INTERPRETATION ===")
print(f"To reliably detect a {p2-p1:.2%} difference with {power:.0%} power:")
print(f" • You need {n_required:.0f} visitors per version")
print(f" • Total: {2*n_required:.0f} visitors")
print(f" • Your current test ({n_a + n_b} visitors) is underpowered")
print(f" • You need {2*n_required - (n_a + n_b):.0f} more visitors")
Output:
=== SAMPLE SIZE CALCULATION ===
Baseline conversion rate: 3.2%
Target conversion rate: 3.8%
Minimum detectable effect: 0.60%
Significance level (α): 5%
Desired power: 80%
Required sample size per group: 13,566
Total sample size (both groups): 27,132
=== INTERPRETATION ===
To reliably detect a 0.60% difference with 80% power:
• You need 13,566 visitors per version
• Total: 27,132 visitors
• Your current test (2000 visitors) is underpowered
• You need 25,132 more visitors
Business Reality Check:
To detect a 0.6 percentage point improvement with confidence, you need 27,000 visitors, not 2,000. This is why many A/B tests are inconclusive—they're stopped too early.
Options:
- Run the test longer until you reach 27,000 visitors
- Test a bigger change that would produce a larger effect (easier to detect)
- Accept the uncertainty and make a judgment call based on other factors
4.5.3 p-Values, Effect Sizes, and Practical Significance
p-Values
A p-value is the probability of observing data as extreme as what you saw, assuming there's no real effect (the "null hypothesis" is true).
Common misconception : "p < 0.05 means there's a 95% chance the effect is real."
Reality : p < 0.05 means "if there were no real effect, we'd see data this extreme less than 5% of the time."
Interpretation guide:
- p < 0.001 : Very strong evidence against null hypothesis
- p < 0.01 : Strong evidence
- p < 0.05 : Moderate evidence (conventional threshold)
- p < 0.10 : Weak evidence
- p > 0.10 : Insufficient evidence
Important : p-values tell you if an effect exists, not if it's large or important!
Effect Size
Effect size measures the magnitude of a difference, independent of sample size.
Why it matters : With a huge sample, even tiny, meaningless differences become "statistically significant."
Example :
- Company A: Mean salary = $65,000
- Company B: Mean salary = $65,100
- With n=100,000 employees each, this $100 difference might be statistically significant (p < 0.05)
- But is it practically meaningful? Probably not!
Common effect size measures:
-
Cohen's d
(for comparing means):
- d = (Mean1 - Mean2) / Pooled SD
- Small: d = 0.2
- Medium: d = 0.5
- Large: d = 0.8
-
Percentage difference
(for business metrics):
- Absolute: "3.8% vs. 3.2% = 0.6 percentage points"
- Relative: "3.8% vs. 3.2% = 18.8% relative increase"
Practical Significance
Statistical significance ≠ Practical significance
Statistical significance : The effect is unlikely to be due to chance
Practical significance : The effect is large enough to matter for business decisions
Example: Marketing Campaign
You test a new email campaign:
- Old campaign: 10,000 emails, 500 clicks (5.0%)
- New campaign: 10,000 emails, 520 clicks (5.2%)
- Difference: 0.2 percentage points
Prompt to AI:
Analyze this A/B test for both statistical and practical significance:
- Control: 500/10000 = 5.0%
- Treatment: 520/10000 = 5.2%
- Cost per email: \$0.10
- Revenue per click: \$5.00
Determine:
1. Is it statistically significant?
2. Is it practically significant (worth the effort)?
Python Code:
from scipy import stats
import numpy as np
# Data
n_control = 10000
clicks_control = 500
rate_control = clicks_control / n_control
n_treatment = 10000
clicks_treatment = 520
rate_treatment = clicks_treatment / n_treatment
# Statistical significance
diff = rate_treatment - rate_control
se_control = np.sqrt(rate_control * (1 - rate_control) / n_control)
se_treatment = np.sqrt(rate_treatment * (1 - rate_treatment) / n_treatment)
se_diff = np.sqrt(se_control**2 + se_treatment**2)
z_score = diff / se_diff
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
print("=== STATISTICAL SIGNIFICANCE ===\n")
print(f"Control: {clicks_control}/{n_control} = {rate_control:.2%}")
print(f"Treatment: {clicks_treatment}/{n_treatment} = {rate_treatment:.2%}")
print(f"Difference: {diff:.2%} ({(diff/rate_control)*100:.1f}% relative increase)")
print(f"\nZ-score: {z_score:.2f}")
print(f"P-value: {p_value:.3f}")
if p_value < 0.05:
print(f"✓ Statistically significant at α=0.05")
else:
print(f"✗ NOT statistically significant at α=0.05")
# Practical significance
cost_per_email = 0.10
revenue_per_click = 5.00
print(f"\n=== PRACTICAL SIGNIFICANCE ===\n")
# Calculate ROI for both campaigns
cost_control = n_control * cost_per_email
revenue_control = clicks_control * revenue_per_click
profit_control = revenue_control - cost_control
roi_control = (profit_control / cost_control) * 100
cost_treatment = n_treatment * cost_per_email
revenue_treatment = clicks_treatment * revenue_per_click
profit_treatment = revenue_treatment - cost_treatment
roi_treatment = (profit_treatment / cost_treatment) * 100
print(f"Control Campaign:")
print(f" Cost: ${cost_control:,.0f}")
print(f" Revenue: ${revenue_control:,.0f}")
print(f" Profit: ${profit_control:,.0f}")
print(f" ROI: {roi_control:.1f}%")
print(f"\nTreatment Campaign:")
print(f" Cost: ${cost_treatment:,.0f}")
print(f" Revenue: ${revenue_treatment:,.0f}")
print(f" Profit: ${profit_treatment:,.0f}")
print(f" ROI: {roi_treatment:.1f}%")
profit_increase = profit_treatment - profit_control
print(f"\nProfit increase: ${profit_increase:,.0f} ({(profit_increase/profit_control)*100:.1f}%)")
# Decision
print(f"\n=== RECOMMENDATION ===")
if p_value < 0.05 and profit_increase > 0:
print(f"✓ Switch to new campaign")
print(f" • Statistically significant improvement")
print(f" • Generates ${profit_increase:,.0f} additional profit per 10,000 emails")
print(f" • At 1 million emails/month, that's ${profit_increase * 100:,.0f}/month")
elif p_value >= 0.05:
print(f"○ Inconclusive - need more data")
print(f" • Difference is not statistically significant")
print(f" • Could be due to random variation")
else:
print(f"✗ Don't switch")
print(f" • No meaningful business impact")
Output:
=== STATISTICAL SIGNIFICANCE ===
Control: 500/10000 = 5.00%
Treatment: 520/10000 = 5.20%
Difference: 0.20% (4.0% relative increase)
Z-score: 0.65
P-value: 0.518
✗ NOT statistically significant at α=0.05
=== PRACTICAL SIGNIFICANCE ===
Control Campaign:
Cost: \$1,000
Revenue: \$2,500
Profit: \$1,500
ROI: 150.0%
Treatment Campaign:
Cost: \$1,000
Revenue: \$2,600
Profit: \$1,600
ROI: 160.0%
Profit increase: \$100 (6.7%)
=== RECOMMENDATION ===
○ Inconclusive - need more data
• Difference is not statistically significant
• Could be due to random variation
Key Insight:
The new campaign shows a $100 profit increase per 10,000 emails. If you send 1 million emails/month, that's $10,000/month additional profit—potentially meaningful!
But the difference isn't statistically significant (p = 0.518), so you can't be confident it's real. You need more data before making a decision.
The Complete Picture: Statistical + Practical + Business Context
Good decision-making requires