Appearance
🛠 Workshop — Thiết Kế & Phân Tích A/B Test
Từ hypothesis → sample size → generate data → phân tích bằng scipy.stats → kết luận statistical & practical significance. Output: Jupyter Notebook + A/B Test Report!
🎯 Mục tiêu workshop
Sau khi hoàn thành workshop này, bạn sẽ:
- Thiết kế experiment — viết SMART hypothesis, tính sample size
- Generate sample dataset kết quả A/B test
- Phân tích bằng Python — chi-square test, t-test, confidence interval
- Kết luận — statistical significance + practical significance
- Viết A/B Test Report — summary cho stakeholders
🧰 Yêu cầu
| Yêu cầu | Chi tiết |
|---|---|
| Kiến thức | Đã hoàn thành Buổi 15 lý thuyết (A/B Testing) |
| Python | pandas, numpy, scipy.stats, matplotlib (đã học Buổi 7-10) |
| Thời gian | 90–120 phút |
| Output | Jupyter Notebook + A/B Test Report (markdown summary) |
💡 Naming convention
Đặt tên file: HoTen_Buoi15_ABTest.ipynb
📦 Scenario: E-commerce Checkout A/B Test
Bạn là DA tại ShopVN — e-commerce platform, 80,000 daily visitors. Product team vừa redesign checkout page: simplified 2-step checkout (Treatment) vs current 4-step checkout (Control).
Business context:
- Current checkout CVR: ~12%
- Average Order Value (AOV): ~350,000 VND
- Monthly revenue: ~100 tỷ VND
- Goal: Tăng checkout CVR → tăng revenue
Phần 1: Experiment Design (20 phút)
Bước 1.1: Viết SMART Hypothesis
python
# ============================================
# PHẦN 1: EXPERIMENT DESIGN
# ============================================
# SMART Hypothesis
hypothesis = """
HYPOTHESIS:
- H₀ (Null): Checkout CVR của 2-step checkout = CVR của 4-step checkout
- H₁ (Alternative): 2-step checkout tăng CVR ít nhất 8% relative so với 4-step
SMART:
- Specific: 2-step checkout thay vì 4-step
- Measurable: Checkout CVR (conversions / visitors reaching checkout)
- Actionable: Nếu CVR tăng ≥ 8% relative → deploy 2-step cho 100% users
- Relevant: Checkout CVR trực tiếp ảnh hưởng revenue
- Time-bound: Test trong 14 ngày (2 full business cycles)
"""
print(hypothesis)Bước 1.2: Tính Sample Size
python
import numpy as np
from scipy import stats
def calculate_sample_size(baseline_rate, mde_relative, alpha=0.05, power=0.80):
"""
Calculate sample size per group for two-proportion z-test.
Parameters:
baseline_rate: Current conversion rate (e.g., 0.12 for 12%)
mde_relative: Minimum Detectable Effect as relative change (e.g., 0.08 for 8%)
alpha: Significance level (default 0.05)
power: Statistical power (default 0.80)
Returns:
Sample size per group (int)
"""
p1 = baseline_rate
p2 = p1 * (1 + mde_relative)
z_alpha = stats.norm.ppf(1 - alpha / 2) # two-tailed
z_beta = stats.norm.ppf(power)
numerator = (z_alpha + z_beta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
denominator = (p2 - p1) ** 2
n = int(np.ceil(numerator / denominator))
return n
# Parameters
baseline_cvr = 0.12 # 12% current checkout CVR
mde = 0.08 # 8% relative lift (target: 12.96%)
alpha = 0.05 # 5% significance level
power = 0.80 # 80% statistical power
daily_traffic = 80_000 # daily visitors to checkout
# Calculate
n_per_group = calculate_sample_size(baseline_cvr, mde, alpha, power)
n_total = n_per_group * 2
duration_days = int(np.ceil(n_total / daily_traffic))
recommended_duration = max(duration_days, 14) # minimum 14 days
print("📐 SAMPLE SIZE CALCULATION")
print("=" * 55)
print(f" Baseline CVR: {baseline_cvr:.1%}")
print(f" Target CVR: {baseline_cvr * (1 + mde):.2%} ({mde:.0%} relative lift)")
print(f" Significance (α): {alpha}")
print(f" Power (1-β): {power}")
print(f"")
print(f" Sample per group: {n_per_group:,}")
print(f" Total sample: {n_total:,}")
print(f" Daily traffic: {daily_traffic:,}")
print(f" Min days (sample): {duration_days} days")
print(f" Recommended: {recommended_duration} days (≥ 2 business cycles)")Bước 1.3: Experiment Plan Summary
python
experiment_plan = f"""
📋 EXPERIMENT PLAN
{'=' * 55}
Experiment ID: EXP-2026-015
Name: Simplified Checkout (2-step vs 4-step)
Owner: [Your Name], Data Analyst
HYPOTHESIS:
H₀: CVR_control = CVR_treatment
H₁: CVR_treatment > CVR_control (≥ 8% relative)
DESIGN:
Type: A/B (2 variants)
Randomization: User-level (cookie hash)
Split: 50/50
METRICS:
Primary: Checkout CVR
Guardrail #1: Revenue per Visitor (không giảm > 5%)
Guardrail #2: Cart Abandonment Rate (không tăng > 3pp)
Guardrail #3: Customer Support Tickets (không tăng > 10%)
SAMPLE SIZE:
Per group: {n_per_group:,}
Total: {n_total:,}
DURATION: {recommended_duration} days
STATISTICAL TEST: Chi-square test (proportions)
DECISION RULE: Deploy if p < 0.05 AND lift ≥ MDE AND guardrails OK
⚠️ NO PEEKING at p-value until Day {recommended_duration}!
"""
print(experiment_plan)Phần 2: Generate A/B Test Dataset (15 phút)
Bước 2.1: Generate Conversion Data (Primary Metric)
python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
np.random.seed(15) # Reproducible results
# ============================================
# GENERATE A/B TEST DATA
# ============================================
n_control = 22_000
n_treatment = 22_000
# True conversion rates (simulating reality)
true_cvr_control = 0.120 # 12.0%
true_cvr_treatment = 0.133 # 13.3% (~10.8% relative lift — slightly above MDE)
# Generate user-level data
def generate_ab_data(n_users, cvr, group_name, start_date='2026-01-15'):
"""Generate realistic A/B test data for each user."""
start = pd.Timestamp(start_date)
# Random visit dates over 14 days
visit_day = np.random.randint(0, 14, size=n_users)
visit_dates = [start + timedelta(days=int(d)) for d in visit_day]
# Conversions
converted = np.random.binomial(1, cvr, size=n_users)
# Revenue (only for converted users)
aov = np.where(converted == 1,
np.random.lognormal(mean=12.4, sigma=0.5, size=n_users).astype(int),
0)
# Platform
platform = np.random.choice(['mobile', 'desktop'], size=n_users, p=[0.65, 0.35])
# User type
user_type = np.random.choice(['new', 'returning'], size=n_users, p=[0.40, 0.60])
# Time on checkout page (seconds)
base_time = np.random.exponential(scale=45, size=n_users)
time_on_page = np.where(converted == 1, base_time * 1.5, base_time * 0.7)
return pd.DataFrame({
'user_id': [f'{group_name}_{i:05d}' for i in range(n_users)],
'group': group_name,
'visit_date': visit_dates,
'platform': platform,
'user_type': user_type,
'converted': converted,
'revenue': aov,
'time_on_page_sec': time_on_page.round(1)
})
# Generate data
control_df = generate_ab_data(n_control, true_cvr_control, 'control')
treatment_df = generate_ab_data(n_treatment, true_cvr_treatment, 'treatment')
# Combine
ab_data = pd.concat([control_df, treatment_df], ignore_index=True)
print(f"✅ A/B Test Data Generated!")
print(f" Total users: {len(ab_data):,}")
print(f" Control: {len(control_df):,}")
print(f" Treatment: {len(treatment_df):,}")
print(f" Date range: {ab_data['visit_date'].min().date()} → {ab_data['visit_date'].max().date()}")
print(f"\n📋 Sample data:")
print(ab_data.head(10).to_string(index=False))Bước 2.2: Quick Data Exploration
python
# ============================================
# DATA EXPLORATION
# ============================================
print("📊 GROUP SUMMARY")
print("=" * 60)
summary = ab_data.groupby('group').agg(
users=('user_id', 'count'),
conversions=('converted', 'sum'),
cvr=('converted', 'mean'),
total_revenue=('revenue', 'sum'),
avg_revenue_per_visitor=('revenue', 'mean'),
avg_time_on_page=('time_on_page_sec', 'mean')
).round(4)
for group, row in summary.iterrows():
print(f"\n {'📘' if group == 'control' else '📙'} {group.upper()}")
print(f" Users: {int(row['users']):,}")
print(f" Conversions: {int(row['conversions']):,}")
print(f" CVR: {row['cvr']:.2%}")
print(f" Total Revenue: {int(row['total_revenue']):,} VND")
print(f" Rev/Visitor: {row['avg_revenue_per_visitor']:,.0f} VND")
print(f" Avg Time: {row['avg_time_on_page']:.1f} sec")
# Check balance (Sample Ratio Mismatch)
print(f"\n🔍 SRM Check:")
total = len(ab_data)
control_pct = len(control_df) / total * 100
treatment_pct = len(treatment_df) / total * 100
print(f" Control: {control_pct:.1f}% | Treatment: {treatment_pct:.1f}%")
print(f" Split OK? {'✅ Yes (balanced)' if abs(control_pct - 50) < 2 else '⚠️ Check randomization!'}")Phần 3: Statistical Analysis (30 phút)
Bước 3.1: Chi-square Test — Primary Metric (CVR)
python
from scipy import stats
# ============================================
# CHI-SQUARE TEST — CONVERSION RATE
# ============================================
# Extract data
control = ab_data[ab_data['group'] == 'control']
treatment = ab_data[ab_data['group'] == 'treatment']
n_c = len(control)
n_t = len(treatment)
conv_c = control['converted'].sum()
conv_t = treatment['converted'].sum()
cvr_c = conv_c / n_c
cvr_t = conv_t / n_t
# Build contingency table
observed = np.array([
[conv_c, n_c - conv_c], # Control: [converted, not converted]
[conv_t, n_t - conv_t] # Treatment: [converted, not converted]
])
# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(observed)
# Relative lift
absolute_lift = cvr_t - cvr_c
relative_lift = absolute_lift / cvr_c * 100
print("📊 CHI-SQUARE TEST — CHECKOUT CVR")
print("=" * 55)
print(f"\n CONVERSION RATES:")
print(f" Control: {conv_c:,}/{n_c:,} = {cvr_c:.2%}")
print(f" Treatment: {conv_t:,}/{n_t:,} = {cvr_t:.2%}")
print(f" Absolute Δ: {absolute_lift:+.2%} ({absolute_lift*100:+.2f} pp)")
print(f" Relative Δ: {relative_lift:+.1f}%")
print(f"\n STATISTICAL TEST:")
print(f" Chi-square: {chi2:.4f}")
print(f" df: {dof}")
print(f" p-value: {p_value:.4f}")
print(f"\n DECISION:")
if p_value < 0.05:
print(f" ✅ SIGNIFICANT (p = {p_value:.4f} < 0.05)")
print(f" → Reject H₀: Treatment CVR ≠ Control CVR")
else:
print(f" ❌ NOT SIGNIFICANT (p = {p_value:.4f} ≥ 0.05)")
print(f" → Fail to reject H₀: insufficient evidence of difference")Bước 3.2: Confidence Interval
python
# ============================================
# CONFIDENCE INTERVAL FOR DIFFERENCE IN CVR
# ============================================
def proportion_ci(conv_a, n_a, conv_b, n_b, alpha=0.05):
"""Calculate CI for difference in proportions (B - A)."""
p_a = conv_a / n_a
p_b = conv_b / n_b
diff = p_b - p_a
se = np.sqrt(p_a * (1 - p_a) / n_a + p_b * (1 - p_b) / n_b)
z = stats.norm.ppf(1 - alpha / 2)
return diff, diff - z * se, diff + z * se, se
diff, ci_low, ci_high, se = proportion_ci(conv_c, n_c, conv_t, n_t)
print("📊 95% CONFIDENCE INTERVAL")
print("=" * 55)
print(f" Point estimate: {diff:.4f} ({diff*100:+.2f} pp)")
print(f" 95% CI: [{ci_low*100:+.2f} pp, {ci_high*100:+.2f} pp]")
print(f" Standard Error: {se:.6f}")
print(f"")
if ci_low > 0:
print(f" ✅ CI entirely above 0 → Treatment DEFINITELY improves CVR")
print(f" Treatment increases CVR by {ci_low*100:.2f}pp to {ci_high*100:.2f}pp")
elif ci_high < 0:
print(f" ❌ CI entirely below 0 → Treatment decreases CVR")
else:
print(f" ⚠️ CI includes 0 → Effect could be 0 (inconclusive)")Bước 3.3: t-test — Revenue per Visitor
python
# ============================================
# T-TEST — REVENUE PER VISITOR (continuous metric)
# ============================================
rev_control = control['revenue'].values
rev_treatment = treatment['revenue'].values
# Welch's t-test (don't assume equal variance)
t_stat, p_value_rev = stats.ttest_ind(rev_control, rev_treatment, equal_var=False)
# Effect size (Cohen's d)
pooled_std = np.sqrt((rev_control.std()**2 + rev_treatment.std()**2) / 2)
cohens_d = (rev_treatment.mean() - rev_control.mean()) / pooled_std if pooled_std > 0 else 0
# Classify effect size
if abs(cohens_d) < 0.2:
effect_label = "Negligible/Small"
elif abs(cohens_d) < 0.5:
effect_label = "Small-Medium"
elif abs(cohens_d) < 0.8:
effect_label = "Medium-Large"
else:
effect_label = "Large"
print("📊 WELCH'S T-TEST — REVENUE PER VISITOR")
print("=" * 55)
print(f"\n REVENUE PER VISITOR:")
print(f" Control: {rev_control.mean():,.0f} VND (SD: {rev_control.std():,.0f})")
print(f" Treatment: {rev_treatment.mean():,.0f} VND (SD: {rev_treatment.std():,.0f})")
print(f" Difference: {rev_treatment.mean() - rev_control.mean():+,.0f} VND")
print(f"\n STATISTICAL TEST:")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value: {p_value_rev:.4f}")
print(f" Cohen's d: {cohens_d:.4f} ({effect_label})")
print(f"\n GUARDRAIL CHECK:")
rev_change_pct = (rev_treatment.mean() - rev_control.mean()) / rev_control.mean() * 100
if rev_change_pct > -5:
print(f" ✅ Revenue/Visitor change: {rev_change_pct:+.1f}% (within -5% guardrail)")
else:
print(f" ❌ Revenue/Visitor change: {rev_change_pct:+.1f}% (VIOLATES -5% guardrail!)")Bước 3.4: Segment Analysis
python
# ============================================
# SEGMENT ANALYSIS — CHECK FOR SIMPSON'S PARADOX
# ============================================
print("📊 SEGMENT ANALYSIS")
print("=" * 60)
for segment_col in ['platform', 'user_type']:
print(f"\n 📂 By {segment_col.upper()}:")
print(f" {'Segment':<12} | {'Control CVR':>11} | {'Treatment CVR':>13} | {'Lift':>8} | {'p-value':>8}")
print(f" {'-'*12}-+-{'-'*11}-+-{'-'*13}-+-{'-'*8}-+-{'-'*8}")
for seg_val in ab_data[segment_col].unique():
seg_c = control[control[segment_col] == seg_val]
seg_t = treatment[treatment[segment_col] == seg_val]
cvr_seg_c = seg_c['converted'].mean()
cvr_seg_t = seg_t['converted'].mean()
lift_seg = (cvr_seg_t - cvr_seg_c) / cvr_seg_c * 100 if cvr_seg_c > 0 else 0
# Chi-square for segment
obs_seg = np.array([
[seg_c['converted'].sum(), len(seg_c) - seg_c['converted'].sum()],
[seg_t['converted'].sum(), len(seg_t) - seg_t['converted'].sum()]
])
_, p_seg, _, _ = stats.chi2_contingency(obs_seg)
sig_icon = "✅" if p_seg < 0.05 else " "
print(f" {seg_val:<12} | {cvr_seg_c:>10.2%} | {cvr_seg_t:>12.2%} | {lift_seg:>+7.1f}% | {p_seg:>7.4f} {sig_icon}")
print(f"\n ⚠️ Check: Do all segments show same direction as overall?")
print(f" If not → Simpson's Paradox → investigate before deploying!")Phần 4: Visualization (15 phút)
Bước 4.1: CVR by Group Bar Chart
python
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# --- Chart 1: CVR Comparison ---
ax1 = axes[0]
groups = ['Control\n(4-step)', 'Treatment\n(2-step)']
cvrs = [cvr_c * 100, cvr_t * 100]
colors = ['#2196F3', '#4CAF50']
bars = ax1.bar(groups, cvrs, color=colors, width=0.5, edgecolor='white', linewidth=2)
ax1.set_ylabel('Checkout CVR (%)')
ax1.set_title('📊 Checkout CVR: Control vs Treatment', fontsize=12, fontweight='bold')
ax1.set_ylim(0, max(cvrs) * 1.3)
for bar, cvr in zip(bars, cvrs):
ax1.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.2,
f'{cvr:.2f}%', ha='center', va='bottom', fontweight='bold', fontsize=13)
ax1.axhline(y=cvr_c * 100, color='gray', linestyle='--', alpha=0.5, label='Baseline')
# --- Chart 2: CVR by Platform ---
ax2 = axes[1]
platforms = ab_data['platform'].unique()
x = np.arange(len(platforms))
width = 0.3
cvr_c_plat = [control[control['platform']==p]['converted'].mean()*100 for p in platforms]
cvr_t_plat = [treatment[treatment['platform']==p]['converted'].mean()*100 for p in platforms]
ax2.bar(x - width/2, cvr_c_plat, width, label='Control', color='#2196F3', edgecolor='white')
ax2.bar(x + width/2, cvr_t_plat, width, label='Treatment', color='#4CAF50', edgecolor='white')
ax2.set_xticks(x)
ax2.set_xticklabels(platforms)
ax2.set_ylabel('CVR (%)')
ax2.set_title('📱 CVR by Platform', fontsize=12, fontweight='bold')
ax2.legend()
# --- Chart 3: Revenue Distribution ---
ax3 = axes[2]
rev_c_pos = rev_control[rev_control > 0]
rev_t_pos = rev_treatment[rev_treatment > 0]
ax3.hist(rev_c_pos, bins=50, alpha=0.5, label=f'Control (μ={rev_c_pos.mean():,.0f})', color='#2196F3', density=True)
ax3.hist(rev_t_pos, bins=50, alpha=0.5, label=f'Treatment (μ={rev_t_pos.mean():,.0f})', color='#4CAF50', density=True)
ax3.set_xlabel('Revenue (VND)')
ax3.set_ylabel('Density')
ax3.set_title('💰 Revenue Distribution (Converted Users)', fontsize=12, fontweight='bold')
ax3.legend(fontsize=9)
plt.tight_layout()
plt.savefig('ab_test_results.png', dpi=150, bbox_inches='tight')
plt.show()
print("📊 Charts saved to ab_test_results.png")Bước 4.2: Daily CVR Trend
python
# ============================================
# DAILY CVR TREND (check for novelty effect)
# ============================================
daily = ab_data.groupby(['visit_date', 'group']).agg(
visitors=('user_id', 'count'),
conversions=('converted', 'sum')
).reset_index()
daily['cvr'] = daily['conversions'] / daily['visitors'] * 100
fig, ax = plt.subplots(figsize=(12, 5))
for group, color, marker in [('control', '#2196F3', 'o'), ('treatment', '#4CAF50', 's')]:
g = daily[daily['group'] == group]
ax.plot(g['visit_date'], g['cvr'], marker=marker, linewidth=2,
color=color, label=group.capitalize(), markersize=6)
ax.set_xlabel('Date')
ax.set_ylabel('Daily CVR (%)')
ax.set_title('📈 Daily CVR Trend — Control vs Treatment', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('ab_test_daily_trend.png', dpi=150, bbox_inches='tight')
plt.show()
print("📈 Check: Is the lift consistent across days? Any novelty effect (Treatment high early, then drops)?")Phần 5: Final Report & Conclusion (15 phút)
Bước 5.1: Business Impact Calculation
python
# ============================================
# BUSINESS IMPACT CALCULATION
# ============================================
daily_checkout_visitors = 80_000
monthly_visitors = daily_checkout_visitors * 30
# Current
current_monthly_conversions = monthly_visitors * cvr_c
current_aov = control[control['converted']==1]['revenue'].mean() if control['converted'].sum() > 0 else 350000
current_monthly_revenue = current_monthly_conversions * current_aov
# Projected (with Treatment)
projected_monthly_conversions = monthly_visitors * cvr_t
projected_aov = treatment[treatment['converted']==1]['revenue'].mean() if treatment['converted'].sum() > 0 else 350000
projected_monthly_revenue = projected_monthly_conversions * projected_aov
delta_revenue = projected_monthly_revenue - current_monthly_revenue
delta_annual = delta_revenue * 12
print("💰 BUSINESS IMPACT PROJECTION")
print("=" * 55)
print(f"\n CURRENT (Control — 4-step checkout):")
print(f" Monthly visitors: {monthly_visitors:>12,}")
print(f" CVR: {cvr_c:>12.2%}")
print(f" Monthly conversions: {current_monthly_conversions:>12,.0f}")
print(f" AOV: {current_aov:>12,.0f} VND")
print(f" Monthly revenue: {current_monthly_revenue:>12,.0f} VND")
print(f"\n PROJECTED (Treatment — 2-step checkout):")
print(f" Monthly visitors: {monthly_visitors:>12,}")
print(f" CVR: {cvr_t:>12.2%}")
print(f" Monthly conversions: {projected_monthly_conversions:>12,.0f}")
print(f" AOV: {projected_aov:>12,.0f} VND")
print(f" Monthly revenue: {projected_monthly_revenue:>12,.0f} VND")
print(f"\n IMPACT:")
print(f" Monthly Δ revenue: {delta_revenue:>+12,.0f} VND")
print(f" Annual Δ revenue: {delta_annual:>+12,.0f} VND")
print(f" Annual Δ revenue: {delta_annual/1e9:>+12.1f} tỷ VND")Bước 5.2: Generate Final Report
python
# ============================================
# FINAL A/B TEST REPORT
# ============================================
report = f"""
{'='*60}
📋 A/B TEST REPORT — CHECKOUT SIMPLIFICATION
{'='*60}
📌 EXPERIMENT SUMMARY
Experiment: 2-step vs 4-step checkout
Duration: 14 days (2026-01-15 → 2026-01-28)
Sample: {n_c:,} (Control) + {n_t:,} (Treatment)
📊 PRIMARY METRIC: CHECKOUT CVR
Control: {cvr_c:.2%} ({conv_c:,} conversions)
Treatment: {cvr_t:.2%} ({conv_t:,} conversions)
Absolute Δ: {absolute_lift*100:+.2f} percentage points
Relative Δ: {relative_lift:+.1f}%
p-value: {p_value:.4f} {'✅ SIGNIFICANT' if p_value < 0.05 else '❌ NOT SIGNIFICANT'}
95% CI: [{ci_low*100:+.2f}pp, {ci_high*100:+.2f}pp]
🛡️ GUARDRAIL METRICS
Revenue/Visitor: {rev_change_pct:+.1f}% change {'✅ OK' if rev_change_pct > -5 else '❌ VIOLATED'}
Segment check: {'✅ Consistent direction across segments' if relative_lift > 0 else '⚠️ Check segments'}
💰 BUSINESS IMPACT (if deployed)
Monthly Δ revenue: {delta_revenue:+,.0f} VND
Annual Δ revenue: {delta_annual/1e9:+.1f} tỷ VND
🎯 RECOMMENDATION
"""
if p_value < 0.05 and relative_lift >= 8 and rev_change_pct > -5:
report += """ ✅ DEPLOY TREATMENT (2-step checkout)
Rationale:
- Statistical significance: p < 0.05
- Practical significance: relative lift meets MDE (≥ 8%)
- Guardrail metrics: OK (revenue/visitor not degraded)
- Segment analysis: consistent across platforms and user types
Deployment plan:
1. Gradual rollout: 10% → 50% → 100% over 2 weeks
2. Monitor CVR + Revenue daily during rollout
3. Set alert: if CVR drops below baseline → auto-rollback
"""
elif p_value < 0.05 and relative_lift < 8:
report += """ ⚠️ HOLD — STATISTICALLY SIGNIFICANT BUT BELOW MDE
While p < 0.05, the observed lift is below the 8% MDE target.
Consider: is the observed lift still practically meaningful?
"""
else:
report += """ ❌ DO NOT DEPLOY — NOT SIGNIFICANT
Insufficient evidence that Treatment is better than Control.
Options: redesign experiment, increase MDE, or try different change.
"""
report += f"""
📝 LEARNINGS
1. {'2-step checkout reduces friction → more completions' if relative_lift > 0 else 'Simplification did not improve CVR'}
2. Segment analysis shows {'consistent' if relative_lift > 0 else 'mixed'} results
3. Revenue per visitor {'maintained' if rev_change_pct > -5 else 'degraded'} — AOV not affected
{'='*60}
Report generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}
{'='*60}
"""
print(report)✅ Checklist hoàn thành
EXPERIMENT DESIGN:
✅ SMART hypothesis written
✅ Sample size calculated (formula + code)
✅ Duration planned (≥ 14 days)
✅ Primary metric + guardrail metrics defined
✅ Statistical test pre-selected (chi-square for CVR)
DATA & ANALYSIS:
✅ Dataset generated (control + treatment)
✅ SRM check (sample ratio balanced)
✅ Chi-square test for primary metric (CVR)
✅ 95% Confidence Interval calculated
✅ t-test for guardrail metric (Revenue/Visitor)
✅ Effect size (Cohen's d) calculated
✅ Segment analysis (platform, user type)
VISUALIZATION:
✅ CVR bar chart (Control vs Treatment)
✅ CVR by segment chart
✅ Revenue distribution chart
✅ Daily CVR trend chart
REPORT:
✅ Business impact calculated (monthly + annual revenue)
✅ Final recommendation with rationale
✅ Learnings documented🎁 Bonus Challenge (Optional)
Bonus 1: Power Analysis Visualization
python
# How does sample size change with different MDE?
mde_range = np.arange(0.03, 0.25, 0.01)
sample_sizes = [calculate_sample_size(0.12, m) for m in mde_range]
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(mde_range * 100, [s/1000 for s in sample_sizes], 'b-o', markersize=4, linewidth=2)
ax.set_xlabel('Minimum Detectable Effect (% relative)')
ax.set_ylabel('Sample Size per Group (thousands)')
ax.set_title('📐 Sample Size vs MDE (Baseline CVR = 12%, α=0.05, Power=80%)', fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axvline(x=8, color='red', linestyle='--', label='Our MDE = 8%')
ax.legend()
plt.tight_layout()
plt.show()Bonus 2: Simulate Peeking Effect
python
# Simulate how peeking inflates false positive rate
def simulate_peeking(n_simulations=10000, n_per_group=5000, true_cvr=0.12,
peek_days=[3, 5, 7, 10, 14], daily_users=700):
"""Simulate peeking and count false positives (when H0 is TRUE)."""
false_positives_by_day = {d: 0 for d in peek_days}
for _ in range(n_simulations):
# Generate data under H0 (no difference)
for day in peek_days:
n_so_far = min(day * daily_users, n_per_group)
control = np.random.binomial(1, true_cvr, size=n_so_far)
treatment = np.random.binomial(1, true_cvr, size=n_so_far) # SAME rate!
observed = np.array([
[control.sum(), n_so_far - control.sum()],
[treatment.sum(), n_so_far - treatment.sum()]
])
if observed.min() > 0:
_, p, _, _ = stats.chi2_contingency(observed)
if p < 0.05:
false_positives_by_day[day] += 1
return {d: count/n_simulations*100 for d, count in false_positives_by_day.items()}
print("🔍 Simulating peeking effect (n=5000 simulations)...")
print(" (Under H₀: no real difference between A and B)")
fp_rates = simulate_peeking(n_simulations=5000)
print("\n Day | False Positive Rate | Expected: 5%")
print(" ----+--------------------+-------------")
for day, rate in sorted(fp_rates.items()):
bar = '█' * int(rate)
print(f" {day:>3} | {rate:>5.1f}% {bar}")
print(f"\n ⚠️ If you peek at all 5 days and stop when p < 0.05,")
print(f" your actual false positive rate ≈ {max(fp_rates.values()):.0f}%, not 5%!")📚 Tài liệu tham khảo
| Tài liệu | Nội dung |
|---|---|
| Trustworthy Online Controlled Experiments (Kohavi et al.) | Bible of A/B testing |
| Evan Miller Sample Size Calculator | evanmiller.org/ab-testing |
| scipy.stats documentation | docs.scipy.org/doc/scipy/reference/stats.html |
| Statistical Methods in Online A/B Testing (Georgiev) | Thống kê chi tiết cho A/B test |