Skip to content

🛠 Workshop — Thiết Kế & Phân Tích A/B Test

Từ hypothesis → sample size → generate data → phân tích bằng scipy.stats → kết luận statistical & practical significance. Output: Jupyter Notebook + A/B Test Report!

🎯 Mục tiêu workshop

Sau khi hoàn thành workshop này, bạn sẽ:

  1. Thiết kế experiment — viết SMART hypothesis, tính sample size
  2. Generate sample dataset kết quả A/B test
  3. Phân tích bằng Python — chi-square test, t-test, confidence interval
  4. Kết luận — statistical significance + practical significance
  5. Viết A/B Test Report — summary cho stakeholders

🧰 Yêu cầu

Yêu cầuChi tiết
Kiến thứcĐã hoàn thành Buổi 15 lý thuyết (A/B Testing)
Pythonpandas, numpy, scipy.stats, matplotlib (đã học Buổi 7-10)
Thời gian90–120 phút
OutputJupyter Notebook + A/B Test Report (markdown summary)

💡 Naming convention

Đặt tên file: HoTen_Buoi15_ABTest.ipynb


📦 Scenario: E-commerce Checkout A/B Test

Bạn là DA tại ShopVN — e-commerce platform, 80,000 daily visitors. Product team vừa redesign checkout page: simplified 2-step checkout (Treatment) vs current 4-step checkout (Control).

Business context:

  • Current checkout CVR: ~12%
  • Average Order Value (AOV): ~350,000 VND
  • Monthly revenue: ~100 tỷ VND
  • Goal: Tăng checkout CVR → tăng revenue

Phần 1: Experiment Design (20 phút)

Bước 1.1: Viết SMART Hypothesis

python
# ============================================
# PHẦN 1: EXPERIMENT DESIGN
# ============================================

# SMART Hypothesis
hypothesis = """
HYPOTHESIS:
- H₀ (Null): Checkout CVR của 2-step checkout = CVR của 4-step checkout
- H₁ (Alternative): 2-step checkout tăng CVR ít nhất 8% relative so với 4-step

SMART:
- Specific:   2-step checkout thay vì 4-step
- Measurable: Checkout CVR (conversions / visitors reaching checkout)
- Actionable: Nếu CVR tăng ≥ 8% relative → deploy 2-step cho 100% users
- Relevant:   Checkout CVR trực tiếp ảnh hưởng revenue
- Time-bound: Test trong 14 ngày (2 full business cycles)
"""
print(hypothesis)

Bước 1.2: Tính Sample Size

python
import numpy as np
from scipy import stats

def calculate_sample_size(baseline_rate, mde_relative, alpha=0.05, power=0.80):
    """
    Calculate sample size per group for two-proportion z-test.

    Parameters:
        baseline_rate: Current conversion rate (e.g., 0.12 for 12%)
        mde_relative: Minimum Detectable Effect as relative change (e.g., 0.08 for 8%)
        alpha: Significance level (default 0.05)
        power: Statistical power (default 0.80)

    Returns:
        Sample size per group (int)
    """
    p1 = baseline_rate
    p2 = p1 * (1 + mde_relative)

    z_alpha = stats.norm.ppf(1 - alpha / 2)  # two-tailed
    z_beta = stats.norm.ppf(power)

    numerator = (z_alpha + z_beta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
    denominator = (p2 - p1) ** 2

    n = int(np.ceil(numerator / denominator))
    return n

# Parameters
baseline_cvr = 0.12      # 12% current checkout CVR
mde = 0.08               # 8% relative lift (target: 12.96%)
alpha = 0.05             # 5% significance level
power = 0.80             # 80% statistical power
daily_traffic = 80_000   # daily visitors to checkout

# Calculate
n_per_group = calculate_sample_size(baseline_cvr, mde, alpha, power)
n_total = n_per_group * 2
duration_days = int(np.ceil(n_total / daily_traffic))
recommended_duration = max(duration_days, 14)  # minimum 14 days

print("📐 SAMPLE SIZE CALCULATION")
print("=" * 55)
print(f"  Baseline CVR:      {baseline_cvr:.1%}")
print(f"  Target CVR:        {baseline_cvr * (1 + mde):.2%} ({mde:.0%} relative lift)")
print(f"  Significance (α):  {alpha}")
print(f"  Power (1-β):       {power}")
print(f"")
print(f"  Sample per group:  {n_per_group:,}")
print(f"  Total sample:      {n_total:,}")
print(f"  Daily traffic:     {daily_traffic:,}")
print(f"  Min days (sample): {duration_days} days")
print(f"  Recommended:       {recommended_duration} days (≥ 2 business cycles)")

Bước 1.3: Experiment Plan Summary

python
experiment_plan = f"""
📋 EXPERIMENT PLAN
{'=' * 55}
  Experiment ID:    EXP-2026-015
  Name:             Simplified Checkout (2-step vs 4-step)
  Owner:            [Your Name], Data Analyst

  HYPOTHESIS:
    H₀: CVR_control = CVR_treatment
    H₁: CVR_treatment > CVR_control (≥ 8% relative)

  DESIGN:
    Type:            A/B (2 variants)
    Randomization:   User-level (cookie hash)
    Split:           50/50

  METRICS:
    Primary:         Checkout CVR
    Guardrail #1:    Revenue per Visitor (không giảm > 5%)
    Guardrail #2:    Cart Abandonment Rate (không tăng > 3pp)
    Guardrail #3:    Customer Support Tickets (không tăng > 10%)

  SAMPLE SIZE:
    Per group:       {n_per_group:,}
    Total:           {n_total:,}

  DURATION:          {recommended_duration} days

  STATISTICAL TEST:  Chi-square test (proportions)
  DECISION RULE:     Deploy if p < 0.05 AND lift ≥ MDE AND guardrails OK

  ⚠️ NO PEEKING at p-value until Day {recommended_duration}!
"""
print(experiment_plan)

Phần 2: Generate A/B Test Dataset (15 phút)

Bước 2.1: Generate Conversion Data (Primary Metric)

python
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

np.random.seed(15)  # Reproducible results

# ============================================
# GENERATE A/B TEST DATA
# ============================================

n_control = 22_000
n_treatment = 22_000

# True conversion rates (simulating reality)
true_cvr_control = 0.120      # 12.0%
true_cvr_treatment = 0.133    # 13.3% (~10.8% relative lift — slightly above MDE)

# Generate user-level data
def generate_ab_data(n_users, cvr, group_name, start_date='2026-01-15'):
    """Generate realistic A/B test data for each user."""
    start = pd.Timestamp(start_date)

    # Random visit dates over 14 days
    visit_day = np.random.randint(0, 14, size=n_users)
    visit_dates = [start + timedelta(days=int(d)) for d in visit_day]

    # Conversions
    converted = np.random.binomial(1, cvr, size=n_users)

    # Revenue (only for converted users)
    aov = np.where(converted == 1,
                   np.random.lognormal(mean=12.4, sigma=0.5, size=n_users).astype(int),
                   0)

    # Platform
    platform = np.random.choice(['mobile', 'desktop'], size=n_users, p=[0.65, 0.35])

    # User type
    user_type = np.random.choice(['new', 'returning'], size=n_users, p=[0.40, 0.60])

    # Time on checkout page (seconds)
    base_time = np.random.exponential(scale=45, size=n_users)
    time_on_page = np.where(converted == 1, base_time * 1.5, base_time * 0.7)

    return pd.DataFrame({
        'user_id': [f'{group_name}_{i:05d}' for i in range(n_users)],
        'group': group_name,
        'visit_date': visit_dates,
        'platform': platform,
        'user_type': user_type,
        'converted': converted,
        'revenue': aov,
        'time_on_page_sec': time_on_page.round(1)
    })

# Generate data
control_df = generate_ab_data(n_control, true_cvr_control, 'control')
treatment_df = generate_ab_data(n_treatment, true_cvr_treatment, 'treatment')

# Combine
ab_data = pd.concat([control_df, treatment_df], ignore_index=True)

print(f"✅ A/B Test Data Generated!")
print(f"   Total users:    {len(ab_data):,}")
print(f"   Control:        {len(control_df):,}")
print(f"   Treatment:      {len(treatment_df):,}")
print(f"   Date range:     {ab_data['visit_date'].min().date()}{ab_data['visit_date'].max().date()}")
print(f"\n📋 Sample data:")
print(ab_data.head(10).to_string(index=False))

Bước 2.2: Quick Data Exploration

python
# ============================================
# DATA EXPLORATION
# ============================================

print("📊 GROUP SUMMARY")
print("=" * 60)

summary = ab_data.groupby('group').agg(
    users=('user_id', 'count'),
    conversions=('converted', 'sum'),
    cvr=('converted', 'mean'),
    total_revenue=('revenue', 'sum'),
    avg_revenue_per_visitor=('revenue', 'mean'),
    avg_time_on_page=('time_on_page_sec', 'mean')
).round(4)

for group, row in summary.iterrows():
    print(f"\n  {'📘' if group == 'control' else '📙'} {group.upper()}")
    print(f"     Users:         {int(row['users']):,}")
    print(f"     Conversions:   {int(row['conversions']):,}")
    print(f"     CVR:           {row['cvr']:.2%}")
    print(f"     Total Revenue: {int(row['total_revenue']):,} VND")
    print(f"     Rev/Visitor:   {row['avg_revenue_per_visitor']:,.0f} VND")
    print(f"     Avg Time:      {row['avg_time_on_page']:.1f} sec")

# Check balance (Sample Ratio Mismatch)
print(f"\n🔍 SRM Check:")
total = len(ab_data)
control_pct = len(control_df) / total * 100
treatment_pct = len(treatment_df) / total * 100
print(f"   Control:   {control_pct:.1f}% | Treatment: {treatment_pct:.1f}%")
print(f"   Split OK?  {'✅ Yes (balanced)' if abs(control_pct - 50) < 2 else '⚠️ Check randomization!'}")

Phần 3: Statistical Analysis (30 phút)

Bước 3.1: Chi-square Test — Primary Metric (CVR)

python
from scipy import stats

# ============================================
# CHI-SQUARE TEST — CONVERSION RATE
# ============================================

# Extract data
control = ab_data[ab_data['group'] == 'control']
treatment = ab_data[ab_data['group'] == 'treatment']

n_c = len(control)
n_t = len(treatment)
conv_c = control['converted'].sum()
conv_t = treatment['converted'].sum()

cvr_c = conv_c / n_c
cvr_t = conv_t / n_t

# Build contingency table
observed = np.array([
    [conv_c, n_c - conv_c],        # Control: [converted, not converted]
    [conv_t, n_t - conv_t]         # Treatment: [converted, not converted]
])

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(observed)

# Relative lift
absolute_lift = cvr_t - cvr_c
relative_lift = absolute_lift / cvr_c * 100

print("📊 CHI-SQUARE TEST — CHECKOUT CVR")
print("=" * 55)
print(f"\n  CONVERSION RATES:")
print(f"    Control:     {conv_c:,}/{n_c:,} = {cvr_c:.2%}")
print(f"    Treatment:   {conv_t:,}/{n_t:,} = {cvr_t:.2%}")
print(f"    Absolute Δ:  {absolute_lift:+.2%} ({absolute_lift*100:+.2f} pp)")
print(f"    Relative Δ:  {relative_lift:+.1f}%")
print(f"\n  STATISTICAL TEST:")
print(f"    Chi-square:  {chi2:.4f}")
print(f"    df:          {dof}")
print(f"    p-value:     {p_value:.4f}")
print(f"\n  DECISION:")
if p_value < 0.05:
    print(f"    ✅ SIGNIFICANT (p = {p_value:.4f} < 0.05)")
    print(f"    → Reject H₀: Treatment CVR ≠ Control CVR")
else:
    print(f"    ❌ NOT SIGNIFICANT (p = {p_value:.4f} ≥ 0.05)")
    print(f"    → Fail to reject H₀: insufficient evidence of difference")

Bước 3.2: Confidence Interval

python
# ============================================
# CONFIDENCE INTERVAL FOR DIFFERENCE IN CVR
# ============================================

def proportion_ci(conv_a, n_a, conv_b, n_b, alpha=0.05):
    """Calculate CI for difference in proportions (B - A)."""
    p_a = conv_a / n_a
    p_b = conv_b / n_b
    diff = p_b - p_a

    se = np.sqrt(p_a * (1 - p_a) / n_a + p_b * (1 - p_b) / n_b)
    z = stats.norm.ppf(1 - alpha / 2)

    return diff, diff - z * se, diff + z * se, se

diff, ci_low, ci_high, se = proportion_ci(conv_c, n_c, conv_t, n_t)

print("📊 95% CONFIDENCE INTERVAL")
print("=" * 55)
print(f"  Point estimate:  {diff:.4f} ({diff*100:+.2f} pp)")
print(f"  95% CI:          [{ci_low*100:+.2f} pp, {ci_high*100:+.2f} pp]")
print(f"  Standard Error:  {se:.6f}")
print(f"")
if ci_low > 0:
    print(f"  ✅ CI entirely above 0 → Treatment DEFINITELY improves CVR")
    print(f"     Treatment increases CVR by {ci_low*100:.2f}pp to {ci_high*100:.2f}pp")
elif ci_high < 0:
    print(f"  ❌ CI entirely below 0 → Treatment decreases CVR")
else:
    print(f"  ⚠️ CI includes 0 → Effect could be 0 (inconclusive)")

Bước 3.3: t-test — Revenue per Visitor

python
# ============================================
# T-TEST — REVENUE PER VISITOR (continuous metric)
# ============================================

rev_control = control['revenue'].values
rev_treatment = treatment['revenue'].values

# Welch's t-test (don't assume equal variance)
t_stat, p_value_rev = stats.ttest_ind(rev_control, rev_treatment, equal_var=False)

# Effect size (Cohen's d)
pooled_std = np.sqrt((rev_control.std()**2 + rev_treatment.std()**2) / 2)
cohens_d = (rev_treatment.mean() - rev_control.mean()) / pooled_std if pooled_std > 0 else 0

# Classify effect size
if abs(cohens_d) < 0.2:
    effect_label = "Negligible/Small"
elif abs(cohens_d) < 0.5:
    effect_label = "Small-Medium"
elif abs(cohens_d) < 0.8:
    effect_label = "Medium-Large"
else:
    effect_label = "Large"

print("📊 WELCH'S T-TEST — REVENUE PER VISITOR")
print("=" * 55)
print(f"\n  REVENUE PER VISITOR:")
print(f"    Control:     {rev_control.mean():,.0f} VND (SD: {rev_control.std():,.0f})")
print(f"    Treatment:   {rev_treatment.mean():,.0f} VND (SD: {rev_treatment.std():,.0f})")
print(f"    Difference:  {rev_treatment.mean() - rev_control.mean():+,.0f} VND")
print(f"\n  STATISTICAL TEST:")
print(f"    t-statistic: {t_stat:.4f}")
print(f"    p-value:     {p_value_rev:.4f}")
print(f"    Cohen's d:   {cohens_d:.4f} ({effect_label})")
print(f"\n  GUARDRAIL CHECK:")
rev_change_pct = (rev_treatment.mean() - rev_control.mean()) / rev_control.mean() * 100
if rev_change_pct > -5:
    print(f"    ✅ Revenue/Visitor change: {rev_change_pct:+.1f}% (within -5% guardrail)")
else:
    print(f"    ❌ Revenue/Visitor change: {rev_change_pct:+.1f}% (VIOLATES -5% guardrail!)")

Bước 3.4: Segment Analysis

python
# ============================================
# SEGMENT ANALYSIS — CHECK FOR SIMPSON'S PARADOX
# ============================================

print("📊 SEGMENT ANALYSIS")
print("=" * 60)

for segment_col in ['platform', 'user_type']:
    print(f"\n  📂 By {segment_col.upper()}:")
    print(f"  {'Segment':<12} | {'Control CVR':>11} | {'Treatment CVR':>13} | {'Lift':>8} | {'p-value':>8}")
    print(f"  {'-'*12}-+-{'-'*11}-+-{'-'*13}-+-{'-'*8}-+-{'-'*8}")

    for seg_val in ab_data[segment_col].unique():
        seg_c = control[control[segment_col] == seg_val]
        seg_t = treatment[treatment[segment_col] == seg_val]

        cvr_seg_c = seg_c['converted'].mean()
        cvr_seg_t = seg_t['converted'].mean()
        lift_seg = (cvr_seg_t - cvr_seg_c) / cvr_seg_c * 100 if cvr_seg_c > 0 else 0

        # Chi-square for segment
        obs_seg = np.array([
            [seg_c['converted'].sum(), len(seg_c) - seg_c['converted'].sum()],
            [seg_t['converted'].sum(), len(seg_t) - seg_t['converted'].sum()]
        ])
        _, p_seg, _, _ = stats.chi2_contingency(obs_seg)

        sig_icon = "✅" if p_seg < 0.05 else "  "
        print(f"  {seg_val:<12} | {cvr_seg_c:>10.2%} | {cvr_seg_t:>12.2%} | {lift_seg:>+7.1f}% | {p_seg:>7.4f} {sig_icon}")

print(f"\n  ⚠️ Check: Do all segments show same direction as overall?")
print(f"     If not → Simpson's Paradox → investigate before deploying!")

Phần 4: Visualization (15 phút)

Bước 4.1: CVR by Group Bar Chart

python
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# --- Chart 1: CVR Comparison ---
ax1 = axes[0]
groups = ['Control\n(4-step)', 'Treatment\n(2-step)']
cvrs = [cvr_c * 100, cvr_t * 100]
colors = ['#2196F3', '#4CAF50']
bars = ax1.bar(groups, cvrs, color=colors, width=0.5, edgecolor='white', linewidth=2)
ax1.set_ylabel('Checkout CVR (%)')
ax1.set_title('📊 Checkout CVR: Control vs Treatment', fontsize=12, fontweight='bold')
ax1.set_ylim(0, max(cvrs) * 1.3)
for bar, cvr in zip(bars, cvrs):
    ax1.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.2,
             f'{cvr:.2f}%', ha='center', va='bottom', fontweight='bold', fontsize=13)
ax1.axhline(y=cvr_c * 100, color='gray', linestyle='--', alpha=0.5, label='Baseline')

# --- Chart 2: CVR by Platform ---
ax2 = axes[1]
platforms = ab_data['platform'].unique()
x = np.arange(len(platforms))
width = 0.3

cvr_c_plat = [control[control['platform']==p]['converted'].mean()*100 for p in platforms]
cvr_t_plat = [treatment[treatment['platform']==p]['converted'].mean()*100 for p in platforms]

ax2.bar(x - width/2, cvr_c_plat, width, label='Control', color='#2196F3', edgecolor='white')
ax2.bar(x + width/2, cvr_t_plat, width, label='Treatment', color='#4CAF50', edgecolor='white')
ax2.set_xticks(x)
ax2.set_xticklabels(platforms)
ax2.set_ylabel('CVR (%)')
ax2.set_title('📱 CVR by Platform', fontsize=12, fontweight='bold')
ax2.legend()

# --- Chart 3: Revenue Distribution ---
ax3 = axes[2]
rev_c_pos = rev_control[rev_control > 0]
rev_t_pos = rev_treatment[rev_treatment > 0]
ax3.hist(rev_c_pos, bins=50, alpha=0.5, label=f'Control (μ={rev_c_pos.mean():,.0f})', color='#2196F3', density=True)
ax3.hist(rev_t_pos, bins=50, alpha=0.5, label=f'Treatment (μ={rev_t_pos.mean():,.0f})', color='#4CAF50', density=True)
ax3.set_xlabel('Revenue (VND)')
ax3.set_ylabel('Density')
ax3.set_title('💰 Revenue Distribution (Converted Users)', fontsize=12, fontweight='bold')
ax3.legend(fontsize=9)

plt.tight_layout()
plt.savefig('ab_test_results.png', dpi=150, bbox_inches='tight')
plt.show()
print("📊 Charts saved to ab_test_results.png")

Bước 4.2: Daily CVR Trend

python
# ============================================
# DAILY CVR TREND (check for novelty effect)
# ============================================

daily = ab_data.groupby(['visit_date', 'group']).agg(
    visitors=('user_id', 'count'),
    conversions=('converted', 'sum')
).reset_index()
daily['cvr'] = daily['conversions'] / daily['visitors'] * 100

fig, ax = plt.subplots(figsize=(12, 5))
for group, color, marker in [('control', '#2196F3', 'o'), ('treatment', '#4CAF50', 's')]:
    g = daily[daily['group'] == group]
    ax.plot(g['visit_date'], g['cvr'], marker=marker, linewidth=2,
            color=color, label=group.capitalize(), markersize=6)

ax.set_xlabel('Date')
ax.set_ylabel('Daily CVR (%)')
ax.set_title('📈 Daily CVR Trend — Control vs Treatment', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('ab_test_daily_trend.png', dpi=150, bbox_inches='tight')
plt.show()
print("📈 Check: Is the lift consistent across days? Any novelty effect (Treatment high early, then drops)?")

Phần 5: Final Report & Conclusion (15 phút)

Bước 5.1: Business Impact Calculation

python
# ============================================
# BUSINESS IMPACT CALCULATION
# ============================================

daily_checkout_visitors = 80_000
monthly_visitors = daily_checkout_visitors * 30

# Current
current_monthly_conversions = monthly_visitors * cvr_c
current_aov = control[control['converted']==1]['revenue'].mean() if control['converted'].sum() > 0 else 350000
current_monthly_revenue = current_monthly_conversions * current_aov

# Projected (with Treatment)
projected_monthly_conversions = monthly_visitors * cvr_t
projected_aov = treatment[treatment['converted']==1]['revenue'].mean() if treatment['converted'].sum() > 0 else 350000
projected_monthly_revenue = projected_monthly_conversions * projected_aov

delta_revenue = projected_monthly_revenue - current_monthly_revenue
delta_annual = delta_revenue * 12

print("💰 BUSINESS IMPACT PROJECTION")
print("=" * 55)
print(f"\n  CURRENT (Control — 4-step checkout):")
print(f"    Monthly visitors:    {monthly_visitors:>12,}")
print(f"    CVR:                 {cvr_c:>12.2%}")
print(f"    Monthly conversions: {current_monthly_conversions:>12,.0f}")
print(f"    AOV:                 {current_aov:>12,.0f} VND")
print(f"    Monthly revenue:     {current_monthly_revenue:>12,.0f} VND")
print(f"\n  PROJECTED (Treatment — 2-step checkout):")
print(f"    Monthly visitors:    {monthly_visitors:>12,}")
print(f"    CVR:                 {cvr_t:>12.2%}")
print(f"    Monthly conversions: {projected_monthly_conversions:>12,.0f}")
print(f"    AOV:                 {projected_aov:>12,.0f} VND")
print(f"    Monthly revenue:     {projected_monthly_revenue:>12,.0f} VND")
print(f"\n  IMPACT:")
print(f"    Monthly Δ revenue:   {delta_revenue:>+12,.0f} VND")
print(f"    Annual Δ revenue:    {delta_annual:>+12,.0f} VND")
print(f"    Annual Δ revenue:    {delta_annual/1e9:>+12.1f} tỷ VND")

Bước 5.2: Generate Final Report

python
# ============================================
# FINAL A/B TEST REPORT
# ============================================

report = f"""
{'='*60}
        📋 A/B TEST REPORT — CHECKOUT SIMPLIFICATION
{'='*60}

📌 EXPERIMENT SUMMARY
   Experiment:    2-step vs 4-step checkout
   Duration:      14 days (2026-01-15 → 2026-01-28)
   Sample:        {n_c:,} (Control) + {n_t:,} (Treatment)

📊 PRIMARY METRIC: CHECKOUT CVR
   Control:       {cvr_c:.2%} ({conv_c:,} conversions)
   Treatment:     {cvr_t:.2%} ({conv_t:,} conversions)
   Absolute Δ:    {absolute_lift*100:+.2f} percentage points
   Relative Δ:    {relative_lift:+.1f}%
   p-value:       {p_value:.4f} {'✅ SIGNIFICANT' if p_value < 0.05 else '❌ NOT SIGNIFICANT'}
   95% CI:        [{ci_low*100:+.2f}pp, {ci_high*100:+.2f}pp]

🛡️ GUARDRAIL METRICS
   Revenue/Visitor:  {rev_change_pct:+.1f}% change {'✅ OK' if rev_change_pct > -5 else '❌ VIOLATED'}
   Segment check:    {'✅ Consistent direction across segments' if relative_lift > 0 else '⚠️ Check segments'}

💰 BUSINESS IMPACT (if deployed)
   Monthly Δ revenue:  {delta_revenue:+,.0f} VND
   Annual Δ revenue:   {delta_annual/1e9:+.1f} tỷ VND

🎯 RECOMMENDATION
"""

if p_value < 0.05 and relative_lift >= 8 and rev_change_pct > -5:
    report += """   ✅ DEPLOY TREATMENT (2-step checkout)

   Rationale:
   - Statistical significance: p < 0.05
   - Practical significance: relative lift meets MDE (≥ 8%)
   - Guardrail metrics: OK (revenue/visitor not degraded)
   - Segment analysis: consistent across platforms and user types

   Deployment plan:
   1. Gradual rollout: 10% → 50% → 100% over 2 weeks
   2. Monitor CVR + Revenue daily during rollout
   3. Set alert: if CVR drops below baseline → auto-rollback
"""
elif p_value < 0.05 and relative_lift < 8:
    report += """   ⚠️ HOLD — STATISTICALLY SIGNIFICANT BUT BELOW MDE

   While p < 0.05, the observed lift is below the 8% MDE target.
   Consider: is the observed lift still practically meaningful?
"""
else:
    report += """   ❌ DO NOT DEPLOY — NOT SIGNIFICANT

   Insufficient evidence that Treatment is better than Control.
   Options: redesign experiment, increase MDE, or try different change.
"""

report += f"""
📝 LEARNINGS
   1. {'2-step checkout reduces friction → more completions' if relative_lift > 0 else 'Simplification did not improve CVR'}
   2. Segment analysis shows {'consistent' if relative_lift > 0 else 'mixed'} results
   3. Revenue per visitor {'maintained' if rev_change_pct > -5 else 'degraded'} — AOV not affected

{'='*60}
   Report generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}
{'='*60}
"""

print(report)

✅ Checklist hoàn thành

EXPERIMENT DESIGN:
✅ SMART hypothesis written
✅ Sample size calculated (formula + code)
✅ Duration planned (≥ 14 days)
✅ Primary metric + guardrail metrics defined
✅ Statistical test pre-selected (chi-square for CVR)

DATA & ANALYSIS:
✅ Dataset generated (control + treatment)
✅ SRM check (sample ratio balanced)
✅ Chi-square test for primary metric (CVR)
✅ 95% Confidence Interval calculated
✅ t-test for guardrail metric (Revenue/Visitor)
✅ Effect size (Cohen's d) calculated
✅ Segment analysis (platform, user type)

VISUALIZATION:
✅ CVR bar chart (Control vs Treatment)
✅ CVR by segment chart
✅ Revenue distribution chart
✅ Daily CVR trend chart

REPORT:
✅ Business impact calculated (monthly + annual revenue)
✅ Final recommendation with rationale
✅ Learnings documented

🎁 Bonus Challenge (Optional)

Bonus 1: Power Analysis Visualization

python
# How does sample size change with different MDE?
mde_range = np.arange(0.03, 0.25, 0.01)
sample_sizes = [calculate_sample_size(0.12, m) for m in mde_range]

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(mde_range * 100, [s/1000 for s in sample_sizes], 'b-o', markersize=4, linewidth=2)
ax.set_xlabel('Minimum Detectable Effect (% relative)')
ax.set_ylabel('Sample Size per Group (thousands)')
ax.set_title('📐 Sample Size vs MDE (Baseline CVR = 12%, α=0.05, Power=80%)', fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axvline(x=8, color='red', linestyle='--', label='Our MDE = 8%')
ax.legend()
plt.tight_layout()
plt.show()

Bonus 2: Simulate Peeking Effect

python
# Simulate how peeking inflates false positive rate
def simulate_peeking(n_simulations=10000, n_per_group=5000, true_cvr=0.12,
                      peek_days=[3, 5, 7, 10, 14], daily_users=700):
    """Simulate peeking and count false positives (when H0 is TRUE)."""
    false_positives_by_day = {d: 0 for d in peek_days}

    for _ in range(n_simulations):
        # Generate data under H0 (no difference)
        for day in peek_days:
            n_so_far = min(day * daily_users, n_per_group)
            control = np.random.binomial(1, true_cvr, size=n_so_far)
            treatment = np.random.binomial(1, true_cvr, size=n_so_far)  # SAME rate!

            observed = np.array([
                [control.sum(), n_so_far - control.sum()],
                [treatment.sum(), n_so_far - treatment.sum()]
            ])

            if observed.min() > 0:
                _, p, _, _ = stats.chi2_contingency(observed)
                if p < 0.05:
                    false_positives_by_day[day] += 1

    return {d: count/n_simulations*100 for d, count in false_positives_by_day.items()}

print("🔍 Simulating peeking effect (n=5000 simulations)...")
print("   (Under H₀: no real difference between A and B)")
fp_rates = simulate_peeking(n_simulations=5000)

print("\n   Day | False Positive Rate | Expected: 5%")
print("   ----+--------------------+-------------")
for day, rate in sorted(fp_rates.items()):
    bar = '█' * int(rate)
    print(f"   {day:>3} | {rate:>5.1f}%  {bar}")
print(f"\n   ⚠️  If you peek at all 5 days and stop when p < 0.05,")
print(f"       your actual false positive rate ≈ {max(fp_rates.values()):.0f}%, not 5%!")

📚 Tài liệu tham khảo

Tài liệuNội dung
Trustworthy Online Controlled Experiments (Kohavi et al.)Bible of A/B testing
Evan Miller Sample Size Calculatorevanmiller.org/ab-testing
scipy.stats documentationdocs.scipy.org/doc/scipy/reference/stats.html
Statistical Methods in Online A/B Testing (Georgiev)Thống kê chi tiết cho A/B test