📘 Buổi 15: A/B Testing — Thí nghiệm với dữ liệu để ra quyết định đúng

Đừng đoán — hãy test. A/B testing biến cảm tính thành khoa học.

🎯 Mục tiêu buổi học

Sau buổi này, học viên sẽ:

Hiểu nguyên tắc A/B testing: control vs treatment, statistical significance
Thiết kế experiment: hypothesis, sample size, duration
Phân tích kết quả: p-value, confidence interval, effect size
Tránh cạm bẫy: peeking, multiple testing, selection bias

📋 Tổng quan

Ở Buổi 14, bạn đã nắm Industry Case Studies — Marketing, Finance, Supply Chain — biết domain nào dùng metric nào, KPI nào cho ngành nào. Nhưng khi bạn muốn thay đổi một thứ gì đó — landing page mới, pricing mới, feature mới — làm sao biết thay đổi đó thực sự TỐT HƠN?

Buổi 15 chuyển từ "đo hiện tại" (Buổi 13-14) sang "thí nghiệm tương lai" — A/B testing. Thay vì đoán, bạn chia users thành 2 nhóm, cho nhóm A thấy version cũ (Control), nhóm B thấy version mới (Treatment), đo kết quả, và dùng thống kê để kết luận: thay đổi có thực sự tốt hơn hay chỉ là noise?

Theo Harvard Business Review (2024), các công ty data-driven chạy A/B test nhiều hơn 2.5x so với đối thủ — và tăng trưởng nhanh hơn 30-40%. Google chạy 10,000+ experiments/năm. Booking.com chạy 25,000+ experiments/năm. Microsoft Bing phát hiện 1 experiment có thể tăng $100M revenue/năm.

A/B testing là kỹ năng biến DA từ "người report số liệu" thành "người thiết kế thí nghiệm" — từ reactive sang proactive. Bạn không chờ data nói gì — bạn tạo ra data bằng experiment.

mermaid

flowchart LR
    A["📥 Obtain<br/>Buổi 7: Python"] --> B["🧹 Scrub<br/>Buổi 8: Pandas"]
    B --> C["🔍 Explore<br/>Buổi 9: EDA"]
    C --> D["📊 iNterpret<br/>Buổi 10-11: Chart + BI"]
    D --> E["📖 Storytelling<br/>Buổi 12: Presentation"]
    E --> F["💼 Business Metrics<br/>Buổi 13: KPI, Funnel"]
    F --> G["🏭 Industry Cases<br/>Buổi 14: Domain Analytics"]
    G --> H["🧪 A/B Testing<br/>✅ Buổi 15: Experiment"]
    style H fill:#e8f5e9,stroke:#4caf50,stroke-width:3px

💡 Tại sao A/B Testing quan trọng cho DA?

Tình huống	Không test	A/B test
PM muốn đổi màu button CTA	"Đỏ đẹp hơn xanh" → HiPPO decision	Split 50/50, đo CVR → p < 0.05 → deploy winner
Marketing muốn test subject line email	Chọn theo "cảm giác"	3 variants, 10% traffic mỗi variant, winner → 70% còn lại
Product muốn test pricing	"Tăng giá 10% chắc OK"	Control (giá cũ) vs Treatment (giá mới), đo revenue/user
CEO muốn redesign homepage	Agency thiết kế → deploy luôn	Old vs New, đo bounce rate + CVR, 2 tuần
Growth team muốn test onboarding flow	Copy từ competitor	3-step vs 5-step onboarding, đo Day 7 retention

📌 Phần 1: A/B Testing Fundamentals — Control vs Treatment

A/B Testing là gì?

A/B Testing (Split Testing) — phương pháp thí nghiệm so sánh 2 phiên bản (A và B) để xác định phiên bản nào hoạt động tốt hơn, dựa trên dữ liệu thống kê, không phải cảm tính.

mermaid

flowchart TD
    A["👥 Total Users<br/>100,000 visitors"] --> B["🎲 Random Split"]
    B --> C["👤 Control Group (A)<br/>50,000 users<br/>Thấy version CŨ"]
    B --> D["👤 Treatment Group (B)<br/>50,000 users<br/>Thấy version MỚI"]
    C --> E["📊 Measure: CVR = 4.2%"]
    D --> F["📊 Measure: CVR = 4.8%"]
    E --> G["🧮 Statistical Test<br/>p-value = 0.03 < 0.05"]
    F --> G
    G --> H["✅ Treatment B wins!<br/>Deploy version mới"]

Thuật ngữ cốt lõi

Thuật ngữ	Định nghĩa	Ví dụ
Control (A)	Phiên bản hiện tại — baseline	Landing page hiện tại
Treatment (B)	Phiên bản thay đổi — hypothesis	Landing page mới, button đỏ
Metric (KPI)	Thước đo kết quả experiment	CVR, CTR, Revenue/User, Retention
Null Hypothesis ( $H_{0}$ )	"Không có sự khác biệt giữa A và B"	$C V R_{A} = C V R_{B}$
Alternative Hypothesis ( $H_{1}$ )	"Có sự khác biệt — B tốt hơn A"	$C V R_{B} > C V R_{A}$
p-value	Xác suất thấy kết quả NẾU $H_{0}$ đúng	p = 0.03 → 3% chance kết quả là noise
Statistical Significance	Kết quả đủ strong để bác bỏ $H_{0}$	p < 0.05 → "significant"
Significance Level ( $α$ )	Ngưỡng chấp nhận — thường 0.05 (5%)	Chấp nhận 5% false positive
Power ( $1 - β$ )	Khả năng phát hiện effect THỰC	80% power → 80% chance phát hiện
Type I Error (False Positive)	Kết luận B tốt hơn nhưng THỰC RA không	Probability = $α$ (5%)
Type II Error (False Negative)	Kết luận B không tốt hơn nhưng THỰC RA có	Probability = $β$ (20%)
Effect Size	Mức độ khác biệt thực tế giữa A và B	CVR tăng 0.6 percentage points
MDE (Minimum Detectable Effect)	Effect nhỏ nhất mà test có thể phát hiện	"Tôi muốn detect ít nhất 5% relative lift"

Hypothesis Testing Flow

mermaid

flowchart TD
    A["🧪 Đặt Hypothesis<br/>H₀: CVR_A = CVR_B<br/>H₁: CVR_B > CVR_A"] --> B["📏 Chọn Significance Level<br/>α = 0.05 (5%)"]
    B --> C["📐 Tính Sample Size<br/>n = f(α, power, MDE)"]
    C --> D["🎲 Random Split & Run<br/>50/50 split, chạy đủ duration"]
    D --> E["📊 Collect Data<br/>Control: 4.2% | Treatment: 4.8%"]
    E --> F["🧮 Statistical Test<br/>t-test hoặc chi-square"]
    F --> G{"p-value < α?"}
    G -->|"Yes: p = 0.03 < 0.05"| H["✅ Reject H₀<br/>Treatment B wins!"]
    G -->|"No: p = 0.18 > 0.05"| I["❌ Fail to Reject H₀<br/>Không đủ evidence"]
    H --> J["🚀 Deploy Treatment"]
    I --> K["🔄 Keep Control<br/>hoặc redesign experiment"]

p-value — Hiểu đúng

p-value = 0.03 nghĩa là: "Nếu THỰC SỰ không có sự khác biệt giữa A và B ( $H_{0}$ đúng), thì xác suất quan sát được kết quả extreme như vậy (hoặc hơn) chỉ là 3%."

p-value KHÔNG phải:

❌ "Xác suất B tốt hơn A là 97%"
❌ "Xác suất $H_{0}$ đúng là 3%"
❌ "3% chance kết quả sai"

p-value ĐÚNG là:

✅ "Nếu $H_{0}$ đúng, xác suất thấy data extreme như vậy là 3%"
✅ p < $α$ (0.05) → Reject $H_{0}$ → Kết luận: có sự khác biệt có ý nghĩa thống kê

p-value	Kết luận	Hành động
p < 0.01	Rất significant	Deploy với high confidence
0.01 ≤ p < 0.05	Significant	Deploy, monitor closely
0.05 ≤ p < 0.10	Marginally significant	Cân nhắc — có thể cần thêm data
p ≥ 0.10	Không significant	Giữ Control, redesign test

Type I vs Type II Errors

	$H_{0}$ thực sự ĐÚNG (A=B)	$H_{0}$ thực sự SAI (B tốt hơn)
Reject $H_{0}$ (kết luận B tốt hơn)	❌ Type I Error (False Positive) — deploy B nhưng B không tốt hơn	✅ Correct! — phát hiện improvement
Fail to reject $H_{0}$ (kết luận không khác)	✅ Correct! — giữ A đúng	❌ Type II Error (False Negative) — bỏ lỡ improvement

α = P (Type I Error) = P (Reject H_{0} | H_{0} true)

β = P (Type II Error) = P (Fail to reject H_{0} | H_{0} false)

P o w e r = 1 - β

⚠️ Trong business context

Type I Error = Deploy version mới mà thực ra KHÔNG tốt hơn → wasted effort, potential negative impact
Type II Error = Giữ version cũ mà bỏ lỡ improvement → missed opportunity, competitor advantage
Thông thường: $α = 0.05$ (5% chance false positive), Power = 80% ( $β = 0.20$ , 20% chance false negative)

📌 Phần 2: Experiment Design — Hypothesis, Sample Size, Duration

SMART Hypothesis

Hypothesis tốt phải SMART — Specific, Measurable, Actionable, Relevant, Time-bound:

SMART	Bad Hypothesis	Good Hypothesis
Specific	"Landing page mới sẽ tốt hơn"	"Button CTA màu đỏ sẽ tăng CVR"
Measurable	"Users thích version B hơn"	"CVR tăng ít nhất 10% (relative)"
Actionable	"Revenue sẽ tăng"	"Nếu CVR tăng ≥ 10%, deploy cho 100% traffic"
Relevant	"Bounce rate giảm" (nhưng KPI là CVR)	"CVR là primary metric, bounce rate là guardrail"
Time-bound	"Chạy test cho đến khi significant"	"Chạy 14 ngày, cover 2 full business cycles"

Template hypothesis:

"Thay đổi [X] sẽ làm tăng/giảm [primary metric] ít nhất [MDE]% trong [duration], so với version hiện tại, cho [target audience]."

Ví dụ: "Thay đổi button CTA từ xanh sang đỏ sẽ làm tăng CVR ít nhất 10% (relative) trong 14 ngày, so với button xanh hiện tại, cho tất cả mobile visitors."

Sample Size Calculation

Tại sao sample size quan trọng? Nếu quá ít — test không đủ power để detect effect. Nếu quá nhiều — lãng phí traffic và thời gian.

Công thức sample size cho 2-proportion z-test:

n = \frac{(Z_{α / 2} + Z_{β})^{2} \cdot [p_{1} (1 - p_{1}) + p_{2} (1 - p_{2})]}{(p_{2} - p_{1})^{2}}

Trong đó:

$n$ = sample size MỖI group
$Z_{α / 2}$ = z-score cho significance level (α = 0.05 → $Z_{0.025}$ = 1.96)
$Z_{β}$ = z-score cho power (power 80% → $Z_{0.20}$ = 0.84)
$p_{1}$ = baseline conversion rate (Control)
$p_{2}$ = expected conversion rate (Treatment) = $p_{1} \times (1 + M D E)$

python

# Sample Size Calculator for A/B Test
import numpy as np
from scipy import stats

def calculate_sample_size(baseline_rate, mde_relative, alpha=0.05, power=0.80):
    """
    Calculate sample size per group for A/B test (two-proportion z-test).

    Parameters:
        baseline_rate: Current conversion rate (e.g., 0.05 for 5%)
        mde_relative: Minimum Detectable Effect as relative change (e.g., 0.10 for 10% lift)
        alpha: Significance level (default 0.05)
        power: Statistical power (default 0.80)

    Returns:
        Sample size per group
    """
    p1 = baseline_rate
    p2 = p1 * (1 + mde_relative)

    z_alpha = stats.norm.ppf(1 - alpha / 2)  # two-tailed
    z_beta = stats.norm.ppf(power)

    numerator = (z_alpha + z_beta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
    denominator = (p2 - p1) ** 2

    n = np.ceil(numerator / denominator)
    return int(n)

# Ví dụ: CVR hiện tại 5%, muốn detect 10% relative lift
scenarios = [
    {'baseline': 0.05, 'mde': 0.10, 'label': 'CVR 5%, MDE 10%'},
    {'baseline': 0.05, 'mde': 0.05, 'label': 'CVR 5%, MDE 5%'},
    {'baseline': 0.10, 'mde': 0.10, 'label': 'CVR 10%, MDE 10%'},
    {'baseline': 0.02, 'mde': 0.20, 'label': 'CVR 2%, MDE 20%'},
    {'baseline': 0.30, 'mde': 0.05, 'label': 'CVR 30%, MDE 5%'},
]

print("📐 SAMPLE SIZE CALCULATOR")
print("=" * 70)
for s in scenarios:
    n = calculate_sample_size(s['baseline'], s['mde'])
    total = n * 2
    print(f"  {s['label']:25s} → n = {n:>8,}/group → Total: {total:>8,} users")

Bảng sample size phổ biến:

Baseline CVR	MDE (relative)	$α$	Power	n per group	Total
5%	10% lift	0.05	80%	~31,000	~62,000
5%	5% lift	0.05	80%	~125,000	~250,000
10%	10% lift	0.05	80%	~14,700	~29,400
2%	20% lift	0.05	80%	~49,000	~98,000
30%	5% lift	0.05	80%	~11,500	~23,000

💡 Sample Size Rules of Thumb

Baseline thấp hơn → cần nhiều sample hơn (5% CVR cần nhiều hơn 30% CVR)
MDE nhỏ hơn → cần NHIỀU sample hơn (detect 5% lift khó hơn 20% lift)
Power cao hơn → cần nhiều sample hơn (90% power cần ~25% thêm vs 80%)
Nếu daily traffic thấp → tăng MDE hoặc giảm power để test đủ nhanh

Duration Calculation

D u r a t i o n = \frac{T o t a l S a m p l e S i z e}{D a i l y T r a f f i c \times % T r a f f i c i n T e s t}

python

# Duration Calculator
def calculate_duration(sample_size_total, daily_traffic, traffic_pct=1.0):
    """Calculate test duration in days."""
    return int(np.ceil(sample_size_total / (daily_traffic * traffic_pct)))

# Ví dụ
daily_traffic = 10_000  # users/day
sample_needed = 62_000  # total (cả 2 groups)
traffic_in_test = 1.0   # 100% traffic vào test

duration = calculate_duration(sample_needed, daily_traffic, traffic_in_test)
print(f"\n⏱️ DURATION ESTIMATE:")
print(f"   Sample needed: {sample_needed:,} users")
print(f"   Daily traffic: {daily_traffic:,} users/day")
print(f"   Traffic in test: {traffic_in_test*100:.0f}%")
print(f"   Duration: {duration} days")
print(f"   Recommendation: Run minimum {max(duration, 14)} days (cover 2 business cycles)")

Randomization — Chia nhóm thế nào?

Method	Mô tả	Ưu	Nhược
Simple Random	Mỗi user random vào A hoặc B	Đơn giản nhất	Có thể imbalance nhỏ
Stratified	Random TRONG từng stratum (mobile/desktop, new/returning)	Balance trên key dimensions	Phức tạp hơn
Cluster	Random theo unit lớn hơn (city, company)	Tránh contamination	Cần nhiều clusters
Cookie/User ID	Hash user_id → deterministic assignment	Consistent experience	Cookie clearing = re-randomization

mermaid

flowchart TD
    A["🎯 Define Hypothesis"] --> B["📐 Calculate Sample Size"]
    B --> C["⏱️ Estimate Duration"]
    C --> D{"Duration OK?<br/>(≤ 4 weeks)"}
    D -->|Yes| E["🎲 Randomize Users"]
    D -->|No| F["Adjust: ↑ MDE<br/>or ↓ Power<br/>or ↑ Traffic"]
    F --> B
    E --> G["🚀 Launch Experiment"]
    G --> H["📊 Wait Full Duration<br/>NO PEEKING!"]
    H --> I["🧮 Analyze Results"]

Guardrail Metrics

Khi A/B test, bạn có primary metric (vd: CVR) và guardrail metrics — metrics không được xấu đi:

Loại	Metric	Ý nghĩa
Primary	CVR (Conversion Rate)	Metric bạn muốn CẢI THIỆN
Guardrail	Bounce Rate	Không được tăng quá 5%
Guardrail	Page Load Time	Không được tăng quá 200ms
Guardrail	Revenue per User	Không được giảm (Simpson's Paradox)
Guardrail	Customer Complaints	Không được tăng

📌 Phần 3: Analyzing Results — t-test, Chi-square, Confidence Interval

Chọn Statistical Test nào?

mermaid

flowchart TD
    A["📊 Loại metric?"] --> B{"Metric là gì?"}
    B -->|"Proportion / Rate<br/>(CVR, CTR)"| C["Chi-square test<br/>hoặc Z-test for proportions"]
    B -->|"Continuous / Mean<br/>(Revenue, Time on Page)"| D{"Data distribution?"}
    D -->|"Normal or n > 30"| E["Independent t-test<br/>(Welch's t-test)"]
    D -->|"Non-normal, small n"| F["Mann-Whitney U test<br/>(non-parametric)"]
    B -->|"Count data<br/>(# clicks, # errors)"| G["Chi-square test<br/>hoặc Poisson test"]

Chi-square Test — So sánh tỷ lệ (Proportions)

Dùng khi: metric là tỷ lệ (CVR, CTR, signup rate).

python

# Chi-square Test for A/B Testing (Proportions)
from scipy import stats
import numpy as np

# Data: Landing page A/B test
control_visitors = 15000
control_conversions = 630     # CVR = 4.20%
treatment_visitors = 15000
treatment_conversions = 720   # CVR = 4.80%

# Chi-square test
observed = np.array([
    [control_conversions, control_visitors - control_conversions],
    [treatment_conversions, treatment_visitors - treatment_conversions]
])

chi2, p_value, dof, expected = stats.chi2_contingency(observed)

# Results
cvr_control = control_conversions / control_visitors * 100
cvr_treatment = treatment_conversions / treatment_visitors * 100
relative_lift = (cvr_treatment - cvr_control) / cvr_control * 100

print("📊 A/B TEST RESULTS — CHI-SQUARE TEST")
print("=" * 55)
print(f"  Control:   {control_conversions:,}/{control_visitors:,} = {cvr_control:.2f}%")
print(f"  Treatment: {treatment_conversions:,}/{treatment_visitors:,} = {cvr_treatment:.2f}%")
print(f"  Absolute Lift:  {cvr_treatment - cvr_control:+.2f} percentage points")
print(f"  Relative Lift:  {relative_lift:+.1f}%")
print(f"\n  Chi-square stat: {chi2:.4f}")
print(f"  p-value:         {p_value:.4f}")
print(f"  Significant?     {'✅ YES (p < 0.05)' if p_value < 0.05 else '❌ NO (p >= 0.05)'}")

Independent t-test — So sánh trung bình (Means)

Dùng khi: metric là giá trị liên tục (revenue per user, time on page, AOV).

python

# Independent t-test for A/B Testing (Means)
from scipy import stats
import numpy as np

np.random.seed(15)

# Data: Revenue per user in A/B test
control_revenue = np.random.lognormal(mean=10.5, sigma=1.2, size=5000)    # Control
treatment_revenue = np.random.lognormal(mean=10.6, sigma=1.2, size=5000)  # Treatment (slightly higher)

# Welch's t-test (không assume equal variance)
t_stat, p_value = stats.ttest_ind(control_revenue, treatment_revenue, equal_var=False)

# Effect size (Cohen's d)
pooled_std = np.sqrt((control_revenue.std()**2 + treatment_revenue.std()**2) / 2)
cohens_d = (treatment_revenue.mean() - control_revenue.mean()) / pooled_std

print("📊 A/B TEST RESULTS — WELCH'S T-TEST")
print("=" * 55)
print(f"  Control mean:    {control_revenue.mean():,.0f} VND (n={len(control_revenue):,})")
print(f"  Treatment mean:  {treatment_revenue.mean():,.0f} VND (n={len(treatment_revenue):,})")
print(f"  Difference:      {treatment_revenue.mean() - control_revenue.mean():+,.0f} VND")
print(f"\n  t-statistic:     {t_stat:.4f}")
print(f"  p-value:         {p_value:.4f}")
print(f"  Cohen's d:       {cohens_d:.4f}")
print(f"  Significant?     {'✅ YES (p < 0.05)' if p_value < 0.05 else '❌ NO (p >= 0.05)'}")
print(f"  Effect size:     {'Small' if abs(cohens_d) < 0.2 else 'Medium' if abs(cohens_d) < 0.5 else 'Large'}")

Confidence Interval

Confidence Interval (CI) — khoảng giá trị mà true effect nằm trong đó với probability $(1 - α)$ .

C I = ({\hat{p}}_{B} - {\hat{p}}_{A}) \pm Z_{α / 2} \cdot \sqrt{\frac{{\hat{p}}_{A} (1 - {\hat{p}}_{A})}{n_{A}} + \frac{{\hat{p}}_{B} (1 - {\hat{p}}_{B})}{n_{B}}}

python

# Confidence Interval for A/B Test
from scipy import stats
import numpy as np

def ab_test_confidence_interval(conv_a, n_a, conv_b, n_b, alpha=0.05):
    """Calculate CI for difference in proportions."""
    p_a = conv_a / n_a
    p_b = conv_b / n_b
    diff = p_b - p_a

    se = np.sqrt(p_a * (1 - p_a) / n_a + p_b * (1 - p_b) / n_b)
    z = stats.norm.ppf(1 - alpha / 2)

    ci_lower = diff - z * se
    ci_upper = diff + z * se

    return diff, ci_lower, ci_upper, se

# Ví dụ
diff, ci_low, ci_high, se = ab_test_confidence_interval(
    conv_a=630, n_a=15000,
    conv_b=720, n_b=15000,
    alpha=0.05
)

print("📊 CONFIDENCE INTERVAL")
print("=" * 55)
print(f"  Difference:    {diff*100:+.2f} percentage points")
print(f"  95% CI:        [{ci_low*100:+.2f}%, {ci_high*100:+.2f}%]")
print(f"  Standard Error: {se*100:.4f}%")
print(f"\n  Interpret: Treatment tăng CVR từ {ci_low*100:+.2f}% đến {ci_high*100:+.2f}%")
print(f"             với 95% confidence.")
if ci_low > 0:
    print(f"  ✅ CI không chứa 0 → Effect thực sự dương!")
else:
    print(f"  ⚠️ CI chứa 0 → Effect có thể là 0 (không chắc có cải thiện)")

Effect Size — Statistical vs Practical Significance

Concept	Câu hỏi	Ví dụ
Statistical Significance	"Có khác biệt THẬT không?"	p = 0.02 < 0.05 → Yes!
Practical Significance	"Khác biệt có ĐỦ LỚN để quan tâm không?"	CVR tăng 0.01% → statistically significant nhưng...
Effect Size	"Khác biệt LỚN bao nhiêu?"	Cohen's d = 0.05 → negligible

Cohen's d interpretation:

Cohen's d	Interpretation	Ví dụ A/B test
< 0.2	Negligible/Small	Revenue tăng 500 VND/user — không worth deploying
0.2 - 0.5	Small-Medium	CVR tăng 0.3pp → worth testing thêm
0.5 - 0.8	Medium-Large	CTR tăng 50% relative → strong improvement
> 0.8	Large	Rare trong A/B test — kiểm tra lại data

⚠️ Statistical Significance ≠ Practical Significance

Với sample size rất lớn (n > 1,000,000), gần như BẤT KỲ sự khác biệt nhỏ nào cũng trở nên "statistically significant." Google với 1 tỷ users có thể detect CVR khác biệt 0.001% — significant nhưng KHÔNG practical.

Rule of thumb: Luôn hỏi: "Nếu deploy, impact business bao nhiêu tiền, bao nhiêu users?"

Decision Framework

mermaid

flowchart TD
    A["📊 Test kết thúc"] --> B{"p < α?"}
    B -->|No| C["❌ Not Significant<br/>Giữ Control"]
    B -->|Yes| D{"Effect Size<br/>đủ lớn?<br/>(Practical significance)"}
    D -->|No| E["⚠️ Statistically sig<br/>but negligible effect<br/>Cân nhắc business impact"]
    D -->|Yes| F{"Guardrail metrics<br/>OK?"}
    F -->|No| G["❌ Guardrail violated<br/>Investigate trước khi deploy"]
    F -->|Yes| H["✅ Deploy Treatment!<br/>Monitor post-launch"]

📌 Phần 4: Common Pitfalls — Peeking, Multiple Testing, Bias

Pitfall 1: Peeking (Nhìn trộm kết quả)

Peeking = kiểm tra p-value TRƯỚC KHI test chạy xong → inflate false positive rate lên 25-30% thay vì 5%.

mermaid

flowchart LR
    A["Day 1: p = 0.28"] --> B["Day 3: p = 0.12"]
    B --> C["Day 5: p = 0.04 ⭐"]
    C --> D["🚨 STOP! Ship it!"]
    D --> E["❌ WRONG!<br/>p fluctuates early<br/>Wait for full duration"]
    C --> F["Day 7: p = 0.15"]
    F --> G["Day 10: p = 0.08"]
    G --> H["Day 14: p = 0.06<br/>Full duration → p > 0.05"]
    H --> I["✅ Correct: Not significant"]

Tại sao peeking nguy hiểm?

Ngày	p-value	Quyết định nếu peek	Quyết định đúng
Day 3	0.04	"Significant! Ship!" ❌	Chưa đủ sample → chờ
Day 7	0.15	"Not significant..."	Chưa hết duration → chờ
Day 14	0.06	"Close to significant..."	Full duration → NOT significant → giữ Control

Solution: Quyết định duration TRƯỚC KHI chạy → KHÔNG nhìn p-value cho đến khi đủ duration. Nếu cần nhìn sớm → dùng sequential testing (vd: Bayesian A/B, alpha spending).

Pitfall 2: Multiple Testing (Bonferroni Correction)

Multiple testing = test NHIỀU variants hoặc NHIỀU metrics cùng lúc → inflation false positive.

Nếu test 20 metrics ở $α = 0.05$ , kỳ vọng 1 metric sẽ "significant" chỉ vì chance (5% × 20 = 1).

P (at least 1 false positive) = 1 - (1 - α)^{m}

Với m = 20 metrics: $P = 1 - (1 - 0.05)^{20} = 1 - 0.358 = 64.2 %$ chance có ít nhất 1 false positive!

Bonferroni Correction:

α_{a d j u s t e d} = \frac{α}{m}

Số metrics	$α$ gốc	$α$ Bonferroni	Ý nghĩa
1	0.05	0.050	Standard
3	0.05	0.017	Stricter (p < 0.017 mới significant)
5	0.05	0.010	Rất strict
20	0.05	0.0025	Cực strict — cần very large sample

python

# Multiple Testing Correction
from scipy import stats
import numpy as np

# Giả sử test 5 metrics, thu được 5 p-values
p_values = {
    'CVR':              0.032,
    'Revenue/User':     0.048,
    'Bounce Rate':      0.280,
    'Time on Page':     0.008,
    'Pages per Session': 0.041
}

alpha = 0.05
n_tests = len(p_values)
alpha_bonferroni = alpha / n_tests

print("📊 MULTIPLE TESTING CORRECTION")
print("=" * 65)
print(f"  Number of tests: {n_tests}")
print(f"  Original α: {alpha}")
print(f"  Bonferroni α: {alpha_bonferroni:.4f}")
print()

for metric, p in p_values.items():
    sig_original = "✅ Sig" if p < alpha else "❌ Not"
    sig_bonferroni = "✅ Sig" if p < alpha_bonferroni else "❌ Not"
    print(f"  {metric:22s} | p = {p:.3f} | Original: {sig_original} | Bonferroni: {sig_bonferroni}")

Pitfall 3: Simpson's Paradox trong A/B Testing

Simpson's Paradox = trend xuất hiện trong subgroups nhưng đảo ngược khi combine data.

Segment	Control CVR	Treatment CVR	"Winner"
Mobile (80% traffic)	3.0%	3.5%	Treatment ✅
Desktop (20% traffic)	8.0%	7.0%	Control ✅
Blended	4.0%	4.2%	Treatment

Chờ — tại sao blended cho Treatment thắng trong khi Desktop cho Control thắng?

Vì Treatment nhận nhiều mobile traffic hơn (mobile CVR thấp hơn desktop). Shift traffic mix → blended bị misleading.

Solution: Luôn segment analysis — kiểm tra kết quả theo platform, user type, country trước khi kết luận.

Pitfall 4: Selection Bias

Selection bias = users trong Control và Treatment KHÔNG tương đương từ đầu.

Bias	Nguyên nhân	Hậu quả
Day-of-week bias	Launch test thứ 6, traffic weekend ≠ weekday	Effect bị seasonality
New vs returning	Treatment nhận nhiều new users hơn	New users behave differently
Geo bias	Treatment chạy ở HCM, Control ở HN	Region differences ≠ treatment effect
Survivorship bias	Chỉ analyze users hoàn thành flow	Dropout users bị bỏ qua

Pitfall 5: Survivorship Bias trong Experiment

Survivorship bias = chỉ phân tích users "sống sót" trong experiment, bỏ qua những người dropout sớm.

Ví dụ: A/B test onboarding flow 3 bước (Control) vs 5 bước (Treatment). Treatment có completion rate 78% (vs Control 85%), nhưng users hoàn thành Treatment có Day 7 retention 45% (vs Control 38%).

Sai lầm: "Treatment tốt hơn! Retention cao hơn!" → KHÔNG! 22% users dropout Treatment bị bỏ qua. Nếu tính ALL users (kể cả dropout), overall retention Treatment có thể THẤP hơn Control.

Summary: A/B Testing Checklist

PRE-TEST:
✅ Define SMART hypothesis
✅ Choose primary metric + guardrail metrics
✅ Calculate sample size (MDE, power, significance)
✅ Estimate duration (≥ 1 full business cycle, min 7 days)
✅ Set up randomization (user-level, cookie-based)
✅ Document experiment plan BEFORE launching

DURING TEST:
✅ NO PEEKING at p-value!
✅ Monitor guardrail metrics (bugs, crashes, server errors)
✅ Check sample ratio mismatch (50/50 split still balanced?)
✅ Run for FULL pre-determined duration

POST-TEST:
✅ Run appropriate statistical test (chi-square / t-test)
✅ Check statistical significance (p < α)
✅ Check practical significance (effect size, business impact)
✅ Check guardrail metrics (no degradation)
✅ Segment analysis (mobile/desktop, new/returning, geo)
✅ Document results + learnings
✅ Deploy or iterate

🔗 Kết nối toàn bộ

A/B Testing trong hành trình DA

Buổi	Kỹ năng	A/B Testing liên quan
Buổi 7-8	Python + Pandas	Code analysis cho experiment results
Buổi 9	EDA	Explore experiment data, check distributions
Buổi 10-11	Visualization + BI	Dashboard experiment results
Buổi 12	Data Storytelling	Present experiment findings cho stakeholders
Buổi 13	Business Metrics	North Star metric = primary metric cho experiment
Buổi 14	Industry Cases	A/B test cho Marketing (landing page), Product (features)
Buổi 15	A/B Testing	Thiết kế + Phân tích experiment

Checklist "A/B Testing Literacy"

✅ Hiểu Control vs Treatment, Null vs Alternative Hypothesis
✅ Hiểu p-value đúng nghĩa (không phải "xác suất B tốt hơn A")
✅ Tính sample size cho experiment (MDE, power, significance)
✅ Chọn statistical test phù hợp (chi-square vs t-test)
✅ Phân tích kết quả: p-value, CI, effect size
✅ Phân biệt statistical vs practical significance
✅ Biết guardrail metrics và tại sao cần
✅ Tránh peeking, multiple testing, selection bias
✅ Segment analysis trước khi kết luận
✅ Document experiment plan + results

📚 Tài liệu tham khảo

Tài liệu	Tác giả	Nội dung chính
Trustworthy Online Controlled Experiments	Ron Kohavi, Diane Tang, Ya Xu	Bible của A/B testing — từ Microsoft/Google
Statistical Methods in Online A/B Testing	Georgi Georgiev	Thống kê chi tiết cho A/B test
Experimentation Works	Stefan Thomke	Harvard Business School — culture of experimentation
Lean Analytics	Alistair Croll & Ben Yoskovitz	A/B testing trong startup context
Evan Miller's Sample Size Calculator	Evan Miller	Online tool tính sample size (evanmiller.org)
Booking.com Experimentation Culture	Lukas Vermeer	25,000 experiments/năm tại Booking.com

🎯 Bài tập và thực hành

Workshop: Thiết kế & Phân tích A/B Test — hypothesis, sample size, t-test, chi-square, kết luận
Case Study: Netflix thumbnails, Booking.com experiments, Microsoft Bing $100M experiment
Mini Game: Experiment Lab — 5 tình huống A/B test, Gold ≥ 85 XP
Blog: Câu chuyện Dũng — Growth DA và bài A/B test sai từ đầu đến cuối
Tiêu chuẩn: DMAIC, CUPED, Reproducible Experiments

📘 Buổi 15: A/B Testing — Thí nghiệm với dữ liệu để ra quyết định đúng ​

🎯 Mục tiêu buổi học ​

📋 Tổng quan ​

📌 Phần 1: A/B Testing Fundamentals — Control vs Treatment ​

A/B Testing là gì? ​

Thuật ngữ cốt lõi ​

Hypothesis Testing Flow ​

p-value — Hiểu đúng ​

Type I vs Type II Errors ​

📌 Phần 2: Experiment Design — Hypothesis, Sample Size, Duration ​

SMART Hypothesis ​

Sample Size Calculation ​

Duration Calculation ​

Randomization — Chia nhóm thế nào? ​

Guardrail Metrics ​

📌 Phần 3: Analyzing Results — t-test, Chi-square, Confidence Interval ​

Chọn Statistical Test nào? ​

Chi-square Test — So sánh tỷ lệ (Proportions) ​

Independent t-test — So sánh trung bình (Means) ​

Confidence Interval ​

Effect Size — Statistical vs Practical Significance ​

Decision Framework ​

📌 Phần 4: Common Pitfalls — Peeking, Multiple Testing, Bias ​

Pitfall 1: Peeking (Nhìn trộm kết quả) ​

Pitfall 2: Multiple Testing (Bonferroni Correction) ​

Pitfall 3: Simpson's Paradox trong A/B Testing ​

Pitfall 4: Selection Bias ​

Pitfall 5: Survivorship Bias trong Experiment ​

Summary: A/B Testing Checklist ​

🔗 Kết nối toàn bộ ​

A/B Testing trong hành trình DA ​

Checklist "A/B Testing Literacy" ​

📚 Tài liệu tham khảo ​

🎯 Bài tập và thực hành ​

📘 Buổi 15: A/B Testing — Thí nghiệm với dữ liệu để ra quyết định đúng

🎯 Mục tiêu buổi học

📋 Tổng quan

📌 Phần 1: A/B Testing Fundamentals — Control vs Treatment

A/B Testing là gì?

Thuật ngữ cốt lõi

Hypothesis Testing Flow

p-value — Hiểu đúng

Type I vs Type II Errors

📌 Phần 2: Experiment Design — Hypothesis, Sample Size, Duration

SMART Hypothesis

Sample Size Calculation

Duration Calculation

Randomization — Chia nhóm thế nào?

Guardrail Metrics

📌 Phần 3: Analyzing Results — t-test, Chi-square, Confidence Interval

Chọn Statistical Test nào?

Chi-square Test — So sánh tỷ lệ (Proportions)

Independent t-test — So sánh trung bình (Means)

Confidence Interval

Effect Size — Statistical vs Practical Significance

Decision Framework

📌 Phần 4: Common Pitfalls — Peeking, Multiple Testing, Bias

Pitfall 1: Peeking (Nhìn trộm kết quả)

Pitfall 2: Multiple Testing (Bonferroni Correction)

Pitfall 3: Simpson's Paradox trong A/B Testing

Pitfall 4: Selection Bias

Pitfall 5: Survivorship Bias trong Experiment

Summary: A/B Testing Checklist

🔗 Kết nối toàn bộ

A/B Testing trong hành trình DA

Checklist "A/B Testing Literacy"

📚 Tài liệu tham khảo

🎯 Bài tập và thực hành