🎮 Experiment Lab — Thiết Kế & Phân Tích A/B Test!

Bạn vừa được bổ nhiệm làm Lead Experimentation Analyst tại TestLab Inc. 🧪 — công ty tư vấn A/B testing cho các startup và enterprise. Mỗi tuần, clients gửi đến experiment proposals và results: "Test này thiết kế đúng chưa? Kết quả có tin được không? Nên deploy hay giữ?" 5 tình huống, mỗi tình huống: đánh giá experiment design hoặc phân tích kết quả. Chọn đúng = XP. Chọn sai = client ship feature lỗi, mất revenue, hoặc bỏ lỡ opportunity! 🚨

🎯 Mục tiêu học tập

Sau khi hoàn thành game, bạn sẽ:

Đánh giá experiment design — hypothesis, sample size, duration, randomization
Phân tích kết quả test — p-value, CI, effect size, practical significance
Phát hiện pitfalls — peeking, multiple testing, Simpson's Paradox, selection bias
Ra quyết định — deploy, giữ, hoặc redesign experiment
Tư duy end-to-end — từ hypothesis → analysis → decision

📜 Luật chơi

┌──────────────────────────────────────────────────────┐
│  BẠN = Lead Experimentation Analyst 🧪               │
│  CLIENTS = 5 companies cần tư vấn A/B testing         │
│  MỖI VÒNG = 1 tình huống experiment → 4 lựa chọn     │
│  CHỌN insight + action phù hợp nhất → = XP            │
│  MỤC TIÊU = Thu thập ≥ 85 XP để đạt hạng Gold 🥇     │
└──────────────────────────────────────────────────────┘

Cách tính điểm mỗi vòng:

Thành phần	XP
Trả lời đúng	+16 XP (Vòng 1–2), +18 XP (Vòng 3–4), +20 XP (Vòng 5)
Trả lời sai	+0 XP
Không dùng hint	+2 XP bonus ⚡
Giải thích đúng lý do	+3 XP bonus 🧠

Tổng XP tối đa: 16+16+18+18+20 = 88 XP (chưa tính bonus)

Nguyên tắc quan trọng:

🧪 Experiment design quyết định quality — test sai thiết kế → kết quả vô nghĩa
📊 Statistical significance ≠ practical significance — p < 0.05 chưa đủ để deploy
🚨 Peeking, multiple testing, selection bias — 3 kẻ thù lớn nhất của A/B test

🏆 Bảng xếp hạng & Huy hiệu

Ranks

Hạng	XP	Mô tả
🥇 Gold — Experiment Master	≥ 85 XP	Bạn thiết kế experiment như Netflix, phân tích như Bing!
🥈 Silver — Experiment Apprentice	≥ 60 XP	Tốt! Cần luyện thêm — đọc lại pitfalls section.
🥉 Bronze — Test Beginner	≥ 35 XP	Hiểu cơ bản nhưng còn nhiều lỗ hổng — ôn lại Buổi 15.
💀 Game Over	< 35 XP	Client mất tiền vì ship feature sai — quay lại học toàn bộ!

Huy hiệu đặc biệt

Badge	Điều kiện	Mô tả
🧪 Design Expert	Đúng Vòng 1 + Vòng 2	Master experiment design!
📊 Analysis Pro	Đúng Vòng 3 + Vòng 4	Statistical analysis on point!
🛡️ Pitfall Detector	Đúng vòng có pitfall (Vòng 2, 4, 5)	Phát hiện cạm bẫy trước khi mắc!
🔥 Full Streak	5 vòng liên tiếp đúng	Perfect experiment judgment!
🏆 Perfect Score	Đúng tất cả 5 vòng	Experiment legend!
💡 No Hints Hero	Không dùng hint cả game	Pure experimentation intuition!

🎲 Chỉ số theo dõi

┌─────────────────────────────┐
│  🧪 SCOREBOARD              │
│  ─────────────────────────  │
│  XP hiện tại:    ___/88     │
│  Vòng hiện tại:  ___/5      │
│  Streak:         ___        │
│  Hints used:     ___        │
│  Design W/L:     ___        │
│  Analysis W/L:   ___        │
│  Hạng dự kiến:   ___        │
└─────────────────────────────┘

🎮 BẮT ĐẦU GAME!

🔵 VÒNG 1: "Sample size bao nhiêu là đủ?" (+16 XP)

🏷️ Category: Experiment Design

Tình huống:

PM tại ShopVN — e-commerce platform, 50,000 daily visitors — muốn A/B test checkout page mới (simplified 2-step vs current 4-step). Current checkout CVR = 8%. PM muốn detect ít nhất 5% relative lift (tức CVR target = 8.4%).

PM hỏi bạn: "Bao nhiêu sample là đủ? Traffic mỗi ngày 50K nên chạy 2 ngày là xong đúng không?"

Bạn tính:

Parameter	Value
Baseline CVR	8%
MDE	5% relative → CVR target 8.4%
α	0.05
Power	80%
Calculated n/group	~47,000
Total needed	~94,000
Daily traffic	50,000

4 lựa chọn:

	Recommendation	Logic
A	"50K/ngày × 2 ngày = 100K > 94K. Chạy 2 ngày là đủ sample!"	Quick math, hit sample threshold
B	"Sample size ~94K → cần ≥ 2 ngày. NHƯNG phải chạy tối thiểu 7-14 ngày để cover full business cycle (weekday + weekend behavior khác nhau). Recommend 14 ngày, giảm traffic allocation nếu cần."	Duration > sample size alone
C	"5% MDE quá nhỏ, cần quá nhiều sample. Tăng MDE lên 20% để giảm sample size."	Relax requirements
D	"50K/ngày quá ít cho A/B test. Cần ít nhất 500K/ngày. Đợi platform grow."	More traffic needed

💡 Hint (−2 XP)

Sample size chỉ là 1 condition. A/B test cần cover full business cycle — weekday vs weekend, đầu tháng vs cuối tháng. 2 ngày = chỉ cover 2 ngày trong tuần → biased. Duration ≥ 7 ngày (1 week minimum), ideally 14 ngày.

✅ Đáp án

B — Sample size ~94K đạt được trong 2 ngày, nhưng duration 2 ngày KHÔNG đủ. Lý do:

Day-of-week effect: Shopping behavior thứ 7/CN ≠ thứ 2-6. Test 2 ngày chỉ cover 2 days → biased sample.
Paycheck effect: Đầu tháng spending ≠ cuối tháng.
Novelty effect: Users react differently to NEW design → cần time để normalize.

Correct approach: Run 14 ngày, allocate 50% traffic → 25,000/ngày × 14 = 350K total (far exceeds 94K minimum). Duration matters MORE than hitting sample threshold.

C là SAI vì 5% MDE là business requirement — PM cần detect 5% lift. Tăng MDE = bỏ qua improvement nhỏ nhưng có thể valuable.

Bài học: Sample size calculation tells you MINIMUM sample. Duration tells you MINIMUM time. BOTH must be met. Never run test shorter than 1 full business cycle.

🔵 VÒNG 2: "Kết quả A/B test — nên tin hay không?" (+16 XP)

🏷️ Category: Pitfall Detection — Peeking

Tình huống:

Growth Lead tại QuickChat — messaging app, 200K DAU — chạy A/B test notification frequency (3 notifications/day vs current 5/day). Hypothesis: giảm notification → giảm annoyance → tăng Day 7 retention.

Growth Lead gửi bạn email:

"Good news! Day 7, p-value = 0.04. Treatment (3 notifs) retention 52% vs Control (5 notifs) 49%. Statistically significant! Tôi muốn deploy ngay cuối tuần. Bạn approve?"

Bạn hỏi thêm và phát hiện:

Detail	Information
Planned duration	21 ngày
Actual duration so far	7 ngày
Planned sample size	80,000/group
Current sample	28,000/group
Times Growth Lead checked p-value	Every day since Day 3
Day 3 p-value	0.18
Day 5 p-value	0.09
Day 7 p-value	0.04

4 lựa chọn:

	Recommendation	Logic
A	"p = 0.04 < 0.05, significant! Deploy ngay. Retention là long-term metric, càng deploy sớm càng tốt."	Trust the p-value
B	"🚨 Peeking alert! Growth Lead check p-value hàng ngày từ Day 3. p fluctuated 0.18 → 0.09 → 0.04. Với 5 peeks, false positive rate thực tế ~20%, không phải 5%. PHẢI chạy tiếp đến Day 21 như planned. Không deploy."	Peeking invalidation
C	"28K/group < 80K target. Cần thêm sample. Chạy thêm 7 ngày = 14 ngày total."	Insufficient sample
D	"Retention metric cần ít nhất 30 ngày. 7 ngày retention không reliable."	Wrong metric timeframe

💡 Hint (−2 XP)

Growth Lead đã check p-value 5 lần (Day 3, 4, 5, 6, 7). Mỗi lần check = 1 hypothesis test. 5 checks ở α = 0.05 → actual false positive rate:

$P (at least 1 false positive) = 1 - (1 - 0.05)^{5} \approx 22.6 %$

p = 0.04 có thể là false positive. Đây là peeking problem — kẻ thù #1 của A/B testing.

✅ Đáp án

B — Peeking invalidation. Growth Lead peek 5 lần → false positive rate thực tế ~20-25%, không phải 5%. p = 0.04 có thể là noise.

Tại sao B đúng hơn C và D:

C (insufficient sample) đúng nhưng chưa identify root cause — vấn đề chính là PEEKING, không chỉ sample size
D (retention timeframe) hợp lý nhưng Day 7 retention là standard metric cho messaging apps — không phải vấn đề chính

Action:

KHÔNG deploy
Chạy tiếp đến Day 21 (planned duration)
KHÔNG check p-value cho đến Day 21
Day 21: analyze với full sample → quyết định

Bài học: Peeking = tung đồng xu nhiều lần rồi dừng khi mặt ngửa. Planned duration = commitment. Nếu cần check sớm → dùng Sequential Testing / Group Sequential Design với alpha spending function.

🟡 VÒNG 3: "Statistical significance nhưng nên deploy không?" (+18 XP)

🏷️ Category: Statistical vs Practical Significance

Tình huống:

Product Manager tại BookBuddy — reading app, 500K MAU — A/B test new recommendation algorithm. Test chạy đúng: pre-registered hypothesis, đúng sample size, 21-day duration, no peeking.

Results:

Metric	Control (Old Algorithm)	Treatment (New Algorithm)	Lift	p-value
Books Started/User/Month	2.34	2.38	+1.7%	0.012 ✅
Books Finished/User/Month	1.12	1.13	+0.9%	0.34
Reading Time/Day	22.1 min	22.3 min	+0.9%	0.28
App Opens/Day	3.1	3.1	+0.0%	0.95
Engineering cost to maintain new algo	—	+$8K/month server costs	—	—

PM nói: "Books Started significant ở p = 0.012! New algorithm phát hiện đúng sách users muốn đọc. Deploy!"

4 lựa chọn:

	Recommendation	Logic
A	"p = 0.012 cho Books Started — deploy! Data doesn't lie."	Trust statistical significance
B	"Books Started +1.7% (0.04 books/user/month) nhưng Books Finished, Reading Time, App Opens đều KHÔNG significant. Effect size quá nhỏ (0.04 books ≈ 1 extra book per 25 users). Plus $8K/month extra cost. Practical impact gần bằng 0. KHÔNG deploy."	Practical significance check
C	"Deploy nhưng monitor 30 ngày. Nếu Books Finished tăng theo → keep."	Conditional deploy
D	"Sample size có thể chưa đủ cho Books Finished. Chạy tiếp thêm 21 ngày."	More data needed

💡 Hint (−2 XP)

Primary metric (Books Started) significant ở p = 0.012 — nhưng effect size = +0.04 books/user/month (từ 2.34 → 2.38). Với 500K MAU = 20,000 extra books started/month. Downstream metrics (Finished, Reading Time, App Opens) đều KHÔNG improved. Plus $8K/month chi phí → ROI âm?

✅ Đáp án

B — Practical significance check. Analysis:

Statistical significance: ✅ p = 0.012 < 0.05. Improvement IS real statistically.

Practical significance: ❌

Effect: +0.04 books/user/month = 1 thêm book per 25 users/month → negligible
Downstream: Books Finished, Reading Time, App Opens → KHÔNG improve
User doesn't finish more books → "Started" tăng nhưng engagement KHÔNG tăng
Cost: $8K/month extra → $96K/year
Revenue impact: gần bằng 0 (users không read more, không pay more)

ROI = negative. $96K cost cho ~0 revenue uplift.

C (conditional deploy) là risky vì thêm $8K/month chi phí cho unclear benefit. D (more data) không giải quyết vấn đề — even with more data, effect size vẫn negligible.

Bài học: Statistical significance với large sample (500K MAU) có thể detect BẤT KỲ sự khác biệt nhỏ nào. "Significant" ≠ "Important." Luôn hỏi: "Impact = bao nhiêu $$? Bao nhiêu users thực sự affected?"

🟡 VÒNG 4: "Kết quả ngược khi segment" (+18 XP)

🏷️ Category: Simpson's Paradox

Tình huống:

Head of Growth tại FitTrack — fitness app, 300K MAU — A/B test new onboarding flow (video tutorial vs current text-based). Test chạy chuẩn, đủ sample, đủ duration.

Overall results:

Metric	Control (Text)	Treatment (Video)	Lift	p-value
Day 7 Retention	34.2%	36.8%	+7.6%	0.003 ✅
Sample size	45,000	45,000

Growth Head: "Video onboarding wins! +7.6% retention, p = 0.003. Ship it!"

Bạn segment analysis:

Segment	Control Retention	Treatment Retention	p-value	Control n	Treatment n
iOS (60% traffic)	38.0%	37.5%	0.58	27,000	27,000
Android (40% traffic)	28.5%	28.0%	0.64	18,000	18,000

Chờ — BOTH segments cho Treatment THUA Control. Nhưng overall Treatment thắng?

4 lựa chọn:

	Explanation + Action	Logic
A	"Overall result significant — segment fluctuation là noise. Deploy Treatment."	Trust overall
B	"Simpson's Paradox! Treatment THUA trong cả iOS VÀ Android — nhưng thắng overall. Nguyên nhân: Treatment group nhận nhiều iOS users hơn (iOS retention tự nhiên cao hơn). Traffic split KHÔNG balanced by platform → confounded result. KHÔNG deploy. Investigate randomization."	Simpson's Paradox detection
C	"iOS và Android có behavior khác nhau. Deploy Treatment cho iOS, giữ Control cho Android."	Segment-specific deploy
D	"p-value cho segments > 0.05 vì sample nhỏ hơn. Chạy thêm data cho mỗi segment."	Underpowered segments

💡 Hint (−2 XP)

Treatment THUA Control trong CẢ HAI segments (iOS: 37.5% < 38.0%, Android: 28.0% < 28.5%) nhưng THẮNG overall (36.8% > 34.2%). Đây là Simpson's Paradox — xảy ra khi confounding variable (platform mix) khác nhau giữa Control và Treatment. Check: Treatment có nhiều iOS users hơn → iOS retention cao hơn → pulls Treatment overall up.

✅ Đáp án

B — Simpson's Paradox. Let's verify bằng toán:

Control group:

iOS: 27,000 users × 38.0% = 10,260 retained
Android: 18,000 users × 28.5% = 5,130 retained
Total: 15,390 / 45,000 = 34.2% ✅

Treatment group — nếu platform mix KHÁC:

Giả sử Treatment nhận 32,000 iOS + 13,000 Android (thay vì 27K/18K)
iOS: 32,000 × 37.5% = 12,000 retained
Android: 13,000 × 28.0% = 3,640 retained
Total: 15,640 / 45,000 = 34.8%

Nhưng report nói Treatment = 36.8% → platform mix trong Treatment group phải rất skewed toward iOS.

Root cause: Randomization BUG — Treatment nhận nhiều iOS users hơn. iOS users naturally có retention cao hơn → makes Treatment look better. Nhưng within EACH platform, Treatment actually WORSE.

Action:

❌ KHÔNG deploy
Investigate randomization mechanism — SRM (Sample Ratio Mismatch) check by platform
Fix randomization → re-run experiment
Stratify by platform khi setup next test

C sai vì Treatment THUA Control trong CẢ HAI segments — không có segment nào Tretment tốt hơn.

Bài học: Luôn chạy segment analysis trước khi deploy. Overall result có thể bị confounded bởi unbalanced splits. Simpson's Paradox = average che giấu sự thật.

🔴 VÒNG 5: "3 variants, 5 metrics — claim 'significant'" (+20 XP)

🏷️ Category: Multiple Testing + Comprehensive Analysis

Tình huống:

Marketing Director tại FreshMeal — meal delivery app — chạy A/B/C test 3 variants email subject line. Test duration 14 ngày, 20K users mỗi variant. Marketing Director gửi bạn report:

"Amazing results! Chúng tôi test 3 subject lines và đo 5 metrics. Dưới đây là kết quả — rất nhiều significant results!"

Metric	Control (A)	Variant B	p (B vs A)	Variant C	p (C vs A)
Open Rate	22.0%	24.1%	0.031 ✅	23.5%	0.082
Click Rate	4.2%	4.5%	0.180	4.8%	0.040 ✅
Purchase Rate	1.8%	2.0%	0.210	1.9%	0.350
Unsubscribe Rate	0.5%	0.6%	0.280	0.4%	0.150
Revenue/Email	3,600 VND	3,800 VND	0.320	4,100 VND	0.048 ✅

Marketing Director: "B wins on Open Rate (p=0.031), C wins on Click Rate (p=0.040) AND Revenue/Email (p=0.048). I want to deploy C — 2 significant metrics!"

4 lựa chọn:

	Recommendation	Logic
A	"C có 2 significant results — deploy C."	Count significant metrics
B	"B wins on Open Rate — deploy B. Open Rate là top-of-funnel, quan trọng nhất."	Prioritize top-funnel
C	"Multiple testing problem: 3 variants × 5 metrics = 10 comparisons (B vs A: 5, C vs A: 5). Bonferroni α = 0.05/10 = 0.005. KHÔNG CÓ metric nào p < 0.005. All 'significant' results có thể là false positive. Cần: (1) define 1 primary metric, (2) run focused A/B (2 variants, not 3), (3) recalculate sample size."	Multiple testing correction
D	"Purchase Rate — metric quan trọng nhất — KHÔNG significant cho cả B và C. Experiment fail. Giữ Control."	Focus on business metric

💡 Hint (−2 XP)

Tổng số comparisons:

B vs A: 5 metrics = 5 tests
C vs A: 5 metrics = 5 tests
Tổng: 10 independent tests

Probability ít nhất 1 false positive: $P = 1 - (1 - 0.05)^{10} = 40.1 %$

40% chance ít nhất 1 metric "significant" chỉ vì chance! Bonferroni correction: α = 0.05/10 = 0.005. Metrics với p = 0.031, 0.040, 0.048 → all ABOVE 0.005 → not significant after correction.

✅ Đáp án

C — Multiple testing correction. Detailed analysis:

Problem 1: 10 comparisons without correction

So sánh	p-value	α = 0.05	α = 0.005 (Bonferroni)
B Open Rate	0.031	✅	❌
B Click Rate	0.180	❌	❌
B Purchase	0.210	❌	❌
B Unsub	0.280	❌	❌
B Revenue	0.320	❌	❌
C Open Rate	0.082	❌	❌
C Click Rate	0.040	✅	❌
C Purchase	0.350	❌	❌
C Unsub	0.150	❌	❌
C Revenue	0.048	✅	❌

After Bonferroni: 0 significant results. All 3 "significant" metrics are likely false positives.

Problem 2: No primary metric pre-defined Marketing Director đo 5 metrics và cherry-pick "significant" ones → HARKing/p-hacking behavior.

D cũng có logic (Purchase Rate not significant = experiment fail) nhưng C identify ROOT CAUSE (multiple testing) + đề xuất fix (define primary metric, fewer variants).

Recommended fix:

Define 1 primary metric (Revenue/Email)
Run A/B (2 variants — Control vs best candidate from exploratory)
Calculate sample size cho 1 comparison, 1 metric
Pre-register hypothesis
No peeking, run full duration

Bài học: More variants × more metrics = exponentially more false positives. A/B testing = focused test: 1 change, 1 primary metric, 2 variants. Anything more = phải adjust for multiple testing.

🏁 KẾT QUẢ

Tính tổng XP

Vòng 1 (Sample Size + Duration):    ___/16 XP    (+ bonus hint ___/2 + lý do ___/3)
Vòng 2 (Peeking Detection):         ___/16 XP    (+ bonus hint ___/2 + lý do ___/3)
Vòng 3 (Stat vs Practical Sig):     ___/18 XP    (+ bonus hint ___/2 + lý do ___/3)
Vòng 4 (Simpson's Paradox):         ___/18 XP    (+ bonus hint ___/2 + lý do ___/3)
Vòng 5 (Multiple Testing):          ___/20 XP    (+ bonus hint ___/2 + lý do ___/3)
────────────────────────────────────────
TỔNG:   ___/88 XP    (max 113 XP with all bonuses)

Bảng xếp hạng cuối cùng

Hạng	XP	Mô tả
🥇 Gold — Experiment Master	≥ 85 XP	Bạn thiết kế experiment như Netflix, phân tích như Microsoft Bing!
🥈 Silver — Experiment Apprentice	≥ 60 XP	Tốt! Cần luyện thêm pitfalls — đọc lại Phần 4 Buổi 15.
🥉 Bronze — Test Beginner	≥ 35 XP	Hiểu cơ bản nhưng dễ mắc bẫy — ôn lại toàn bộ Buổi 15.
💀 Game Over	< 35 XP	Ship feature lỗi 3 lần — quay lại đọc toàn bộ Buổi 15!

📝 Tổng kết kiến thức

Vòng	Category	Đáp án	Bài học
1	Design	Duration ≥ 1 business cycle, NOT just sample threshold	Sample size + Duration BOTH must be met
2	Pitfall	Peeking invalidates p-value → false positive rate 20-25%	KHÔNG check p-value trước full duration
3	Analysis	Statistical sig (p=0.012) nhưng effect negligible → KHÔNG deploy	Luôn check practical significance + ROI
4	Pitfall	Simpson's Paradox — overall trend reverses in segments	Segment analysis TRƯỚC khi deploy
5	Pitfall	10 comparisons × α=0.05 → 40% chance false positive	Multiple testing → Bonferroni correction

💡 Quy tắc vàng A/B Testing

Design: Tính sample size + plan duration TRƯỚC khi chạy test
Peeking: KHÔNG BAO GIỜ check p-value trước full duration
Significance: Statistical significance ≠ practical significance — hỏi "impact = $$$?"
Segments: Luôn segment analysis — overall average che giấu Simpson's Paradox
Multiple testing: 1 primary metric, 2 variants. Nếu nhiều hơn → Bonferroni correction

🎮 Experiment Lab — Thiết Kế & Phân Tích A/B Test! ​

🎯 Mục tiêu học tập ​

📜 Luật chơi ​

🏆 Bảng xếp hạng & Huy hiệu ​

Ranks ​

Huy hiệu đặc biệt ​

🎲 Chỉ số theo dõi ​

🎮 BẮT ĐẦU GAME! ​

🔵 VÒNG 1: "Sample size bao nhiêu là đủ?" (+16 XP) ​

🔵 VÒNG 2: "Kết quả A/B test — nên tin hay không?" (+16 XP) ​

🟡 VÒNG 3: "Statistical significance nhưng nên deploy không?" (+18 XP) ​

🟡 VÒNG 4: "Kết quả ngược khi segment" (+18 XP) ​

🔴 VÒNG 5: "3 variants, 5 metrics — claim 'significant'" (+20 XP) ​

🏁 KẾT QUẢ ​

Tính tổng XP ​

Bảng xếp hạng cuối cùng ​

📝 Tổng kết kiến thức ​

🎮 Experiment Lab — Thiết Kế & Phân Tích A/B Test!

🎯 Mục tiêu học tập

📜 Luật chơi

🏆 Bảng xếp hạng & Huy hiệu

Ranks

Huy hiệu đặc biệt

🎲 Chỉ số theo dõi

🎮 BẮT ĐẦU GAME!

🔵 VÒNG 1: "Sample size bao nhiêu là đủ?" (+16 XP)

🔵 VÒNG 2: "Kết quả A/B test — nên tin hay không?" (+16 XP)

🟡 VÒNG 3: "Statistical significance nhưng nên deploy không?" (+18 XP)

🟡 VÒNG 4: "Kết quả ngược khi segment" (+18 XP)

🔴 VÒNG 5: "3 variants, 5 metrics — claim 'significant'" (+20 XP)

🏁 KẾT QUẢ

Tính tổng XP

Bảng xếp hạng cuối cùng

📝 Tổng kết kiến thức