🏆 Tiêu chuẩn — A/B Testing

Các tiêu chuẩn giúp bạn thiết kế và phân tích A/B test đúng framework, giảm variance, và đảm bảo kết quả reproducible — DMAIC cho quy trình, CUPED cho variance reduction, Reproducible Experiments cho transparency.

Tổng quan tiêu chuẩn buổi 15

Buổi 15 chuyển từ Industry Case Studies (Buổi 14) sang A/B Testing — experimentation design & analysis. A/B testing không chỉ cần biết p-value — cần quy trình chuẩn, kỹ thuật giảm variance, và khả năng tái lập:

DMAIC — Define, Measure, Analyze, Improve, Control → quy trình Six Sigma áp dụng cho experiment design
CUPED — Controlled-experiment Using Pre-Experiment Data → kỹ thuật variance reduction của Microsoft, giúp detect effect nhỏ hơn với cùng sample size
Reproducible Experiments — Tiêu chuẩn document, version control, pre-registration → đảm bảo experiment có thể tái lập và kiểm chứng

📋 Danh sách tiêu chuẩn liên quan

#	Tiêu chuẩn	Tổ chức / Tác giả	Áp dụng cho Buổi 15
1	DMAIC	Six Sigma / Motorola / GE	Quy trình thiết kế experiment có hệ thống
2	CUPED	Microsoft ExP (Deng et al., 2013)	Variance reduction cho A/B test → tăng sensitivity
3	Reproducible Experiments	OSF / AEA RCT Registry	Đảm bảo transparency, pre-registration, reproducibility

1️⃣ DMAIC — Six Sigma cho Experiment Design

Giới thiệu

DMAIC (Define, Measure, Analyze, Improve, Control) — framework cải tiến quy trình từ Six Sigma, phát triển tại Motorola những năm 1980, phổ biến bởi GE dưới thời CEO Jack Welch (1990s). DMAIC ban đầu dành cho manufacturing — nhưng áp dụng tuyệt vời cho experiment design trong A/B testing.

DMAIC đảm bảo bạn không "nhảy vào test" mà thiếu bước — từ define problem → measure baseline → analyze root cause → improve (experiment) → control (monitor after deploy).

5 Bước DMAIC cho A/B Testing

Bước	Tên	Mô tả	Output
D	Define	Xác định vấn đề, mục tiêu, scope	Problem statement, SMART hypothesis
M	Measure	Đo baseline metric, xác định data sources	Baseline CVR, sample size calculation
A	Analyze	Phân tích root cause, chọn thay đổi để test	Root cause analysis, treatment design
I	Improve	Chạy A/B test, thu thập data	Experiment results, statistical analysis
C	Control	Deploy winner, monitor, document	Deployment, monitoring dashboard, experiment report

DMAIC áp dụng cho A/B Test — Ví dụ

Vấn đề: Signup CVR trên landing page = 3.2%, thấp hơn industry benchmark 5%.

Phase	Hoạt động	Output
Define	Problem: CVR 3.2% < benchmark 5%. Goal: tăng CVR ≥ 15% relative. Scope: landing page, desktop + mobile.	Hypothesis: "Simplified form (4 fields → 3 fields) sẽ tăng CVR ≥ 15% relative"
Measure	Baseline CVR: 3.2% (30-day average). Daily traffic: 5,000. Current form: 4 fields (name, email, password, company).	Sample size: 14,000/group. Duration: 6 days. Primary: CVR. Guardrail: Revenue/User.
Analyze	Funnel analysis: 40% users abandon ở field "Company Name" (not required but feels mandatory). Heatmap: users hesitate at Company field.	Root cause: "Company Name" field tạo friction. Treatment: remove Company field.
Improve	A/B test: Control (4 fields) vs Treatment (3 fields, remove Company). Run 14 days. Result: CVR Treatment 3.85% vs Control 3.20%. p = 0.008.	Winner: Treatment. Lift +20.3% relative. Guardrails OK.
Control	Deploy Treatment cho 100% traffic. Monitor CVR tuần 1, tuần 2, tuần 4. Document experiment in wiki. Set alert if CVR drops below 3.5%.	Deployment checklist. Monitoring dashboard. Experiment report.

DMAIC Checklist cho A/B Testing

DEFINE:
✅ Problem statement rõ ràng (metric nào underperform?)
✅ SMART hypothesis (specific, measurable, actionable, relevant, time-bound)
✅ Success criteria (MDE, significance level)
✅ Stakeholder alignment (PM, Engineering, Design đồng ý test)

MEASURE:
✅ Baseline metric (30-day average, không phải 1-day snapshot)
✅ Sample size calculation (công thức hoặc tool)
✅ Duration estimate (cover ≥ 1 business cycle)
✅ Data pipeline verified (tracking events fire correctly)

ANALYZE:
✅ Root cause analysis (tại sao metric underperform?)
✅ Treatment design (thay đổi gì, tại sao?)
✅ Confounding variables identified và controlled

IMPROVE:
✅ Randomization setup (user-level, cookie-based)
✅ Guardrail metrics defined
✅ NO peeking commitment
✅ Statistical test pre-selected (chi-square / t-test)

CONTROL:
✅ Deployment plan (gradual rollout: 10% → 50% → 100%)
✅ Monitoring dashboard (primary + guardrails)
✅ Rollback plan (nếu post-deploy metrics xấu đi)
✅ Experiment documented (wiki, Notion, internal tool)

Ưu & nhược điểm DMAIC cho A/B Testing

Ưu điểm	Nhược điểm
✅ Structured — không miss steps quan trọng	❌ Có thể over-engineering cho simple tests
✅ Pre-analysis — understand root cause trước khi test	❌ Slow — DMAIC cần 1-2 tuần prep trước khi test
✅ Post-deployment monitoring — không "ship and forget"	❌ Rigid — không phù hợp cho rapid experimentation (Booking.com style)
✅ Documentation — mọi experiment có paper trail	❌ Bureaucracy risk — nếu quá strict sẽ giảm experiment velocity

💡 Khi nào dùng DMAIC?

High-stakes experiments: Test thay đổi pricing, core flow, payment → dùng full DMAIC
Low-stakes experiments: Test button color, copy change → DMAIC lite (Define + Improve + Control)
Rapid experimentation: Booking.com style — simplified DMAIC, automated platform, skip heavy Analyze phase

2️⃣ CUPED — Variance Reduction cho A/B Test

Giới thiệu

CUPED (Controlled-experiment Using Pre-Experiment Data) — kỹ thuật variance reduction phát triển bởi Microsoft (Alex Deng et al., 2013, KDD). CUPED là 1 trong những innovations quan trọng nhất trong online experimentation — được Microsoft, LinkedIn, Netflix, Uber, Airbnb áp dụng.

Vấn đề CUPED giải quyết: A/B test cần sample size lớn để detect effect nhỏ. Traffic giới hạn → hoặc phải tăng MDE (bỏ qua small effects), hoặc chạy test lâu hơn. CUPED giảm variance của metric → cùng sample size nhưng detect effect nhỏ hơn — tương đương tăng sample size 50-80% mà không cần thêm traffic.

Nguyên lý CUPED

Ý tưởng core: Variance trong A/B test metric đến từ 2 nguồn:

Treatment effect — sự khác biệt thực sự giữa A và B (muốn detect)
User-level noise — mỗi user có behavior khác nhau (muốn GIẢM)

CUPED dùng pre-experiment data (data TRƯỚC khi experiment bắt đầu) để "adjust" cho user-level noise.

Công thức:

{\hat{Y}}_{c u p e d} = \hat{Y} - θ \cdot (X - \bar{X})

Trong đó:

$\hat{Y}$ = metric trung bình trong experiment
$X$ = pre-experiment covariate (vd: user's metric TRƯỚC khi test bắt đầu)
$\bar{X}$ = mean của covariate
$θ = \frac{C o v (Y, X)}{V a r (X)}$ = hệ số điều chỉnh

Variance reduction:

V a r ({\hat{Y}}_{c u p e d}) = V a r (\hat{Y}) \cdot (1 - ρ^{2})

Trong đó $ρ$ = correlation giữa pre-experiment và in-experiment metric.

CUPED hoạt động thế nào?

Ví dụ: A/B test measure Revenue per User trong 14 ngày.

User A: pre-experiment spend $50/tuần, in-experiment spend $55 (Treatment)
User B: pre-experiment spend $10/tuần, in-experiment spend $12 (Treatment)

Nếu KHÔNG CUPED: variance giữa User A ($55) và User B ($12) rất lớn → khó detect treatment effect.

Nếu CUPED: adjust theo pre-experiment → User A đã spend $50 sẵn → $55 chỉ là natural behavior. Treatment effect ≈ $55 - $50 = $5. User B: $12 - $10 = $2. Variance giảm vì remove baseline differences.

Impact của CUPED

Metric	Typical $ρ$ (pre vs in-experiment)	Variance Reduction	Equivalent Sample Size Increase
Revenue per User	0.6-0.8	36-64%	56-178% thêm traffic miễn phí
Sessions per User	0.7-0.85	49-72%	96-257%
Conversion Rate	0.3-0.5	9-25%	10-33%
Time on Site	0.5-0.7	25-49%	33-96%

Ví dụ thực tế:

Cần 100,000 users/group cho A/B test (baseline MDE)
Với CUPED ( $ρ = 0.7$ ): $V a r_{c u p e d} = V a r \times (1 - 0.49) = 0.51 \times V a r$
Equivalent: chỉ cần 51,000 users/group — hoặc detect effect nhỏ hơn 30% với cùng sample

Khi nào dùng CUPED

Phù hợp	Không phù hợp
✅ Metric có correlation cao với pre-experiment data (revenue, sessions)	❌ New users (không có pre-experiment data)
✅ Platform traffic thấp, cần tăng sensitivity	❌ Metric mới chưa từng đo (no historical data)
✅ Detect small effects (1-3% lift) mà business cần	❌ Correlation thấp ( $ρ$ < 0.3) → improvement nhỏ

⚠️ CUPED prerequisites

Pre-experiment data phải available — cần ≥ 2-4 tuần historical data per user
Covariate phải KHÔNG bị ảnh hưởng bởi treatment — dùng data TRƯỚC experiment start
Correlation phải đủ cao ( $ρ > 0.3$ ) để CUPED có impact meaningful

3️⃣ Reproducible Experiments — Transparency & Pre-registration

Giới thiệu

Reproducible Experiments — tập hợp practices đảm bảo A/B test có thể được tái lập, kiểm chứng, và audit bởi người khác. Xuất phát từ Replication Crisis trong khoa học — hơn 50% published psychology experiments không tái lập được (Open Science Collaboration, 2015).

Trong tech, vấn đề tương tự: DA chạy A/B test, report "significant result" — nhưng 6 tháng sau không ai verify lại, không ai biết parameters đã dùng, không ai reproduce được analysis.

3 pillars of Reproducible Experiments:

Pillar	Mô tả	Tool/Platform
Pre-registration	Đăng ký hypothesis + analysis plan TRƯỚC khi chạy test	AEA RCT Registry, OSF, internal wiki
Version Control	Code, config, data → tracked trong Git	GitHub, GitLab, internal repo
Automated Reporting	Analysis pipeline tự động — không manual calculations	Jupyter Notebooks, experimentation platform

Pre-registration — Tại sao quan trọng?

Pre-registration = documented experiment plan TRƯỚC khi run experiment, bao gồm:

Element	Pre-registered	Tại sao cần
Hypothesis	"Simplified form tăng CVR ≥ 15%"	Ngăn HARKing (Hypothesizing After Results Known)
Primary metric	CVR	Ngăn cherry-picking metric sau khi thấy results
Sample size	14,000/group	Ngăn chạy "cho đến khi significant"
Duration	14 ngày	Ngăn early stopping / peeking
Analysis plan	Chi-square test, α = 0.05	Ngăn switching test sau khi thấy results
Guardrails	Revenue/User, Bounce Rate	Document upfront, không thêm sau

HARKing (Hypothesizing After Results Known) — sai lầm phổ biến:

❌ HARKing flow:
1. Chạy test → thu data
2. Nhìn results → "Oh, CVR không significant nhưng Bounce Rate significant!"
3. Viết report: "Hypothesis: Treatment giảm Bounce Rate" (LIE — không phải hypothesis ban đầu)
4. p-value cho Bounce Rate = 0.04 → "SIGNIFICANT!"
5. Nhưng thực ra = multiple testing problem → false positive

✅ Pre-registered flow:
1. Document hypothesis: "Treatment tăng CVR ≥ 15%"
2. Chạy test → thu data
3. Analyze theo plan: CVR p = 0.12 → NOT significant
4. Report: "Hypothesis not supported. Observed: CVR +8% but not significant."
5. Note: "Bounce Rate giảm (p=0.04) — exploratory finding, cần follow-up test"

Experiment Documentation Template

markdown

# EXPERIMENT REPORT: [Experiment Name]

## 1. Pre-registration
- **Date registered:** YYYY-MM-DD
- **Experiment ID:** EXP-2026-015
- **Owner:** [Name, Team]

## 2. Hypothesis
- **H₀:** [Null hypothesis]
- **H₁:** [Alternative hypothesis]
- **MDE:** [Minimum Detectable Effect]

## 3. Design
- **Type:** A/B (2 variants) / A/B/C (3 variants)
- **Randomization:** User-level, cookie-based hash
- **Sample size:** [n per group]
- **Duration:** [days]
- **Traffic allocation:** 50/50

## 4. Metrics
- **Primary:** [1 metric]
- **Guardrails:** [2-3 metrics]
- **Statistical test:** Chi-square / t-test / Mann-Whitney

## 5. Results
- **Control:** [metric value]
- **Treatment:** [metric value]
- **Lift:** [absolute and relative]
- **p-value:** [value]
- **95% CI:** [lower, upper]
- **Effect size:** [Cohen's d or relative]

## 6. Decision
- **Deploy / Don't deploy / Need more data**
- **Rationale:** [Why]

## 7. Post-deployment Monitoring
- **Week 1:** [metrics]
- **Week 4:** [metrics]
- **Anomalies:** [if any]

## 8. Learnings
- [What did we learn, even if test failed?]

Ưu & nhược điểm Reproducible Experiments

Ưu điểm	Nhược điểm
✅ Prevents HARKing, p-hacking, cherry-picking	❌ Overhead — mỗi experiment cần documentation
✅ Institutional memory — searchable experiment database	❌ Slows down rapid experimentation
✅ Auditability — stakeholders có thể verify	❌ Culture change — team cần buy-in
✅ Learning from failures — documented failed experiments = knowledge	❌ Not foolproof — người ta vẫn có thể cheat

🔗 Kết hợp 3 Framework

Ba framework phục vụ 3 aspect khác nhau của A/B testing nhưng bổ trợ nhau:

Layer	Framework	Aspect	DA sử dụng khi
Process	DMAIC	Quy trình end-to-end experiment	Thiết kế experiment có hệ thống
Statistics	CUPED	Variance reduction	Tăng sensitivity, detect small effects
Governance	Reproducible Experiments	Transparency & documentation	Đảm bảo results trustworthy

Ví dụ kết hợp:

DMAIC → Define problem, Measure baseline, Analyze root cause
   ↓
Design experiment → Pre-register hypothesis (Reproducible)
   ↓
Run experiment → Apply CUPED for variance reduction
   ↓
Analyze → Report results in documented template (Reproducible)
   ↓
Control → Monitor post-deployment (DMAIC)

Checklist cho DA khi áp dụng standards

✅ DMAIC: Có đủ 5 bước (D-M-A-I-C) cho mỗi experiment?
✅ CUPED: Pre-experiment data available? ρ > 0.3? Apply variance reduction?
✅ Reproducible: Hypothesis pre-registered? Analysis plan documented?
✅ Reproducible: Code trong version control? Results reproducible?
✅ Tổng hợp: Experiment report documented theo template chuẩn?

📚 Tài liệu tham khảo

Tài liệu	Tác giả/Tổ chức	Năm	Nội dung
The Six Sigma Way	Peter Pande, Robert Neuman	2000	DMAIC framework gốc
"Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data"	Alex Deng et al. (Microsoft)	2013	CUPED paper gốc (KDD 2013)
Trustworthy Online Controlled Experiments	Ron Kohavi, Diane Tang, Ya Xu	2020	Reproducible experiments, OEC, ExP platform
"Estimating the Reproducibility of Psychological Science"	Open Science Collaboration	2015	Replication crisis → motivation for pre-registration
OSF Pre-registration	Center for Open Science	Ongoing	Platform for experiment pre-registration
"Democratizing Online Controlled Experiments at Booking.com"	Lukas Vermeer et al.	2019	Experimentation culture + platform design

🏆 Tiêu chuẩn — A/B Testing ​

Tổng quan tiêu chuẩn buổi 15 ​

📋 Danh sách tiêu chuẩn liên quan ​

1️⃣ DMAIC — Six Sigma cho Experiment Design ​

Giới thiệu ​

5 Bước DMAIC cho A/B Testing ​

DMAIC áp dụng cho A/B Test — Ví dụ ​

DMAIC Checklist cho A/B Testing ​

Ưu & nhược điểm DMAIC cho A/B Testing ​

2️⃣ CUPED — Variance Reduction cho A/B Test ​

Giới thiệu ​

Nguyên lý CUPED ​

CUPED hoạt động thế nào? ​

Impact của CUPED ​

Khi nào dùng CUPED ​

3️⃣ Reproducible Experiments — Transparency & Pre-registration ​

Giới thiệu ​

Pre-registration — Tại sao quan trọng? ​

Experiment Documentation Template ​

Ưu & nhược điểm Reproducible Experiments ​

🔗 Kết hợp 3 Framework ​

Checklist cho DA khi áp dụng standards ​

📚 Tài liệu tham khảo ​

🏆 Tiêu chuẩn — A/B Testing

Tổng quan tiêu chuẩn buổi 15

📋 Danh sách tiêu chuẩn liên quan

1️⃣ DMAIC — Six Sigma cho Experiment Design

Giới thiệu

5 Bước DMAIC cho A/B Testing

DMAIC áp dụng cho A/B Test — Ví dụ

DMAIC Checklist cho A/B Testing

Ưu & nhược điểm DMAIC cho A/B Testing

2️⃣ CUPED — Variance Reduction cho A/B Test

Giới thiệu

Nguyên lý CUPED

CUPED hoạt động thế nào?

Impact của CUPED

Khi nào dùng CUPED

3️⃣ Reproducible Experiments — Transparency & Pre-registration

Giới thiệu

Pre-registration — Tại sao quan trọng?

Experiment Documentation Template

Ưu & nhược điểm Reproducible Experiments

🔗 Kết hợp 3 Framework

Checklist cho DA khi áp dụng standards

📚 Tài liệu tham khảo