Appearance
🛠 Workshop — Analysis Sprint: Từ Raw Data đến Dashboard
Chạy sprint phân tích hoàn chỉnh: Data Cleaning → EDA → Answer 5 Business Questions → Build Dashboard 3 pages → Peer Review → Executive Summary Draft. Output: Jupyter Notebook + Dashboard v1 + 1-page Summary!
🎯 Mục tiêu workshop
Sau khi hoàn thành workshop này, bạn sẽ:
- Clean data — xử lý missing, duplicates, types cho capstone dataset
- EDA hoàn chỉnh — distributions, correlations, segment analysis
- Answer 5 business questions — mỗi question có evidence + finding
- Build dashboard — 3 pages: Overview, Deep Dive, Recommendations
- Peer review — nhận và cho feedback có cấu trúc
- Draft executive summary — 5 findings theo What → So What → Now What
🧰 Yêu cầu
| Yêu cầu | Chi tiết |
|---|---|
| Kiến thức | Đã hoàn thành Buổi 18 (Data Collection + Data Audit) + Buổi 19 lý thuyết |
| Data | Capstone dataset đã thu thập từ Buổi 18 |
| Tools | Python (pandas, matplotlib, seaborn, scipy), Power BI / Tableau |
| Thời gian | 120–150 phút (sprint time-boxed) |
| Output | Jupyter Notebook + Dashboard v1 + Executive Summary Draft |
💡 Naming convention
- Notebook:
HoTen_Buoi19_CapstoneEDA.ipynb - Dashboard:
HoTen_Buoi19_Dashboard.pbix(hoặc Tableau workbook) - Summary:
HoTen_Buoi19_ExecSummary.md(hoặc .docx)
⏰ Sprint Timeline
┌──────────────────────────────────────────────────────────────┐
│ ⏰ SPRINT TIMELINE — 150 PHÚT │
│ │
│ ┌────────────────┐ 00:00 — 00:15 Sprint Planning │
│ │ 🎯 PLAN │ Review data audit, confirm 5 questions │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ 00:15 — 00:45 Data Cleaning │
│ │ 🧹 CLEAN │ Missing values, duplicates, types │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ 00:45 — 01:30 EDA + Business Questions │
│ │ 🔍 ANALYZE │ Distributions, correlations, Q1—Q5 │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ 01:30 — 02:00 Dashboard Build │
│ │ 📊 DASHBOARD │ 3 pages: Overview, Deep Dive, Insights │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ 02:00 — 02:20 Peer Review │
│ │ 👥 REVIEW │ Present 3 min, feedback 5 min, fix 12 │
│ └────────────────┘ │
│ │
│ ┌────────────────┐ 02:20 — 02:30 Executive Summary │
│ │ 📝 SUMMARY │ 5 findings, 3 recommendations │
│ └────────────────┘ │
└──────────────────────────────────────────────────────────────┘⚠️ Time-box nghiêm ngặt!
Nếu 30 phút mà cleaning chưa xong → dừng, tạm accept "good enough" cleaning, move on. Sprint = progress > perfection. Bạn có thể quay lại polish sau.
Phần 1: Sprint Planning (15 phút)
Bước 1.1: Review Data Audit
Mở Data Audit từ Buổi 18 và fill bảng:
┌──────────────────────────────────────────────────────────────┐
│ 📋 SPRINT PLANNING — DATA REVIEW │
│ │
│ Dataset: ________________________ │
│ Rows: ________ Columns: ________ │
│ Date Range: ________________________ │
│ Source: ________________________ │
│ │
│ Key Columns (top 10): │
│ 1. _____________ (type: _____, missing: ___%) │
│ 2. _____________ (type: _____, missing: ___%) │
│ 3. _____________ (type: _____, missing: ___%) │
│ 4. _____________ (type: _____, missing: ___%) │
│ 5. _____________ (type: _____, missing: ___%) │
│ 6. _____________ (type: _____, missing: ___%) │
│ 7. _____________ (type: _____, missing: ___%) │
│ 8. _____________ (type: _____, missing: ___%) │
│ 9. _____________ (type: _____, missing: ___%) │
│ 10. ____________ (type: _____, missing: ___%) │
└──────────────────────────────────────────────────────────────┘Bước 1.2: Confirm 5 Business Questions
┌──────────────────────────────────────────────────────────────┐
│ 📋 5 BUSINESS QUESTIONS │
│ │
│ Q1 (Descriptive): ____________________________________ │
│ Analysis method: ____________________________________ │
│ │
│ Q2 (Diagnostic): ____________________________________ │
│ Analysis method: ____________________________________ │
│ │
│ Q3 (Comparative): ____________________________________ │
│ Analysis method: ____________________________________ │
│ │
│ Q4 (Predictive/Diagnostic): ____________________________ │
│ Analysis method: ____________________________________ │
│ │
│ Q5 (Prescriptive): ____________________________________ │
│ Analysis method: ____________________________________ │
└──────────────────────────────────────────────────────────────┘Phần 2: Data Cleaning (30 phút)
Bước 2.1: Setup & Load
python
# ============================================
# CAPSTONE ANALYSIS SPRINT
# Author: [Your Name]
# Date: [Today's Date]
# Dataset: [Your Dataset Name]
# ============================================
# Table of Contents:
# 1. Setup & Load Data
# 2. Data Cleaning
# 3. EDA — Univariate
# 4. EDA — Bivariate
# 5. Q1: [Your Question]
# 6. Q2: [Your Question]
# 7. Q3: [Your Question]
# 8. Q4: [Your Question]
# 9. Q5: [Your Question]
# 10. Summary of Findings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Settings
plt.rcParams['figure.figsize'] = (12, 6)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
np.random.seed(42)
print("✅ Libraries loaded!")Bước 2.2: Load & Inspect
python
# ============================================
# LOAD DATA
# ============================================
df = pd.read_csv('data/your_dataset.csv') # ← THAY PATH CỦA BẠN
# Quick overview
print(f"📊 Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\n📋 Data Types:")
print(df.dtypes)
print(f"\n🔍 First 5 rows:")
df.head()Bước 2.3: Missing Values
python
# ============================================
# MISSING VALUES AUDIT
# ============================================
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
missing_report = pd.DataFrame({
'Missing': missing,
'%': missing_pct
}).query('Missing > 0').sort_values('%', ascending=False)
if len(missing_report) > 0:
print(f"⚠️ MISSING VALUES:")
print(missing_report)
print(f"\nTotal rows with any missing: {df.isnull().any(axis=1).sum():,}")
else:
print("✅ No missing values!")Bước 2.4: Handle Missing + Duplicates + Types
python
# ============================================
# DATA CLEANING
# ============================================
# 1. Handle missing values (customize per your dataset)
# Strategy: < 5% → drop, 5-30% → impute, > 30% → drop column
# Example:
# df['column_name'].fillna(df['column_name'].median(), inplace=True)
# df.dropna(subset=['critical_column'], inplace=True)
# YOUR CLEANING CODE HERE:
# _______________________________________
# _______________________________________
# _______________________________________
# 2. Remove duplicates
before = len(df)
df.drop_duplicates(inplace=True)
after = len(df)
print(f"🔁 Duplicates removed: {before - after}")
# 3. Fix data types
# df['date_column'] = pd.to_datetime(df['date_column'])
# df['category_column'] = df['category_column'].astype('category')
# YOUR TYPE FIXES HERE:
# _______________________________________
# 4. Final check
print(f"\n✅ Cleaned dataset: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f" Missing values: {df.isnull().sum().sum()}")
print(f" Duplicates: {df.duplicated().sum()}")Bước 2.5: Save Cleaned Data
python
# Save cleaned version
df.to_csv('data/your_dataset_cleaned.csv', index=False)
print("💾 Cleaned dataset saved!")Phần 3: EDA + Business Questions (45 phút)
Bước 3.1: Univariate Analysis
python
# ============================================
# EDA — UNIVARIATE
# ============================================
# Numeric columns distribution
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
n_cols = min(len(numeric_cols), 6) # Show top 6
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()
for idx, col in enumerate(numeric_cols[:n_cols]):
ax = axes[idx]
df[col].hist(bins=30, ax=ax, color='steelblue', edgecolor='white', alpha=0.8)
ax.axvline(df[col].median(), color='red', linestyle='--', label=f'Median: {df[col].median():,.1f}')
ax.set_title(col, fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
# Hide empty subplots
for idx in range(n_cols, 6):
axes[idx].set_visible(False)
plt.suptitle('Distribution of Key Numeric Variables', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# Summary stats
print("📊 Summary Statistics:")
print(df[numeric_cols].describe().round(2))Bước 3.2: Bivariate Analysis
python
# ============================================
# EDA — BIVARIATE (Correlation)
# ============================================
corr = df[numeric_cols].corr()
plt.figure(figsize=(12, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, cmap='RdBu_r', center=0,
fmt='.2f', square=True, linewidths=0.5,
cbar_kws={'label': 'Correlation'})
plt.title('Correlation Matrix — Key Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Top positive & negative correlations
corr_pairs = corr.unstack().sort_values()
corr_pairs = corr_pairs[(corr_pairs < 1) & (corr_pairs > -1)]
print("\n🔝 Top 5 Positive Correlations:")
print(corr_pairs.tail(5))
print("\n🔻 Top 5 Negative Correlations:")
print(corr_pairs.head(5))Bước 3.3: Answer Business Questions
Template cho mỗi question (copy-paste và customize):
python
# ============================================
# Q1: [YOUR QUESTION HERE]
# ============================================
# Analysis
# _______________________________________
# _______________________________________
# Visualization
# _______________________________________
# Finding
print("""
┌──────────────────────────────────────────────────────┐
│ 📋 FINDING #1 │
│ │
│ WHAT: ___________________________________________ │
│ SO WHAT: ________________________________________ │
│ NOW WHAT: _______________________________________ │
│ │
│ Evidence: [chart above / statistic] │
│ Confidence: [High / Medium / Low] │
└──────────────────────────────────────────────────────┘
""")Ví dụ Q3 — Comparative Analysis:
python
# ============================================
# Q3: Khách hàng mới vs returning khác nhau thế nào về [metric]?
# ============================================
# Group comparison
group_a = df[df['customer_type'] == 'New']['order_value']
group_b = df[df['customer_type'] == 'Returning']['order_value']
# Stats
print(f"New Customers: Mean = {group_a.mean():,.0f}, Median = {group_a.median():,.0f}, N = {len(group_a):,}")
print(f"Returning Customers: Mean = {group_b.mean():,.0f}, Median = {group_b.median():,.0f}, N = {len(group_b):,}")
# T-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"\nt-test: t = {t_stat:.3f}, p = {p_value:.4f}")
print(f"Significant? {'✅ Yes' if p_value < 0.05 else '❌ No'} (α = 0.05)")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box plot
df.boxplot(column='order_value', by='customer_type', ax=axes[0])
axes[0].set_title('Order Value by Customer Type')
axes[0].set_xlabel('')
# Distribution overlay
group_a.hist(bins=30, ax=axes[1], alpha=0.6, label='New', color='steelblue')
group_b.hist(bins=30, ax=axes[1], alpha=0.6, label='Returning', color='coral')
axes[1].set_title('Distribution Comparison')
axes[1].legend()
plt.suptitle('Q3: New vs Returning Customer — Order Value', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
print("""
┌──────────────────────────────────────────────────────┐
│ 📋 FINDING #3 │
│ │
│ WHAT: Returning customers có AOV cao hơn 35% │
│ SO WHAT: Retention tăng 10% → revenue tăng ước tính │
│ 15% do AOV effect │
│ NOW WHAT: Invest vào loyalty program + personalized │
│ recommendations cho returning customers │
│ │
│ Evidence: t-test p < 0.001, effect size = 0.42 │
│ Confidence: High │
└──────────────────────────────────────────────────────┘
""")Bước 3.4: Summary of Findings
python
# ============================================
# SUMMARY OF FINDINGS
# ============================================
print("""
╔══════════════════════════════════════════════════════════════╗
║ 📊 SUMMARY OF FINDINGS ║
║ ║
║ Q1: ___________________________________________________ ║
║ Finding: ___________________________________________ ║
║ ║
║ Q2: ___________________________________________________ ║
║ Finding: ___________________________________________ ║
║ ║
║ Q3: ___________________________________________________ ║
║ Finding: ___________________________________________ ║
║ ║
║ Q4: ___________________________________________________ ║
║ Finding: ___________________________________________ ║
║ ║
║ Q5: ___________________________________________________ ║
║ Finding: ___________________________________________ ║
║ ║
║ 📌 TOP INSIGHT: ________________________________________ ║
║ 💡 TOP RECOMMENDATION: _________________________________ ║
╚══════════════════════════════════════════════════════════════╝
""")Phần 4: Dashboard Build (30 phút)
Bước 4.1: Export Cleaned Data
python
# Export for Power BI / Tableau
df.to_csv('data/dashboard_data.csv', index=False)
# If needed — create summary tables
# monthly_summary = df.groupby('month').agg({...}).reset_index()
# monthly_summary.to_csv('data/monthly_summary.csv', index=False)
print("📦 Data exported for dashboard!")Bước 4.2: Dashboard Mockup (5 phút)
Trước khi mở Power BI / Tableau, sketch trên giấy:
┌──────────────────────────────────────────────────────────────┐
│ 📊 MY DASHBOARD MOCKUP │
│ │
│ PAGE 1: _______________ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ KPI │ │ KPI │ │ KPI │ │ KPI │ │
│ │ ___ │ │ ___ │ │ ___ │ │ ___ │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ ┌──────────────────┐ ┌──────────────┐ │
│ │ Chart: ________ │ │ Chart: ____ │ │
│ │ Type: _________ │ │ Type: _____ │ │
│ └──────────────────┘ └──────────────┘ │
│ │
│ PAGE 2: _______________ │
│ [Sketch your deep dive page] │
│ │
│ PAGE 3: _______________ │
│ [Sketch your insights/recommendations page] │
└──────────────────────────────────────────────────────────────┘Bước 4.3: Build Dashboard (25 phút)
Checklist per page:
PAGE 1 — OVERVIEW:
☐ Import cleaned data
☐ Create date table (if needed)
☐ KPI Card 1: _______ (with YoY/MoM comparison)
☐ KPI Card 2: _______ (with comparison)
☐ KPI Card 3: _______ (with comparison)
☐ KPI Card 4: _______ (with comparison)
☐ Main chart: _______ (type: _______)
☐ Secondary chart: _______ (type: _______)
☐ Filters: Date, _______, _______
PAGE 2 — DEEP DIVE:
☐ Segment comparison chart: _______
☐ Cross-filter working? ☐ Yes ☐ No
☐ Detail chart/table: _______
PAGE 3 — INSIGHTS:
☐ Annotated key finding chart
☐ Recommendations summary
☐ Dashboard title + date range + sourcePhần 5: Peer Review (20 phút)
Bước 5.1: Self-Review (3 phút)
Trước peer review, tự check:
NOTEBOOK SELF-CHECK:
☐ Kernel → Restart & Run All → no errors
☐ All sections have markdown headers
☐ 5 findings documented (What/So What/Now What)
☐ Charts have titles + labels
☐ No unused/debug cells
DASHBOARD SELF-CHECK:
☐ 3 pages built
☐ KPI cards with comparison
☐ Charts have titles + axis labels
☐ Filters working
☐ Colors consistentBước 5.2: Peer Review Session (12 phút)
Pair up với 1 bạn. Mỗi người:
- Present 3 phút: Walk through notebook → dashboard → findings
- Review 5 phút: Partner fills review form
Peer Review Form
┌──────────────────────────────────────────────────────────────┐
│ 📋 PEER REVIEW FORM │
│ │
│ Builder: ________________ Reviewer: ________________ │
│ │
│ NOTEBOOK Score (1-5) │
│ ☐ Code runs without errors [___] │
│ ☐ Structure clear (headers, sections) [___] │
│ ☐ Findings documented (What/So What/Now What) [___] │
│ ☐ Charts informative + labeled [___] │
│ │
│ DASHBOARD Score (1-5) │
│ ☐ Answers business questions [___] │
│ ☐ Visual hierarchy (KPI → Trend → Detail) [___] │
│ ☐ Colors consistent, charts appropriate [___] │
│ ☐ Interactivity works (filters, cross-filter) [___] │
│ │
│ 🌟 WHAT WORKS WELL: │
│ _________________________________________________________ │
│ _________________________________________________________ │
│ │
│ 🔧 TOP 3 IMPROVEMENTS: │
│ 1. _____________________________________________________ │
│ 2. _____________________________________________________ │
│ 3. _____________________________________________________ │
│ │
│ Overall: [___] / 5 │
└──────────────────────────────────────────────────────────────┘Bước 5.3: Fix Top 3 (5 phút)
Nhận feedback → fix top 3 issues ngay:
FIX LOG:
☐ Issue 1: _________________ → Fixed: ☐
☐ Issue 2: _________________ → Fixed: ☐
☐ Issue 3: _________________ → Fixed: ☐Phần 6: Executive Summary (10 phút)
Bước 6.1: Draft Summary
Viết 1 trang — dùng template:
markdown
# Executive Summary — [Your Project Title]
**Author:** [Name] | **Date:** [Date] | **Dataset:** [Description, N rows, period]
---
## Objective
[1-2 câu: bài toán gì, data gì, business context]
## Key Findings
### Finding 1: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]
### Finding 2: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]
### Finding 3: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]
### Finding 4: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]
### Finding 5: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]
## Top 3 Recommendations
| # | Recommendation | Expected Impact | Priority |
|---|---------------|----------------|---------|
| 1 | [Action] | [Quantified impact] | 🔴 High |
| 2 | [Action] | [Quantified impact] | 🟡 Medium |
| 3 | [Action] | [Quantified impact] | 🟢 Low |
## Appendix
- Jupyter Notebook: [filename]
- Dashboard: [filename]
- Data Source: [description]✅ Sprint Completion Checklist
┌──────────────────────────────────────────────────────────────┐
│ ✅ SPRINT COMPLETION — FINAL CHECK │
│ │
│ DELIVERABLE 1: JUPYTER NOTEBOOK │
│ ☐ File: HoTen_Buoi19_CapstoneEDA.ipynb │
│ ☐ Restart & Run All = no errors │
│ ☐ Structured sections (≤ 50 cells, organized) │
│ ☐ 5 business questions answered with evidence │
│ ☐ Findings documented (What/So What/Now What) │
│ ☐ Charts titled + labeled │
│ │
│ DELIVERABLE 2: DASHBOARD v1 │
│ ☐ File: HoTen_Buoi19_Dashboard.pbix / .twbx │
│ ☐ 3 pages: Overview + Deep Dive + Insights │
│ ☐ KPI cards with period comparison │
│ ☐ Filters working │
│ ☐ Peer reviewed — top issues fixed │
│ │
│ DELIVERABLE 3: EXECUTIVE SUMMARY DRAFT │
│ ☐ File: HoTen_Buoi19_ExecSummary.md │
│ ☐ 5 findings (What/So What/Now What) │
│ ☐ 3 recommendations with quantified impact │
│ ☐ ≤ 2 pages │
│ │
│ PEER REVIEW: │
│ ☐ Reviewed by: ________________ (score: ___/5) │
│ ☐ Top 3 feedback items addressed │
│ │
│ 📌 STATUS: ☐ COMPLETE ☐ NEEDS ITERATION │
└──────────────────────────────────────────────────────────────┘📊 Rubric đánh giá
| Tiêu chí | Excellent (5) | Good (4) | Adequate (3) | Needs Work (2) |
|---|---|---|---|---|
| Data Cleaning | Complete, documented, no missing | Complete, minor gaps | Partial, some issues | Incomplete, many errors |
| EDA | Deep, insightful, well-visualized | Good coverage, clear charts | Basic distributions only | Minimal, messy |
| Business Questions | 5 questions, strong evidence + findings | 4-5 questions, adequate evidence | 3 questions, weak evidence | < 3 questions |
| Dashboard | 3+ pages, interactive, annotated, polished | 3 pages, working, clean | 2 pages, basic | 1 page, broken |
| Executive Summary | 5 findings, quantified recommendations | 4-5 findings, some quantification | 3 findings, vague recs | < 3 findings |
| Peer Review | Given + received, all fixes applied | Given + received, most fixes | Received only | Not done |