Skip to content

🛠 Workshop — Analysis Sprint: Từ Raw Data đến Dashboard

Chạy sprint phân tích hoàn chỉnh: Data Cleaning → EDA → Answer 5 Business Questions → Build Dashboard 3 pages → Peer Review → Executive Summary Draft. Output: Jupyter Notebook + Dashboard v1 + 1-page Summary!

🎯 Mục tiêu workshop

Sau khi hoàn thành workshop này, bạn sẽ:

  1. Clean data — xử lý missing, duplicates, types cho capstone dataset
  2. EDA hoàn chỉnh — distributions, correlations, segment analysis
  3. Answer 5 business questions — mỗi question có evidence + finding
  4. Build dashboard — 3 pages: Overview, Deep Dive, Recommendations
  5. Peer review — nhận và cho feedback có cấu trúc
  6. Draft executive summary — 5 findings theo What → So What → Now What

🧰 Yêu cầu

Yêu cầuChi tiết
Kiến thứcĐã hoàn thành Buổi 18 (Data Collection + Data Audit) + Buổi 19 lý thuyết
DataCapstone dataset đã thu thập từ Buổi 18
ToolsPython (pandas, matplotlib, seaborn, scipy), Power BI / Tableau
Thời gian120–150 phút (sprint time-boxed)
OutputJupyter Notebook + Dashboard v1 + Executive Summary Draft

💡 Naming convention

  • Notebook: HoTen_Buoi19_CapstoneEDA.ipynb
  • Dashboard: HoTen_Buoi19_Dashboard.pbix (hoặc Tableau workbook)
  • Summary: HoTen_Buoi19_ExecSummary.md (hoặc .docx)

⏰ Sprint Timeline

┌──────────────────────────────────────────────────────────────┐
│  ⏰ SPRINT TIMELINE — 150 PHÚT                               │
│                                                               │
│  ┌────────────────┐  00:00 — 00:15   Sprint Planning          │
│  │ 🎯 PLAN        │  Review data audit, confirm 5 questions   │
│  └────────────────┘                                           │
│                                                               │
│  ┌────────────────┐  00:15 — 00:45   Data Cleaning            │
│  │ 🧹 CLEAN       │  Missing values, duplicates, types        │
│  └────────────────┘                                           │
│                                                               │
│  ┌────────────────┐  00:45 — 01:30   EDA + Business Questions │
│  │ 🔍 ANALYZE     │  Distributions, correlations, Q1—Q5      │
│  └────────────────┘                                           │
│                                                               │
│  ┌────────────────┐  01:30 — 02:00   Dashboard Build          │
│  │ 📊 DASHBOARD   │  3 pages: Overview, Deep Dive, Insights  │
│  └────────────────┘                                           │
│                                                               │
│  ┌────────────────┐  02:00 — 02:20   Peer Review              │
│  │ 👥 REVIEW      │  Present 3 min, feedback 5 min, fix 12   │
│  └────────────────┘                                           │
│                                                               │
│  ┌────────────────┐  02:20 — 02:30   Executive Summary        │
│  │ 📝 SUMMARY     │  5 findings, 3 recommendations            │
│  └────────────────┘                                           │
└──────────────────────────────────────────────────────────────┘

⚠️ Time-box nghiêm ngặt!

Nếu 30 phút mà cleaning chưa xong → dừng, tạm accept "good enough" cleaning, move on. Sprint = progress > perfection. Bạn có thể quay lại polish sau.


Phần 1: Sprint Planning (15 phút)

Bước 1.1: Review Data Audit

Mở Data Audit từ Buổi 18 và fill bảng:

┌──────────────────────────────────────────────────────────────┐
│  📋 SPRINT PLANNING — DATA REVIEW                            │
│                                                               │
│  Dataset: ________________________                            │
│  Rows: ________   Columns: ________                          │
│  Date Range: ________________________                         │
│  Source: ________________________                              │
│                                                               │
│  Key Columns (top 10):                                        │
│  1. _____________ (type: _____, missing: ___%)               │
│  2. _____________ (type: _____, missing: ___%)               │
│  3. _____________ (type: _____, missing: ___%)               │
│  4. _____________ (type: _____, missing: ___%)               │
│  5. _____________ (type: _____, missing: ___%)               │
│  6. _____________ (type: _____, missing: ___%)               │
│  7. _____________ (type: _____, missing: ___%)               │
│  8. _____________ (type: _____, missing: ___%)               │
│  9. _____________ (type: _____, missing: ___%)               │
│  10. ____________ (type: _____, missing: ___%)               │
└──────────────────────────────────────────────────────────────┘

Bước 1.2: Confirm 5 Business Questions

┌──────────────────────────────────────────────────────────────┐
│  📋 5 BUSINESS QUESTIONS                                      │
│                                                               │
│  Q1 (Descriptive):   ____________________________________    │
│  Analysis method:    ____________________________________    │
│                                                               │
│  Q2 (Diagnostic):    ____________________________________    │
│  Analysis method:    ____________________________________    │
│                                                               │
│  Q3 (Comparative):   ____________________________________    │
│  Analysis method:    ____________________________________    │
│                                                               │
│  Q4 (Predictive/Diagnostic): ____________________________    │
│  Analysis method:    ____________________________________    │
│                                                               │
│  Q5 (Prescriptive):  ____________________________________    │
│  Analysis method:    ____________________________________    │
└──────────────────────────────────────────────────────────────┘

Phần 2: Data Cleaning (30 phút)

Bước 2.1: Setup & Load

python
# ============================================
# CAPSTONE ANALYSIS SPRINT
# Author: [Your Name]
# Date: [Today's Date]
# Dataset: [Your Dataset Name]
# ============================================

# Table of Contents:
# 1. Setup & Load Data
# 2. Data Cleaning
# 3. EDA — Univariate
# 4. EDA — Bivariate
# 5. Q1: [Your Question]
# 6. Q2: [Your Question]
# 7. Q3: [Your Question]
# 8. Q4: [Your Question]
# 9. Q5: [Your Question]
# 10. Summary of Findings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Settings
plt.rcParams['figure.figsize'] = (12, 6)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
np.random.seed(42)

print("✅ Libraries loaded!")

Bước 2.2: Load & Inspect

python
# ============================================
# LOAD DATA
# ============================================

df = pd.read_csv('data/your_dataset.csv')  # ← THAY PATH CỦA BẠN

# Quick overview
print(f"📊 Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\n📋 Data Types:")
print(df.dtypes)
print(f"\n🔍 First 5 rows:")
df.head()

Bước 2.3: Missing Values

python
# ============================================
# MISSING VALUES AUDIT
# ============================================

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
missing_report = pd.DataFrame({
    'Missing': missing,
    '%': missing_pct
}).query('Missing > 0').sort_values('%', ascending=False)

if len(missing_report) > 0:
    print(f"⚠️ MISSING VALUES:")
    print(missing_report)
    print(f"\nTotal rows with any missing: {df.isnull().any(axis=1).sum():,}")
else:
    print("✅ No missing values!")

Bước 2.4: Handle Missing + Duplicates + Types

python
# ============================================
# DATA CLEANING
# ============================================

# 1. Handle missing values (customize per your dataset)
# Strategy: < 5% → drop, 5-30% → impute, > 30% → drop column
# Example:
# df['column_name'].fillna(df['column_name'].median(), inplace=True)
# df.dropna(subset=['critical_column'], inplace=True)

# YOUR CLEANING CODE HERE:
# _______________________________________
# _______________________________________
# _______________________________________

# 2. Remove duplicates
before = len(df)
df.drop_duplicates(inplace=True)
after = len(df)
print(f"🔁 Duplicates removed: {before - after}")

# 3. Fix data types
# df['date_column'] = pd.to_datetime(df['date_column'])
# df['category_column'] = df['category_column'].astype('category')

# YOUR TYPE FIXES HERE:
# _______________________________________

# 4. Final check
print(f"\n✅ Cleaned dataset: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"   Missing values: {df.isnull().sum().sum()}")
print(f"   Duplicates: {df.duplicated().sum()}")

Bước 2.5: Save Cleaned Data

python
# Save cleaned version
df.to_csv('data/your_dataset_cleaned.csv', index=False)
print("💾 Cleaned dataset saved!")

Phần 3: EDA + Business Questions (45 phút)

Bước 3.1: Univariate Analysis

python
# ============================================
# EDA — UNIVARIATE
# ============================================

# Numeric columns distribution
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
n_cols = min(len(numeric_cols), 6)  # Show top 6

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for idx, col in enumerate(numeric_cols[:n_cols]):
    ax = axes[idx]
    df[col].hist(bins=30, ax=ax, color='steelblue', edgecolor='white', alpha=0.8)
    ax.axvline(df[col].median(), color='red', linestyle='--', label=f'Median: {df[col].median():,.1f}')
    ax.set_title(col, fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)

# Hide empty subplots
for idx in range(n_cols, 6):
    axes[idx].set_visible(False)

plt.suptitle('Distribution of Key Numeric Variables', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Summary stats
print("📊 Summary Statistics:")
print(df[numeric_cols].describe().round(2))

Bước 3.2: Bivariate Analysis

python
# ============================================
# EDA — BIVARIATE (Correlation)
# ============================================

corr = df[numeric_cols].corr()

plt.figure(figsize=(12, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', square=True, linewidths=0.5,
            cbar_kws={'label': 'Correlation'})
plt.title('Correlation Matrix — Key Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Top positive & negative correlations
corr_pairs = corr.unstack().sort_values()
corr_pairs = corr_pairs[(corr_pairs < 1) & (corr_pairs > -1)]
print("\n🔝 Top 5 Positive Correlations:")
print(corr_pairs.tail(5))
print("\n🔻 Top 5 Negative Correlations:")
print(corr_pairs.head(5))

Bước 3.3: Answer Business Questions

Template cho mỗi question (copy-paste và customize):

python
# ============================================
# Q1: [YOUR QUESTION HERE]
# ============================================

# Analysis
# _______________________________________
# _______________________________________

# Visualization
# _______________________________________

# Finding
print("""
┌──────────────────────────────────────────────────────┐
│  📋 FINDING #1                                        │
│                                                       │
│  WHAT: ___________________________________________   │
│  SO WHAT: ________________________________________   │
│  NOW WHAT: _______________________________________   │
│                                                       │
│  Evidence: [chart above / statistic]                 │
│  Confidence: [High / Medium / Low]                   │
└──────────────────────────────────────────────────────┘
""")

Ví dụ Q3 — Comparative Analysis:

python
# ============================================
# Q3: Khách hàng mới vs returning khác nhau thế nào về [metric]?
# ============================================

# Group comparison
group_a = df[df['customer_type'] == 'New']['order_value']
group_b = df[df['customer_type'] == 'Returning']['order_value']

# Stats
print(f"New Customers:     Mean = {group_a.mean():,.0f}, Median = {group_a.median():,.0f}, N = {len(group_a):,}")
print(f"Returning Customers: Mean = {group_b.mean():,.0f}, Median = {group_b.median():,.0f}, N = {len(group_b):,}")

# T-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"\nt-test: t = {t_stat:.3f}, p = {p_value:.4f}")
print(f"Significant? {'✅ Yes' if p_value < 0.05 else '❌ No'} (α = 0.05)")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
df.boxplot(column='order_value', by='customer_type', ax=axes[0])
axes[0].set_title('Order Value by Customer Type')
axes[0].set_xlabel('')

# Distribution overlay
group_a.hist(bins=30, ax=axes[1], alpha=0.6, label='New', color='steelblue')
group_b.hist(bins=30, ax=axes[1], alpha=0.6, label='Returning', color='coral')
axes[1].set_title('Distribution Comparison')
axes[1].legend()

plt.suptitle('Q3: New vs Returning Customer — Order Value', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print("""
┌──────────────────────────────────────────────────────┐
│  📋 FINDING #3                                        │
│                                                       │
│  WHAT: Returning customers có AOV cao hơn 35%        │
│  SO WHAT: Retention tăng 10% → revenue tăng ước tính │
│           15% do AOV effect                          │
│  NOW WHAT: Invest vào loyalty program + personalized │
│            recommendations cho returning customers   │
│                                                       │
│  Evidence: t-test p < 0.001, effect size = 0.42      │
│  Confidence: High                                    │
└──────────────────────────────────────────────────────┘
""")

Bước 3.4: Summary of Findings

python
# ============================================
# SUMMARY OF FINDINGS
# ============================================

print("""
╔══════════════════════════════════════════════════════════════╗
║  📊 SUMMARY OF FINDINGS                                      ║
║                                                               ║
║  Q1: ___________________________________________________     ║
║      Finding: ___________________________________________     ║
║                                                               ║
║  Q2: ___________________________________________________     ║
║      Finding: ___________________________________________     ║
║                                                               ║
║  Q3: ___________________________________________________     ║
║      Finding: ___________________________________________     ║
║                                                               ║
║  Q4: ___________________________________________________     ║
║      Finding: ___________________________________________     ║
║                                                               ║
║  Q5: ___________________________________________________     ║
║      Finding: ___________________________________________     ║
║                                                               ║
║  📌 TOP INSIGHT: ________________________________________     ║
║  💡 TOP RECOMMENDATION: _________________________________     ║
╚══════════════════════════════════════════════════════════════╝
""")

Phần 4: Dashboard Build (30 phút)

Bước 4.1: Export Cleaned Data

python
# Export for Power BI / Tableau
df.to_csv('data/dashboard_data.csv', index=False)

# If needed — create summary tables
# monthly_summary = df.groupby('month').agg({...}).reset_index()
# monthly_summary.to_csv('data/monthly_summary.csv', index=False)

print("📦 Data exported for dashboard!")

Bước 4.2: Dashboard Mockup (5 phút)

Trước khi mở Power BI / Tableau, sketch trên giấy:

┌──────────────────────────────────────────────────────────────┐
│  📊 MY DASHBOARD MOCKUP                                      │
│                                                               │
│  PAGE 1: _______________                                      │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐                        │
│  │ KPI  │ │ KPI  │ │ KPI  │ │ KPI  │                        │
│  │ ___  │ │ ___  │ │ ___  │ │ ___  │                        │
│  └──────┘ └──────┘ └──────┘ └──────┘                        │
│  ┌──────────────────┐ ┌──────────────┐                       │
│  │  Chart: ________  │ │  Chart: ____ │                      │
│  │  Type: _________  │ │  Type: _____ │                      │
│  └──────────────────┘ └──────────────┘                       │
│                                                               │
│  PAGE 2: _______________                                      │
│  [Sketch your deep dive page]                                 │
│                                                               │
│  PAGE 3: _______________                                      │
│  [Sketch your insights/recommendations page]                  │
└──────────────────────────────────────────────────────────────┘

Bước 4.3: Build Dashboard (25 phút)

Checklist per page:

PAGE 1 — OVERVIEW:
☐ Import cleaned data
☐ Create date table (if needed)
☐ KPI Card 1: _______ (with YoY/MoM comparison)
☐ KPI Card 2: _______ (with comparison)
☐ KPI Card 3: _______ (with comparison)
☐ KPI Card 4: _______ (with comparison)
☐ Main chart: _______ (type: _______)
☐ Secondary chart: _______ (type: _______)
☐ Filters: Date, _______, _______

PAGE 2 — DEEP DIVE:
☐ Segment comparison chart: _______
☐ Cross-filter working? ☐ Yes ☐ No
☐ Detail chart/table: _______

PAGE 3 — INSIGHTS:
☐ Annotated key finding chart
☐ Recommendations summary
☐ Dashboard title + date range + source

Phần 5: Peer Review (20 phút)

Bước 5.1: Self-Review (3 phút)

Trước peer review, tự check:

NOTEBOOK SELF-CHECK:
☐ Kernel → Restart & Run All → no errors
☐ All sections have markdown headers
☐ 5 findings documented (What/So What/Now What)
☐ Charts have titles + labels
☐ No unused/debug cells

DASHBOARD SELF-CHECK:
☐ 3 pages built
☐ KPI cards with comparison
☐ Charts have titles + axis labels
☐ Filters working
☐ Colors consistent

Bước 5.2: Peer Review Session (12 phút)

Pair up với 1 bạn. Mỗi người:

  • Present 3 phút: Walk through notebook → dashboard → findings
  • Review 5 phút: Partner fills review form

Peer Review Form

┌──────────────────────────────────────────────────────────────┐
│  📋 PEER REVIEW FORM                                          │
│                                                               │
│  Builder: ________________  Reviewer: ________________       │
│                                                               │
│  NOTEBOOK                                      Score (1-5)   │
│  ☐ Code runs without errors                    [___]         │
│  ☐ Structure clear (headers, sections)          [___]         │
│  ☐ Findings documented (What/So What/Now What)  [___]         │
│  ☐ Charts informative + labeled                [___]         │
│                                                               │
│  DASHBOARD                                     Score (1-5)   │
│  ☐ Answers business questions                   [___]         │
│  ☐ Visual hierarchy (KPI → Trend → Detail)     [___]         │
│  ☐ Colors consistent, charts appropriate        [___]         │
│  ☐ Interactivity works (filters, cross-filter)  [___]         │
│                                                               │
│  🌟 WHAT WORKS WELL:                                         │
│  _________________________________________________________   │
│  _________________________________________________________   │
│                                                               │
│  🔧 TOP 3 IMPROVEMENTS:                                      │
│  1. _____________________________________________________    │
│  2. _____________________________________________________    │
│  3. _____________________________________________________    │
│                                                               │
│  Overall: [___] / 5                                          │
└──────────────────────────────────────────────────────────────┘

Bước 5.3: Fix Top 3 (5 phút)

Nhận feedback → fix top 3 issues ngay:

FIX LOG:
☐ Issue 1: _________________ → Fixed: ☐
☐ Issue 2: _________________ → Fixed: ☐
☐ Issue 3: _________________ → Fixed: ☐

Phần 6: Executive Summary (10 phút)

Bước 6.1: Draft Summary

Viết 1 trang — dùng template:

markdown
# Executive Summary — [Your Project Title]

**Author:** [Name] | **Date:** [Date] | **Dataset:** [Description, N rows, period]

---

## Objective
[1-2 câu: bài toán gì, data gì, business context]

## Key Findings

### Finding 1: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]

### Finding 2: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]

### Finding 3: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]

### Finding 4: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]

### Finding 5: [Title]
- **What:** [Data says...]
- **So What:** [Business impact...]
- **Now What:** [Recommended action...]

## Top 3 Recommendations

| # | Recommendation | Expected Impact | Priority |
|---|---------------|----------------|---------|
| 1 | [Action] | [Quantified impact] | 🔴 High |
| 2 | [Action] | [Quantified impact] | 🟡 Medium |
| 3 | [Action] | [Quantified impact] | 🟢 Low |

## Appendix
- Jupyter Notebook: [filename]
- Dashboard: [filename]
- Data Source: [description]

✅ Sprint Completion Checklist

┌──────────────────────────────────────────────────────────────┐
│  ✅ SPRINT COMPLETION — FINAL CHECK                           │
│                                                               │
│  DELIVERABLE 1: JUPYTER NOTEBOOK                              │
│  ☐ File: HoTen_Buoi19_CapstoneEDA.ipynb                     │
│  ☐ Restart & Run All = no errors                             │
│  ☐ Structured sections (≤ 50 cells, organized)               │
│  ☐ 5 business questions answered with evidence               │
│  ☐ Findings documented (What/So What/Now What)               │
│  ☐ Charts titled + labeled                                   │
│                                                               │
│  DELIVERABLE 2: DASHBOARD v1                                  │
│  ☐ File: HoTen_Buoi19_Dashboard.pbix / .twbx                │
│  ☐ 3 pages: Overview + Deep Dive + Insights                 │
│  ☐ KPI cards with period comparison                          │
│  ☐ Filters working                                           │
│  ☐ Peer reviewed — top issues fixed                          │
│                                                               │
│  DELIVERABLE 3: EXECUTIVE SUMMARY DRAFT                       │
│  ☐ File: HoTen_Buoi19_ExecSummary.md                        │
│  ☐ 5 findings (What/So What/Now What)                        │
│  ☐ 3 recommendations with quantified impact                  │
│  ☐ ≤ 2 pages                                                │
│                                                               │
│  PEER REVIEW:                                                 │
│  ☐ Reviewed by: ________________ (score: ___/5)             │
│  ☐ Top 3 feedback items addressed                            │
│                                                               │
│  📌 STATUS: ☐ COMPLETE  ☐ NEEDS ITERATION                   │
└──────────────────────────────────────────────────────────────┘

📊 Rubric đánh giá

Tiêu chíExcellent (5)Good (4)Adequate (3)Needs Work (2)
Data CleaningComplete, documented, no missingComplete, minor gapsPartial, some issuesIncomplete, many errors
EDADeep, insightful, well-visualizedGood coverage, clear chartsBasic distributions onlyMinimal, messy
Business Questions5 questions, strong evidence + findings4-5 questions, adequate evidence3 questions, weak evidence< 3 questions
Dashboard3+ pages, interactive, annotated, polished3 pages, working, clean2 pages, basic1 page, broken
Executive Summary5 findings, quantified recommendations4-5 findings, some quantification3 findings, vague recs< 3 findings
Peer ReviewGiven + received, all fixes appliedGiven + received, most fixesReceived onlyNot done

🔗 Liên kết