🛠 Workshop — Project Kickoff: Từ Idea đến Git Repo

Chọn đề tài → download dataset → data audit → viết project brief → setup Git repo + README → push lên GitHub. Output: Git repo hoàn chỉnh + project brief + data audit report — base cho Buổi 19 (Analysis & Dashboard)!

🎯 Mục tiêu workshop

Sau khi hoàn thành workshop này, bạn sẽ:

Chọn đề tài + dataset — đã evaluate theo DRBST + checklist 10 điểm
Viết project brief — 5 business questions + methods + tools
Setup Git repo — folder structure chuẩn + .gitignore + README.md
Data audit — profile dataset, đánh giá chất lượng, document findings

🧰 Yêu cầu

Yêu cầu	Chi tiết
Kiến thức	Đã hoàn thành Buổi 18 lý thuyết (Project Kickoff)
Tools	Python (pandas, numpy), Git, GitHub account, text editor
Thời gian	90–120 phút
Output	Git repo (GitHub) + Project Brief + Data Audit Report

💡 Naming convention

Repo name: capstone-[domain]-analysis (ví dụ: capstone-ecommerce-analysis)

📦 Scenario

Bạn là fresh grad DA, chuẩn bị portfolio project đầu tiên. Mục tiêu: 2 tuần hoàn thành capstone project — từ data đến dashboard đến presentation. Hôm nay (Day 1): kickoff — chọn đề tài, setup repository, audit data.

Cuối workshop, bạn phải có:

✅ GitHub repo live (public)
✅ Project brief document
✅ Data audit report
✅ Clean folder structure
✅ README.md v1

Phần 1: Chọn Đề Tài + Dataset (20 phút)

Bước 1.1: Brainstorm & Evaluate Ideas

Dùng bảng DRBST để đánh giá 3-5 ideas:

python

# ============================================
# PHẦN 1: CHỌN ĐỀ TÀI — DRBST EVALUATION
# ============================================

import pandas as pd

# Brainstorm ideas — điền ideas của bạn vào đây
ideas = pd.DataFrame({
    'Idea': [
        'E-commerce Customer Segmentation (Olist)',
        'HR Employee Attrition (IBM)',
        'Marketing Campaign ROI',
        'Telecom Customer Churn',
        # 'Your idea here'
    ],
    'D_Data': [5, 5, 4, 5],           # 1-5: Data available + quality
    'R_Real_Context': [5, 4, 4, 5],   # 1-5: Real business context
    'B_Business_Impact': [5, 5, 4, 4], # 1-5: Actionable insights
    'S_Scope_2weeks': [4, 5, 4, 4],   # 1-5: Feasible in 2 weeks
    'T_Tellable': [5, 4, 4, 4]        # 1-5: Can present in 5 min
})

ideas['Total'] = ideas[['D_Data', 'R_Real_Context', 'B_Business_Impact',
                          'S_Scope_2weeks', 'T_Tellable']].sum(axis=1)
ideas['Average'] = ideas['Total'] / 5

print("=" * 70)
print("📊 DRBST EVALUATION — CAPSTONE IDEAS")
print("=" * 70)
print(ideas.sort_values('Total', ascending=False).to_string(index=False))
print(f"\n🏆 Best Idea: {ideas.loc[ideas['Total'].idxmax(), 'Idea']}")
print(f"   Score: {ideas['Total'].max()}/25 (Average: {ideas['Average'].max():.1f}/5)")

Bước 1.2: Download Dataset

Sau khi chọn đề tài, download dataset. Ví dụ với Olist E-commerce (Kaggle):

python

# ============================================
# DOWNLOAD DATASET
# ============================================

# Option 1: Kaggle CLI (nếu đã setup Kaggle API)
# Terminal: kaggle datasets download -d olistbr/brazilian-ecommerce

# Option 2: Manual download từ Kaggle
# Link: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce

# Option 3: Dùng dataset gợi ý (smaller, ready-to-use)
# Tạo sample dataset nếu chưa download
import numpy as np
np.random.seed(42)

# === SAMPLE: E-commerce Dataset (10,000 orders) ===
n = 10000

orders = pd.DataFrame({
    'order_id': [f'ORD_{i:05d}' for i in range(n)],
    'customer_id': [f'CUST_{np.random.randint(1, 5000):05d}' for _ in range(n)],
    'order_date': pd.date_range('2024-01-01', periods=n, freq='52min'),
    'product_category': np.random.choice(
        ['Electronics', 'Fashion', 'Home', 'Beauty', 'Sports',
         'Books', 'Food', 'Toys', 'Health', 'Auto'],
        n, p=[0.15, 0.20, 0.12, 0.10, 0.08, 0.10, 0.08, 0.05, 0.07, 0.05]
    ),
    'order_value': np.round(np.random.lognormal(4.5, 0.8, n), 2),
    'quantity': np.random.choice([1, 2, 3, 4, 5], n, p=[0.45, 0.25, 0.15, 0.10, 0.05]),
    'payment_method': np.random.choice(
        ['Credit Card', 'Debit Card', 'Bank Transfer', 'E-wallet', 'COD'],
        n, p=[0.35, 0.20, 0.15, 0.15, 0.15]
    ),
    'delivery_days': np.random.choice(range(1, 31), n),
    'review_score': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.08, 0.15, 0.32, 0.40]),
    'city': np.random.choice(
        ['Ho Chi Minh', 'Ha Noi', 'Da Nang', 'Can Tho', 'Hai Phong',
         'Nha Trang', 'Hue', 'Bien Hoa', 'Vung Tau', 'Buon Ma Thuot'],
        n, p=[0.30, 0.25, 0.10, 0.08, 0.07, 0.05, 0.05, 0.04, 0.03, 0.03]
    ),
    'customer_segment': np.random.choice(
        ['New', 'Regular', 'VIP', 'At-Risk', 'Churned'],
        n, p=[0.25, 0.35, 0.15, 0.15, 0.10]
    )
})

# Inject realistic missing values
mask_delivery = np.random.random(n) < 0.03  # 3% missing delivery
orders.loc[mask_delivery, 'delivery_days'] = np.nan

mask_review = np.random.random(n) < 0.05  # 5% missing reviews
orders.loc[mask_review, 'review_score'] = np.nan

print(f"✅ Dataset created: {orders.shape[0]:,} orders × {orders.shape[1]} columns")
print(f"📅 Date range: {orders['order_date'].min()} → {orders['order_date'].max()}")
orders.head()

Phần 2: Data Audit — Profile & Evaluate (25 phút)

Bước 2.1: Data Profiling

python

# ============================================
# PHẦN 2: DATA AUDIT — PROFILE DATASET
# ============================================

print("=" * 65)
print("📊 DATA AUDIT REPORT")
print("=" * 65)

# --- 1. SHAPE ---
print(f"\n{'─' * 65}")
print(f"📐 1. SHAPE")
print(f"{'─' * 65}")
print(f"   Rows:    {orders.shape[0]:,}")
print(f"   Columns: {orders.shape[1]}")
print(f"   Pass ≥ 1,000 rows: {'✅ YES' if orders.shape[0] >= 1000 else '❌ NO'}")

# --- 2. DATA TYPES ---
print(f"\n{'─' * 65}")
print(f"📋 2. DATA TYPES")
print(f"{'─' * 65}")
for col in orders.columns:
    print(f"   {col:25s} → {str(orders[col].dtype):15s} | Unique: {orders[col].nunique():,}")

# --- 3. MISSING VALUES ---
print(f"\n{'─' * 65}")
print(f"❓ 3. MISSING VALUES")
print(f"{'─' * 65}")
missing = orders.isnull().sum()
missing_pct = (missing / len(orders) * 100).round(2)
total_missing = missing.sum()
total_cells = orders.shape[0] * orders.shape[1]
overall_missing_pct = (total_missing / total_cells * 100)

for col in orders.columns:
    status = "✅" if missing[col] == 0 else ("⚠️" if missing_pct[col] < 10 else "❌")
    print(f"   {status} {col:25s} → {missing[col]:5,} missing ({missing_pct[col]:5.2f}%)")

print(f"\n   Overall: {total_missing:,} / {total_cells:,} ({overall_missing_pct:.2f}%)")
print(f"   Pass ≤ 30%: {'✅ YES' if overall_missing_pct <= 30 else '❌ NO'}")

# --- 4. NUMERIC SUMMARY ---
print(f"\n{'─' * 65}")
print(f"📈 4. NUMERIC SUMMARY")
print(f"{'─' * 65}")
numeric_cols = orders.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    stats = orders[col].describe()
    print(f"   {col}:")
    print(f"      Mean: {stats['mean']:.2f} | Std: {stats['std']:.2f}")
    print(f"      Min: {stats['min']:.2f} | Max: {stats['max']:.2f}")
    print(f"      Q1: {stats['25%']:.2f} | Median: {stats['50%']:.2f} | Q3: {stats['75%']:.2f}")

Bước 2.2: Categorical Summary

python

# --- 5. CATEGORICAL SUMMARY ---
print(f"\n{'─' * 65}")
print(f"🏷️ 5. CATEGORICAL SUMMARY")
print(f"{'─' * 65}")

cat_cols = orders.select_dtypes(include='object').columns
for col in cat_cols:
    print(f"\n   📂 {col} ({orders[col].nunique()} unique):")
    top5 = orders[col].value_counts().head(5)
    for val, count in top5.items():
        pct = count / len(orders) * 100
        bar = "█" * int(pct / 2)
        print(f"      {val:20s} → {count:5,} ({pct:5.1f}%) {bar}")

Bước 2.3: Duplicate & Date Check

python

# --- 6. DUPLICATES ---
print(f"\n{'─' * 65}")
print(f"🔄 6. DUPLICATES")
print(f"{'─' * 65}")
dups = orders.duplicated().sum()
print(f"   Exact duplicates: {dups} ({dups/len(orders)*100:.2f}%)")
print(f"   Pass: {'✅ No duplicates' if dups == 0 else f'⚠️ {dups} duplicates found'}")

# --- 7. DATE RANGE ---
print(f"\n{'─' * 65}")
print(f"📅 7. DATE RANGE")
print(f"{'─' * 65}")
if 'order_date' in orders.columns:
    orders['order_date'] = pd.to_datetime(orders['order_date'])
    date_range = (orders['order_date'].max() - orders['order_date'].min()).days
    print(f"   Start: {orders['order_date'].min()}")
    print(f"   End:   {orders['order_date'].max()}")
    print(f"   Span:  {date_range} days ({date_range/30:.1f} months)")
    print(f"   Has temporal data: ✅ YES")

Bước 2.4: Dataset Viability Score

python

# ============================================
# DATASET VIABILITY SCORE (10-point checklist)
# ============================================

print(f"\n{'=' * 65}")
print(f"📋 DATASET VIABILITY CHECKLIST")
print(f"{'=' * 65}")

# Evaluate each criterion
checks = {
    '1. Size ≥ 1,000 rows': orders.shape[0] >= 1000,
    '2. ≥ 8 columns (mix types)': orders.shape[1] >= 8,
    '3. Has date/time column': 'order_date' in orders.columns,
    '4. Missing ≤ 30% overall': overall_missing_pct <= 30,
    '5. Has documentation': True,  # Manual check
    '6. Business context clear': True,  # Manual check
    '7. Has target variable': True,  # Manual check (review_score, segment)
    '8. License OK': True,  # Manual check
    '9. Data ≤ 5 years old': True,  # Manual check
    '10. Not overused': True,  # Manual check
}

score = sum(checks.values())
for criterion, passed in checks.items():
    status = "✅" if passed else "❌"
    print(f"   {status} {criterion}")

print(f"\n{'─' * 65}")
print(f"   📊 SCORE: {score}/10")
print(f"   {'✅ PASS — Dataset suitable for capstone!' if score >= 7 else '❌ FAIL — Find another dataset'}")
print(f"{'=' * 65}")

Phần 3: Viết Project Brief (15 phút)

Bước 3.1: Create Project Brief

python

# ============================================
# PHẦN 3: PROJECT BRIEF
# ============================================

project_brief = """
╔═══════════════════════════════════════════════════════════════╗
║                    📋 PROJECT BRIEF                           ║
╠═══════════════════════════════════════════════════════════════╣
║ Title: E-commerce Customer Segmentation & Revenue Insights    ║
║ Author: [YOUR NAME]                                           ║
║ Date: [TODAY'S DATE]                                          ║
╠═══════════════════════════════════════════════════════════════╣
║ 🎯 BUSINESS PROBLEM:                                         ║
║ An e-commerce marketplace needs to understand customer        ║
║ behavior to optimize marketing spend, improve delivery        ║
║ experience, and identify high-value customer segments.        ║
║                                                               ║
║ 📊 DATASET:                                                   ║
║ - Source: Kaggle / Generated sample                           ║
║ - Size: {rows:,} rows × {cols} columns                       ║
║ - Period: {start} → {end}                                     ║
║ - Key variables: order_value, product_category,               ║
║   delivery_days, review_score, customer_segment, city         ║
║                                                               ║
║ ❓ BUSINESS QUESTIONS:                                        ║
║ 1. Which customer segments have highest LTV?                  ║
║    → Marketing strategy per segment                           ║
║ 2. How does delivery time affect review ratings?              ║
║    → Delivery threshold for negative experience               ║
║ 3. What are revenue trends & seasonal patterns?               ║
║    → Inventory and marketing planning                         ║
║ 4. Which product categories are growing fastest?              ║
║    → Product mix optimization                                 ║
║ 5. What factors predict customer churn?                       ║
║    → Retention strategy for at-risk customers                 ║
║                                                               ║
║ 🔧 METHODS & TOOLS:                                          ║
║ - Analysis: EDA, RFM Segmentation, Correlation,              ║
║   Trend Analysis, optional ML (Logistic Regression)           ║
║ - Tools: Python (pandas, matplotlib, seaborn), SQL,           ║
║   Tableau / Power BI                                          ║
║ - Libraries: pandas, numpy, matplotlib, seaborn,              ║
║   scikit-learn (optional)                                     ║
║                                                               ║
║ 📦 DELIVERABLES:                                              ║
║ - [x] Jupyter Notebooks (analysis + code)                     ║
║ - [ ] Dashboard (Tableau / Power BI)                          ║
║ - [ ] Presentation (10-15 slides)                             ║
║ - [x] GitHub README (project documentation)                   ║
║                                                               ║
║ 📅 TIMELINE:                                                  ║
║ - Week 1: Data cleaning + EDA + Segmentation analysis         ║
║ - Week 2: Dashboard + Presentation + Polish repo              ║
╚═══════════════════════════════════════════════════════════════╝
""".format(
    rows=orders.shape[0],
    cols=orders.shape[1],
    start=orders['order_date'].min().strftime('%Y-%m-%d'),
    end=orders['order_date'].max().strftime('%Y-%m-%d')
)

print(project_brief)

Bước 3.2: Questions → Methods Mapping

python

# ============================================
# QUESTIONS → METHODS MAPPING
# ============================================

mapping = pd.DataFrame({
    'Question': [
        'Q1: Customer segments by LTV?',
        'Q2: Delivery → Rating impact?',
        'Q3: Revenue trends & seasonality?',
        'Q4: Fastest growing categories?',
        'Q5: Churn prediction factors?'
    ],
    'Method': [
        'RFM Analysis + Clustering',
        'Correlation + Threshold Analysis',
        'Time Series Decomposition',
        'Growth Rate + Trend Analysis',
        'Logistic Regression (optional)'
    ],
    'Tool': [
        'Python (pandas, sklearn)',
        'Python (seaborn, scipy)',
        'Python (matplotlib, statsmodels)',
        'Python (pandas) + Tableau',
        'Python (sklearn)'
    ],
    'Output': [
        'Segment profiles + marketing recs',
        'Delivery threshold + improvement recs',
        'Forecast chart + seasonal calendar',
        'Category growth dashboard',
        'Risk factors + retention strategy'
    ],
    'Priority': ['P0', 'P0', 'P1', 'P1', 'P2']
})

print("📋 QUESTIONS → METHODS → TOOLS → OUTPUT MAPPING")
print("=" * 80)
print(mapping.to_string(index=False))

Phần 4: Setup Git Repo + README (25 phút)

Bước 4.1: Git Init & Folder Structure

Mở terminal và chạy từng lệnh:

bash

# ============================================
# PHẦN 4: GIT SETUP
# ============================================

# 1. Cấu hình Git (lần đầu tiên)
git config --global user.name "Tên Của Bạn"
git config --global user.email "email@example.com"

# 2. Tạo project folder
mkdir capstone-ecommerce-analysis
cd capstone-ecommerce-analysis

# 3. Init Git repository
git init

# 4. Tạo folder structure (TDSP-inspired)
mkdir -p data/raw
mkdir -p data/processed
mkdir -p notebooks
mkdir -p dashboards
mkdir -p reports/figures
mkdir -p src

# 5. Verify structure
# Windows:
dir /s /b
# Mac/Linux:
# find . -type d

Bước 4.2: Create .gitignore

bash

# ============================================
# CREATE .gitignore
# ============================================

# Tạo file .gitignore (Windows PowerShell)
@"
# === Data files (too large for Git) ===
*.csv
*.xlsx
*.parquet
*.json
data/raw/

# === Jupyter ===
.ipynb_checkpoints/
*.ipynb_metadata/

# === Python ===
__pycache__/
*.pyc
*.pyo
.env
venv/
.venv/
*.egg-info/

# === OS ===
.DS_Store
Thumbs.db
desktop.ini
~$*

# === IDE ===
.vscode/
.idea/
*.swp
*.swo

# === Tableau ===
*.twbx
*.hyper
"@ | Out-File -Encoding utf8 .gitignore

# Hoặc trên Mac/Linux:
# cat > .gitignore << 'EOF'
# [nội dung tương tự]
# EOF

Bước 4.3: Create README.md

bash

# ============================================
# CREATE README.md
# ============================================

# Tạo README.md (dùng text editor hoặc VS Code)
# code README.md

Nội dung README.md:

markdown

# 📊 E-commerce Customer Segmentation & Revenue Insights

> Phân tích 10,000+ orders từ e-commerce marketplace — segment customers
> theo RFM, analyze delivery impact on ratings, identify growth categories.

![Project Status](https://img.shields.io/badge/Status-In%20Progress-yellow)
![Python](https://img.shields.io/badge/Python-3.11-blue)
![License](https://img.shields.io/badge/License-MIT-green)

<!-- ![Dashboard Screenshot](dashboards/dashboard_screenshot.png) -->

## 🎯 Business Problem

An e-commerce marketplace needs to understand customer behavior to:
- **Optimize marketing spend** by targeting the right customer segments
- **Improve delivery experience** by identifying pain point thresholds
- **Forecast revenue** for better inventory and budget planning

## 📊 Dataset

| Attribute | Detail |
|-----------|--------|
| **Source** | Kaggle / Generated sample |
| **Size** | 10,000 rows × 11 columns |
| **Period** | Jan 2024 — Dec 2024 |
| **Key Variables** | order_value, product_category, delivery_days, review_score, customer_segment |

## ❓ Key Questions

1. Which customer segments (RFM) have the highest LTV, and what marketing strategy suits each?
2. How does delivery time affect review ratings? What's the negative experience threshold?
3. What are the 12-month revenue trends and seasonal patterns?
4. Which product categories are growing fastest?
5. What factors predict customer churn?

## 🔧 Tools & Methods

- **Python**: pandas, numpy, matplotlib, seaborn, scikit-learn
- **SQL**: Data extraction and aggregation
- **BI**: Tableau / Power BI interactive dashboard
- **Methods**: EDA, RFM Segmentation, Correlation Analysis, Trend Analysis

## 📈 Key Findings

🔄 *In progress — will be updated after analysis*

## 💡 Recommendations

🔄 *In progress — will be updated after analysis*

## 📁 Project Structure

capstone-ecommerce-analysis/ ├── README.md ├── .gitignore ├── requirements.txt ├── data/ │ ├── raw/ # Original data (not tracked) │ └── processed/ # Cleaned data ├── notebooks/ │ ├── 01_data_audit.ipynb │ ├── 02_data_cleaning.ipynb │ ├── 03_eda.ipynb │ ├── 04_segmentation.ipynb │ └── 05_analysis.ipynb ├── dashboards/ │ └── dashboard_screenshot.png ├── reports/ │ ├── figures/ │ └── presentation.pdf └── src/ └── utils.py


## 🚀 How to Run

1. Clone: `git clone https://github.com/[username]/capstone-ecommerce-analysis.git`
2. Install: `pip install -r requirements.txt`
3. Run notebooks in order: 01 → 02 → 03 → 04 → 05

## 👤 Author

**[Your Name]** — Aspiring Data Analyst
- 🔗 [LinkedIn](https://linkedin.com/in/your-profile)
- 📧 your.email@example.com

Bước 4.4: Create requirements.txt

bash

# ============================================
# CREATE requirements.txt
# ============================================

# Tạo requirements.txt
@"
pandas==2.1.4
numpy==1.26.2
matplotlib==3.8.2
seaborn==0.13.0
scikit-learn==1.3.2
jupyter==1.0.0
openpyxl==3.1.2
"@ | Out-File -Encoding utf8 requirements.txt

# Hoặc nếu đã install packages:
# pip freeze > requirements.txt

Bước 4.5: First Commit + Push

bash

# ============================================
# FIRST COMMIT + PUSH TO GITHUB
# ============================================

# 1. Check status
git status

# 2. Add all files
git add .

# 3. First commit
git commit -m "feat: initial project setup — folder structure, README, .gitignore, requirements"

# 4. Verify commit
git log --oneline

# 5. Create GitHub repo (qua browser: github.com/new)
# Tên: capstone-ecommerce-analysis
# Visibility: Public
# Do NOT add README/gitignore (đã tạo locally)

# 6. Connect + Push
git remote add origin https://github.com/[username]/capstone-ecommerce-analysis.git
git branch -M main
git push -u origin main

# 7. Verify: mở browser → github.com/[username]/capstone-ecommerce-analysis

Bước 4.6: Save Data Audit to Repo

python

# ============================================
# SAVE DATA AUDIT REPORT
# ============================================

# Save dataset (processed, for reference)
orders.to_csv('data/processed/orders_sample.csv', index=False)
print("✅ Dataset saved: data/processed/orders_sample.csv")

# Save data audit report as markdown
audit_report = f"""# 📊 Data Audit Report

## Dataset Overview
- **Source:** Kaggle / Generated sample
- **Size:** {orders.shape[0]:,} rows × {orders.shape[1]} columns
- **Date range:** {orders['order_date'].min().strftime('%Y-%m-%d')} → {orders['order_date'].max().strftime('%Y-%m-%d')}

## Data Types
| Column | Type | Unique | Missing |
|--------|------|--------|---------|
"""

for col in orders.columns:
    audit_report += f"| {col} | {orders[col].dtype} | {orders[col].nunique():,} | {orders[col].isnull().sum()} ({orders[col].isnull().mean()*100:.1f}%) |\n"

audit_report += f"""
## Viability Score
- **Score:** {score}/10
- **Result:** {'✅ PASS' if score >= 7 else '❌ FAIL'}

## Notes
- Delivery days: {orders['delivery_days'].isnull().sum()} missing values ({orders['delivery_days'].isnull().mean()*100:.1f}%) → impute with median
- Review score: {orders['review_score'].isnull().sum()} missing values ({orders['review_score'].isnull().mean()*100:.1f}%) → drop or impute
- Date range covers ~{(orders['order_date'].max() - orders['order_date'].min()).days} days
"""

with open('data/data_dictionary.md', 'w', encoding='utf-8') as f:
    f.write(audit_report)

print("✅ Data audit report saved: data/data_dictionary.md")

bash

# Commit data audit
git add data/data_dictionary.md data/processed/orders_sample.csv
git commit -m "data: add sample dataset and data audit report"
git push

Phần 5: Verification — Kiểm tra Final (5 phút)

Bước 5.1: Final Checklist

python

# ============================================
# PHẦN 5: FINAL VERIFICATION
# ============================================

import os

# Check all required files exist
required_files = [
    'README.md',
    '.gitignore',
    'requirements.txt',
    'data/data_dictionary.md',
    'data/processed/orders_sample.csv',
]

required_dirs = [
    'data/raw',
    'data/processed',
    'notebooks',
    'dashboards',
    'reports/figures',
    'src',
]

print("=" * 60)
print("📋 FINAL VERIFICATION CHECKLIST")
print("=" * 60)

print("\n📄 Required Files:")
for f in required_files:
    exists = os.path.exists(f)
    print(f"   {'✅' if exists else '❌'} {f}")

print("\n📁 Required Directories:")
for d in required_dirs:
    exists = os.path.isdir(d)
    print(f"   {'✅' if exists else '❌'} {d}")

# Check Git status
print("\n🔧 Git Status:")
print("   Run: git status")
print("   Run: git log --oneline")
print("   Run: git remote -v")

print("\n" + "=" * 60)
print("🎯 DAY 1 COMPLETE CHECKLIST:")
print("=" * 60)
checklist = [
    "☐ Đề tài chọn xong (DRBST ≥ 4/5)",
    "☐ Dataset downloaded + audited (≥ 7/10)",
    "☐ Project brief written (5 questions + methods)",
    "☐ Git repo initialized + pushed to GitHub",
    "☐ README.md v1 live on GitHub",
    "☐ .gitignore + requirements.txt committed",
    "☐ Data audit report saved",
    "☐ Folder structure TDSP-style",
]
for item in checklist:
    print(f"   {item}")

print(f"\n✅ Ready for Buổi 19: Data Cleaning + EDA + Analysis!")

📊 Output mẫu — Bạn cần có

Deliverable	File/Location	Status
Git repo	`github.com/[user]/capstone-ecommerce-analysis`	☐ Live
README.md	Root of repo	☐ v1 committed
.gitignore	Root of repo	☐ Committed
requirements.txt	Root of repo	☐ Committed
Data audit report	`data/data_dictionary.md`	☐ Committed
Dataset	`data/processed/orders_sample.csv`	☐ Committed
Project brief	Notebook or document	☐ Written
Folder structure	See TDSP layout	☐ Created

🎯 Tiếp theo — Buổi 19

Với repo setup và data audited, buổi tiếp theo bạn sẽ:

Data cleaning: Handle missing, duplicates, types
EDA: Distributions, correlations, key patterns
Analysis: Answer business questions 1-5
Dashboard: Build interactive BI dashboard

Day 1 (Today): ✅ Kickoff — dataset + repo + brief
Day 2 (Next):  → Data cleaning + wrangling
Day 3-4:       → EDA + visualizations
Day 5-7:       → Deep analysis + answer questions
Day 8-9:       → Dashboard
Day 10-14:     → Presentation + polish

💡 Commit mỗi ngày!

Mỗi session làm việc, commit ít nhất 1 lần. Green squares trên GitHub = evidence bạn đang active:

bash

git add .
git commit -m "feat: [mô tả ngắn gọn]"
git push

🛠 Workshop — Project Kickoff: Từ Idea đến Git Repo ​

🎯 Mục tiêu workshop ​

🧰 Yêu cầu ​

📦 Scenario ​

Phần 1: Chọn Đề Tài + Dataset (20 phút) ​

Bước 1.1: Brainstorm & Evaluate Ideas ​

Bước 1.2: Download Dataset ​

Phần 2: Data Audit — Profile & Evaluate (25 phút) ​

Bước 2.1: Data Profiling ​

Bước 2.2: Categorical Summary ​

Bước 2.3: Duplicate & Date Check ​

Bước 2.4: Dataset Viability Score ​

Phần 3: Viết Project Brief (15 phút) ​

Bước 3.1: Create Project Brief ​

Bước 3.2: Questions → Methods Mapping ​

Phần 4: Setup Git Repo + README (25 phút) ​

Bước 4.1: Git Init & Folder Structure ​

Bước 4.2: Create .gitignore ​

Bước 4.3: Create README.md ​

Bước 4.4: Create requirements.txt ​

Bước 4.5: First Commit + Push ​

Bước 4.6: Save Data Audit to Repo ​

Phần 5: Verification — Kiểm tra Final (5 phút) ​

Bước 5.1: Final Checklist ​

📊 Output mẫu — Bạn cần có ​

🎯 Tiếp theo — Buổi 19 ​

🛠 Workshop — Project Kickoff: Từ Idea đến Git Repo

🎯 Mục tiêu workshop

🧰 Yêu cầu

📦 Scenario

Phần 1: Chọn Đề Tài + Dataset (20 phút)

Bước 1.1: Brainstorm & Evaluate Ideas

Bước 1.2: Download Dataset

Phần 2: Data Audit — Profile & Evaluate (25 phút)

Bước 2.1: Data Profiling

Bước 2.2: Categorical Summary

Bước 2.3: Duplicate & Date Check

Bước 2.4: Dataset Viability Score

Phần 3: Viết Project Brief (15 phút)

Bước 3.1: Create Project Brief

Bước 3.2: Questions → Methods Mapping

Phần 4: Setup Git Repo + README (25 phút)

Bước 4.1: Git Init & Folder Structure

Bước 4.2: Create .gitignore

Bước 4.3: Create README.md

Bước 4.4: Create requirements.txt

Bước 4.5: First Commit + Push

Bước 4.6: Save Data Audit to Repo

Phần 5: Verification — Kiểm tra Final (5 phút)

Bước 5.1: Final Checklist

📊 Output mẫu — Bạn cần có

🎯 Tiếp theo — Buổi 19