Appearance
🏆 Tiêu chuẩn — Capstone Project Kickoff
Các tiêu chuẩn giúp bạn kickoff capstone project đúng quy trình, đúng cấu trúc, và reproducible — TDSP cho project lifecycle, Git Workflow cho version control, Reproducible Analysis cho tính minh bạch.
Tổng quan tiêu chuẩn buổi 18
Buổi 18 bắt đầu Capstone Project — từ chọn đề tài, data collection, đến setup repository. Để project đạt chuẩn portfolio-ready, cần tuân theo 3 tiêu chuẩn từ industry:
- TDSP (Microsoft) — Team Data Science Process → quy trình project lifecycle cho data science/analytics
- Git Workflow Standards — Conventional Commits + Branching Strategy → version control chuyên nghiệp
- Reproducible Analysis — Tiêu chuẩn reproducibility cho analytical work → người khác có thể verify kết quả
📋 Danh sách tiêu chuẩn liên quan
| # | Tiêu chuẩn | Tổ chức / Tác giả | Áp dụng cho Buổi 18 |
|---|---|---|---|
| 1 | TDSP | Microsoft (2016-present) | Project lifecycle, folder structure, documentation |
| 2 | Conventional Commits + Git Flow | Git community / Vincent Driessen | Commit messages, branching, version control workflow |
| 3 | Reproducible Analysis | ROpenSci / Turing Way / Wilson et al. | Environment management, documentation, code practices |
1️⃣ TDSP — Team Data Science Process (Microsoft)
Giới thiệu
TDSP (Team Data Science Process) — framework do Microsoft phát triển (2016, cập nhật liên tục), cung cấp structured lifecycle cho data science và analytics projects. TDSP khác CRISP-DM (Buổi 17) ở chỗ: TDSP tập trung vào team collaboration, standardized project structure, và Git integration — rất phù hợp cho capstone portfolio project.
TDSP gồm 5 lifecycle stages — iterative, data-driven:
mermaid
flowchart TD
A["1️⃣ Business<br/>Understanding"] --> B["2️⃣ Data Acquisition<br/>& Understanding"]
B --> C["3️⃣ Modeling"]
C --> D["4️⃣ Deployment"]
D --> E["5️⃣ Customer<br/>Acceptance"]
E -->|"Iterate"| A
B -->|"Need clarity"| A
C -->|"Need more data"| B5 Lifecycle Stages
| Stage | Mục đích | Key Activities | Output cho Capstone |
|---|---|---|---|
| 1. Business Understanding | Xác định mục tiêu dự án, metrics, stakeholders | Define objectives, identify data sources, create project plan | Project Brief document |
| 2. Data Acquisition & Understanding | Thu thập data, EDA, đánh giá quality | Ingest data, explore, clean, feature engineering | Data Audit Report, Clean Dataset |
| 3. Modeling | Xây dựng và đánh giá models/analyses | Feature selection, model training, evaluation | Analysis Notebooks, Model Results |
| 4. Deployment | Triển khai results vào production/portfolio | Dashboard, report, API, presentation | Dashboard + Presentation |
| 5. Customer Acceptance | Stakeholder review, handoff | Present findings, document handoff, feedback | Final Presentation + Peer Review |
TDSP Standardized Project Structure
TDSP define cấu trúc folder chuẩn cho mọi data project — áp dụng trực tiếp cho capstone:
📁 TDSP Project Structure (adapted for Capstone)
├── 📄 README.md ← Project overview (charter)
├── 📄 .gitignore
├── 📁 docs/
│ ├── 📄 project_charter.md ← Business understanding
│ ├── 📄 data_dictionary.md ← Data documentation
│ ├── 📄 data_report.md ← Data quality & EDA summary
│ └── 📄 final_report.md ← Analysis findings & recs
├── 📁 data/
│ ├── 📁 raw/ ← Original, immutable data
│ ├── 📁 processed/ ← Cleaned, transformed data
│ └── 📁 external/ ← Third-party data sources
├── 📁 notebooks/ ← Jupyter notebooks (numbered)
├── 📁 src/ ← Source code (reusable)
│ ├── 📄 data_processing.py
│ ├── 📄 feature_engineering.py
│ └── 📄 visualization.py
├── 📁 reports/ ← Generated reports, slides
│ ├── 📁 figures/ ← Saved plots/charts
│ └── 📄 presentation.pdf
└── 📄 requirements.txt ← Python dependenciesTDSP Project Charter Template
Microsoft recommend mỗi project bắt đầu bằng Project Charter — document 1-2 trang:
PROJECT CHARTER
===============
1. Project Title: [Tên phù hợp, descriptive]
2. Business Background: [Tại sao project này? Stakeholder nào?]
3. Scope:
- In scope: [Phân tích nào sẽ thực hiện?]
- Out of scope: [KHÔNG làm gì?]
4. Metrics: [Success metrics — accuracy, business KPI impact]
5. Architecture:
- Data: [Sources, size, format]
- Tools: [Python, SQL, BI tool]
- Deliverables: [Notebook, dashboard, presentation]
6. Timeline: [Milestones theo tuần]
7. Team: [Roles — trong capstone: bạn = all roles]
8. Communication: [Update schedule, feedback loops]TDSP Checklist cho Capstone
STAGE 1 — BUSINESS UNDERSTANDING:
✅ Project Charter hoàn thành
✅ 3-5 business questions cụ thể
✅ Success metrics defined
✅ Scope clearly bounded (in vs out)
STAGE 2 — DATA ACQUISITION & UNDERSTANDING:
✅ Dataset sourced + downloaded
✅ Data dictionary created
✅ Data quality report (audit)
✅ EDA completed, patterns documented
STAGE 3 — MODELING:
✅ Analysis/models aligned với business questions
✅ Results validated + interpreted
✅ Limitations documented
STAGE 4 — DEPLOYMENT:
✅ Dashboard deployed (Tableau Public / Power BI)
✅ GitHub repo organized per TDSP structure
✅ README updated with findings
STAGE 5 — CUSTOMER ACCEPTANCE:
✅ Presentation rehearsed
✅ Peer review completed
✅ Feedback incorporated2️⃣ Git Workflow Standards — Conventional Commits + Branching
Giới thiệu
Git Workflow Standards bao gồm 2 conventions quan trọng cho DA projects:
- Conventional Commits — chuẩn viết commit messages nhất quán, machine-readable
- Git Flow / GitHub Flow — branching strategy cho project development
Dù capstone là solo project, tuân theo Git standards cho thấy bạn work professionally — hiring managers kiểm tra commit history.
Conventional Commits Specification
Format:
<type>[optional scope]: <description>
[optional body]
[optional footer]Types phổ biến cho DA project:
| Type | Mô tả | Ví dụ |
|---|---|---|
feat | New feature / analysis | feat: add RFM customer segmentation |
fix | Sửa bug / error | fix: correct revenue calculation formula |
data | Data-related changes | data: add cleaned dataset v2 |
docs | Documentation only | docs: update README with key findings |
viz | Visualization changes | viz: create monthly revenue trend chart |
refactor | Code restructure | refactor: extract helper functions to utils.py |
style | Formatting, no logic change | style: consistent chart color palette |
test | Adding/updating tests | test: add data validation checks |
chore | Maintenance tasks | chore: update .gitignore, add requirements.txt |
Commit Message Best Practices
# ❌ BAD commits (non-informative):
"update"
"fix stuff"
"WIP"
"."
"final version"
"final final version v2 REAL FINAL"
# ✅ GOOD commits (Conventional Commits):
"feat: add exploratory data analysis notebook with distribution plots"
"data: clean missing values in customer_age using median imputation"
"viz: create Tableau dashboard with 4 tabs — overview, segments, delivery, revenue"
"docs: add data dictionary with descriptions for all 15 columns"
"fix: remove duplicate orders (2,341 rows) from analysis dataset"Branching Strategy cho DA
Simplified GitHub Flow (recommended cho capstone):
mermaid
gitGraph
commit id: "init"
commit id: "docs: README"
branch eda
commit id: "feat: EDA notebook"
commit id: "viz: distribution plots"
checkout main
merge eda id: "merge: EDA complete"
branch analysis
commit id: "feat: segmentation"
commit id: "feat: delivery analysis"
checkout main
merge analysis id: "merge: Analysis done"
commit id: "docs: final README"| Branch | Purpose | Khi nào dùng |
|---|---|---|
main | Stable, presentable version | Chỉ merge code đã test |
eda | Exploratory analysis | Data cleaning + EDA phase |
analysis | Deep analysis / modeling | Khi answer specific questions |
dashboard | BI development | Dashboard build phase |
docs | Documentation updates | README, reports |
Git Workflow Checklist
DAILY WORKFLOW:
✅ git status — check trạng thái trước khi làm
✅ git add <files> — stage files cụ thể (KHÔNG git add . mọi lúc)
✅ git commit -m "type: description" — commit nhỏ, thường xuyên
✅ git push — push lên remote mỗi cuối session
✅ git log --oneline — review lịch sử commits
PROJECT WORKFLOW:
✅ .gitignore setup TRƯỚC khi commit đầu tiên
✅ Branch cho mỗi major phase (eda, analysis, dashboard)
✅ Merge vào main khi phase hoàn thành
✅ Tag releases: v0.1 (data ready), v0.5 (analysis done), v1.0 (complete)
✅ README update mỗi khi có milestone mới3️⃣ Reproducible Analysis — Standards & Practices
Giới thiệu
Reproducible Analysis — nguyên tắc rằng bất kỳ ai cũng có thể chạy lại analysis của bạn và nhận được CÙNG kết quả. Đây là tiêu chuẩn cốt lõi trong scientific research, và ngày càng quan trọng trong DA — đặc biệt khi analysis ảnh hưởng business decisions.
Theo The Turing Way (Alan Turing Institute, 2024): "Reproducibility is the minimum standard for credible research and analysis. If it can't be reproduced, it shouldn't be trusted."
Cho capstone project, reproducibility có nghĩa: nếu hiring manager clone repo và chạy code, kết quả phải giống hệt output bạn trình bày.
5 Pillars of Reproducible Analysis
mermaid
flowchart TD
A["🔄 Reproducible Analysis"] --> B["1️⃣ Environment<br/>Management"]
A --> C["2️⃣ Data<br/>Management"]
A --> D["3️⃣ Code<br/>Practices"]
A --> E["4️⃣ Documentation"]
A --> F["5️⃣ Version<br/>Control"]| Pillar | Principle | Implementation cho Capstone |
|---|---|---|
| 1. Environment | Pin dependencies, isolate environment | requirements.txt với versions, venv hoặc conda env |
| 2. Data | Raw data immutable, transformations scripted | data/raw/ (never modify), data/processed/ (generated by code) |
| 3. Code | No manual steps, everything in scripts/notebooks | Notebooks chạy sequential (01 → 02 → 03), no "run this cell first" chaos |
| 4. Documentation | Assumptions, decisions, limitations documented | Markdown cells, data dictionary, README |
| 5. Version Control | Track changes, meaningful history | Git commits, tags, branching |
Environment Management
bash
# === Tạo virtual environment ===
python -m venv venv
source venv/bin/activate # Mac/Linux
.\venv\Scripts\activate # Windows
# === Install packages ===
pip install pandas==2.1.4 numpy==1.26.2 matplotlib==3.8.2 seaborn==0.13.0 scikit-learn==1.3.2
# === Freeze requirements ===
pip freeze > requirements.txt
# === Reproduce trên máy khác ===
pip install -r requirements.txtrequirements.txt mẫu cho capstone:
pandas==2.1.4
numpy==1.26.2
matplotlib==3.8.2
seaborn==0.13.0
scikit-learn==1.3.2
jupyter==1.0.0
openpyxl==3.1.2Code Practices for Reproducibility
| Practice | Tại sao? | Ví dụ |
|---|---|---|
| Set random seed | Đảm bảo kết quả giống nhau mỗi lần chạy | np.random.seed(42) |
| Relative paths | Code chạy trên máy khác | pd.read_csv('data/raw/data.csv') NOT C:\Users\... |
| No hardcoded values | Dễ thay đổi + transparent | THRESHOLD = 0.05 ở đầu notebook |
| Functions, not copy-paste | DRY principle, maintainable | def clean_data(df): ... |
| Sequential notebooks | Rõ execution order | 01_cleaning.ipynb → 02_eda.ipynb → 03_analysis.ipynb |
| Output regeneratable | Charts/reports tạo từ code | plt.savefig('reports/figures/chart1.png') |
Reproducibility Checklist cho Capstone
╔═══════════════════════════════════════════════════════════════╗
║ REPRODUCIBILITY CHECKLIST — CAPSTONE ║
╠═══════════════════════════════════════════════════════════════╣
║ ENVIRONMENT: ║
║ ☐ requirements.txt với pinned versions ║
║ ☐ Python version documented (≥ 3.9) ║
║ ☐ Virtual environment instructions trong README ║
║ ║
║ DATA: ║
║ ☐ Raw data unmodified (hoặc download link) ║
║ ☐ Data processing fully scripted (no manual Excel edits) ║
║ ☐ Data dictionary (column descriptions) ║
║ ║
║ CODE: ║
║ ☐ Random seed set ║
║ ☐ Relative paths only ║
║ ☐ Notebooks numbered + sequential ║
║ ☐ No "magic numbers" (constants at top) ║
║ ☐ Functions for repeated logic ║
║ ║
║ DOCUMENTATION: ║
║ ☐ README "How to Run" section ║
║ ☐ Assumptions documented ║
║ ☐ Limitations acknowledged ║
║ ║
║ VERSION CONTROL: ║
║ ☐ .gitignore comprehensive ║
║ ☐ Meaningful commit messages ║
║ ☐ No secrets/keys in repo ║
╚═══════════════════════════════════════════════════════════════╝📊 So sánh 3 tiêu chuẩn
| Aspect | TDSP | Git Workflow | Reproducible Analysis |
|---|---|---|---|
| Focus | Project lifecycle & structure | Version control & collaboration | Correctness & verifiability |
| Origin | Microsoft Azure ML team | Open-source Git community | Scientific research community |
| Key benefit for capstone | Professional folder structure | Clean commit history impresses hiring managers | Hiring manager can verify your work |
| Most important element | Project Charter + TDSP folder structure | Conventional Commits + consistent pushing | requirements.txt + relative paths + seed |
| Time investment | 30 min setup | 5 min per commit session | 15 min initial setup |
| ROI | Portfolio looks professional | Evidence of consistent work habits | Builds trust in your results |
🔑 Tổng kết
| # | Tiêu chuẩn | Key Action cho Capstone |
|---|---|---|
| 1 | TDSP | Tạo Project Charter + folder structure chuẩn TDSP |
| 2 | Git Workflow | Conventional Commits, commit nhỏ + thường xuyên, branch per phase |
| 3 | Reproducible Analysis | requirements.txt + seed + relative paths + sequential notebooks |
💡 Tại sao tiêu chuẩn quan trọng cho portfolio?
Hiring managers không chỉ đánh giá kết quả — họ đánh giá quy trình. Capstone project tuân theo TDSP, Git workflow, và reproducibility standards cho thấy bạn work like a professional, không chỉ biết code.