Skip to content

🏆 Tiêu chuẩn — Capstone Project Kickoff

Các tiêu chuẩn giúp bạn kickoff capstone project đúng quy trình, đúng cấu trúc, và reproducible — TDSP cho project lifecycle, Git Workflow cho version control, Reproducible Analysis cho tính minh bạch.

Tổng quan tiêu chuẩn buổi 18

Buổi 18 bắt đầu Capstone Project — từ chọn đề tài, data collection, đến setup repository. Để project đạt chuẩn portfolio-ready, cần tuân theo 3 tiêu chuẩn từ industry:

  1. TDSP (Microsoft) — Team Data Science Process → quy trình project lifecycle cho data science/analytics
  2. Git Workflow Standards — Conventional Commits + Branching Strategy → version control chuyên nghiệp
  3. Reproducible Analysis — Tiêu chuẩn reproducibility cho analytical work → người khác có thể verify kết quả

📋 Danh sách tiêu chuẩn liên quan

#Tiêu chuẩnTổ chức / Tác giảÁp dụng cho Buổi 18
1TDSPMicrosoft (2016-present)Project lifecycle, folder structure, documentation
2Conventional Commits + Git FlowGit community / Vincent DriessenCommit messages, branching, version control workflow
3Reproducible AnalysisROpenSci / Turing Way / Wilson et al.Environment management, documentation, code practices

1️⃣ TDSP — Team Data Science Process (Microsoft)

Giới thiệu

TDSP (Team Data Science Process) — framework do Microsoft phát triển (2016, cập nhật liên tục), cung cấp structured lifecycle cho data science và analytics projects. TDSP khác CRISP-DM (Buổi 17) ở chỗ: TDSP tập trung vào team collaboration, standardized project structure, và Git integration — rất phù hợp cho capstone portfolio project.

TDSP gồm 5 lifecycle stages — iterative, data-driven:

mermaid
flowchart TD
    A["1️⃣ Business<br/>Understanding"] --> B["2️⃣ Data Acquisition<br/>& Understanding"]
    B --> C["3️⃣ Modeling"]
    C --> D["4️⃣ Deployment"]
    D --> E["5️⃣ Customer<br/>Acceptance"]
    E -->|"Iterate"| A
    B -->|"Need clarity"| A
    C -->|"Need more data"| B

5 Lifecycle Stages

StageMục đíchKey ActivitiesOutput cho Capstone
1. Business UnderstandingXác định mục tiêu dự án, metrics, stakeholdersDefine objectives, identify data sources, create project planProject Brief document
2. Data Acquisition & UnderstandingThu thập data, EDA, đánh giá qualityIngest data, explore, clean, feature engineeringData Audit Report, Clean Dataset
3. ModelingXây dựng và đánh giá models/analysesFeature selection, model training, evaluationAnalysis Notebooks, Model Results
4. DeploymentTriển khai results vào production/portfolioDashboard, report, API, presentationDashboard + Presentation
5. Customer AcceptanceStakeholder review, handoffPresent findings, document handoff, feedbackFinal Presentation + Peer Review

TDSP Standardized Project Structure

TDSP define cấu trúc folder chuẩn cho mọi data project — áp dụng trực tiếp cho capstone:

📁 TDSP Project Structure (adapted for Capstone)
├── 📄 README.md                    ← Project overview (charter)
├── 📄 .gitignore
├── 📁 docs/
│   ├── 📄 project_charter.md      ← Business understanding
│   ├── 📄 data_dictionary.md      ← Data documentation
│   ├── 📄 data_report.md          ← Data quality & EDA summary
│   └── 📄 final_report.md         ← Analysis findings & recs
├── 📁 data/
│   ├── 📁 raw/                    ← Original, immutable data
│   ├── 📁 processed/              ← Cleaned, transformed data
│   └── 📁 external/               ← Third-party data sources
├── 📁 notebooks/                   ← Jupyter notebooks (numbered)
├── 📁 src/                         ← Source code (reusable)
│   ├── 📄 data_processing.py
│   ├── 📄 feature_engineering.py
│   └── 📄 visualization.py
├── 📁 reports/                     ← Generated reports, slides
│   ├── 📁 figures/                ← Saved plots/charts
│   └── 📄 presentation.pdf
└── 📄 requirements.txt            ← Python dependencies

TDSP Project Charter Template

Microsoft recommend mỗi project bắt đầu bằng Project Charter — document 1-2 trang:

PROJECT CHARTER
===============
1. Project Title: [Tên phù hợp, descriptive]
2. Business Background: [Tại sao project này? Stakeholder nào?]
3. Scope:
   - In scope: [Phân tích nào sẽ thực hiện?]
   - Out of scope: [KHÔNG làm gì?]
4. Metrics: [Success metrics — accuracy, business KPI impact]
5. Architecture:
   - Data: [Sources, size, format]
   - Tools: [Python, SQL, BI tool]
   - Deliverables: [Notebook, dashboard, presentation]
6. Timeline: [Milestones theo tuần]
7. Team: [Roles — trong capstone: bạn = all roles]
8. Communication: [Update schedule, feedback loops]

TDSP Checklist cho Capstone

STAGE 1 — BUSINESS UNDERSTANDING:
✅ Project Charter hoàn thành
✅ 3-5 business questions cụ thể
✅ Success metrics defined
✅ Scope clearly bounded (in vs out)

STAGE 2 — DATA ACQUISITION & UNDERSTANDING:
✅ Dataset sourced + downloaded
✅ Data dictionary created
✅ Data quality report (audit)
✅ EDA completed, patterns documented

STAGE 3 — MODELING:
✅ Analysis/models aligned với business questions
✅ Results validated + interpreted
✅ Limitations documented

STAGE 4 — DEPLOYMENT:
✅ Dashboard deployed (Tableau Public / Power BI)
✅ GitHub repo organized per TDSP structure
✅ README updated with findings

STAGE 5 — CUSTOMER ACCEPTANCE:
✅ Presentation rehearsed
✅ Peer review completed
✅ Feedback incorporated

2️⃣ Git Workflow Standards — Conventional Commits + Branching

Giới thiệu

Git Workflow Standards bao gồm 2 conventions quan trọng cho DA projects:

  1. Conventional Commits — chuẩn viết commit messages nhất quán, machine-readable
  2. Git Flow / GitHub Flow — branching strategy cho project development

Dù capstone là solo project, tuân theo Git standards cho thấy bạn work professionally — hiring managers kiểm tra commit history.

Conventional Commits Specification

Format:

<type>[optional scope]: <description>

[optional body]

[optional footer]

Types phổ biến cho DA project:

TypeMô tảVí dụ
featNew feature / analysisfeat: add RFM customer segmentation
fixSửa bug / errorfix: correct revenue calculation formula
dataData-related changesdata: add cleaned dataset v2
docsDocumentation onlydocs: update README with key findings
vizVisualization changesviz: create monthly revenue trend chart
refactorCode restructurerefactor: extract helper functions to utils.py
styleFormatting, no logic changestyle: consistent chart color palette
testAdding/updating teststest: add data validation checks
choreMaintenance taskschore: update .gitignore, add requirements.txt

Commit Message Best Practices

# ❌ BAD commits (non-informative):
"update"
"fix stuff"
"WIP"
"."
"final version"
"final final version v2 REAL FINAL"

# ✅ GOOD commits (Conventional Commits):
"feat: add exploratory data analysis notebook with distribution plots"
"data: clean missing values in customer_age using median imputation"
"viz: create Tableau dashboard with 4 tabs — overview, segments, delivery, revenue"
"docs: add data dictionary with descriptions for all 15 columns"
"fix: remove duplicate orders (2,341 rows) from analysis dataset"

Branching Strategy cho DA

Simplified GitHub Flow (recommended cho capstone):

mermaid
gitGraph
    commit id: "init"
    commit id: "docs: README"
    branch eda
    commit id: "feat: EDA notebook"
    commit id: "viz: distribution plots"
    checkout main
    merge eda id: "merge: EDA complete"
    branch analysis
    commit id: "feat: segmentation"
    commit id: "feat: delivery analysis"
    checkout main
    merge analysis id: "merge: Analysis done"
    commit id: "docs: final README"
BranchPurposeKhi nào dùng
mainStable, presentable versionChỉ merge code đã test
edaExploratory analysisData cleaning + EDA phase
analysisDeep analysis / modelingKhi answer specific questions
dashboardBI developmentDashboard build phase
docsDocumentation updatesREADME, reports

Git Workflow Checklist

DAILY WORKFLOW:
✅ git status — check trạng thái trước khi làm
✅ git add <files> — stage files cụ thể (KHÔNG git add . mọi lúc)
✅ git commit -m "type: description" — commit nhỏ, thường xuyên
✅ git push — push lên remote mỗi cuối session
✅ git log --oneline — review lịch sử commits

PROJECT WORKFLOW:
✅ .gitignore setup TRƯỚC khi commit đầu tiên
✅ Branch cho mỗi major phase (eda, analysis, dashboard)
✅ Merge vào main khi phase hoàn thành
✅ Tag releases: v0.1 (data ready), v0.5 (analysis done), v1.0 (complete)
✅ README update mỗi khi có milestone mới

3️⃣ Reproducible Analysis — Standards & Practices

Giới thiệu

Reproducible Analysis — nguyên tắc rằng bất kỳ ai cũng có thể chạy lại analysis của bạn và nhận được CÙNG kết quả. Đây là tiêu chuẩn cốt lõi trong scientific research, và ngày càng quan trọng trong DA — đặc biệt khi analysis ảnh hưởng business decisions.

Theo The Turing Way (Alan Turing Institute, 2024): "Reproducibility is the minimum standard for credible research and analysis. If it can't be reproduced, it shouldn't be trusted."

Cho capstone project, reproducibility có nghĩa: nếu hiring manager clone repo và chạy code, kết quả phải giống hệt output bạn trình bày.

5 Pillars of Reproducible Analysis

mermaid
flowchart TD
    A["🔄 Reproducible Analysis"] --> B["1️⃣ Environment<br/>Management"]
    A --> C["2️⃣ Data<br/>Management"]
    A --> D["3️⃣ Code<br/>Practices"]
    A --> E["4️⃣ Documentation"]
    A --> F["5️⃣ Version<br/>Control"]
PillarPrincipleImplementation cho Capstone
1. EnvironmentPin dependencies, isolate environmentrequirements.txt với versions, venv hoặc conda env
2. DataRaw data immutable, transformations scripteddata/raw/ (never modify), data/processed/ (generated by code)
3. CodeNo manual steps, everything in scripts/notebooksNotebooks chạy sequential (01 → 02 → 03), no "run this cell first" chaos
4. DocumentationAssumptions, decisions, limitations documentedMarkdown cells, data dictionary, README
5. Version ControlTrack changes, meaningful historyGit commits, tags, branching

Environment Management

bash
# === Tạo virtual environment ===
python -m venv venv
source venv/bin/activate  # Mac/Linux
.\venv\Scripts\activate   # Windows

# === Install packages ===
pip install pandas==2.1.4 numpy==1.26.2 matplotlib==3.8.2 seaborn==0.13.0 scikit-learn==1.3.2

# === Freeze requirements ===
pip freeze > requirements.txt

# === Reproduce trên máy khác ===
pip install -r requirements.txt

requirements.txt mẫu cho capstone:

pandas==2.1.4
numpy==1.26.2
matplotlib==3.8.2
seaborn==0.13.0
scikit-learn==1.3.2
jupyter==1.0.0
openpyxl==3.1.2

Code Practices for Reproducibility

PracticeTại sao?Ví dụ
Set random seedĐảm bảo kết quả giống nhau mỗi lần chạynp.random.seed(42)
Relative pathsCode chạy trên máy khácpd.read_csv('data/raw/data.csv') NOT C:\Users\...
No hardcoded valuesDễ thay đổi + transparentTHRESHOLD = 0.05 ở đầu notebook
Functions, not copy-pasteDRY principle, maintainabledef clean_data(df): ...
Sequential notebooksRõ execution order01_cleaning.ipynb → 02_eda.ipynb → 03_analysis.ipynb
Output regeneratableCharts/reports tạo từ codeplt.savefig('reports/figures/chart1.png')

Reproducibility Checklist cho Capstone

╔═══════════════════════════════════════════════════════════════╗
║          REPRODUCIBILITY CHECKLIST — CAPSTONE                 ║
╠═══════════════════════════════════════════════════════════════╣
║  ENVIRONMENT:                                                 ║
║  ☐ requirements.txt với pinned versions                       ║
║  ☐ Python version documented (≥ 3.9)                          ║
║  ☐ Virtual environment instructions trong README              ║
║                                                               ║
║  DATA:                                                        ║
║  ☐ Raw data unmodified (hoặc download link)                   ║
║  ☐ Data processing fully scripted (no manual Excel edits)     ║
║  ☐ Data dictionary (column descriptions)                      ║
║                                                               ║
║  CODE:                                                        ║
║  ☐ Random seed set                                            ║
║  ☐ Relative paths only                                        ║
║  ☐ Notebooks numbered + sequential                            ║
║  ☐ No "magic numbers" (constants at top)                      ║
║  ☐ Functions for repeated logic                               ║
║                                                               ║
║  DOCUMENTATION:                                               ║
║  ☐ README "How to Run" section                                ║
║  ☐ Assumptions documented                                     ║
║  ☐ Limitations acknowledged                                   ║
║                                                               ║
║  VERSION CONTROL:                                             ║
║  ☐ .gitignore comprehensive                                   ║
║  ☐ Meaningful commit messages                                 ║
║  ☐ No secrets/keys in repo                                    ║
╚═══════════════════════════════════════════════════════════════╝

📊 So sánh 3 tiêu chuẩn

AspectTDSPGit WorkflowReproducible Analysis
FocusProject lifecycle & structureVersion control & collaborationCorrectness & verifiability
OriginMicrosoft Azure ML teamOpen-source Git communityScientific research community
Key benefit for capstoneProfessional folder structureClean commit history impresses hiring managersHiring manager can verify your work
Most important elementProject Charter + TDSP folder structureConventional Commits + consistent pushingrequirements.txt + relative paths + seed
Time investment30 min setup5 min per commit session15 min initial setup
ROIPortfolio looks professionalEvidence of consistent work habitsBuilds trust in your results

🔑 Tổng kết

#Tiêu chuẩnKey Action cho Capstone
1TDSPTạo Project Charter + folder structure chuẩn TDSP
2Git WorkflowConventional Commits, commit nhỏ + thường xuyên, branch per phase
3Reproducible Analysisrequirements.txt + seed + relative paths + sequential notebooks

💡 Tại sao tiêu chuẩn quan trọng cho portfolio?

Hiring managers không chỉ đánh giá kết quả — họ đánh giá quy trình. Capstone project tuân theo TDSP, Git workflow, và reproducibility standards cho thấy bạn work like a professional, không chỉ biết code.