Data & Cohorts
Step 1 – Data & Cohorts: deciding which cases are in your study
Section titled “Step 1 – Data & Cohorts: deciding which cases are in your study”Here you decide which patients, slides, or blocks are part of your project and collect all the key information about them in one place. In practice this means building a “cohort table” (usually a CSV or spreadsheet) with IDs, diagnoses, outcomes, and any other clinicopathologic variables you care about.
Technical name: Data & Cohorts
What this is
Section titled “What this is”Everything done before you open slides under the digital microscope:
- Decide which patients/cases are in your study.
- Ensure you know which slide belongs to which patient.
- Track diagnosis, treatment, outcomes, biomarkers, and more.
Typical questions
Section titled “Typical questions”- “Give me all ER+ breast cancer cases with at least 5 years of follow‑up.”
- “Which TCGA cases actually have whole‑slide images available?”
- “How many normal controls do I have for this tumor group?”
Common tasks
Section titled “Common tasks”- Download public datasets (TCGA, CAMELYON, etc.).
- Clean spreadsheets so IDs are consistent.
- Define inclusion/exclusion criteria and save the cohort list.
- Build a simple case table reused across projects.
Core tools (examples)
Section titled “Core tools (examples)”- GDC Data Portal / gdc‑client — download TCGA cases and slides.
- cBioPortal / UCSC Xena — explore cohorts by mutation, expression, survival.
- Pandas (Python) — like Excel in code for filtering/merging tables.
- Plain CSV + spreadsheet — totally fine for small projects.
Clinician mental model
Section titled “Clinician mental model”This is your case log and study cohort spreadsheet—but made precise enough for a computer to follow.
Ready-to-use code
Section titled “Ready-to-use code”- PFP-01: Project folder & path blocks — set
PROJECT_ROOT, define standard folders, create them, and confirm your images are visible before building cohort tables. - Build the dataset (pandas) — inventory slides, inspect cohort tables, apply exclusions, and split datasets without leaving Python.