Data & Cohorts

Step 1 – Data & Cohorts: deciding which cases are in your study

Here you decide which patients, slides, or blocks are part of your project and collect all the key information about them in one place. In practice this means building one or more cohort tables (usually CSVs) with patient IDs, slide IDs, diagnoses, outcomes, and other clinicopathologic variables you care about—backed by a clear folder structure and consistent paths.

Technical name: Data & Cohorts

What this is

Everything done before you open slides under the digital microscope or run models:

Decide which patients/cases are in your study.
Ensure you know which slide belongs to which patient (and how many slides per case).
Merge file inventories with LIS/clinical tables.
Track diagnosis, treatment, outcomes, biomarkers, and more.
Define and lock in patient‑level train/test splits to avoid leakage later.

By the end of this step you should have:

a clean project folder layout (PFP‑01), and
at least one master cohort.csv (DAT‑01) that everything else reads from.

Typical questions

“Give me all ER+ breast cancer cases with at least 5 years of follow‑up.”
“Which TCGA cases actually have whole‑slide images available?”
“How many normal controls do I have for this tumor group?”

Common tasks

Download public datasets (for example TCGA, CAMELYON) and save them into a standard project structure.
Inventory slides on disk by crawling storage (NAS, external drives, cloud mounts).
Clean spreadsheets so IDs are consistent (patient IDs, slide barcodes, accession numbers).
Merge slide inventories with clinical/LIS exports into a single table.
Define inclusion/exclusion criteria (for example prior chemotherapy, missing diagnosis) and apply them in code.
Perform patient‑level splits (train/test, and optionally validation) and record them in the cohort file.
Build simple, reusable case tables that multiple experiments can share.

Core tools (examples)

GDC Data Portal / gdc‑client — download TCGA cases and slides.
cBioPortal / UCSC Xena — explore cohorts by mutation, expression, survival.
Python pathlib — define PROJECT_ROOT, DATA_DIR, and standard folders once per project.
Pandas (Python) — like Excel in code for filtering/merging tables and saving cohort CSVs.
scikit‑learn group splitters — create patient‑level train/test/validation splits without leakage.
Plain CSV + spreadsheet — still fine for small projects or manual review, as long as they are generated reproducibly.

Clinician mental model

This is your case log and study cohort spreadsheet, but made precise enough for a computer to follow and re‑run:

Think of PFP‑01 as designing the filing cabinets for the study.
DAT‑01 is accessioning: every slide and patient is logged, filtered, and assigned to train vs test.
Downstream steps (Slides & Viewing, Preprocessing, ML) simply look up rows in cohort.csv instead of hunting through folders.

Ready-to-use code

PFP-01: Project folder & path blocks — set PROJECT_ROOT, define standard folders, create them, and confirm your images are visible before building cohort tables.
DAT-01: Master cohort CSV — crawl slides on disk, merge with clinical tables, apply patient-level exclusions, and create a leakage-safe train/test split in a single cohort.csv.
Build the dataset (pandas) — extend and inspect cohort tables, apply additional exclusions, and run summary statistics without leaving Python.