Data & Cohorts – DAT-01: Master Cohort CSV
Data & Cohorts – DAT-01: Building the Master Cohort CSV
Section titled “Data & Cohorts – DAT-01: Building the Master Cohort CSV”Technical name: Cohort Curation & Splitting
Build a single, auditable cohort.csv that knows:
- which slides exist on disk
- which patient each slide belongs to
- which cases should be excluded
- which group (train vs test) each patient is assigned to
All later steps (viewing, preprocessing, modeling) should read from this file—not from ad‑hoc folders.
What it is
Section titled “What it is”A set of Python blocks that:
- crawls your project storage to find every slide,
- merges that list with your clinical/diagnosis table,
- filters out bad or excluded cases, and
- performs a patient‑level split into train vs test.
Why a clinician would want to use it
Section titled “Why a clinician would want to use it”- Safety: Ensures you never accidentally train on patients who should be excluded (for example, prior chemotherapy or wrong diagnosis).
- Leakage prevention: Guarantees that if Patient A has 3 slides, all 3 end up in the same group (train or test), so the AI cannot “cheat” by seeing nearly identical slides in both sets.
- Audit trail: Creates a single
cohort.csvthat serves as the source of truth for your project. You can always review who was included, excluded, and why.
Prerequisite
Run PFP‑01: Project Folder & Path Blocks first so thatPROJECT_ROOT,DATA_DIR,RAW_IMAGES_DIR, andMETADATA_DIRare defined.
Lego Building Blocks (Code)
Section titled “Lego Building Blocks (Code)”Block A: The Crawler (Find the files)
Section titled “Block A: The Crawler (Find the files)”Goal: Scan your raw data folder and list every Whole Slide Image (WSI). This bridges the physical world (files on disk) to the digital world (a structured table in Python).
import pandas as pdfrom pathlib import Path
# 1. Define where your raw slides live (the "NAS")# We reuse RAW_IMAGES_DIR from PFP-01; adjust if your layout differs.RAW_DATA_DIR = RAW_IMAGES_DIR
# 2. Crawl the folder for common WSI extensions# We use glob patterns to find .svs, .ndpi, .tif, .tiff (recursive in subfolders)extensions = ["*.svs", "*.ndpi", "*.tif", "*.tiff"]file_list = []
for ext in extensions: # rglob means "recursive glob" (look in subfolders too) for file_path in RAW_DATA_DIR.rglob(ext): # Optional but recommended: store a path relative to PROJECT_ROOT for portability relative_path = file_path.relative_to(PROJECT_ROOT)
file_list.append({ "slide_id": file_path.stem, # filename without extension (e.g. "S23-4092") "full_path": str(file_path), # complete path for the code to find it later "project_rel_path": relative_path.as_posix(), # portable path from PROJECT_ROOT "extension": file_path.suffix, })
# 3. Convert to a DataFramedf_slides = pd.DataFrame(file_list)
print(f"Found {len(df_slides)} slides on disk.")print(df_slides.head())Block B: The Merge & Filter (Connect clinical data)
Section titled “Block B: The Merge & Filter (Connect clinical data)”Goal: Merge the slide inventory with clinical data (the LIS extract) and apply your inclusion/exclusion criteria.
# 1. Load your clinical data (the "LIS")# In real life, this is an Excel or CSV extract from hospital IT.# Here we create a small dummy table for demonstration.clinical_data = { "slide_id": ["S23-4092", "S23-4093", "S23-4094", "S23-4095", "S23-4096"], "patient_id": ["PAT_001", "PAT_001", "PAT_002", "PAT_003", "PAT_004"], "diagnosis": ["Tumor", "Tumor", "Normal", "Tumor", "Tumor"], "prior_chemo": [False, False, False, True, False], # exclusion criterion}df_clinical = pd.DataFrame(clinical_data)
# 2. Merge the islands# We assume 'slide_id' is the common key (barcode).# 'inner' join means: only keep slides that HAVE clinical data.df_cohort = pd.merge(df_slides, df_clinical, on="slide_id", how="inner")
print(f"Slides with clinical data: {len(df_cohort)}")
# 3. Apply filters (the CONSORT logic)# Example: exclude patients who had prior chemotherapydf_clean = df_cohort[df_cohort["prior_chemo"] == False].copy()
print(f"Slides after removing chemo cases: {len(df_clean)}")Block C: The Patient-Level Split (The safety lock)
Section titled “Block C: The Patient-Level Split (The safety lock)”Goal: Split by patient, not by slide. All slides from a single patient stay together in either the training or testing group to prevent data leakage.
from sklearn.model_selection import GroupShuffleSplit
# 1. Initialize a single grouped train/test splitsplitter = GroupShuffleSplit( n_splits=1, # only one train/test split test_size=0.2, # ~20% of patients in the test set random_state=42, # reproducible)
# 2. Generate indices using patient_id as the grouping keytrain_idx, test_idx = next( splitter.split(df_clean, groups=df_clean["patient_id"]))
# 3. Create and populate the split columndf_clean["split"] = "train"df_clean.loc[test_idx, "split"] = "test"
# 4. Basic sanity checksprint("Split distribution (rows):")print(df_clean["split"].value_counts())
train_pats = set(df_clean.loc[df_clean["split"] == "train", "patient_id"])test_pats = set(df_clean.loc[df_clean["split"] == "test", "patient_id"])overlap = train_pats & test_pats
if not overlap: print("SUCCESS: no patient appears in both train and test.")else: print(f"WARNING: data leakage detected; patients in both sets: {overlap}")Block D: Save the master list
Section titled “Block D: Save the master list”Goal: Save this curated table as a single cohort.csv. All later steps (preprocessing, training, evaluation) should read this file—never raw folders directly.
from pathlib import Path
# 1. Define the output path (reusing METADATA_DIR from PFP-01)CSV_PATH = METADATA_DIR / "cohort.csv"
# 2. Savedf_clean.to_csv(CSV_PATH, index=False)
print(f"Master cohort list saved to: {CSV_PATH}")print("Ready for Step 2 (Viewing) and Step 3 (Preprocessing).")Notes and Extensions
Section titled “Notes and Extensions”- Replace the dummy
clinical_datawith a real hospital extract (pd.read_excelorpd.read_csv). - Expand the CONSORT‑style filters (prior treatments, missing labels, image quality flags) before splitting.
- If you need separate validation and test sets, extend this group‑based split logic (for example with
GroupShuffleSplitorGroupKFold) to carve out more than two non‑overlapping groups of patients.