Data & Cohorts – DAT-01: Master Cohort CSV

Data & Cohorts – DAT-01: Building the Master Cohort CSV

Technical name: Cohort Curation & Splitting

Build a single, auditable cohort.csv that knows:

which slides exist on disk
which patient each slide belongs to
which cases should be excluded
which group (train vs test) each patient is assigned to

All later steps (viewing, preprocessing, modeling) should read from this file—not from ad‑hoc folders.

What it is

A set of Python blocks that:

crawls your project storage to find every slide,
merges that list with your clinical/diagnosis table,
filters out bad or excluded cases, and
performs a patient‑level split into train vs test.

Why a clinician would want to use it

Safety: Ensures you never accidentally train on patients who should be excluded (for example, prior chemotherapy or wrong diagnosis).
Leakage prevention: Guarantees that if Patient A has 3 slides, all 3 end up in the same group (train or test), so the AI cannot “cheat” by seeing nearly identical slides in both sets.
Audit trail: Creates a single cohort.csv that serves as the source of truth for your project. You can always review who was included, excluded, and why.

Prerequisite
Run PFP‑01: Project Folder & Path Blocks first so that PROJECT_ROOT, DATA_DIR, RAW_IMAGES_DIR, and METADATA_DIR are defined.

Lego Building Blocks (Code)

Block A: The Crawler (Find the files)

Goal: Scan your raw data folder and list every Whole Slide Image (WSI). This bridges the physical world (files on disk) to the digital world (a structured table in Python).

import pandas as pd
from pathlib import Path

# 1. Define where your raw slides live (the "NAS")
# We reuse RAW_IMAGES_DIR from PFP-01; adjust if your layout differs.
RAW_DATA_DIR = RAW_IMAGES_DIR

# 2. Crawl the folder for common WSI extensions
# We use glob patterns to find .svs, .ndpi, .tif, .tiff (recursive in subfolders)
extensions = ["*.svs", "*.ndpi", "*.tif", "*.tiff"]
file_list = []

for ext in extensions:
    # rglob means "recursive glob" (look in subfolders too)
    for file_path in RAW_DATA_DIR.rglob(ext):
        # Optional but recommended: store a path relative to PROJECT_ROOT for portability
        relative_path = file_path.relative_to(PROJECT_ROOT)

        file_list.append({
            "slide_id": file_path.stem,                   # filename without extension (e.g. "S23-4092")
            "full_path": str(file_path),                  # complete path for the code to find it later
            "project_rel_path": relative_path.as_posix(), # portable path from PROJECT_ROOT
            "extension": file_path.suffix,
        })

# 3. Convert to a DataFrame
df_slides = pd.DataFrame(file_list)

print(f"Found {len(df_slides)} slides on disk.")
print(df_slides.head())

Block B: The Merge & Filter (Connect clinical data)

Goal: Merge the slide inventory with clinical data (the LIS extract) and apply your inclusion/exclusion criteria.

# 1. Load your clinical data (the "LIS")
# In real life, this is an Excel or CSV extract from hospital IT.
# Here we create a small dummy table for demonstration.
clinical_data = {
    "slide_id": ["S23-4092", "S23-4093", "S23-4094", "S23-4095", "S23-4096"],
    "patient_id": ["PAT_001", "PAT_001", "PAT_002", "PAT_003", "PAT_004"],
    "diagnosis": ["Tumor", "Tumor", "Normal", "Tumor", "Tumor"],
    "prior_chemo": [False, False, False, True, False],  # exclusion criterion
}
df_clinical = pd.DataFrame(clinical_data)

# 2. Merge the islands
# We assume 'slide_id' is the common key (barcode).
# 'inner' join means: only keep slides that HAVE clinical data.
df_cohort = pd.merge(df_slides, df_clinical, on="slide_id", how="inner")

print(f"Slides with clinical data: {len(df_cohort)}")

# 3. Apply filters (the CONSORT logic)
# Example: exclude patients who had prior chemotherapy
df_clean = df_cohort[df_cohort["prior_chemo"] == False].copy()

print(f"Slides after removing chemo cases: {len(df_clean)}")

Block C: The Patient-Level Split (The safety lock)

Goal: Split by patient, not by slide. All slides from a single patient stay together in either the training or testing group to prevent data leakage.

from sklearn.model_selection import GroupShuffleSplit

# 1. Initialize a single grouped train/test split
splitter = GroupShuffleSplit(
    n_splits=1,      # only one train/test split
    test_size=0.2,   # ~20% of patients in the test set
    random_state=42, # reproducible
)

# 2. Generate indices using patient_id as the grouping key
train_idx, test_idx = next(
    splitter.split(df_clean, groups=df_clean["patient_id"])
)

# 3. Create and populate the split column
df_clean["split"] = "train"
df_clean.loc[test_idx, "split"] = "test"

# 4. Basic sanity checks
print("Split distribution (rows):")
print(df_clean["split"].value_counts())

train_pats = set(df_clean.loc[df_clean["split"] == "train", "patient_id"])
test_pats = set(df_clean.loc[df_clean["split"] == "test", "patient_id"])
overlap = train_pats & test_pats

if not overlap:
    print("SUCCESS: no patient appears in both train and test.")
else:
    print(f"WARNING: data leakage detected; patients in both sets: {overlap}")

Block D: Save the master list

Goal: Save this curated table as a single cohort.csv. All later steps (preprocessing, training, evaluation) should read this file—never raw folders directly.

from pathlib import Path

# 1. Define the output path (reusing METADATA_DIR from PFP-01)
CSV_PATH = METADATA_DIR / "cohort.csv"

# 2. Save
df_clean.to_csv(CSV_PATH, index=False)

print(f"Master cohort list saved to: {CSV_PATH}")
print("Ready for Step 2 (Viewing) and Step 3 (Preprocessing).")

Notes and Extensions

Replace the dummy clinical_data with a real hospital extract (pd.read_excel or pd.read_csv).
Expand the CONSORT‑style filters (prior treatments, missing labels, image quality flags) before splitting.
If you need separate validation and test sets, extend this group‑based split logic (for example with GroupShuffleSplit or GroupKFold) to carve out more than two non‑overlapping groups of patients.