Data & Cohorts – Build the Dataset (Pandas)

Build your cohort tables in code with pandas so you can scan, clean, and split datasets reproducibly.

1 - Data & Cohorts (Building the Dataset)

Name of Tool

Pandas (The Programmable Spreadsheet)

Technical Explanation

Pandas is a high-performance data manipulation library built on top of NumPy. It uses the DataFrame object—a 2D labeled data structure with columns of potentially different types—to handle structured data. Vectorized operations let you filter, group, and clean data faster and with fewer errors than manual spreadsheet editing or Python loops.

Simplified Explanation

This is your “Digital Accessioning Department.” In a physical lab you do not just drop samples on a bench; you log them into the LIS, assign IDs, and check that requisitions match specimens. Pandas is that system: it turns raw files into a structured “Excel sheet” the computer can read, sort, and clean instantly.

What can it do?

Inventory: Scan your drive and list every image file into a table.
Link: Merge image filenames with diagnosis data from the hospital.
Quality Control: Find and remove patients with missing diagnoses or corrupt files.
Stats: Calculate cohort distributions (for example, male vs female counts) in milliseconds.

Situations where it’s used (Medical Examples)

Exclusion Criteria: Dataset of 1,000 lung biopsies; exclude all “Crush Artifact” cases in one line: df = df[df["quality"] != "crush"].
Cohort Balancing: You have 500 benign slides but 50 malignant. Use pandas to randomly select 50 benign slides to match the tumor group.

Why it’s important to pathologists

Garbage In, Garbage Out. An AI model will learn whatever you feed it. If your spreadsheet has duplicates, mislabels, or data leakage, your scientific results are invalid. Pandas lets you audit and lock down the cohort before processing a single pixel.

Installation Instructions

Open your terminal (Mac/Linux) or Command Prompt/PowerShell (Windows) and run:

Windows:

pip install pandas

Mac / Linux:

pip3 install pandas

If you use Anaconda: conda install pandas.

Lego Building Blocks (Code)

Block A: The Census (Creating the Master List)

The Situation: Thousands of slide images sit in nested folders. Typing filenames into Excel would take days and introduce typos.
The Solution: Crawl the folders, capture each slide path, infer a label from the parent folder, and save a CSV master list.

import pandas as pd

# 1. Setup the path (Using the variable from Step 0)
# We look into the RAW_IMAGES_DIR we defined earlier
target_folder = RAW_IMAGES_DIR

# 2. The Scan: find every whole-slide image
# .rglob searches recursively (in subfolders too)
image_files = list(target_folder.rglob("*.svs"))

# 3. The Extraction: build a table in memory
data = []
for file in image_files:
    # Example: /data/raw_images/Tumor/slide_01.svs --> label = "Tumor"
    label = file.parent.name
    data.append({
        "slide_id": file.stem,   # filename without extension
        "diagnosis": label,      # parent folder name
        "full_path": str(file),  # absolute path on disk
    })

# 4. The Save: write the census to CSV
df = pd.DataFrame(data)
save_path = METADATA_DIR / "master_cohort.csv"
df.to_csv(save_path, index=False)

print(f"Census complete. Found {len(df)} slides.")
print(f"Saved to {save_path}")

Simulated output:

Census complete. Found 480 slides.
Saved to /Users/DrFernando/Projects/Melanoma_AI/metadata/master_cohort.csv

Block B: The Chart Review (Inspecting the Data)

The Situation: You have a CSV, but it is a “black box.” Are diagnoses missing? Is the cohort balanced?
The Solution: Peek inside with pandas to review the first rows, count diagnoses, and flag missing labels.

import pandas as pd

# 1. Load the Master List
load_path = METADATA_DIR / "master_cohort.csv"
df = pd.read_csv(load_path)

# 2. The "Quick Look"
print("--- First 5 Rows of the Cohort ---")
print(df.head())

# 3. The "Demographics" Check
print("\n--- Diagnosis Breakdown ---")
counts = df["diagnosis"].value_counts()
print(counts)

# 4. The "Missing Data" Check
missing = df["diagnosis"].isnull().sum()
print(f"\nSlides with missing diagnosis: {missing}")

Simulated output:

--- First 5 Rows of the Cohort ---
   slide_id diagnosis                                        full_path
0  Case_001     Tumor  /Users/DrFernando/Projects/Melanoma_AI/data/...
1  Case_002     Tumor  /Users/DrFernando/Projects/Melanoma_AI/data/...
2  Case_003    Normal  /Users/DrFernando/Projects/Melanoma_AI/data/...

--- Diagnosis Breakdown ---
Normal    350
Tumor     130
Name: diagnosis, dtype: int64

Slides with missing diagnosis: 0

Block C: The Exclusion Criteria (Cleaning)

The Situation: After inspection, you notice unwanted labels (for example, “Ambiguous”). Feeding these to the model will confuse it.
The Solution: Apply filters in code to drop missing labels and keep only the diagnoses you trust.

# 1. Drop Missing Data
df_clean = df.dropna(subset=["diagnosis"])

# 2. Apply Exclusion Criteria
valid_diagnoses = ["Tumor", "Normal"]
df_final = df_clean[df_clean["diagnosis"].isin(valid_diagnoses)]

print(f"Original size: {len(df)}")
print(f"Cleaned size:  {len(df_final)}")

Simulated output:

Original size: 480
Cleaned size:  460

Resource Site

Official documentation: https://pandas.pydata.org/docs/user_guide/index.html
Cheat sheet: Pandas Getting Started Guide (PDF)

1B - Concepts: The Rules of the Game (Splitting & Fitting)

Name of Concept

Model Generalization (Train/Test Splits, Overfitting, & Underfitting)

Technical Explanation

Machine learning aims to minimize generalization error—the gap between performance on training data and unseen data.

Overfitting (High Variance): The model memorizes noise and fails on new data.
Underfitting (High Bias): The model is too simple to capture structure (for example, always guessing the majority class).

Simplified Explanation

This is the difference between learning and memorizing.

Goal: Build an AI that understands what cancer looks like.
The Split: Hide some slides (the test set) and reveal them only at the end.
Overfitting: Like a student who memorizes the answer key; knows “Slide 001 is cancer” but not why.
Underfitting: Like a lazy student guessing “Benign” on everything because most slides are benign.

The Three Buckets (The Ratios)

Simple Split (80/20): 80% training / 20% testing.
Pro Split (70/15/15): 70% training / 15% validation / 15% testing.

Situations where it’s used (Medical Examples)

The “Leaky” Split: You have 10 slides from Patient A. Putting 8 in training and 2 in testing lets the model memorize Patient A’s stain style. It “passes” the test by recognition, not by learning cancer morphology.

Why it’s important to pathologists

Training and testing on slides from the same patients invites leakage. A memorizing model inflates metrics and fails on new patients. Splitting by patient avoids this.

Installation Instructions

Scikit-learn handles most splitting utilities.

Windows:

pip install scikit-learn

Mac / Linux:

pip3 install scikit-learn

Lego Building Blocks (Code)

Block A: The Stratified Split (The “Fair” Split)

The Situation: You have 100 normal slides and only 20 tumor slides. A random split might send all tumors to the test set, starving training.
The Solution: Use stratification to preserve class ratios in both train and test.

from sklearn.model_selection import train_test_split

# 1. Define the labels (y) for stratification
y = df_final["diagnosis"]

# 2. Perform the split
train_df, test_df = train_test_split(
    df_final,
    test_size=0.2,     # reserve 20% for testing
    stratify=y,        # keep Tumor/Normal ratio equal
    random_state=42,   # reproducible split
)

# 3. Verify the ratios (sanity check)
print("--- Split Report ---")
orig_rate = len(df_final[df_final.diagnosis == "Tumor"]) / len(df_final)
train_rate = len(train_df[train_df.diagnosis == "Tumor"]) / len(train_df)

print(f"Original Tumor Rate: {orig_rate:.2%}")
print(f"Training Tumor Rate: {train_rate:.2%}")

Simulated output:

--- Split Report ---
Original Tumor Rate: 15.00%
Training Tumor Rate: 15.00%

Block B: The Patient-Level Split (Preventing Leakage)

The Situation: Multiple slides per patient can leak style information across splits.
The Solution: Group by patient so all of a patient’s slides stay together in either train or test.

from sklearn.model_selection import GroupShuffleSplit

# 1. Define the groups (patient IDs). Replace slide_id if you have patient_id.
groups = df_final["slide_id"]  # TODO: use df_final["patient_id"] when available

# 2. Initialize the splitter
splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# 3. Perform the split
train_idx, test_idx = next(splitter.split(df_final, groups=groups))
train_patient_df = df_final.iloc[train_idx]
test_patient_df = df_final.iloc[test_idx]

# 4. Save final splits (optional)
train_patient_df.to_csv(METADATA_DIR / "cohort_train.csv", index=False)
test_patient_df.to_csv(METADATA_DIR / "cohort_test.csv", index=False)

print("Success: Patient-level split complete.")
print(f"Training on {len(train_patient_df)} slides.")
print(f"Testing on {len(test_patient_df)} slides.")

Simulated output:

Success: Patient-level split complete.
Training on 368 slides.
Testing on 92 slides.

Resource Site

Visual guide: Underfitting vs Overfitting (Scikit-learn)
Official documentation: https://scikit-learn.org/stable/modules/cross_validation.html