Data & Cohorts – Build the Dataset (Pandas)
Data & Cohorts – Build the Dataset (Pandas)
Section titled “Data & Cohorts – Build the Dataset (Pandas)”Build your cohort tables in code with pandas so you can scan, clean, and split datasets reproducibly.
1 - Data & Cohorts (Building the Dataset)
Section titled “1 - Data & Cohorts (Building the Dataset)”Name of Tool
Section titled “Name of Tool”Pandas (The Programmable Spreadsheet)
Technical Explanation
Section titled “Technical Explanation”Pandas is a high-performance data manipulation library built on top of NumPy. It uses the DataFrame object—a 2D labeled data structure with columns of potentially different types—to handle structured data. Vectorized operations let you filter, group, and clean data faster and with fewer errors than manual spreadsheet editing or Python loops.
Simplified Explanation
Section titled “Simplified Explanation”This is your “Digital Accessioning Department.” In a physical lab you do not just drop samples on a bench; you log them into the LIS, assign IDs, and check that requisitions match specimens. Pandas is that system: it turns raw files into a structured “Excel sheet” the computer can read, sort, and clean instantly.
What can it do?
Section titled “What can it do?”- Inventory: Scan your drive and list every image file into a table.
- Link: Merge image filenames with diagnosis data from the hospital.
- Quality Control: Find and remove patients with missing diagnoses or corrupt files.
- Stats: Calculate cohort distributions (for example, male vs female counts) in milliseconds.
Situations where it’s used (Medical Examples)
Section titled “Situations where it’s used (Medical Examples)”- Exclusion Criteria: Dataset of 1,000 lung biopsies; exclude all “Crush Artifact” cases in one line:
df = df[df["quality"] != "crush"]. - Cohort Balancing: You have 500 benign slides but 50 malignant. Use pandas to randomly select 50 benign slides to match the tumor group.
Why it’s important to pathologists
Section titled “Why it’s important to pathologists”Garbage In, Garbage Out. An AI model will learn whatever you feed it. If your spreadsheet has duplicates, mislabels, or data leakage, your scientific results are invalid. Pandas lets you audit and lock down the cohort before processing a single pixel.
Installation Instructions
Section titled “Installation Instructions”Open your terminal (Mac/Linux) or Command Prompt/PowerShell (Windows) and run:
Windows:
pip install pandasMac / Linux:
pip3 install pandasIf you use Anaconda: conda install pandas.
Lego Building Blocks (Code)
Section titled “Lego Building Blocks (Code)”Block A: The Census (Creating the Master List)
Section titled “Block A: The Census (Creating the Master List)”The Situation: Thousands of slide images sit in nested folders. Typing filenames into Excel would take days and introduce typos.
The Solution: Crawl the folders, capture each slide path, infer a label from the parent folder, and save a CSV master list.
import pandas as pd
# 1. Setup the path (Using the variable from Step 0)# We look into the RAW_IMAGES_DIR we defined earliertarget_folder = RAW_IMAGES_DIR
# 2. The Scan: find every whole-slide image# .rglob searches recursively (in subfolders too)image_files = list(target_folder.rglob("*.svs"))
# 3. The Extraction: build a table in memorydata = []for file in image_files: # Example: /data/raw_images/Tumor/slide_01.svs --> label = "Tumor" label = file.parent.name data.append({ "slide_id": file.stem, # filename without extension "diagnosis": label, # parent folder name "full_path": str(file), # absolute path on disk })
# 4. The Save: write the census to CSVdf = pd.DataFrame(data)save_path = METADATA_DIR / "master_cohort.csv"df.to_csv(save_path, index=False)
print(f"Census complete. Found {len(df)} slides.")print(f"Saved to {save_path}")Simulated output:
Census complete. Found 480 slides.Saved to /Users/DrFernando/Projects/Melanoma_AI/metadata/master_cohort.csvBlock B: The Chart Review (Inspecting the Data)
Section titled “Block B: The Chart Review (Inspecting the Data)”The Situation: You have a CSV, but it is a “black box.” Are diagnoses missing? Is the cohort balanced?
The Solution: Peek inside with pandas to review the first rows, count diagnoses, and flag missing labels.
import pandas as pd
# 1. Load the Master Listload_path = METADATA_DIR / "master_cohort.csv"df = pd.read_csv(load_path)
# 2. The "Quick Look"print("--- First 5 Rows of the Cohort ---")print(df.head())
# 3. The "Demographics" Checkprint("\n--- Diagnosis Breakdown ---")counts = df["diagnosis"].value_counts()print(counts)
# 4. The "Missing Data" Checkmissing = df["diagnosis"].isnull().sum()print(f"\nSlides with missing diagnosis: {missing}")Simulated output:
--- First 5 Rows of the Cohort --- slide_id diagnosis full_path0 Case_001 Tumor /Users/DrFernando/Projects/Melanoma_AI/data/...1 Case_002 Tumor /Users/DrFernando/Projects/Melanoma_AI/data/...2 Case_003 Normal /Users/DrFernando/Projects/Melanoma_AI/data/...
--- Diagnosis Breakdown ---Normal 350Tumor 130Name: diagnosis, dtype: int64
Slides with missing diagnosis: 0Block C: The Exclusion Criteria (Cleaning)
Section titled “Block C: The Exclusion Criteria (Cleaning)”The Situation: After inspection, you notice unwanted labels (for example, “Ambiguous”). Feeding these to the model will confuse it.
The Solution: Apply filters in code to drop missing labels and keep only the diagnoses you trust.
# 1. Drop Missing Datadf_clean = df.dropna(subset=["diagnosis"])
# 2. Apply Exclusion Criteriavalid_diagnoses = ["Tumor", "Normal"]df_final = df_clean[df_clean["diagnosis"].isin(valid_diagnoses)]
print(f"Original size: {len(df)}")print(f"Cleaned size: {len(df_final)}")Simulated output:
Original size: 480Cleaned size: 460Resource Site
Section titled “Resource Site”- Official documentation: https://pandas.pydata.org/docs/user_guide/index.html
- Cheat sheet: Pandas Getting Started Guide (PDF)
1B - Concepts: The Rules of the Game (Splitting & Fitting)
Section titled “1B - Concepts: The Rules of the Game (Splitting & Fitting)”Name of Concept
Section titled “Name of Concept”Model Generalization (Train/Test Splits, Overfitting, & Underfitting)
Technical Explanation
Section titled “Technical Explanation”Machine learning aims to minimize generalization error—the gap between performance on training data and unseen data.
- Overfitting (High Variance): The model memorizes noise and fails on new data.
- Underfitting (High Bias): The model is too simple to capture structure (for example, always guessing the majority class).
Simplified Explanation
Section titled “Simplified Explanation”This is the difference between learning and memorizing.
- Goal: Build an AI that understands what cancer looks like.
- The Split: Hide some slides (the test set) and reveal them only at the end.
- Overfitting: Like a student who memorizes the answer key; knows “Slide 001 is cancer” but not why.
- Underfitting: Like a lazy student guessing “Benign” on everything because most slides are benign.
The Three Buckets (The Ratios)
Section titled “The Three Buckets (The Ratios)”- Simple Split (80/20): 80% training / 20% testing.
- Pro Split (70/15/15): 70% training / 15% validation / 15% testing.
Situations where it’s used (Medical Examples)
Section titled “Situations where it’s used (Medical Examples)”The “Leaky” Split: You have 10 slides from Patient A. Putting 8 in training and 2 in testing lets the model memorize Patient A’s stain style. It “passes” the test by recognition, not by learning cancer morphology.
Why it’s important to pathologists
Section titled “Why it’s important to pathologists”Training and testing on slides from the same patients invites leakage. A memorizing model inflates metrics and fails on new patients. Splitting by patient avoids this.
Installation Instructions
Section titled “Installation Instructions”Scikit-learn handles most splitting utilities.
Windows:
pip install scikit-learnMac / Linux:
pip3 install scikit-learnLego Building Blocks (Code)
Section titled “Lego Building Blocks (Code)”Block A: The Stratified Split (The “Fair” Split)
Section titled “Block A: The Stratified Split (The “Fair” Split)”The Situation: You have 100 normal slides and only 20 tumor slides. A random split might send all tumors to the test set, starving training.
The Solution: Use stratification to preserve class ratios in both train and test.
from sklearn.model_selection import train_test_split
# 1. Define the labels (y) for stratificationy = df_final["diagnosis"]
# 2. Perform the splittrain_df, test_df = train_test_split( df_final, test_size=0.2, # reserve 20% for testing stratify=y, # keep Tumor/Normal ratio equal random_state=42, # reproducible split)
# 3. Verify the ratios (sanity check)print("--- Split Report ---")orig_rate = len(df_final[df_final.diagnosis == "Tumor"]) / len(df_final)train_rate = len(train_df[train_df.diagnosis == "Tumor"]) / len(train_df)
print(f"Original Tumor Rate: {orig_rate:.2%}")print(f"Training Tumor Rate: {train_rate:.2%}")Simulated output:
--- Split Report ---Original Tumor Rate: 15.00%Training Tumor Rate: 15.00%Block B: The Patient-Level Split (Preventing Leakage)
Section titled “Block B: The Patient-Level Split (Preventing Leakage)”The Situation: Multiple slides per patient can leak style information across splits.
The Solution: Group by patient so all of a patient’s slides stay together in either train or test.
from sklearn.model_selection import GroupShuffleSplit
# 1. Define the groups (patient IDs). Replace slide_id if you have patient_id.groups = df_final["slide_id"] # TODO: use df_final["patient_id"] when available
# 2. Initialize the splittersplitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# 3. Perform the splittrain_idx, test_idx = next(splitter.split(df_final, groups=groups))train_patient_df = df_final.iloc[train_idx]test_patient_df = df_final.iloc[test_idx]
# 4. Save final splits (optional)train_patient_df.to_csv(METADATA_DIR / "cohort_train.csv", index=False)test_patient_df.to_csv(METADATA_DIR / "cohort_test.csv", index=False)
print("Success: Patient-level split complete.")print(f"Training on {len(train_patient_df)} slides.")print(f"Testing on {len(test_patient_df)} slides.")Simulated output:
Success: Patient-level split complete.Training on 368 slides.Testing on 92 slides.Resource Site
Section titled “Resource Site”- Visual guide: Underfitting vs Overfitting (Scikit-learn)
- Official documentation: https://scikit-learn.org/stable/modules/cross_validation.html