Step 6 – ML & Modeling (The Diagnosis Phase)

6 - ML & Modeling (The Diagnosis Phase)

Name of Tool

Scikit-Learn (The Statistician) & PyTorch (The Neural Network)

Technical Explanation

Scikit-Learn (sklearn) is a CPU-focused library for classical ML on structured data (tables). It offers regression, SVMs, random forests, and more. PyTorch (torch) is a deep learning framework optimized for GPUs; it uses tensors and automatic differentiation to train neural networks on unstructured data (raw images, text).

Simplified Explanation

This is the “Medical Board Exam.”

Path A (Scikit-Learn): A checklist. You hand over measured features (for example, nucleus area, circularity) and it follows decision rules (like a flowchart) to classify. Fast and interpretable, limited by the features you provide.
Path B (PyTorch): A resident. You show 10,000 images and it learns patterns you did not explicitly describe (texture, chromatin). Heavier compute, but more powerful.

What can it do?

Classification: Tumor vs Normal (binary) or Grade 1/2/3 (multiclass).
Regression: Predict a continuous value (for example, survival months).
Segmentation (PyTorch): Pixel-level outlines (for example, tumor mask).

Situations where it’s used (Medical Examples)

Feature Approach (Path A): Random Forest on circularity from Step 5 shows “Circularity < 0.6” predicts malignancy.
End-to-End (Path B): Raw H&E patches into ResNet-50 learn stromal orientation signals to predict metastasis.

Why it’s important to pathologists

This is the engine. Everything before (tiling, normalizing, feature extraction) prepared fuel. Here the computer actually attempts a diagnosis.

Installation Instructions

Run in terminal:

pip install scikit-learn torch torchvision

For PyTorch with GPUs, follow the CUDA-specific command from pytorch.org; the above installs the CPU version.

Path A: Classical ML (The Random Forest)

Use this when you have a CSV of numeric features (for example, from Step 5).

Block A1: Train a Random Forest on tabular features

The Situation: You measured nuclei (area, perimeter, circularity) and have labels (0 = Benign, 1 = Malignant).
The Solution: Train a Random Forest—100 trees vote on the diagnosis.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load your feature table (replace this dummy data with pd.read_csv)
data = {
    "area": [100, 120, 110, 400, 420, 450],          # Smaller often benign
    "circularity": [0.9, 0.85, 0.92, 0.4, 0.3, 0.5], # Round vs irregular
    "diagnosis": [0, 0, 0, 1, 1, 1],                 # 0=Benign, 1=Malignant
}
df = pd.DataFrame(data)

# 2. Split into train/test
X = df[["area", "circularity"]]
y = df["diagnosis"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 3. Model
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# 4. Train
clf.fit(X_train, y_train)

# 5. Predict and score
predictions = clf.predict(X_test)
acc = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {acc:.2%}")

# 6. Inference on a new sample
new_cell = [[410, 0.35]]
result = clf.predict(new_cell)
print(f"Prediction for new cell: {'Malignant' if result[0]==1 else 'Benign'}")

Simulated output:

Model Accuracy: 100.00%
Prediction for new cell: Malignant

Path B: Deep Learning (The Convolutional Neural Network)

Use this when you have raw images (for example, patches from Step 2/3).

Block B1: Architecture setup with transfer learning (ResNet-18)

import torch
import torch.nn as nn
from torchvision import models

# 1. Load pretrained ResNet-18
model = models.resnet18(weights="DEFAULT")

# 2. Replace the final layer to predict 2 classes (Tumor/Normal)
num_features = model.fc.in_features  # usually 512
model.fc = nn.Linear(num_features, 2)

# 3. Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

print("Model architecture modified for 2 classes (Tumor/Normal).")
print(f"Running on: {device}")

Simulated output:

Model architecture modified for 2 classes (Tumor/Normal).
Running on: cpu

Block B2: Single-image inference

from torchvision import transforms
from PIL import Image
import torch.nn.functional as F

# 1. Preprocessing pipeline
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 2. Load and preprocess an image (replace with a real patch path)
img_path = "path/to/project/data/raw_images/test_patch.jpg"
img = Image.new("RGB", (300, 300), color="pink")  # demo fallback
input_tensor = preprocess(img)
input_batch = input_tensor.unsqueeze(0).to(device)

# 3. Forward pass
model.eval()
with torch.no_grad():
    output = model(input_batch)  # logits

# 4. Probabilities
probabilities = F.softmax(output[0], dim=0)

print(f"Raw Output scores: {output}")
print(f"Probability Class 0 (Benign): {probabilities[0].item():.4f}")
print(f"Probability Class 1 (Tumor):  {probabilities[1].item():.4f}")

Simulated output:

Raw Output scores: tensor([[-0.5612,  0.3421]])
Probability Class 0 (Benign): 0.2883
Probability Class 1 (Tumor):  0.7117

Resource Site

Scikit-Learn User Guide: https://scikit-learn.org/stable/user_guide.html
PyTorch Beginner Tutorial: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
TorchVision Models (ResNet): https://pytorch.org/vision/stable/models.html