Evaluation & Explainability

Step 7 – Evaluation & Explainability: checking if you trust the model

A model that runs is not automatically a model you should trust. Here you evaluate performance using appropriate metrics (such as accuracy, AUC, sensitivity, specificity), look at error patterns, and apply simple explainability tools to see what the model is focusing on. The goal is to decide whether the model is safe and useful in a real pathology workflow.

Technical name: Evaluation & Explainability

What this is

Assess whether a model is good, fair, and understandable:

Performance on independent cases.
Subtype/rare‑pattern coverage.
Visual insight into slide regions influencing decisions.

Typical questions

“What’s sensitivity/specificity on an independent set?”
“Is performance similar across age/site/scanner groups?”
“Where does the model focus when making decisions?”

Common tasks

Compute performance metrics (AUC, F1, sensitivity, specificity).
Plot ROC, calibration, confusion matrices.
Visualize heatmaps/attention on slides.
Check biases and failure modes.

Core tools (examples)

scikit‑learn metrics — ROC/AUC, classification reports, calibration.
TensorBoard — visualize training runs.
Weights & Biases — tracking, dashboards, comparisons.
Grad‑CAM / captum — saliency/attention maps.

Clinician mental model

Treat this like validation and peer-review for models—the rigor you expect from any diagnostic tool.

Ready-to-use code

EVAL-01: Confusion, ROC, heatmaps — scikit-learn + Seaborn blocks for confusion matrices, ROC/AUC, and overlay heatmaps so you see error types, threshold trade-offs, and where the model is “looking.”