Step 5 – Feature Extraction (The Morphometrics Phase)
5 - Feature Extraction (The Morphometrics Phase)
Section titled “5 - Feature Extraction (The Morphometrics Phase)”Step 3 gave you high-quality tiles and QC metrics.
Step 4 projected pathologist-drawn regions onto those tiles (and optionally into masks).
Step 5 takes those tiles and masks and turns them into numbers:
- At tile level: one feature vector per tile →
tile_features.csv. - Optionally, at cell/object level: one feature vector per nucleus or object →
cell_features.csv.
These feature tables are the main input to Step 6 – ML & Modeling, where classical machine-learning models and deep models can be trained and evaluated.
At a high level:
Inputs (from previous steps)
Section titled “Inputs (from previous steps)”patch_cohort.csv(Step 3): one row per tile, withslide_id,tile_path, QC metrics, andsplit.patch_cohort_labels.csv(Step 4): tile-level labels and optional class fractions (for exampleprimary_label,frac_tumor).- Optional: tile-level masks (for example nuclei segmentation, tumor region masks).
Outputs (for next steps)
Section titled “Outputs (for next steps)”tile_features.csv: one row per tile, with:- IDs:
slide_id,tile_path,split - Labels:
primary_label(tumor, normal, unlabeled, …) - Features: morphometrics, texture, intensity, and any other numeric descriptors.
- IDs:
- Optional
cell_features.csv: one row per nucleus/object, for detailed analyses or MIL-style aggregation.
This page (FEAT-01) shows how to build those features with Scikit-Image and Pillow, integrated into the pipeline objects from Steps 1–4. Scikit-Image provides scientific image-processing routines (including geometry and texture operations), while Pillow handles image loading and basic manipulation.
Name of Tool
Section titled “Name of Tool”Scikit-Image (The Measuring Toolkit) & Pillow (The Image Loader & Converter)
- Scikit-Image (
skimage) is a scientific image-processing library that builds on NumPy and SciPy to provide geometry, segmentation, filtering, and texture-analysis functions. - Pillow (PIL) is the maintained fork of the Python Imaging Library, used here to open tiles from disk and convert them into arrays that Scikit-Image can measure.
Together, they are the measurement engine for Step 5.
Technical Explanation
Section titled “Technical Explanation”Scikit-Image represents images as NumPy arrays and provides a rich toolbox for measurements:
-
Object-level morphometrics
skimage.measure.regionpropscomputes shape and intensity properties for labeled regions:- area, perimeter
- eccentricity, solidity
- major/minor axis lengths
- mean intensity (if an intensity image is provided).
-
Texture features (GLCM / Haralick-type)
skimage.feature.graycomatrixconstructs gray-level co-occurrence matrices;graycopropsextracts features such as:- contrast
- homogeneity
- energy
- correlation.
-
Intensity & color handling
Utility functions inskimage.exposureandskimage.colorhelp with intensity rescaling and color conversions (for example RGB → grayscale), which are often needed before texture and intensity measurements.
Pillow provides:
- Image I/O –
PIL.Image.open()to load each tile using thetile_pathfrompatch_cohort.csv. - Format conversion –
img.convert("L")for grayscale; optional resizing or cropping as needed.
Where it sits in the pipeline
Section titled “Where it sits in the pipeline”Data coming in (from Steps 3–4)
patch_cohort.csvcontains one row per tile with paths and QC metrics.patch_cohort_labels.csvcontains tile-level labels and optional class fractions created by projecting annotations (Step 4).- Optional segmentation outputs provide masks (for example nuclei masks, tumor-region masks) at tile level.
Measurement phase (Step 5)
For each tile (typically starting with the training split):
- Load the RGB tile with Pillow; convert to grayscale for texture and intensity measurements.
- Optionally load corresponding masks (for example nuclei segmentation) and run
regionpropsfor object-level morphometrics. - Compute:
- per-object features (area, circularity, eccentricity, mean intensity), and
- per-tile features (GLCM texture, intensity statistics, aggregated morphometrics).
Aggregation & feature tables
- Object-level features: stored in
cell_features.csv(one row per nucleus/object) if needed. - Tile-level features: aggregated statistics (mean, standard deviation, percentiles, counts) per tile merged with labels and split into a single
tile_features.csv.
Units & scaling
Section titled “Units & scaling”On their own, Scikit-Image measures in pixel units.
If you need physical units (for example µm² for area), combine Step 3’s MPP information with Step 5 outputs:
area_um2 = area_px * (mpp_x * mpp_y)perimeter_um = perimeter_px * mpp
Downstream (Step 6), tile_features.csv becomes the input to:
- classical ML models (logistic regression, random forests, gradient boosting),
- survival / outcome models,
- or as an interpretable companion to deep-learning feature embeddings.
Simplified Explanation
Section titled “Simplified Explanation”Scikit-Image + Pillow are the digital ruler, protractor, and texture meter of the pipeline.
Pathologists think and talk in qualitative terms:
- “Nuclei are large and irregular.”
- “The stroma here is dense and chaotic.”
- “These nuclei look very dark (hyperchromatic).”
Computers cannot work with those descriptions. They need numbers in a table.
Feature extraction translates those adjectives into metrics:
- “Large” → average nuclear area is high; distribution skewed toward big nuclei.
- “Irregular” → circularity is low, eccentricity is high.
- “Hyperchromatic” → mean intensity is low; intensity histogram is shifted toward darker values.
- “Chaotic stroma” → texture contrast is high, homogeneity is low.
Step 5 is the translation layer:
Human words → geometric / texture / intensity measurements → rows and columns in a CSV.
From this point on, you work with feature tables just like any other tabular dataset in machine learning.
What can it do?
Section titled “What can it do?”Within this pipeline, Scikit-Image + Pillow can:
-
Measure nuclear morphometrics
- Per-nucleus features: area, perimeter, eccentricity, solidity, circularity, mean intensity.
- Per-tile summaries: average nucleus size, variability in size, fraction of highly irregular nuclei, nuclei density per tile.
-
Quantify texture
- GLCM-based features to distinguish:
- smooth regions (normal stroma, homogeneous tissue)
- from rough/chaotic regions (desmoplastic stroma, certain tumor regions).
- GLCM-based features to distinguish:
-
Quantify intensity & stain behavior
- Measure how dark or light nuclei or regions are (hyperchromasia vs pale nuclei).
- Track variation in intensity across tiles for QC and biological interpretation.
-
Build reusable feature tables
tile_features.csvfor tile-level modeling.- Optional
cell_features.csvfor detailed morphometric analyses, clustering, or MIL aggregation.
-
Support interpretability
- Provide human-understandable numbers you can correlate with outcomes or use to explain model decisions.
Situations where it’s used (Medical Examples)
Section titled “Situations where it’s used (Medical Examples)”Nuclear pleomorphism in tumor grading
Section titled “Nuclear pleomorphism in tumor grading”- Segment nuclei in annotated tumor tiles.
- Use
regionpropsto compute area, circularity, eccentricity and mean intensity. - Summarize per slide or per region:
- higher mean and variance in nuclear area → more pleomorphism,
- lower circularity → more irregular shapes,
- lower mean intensity → more hyperchromatic nuclei.
Stromal reaction and desmoplasia
Section titled “Stromal reaction and desmoplasia”- Focus on tiles labeled as stroma.
- Compute texture features (for example GLCM contrast, homogeneity).
- Show that desmoplastic stroma has higher contrast and lower homogeneity than normal collagen, supporting quantification of the stromal reaction.
Tumor–immune microenvironment
Section titled “Tumor–immune microenvironment”If nuclei segmentation is available:
- separate lymphocyte-like nuclei (small, round) from larger tumor nuclei based on morphometrics.
- compute, per tile:
- lymphocyte density,
- tumor-cell density,
- ratios of lymphocytes to tumor cells.
- correlate these with outcomes or immune signatures.
QC and artifact monitoring beyond Step 3
Section titled “QC and artifact monitoring beyond Step 3”Combine intensity and texture metrics to:
- detect tiles with unusual intensity distributions (possible stain issues),
- identify areas where tissue folding or crushing leads to abnormal texture patterns.
Why it’s important to pathologists
Section titled “Why it’s important to pathologists”Feature extraction is crucial because it:
-
Makes pathology quantitative
Moves from “it looks bad” to “median nuclear area is 2.5× normal with high variance; texture contrast is elevated in tumor stroma.” -
Improves reproducibility
Numeric measurements allow grading rules, risk scores, and research claims to be tested and reproduced across cohorts and institutions. -
Supports clinical research
Morphometric and texture features can be correlated with:- grade, stage, molecular subtypes,
- treatment response and survival.
-
Stays interpretable
Features like area, circularity, contrast, and intensity are closer to pathologists’ intuitions than deep-network internal activations. Even if you later rely on deep learning, classical features remain useful as a sanity check and as an explanatory layer. -
Builds a bridge to ML
Step 5 is where domain knowledge (what pathologists care about) is encoded as features that machine-learning models can understand and learn from.
Installation Instructions
Section titled “Installation Instructions”If you followed Step 0 – Initial System Setup, your histopath-core environment already has:
numpy,pandas,scikit-learn,matplotlib,pillow, andopenslide-python.
To add Scikit-Image (and ensure the other libraries are present), install into the same environment:
pip install scikit-image pillow numpy matplotlibFor more detailed and up-to-date installation instructions, see:
- Scikit-Image install guide: https://scikit-image.org/docs/stable/user_guide/install.html
- Pillow install docs: https://python-pillow.github.io/
Lego Building Blocks (Code)
Section titled “Lego Building Blocks (Code)”These blocks are teaching examples that follow the pipeline assumptions:
- A project root folder (PFP-01).
metadata/patch_cohort.csvfrom Step 3.metadata/patch_cohort_labels.csvfrom Step 4.
You can adapt them to your own column names and file layout.
Block A: Geometric Features (Measuring Shape in Real Tiles)
Section titled “Block A: Geometric Features (Measuring Shape in Real Tiles)”The Situation
You have nuclei segmentation masks for tiles (from Step 4 or a separate segmentation stage). For each nucleus you want morphometric features, and then per-tile summaries.
The Solution
Use skimage.measure.regionprops on the nuclei mask, compute area, perimeter, eccentricity, solidity, and circularity. Optionally use the grayscale tile as an intensity image.
from pathlib import Pathimport math
import numpy as npimport pandas as pdfrom PIL import Imagefrom skimage.measure import label, regionprops
PROJECT_ROOT = Path("/path/to/project").resolve()
# Load tile-level cohort and labelspatch_df = pd.read_csv(PROJECT_ROOT / "metadata" / "patch_cohort.csv")labels_df = pd.read_csv(PROJECT_ROOT / "metadata" / "patch_cohort_labels.csv")
# Merge to have tile metadata + labels in one tablepatch_df = patch_df.merge( labels_df, on=["slide_id", "tile_path"], how="left", suffixes=("", "_label"),)
# Optional: keep only QC-passing tilesif "qc_status_tile" in patch_df.columns: patch_df = patch_df[patch_df["qc_status_tile"] != "FAIL"].copy()
# Example: restrict to training splittrain_tiles = patch_df[patch_df.get("split", "train") == "train"].copy()
def iter_tiles(df: pd.DataFrame, project_root: Path = PROJECT_ROOT): """ Iterate over tiles, yielding (row, tile_gray).
row: pandas Series with metadata + labels. tile_gray: 2D NumPy array (grayscale). """ for _, row in df.iterrows(): tile_rel = row["tile_path"] tile_path = project_root / tile_rel
img = Image.open(tile_path) tile_gray = img.convert("L")
yield row, np.array(tile_gray)
def nuclei_features( nuclei_mask: np.ndarray, intensity_image: np.ndarray | None = None, min_area: int = 20,): """ Compute per-nucleus features from a nuclei mask.
nuclei_mask: 2D array 0 = background, >0 = nucleus label or binary mask that will be labeled. intensity_image: 2D array (optional) grayscale tile for intensity statistics. """ if nuclei_mask.dtype == bool or nuclei_mask.max() <= 1: labeled = label(nuclei_mask) else: labeled = nuclei_mask
rows = []
for region in regionprops(labeled, intensity_image=intensity_image): if region.area < min_area: continue # remove tiny specks
perimeter = region.perimeter if perimeter <= 0: circularity = float("nan") else: circularity = (4.0 * math.pi * region.area) / (perimeter**2)
row = { "area": float(region.area), "perimeter": float(perimeter), "eccentricity": float(region.eccentricity), "solidity": float(region.solidity), "circularity": float(circularity), }
if intensity_image is not None: row["mean_intensity"] = float(region.mean_intensity) else: row["mean_intensity"] = float("nan")
rows.append(row)
return rowsYou can store these per-nucleus rows into cell_features.csv if you want object-level analyses, or aggregate them per tile (for example mean nucleus area, number of nuclei per tile) and merge with the tile-level features.
Block B: Texture Features (Haralick/GLCM on Tiles)
Section titled “Block B: Texture Features (Haralick/GLCM on Tiles)”The Situation
You want to distinguish tiles with smooth texture (for example normal stroma) from tiles with chaotic texture (for example desmoplastic reaction or dense tumor).
The Solution
Use the gray-level co-occurrence matrix (GLCM) to compute contrast, homogeneity, energy, and correlation.
import numpy as npfrom skimage import exposurefrom skimage.feature import graycomatrix, graycoprops
def tile_texture_intensity_features(tile_gray: np.ndarray, levels: int = 32): """ Compute GLCM texture features + simple intensity statistics from a grayscale tile.
tile_gray: 2D array, grayscale tile (uint8 or float). levels: number of gray levels to quantize into. """ # 1. Quantize intensities to a fixed range of gray levels tile_scaled = exposure.rescale_intensity( tile_gray, in_range="image", out_range=(0, levels - 1), ).astype("uint8")
# 2. Build GLCM at multiple distances and angles glcm = graycomatrix( tile_scaled, distances=[1, 2, 4], angles=[0, np.pi / 4, np.pi / 2, 3 * np.pi / 4], levels=levels, symmetric=True, normed=True, )
features = {}
# 3. Extract and average properties across all distances/angles for prop_name in ("contrast", "homogeneity", "energy", "correlation"): values = graycoprops(glcm, prop_name) features[f"glcm_{prop_name}_mean"] = float(values.mean())
# 4. Basic intensity statistics features["mean_intensity"] = float(tile_scaled.mean()) features["std_intensity"] = float(tile_scaled.std())
return featuresThese features summarize both the “roughness” and “brightness” of each tile.
Block C: Intensity Features, Hyperchromasia & tile_features.csv
Section titled “Block C: Intensity Features, Hyperchromasia & tile_features.csv”The Situation
You want a single CSV (tile_features.csv) containing:
slide_id,tile_path,split,primary_label- texture and intensity features
- optional aggregated morphometric features (if available).
The Solution
Iterate over tiles, compute features, attach labels, and save everything in one table.
import numpy as np
feature_rows = []
for row, tile_gray in iter_tiles(train_tiles): primary_label = row.get("primary_label", "unlabeled")
# 1. Tile-level texture + intensity tile_feats = tile_texture_intensity_features(tile_gray)
# 2. (Optional) aggregated nuclei features if you have nuclei masks # You would implement `load_nuclei_mask(row)` to find and load # a mask file for this tile. Comment out if not using masks yet. # # nuclei_mask = load_nuclei_mask(row) # user-defined function # if nuclei_mask is not None: # cell_rows = nuclei_features(nuclei_mask, intensity_image=tile_gray) # if cell_rows: # cell_df = pd.DataFrame(cell_rows) # tile_feats["mean_nucleus_area"] = float(cell_df["area"].mean()) # tile_feats["std_nucleus_area"] = float(cell_df["area"].std()) # tile_feats["mean_nucleus_circularity"] = float( # cell_df["circularity"].mean() # ) # tile_feats["nuclei_count"] = int(len(cell_df)) # else: # tile_feats["mean_nucleus_area"] = np.nan # tile_feats["std_nucleus_area"] = np.nan # tile_feats["mean_nucleus_circularity"] = np.nan # tile_feats["nuclei_count"] = 0
feature_row = { "slide_id": row["slide_id"], "tile_path": row["tile_path"], "split": row.get("split", "train"), "primary_label": primary_label, **tile_feats, }
feature_rows.append(feature_row)
# 3. Build the tile_features tabletile_features_df = pd.DataFrame(feature_rows)
# 4. Save for Step 6out_path = PROJECT_ROOT / "metadata" / "tile_features.csv"tile_features_df.to_csv(out_path, index=False)
print(f"Saved {len(tile_features_df)} rows to {out_path}")print(tile_features_df.head())This tile_features.csv is the main output of Step 5:
- It joins IDs, labels, and numeric features into one table.
- It feeds directly into Step 6 (ML & modeling) alongside your cohort and patch-level metadata.
Resource Site
Section titled “Resource Site”- Scikit-Image documentation (landing page): https://scikit-image.org/docs/stable/
- Scikit-Image User Guide: https://scikit-image.org/docs/stable/user_guide/index.html
- Scikit-Image Examples Gallery: https://scikit-image.org/docs/stable/auto_examples/index.html
- Pillow documentation and examples: https://python-pillow.github.io/