Skip to content

Step 5 – Feature Extraction (The Morphometrics Phase)

5 - Feature Extraction (The Morphometrics Phase)

Section titled “5 - Feature Extraction (The Morphometrics Phase)”

Step 3 gave you high-quality tiles and QC metrics.
Step 4 projected pathologist-drawn regions onto those tiles (and optionally into masks).

Step 5 takes those tiles and masks and turns them into numbers:

  • At tile level: one feature vector per tile → tile_features.csv.
  • Optionally, at cell/object level: one feature vector per nucleus or object → cell_features.csv.

These feature tables are the main input to Step 6 – ML & Modeling, where classical machine-learning models and deep models can be trained and evaluated.

At a high level:

  • patch_cohort.csv (Step 3): one row per tile, with slide_id, tile_path, QC metrics, and split.
  • patch_cohort_labels.csv (Step 4): tile-level labels and optional class fractions (for example primary_label, frac_tumor).
  • Optional: tile-level masks (for example nuclei segmentation, tumor region masks).
  • tile_features.csv: one row per tile, with:
    • IDs: slide_id, tile_path, split
    • Labels: primary_label (tumor, normal, unlabeled, …)
    • Features: morphometrics, texture, intensity, and any other numeric descriptors.
  • Optional cell_features.csv: one row per nucleus/object, for detailed analyses or MIL-style aggregation.

This page (FEAT-01) shows how to build those features with Scikit-Image and Pillow, integrated into the pipeline objects from Steps 1–4. Scikit-Image provides scientific image-processing routines (including geometry and texture operations), while Pillow handles image loading and basic manipulation.

Scikit-Image (The Measuring Toolkit) & Pillow (The Image Loader & Converter)

  • Scikit-Image (skimage) is a scientific image-processing library that builds on NumPy and SciPy to provide geometry, segmentation, filtering, and texture-analysis functions.
  • Pillow (PIL) is the maintained fork of the Python Imaging Library, used here to open tiles from disk and convert them into arrays that Scikit-Image can measure.

Together, they are the measurement engine for Step 5.

Scikit-Image represents images as NumPy arrays and provides a rich toolbox for measurements:

  • Object-level morphometrics
    skimage.measure.regionprops computes shape and intensity properties for labeled regions:

    • area, perimeter
    • eccentricity, solidity
    • major/minor axis lengths
    • mean intensity (if an intensity image is provided).
  • Texture features (GLCM / Haralick-type)
    skimage.feature.graycomatrix constructs gray-level co-occurrence matrices; graycoprops extracts features such as:

    • contrast
    • homogeneity
    • energy
    • correlation.
  • Intensity & color handling
    Utility functions in skimage.exposure and skimage.color help with intensity rescaling and color conversions (for example RGB → grayscale), which are often needed before texture and intensity measurements.

Pillow provides:

  • Image I/OPIL.Image.open() to load each tile using the tile_path from patch_cohort.csv.
  • Format conversionimg.convert("L") for grayscale; optional resizing or cropping as needed.

Data coming in (from Steps 3–4)

  • patch_cohort.csv contains one row per tile with paths and QC metrics.
  • patch_cohort_labels.csv contains tile-level labels and optional class fractions created by projecting annotations (Step 4).
  • Optional segmentation outputs provide masks (for example nuclei masks, tumor-region masks) at tile level.

Measurement phase (Step 5)

For each tile (typically starting with the training split):

  • Load the RGB tile with Pillow; convert to grayscale for texture and intensity measurements.
  • Optionally load corresponding masks (for example nuclei segmentation) and run regionprops for object-level morphometrics.
  • Compute:
    • per-object features (area, circularity, eccentricity, mean intensity), and
    • per-tile features (GLCM texture, intensity statistics, aggregated morphometrics).

Aggregation & feature tables

  • Object-level features: stored in cell_features.csv (one row per nucleus/object) if needed.
  • Tile-level features: aggregated statistics (mean, standard deviation, percentiles, counts) per tile merged with labels and split into a single tile_features.csv.

On their own, Scikit-Image measures in pixel units.
If you need physical units (for example µm² for area), combine Step 3’s MPP information with Step 5 outputs:

  • area_um2 = area_px * (mpp_x * mpp_y)
  • perimeter_um = perimeter_px * mpp

Downstream (Step 6), tile_features.csv becomes the input to:

  • classical ML models (logistic regression, random forests, gradient boosting),
  • survival / outcome models,
  • or as an interpretable companion to deep-learning feature embeddings.

Scikit-Image + Pillow are the digital ruler, protractor, and texture meter of the pipeline.

Pathologists think and talk in qualitative terms:

  • “Nuclei are large and irregular.”
  • “The stroma here is dense and chaotic.”
  • “These nuclei look very dark (hyperchromatic).”

Computers cannot work with those descriptions. They need numbers in a table.

Feature extraction translates those adjectives into metrics:

  • “Large” → average nuclear area is high; distribution skewed toward big nuclei.
  • “Irregular” → circularity is low, eccentricity is high.
  • “Hyperchromatic” → mean intensity is low; intensity histogram is shifted toward darker values.
  • “Chaotic stroma” → texture contrast is high, homogeneity is low.

Step 5 is the translation layer:

Human words → geometric / texture / intensity measurements → rows and columns in a CSV.

From this point on, you work with feature tables just like any other tabular dataset in machine learning.

Within this pipeline, Scikit-Image + Pillow can:

  • Measure nuclear morphometrics

    • Per-nucleus features: area, perimeter, eccentricity, solidity, circularity, mean intensity.
    • Per-tile summaries: average nucleus size, variability in size, fraction of highly irregular nuclei, nuclei density per tile.
  • Quantify texture

    • GLCM-based features to distinguish:
      • smooth regions (normal stroma, homogeneous tissue)
      • from rough/chaotic regions (desmoplastic stroma, certain tumor regions).
  • Quantify intensity & stain behavior

    • Measure how dark or light nuclei or regions are (hyperchromasia vs pale nuclei).
    • Track variation in intensity across tiles for QC and biological interpretation.
  • Build reusable feature tables

    • tile_features.csv for tile-level modeling.
    • Optional cell_features.csv for detailed morphometric analyses, clustering, or MIL aggregation.
  • Support interpretability

    • Provide human-understandable numbers you can correlate with outcomes or use to explain model decisions.

Situations where it’s used (Medical Examples)

Section titled “Situations where it’s used (Medical Examples)”
  • Segment nuclei in annotated tumor tiles.
  • Use regionprops to compute area, circularity, eccentricity and mean intensity.
  • Summarize per slide or per region:
    • higher mean and variance in nuclear area → more pleomorphism,
    • lower circularity → more irregular shapes,
    • lower mean intensity → more hyperchromatic nuclei.
  • Focus on tiles labeled as stroma.
  • Compute texture features (for example GLCM contrast, homogeneity).
  • Show that desmoplastic stroma has higher contrast and lower homogeneity than normal collagen, supporting quantification of the stromal reaction.

If nuclei segmentation is available:

  • separate lymphocyte-like nuclei (small, round) from larger tumor nuclei based on morphometrics.
  • compute, per tile:
    • lymphocyte density,
    • tumor-cell density,
    • ratios of lymphocytes to tumor cells.
  • correlate these with outcomes or immune signatures.

Combine intensity and texture metrics to:

  • detect tiles with unusual intensity distributions (possible stain issues),
  • identify areas where tissue folding or crushing leads to abnormal texture patterns.

Feature extraction is crucial because it:

  • Makes pathology quantitative
    Moves from “it looks bad” to “median nuclear area is 2.5× normal with high variance; texture contrast is elevated in tumor stroma.”

  • Improves reproducibility
    Numeric measurements allow grading rules, risk scores, and research claims to be tested and reproduced across cohorts and institutions.

  • Supports clinical research
    Morphometric and texture features can be correlated with:

    • grade, stage, molecular subtypes,
    • treatment response and survival.
  • Stays interpretable
    Features like area, circularity, contrast, and intensity are closer to pathologists’ intuitions than deep-network internal activations. Even if you later rely on deep learning, classical features remain useful as a sanity check and as an explanatory layer.

  • Builds a bridge to ML
    Step 5 is where domain knowledge (what pathologists care about) is encoded as features that machine-learning models can understand and learn from.

If you followed Step 0 – Initial System Setup, your histopath-core environment already has:

  • numpy, pandas, scikit-learn, matplotlib, pillow, and openslide-python.

To add Scikit-Image (and ensure the other libraries are present), install into the same environment:

Terminal window
pip install scikit-image pillow numpy matplotlib

For more detailed and up-to-date installation instructions, see:

These blocks are teaching examples that follow the pipeline assumptions:

  • A project root folder (PFP-01).
  • metadata/patch_cohort.csv from Step 3.
  • metadata/patch_cohort_labels.csv from Step 4.

You can adapt them to your own column names and file layout.

Block A: Geometric Features (Measuring Shape in Real Tiles)

Section titled “Block A: Geometric Features (Measuring Shape in Real Tiles)”

The Situation
You have nuclei segmentation masks for tiles (from Step 4 or a separate segmentation stage). For each nucleus you want morphometric features, and then per-tile summaries.

The Solution
Use skimage.measure.regionprops on the nuclei mask, compute area, perimeter, eccentricity, solidity, and circularity. Optionally use the grayscale tile as an intensity image.

from pathlib import Path
import math
import numpy as np
import pandas as pd
from PIL import Image
from skimage.measure import label, regionprops
PROJECT_ROOT = Path("/path/to/project").resolve()
# Load tile-level cohort and labels
patch_df = pd.read_csv(PROJECT_ROOT / "metadata" / "patch_cohort.csv")
labels_df = pd.read_csv(PROJECT_ROOT / "metadata" / "patch_cohort_labels.csv")
# Merge to have tile metadata + labels in one table
patch_df = patch_df.merge(
labels_df,
on=["slide_id", "tile_path"],
how="left",
suffixes=("", "_label"),
)
# Optional: keep only QC-passing tiles
if "qc_status_tile" in patch_df.columns:
patch_df = patch_df[patch_df["qc_status_tile"] != "FAIL"].copy()
# Example: restrict to training split
train_tiles = patch_df[patch_df.get("split", "train") == "train"].copy()
def iter_tiles(df: pd.DataFrame, project_root: Path = PROJECT_ROOT):
"""
Iterate over tiles, yielding (row, tile_gray).
row: pandas Series with metadata + labels.
tile_gray: 2D NumPy array (grayscale).
"""
for _, row in df.iterrows():
tile_rel = row["tile_path"]
tile_path = project_root / tile_rel
img = Image.open(tile_path)
tile_gray = img.convert("L")
yield row, np.array(tile_gray)
def nuclei_features(
nuclei_mask: np.ndarray,
intensity_image: np.ndarray | None = None,
min_area: int = 20,
):
"""
Compute per-nucleus features from a nuclei mask.
nuclei_mask: 2D array
0 = background, >0 = nucleus label
or binary mask that will be labeled.
intensity_image: 2D array (optional)
grayscale tile for intensity statistics.
"""
if nuclei_mask.dtype == bool or nuclei_mask.max() <= 1:
labeled = label(nuclei_mask)
else:
labeled = nuclei_mask
rows = []
for region in regionprops(labeled, intensity_image=intensity_image):
if region.area < min_area:
continue # remove tiny specks
perimeter = region.perimeter
if perimeter <= 0:
circularity = float("nan")
else:
circularity = (4.0 * math.pi * region.area) / (perimeter**2)
row = {
"area": float(region.area),
"perimeter": float(perimeter),
"eccentricity": float(region.eccentricity),
"solidity": float(region.solidity),
"circularity": float(circularity),
}
if intensity_image is not None:
row["mean_intensity"] = float(region.mean_intensity)
else:
row["mean_intensity"] = float("nan")
rows.append(row)
return rows

You can store these per-nucleus rows into cell_features.csv if you want object-level analyses, or aggregate them per tile (for example mean nucleus area, number of nuclei per tile) and merge with the tile-level features.

Block B: Texture Features (Haralick/GLCM on Tiles)

Section titled “Block B: Texture Features (Haralick/GLCM on Tiles)”

The Situation
You want to distinguish tiles with smooth texture (for example normal stroma) from tiles with chaotic texture (for example desmoplastic reaction or dense tumor).

The Solution
Use the gray-level co-occurrence matrix (GLCM) to compute contrast, homogeneity, energy, and correlation.

import numpy as np
from skimage import exposure
from skimage.feature import graycomatrix, graycoprops
def tile_texture_intensity_features(tile_gray: np.ndarray, levels: int = 32):
"""
Compute GLCM texture features + simple intensity statistics from a grayscale tile.
tile_gray: 2D array, grayscale tile (uint8 or float).
levels: number of gray levels to quantize into.
"""
# 1. Quantize intensities to a fixed range of gray levels
tile_scaled = exposure.rescale_intensity(
tile_gray,
in_range="image",
out_range=(0, levels - 1),
).astype("uint8")
# 2. Build GLCM at multiple distances and angles
glcm = graycomatrix(
tile_scaled,
distances=[1, 2, 4],
angles=[0, np.pi / 4, np.pi / 2, 3 * np.pi / 4],
levels=levels,
symmetric=True,
normed=True,
)
features = {}
# 3. Extract and average properties across all distances/angles
for prop_name in ("contrast", "homogeneity", "energy", "correlation"):
values = graycoprops(glcm, prop_name)
features[f"glcm_{prop_name}_mean"] = float(values.mean())
# 4. Basic intensity statistics
features["mean_intensity"] = float(tile_scaled.mean())
features["std_intensity"] = float(tile_scaled.std())
return features

These features summarize both the “roughness” and “brightness” of each tile.

Block C: Intensity Features, Hyperchromasia & tile_features.csv

Section titled “Block C: Intensity Features, Hyperchromasia & tile_features.csv”

The Situation
You want a single CSV (tile_features.csv) containing:

  • slide_id, tile_path, split, primary_label
  • texture and intensity features
  • optional aggregated morphometric features (if available).

The Solution
Iterate over tiles, compute features, attach labels, and save everything in one table.

import numpy as np
feature_rows = []
for row, tile_gray in iter_tiles(train_tiles):
primary_label = row.get("primary_label", "unlabeled")
# 1. Tile-level texture + intensity
tile_feats = tile_texture_intensity_features(tile_gray)
# 2. (Optional) aggregated nuclei features if you have nuclei masks
# You would implement `load_nuclei_mask(row)` to find and load
# a mask file for this tile. Comment out if not using masks yet.
#
# nuclei_mask = load_nuclei_mask(row) # user-defined function
# if nuclei_mask is not None:
# cell_rows = nuclei_features(nuclei_mask, intensity_image=tile_gray)
# if cell_rows:
# cell_df = pd.DataFrame(cell_rows)
# tile_feats["mean_nucleus_area"] = float(cell_df["area"].mean())
# tile_feats["std_nucleus_area"] = float(cell_df["area"].std())
# tile_feats["mean_nucleus_circularity"] = float(
# cell_df["circularity"].mean()
# )
# tile_feats["nuclei_count"] = int(len(cell_df))
# else:
# tile_feats["mean_nucleus_area"] = np.nan
# tile_feats["std_nucleus_area"] = np.nan
# tile_feats["mean_nucleus_circularity"] = np.nan
# tile_feats["nuclei_count"] = 0
feature_row = {
"slide_id": row["slide_id"],
"tile_path": row["tile_path"],
"split": row.get("split", "train"),
"primary_label": primary_label,
**tile_feats,
}
feature_rows.append(feature_row)
# 3. Build the tile_features table
tile_features_df = pd.DataFrame(feature_rows)
# 4. Save for Step 6
out_path = PROJECT_ROOT / "metadata" / "tile_features.csv"
tile_features_df.to_csv(out_path, index=False)
print(f"Saved {len(tile_features_df)} rows to {out_path}")
print(tile_features_df.head())

This tile_features.csv is the main output of Step 5:

  • It joins IDs, labels, and numeric features into one table.
  • It feeds directly into Step 6 (ML & modeling) alongside your cohort and patch-level metadata.