Step 5 – Feature Extraction (The Morphometrics Phase)

5 - Feature Extraction (The Morphometrics Phase)

Step 3 gave you high-quality tiles and QC metrics.
Step 4 projected pathologist-drawn regions onto those tiles (and optionally into masks).

Step 5 takes those tiles and masks and turns them into numbers:

At tile level: one feature vector per tile → tile_features.csv.
Optionally, at cell/object level: one feature vector per nucleus or object → cell_features.csv.

These feature tables are the main input to Step 6 – ML & Modeling, where classical machine-learning models and deep models can be trained and evaluated.

At a high level:

Inputs (from previous steps)

patch_cohort.csv (Step 3): one row per tile, with slide_id, tile_path, QC metrics, and split.
patch_cohort_labels.csv (Step 4): tile-level labels and optional class fractions (for example primary_label, frac_tumor).
Optional: tile-level masks (for example nuclei segmentation, tumor region masks).

Outputs (for next steps)

tile_features.csv: one row per tile, with:
- IDs: slide_id, tile_path, split
- Labels: primary_label (tumor, normal, unlabeled, …)
- Features: morphometrics, texture, intensity, and any other numeric descriptors.
Optional cell_features.csv: one row per nucleus/object, for detailed analyses or MIL-style aggregation.

This page (FEAT-01) shows how to build those features with Scikit-Image and Pillow, integrated into the pipeline objects from Steps 1–4. Scikit-Image provides scientific image-processing routines (including geometry and texture operations), while Pillow handles image loading and basic manipulation.

Name of Tool

Scikit-Image (The Measuring Toolkit) & Pillow (The Image Loader & Converter)

Scikit-Image (skimage) is a scientific image-processing library that builds on NumPy and SciPy to provide geometry, segmentation, filtering, and texture-analysis functions.
Pillow (PIL) is the maintained fork of the Python Imaging Library, used here to open tiles from disk and convert them into arrays that Scikit-Image can measure.

Together, they are the measurement engine for Step 5.

Technical Explanation

Scikit-Image represents images as NumPy arrays and provides a rich toolbox for measurements:

Object-level morphometrics
skimage.measure.regionprops computes shape and intensity properties for labeled regions:
- area, perimeter
- eccentricity, solidity
- major/minor axis lengths
- mean intensity (if an intensity image is provided).
Texture features (GLCM / Haralick-type)
skimage.feature.graycomatrix constructs gray-level co-occurrence matrices; graycoprops extracts features such as:
- contrast
- homogeneity
- energy
- correlation.
Intensity & color handling
Utility functions in skimage.exposure and skimage.color help with intensity rescaling and color conversions (for example RGB → grayscale), which are often needed before texture and intensity measurements.

Pillow provides:

Image I/O – PIL.Image.open() to load each tile using the tile_path from patch_cohort.csv.
Format conversion – img.convert("L") for grayscale; optional resizing or cropping as needed.

Where it sits in the pipeline

Data coming in (from Steps 3–4)

patch_cohort.csv contains one row per tile with paths and QC metrics.
patch_cohort_labels.csv contains tile-level labels and optional class fractions created by projecting annotations (Step 4).
Optional segmentation outputs provide masks (for example nuclei masks, tumor-region masks) at tile level.

Measurement phase (Step 5)

For each tile (typically starting with the training split):

Load the RGB tile with Pillow; convert to grayscale for texture and intensity measurements.
Optionally load corresponding masks (for example nuclei segmentation) and run regionprops for object-level morphometrics.
Compute:
- per-object features (area, circularity, eccentricity, mean intensity), and
- per-tile features (GLCM texture, intensity statistics, aggregated morphometrics).

Aggregation & feature tables

Object-level features: stored in cell_features.csv (one row per nucleus/object) if needed.
Tile-level features: aggregated statistics (mean, standard deviation, percentiles, counts) per tile merged with labels and split into a single tile_features.csv.

Units & scaling

On their own, Scikit-Image measures in pixel units.
If you need physical units (for example µm² for area), combine Step 3’s MPP information with Step 5 outputs:

area_um2 = area_px * (mpp_x * mpp_y)
perimeter_um = perimeter_px * mpp

Downstream (Step 6), tile_features.csv becomes the input to:

classical ML models (logistic regression, random forests, gradient boosting),
survival / outcome models,
or as an interpretable companion to deep-learning feature embeddings.

Simplified Explanation

Scikit-Image + Pillow are the digital ruler, protractor, and texture meter of the pipeline.

Pathologists think and talk in qualitative terms:

“Nuclei are large and irregular.”
“The stroma here is dense and chaotic.”
“These nuclei look very dark (hyperchromatic).”

Computers cannot work with those descriptions. They need numbers in a table.

Feature extraction translates those adjectives into metrics:

“Large” → average nuclear area is high; distribution skewed toward big nuclei.
“Irregular” → circularity is low, eccentricity is high.
“Hyperchromatic” → mean intensity is low; intensity histogram is shifted toward darker values.
“Chaotic stroma” → texture contrast is high, homogeneity is low.

Step 5 is the translation layer:

Human words → geometric / texture / intensity measurements → rows and columns in a CSV.

From this point on, you work with feature tables just like any other tabular dataset in machine learning.

What can it do?

Within this pipeline, Scikit-Image + Pillow can:

Measure nuclear morphometrics
- Per-nucleus features: area, perimeter, eccentricity, solidity, circularity, mean intensity.
- Per-tile summaries: average nucleus size, variability in size, fraction of highly irregular nuclei, nuclei density per tile.
Quantify texture
- GLCM-based features to distinguish:
  - smooth regions (normal stroma, homogeneous tissue)
  - from rough/chaotic regions (desmoplastic stroma, certain tumor regions).
Quantify intensity & stain behavior
- Measure how dark or light nuclei or regions are (hyperchromasia vs pale nuclei).
- Track variation in intensity across tiles for QC and biological interpretation.
Build reusable feature tables
- tile_features.csv for tile-level modeling.
- Optional cell_features.csv for detailed morphometric analyses, clustering, or MIL aggregation.
Support interpretability
- Provide human-understandable numbers you can correlate with outcomes or use to explain model decisions.

Situations where it’s used (Medical Examples)

Nuclear pleomorphism in tumor grading

Segment nuclei in annotated tumor tiles.
Use regionprops to compute area, circularity, eccentricity and mean intensity.
Summarize per slide or per region:
- higher mean and variance in nuclear area → more pleomorphism,
- lower circularity → more irregular shapes,
- lower mean intensity → more hyperchromatic nuclei.

Stromal reaction and desmoplasia

Focus on tiles labeled as stroma.
Compute texture features (for example GLCM contrast, homogeneity).
Show that desmoplastic stroma has higher contrast and lower homogeneity than normal collagen, supporting quantification of the stromal reaction.

Tumor–immune microenvironment

If nuclei segmentation is available:

separate lymphocyte-like nuclei (small, round) from larger tumor nuclei based on morphometrics.
compute, per tile:
- lymphocyte density,
- tumor-cell density,
- ratios of lymphocytes to tumor cells.
correlate these with outcomes or immune signatures.

QC and artifact monitoring beyond Step 3

Combine intensity and texture metrics to:

detect tiles with unusual intensity distributions (possible stain issues),
identify areas where tissue folding or crushing leads to abnormal texture patterns.

Why it’s important to pathologists

Feature extraction is crucial because it:

Makes pathology quantitative
Moves from “it looks bad” to “median nuclear area is 2.5× normal with high variance; texture contrast is elevated in tumor stroma.”
Improves reproducibility
Numeric measurements allow grading rules, risk scores, and research claims to be tested and reproduced across cohorts and institutions.
Supports clinical research
Morphometric and texture features can be correlated with:
- grade, stage, molecular subtypes,
- treatment response and survival.
Stays interpretable
Features like area, circularity, contrast, and intensity are closer to pathologists’ intuitions than deep-network internal activations. Even if you later rely on deep learning, classical features remain useful as a sanity check and as an explanatory layer.
Builds a bridge to ML
Step 5 is where domain knowledge (what pathologists care about) is encoded as features that machine-learning models can understand and learn from.

Installation Instructions

If you followed Step 0 – Initial System Setup, your histopath-core environment already has:

numpy, pandas, scikit-learn, matplotlib, pillow, and openslide-python.

To add Scikit-Image (and ensure the other libraries are present), install into the same environment:

pip install scikit-image pillow numpy matplotlib

For more detailed and up-to-date installation instructions, see:

Scikit-Image install guide: https://scikit-image.org/docs/stable/user_guide/install.html
Pillow install docs: https://python-pillow.github.io/

Lego Building Blocks (Code)

These blocks are teaching examples that follow the pipeline assumptions:

A project root folder (PFP-01).
metadata/patch_cohort.csv from Step 3.
metadata/patch_cohort_labels.csv from Step 4.

You can adapt them to your own column names and file layout.

Block A: Geometric Features (Measuring Shape in Real Tiles)

The Situation
You have nuclei segmentation masks for tiles (from Step 4 or a separate segmentation stage). For each nucleus you want morphometric features, and then per-tile summaries.

The Solution
Use skimage.measure.regionprops on the nuclei mask, compute area, perimeter, eccentricity, solidity, and circularity. Optionally use the grayscale tile as an intensity image.

from pathlib import Path
import math

import numpy as np
import pandas as pd
from PIL import Image
from skimage.measure import label, regionprops

PROJECT_ROOT = Path("/path/to/project").resolve()

# Load tile-level cohort and labels
patch_df = pd.read_csv(PROJECT_ROOT / "metadata" / "patch_cohort.csv")
labels_df = pd.read_csv(PROJECT_ROOT / "metadata" / "patch_cohort_labels.csv")

# Merge to have tile metadata + labels in one table
patch_df = patch_df.merge(
    labels_df,
    on=["slide_id", "tile_path"],
    how="left",
    suffixes=("", "_label"),
)

# Optional: keep only QC-passing tiles
if "qc_status_tile" in patch_df.columns:
    patch_df = patch_df[patch_df["qc_status_tile"] != "FAIL"].copy()

# Example: restrict to training split
train_tiles = patch_df[patch_df.get("split", "train") == "train"].copy()


def iter_tiles(df: pd.DataFrame, project_root: Path = PROJECT_ROOT):
    """
    Iterate over tiles, yielding (row, tile_gray).

    row: pandas Series with metadata + labels.
    tile_gray: 2D NumPy array (grayscale).
    """
    for _, row in df.iterrows():
        tile_rel = row["tile_path"]
        tile_path = project_root / tile_rel

        img = Image.open(tile_path)
        tile_gray = img.convert("L")

        yield row, np.array(tile_gray)


def nuclei_features(
    nuclei_mask: np.ndarray,
    intensity_image: np.ndarray | None = None,
    min_area: int = 20,
):
    """
    Compute per-nucleus features from a nuclei mask.

    nuclei_mask: 2D array
        0 = background, >0 = nucleus label
        or binary mask that will be labeled.
    intensity_image: 2D array (optional)
        grayscale tile for intensity statistics.
    """
    if nuclei_mask.dtype == bool or nuclei_mask.max() <= 1:
        labeled = label(nuclei_mask)
    else:
        labeled = nuclei_mask

    rows = []

    for region in regionprops(labeled, intensity_image=intensity_image):
        if region.area < min_area:
            continue  # remove tiny specks

        perimeter = region.perimeter
        if perimeter <= 0:
            circularity = float("nan")
        else:
            circularity = (4.0 * math.pi * region.area) / (perimeter**2)

        row = {
            "area": float(region.area),
            "perimeter": float(perimeter),
            "eccentricity": float(region.eccentricity),
            "solidity": float(region.solidity),
            "circularity": float(circularity),
        }

        if intensity_image is not None:
            row["mean_intensity"] = float(region.mean_intensity)
        else:
            row["mean_intensity"] = float("nan")

        rows.append(row)

    return rows

You can store these per-nucleus rows into cell_features.csv if you want object-level analyses, or aggregate them per tile (for example mean nucleus area, number of nuclei per tile) and merge with the tile-level features.

Block B: Texture Features (Haralick/GLCM on Tiles)

The Situation
You want to distinguish tiles with smooth texture (for example normal stroma) from tiles with chaotic texture (for example desmoplastic reaction or dense tumor).

The Solution
Use the gray-level co-occurrence matrix (GLCM) to compute contrast, homogeneity, energy, and correlation.

import numpy as np
from skimage import exposure
from skimage.feature import graycomatrix, graycoprops


def tile_texture_intensity_features(tile_gray: np.ndarray, levels: int = 32):
    """
    Compute GLCM texture features + simple intensity statistics from a grayscale tile.

    tile_gray: 2D array, grayscale tile (uint8 or float).
    levels: number of gray levels to quantize into.
    """
    # 1. Quantize intensities to a fixed range of gray levels
    tile_scaled = exposure.rescale_intensity(
        tile_gray,
        in_range="image",
        out_range=(0, levels - 1),
    ).astype("uint8")

    # 2. Build GLCM at multiple distances and angles
    glcm = graycomatrix(
        tile_scaled,
        distances=[1, 2, 4],
        angles=[0, np.pi / 4, np.pi / 2, 3 * np.pi / 4],
        levels=levels,
        symmetric=True,
        normed=True,
    )

    features = {}

    # 3. Extract and average properties across all distances/angles
    for prop_name in ("contrast", "homogeneity", "energy", "correlation"):
        values = graycoprops(glcm, prop_name)
        features[f"glcm_{prop_name}_mean"] = float(values.mean())

    # 4. Basic intensity statistics
    features["mean_intensity"] = float(tile_scaled.mean())
    features["std_intensity"] = float(tile_scaled.std())

    return features

These features summarize both the “roughness” and “brightness” of each tile.

Block C: Intensity Features, Hyperchromasia & `tile_features.csv`

The Situation
You want a single CSV (tile_features.csv) containing:

slide_id, tile_path, split, primary_label
texture and intensity features
optional aggregated morphometric features (if available).

The Solution
Iterate over tiles, compute features, attach labels, and save everything in one table.

import numpy as np

feature_rows = []

for row, tile_gray in iter_tiles(train_tiles):
    primary_label = row.get("primary_label", "unlabeled")

    # 1. Tile-level texture + intensity
    tile_feats = tile_texture_intensity_features(tile_gray)

    # 2. (Optional) aggregated nuclei features if you have nuclei masks
    #    You would implement `load_nuclei_mask(row)` to find and load
    #    a mask file for this tile. Comment out if not using masks yet.
    #
    # nuclei_mask = load_nuclei_mask(row)  # user-defined function
    # if nuclei_mask is not None:
    #     cell_rows = nuclei_features(nuclei_mask, intensity_image=tile_gray)
    #     if cell_rows:
    #         cell_df = pd.DataFrame(cell_rows)
    #         tile_feats["mean_nucleus_area"] = float(cell_df["area"].mean())
    #         tile_feats["std_nucleus_area"] = float(cell_df["area"].std())
    #         tile_feats["mean_nucleus_circularity"] = float(
    #             cell_df["circularity"].mean()
    #         )
    #         tile_feats["nuclei_count"] = int(len(cell_df))
    #     else:
    #         tile_feats["mean_nucleus_area"] = np.nan
    #         tile_feats["std_nucleus_area"] = np.nan
    #         tile_feats["mean_nucleus_circularity"] = np.nan
    #         tile_feats["nuclei_count"] = 0

    feature_row = {
        "slide_id": row["slide_id"],
        "tile_path": row["tile_path"],
        "split": row.get("split", "train"),
        "primary_label": primary_label,
        **tile_feats,
    }

    feature_rows.append(feature_row)

# 3. Build the tile_features table
tile_features_df = pd.DataFrame(feature_rows)

# 4. Save for Step 6
out_path = PROJECT_ROOT / "metadata" / "tile_features.csv"
tile_features_df.to_csv(out_path, index=False)

print(f"Saved {len(tile_features_df)} rows to {out_path}")
print(tile_features_df.head())

This tile_features.csv is the main output of Step 5:

It joins IDs, labels, and numeric features into one table.
It feeds directly into Step 6 (ML & modeling) alongside your cohort and patch-level metadata.

Resource Site

Scikit-Image documentation (landing page): https://scikit-image.org/docs/stable/
Scikit-Image User Guide: https://scikit-image.org/docs/stable/user_guide/index.html
Scikit-Image Examples Gallery: https://scikit-image.org/docs/stable/auto_examples/index.html
Pillow documentation and examples: https://python-pillow.github.io/