Step 4 – Annotation & Labeling (The Digital Stencil)

4 - Annotation & Labeling (The Digital Stencil)

Step 4 is where the hand-drawn regions from a pathologist become math the computer can use.

This page is the basic build for LAB-01/02:

Export regions from QuPath (or similar) as GeoJSON.
Load them in Python with Shapely.
Use geometry to:
- answer “is this cell inside this tumor region?”, and
- project slide-level polygons onto tiles from patch_cohort.csv to build tile-level labels or pixel masks.

Upstream assumption

Step 4 assumes you already have a QuPath (or similar) project where:

Slides correspond to entries in cohort.csv (ideally with matching slide_id names).
Regions of interest (ROIs) have been drawn and assigned classes (for example tumor, normal, stroma, necrosis, artifact).
These ROIs are exported as one GeoJSON file per slide into something like: annotations/slide_<slide_id>.geojson.

Inputs and Outputs of Step 4

Inputs

cohort.csv (from Step 1).
cohort_qc.csv (from Step 3) – same rows as cohort.csv, plus QC metrics.
patch_cohort.csv (from Step 3) – one row per tile, with at least:
- slide_id
- tile identifier (tile_path or tile_id, or tile_row / tile_col)
- enough information to derive tile geometry in slide pixels (for example tile_x_px, tile_y_px, tile_width_px, tile_height_px, or a known tiling grid + level downsample).
annotations/slide_<slide_id>.geojson – exported QuPath (or other tool) ROIs in slide coordinate space.

Outputs

patch_cohort_labels.csv (or similar), with:
- slide_id
- tile identifier (tile_path / tile_id)
- primary_label (for example tumor, normal, unlabeled)
- optional: per-class coverage fractions (frac_tumor, frac_normal, …)

This labels table can be joined back to patch_cohort.csv on (slide_id, tile identifier) and used for supervised modeling.

Slides without a GeoJSON file are not discarded – their tiles are labeled unlabeled and remain available for unsupervised / self-supervised workflows.

Slide-level labels (for example “tumor present: yes/no”) are typically used later in MIL / weak-supervision models and don’t require geometry, so they are not the focus of LAB-01/02.

Name of Tool

Shapely (The Geometry Engine) & GeoJSON (The Universal File Format)

Technical Explanation

GeoJSON is an open standard for representing geographic / image geometries plus attributes in JSON.

A file is typically a FeatureCollection of features.
Each feature has a geometry (polygon, point, etc.) and properties (like classification.name = "tumor").
QuPath (and other WSI annotation tools) can export regions of interest (ROIs) as GeoJSON in slide pixel coordinates (usually level‑0 / full‑resolution).

Shapely converts those JSON geometries into in-memory geometry objects (Polygon, Point, box, etc.) and provides operations such as:

intersects, contains, intersection
area, union, and more

In Step 4 we treat everything as geometry in the same coordinate system:

Annotation polygons from QuPath → Shapely Polygons.
Tiles from patch_cohort.csv → Shapely box(...) rectangles.
Optional: nucleus centers → Shapely Points.

Then we do overlap math:

For each tile, compute how much of its area intersects each class.
Turn those area fractions into:
- a primary label (for example tumor, normal, unlabeled), and/or
- pixel-wise masks for segmentation.

The output becomes a table such as patch_cohort_labels.csv that can be joined back to patch_cohort.csv for modeling.

Coordinate systems & levels (careful!)

GeoJSON from QuPath is typically in level‑0 slide pixels.
Your tiles may come from a downsampled level (for example DeepZoom level 3).
You must ensure that the tile geometry you use for Shapely (for example tile_x_px, tile_y_px, tile_width_px, tile_height_px) is expressed in the same coordinate system as the polygons.

In a production pipeline it is safer to:

store explicit tile_x_px, tile_y_px, tile_width_px, tile_height_px in patch_cohort.csv, or
make the level/downsample assumptions extremely clear in the code.

Simplified Explanation

This is your “Digital Tracing Paper + Cookie Cutter.”

Step 3 gave you a chessboard of tiles (patch_cohort.csv).
Step 4 gives you colored regions (tumor, normal, necrosis, etc.) drawn in QuPath.
GeoJSON is how those drawings leave QuPath.
Shapely is the geometry brain that answers:
- “Which squares on the chessboard are mostly red (tumor)?”
- “Is this little dot (a nucleus) inside the tumor zone?”

Why export to .geojson?

It is just text — you can open it in Notepad.
It handles nested shapes cleanly:
- CSV is great for rows/columns.
- GeoJSON is great for polygons with holes, multipolygons, and points.
It is the Universal Adapter:
- QuPath, ASAP, Cytomine, DSA/HistomicsUI, etc. can export GeoJSON.
- Python, web viewers, GIS tools, and maps all know how to read it.

Your annotations are not locked inside QuPath and can be reused across experiments.

What does Shapely do?

Shapely turns those coordinate lists into real geometric objects so you can ask questions like:

“Is this cell centroid inside the tumor polygon?”
“What fraction of this tile lies inside tumor vs normal vs artifact?”
“What is the area of this ROI in pixels or mm²?”

In other words:

QuPath draws,
GeoJSON stores,
Shapely does the math.

What can it do?

Read
Import ROIs from QuPath, ASAP, Cytomine, etc. into Python.
Filter
Drop tiny or noisy annotations (for example “ignore regions with area < 500 pixels²”).
Tile labeling
- Intersect polygons with tile rectangles from patch_cohort.csv.
- Compute per-class area fractions per tile.
- Assign a primary label to each tile (tumor, normal, unlabeled, etc.).
- Slides without annotation files → all tiles labeled unlabeled, still usable for unsupervised/self‑supervised work.
Rasterize
Turn polygons into binary or multi-class masks for segmentation (U‑Net style training).
Object labeling
Use point‑in‑polygon checks to classify detected cells (for example “mitosis inside tumor region”).
Plays nicely with slide-level labels
Region-based labels from Step 4 can be combined later with slide-level labels (“tumor present: yes/no”) for MIL / weak‑supervision models.

Situations where it’s used (Medical Examples)

The “Donut” Problem

You circled a tumor with a necrotic center. GeoJSON supports polygons with holes; Shapely understands them; your mask / rasterization respects the hollow middle instead of filling it in.

Tile-Level “Gold Standard”

You want tiles that are clearly tumor for a strong supervision dataset.

Draw tumor ROIs in QuPath.
Use Shapely to compute how much of each tile lies inside tumor.
Keep tiles where tumor_fraction ≥ 0.7 for your “pure tumor” training set.

Cell-Level Labels without Manual Click-Per-Cell

A detection model finds candidate nuclei. Instead of labeling each nucleus manually:

Pathologist annotates tumor vs non‑tumor regions.
A Shapely point-in-polygon pass assigns each nucleus a class using those regions as the ground truth stencil.

Why it’s important to pathologists

An AI model only learns what you actually show it, not what you meant to show.

If your contour never becomes math, the model sees a mix of tumor, normal, and artifact and learns a muddled definition of “tumor.”

By exporting ROIs to GeoJSON and turning them into masks/labels, you ensure the model sees exactly the regions you intended.

It also keeps your annotations:

Reusable:
Different models (tile classifier, MIL, segmentation, detection) can all consume the same annotation source.
Auditable:
You can revisit or revise labels later without redrawing everything.
You can inspect the GeoJSON + label generation code to understand how labels were produced.

Installation Instructions

Run in terminal:

pip install shapely geojson numpy opencv-python pandas

numpy and opencv-python are already used in Step 3. pandas is used here to join labels with patch_cohort.csv.

Lego Building Blocks (Code)

Block A: The Bridge (Reading QuPath Annotations)

The Situation:
You drew tumor / normal regions in QuPath. By default they live inside a QuPath project and Python cannot see them.

The Solution:
Export each slide’s annotations from QuPath as GeoJSON (for example annotations/slide_0001.geojson) and load them into Shapely polygons.

import json
from pathlib import Path

from shapely.geometry import shape  # text -> geometry


# 1. Helper: load one QuPath GeoJSON export
def load_qupath_geojson(geojson_path: Path):
    """
    Return a list of dicts:
    [
      {"class_name": "tumor", "geometry": <Polygon>},
      ...
    ]
    """
    with geojson_path.open() as f:
        data = json.load(f)

    records = []
    for feature in data.get("features", []):
        props = feature.get("properties", {})
        classification = props.get("classification") or {}
        class_name = classification.get("name")

        if class_name is None:
            # unlabeled or background region
            continue

        geom = shape(feature["geometry"])  # GeoJSON -> Shapely geometry
        records.append({"class_name": class_name, "geometry": geom})

    return records


# 2. Example: load annotations for a single slide
slide_id = "slide_0001"
ANNOTATIONS_DIR = Path("annotations")

geojson_path = ANNOTATIONS_DIR / f"{slide_id}.geojson"
ann_records = load_qupath_geojson(geojson_path)

print(f"Loaded {len(ann_records)} regions for slide {slide_id}.")
print("First region class:", ann_records[0]["class_name"])
print("First region area (pixels²):", ann_records[0]["geometry"].area)

In practice, you will have one .geojson per slide. The file name simply needs to include slide_id so you can match it to cohort.csv / patch_cohort.csv.

Block B: The Inclusion Check (Point-in-Polygon)

The Situation:
You have cell detections (for example nucleus centroids) and need to know which cells are inside tumor regions.

The Solution:
Use Shapely’s contains to test whether a Point lies inside a tumor polygon.

from shapely.geometry import Point

# Use the first polygon from Block A as an example
tumor_polygon = ann_records[0]["geometry"]

# 1. Example cell coordinate (x, y) in slide pixels
cell_x, cell_y = 50, 50
cell_point = Point(cell_x, cell_y)

# 2. Test inclusion
is_inside = tumor_polygon.contains(cell_point)
print(f"Cell at ({cell_x}, {cell_y}) is inside tumor? {is_inside}")

# 3. Test a point clearly outside
outside_point = Point(2000, 2000)
print(
    f"Cell at (2000, 2000) is inside tumor? {tumor_polygon.contains(outside_point)}"
)

Block C: The Coloring Book (Rasterizing to a Mask)

The Situation:
Segmentation models want pixel masks, not coordinate lists.

The Solution:
Convert polygons to a binary mask with OpenCV’s fillPoly.

import numpy as np
import cv2
import matplotlib.pyplot as plt

# Assume 'tumor_polygon' from Block B (a Shapely Polygon)

# 1. Define the size of the mask you want (e.g. a patch window)
mask_h, mask_w = 200, 200  # example patch size
mask = np.zeros((mask_h, mask_w), dtype=np.uint8)

# 2. Prepare polygon coordinates for OpenCV
# Shapely polygon is in slide coords; here we assume
# the polygon is already in the same coordinate frame as this mask.
coords = np.array(tumor_polygon.exterior.coords, dtype=np.int32)

# 3. Draw the mask (255 = white)
cv2.fillPoly(mask, [coords], 255)

# 4. Display
plt.figure(figsize=(4, 4))
plt.imshow(mask, cmap="gray")
plt.title("Binary Tumor Mask")
plt.axis("off")
plt.show()

In a real pipeline, you would:

shift/scale polygon coordinates to the patch window you are rasterizing, and
repeat for multiple classes (tumor, stroma, necrosis) to build multi-channel or multi-class masks.

Block D: From Slide Polygons to Tile-Level Labels (`patch_cohort_labels.csv`)

The Situation:
Step 3 has already produced patch_cohort.csv with thousands of tiles per slide. Now you want a table that says, for each tile:

which slide it came from (slide_id),
which tile it is (tile_path or similar), and
what its primary label is (for example tumor, normal, unlabeled).

The Solution:
For each tile:

Build a Shapely box representing the tile extent in slide pixels.
Intersect that box with all polygons for the slide.
Compute area fraction per class and choose the majority class if it exceeds a threshold.

This example assumes:

You have a patch_cohort.csv from Step 3 with at least:
- slide_id
- tile_path (or tile_row / tile_col)
Tiles were cut on a fixed grid with size TILE_SIZE at a known level.

Adjust for your own tiling if needed (for example DeepZoom level downsampling) so the tile boxes match the slide coordinate system used by GeoJSON.

from pathlib import Path

import pandas as pd
from shapely.geometry import box

PROJECT_ROOT = Path("/path/to/your/project/root")
METADATA_DIR = PROJECT_ROOT / "metadata"
ANNOTATIONS_DIR = PROJECT_ROOT / "annotations"

patch_df = pd.read_csv(METADATA_DIR / "patch_cohort.csv")

# Example tiling assumptions (must match Step 3)
TILE_SIZE = 512           # same as used in libvips dzsave
LEVEL_DOWNSAMPLE = 1.0    # if tiles come from a downsampled level, update this

INTERSECTION_THRESHOLD = 0.5  # min fraction of tile area to assign a class


def tile_box_from_row(row):
    """
    Build a Shapely box for one tile.

    Here we derive coordinates from (tile_row, tile_col).
    Adjust this logic if your patch_cohort stores explicit x/y/width/height.
    """
    row_idx = int(row["tile_row"])
    col_idx = int(row["tile_col"])

    # convert tile grid indices to slide pixel coordinates (simplified)
    x = col_idx * TILE_SIZE * LEVEL_DOWNSAMPLE
    y = row_idx * TILE_SIZE * LEVEL_DOWNSAMPLE
    w = TILE_SIZE * LEVEL_DOWNSAMPLE
    h = TILE_SIZE * LEVEL_DOWNSAMPLE

    return box(x, y, x + w, y + h)


def label_tiles_for_slide(slide_id: str) -> pd.DataFrame:
    # Tiles for this slide
    slide_tiles = patch_df[patch_df["slide_id"] == slide_id].copy()
    if slide_tiles.empty:
        raise ValueError(f"No tiles found for slide_id={slide_id}")

    # Annotations for this slide
    ann_path = ANNOTATIONS_DIR / f"{slide_id}.geojson"
    if not ann_path.exists():
        # No annotations: label everything as unlabeled
        slide_tiles["primary_label"] = "unlabeled"
        return slide_tiles[["slide_id", "tile_path", "primary_label"]]

    ann_records = load_qupath_geojson(ann_path)
    ann_geoms = [r["geometry"] for r in ann_records]
    ann_classes = [r["class_name"] for r in ann_records]

    labeled_rows = []
    for _, row in slide_tiles.iterrows():
        tile_geom = tile_box_from_row(row)
        tile_area = tile_geom.area

        class_areas = {}

        # accumulate overlapping area per class
        for geom, cls in zip(ann_geoms, ann_classes):
            if not geom.intersects(tile_geom):
                continue
            inter_area = geom.intersection(tile_geom).area
            if inter_area <= 0:
                continue
            class_areas[cls] = class_areas.get(cls, 0.0) + inter_area

        if class_areas:
            frac_by_class = {cls: area / tile_area for cls, area in class_areas.items()}
            primary_cls = max(frac_by_class, key=frac_by_class.get)
            primary_frac = frac_by_class[primary_cls]
        else:
            frac_by_class = {}
            primary_cls = "unlabeled"
            primary_frac = 0.0

        if primary_frac < INTERSECTION_THRESHOLD:
            primary_cls = "unlabeled"

        out = {
            "slide_id": slide_id,
            "tile_path": row["tile_path"],
            "primary_label": primary_cls,
        }

        # optional: keep per-class fractions for analysis
        for cls, frac in frac_by_class.items():
            out[f"frac_{cls}"] = frac

        labeled_rows.append(out)

    return pd.DataFrame(labeled_rows)


# Example: build labels for a single slide
slide_id = "slide_0001"
labels_df = label_tiles_for_slide(slide_id)

output_path = METADATA_DIR / f"patch_cohort_labels_{slide_id}.csv"
labels_df.to_csv(output_path, index=False)

print(f"Saved {len(labels_df)} labeled tiles to {output_path}")
print(labels_df.head())

In a full project, you would loop over all slide_ids in cohort.csv / cohort_qc.csv, concatenate the per-slide label tables, and save a single patch_cohort_labels.csv. That table can then be merged with patch_cohort.csv on (slide_id, tile_path) for modeling.

Resource Site

GeoJSON Specification: https://geojson.org/
Shapely Documentation: https://shapely.readthedocs.io/en/stable/manual.html