Step 4 – Annotation & Labeling (The Digital Stencil)
4 - Annotation & Labeling (The Digital Stencil)
Section titled “4 - Annotation & Labeling (The Digital Stencil)”Step 4 is where the hand-drawn regions from a pathologist become math the computer can use.
This page is the basic build for LAB-01/02:
- Export regions from QuPath (or similar) as GeoJSON.
- Load them in Python with Shapely.
- Use geometry to:
- answer “is this cell inside this tumor region?”, and
- project slide-level polygons onto tiles from
patch_cohort.csvto build tile-level labels or pixel masks.
Upstream assumption
Section titled “Upstream assumption”Step 4 assumes you already have a QuPath (or similar) project where:
- Slides correspond to entries in
cohort.csv(ideally with matchingslide_idnames). - Regions of interest (ROIs) have been drawn and assigned classes (for example tumor, normal, stroma, necrosis, artifact).
- These ROIs are exported as one GeoJSON file per slide into something like:
annotations/slide_<slide_id>.geojson.
Inputs and Outputs of Step 4
Section titled “Inputs and Outputs of Step 4”Inputs
Section titled “Inputs”cohort.csv(from Step 1).cohort_qc.csv(from Step 3) – same rows ascohort.csv, plus QC metrics.patch_cohort.csv(from Step 3) – one row per tile, with at least:slide_id- tile identifier (
tile_pathortile_id, ortile_row/tile_col) - enough information to derive tile geometry in slide pixels (for example
tile_x_px,tile_y_px,tile_width_px,tile_height_px, or a known tiling grid + level downsample).
annotations/slide_<slide_id>.geojson– exported QuPath (or other tool) ROIs in slide coordinate space.
Outputs
Section titled “Outputs”patch_cohort_labels.csv(or similar), with:slide_id- tile identifier (
tile_path/tile_id) primary_label(for example tumor, normal, unlabeled)- optional: per-class coverage fractions (
frac_tumor,frac_normal, …)
This labels table can be joined back to patch_cohort.csv on (slide_id, tile identifier) and used for supervised modeling.
Slides without a GeoJSON file are not discarded – their tiles are labeled unlabeled and remain available for unsupervised / self-supervised workflows.
Slide-level labels (for example “tumor present: yes/no”) are typically used later in MIL / weak-supervision models and don’t require geometry, so they are not the focus of LAB-01/02.
Name of Tool
Section titled “Name of Tool”Shapely (The Geometry Engine) & GeoJSON (The Universal File Format)
Technical Explanation
Section titled “Technical Explanation”GeoJSON is an open standard for representing geographic / image geometries plus attributes in JSON.
- A file is typically a
FeatureCollectionof features. - Each feature has a
geometry(polygon, point, etc.) andproperties(likeclassification.name = "tumor"). - QuPath (and other WSI annotation tools) can export regions of interest (ROIs) as GeoJSON in slide pixel coordinates (usually level‑0 / full‑resolution).
Shapely converts those JSON geometries into in-memory geometry objects (Polygon, Point, box, etc.) and provides operations such as:
intersects,contains,intersectionarea,union, and more
In Step 4 we treat everything as geometry in the same coordinate system:
- Annotation polygons from QuPath → Shapely
Polygons. - Tiles from
patch_cohort.csv→ Shapelybox(...)rectangles. - Optional: nucleus centers → Shapely
Points.
Then we do overlap math:
- For each tile, compute how much of its area intersects each class.
- Turn those area fractions into:
- a primary label (for example tumor, normal, unlabeled), and/or
- pixel-wise masks for segmentation.
The output becomes a table such as patch_cohort_labels.csv that can be joined back to patch_cohort.csv for modeling.
Coordinate systems & levels (careful!)
Section titled “Coordinate systems & levels (careful!)”- GeoJSON from QuPath is typically in level‑0 slide pixels.
- Your tiles may come from a downsampled level (for example DeepZoom level 3).
- You must ensure that the tile geometry you use for Shapely (for example
tile_x_px,tile_y_px,tile_width_px,tile_height_px) is expressed in the same coordinate system as the polygons.
In a production pipeline it is safer to:
- store explicit
tile_x_px,tile_y_px,tile_width_px,tile_height_pxinpatch_cohort.csv, or - make the level/downsample assumptions extremely clear in the code.
Simplified Explanation
Section titled “Simplified Explanation”This is your “Digital Tracing Paper + Cookie Cutter.”
- Step 3 gave you a chessboard of tiles (
patch_cohort.csv). - Step 4 gives you colored regions (tumor, normal, necrosis, etc.) drawn in QuPath.
- GeoJSON is how those drawings leave QuPath.
- Shapely is the geometry brain that answers:
- “Which squares on the chessboard are mostly red (tumor)?”
- “Is this little dot (a nucleus) inside the tumor zone?”
Why export to .geojson?
Section titled “Why export to .geojson?”- It is just text — you can open it in Notepad.
- It handles nested shapes cleanly:
- CSV is great for rows/columns.
- GeoJSON is great for polygons with holes, multipolygons, and points.
- It is the Universal Adapter:
- QuPath, ASAP, Cytomine, DSA/HistomicsUI, etc. can export GeoJSON.
- Python, web viewers, GIS tools, and maps all know how to read it.
Your annotations are not locked inside QuPath and can be reused across experiments.
What does Shapely do?
Section titled “What does Shapely do?”Shapely turns those coordinate lists into real geometric objects so you can ask questions like:
- “Is this cell centroid inside the tumor polygon?”
- “What fraction of this tile lies inside tumor vs normal vs artifact?”
- “What is the area of this ROI in pixels or mm²?”
In other words:
- QuPath draws,
- GeoJSON stores,
- Shapely does the math.
What can it do?
Section titled “What can it do?”-
Read
Import ROIs from QuPath, ASAP, Cytomine, etc. into Python. -
Filter
Drop tiny or noisy annotations (for example “ignore regions with area < 500 pixels²”). -
Tile labeling
- Intersect polygons with tile rectangles from
patch_cohort.csv. - Compute per-class area fractions per tile.
- Assign a primary label to each tile (tumor, normal, unlabeled, etc.).
- Slides without annotation files → all tiles labeled
unlabeled, still usable for unsupervised/self‑supervised work.
- Intersect polygons with tile rectangles from
-
Rasterize
Turn polygons into binary or multi-class masks for segmentation (U‑Net style training). -
Object labeling
Use point‑in‑polygon checks to classify detected cells (for example “mitosis inside tumor region”). -
Plays nicely with slide-level labels
Region-based labels from Step 4 can be combined later with slide-level labels (“tumor present: yes/no”) for MIL / weak‑supervision models.
Situations where it’s used (Medical Examples)
Section titled “Situations where it’s used (Medical Examples)”The “Donut” Problem
Section titled “The “Donut” Problem”You circled a tumor with a necrotic center. GeoJSON supports polygons with holes; Shapely understands them; your mask / rasterization respects the hollow middle instead of filling it in.
Tile-Level “Gold Standard”
Section titled “Tile-Level “Gold Standard””You want tiles that are clearly tumor for a strong supervision dataset.
- Draw tumor ROIs in QuPath.
- Use Shapely to compute how much of each tile lies inside tumor.
- Keep tiles where
tumor_fraction ≥ 0.7for your “pure tumor” training set.
Cell-Level Labels without Manual Click-Per-Cell
Section titled “Cell-Level Labels without Manual Click-Per-Cell”A detection model finds candidate nuclei. Instead of labeling each nucleus manually:
- Pathologist annotates tumor vs non‑tumor regions.
- A Shapely point-in-polygon pass assigns each nucleus a class using those regions as the ground truth stencil.
Why it’s important to pathologists
Section titled “Why it’s important to pathologists”An AI model only learns what you actually show it, not what you meant to show.
If your contour never becomes math, the model sees a mix of tumor, normal, and artifact and learns a muddled definition of “tumor.”
By exporting ROIs to GeoJSON and turning them into masks/labels, you ensure the model sees exactly the regions you intended.
It also keeps your annotations:
-
Reusable:
Different models (tile classifier, MIL, segmentation, detection) can all consume the same annotation source. -
Auditable:
You can revisit or revise labels later without redrawing everything.
You can inspect the GeoJSON + label generation code to understand how labels were produced.
Installation Instructions
Section titled “Installation Instructions”Run in terminal:
pip install shapely geojson numpy opencv-python pandasnumpy and opencv-python are already used in Step 3. pandas is used here to join labels with patch_cohort.csv.
Lego Building Blocks (Code)
Section titled “Lego Building Blocks (Code)”Block A: The Bridge (Reading QuPath Annotations)
Section titled “Block A: The Bridge (Reading QuPath Annotations)”The Situation:
You drew tumor / normal regions in QuPath. By default they live inside a QuPath project and Python cannot see them.
The Solution:
Export each slide’s annotations from QuPath as GeoJSON (for example annotations/slide_0001.geojson) and load them into Shapely polygons.
import jsonfrom pathlib import Path
from shapely.geometry import shape # text -> geometry
# 1. Helper: load one QuPath GeoJSON exportdef load_qupath_geojson(geojson_path: Path): """ Return a list of dicts: [ {"class_name": "tumor", "geometry": <Polygon>}, ... ] """ with geojson_path.open() as f: data = json.load(f)
records = [] for feature in data.get("features", []): props = feature.get("properties", {}) classification = props.get("classification") or {} class_name = classification.get("name")
if class_name is None: # unlabeled or background region continue
geom = shape(feature["geometry"]) # GeoJSON -> Shapely geometry records.append({"class_name": class_name, "geometry": geom})
return records
# 2. Example: load annotations for a single slideslide_id = "slide_0001"ANNOTATIONS_DIR = Path("annotations")
geojson_path = ANNOTATIONS_DIR / f"{slide_id}.geojson"ann_records = load_qupath_geojson(geojson_path)
print(f"Loaded {len(ann_records)} regions for slide {slide_id}.")print("First region class:", ann_records[0]["class_name"])print("First region area (pixels²):", ann_records[0]["geometry"].area)In practice, you will have one .geojson per slide. The file name simply needs to include slide_id so you can match it to cohort.csv / patch_cohort.csv.
Block B: The Inclusion Check (Point-in-Polygon)
Section titled “Block B: The Inclusion Check (Point-in-Polygon)”The Situation:
You have cell detections (for example nucleus centroids) and need to know which cells are inside tumor regions.
The Solution:
Use Shapely’s contains to test whether a Point lies inside a tumor polygon.
from shapely.geometry import Point
# Use the first polygon from Block A as an exampletumor_polygon = ann_records[0]["geometry"]
# 1. Example cell coordinate (x, y) in slide pixelscell_x, cell_y = 50, 50cell_point = Point(cell_x, cell_y)
# 2. Test inclusionis_inside = tumor_polygon.contains(cell_point)print(f"Cell at ({cell_x}, {cell_y}) is inside tumor? {is_inside}")
# 3. Test a point clearly outsideoutside_point = Point(2000, 2000)print( f"Cell at (2000, 2000) is inside tumor? {tumor_polygon.contains(outside_point)}")Block C: The Coloring Book (Rasterizing to a Mask)
Section titled “Block C: The Coloring Book (Rasterizing to a Mask)”The Situation:
Segmentation models want pixel masks, not coordinate lists.
The Solution:
Convert polygons to a binary mask with OpenCV’s fillPoly.
import numpy as npimport cv2import matplotlib.pyplot as plt
# Assume 'tumor_polygon' from Block B (a Shapely Polygon)
# 1. Define the size of the mask you want (e.g. a patch window)mask_h, mask_w = 200, 200 # example patch sizemask = np.zeros((mask_h, mask_w), dtype=np.uint8)
# 2. Prepare polygon coordinates for OpenCV# Shapely polygon is in slide coords; here we assume# the polygon is already in the same coordinate frame as this mask.coords = np.array(tumor_polygon.exterior.coords, dtype=np.int32)
# 3. Draw the mask (255 = white)cv2.fillPoly(mask, [coords], 255)
# 4. Displayplt.figure(figsize=(4, 4))plt.imshow(mask, cmap="gray")plt.title("Binary Tumor Mask")plt.axis("off")plt.show()In a real pipeline, you would:
- shift/scale polygon coordinates to the patch window you are rasterizing, and
- repeat for multiple classes (tumor, stroma, necrosis) to build multi-channel or multi-class masks.
Block D: From Slide Polygons to Tile-Level Labels (patch_cohort_labels.csv)
Section titled “Block D: From Slide Polygons to Tile-Level Labels (patch_cohort_labels.csv)”The Situation:
Step 3 has already produced patch_cohort.csv with thousands of tiles per slide. Now you want a table that says, for each tile:
- which slide it came from (
slide_id), - which tile it is (
tile_pathor similar), and - what its primary label is (for example tumor, normal, unlabeled).
The Solution:
For each tile:
- Build a Shapely
boxrepresenting the tile extent in slide pixels. - Intersect that box with all polygons for the slide.
- Compute area fraction per class and choose the majority class if it exceeds a threshold.
This example assumes:
- You have a
patch_cohort.csvfrom Step 3 with at least:slide_idtile_path(ortile_row/tile_col)
- Tiles were cut on a fixed grid with size
TILE_SIZEat a known level.
Adjust for your own tiling if needed (for example DeepZoom level downsampling) so the tile boxes match the slide coordinate system used by GeoJSON.
from pathlib import Path
import pandas as pdfrom shapely.geometry import box
PROJECT_ROOT = Path("/path/to/your/project/root")METADATA_DIR = PROJECT_ROOT / "metadata"ANNOTATIONS_DIR = PROJECT_ROOT / "annotations"
patch_df = pd.read_csv(METADATA_DIR / "patch_cohort.csv")
# Example tiling assumptions (must match Step 3)TILE_SIZE = 512 # same as used in libvips dzsaveLEVEL_DOWNSAMPLE = 1.0 # if tiles come from a downsampled level, update this
INTERSECTION_THRESHOLD = 0.5 # min fraction of tile area to assign a class
def tile_box_from_row(row): """ Build a Shapely box for one tile.
Here we derive coordinates from (tile_row, tile_col). Adjust this logic if your patch_cohort stores explicit x/y/width/height. """ row_idx = int(row["tile_row"]) col_idx = int(row["tile_col"])
# convert tile grid indices to slide pixel coordinates (simplified) x = col_idx * TILE_SIZE * LEVEL_DOWNSAMPLE y = row_idx * TILE_SIZE * LEVEL_DOWNSAMPLE w = TILE_SIZE * LEVEL_DOWNSAMPLE h = TILE_SIZE * LEVEL_DOWNSAMPLE
return box(x, y, x + w, y + h)
def label_tiles_for_slide(slide_id: str) -> pd.DataFrame: # Tiles for this slide slide_tiles = patch_df[patch_df["slide_id"] == slide_id].copy() if slide_tiles.empty: raise ValueError(f"No tiles found for slide_id={slide_id}")
# Annotations for this slide ann_path = ANNOTATIONS_DIR / f"{slide_id}.geojson" if not ann_path.exists(): # No annotations: label everything as unlabeled slide_tiles["primary_label"] = "unlabeled" return slide_tiles[["slide_id", "tile_path", "primary_label"]]
ann_records = load_qupath_geojson(ann_path) ann_geoms = [r["geometry"] for r in ann_records] ann_classes = [r["class_name"] for r in ann_records]
labeled_rows = [] for _, row in slide_tiles.iterrows(): tile_geom = tile_box_from_row(row) tile_area = tile_geom.area
class_areas = {}
# accumulate overlapping area per class for geom, cls in zip(ann_geoms, ann_classes): if not geom.intersects(tile_geom): continue inter_area = geom.intersection(tile_geom).area if inter_area <= 0: continue class_areas[cls] = class_areas.get(cls, 0.0) + inter_area
if class_areas: frac_by_class = {cls: area / tile_area for cls, area in class_areas.items()} primary_cls = max(frac_by_class, key=frac_by_class.get) primary_frac = frac_by_class[primary_cls] else: frac_by_class = {} primary_cls = "unlabeled" primary_frac = 0.0
if primary_frac < INTERSECTION_THRESHOLD: primary_cls = "unlabeled"
out = { "slide_id": slide_id, "tile_path": row["tile_path"], "primary_label": primary_cls, }
# optional: keep per-class fractions for analysis for cls, frac in frac_by_class.items(): out[f"frac_{cls}"] = frac
labeled_rows.append(out)
return pd.DataFrame(labeled_rows)
# Example: build labels for a single slideslide_id = "slide_0001"labels_df = label_tiles_for_slide(slide_id)
output_path = METADATA_DIR / f"patch_cohort_labels_{slide_id}.csv"labels_df.to_csv(output_path, index=False)
print(f"Saved {len(labels_df)} labeled tiles to {output_path}")print(labels_df.head())In a full project, you would loop over all slide_ids in cohort.csv / cohort_qc.csv, concatenate the per-slide label tables, and save a single patch_cohort_labels.csv. That table can then be merged with patch_cohort.csv on (slide_id, tile_path) for modeling.
Resource Site
Section titled “Resource Site”- GeoJSON Specification: https://geojson.org/
- Shapely Documentation: https://shapely.readthedocs.io/en/stable/manual.html