Step 3 – Preprocessing & QC (The Grossing Room)

You are absolutely right to call that out. I listed NumPy in the title as “The Calculator,” but then I let it fade into the background in the technical breakdown. In reality, NumPy is the glue holding everything together—OpenCV images are NumPy arrays, and without NumPy, we couldn’t create the kernels or calculate the variance.

Here is the fully complete version of Step 3, with NumPy restored to its proper place in the explanations, table, and code comments.

3 - Preprocessing & QC (The Grossing Room)

Name of Tool

OpenCV (The Scalpel), NumPy (The Calculator) & Libvips (The Industrial Slicer)

Technical Explanation

OpenCV (cv2) is a computer vision library used for real-time artifact detection (masking). NumPy (numpy) is the fundamental package for scientific computing; it treats images as multi-dimensional matrices (grids of numbers), allowing us to slice channels, create structural elements (kernels), and calculate statistical variance. Libvips (pyvips) is a streaming image processing library that allows us to slice gigapixel images without loading them into RAM.

Simplified Explanation

This is your “Digital Grossing Station.”

OpenCV is the “Eye” that sees the difference between tissue and glass.
NumPy is the “Math Brain” that calculates exactly how blurry an image is and creates the “brush” we use to clean up dust.
Libvips is the “Slicer” that cuts the massive 40GB image into thousands of bite-sized squares (tiles) for the AI.

Technical Breakdown: The “Tool → Function” Map

Medical Goal	The Tool	The Specific Function	How It Works
Format Conversion (RGB → HSV)	OpenCV	`cv2.cvtColor(img, cv2.COLOR_RGB2HSV)`	Converts the image from Red‑Green‑Blue (RGB) to Hue‑Saturation‑Value (HSV) so we can isolate the Saturation channel (S) for tissue detection.
Tissue Detection (Isolate Color)	NumPy	`hsv_image[:, :, 1]`	Array slicing: extracts the Saturation channel from an HSV image. High saturation = tissue; low saturation = glass/background.
Noise Reduction (Smooth Sensor/Dust Noise)	OpenCV	`cv2.GaussianBlur()`	Applies a small Gaussian blur (for example 3×3 or 5×5) to smear out single‑pixel noise. Real tissue blobs remain, but isolated dust pixels disappear before thresholding.
Tissue Detection (Thresholding)	OpenCV	`cv2.threshold()`	Otsu’s Method: automatically calculates the cutoff value to separate “foreground” (tissue) from “background” (glass).
Cleaning the Mask (Create Brush)	OpenCV	`cv2.getStructuringElement()`	Creates a structuring element (kernel) such as a 5×5 ellipse or square. This acts as the “brush” shape for morphological operations that clean up the mask.
Artifact Removal (Dust/Specks)	OpenCV	`cv2.morphologyEx()`	Morphological closing: uses the structuring element to erase tiny specks and fill small gaps in the mask, turning scattered tissue into a more solid, contiguous region.
Blur Detection (Calculate Sharpness Score)	OpenCV + NumPy	`cv2.Laplacian(gray_image, cv2.CV_64F).var()`	Computes a Laplacian edge image and then takes the variance. Low variance = flat/blurry (few edges). High variance = sharp (many strong edges).
Slicing (Pre-Tiling to Disk)	Libvips	`image.dzsave()`	DeepZoom save: exports the giant image into a grid of tiles on disk. Useful when you want a reusable dataset of image files for many experiments.
Extraction (Virtual Tiling)	OpenSlide	`slide.read_region((x, y), level, size)`	Instead of writing tiles to disk, reads a patch directly from the WSI into memory at the coordinates you choose—ideal for “map first, then extract specific tiles” flows.

Pro‑tip: you can also use cv2.connectedComponentsWithStats to remove small isolated blobs by area (for example, delete any component smaller than a few hundred pixels). This is useful for removing floating debris or pen marks.

Situations where it’s used (Medical Examples)

The “Green Pen” Problem: A surgeon circled the tumor with a green marker. OpenCV detects this high-saturation artifact so we can exclude it.
The “Impossible” File: You cannot load a 40GB image into Python to crop it. Libvips streams it from the hard drive, cutting it like a deli slicer without crashing your computer.

Why it’s important to pathologists

Efficiency and Safety. Training on empty glass wastes time; training on marker ink teaches the AI to cheat (for example, “Green ink = Cancer”). Preprocessing ensures the data represents biology, not artifacts.

Installation Instructions

Run in terminal:

pip install opencv-python numpy matplotlib pyvips

Note: Libvips also requires the system binary to be installed, as covered in Phase 0.

Lego Building Blocks (Code)

Block A: Tissue Detection (Otsu Thresholding)

The Situation: You need a binary mask (white = tissue, black = glass), but hard thresholds fail because stain intensity varies.
The Solution: Use NumPy slicing to grab the Saturation channel and OpenCV to auto-threshold it.
⚠️ Safety Note: Simple saturation checks may pick up marker pen. In production, we often add a “Hue” filter to exclude green/blue ink.

In real pipelines, the thumbnail used here is usually generated directly from the WSI using OpenSlide.get_thumbnail(...) or a downsampled libvips export, rather than being hand‑saved.

import cv2
import numpy as np
import matplotlib.pyplot as plt

# 1. Load a thumbnail (small version of the slide)
# In real pipelines, this thumbnail is usually generated from a WSI via
# OpenSlide.get_thumbnail(...) or a downsampled libvips export.
# Here we use a saved PNG for simplicity.
img_path = "thumbnail.png"
img_bgr = cv2.imread(img_path)  # OpenCV loads as BGR
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

# 2. Convert RGB → HSV and grab the Saturation channel
img_hsv = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2HSV)

# Tool Used: NumPy slicing ([:, :, 1])
# Why: We only need the 2nd channel (Saturation), not Hue or Value.
sat = img_hsv[:, :, 1]

# 3. Light Gaussian blur to reduce single-pixel noise
# Tool Used: OpenCV (cv2.GaussianBlur)
sat_blur = cv2.GaussianBlur(sat, (5, 5), 0)

# 4. Apply Otsu Thresholding on blurred saturation
# Tool Used: OpenCV (cv2.threshold)
# Why: Automatically finds the "magic number" that separates tissue from glass.
threshold_val, tissue_mask = cv2.threshold(
    sat_blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
)

# 5. Display results
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].imshow(img_rgb)
ax[0].set_title("Original Thumbnail")
ax[1].imshow(tissue_mask, cmap="gray")
ax[1].set_title(f"Tissue Mask (Otsu Thresh: {threshold_val:.1f})")
plt.show()

Block B: Cleaning the Mask (Morphology)

The Situation: The mask is “noisy”—it has dust specks (tiny white dots) and small holes (fat/lumens).
The Solution: Use NumPy to build a “brush” (kernel) and OpenCV to scrub the mask clean.

# Assumes 'tissue_mask' from Block A exists (binary mask where tissue ≈ white)

# 1. Define a structuring element (brush)
# Tool Used: OpenCV (cv2.getStructuringElement)
# Why: Creates a 5×5 elliptical kernel that defines the shape of our "brush" for morphology.
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))

# 2. (Optional) Remove tiny noise with an opening
# Tool Used: OpenCV (cv2.morphologyEx)
mask_open = cv2.morphologyEx(tissue_mask, cv2.MORPH_OPEN, kernel)

# 3. Fill small gaps and holes with a closing
# Tool Used: OpenCV (cv2.morphologyEx)
# Why: Closing fills black holes inside tissue regions and connects nearby fragments.
mask_clean = cv2.morphologyEx(mask_open, cv2.MORPH_CLOSE, kernel)

# 4. Calculate tissue percentage
# Tool Used: NumPy-style size (mask_clean.size)
# Why: We divide the count of white pixels by the total array size.
tissue_percentage = (cv2.countNonZero(mask_clean) / mask_clean.size) * 100
print(f"Tissue Area: {tissue_percentage:.2f}% of slide.")

Optionally, you can also run cv2.connectedComponentsWithStats on mask_clean to remove very small components by area (for example, drop anything smaller than a few hundred pixels) to get rid of floating debris or pen marks.

Block C: Blur Detection (Quality Control)

The Situation: A scan strip may be blurry due to a scanner error.
The Solution: Use Laplacian variance (OpenCV + NumPy) as a sharpness score for focus.

# 1. Convert to grayscale (color not needed for blur)
gray = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2GRAY)

# 2. Calculate Laplacian (edge strength)
lap = cv2.Laplacian(gray, cv2.CV_64F)

# 3. Calculate Laplacian variance (sharpness score)
# Tool Used: OpenCV + NumPy (.var())
# Why: We need a single number representing the spread of edge intensities.
# High variance = crisp edges (Sharp). Low variance = flat colors (Blurry).
sharpness_score = lap.var()
BLUR_THRESHOLD = 100.0  # example value; tune for your scanner

print(f"Sharpness Score: {sharpness_score:.2f}")
if sharpness_score < BLUR_THRESHOLD:
    print("QC STATUS: FAIL (Blurry)")
else:
    print("QC STATUS: PASS (Sharp)")

In practice, thresholds like BLUR_THRESHOLD are dataset-specific hyperparameters rather than universal constants—you should tune them by plotting the distribution of sharpness_score values and visually inspecting borderline slides.

Block D: The Industrial Slicer (Libvips Tiling)

The Situation: You have a QC-passed slide. Now you need to cut the massive 40GB file into 512 × 512 tiles for the AI.
The Solution: Use pyvips to stream and slice the image. This is the “Factory” engine.

import pyvips
import os

# 1. Load the massive image (Streams from disk, does not fill RAM)
# Replace with your actual .svs file path
slide_path = "patient_001.svs"
slide = pyvips.Image.new_from_file(slide_path, level=0)

# 2. Define output directory and tile size
output_dir = "patient_001_tiles"
tile_size = 512

# 3. The "DeepZoom" Slicer
# Function Used: slide.dzsave
# Why: Efficiently chops the 40GB image into tiles without crashing RAM.
slide.dzsave(
    output_dir,
    suffix=".jpg",
    tile_size=tile_size,
    overlap=0,
    depth="one",
)

print(f"Slicing Complete. Tiles saved to: {os.path.abspath(output_dir)}")

Block D2: Tile Metadata & Tissue Filtering (`patch_cohort.csv`)

Block D writes tiles to disk. Block D2 walks those tiles, measures simple QC metrics, and builds a patch_cohort.csv table that downstream code (for example the Step 2B “Lightbox” patch viewer) can reuse.

import pandas as pd
import numpy as np
import cv2
from pathlib import Path

PROJECT_ROOT = Path("/path/to/your/project/root")
DATA_DIR = PROJECT_ROOT / "data"
METADATA_DIR = PROJECT_ROOT / "metadata"

tiles_root = DATA_DIR / "tiles"

records = []

for slide_dir in tiles_root.iterdir():
    if not slide_dir.is_dir():
        continue

    slide_id = slide_dir.name

    for tile_path in slide_dir.glob("*.png"):
        stem = tile_path.stem  # e.g. "tile_3_5"
        parts = stem.split("_")

        if len(parts) >= 3:
            row_idx = int(parts[-2])
            col_idx = int(parts[-1])
        else:
            row_idx = None
            col_idx = None

        img = cv2.imread(str(tile_path))
        if img is None:
            continue

        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

        # Simple tissue mask on the tile
        _, mask = cv2.threshold(
            gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
        )
        mask = 255 - mask  # make tissue ≈ white

        tile_tissue_pct = float(mask.mean() / 255.0)

        # Tile-level blur metric (Laplacian variance)
        lap = cv2.Laplacian(gray, cv2.CV_64F)
        tile_sharpness = float(lap.var())

        TILE_TISSUE_MIN = 0.50
        TILE_BLUR_MIN = 50.0

        tile_pass = (
            (tile_tissue_pct >= TILE_TISSUE_MIN) and
            (tile_sharpness >= TILE_BLUR_MIN)
        )
        qc_status_tile = "PASS" if tile_pass else "FAIL"

        records.append({
            "slide_id": slide_id,
            "tile_path": str(tile_path),
            "tile_row": row_idx,
            "tile_col": col_idx,
            "tile_tissue_pct": tile_tissue_pct,
            "tile_sharpness": tile_sharpness,
            "qc_status_tile": qc_status_tile,
        })

patch_df = pd.DataFrame(records)

cohort_qc = pd.read_csv(METADATA_DIR / "cohort_qc.csv")
patch_df = patch_df.merge(
    cohort_qc[["slide_id", "split"]],
    on="slide_id",
    how="left",
)

output_path = METADATA_DIR / "patch_cohort.csv"
patch_df.to_csv(output_path, index=False)

print(f"Saved patch-level QC table to {output_path}")

The tile filename pattern (tile_row_col.png) is just an example and must match how you name tiles in Block D. Thresholds such as TILE_TISSUE_MIN and TILE_BLUR_MIN are project‑specific hyperparameters—tune them by plotting their distributions and inspecting borderline tiles.

Optional: Virtual Tiling with `read_region` (No Files on Disk)

Instead of pre‑tiling to disk, you can extract patches directly from WSIs using coordinates. This is useful when you want to avoid writing thousands of image files and prefer a “map first, then extract” workflow.

import openslide
from pathlib import Path

wsi_path = Path("path/to/slide.svs")
slide = openslide.OpenSlide(str(wsi_path))

# Example coordinate list (x, y, level, size) could be generated from a mask
coords = [
    (10000, 20000, 0, (256, 256)),
    (12000, 22000, 0, (256, 256)),
    # ...
]

patches = []
for x, y, level, size in coords:
    region = slide.read_region((x, y), level, size)  # RGBA PIL Image
    patches.append(region)

Libvips pre‑tiling (Block D) is Option 1 when you want a reusable on‑disk dataset. Virtual tiling with read_region is Option 2 when you value flexibility and want to generate patches on the fly.

Block E: Stain Normalization (The Equalizer) 🎨

The Situation:
Lab A uses a lot of Hematoxylin (slides look purple).
Lab B uses a lot of Eosin (slides look pink).

The AI’s Mistake:
It learns “Purple = Benign” and “Pink = Cancer” just because the cancer cases came from Lab B. This is a disaster.

The Solution:
Use Macenko Normalization. It mathematically forces every single tile to have the same color distribution as a “Target Template.”

Macenko: Best for preserving structure (standard for pathology).
Reinhard: Faster, but just matches statistical color mean (can look “washed out”).

In real workflows, Macenko is typically applied to patches or tiles after tiling, rather than to an entire WSI at once, to keep memory usage manageable.

Code Concept (Python):
We typically use a library like torchstain (fast, GPU) or staintools.

# Block E: Stain Normalization (Macenko)
import torchstain
import cv2
import matplotlib.pyplot as plt

# 1. Define a "Target" (A perfect slide you want everyone to look like)
target_img = cv2.imread("perfect_slide_reference.png")
target_img = cv2.cvtColor(target_img, cv2.COLOR_BGR2RGB)

# 2. Initialize the Normalizer (Fit to the target)
normalizer = torchstain.normalizers.MacenkoNormalizer(backend="numpy")
normalizer.fit(target_img)

# 3. Load a "Weird" slide (Too pink)
weird_tile = cv2.imread("lab_B_tile.png")
weird_tile = cv2.cvtColor(weird_tile, cv2.COLOR_BGR2RGB)

# 4. Normalize (Force it to match the target)
normalized_tile, H, E = normalizer.normalize(I=weird_tile, stains=True)

# 5. Display
fig, ax = plt.subplots(1, 2)
ax[0].imshow(weird_tile)
ax[0].set_title("Original (Too Pink)")
ax[1].imshow(normalized_tile)
ax[1].set_title("Normalized (Standardized)")
plt.show()

Block F: Color Deconvolution (The Separator) 🧪

The Situation:
You want to count nuclei (purple), but they are overlapping with cytoplasm (pink). It is hard to threshold just the purple pixels because they are mixed.

The Solution:
Use Color Deconvolution to mathematically un-mix the stains into separate grayscale images: a Hematoxylin channel and an Eosin channel.

Code Concept (Python):
We use scikit-image for this.

Most colour deconvolution utilities expect either 8‑bit images in [0, 255] or floating‑point images in [0, 1]; feeding in unexpected ranges can produce strange or inverted stain channels.

# Block F: Color Deconvolution
from skimage.color import separate_stains, hdx_from_rgb

# 1. Separate the stains
# 'hdx_from_rgb' is the mathematical matrix for H&E
stains_separated = separate_stains(weird_tile, hdx_from_rgb)

# 2. Extract Channels (0 = Hematoxylin, 1 = Eosin)
nuclei_channel = stains_separated[:, :, 0]  # Just the purple stuff
cytoplasm_channel = stains_separated[:, :, 1]  # Just the pink stuff

# Now you can easily threshold 'nuclei_channel' to find cells!

The “Silent Killer”: Resolution (Microns Per Pixel, MPP) 📏

There is one more silent killer of AI models that we have not listed yet. It is even more dangerous than color variation: resolution mismatch.

Scanner A: Scans at 40× (0.25 microns per pixel). Cells look huge.
Scanner B: Scans at 20× (0.50 microns per pixel). Cells look tiny.

If you mix these tiles in a folder, the AI will think the “big cells” are giant monsters. You must rescale them to a common resolution (for example, “Rescale everything to 0.5 MPP”).

Recommendation:

Add a small check inside Block D (The Slicer) or a dedicated “Pre-check” block.
Read the MPP from metadata (from Step 2).
Logic (conceptually):
- If slide_mpp == 0.25, resize by 0.5 before tiling.

This keeps cell size consistent across scanners so the AI learns biology, not scanner zoom.

Block G: Rescaling to a Target MPP (Resolution Normalization)

To avoid resolution mismatch, you can rescale slides so they all have a common physical resolution (for example 0.5 µm/px) before tiling.

import openslide
import pyvips
from pathlib import Path

TARGET_MPP = 0.5  # microns per pixel

input_wsi_path = Path("path/to/slide.svs")
output_wsi_path = Path("path/to/slide_mpp0.5.tif")

slide = openslide.OpenSlide(str(input_wsi_path))

# Depending on vendor, this key may differ; adjust as needed.
mpp_x = slide.properties.get("aperio.MPP", None)
if mpp_x is None:
    raise ValueError("No MPP metadata found; use a vendor-specific key here.")

mpp_x = float(mpp_x)
print(f"Original MPP: {mpp_x} µm/px")

scale_factor = mpp_x / TARGET_MPP
print(f"Scale factor: {scale_factor}")

image = pyvips.Image.new_from_file(str(input_wsi_path), access="sequential")

# Decreasing MPP means increasing pixels; resize by 1/scale_factor
image_rescaled = image.resize(1.0 / scale_factor)

image_rescaled.tiffsave(
    str(output_wsi_path),
    tile=True,
    pyramid=True,
    compression="jpeg",
    Q=90,
)

print(f"Saved rescaled slide at approx. {TARGET_MPP} µm/px → {output_wsi_path}")

In practice, this kind of resolution normalization is run as a batch job over many slides. Vendor-specific metadata keys for MPP differ, so "aperio.MPP" is just an example—you should confirm the correct key for your scanner.

Block Z: Slide-Level QC Loop (Connecting to `cohort.csv`)

Blocks A–C show how to build tissue masks and blur metrics for a single image. Block Z applies these ideas across all slides listed in your cohort.csv and writes a cohort_qc.csv table with slide‑level QC metrics.

import pandas as pd
from pathlib import Path
import cv2
import numpy as np
import openslide

PROJECT_ROOT = Path("/path/to/your/project/root")
METADATA_DIR = PROJECT_ROOT / "metadata"

cohort_path = METADATA_DIR / "cohort.csv"
df = pd.read_csv(cohort_path)

TISSUE_PCT_MIN = 0.30
BLUR_THRESHOLD = 100.0

qc_records = []

for _, row in df.iterrows():
    slide_id = row["slide_id"]
    wsi_path = Path(row["full_path"])

    print(f"Processing slide: {slide_id}")

    slide = openslide.OpenSlide(str(wsi_path))
    thumb = slide.get_thumbnail((2048, 2048))

    img_rgb = np.array(thumb)
    img_hsv = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2HSV)
    sat = img_hsv[:, :, 1]
    sat_blur = cv2.GaussianBlur(sat, (5, 5), 0)

    # Tissue mask on blurred saturation
    _, tissue_mask = cv2.threshold(
        sat_blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
    )
    # Depending on your convention, you may need to invert so tissue ≈ white
    tissue_mask = 255 - tissue_mask

    tissue_pct = float(tissue_mask.mean() / 255.0)

    gray = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2GRAY)
    lap = cv2.Laplacian(gray, cv2.CV_64F)
    sharpness_score = float(lap.var())

    mpp_x = slide.properties.get("aperio.MPP", None)
    mpp_val = float(mpp_x) if mpp_x is not None else None

    qc_pass = (
        (tissue_pct >= TISSUE_PCT_MIN) and
        (sharpness_score >= BLUR_THRESHOLD)
    )
    qc_status = "PASS" if qc_pass else "FAIL"

    qc_records.append({
        "slide_id": slide_id,
        "tissue_pct": tissue_pct,
        "sharpness_score": sharpness_score,
        "mpp": mpp_val,
        "qc_status": qc_status,
    })

df_qc = df.merge(pd.DataFrame(qc_records), on="slide_id", how="left")

output_path = METADATA_DIR / "cohort_qc.csv"
df_qc.to_csv(output_path, index=False)

print(f"Saved slide-level QC table to {output_path}")

Here, TISSUE_PCT_MIN and BLUR_THRESHOLD are again project‑specific hyperparameters. A good practice is to visualise their distributions, review borderline cases, and adjust thresholds until the automatic QC aligns with your clinical judgement.

Summary of the “Complete” Step 3 Toolbox

Block	Name	Function	Mandatory?
A	Tissue Detection	Find tissue, ignore glass (Otsu).	YES
B	Cleaning	Remove dust/holes (Morphology).	YES
C	QC (Blur)	Reject bad scans (Laplacian).	YES
D	Slicing (Libvips)	Cut 40GB slides into tiles.	YES
D2	Tile QC & Metadata	Build `patch_cohort.csv` with tile metrics.	High Priority (for DL)
E	Normalization	Fix color differences (Macenko).	High Priority (for DL)
F	Deconvolution	Separate H vs E channels.	Optional (for nuclei counting)
G	Rescaling (MPP)	Ensure all cells are the same size (MPP).	CRITICAL (often forgotten).
Z	Slide QC Loop	Write `cohort_qc.csv` with QC metrics.	YES

Resource Site

OpenCV Tutorials: https://docs.opencv.org
NumPy User Guide: https://numpy.org/doc/stable/user/absolute_beginners.html
Libvips (Pyvips) Docs: https://libvips.github.io/pyvips