Preprocessing & QC

Step 3 – Preprocessing & Quality Control: cleaning bad images and files

Here you deal with the messy reality of digital slides. You look for scans that are out of focus, covered in pen marks, folded, poorly stained, or simply missing. You may also run basic numeric checks (for example, image size, brightness, or file integrity) to automatically flag obviously bad slides before they ever reach a model.

Technical name: Preprocessing & QC

What this is

Prepare slides so they don’t waste time or confuse models:

Keep only relevant parts.
Remove obvious junk/background.
Reduce cross‑site/scanner stain differences.

Typical questions

“Can we crop to just the tumor, not blank glass?”
“Can we remove tiles that are all white or out of focus?”
“Why do these H&Es look different? Can we make them more uniform?”

Common tasks

Crop to tissue area or ROI.
Split slides into tiles/patches (e.g., 256×256).
Filter tiles without tissue/marker/focus.
Basic stain normalization across labs/scanners.

Inputs and Outputs of Step 3

Step 3 is not just a collection of tricks; it behaves like a pipeline stage with a clear contract.

Inputs

From earlier steps, Step 3 expects:

cohort.csv from Step 1 (DAT-01), containing at least:
- slide_id – unique slide identifier.
- full_path – absolute path to the WSI file.
- project_rel_path – path relative to PROJECT_ROOT for portability.
- split – train/test (and later validation) assignment at the patient level.
The raw whole-slide image files listed in cohort.csv (for example .svs, .ndpi, .tif, .tiff).
Optional: scanner metadata such as microns-per-pixel (MPP), objective magnification, and vendor tags (if collected in earlier steps).

Outputs

After Step 3, we want two main products:

A QC-annotated slide table (for example cohort_qc.csv) with the same rows as cohort.csv, plus extra columns such as:
- tissue_pct – estimated percentage of slide covered by tissue.
- sharpness_score – focus/blur metric (for example Laplacian variance on a thumbnail).
- mpp – microns-per-pixel (if available).
- qc_status – quick QC decision (for example PASS or FAIL).
A patch-level / tile-level table (for example patch_cohort.csv) that describes tiles generated from the WSIs, with columns such as:
- slide_id
- tile_id or tile_path
- tile_row, tile_col or (x, y) coordinates
- mpp – resolution at which tiles were generated.
- tile_tissue_pct – tissue coverage within the tile.
- tile_sharpness – optional tile-level blur metric.
- qc_status_tile – for example PASS / FAIL for each tile.
- split – inherited from the parent slide.

In terms of concrete files, Step 3 feeds the two downstream paths slightly differently:

Path A – Supervised (labels available):
- Uses cohort_qc.csv to filter out obviously bad slides.
- Typically works with patch_cohort.csv so that each tile knows:
  - which slide it came from,
  - its coordinates inside the slide,
  - whether it passed tile-level QC.
- Step 4 (Annotations) will merge human labels from tools like QuPath back onto these slide/tile tables.
Path B – Unsupervised (no labels yet):
- Also starts from cohort_qc.csv to remove unusable slides.
- Uses patch_cohort.csv to feed unlabeled tiles into feature extraction and clustering (Step 5).
- Later, cluster assignments can be projected back onto slides or viewed in tools like QuPath.

The important part: Step 3 should always leave you with tidy tables (cohort_qc.csv and patch_cohort.csv) that downstream steps can reuse.

Core tools (examples)

libvips — fast CLI to crop, resize, tile WSIs. See /tools/libvips/
QuPath — detect tissue, export tiles, batch operations.
ImageJ / Fiji — general‑purpose processing for smaller images.
Histolab / TIAToolbox — WSI‑specific preprocessing and tiling.

Clinician mental model

Think of grossing and QC for digital slides: trim, clean, and standardize before deeper analysis.

Ready-to-use code

QC-01: Grossing & QC blocks — OpenCV/NumPy tissue masking with Otsu, morphology cleanup, and blur detection to trim background and flag bad scans before modeling.

Advanced QC and Stain Tools (Optional)

If you want to go beyond the simple building blocks shown here, there are mature open-source libraries and toolkits that implement similar ideas at scale (slide-level QC, stain normalisation, tiling, metadata handling, and more). The goal of Step 3 is not to replace those tools, but to make their inner logic visible so you can understand, debug, and customise your own workflows.