Preprocessing & QC
Step 3 – Preprocessing & Quality Control: cleaning bad images and files
Section titled “Step 3 – Preprocessing & Quality Control: cleaning bad images and files”Here you deal with the messy reality of digital slides. You look for scans that are out of focus, covered in pen marks, folded, poorly stained, or simply missing. You may also run basic numeric checks (for example, image size, brightness, or file integrity) to automatically flag obviously bad slides before they ever reach a model.
Technical name: Preprocessing & QC
What this is
Section titled “What this is”Prepare slides so they don’t waste time or confuse models:
- Keep only relevant parts.
- Remove obvious junk/background.
- Reduce cross‑site/scanner stain differences.
Typical questions
Section titled “Typical questions”- “Can we crop to just the tumor, not blank glass?”
- “Can we remove tiles that are all white or out of focus?”
- “Why do these H&Es look different? Can we make them more uniform?”
Common tasks
Section titled “Common tasks”- Crop to tissue area or ROI.
- Split slides into tiles/patches (e.g., 256×256).
- Filter tiles without tissue/marker/focus.
- Basic stain normalization across labs/scanners.
Inputs and Outputs of Step 3
Section titled “Inputs and Outputs of Step 3”Step 3 is not just a collection of tricks; it behaves like a pipeline stage with a clear contract.
Inputs
Section titled “Inputs”From earlier steps, Step 3 expects:
cohort.csvfrom Step 1 (DAT-01), containing at least:slide_id– unique slide identifier.full_path– absolute path to the WSI file.project_rel_path– path relative toPROJECT_ROOTfor portability.split– train/test (and later validation) assignment at the patient level.
- The raw whole-slide image files listed in
cohort.csv(for example.svs,.ndpi,.tif,.tiff). - Optional: scanner metadata such as microns-per-pixel (MPP), objective magnification, and vendor tags (if collected in earlier steps).
Outputs
Section titled “Outputs”After Step 3, we want two main products:
-
A QC-annotated slide table (for example
cohort_qc.csv) with the same rows ascohort.csv, plus extra columns such as:tissue_pct– estimated percentage of slide covered by tissue.sharpness_score– focus/blur metric (for example Laplacian variance on a thumbnail).mpp– microns-per-pixel (if available).qc_status– quick QC decision (for examplePASSorFAIL).
-
A patch-level / tile-level table (for example
patch_cohort.csv) that describes tiles generated from the WSIs, with columns such as:slide_idtile_idortile_pathtile_row,tile_color(x, y)coordinatesmpp– resolution at which tiles were generated.tile_tissue_pct– tissue coverage within the tile.tile_sharpness– optional tile-level blur metric.qc_status_tile– for examplePASS/FAILfor each tile.split– inherited from the parent slide.
In terms of concrete files, Step 3 feeds the two downstream paths slightly differently:
-
Path A – Supervised (labels available):
- Uses
cohort_qc.csvto filter out obviously bad slides. - Typically works with
patch_cohort.csvso that each tile knows:- which slide it came from,
- its coordinates inside the slide,
- whether it passed tile-level QC.
- Step 4 (Annotations) will merge human labels from tools like QuPath back onto these slide/tile tables.
- Uses
-
Path B – Unsupervised (no labels yet):
- Also starts from
cohort_qc.csvto remove unusable slides. - Uses
patch_cohort.csvto feed unlabeled tiles into feature extraction and clustering (Step 5). - Later, cluster assignments can be projected back onto slides or viewed in tools like QuPath.
- Also starts from
The important part: Step 3 should always leave you with tidy tables (cohort_qc.csv and patch_cohort.csv) that downstream steps can reuse.
Core tools (examples)
Section titled “Core tools (examples)”- libvips — fast CLI to crop, resize, tile WSIs. See /tools/libvips/
- QuPath — detect tissue, export tiles, batch operations.
- ImageJ / Fiji — general‑purpose processing for smaller images.
- Histolab / TIAToolbox — WSI‑specific preprocessing and tiling.
Clinician mental model
Section titled “Clinician mental model”Think of grossing and QC for digital slides: trim, clean, and standardize before deeper analysis.
Ready-to-use code
Section titled “Ready-to-use code”- QC-01: Grossing & QC blocks — OpenCV/NumPy tissue masking with Otsu, morphology cleanup, and blur detection to trim background and flag bad scans before modeling.
Advanced QC and Stain Tools (Optional)
Section titled “Advanced QC and Stain Tools (Optional)”If you want to go beyond the simple building blocks shown here, there are mature open-source libraries and toolkits that implement similar ideas at scale (slide-level QC, stain normalisation, tiling, metadata handling, and more). The goal of Step 3 is not to replace those tools, but to make their inner logic visible so you can understand, debug, and customise your own workflows.