Skip to content

PFP-01: Project Folder & Path Blocks (pathlib)

Data & Cohorts – PFP-01: Project Folder & Path Blocks

Section titled “Data & Cohorts – PFP-01: Project Folder & Path Blocks”

Set up a portable, organized project structure using only the Python standard library (pathlib) before you start building cohorts.

0 - Project Environment Setup (Python Standard Library)

Section titled “0 - Project Environment Setup (Python Standard Library)”

Python pathlib (Project Structure Manager)

The pathlib module creates path objects that represent file system locations. By defining a dynamic PROJECT_ROOT, every other path (datasets, models, logs) is built relative to that root with the overloaded / operator. This removes hard-coded paths, keeps code portable across Windows and Unix path separators (handles backslashes \\ vs forward slashes / automatically), and centralizes where you point a project.

This is the “Digital Architect.” Before you bring any slides or data into your computer, you build the “Lab Room” where they will live.

  • The Root: The main address of your lab.
  • The Subfolders: The cabinets and drawers for “Raw Slides,” “Excel Sheets,” and “Final Reports.”

Define the address once and Python remembers where everything belongs.

  • Standardize: Enforce a clean structure (Data, Models, Results) for every project you start.
  • Automate: Create all necessary empty folders on your hard drive with one click.
  • Portability: Share code with a colleague; they change one line (PROJECT_ROOT) and the pipeline works on their machine.
  • Safety: Prevent accidental saves to the wrong place or overwriting raw data.

Situations where it’s used (Medical Examples)

Section titled “Situations where it’s used (Medical Examples)”
  1. Moving from Laptop to Server: You develop on a MacBook, then train on the hospital GPU cluster. Change PROJECT_ROOT and the code finds data and results on the new server without breaking.
  2. Collaborative Research: You and a colleague in another city have different usernames and drives. Using relative paths lets the same code run cleanly for both of you.

A physical pathology lab is strictly organized: biopsies go in the yellow bin, resections in the blue bin. Digital pathology is the same. You will generate thousands of files (patches, heatmaps, logs). Without a rigorous file structure set up before you start, you risk losing data, overwriting experiments, or confusing patient IDs.

None required. pathlib ships with every Python installation.

Block A: The Foundation (Setting the Lab Address)

Section titled “Block A: The Foundation (Setting the Lab Address)”

The Situation: Your computer has thousands of folders. If you ask Python to open a file, it will not know where to look.

The Solution: Plant a flag. Tell Python the “Main Entrance” of the project. Every other file is addressed relative to this spot.

from pathlib import Path
# Block A: Set the project root folder
# TODO: Change the text inside the quotes to the actual folder on your computer
PROJECT_ROOT = Path("/path/to/your/project/root")
print("Project root is set to:")
print(PROJECT_ROOT)

Simulated output:

Project root is set to:
/Users/DrFernando/Projects/Melanoma_AI

Block B: The Blueprint (Naming the Cabinets)

Section titled “Block B: The Blueprint (Naming the Cabinets)”

The Situation: You have the main room, but it is empty space. Tossing Excel sheets, images, and models on the floor will lose them.

The Solution: Draw a map. Decide that data, models, and results each have a clear home.

# Assumes you already ran Block A to set PROJECT_ROOT
# 1. Define the "Data Cabinet" (e.g., /path/to/project/data)
DATA_DIR = PROJECT_ROOT / "data"
# 2. Define the "Raw Slide Drawer" inside the Data Cabinet
RAW_IMAGES_DIR = DATA_DIR / "raw_images"
# 3. Define the "Log Book Shelf" (Metadata)
METADATA_DIR = PROJECT_ROOT / "metadata"
# 4. Define the "Freezer" (Models)
MODELS_DIR = PROJECT_ROOT / "models"
# 5. Define the "Report Rack" (Results)
RESULTS_DIR = PROJECT_ROOT / "results"
print("Plan for Lab Layout:")
print(f" Data lives at: {DATA_DIR}")
print(f" Images live at: {RAW_IMAGES_DIR}")
print(f" Metadata lives at: {METADATA_DIR}")
print(f" Models live at: {MODELS_DIR}")
print(f" Results live at: {RESULTS_DIR}")

Simulated output:

Plan for Lab Layout:
Data lives at: /Users/DrFernando/Projects/Melanoma_AI/data
Images live at: /Users/DrFernando/Projects/Melanoma_AI/data/raw_images
Metadata lives at: /Users/DrFernando/Projects/Melanoma_AI/metadata
Models live at: /Users/DrFernando/Projects/Melanoma_AI/models
Results live at: /Users/DrFernando/Projects/Melanoma_AI/results

Block C: The Construction Crew (Building the Folders)

Section titled “Block C: The Construction Crew (Building the Folders)”

The Situation: You have drawn the map, but the folders do not exist yet. Saving into a missing folder will crash.

The Solution: Loop through the planned folders, create anything missing, and continue without errors when things already exist.

# Create a list of all the folders planned in Block B
folders_to_create = [
DATA_DIR,
RAW_IMAGES_DIR,
METADATA_DIR,
MODELS_DIR,
RESULTS_DIR,
]
# Loop through the list and build them
for folder in folders_to_create:
# parents=True: create any missing parent folders
# exist_ok=True: do not crash if the folder already exists
folder.mkdir(parents=True, exist_ok=True)
print("Construction complete. All folders are ready:")
for folder in folders_to_create:
print(" ", folder)

Simulated output:

Construction complete. All folders are ready:
/Users/DrFernando/Projects/Melanoma_AI/data
/Users/DrFernando/Projects/Melanoma_AI/data/raw_images
/Users/DrFernando/Projects/Melanoma_AI/metadata
/Users/DrFernando/Projects/Melanoma_AI/models
/Users/DrFernando/Projects/Melanoma_AI/results

Block D: The Quick Inspection (Checking for Slides)

Section titled “Block D: The Quick Inspection (Checking for Slides)”

The Situation: You built the shelves, but are they empty? Waiting 10 minutes for a script to fail because images were never copied is frustrating.

The Solution: Peek inside the drawer. Count image files in RAW_IMAGES_DIR before heavy work begins.

# Define what file types to look for
# Add "*.svs" or "*.ndpi" if you are using whole-slide images
image_extensions = ["*.png", "*.jpg", "*.tif", "*.svs"]
# Collect image files
image_files = []
for pattern in image_extensions:
image_files.extend(RAW_IMAGES_DIR.rglob(pattern))
# Count them
num_images = len(image_files)
print(f"Found {num_images} image files in the drawer.")
# Show an example (sanity check)
if num_images > 0:
print("Example file found:")
print(image_files[0])
else:
print("WARNING: The drawer is empty. Please copy your images into:", RAW_IMAGES_DIR)

Simulated output:

Found 480 image files in the drawer.
Example file found:
/Users/DrFernando/Projects/Melanoma_AI/data/raw_images/Tumor/Case_001.svs