Parallel Execution in Calibration (PYCCAPT_PARALLEL_WORKERS)
PyCCAPT now opportunistically parallelizes a handful of hot loops in the calibration and raw-data analysis paths. The default is automatic: when a loop has enough independent work to amortize executor startup, it runs on multiple threads (or processes, for pure-Python kernels); otherwise it falls back to the serial code that’s always been there.
Two environment variables let you pin the behavior without changing any code – useful for benchmarking, reproducible runs, CI, and small-RAM machines that can’t afford a process pool.
Environment variables
PYCCAPT_PARALLEL_WORKERSInteger.
0or1forces strictly serial execution (identical to the pre-parallel behavior, used by the test suite). Any larger integer caps the worker count. Unset = auto (cpu_count - 1capped at 8).PYCCAPT_PARALLEL_BACKENDOne of
auto,thread,process,serial. Unset = auto.threadis right for NumPy/SciPy-bound kernels (GIL-releasing);processfor pure-Python kernels;serialfor debugging.
What’s parallelized today
These are the loops where measurement has shown a real wall-clock win:
Hot path |
Measured speedup |
|---|---|
|
~**3x** on 5M ions / 2.6k cells |
|
~**2.2x** on 5M ions / 500 segs |
|
~**1.5x** on >=1M sequences |
Loops where parallelism was tried and rejected (the per-task work is too cheap, so executor dispatch / pickling dominates):
bowl_correctioncartesian path (cell work is sub-millisecond on pre-sorted slices).Smaller Surface Concept datasets (<500 k sequence records) – pickle overhead on the dict-of-lists data structure exceeds the inner compute.
Adaptive residual calibration’s outer iterations (iterations depend on each other; can’t parallelize).
Reflectron per-chunk correction (already vectorized; chunk parallelism contends with the streaming HDF5 writer).
Each of these paths has a comment in the source explaining the measurement that led to keeping it serial. When you re-measure on different hardware (e.g. a Linux server with fork-based multiprocessing) you may find the break-even shifts – the env-var override is the right knob to flip.
Usage
Pin behavior in tests / CI (the test suite already does this):
export PYCCAPT_PARALLEL_WORKERS=1 # serial, deterministic
pytest tests/
Run with explicit worker count:
export PYCCAPT_PARALLEL_WORKERS=8
jupyter notebook
Force a specific backend (useful when debugging worker-vs-serial divergence):
export PYCCAPT_PARALLEL_BACKEND=thread
pytest tests/calibration/test_calibration_state.py
Python-side overrides
Library code that wants finer control passes a ParallelConfig:
from pyccapt.calibration.core.parallel import ParallelConfig, parallel_map
results = parallel_map(
my_kernel,
items,
config=ParallelConfig(backend="thread", workers=4, min_items=8),
gil_releasing=True,
)
ParallelConfig.min_items overrides the per-backend default threshold
(32 for threads, 200 for processes). Set it explicitly when you’ve
already done coarse-grained batching at the caller level so each work item
represents real CPU work.
Design notes
The parallelism layer lives in
pyccapt.calibration.core.parallel. Key invariants:
Backward-compat: every caller keeps its old signature. Below the configured
min_items,parallel_mapruns in serial – identical results, no executor created.Determinism: results are returned in input order regardless of backend. Tests pin
PYCCAPT_PARALLEL_WORKERS=1to reproduce serial output bit-for-bit.Graceful degradation: when a closure or notebook-defined function is handed to the auto backend with a pure-Python kernel,
parallel_mapsilently downgrades from processes to threads so the caller still gets a correct answer. Settingbackend='process'explicitly raises instead – the caller wanted parallelism and should know it failed.No nested parallelism: if you’re already inside a parallel worker,
parallel_mapstill applies its thresholds; nested-pool deadlock is prevented by the auto-serial fallback for small workloads, not by explicit detection.
For the matching low-RAM / memory-mapped I/O work see Working with Big Datasets on Small RAM (Memory-Mapped I/O).