High-Performance Computing¶

Supported Parallel Algorithms¶

Algorithm	Serial	OpenMP	CUDA	MPI
GWRBasic	✅	✅	✅	✅
GWRMultiscale	✅	✅	✅	Serial × \(n_{var}\)
GWSS (Average)	✅	✅	—	—
GWSS (Correlation)	✅	✅	—	—

Notes:

Serial (SerialOnly): single-threaded; suitable for small datasets or debugging.
OpenMP: shared-memory multi-threading; suitable for single-node multi-core machines. Requires ENABLE_OPENMP at build time.
CUDA: NVIDIA GPU acceleration; suitable for large datasets. Requires ENABLE_CUDA at build time.
MPI: distributed computing; suitable for multi-node clusters. Requires ENABLE_MPI at build time.

Multi-Threading (OpenMP)¶

Enables shared-memory parallel computation via OpenMP. The core per-sample model fitting loop is parallelised, with each thread independently computing coefficient estimates for a subset of samples.

Setting the number of threads:

from pygwmodel import GWRBasic, ParallelType, BandwidthWeight, CRSDistance

algorithm = GWRBasic(data, y, x,
                     weight=BandwidthWeight(36.0, adaptive=True),
                     distance=CRSDistance()).enable_parallel(
    ParallelType.OpenMP, threads=4
).fit()

GWRMultiscale also supports OpenMP:

from pygwmodel import GWRMultiscale

algorithm = GWRMultiscale(data, y, x, weights).enable_parallel(
    ParallelType.OpenMP, threads=8
).fit()

Recommended thread count: set to the number of physical CPU cores.

GPU Acceleration (CUDA)¶

Offloads the locally weighted regression matrix operations to a NVIDIA GPU, suitable for larger datasets.

Group Size (`group_size`)¶

group_size controls how many samples’ coefficient estimates are computed together on the GPU in one batch. Larger groups make better use of GPU parallelism but are constrained by GPU memory. The internal constraint is:

\[k \times n \times g \times 8 < \text{GPU Memory}\]

where \(k\) is the number of independent variables, \(n\) is the number of samples, and \(g\) is the group size.

Usage:

algorithm = GWRBasic(data, y, x,
                     weight=BandwidthWeight(36.0, adaptive=True),
                     distance=CRSDistance()).enable_parallel(
    ParallelType.CUDA, gpu_id=0, group_size=64
).fit()

For GWRMultiscale:

algorithm = GWRMultiscale(data, y, x, weights).enable_parallel(
    ParallelType.CUDA, gpu_id=0, group_size=128
).fit()

Performance Tips¶

Scenario	Recommendation
Small datasets (\(n < 1000\))	Serial is sufficient; overhead is negligible
Medium datasets (\(n < 10^4\))	Use OpenMP with thread count equal to CPU cores
Large datasets (\(n > 10^4\))	Use CUDA if available; otherwise OpenMP
Speeding up GWRMultiscale	Set `has_hat_matrix=False` to skip hat matrix computation, significantly reducing memory and computational cost
GWRMultiscale convergence	Increase `criterion_threshold` to reduce iterations (trades slight accuracy)