High-Performance Computing

Supported Parallel Algorithms

Algorithm

Serial

OpenMP

CUDA

MPI

GWRBasic

GWRMultiscale

Serial × \(n_{var}\)

GWSS (Average)

GWSS (Correlation)

Notes:

  • Serial (SerialOnly): single-threaded; suitable for small datasets or debugging.

  • OpenMP: shared-memory multi-threading; suitable for single-node multi-core machines. Requires ENABLE_OPENMP at build time.

  • CUDA: NVIDIA GPU acceleration; suitable for large datasets. Requires ENABLE_CUDA at build time.

  • MPI: distributed computing; suitable for multi-node clusters. Requires ENABLE_MPI at build time.

Multi-Threading (OpenMP)

Enables shared-memory parallel computation via OpenMP. The core per-sample model fitting loop is parallelised, with each thread independently computing coefficient estimates for a subset of samples.

Setting the number of threads:

from pygwmodel import GWRBasic, ParallelType, BandwidthWeight, CRSDistance

algorithm = GWRBasic(data, y, x,
                     weight=BandwidthWeight(36.0, adaptive=True),
                     distance=CRSDistance()).enable_parallel(
    ParallelType.OpenMP, threads=4
).fit()

GWRMultiscale also supports OpenMP:

from pygwmodel import GWRMultiscale

algorithm = GWRMultiscale(data, y, x, weights).enable_parallel(
    ParallelType.OpenMP, threads=8
).fit()

Recommended thread count: set to the number of physical CPU cores.

GPU Acceleration (CUDA)

Offloads the locally weighted regression matrix operations to a NVIDIA GPU, suitable for larger datasets.

Group Size (group_size)

group_size controls how many samples’ coefficient estimates are computed together on the GPU in one batch. Larger groups make better use of GPU parallelism but are constrained by GPU memory. The internal constraint is:

\[k \times n \times g \times 8 < \text{GPU Memory}\]

where \(k\) is the number of independent variables, \(n\) is the number of samples, and \(g\) is the group size.

Usage:

algorithm = GWRBasic(data, y, x,
                     weight=BandwidthWeight(36.0, adaptive=True),
                     distance=CRSDistance()).enable_parallel(
    ParallelType.CUDA, gpu_id=0, group_size=64
).fit()

For GWRMultiscale:

algorithm = GWRMultiscale(data, y, x, weights).enable_parallel(
    ParallelType.CUDA, gpu_id=0, group_size=128
).fit()

Performance Tips

Scenario

Recommendation

Small datasets (\(n < 1000\))

Serial is sufficient; overhead is negligible

Medium datasets (\(n < 10^4\))

Use OpenMP with thread count equal to CPU cores

Large datasets (\(n > 10^4\))

Use CUDA if available; otherwise OpenMP

Speeding up GWRMultiscale

Set has_hat_matrix=False to skip hat matrix computation, significantly reducing memory and computational cost

GWRMultiscale convergence

Increase criterion_threshold to reduce iterations (trades slight accuracy)