High-Performance Computing =========================== .. _perf-overview: Supported Parallel Algorithms ----------------------------- .. list-table:: :header-rows: 1 :widths: 25 20 20 20 15 * - Algorithm - Serial - OpenMP - CUDA - MPI * - GWRBasic - ✅ - ✅ - ✅ - ✅ * - GWRMultiscale - ✅ - ✅ - ✅ - Serial × :math:`n_{var}` * - GWSS (Average) - ✅ - ✅ - — - — * - GWSS (Correlation) - ✅ - ✅ - — - — Notes: - **Serial** (SerialOnly): single-threaded; suitable for small datasets or debugging. - **OpenMP**: shared-memory multi-threading; suitable for single-node multi-core machines. Requires ``ENABLE_OPENMP`` at build time. - **CUDA**: NVIDIA GPU acceleration; suitable for large datasets. Requires ``ENABLE_CUDA`` at build time. - **MPI**: distributed computing; suitable for multi-node clusters. Requires ``ENABLE_MPI`` at build time. .. _perf-openmp: Multi-Threading (OpenMP) ------------------------ Enables shared-memory parallel computation via OpenMP. The core per-sample model fitting loop is parallelised, with each thread independently computing coefficient estimates for a subset of samples. Setting the number of threads: .. code-block:: python from pygwmodel import GWRBasic, ParallelType, BandwidthWeight, CRSDistance algorithm = GWRBasic(data, y, x, weight=BandwidthWeight(36.0, adaptive=True), distance=CRSDistance()).enable_parallel( ParallelType.OpenMP, threads=4 ).fit() GWRMultiscale also supports OpenMP: .. code-block:: python from pygwmodel import GWRMultiscale algorithm = GWRMultiscale(data, y, x, weights).enable_parallel( ParallelType.OpenMP, threads=8 ).fit() Recommended thread count: set to the number of physical CPU cores. .. _perf-cuda: GPU Acceleration (CUDA) ----------------------- Offloads the locally weighted regression matrix operations to a NVIDIA GPU, suitable for larger datasets. Group Size (``group_size``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``group_size`` controls how many samples' coefficient estimates are computed together on the GPU in one batch. Larger groups make better use of GPU parallelism but are constrained by GPU memory. The internal constraint is: .. math:: k \times n \times g \times 8 < \text{GPU Memory} where :math:`k` is the number of independent variables, :math:`n` is the number of samples, and :math:`g` is the group size. Usage: .. code-block:: python algorithm = GWRBasic(data, y, x, weight=BandwidthWeight(36.0, adaptive=True), distance=CRSDistance()).enable_parallel( ParallelType.CUDA, gpu_id=0, group_size=64 ).fit() For GWRMultiscale: .. code-block:: python algorithm = GWRMultiscale(data, y, x, weights).enable_parallel( ParallelType.CUDA, gpu_id=0, group_size=128 ).fit() .. _perf-tips: Performance Tips ---------------- .. list-table:: :header-rows: 1 :widths: 35 65 * - Scenario - Recommendation * - Small datasets (:math:`n < 1000`) - Serial is sufficient; overhead is negligible * - Medium datasets (:math:`n < 10^4`) - Use OpenMP with thread count equal to CPU cores * - Large datasets (:math:`n > 10^4`) - Use CUDA if available; otherwise OpenMP * - Speeding up GWRMultiscale - Set ``has_hat_matrix=False`` to skip hat matrix computation, significantly reducing memory and computational cost * - GWRMultiscale convergence - Increase ``criterion_threshold`` to reduce iterations (trades slight accuracy)