Multi-Backend Acceleration Design Spec¶

Date: 2026-03-17 Status: Draft Scope: Revamp forge to support three backends — Metal (macOS), CUDA (Linux/NVIDIA), CPU+OpenMP (everywhere) — with clean compile-time dispatch.

1. Problem Statement¶

forge currently has three backend paths (Metal, CUDA via OpenACC, CPU), but only the Metal path is actively tested and maintained. The OpenACC-based CUDA path is vendor-locked to NVIDIA's nvc++ compiler, and OpenACC adoption is declining (spec stagnated at v3.3, November 2022). The CPU path falls through when OpenACC pragmas are ignored by non-PGI compilers but has no explicit parallel threading.

Goals¶

CPU+OpenMP backend: Portable CPU parallelism via OpenMP parallel for, working on GCC, Clang, Apple Clang (with brew install libomp), and MSVC (OpenMP 2.0).
CUDA backend: Dedicated CUDA implementation mirroring Metal's pipeline architecture — RAII C++ classes, pre-allocated GPU buffers, cuBLAS for standard linear algebra, custom kernels where needed. No OpenACC dependency.
Metal backend: Unchanged — already working and tested.
Clean dispatch layer: Compile-time backend resolution via a unified dispatch header, replacing scattered #ifdef blocks in operators.
Retire OpenACC: Remove all #pragma acc usage and the OPENACC_GPU/OPENACC_MP CMake flags.
Drop NVIDIA HPC SDK dependency: Without OpenACC, the CUDA backend only needs nvcc (from the standard CUDA Toolkit) + any C++17 host compiler (GCC, Clang). No more PGI/NVHPC toolchain required. This dramatically simplifies the build — standard nvidia/cuda Docker images work, and CMake's enable_language(CUDA) handles nvcc natively.

Non-Goals¶

AMD GPU / ROCm support (no hardware available)
Runtime backend selection (compile-time is sufficient)
OpenMP target offload for GPU (less mature than dedicated CUDA)

2. Architecture Overview¶

CMake Configure Time
├── -DMETAL_COMPUTE=ON     → Metal backend (macOS Apple Silicon)
├── -DCUDA_COMPUTE=ON      → CUDA backend (NVIDIA discrete GPUs, sm_75–sm_90)
└── (neither)              → CPU backend (OpenMP threading, FFTW)

Each produces:
  build_{backend}/
  ├── libForgeCommon.{so,dylib,a}   (shared or static via BUILD_SHARED_LIBS)
  ├── forgeSense
  ├── forgePcSense
  ├── forgePcSenseTimeSeg
  └── {backend}_tests

Metal and CUDA are mutually exclusive at configure time (enforced by CMake FATAL_ERROR). In practice, Metal is macOS-only and CUDA is Linux-only, so this never conflicts.

Backend Selection in forge-studio¶

forge-studio selects the backend by choosing the executable path: - macOS: forge/bin/metal/ or forge/bin/cpu/ - Linux: forge/bin/cuda/ or forge/bin/cpu/

Each path contains self-contained executables (static linking recommended for forge-studio distribution).

3. Backend Dispatch Layer¶

File: forge/Core/BackendDispatch.hpp

A thin compile-time dispatch header that maps operations to the active backend. This replaces the current pattern of #ifdef METAL_COMPUTE / #ifdef OPENACC_GPU blocks scattered throughout operators and data structures.

3.1 Dispatch Categories¶

GPU Memory Management:

namespace forge::backend {
    template<typename T> void gpu_malloc(T** ptr, size_t n_elem);
    template<typename T> void gpu_free(T* ptr);
    template<typename T> void gpu_copy_to_device(T* dst, const T* src, size_t n_elem);
    template<typename T> void gpu_copy_to_host(T* dst, const T* src, size_t n_elem);
    template<typename T> bool has_gpu();  // constexpr, returns false for CPU backend
}

Vector Algebra:

namespace forge::backend {
    template<typename T> void axpy(size_t n, T alpha, const T* x, T* y);
    template<typename T> T dot(size_t n, const T* x, const T* y);
    template<typename T> void scale(size_t n, T alpha, T* x);
    template<typename T> void elem_multiply(size_t n, const T* a, const T* b, T* out);
    // ... additional ops as needed
}

Pipeline Operations:

namespace forge::backend {
    // Forward/adjoint for NUFFT, DFT, gridding
    // Wraps Metal or CUDA pipeline calls
    // CPU backend: falls through to existing Armadillo implementations
}

3.2 Resolution Mechanism¶

#if defined(CUDA_COMPUTE)
    #include "CUDA/CudaVectorOps_dispatch.hpp"
    // gpu_malloc → cudaMalloc, axpy → cublasSaxpy, etc.
#elif defined(METAL_COMPUTE)
    #include "Metal/MetalVectorOps_dispatch.hpp"
    // existing Metal dispatch, wrapped in unified interface
#else
    // CPU: no GPU memory ops, vector algebra via Armadillo/OpenMP
#endif

3.3 AccelerateDispatch and Backend Dispatch Are Orthogonal¶

AccelerateDispatch.hpp handles CPU-side SIMD (vDSP/BLAS) on macOS. It is orthogonal to the GPU backend dispatch layer — Accelerate optimizes host-side vector operations regardless of whether Metal, CUDA, or CPU is the active GPU backend. It remains unchanged and is not part of BackendDispatch.hpp.

3.4 Impact on Existing Code¶

Today in Gnufft.cpp:

#ifdef METAL_COMPUTE
    if (pipelineCtx != nullptr) {
        metal_nufft_forward(pipelineCtx, ...);
    } else if (metalCtx != nullptr) { ... }
#endif
    computeFd_CPU_Grid<T1>(...);

After:

if (pipelineCtx != nullptr) {
    forge::backend::nufft_forward(pipelineCtx, ...);
} else {
    computeFd_CPU_Grid<T1>(...);  // CPU/OpenMP path
}

The pipelineCtx type resolves to CudaNufftPipeline* or MetalNufftPipelineContext* based on the backend.

All operators with backend-specific code are refactored: Gnufft, Gdft, GdftR2, Gfft, SENSE, pcSENSE, pcSenseTimeSeg.

4. CPU+OpenMP Backend¶

4.1 What Changes¶

Component	Current (OpenACC)	New (OpenMP)
`gridding.cpp`	`#pragma acc parallel loop`	`#pragma omp parallel for` with schedule and reduction clauses
`gridding.cpp` atomics	`#pragma acc atomic`	`#pragma omp atomic`
`griddingSupport.cpp`	21 `#pragma acc` directives (density compensation, deapodization)	`#pragma omp parallel for`
`forgeCol.hpp`	`#pragma acc enter/exit data`	Removed — no GPU memory management for CPU backend
`forgeMat.hpp`	`#pragma acc enter/exit data`	Removed — no GPU memory management for CPU backend
`forgeCol.hpp` `zeros()`/`ones()`	`#pragma acc parallel loop`	`#pragma omp parallel for`
`forgeMat.hpp` `zeros()`/`ones()`	`#pragma acc parallel loop`	`#pragma omp parallel for`
`ftCpu.cpp`	`#pragma acc kernels`	`#pragma omp parallel for`
`ftCpuWithGrads.cpp`	Already uses `#pragma omp`	Unchanged
`Gfft.cpp`	`#pragma acc` + cuFFT plan code	OpenMP path / dispatch layer
FFT	`fftCPU` (FFTW)	Unchanged
Operators	Armadillo-based fallthrough	Unchanged

4.2 Development Order¶

Strip OpenACC, verify single-threaded correctness: Use OpenACC pragmas as a roadmap — read each pragma to understand parallelism/data dependencies, then remove. Run existing test suite to confirm correctness.
Add OpenMP parallelism: Add #pragma omp parallel for to hot loops identified from the OpenACC roadmap (gridding, DFT). Verify correctness and measure speedup.
Apple Clang note: Requires brew install libomp and appropriate -Xclang -fopenmp flags. CMake's find_package(OpenMP) handles this when libomp is installed.

4.3 `forgeCol`/`forgeMat` on CPU¶

isOnGPU is always false
All operators dispatch directly to Armadillo (existing behavior)
No GPU memory allocation or transfer code compiled
getArma() returns zero-copy view without sync (no device to sync from)
The isCopy flag (used only for OpenACC bookkeeping) becomes dead code — clean it up in this phase

5. CUDA Backend¶

5.1 New Directory: `forge/CUDA/`¶

File	Purpose
`CudaNufftPipeline.h/.cu`	Full GPU NUFFT pipeline — gridding + cuFFT + deapodize. RAII C++ class with pre-allocated buffers and cached cuFFT plans.
`CudaGridding.h/.cu`	Gridding-only CUDA kernels (adjoint/forward, 2D/3D)
`CudaDFT.h/.cu`	Field-corrected DFT CUDA kernels
`CudaVectorOps.h/.cu`	Element-wise ops not covered by cuBLAS (complex multiply, magnitude, phase)
`CudaVectorOps_dispatch.hpp`	Dispatch header mapping `forgeCol`/`forgeMat` operators to cuBLAS or custom kernels
`CudaContext.h/.cu`	Device management, stream creation, error checking utilities
`CudaFFT.h/.cu`	cuFFT wrapper (migrated from existing `fftGPU.h/.cpp`, adapted for CUDA backend)

5.2 Pipeline Architecture¶

class CudaNufftPipeline {
public:
    // Constructor: cudaMalloc all buffers, create cuFFT plans, upload LUT + trajectory
    CudaNufftPipeline(uword imgDim, uword nShots, const T* kx, const T* ky, const T* kz,
                      const T* LUT, uword sizeLUT, T gridOS, uword nInterp);

    // One cudaMemcpy in → all kernels on-device → one cudaMemcpy out
    void forward(const T* imageIn, T* samplesOut);
    void adjoint(const T* samplesIn, T* imageOut);

    // Destructor: cudaFree all buffers, destroy cuFFT plans
    ~CudaNufftPipeline();

private:
    // Pre-allocated GPU buffers
    T* d_lut;
    T* d_kx, *d_ky, *d_kz;
    T* d_image, *d_grid, *d_samples;
    cufftHandle fftPlan;
    // ...
};

Pipeline stages (matching Metal): - Forward: deapodize → zero_pad → fftshift → cuFFT → ifftshift → gridding - Adjoint: gridding → ifftshift → cuIFFT → fftshift → crop_center → deapodize

5.3 `forgeCol`/`forgeMat` CUDA Memory¶

forgeCol::operator+()  →  is data on CUDA GPU?
                            ├── yes → CudaVectorOps dispatch
                            │         ├── standard BLAS op? → cuBLAS (axpy, dot, scal, gemm)
                            │         └── non-standard op?  → custom CUDA kernel
                            ├── Metal? → MetalVectorOps dispatch (existing)
                            └── CPU?   → Armadillo (existing)

Memory lifecycle: 1. putOnGPU() → cudaMalloc device buffer + cudaMemcpy host→device, isOnGPU = true 2. Operators dispatch to cuBLAS / custom kernels (data stays on device) 3. Non-const getArma() → cudaMemcpy device→host when host access needed 4. Destructor → cudaFree

CUDA memory model details: - forgeCol maintains separate host and device pointers (unlike Metal's unified memory). Host pointer uses standard new T[]; device pointer via cudaMalloc. - Consider cudaMallocHost (pinned host memory) for improved transfer bandwidth on discrete GPUs. This can be a later optimization — standard host memory works correctly, pinned memory just transfers faster. - forgeComplex<T> is layout-compatible with cuFloatComplex/cuDoubleComplex (two contiguous values). Verify at compile time:

static_assert(sizeof(forgeComplex<float>) == sizeof(cuFloatComplex));
static_assert(alignof(forgeComplex<float>) == alignof(cuFloatComplex));

5.4 Target Architectures¶

CUDA compute capabilities sm_75 through sm_90:

CC	Architecture	Example GPUs
7.5	Turing	RTX 2080, T4
8.0	Ampere	A100
8.6	Ampere	RTX 3080
8.9	Ada Lovelace	RTX 4090, L40
9.0	Hopper	H100

CUDA Toolkit 12.6 — fully supports Turing and newer. Future-proof through CUDA 13.x (which only dropped pre-Turing).

6. CMake Build System Changes¶

6.1 New/Changed Flags¶

Flag	Status	Effect
`-DMETAL_COMPUTE=ON`	Unchanged	Metal backend
`-DCUDA_COMPUTE=ON`	New	CUDA backend
`-DBUILD_SHARED_LIBS=ON/OFF`	New	Shared (system install) vs static (forge-studio)
`-DOPENACC_GPU=ON`	Retired	Removed
`-DOPENACC_MP=ON`	Retired	Removed

6.2 Key CMake Logic¶

option(METAL_COMPUTE "Enable Apple Metal GPU backend" OFF)
option(CUDA_COMPUTE "Enable NVIDIA CUDA GPU backend" OFF)
option(BUILD_SHARED_LIBS "Build shared or static libraries" ON)

# Mutual exclusion
if(METAL_COMPUTE AND CUDA_COMPUTE)
    message(FATAL_ERROR "Cannot enable both METAL_COMPUTE and CUDA_COMPUTE")
endif()

# OpenMP is available for ALL builds (including Metal) for CPU fallback paths.
# Some operators fall back to CPU even in GPU builds (e.g., double precision on Metal).
# ftCpuWithGrads.cpp and TimeSegmentation.cpp already use #pragma omp.
find_package(OpenMP REQUIRED)

if(CUDA_COMPUTE)
    enable_language(CUDA)
    find_package(CUDAToolkit REQUIRED)
    set(CMAKE_CUDA_ARCHITECTURES "75;80;86;89;90")
endif()

Note: The existing add_library(ForgeCommon SHARED ...) in forge/CMakeLists.txt must be changed to add_library(ForgeCommon ...) (without SHARED) so that BUILD_SHARED_LIBS takes effect.

6.3 Source File Selection¶

forge/CUDA/*.cu — compiled only when CUDA_COMPUTE=ON
forge/Metal/*.mm — compiled only when METAL_COMPUTE=ON
Core sources — always compiled; BackendDispatch.hpp resolves backend at compile time

7. Testing Strategy¶

7.1 Existing Test Inventory¶

Speed	Count	Examples
Fast (<1s)	~80	forgeComplex, forgeCol, forgeMat, forgePhase2, penalty, utility, CLI, NiftiOrientation
Medium (1-10s)	~25	operator adjoint checks, density compensation, smaller reconstructions
Slow (>10s)	~15	Full ISMRMRD/Spiral3D reconstructions, benchmarks

91% of tests use synthetic data with no external file dependencies.

7.2 Test Executables Per Build¶

Build	Binary	Content
CPU	`cpu_tests`	All CPU-agnostic tests + OpenMP correctness
CUDA	`cuda_tests`	Same correctness tests via CUDA dispatch + CUDA pipeline/memory tests
Metal	`metal_tests`	Same correctness tests via Metal dispatch + Metal pipeline tests (exists today)

Shared test source files where possible — core algorithmic tests (gridding accuracy, operator adjoint, reconstruction quality) should be identical across backends.

7.3 Development Verification Order¶

Single-threaded CPU — all existing fast+medium tests pass (correctness baseline)
OpenMP CPU — same tests pass, results match single-threaded within tolerance
CUDA — same tests pass, results match CPU within floating-point tolerance

7.4 CI Tiers¶

Tier	Trigger	Runner	Tests
1	Every PR	Linux (GitHub-hosted)	CPU build, fast+medium tests only (`~[Benchmark]`)
2	Nightly / manual	Linux (GitHub-hosted)	CPU build, full test suite including slow reconstruction tests
3	Release / manual	macOS (self-hosted or GitHub) + Linux GPU (self-hosted)	Metal + CUDA builds, full test suites

7.5 CI Dependencies¶

CPU builds (Tier 1/2): Linux GitHub-hosted runners (Ubuntu) require installing: Armadillo, FFTW3, HDF5, ISMRMRD, Boost, SuperLU, OpenMP. Options: - Use the existing docker/forge-dev/ Dockerfile as a CI container - Or install via apt-get in the workflow (slower but simpler to maintain)

Recommend the Docker container approach since the Dockerfiles already exist.

CUDA builds (Tier 3): Without the NVIDIA HPC SDK dependency, the CUDA build only needs the standard CUDA Toolkit + a system GCC/Clang. This simplifies CI significantly: - Base image: nvidia/cuda:12.6-devel-ubuntu22.04 + apt-get install forge dependencies - Or use the Jimver/cuda-toolkit GitHub Action to install CUDA on a standard runner - Self-hosted runner with GPU still needed for running CUDA tests (not just building)

8. Implementation Priority¶

Phase 1: Strip OpenACC, Verify Single-Threaded Correctness¶

Audit all OpenACC pragmas across the codebase as a parallelism/data-dependency roadmap
Remove OpenACC pragmas, _OPENACC/OPENACC_GPU guards, and the isCopy flag
Retire fftGPU.h/.cpp (cuFFT wrappers gated behind OPENACC_GPU — functionality moves to CUDA backend in Phase 4)
Retire OPENACC_GPU and OPENACC_MP CMake flags and associated compiler options
Clean up ForgeIncludes.h, Gfft.h/.cpp, gridding.h, griddingSupport.h, ftCpu.h, ftCpuWithGrads.h, and test files that have _OPENACC/OPENACC_GPU guards
Verify all existing fast+medium tests pass single-threaded

Phase 2: Backend Dispatch Layer¶

Define BackendDispatch.hpp interface (GPU memory, vector algebra, pipeline ops)
Wrap existing Metal dispatch (MetalVectorOps_dispatch.hpp) in unified interface
Refactor operators (Gnufft, Gdft, GdftR2, Gfft, SENSE, pcSENSE, pcSenseTimeSeg) to use dispatch instead of #ifdef blocks
Verify Metal backend still passes all tests

Phase 3: Add OpenMP Parallelism to CPU Path¶

Add #pragma omp parallel for to hot loops in gridding.cpp, griddingSupport.cpp, ftCpu.cpp
Add OpenMP to forgeCol::zeros()/ones(), forgeMat::zeros()/ones()
Verify parallel correctness (same results as single-threaded within tolerance)
Benchmark single-threaded vs OpenMP speedup

Phase 4: CUDA Backend¶

Implement CudaContext (device management, streams, error checking)
Implement CudaVectorOps (cuBLAS + custom kernels) and dispatch header
Implement CudaFFT (migrate cuFFT wrappers from retired fftGPU.h/.cpp)
Implement CudaGridding kernels (translate from Metal/OpenACC kernels)
Implement CudaNufftPipeline (full forward/adjoint pipeline with cuFFT) — includes CUDA kernels for deapodize, zero_pad, crop_center, fftshift (analogous to Metal compute shaders)
Implement CudaDFT kernels
Wire forgeCol/forgeMat CUDA memory management (dual host/device pointers, putOnGPU()/sync)
Enable NVTX tracing (Tracer.hpp with USE_NVTX) for the CUDA backend
Test against CPU results for numerical agreement

Phase 5: Build System and CI¶

Update CMakeLists.txt (new CUDA_COMPUTE flag, CUDA language, BUILD_SHARED_LIBS, change ForgeCommon from SHARED to default)
Set up CI tiers (GitHub Actions, Docker or apt for Linux dependencies)
Packaging support (shared vs static builds)

9. Files Modified¶

New Files¶

forge/Core/BackendDispatch.hpp
forge/CUDA/CudaContext.h, forge/CUDA/CudaContext.cu
forge/CUDA/CudaNufftPipeline.h, forge/CUDA/CudaNufftPipeline.cu
forge/CUDA/CudaGridding.h, forge/CUDA/CudaGridding.cu
forge/CUDA/CudaDFT.h, forge/CUDA/CudaDFT.cu
forge/CUDA/CudaFFT.h, forge/CUDA/CudaFFT.cu
forge/CUDA/CudaVectorOps.h, forge/CUDA/CudaVectorOps.cu
forge/CUDA/CudaVectorOps_dispatch.hpp
.github/workflows/ci.yml (CI pipeline)

Modified Files¶

Core data structures: - forge/Core/forgeCol.hpp — replace OpenACC pragmas with dispatch calls (CUDA) or OpenMP (CPU); remove isCopy flag - forge/Core/forgeMat.hpp — same - forge/Core/forgeSubview_Col.hpp — remove #pragma acc directives - forge/Core/ForgeIncludes.h — remove #ifdef OPENACC_GPU guards - forge/Core/Tracer.hpp — enable USE_NVTX for CUDA_COMPUTE (not just OpenACC)

Gridding: - forge/Gridding/gridding.h — remove #ifdef OPENACC_GPU / #elif _OPENACC guards - forge/Gridding/gridding.cpp — replace #pragma acc with #pragma omp - forge/Gridding/griddingSupport.h — remove _OPENACC guards - forge/Gridding/griddingSupport.cpp — replace 21 #pragma acc directives with #pragma omp - forge/Gridding/TimeSegmentation.cpp — un-comment #pragma omp parallel for

FFT: - forge/FFT/ftCpu.h — remove _OPENACC guards - forge/FFT/ftCpu.cpp — replace #pragma acc kernels with #pragma omp parallel for - forge/FFT/ftCpuWithGrads.h — remove _OPENACC guards

Operators: - forge/Operators/Gnufft.h, Gnufft.cpp — use dispatch layer instead of #ifdef blocks - forge/Operators/Gdft.h, Gdft.cpp — same - forge/Operators/GdftR2.h, GdftR2.cpp — same - forge/Operators/Gfft.h, Gfft.cpp — same (has OPENACC_GPU/OPENACC_MP guards and cuFFT plan code) - forge/Operators/SENSE.h, SENSE.cpp — same - forge/Operators/pcSENSE.h, pcSENSE.cpp — same - forge/Operators/pcSenseTimeSeg.h, pcSenseTimeSeg.cpp — same

Metal dispatch: - forge/Metal/MetalVectorOps_dispatch.hpp — wrap in unified dispatch interface

Build system: - CMakeLists.txt — new flags, CUDA language, retire OpenACC, BUILD_SHARED_LIBS - forge/CMakeLists.txt — CUDA source file list, change SHARED to default, conditional compilation

Tests (remove _OPENACC/OPENACC_GPU guards): - forge/Tests/forgeColTests.cpp - forge/Tests/forgeColCplxTests.cpp - forge/Tests/forgeMatTests.cpp - forge/Tests/MetalVectorOpsTests.cpp

Other: - forge/Microbenchmarks/forgeCol/main.cpp — remove _OPENACC guards

Retired¶

All #pragma acc directives across the codebase
OPENACC_GPU and OPENACC_MP CMake options
OpenACC-specific compiler flags (-cuda -acc -gpu=cuda12.6,cc60-cc89, -Mnollvm -O1 -acc -ta=multicore)
NVIDIA HPC SDK (nvc++) dependency — no longer needed; CUDA backend uses standard nvcc + GCC/Clang host compiler
forge/FFT/fftGPU.h, forge/FFT/fftGPU.cpp — cuFFT wrappers gated behind OPENACC_GPU; functionality migrated to forge/CUDA/CudaFFT.h/.cu
forge/my_sincosf.cuh — legacy CUDA header; evaluate for inclusion in forge/CUDA/ or retire
isCopy flag in forgeCol (OpenACC bookkeeping only)

Unchanged¶

forge/Core/AccelerateDispatch.hpp — orthogonal to backend dispatch, handles CPU-side vDSP/BLAS on macOS

10. Risks and Mitigations¶

Risk	Mitigation
OpenMP atomic scatter in adjoint gridding may have different performance characteristics than OpenACC	Benchmark early; consider alternative accumulation strategies (thread-local accumulators + reduction)
cuBLAS complex type differences (`cuFloatComplex` vs `forgeComplex<float>`)	`forgeComplex<T>` is layout-compatible with `cuFloatComplex` (two contiguous floats); verify with `static_assert`
CUDA kernel launch parameters (block/grid size) need tuning per GPU	Start with reasonable defaults (256 threads/block); tuning is Phase 3 optimization
Apple Clang OpenMP requires `brew install libomp`	Document in README; CMake will fail clearly at `find_package(OpenMP REQUIRED)`
Dispatch layer abstraction may not cover all edge cases	Keep it thin — only abstract what's needed, allow backend-specific code in backend directories
`CudaNufftPipeline` needs CUDA kernels for deapodize, zero_pad, crop_center, fftshift (pipeline stages not covered by cuFFT or cuBLAS)	These are simple element-wise or copy kernels; include them in `CudaNufftPipeline.cu` alongside the pipeline orchestration
NVTX tracing currently tied to OpenACC via `USE_NVTX`	Move `USE_NVTX` define to trigger on `CUDA_COMPUTE` instead of `OPENACC_GPU`; NVTX is a standalone CUDA library