Multi-Backend Acceleration Design Spec¶
Date: 2026-03-17 (updated 2026-03-18) Status: Phases 1-3 complete (PR #14), Phases 4-5 in progress Scope: Revamp forge to support three backends — Metal (macOS), CUDA (Linux/NVIDIA), CPU+OpenMP (everywhere) — with clean compile-time dispatch.
1. Problem Statement¶
forge currently has three backend paths (Metal, CUDA via OpenACC, CPU), but only the Metal path is actively tested and maintained. The OpenACC-based CUDA path is vendor-locked to NVIDIA's nvc++ compiler, and OpenACC adoption is declining (spec stagnated at v3.3, November 2022). The CPU path falls through when OpenACC pragmas are ignored by non-PGI compilers but has no explicit parallel threading.
Goals¶
- CPU+OpenMP backend: Portable CPU parallelism via OpenMP
parallel for, working on GCC, Clang, Apple Clang (withbrew install libomp), and MSVC (OpenMP 2.0). - CUDA backend: Dedicated CUDA implementation mirroring Metal's pipeline architecture — RAII C++ classes, pre-allocated GPU buffers, cuBLAS for standard linear algebra, custom kernels where needed. No OpenACC dependency.
- Metal backend: Unchanged — already working and tested.
- Clean dispatch layer: Compile-time backend resolution via a unified dispatch header, replacing scattered
#ifdefblocks in operators. - Retire OpenACC: Remove all
#pragma accusage and theOPENACC_GPU/OPENACC_MPCMake flags. - Drop NVIDIA HPC SDK dependency: Without OpenACC, the CUDA backend only needs
nvcc(from the standard CUDA Toolkit) + any C++17 host compiler (GCC, Clang). No more PGI/NVHPC toolchain required. This dramatically simplifies the build — standardnvidia/cudaDocker images work, and CMake'senable_language(CUDA)handlesnvccnatively.
Non-Goals¶
- AMD GPU / ROCm support (no hardware available)
- Runtime backend selection (compile-time is sufficient)
- OpenMP target offload for GPU (less mature than dedicated CUDA)
2. Architecture Overview¶
CMake Configure Time
├── -DMETAL_COMPUTE=ON → Metal backend (macOS Apple Silicon)
├── -DCUDA_COMPUTE=ON → CUDA backend (NVIDIA GPUs, sm_75–sm_120)
└── (neither) → CPU backend (OpenMP threading, FFTW)
Each produces:
build_{backend}/
├── libForgeCommon.{so,dylib,a} (shared or static via BUILD_SHARED_LIBS)
├── forgeSense
├── forgePcSense
├── forgePcSenseTimeSeg
└── {backend}_tests
Metal and CUDA are mutually exclusive at configure time (enforced by CMake FATAL_ERROR). In practice, Metal is macOS-only and CUDA is Linux-only, so this never conflicts.
Backend Selection in forge-studio¶
forge-studio selects the backend by choosing the executable path:
- macOS: forge/bin/metal/ or forge/bin/cpu/
- Linux: forge/bin/cuda/ or forge/bin/cpu/
Each path contains self-contained executables (static linking recommended for forge-studio distribution).
3. Backend Dispatch Layer¶
File: forge/Core/BackendDispatch.hpp
A thin compile-time dispatch header that maps operations to the active backend. This replaces the current pattern of #ifdef METAL_COMPUTE / #ifdef OPENACC_GPU blocks scattered throughout operators and data structures.
3.1 Dispatch Categories¶
GPU Memory Management:
namespace forge::backend {
template<typename T> void gpu_malloc(T** ptr, size_t n_elem);
template<typename T> void gpu_free(T* ptr);
template<typename T> void gpu_copy_to_device(T* dst, const T* src, size_t n_elem);
template<typename T> void gpu_copy_to_host(T* dst, const T* src, size_t n_elem);
template<typename T> bool has_gpu(); // constexpr, returns false for CPU backend
}
Vector Algebra:
namespace forge::backend {
template<typename T> void axpy(size_t n, T alpha, const T* x, T* y);
template<typename T> T dot(size_t n, const T* x, const T* y);
template<typename T> void scale(size_t n, T alpha, T* x);
template<typename T> void elem_multiply(size_t n, const T* a, const T* b, T* out);
// ... additional ops as needed
}
Pipeline Operations:
namespace forge::backend {
// Forward/adjoint for NUFFT, DFT, gridding
// Wraps Metal or CUDA pipeline calls
// CPU backend: falls through to existing Armadillo implementations
}
3.2 Resolution Mechanism¶
#if defined(CUDA_COMPUTE)
#include "CUDA/CudaVectorOps_dispatch.hpp"
// gpu_malloc → cudaMalloc, axpy → cublasSaxpy, etc.
#elif defined(METAL_COMPUTE)
#include "Metal/MetalVectorOps_dispatch.hpp"
// existing Metal dispatch, wrapped in unified interface
#else
// CPU: no GPU memory ops, vector algebra via Armadillo/OpenMP
#endif
3.3 AccelerateDispatch and Backend Dispatch Are Orthogonal¶
AccelerateDispatch.hpp handles CPU-side SIMD (vDSP/BLAS) on macOS. It is orthogonal to the GPU backend dispatch layer — Accelerate optimizes host-side vector operations regardless of whether Metal, CUDA, or CPU is the active GPU backend. It remains unchanged and is not part of BackendDispatch.hpp.
3.4 Impact on Existing Code¶
Today in Gnufft.cpp:
#ifdef METAL_COMPUTE
if (pipelineCtx != nullptr) {
metal_nufft_forward(pipelineCtx, ...);
} else if (metalCtx != nullptr) { ... }
#endif
computeFd_CPU_Grid<T1>(...);
After:
if (pipelineCtx != nullptr) {
forge::backend::nufft_forward(pipelineCtx, ...);
} else {
computeFd_CPU_Grid<T1>(...); // CPU/OpenMP path
}
The pipelineCtx type resolves to CudaNufftPipeline* or MetalNufftPipelineContext* based on the backend.
All operators with backend-specific code are refactored: Gnufft, Gdft, GdftR2, Gfft, SENSE, pcSENSE, pcSenseTimeSeg.
4. CPU+OpenMP Backend¶
4.1 What Changes¶
| Component | Current (OpenACC) | New (OpenMP) |
|---|---|---|
gridding.cpp |
#pragma acc parallel loop |
#pragma omp parallel for with schedule and reduction clauses |
gridding.cpp atomics |
#pragma acc atomic |
#pragma omp atomic |
griddingSupport.cpp |
21 #pragma acc directives (density compensation, deapodization) |
#pragma omp parallel for |
forgeCol.hpp |
#pragma acc enter/exit data |
Removed — no GPU memory management for CPU backend |
forgeMat.hpp |
#pragma acc enter/exit data |
Removed — no GPU memory management for CPU backend |
forgeCol.hpp zeros()/ones() |
#pragma acc parallel loop |
#pragma omp parallel for |
forgeMat.hpp zeros()/ones() |
#pragma acc parallel loop |
#pragma omp parallel for |
ftCpu.cpp |
#pragma acc kernels |
#pragma omp parallel for |
ftCpuWithGrads.cpp |
Already uses #pragma omp |
Unchanged |
Gfft.cpp |
#pragma acc + cuFFT plan code |
OpenMP path / dispatch layer |
| FFT | fftCPU (FFTW) |
Unchanged |
| Operators | Armadillo-based fallthrough | Unchanged |
4.2 Development Order¶
- Strip OpenACC, verify single-threaded correctness: Use OpenACC pragmas as a roadmap — read each pragma to understand parallelism/data dependencies, then remove. Run existing test suite to confirm correctness.
- Add OpenMP parallelism: Add
#pragma omp parallel forto hot loops identified from the OpenACC roadmap (gridding, DFT). Verify correctness and measure speedup. - Apple Clang note: Requires
brew install libompand appropriate-Xclang -fopenmpflags. CMake'sfind_package(OpenMP)handles this when libomp is installed.
4.3 forgeCol/forgeMat on CPU¶
isOnGPUis always false- All operators dispatch directly to Armadillo (existing behavior)
- No GPU memory allocation or transfer code compiled
getArma()returns zero-copy view without sync (no device to sync from)- The
isCopyflag (used only for OpenACC bookkeeping) becomes dead code — clean it up in this phase
5. CUDA Backend¶
5.1 New Directory: forge/CUDA/¶
| File | Purpose |
|---|---|
CudaNufftPipeline.h/.cu |
Full GPU NUFFT pipeline — gridding + cuFFT + deapodize. RAII C++ class with pre-allocated buffers and cached cuFFT plans. |
CudaGridding.h/.cu |
Gridding-only CUDA kernels (adjoint/forward, 2D/3D) |
CudaDFT.h/.cu |
Field-corrected DFT CUDA kernels |
CudaVectorOps.h/.cu |
Element-wise ops not covered by cuBLAS (complex multiply, magnitude, phase) |
CudaVectorOps_dispatch.hpp |
Dispatch header mapping forgeCol/forgeMat operators to cuBLAS or custom kernels |
CudaContext.h/.cu |
Device management, stream creation, error checking utilities |
CudaFFT.h/.cu |
cuFFT wrapper (migrated from existing fftGPU.h/.cpp, adapted for CUDA backend) |
5.2 Pipeline Architecture¶
class CudaNufftPipeline {
public:
// Constructor: cudaMalloc all buffers, create cuFFT plans, upload LUT + trajectory
CudaNufftPipeline(uword imgDim, uword nShots, const T* kx, const T* ky, const T* kz,
const T* LUT, uword sizeLUT, T gridOS, uword nInterp);
// One cudaMemcpy in → all kernels on-device → one cudaMemcpy out
void forward(const T* imageIn, T* samplesOut);
void adjoint(const T* samplesIn, T* imageOut);
// Destructor: cudaFree all buffers, destroy cuFFT plans
~CudaNufftPipeline();
private:
// Pre-allocated GPU buffers
T* d_lut;
T* d_kx, *d_ky, *d_kz;
T* d_image, *d_grid, *d_samples;
cufftHandle fftPlan;
// ...
};
Pipeline stages (matching Metal): - Forward: deapodize → zero_pad → fftshift → cuFFT → ifftshift → gridding - Adjoint: gridding → ifftshift → cuIFFT → fftshift → crop_center → deapodize
5.3 forgeCol/forgeMat CUDA Memory¶
forgeCol::operator+() → is data on CUDA GPU?
├── yes → CudaVectorOps dispatch
│ ├── standard BLAS op? → cuBLAS (axpy, dot, scal, gemm)
│ └── non-standard op? → custom CUDA kernel
├── Metal? → MetalVectorOps dispatch (existing)
└── CPU? → Armadillo (existing)
Memory lifecycle:
1. putOnGPU() → cudaMalloc device buffer + cudaMemcpy host→device, isOnGPU = true
2. Operators dispatch to cuBLAS / custom kernels (data stays on device)
3. Non-const getArma() → cudaMemcpy device→host when host access needed
4. Destructor → cudaFree
CUDA memory model details:
- forgeCol maintains separate host and device pointers (unlike Metal's unified memory). Host pointer uses standard new T[]; device pointer via cudaMalloc.
- Consider cudaMallocHost (pinned host memory) for improved transfer bandwidth on discrete GPUs. This can be a later optimization — standard host memory works correctly, pinned memory just transfers faster.
- forgeComplex<T> is layout-compatible with cuFloatComplex/cuDoubleComplex (two contiguous values). Verify at compile time:
static_assert(sizeof(forgeComplex<float>) == sizeof(cuFloatComplex));
static_assert(alignof(forgeComplex<float>) == alignof(cuFloatComplex));
5.4 Target Architectures¶
CUDA compute capabilities sm_75 through sm_120:
| CC | Architecture | Example GPUs |
|---|---|---|
| 7.5 | Turing | RTX 2080, T4 |
| 8.0 | Ampere | A100 |
| 8.6 | Ampere | RTX 3080 |
| 8.9 | Ada Lovelace | RTX 4090, L40 |
| 9.0 | Hopper | H100 |
| 12.0 | Blackwell | B200 |
| 12.1 | Blackwell | GB10 (DGX Spark) |
CUDA Toolkit 13.0 — supports Turing through Blackwell. Pre-Turing (< sm_75) was dropped in CUDA 13.0.
5.5 Primary Development Hardware¶
DGX Spark (GB10):
- CPU: NVIDIA Grace (ARM64/aarch64)
- GPU: Blackwell GB10 (compute capability 12.1)
- Memory: 128GB unified (Grace-Blackwell)
- CUDA: 13.0, GCC 13.3, CMake 3.28
- OS: Ubuntu 24.04 (aarch64)
- Access: ssh clj@100.109.49.10
Secondary test hardware: - RTX 2080 (compute capability 7.5, Turing) — discrete GPU with separate host/device memory
5.6 CUDA Memory Model¶
Explicit cudaMalloc + cudaMemcpy for all GPU memory operations. This works on both:
- Discrete GPUs (RTX 2080) — standard PCIe transfers
- Grace-Blackwell unified memory — the explicit API still works, driver optimizes internally
Unified memory (cudaMallocManaged) is NOT used as the default, to ensure predictable performance on discrete GPUs. Can be added as an optimization later if needed.
6. CMake Build System Changes¶
6.1 New/Changed Flags¶
| Flag | Status | Effect |
|---|---|---|
-DMETAL_COMPUTE=ON |
Unchanged | Metal backend |
-DCUDA_COMPUTE=ON |
New | CUDA backend |
-DBUILD_SHARED_LIBS=ON/OFF |
New | Shared (system install) vs static (forge-studio) |
-DOPENACC_GPU=ON |
Retired | Removed |
-DOPENACC_MP=ON |
Retired | Removed |
6.2 Key CMake Logic¶
option(METAL_COMPUTE "Enable Apple Metal GPU backend" OFF)
option(CUDA_COMPUTE "Enable NVIDIA CUDA GPU backend" OFF)
option(BUILD_SHARED_LIBS "Build shared or static libraries" ON)
# Mutual exclusion
if(METAL_COMPUTE AND CUDA_COMPUTE)
message(FATAL_ERROR "Cannot enable both METAL_COMPUTE and CUDA_COMPUTE")
endif()
# OpenMP is available for ALL builds (including Metal) for CPU fallback paths.
# Some operators fall back to CPU even in GPU builds (e.g., double precision on Metal).
# ftCpuWithGrads.cpp and TimeSegmentation.cpp already use #pragma omp.
find_package(OpenMP REQUIRED)
if(CUDA_COMPUTE)
enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES "75;80;86;89;90;120")
endif()
Note: The existing add_library(ForgeCommon SHARED ...) in forge/CMakeLists.txt must be changed to add_library(ForgeCommon ...) (without SHARED) so that BUILD_SHARED_LIBS takes effect.
6.3 Source File Selection¶
forge/CUDA/*.cu— compiled only whenCUDA_COMPUTE=ONforge/Metal/*.mm— compiled only whenMETAL_COMPUTE=ON- Core sources — always compiled;
BackendDispatch.hppresolves backend at compile time
7. Testing Strategy¶
7.1 Existing Test Inventory¶
| Speed | Count | Examples |
|---|---|---|
| Fast (<1s) | ~80 | forgeComplex, forgeCol, forgeMat, forgePhase2, penalty, utility, CLI, NiftiOrientation |
| Medium (1-10s) | ~25 | operator adjoint checks, density compensation, smaller reconstructions |
| Slow (>10s) | ~15 | Full ISMRMRD/Spiral3D reconstructions, benchmarks |
91% of tests use synthetic data with no external file dependencies.
7.2 Test Executables Per Build¶
| Build | Binary | Content |
|---|---|---|
| CPU | cpu_tests |
All CPU-agnostic tests + OpenMP correctness |
| CUDA | cuda_tests |
Same correctness tests via CUDA dispatch + CUDA pipeline/memory tests |
| Metal | metal_tests |
Same correctness tests via Metal dispatch + Metal pipeline tests (exists today) |
Shared test source files where possible — core algorithmic tests (gridding accuracy, operator adjoint, reconstruction quality) should be identical across backends.
7.3 Development Verification Order¶
- Single-threaded CPU — all existing fast+medium tests pass (correctness baseline)
- OpenMP CPU — same tests pass, results match single-threaded within tolerance
- CUDA — same tests pass, results match CPU within floating-point tolerance
7.4 CI Tiers¶
| Tier | Trigger | Runner | Platform | Tests | Est. min/run |
|---|---|---|---|---|---|
| 1 | Every PR | GitHub-hosted Ubuntu | x86_64 Linux | CPU build, fast+medium tests (~[Benchmark]) |
~5 min |
| 2 | Weekly / manual | GitHub-hosted Ubuntu | x86_64 Linux | CPU build, full test suite | ~14 min |
| 3 | Release / manual | DGX Spark (self-hosted) | aarch64 Linux + Blackwell GPU | CUDA build + tests, ARM64 CPU build + tests | Free |
| 3 | Release / manual | macOS (self-hosted or GitHub) | macOS ARM64 | Metal build + tests | Free (self-hosted) |
Budget: ~106 GitHub-hosted minutes/month (assuming ~10 PRs + weekly Tier 2). Tier 3 is free (self-hosted hardware).
7.5 CI Dependencies¶
Tier 1/2 — GitHub-hosted x86_64 Ubuntu runners:
Install via apt-get in the workflow (no Docker — simpler to maintain):
libarmadillo-dev libfftw3-dev libhdf5-dev libboost-program-options-dev
libboost-serialization-dev libsuperlu-dev libomp-dev
Tier 3 — DGX Spark (self-hosted aarch64):
Dependencies installed directly via apt-get on the machine (one-time setup):
sudo apt-get install -y libarmadillo-dev libfftw3-dev libhdf5-dev \
libboost-program-options-dev libboost-serialization-dev libsuperlu-dev
# ISMRMRD: build from source if not available via apt on aarch64
actions-runner for linux-arm64) registered with labels: self-hosted, linux, arm64, cuda.
Tier 3 — macOS (self-hosted): Dependencies via Homebrew (already available on dev machines).
7.6 DGX Spark Access¶
- Host:
ssh clj@100.109.49.10(Tailscale network) - GPU: NVIDIA GB10 (Blackwell, compute capability 12.1)
- CPU: NVIDIA Grace (ARM64), 128GB RAM
- CUDA: 13.0, GCC 13.3, CMake 3.28
- OS: Ubuntu 24.04 (aarch64)
8. Implementation Priority¶
Phase 1: Strip OpenACC, Verify Single-Threaded Correctness — COMPLETE (PR #14)¶
- ~~Audit all OpenACC pragmas across the codebase as a parallelism/data-dependency roadmap~~
- ~~Remove OpenACC pragmas,
_OPENACC/OPENACC_GPUguards, and theisCopyflag~~ - ~~Retire
fftGPU.h/.cpp,my_sincosf.cuh~~ - ~~Retire
OPENACC_GPUandOPENACC_MPCMake flags~~ - ~~Verify all existing tests pass single-threaded~~
Phase 2: Backend Dispatch Layer — COMPLETE (PR #14)¶
- ~~Define
BackendDispatch.hppinterface~~ - ~~Operators evaluated — clean 2-way
#ifdef METAL_COMPUTE/#elseis sufficient~~
Phase 3: Add OpenMP Parallelism to CPU Path — COMPLETE (PR #14)¶
- ~~Add
#pragma omp parallel forto gridding, griddingSupport, ftCpu, ftCpuWithGrads, forgeCol/forgeMat~~ - ~~Fix OpenMP data races (shared variable scoping in griddingSupport)~~
- ~~Add Gnufft copy constructor for thread-safe parallel SENSE coil processing~~
- ~~Add FFTW planner mutex for thread safety~~
- ~~Benchmarked: 3.1x speedup (SENSE 256x256, 4 threads), 5.4x (pcSENSE 256x256, 8 threads)~~
Phase 5: CI Pipeline (do before Phase 4 — validates Phase 1-3 on Linux)¶
- Set up DGX Spark: install dependencies via
apt-get, verify CPU-only build on ARM64 - Create
.github/workflows/ci.ymlwith Tier 1 (every PR, x86_64 GitHub-hosted, CPU build + fast tests) - Add Tier 2 (weekly, x86_64, full test suite)
- Set up DGX Spark as self-hosted GitHub Actions runner (linux-arm64 + cuda labels)
- Add Tier 3 workflow for DGX Spark (CUDA build + tests, ARM64 CPU build + tests)
- Add Tier 3 workflow for macOS (Metal build + tests, self-hosted or GitHub runner)
Phase 4: CUDA Backend¶
- Implement
CudaContext(device management, streams, error checking) - Implement
CudaVectorOps(cuBLAS + custom kernels) and dispatch header - Implement
CudaFFT(cuFFT wrapper) - Implement
CudaGriddingkernels (translate from Metal kernels + CPU gridding code) - Implement
CudaNufftPipeline(full forward/adjoint pipeline with cuFFT) — includes CUDA kernels for deapodize, zero_pad, crop_center, fftshift - Implement
CudaDFTkernels - Wire
forgeCol/forgeMatCUDA memory management (dual host/device pointers,putOnGPU()/sync viacudaMalloc/cudaMemcpy) - Enable NVTX tracing (
Tracer.hppwithUSE_NVTX) for the CUDA backend - Test on DGX Spark (Blackwell) and RTX 2080 (Turing) against CPU results for numerical agreement
9. Files Modified¶
New Files¶
forge/Core/BackendDispatch.hppforge/CUDA/CudaContext.h,forge/CUDA/CudaContext.cuforge/CUDA/CudaNufftPipeline.h,forge/CUDA/CudaNufftPipeline.cuforge/CUDA/CudaGridding.h,forge/CUDA/CudaGridding.cuforge/CUDA/CudaDFT.h,forge/CUDA/CudaDFT.cuforge/CUDA/CudaFFT.h,forge/CUDA/CudaFFT.cuforge/CUDA/CudaVectorOps.h,forge/CUDA/CudaVectorOps.cuforge/CUDA/CudaVectorOps_dispatch.hpp.github/workflows/ci.yml(CI pipeline)
Modified Files¶
Core data structures:
- forge/Core/forgeCol.hpp — replace OpenACC pragmas with dispatch calls (CUDA) or OpenMP (CPU); remove isCopy flag
- forge/Core/forgeMat.hpp — same
- forge/Core/forgeSubview_Col.hpp — remove #pragma acc directives
- forge/Core/ForgeIncludes.h — remove #ifdef OPENACC_GPU guards
- forge/Core/Tracer.hpp — enable USE_NVTX for CUDA_COMPUTE (not just OpenACC)
Gridding:
- forge/Gridding/gridding.h — remove #ifdef OPENACC_GPU / #elif _OPENACC guards
- forge/Gridding/gridding.cpp — replace #pragma acc with #pragma omp
- forge/Gridding/griddingSupport.h — remove _OPENACC guards
- forge/Gridding/griddingSupport.cpp — replace 21 #pragma acc directives with #pragma omp
- forge/Gridding/TimeSegmentation.cpp — un-comment #pragma omp parallel for
FFT:
- forge/FFT/ftCpu.h — remove _OPENACC guards
- forge/FFT/ftCpu.cpp — replace #pragma acc kernels with #pragma omp parallel for
- forge/FFT/ftCpuWithGrads.h — remove _OPENACC guards
Operators:
- forge/Operators/Gnufft.h, Gnufft.cpp — use dispatch layer instead of #ifdef blocks
- forge/Operators/Gdft.h, Gdft.cpp — same
- forge/Operators/GdftR2.h, GdftR2.cpp — same
- forge/Operators/Gfft.h, Gfft.cpp — same (has OPENACC_GPU/OPENACC_MP guards and cuFFT plan code)
- forge/Operators/SENSE.h, SENSE.cpp — same
- forge/Operators/pcSENSE.h, pcSENSE.cpp — same
- forge/Operators/pcSenseTimeSeg.h, pcSenseTimeSeg.cpp — same
Metal dispatch:
- forge/Metal/MetalVectorOps_dispatch.hpp — wrap in unified dispatch interface
Build system:
- CMakeLists.txt — new flags, CUDA language, retire OpenACC, BUILD_SHARED_LIBS
- forge/CMakeLists.txt — CUDA source file list, change SHARED to default, conditional compilation
Tests (remove _OPENACC/OPENACC_GPU guards):
- forge/Tests/forgeColTests.cpp
- forge/Tests/forgeColCplxTests.cpp
- forge/Tests/forgeMatTests.cpp
- forge/Tests/MetalVectorOpsTests.cpp
Other:
- forge/Microbenchmarks/forgeCol/main.cpp — remove _OPENACC guards
Retired¶
- All
#pragma accdirectives across the codebase OPENACC_GPUandOPENACC_MPCMake options- OpenACC-specific compiler flags (
-cuda -acc -gpu=cuda12.6,cc60-cc89,-Mnollvm -O1 -acc -ta=multicore) - NVIDIA HPC SDK (nvc++) dependency — no longer needed; CUDA backend uses standard
nvcc+ GCC/Clang host compiler forge/FFT/fftGPU.h,forge/FFT/fftGPU.cpp— cuFFT wrappers gated behindOPENACC_GPU; functionality migrated toforge/CUDA/CudaFFT.h/.cuforge/my_sincosf.cuh— legacy CUDA header; evaluate for inclusion inforge/CUDA/or retireisCopyflag inforgeCol(OpenACC bookkeeping only)
Unchanged¶
forge/Core/AccelerateDispatch.hpp— orthogonal to backend dispatch, handles CPU-side vDSP/BLAS on macOS
10. Risks and Mitigations¶
| Risk | Mitigation |
|---|---|
| OpenMP atomic scatter in adjoint gridding may have different performance characteristics than OpenACC | Benchmark early; consider alternative accumulation strategies (thread-local accumulators + reduction) |
cuBLAS complex type differences (cuFloatComplex vs forgeComplex<float>) |
forgeComplex<T> is layout-compatible with cuFloatComplex (two contiguous floats); verify with static_assert |
| CUDA kernel launch parameters (block/grid size) need tuning per GPU | Start with reasonable defaults (256 threads/block); tuning is Phase 3 optimization |
Apple Clang OpenMP requires brew install libomp |
Document in README; CMake will fail clearly at find_package(OpenMP REQUIRED) |
| Dispatch layer abstraction may not cover all edge cases | Keep it thin — only abstract what's needed, allow backend-specific code in backend directories |
CudaNufftPipeline needs CUDA kernels for deapodize, zero_pad, crop_center, fftshift (pipeline stages not covered by cuFFT or cuBLAS) |
These are simple element-wise or copy kernels; include them in CudaNufftPipeline.cu alongside the pipeline orchestration |
NVTX tracing currently tied to OpenACC via USE_NVTX |
Move USE_NVTX define to trigger on CUDA_COMPUTE instead of OPENACC_GPU; NVTX is a standalone CUDA library |
| ARM64 Linux (DGX Spark) may have different library availability than x86_64 | ISMRMRD may need building from source on aarch64; verify all deps available via apt-get |
| CUDA 13.0 vs 12.6 API differences | Use CUDA 13.0 as the baseline; sm_120 (Blackwell) requires CUDA 13.0+. Test on both Blackwell (DGX Spark) and Turing (RTX 2080) |
| Grace-Blackwell unified memory may mask transfer bugs | Use explicit cudaMalloc/cudaMemcpy (not cudaMallocManaged) to ensure code works on discrete GPUs with separate memory |