CUDA Performance Optimization Design Spec¶
Date: 2026-03-19 Status: Draft Scope: Eliminate memory allocation and transfer overhead in the CUDA backend PCG reconstruction loop.
1. Problem Statement¶
The CUDA backend (v0.13.0) achieves only 1.45x speedup on SENSE 256x256 reconstruction despite GPU kernels completing in 56ms. Nsight Systems profiling reveals:
2D SENSE 256x256 profile:
| Category | Time | % of CUDA API | Calls |
|----------|------|---------------|-------|
| cudaMemcpy | 66ms | 62.5% | 640 |
| cudaMalloc | 16ms | 14.8% | 470 |
| cudaFree | 12ms | 11.6% | 470 |
| Kernel launches | 4ms | 3.6% | 1,232 |
| GPU kernel compute | 56ms | — | — |
3D Spiral3D profile:
| Category | Time | % of CUDA API |
|----------|------|---------------|
| cudaMemcpyAsync | 74.6s | 98.4% |
| Host→Device transfers | 95 GB | 21,729 copies |
| Device→Host transfers | 56 GB | 21,504 copies |
GPU kernels are fast. The bottleneck is memory management overhead.
Root Causes¶
-
cudaMalloc/cudaFreechurn: EveryforgeColtemporary in the PCG loop allocates and frees device memory. 470 pairs per 2D reconstruction. -
Host↔device transfers: The PCG solver creates temporary
forgeColobjects that trigger uploads (when mixed with GPU operands viaensure_device) and the Gnufft pipeline copies data H2D/D2H per call even thoughforward_device/adjoint_devicevariants exist.
2. Changes¶
2.1 cudaMallocAsync Foundation¶
Replace cudaMalloc/cudaFree with cudaMallocAsync/cudaFreeAsync in all forgeCol.hpp allocation paths. CUDA's stream-ordered memory pool (available since CUDA 11.2) reuses recently freed blocks, making temporary allocations near-zero cost.
Files modified: forge/Core/forgeCol.hpp
Allocation sites to change:
- putOnGPU(): cudaMalloc → cudaMallocAsync
- Destructor: cudaFree → cudaFreeAsync
- set_size(): cudaFree → cudaFreeAsync
- operator= (copy): cudaMalloc → cudaMallocAsync
- Copy constructor: cudaMalloc → cudaMallocAsync
All async calls use forge::cuda::get_stream().
Expected impact: Eliminates ~28ms (26%) of 2D overhead.
2.2 Keep PCG Vectors on GPU¶
Upload persistent solver vectors once at the start of solve_pwls_pcg and keep them on device for the entire solve. Use the original Metal-style loop with temporaries (not scratch buffers — the failed approach). cudaMallocAsync makes temporaries cheap.
Files modified: forge/Solvers/solve_pwls_pcg.hpp
Changes:
- After creating x_pg, call x_pg.putOnGPU()
- Create mutable GPU copies of yi_pg and W_pg
- Replace references to yi_pg/W_pg in the loop body with GPU copies
- The forgeCol operators already dispatch to CUDA when operands are on GPU
- The cdot, norm free functions already have CUDA dispatch
SENSE operator changes: forge/Operators/SENSE.cpp
The SENSE CUDA path currently stores per-coil results in outData_pg (forgeMat, host-only) via set_col, which forces getFromGPU() syncs. Fix by concatenating forgeCol results directly:
// Instead of:
outData_pg.set_col(ii, coil_result);
// ...
return vectorise(outData_pg);
// Do:
forgeCol<forgeComplex<T1>> allCoils(n1 * nc);
// Copy each coil result into the right slice of allCoils (on GPU)
This avoids forgeMat and keeps data on GPU. forgeMat CUDA support is deferred to follow-up work.
Expected impact: Eliminates ~66ms (62%) of 2D memcpy overhead. GPU kernels (56ms) + PCG vector ops become dominant.
2.3 Re-profile After 2.1 and 2.2¶
After implementing the above, re-run Nsight Systems profiling on both SENSE 256x256 and Spiral3D to: - Verify memcpy and malloc overhead is eliminated - Identify the next bottleneck (kernel efficiency? cuFFT? PCG vector ops?) - Decide if circshift optimization, forgeMat CUDA, or kernel tuning is needed
3. What Is NOT In Scope¶
forgeMatCUDA support — follow-up workCudaDFT— Gdft/GdftR2 fall back to CPU- circshift kernel optimization — re-profile first
- Scratch-buffer PCG loop — tried and failed due to data flow bugs; cudaMallocAsync makes it unnecessary
4. Success Criteria¶
- All tests pass (101 non-benchmark tests, correct NRMSE)
- Measurable speedup over current 1.45x on SENSE 256x256
- Profile confirms memcpy/malloc overhead is reduced
5. Risks¶
| Risk | Mitigation |
|---|---|
cudaMallocAsync not available on older CUDA |
Requires CUDA 11.2+; our minimum is 13.0 |
| PCG vectors on GPU: same divergence bug as before | Use original Metal-style loop, not scratch buffers. operator= is now fixed. |
| SENSE forgeCol concatenation correctness | Test adjoint property and NRMSE match CPU |