CUDA Performance Optimization Design Spec¶

Date: 2026-03-19 Status: Draft Scope: Eliminate memory allocation and transfer overhead in the CUDA backend PCG reconstruction loop.

1. Problem Statement¶

The CUDA backend (v0.13.0) achieves only 1.45x speedup on SENSE 256x256 reconstruction despite GPU kernels completing in 56ms. Nsight Systems profiling reveals:

2D SENSE 256x256 profile: | Category | Time | % of CUDA API | Calls | |----------|------|---------------|-------| | cudaMemcpy | 66ms | 62.5% | 640 | | cudaMalloc | 16ms | 14.8% | 470 | | cudaFree | 12ms | 11.6% | 470 | | Kernel launches | 4ms | 3.6% | 1,232 | | GPU kernel compute | 56ms | — | — |

3D Spiral3D profile: | Category | Time | % of CUDA API | |----------|------|---------------| | cudaMemcpyAsync | 74.6s | 98.4% | | Host→Device transfers | 95 GB | 21,729 copies | | Device→Host transfers | 56 GB | 21,504 copies |

GPU kernels are fast. The bottleneck is memory management overhead.

Root Causes¶

cudaMalloc/cudaFree churn: Every forgeCol temporary in the PCG loop allocates and frees device memory. 470 pairs per 2D reconstruction.
Host↔device transfers: The PCG solver creates temporary forgeCol objects that trigger uploads (when mixed with GPU operands via ensure_device) and the Gnufft pipeline copies data H2D/D2H per call even though forward_device/adjoint_device variants exist.

2. Changes¶

2.1 cudaMallocAsync Foundation¶

Replace cudaMalloc/cudaFree with cudaMallocAsync/cudaFreeAsync in all forgeCol.hpp allocation paths. CUDA's stream-ordered memory pool (available since CUDA 11.2) reuses recently freed blocks, making temporary allocations near-zero cost.

Files modified: forge/Core/forgeCol.hpp

Allocation sites to change: - putOnGPU(): cudaMalloc → cudaMallocAsync - Destructor: cudaFree → cudaFreeAsync - set_size(): cudaFree → cudaFreeAsync - operator= (copy): cudaMalloc → cudaMallocAsync - Copy constructor: cudaMalloc → cudaMallocAsync

All async calls use forge::cuda::get_stream().

Expected impact: Eliminates ~28ms (26%) of 2D overhead.

2.2 Keep PCG Vectors on GPU¶

Upload persistent solver vectors once at the start of solve_pwls_pcg and keep them on device for the entire solve. Use the original Metal-style loop with temporaries (not scratch buffers — the failed approach). cudaMallocAsync makes temporaries cheap.

Files modified: forge/Solvers/solve_pwls_pcg.hpp

Changes: - After creating x_pg, call x_pg.putOnGPU() - Create mutable GPU copies of yi_pg and W_pg - Replace references to yi_pg/W_pg in the loop body with GPU copies - The forgeCol operators already dispatch to CUDA when operands are on GPU - The cdot, norm free functions already have CUDA dispatch

SENSE operator changes: forge/Operators/SENSE.cpp

The SENSE CUDA path currently stores per-coil results in outData_pg (forgeMat, host-only) via set_col, which forces getFromGPU() syncs. Fix by concatenating forgeCol results directly:

// Instead of:
outData_pg.set_col(ii, coil_result);
// ...
return vectorise(outData_pg);

// Do:
forgeCol<forgeComplex<T1>> allCoils(n1 * nc);
// Copy each coil result into the right slice of allCoils (on GPU)

This avoids forgeMat and keeps data on GPU. forgeMat CUDA support is deferred to follow-up work.

Expected impact: Eliminates ~66ms (62%) of 2D memcpy overhead. GPU kernels (56ms) + PCG vector ops become dominant.

2.3 Re-profile After 2.1 and 2.2¶

After implementing the above, re-run Nsight Systems profiling on both SENSE 256x256 and Spiral3D to: - Verify memcpy and malloc overhead is eliminated - Identify the next bottleneck (kernel efficiency? cuFFT? PCG vector ops?) - Decide if circshift optimization, forgeMat CUDA, or kernel tuning is needed

3. What Is NOT In Scope¶

forgeMat CUDA support — follow-up work
CudaDFT — Gdft/GdftR2 fall back to CPU
circshift kernel optimization — re-profile first
Scratch-buffer PCG loop — tried and failed due to data flow bugs; cudaMallocAsync makes it unnecessary

4. Success Criteria¶

All tests pass (101 non-benchmark tests, correct NRMSE)
Measurable speedup over current 1.45x on SENSE 256x256
Profile confirms memcpy/malloc overhead is reduced

5. Risks¶

Risk	Mitigation
`cudaMallocAsync` not available on older CUDA	Requires CUDA 11.2+; our minimum is 13.0
PCG vectors on GPU: same divergence bug as before	Use original Metal-style loop, not scratch buffers. operator= is now fixed.
SENSE forgeCol concatenation correctness	Test adjoint property and NRMSE match CPU