CUDA PCG Convergence Fix — Design Spec¶
Date: 2026-03-19
Branch: feature/cuda-perf-optimization
Status: Draft
Problem Statement¶
When PCG solver vectors (x_pg, yi, W) are placed on GPU via putOnGPU(), SENSE+Gnufft reconstruction diverges:
- Without
putOnGPU(): NRMSE 0.20256 (correct), 686ms - With
putOnGPU(): NRMSE 0.854,dAWAd = 4.28e+19, step sizes ~1e-6
The dAWAd value changes with code arrangement (reordering lines, adding/removing statements), which is the hallmark of undefined behavior — reading stale or uninitialized memory.
Root Cause Analysis¶
Note: The analysis below is a hypothesis based on code review and the observation that dAWAd changes with code arrangement (hallmark of reading uninitialized/stale memory). Phase 1 diagnostics will confirm or refute this hypothesis. If Phase 1 reveals that GPU and CPU dAWAd values agree (both ~4e+19), the issue may be a normalization/scaling difference in the CUDA pipeline rather than stale data, and the fix direction would change accordingly.
The Dual-Pointer Coherence Hazard¶
forgeCol maintains two data pointers: host (mem) and device (d_mem), with an isOnGPU flag indicating which is authoritative. When GPU operations produce results, a common pattern creates a hazard:
forgeCol result(n)— allocates hostmem(uninitialized garbage vianew T[n])result.putOnGPU()— allocatesd_mem, copies garbage frommemtod_mem- CUDA kernel writes correct data to
d_mem - Result:
d_memis correct,memis garbage,isOnGPU = true
This is safe as long as every consumer checks on_gpu() and uses device_memptr(). But any code path that reads memptr() instead — copy constructors copying host data, at() accessors, arma fallback paths — reads garbage. The garbage content depends on previous heap allocations, explaining why results change with code arrangement.
The SENSE Adjoint Host/Device Ping-Pong¶
The SENSE adjoint CUDA path (SENSE::operator/ forgeCol overload) has the most host/device transitions per call:
- Copy-constructs input data, then calls
getFromGPU()to sync device → host (correct, but wasteful — the copy constructor first copies stale host data, thengetFromGPUimmediately overwrites it) - Slices per-coil data via
subvec(host-only operation) - Runs Gnufft adjoint via
CudaNufftPipeline::adjoint()host-pointer wrapper (internally does H2D copy → kernel execution → D2H copy → stream sync) - Accumulates coil images on host
- Returns result on host — which then gets uploaded back to GPU in the next PCG operation
While each individual transition is correctly synchronized, the overall pattern creates many opportunities for coherence bugs as surrounding code evolves. More critically, it forces every PCG iteration through a GPU → host → GPU round-trip that is both slow and fragile.
What Works (and Why)¶
Without putOnGPU() on PCG vectors, the solver works because:
- x_pg stays on host → R.Gradient(x_pg) uses CPU penalty loops (not CUDA kernels)
- yi_ref and W_ref are const references to host data → ensure_device uploads them to GPU as temporaries for each operation
- The CUDA NUFFT pipeline and vector operations (cdot, rvec_cmul, etc.) are confirmed correct — they work in both cases
The CUDA vector operations, NUFFT pipeline, deapodization, and FFT normalization are all verified correct through code review. Both FFTW and cuFFT are unnormalized (the commented-out normalize_fft2d/3d in gridding.cpp:431-439 confirms this). The deapodization kernels use identical sin/sinh formulas.
Secondary Bug: CudaPenalty Boundary Condition¶
kernel_finite_diff_adj in CudaPenalty.cu computes d[i] - d[i+offset] for i < offset, while the CPU version computes 0 - d[i+offset]. Currently neutralized because the preceding Cd zeros those elements, but it is a latent correctness bug.
Design¶
Phase 1: Debug Instrumentation¶
Add validation checkpoints to confirm the stale-data hypothesis and locate the exact divergence point.
Add debug_validate() to forgeCol (forge/Core/forgeCol.hpp):
- Creates a temporary copy, syncs GPU → host, checks for NaN/Inf
- Gated behind #ifndef NDEBUG so it compiles out in release builds
Add comparison checkpoints in PCG (forge/Solvers/solve_pwls_pcg.hpp):
After each major operation in the first PCG iteration, compute the same value on both GPU and CPU and compare:
- After
Ax_pg = A * x_pg— sync Ax to host, check for NaN - After
ngrad_pg = A / Wresidual— sync, check magnitude - After
Adir_pg = A * ddir_pg— sync, check magnitude - After
dAWAd = cdot(Adir, WAdir)— also compute on host (sync both vectors, CPU loop), compare
The key diagnostic: if GPU cdot and CPU cdot of the same (synced) data agree, the data is correct and the magnitude is inherent. If they differ, the data reaching cdot is corrupt.
Decision gate after Phase 1:
- If GPU/CPU dAWAd values disagree → stale-data hypothesis confirmed, proceed to Phase 2 (GPU-resident adjoint eliminates the coherence hazard)
- If GPU/CPU dAWAd values agree and are both ~4e+19 → the magnitude is inherent to the problem. Investigate whether the "working" case also has this magnitude (add same print). If so, the divergence cause is elsewhere (step-size computation, penalty chain, etc.). Phase 2 is still valuable for performance but would not be the convergence fix.
Phase 2: GPU-Resident SENSE Adjoint¶
Rewrite the SENSE adjoint CUDA path to stay entirely on GPU, eliminating the host/device ping-pong.
Current flow (fragile):
d (GPU) → copy+getFromGPU → d_host (host) → subvec → dSlice (host)
→ CudaNufftPipeline::adjoint() host wrapper (internal H2D → kernels → D2H → sync)
→ coil_img (host) → coil_img %= conjSmap (host) → imgSum += coil_img (host) → return (host)
New flow (GPU-resident):
d (GPU) → device pointer arithmetic → CudaNufftPipeline::adjoint_device() (GPU)
→ coil_img (GPU) → coil_img %= conjSmap (GPU) → imgSum += coil_img (GPU) → return (GPU)
Key implementation details:
-
Per-coil slicing uses device pointer arithmetic (
d.device_memptr() + ii * n1) instead ofsubvec. Theadjoint_devicemethod acceptsconst float*device pointers and writesn1complex elements. Pointer arithmetic onforgeComplex<T1>*gives the correct byte offset. -
Sensitivity map columns are uploaded per-coil via
col_copy+putOnGPU(same pattern as the forward path). -
Accumulator initialization:
imgSum.putOnGPU()afterimgSum.zeros()to place zeros on GPU. All subsequent+=operations dispatch to CUDA (operator+=checksisOnGPU || pgA.on_gpu()). -
Data flow verification: After the rewrite, all forgeCol objects in the loop (
coil_img,conjSmap_col,imgSum) will haveisOnGPU=true. The%=and+=operators dispatch to CUDA forforgeComplex<float>when either operand is on GPU (verified inforgeCol.hppoperator implementations). No operation falls back to a host loop. -
Stream ordering: All operations use the singleton
forge::cuda::get_stream(). Sinceadjoint_device,%=(CUDA kernel), and+=(cuBLAS saxpy) all submit to the same stream, they execute in order without explicit synchronization.
File: forge/Operators/SENSE.cpp (SENSE::operator/ forgeCol overload)
Phase 2b: Audit Penalty Chain for GPU Dispatch¶
The Robject::Penalty(), Robject::Gradient(), and Robject::Denom() forgeCol overloads call several operations on potentially GPU-resident data:
Cd()andCtd()— have CUDA dispatch (verified inRobject.cpp)dpot()/wpot()— return copies or host-allocated ones; consumed by GPU-dispatched%=cdot()— has CUDA dispatch (verified inforgeCol.hpp)abs()— verify CUDA dispatch exists forforgeCol<forgeComplex<float>>sum()— currently callsgetFromGPU()then CPU loop (verified inforgeCol.hpp); correct but slowto_complex()— verify behavior with GPU-resident input
If any of these lack CUDA dispatch, they will read stale host data when x_pg is on GPU. Phase 1 instrumentation should catch this (NaN/Inf checks), but an explicit audit during implementation ensures completeness.
File: forge/Core/forgeCol.hpp
Phase 3: Fix CudaPenalty Boundary Bug¶
Fix kernel_finite_diff_adj to match the CPU implementation:
// Before:
float re = d[2 * i];
float im = d[2 * i + 1];
// After:
float re = (i >= offset) ? d[2 * i] : 0.0f;
float im = (i >= offset) ? d[2 * i + 1] : 0.0f;
File: forge/CUDA/CudaPenalty.cu, line 29-36
Phase 4: Validation¶
- Run full test suite:
./build/cpu_tests ~"[Benchmark]"— all 86 tests must pass - Verify
[SENSE_bench][bench256]gives NRMSE ~0.20256 - Verify SENSE adjointness test passes (relError < 1e-5)
- Remove or gate Phase 1 debug instrumentation
- Run Nsight Systems profile to confirm D2H/H2D elimination
Scope¶
This spec addresses single-precision (float) only. All CUDA paths are gated by if constexpr (std::is_same<T1, float>::value). Double-precision CUDA support is out of scope.
Files Changed¶
| File | Phase | Change |
|---|---|---|
forge/Core/forgeCol.hpp |
1, 2b | Add debug_validate() method; audit abs(), to_complex() for GPU dispatch |
forge/Solvers/solve_pwls_pcg.hpp |
1, 4 | Add/remove validation checkpoints |
forge/Operators/SENSE.cpp |
2 | GPU-resident adjoint path |
forge/CUDA/CudaPenalty.cu |
3 | Fix kernel_finite_diff_adj boundary |
Future Work¶
After this fix, the following architectural improvements should be considered (tracked in memory):
- GPU-resident
subvecfor forgeCol — device pointer slicing without D2H copy - Stale-host-data hazard elimination — null out
memor add dirty flag whend_memis modified; consider a debug-build NaN-poisoning mode whereputOnGPU()fillsmemwith NaN after upload, making stale reads fail loudly - forgeMat CUDA support —
d_mem/putOnGPU/getFromGPUon forgeMat for GPU-resident sensitivity maps - SENSE forward parallelism — per-coil Gnufft could use CUDA graphs or multi-stream for overlap
Testing Strategy¶
- Correctness gate: NRMSE within 1e-4 of CPU-only result for SENSE 256x256
- Adjointness gate: relError < 1e-5 for SENSE forward/adjoint
- Regression gate: all 86 existing tests pass
- Performance check: PCG solve time should decrease (no D2H/H2D per iteration)