CUDA PCG Convergence Fix — Design Spec¶

Date: 2026-03-19 Branch: feature/cuda-perf-optimization Status: Draft

Problem Statement¶

When PCG solver vectors (x_pg, yi, W) are placed on GPU via putOnGPU(), SENSE+Gnufft reconstruction diverges:

Without putOnGPU(): NRMSE 0.20256 (correct), 686ms
With putOnGPU(): NRMSE 0.854, dAWAd = 4.28e+19, step sizes ~1e-6

The dAWAd value changes with code arrangement (reordering lines, adding/removing statements), which is the hallmark of undefined behavior — reading stale or uninitialized memory.

Root Cause Analysis¶

Note: The analysis below is a hypothesis based on code review and the observation that dAWAd changes with code arrangement (hallmark of reading uninitialized/stale memory). Phase 1 diagnostics will confirm or refute this hypothesis. If Phase 1 reveals that GPU and CPU dAWAd values agree (both ~4e+19), the issue may be a normalization/scaling difference in the CUDA pipeline rather than stale data, and the fix direction would change accordingly.

The Dual-Pointer Coherence Hazard¶

forgeCol maintains two data pointers: host (mem) and device (d_mem), with an isOnGPU flag indicating which is authoritative. When GPU operations produce results, a common pattern creates a hazard:

forgeCol result(n) — allocates host mem (uninitialized garbage via new T[n])
result.putOnGPU() — allocates d_mem, copies garbage from mem to d_mem
CUDA kernel writes correct data to d_mem
Result: d_mem is correct, mem is garbage, isOnGPU = true

This is safe as long as every consumer checks on_gpu() and uses device_memptr(). But any code path that reads memptr() instead — copy constructors copying host data, at() accessors, arma fallback paths — reads garbage. The garbage content depends on previous heap allocations, explaining why results change with code arrangement.

The SENSE Adjoint Host/Device Ping-Pong¶

The SENSE adjoint CUDA path (SENSE::operator/ forgeCol overload) has the most host/device transitions per call:

Copy-constructs input data, then calls getFromGPU() to sync device → host (correct, but wasteful — the copy constructor first copies stale host data, then getFromGPU immediately overwrites it)
Slices per-coil data via subvec (host-only operation)
Runs Gnufft adjoint via CudaNufftPipeline::adjoint() host-pointer wrapper (internally does H2D copy → kernel execution → D2H copy → stream sync)
Accumulates coil images on host
Returns result on host — which then gets uploaded back to GPU in the next PCG operation

While each individual transition is correctly synchronized, the overall pattern creates many opportunities for coherence bugs as surrounding code evolves. More critically, it forces every PCG iteration through a GPU → host → GPU round-trip that is both slow and fragile.

What Works (and Why)¶

Without putOnGPU() on PCG vectors, the solver works because: - x_pg stays on host → R.Gradient(x_pg) uses CPU penalty loops (not CUDA kernels) - yi_ref and W_ref are const references to host data → ensure_device uploads them to GPU as temporaries for each operation - The CUDA NUFFT pipeline and vector operations (cdot, rvec_cmul, etc.) are confirmed correct — they work in both cases

The CUDA vector operations, NUFFT pipeline, deapodization, and FFT normalization are all verified correct through code review. Both FFTW and cuFFT are unnormalized (the commented-out normalize_fft2d/3d in gridding.cpp:431-439 confirms this). The deapodization kernels use identical sin/sinh formulas.

Secondary Bug: CudaPenalty Boundary Condition¶

kernel_finite_diff_adj in CudaPenalty.cu computes d[i] - d[i+offset] for i < offset, while the CPU version computes 0 - d[i+offset]. Currently neutralized because the preceding Cd zeros those elements, but it is a latent correctness bug.

Design¶

Phase 1: Debug Instrumentation¶

Add validation checkpoints to confirm the stale-data hypothesis and locate the exact divergence point.

Add debug_validate() to forgeCol (forge/Core/forgeCol.hpp): - Creates a temporary copy, syncs GPU → host, checks for NaN/Inf - Gated behind #ifndef NDEBUG so it compiles out in release builds

Add comparison checkpoints in PCG (forge/Solvers/solve_pwls_pcg.hpp): After each major operation in the first PCG iteration, compute the same value on both GPU and CPU and compare:

After Ax_pg = A * x_pg — sync Ax to host, check for NaN
After ngrad_pg = A / Wresidual — sync, check magnitude
After Adir_pg = A * ddir_pg — sync, check magnitude
After dAWAd = cdot(Adir, WAdir) — also compute on host (sync both vectors, CPU loop), compare

The key diagnostic: if GPU cdot and CPU cdot of the same (synced) data agree, the data is correct and the magnitude is inherent. If they differ, the data reaching cdot is corrupt.

Decision gate after Phase 1: - If GPU/CPU dAWAd values disagree → stale-data hypothesis confirmed, proceed to Phase 2 (GPU-resident adjoint eliminates the coherence hazard) - If GPU/CPU dAWAd values agree and are both ~4e+19 → the magnitude is inherent to the problem. Investigate whether the "working" case also has this magnitude (add same print). If so, the divergence cause is elsewhere (step-size computation, penalty chain, etc.). Phase 2 is still valuable for performance but would not be the convergence fix.

Phase 2: GPU-Resident SENSE Adjoint¶

Rewrite the SENSE adjoint CUDA path to stay entirely on GPU, eliminating the host/device ping-pong.

Current flow (fragile):

d (GPU) → copy+getFromGPU → d_host (host) → subvec → dSlice (host)
→ CudaNufftPipeline::adjoint() host wrapper (internal H2D → kernels → D2H → sync)
→ coil_img (host) → coil_img %= conjSmap (host) → imgSum += coil_img (host) → return (host)

New flow (GPU-resident):

d (GPU) → device pointer arithmetic → CudaNufftPipeline::adjoint_device() (GPU)
→ coil_img (GPU) → coil_img %= conjSmap (GPU) → imgSum += coil_img (GPU) → return (GPU)

Key implementation details:

Per-coil slicing uses device pointer arithmetic (d.device_memptr() + ii * n1) instead of subvec. The adjoint_device method accepts const float* device pointers and writes n1 complex elements. Pointer arithmetic on forgeComplex<T1>* gives the correct byte offset.
Sensitivity map columns are uploaded per-coil via col_copy + putOnGPU (same pattern as the forward path).
Accumulator initialization: imgSum.putOnGPU() after imgSum.zeros() to place zeros on GPU. All subsequent += operations dispatch to CUDA (operator+= checks isOnGPU || pgA.on_gpu()).
Data flow verification: After the rewrite, all forgeCol objects in the loop (coil_img, conjSmap_col, imgSum) will have isOnGPU=true. The %= and += operators dispatch to CUDA for forgeComplex<float> when either operand is on GPU (verified in forgeCol.hpp operator implementations). No operation falls back to a host loop.
Stream ordering: All operations use the singleton forge::cuda::get_stream(). Since adjoint_device, %= (CUDA kernel), and += (cuBLAS saxpy) all submit to the same stream, they execute in order without explicit synchronization.

File: forge/Operators/SENSE.cpp (SENSE::operator/ forgeCol overload)

Phase 2b: Audit Penalty Chain for GPU Dispatch¶

The Robject::Penalty(), Robject::Gradient(), and Robject::Denom() forgeCol overloads call several operations on potentially GPU-resident data:

Cd() and Ctd() — have CUDA dispatch (verified in Robject.cpp)
dpot() / wpot() — return copies or host-allocated ones; consumed by GPU-dispatched %=
cdot() — has CUDA dispatch (verified in forgeCol.hpp)
abs() — verify CUDA dispatch exists for forgeCol<forgeComplex<float>>
sum() — currently calls getFromGPU() then CPU loop (verified in forgeCol.hpp); correct but slow
to_complex() — verify behavior with GPU-resident input

If any of these lack CUDA dispatch, they will read stale host data when x_pg is on GPU. Phase 1 instrumentation should catch this (NaN/Inf checks), but an explicit audit during implementation ensures completeness.

File: forge/Core/forgeCol.hpp

Phase 3: Fix CudaPenalty Boundary Bug¶

Fix kernel_finite_diff_adj to match the CPU implementation:

// Before:
float re = d[2 * i];
float im = d[2 * i + 1];

// After:
float re = (i >= offset) ? d[2 * i] : 0.0f;
float im = (i >= offset) ? d[2 * i + 1] : 0.0f;

File: forge/CUDA/CudaPenalty.cu, line 29-36

Phase 4: Validation¶

Run full test suite: ./build/cpu_tests ~"[Benchmark]" — all 86 tests must pass
Verify [SENSE_bench][bench256] gives NRMSE ~0.20256
Verify SENSE adjointness test passes (relError < 1e-5)
Remove or gate Phase 1 debug instrumentation
Run Nsight Systems profile to confirm D2H/H2D elimination

Scope¶

This spec addresses single-precision (float) only. All CUDA paths are gated by if constexpr (std::is_same<T1, float>::value). Double-precision CUDA support is out of scope.

Files Changed¶

File	Phase	Change
`forge/Core/forgeCol.hpp`	1, 2b	Add `debug_validate()` method; audit `abs()`, `to_complex()` for GPU dispatch
`forge/Solvers/solve_pwls_pcg.hpp`	1, 4	Add/remove validation checkpoints
`forge/Operators/SENSE.cpp`	2	GPU-resident adjoint path
`forge/CUDA/CudaPenalty.cu`	3	Fix `kernel_finite_diff_adj` boundary

Future Work¶

After this fix, the following architectural improvements should be considered (tracked in memory):

GPU-resident subvec for forgeCol — device pointer slicing without D2H copy
Stale-host-data hazard elimination — null out mem or add dirty flag when d_mem is modified; consider a debug-build NaN-poisoning mode where putOnGPU() fills mem with NaN after upload, making stale reads fail loudly
forgeMat CUDA support — d_mem/putOnGPU/getFromGPU on forgeMat for GPU-resident sensitivity maps
SENSE forward parallelism — per-coil Gnufft could use CUDA graphs or multi-stream for overlap

Testing Strategy¶

Correctness gate: NRMSE within 1e-4 of CPU-only result for SENSE 256x256
Adjointness gate: relError < 1e-5 for SENSE forward/adjoint
Regression gate: all 86 existing tests pass
Performance check: PCG solve time should decrease (no D2H/H2D per iteration)