CUDA PCG Convergence Fix Implementation Plan¶
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Fix the PCG solver divergence when vectors are placed on GPU by adding debug instrumentation, making the SENSE adjoint GPU-resident, and fixing the CudaPenalty boundary bug.
Architecture: Phased approach — Phase 1 adds diagnostics to confirm the stale-data hypothesis, Phase 2 rewrites the SENSE adjoint to stay GPU-resident (eliminating host/device ping-pong), Phase 2b audits the penalty chain for missing GPU sync, Phase 3 fixes a latent CudaPenalty kernel bug, and Phase 4 validates everything.
Tech Stack: C++17, CUDA, cuBLAS, cuFFT, Catch2 (testing)
Spec: docs/superpowers/specs/2026-03-19-cuda-pcg-convergence-fix-design.md
Build/test commands (DGX Spark):
# SSH to DGX Spark
ssh clj@100.109.49.10
# Build
cd /tmp/forge-cuda
cmake -B build -S . -DCUDA_COMPUTE=ON
cmake --build build -j$(nproc)
# Run tests (exclude benchmarks)
./build/cpu_tests ~"[Benchmark]"
# Run specific SENSE test
./build/cpu_tests "[SENSE_bench][bench256]"
File Map¶
| File | Action | Responsibility |
|---|---|---|
forge/Core/forgeCol.hpp |
Modify | Add debug_validate(), fix abs() GPU sync |
forge/Solvers/solve_pwls_pcg.hpp |
Modify | Add Phase 1 debug checkpoints |
forge/Operators/SENSE.cpp |
Modify | Rewrite CUDA adjoint to be GPU-resident |
forge/CUDA/CudaPenalty.cu |
Modify | Fix kernel_finite_diff_adj boundary |
forge/Tests/operatorTests.cpp |
Modify | Add SENSE adjoint GPU-output test |
Task 1: Phase 1 — Debug Instrumentation in PCG Solver¶
Add diagnostic checkpoints to the GPU PCG path that compare GPU-computed values with CPU-recomputed values, confirming whether stale data is the root cause.
Files:
- Modify: forge/Solvers/solve_pwls_pcg.hpp (GPU PCG path, lines ~155-290)
- [ ] Step 1: Add debug sync+compare helper at the top of the file
Add this helper inside the #if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE) block, before the GPU PCG function body. It syncs a GPU forgeCol to host and computes a CPU-side norm for comparison:
#ifndef NDEBUG
namespace pcg_debug {
/// Sync a GPU forgeCol to host (non-destructive copy) and return its L2 norm.
/// Used to verify GPU-resident data isn't corrupt.
template <typename T>
T sync_and_norm(const forgeCol<forgeComplex<T>>& v, const char* label) {
forgeCol<forgeComplex<T>> host_copy(v);
if (host_copy.on_gpu()) host_copy.getFromGPU();
T acc = T(0);
for (uword ii = 0; ii < host_copy.n_elem; ii++) {
acc += host_copy.at(ii).real() * host_copy.at(ii).real()
+ host_copy.at(ii).imag() * host_copy.at(ii).imag();
}
T n = std::sqrt(acc);
fprintf(stderr, "[PCG_DEBUG] %s: norm=%g, n_elem=%lu, on_gpu=%d\n",
label, (double)n, (unsigned long)v.n_elem, v.on_gpu());
if (std::isnan(n) || std::isinf(n))
fprintf(stderr, "[PCG_DEBUG] *** %s has NaN/Inf! ***\n", label);
return n;
}
/// Compute cdot on CPU from GPU data (sync both, CPU loop), compare with GPU result.
template <typename T>
void compare_cdot(const forgeCol<forgeComplex<T>>& A, const forgeCol<forgeComplex<T>>& B,
forgeComplex<T> gpu_result, const char* label) {
forgeCol<forgeComplex<T>> a_host(A), b_host(B);
if (a_host.on_gpu()) a_host.getFromGPU();
if (b_host.on_gpu()) b_host.getFromGPU();
T re = T(0), im = T(0);
for (uword ii = 0; ii < a_host.n_elem; ii++) {
forgeComplex<T> ca(a_host.at(ii).real(), -a_host.at(ii).imag());
forgeComplex<T> prod(ca.real() * b_host.at(ii).real() - ca.imag() * b_host.at(ii).imag(),
ca.real() * b_host.at(ii).imag() + ca.imag() * b_host.at(ii).real());
re += prod.real();
im += prod.imag();
}
T gpu_re = gpu_result.real();
T cpu_re = re;
T rel_err = (std::abs(cpu_re) > 1e-30) ? std::abs(gpu_re - cpu_re) / std::abs(cpu_re) : std::abs(gpu_re - cpu_re);
fprintf(stderr, "[PCG_DEBUG] %s: GPU_cdot=%g, CPU_cdot=%g, relErr=%g\n",
label, (double)gpu_re, (double)cpu_re, (double)rel_err);
if (rel_err > 1e-3)
fprintf(stderr, "[PCG_DEBUG] *** %s: GPU/CPU cdot MISMATCH (relErr=%g)! ***\n", label, (double)rel_err);
}
} // namespace pcg_debug
#endif
- [ ] Step 2: Add checkpoints in the GPU PCG iteration loop
Inside the GPU PCG path, after the key operations in the first iteration only (gate with if (ii == 0)), add:
After Ax_pg = A * x_pg (line ~171):
#ifndef NDEBUG
if (ii == 0) pcg_debug::sync_and_norm(Ax_pg, "Ax_pg (iter 0)");
#endif
After ngrad_pg = A / Wresidual (line ~192):
#ifndef NDEBUG
if (ii == 0) pcg_debug::sync_and_norm(ngrad_pg, "ngrad_pg (iter 0)");
#endif
After Adir_pg = A * ddir_pg (line ~234):
#ifndef NDEBUG
if (ii == 0) pcg_debug::sync_and_norm(Adir_pg, "Adir_pg (iter 0)");
#endif
After dAWAd = cdot(Adir_pg, WAdir_pg) (line ~237):
#ifndef NDEBUG
if (ii == 0) pcg_debug::compare_cdot(Adir_pg, WAdir_pg, dAWAd, "dAWAd (iter 0)");
#endif
- [ ] Step 3: Build and run the SENSE benchmark test
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep -E 'PCG_DEBUG|NRMSE|step j='"
Expected output: [PCG_DEBUG] lines showing norms and cdot comparisons. If GPU/CPU cdot values disagree (relErr > 1e-3), the stale-data hypothesis is confirmed. If they agree and dAWAd is ~4e+19, also add the same print to the "working" case (temporarily remove putOnGPU calls and re-run) to compare.
- [ ] Step 4: Commit instrumentation
git add forge/Solvers/solve_pwls_pcg.hpp
git commit -m "debug: add PCG GPU/CPU comparison checkpoints for dAWAd diagnosis"
Task 2: Phase 2 — GPU-Resident SENSE Adjoint¶
Rewrite the SENSE adjoint CUDA path to use adjoint_device with device pointer arithmetic, eliminating the host/device ping-pong.
Files:
- Modify: forge/Operators/SENSE.cpp (CUDA adjoint path, lines ~160-179)
Prerequisites: Task 1 completed. Decision gate: If Phase 1 shows GPU/CPU dAWAd values agree (both ~4e+19), investigate whether the "working" case (no putOnGPU) also has this magnitude before proceeding. Phase 2 is still valuable for performance regardless, but additional debugging may be needed for convergence.
- [ ] Step 1: Write a test for SENSE adjoint returning GPU-resident data
Add a test in forge/Tests/operatorTests.cpp after the existing SENSE adjointness test (around line 201). This test verifies that the SENSE adjoint returns GPU-resident data when given GPU-resident input, and that the result matches the CPU path.
Use the same setup pattern as the existing SENSE<Gdft> adjointness test at line 162, but with Gnufft (which has the CUDA pipeline):
#ifdef CUDA_COMPUTE
TEST_CASE("SENSE<Gnufft> adjoint returns GPU-resident data and matches CPU", "[SENSE][CUDA]")
{
// Match the existing SENSE adjointness test pattern
uword Nx = 8, Ny = 8, Nz = 1;
uword n1 = Nx * Ny; // image pixels
uword nc = 2; // coils
Col<float> ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN;
makeCartesianCoords(Nx, Ny, ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN);
uword n2 = kxN.n_elem; // k-space samples per coil
Gnufft<float> G(n2, 2.0, Nx, Ny, Nz, kxN, kyN, kzN, ix, iy, iz);
// Constant sensitivity maps (flat Col<cx_float>, same as existing test)
Col<cx_float> senseMap(n1 * nc);
senseMap.subvec(0, n1 - 1).fill(cx_float(1.0f, 0.0f));
senseMap.subvec(n1, 2 * n1 - 1).fill(cx_float(0.0f, 1.0f));
// Constructor: SENSE(G, senseMap, n1_kspace_per_coil, n2_image_pixels, nc)
SENSE<float, Gnufft<float>> S(G, senseMap, n2, n1, nc);
// Create test k-space data (all coils concatenated)
forgeCol<forgeComplex<float>> kdata(n2 * nc);
for (uword i = 0; i < kdata.n_elem; i++)
kdata.at(i) = forgeComplex<float>((float)(i % 7) / 7.0f, (float)(i % 5) / 5.0f);
// Compute CPU reference (kdata on host → CPU/host-wrapper path)
forgeCol<forgeComplex<float>> cpu_result = S / kdata;
// Put kdata on GPU and compute (should use GPU-resident path)
kdata.putOnGPU();
forgeCol<forgeComplex<float>> gpu_result = S / kdata;
// Verify result is on GPU
REQUIRE(gpu_result.on_gpu());
// Sync to host and compare element-wise
gpu_result.getFromGPU();
float maxErr = 0.0f;
for (uword i = 0; i < cpu_result.n_elem; i++) {
float dr = cpu_result.at(i).real() - gpu_result.at(i).real();
float di = cpu_result.at(i).imag() - gpu_result.at(i).imag();
float err = std::sqrt(dr * dr + di * di);
if (err > maxErr) maxErr = err;
}
float cpuNorm = norm(cpu_result);
float relErr = maxErr / (cpuNorm + 1e-30f);
CAPTURE(maxErr, cpuNorm, relErr);
REQUIRE(relErr < 1e-4f);
}
#endif
- [ ] Step 2: Build and verify the test fails (current adjoint returns on host)
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests '[SENSE][CUDA]' -v 2>&1"
Expected: The REQUIRE(gpu_result.on_gpu()) assertion fails because the current adjoint returns on host.
- [ ] Step 3: Rewrite the SENSE adjoint CUDA path
In forge/Operators/SENSE.cpp, replace the CUDA adjoint block (inside #elif defined(CUDA_COMPUTE) in operator/) with:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
// GPU-resident adjoint: keep data on GPU, use per-coil D2D slicing
forgeCol<forgeComplex<T1>> d_gpu(d);
if (!d_gpu.on_gpu()) d_gpu.putOnGPU();
forgeCol<forgeComplex<T1>> imgSum(n2);
imgSum.zeros();
imgSum.putOnGPU();
for (unsigned int ii = 0; ii < this->nc; ii++) {
// Slice per-coil data via D2D copy from the right offset
forgeCol<forgeComplex<T1>> dSlice(n1);
dSlice.putOnGPU();
CUDA_CHECK(cudaMemcpyAsync(
dSlice.device_memptr(),
d_gpu.device_memptr() + ii * n1,
n1 * sizeof(forgeComplex<T1>),
cudaMemcpyDeviceToDevice,
forge::cuda::get_stream()));
// Gnufft::operator/(forgeCol) checks on_gpu() and dispatches
// to adjoint_device internally — works for any Tobj
forgeCol<forgeComplex<T1>> coil_img = (*this->G_obj) / dSlice;
// Weight by conjugate sensitivity map (upload per-coil, same as forward path)
forgeCol<forgeComplex<T1>> conjSmap_col = conjSMap.col_copy(ii);
conjSmap_col.putOnGPU();
coil_img %= conjSmap_col;
imgSum += coil_img;
}
CUDA_CHECK(cudaStreamSynchronize(forge::cuda::get_stream()));
return imgSum;
}
#endif
Key differences from the old path:
- No getFromGPU / subvec / host accumulation
- Per-coil slicing via D2D cudaMemcpyAsync from offset device_memptr() + ii * n1
- Uses (*G_obj) / dSlice operator dispatch (not direct cudaPipelineCtx access), so it compiles for all Tobj instantiations (Gnufft, Gdft, TimeSegmentation)
- dSlice.on_gpu()=true → Gnufft dispatches to adjoint_device internally
- imgSum initialized with zeros() then putOnGPU() to place zeros on GPU
- All %= and += dispatch to CUDA (both operands on GPU)
- cudaStreamSynchronize before return (matches forward path at line 124)
- [ ] Step 4: Build and run the new SENSE CUDA test
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests '[SENSE][CUDA]' -v 2>&1"
Expected: PASS — GPU result matches CPU within relErr < 1e-4, and gpu_result.on_gpu() is true.
- [ ] Step 5: Run the existing SENSE adjointness test
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_adj]' -v 2>&1"
Expected: PASS — adjointness property still holds (relError < 1e-3).
- [ ] Step 6: Run the SENSE benchmark to check convergence
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep -E 'NRMSE|step j=|PCG_DEBUG'"
Expected: NRMSE ~0.20256 (correct convergence). If this passes, the GPU-resident adjoint fixed the convergence issue. If not, Phase 1 diagnostics guide further investigation.
- [ ] Step 7: Commit
git add forge/Operators/SENSE.cpp forge/Tests/operatorTests.cpp
git commit -m "fix: GPU-resident SENSE adjoint eliminates host/device ping-pong in PCG solver"
Task 3: Phase 2b — Audit and Fix Penalty Chain GPU Sync¶
The abs() and to_complex() free functions in forgeCol.hpp read from host mem via at() without syncing GPU data. When called on GPU-resident forgeCol objects (from R.Penalty(x_pg) in the PCG loop), they read stale/garbage host data.
This only affects convergence monitoring (penalty value display), not the step computation. But it should be fixed for correct diagnostics.
Files:
- Modify: forge/Core/forgeCol.hpp (abs() and to_complex() functions)
- [ ] Step 1: Add GPU sync to
abs()for complex forgeCol
In forge/Core/forgeCol.hpp, find the abs() function for forgeCol<forgeComplex<T>> (around line 1141). Add a GPU sync before the host loop:
template <typename T> forgeCol<T> abs(const forgeCol<forgeComplex<T>>& A)
{
#ifdef CUDA_COMPUTE
if (A.on_gpu()) {
const_cast<forgeCol<forgeComplex<T>>&>(A).getFromGPU();
}
#endif
forgeCol<T> out(A.n_elem);
for (uword ii = 0; ii < A.n_elem; ii++) {
This follows the same pattern used by sum() at line 960. Note: this is a stopgap — a CUDA abs() kernel would be faster for large vectors. Tracked in future work.
- [ ] Step 2: Add GPU sync to
to_complex()
Find the to_complex() function (around line 1302). Add a GPU sync:
template <typename T, ...>
forgeCol<forgeComplex<T>> to_complex(const forgeCol<T>& A)
{
#ifdef CUDA_COMPUTE
if (A.on_gpu()) {
const_cast<forgeCol<T>&>(A).getFromGPU();
}
#endif
forgeCol<forgeComplex<T>> out(A.n_elem);
- [ ] Step 3: Add GPU sync to
real()andimag()free functions
Find real() (around line 1155) and imag() (around line 1168). Both iterate via at() which reads host mem directly. Add the same GPU sync pattern:
template <typename T> forgeCol<T> real(const forgeCol<forgeComplex<T>>& A)
{
#ifdef CUDA_COMPUTE
if (A.on_gpu()) {
const_cast<forgeCol<forgeComplex<T>>&>(A).getFromGPU();
}
#endif
forgeCol<T> out(A.n_elem);
Apply the same pattern to imag().
- [ ] Step 5: Build and run full test suite
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests ~'[Benchmark]' 2>&1 | tail -5"
Expected: All tests pass. The GPU syncs are only exercised when GPU data is present, so existing CPU tests are unaffected.
- [ ] Step 6: Commit
git add forge/Core/forgeCol.hpp
git commit -m "fix: add GPU sync to abs(), to_complex(), real(), imag() for correct GPU-resident reads"
Task 4: Phase 3 — Fix CudaPenalty Boundary Bug¶
Fix kernel_finite_diff_adj to match the CPU implementation for boundary elements.
Files:
- Modify: forge/CUDA/CudaPenalty.cu (lines 23-37)
- [ ] Step 1: Fix the kernel
In forge/CUDA/CudaPenalty.cu, replace the kernel body:
__global__ void kernel_finite_diff_adj(const float* d, float* out, int n, int offset)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n)
return;
// For i < offset: out[i] = 0 - d[i+offset] (matches CPU Robject::Ctd)
// For i >= offset: out[i] = d[i] - d[i+offset]
// For i+offset >= n: out[i] = d[i] (no subtraction term)
float re = (i >= offset) ? d[2 * i] : 0.0f;
float im = (i >= offset) ? d[2 * i + 1] : 0.0f;
if (i + offset < n) {
re -= d[2 * (i + offset)];
im -= d[2 * (i + offset) + 1];
}
out[2 * i] = re;
out[2 * i + 1] = im;
}
- [ ] Step 2: Build and run tests
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests ~'[Benchmark]' 2>&1 | tail -5"
Expected: All tests pass. This fix is currently neutralized by Cd zeroing, but ensures correctness if Ctd is ever called independently.
- [ ] Step 3: Commit
git add forge/CUDA/CudaPenalty.cu
git commit -m "fix: correct kernel_finite_diff_adj boundary for i < offset to match CPU"
Task 5: Phase 4 — Validation and Cleanup¶
Run the full test suite, verify convergence, and clean up debug instrumentation.
Files:
- Modify: forge/Solvers/solve_pwls_pcg.hpp (remove debug checkpoints if convergence is fixed)
- [ ] Step 1: Run full test suite
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests ~'[Benchmark]' -v 2>&1"
Expected: All 86+ tests pass (including the new [SENSE][CUDA] test).
- [ ] Step 2: Verify SENSE benchmark convergence
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep -E 'NRMSE|Time'"
Expected: NRMSE ~0.20256. Time should be faster than the 686ms baseline (no D2H/H2D per iteration).
- [ ] Step 3: Run Spiral3D reconstruction test
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[Spiral3D]' 2>&1 | grep -E 'NRMSE|Time'"
Expected: NRMSE within acceptable range, no regressions.
- [ ] Step 4: Verify SENSE adjointness
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_adj]' -v 2>&1"
Expected: PASS with relError < 1e-3.
- [ ] Step 5: Remove or gate debug instrumentation
If convergence is fixed, remove the pcg_debug namespace and all #ifndef NDEBUG checkpoints added in Task 1 from solve_pwls_pcg.hpp. The pre-existing fprintf(stderr, " step j=%u: ...") at line ~271-273 is a separate debug print from a prior debugging session — also remove it as part of cleanup.
If convergence is NOT fixed, keep the instrumentation and investigate based on the diagnostic output. See the decision gate in the spec.
- [ ] Step 6: Final commit
git add forge/Solvers/solve_pwls_pcg.hpp
git commit -m "chore: remove PCG debug instrumentation after convergence fix verified"
- [ ] Step 7: Run Nsight Systems profile (optional)
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && nsys profile -o /tmp/forge_sense2d_fixed ./build/cpu_tests '[SENSE_bench][bench256]'"
Compare with the pre-fix profile (/tmp/forge_sense2d.nsys-rep) to verify D2H/H2D transfers are eliminated from the PCG iteration loop.