CUDA PCG Convergence Fix Implementation Plan¶

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Fix the PCG solver divergence when vectors are placed on GPU by adding debug instrumentation, making the SENSE adjoint GPU-resident, and fixing the CudaPenalty boundary bug.

Architecture: Phased approach — Phase 1 adds diagnostics to confirm the stale-data hypothesis, Phase 2 rewrites the SENSE adjoint to stay GPU-resident (eliminating host/device ping-pong), Phase 2b audits the penalty chain for missing GPU sync, Phase 3 fixes a latent CudaPenalty kernel bug, and Phase 4 validates everything.

Tech Stack: C++17, CUDA, cuBLAS, cuFFT, Catch2 (testing)

Spec: docs/superpowers/specs/2026-03-19-cuda-pcg-convergence-fix-design.md

Build/test commands (DGX Spark):

# SSH to DGX Spark
ssh clj@100.109.49.10

# Build
cd /tmp/forge-cuda
cmake -B build -S . -DCUDA_COMPUTE=ON
cmake --build build -j$(nproc)

# Run tests (exclude benchmarks)
./build/cpu_tests ~"[Benchmark]"

# Run specific SENSE test
./build/cpu_tests "[SENSE_bench][bench256]"

File Map¶

File	Action	Responsibility
`forge/Core/forgeCol.hpp`	Modify	Add `debug_validate()`, fix `abs()` GPU sync
`forge/Solvers/solve_pwls_pcg.hpp`	Modify	Add Phase 1 debug checkpoints
`forge/Operators/SENSE.cpp`	Modify	Rewrite CUDA adjoint to be GPU-resident
`forge/CUDA/CudaPenalty.cu`	Modify	Fix `kernel_finite_diff_adj` boundary
`forge/Tests/operatorTests.cpp`	Modify	Add SENSE adjoint GPU-output test

Task 1: Phase 1 — Debug Instrumentation in PCG Solver¶

Add diagnostic checkpoints to the GPU PCG path that compare GPU-computed values with CPU-recomputed values, confirming whether stale data is the root cause.

Files: - Modify: forge/Solvers/solve_pwls_pcg.hpp (GPU PCG path, lines ~155-290)

[ ] Step 1: Add debug sync+compare helper at the top of the file

Add this helper inside the #if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE) block, before the GPU PCG function body. It syncs a GPU forgeCol to host and computes a CPU-side norm for comparison:

#ifndef NDEBUG
namespace pcg_debug {
/// Sync a GPU forgeCol to host (non-destructive copy) and return its L2 norm.
/// Used to verify GPU-resident data isn't corrupt.
template <typename T>
T sync_and_norm(const forgeCol<forgeComplex<T>>& v, const char* label) {
    forgeCol<forgeComplex<T>> host_copy(v);
    if (host_copy.on_gpu()) host_copy.getFromGPU();
    T acc = T(0);
    for (uword ii = 0; ii < host_copy.n_elem; ii++) {
        acc += host_copy.at(ii).real() * host_copy.at(ii).real()
             + host_copy.at(ii).imag() * host_copy.at(ii).imag();
    }
    T n = std::sqrt(acc);
    fprintf(stderr, "[PCG_DEBUG] %s: norm=%g, n_elem=%lu, on_gpu=%d\n",
            label, (double)n, (unsigned long)v.n_elem, v.on_gpu());
    if (std::isnan(n) || std::isinf(n))
        fprintf(stderr, "[PCG_DEBUG] *** %s has NaN/Inf! ***\n", label);
    return n;
}

/// Compute cdot on CPU from GPU data (sync both, CPU loop), compare with GPU result.
template <typename T>
void compare_cdot(const forgeCol<forgeComplex<T>>& A, const forgeCol<forgeComplex<T>>& B,
                  forgeComplex<T> gpu_result, const char* label) {
    forgeCol<forgeComplex<T>> a_host(A), b_host(B);
    if (a_host.on_gpu()) a_host.getFromGPU();
    if (b_host.on_gpu()) b_host.getFromGPU();
    T re = T(0), im = T(0);
    for (uword ii = 0; ii < a_host.n_elem; ii++) {
        forgeComplex<T> ca(a_host.at(ii).real(), -a_host.at(ii).imag());
        forgeComplex<T> prod(ca.real() * b_host.at(ii).real() - ca.imag() * b_host.at(ii).imag(),
                             ca.real() * b_host.at(ii).imag() + ca.imag() * b_host.at(ii).real());
        re += prod.real();
        im += prod.imag();
    }
    T gpu_re = gpu_result.real();
    T cpu_re = re;
    T rel_err = (std::abs(cpu_re) > 1e-30) ? std::abs(gpu_re - cpu_re) / std::abs(cpu_re) : std::abs(gpu_re - cpu_re);
    fprintf(stderr, "[PCG_DEBUG] %s: GPU_cdot=%g, CPU_cdot=%g, relErr=%g\n",
            label, (double)gpu_re, (double)cpu_re, (double)rel_err);
    if (rel_err > 1e-3)
        fprintf(stderr, "[PCG_DEBUG] *** %s: GPU/CPU cdot MISMATCH (relErr=%g)! ***\n", label, (double)rel_err);
}
} // namespace pcg_debug
#endif

[ ] Step 2: Add checkpoints in the GPU PCG iteration loop

Inside the GPU PCG path, after the key operations in the first iteration only (gate with if (ii == 0)), add:

After Ax_pg = A * x_pg (line ~171):

#ifndef NDEBUG
            if (ii == 0) pcg_debug::sync_and_norm(Ax_pg, "Ax_pg (iter 0)");
#endif

After ngrad_pg = A / Wresidual (line ~192):

#ifndef NDEBUG
            if (ii == 0) pcg_debug::sync_and_norm(ngrad_pg, "ngrad_pg (iter 0)");
#endif

After Adir_pg = A * ddir_pg (line ~234):

#ifndef NDEBUG
            if (ii == 0) pcg_debug::sync_and_norm(Adir_pg, "Adir_pg (iter 0)");
#endif

After dAWAd = cdot(Adir_pg, WAdir_pg) (line ~237):

#ifndef NDEBUG
            if (ii == 0) pcg_debug::compare_cdot(Adir_pg, WAdir_pg, dAWAd, "dAWAd (iter 0)");
#endif

[ ] Step 3: Build and run the SENSE benchmark test

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep -E 'PCG_DEBUG|NRMSE|step j='"

Expected output: [PCG_DEBUG] lines showing norms and cdot comparisons. If GPU/CPU cdot values disagree (relErr > 1e-3), the stale-data hypothesis is confirmed. If they agree and dAWAd is ~4e+19, also add the same print to the "working" case (temporarily remove putOnGPU calls and re-run) to compare.

[ ] Step 4: Commit instrumentation

git add forge/Solvers/solve_pwls_pcg.hpp
git commit -m "debug: add PCG GPU/CPU comparison checkpoints for dAWAd diagnosis"

Task 2: Phase 2 — GPU-Resident SENSE Adjoint¶

Rewrite the SENSE adjoint CUDA path to use adjoint_device with device pointer arithmetic, eliminating the host/device ping-pong.

Files: - Modify: forge/Operators/SENSE.cpp (CUDA adjoint path, lines ~160-179)

Prerequisites: Task 1 completed. Decision gate: If Phase 1 shows GPU/CPU dAWAd values agree (both ~4e+19), investigate whether the "working" case (no putOnGPU) also has this magnitude before proceeding. Phase 2 is still valuable for performance regardless, but additional debugging may be needed for convergence.

[ ] Step 1: Write a test for SENSE adjoint returning GPU-resident data

Add a test in forge/Tests/operatorTests.cpp after the existing SENSE adjointness test (around line 201). This test verifies that the SENSE adjoint returns GPU-resident data when given GPU-resident input, and that the result matches the CPU path.

Use the same setup pattern as the existing SENSE<Gdft> adjointness test at line 162, but with Gnufft (which has the CUDA pipeline):

#ifdef CUDA_COMPUTE
TEST_CASE("SENSE<Gnufft> adjoint returns GPU-resident data and matches CPU", "[SENSE][CUDA]")
{
    // Match the existing SENSE adjointness test pattern
    uword Nx = 8, Ny = 8, Nz = 1;
    uword n1 = Nx * Ny; // image pixels
    uword nc = 2;        // coils

    Col<float> ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN;
    makeCartesianCoords(Nx, Ny, ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN);
    uword n2 = kxN.n_elem; // k-space samples per coil

    Gnufft<float> G(n2, 2.0, Nx, Ny, Nz, kxN, kyN, kzN, ix, iy, iz);

    // Constant sensitivity maps (flat Col<cx_float>, same as existing test)
    Col<cx_float> senseMap(n1 * nc);
    senseMap.subvec(0, n1 - 1).fill(cx_float(1.0f, 0.0f));
    senseMap.subvec(n1, 2 * n1 - 1).fill(cx_float(0.0f, 1.0f));

    // Constructor: SENSE(G, senseMap, n1_kspace_per_coil, n2_image_pixels, nc)
    SENSE<float, Gnufft<float>> S(G, senseMap, n2, n1, nc);

    // Create test k-space data (all coils concatenated)
    forgeCol<forgeComplex<float>> kdata(n2 * nc);
    for (uword i = 0; i < kdata.n_elem; i++)
        kdata.at(i) = forgeComplex<float>((float)(i % 7) / 7.0f, (float)(i % 5) / 5.0f);

    // Compute CPU reference (kdata on host → CPU/host-wrapper path)
    forgeCol<forgeComplex<float>> cpu_result = S / kdata;

    // Put kdata on GPU and compute (should use GPU-resident path)
    kdata.putOnGPU();
    forgeCol<forgeComplex<float>> gpu_result = S / kdata;

    // Verify result is on GPU
    REQUIRE(gpu_result.on_gpu());

    // Sync to host and compare element-wise
    gpu_result.getFromGPU();
    float maxErr = 0.0f;
    for (uword i = 0; i < cpu_result.n_elem; i++) {
        float dr = cpu_result.at(i).real() - gpu_result.at(i).real();
        float di = cpu_result.at(i).imag() - gpu_result.at(i).imag();
        float err = std::sqrt(dr * dr + di * di);
        if (err > maxErr) maxErr = err;
    }
    float cpuNorm = norm(cpu_result);
    float relErr = maxErr / (cpuNorm + 1e-30f);
    CAPTURE(maxErr, cpuNorm, relErr);
    REQUIRE(relErr < 1e-4f);
}
#endif

[ ] Step 2: Build and verify the test fails (current adjoint returns on host)

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests '[SENSE][CUDA]' -v 2>&1"

Expected: The REQUIRE(gpu_result.on_gpu()) assertion fails because the current adjoint returns on host.

[ ] Step 3: Rewrite the SENSE adjoint CUDA path

In forge/Operators/SENSE.cpp, replace the CUDA adjoint block (inside #elif defined(CUDA_COMPUTE) in operator/) with:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        // GPU-resident adjoint: keep data on GPU, use per-coil D2D slicing
        forgeCol<forgeComplex<T1>> d_gpu(d);
        if (!d_gpu.on_gpu()) d_gpu.putOnGPU();

        forgeCol<forgeComplex<T1>> imgSum(n2);
        imgSum.zeros();
        imgSum.putOnGPU();

        for (unsigned int ii = 0; ii < this->nc; ii++) {
            // Slice per-coil data via D2D copy from the right offset
            forgeCol<forgeComplex<T1>> dSlice(n1);
            dSlice.putOnGPU();
            CUDA_CHECK(cudaMemcpyAsync(
                dSlice.device_memptr(),
                d_gpu.device_memptr() + ii * n1,
                n1 * sizeof(forgeComplex<T1>),
                cudaMemcpyDeviceToDevice,
                forge::cuda::get_stream()));

            // Gnufft::operator/(forgeCol) checks on_gpu() and dispatches
            // to adjoint_device internally — works for any Tobj
            forgeCol<forgeComplex<T1>> coil_img = (*this->G_obj) / dSlice;

            // Weight by conjugate sensitivity map (upload per-coil, same as forward path)
            forgeCol<forgeComplex<T1>> conjSmap_col = conjSMap.col_copy(ii);
            conjSmap_col.putOnGPU();
            coil_img %= conjSmap_col;
            imgSum += coil_img;
        }
        CUDA_CHECK(cudaStreamSynchronize(forge::cuda::get_stream()));
        return imgSum;
    }
#endif

Key differences from the old path: - No getFromGPU / subvec / host accumulation - Per-coil slicing via D2D cudaMemcpyAsync from offset device_memptr() + ii * n1 - Uses (*G_obj) / dSlice operator dispatch (not direct cudaPipelineCtx access), so it compiles for all Tobj instantiations (Gnufft, Gdft, TimeSegmentation) - dSlice.on_gpu()=true → Gnufft dispatches to adjoint_device internally - imgSum initialized with zeros() then putOnGPU() to place zeros on GPU - All %= and += dispatch to CUDA (both operands on GPU) - cudaStreamSynchronize before return (matches forward path at line 124)

[ ] Step 4: Build and run the new SENSE CUDA test

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests '[SENSE][CUDA]' -v 2>&1"

Expected: PASS — GPU result matches CPU within relErr < 1e-4, and gpu_result.on_gpu() is true.

[ ] Step 5: Run the existing SENSE adjointness test

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_adj]' -v 2>&1"

Expected: PASS — adjointness property still holds (relError < 1e-3).

[ ] Step 6: Run the SENSE benchmark to check convergence

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep -E 'NRMSE|step j=|PCG_DEBUG'"

Expected: NRMSE ~0.20256 (correct convergence). If this passes, the GPU-resident adjoint fixed the convergence issue. If not, Phase 1 diagnostics guide further investigation.

[ ] Step 7: Commit

git add forge/Operators/SENSE.cpp forge/Tests/operatorTests.cpp
git commit -m "fix: GPU-resident SENSE adjoint eliminates host/device ping-pong in PCG solver"

Task 3: Phase 2b — Audit and Fix Penalty Chain GPU Sync¶

The abs() and to_complex() free functions in forgeCol.hpp read from host mem via at() without syncing GPU data. When called on GPU-resident forgeCol objects (from R.Penalty(x_pg) in the PCG loop), they read stale/garbage host data.

This only affects convergence monitoring (penalty value display), not the step computation. But it should be fixed for correct diagnostics.

Files: - Modify: forge/Core/forgeCol.hpp (abs() and to_complex() functions)

[ ] Step 1: Add GPU sync to abs() for complex forgeCol

In forge/Core/forgeCol.hpp, find the abs() function for forgeCol<forgeComplex<T>> (around line 1141). Add a GPU sync before the host loop:

template <typename T> forgeCol<T> abs(const forgeCol<forgeComplex<T>>& A)
{
#ifdef CUDA_COMPUTE
    if (A.on_gpu()) {
        const_cast<forgeCol<forgeComplex<T>>&>(A).getFromGPU();
    }
#endif
    forgeCol<T> out(A.n_elem);
    for (uword ii = 0; ii < A.n_elem; ii++) {

This follows the same pattern used by sum() at line 960. Note: this is a stopgap — a CUDA abs() kernel would be faster for large vectors. Tracked in future work.

[ ] Step 2: Add GPU sync to to_complex()

Find the to_complex() function (around line 1302). Add a GPU sync:

template <typename T, ...>
forgeCol<forgeComplex<T>> to_complex(const forgeCol<T>& A)
{
#ifdef CUDA_COMPUTE
    if (A.on_gpu()) {
        const_cast<forgeCol<T>&>(A).getFromGPU();
    }
#endif
    forgeCol<forgeComplex<T>> out(A.n_elem);

[ ] Step 3: Add GPU sync to real() and imag() free functions

Find real() (around line 1155) and imag() (around line 1168). Both iterate via at() which reads host mem directly. Add the same GPU sync pattern:

template <typename T> forgeCol<T> real(const forgeCol<forgeComplex<T>>& A)
{
#ifdef CUDA_COMPUTE
    if (A.on_gpu()) {
        const_cast<forgeCol<forgeComplex<T>>&>(A).getFromGPU();
    }
#endif
    forgeCol<T> out(A.n_elem);

Apply the same pattern to imag().

[ ] Step 5: Build and run full test suite

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests ~'[Benchmark]' 2>&1 | tail -5"

Expected: All tests pass. The GPU syncs are only exercised when GPU data is present, so existing CPU tests are unaffected.

[ ] Step 6: Commit

git add forge/Core/forgeCol.hpp
git commit -m "fix: add GPU sync to abs(), to_complex(), real(), imag() for correct GPU-resident reads"

Task 4: Phase 3 — Fix CudaPenalty Boundary Bug¶

Fix kernel_finite_diff_adj to match the CPU implementation for boundary elements.

Files: - Modify: forge/CUDA/CudaPenalty.cu (lines 23-37)

[ ] Step 1: Fix the kernel

In forge/CUDA/CudaPenalty.cu, replace the kernel body:

__global__ void kernel_finite_diff_adj(const float* d, float* out, int n, int offset)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n)
        return;
    // For i < offset: out[i] = 0 - d[i+offset] (matches CPU Robject::Ctd)
    // For i >= offset: out[i] = d[i] - d[i+offset]
    // For i+offset >= n: out[i] = d[i] (no subtraction term)
    float re = (i >= offset) ? d[2 * i] : 0.0f;
    float im = (i >= offset) ? d[2 * i + 1] : 0.0f;
    if (i + offset < n) {
        re -= d[2 * (i + offset)];
        im -= d[2 * (i + offset) + 1];
    }
    out[2 * i] = re;
    out[2 * i + 1] = im;
}

[ ] Step 2: Build and run tests

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests ~'[Benchmark]' 2>&1 | tail -5"

Expected: All tests pass. This fix is currently neutralized by Cd zeroing, but ensures correctness if Ctd is ever called independently.

[ ] Step 3: Commit

git add forge/CUDA/CudaPenalty.cu
git commit -m "fix: correct kernel_finite_diff_adj boundary for i < offset to match CPU"

Task 5: Phase 4 — Validation and Cleanup¶

Run the full test suite, verify convergence, and clean up debug instrumentation.

Files: - Modify: forge/Solvers/solve_pwls_pcg.hpp (remove debug checkpoints if convergence is fixed)

[ ] Step 1: Run full test suite

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && cmake --build build -j$(nproc) && ./build/cpu_tests ~'[Benchmark]' -v 2>&1"

Expected: All 86+ tests pass (including the new [SENSE][CUDA] test).

[ ] Step 2: Verify SENSE benchmark convergence

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep -E 'NRMSE|Time'"

Expected: NRMSE ~0.20256. Time should be faster than the 686ms baseline (no D2H/H2D per iteration).

[ ] Step 3: Run Spiral3D reconstruction test

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[Spiral3D]' 2>&1 | grep -E 'NRMSE|Time'"

Expected: NRMSE within acceptable range, no regressions.

[ ] Step 4: Verify SENSE adjointness

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && ./build/cpu_tests '[SENSE_adj]' -v 2>&1"

Expected: PASS with relError < 1e-3.

[ ] Step 5: Remove or gate debug instrumentation

If convergence is fixed, remove the pcg_debug namespace and all #ifndef NDEBUG checkpoints added in Task 1 from solve_pwls_pcg.hpp. The pre-existing fprintf(stderr, " step j=%u: ...") at line ~271-273 is a separate debug print from a prior debugging session — also remove it as part of cleanup.

If convergence is NOT fixed, keep the instrumentation and investigate based on the diagnostic output. See the decision gate in the spec.

[ ] Step 6: Final commit

git add forge/Solvers/solve_pwls_pcg.hpp
git commit -m "chore: remove PCG debug instrumentation after convergence fix verified"

[ ] Step 7: Run Nsight Systems profile (optional)

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && nsys profile -o /tmp/forge_sense2d_fixed ./build/cpu_tests '[SENSE_bench][bench256]'"

Compare with the pre-fix profile (/tmp/forge_sense2d.nsys-rep) to verify D2H/H2D transfers are eliminated from the PCG iteration loop.