CUDA Performance Optimization Implementation Plan¶

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

IMPORTANT: All CUDA builds and tests run on the DGX Spark via SSH. Workflow: edit locally → rsync → build+test via SSH.
rsync -az --exclude build --exclude build-cpu --exclude .cache --exclude .worktrees \
    --exclude '*.nii' --exclude '*.log' --exclude '*.dat' \
    /Users/acerjanic/Developer/forge/ clj@100.109.49.10:/tmp/forge-cuda/
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \
    cmake --build build --target cpu_tests -j\$(nproc)"

Goal: Eliminate cudaMalloc/cudaFree churn and host↔device transfer overhead in the CUDA PCG reconstruction loop, improving the current 1.45x speedup to 3x+ over CPU.

Architecture: Two changes: (1) replace cudaMalloc/cudaFree with cudaMallocAsync/cudaFreeAsync in forgeCol for near-zero-cost temporary allocations, (2) keep PCG solver vectors on GPU for the entire solve and route SENSE operator through forgeCol GPU path, avoiding forgeMat host syncs.

Tech Stack: CUDA 13.0 (cudaMallocAsync), cuFFT, cuBLAS, forgeCol GPU dispatch

Spec: docs/superpowers/specs/2026-03-19-cuda-performance-optimization-design.md

Baseline (DGX Spark, SENSE 256x256): CPU 936ms, CUDA 645ms (1.45x)

File Structure¶

Files to Modify¶

File	Change
`forge/Core/forgeCol.hpp`	Replace `cudaMalloc`→`cudaMallocAsync`, `cudaFree`→`cudaFreeAsync` at all allocation sites
`forge/Solvers/solve_pwls_pcg.hpp`	Put persistent PCG vectors on GPU at solver start; use `yi_ref`/`W_ref` aliases
`forge/Operators/SENSE.cpp`	CUDA path: avoid forgeMat — concatenate forgeCol results directly on GPU

Task 1: cudaMallocAsync in forgeCol¶

Files: - Modify: forge/Core/forgeCol.hpp

[ ] Step 1: Find all cudaMalloc/cudaFree calls in forgeCol.hpp

Search for cudaMalloc( and cudaFree( in the file. Expected locations: - putOnGPU(): cudaMalloc(&d_mem, ...) - Destructor: cudaFree(d_mem) - set_size(): cudaFree(d_mem) - operator= (copy): cudaMalloc(&d_mem, ...) - Copy constructor: cudaMalloc(&d_mem, ...)

[ ] Step 2: Replace all cudaMalloc with cudaMallocAsync

Change every cudaMalloc(&d_mem, n_elem * sizeof(T)) to:

cudaMallocAsync(&d_mem, n_elem * sizeof(T), forge::cuda::get_stream())

Change every cudaFree(d_mem) to:

cudaFreeAsync(d_mem, forge::cuda::get_stream())

Note: cudaFreeAsync with a null pointer is safe (no-op), just like cudaFree.

[ ] Step 3: Verify Metal build locally

cmake --build build --target ForgeCommon -j4

[ ] Step 4: Verify CUDA build and tests on DGX Spark

# rsync + build + test
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \
    cmake --build build --target cpu_tests -j\$(nproc) && \
    ./build/cpu_tests '~[Benchmark]' '~[ISMRMRD]' '~[Spiral3D]' '~[directRecon]' \
    '~[SENSE_bench]' '~[pcSENSE_bench]' '~[TimeSeg_bench]' '~[Gdft_bench]'"

Expected: All 86 fast tests pass.

[ ] Step 5: Benchmark

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
    for i in 1 2 3; do ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep 'Total\|NRMSE'; done"

Expected: NRMSE 0.20256 (unchanged). Time should be slightly faster than 645ms.

[ ] Step 6: Commit

git add forge/Core/forgeCol.hpp
git commit -m "perf: use cudaMallocAsync/cudaFreeAsync in forgeCol for pooled GPU allocations"

Task 2: GPU-Resident PCG Solver Vectors¶

Files: - Modify: forge/Solvers/solve_pwls_pcg.hpp

[ ] Step 1: Add GPU upload of persistent vectors

In the GPU PCG path (inside #if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE)), after forgeCol<forgeComplex<T1>> x_pg = xInitial;, add:

#ifdef CUDA_COMPUTE
        // Upload persistent vectors to GPU — they stay there for the entire solve
        x_pg.putOnGPU();
        forgeCol<forgeComplex<T1>> yi_gpu(yi_pg); yi_gpu.putOnGPU();
        forgeCol<T1> W_gpu(W_pg); W_gpu.putOnGPU();
        auto& yi_ref = yi_gpu;
        auto& W_ref = W_gpu;
#else
        auto& yi_ref = yi_pg;
        auto& W_ref = W_pg;
#endif

[ ] Step 2: Replace yi_pg and W_pg references in the GPU loop body

In the GPU PCG loop body, replace every yi_pg with yi_ref and every W_pg with W_ref. Affected lines: - W_pg % yi_pg (yWy pre-computation) - yi_pg - Ax_pg (residual computation, occurs 3 times) - W_pg % residual / W_pg % Wresidual (weighted residual) - W_pg % Adir_pg (WAdir computation)

Do NOT change the Armadillo fallback path (lines after #endif).

[ ] Step 3: Verify Metal build locally

cmake --build build --target ForgeCommon -j4

[ ] Step 4: Verify CUDA correctness on DGX Spark

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \
    cmake --build build --target cpu_tests -j\$(nproc) && \
    ./build/cpu_tests '[PCG convergence]' 2>&1 | head -15 && \
    ./build/cpu_tests '[SENSE adjoint]' 2>&1 | tail -3 && \
    ./build/cpu_tests '~[Benchmark]' '~[ISMRMRD]' '~[Spiral3D]' '~[directRecon]' \
    '~[SENSE_bench]' '~[pcSENSE_bench]' '~[TimeSeg_bench]' '~[Gdft_bench]' 2>&1 | tail -3"

Expected: PCG converges (error decreasing), all 86 tests pass.

[ ] Step 5: Benchmark

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
    for i in 1 2 3; do ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep 'Total\|NRMSE'; done"

Expected: Significant speedup over 645ms. NRMSE 0.20256.

[ ] Step 6: Commit

git add forge/Solvers/solve_pwls_pcg.hpp
git commit -m "perf: keep PCG solver vectors on GPU for entire solve — eliminates per-iteration H2D/D2H"

Task 3: SENSE Operator GPU Data Flow¶

Files: - Modify: forge/Operators/SENSE.cpp

The SENSE CUDA forward path currently stores per-coil Gnufft results in outData_pg (forgeMat, host-only) via set_col, which forces getFromGPU() syncs. Fix: concatenate forgeCol results directly.

[ ] Step 1: Rewrite SENSE forward CUDA path

Replace the current CUDA forward path:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        forgeCol<forgeComplex<T1>> d_gpu(d);
        d_gpu.putOnGPU();
        for (unsigned int ii = 0; ii < this->nc; ii++) {
            forgeCol<forgeComplex<T1>> smap_col = SMap.col_copy(ii);
            smap_col.putOnGPU();
            forgeCol<forgeComplex<T1>> weighted(d_gpu);
            weighted %= smap_col;
            forgeCol<forgeComplex<T1>> coil_result = (*this->G_obj) * weighted;
            coil_result.getFromGPU();
            outData_pg.set_col(ii, coil_result);
        }
        return vectorise(outData_pg);
    }

With:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        // Keep data on GPU — avoid forgeMat host syncs
        forgeCol<forgeComplex<T1>> d_gpu(d);
        if (!d_gpu.on_gpu()) d_gpu.putOnGPU();

        // Concatenate per-coil results directly into output vector
        forgeCol<forgeComplex<T1>> allData(n1 * nc);
        allData.putOnGPU();

        for (unsigned int ii = 0; ii < this->nc; ii++) {
            forgeCol<forgeComplex<T1>> smap_col = SMap.col_copy(ii);
            smap_col.putOnGPU();
            forgeCol<forgeComplex<T1>> weighted(d_gpu);
            weighted %= smap_col;
            forgeCol<forgeComplex<T1>> coil_result = (*this->G_obj) * weighted;
            // Copy coil result into the right slice of allData (device-to-device)
            coil_result.getFromGPU();
            std::memcpy(allData.memptr() + ii * n1, coil_result.memptr(),
                        n1 * sizeof(forgeComplex<T1>));
        }
        return allData;
    }

Note: We still do getFromGPU() per coil because memcpy works on host pointers. This is a partial fix — full GPU concatenation requires either a CUDA memcpy kernel or forgeMat CUDA support. But it eliminates the forgeMat set_col + vectorise overhead.

Actually, a better approach: keep getFromGPU() but skip the forgeMat entirely. The result doesn't need to stay on GPU since the PCG solver's A * x_pg call wraps the result in a new forgeCol which gets put on GPU by the mixed-operand dispatch.

[ ] Step 2: Rewrite SENSE adjoint CUDA path

Same approach — avoid forgeMat, concatenate forgeCol results. The adjoint sums across coils rather than concatenating, so the pattern is different:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        forgeCol<forgeComplex<T1>> imgSum(n2);
        imgSum.zeros();

        for (unsigned int ii = 0; ii < this->nc; ii++) {
            forgeCol<forgeComplex<T1>> dSlice = d.subvec(ii * n1, (ii + 1) * n1 - 1);
            forgeCol<forgeComplex<T1>> coil_img = (*this->G_obj) / dSlice;
            // Weight by conjugate sensitivity map
            forgeCol<forgeComplex<T1>> conjSmap_col = conjSMap.col_copy(ii);
            coil_img %= conjSmap_col;
            imgSum += coil_img;
        }
        return imgSum;
    }

This avoids forgeMat entirely — the sum accumulates in imgSum on host (or GPU if the operators push it there via mixed-operand dispatch).

[ ] Step 3: Verify Metal build locally
[ ] Step 4: Verify all tests on DGX Spark

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \
    cmake --build build --target cpu_tests -j\$(nproc) && \
    ./build/cpu_tests '[SENSE adjoint]' 2>&1 | tail -3 && \
    ./build/cpu_tests '[SENSE_synthetic]' 2>&1 | tail -3 && \
    ./build/cpu_tests '[PCG convergence]' 2>&1 | tail -3 && \
    ./build/cpu_tests '~[Benchmark]' '~[ISMRMRD]' '~[Spiral3D]' '~[directRecon]' \
    '~[SENSE_bench]' '~[pcSENSE_bench]' '~[TimeSeg_bench]' '~[Gdft_bench]' 2>&1 | tail -3"

Expected: All tests pass.

[ ] Step 5: Benchmark + profile

# Benchmark
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
    echo '=== CUDA (3 runs) ===' && \
    for i in 1 2 3; do ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep 'Total\|NRMSE'; done && \
    echo '=== CPU baseline (3 runs) ===' && \
    for i in 1 2 3; do ./build-cpu/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep Total; done"

# Re-profile
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
    nsys profile --stats=true --force-overwrite=true -o /tmp/forge_sense2d_opt \
    ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | tail -40"

Compare profile against baseline: memcpy and malloc should be dramatically reduced.

[ ] Step 6: Commit

git add forge/Operators/SENSE.cpp
git commit -m "perf: SENSE CUDA path avoids forgeMat — direct forgeCol concatenation/summation"

Task 4: Final Verification + PR¶

[ ] Step 1: Run full test suite on DGX Spark

ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
    ./build/cpu_tests '~[Benchmark]' '~[SENSE_bench]' '~[pcSENSE_bench]' \
    '~[TimeSeg_bench]' '~[Gdft_bench]' '~[bench256]' 2>&1 | tail -3"

Expected: 101 tests pass.

[ ] Step 2: Run SENSE 256x256 benchmark 5 times for stable numbers
[ ] Step 3: Verify Metal tests still pass locally

cmake --build build --target cpu_tests metal_tests -j4 && \
    ./build/cpu_tests '~[Benchmark]' && ./build/metal_tests '~[Benchmark]'

[ ] Step 4: Push and create PR

Summary of Commits¶

Task	Commit Message
1	`perf: use cudaMallocAsync/cudaFreeAsync in forgeCol for pooled GPU allocations`
2	`perf: keep PCG solver vectors on GPU for entire solve`
3	`perf: SENSE CUDA path avoids forgeMat — direct forgeCol concatenation/summation`