CUDA Performance Optimization Implementation Plan¶
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.IMPORTANT: All CUDA builds and tests run on the DGX Spark via SSH. Workflow: edit locally → rsync → build+test via SSH.
rsync -az --exclude build --exclude build-cpu --exclude .cache --exclude .worktrees \ --exclude '*.nii' --exclude '*.log' --exclude '*.dat' \ /Users/acerjanic/Developer/forge/ clj@100.109.49.10:/tmp/forge-cuda/ ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \ cmake --build build --target cpu_tests -j\$(nproc)"
Goal: Eliminate cudaMalloc/cudaFree churn and host↔device transfer overhead in the CUDA PCG reconstruction loop, improving the current 1.45x speedup to 3x+ over CPU.
Architecture: Two changes: (1) replace cudaMalloc/cudaFree with cudaMallocAsync/cudaFreeAsync in forgeCol for near-zero-cost temporary allocations, (2) keep PCG solver vectors on GPU for the entire solve and route SENSE operator through forgeCol GPU path, avoiding forgeMat host syncs.
Tech Stack: CUDA 13.0 (cudaMallocAsync), cuFFT, cuBLAS, forgeCol GPU dispatch
Spec: docs/superpowers/specs/2026-03-19-cuda-performance-optimization-design.md
Baseline (DGX Spark, SENSE 256x256): CPU 936ms, CUDA 645ms (1.45x)
File Structure¶
Files to Modify¶
| File | Change |
|---|---|
forge/Core/forgeCol.hpp |
Replace cudaMalloc→cudaMallocAsync, cudaFree→cudaFreeAsync at all allocation sites |
forge/Solvers/solve_pwls_pcg.hpp |
Put persistent PCG vectors on GPU at solver start; use yi_ref/W_ref aliases |
forge/Operators/SENSE.cpp |
CUDA path: avoid forgeMat — concatenate forgeCol results directly on GPU |
Task 1: cudaMallocAsync in forgeCol¶
Files:
- Modify: forge/Core/forgeCol.hpp
- [ ] Step 1: Find all cudaMalloc/cudaFree calls in forgeCol.hpp
Search for cudaMalloc( and cudaFree( in the file. Expected locations:
- putOnGPU(): cudaMalloc(&d_mem, ...)
- Destructor: cudaFree(d_mem)
- set_size(): cudaFree(d_mem)
- operator= (copy): cudaMalloc(&d_mem, ...)
- Copy constructor: cudaMalloc(&d_mem, ...)
- [ ] Step 2: Replace all cudaMalloc with cudaMallocAsync
Change every cudaMalloc(&d_mem, n_elem * sizeof(T)) to:
cudaMallocAsync(&d_mem, n_elem * sizeof(T), forge::cuda::get_stream())
Change every cudaFree(d_mem) to:
cudaFreeAsync(d_mem, forge::cuda::get_stream())
Note: cudaFreeAsync with a null pointer is safe (no-op), just like cudaFree.
- [ ] Step 3: Verify Metal build locally
cmake --build build --target ForgeCommon -j4
- [ ] Step 4: Verify CUDA build and tests on DGX Spark
# rsync + build + test
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \
cmake --build build --target cpu_tests -j\$(nproc) && \
./build/cpu_tests '~[Benchmark]' '~[ISMRMRD]' '~[Spiral3D]' '~[directRecon]' \
'~[SENSE_bench]' '~[pcSENSE_bench]' '~[TimeSeg_bench]' '~[Gdft_bench]'"
Expected: All 86 fast tests pass.
- [ ] Step 5: Benchmark
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
for i in 1 2 3; do ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep 'Total\|NRMSE'; done"
Expected: NRMSE 0.20256 (unchanged). Time should be slightly faster than 645ms.
- [ ] Step 6: Commit
git add forge/Core/forgeCol.hpp
git commit -m "perf: use cudaMallocAsync/cudaFreeAsync in forgeCol for pooled GPU allocations"
Task 2: GPU-Resident PCG Solver Vectors¶
Files:
- Modify: forge/Solvers/solve_pwls_pcg.hpp
- [ ] Step 1: Add GPU upload of persistent vectors
In the GPU PCG path (inside #if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE)), after forgeCol<forgeComplex<T1>> x_pg = xInitial;, add:
#ifdef CUDA_COMPUTE
// Upload persistent vectors to GPU — they stay there for the entire solve
x_pg.putOnGPU();
forgeCol<forgeComplex<T1>> yi_gpu(yi_pg); yi_gpu.putOnGPU();
forgeCol<T1> W_gpu(W_pg); W_gpu.putOnGPU();
auto& yi_ref = yi_gpu;
auto& W_ref = W_gpu;
#else
auto& yi_ref = yi_pg;
auto& W_ref = W_pg;
#endif
- [ ] Step 2: Replace yi_pg and W_pg references in the GPU loop body
In the GPU PCG loop body, replace every yi_pg with yi_ref and every W_pg with W_ref. Affected lines:
- W_pg % yi_pg (yWy pre-computation)
- yi_pg - Ax_pg (residual computation, occurs 3 times)
- W_pg % residual / W_pg % Wresidual (weighted residual)
- W_pg % Adir_pg (WAdir computation)
Do NOT change the Armadillo fallback path (lines after #endif).
- [ ] Step 3: Verify Metal build locally
cmake --build build --target ForgeCommon -j4
- [ ] Step 4: Verify CUDA correctness on DGX Spark
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \
cmake --build build --target cpu_tests -j\$(nproc) && \
./build/cpu_tests '[PCG convergence]' 2>&1 | head -15 && \
./build/cpu_tests '[SENSE adjoint]' 2>&1 | tail -3 && \
./build/cpu_tests '~[Benchmark]' '~[ISMRMRD]' '~[Spiral3D]' '~[directRecon]' \
'~[SENSE_bench]' '~[pcSENSE_bench]' '~[TimeSeg_bench]' '~[Gdft_bench]' 2>&1 | tail -3"
Expected: PCG converges (error decreasing), all 86 tests pass.
- [ ] Step 5: Benchmark
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
for i in 1 2 3; do ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep 'Total\|NRMSE'; done"
Expected: Significant speedup over 645ms. NRMSE 0.20256.
- [ ] Step 6: Commit
git add forge/Solvers/solve_pwls_pcg.hpp
git commit -m "perf: keep PCG solver vectors on GPU for entire solve — eliminates per-iteration H2D/D2H"
Task 3: SENSE Operator GPU Data Flow¶
Files:
- Modify: forge/Operators/SENSE.cpp
The SENSE CUDA forward path currently stores per-coil Gnufft results in outData_pg (forgeMat, host-only) via set_col, which forces getFromGPU() syncs. Fix: concatenate forgeCol results directly.
- [ ] Step 1: Rewrite SENSE forward CUDA path
Replace the current CUDA forward path:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
forgeCol<forgeComplex<T1>> d_gpu(d);
d_gpu.putOnGPU();
for (unsigned int ii = 0; ii < this->nc; ii++) {
forgeCol<forgeComplex<T1>> smap_col = SMap.col_copy(ii);
smap_col.putOnGPU();
forgeCol<forgeComplex<T1>> weighted(d_gpu);
weighted %= smap_col;
forgeCol<forgeComplex<T1>> coil_result = (*this->G_obj) * weighted;
coil_result.getFromGPU();
outData_pg.set_col(ii, coil_result);
}
return vectorise(outData_pg);
}
With:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
// Keep data on GPU — avoid forgeMat host syncs
forgeCol<forgeComplex<T1>> d_gpu(d);
if (!d_gpu.on_gpu()) d_gpu.putOnGPU();
// Concatenate per-coil results directly into output vector
forgeCol<forgeComplex<T1>> allData(n1 * nc);
allData.putOnGPU();
for (unsigned int ii = 0; ii < this->nc; ii++) {
forgeCol<forgeComplex<T1>> smap_col = SMap.col_copy(ii);
smap_col.putOnGPU();
forgeCol<forgeComplex<T1>> weighted(d_gpu);
weighted %= smap_col;
forgeCol<forgeComplex<T1>> coil_result = (*this->G_obj) * weighted;
// Copy coil result into the right slice of allData (device-to-device)
coil_result.getFromGPU();
std::memcpy(allData.memptr() + ii * n1, coil_result.memptr(),
n1 * sizeof(forgeComplex<T1>));
}
return allData;
}
Note: We still do getFromGPU() per coil because memcpy works on host pointers. This is a partial fix — full GPU concatenation requires either a CUDA memcpy kernel or forgeMat CUDA support. But it eliminates the forgeMat set_col + vectorise overhead.
Actually, a better approach: keep getFromGPU() but skip the forgeMat entirely. The result doesn't need to stay on GPU since the PCG solver's A * x_pg call wraps the result in a new forgeCol which gets put on GPU by the mixed-operand dispatch.
- [ ] Step 2: Rewrite SENSE adjoint CUDA path
Same approach — avoid forgeMat, concatenate forgeCol results. The adjoint sums across coils rather than concatenating, so the pattern is different:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
forgeCol<forgeComplex<T1>> imgSum(n2);
imgSum.zeros();
for (unsigned int ii = 0; ii < this->nc; ii++) {
forgeCol<forgeComplex<T1>> dSlice = d.subvec(ii * n1, (ii + 1) * n1 - 1);
forgeCol<forgeComplex<T1>> coil_img = (*this->G_obj) / dSlice;
// Weight by conjugate sensitivity map
forgeCol<forgeComplex<T1>> conjSmap_col = conjSMap.col_copy(ii);
coil_img %= conjSmap_col;
imgSum += coil_img;
}
return imgSum;
}
This avoids forgeMat entirely — the sum accumulates in imgSum on host (or GPU if the operators push it there via mixed-operand dispatch).
-
[ ] Step 3: Verify Metal build locally
-
[ ] Step 4: Verify all tests on DGX Spark
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && export PATH=/usr/local/cuda/bin:\$PATH && \
cmake --build build --target cpu_tests -j\$(nproc) && \
./build/cpu_tests '[SENSE adjoint]' 2>&1 | tail -3 && \
./build/cpu_tests '[SENSE_synthetic]' 2>&1 | tail -3 && \
./build/cpu_tests '[PCG convergence]' 2>&1 | tail -3 && \
./build/cpu_tests '~[Benchmark]' '~[ISMRMRD]' '~[Spiral3D]' '~[directRecon]' \
'~[SENSE_bench]' '~[pcSENSE_bench]' '~[TimeSeg_bench]' '~[Gdft_bench]' 2>&1 | tail -3"
Expected: All tests pass.
- [ ] Step 5: Benchmark + profile
# Benchmark
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
echo '=== CUDA (3 runs) ===' && \
for i in 1 2 3; do ./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep 'Total\|NRMSE'; done && \
echo '=== CPU baseline (3 runs) ===' && \
for i in 1 2 3; do ./build-cpu/cpu_tests '[SENSE_bench][bench256]' 2>&1 | grep Total; done"
# Re-profile
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
nsys profile --stats=true --force-overwrite=true -o /tmp/forge_sense2d_opt \
./build/cpu_tests '[SENSE_bench][bench256]' 2>&1 | tail -40"
Compare profile against baseline: memcpy and malloc should be dramatically reduced.
- [ ] Step 6: Commit
git add forge/Operators/SENSE.cpp
git commit -m "perf: SENSE CUDA path avoids forgeMat — direct forgeCol concatenation/summation"
Task 4: Final Verification + PR¶
- [ ] Step 1: Run full test suite on DGX Spark
ssh clj@100.109.49.10 "cd /tmp/forge-cuda && \
./build/cpu_tests '~[Benchmark]' '~[SENSE_bench]' '~[pcSENSE_bench]' \
'~[TimeSeg_bench]' '~[Gdft_bench]' '~[bench256]' 2>&1 | tail -3"
Expected: 101 tests pass.
-
[ ] Step 2: Run SENSE 256x256 benchmark 5 times for stable numbers
-
[ ] Step 3: Verify Metal tests still pass locally
cmake --build build --target cpu_tests metal_tests -j4 && \
./build/cpu_tests '~[Benchmark]' && ./build/metal_tests '~[Benchmark]'
- [ ] Step 4: Push and create PR
Summary of Commits¶
| Task | Commit Message |
|---|---|
| 1 | perf: use cudaMallocAsync/cudaFreeAsync in forgeCol for pooled GPU allocations |
| 2 | perf: keep PCG solver vectors on GPU for entire solve |
| 3 | perf: SENSE CUDA path avoids forgeMat — direct forgeCol concatenation/summation |