TimeSegmentation + pcSenseTimeSeg CUDA Implementation Plan¶
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.IMPORTANT: CUDA builds and tests run on the DGX Spark. Workflow: commit + push locally → pull on DGX Spark → build + test.
git push origin feature/cuda-perf-optimization ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \ git pull acerjanic feature/cuda-perf-optimization && \ cmake --build build --target cpu_tests -j8"
Goal: Add CUDA forgeCol dispatch to TimeSegmentation and pcSenseTimeSeg, enabling fully GPU-resident field-corrected and phase-corrected SENSE reconstructions.
Architecture: No new CUDA kernels. Mirror the existing Metal forgeCol paths with #elif defined(CUDA_COMPUTE) blocks. Pre-upload coefficient vectors to GPU in constructors. All element-wise operations (%=, +=, copy constructor) already have CUDA dispatch in forgeCol — they run CUDA kernels automatically when both operands are on GPU.
Tech Stack: Existing forgeCol CUDA dispatch, CudaNufftPipeline (Gnufft), CudaDftPipeline (Gdft)
Spec: docs/superpowers/specs/2026-03-20-timeseg-pcsense-cuda-design.md
File Structure¶
Files to Modify¶
| File | Change |
|---|---|
forge/Gridding/TimeSegmentation.cpp |
Move coefficient vector population out of Metal block to common code; add CUDA constructor upload; add CUDA forgeCol forward/adjoint blocks |
forge/Operators/pcSenseTimeSeg.h |
Add shotSenseMap_gpu / conjShotSenseMap_gpu vectors under CUDA_COMPUTE |
forge/Operators/pcSenseTimeSeg.cpp |
Add CUDA constructor upload; add CUDA forgeCol forward/adjoint blocks |
No new files. No header changes needed for TimeSegmentation (member vectors already declared unconditionally).
Task 1: TimeSegmentation — Move Coefficient Population to Common Code¶
Files:
- Modify: forge/Gridding/TimeSegmentation.cpp:249-277
The coefficient vectors (Wo_cols, WoH_cols, AA_cols, conjAA_cols, tempD_cols, tempAD_cols) are currently populated inside #ifdef METAL_COMPUTE. This code needs to run on all GPU builds (Metal and CUDA). The only Metal-specific part is that Metal uses page-aligned memory (which forgeCol already handles via aligned_alloc on macOS).
- [ ] Step 1: Move vector population before the
#ifdef METAL_COMPUTEblock
In forge/Gridding/TimeSegmentation.cpp, the current code at lines 249-277 is:
#ifdef METAL_COMPUTE
if constexpr (std::is_same<T1, float>::value) {
Wo_cols.reserve(L);
// ... populate all 6 vectors ...
}
#endif
Change to:
#if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
// Extract per-segment coefficient columns as separate forgeCol vectors.
// Each column is independently GPU-ready (Metal: page-aligned zero-copy,
// CUDA: uploaded to device in the CUDA block below).
Wo_cols.reserve(L);
WoH_cols.reserve(L);
AA_cols.reserve(L);
conjAA_cols.reserve(L);
tempD_cols.reserve(L);
tempAD_cols.reserve(L);
for (int ii = 0; ii < L; ii++) {
Wo_cols.emplace_back(n2);
std::memcpy(Wo_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(Wo.colptr(ii)),
sizeof(forgeComplex<T1>) * n2);
WoH_cols.emplace_back(n2);
std::memcpy(WoH_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(WoH.colptr(ii)),
sizeof(forgeComplex<T1>) * n2);
AA_cols.emplace_back(n1);
std::memcpy(AA_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(AA.colptr(ii)),
sizeof(forgeComplex<T1>) * n1);
conjAA_cols.emplace_back(n1);
std::memcpy(conjAA_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(conjAA.colptr(ii)),
sizeof(forgeComplex<T1>) * n1);
tempD_cols.emplace_back(n2);
tempAD_cols.emplace_back(n1);
}
}
#endif
#ifdef CUDA_COMPUTE
if constexpr (std::is_same<T1, float>::value) {
// Upload per-segment coefficients to GPU. These persist for the lifetime
// of the operator, eliminating per-call H2D transfers.
for (int ii = 0; ii < L; ii++) {
Wo_cols[ii].putOnGPU();
WoH_cols[ii].putOnGPU();
AA_cols[ii].putOnGPU();
conjAA_cols[ii].putOnGPU();
}
}
#endif
- [ ] Step 2: Verify Metal build locally
cmake --build build --target ForgeCommon -j4
Expected: builds without errors.
- [ ] Step 3: Verify CUDA build on DGX Spark
Push, pull, build, run existing TimeSegmentation tests:
git push origin feature/cuda-perf-optimization
ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
git pull acerjanic feature/cuda-perf-optimization && \
cmake --build build --target cpu_tests -j8 && \
./build/cpu_tests '[TimeSeg_synthetic][adjointness]' 2>&1"
Expected: SENSE+TimeSeg adjointness test passes.
- [ ] Step 4: Commit
git add forge/Gridding/TimeSegmentation.cpp
git commit -m "refactor: move TimeSegmentation coefficient vectors to common code for Metal+CUDA"
Task 2: TimeSegmentation — CUDA forgeCol Forward and Adjoint¶
Files:
- Modify: forge/Gridding/TimeSegmentation.cpp:445-500 (forgeCol operators)
- [ ] Step 1: Add CUDA forward forgeCol block
In the forgeCol operator* (around line 445), change the #endif after the Metal block to add a CUDA path:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
Tobj* G = this->obj;
// Forward: phase-modulate image by each segment's field correction,
// then NUFFT forward + interpolation weight, accumulate across segments.
//
// Data flow per segment l:
// copy Wo[l] (async D2D) → multiply by d (CUDA kernel) →
// G* forward (CudaNufftPipeline) → multiply by AA[l] (CUDA kernel) →
// accumulate into result (CUDA kernel)
//
// All operations stay on GPU — forgeCol auto-dispatches to CUDA when
// both operands have on_gpu() == true.
forgeCol<forgeComplex<T1>> tempD(Wo_cols[0]); // async D2D copy (GPU-resident source)
tempD %= d; // CUDA element-wise multiply
forgeCol<forgeComplex<T1>> result = (*G) * tempD; // Gnufft forward_device
result %= AA_cols[0]; // CUDA element-wise multiply
for (unsigned int ii = 1; ii < this->L; ii++) {
forgeCol<forgeComplex<T1>> seg_d(Wo_cols[ii]); // async D2D copy
seg_d %= d; // CUDA multiply
forgeCol<forgeComplex<T1>> seg_out = (*G) * seg_d; // forward_device
seg_out %= AA_cols[ii]; // CUDA multiply
result += seg_out; // CUDA accumulate
}
return result;
}
#endif
- [ ] Step 2: Add CUDA adjoint forgeCol block
In the forgeCol operator/ (around line 475), same pattern with conjugate coefficients:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
Tobj* G = this->obj;
// Adjoint: weight k-space by conjugate interpolation coefficients,
// then NUFFT adjoint + conjugate phase correction, accumulate.
//
// Data flow per segment l:
// copy conjAA[l] (async D2D) → multiply by d (CUDA kernel) →
// G/ adjoint (CudaNufftPipeline) → multiply by WoH[l] (CUDA kernel) →
// accumulate into result (CUDA kernel)
forgeCol<forgeComplex<T1>> tempAD(conjAA_cols[0]); // async D2D copy
tempAD %= d; // CUDA multiply
forgeCol<forgeComplex<T1>> result = (*G) / tempAD; // Gnufft adjoint_device
result %= WoH_cols[0]; // CUDA multiply
for (unsigned int ii = 1; ii < this->L; ii++) {
forgeCol<forgeComplex<T1>> seg_ad(conjAA_cols[ii]); // async D2D copy
seg_ad %= d; // CUDA multiply
forgeCol<forgeComplex<T1>> seg_out = (*G) / seg_ad; // adjoint_device
seg_out %= WoH_cols[ii]; // CUDA multiply
result += seg_out; // CUDA accumulate
}
return result;
}
#endif
- [ ] Step 3: Verify Metal build locally
cmake --build build --target ForgeCommon -j4
- [ ] Step 4: Verify all tests on DGX Spark
ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
git pull acerjanic feature/cuda-perf-optimization && \
cmake --build build --target cpu_tests -j8 && \
./build/cpu_tests '[TimeSeg_synthetic][adjointness]' 2>&1 && \
./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"
Expected: TimeSeg adjointness passes. All tests pass.
- [ ] Step 5: Commit
git add forge/Gridding/TimeSegmentation.cpp
git commit -m "feat: CUDA forgeCol dispatch for TimeSegmentation — GPU-resident field correction"
Task 3: pcSenseTimeSeg — CUDA Members and Constructor Upload¶
Files:
- Modify: forge/Operators/pcSenseTimeSeg.h:120-125 (add CUDA members)
- Modify: forge/Operators/pcSenseTimeSeg.cpp:123-128 (add CUDA constructor upload)
- [ ] Step 1: Add CUDA members to pcSenseTimeSeg.h
After the existing forgeMat members (around line 124), add:
#ifdef CUDA_COMPUTE
/// Per-shot combined sensitivity+phase maps, pre-uploaded to GPU (Ns*Nc vectors).
std::vector<forgeCol<forgeComplex<T1>>> shotSenseMap_gpu;
/// Conjugate per-shot maps for adjoint, pre-uploaded to GPU.
std::vector<forgeCol<forgeComplex<T1>>> conjShotSenseMap_gpu;
#endif
Add #include <vector> at top if not already present.
- [ ] Step 2: Add CUDA constructor upload in pcSenseTimeSeg.cpp
After the #ifdef METAL_COMPUTE block (line 123-128), add:
#ifdef CUDA_COMPUTE
if constexpr (std::is_same<T1, float>::value) {
// Pre-upload per-shot combined sensitivity+phase maps to GPU.
// shotSpecificSenseMap is (Ni x Nc*Ns) arma::Mat — extract each column
// as an owning forgeCol and upload. These persist for the lifetime of
// the operator, eliminating per-coil per-iteration H2D transfers.
shotSenseMap_gpu.reserve(Ns * Nc);
conjShotSenseMap_gpu.reserve(Ns * Nc);
for (unsigned int ii = 0; ii < Nc; ii++) {
for (unsigned int jj = 0; jj < Ns; jj++) {
uword col = jj + ii * Ns;
// Wrap arma column as forgeCol (copies data into forgeCol-owned buffer)
shotSenseMap_gpu.emplace_back(
forgeCol<forgeComplex<T1>>(shotSpecificSenseMap.col(col)));
shotSenseMap_gpu.back().putOnGPU();
conjShotSenseMap_gpu.emplace_back(
forgeCol<forgeComplex<T1>>(conjShotSpecificSenseMap.col(col)));
conjShotSenseMap_gpu.back().putOnGPU();
}
}
}
#endif
Also add the CUDA include at top of pcSenseTimeSeg.cpp:
#ifdef CUDA_COMPUTE
#include "CUDA/CudaContext.h"
#endif
-
[ ] Step 3: Verify Metal build + CUDA build
-
[ ] Step 4: Commit
git add forge/Operators/pcSenseTimeSeg.h forge/Operators/pcSenseTimeSeg.cpp
git commit -m "feat: pcSenseTimeSeg CUDA constructor — pre-upload shot-specific maps to GPU"
Task 4: pcSenseTimeSeg — CUDA forgeCol Forward and Adjoint¶
Files:
- Modify: forge/Operators/pcSenseTimeSeg.cpp:180-227 (forgeCol operators)
- [ ] Step 1: Add CUDA forward forgeCol block
In the forgeCol operator* (line 180), change #endif after Metal to add CUDA:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
// GPU-resident forward: multiply image by combined sensitivity+phase map
// per shot/coil, forward through TimeSegmentation operator, concatenate
// results via D2D copy into output buffer.
forgeCol<forgeComplex<T1>> d_gpu(d);
if (!d_gpu.on_gpu()) d_gpu.putOnGPU();
forgeCol<forgeComplex<T1>> allData(Nd * Ns * Nc);
allData.allocateOnGPU();
for (unsigned int ii = 0; ii < Nc; ii++) {
for (unsigned int jj = 0; jj < Ns; jj++) {
uword col = jj + ii * Ns;
// Multiply image by precomputed combined sensitivity+phase map
forgeCol<forgeComplex<T1>> weighted(d_gpu); // async D2D copy
weighted %= shotSenseMap_gpu[col]; // CUDA element-wise mul
// Forward through per-shot time-segmented NUFFT
forgeCol<forgeComplex<T1>> coil_result = (*AObj[jj]) * weighted;
// D2D copy result into the correct slice of concatenated output
if (coil_result.on_gpu()) {
CUDA_CHECK(cudaMemcpyAsync(
allData.device_memptr() + col * Nd,
coil_result.device_memptr(),
Nd * sizeof(forgeComplex<T1>),
cudaMemcpyDeviceToDevice,
forge::cuda::get_stream()));
} else {
// Fallback: operator returned host data (shouldn't happen with CUDA Gnufft)
CUDA_CHECK(cudaMemcpyAsync(
allData.device_memptr() + col * Nd,
coil_result.memptr(),
Nd * sizeof(forgeComplex<T1>),
cudaMemcpyHostToDevice,
forge::cuda::get_stream()));
}
}
}
return allData;
}
#endif
- [ ] Step 2: Add CUDA adjoint forgeCol block
In the forgeCol operator/ (line 205), change #endif after Metal to add CUDA:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
// GPU-resident adjoint: D2D slice per-shot per-coil k-space data,
// adjoint through TimeSegmentation, weight by conjugate map, accumulate.
forgeCol<forgeComplex<T1>> d_gpu(d);
if (!d_gpu.on_gpu()) d_gpu.putOnGPU();
forgeCol<forgeComplex<T1>> imgSum(Ni);
imgSum.zerosOnGPU();
for (unsigned int ii = 0; ii < Nc; ii++) {
for (unsigned int jj = 0; jj < Ns; jj++) {
uword col = jj + ii * Ns;
// D2D slice: extract per-shot per-coil k-space segment from input
forgeCol<forgeComplex<T1>> seg(Nd);
seg.allocateOnGPU();
CUDA_CHECK(cudaMemcpyAsync(
seg.device_memptr(),
d_gpu.device_memptr() + col * Nd,
Nd * sizeof(forgeComplex<T1>),
cudaMemcpyDeviceToDevice,
forge::cuda::get_stream()));
// Adjoint through per-shot time-segmented NUFFT
forgeCol<forgeComplex<T1>> adjResult = (*AObj[jj]) / seg;
// Weight by conjugate combined sensitivity+phase map and accumulate
adjResult %= conjShotSenseMap_gpu[col];
imgSum += adjResult;
}
}
return imgSum;
}
#endif
- [ ] Step 3: Verify Metal build + CUDA build and tests
ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
git pull acerjanic feature/cuda-perf-optimization && \
cmake --build build --target cpu_tests -j8 && \
./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"
Expected: All tests pass.
- [ ] Step 4: Commit
git add forge/Operators/pcSenseTimeSeg.cpp
git commit -m "feat: CUDA forgeCol dispatch for pcSenseTimeSeg — GPU-resident phase-corrected SENSE"
Task 5: Final Verification¶
- [ ] Step 1: Run full test suite on DGX Spark
ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
./build/cpu_tests '~[Benchmark]' 2>&1 | tail -5"
Expected: All tests pass (including Spiral3D which exercises pcSenseTimeSeg).
- [ ] Step 2: Verify Metal tests locally
cmake --build build --target cpu_tests -j4 && ./build/cpu_tests '~[Benchmark]'
Expected: All tests pass on macOS Metal build.
- [ ] Step 3: Run SENSE+TimeSeg benchmark
ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
./build/cpu_tests '[TimeSeg_bench]' 2>&1"
Compare CUDA performance with previous CPU-only results.
Summary of Commits¶
| Task | Commit Message |
|---|---|
| 1 | refactor: move TimeSegmentation coefficient vectors to common code for Metal+CUDA |
| 2 | feat: CUDA forgeCol dispatch for TimeSegmentation — GPU-resident field correction |
| 3 | feat: pcSenseTimeSeg CUDA constructor — pre-upload shot-specific maps to GPU |
| 4 | feat: CUDA forgeCol dispatch for pcSenseTimeSeg — GPU-resident phase-corrected SENSE |