TimeSegmentation + pcSenseTimeSeg CUDA Implementation Plan¶

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

IMPORTANT: CUDA builds and tests run on the DGX Spark. Workflow: commit + push locally → pull on DGX Spark → build + test.
git push origin feature/cuda-perf-optimization
ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
    git pull acerjanic feature/cuda-perf-optimization && \
    cmake --build build --target cpu_tests -j8"

Goal: Add CUDA forgeCol dispatch to TimeSegmentation and pcSenseTimeSeg, enabling fully GPU-resident field-corrected and phase-corrected SENSE reconstructions.

Architecture: No new CUDA kernels. Mirror the existing Metal forgeCol paths with #elif defined(CUDA_COMPUTE) blocks. Pre-upload coefficient vectors to GPU in constructors. All element-wise operations (%=, +=, copy constructor) already have CUDA dispatch in forgeCol — they run CUDA kernels automatically when both operands are on GPU.

Tech Stack: Existing forgeCol CUDA dispatch, CudaNufftPipeline (Gnufft), CudaDftPipeline (Gdft)

Spec: docs/superpowers/specs/2026-03-20-timeseg-pcsense-cuda-design.md

File Structure¶

Files to Modify¶

File	Change
`forge/Gridding/TimeSegmentation.cpp`	Move coefficient vector population out of Metal block to common code; add CUDA constructor upload; add CUDA forgeCol forward/adjoint blocks
`forge/Operators/pcSenseTimeSeg.h`	Add `shotSenseMap_gpu` / `conjShotSenseMap_gpu` vectors under `CUDA_COMPUTE`
`forge/Operators/pcSenseTimeSeg.cpp`	Add CUDA constructor upload; add CUDA forgeCol forward/adjoint blocks

No new files. No header changes needed for TimeSegmentation (member vectors already declared unconditionally).

Task 1: TimeSegmentation — Move Coefficient Population to Common Code¶

Files: - Modify: forge/Gridding/TimeSegmentation.cpp:249-277

The coefficient vectors (Wo_cols, WoH_cols, AA_cols, conjAA_cols, tempD_cols, tempAD_cols) are currently populated inside #ifdef METAL_COMPUTE. This code needs to run on all GPU builds (Metal and CUDA). The only Metal-specific part is that Metal uses page-aligned memory (which forgeCol already handles via aligned_alloc on macOS).

[ ] Step 1: Move vector population before the #ifdef METAL_COMPUTE block

In forge/Gridding/TimeSegmentation.cpp, the current code at lines 249-277 is:

#ifdef METAL_COMPUTE
    if constexpr (std::is_same<T1, float>::value) {
        Wo_cols.reserve(L);
        // ... populate all 6 vectors ...
    }
#endif

Change to:

#if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        // Extract per-segment coefficient columns as separate forgeCol vectors.
        // Each column is independently GPU-ready (Metal: page-aligned zero-copy,
        // CUDA: uploaded to device in the CUDA block below).
        Wo_cols.reserve(L);
        WoH_cols.reserve(L);
        AA_cols.reserve(L);
        conjAA_cols.reserve(L);
        tempD_cols.reserve(L);
        tempAD_cols.reserve(L);
        for (int ii = 0; ii < L; ii++) {
            Wo_cols.emplace_back(n2);
            std::memcpy(Wo_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(Wo.colptr(ii)),
                sizeof(forgeComplex<T1>) * n2);
            WoH_cols.emplace_back(n2);
            std::memcpy(WoH_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(WoH.colptr(ii)),
                sizeof(forgeComplex<T1>) * n2);
            AA_cols.emplace_back(n1);
            std::memcpy(AA_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(AA.colptr(ii)),
                sizeof(forgeComplex<T1>) * n1);
            conjAA_cols.emplace_back(n1);
            std::memcpy(conjAA_cols[ii].memptr(), reinterpret_cast<const forgeComplex<T1>*>(conjAA.colptr(ii)),
                sizeof(forgeComplex<T1>) * n1);
            tempD_cols.emplace_back(n2);
            tempAD_cols.emplace_back(n1);
        }
    }
#endif

#ifdef CUDA_COMPUTE
    if constexpr (std::is_same<T1, float>::value) {
        // Upload per-segment coefficients to GPU. These persist for the lifetime
        // of the operator, eliminating per-call H2D transfers.
        for (int ii = 0; ii < L; ii++) {
            Wo_cols[ii].putOnGPU();
            WoH_cols[ii].putOnGPU();
            AA_cols[ii].putOnGPU();
            conjAA_cols[ii].putOnGPU();
        }
    }
#endif

[ ] Step 2: Verify Metal build locally

cmake --build build --target ForgeCommon -j4

Expected: builds without errors.

[ ] Step 3: Verify CUDA build on DGX Spark

Push, pull, build, run existing TimeSegmentation tests:

git push origin feature/cuda-perf-optimization
ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
    git pull acerjanic feature/cuda-perf-optimization && \
    cmake --build build --target cpu_tests -j8 && \
    ./build/cpu_tests '[TimeSeg_synthetic][adjointness]' 2>&1"

Expected: SENSE+TimeSeg adjointness test passes.

[ ] Step 4: Commit

git add forge/Gridding/TimeSegmentation.cpp
git commit -m "refactor: move TimeSegmentation coefficient vectors to common code for Metal+CUDA"

Task 2: TimeSegmentation — CUDA forgeCol Forward and Adjoint¶

Files: - Modify: forge/Gridding/TimeSegmentation.cpp:445-500 (forgeCol operators)

[ ] Step 1: Add CUDA forward forgeCol block

In the forgeCol operator* (around line 445), change the #endif after the Metal block to add a CUDA path:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        Tobj* G = this->obj;

        // Forward: phase-modulate image by each segment's field correction,
        // then NUFFT forward + interpolation weight, accumulate across segments.
        //
        // Data flow per segment l:
        //   copy Wo[l] (async D2D) → multiply by d (CUDA kernel) →
        //   G* forward (CudaNufftPipeline) → multiply by AA[l] (CUDA kernel) →
        //   accumulate into result (CUDA kernel)
        //
        // All operations stay on GPU — forgeCol auto-dispatches to CUDA when
        // both operands have on_gpu() == true.

        forgeCol<forgeComplex<T1>> tempD(Wo_cols[0]); // async D2D copy (GPU-resident source)
        tempD %= d;                                     // CUDA element-wise multiply
        forgeCol<forgeComplex<T1>> result = (*G) * tempD; // Gnufft forward_device
        result %= AA_cols[0];                           // CUDA element-wise multiply

        for (unsigned int ii = 1; ii < this->L; ii++) {
            forgeCol<forgeComplex<T1>> seg_d(Wo_cols[ii]); // async D2D copy
            seg_d %= d;                                      // CUDA multiply
            forgeCol<forgeComplex<T1>> seg_out = (*G) * seg_d; // forward_device
            seg_out %= AA_cols[ii];                          // CUDA multiply
            result += seg_out;                               // CUDA accumulate
        }
        return result;
    }
#endif

[ ] Step 2: Add CUDA adjoint forgeCol block

In the forgeCol operator/ (around line 475), same pattern with conjugate coefficients:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        Tobj* G = this->obj;

        // Adjoint: weight k-space by conjugate interpolation coefficients,
        // then NUFFT adjoint + conjugate phase correction, accumulate.
        //
        // Data flow per segment l:
        //   copy conjAA[l] (async D2D) → multiply by d (CUDA kernel) →
        //   G/ adjoint (CudaNufftPipeline) → multiply by WoH[l] (CUDA kernel) →
        //   accumulate into result (CUDA kernel)

        forgeCol<forgeComplex<T1>> tempAD(conjAA_cols[0]); // async D2D copy
        tempAD %= d;                                        // CUDA multiply
        forgeCol<forgeComplex<T1>> result = (*G) / tempAD;  // Gnufft adjoint_device
        result %= WoH_cols[0];                              // CUDA multiply

        for (unsigned int ii = 1; ii < this->L; ii++) {
            forgeCol<forgeComplex<T1>> seg_ad(conjAA_cols[ii]); // async D2D copy
            seg_ad %= d;                                         // CUDA multiply
            forgeCol<forgeComplex<T1>> seg_out = (*G) / seg_ad;  // adjoint_device
            seg_out %= WoH_cols[ii];                             // CUDA multiply
            result += seg_out;                                   // CUDA accumulate
        }
        return result;
    }
#endif

[ ] Step 3: Verify Metal build locally

cmake --build build --target ForgeCommon -j4

[ ] Step 4: Verify all tests on DGX Spark

ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
    git pull acerjanic feature/cuda-perf-optimization && \
    cmake --build build --target cpu_tests -j8 && \
    ./build/cpu_tests '[TimeSeg_synthetic][adjointness]' 2>&1 && \
    ./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"

Expected: TimeSeg adjointness passes. All tests pass.

[ ] Step 5: Commit

git add forge/Gridding/TimeSegmentation.cpp
git commit -m "feat: CUDA forgeCol dispatch for TimeSegmentation — GPU-resident field correction"

Task 3: pcSenseTimeSeg — CUDA Members and Constructor Upload¶

Files: - Modify: forge/Operators/pcSenseTimeSeg.h:120-125 (add CUDA members) - Modify: forge/Operators/pcSenseTimeSeg.cpp:123-128 (add CUDA constructor upload)

[ ] Step 1: Add CUDA members to pcSenseTimeSeg.h

After the existing forgeMat members (around line 124), add:

#ifdef CUDA_COMPUTE
    /// Per-shot combined sensitivity+phase maps, pre-uploaded to GPU (Ns*Nc vectors).
    std::vector<forgeCol<forgeComplex<T1>>> shotSenseMap_gpu;
    /// Conjugate per-shot maps for adjoint, pre-uploaded to GPU.
    std::vector<forgeCol<forgeComplex<T1>>> conjShotSenseMap_gpu;
#endif

Add #include <vector> at top if not already present.

[ ] Step 2: Add CUDA constructor upload in pcSenseTimeSeg.cpp

After the #ifdef METAL_COMPUTE block (line 123-128), add:

#ifdef CUDA_COMPUTE
    if constexpr (std::is_same<T1, float>::value) {
        // Pre-upload per-shot combined sensitivity+phase maps to GPU.
        // shotSpecificSenseMap is (Ni x Nc*Ns) arma::Mat — extract each column
        // as an owning forgeCol and upload. These persist for the lifetime of
        // the operator, eliminating per-coil per-iteration H2D transfers.
        shotSenseMap_gpu.reserve(Ns * Nc);
        conjShotSenseMap_gpu.reserve(Ns * Nc);
        for (unsigned int ii = 0; ii < Nc; ii++) {
            for (unsigned int jj = 0; jj < Ns; jj++) {
                uword col = jj + ii * Ns;
                // Wrap arma column as forgeCol (copies data into forgeCol-owned buffer)
                shotSenseMap_gpu.emplace_back(
                    forgeCol<forgeComplex<T1>>(shotSpecificSenseMap.col(col)));
                shotSenseMap_gpu.back().putOnGPU();
                conjShotSenseMap_gpu.emplace_back(
                    forgeCol<forgeComplex<T1>>(conjShotSpecificSenseMap.col(col)));
                conjShotSenseMap_gpu.back().putOnGPU();
            }
        }
    }
#endif

Also add the CUDA include at top of pcSenseTimeSeg.cpp:

#ifdef CUDA_COMPUTE
#include "CUDA/CudaContext.h"
#endif

[ ] Step 3: Verify Metal build + CUDA build
[ ] Step 4: Commit

git add forge/Operators/pcSenseTimeSeg.h forge/Operators/pcSenseTimeSeg.cpp
git commit -m "feat: pcSenseTimeSeg CUDA constructor — pre-upload shot-specific maps to GPU"

Task 4: pcSenseTimeSeg — CUDA forgeCol Forward and Adjoint¶

Files: - Modify: forge/Operators/pcSenseTimeSeg.cpp:180-227 (forgeCol operators)

[ ] Step 1: Add CUDA forward forgeCol block

In the forgeCol operator* (line 180), change #endif after Metal to add CUDA:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        // GPU-resident forward: multiply image by combined sensitivity+phase map
        // per shot/coil, forward through TimeSegmentation operator, concatenate
        // results via D2D copy into output buffer.
        forgeCol<forgeComplex<T1>> d_gpu(d);
        if (!d_gpu.on_gpu()) d_gpu.putOnGPU();

        forgeCol<forgeComplex<T1>> allData(Nd * Ns * Nc);
        allData.allocateOnGPU();

        for (unsigned int ii = 0; ii < Nc; ii++) {
            for (unsigned int jj = 0; jj < Ns; jj++) {
                uword col = jj + ii * Ns;

                // Multiply image by precomputed combined sensitivity+phase map
                forgeCol<forgeComplex<T1>> weighted(d_gpu);    // async D2D copy
                weighted %= shotSenseMap_gpu[col];              // CUDA element-wise mul

                // Forward through per-shot time-segmented NUFFT
                forgeCol<forgeComplex<T1>> coil_result = (*AObj[jj]) * weighted;

                // D2D copy result into the correct slice of concatenated output
                if (coil_result.on_gpu()) {
                    CUDA_CHECK(cudaMemcpyAsync(
                        allData.device_memptr() + col * Nd,
                        coil_result.device_memptr(),
                        Nd * sizeof(forgeComplex<T1>),
                        cudaMemcpyDeviceToDevice,
                        forge::cuda::get_stream()));
                } else {
                    // Fallback: operator returned host data (shouldn't happen with CUDA Gnufft)
                    CUDA_CHECK(cudaMemcpyAsync(
                        allData.device_memptr() + col * Nd,
                        coil_result.memptr(),
                        Nd * sizeof(forgeComplex<T1>),
                        cudaMemcpyHostToDevice,
                        forge::cuda::get_stream()));
                }
            }
        }
        return allData;
    }
#endif

[ ] Step 2: Add CUDA adjoint forgeCol block

In the forgeCol operator/ (line 205), change #endif after Metal to add CUDA:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        // GPU-resident adjoint: D2D slice per-shot per-coil k-space data,
        // adjoint through TimeSegmentation, weight by conjugate map, accumulate.
        forgeCol<forgeComplex<T1>> d_gpu(d);
        if (!d_gpu.on_gpu()) d_gpu.putOnGPU();

        forgeCol<forgeComplex<T1>> imgSum(Ni);
        imgSum.zerosOnGPU();

        for (unsigned int ii = 0; ii < Nc; ii++) {
            for (unsigned int jj = 0; jj < Ns; jj++) {
                uword col = jj + ii * Ns;

                // D2D slice: extract per-shot per-coil k-space segment from input
                forgeCol<forgeComplex<T1>> seg(Nd);
                seg.allocateOnGPU();
                CUDA_CHECK(cudaMemcpyAsync(
                    seg.device_memptr(),
                    d_gpu.device_memptr() + col * Nd,
                    Nd * sizeof(forgeComplex<T1>),
                    cudaMemcpyDeviceToDevice,
                    forge::cuda::get_stream()));

                // Adjoint through per-shot time-segmented NUFFT
                forgeCol<forgeComplex<T1>> adjResult = (*AObj[jj]) / seg;

                // Weight by conjugate combined sensitivity+phase map and accumulate
                adjResult %= conjShotSenseMap_gpu[col];
                imgSum += adjResult;
            }
        }
        return imgSum;
    }
#endif

[ ] Step 3: Verify Metal build + CUDA build and tests

ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
    git pull acerjanic feature/cuda-perf-optimization && \
    cmake --build build --target cpu_tests -j8 && \
    ./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"

Expected: All tests pass.

[ ] Step 4: Commit

git add forge/Operators/pcSenseTimeSeg.cpp
git commit -m "feat: CUDA forgeCol dispatch for pcSenseTimeSeg — GPU-resident phase-corrected SENSE"

Task 5: Final Verification¶

[ ] Step 1: Run full test suite on DGX Spark

ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
    ./build/cpu_tests '~[Benchmark]' 2>&1 | tail -5"

Expected: All tests pass (including Spiral3D which exercises pcSenseTimeSeg).

[ ] Step 2: Verify Metal tests locally

cmake --build build --target cpu_tests -j4 && ./build/cpu_tests '~[Benchmark]'

Expected: All tests pass on macOS Metal build.

[ ] Step 3: Run SENSE+TimeSeg benchmark

ssh acerj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd ~/forge && \
    ./build/cpu_tests '[TimeSeg_bench]' 2>&1"

Compare CUDA performance with previous CPU-only results.

Summary of Commits¶

Task	Commit Message
1	`refactor: move TimeSegmentation coefficient vectors to common code for Metal+CUDA`
2	`feat: CUDA forgeCol dispatch for TimeSegmentation — GPU-resident field correction`
3	`feat: pcSenseTimeSeg CUDA constructor — pre-upload shot-specific maps to GPU`
4	`feat: CUDA forgeCol dispatch for pcSenseTimeSeg — GPU-resident phase-corrected SENSE`