TimeSegmentation + pcSenseTimeSeg CUDA Support Design Spec¶

Date: 2026-03-20 Status: Approved Scope: Add CUDA forgeCol dispatch to TimeSegmentation and pcSenseTimeSeg operators, enabling GPU-resident field-corrected and phase-corrected SENSE reconstructions.

1. Problem Statement¶

TimeSegmentation and pcSenseTimeSeg have Metal forgeCol paths but no CUDA equivalent. On CUDA builds, their forgeCol operators fall through to the arma CPU path, forcing host↔device round-trips and losing the GPU-resident data flow established in the PCG solver.

All element-wise operations (%=, +=, copy constructor) already have CUDA dispatch in forgeCol. The wrapped operators (Gnufft, Gdft) already have CUDA pipelines. The missing piece is just the #elif defined(CUDA_COMPUTE) blocks that keep data on GPU through the per-segment and per-shot loops.

2. Approach¶

No new CUDA kernels or pipeline classes. Mirror the existing Metal forgeCol patterns with CUDA dispatch. Pre-upload coefficient vectors to GPU in constructors. Use forgeCol's automatic CUDA dispatch for element-wise operations.

3. Changes¶

3.1 TimeSegmentation CUDA Constructor Upload¶

File: forge/Gridding/TimeSegmentation.cpp (constructor, ~line 249)

The Metal path already stores per-segment coefficient columns as std::vector<forgeCol<...>> vectors: Wo_cols, WoH_cols, AA_cols, conjAA_cols, plus scratch buffers tempD_cols and tempAD_cols.

Add a CUDA_COMPUTE block after the Metal block that uploads these vectors to GPU:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        // Pre-upload per-segment phase and interpolation coefficients to GPU.
        // These persist for the lifetime of the operator — eliminates per-call H2D copies.
        for (int ii = 0; ii < L; ii++) {
            Wo_cols[ii].putOnGPU();
            WoH_cols[ii].putOnGPU();
            AA_cols[ii].putOnGPU();
            conjAA_cols[ii].putOnGPU();
        }
    }

Note: The Wo_cols etc. vectors are already populated by the Metal constructor path. For CUDA, the same population code runs (it's not inside the #ifdef METAL_COMPUTE — only the putOnGPU calls are conditional). If the population IS inside the Metal block, it needs to be moved to common code.

3.2 TimeSegmentation CUDA Forward forgeCol¶

File: forge/Gridding/TimeSegmentation.cpp (forgeCol operator*, ~line 445)

Add #elif defined(CUDA_COMPUTE) after the Metal block:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        Tobj* G = this->obj;

        // Phase-modulate image by each time segment's field correction.
        // Wo_cols[l] is GPU-resident (uploaded in constructor).
        // Copy constructor does async D2D when source is on GPU.
        // operator%= dispatches to CUDA kernel automatically.
        forgeCol<forgeComplex<T1>> tempD(Wo_cols[0]);
        tempD %= d;
        forgeCol<forgeComplex<T1>> result = (*G) * tempD;
        result %= AA_cols[0];

        for (unsigned int ii = 1; ii < this->L; ii++) {
            forgeCol<forgeComplex<T1>> seg_d(Wo_cols[ii]);
            seg_d %= d;
            forgeCol<forgeComplex<T1>> seg_out = (*G) * seg_d;
            seg_out %= AA_cols[ii];
            result += seg_out;
        }
        return result;
    }

Data flow: d (GPU) → copy Wo[l] (D2D) → %= (CUDA kernel) → G* (Gnufft forward_device) → %= (CUDA kernel) → += (CUDA kernel). Everything stays on GPU.

3.3 TimeSegmentation CUDA Adjoint forgeCol¶

Same pattern with conjAA_cols and WoH_cols:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        Tobj* G = this->obj;

        forgeCol<forgeComplex<T1>> tempAD(conjAA_cols[0]);
        tempAD %= d;
        forgeCol<forgeComplex<T1>> result = (*G) / tempAD;
        result %= WoH_cols[0];

        for (unsigned int ii = 1; ii < this->L; ii++) {
            forgeCol<forgeComplex<T1>> seg_ad(conjAA_cols[ii]);
            seg_ad %= d;
            forgeCol<forgeComplex<T1>> seg_out = (*G) / seg_ad;
            seg_out %= WoH_cols[ii];
            result += seg_out;
        }
        return result;
    }

3.4 pcSenseTimeSeg CUDA Members¶

File: forge/Operators/pcSenseTimeSeg.h

Add GPU-cached sensitivity map vectors:

#ifdef CUDA_COMPUTE
    std::vector<forgeCol<forgeComplex<T1>>> shotSenseMap_gpu;
    std::vector<forgeCol<forgeComplex<T1>>> conjShotSenseMap_gpu;
#endif

3.5 pcSenseTimeSeg CUDA Constructor Upload¶

File: forge/Operators/pcSenseTimeSeg.cpp (constructor)

After the Metal shotSpecificSenseMap_pg setup, add:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        // Pre-upload per-shot combined sensitivity+phase maps to GPU.
        shotSenseMap_gpu.reserve(Ns * Nc);
        conjShotSenseMap_gpu.reserve(Ns * Nc);
        for (uword ii = 0; ii < Nc; ii++) {
            for (uword jj = 0; jj < Ns; jj++) {
                shotSenseMap_gpu.push_back(
                    forgeCol<forgeComplex<T1>>(shotSpecificSenseMap.col(jj + ii * Ns)));
                shotSenseMap_gpu.back().putOnGPU();
                conjShotSenseMap_gpu.push_back(
                    forgeCol<forgeComplex<T1>>(conjShotSpecificSenseMap.col(jj + ii * Ns)));
                conjShotSenseMap_gpu.back().putOnGPU();
            }
        }
    }

3.6 pcSenseTimeSeg CUDA Forward forgeCol¶

File: forge/Operators/pcSenseTimeSeg.cpp (forgeCol operator*)

Follow the SENSE forward pattern — D2D concat into output buffer:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        forgeCol<forgeComplex<T1>> d_gpu(d);
        if (!d_gpu.on_gpu()) d_gpu.putOnGPU();

        forgeCol<forgeComplex<T1>> allData(Nd * Ns * Nc);
        allData.allocateOnGPU();

        for (unsigned int ii = 0; ii < Nc; ii++) {
            for (unsigned int jj = 0; jj < Ns; jj++) {
                uword col = jj + ii * Ns;
                // Multiply image by combined sensitivity+phase map
                forgeCol<forgeComplex<T1>> weighted(d_gpu);
                weighted %= shotSenseMap_gpu[col];
                // Forward through per-shot time-segmented operator
                forgeCol<forgeComplex<T1>> coil_result = (*AObj[jj]) * weighted;
                // D2D copy into concatenated output
                if (coil_result.on_gpu()) {
                    CUDA_CHECK(cudaMemcpyAsync(
                        allData.device_memptr() + col * Nd,
                        coil_result.device_memptr(),
                        Nd * sizeof(forgeComplex<T1>),
                        cudaMemcpyDeviceToDevice,
                        forge::cuda::get_stream()));
                } else {
                    CUDA_CHECK(cudaMemcpyAsync(
                        allData.device_memptr() + col * Nd,
                        coil_result.memptr(),
                        Nd * sizeof(forgeComplex<T1>),
                        cudaMemcpyHostToDevice,
                        forge::cuda::get_stream()));
                }
            }
        }
        return allData;
    }

3.7 pcSenseTimeSeg CUDA Adjoint forgeCol¶

D2D slice per shot×coil, accumulate on GPU:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        forgeCol<forgeComplex<T1>> d_gpu(d);
        if (!d_gpu.on_gpu()) d_gpu.putOnGPU();

        forgeCol<forgeComplex<T1>> imgSum(Ni);
        imgSum.zerosOnGPU();

        for (unsigned int ii = 0; ii < Nc; ii++) {
            for (unsigned int jj = 0; jj < Ns; jj++) {
                uword col = jj + ii * Ns;
                // D2D slice per-shot per-coil k-space data
                forgeCol<forgeComplex<T1>> seg(Nd);
                seg.allocateOnGPU();
                CUDA_CHECK(cudaMemcpyAsync(
                    seg.device_memptr(),
                    d_gpu.device_memptr() + col * Nd,
                    Nd * sizeof(forgeComplex<T1>),
                    cudaMemcpyDeviceToDevice,
                    forge::cuda::get_stream()));
                // Adjoint through per-shot time-segmented operator
                forgeCol<forgeComplex<T1>> adjResult = (*AObj[jj]) / seg;
                // Weight by conjugate combined map and accumulate
                adjResult %= conjShotSenseMap_gpu[col];
                imgSum += adjResult;
            }
        }
        return imgSum;
    }

3.8 Tests¶

File: forge/Tests/operatorTests.cpp

TimeSegmentation+SENSE adjointness test with CUDA (existing [TimeSeg adjoint] test should pass once TimeSegmentation has CUDA forgeCol path)
pcSenseTimeSeg: the existing pcSENSE benchmark tests exercise pcSenseTimeSeg — verify they still pass

4. What Is NOT In Scope¶

New CUDA kernels — all GPU work uses existing forgeCol dispatch
Batched multi-segment cuFFT — L is small (4-10), sequential is fine
pcSENSE standalone (uses Gdft without TimeSegmentation) — already has Gdft CUDA via CudaDftPipeline

5. Success Criteria¶

All existing tests pass on both CUDA and Metal builds
TimeSegmentation forgeCol operators dispatch to GPU when input is on GPU
pcSenseTimeSeg forgeCol operators dispatch to GPU when input is on GPU
Data stays GPU-resident through the PCG solver loop

6. Risks¶

Risk	Mitigation
TimeSegmentation coefficient vectors not populated outside Metal block	Check constructor — move population to common code if needed
pcSenseTimeSeg forgeMat `col()` returns view that doesn't work on GPU	Use `col_copy()` to get owning forgeCol, then upload
Large number of GPU-cached vectors (Ns×Nc for pcSenseTimeSeg)	Same memory as CPU path; typical Ns=4-8, Nc=4-16 = 16-128 vectors