CudaDFT Pipeline Implementation Plan¶

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

IMPORTANT: All CUDA builds and tests run on the DGX Spark via SSH. Workflow: edit locally → rsync → build+test via SSH.
rsync -avz --exclude build/ --exclude .git/ \
    /Users/acerjanic/Developer/forge/ clj@100.109.49.10:/tmp/forge-cuda/
ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \
    cmake --build build --target cpu_tests -j8"

Goal: GPU-accelerate the field-corrected DFT (Gdft) and gradient-corrected DFT (GdftR2) operators via a new CudaDftPipeline class, unblocking GPU-resident pcSENSE and SENSE\<Gdft> reconstructions.

Architecture: A CudaDftPipeline class in forge/CUDA/ pre-uploads trajectory and field map data to GPU at construction time, then provides forward_device/adjoint_device methods that launch one-thread-per-output CUDA kernels on interleaved complex data. Gdft and GdftR2 operators create the pipeline in their constructors and dispatch from forgeCol overloads, following the same pattern as Gnufft + CudaNufftPipeline.

Tech Stack: CUDA 13.0, sincosf() intrinsic, cudaMemcpyAsync, cudaMallocAsync

Spec: docs/superpowers/specs/2026-03-20-cuda-dft-pipeline-design.md

File Structure¶

Files to Create¶

File	Responsibility
`forge/CUDA/CudaDftPipeline.h`	Class declaration — constructor, destructor, forward/adjoint device+host APIs
`forge/CUDA/CudaDftPipeline.cu`	CUDA kernels (4) + pipeline method implementations

Files to Modify¶

File	Change
`forge/CMakeLists.txt`	Add `CudaDftPipeline.cu` to CUDA source list
`forge/Operators/Gdft.h`	Add `cudaPipelineCtx` member under `CUDA_COMPUTE`
`forge/Operators/Gdft.cpp`	Create pipeline in constructor, add CUDA forgeCol dispatch, add destructor
`forge/Operators/GdftR2.h`	Add `cudaPipelineCtx` member under `CUDA_COMPUTE`
`forge/Operators/GdftR2.cpp`	Create pipeline with gradients in constructor, add CUDA forgeCol dispatch, add destructor
`forge/Tests/operatorTests.cpp`	Add CPU vs CUDA comparison tests for Gdft and GdftR2

Task 1: CudaDftPipeline Header¶

Files: - Create: forge/CUDA/CudaDftPipeline.h

[ ] Step 1: Write the header file

#pragma once
#ifdef CUDA_COMPUTE

#include <cuda_runtime.h>

namespace forge::cuda {

/// GPU-accelerated field-corrected DFT pipeline.
///
/// Computes the forward DFT:
///   s_j = Σ_k m_k · exp(-i·(2π(k_j·r_k) + Δω_k·t_j))
///
/// and its adjoint (conjugate phase). Optionally includes R₂* gradient
/// correction via sinc factors (GdftR2 variant).
///
/// All trajectory, field map, and gradient data are uploaded to GPU once
/// at construction. The forward_device/adjoint_device methods operate on
/// interleaved complex device pointers [re0,im0,re1,im1,...], matching
/// the forgeComplex<float> memory layout used throughout forge.
///
/// Follows the same API pattern as CudaNufftPipeline.
class CudaDftPipeline {
public:
    /// Construct a basic Gdft pipeline (no gradient correction).
    ///
    /// @param kx,ky,kz  k-space coordinates, length numSamples (host pointers)
    /// @param ix,iy,iz  Image-space coordinates, length numPixels (host pointers)
    /// @param FM        Off-resonance field map in rad/s, length numPixels
    /// @param t         Per-sample readout time in seconds, length numSamples
    /// @param numSamples  Number of k-space points (n1)
    /// @param numPixels   Number of image pixels (n2)
    CudaDftPipeline(const float* kx, const float* ky, const float* kz,
                    const float* ix, const float* iy, const float* iz,
                    const float* FM, const float* t,
                    int numSamples, int numPixels);

    /// Construct a GdftR2 pipeline with R₂* gradient correction.
    ///
    /// @param Gx,Gy,Gz    Gradient maps of the field map, length numPixels
    /// @param numX,numY,numZ  Image grid dimensions (for sinc normalization)
    /// Other parameters same as basic constructor.
    CudaDftPipeline(const float* kx, const float* ky, const float* kz,
                    const float* ix, const float* iy, const float* iz,
                    const float* FM, const float* t,
                    const float* Gx, const float* Gy, const float* Gz,
                    int numSamples, int numPixels,
                    int numX, int numY, int numZ);

    ~CudaDftPipeline();

    /// Forward DFT on device pointers (no host transfers).
    /// @param d_image  Input: interleaved complex image, 2*numPixels floats
    /// @param d_kdata  Output: interleaved complex k-space, 2*numSamples floats
    void forward_device(const float* d_image, float* d_kdata);

    /// Adjoint DFT on device pointers (no host transfers).
    /// @param d_kdata  Input: interleaved complex k-space, 2*numSamples floats
    /// @param d_image  Output: interleaved complex image, 2*numPixels floats
    void adjoint_device(const float* d_kdata, float* d_image);

    /// Forward DFT with host pointers (H2D, compute, D2H, sync).
    void forward(const float* h_image, float* h_kdata);

    /// Adjoint DFT with host pointers (H2D, compute, D2H, sync).
    void adjoint(const float* h_kdata, float* h_image);

    // Non-copyable (owns GPU resources)
    CudaDftPipeline(const CudaDftPipeline&) = delete;
    CudaDftPipeline& operator=(const CudaDftPipeline&) = delete;

private:
    // Persistent device arrays — uploaded once at construction
    float* d_kx = nullptr;   // k-space x-coords, length numSamples
    float* d_ky = nullptr;   // k-space y-coords, length numSamples
    float* d_kz = nullptr;   // k-space z-coords, length numSamples
    float* d_ix = nullptr;   // image x-coords, length numPixels
    float* d_iy = nullptr;   // image y-coords, length numPixels
    float* d_iz = nullptr;   // image z-coords, length numPixels
    float* d_FM = nullptr;   // field map (rad/s), length numPixels
    float* d_t  = nullptr;   // readout time (seconds), length numSamples

    // Optional gradient maps for GdftR2 (nullptr if basic Gdft)
    float* d_Gx = nullptr;
    float* d_Gy = nullptr;
    float* d_Gz = nullptr;
    bool hasGradients_ = false;

    // Scratch buffers for host-pointer wrappers
    float* d_input = nullptr;   // 2*max(numSamples,numPixels) floats
    float* d_output = nullptr;  // 2*max(numSamples,numPixels) floats

    int numSamples_;
    int numPixels_;
    int numX_, numY_, numZ_;    // grid dims (for sinc normalization)
    cudaStream_t stream_;
};

} // namespace forge::cuda

#endif // CUDA_COMPUTE

[ ] Step 2: Verify Metal build locally

cmake --build build --target ForgeCommon -j4

Expected: builds without errors (header is CUDA-only, not compiled on macOS).

[ ] Step 3: Commit

git add forge/CUDA/CudaDftPipeline.h
git commit -m "feat: add CudaDftPipeline header — GPU field-corrected DFT interface"

Task 2: CUDA Kernels and Pipeline Implementation¶

Files: - Create: forge/CUDA/CudaDftPipeline.cu - Modify: forge/CMakeLists.txt:51 (add to CUDA source list)

[ ] Step 1: Write the kernel and pipeline implementation

Create forge/CUDA/CudaDftPipeline.cu with:

__device__ sincPG(float x) — sinc helper matching CPU ftCpuWithGrads.cpp:76-83
kernel_dft_forward — one thread per k-space sample, inner loop over pixels, sincosf(-phase), interleaved complex I/O
kernel_dft_adjoint — one thread per image pixel, inner loop over samples, sincosf(+phase)
kernel_dft_forward_grads — forward + sinc correction factor B(j,k)
kernel_dft_adjoint_grads — adjoint + sinc correction factor
Constructor (basic) — cudaMallocAsync + cudaMemcpy for 8 trajectory/field arrays, allocate scratch buffers
Constructor (grads) — calls basic setup + uploads 3 gradient arrays
Destructor — cudaFreeAsync for all device arrays
forward_device/adjoint_device — dispatch appropriate kernel based on hasGradients_
forward/adjoint — H2D input, call *_device, D2H output, cudaStreamSynchronize

Key implementation notes: - Use sincosf(-phase, &s, &c) for forward (single trig eval, negative phase) - Use sincosf(phase, &s, &c) for adjoint (positive phase = conjugate) - Pre-scale k-coords by 2π in the outer (per-thread) section to avoid per-pixel multiply - Block size: 256 threads (same as other forge CUDA kernels) - Comment the DFT equation at the top of each kernel - Comment the interleaved complex layout: data[2*k] = real, data[2*k+1] = imag

Reference for kernel math: - Forward: ftCpu.cpp:94-123 (split real/imag version — adapt to interleaved) - Adjoint: ftCpu.cpp:132-167 - Grads: ftCpuWithGrads.cpp:93-125 and 157-171 - sincPG: ftCpuWithGrads.cpp:76-83

[ ] Step 2: Add to CMakeLists.txt

In forge/CMakeLists.txt, after line 51 (CudaPenalty.cu), add:

        ${PROJECT_SOURCE_DIR}/forge/CUDA/CudaDftPipeline.cu)

[ ] Step 3: Verify Metal build locally

cmake --build build --target ForgeCommon -j4

Expected: builds (.cu file only compiled on CUDA builds).

[ ] Step 4: Verify CUDA build on DGX Spark

rsync -avz --exclude build/ --exclude .git/ \
    /Users/acerjanic/Developer/forge/ clj@100.109.49.10:/tmp/forge-cuda/
ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \
    cmake --build build --target ForgeCommon -j8 2>&1 | tail -5"

Expected: compiles without errors.

[ ] Step 5: Commit

git add forge/CUDA/CudaDftPipeline.cu forge/CMakeLists.txt
git commit -m "feat: CudaDftPipeline CUDA kernels — forward/adjoint DFT with optional R₂* gradients"

Task 3: Integrate CudaDftPipeline into Gdft Operator¶

Files: - Modify: forge/Operators/Gdft.h:102-107 (add CUDA member alongside Metal member) - Modify: forge/Operators/Gdft.cpp:37-60 (constructor), 62-72 (destructor), 146-214 (forgeCol operators)

[ ] Step 1: Add CUDA pipeline member to Gdft.h

After the existing #ifdef METAL_COMPUTE block (lines 102-107), add:

#ifdef CUDA_COMPUTE
    void* cudaPipelineCtx = nullptr;
    ~Gdft();
#endif

Note: The destructor declaration is conditional — Metal already has one, CUDA needs its own. If both Metal and CUDA are defined (shouldn't happen in practice), this would conflict. The existing pattern uses #ifdef METAL_COMPUTE exclusively, so #ifdef CUDA_COMPUTE is safe.

[ ] Step 2: Add CUDA constructor logic in Gdft.cpp

In the Gdft constructor (after the #ifdef METAL_COMPUTE block, line 59), add:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
            kx.memptr(), ky.memptr(), kz.memptr(),
            ix.memptr(), iy.memptr(), iz.memptr(),
            FM.memptr(), t.memptr(), (int)n1, (int)n2);
    }
#endif

Add include at top of file:

#ifdef CUDA_COMPUTE
#include "CUDA/CudaDftPipeline.h"
#endif

[ ] Step 3: Add CUDA destructor in Gdft.cpp

After the #ifdef METAL_COMPUTE destructor block (line 72), add:

#elif defined(CUDA_COMPUTE)
template <typename T1> Gdft<T1>::~Gdft()
{
    if constexpr (std::is_same<T1, float>::value) {
        delete static_cast<forge::cuda::CudaDftPipeline*>(cudaPipelineCtx);
        cudaPipelineCtx = nullptr;
    }
}
#endif

Change the existing #endif on line 72 to #elif defined(CUDA_COMPUTE).

[ ] Step 4: Add CUDA dispatch in forgeCol forward operator

In Gdft::operator*(forgeCol) (line 146), add a CUDA_COMPUTE block after the METAL_COMPUTE block. Follow the Gnufft pattern (Gnufft.cpp:530-548):

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        auto* ctx = static_cast<forge::cuda::CudaDftPipeline*>(cudaPipelineCtx);
        if (ctx != nullptr) {
            if (d.on_gpu()) {
                // Data already on GPU — use device-pointer pipeline
                forgeCol<forgeComplex<T1>> result(n1);
                result.allocateOnGPU();
                ctx->forward_device(
                    reinterpret_cast<const float*>(d.device_memptr()),
                    reinterpret_cast<float*>(result.device_memptr()));
                return result;
            } else {
                // Data on host — use host-pointer wrapper (H2D + D2H internally)
                forgeCol<forgeComplex<T1>> result(n1);
                ctx->forward(
                    reinterpret_cast<const float*>(d.memptr()),
                    reinterpret_cast<float*>(result.memptr()));
                return result;
            }
        }
    }
#endif

Change the existing #endif after the Metal block to #elif defined(CUDA_COMPUTE).

[ ] Step 5: Add CUDA dispatch in forgeCol adjoint operator

Same pattern for Gdft::operator/(forgeCol) (line 181), using adjoint_device/adjoint, output size n2.

[ ] Step 6: Verify Metal build locally

cmake --build build --target ForgeCommon -j4

[ ] Step 7: Verify CUDA build and existing tests on DGX Spark

rsync ... && ssh ... "cmake --build build --target cpu_tests -j8 && \
    ./build/cpu_tests '[Gdft adjoint]' 2>&1"

Expected: Gdft adjointness test passes with CUDA pipeline active.

[ ] Step 8: Commit

git add forge/Operators/Gdft.h forge/Operators/Gdft.cpp
git commit -m "feat: integrate CudaDftPipeline into Gdft — GPU-resident forward/adjoint DFT"

Task 4: Integrate CudaDftPipeline into GdftR2 Operator¶

Files: - Modify: forge/Operators/GdftR2.h:123-128 (add CUDA member) - Modify: forge/Operators/GdftR2.cpp:39-69 (constructor), 71-81 (destructor), 211-end (forgeCol operators)

[ ] Step 1: Add CUDA pipeline member to GdftR2.h

After the #ifdef METAL_COMPUTE block (lines 123-128), add the same pattern as Gdft:

#ifdef CUDA_COMPUTE
    void* cudaPipelineCtx = nullptr;
    ~GdftR2();
#endif

[ ] Step 2: Add CUDA constructor logic in GdftR2.cpp

After the #ifdef METAL_COMPUTE block in the constructor (line 68), add:

#elif defined(CUDA_COMPUTE)
    if constexpr (std::is_same<T1, float>::value) {
        cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
            kx.memptr(), ky.memptr(), kz.memptr(),
            ix.memptr(), iy.memptr(), iz.memptr(),
            FM.memptr(), t.memptr(),
            Gx.memptr(), Gy.memptr(), Gz.memptr(),
            (int)n1, (int)n2, numX, numY, numZ);
    }
#endif

Add the #include "CUDA/CudaDftPipeline.h" at the top of the file (guarded by CUDA_COMPUTE).

[ ] Step 3: Add CUDA destructor and forgeCol dispatch

Follow the exact same pattern as Task 3 Steps 3-5, but for GdftR2.

[ ] Step 4: Verify CUDA build and existing tests on DGX Spark

rsync ... && ssh ... "cmake --build build --target cpu_tests -j8 && \
    ./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"

Expected: All existing tests pass (GdftR2 now uses CUDA pipeline).

[ ] Step 5: Commit

git add forge/Operators/GdftR2.h forge/Operators/GdftR2.cpp
git commit -m "feat: integrate CudaDftPipeline into GdftR2 — GPU-resident DFT with R₂* gradients"

Task 5: CPU vs CUDA Comparison Tests¶

Files: - Modify: forge/Tests/operatorTests.cpp

[ ] Step 1: Write Gdft CPU vs CUDA comparison test

Add after the existing #ifdef CUDA_COMPUTE penalty tests:

#ifdef CUDA_COMPUTE
TEST_CASE("Gdft forward/adjoint: CPU vs CUDA", "[Gdft][CUDA]")
{
    uword Nx = 16, Ny = 16;
    uword n1 = Nx * Ny;
    uword n2 = Nx * Ny;

    Col<float> ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN;
    makeCartesianCoords(Nx, Ny, ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN);

    // Non-zero field map and time vector for meaningful phase correction
    Col<float> FM(n2);
    for (uword i = 0; i < n2; i++)
        FM(i) = 100.0f * sinf(2.0f * M_PI * i / n2);
    Col<float> t(n1);
    for (uword i = 0; i < n1; i++)
        t(i) = 1e-4f * i;

    Gdft<float> G(n1, n2, kxG, kyG, kzG, ix, iy, iz, FM, t);

    // Deterministic test image
    forgeCol<forgeComplex<float>> x(n2);
    for (uword i = 0; i < n2; i++)
        x.at(i) = forgeComplex<float>((float)(i % 17) / 17.0f, (float)(i % 13) / 13.0f);

    // CPU forward
    forgeCol<forgeComplex<float>> fwd_cpu = G * x;

    // GPU forward
    forgeCol<forgeComplex<float>> x_gpu(x);
    x_gpu.putOnGPU();
    forgeCol<forgeComplex<float>> fwd_gpu = G * x_gpu;
    if (fwd_gpu.on_gpu()) fwd_gpu.getFromGPU();

    SECTION("forward") {
        float err = relError(fwd_cpu, fwd_gpu);
        CAPTURE(err);
        REQUIRE(err < 1e-4f);
    }

    // CPU adjoint
    forgeCol<forgeComplex<float>> adj_cpu = G / fwd_cpu;

    // GPU adjoint
    forgeCol<forgeComplex<float>> fwd_gpu2(fwd_cpu);
    fwd_gpu2.putOnGPU();
    forgeCol<forgeComplex<float>> adj_gpu = G / fwd_gpu2;
    if (adj_gpu.on_gpu()) adj_gpu.getFromGPU();

    SECTION("adjoint") {
        float err = relError(adj_cpu, adj_gpu);
        CAPTURE(err);
        REQUIRE(err < 1e-4f);
    }
}
#endif

Note: relError helper was already added in this branch (operatorTests.cpp, line 258).

[ ] Step 2: Write GdftR2 CPU vs CUDA comparison test

Same structure but constructs GdftR2<float> with grid dimensions:

#ifdef CUDA_COMPUTE
TEST_CASE("GdftR2 forward/adjoint: CPU vs CUDA", "[GdftR2][CUDA]")
{
    uword Nx = 16, Ny = 16;
    uword n1 = Nx * Ny;
    uword n2 = Nx * Ny;

    Col<float> ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN;
    makeCartesianCoords(Nx, Ny, ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN);

    Col<float> FM(n2);
    for (uword i = 0; i < n2; i++)
        FM(i) = 100.0f * sinf(2.0f * M_PI * i / n2);
    Col<float> t(n1);
    for (uword i = 0; i < n1; i++)
        t(i) = 1e-4f * i;

    GdftR2<float> G(n1, n2, kxG, kyG, kzG, ix, iy, iz, FM, t, (int)Nx, (int)Ny, 1);

    forgeCol<forgeComplex<float>> x(n2);
    for (uword i = 0; i < n2; i++)
        x.at(i) = forgeComplex<float>((float)(i % 17) / 17.0f, (float)(i % 13) / 13.0f);

    forgeCol<forgeComplex<float>> fwd_cpu = G * x;

    forgeCol<forgeComplex<float>> x_gpu(x);
    x_gpu.putOnGPU();
    forgeCol<forgeComplex<float>> fwd_gpu = G * x_gpu;
    if (fwd_gpu.on_gpu()) fwd_gpu.getFromGPU();

    SECTION("forward") {
        float err = relError(fwd_cpu, fwd_gpu);
        CAPTURE(err);
        REQUIRE(err < 1e-4f);
    }

    forgeCol<forgeComplex<float>> adj_cpu = G / fwd_cpu;
    forgeCol<forgeComplex<float>> fwd_gpu2(fwd_cpu);
    fwd_gpu2.putOnGPU();
    forgeCol<forgeComplex<float>> adj_gpu = G / fwd_gpu2;
    if (adj_gpu.on_gpu()) adj_gpu.getFromGPU();

    SECTION("adjoint") {
        float err = relError(adj_cpu, adj_gpu);
        CAPTURE(err);
        REQUIRE(err < 1e-4f);
    }
}
#endif

[ ] Step 3: Build and run all tests on DGX Spark

rsync ... && ssh ... "cmake --build build --target cpu_tests -j8 && \
    ./build/cpu_tests '[Gdft][CUDA]' 2>&1 && \
    ./build/cpu_tests '[GdftR2][CUDA]' 2>&1 && \
    ./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"

Expected: New CUDA tests pass. All existing tests pass.

[ ] Step 4: Commit

git add forge/Tests/operatorTests.cpp
git commit -m "test: add CPU vs CUDA comparison tests for Gdft and GdftR2"

Task 6: Final Verification¶

[ ] Step 1: Run full test suite on DGX Spark

ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \
    ./build/cpu_tests '~[Benchmark]' 2>&1 | tail -5"

Expected: All tests pass (including Spiral3D which uses SENSE-adjacent operators).

[ ] Step 2: Verify Metal tests locally

cmake --build build --target cpu_tests -j4 && ./build/cpu_tests '~[Benchmark]'

Expected: All tests pass on macOS Metal build.

[ ] Step 3: Benchmark Gdft performance

ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \
    ./build/cpu_tests '[Gdft_bench]' 2>&1"

Compare CUDA vs CPU Gdft timing. The DFT is O(N²) so GPU acceleration should be significant for larger problem sizes.

Summary of Commits¶

Task	Commit Message
1	`feat: add CudaDftPipeline header — GPU field-corrected DFT interface`
2	`feat: CudaDftPipeline CUDA kernels — forward/adjoint DFT with optional R₂* gradients`
3	`feat: integrate CudaDftPipeline into Gdft — GPU-resident forward/adjoint DFT`
4	`feat: integrate CudaDftPipeline into GdftR2 — GPU-resident DFT with R₂* gradients`
5	`test: add CPU vs CUDA comparison tests for Gdft and GdftR2`