CudaDFT Pipeline Design Spec¶

Date: 2026-03-20 Status: Approved Scope: GPU-accelerated field-corrected DFT for Gdft and GdftR2 operators via a new CudaDftPipeline class.

1. Problem Statement¶

The Gdft and GdftR2 operators compute a field-corrected discrete Fourier transform (DFT) with O(N_samples × N_pixels) complexity. On CPU, this is parallelized with OpenMP (one thread per output element). There is no CUDA implementation — Gdft/GdftR2 fall back to the CPU path on CUDA builds, forcing host↔device round-trips in SENSE and pcSENSE when combined with GPU-resident solvers.

The DFT is embarrassingly parallel: each output element is an independent weighted sum over all input elements. This maps directly to GPU threads with no inter-thread communication.

2. DFT Mathematics¶

Forward Transform (image → k-space)¶

For each k-space sample j, sum over all image pixels k:

s_j = Σ_k m_k · exp(-i · (2π(k_j · r_k) + Δω_k · t_j))

Where: - s_j = k-space output at sample j - m_k = complex image value at pixel k - k_j = (kx_j, ky_j, kz_j) k-space coordinates - r_k = (ix_k, iy_k, iz_k) image-space coordinates - Δω_k = off-resonance field map at pixel k (rad/s) - t_j = readout time at sample j (seconds)

Adjoint Transform (k-space → image)¶

For each image pixel k, sum over all k-space samples j:

m_k = Σ_j d_j · exp(+i · (2π(k_j · r_k) + Δω_k · t_j))

Phase sign is conjugated (positive instead of negative).

GdftR2 Gradient Variant¶

Adds a sinc-based R₂* correction factor per sample-pixel pair:

B(j,k) = sinc(kx_j/Nx + Gx_k·t_j) · sinc(ky_j/Ny + Gy_k·t_j) · sinc(kz_j/Nz + Gz_k·t_j)

The forward/adjoint sums are multiplied by B(j,k). Gx, Gy, Gz are pre-computed gradient maps of the field map.

3. Changes¶

3.1 New File: `forge/CUDA/CudaDftPipeline.h`¶

Class declaration for the CUDA DFT pipeline.

class CudaDftPipeline {
    // Pre-uploaded trajectory and field data (persistent on GPU)
    float* d_kx;    // k-space x-coordinates, length numSamples
    float* d_ky;    // k-space y-coordinates, length numSamples
    float* d_kz;    // k-space z-coordinates, length numSamples
    float* d_ix;    // image x-coordinates, length numPixels
    float* d_iy;    // image y-coordinates, length numPixels
    float* d_iz;    // image z-coordinates, length numPixels
    float* d_FM;    // field map (rad/s), length numPixels
    float* d_t;     // readout time vector (seconds), length numSamples

    // Optional gradient maps for GdftR2 (nullptr if basic Gdft)
    float* d_Gx;    // x-gradient of field map, length numPixels
    float* d_Gy;    // y-gradient of field map, length numPixels
    float* d_Gz;    // z-gradient of field map, length numPixels
    bool hasGradients;

    int numSamples; // n1: number of k-space points
    int numPixels;  // n2: number of image pixels
    int numX, numY, numZ; // grid dimensions (for sinc normalization in GdftR2)
    cudaStream_t stream;

public:
    // Basic Gdft constructor (no gradients)
    CudaDftPipeline(const float* kx, const float* ky, const float* kz,
                    const float* ix, const float* iy, const float* iz,
                    const float* FM, const float* t,
                    int numSamples, int numPixels);

    // GdftR2 constructor (with gradient maps)
    CudaDftPipeline(const float* kx, const float* ky, const float* kz,
                    const float* ix, const float* iy, const float* iz,
                    const float* FM, const float* t,
                    const float* Gx, const float* Gy, const float* Gz,
                    int numSamples, int numPixels,
                    int numX, int numY, int numZ);

    ~CudaDftPipeline();

    // Device-pointer API: interleaved complex data [re0,im0,re1,im1,...]
    void forward_device(const float* d_image, float* d_kdata);
    void adjoint_device(const float* d_kdata, float* d_image);

    // Host-pointer wrappers: H2D upload, run pipeline, D2H download, sync
    void forward(const float* h_image, float* h_kdata);
    void adjoint(const float* h_kdata, float* h_image);
};

3.2 New File: `forge/CUDA/CudaDftPipeline.cu`¶

Contains four CUDA kernels and the pipeline implementation.

Kernel: kernel_dft_forward — one thread per k-space sample, no gradients

thread j (j < numSamples):
    sum_re = 0.0, sum_im = 0.0
    // Pre-scale k-space coords by 2π (avoids per-pixel multiply)
    kx_scaled = d_kx[j] * 2π
    ky_scaled = d_ky[j] * 2π
    kz_scaled = d_kz[j] * 2π
    t_j = d_t[j]

    for k = 0 to numPixels-1:
        // Phase: 2π(k·r) + Δω·t
        phase = kx_scaled*d_ix[k] + ky_scaled*d_iy[k] + kz_scaled*d_iz[k] + d_FM[k]*t_j

        // exp(-i·phase) = cos(phase) - i·sin(phase)
        sincosf(-phase, &s, &c)

        // Complex multiply: (c + i·s) * (re + i·im)
        re_in = d_image[2*k]
        im_in = d_image[2*k + 1]
        sum_re += c * re_in - s * im_in
        sum_im += s * re_in + c * im_in

    d_kdata[2*j] = sum_re
    d_kdata[2*j + 1] = sum_im

Kernel: kernel_dft_adjoint — one thread per image pixel, conjugate phase

Same structure but outer loop over pixels, inner loop over samples, and sincosf(+phase, ...) (positive sign).

Kernel: kernel_dft_forward_grads — same as forward + sinc correction

Adds per-accumulation factor:

    B = sincf(π*(kx[j]/numX + Gx[k]*t_j)) / (π*(kx[j]/numX + Gx[k]*t_j))
      * sincf(π*(ky[j]/numY + Gy[k]*t_j)) / (π*(ky[j]/numY + Gy[k]*t_j))
      * sincf(π*(kz[j]/numZ + Gz[k]*t_j)) / (π*(kz[j]/numZ + Gz[k]*t_j))
    sum_re += B * (c * re_in - s * im_in)
    sum_im += B * (s * re_in + c * im_in)

Uses a __device__ sincPG(float x) helper: returns sinf(πx)/(πx) with special case x==0 → 1.0f.

Kernel: kernel_dft_adjoint_grads — adjoint with sinc correction.

Pipeline methods:

Constructor: cudaMallocAsync + cudaMemcpyAsync (H2D) for all trajectory/field/gradient arrays.
Destructor: cudaFreeAsync for all device arrays.
forward_device / adjoint_device: dispatch the appropriate kernel (with or without gradients based on hasGradients flag).
forward / adjoint: cudaMemcpyAsync H2D for input, call *_device, cudaMemcpyAsync D2H for output, cudaStreamSynchronize.

3.3 Modified: `forge/Operators/Gdft.h`¶

Add CudaDftPipeline* member:

#ifdef CUDA_COMPUTE
    void* cudaPipelineCtx = nullptr;
#endif

3.4 Modified: `forge/Operators/Gdft.cpp`¶

Constructor: Create CudaDftPipeline for float builds:

#ifdef CUDA_COMPUTE
    if constexpr (std::is_same<T1, float>::value) {
        cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
            kx.memptr(), ky.memptr(), kz.memptr(),
            ix.memptr(), iy.memptr(), iz.memptr(),
            FM.memptr(), t.memptr(), n1, n2);
    }
#endif

Destructor: Delete pipeline.

forgeCol operators: Add CUDA_COMPUTE block identical to Gnufft's pattern — check d.on_gpu(), dispatch to forward_device or forward.

3.5 Modified: `forge/Operators/GdftR2.h` and `GdftR2.cpp`¶

Same pattern as Gdft but using the gradient constructor:

cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
    kx.memptr(), ..., FM.memptr(), t.memptr(),
    Gx.memptr(), Gy.memptr(), Gz.memptr(),
    n1, n2, numX, numY, numZ);

3.6 Modified: `forge/Tests/operatorTests.cpp`¶

Gdft adjointness with CUDA — existing adjoint test should pass automatically since it uses forgeCol overloads.
New: Gdft CPU vs CUDA forward/adjoint comparison — deterministic input, compare element-wise relative error < 1e-4.
New: GdftR2 CPU vs CUDA comparison — same pattern.

3.7 Modified: `CMakeLists.txt`¶

Add forge/CUDA/CudaDftPipeline.cu to the ForgeCommon sources under the existing CUDA source list.

4. Data Format¶

All device-pointer APIs use interleaved complex format: [re_0, im_0, re_1, im_1, ...]. This matches forgeComplex<float> memory layout and the CudaNufftPipeline convention. The reinterpret_cast<const float*>(d.device_memptr()) pattern used throughout the codebase works directly.

Trajectory arrays (kx, ky, kz, ix, iy, iz, FM, t, Gx, Gy, Gz) are real-valued float arrays.

5. What Is NOT In Scope¶

Double precision (T1 = double) CUDA kernels — float only, matching Metal and CudaNufftPipeline.
Shared memory tiling of the inner loop — the basic thread-per-output kernel is the first step. Tiling can be added later if profiling shows memory bandwidth is the bottleneck.
__half (FP16) or tensor core acceleration — future optimization.

6. Success Criteria¶

All existing Gdft and GdftR2 tests pass on CUDA builds.
New CPU vs CUDA comparison tests pass with relative error < 1e-4.
Gdft adjointness property holds on GPU (relError < 1e-5).
SENSE and pcSENSE now use GPU-resident DFT path when data is on GPU.

7. Risks¶

Risk	Mitigation
sinf/cosf precision differs from CPU	Use 1e-4 tolerance in comparison tests; DFT accumulates many terms so small per-element differences compound
Large problem sizes exhaust GPU memory for trajectory arrays	Same sizes work on CPU; trajectory arrays are small relative to image data
GdftR2 sinc near zero causes numerical issues	Use the same `sincPG` special-case handling as the CPU implementation