CudaDFT Pipeline Design Spec¶
Date: 2026-03-20 Status: Approved Scope: GPU-accelerated field-corrected DFT for Gdft and GdftR2 operators via a new CudaDftPipeline class.
1. Problem Statement¶
The Gdft and GdftR2 operators compute a field-corrected discrete Fourier transform (DFT) with O(N_samples × N_pixels) complexity. On CPU, this is parallelized with OpenMP (one thread per output element). There is no CUDA implementation — Gdft/GdftR2 fall back to the CPU path on CUDA builds, forcing host↔device round-trips in SENSE and pcSENSE when combined with GPU-resident solvers.
The DFT is embarrassingly parallel: each output element is an independent weighted sum over all input elements. This maps directly to GPU threads with no inter-thread communication.
2. DFT Mathematics¶
Forward Transform (image → k-space)¶
For each k-space sample j, sum over all image pixels k:
s_j = Σ_k m_k · exp(-i · (2π(k_j · r_k) + Δω_k · t_j))
Where:
- s_j = k-space output at sample j
- m_k = complex image value at pixel k
- k_j = (kx_j, ky_j, kz_j) k-space coordinates
- r_k = (ix_k, iy_k, iz_k) image-space coordinates
- Δω_k = off-resonance field map at pixel k (rad/s)
- t_j = readout time at sample j (seconds)
Adjoint Transform (k-space → image)¶
For each image pixel k, sum over all k-space samples j:
m_k = Σ_j d_j · exp(+i · (2π(k_j · r_k) + Δω_k · t_j))
Phase sign is conjugated (positive instead of negative).
GdftR2 Gradient Variant¶
Adds a sinc-based R₂* correction factor per sample-pixel pair:
B(j,k) = sinc(kx_j/Nx + Gx_k·t_j) · sinc(ky_j/Ny + Gy_k·t_j) · sinc(kz_j/Nz + Gz_k·t_j)
The forward/adjoint sums are multiplied by B(j,k). Gx, Gy, Gz are pre-computed gradient maps of the field map.
3. Changes¶
3.1 New File: forge/CUDA/CudaDftPipeline.h¶
Class declaration for the CUDA DFT pipeline.
class CudaDftPipeline {
// Pre-uploaded trajectory and field data (persistent on GPU)
float* d_kx; // k-space x-coordinates, length numSamples
float* d_ky; // k-space y-coordinates, length numSamples
float* d_kz; // k-space z-coordinates, length numSamples
float* d_ix; // image x-coordinates, length numPixels
float* d_iy; // image y-coordinates, length numPixels
float* d_iz; // image z-coordinates, length numPixels
float* d_FM; // field map (rad/s), length numPixels
float* d_t; // readout time vector (seconds), length numSamples
// Optional gradient maps for GdftR2 (nullptr if basic Gdft)
float* d_Gx; // x-gradient of field map, length numPixels
float* d_Gy; // y-gradient of field map, length numPixels
float* d_Gz; // z-gradient of field map, length numPixels
bool hasGradients;
int numSamples; // n1: number of k-space points
int numPixels; // n2: number of image pixels
int numX, numY, numZ; // grid dimensions (for sinc normalization in GdftR2)
cudaStream_t stream;
public:
// Basic Gdft constructor (no gradients)
CudaDftPipeline(const float* kx, const float* ky, const float* kz,
const float* ix, const float* iy, const float* iz,
const float* FM, const float* t,
int numSamples, int numPixels);
// GdftR2 constructor (with gradient maps)
CudaDftPipeline(const float* kx, const float* ky, const float* kz,
const float* ix, const float* iy, const float* iz,
const float* FM, const float* t,
const float* Gx, const float* Gy, const float* Gz,
int numSamples, int numPixels,
int numX, int numY, int numZ);
~CudaDftPipeline();
// Device-pointer API: interleaved complex data [re0,im0,re1,im1,...]
void forward_device(const float* d_image, float* d_kdata);
void adjoint_device(const float* d_kdata, float* d_image);
// Host-pointer wrappers: H2D upload, run pipeline, D2H download, sync
void forward(const float* h_image, float* h_kdata);
void adjoint(const float* h_kdata, float* h_image);
};
3.2 New File: forge/CUDA/CudaDftPipeline.cu¶
Contains four CUDA kernels and the pipeline implementation.
Kernel: kernel_dft_forward — one thread per k-space sample, no gradients
thread j (j < numSamples):
sum_re = 0.0, sum_im = 0.0
// Pre-scale k-space coords by 2π (avoids per-pixel multiply)
kx_scaled = d_kx[j] * 2π
ky_scaled = d_ky[j] * 2π
kz_scaled = d_kz[j] * 2π
t_j = d_t[j]
for k = 0 to numPixels-1:
// Phase: 2π(k·r) + Δω·t
phase = kx_scaled*d_ix[k] + ky_scaled*d_iy[k] + kz_scaled*d_iz[k] + d_FM[k]*t_j
// exp(-i·phase) = cos(phase) - i·sin(phase)
sincosf(-phase, &s, &c)
// Complex multiply: (c + i·s) * (re + i·im)
re_in = d_image[2*k]
im_in = d_image[2*k + 1]
sum_re += c * re_in - s * im_in
sum_im += s * re_in + c * im_in
d_kdata[2*j] = sum_re
d_kdata[2*j + 1] = sum_im
Kernel: kernel_dft_adjoint — one thread per image pixel, conjugate phase
Same structure but outer loop over pixels, inner loop over samples, and sincosf(+phase, ...) (positive sign).
Kernel: kernel_dft_forward_grads — same as forward + sinc correction
Adds per-accumulation factor:
B = sincf(π*(kx[j]/numX + Gx[k]*t_j)) / (π*(kx[j]/numX + Gx[k]*t_j))
* sincf(π*(ky[j]/numY + Gy[k]*t_j)) / (π*(ky[j]/numY + Gy[k]*t_j))
* sincf(π*(kz[j]/numZ + Gz[k]*t_j)) / (π*(kz[j]/numZ + Gz[k]*t_j))
sum_re += B * (c * re_in - s * im_in)
sum_im += B * (s * re_in + c * im_in)
Uses a __device__ sincPG(float x) helper: returns sinf(πx)/(πx) with special case x==0 → 1.0f.
Kernel: kernel_dft_adjoint_grads — adjoint with sinc correction.
Pipeline methods:
- Constructor:
cudaMallocAsync+cudaMemcpyAsync(H2D) for all trajectory/field/gradient arrays. - Destructor:
cudaFreeAsyncfor all device arrays. forward_device/adjoint_device: dispatch the appropriate kernel (with or without gradients based onhasGradientsflag).forward/adjoint:cudaMemcpyAsyncH2D for input, call*_device,cudaMemcpyAsyncD2H for output,cudaStreamSynchronize.
3.3 Modified: forge/Operators/Gdft.h¶
Add CudaDftPipeline* member:
#ifdef CUDA_COMPUTE
void* cudaPipelineCtx = nullptr;
#endif
3.4 Modified: forge/Operators/Gdft.cpp¶
Constructor: Create CudaDftPipeline for float builds:
#ifdef CUDA_COMPUTE
if constexpr (std::is_same<T1, float>::value) {
cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
kx.memptr(), ky.memptr(), kz.memptr(),
ix.memptr(), iy.memptr(), iz.memptr(),
FM.memptr(), t.memptr(), n1, n2);
}
#endif
Destructor: Delete pipeline.
forgeCol operators: Add CUDA_COMPUTE block identical to Gnufft's pattern — check d.on_gpu(), dispatch to forward_device or forward.
3.5 Modified: forge/Operators/GdftR2.h and GdftR2.cpp¶
Same pattern as Gdft but using the gradient constructor:
cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
kx.memptr(), ..., FM.memptr(), t.memptr(),
Gx.memptr(), Gy.memptr(), Gz.memptr(),
n1, n2, numX, numY, numZ);
3.6 Modified: forge/Tests/operatorTests.cpp¶
- Gdft adjointness with CUDA — existing adjoint test should pass automatically since it uses forgeCol overloads.
- New: Gdft CPU vs CUDA forward/adjoint comparison — deterministic input, compare element-wise relative error < 1e-4.
- New: GdftR2 CPU vs CUDA comparison — same pattern.
3.7 Modified: CMakeLists.txt¶
Add forge/CUDA/CudaDftPipeline.cu to the ForgeCommon sources under the existing CUDA source list.
4. Data Format¶
All device-pointer APIs use interleaved complex format: [re_0, im_0, re_1, im_1, ...]. This matches forgeComplex<float> memory layout and the CudaNufftPipeline convention. The reinterpret_cast<const float*>(d.device_memptr()) pattern used throughout the codebase works directly.
Trajectory arrays (kx, ky, kz, ix, iy, iz, FM, t, Gx, Gy, Gz) are real-valued float arrays.
5. What Is NOT In Scope¶
- Double precision (
T1 = double) CUDA kernels — float only, matching Metal and CudaNufftPipeline. - Shared memory tiling of the inner loop — the basic thread-per-output kernel is the first step. Tiling can be added later if profiling shows memory bandwidth is the bottleneck.
__half(FP16) or tensor core acceleration — future optimization.
6. Success Criteria¶
- All existing Gdft and GdftR2 tests pass on CUDA builds.
- New CPU vs CUDA comparison tests pass with relative error < 1e-4.
- Gdft adjointness property holds on GPU (relError < 1e-5).
- SENSE
and pcSENSE now use GPU-resident DFT path when data is on GPU.
7. Risks¶
| Risk | Mitigation |
|---|---|
| sinf/cosf precision differs from CPU | Use 1e-4 tolerance in comparison tests; DFT accumulates many terms so small per-element differences compound |
| Large problem sizes exhaust GPU memory for trajectory arrays | Same sizes work on CPU; trajectory arrays are small relative to image data |
| GdftR2 sinc near zero causes numerical issues | Use the same sincPG special-case handling as the CPU implementation |