CudaDFT Pipeline Implementation Plan¶
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.IMPORTANT: All CUDA builds and tests run on the DGX Spark via SSH. Workflow: edit locally → rsync → build+test via SSH.
rsync -avz --exclude build/ --exclude .git/ \ /Users/acerjanic/Developer/forge/ clj@100.109.49.10:/tmp/forge-cuda/ ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \ cmake --build build --target cpu_tests -j8"
Goal: GPU-accelerate the field-corrected DFT (Gdft) and gradient-corrected DFT (GdftR2) operators via a new CudaDftPipeline class, unblocking GPU-resident pcSENSE and SENSE\<Gdft> reconstructions.
Architecture: A CudaDftPipeline class in forge/CUDA/ pre-uploads trajectory and field map data to GPU at construction time, then provides forward_device/adjoint_device methods that launch one-thread-per-output CUDA kernels on interleaved complex data. Gdft and GdftR2 operators create the pipeline in their constructors and dispatch from forgeCol overloads, following the same pattern as Gnufft + CudaNufftPipeline.
Tech Stack: CUDA 13.0, sincosf() intrinsic, cudaMemcpyAsync, cudaMallocAsync
Spec: docs/superpowers/specs/2026-03-20-cuda-dft-pipeline-design.md
File Structure¶
Files to Create¶
| File | Responsibility |
|---|---|
forge/CUDA/CudaDftPipeline.h |
Class declaration — constructor, destructor, forward/adjoint device+host APIs |
forge/CUDA/CudaDftPipeline.cu |
CUDA kernels (4) + pipeline method implementations |
Files to Modify¶
| File | Change |
|---|---|
forge/CMakeLists.txt |
Add CudaDftPipeline.cu to CUDA source list |
forge/Operators/Gdft.h |
Add cudaPipelineCtx member under CUDA_COMPUTE |
forge/Operators/Gdft.cpp |
Create pipeline in constructor, add CUDA forgeCol dispatch, add destructor |
forge/Operators/GdftR2.h |
Add cudaPipelineCtx member under CUDA_COMPUTE |
forge/Operators/GdftR2.cpp |
Create pipeline with gradients in constructor, add CUDA forgeCol dispatch, add destructor |
forge/Tests/operatorTests.cpp |
Add CPU vs CUDA comparison tests for Gdft and GdftR2 |
Task 1: CudaDftPipeline Header¶
Files:
- Create: forge/CUDA/CudaDftPipeline.h
- [ ] Step 1: Write the header file
#pragma once
#ifdef CUDA_COMPUTE
#include <cuda_runtime.h>
namespace forge::cuda {
/// GPU-accelerated field-corrected DFT pipeline.
///
/// Computes the forward DFT:
/// s_j = Σ_k m_k · exp(-i·(2π(k_j·r_k) + Δω_k·t_j))
///
/// and its adjoint (conjugate phase). Optionally includes R₂* gradient
/// correction via sinc factors (GdftR2 variant).
///
/// All trajectory, field map, and gradient data are uploaded to GPU once
/// at construction. The forward_device/adjoint_device methods operate on
/// interleaved complex device pointers [re0,im0,re1,im1,...], matching
/// the forgeComplex<float> memory layout used throughout forge.
///
/// Follows the same API pattern as CudaNufftPipeline.
class CudaDftPipeline {
public:
/// Construct a basic Gdft pipeline (no gradient correction).
///
/// @param kx,ky,kz k-space coordinates, length numSamples (host pointers)
/// @param ix,iy,iz Image-space coordinates, length numPixels (host pointers)
/// @param FM Off-resonance field map in rad/s, length numPixels
/// @param t Per-sample readout time in seconds, length numSamples
/// @param numSamples Number of k-space points (n1)
/// @param numPixels Number of image pixels (n2)
CudaDftPipeline(const float* kx, const float* ky, const float* kz,
const float* ix, const float* iy, const float* iz,
const float* FM, const float* t,
int numSamples, int numPixels);
/// Construct a GdftR2 pipeline with R₂* gradient correction.
///
/// @param Gx,Gy,Gz Gradient maps of the field map, length numPixels
/// @param numX,numY,numZ Image grid dimensions (for sinc normalization)
/// Other parameters same as basic constructor.
CudaDftPipeline(const float* kx, const float* ky, const float* kz,
const float* ix, const float* iy, const float* iz,
const float* FM, const float* t,
const float* Gx, const float* Gy, const float* Gz,
int numSamples, int numPixels,
int numX, int numY, int numZ);
~CudaDftPipeline();
/// Forward DFT on device pointers (no host transfers).
/// @param d_image Input: interleaved complex image, 2*numPixels floats
/// @param d_kdata Output: interleaved complex k-space, 2*numSamples floats
void forward_device(const float* d_image, float* d_kdata);
/// Adjoint DFT on device pointers (no host transfers).
/// @param d_kdata Input: interleaved complex k-space, 2*numSamples floats
/// @param d_image Output: interleaved complex image, 2*numPixels floats
void adjoint_device(const float* d_kdata, float* d_image);
/// Forward DFT with host pointers (H2D, compute, D2H, sync).
void forward(const float* h_image, float* h_kdata);
/// Adjoint DFT with host pointers (H2D, compute, D2H, sync).
void adjoint(const float* h_kdata, float* h_image);
// Non-copyable (owns GPU resources)
CudaDftPipeline(const CudaDftPipeline&) = delete;
CudaDftPipeline& operator=(const CudaDftPipeline&) = delete;
private:
// Persistent device arrays — uploaded once at construction
float* d_kx = nullptr; // k-space x-coords, length numSamples
float* d_ky = nullptr; // k-space y-coords, length numSamples
float* d_kz = nullptr; // k-space z-coords, length numSamples
float* d_ix = nullptr; // image x-coords, length numPixels
float* d_iy = nullptr; // image y-coords, length numPixels
float* d_iz = nullptr; // image z-coords, length numPixels
float* d_FM = nullptr; // field map (rad/s), length numPixels
float* d_t = nullptr; // readout time (seconds), length numSamples
// Optional gradient maps for GdftR2 (nullptr if basic Gdft)
float* d_Gx = nullptr;
float* d_Gy = nullptr;
float* d_Gz = nullptr;
bool hasGradients_ = false;
// Scratch buffers for host-pointer wrappers
float* d_input = nullptr; // 2*max(numSamples,numPixels) floats
float* d_output = nullptr; // 2*max(numSamples,numPixels) floats
int numSamples_;
int numPixels_;
int numX_, numY_, numZ_; // grid dims (for sinc normalization)
cudaStream_t stream_;
};
} // namespace forge::cuda
#endif // CUDA_COMPUTE
- [ ] Step 2: Verify Metal build locally
cmake --build build --target ForgeCommon -j4
Expected: builds without errors (header is CUDA-only, not compiled on macOS).
- [ ] Step 3: Commit
git add forge/CUDA/CudaDftPipeline.h
git commit -m "feat: add CudaDftPipeline header — GPU field-corrected DFT interface"
Task 2: CUDA Kernels and Pipeline Implementation¶
Files:
- Create: forge/CUDA/CudaDftPipeline.cu
- Modify: forge/CMakeLists.txt:51 (add to CUDA source list)
- [ ] Step 1: Write the kernel and pipeline implementation
Create forge/CUDA/CudaDftPipeline.cu with:
__device__ sincPG(float x)— sinc helper matching CPUftCpuWithGrads.cpp:76-83kernel_dft_forward— one thread per k-space sample, inner loop over pixels,sincosf(-phase), interleaved complex I/Okernel_dft_adjoint— one thread per image pixel, inner loop over samples,sincosf(+phase)kernel_dft_forward_grads— forward + sinc correction factorB(j,k)kernel_dft_adjoint_grads— adjoint + sinc correction factor- Constructor (basic) —
cudaMallocAsync+cudaMemcpyfor 8 trajectory/field arrays, allocate scratch buffers - Constructor (grads) — calls basic setup + uploads 3 gradient arrays
- Destructor —
cudaFreeAsyncfor all device arrays forward_device/adjoint_device— dispatch appropriate kernel based onhasGradients_forward/adjoint— H2D input, call*_device, D2H output,cudaStreamSynchronize
Key implementation notes:
- Use sincosf(-phase, &s, &c) for forward (single trig eval, negative phase)
- Use sincosf(phase, &s, &c) for adjoint (positive phase = conjugate)
- Pre-scale k-coords by 2π in the outer (per-thread) section to avoid per-pixel multiply
- Block size: 256 threads (same as other forge CUDA kernels)
- Comment the DFT equation at the top of each kernel
- Comment the interleaved complex layout: data[2*k] = real, data[2*k+1] = imag
Reference for kernel math:
- Forward: ftCpu.cpp:94-123 (split real/imag version — adapt to interleaved)
- Adjoint: ftCpu.cpp:132-167
- Grads: ftCpuWithGrads.cpp:93-125 and 157-171
- sincPG: ftCpuWithGrads.cpp:76-83
- [ ] Step 2: Add to CMakeLists.txt
In forge/CMakeLists.txt, after line 51 (CudaPenalty.cu), add:
${PROJECT_SOURCE_DIR}/forge/CUDA/CudaDftPipeline.cu)
- [ ] Step 3: Verify Metal build locally
cmake --build build --target ForgeCommon -j4
Expected: builds (.cu file only compiled on CUDA builds).
- [ ] Step 4: Verify CUDA build on DGX Spark
rsync -avz --exclude build/ --exclude .git/ \
/Users/acerjanic/Developer/forge/ clj@100.109.49.10:/tmp/forge-cuda/
ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \
cmake --build build --target ForgeCommon -j8 2>&1 | tail -5"
Expected: compiles without errors.
- [ ] Step 5: Commit
git add forge/CUDA/CudaDftPipeline.cu forge/CMakeLists.txt
git commit -m "feat: CudaDftPipeline CUDA kernels — forward/adjoint DFT with optional R₂* gradients"
Task 3: Integrate CudaDftPipeline into Gdft Operator¶
Files:
- Modify: forge/Operators/Gdft.h:102-107 (add CUDA member alongside Metal member)
- Modify: forge/Operators/Gdft.cpp:37-60 (constructor), 62-72 (destructor), 146-214 (forgeCol operators)
- [ ] Step 1: Add CUDA pipeline member to Gdft.h
After the existing #ifdef METAL_COMPUTE block (lines 102-107), add:
#ifdef CUDA_COMPUTE
void* cudaPipelineCtx = nullptr;
~Gdft();
#endif
Note: The destructor declaration is conditional — Metal already has one, CUDA needs its own. If both Metal and CUDA are defined (shouldn't happen in practice), this would conflict. The existing pattern uses #ifdef METAL_COMPUTE exclusively, so #ifdef CUDA_COMPUTE is safe.
- [ ] Step 2: Add CUDA constructor logic in Gdft.cpp
In the Gdft constructor (after the #ifdef METAL_COMPUTE block, line 59), add:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
kx.memptr(), ky.memptr(), kz.memptr(),
ix.memptr(), iy.memptr(), iz.memptr(),
FM.memptr(), t.memptr(), (int)n1, (int)n2);
}
#endif
Add include at top of file:
#ifdef CUDA_COMPUTE
#include "CUDA/CudaDftPipeline.h"
#endif
- [ ] Step 3: Add CUDA destructor in Gdft.cpp
After the #ifdef METAL_COMPUTE destructor block (line 72), add:
#elif defined(CUDA_COMPUTE)
template <typename T1> Gdft<T1>::~Gdft()
{
if constexpr (std::is_same<T1, float>::value) {
delete static_cast<forge::cuda::CudaDftPipeline*>(cudaPipelineCtx);
cudaPipelineCtx = nullptr;
}
}
#endif
Change the existing #endif on line 72 to #elif defined(CUDA_COMPUTE).
- [ ] Step 4: Add CUDA dispatch in forgeCol forward operator
In Gdft::operator*(forgeCol) (line 146), add a CUDA_COMPUTE block after the METAL_COMPUTE block. Follow the Gnufft pattern (Gnufft.cpp:530-548):
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
auto* ctx = static_cast<forge::cuda::CudaDftPipeline*>(cudaPipelineCtx);
if (ctx != nullptr) {
if (d.on_gpu()) {
// Data already on GPU — use device-pointer pipeline
forgeCol<forgeComplex<T1>> result(n1);
result.allocateOnGPU();
ctx->forward_device(
reinterpret_cast<const float*>(d.device_memptr()),
reinterpret_cast<float*>(result.device_memptr()));
return result;
} else {
// Data on host — use host-pointer wrapper (H2D + D2H internally)
forgeCol<forgeComplex<T1>> result(n1);
ctx->forward(
reinterpret_cast<const float*>(d.memptr()),
reinterpret_cast<float*>(result.memptr()));
return result;
}
}
}
#endif
Change the existing #endif after the Metal block to #elif defined(CUDA_COMPUTE).
- [ ] Step 5: Add CUDA dispatch in forgeCol adjoint operator
Same pattern for Gdft::operator/(forgeCol) (line 181), using adjoint_device/adjoint, output size n2.
- [ ] Step 6: Verify Metal build locally
cmake --build build --target ForgeCommon -j4
- [ ] Step 7: Verify CUDA build and existing tests on DGX Spark
rsync ... && ssh ... "cmake --build build --target cpu_tests -j8 && \
./build/cpu_tests '[Gdft adjoint]' 2>&1"
Expected: Gdft adjointness test passes with CUDA pipeline active.
- [ ] Step 8: Commit
git add forge/Operators/Gdft.h forge/Operators/Gdft.cpp
git commit -m "feat: integrate CudaDftPipeline into Gdft — GPU-resident forward/adjoint DFT"
Task 4: Integrate CudaDftPipeline into GdftR2 Operator¶
Files:
- Modify: forge/Operators/GdftR2.h:123-128 (add CUDA member)
- Modify: forge/Operators/GdftR2.cpp:39-69 (constructor), 71-81 (destructor), 211-end (forgeCol operators)
- [ ] Step 1: Add CUDA pipeline member to GdftR2.h
After the #ifdef METAL_COMPUTE block (lines 123-128), add the same pattern as Gdft:
#ifdef CUDA_COMPUTE
void* cudaPipelineCtx = nullptr;
~GdftR2();
#endif
- [ ] Step 2: Add CUDA constructor logic in GdftR2.cpp
After the #ifdef METAL_COMPUTE block in the constructor (line 68), add:
#elif defined(CUDA_COMPUTE)
if constexpr (std::is_same<T1, float>::value) {
cudaPipelineCtx = new forge::cuda::CudaDftPipeline(
kx.memptr(), ky.memptr(), kz.memptr(),
ix.memptr(), iy.memptr(), iz.memptr(),
FM.memptr(), t.memptr(),
Gx.memptr(), Gy.memptr(), Gz.memptr(),
(int)n1, (int)n2, numX, numY, numZ);
}
#endif
Add the #include "CUDA/CudaDftPipeline.h" at the top of the file (guarded by CUDA_COMPUTE).
- [ ] Step 3: Add CUDA destructor and forgeCol dispatch
Follow the exact same pattern as Task 3 Steps 3-5, but for GdftR2.
- [ ] Step 4: Verify CUDA build and existing tests on DGX Spark
rsync ... && ssh ... "cmake --build build --target cpu_tests -j8 && \
./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"
Expected: All existing tests pass (GdftR2 now uses CUDA pipeline).
- [ ] Step 5: Commit
git add forge/Operators/GdftR2.h forge/Operators/GdftR2.cpp
git commit -m "feat: integrate CudaDftPipeline into GdftR2 — GPU-resident DFT with R₂* gradients"
Task 5: CPU vs CUDA Comparison Tests¶
Files:
- Modify: forge/Tests/operatorTests.cpp
- [ ] Step 1: Write Gdft CPU vs CUDA comparison test
Add after the existing #ifdef CUDA_COMPUTE penalty tests:
#ifdef CUDA_COMPUTE
TEST_CASE("Gdft forward/adjoint: CPU vs CUDA", "[Gdft][CUDA]")
{
uword Nx = 16, Ny = 16;
uword n1 = Nx * Ny;
uword n2 = Nx * Ny;
Col<float> ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN;
makeCartesianCoords(Nx, Ny, ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN);
// Non-zero field map and time vector for meaningful phase correction
Col<float> FM(n2);
for (uword i = 0; i < n2; i++)
FM(i) = 100.0f * sinf(2.0f * M_PI * i / n2);
Col<float> t(n1);
for (uword i = 0; i < n1; i++)
t(i) = 1e-4f * i;
Gdft<float> G(n1, n2, kxG, kyG, kzG, ix, iy, iz, FM, t);
// Deterministic test image
forgeCol<forgeComplex<float>> x(n2);
for (uword i = 0; i < n2; i++)
x.at(i) = forgeComplex<float>((float)(i % 17) / 17.0f, (float)(i % 13) / 13.0f);
// CPU forward
forgeCol<forgeComplex<float>> fwd_cpu = G * x;
// GPU forward
forgeCol<forgeComplex<float>> x_gpu(x);
x_gpu.putOnGPU();
forgeCol<forgeComplex<float>> fwd_gpu = G * x_gpu;
if (fwd_gpu.on_gpu()) fwd_gpu.getFromGPU();
SECTION("forward") {
float err = relError(fwd_cpu, fwd_gpu);
CAPTURE(err);
REQUIRE(err < 1e-4f);
}
// CPU adjoint
forgeCol<forgeComplex<float>> adj_cpu = G / fwd_cpu;
// GPU adjoint
forgeCol<forgeComplex<float>> fwd_gpu2(fwd_cpu);
fwd_gpu2.putOnGPU();
forgeCol<forgeComplex<float>> adj_gpu = G / fwd_gpu2;
if (adj_gpu.on_gpu()) adj_gpu.getFromGPU();
SECTION("adjoint") {
float err = relError(adj_cpu, adj_gpu);
CAPTURE(err);
REQUIRE(err < 1e-4f);
}
}
#endif
Note: relError helper was already added in this branch (operatorTests.cpp, line 258).
- [ ] Step 2: Write GdftR2 CPU vs CUDA comparison test
Same structure but constructs GdftR2<float> with grid dimensions:
#ifdef CUDA_COMPUTE
TEST_CASE("GdftR2 forward/adjoint: CPU vs CUDA", "[GdftR2][CUDA]")
{
uword Nx = 16, Ny = 16;
uword n1 = Nx * Ny;
uword n2 = Nx * Ny;
Col<float> ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN;
makeCartesianCoords(Nx, Ny, ix, iy, iz, kxG, kyG, kzG, kxN, kyN, kzN);
Col<float> FM(n2);
for (uword i = 0; i < n2; i++)
FM(i) = 100.0f * sinf(2.0f * M_PI * i / n2);
Col<float> t(n1);
for (uword i = 0; i < n1; i++)
t(i) = 1e-4f * i;
GdftR2<float> G(n1, n2, kxG, kyG, kzG, ix, iy, iz, FM, t, (int)Nx, (int)Ny, 1);
forgeCol<forgeComplex<float>> x(n2);
for (uword i = 0; i < n2; i++)
x.at(i) = forgeComplex<float>((float)(i % 17) / 17.0f, (float)(i % 13) / 13.0f);
forgeCol<forgeComplex<float>> fwd_cpu = G * x;
forgeCol<forgeComplex<float>> x_gpu(x);
x_gpu.putOnGPU();
forgeCol<forgeComplex<float>> fwd_gpu = G * x_gpu;
if (fwd_gpu.on_gpu()) fwd_gpu.getFromGPU();
SECTION("forward") {
float err = relError(fwd_cpu, fwd_gpu);
CAPTURE(err);
REQUIRE(err < 1e-4f);
}
forgeCol<forgeComplex<float>> adj_cpu = G / fwd_cpu;
forgeCol<forgeComplex<float>> fwd_gpu2(fwd_cpu);
fwd_gpu2.putOnGPU();
forgeCol<forgeComplex<float>> adj_gpu = G / fwd_gpu2;
if (adj_gpu.on_gpu()) adj_gpu.getFromGPU();
SECTION("adjoint") {
float err = relError(adj_cpu, adj_gpu);
CAPTURE(err);
REQUIRE(err < 1e-4f);
}
}
#endif
- [ ] Step 3: Build and run all tests on DGX Spark
rsync ... && ssh ... "cmake --build build --target cpu_tests -j8 && \
./build/cpu_tests '[Gdft][CUDA]' 2>&1 && \
./build/cpu_tests '[GdftR2][CUDA]' 2>&1 && \
./build/cpu_tests '~[Benchmark]~[Spiral3D]' 2>&1 | tail -3"
Expected: New CUDA tests pass. All existing tests pass.
- [ ] Step 4: Commit
git add forge/Tests/operatorTests.cpp
git commit -m "test: add CPU vs CUDA comparison tests for Gdft and GdftR2"
Task 6: Final Verification¶
- [ ] Step 1: Run full test suite on DGX Spark
ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \
./build/cpu_tests '~[Benchmark]' 2>&1 | tail -5"
Expected: All tests pass (including Spiral3D which uses SENSE
- [ ] Step 2: Verify Metal tests locally
cmake --build build --target cpu_tests -j4 && ./build/cpu_tests '~[Benchmark]'
Expected: All tests pass on macOS Metal build.
- [ ] Step 3: Benchmark Gdft performance
ssh clj@100.109.49.10 "export PATH=/usr/local/cuda/bin:\$PATH && cd /tmp/forge-cuda && \
./build/cpu_tests '[Gdft_bench]' 2>&1"
Compare CUDA vs CPU Gdft timing. The DFT is O(N²) so GPU acceleration should be significant for larger problem sizes.
Summary of Commits¶
| Task | Commit Message |
|---|---|
| 1 | feat: add CudaDftPipeline header — GPU field-corrected DFT interface |
| 2 | feat: CudaDftPipeline CUDA kernels — forward/adjoint DFT with optional R₂* gradients |
| 3 | feat: integrate CudaDftPipeline into Gdft — GPU-resident forward/adjoint DFT |
| 4 | feat: integrate CudaDftPipeline into GdftR2 — GPU-resident DFT with R₂* gradients |
| 5 | test: add CPU vs CUDA comparison tests for Gdft and GdftR2 |