CPU Backend + OpenMP Implementation Plan¶
For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Strip all OpenACC code, establish a clean CPU backend with OpenMP parallelism, and introduce a compile-time backend dispatch layer that unifies Metal and future CUDA paths.
Architecture: Three-phase approach: (1) remove OpenACC pragmas and guards to get a clean single-threaded CPU build, (2) introduce BackendDispatch.hpp to replace scattered #ifdef blocks in operators, (3) add OpenMP parallel for to hot loops for CPU parallelism.
Tech Stack: C++17, CMake 3.17+, OpenMP, FFTW3, Armadillo, Catch2
Spec: docs/superpowers/specs/2026-03-17-multi-backend-acceleration-design.md
Subsequent plans (not covered here):
- Phase 4: CUDA backend (forge/CUDA/ mirroring Metal)
- Phase 5: CI pipeline (GitHub Actions, Docker, tiered testing)
File Structure¶
Files to Create¶
| File | Responsibility |
|---|---|
forge/Core/BackendDispatch.hpp |
Compile-time dispatch: maps GPU memory ops and vector algebra to Metal, CUDA (future), or CPU no-ops |
Files to Modify (by component)¶
Build system:
- CMakeLists.txt — remove OPENACC_GPU/OPENACC_MP options, add OpenMP for all builds, add BUILD_SHARED_LIBS
- forge/CMakeLists.txt — remove fftGPU.cpp from sources, change SHARED to default
Core data structures:
- forge/Core/forgeCol.hpp — remove ~50 #pragma acc directives, remove isCopy flag, keep Metal allocation path
- forge/Core/forgeMat.hpp — remove ~40 #pragma acc directives
- forge/Core/forgeSubview_Col.hpp — remove draft #pragma acc directives
- forge/Core/ForgeIncludes.h — remove #ifdef OPENACC_GPU block (CUDA profiler includes)
- forge/Core/Tracer.hpp — change USE_NVTX guard to also trigger on CUDA_COMPUTE
Gridding:
- forge/Gridding/gridding.h — simplify 4-way #ifdef to 2-way (Metal vs CPU)
- forge/Gridding/gridding.cpp — replace #pragma acc with #pragma omp
- forge/Gridding/griddingSupport.h — remove _OPENACC guards, keep #pragma acc routine seq as comments for CUDA reference
- forge/Gridding/griddingSupport.cpp — replace 21 #pragma acc with #pragma omp
FFT:
- forge/FFT/ftCpu.h — remove _OPENACC include guard
- forge/FFT/ftCpu.cpp — replace #pragma acc kernels with #pragma omp parallel for
- forge/FFT/ftCpuWithGrads.h — remove _OPENACC include guard
- forge/FFT/ftCpuWithGrads.cpp — replace 6 #pragma acc directives with #pragma omp
Public header:
- forge/Core/forge.h — remove unconditional #include "FFT/fftGPU.h" (file is being retired)
- forge/FFT/fftCPU.h — remove @see fftGPU.h docstring reference
- forge/FFT/fftAccelerate.h — remove @see fftGPU.h docstring reference
Operators:
- forge/Operators/Gfft.h — simplify 4-way #ifdef to 2-way
- forge/Operators/Gfft.cpp — remove OPENACC_GPU cuFFT plan code
- forge/Operators/Gnufft.h, Gnufft.cpp — remove OPENACC_GPU blocks, keep Metal
- forge/Operators/Gdft.h, Gdft.cpp — remove OPENACC_GPU blocks
- forge/Operators/GdftR2.h, GdftR2.cpp — remove OPENACC_GPU blocks
- forge/Operators/SENSE.h, SENSE.cpp — remove OpenACC guards if present
- forge/Operators/pcSENSE.h, pcSENSE.cpp — same
- forge/Operators/pcSenseTimeSeg.h, pcSenseTimeSeg.cpp — same
Tests:
- forge/Tests/forgeColTests.cpp — remove _OPENACC/OPENACC_GPU guards
- forge/Tests/forgeColCplxTests.cpp — same
- forge/Tests/forgeMatTests.cpp — same
- forge/Tests/MetalVectorOpsTests.cpp — same
Files to Retire¶
forge/FFT/fftGPU.h— cuFFT wrappers, entirely#ifdef OPENACC_GPUforge/FFT/fftGPU.cpp— sameforge/my_sincosf.cuh— legacy CUDA header
Task 1: Update Root CMakeLists.txt — Remove OpenACC, Add OpenMP¶
Files:
- Modify: CMakeLists.txt
This is the foundation — get the build system right first.
- [ ] Step 1: Read the current CMakeLists.txt to identify all OpenACC sections
Read CMakeLists.txt in full. Identify:
- option(OPENACC_GPU ...) and option(OPENACC_MP ...) (lines ~52-53)
- The mutual exclusion check (lines ~56-58)
- PGI/NVHPC compiler flags (lines ~243-254)
- OPENACC_GPU conditionals for CUDA toolkit (lines ~83-110)
- USE_NVTX definition (line ~260)
- [ ] Step 2: Remove OPENACC_GPU and OPENACC_MP options and guards
Remove:
option(OPENACC_GPU "Use OpenACC to run on GPU" OFF)
option(OPENACC_MP "Use OpenACC to run on CPU multicore" OFF)
Remove the mutual exclusion check between OPENACC_GPU and OPENACC_MP.
Remove all if(OPENACC_GPU) blocks that:
- Find CUDA toolkit
- Set cuFFT link flags
- Create cuFFT plans
- Set PGI compiler flags (-cuda -acc -gpu=cuda12.6,cc60-cc89)
Remove the if(OPENACC_MP) block with -Mnollvm -O1 -acc -ta=multicore flags.
- [ ] Step 3: Add OpenMP for all builds
Add after the Metal/CUDA mutual exclusion check:
# OpenMP is available for ALL builds (including Metal) for CPU fallback paths
find_package(OpenMP REQUIRED)
Link OpenMP to ForgeCommon (will be done in forge/CMakeLists.txt).
- [ ] Step 4: Update USE_NVTX to trigger on CUDA_COMPUTE
Change the USE_NVTX definition from:
if(OPENACC_GPU)
add_definitions(-DUSE_NVTX)
endif()
if(CUDA_COMPUTE)
add_definitions(-DUSE_NVTX)
endif()
- [ ] Step 5: Add BUILD_SHARED_LIBS and CUDA_COMPUTE options (stubs for future)
Add:
option(CUDA_COMPUTE "Enable NVIDIA CUDA GPU backend" OFF)
option(BUILD_SHARED_LIBS "Build shared or static libraries" ON)
if(METAL_COMPUTE AND CUDA_COMPUTE)
message(FATAL_ERROR "Cannot enable both METAL_COMPUTE and CUDA_COMPUTE")
endif()
Add a stub if(CUDA_COMPUTE) block:
if(CUDA_COMPUTE)
enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_ARCHITECTURES "75;80;86;89;90")
endif()
- [ ] Step 6: Configure and verify the build still works
Run:
cmake -B build -S . -DMETAL_COMPUTE=ON
Expected: Configure succeeds. If it fails on missing OpenACC references, fix them.
- [ ] Step 7: Commit
git add CMakeLists.txt
git commit -m "build: remove OpenACC/OPENACC_MP options, add OpenMP and CUDA_COMPUTE stubs"
Task 2: Update forge/CMakeLists.txt — Remove fftGPU, Fix SHARED¶
Files:
- Modify: forge/CMakeLists.txt
- [ ] Step 1: Read forge/CMakeLists.txt
Identify:
- fftGPU.cpp in the source list (line ~11)
- add_library(ForgeCommon SHARED ...) (line ~46)
- Any OPENACC_GPU conditionals
- [ ] Step 2: Remove fftGPU.cpp from source list
Remove the line:
${PROJECT_SOURCE_DIR}/forge/FFT/fftGPU.cpp
- [ ] Step 3: Change SHARED to default
Change:
add_library(ForgeCommon SHARED ${PG_FILES})
add_library(ForgeCommon ${PG_FILES})
This lets BUILD_SHARED_LIBS control shared vs static.
- [ ] Step 4: Link OpenMP to ForgeCommon
Add after the target_link_libraries section:
if(OpenMP_CXX_FOUND)
target_link_libraries(ForgeCommon PUBLIC OpenMP::OpenMP_CXX)
endif()
- [ ] Step 5: Build to verify
cmake -B build -S . -DMETAL_COMPUTE=ON && cmake --build build --target ForgeCommon -j4
Expected: Build fails because source files still reference fftGPU.h or OpenACC headers. That's expected — we'll fix those in subsequent tasks.
- [ ] Step 6: Commit
git add forge/CMakeLists.txt
git commit -m "build: remove fftGPU from sources, use BUILD_SHARED_LIBS, link OpenMP"
Task 3: Clean ForgeIncludes.h and Tracer.hpp¶
Files:
- Modify: forge/Core/ForgeIncludes.h
- Modify: forge/Core/Tracer.hpp
- [ ] Step 1: Read ForgeIncludes.h
Look for the #ifdef OPENACC_GPU block (lines ~47-50) that includes CUDA profiler headers.
- [ ] Step 2: Remove the OPENACC_GPU block from ForgeIncludes.h
Remove:
#ifdef OPENACC_GPU
#include <cuda_profiler_api.h>
#include <nvToolsExt.h>
#endif
These CUDA profiler includes will be handled by the CUDA backend directly when it's implemented.
- [ ] Step 3: Read Tracer.hpp
Identify the USE_NVTX guard structure.
- [ ] Step 4: Update Tracer.hpp — USE_NVTX now triggered by CUDA_COMPUTE
The #if defined(USE_NVTX) guard in Tracer.hpp is fine as-is — the CMakeLists.txt change (Task 1, Step 4) already ensures USE_NVTX is defined when CUDA_COMPUTE=ON. No code change needed in Tracer.hpp unless there are direct OPENACC_GPU references.
- [ ] Step 5: Commit
git add forge/Core/ForgeIncludes.h forge/Core/Tracer.hpp
git commit -m "refactor: remove OpenACC guards from ForgeIncludes.h and Tracer.hpp"
Task 4: Strip OpenACC from forgeCol.hpp¶
Files:
- Modify: forge/Core/forgeCol.hpp
This is the largest single file change — ~50 #pragma acc directives to remove. Work methodically top-to-bottom.
- [ ] Step 1: Read forgeCol.hpp in full
Map every #pragma acc and #ifdef _OPENACC / #ifdef OPENACC_GPU location. The exploration found them at approximately these locations:
- Lines 24-27: _OPENACC header includes (accel.h, openacc.h)
- Line 135: constructor enter data copyin
- Line 148: constructor enter data create
- Lines 167-219: Arma constructors with device updates
- Lines 231-246: Copy constructor with isCopy, acc_memcpy
- Lines 257-275: Move constructor with acc_attach
- Lines 281-296: Destructor with acc_delete, exit data
- Lines 307-318: reset_mem() with acc_detach
- Lines 330-362: set_size() with device allocation
- Lines 370, 381: zeros()/ones() with parallel loop
- Lines 395-425: getArma() with update host
- Lines 515-931: Operator overloads and free functions with scattered pragmas
- [ ] Step 2: Remove _OPENACC header includes
Remove:
#ifdef _OPENACC
#include <accel.h>
#include <openacc.h>
#endif
- [ ] Step 3: Remove all #pragma acc from constructors and destructor
For each constructor, remove lines like:
#pragma acc enter data copyin(this)
#pragma acc enter data create(mem[0:n_elem])
#pragma acc update device(this)
For the destructor, remove:
#pragma acc exit data finalize detach(mem) delete(mem[0:n_elem])
#pragma acc exit data delete(this)
Keep the underlying new[]/delete[] and Metal std::aligned_alloc/std::free allocation logic intact.
- [ ] Step 4: Remove isCopy flag
Remove the bool isCopy member variable declaration.
Remove all references to isCopy in:
- Copy constructor (where isCopy = true is set)
- Move constructor
- Destructor (where if (!isCopy) guards cleanup)
- Any other locations
The copy constructor should still perform a deep copy of mem — just remove the isCopy bookkeeping.
- [ ] Step 5: Remove OpenACC from set_size() and reset_mem()
In set_size(), remove:
#ifdef _OPENACC
isOnGPU = true;
#pragma acc update device(this)
#pragma acc enter data create(mem[0:n_elem])
#endif
Keep the allocation paths (Metal aligned_alloc vs standard new) and the isOnGPU = false default.
In reset_mem(), remove acc_detach and device update pragmas.
- [ ] Step 6: Keep zeros()/ones() loops — replace pragma for Phase 3
For now, just remove #pragma acc parallel loop present(mem[0:n_elem]). The plain for loop remains and works single-threaded. OpenMP will be added in Task 11.
- [ ] Step 7: Remove OpenACC from getArma() variants
Remove #pragma acc update host(mem[0:n_elem]) from non-const getArma(). Keep the Metal #ifdef METAL_COMPUTE path that does Metal-specific sync.
- [ ] Step 8: Remove OpenACC from all operator overloads
Go through every operator+, operator-, operator*, operator%=, etc. and remove #pragma acc parallel loop directives. The loops remain as plain single-threaded loops. Keep Metal dispatch paths (#ifdef METAL_COMPUTE).
- [ ] Step 9: Remove OpenACC from free functions (sum, norm, etc.)
Same pattern — remove #pragma acc from sum(), norm(), and template operator functions at the bottom of the file.
- [ ] Step 10: Verify the file compiles
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
Fix any compilation errors from dangling references to removed symbols.
- [ ] Step 11: Commit
git add forge/Core/forgeCol.hpp
git commit -m "refactor: strip all OpenACC pragmas and isCopy flag from forgeCol.hpp"
Task 5: Strip OpenACC from forgeMat.hpp¶
Files:
- Modify: forge/Core/forgeMat.hpp
Same pattern as forgeCol — ~40 #pragma acc directives.
-
[ ] Step 1: Read forgeMat.hpp in full
-
[ ] Step 2: Remove _OPENACC header includes
-
[ ] Step 3: Remove all #pragma acc from constructors, destructor, set_size(), reset_mem()
Same approach as forgeCol. Keep Metal allocation paths.
- [ ] Step 4: Remove OpenACC from zeros()/ones()
Remove #pragma acc parallel loop. Plain loops for now.
-
[ ] Step 5: Remove OpenACC from getArma() and all operator overloads
-
[ ] Step 6: Remove OpenACC from sum() functions
The sum() functions have nested parallel loops with reductions — remove the pragmas, keep the loops.
-
[ ] Step 7: Remove OpenACC from vectorise()
-
[ ] Step 8: Verify compilation
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
- [ ] Step 9: Commit
git add forge/Core/forgeMat.hpp
git commit -m "refactor: strip all OpenACC pragmas from forgeMat.hpp"
Task 6: Clean forgeSubview_Col.hpp and Test Files¶
Files:
- Modify: forge/Core/forgeSubview_Col.hpp
- Modify: forge/Tests/forgeColTests.cpp
- Modify: forge/Tests/forgeColCplxTests.cpp
- Modify: forge/Tests/forgeMatTests.cpp
- Modify: forge/Tests/MetalVectorOpsTests.cpp
- Modify: forge/Microbenchmarks/forgeCol/main.cpp
- [ ] Step 1: Read forgeSubview_Col.hpp
This is a draft/stub file. Remove any #pragma acc directives.
- [ ] Step 2: Clean test files — remove _OPENACC/OPENACC_GPU guards
In each test file, find and remove blocks like:
#ifdef _OPENACC
// OpenACC-specific test setup
#endif
Or:
#ifdef OPENACC_GPU
// ...
#endif
Keep #ifdef METAL_COMPUTE guards — those are still active.
- [ ] Step 3: Clean Microbenchmarks
Remove _OPENACC guards from forge/Microbenchmarks/forgeCol/main.cpp.
- [ ] Step 4: Verify compilation
cmake --build build --target cpu_tests -j4 2>&1 | head -50
- [ ] Step 5: Commit
git add forge/Core/forgeSubview_Col.hpp forge/Tests/forgeColTests.cpp forge/Tests/forgeColCplxTests.cpp forge/Tests/forgeMatTests.cpp forge/Tests/MetalVectorOpsTests.cpp forge/Microbenchmarks/forgeCol/main.cpp
git commit -m "refactor: remove OpenACC guards from subview, tests, and microbenchmarks"
Task 7: Strip OpenACC from Gridding¶
Files:
- Modify: forge/Gridding/gridding.h
- Modify: forge/Gridding/gridding.cpp
- Modify: forge/Gridding/griddingSupport.h
- Modify: forge/Gridding/griddingSupport.cpp
- Modify: forge/Gridding/TimeSegmentation.cpp
- [ ] Step 1: Read gridding.h
Simplify the 4-way #ifdef (lines ~54-81) from:
#ifdef OPENACC_GPU
// GPU version
#elif _OPENACC
// OpenACC without CUDA
#elif defined(METAL_COMPUTE)
// Metal
#else
// CPU
#endif
#ifdef METAL_COMPUTE
// Metal
#else
// CPU
#endif
- [ ] Step 2: Read and clean gridding.cpp
Remove all #pragma acc parallel loop, #pragma acc loop seq, #pragma acc loop independent, #pragma acc atomic update, and #pragma acc data directives. Leave the loops and their bodies intact.
Key locations:
- gridding_adjoint_2D(): ~6 pragmas (parallel loop, seq loops, atomic updates)
- gridding_adjoint_3D(): ~8 pragmas (similar structure)
- gridding_forward_2D(): ~4 pragmas
- gridding_forward_3D(): ~4 pragmas
Do NOT add OpenMP yet — that's Task 11. Just remove OpenACC, leaving plain single-threaded loops.
- [ ] Step 3: Read and clean griddingSupport.h
Remove #ifdef _OPENACC include guard for openacc.h.
Remove #pragma acc routine seq from function declarations (bessi0, kernel_value_LUT). These annotations told OpenACC the function could be called from device code — not relevant for CPU/OpenMP.
- [ ] Step 4: Read and clean griddingSupport.cpp
Remove all 21 #pragma acc directives:
- #pragma acc routine seq on function definitions
- #pragma acc parallel loop collapse(N) on deinterleave functions
- #pragma acc parallel loop on normalize functions
- #pragma acc parallel loop on deapodization functions
- #pragma acc data regions
Leave all loops intact as single-threaded.
- [ ] Step 5: Un-comment #pragma omp in TimeSegmentation.cpp
Find the commented-out #pragma omp parallel for (line ~143) and un-comment it:
#pragma omp parallel for
- [ ] Step 6: Verify compilation
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
- [ ] Step 7: Commit
git add forge/Gridding/gridding.h forge/Gridding/gridding.cpp forge/Gridding/griddingSupport.h forge/Gridding/griddingSupport.cpp forge/Gridding/TimeSegmentation.cpp
git commit -m "refactor: strip OpenACC from gridding, griddingSupport, and TimeSegmentation"
Task 8: Strip OpenACC from FFT Files and Public Header¶
Files:
- Modify: forge/FFT/ftCpu.h
- Modify: forge/FFT/ftCpu.cpp
- Modify: forge/FFT/ftCpuWithGrads.h
- Modify: forge/FFT/ftCpuWithGrads.cpp
- Modify: forge/Core/forge.h
- Modify: forge/FFT/fftCPU.h (docstring cleanup)
- Modify: forge/FFT/fftAccelerate.h (docstring cleanup)
- Retire: forge/FFT/fftGPU.h (do not delete yet — just confirm it's excluded from build)
- Retire: forge/FFT/fftGPU.cpp (excluded from build in Task 2)
- [ ] Step 1: Read and clean ftCpu.h
Remove #ifdef _OPENACC block that includes openacc.h.
- [ ] Step 2: Read and clean ftCpu.cpp
Remove #pragma acc kernels, #pragma acc loop independent gang, #pragma acc loop vector(128), and any copyin/copyout clauses. Leave loops as single-threaded.
Both ftCpu() and iftCpu() functions have similar OpenACC structures — clean both.
- [ ] Step 3: Read and clean ftCpuWithGrads.h
Remove #ifdef _OPENACC include guard. This file already uses #pragma omp — leave those intact.
- [ ] Step 4: Read and clean ftCpuWithGrads.cpp
This file has 6 #pragma acc directives (lines ~107, 112, 123, 175, 179, 196). Remove all of them. The DFT loops are structurally identical to ftCpu.cpp. Leave existing #pragma omp directives intact.
- [ ] Step 5: Remove fftGPU.h include from forge.h
In forge/Core/forge.h (line ~33), remove:
#include "FFT/fftGPU.h"
This is an unconditional include of a file being retired. Without this fix, every translation unit that includes forge.h will fail to compile after fftGPU.h is deleted.
- [ ] Step 6: Clean up @see fftGPU.h docstring references
In forge/FFT/fftCPU.h and forge/FFT/fftAccelerate.h, remove @see fftGPU.h documentation references.
- [ ] Step 7: Verify fftGPU files are excluded from build
Confirm fftGPU.cpp was removed from forge/CMakeLists.txt in Task 2. The files still exist on disk but are no longer compiled.
- [ ] Step 8: Verify compilation
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
- [ ] Step 9: Commit
git add forge/FFT/ftCpu.h forge/FFT/ftCpu.cpp forge/FFT/ftCpuWithGrads.h forge/FFT/ftCpuWithGrads.cpp forge/Core/forge.h forge/FFT/fftCPU.h forge/FFT/fftAccelerate.h
git commit -m "refactor: strip OpenACC from FFT files, remove fftGPU.h include from forge.h"
Task 9: Strip OpenACC from All Operators¶
Files:
- Modify: forge/Operators/Gfft.h, forge/Operators/Gfft.cpp
- Modify: forge/Operators/Gnufft.h, forge/Operators/Gnufft.cpp
- Modify: forge/Operators/Gdft.h, forge/Operators/Gdft.cpp
- Modify: forge/Operators/GdftR2.h, forge/Operators/GdftR2.cpp
- Modify: forge/Operators/SENSE.h, forge/Operators/SENSE.cpp
- Modify: forge/Operators/pcSENSE.h, forge/Operators/pcSENSE.cpp
- Modify: forge/Operators/pcSenseTimeSeg.h, forge/Operators/pcSenseTimeSeg.cpp
- [ ] Step 1: Read and clean Gfft.h
Simplify the 4-way #ifdef to 2-way:
#ifdef METAL_COMPUTE
// Metal includes
#else
// CPU includes (FFTW)
#endif
Remove both the OPENACC_GPU and OPENACC_MP branches — both include fftGPU.h which is being retired. Also remove OPENACC_GPU members (stream, cuFFT plan handle).
- [ ] Step 2: Read and clean Gfft.cpp
Remove #ifdef OPENACC_GPU blocks that:
- Create CUDA streams
- Create cuFFT plans
- Use #pragma acc data regions
- Call fftGPU functions
The CPU fallback path (using fftCPU or fftAccelerate) becomes the only non-Metal path.
- [ ] Step 3: Read and clean Gnufft.h and Gnufft.cpp
In Gnufft.h:
- Remove OPENACC_GPU members (cuFFT plan handle, etc.)
- Keep Metal members (pipelineCtx, metalCtx)
In Gnufft.cpp:
- Remove #ifdef OPENACC_GPU block in constructor (cuFFT plan setup, lines ~86-99)
- Remove #pragma acc enter data in constructor (line ~116-118)
- In operator* and operator/: simplify from 3-way dispatch (Metal pipeline / Metal gridding / OpenACC) to 2-way (Metal / CPU)
- Remove any #pragma acc in the CPU computation path
- [ ] Step 4: Read and clean Gdft.h and Gdft.cpp
Same pattern — remove OPENACC_GPU blocks, keep Metal blocks, simplify dispatch.
- [ ] Step 5: Read and clean GdftR2.h and GdftR2.cpp
Same pattern.
- [ ] Step 6: Read and clean SENSE.h, SENSE.cpp, pcSENSE.h, pcSENSE.cpp, pcSenseTimeSeg.h, pcSenseTimeSeg.cpp
These higher-level operators may have fewer or no direct OpenACC pragmas — they delegate to Gnufft/Gdft. But check for any #ifdef OPENACC_GPU guards or #pragma acc data regions.
- [ ] Step 7: Verify compilation
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
- [ ] Step 8: Commit
git add forge/Operators/
git commit -m "refactor: strip all OpenACC from operators (Gfft, Gnufft, Gdft, GdftR2, SENSE, pcSENSE, pcSenseTimeSeg)"
Task 10: Retire Legacy Files and Full Build + Test¶
Files:
- Retire: forge/FFT/fftGPU.h, forge/FFT/fftGPU.cpp
- Retire: forge/my_sincosf.cuh
- [ ] Step 1: Delete retired files
git rm forge/FFT/fftGPU.h forge/FFT/fftGPU.cpp
If forge/my_sincosf.cuh exists and is not referenced anywhere:
git rm forge/my_sincosf.cuh
- [ ] Step 2: Search for any remaining OpenACC references
# Search for any remaining #pragma acc
grep -rn "#pragma acc" forge/ --include="*.cpp" --include="*.hpp" --include="*.h"
# Search for any remaining _OPENACC or OPENACC_GPU
grep -rn "_OPENACC\|OPENACC_GPU\|OPENACC_MP" forge/ --include="*.cpp" --include="*.hpp" --include="*.h" --include="*.cmake"
grep -rn "_OPENACC\|OPENACC_GPU\|OPENACC_MP" CMakeLists.txt
Expected: No results (or only in files already retired/excluded from build). Fix any stragglers.
- [ ] Step 3: Full clean build (Metal backend)
rm -rf build
cmake -B build -S . -DMETAL_COMPUTE=ON
cmake --build build -j4
Expected: Clean build with zero errors and zero OpenACC-related warnings.
- [ ] Step 4: Run all fast+medium tests
./build/cpu_tests '~[Benchmark]'
Expected: All tests pass. This is the single-threaded correctness baseline.
- [ ] Step 5: Run Metal tests
./build/metal_tests '~[Benchmark]'
Expected: All Metal tests pass. Metal backend is unchanged.
- [ ] Step 6: Full clean build (CPU-only backend)
rm -rf build_cpu
cmake -B build_cpu -S . -DMETAL_COMPUTE=OFF -DOPENACC_GPU=OFF
cmake --build build_cpu -j4
Expected: Clean build on CPU-only path (no Metal, no CUDA, no OpenACC).
- [ ] Step 7: Run CPU-only tests
./build_cpu/cpu_tests '~[Benchmark]'
Expected: All CPU-agnostic tests pass. Metal-specific tests are excluded by #ifdef.
- [ ] Step 8: Commit
git add -A
git commit -m "refactor: retire fftGPU and my_sincosf.cuh, verify clean build on Metal and CPU"
Task 11: Add OpenMP Parallelism to Hot Loops¶
Files:
- Modify: forge/Gridding/gridding.cpp
- Modify: forge/Gridding/griddingSupport.cpp
- Modify: forge/FFT/ftCpu.cpp
- Modify: forge/FFT/ftCpuWithGrads.cpp
- Modify: forge/Core/forgeCol.hpp
- Modify: forge/Core/forgeMat.hpp
Now that single-threaded correctness is verified, layer in OpenMP.
- [ ] Step 1: Add OpenMP to gridding_adjoint_2D/3D in gridding.cpp
For the outer loop of gridding_adjoint_2D, add:
#pragma omp parallel for schedule(dynamic)
For the atomic grid accumulation inside the loop, add:
#pragma omp atomic update
Same pattern for gridding_adjoint_3D. The #pragma omp atomic replaces the old #pragma acc atomic — same semantics.
For gridding_forward_2D/3D, the gather operation is embarrassingly parallel with no atomics:
#pragma omp parallel for schedule(static)
- [ ] Step 2: Add OpenMP to griddingSupport.cpp
For deinterleave_data2d:
#pragma omp parallel for collapse(2)
For deinterleave_data3d:
#pragma omp parallel for collapse(3)
For normalize_fft2d, normalize_fft3d:
#pragma omp parallel for
For deapodization functions:
#pragma omp parallel for
Follow the same loop structure the old OpenACC pragmas annotated — they identified which loops are safe to parallelize.
- [ ] Step 3: Add OpenMP to ftCpu.cpp
For ftCpu() and iftCpu(), add:
#pragma omp parallel for
to the outer loop over k-space samples. The inner loop (over spatial positions) is a dot product — keep it sequential within each thread.
- [ ] Step 3b: Add OpenMP to ftCpuWithGrads.cpp
Same pattern as ftCpu.cpp — the DFT loops are structurally identical (with additional R2* exponential terms). Add #pragma omp parallel for to the outer loops of both ftCpuWithGrads() and iftCpuWithGrads().
- [ ] Step 4: Add OpenMP to forgeCol::zeros() and forgeCol::ones()
void zeros() {
#pragma omp parallel for schedule(static)
for (uword i = 0; i < n_elem; i++) {
mem[i] = T(0);
}
}
Same for ones(). These are large arrays (millions of elements for 3D images).
- [ ] Step 5: Add OpenMP to forgeMat::zeros() and forgeMat::ones()
Same pattern as forgeCol.
- [ ] Step 6: Build and test with OpenMP
cmake --build build_cpu -j4
./build_cpu/cpu_tests '~[Benchmark]'
Expected: All tests pass. Results should match single-threaded within floating-point tolerance.
- [ ] Step 7: Quick performance sanity check
Run one of the medium-speed tests and compare wall time:
# Single-threaded
OMP_NUM_THREADS=1 time ./build_cpu/cpu_tests "[Gnufft adjoint]"
# All cores
time ./build_cpu/cpu_tests "[Gnufft adjoint]"
Expected: Multi-threaded is noticeably faster. Exact speedup depends on the test.
- [ ] Step 8: Commit
git add forge/Gridding/gridding.cpp forge/Gridding/griddingSupport.cpp forge/FFT/ftCpu.cpp forge/FFT/ftCpuWithGrads.cpp forge/Core/forgeCol.hpp forge/Core/forgeMat.hpp
git commit -m "feat: add OpenMP parallelism to gridding, griddingSupport, ftCpu, ftCpuWithGrads, forgeCol, and forgeMat"
Task 12: Create BackendDispatch.hpp¶
Files:
- Create: forge/Core/BackendDispatch.hpp
- [ ] Step 1: Create the dispatch header
#pragma once
// BackendDispatch.hpp — Compile-time backend resolution
//
// Maps GPU memory management and vector algebra to the active backend.
// CPU backend: GPU ops are no-ops or static_asserts.
// Metal backend: delegates to MetalVectorOps_dispatch.hpp
// CUDA backend (future): delegates to CudaVectorOps_dispatch.hpp
#include <cstddef>
#include "forgeComplex.hpp"
namespace forge::backend {
// ============================================================
// Compile-time backend query
// ============================================================
constexpr bool has_gpu() {
#if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE)
return true;
#else
return false;
#endif
}
// ============================================================
// GPU Memory Management
// ============================================================
// On CPU-only builds, these are no-ops or errors.
// On Metal, memory is unified (page-aligned alloc gives GPU access).
// On CUDA (future), these map to cudaMalloc/cudaMemcpy/cudaFree.
#if defined(CUDA_COMPUTE)
// Future: #include "CUDA/CudaVectorOps_dispatch.hpp"
#elif defined(METAL_COMPUTE)
// Metal uses unified memory — no explicit GPU malloc/copy needed.
// MetalVectorOps_dispatch.hpp handles compute dispatch.
#include "Metal/MetalVectorOps_dispatch.hpp"
#else
// CPU-only: no GPU memory operations
#endif
} // namespace forge::backend
- [ ] Step 2: Verify it compiles when included
Add a temporary #include "BackendDispatch.hpp" to one source file, build, and remove.
- [ ] Step 3: Commit
git add forge/Core/BackendDispatch.hpp
git commit -m "feat: add BackendDispatch.hpp compile-time backend resolution layer"
Task 13: Refactor Operators to Use Dispatch Layer¶
Files:
- Modify: forge/Operators/Gnufft.cpp
- Modify: forge/Operators/Gdft.cpp
- Modify: forge/Operators/GdftR2.cpp
- Modify: forge/Operators/Gfft.cpp
This task replaces the remaining #ifdef METAL_COMPUTE blocks in operators with dispatch-layer patterns. The goal is that operators have clean code without scattered #ifdef blocks — just if (pipelineCtx) runtime checks against the backend context.
- [ ] Step 1: Evaluate current Metal #ifdef blocks in Gnufft.cpp
After Task 9, the #ifdef blocks should be just Metal vs CPU. Evaluate whether the dispatch layer abstraction adds value here, or if the 2-way #ifdef METAL_COMPUTE is already clean enough.
Pragmatic decision: If the operator code is already clean with just #ifdef METAL_COMPUTE / #else guards, don't force it through the dispatch layer. The dispatch layer's main value is for GPU memory management (forgeCol/forgeMat) and vector algebra — not for pipeline-level operations which are inherently backend-specific.
- [ ] Step 2: If refactoring, update the operator dispatch pattern
If the #ifdef blocks are complex, refactor to use dispatch. If they're simple 2-way switches, leave them and document the pattern for the CUDA backend to follow.
- [ ] Step 3: Verify all tests still pass
cmake --build build -j4
./build/cpu_tests '~[Benchmark]'
./build/metal_tests '~[Benchmark]'
- [ ] Step 4: Commit
git add forge/Operators/ forge/Core/BackendDispatch.hpp
git commit -m "refactor: operators use BackendDispatch where beneficial, document pattern for CUDA"
Task 14: Final Verification and Documentation¶
Files:
- Modify: CLAUDE.md (update build instructions)
- [ ] Step 1: Full clean build — Metal backend
rm -rf build
cmake -B build -S . -DMETAL_COMPUTE=ON
cmake --build build -j4
- [ ] Step 2: Run complete Metal test suite
./build/metal_tests '~[Benchmark]'
./build/cpu_tests '~[Benchmark]'
Expected: All pass.
- [ ] Step 3: Full clean build — CPU-only backend
rm -rf build_cpu
cmake -B build_cpu -S .
cmake --build build_cpu -j4
Note: no -DMETAL_COMPUTE and no -DOPENACC_GPU — this is the pure CPU+OpenMP build.
- [ ] Step 4: Run CPU test suite
./build_cpu/cpu_tests '~[Benchmark]'
Expected: All CPU-agnostic tests pass.
- [ ] Step 5: OpenMP thread scaling check
OMP_NUM_THREADS=1 ./build_cpu/cpu_tests "[Gnufft adjoint]" --benchmark
OMP_NUM_THREADS=4 ./build_cpu/cpu_tests "[Gnufft adjoint]" --benchmark
Confirm OpenMP provides speedup.
- [ ] Step 6: Update CLAUDE.md build instructions
Update the build instructions to reflect new flags:
- Remove references to OPENACC_GPU and OPENACC_MP
- Add CUDA_COMPUTE (coming soon)
- Add BUILD_SHARED_LIBS
- Note that OpenMP is now required for all builds (Apple Clang needs brew install libomp)
- [ ] Step 7: Update CLAUDE.md Key CMake Options table
| Flag | Effect |
|---|---|
-DMETAL_COMPUTE=ON |
Enable Apple Metal GPU backend |
-DCUDA_COMPUTE=ON |
Enable NVIDIA CUDA GPU backend (future) |
-DBUILD_SHARED_LIBS=OFF |
Build static libraries (for forge-studio) |
-DENABLE_DOUBLE_PRECISION=ON |
Switch to double precision |
-DBUILD_FORGEVIEW=ON |
Build the forgeview TUI viewer (default ON) |
- [ ] Step 8: Final commit
git add CLAUDE.md
git commit -m "docs: update build instructions for OpenMP/CUDA_COMPUTE, remove OpenACC references"
Summary of Commits¶
| Task | Commit Message |
|---|---|
| 1 | build: remove OpenACC/OPENACC_MP options, add OpenMP and CUDA_COMPUTE stubs |
| 2 | build: remove fftGPU from sources, use BUILD_SHARED_LIBS, link OpenMP |
| 3 | refactor: remove OpenACC guards from ForgeIncludes.h and Tracer.hpp |
| 4 | refactor: strip all OpenACC pragmas and isCopy flag from forgeCol.hpp |
| 5 | refactor: strip all OpenACC pragmas from forgeMat.hpp |
| 6 | refactor: remove OpenACC guards from subview, tests, and microbenchmarks |
| 7 | refactor: strip OpenACC from gridding, griddingSupport, and TimeSegmentation |
| 8 | refactor: strip OpenACC from FFT files, remove fftGPU.h include from forge.h |
| 9 | refactor: strip all OpenACC from operators |
| 10 | refactor: retire fftGPU and my_sincosf.cuh, verify clean build on Metal and CPU |
| 11 | feat: add OpenMP parallelism to gridding, griddingSupport, ftCpu, ftCpuWithGrads, forgeCol, and forgeMat |
| 12 | feat: add BackendDispatch.hpp compile-time backend resolution layer |
| 13 | refactor: operators use BackendDispatch where beneficial |
| 14 | docs: update build instructions for OpenMP/CUDA_COMPUTE, remove OpenACC references |