CPU Backend + OpenMP Implementation Plan¶

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Strip all OpenACC code, establish a clean CPU backend with OpenMP parallelism, and introduce a compile-time backend dispatch layer that unifies Metal and future CUDA paths.

Architecture: Three-phase approach: (1) remove OpenACC pragmas and guards to get a clean single-threaded CPU build, (2) introduce BackendDispatch.hpp to replace scattered #ifdef blocks in operators, (3) add OpenMP parallel for to hot loops for CPU parallelism.

Tech Stack: C++17, CMake 3.17+, OpenMP, FFTW3, Armadillo, Catch2

Spec: docs/superpowers/specs/2026-03-17-multi-backend-acceleration-design.md

Subsequent plans (not covered here): - Phase 4: CUDA backend (forge/CUDA/ mirroring Metal) - Phase 5: CI pipeline (GitHub Actions, Docker, tiered testing)

File Structure¶

Files to Create¶

File	Responsibility
`forge/Core/BackendDispatch.hpp`	Compile-time dispatch: maps GPU memory ops and vector algebra to Metal, CUDA (future), or CPU no-ops

Files to Modify (by component)¶

Build system: - CMakeLists.txt — remove OPENACC_GPU/OPENACC_MP options, add OpenMP for all builds, add BUILD_SHARED_LIBS - forge/CMakeLists.txt — remove fftGPU.cpp from sources, change SHARED to default

Core data structures: - forge/Core/forgeCol.hpp — remove ~50 #pragma acc directives, remove isCopy flag, keep Metal allocation path - forge/Core/forgeMat.hpp — remove ~40 #pragma acc directives - forge/Core/forgeSubview_Col.hpp — remove draft #pragma acc directives - forge/Core/ForgeIncludes.h — remove #ifdef OPENACC_GPU block (CUDA profiler includes) - forge/Core/Tracer.hpp — change USE_NVTX guard to also trigger on CUDA_COMPUTE

Gridding: - forge/Gridding/gridding.h — simplify 4-way #ifdef to 2-way (Metal vs CPU) - forge/Gridding/gridding.cpp — replace #pragma acc with #pragma omp - forge/Gridding/griddingSupport.h — remove _OPENACC guards, keep #pragma acc routine seq as comments for CUDA reference - forge/Gridding/griddingSupport.cpp — replace 21 #pragma acc with #pragma omp

FFT: - forge/FFT/ftCpu.h — remove _OPENACC include guard - forge/FFT/ftCpu.cpp — replace #pragma acc kernels with #pragma omp parallel for - forge/FFT/ftCpuWithGrads.h — remove _OPENACC include guard - forge/FFT/ftCpuWithGrads.cpp — replace 6 #pragma acc directives with #pragma omp

Public header: - forge/Core/forge.h — remove unconditional #include "FFT/fftGPU.h" (file is being retired) - forge/FFT/fftCPU.h — remove @see fftGPU.h docstring reference - forge/FFT/fftAccelerate.h — remove @see fftGPU.h docstring reference

Operators: - forge/Operators/Gfft.h — simplify 4-way #ifdef to 2-way - forge/Operators/Gfft.cpp — remove OPENACC_GPU cuFFT plan code - forge/Operators/Gnufft.h, Gnufft.cpp — remove OPENACC_GPU blocks, keep Metal - forge/Operators/Gdft.h, Gdft.cpp — remove OPENACC_GPU blocks - forge/Operators/GdftR2.h, GdftR2.cpp — remove OPENACC_GPU blocks - forge/Operators/SENSE.h, SENSE.cpp — remove OpenACC guards if present - forge/Operators/pcSENSE.h, pcSENSE.cpp — same - forge/Operators/pcSenseTimeSeg.h, pcSenseTimeSeg.cpp — same

Tests: - forge/Tests/forgeColTests.cpp — remove _OPENACC/OPENACC_GPU guards - forge/Tests/forgeColCplxTests.cpp — same - forge/Tests/forgeMatTests.cpp — same - forge/Tests/MetalVectorOpsTests.cpp — same

Files to Retire¶

forge/FFT/fftGPU.h — cuFFT wrappers, entirely #ifdef OPENACC_GPU
forge/FFT/fftGPU.cpp — same
forge/my_sincosf.cuh — legacy CUDA header

Task 1: Update Root CMakeLists.txt — Remove OpenACC, Add OpenMP¶

Files: - Modify: CMakeLists.txt

This is the foundation — get the build system right first.

[ ] Step 1: Read the current CMakeLists.txt to identify all OpenACC sections

Read CMakeLists.txt in full. Identify: - option(OPENACC_GPU ...) and option(OPENACC_MP ...) (lines ~52-53) - The mutual exclusion check (lines ~56-58) - PGI/NVHPC compiler flags (lines ~243-254) - OPENACC_GPU conditionals for CUDA toolkit (lines ~83-110) - USE_NVTX definition (line ~260)

[ ] Step 2: Remove OPENACC_GPU and OPENACC_MP options and guards

Remove:

option(OPENACC_GPU "Use OpenACC to run on GPU" OFF)
option(OPENACC_MP "Use OpenACC to run on CPU multicore" OFF)

Remove the mutual exclusion check between OPENACC_GPU and OPENACC_MP.

Remove all if(OPENACC_GPU) blocks that: - Find CUDA toolkit - Set cuFFT link flags - Create cuFFT plans - Set PGI compiler flags (-cuda -acc -gpu=cuda12.6,cc60-cc89)

Remove the if(OPENACC_MP) block with -Mnollvm -O1 -acc -ta=multicore flags.

[ ] Step 3: Add OpenMP for all builds

Add after the Metal/CUDA mutual exclusion check:

# OpenMP is available for ALL builds (including Metal) for CPU fallback paths
find_package(OpenMP REQUIRED)

Link OpenMP to ForgeCommon (will be done in forge/CMakeLists.txt).

[ ] Step 4: Update USE_NVTX to trigger on CUDA_COMPUTE

Change the USE_NVTX definition from:

if(OPENACC_GPU)
    add_definitions(-DUSE_NVTX)
endif()

To:

if(CUDA_COMPUTE)
    add_definitions(-DUSE_NVTX)
endif()

[ ] Step 5: Add BUILD_SHARED_LIBS and CUDA_COMPUTE options (stubs for future)

Add:

option(CUDA_COMPUTE "Enable NVIDIA CUDA GPU backend" OFF)
option(BUILD_SHARED_LIBS "Build shared or static libraries" ON)

if(METAL_COMPUTE AND CUDA_COMPUTE)
    message(FATAL_ERROR "Cannot enable both METAL_COMPUTE and CUDA_COMPUTE")
endif()

Add a stub if(CUDA_COMPUTE) block:

if(CUDA_COMPUTE)
    enable_language(CUDA)
    find_package(CUDAToolkit REQUIRED)
    set(CMAKE_CUDA_ARCHITECTURES "75;80;86;89;90")
endif()

[ ] Step 6: Configure and verify the build still works

Run:

cmake -B build -S . -DMETAL_COMPUTE=ON

Expected: Configure succeeds. If it fails on missing OpenACC references, fix them.

[ ] Step 7: Commit

git add CMakeLists.txt
git commit -m "build: remove OpenACC/OPENACC_MP options, add OpenMP and CUDA_COMPUTE stubs"

Task 2: Update forge/CMakeLists.txt — Remove fftGPU, Fix SHARED¶

Files: - Modify: forge/CMakeLists.txt

[ ] Step 1: Read forge/CMakeLists.txt

Identify: - fftGPU.cpp in the source list (line ~11) - add_library(ForgeCommon SHARED ...) (line ~46) - Any OPENACC_GPU conditionals

[ ] Step 2: Remove fftGPU.cpp from source list

Remove the line:

${PROJECT_SOURCE_DIR}/forge/FFT/fftGPU.cpp

[ ] Step 3: Change SHARED to default

Change:

add_library(ForgeCommon SHARED ${PG_FILES})

To:

add_library(ForgeCommon ${PG_FILES})

This lets BUILD_SHARED_LIBS control shared vs static.

[ ] Step 4: Link OpenMP to ForgeCommon

Add after the target_link_libraries section:

if(OpenMP_CXX_FOUND)
    target_link_libraries(ForgeCommon PUBLIC OpenMP::OpenMP_CXX)
endif()

[ ] Step 5: Build to verify

cmake -B build -S . -DMETAL_COMPUTE=ON && cmake --build build --target ForgeCommon -j4

Expected: Build fails because source files still reference fftGPU.h or OpenACC headers. That's expected — we'll fix those in subsequent tasks.

[ ] Step 6: Commit

git add forge/CMakeLists.txt
git commit -m "build: remove fftGPU from sources, use BUILD_SHARED_LIBS, link OpenMP"

Task 3: Clean ForgeIncludes.h and Tracer.hpp¶

Files: - Modify: forge/Core/ForgeIncludes.h - Modify: forge/Core/Tracer.hpp

[ ] Step 1: Read ForgeIncludes.h

Look for the #ifdef OPENACC_GPU block (lines ~47-50) that includes CUDA profiler headers.

[ ] Step 2: Remove the OPENACC_GPU block from ForgeIncludes.h

Remove:

#ifdef OPENACC_GPU
#include <cuda_profiler_api.h>
#include <nvToolsExt.h>
#endif

These CUDA profiler includes will be handled by the CUDA backend directly when it's implemented.

[ ] Step 3: Read Tracer.hpp

Identify the USE_NVTX guard structure.

[ ] Step 4: Update Tracer.hpp — USE_NVTX now triggered by CUDA_COMPUTE

The #if defined(USE_NVTX) guard in Tracer.hpp is fine as-is — the CMakeLists.txt change (Task 1, Step 4) already ensures USE_NVTX is defined when CUDA_COMPUTE=ON. No code change needed in Tracer.hpp unless there are direct OPENACC_GPU references.

[ ] Step 5: Commit

git add forge/Core/ForgeIncludes.h forge/Core/Tracer.hpp
git commit -m "refactor: remove OpenACC guards from ForgeIncludes.h and Tracer.hpp"

Task 4: Strip OpenACC from forgeCol.hpp¶

Files: - Modify: forge/Core/forgeCol.hpp

This is the largest single file change — ~50 #pragma acc directives to remove. Work methodically top-to-bottom.

[ ] Step 1: Read forgeCol.hpp in full

Map every #pragma acc and #ifdef _OPENACC / #ifdef OPENACC_GPU location. The exploration found them at approximately these locations: - Lines 24-27: _OPENACC header includes (accel.h, openacc.h) - Line 135: constructor enter data copyin - Line 148: constructor enter data create - Lines 167-219: Arma constructors with device updates - Lines 231-246: Copy constructor with isCopy, acc_memcpy - Lines 257-275: Move constructor with acc_attach - Lines 281-296: Destructor with acc_delete, exit data - Lines 307-318: reset_mem() with acc_detach - Lines 330-362: set_size() with device allocation - Lines 370, 381: zeros()/ones() with parallel loop - Lines 395-425: getArma() with update host - Lines 515-931: Operator overloads and free functions with scattered pragmas

[ ] Step 2: Remove _OPENACC header includes

Remove:

#ifdef _OPENACC
#include <accel.h>
#include <openacc.h>
#endif

[ ] Step 3: Remove all #pragma acc from constructors and destructor

For each constructor, remove lines like:

#pragma acc enter data copyin(this)
#pragma acc enter data create(mem[0:n_elem])
#pragma acc update device(this)

For the destructor, remove:

#pragma acc exit data finalize detach(mem) delete(mem[0:n_elem])
#pragma acc exit data delete(this)

Keep the underlying new[]/delete[] and Metal std::aligned_alloc/std::free allocation logic intact.

[ ] Step 4: Remove isCopy flag

Remove the bool isCopy member variable declaration. Remove all references to isCopy in: - Copy constructor (where isCopy = true is set) - Move constructor - Destructor (where if (!isCopy) guards cleanup) - Any other locations

The copy constructor should still perform a deep copy of mem — just remove the isCopy bookkeeping.

[ ] Step 5: Remove OpenACC from set_size() and reset_mem()

In set_size(), remove:

#ifdef _OPENACC
    isOnGPU = true;
    #pragma acc update device(this)
    #pragma acc enter data create(mem[0:n_elem])
#endif

Keep the allocation paths (Metal aligned_alloc vs standard new) and the isOnGPU = false default.

In reset_mem(), remove acc_detach and device update pragmas.

[ ] Step 6: Keep zeros()/ones() loops — replace pragma for Phase 3

For now, just remove #pragma acc parallel loop present(mem[0:n_elem]). The plain for loop remains and works single-threaded. OpenMP will be added in Task 11.

[ ] Step 7: Remove OpenACC from getArma() variants

Remove #pragma acc update host(mem[0:n_elem]) from non-const getArma(). Keep the Metal #ifdef METAL_COMPUTE path that does Metal-specific sync.

[ ] Step 8: Remove OpenACC from all operator overloads

Go through every operator+, operator-, operator*, operator%=, etc. and remove #pragma acc parallel loop directives. The loops remain as plain single-threaded loops. Keep Metal dispatch paths (#ifdef METAL_COMPUTE).

[ ] Step 9: Remove OpenACC from free functions (sum, norm, etc.)

Same pattern — remove #pragma acc from sum(), norm(), and template operator functions at the bottom of the file.

[ ] Step 10: Verify the file compiles

cmake --build build --target ForgeCommon -j4 2>&1 | head -50

Fix any compilation errors from dangling references to removed symbols.

[ ] Step 11: Commit

git add forge/Core/forgeCol.hpp
git commit -m "refactor: strip all OpenACC pragmas and isCopy flag from forgeCol.hpp"

Task 5: Strip OpenACC from forgeMat.hpp¶

Files: - Modify: forge/Core/forgeMat.hpp

Same pattern as forgeCol — ~40 #pragma acc directives.

[ ] Step 1: Read forgeMat.hpp in full
[ ] Step 2: Remove _OPENACC header includes
[ ] Step 3: Remove all #pragma acc from constructors, destructor, set_size(), reset_mem()

Same approach as forgeCol. Keep Metal allocation paths.

[ ] Step 4: Remove OpenACC from zeros()/ones()

Remove #pragma acc parallel loop. Plain loops for now.

[ ] Step 5: Remove OpenACC from getArma() and all operator overloads
[ ] Step 6: Remove OpenACC from sum() functions

The sum() functions have nested parallel loops with reductions — remove the pragmas, keep the loops.

[ ] Step 7: Remove OpenACC from vectorise()
[ ] Step 8: Verify compilation

cmake --build build --target ForgeCommon -j4 2>&1 | head -50

[ ] Step 9: Commit

git add forge/Core/forgeMat.hpp
git commit -m "refactor: strip all OpenACC pragmas from forgeMat.hpp"

Task 6: Clean forgeSubview_Col.hpp and Test Files¶

Files: - Modify: forge/Core/forgeSubview_Col.hpp - Modify: forge/Tests/forgeColTests.cpp - Modify: forge/Tests/forgeColCplxTests.cpp - Modify: forge/Tests/forgeMatTests.cpp - Modify: forge/Tests/MetalVectorOpsTests.cpp - Modify: forge/Microbenchmarks/forgeCol/main.cpp

[ ] Step 1: Read forgeSubview_Col.hpp

This is a draft/stub file. Remove any #pragma acc directives.

[ ] Step 2: Clean test files — remove _OPENACC/OPENACC_GPU guards

In each test file, find and remove blocks like:

#ifdef _OPENACC
    // OpenACC-specific test setup
#endif

Or:

#ifdef OPENACC_GPU
    // ...
#endif

Keep #ifdef METAL_COMPUTE guards — those are still active.

[ ] Step 3: Clean Microbenchmarks

Remove _OPENACC guards from forge/Microbenchmarks/forgeCol/main.cpp.

[ ] Step 4: Verify compilation

cmake --build build --target cpu_tests -j4 2>&1 | head -50

[ ] Step 5: Commit

git add forge/Core/forgeSubview_Col.hpp forge/Tests/forgeColTests.cpp forge/Tests/forgeColCplxTests.cpp forge/Tests/forgeMatTests.cpp forge/Tests/MetalVectorOpsTests.cpp forge/Microbenchmarks/forgeCol/main.cpp
git commit -m "refactor: remove OpenACC guards from subview, tests, and microbenchmarks"

Task 7: Strip OpenACC from Gridding¶

Files: - Modify: forge/Gridding/gridding.h - Modify: forge/Gridding/gridding.cpp - Modify: forge/Gridding/griddingSupport.h - Modify: forge/Gridding/griddingSupport.cpp - Modify: forge/Gridding/TimeSegmentation.cpp

[ ] Step 1: Read gridding.h

Simplify the 4-way #ifdef (lines ~54-81) from:

#ifdef OPENACC_GPU
    // GPU version
#elif _OPENACC
    // OpenACC without CUDA
#elif defined(METAL_COMPUTE)
    // Metal
#else
    // CPU
#endif

To:

#ifdef METAL_COMPUTE
    // Metal
#else
    // CPU
#endif

[ ] Step 2: Read and clean gridding.cpp

Remove all #pragma acc parallel loop, #pragma acc loop seq, #pragma acc loop independent, #pragma acc atomic update, and #pragma acc data directives. Leave the loops and their bodies intact.

Key locations: - gridding_adjoint_2D(): ~6 pragmas (parallel loop, seq loops, atomic updates) - gridding_adjoint_3D(): ~8 pragmas (similar structure) - gridding_forward_2D(): ~4 pragmas - gridding_forward_3D(): ~4 pragmas

Do NOT add OpenMP yet — that's Task 11. Just remove OpenACC, leaving plain single-threaded loops.

[ ] Step 3: Read and clean griddingSupport.h

Remove #ifdef _OPENACC include guard for openacc.h. Remove #pragma acc routine seq from function declarations (bessi0, kernel_value_LUT). These annotations told OpenACC the function could be called from device code — not relevant for CPU/OpenMP.

[ ] Step 4: Read and clean griddingSupport.cpp

Remove all 21 #pragma acc directives: - #pragma acc routine seq on function definitions - #pragma acc parallel loop collapse(N) on deinterleave functions - #pragma acc parallel loop on normalize functions - #pragma acc parallel loop on deapodization functions - #pragma acc data regions

Leave all loops intact as single-threaded.

[ ] Step 5: Un-comment #pragma omp in TimeSegmentation.cpp

Find the commented-out #pragma omp parallel for (line ~143) and un-comment it:

#pragma omp parallel for

[ ] Step 6: Verify compilation

cmake --build build --target ForgeCommon -j4 2>&1 | head -50

[ ] Step 7: Commit

git add forge/Gridding/gridding.h forge/Gridding/gridding.cpp forge/Gridding/griddingSupport.h forge/Gridding/griddingSupport.cpp forge/Gridding/TimeSegmentation.cpp
git commit -m "refactor: strip OpenACC from gridding, griddingSupport, and TimeSegmentation"

Task 8: Strip OpenACC from FFT Files and Public Header¶

Files: - Modify: forge/FFT/ftCpu.h - Modify: forge/FFT/ftCpu.cpp - Modify: forge/FFT/ftCpuWithGrads.h - Modify: forge/FFT/ftCpuWithGrads.cpp - Modify: forge/Core/forge.h - Modify: forge/FFT/fftCPU.h (docstring cleanup) - Modify: forge/FFT/fftAccelerate.h (docstring cleanup) - Retire: forge/FFT/fftGPU.h (do not delete yet — just confirm it's excluded from build) - Retire: forge/FFT/fftGPU.cpp (excluded from build in Task 2)

[ ] Step 1: Read and clean ftCpu.h

Remove #ifdef _OPENACC block that includes openacc.h.

[ ] Step 2: Read and clean ftCpu.cpp

Remove #pragma acc kernels, #pragma acc loop independent gang, #pragma acc loop vector(128), and any copyin/copyout clauses. Leave loops as single-threaded.

Both ftCpu() and iftCpu() functions have similar OpenACC structures — clean both.

[ ] Step 3: Read and clean ftCpuWithGrads.h

Remove #ifdef _OPENACC include guard. This file already uses #pragma omp — leave those intact.

[ ] Step 4: Read and clean ftCpuWithGrads.cpp

This file has 6 #pragma acc directives (lines ~107, 112, 123, 175, 179, 196). Remove all of them. The DFT loops are structurally identical to ftCpu.cpp. Leave existing #pragma omp directives intact.

[ ] Step 5: Remove fftGPU.h include from forge.h

In forge/Core/forge.h (line ~33), remove:

#include "FFT/fftGPU.h"

This is an unconditional include of a file being retired. Without this fix, every translation unit that includes forge.h will fail to compile after fftGPU.h is deleted.

[ ] Step 6: Clean up @see fftGPU.h docstring references

In forge/FFT/fftCPU.h and forge/FFT/fftAccelerate.h, remove @see fftGPU.h documentation references.

[ ] Step 7: Verify fftGPU files are excluded from build

Confirm fftGPU.cpp was removed from forge/CMakeLists.txt in Task 2. The files still exist on disk but are no longer compiled.

[ ] Step 8: Verify compilation

cmake --build build --target ForgeCommon -j4 2>&1 | head -50

[ ] Step 9: Commit

git add forge/FFT/ftCpu.h forge/FFT/ftCpu.cpp forge/FFT/ftCpuWithGrads.h forge/FFT/ftCpuWithGrads.cpp forge/Core/forge.h forge/FFT/fftCPU.h forge/FFT/fftAccelerate.h
git commit -m "refactor: strip OpenACC from FFT files, remove fftGPU.h include from forge.h"

Task 9: Strip OpenACC from All Operators¶

Files: - Modify: forge/Operators/Gfft.h, forge/Operators/Gfft.cpp - Modify: forge/Operators/Gnufft.h, forge/Operators/Gnufft.cpp - Modify: forge/Operators/Gdft.h, forge/Operators/Gdft.cpp - Modify: forge/Operators/GdftR2.h, forge/Operators/GdftR2.cpp - Modify: forge/Operators/SENSE.h, forge/Operators/SENSE.cpp - Modify: forge/Operators/pcSENSE.h, forge/Operators/pcSENSE.cpp - Modify: forge/Operators/pcSenseTimeSeg.h, forge/Operators/pcSenseTimeSeg.cpp

[ ] Step 1: Read and clean Gfft.h

Simplify the 4-way #ifdef to 2-way:

#ifdef METAL_COMPUTE
    // Metal includes
#else
    // CPU includes (FFTW)
#endif

Remove both the OPENACC_GPU and OPENACC_MP branches — both include fftGPU.h which is being retired. Also remove OPENACC_GPU members (stream, cuFFT plan handle).

[ ] Step 2: Read and clean Gfft.cpp

Remove #ifdef OPENACC_GPU blocks that: - Create CUDA streams - Create cuFFT plans - Use #pragma acc data regions - Call fftGPU functions

The CPU fallback path (using fftCPU or fftAccelerate) becomes the only non-Metal path.

[ ] Step 3: Read and clean Gnufft.h and Gnufft.cpp

In Gnufft.h: - Remove OPENACC_GPU members (cuFFT plan handle, etc.) - Keep Metal members (pipelineCtx, metalCtx)

In Gnufft.cpp: - Remove #ifdef OPENACC_GPU block in constructor (cuFFT plan setup, lines ~86-99) - Remove #pragma acc enter data in constructor (line ~116-118) - In operator* and operator/: simplify from 3-way dispatch (Metal pipeline / Metal gridding / OpenACC) to 2-way (Metal / CPU) - Remove any #pragma acc in the CPU computation path

[ ] Step 4: Read and clean Gdft.h and Gdft.cpp

Same pattern — remove OPENACC_GPU blocks, keep Metal blocks, simplify dispatch.

[ ] Step 5: Read and clean GdftR2.h and GdftR2.cpp

Same pattern.

[ ] Step 6: Read and clean SENSE.h, SENSE.cpp, pcSENSE.h, pcSENSE.cpp, pcSenseTimeSeg.h, pcSenseTimeSeg.cpp

These higher-level operators may have fewer or no direct OpenACC pragmas — they delegate to Gnufft/Gdft. But check for any #ifdef OPENACC_GPU guards or #pragma acc data regions.

[ ] Step 7: Verify compilation

cmake --build build --target ForgeCommon -j4 2>&1 | head -50

[ ] Step 8: Commit

git add forge/Operators/
git commit -m "refactor: strip all OpenACC from operators (Gfft, Gnufft, Gdft, GdftR2, SENSE, pcSENSE, pcSenseTimeSeg)"

Task 10: Retire Legacy Files and Full Build + Test¶

Files: - Retire: forge/FFT/fftGPU.h, forge/FFT/fftGPU.cpp - Retire: forge/my_sincosf.cuh

[ ] Step 1: Delete retired files

git rm forge/FFT/fftGPU.h forge/FFT/fftGPU.cpp

If forge/my_sincosf.cuh exists and is not referenced anywhere:

git rm forge/my_sincosf.cuh

[ ] Step 2: Search for any remaining OpenACC references

# Search for any remaining #pragma acc
grep -rn "#pragma acc" forge/ --include="*.cpp" --include="*.hpp" --include="*.h"

# Search for any remaining _OPENACC or OPENACC_GPU
grep -rn "_OPENACC\|OPENACC_GPU\|OPENACC_MP" forge/ --include="*.cpp" --include="*.hpp" --include="*.h" --include="*.cmake"
grep -rn "_OPENACC\|OPENACC_GPU\|OPENACC_MP" CMakeLists.txt

Expected: No results (or only in files already retired/excluded from build). Fix any stragglers.

[ ] Step 3: Full clean build (Metal backend)

rm -rf build
cmake -B build -S . -DMETAL_COMPUTE=ON
cmake --build build -j4

Expected: Clean build with zero errors and zero OpenACC-related warnings.

[ ] Step 4: Run all fast+medium tests

./build/cpu_tests '~[Benchmark]'

Expected: All tests pass. This is the single-threaded correctness baseline.

[ ] Step 5: Run Metal tests

./build/metal_tests '~[Benchmark]'

Expected: All Metal tests pass. Metal backend is unchanged.

[ ] Step 6: Full clean build (CPU-only backend)

rm -rf build_cpu
cmake -B build_cpu -S . -DMETAL_COMPUTE=OFF -DOPENACC_GPU=OFF
cmake --build build_cpu -j4

Expected: Clean build on CPU-only path (no Metal, no CUDA, no OpenACC).

[ ] Step 7: Run CPU-only tests

./build_cpu/cpu_tests '~[Benchmark]'

Expected: All CPU-agnostic tests pass. Metal-specific tests are excluded by #ifdef.

[ ] Step 8: Commit

git add -A
git commit -m "refactor: retire fftGPU and my_sincosf.cuh, verify clean build on Metal and CPU"

Task 11: Add OpenMP Parallelism to Hot Loops¶

Files: - Modify: forge/Gridding/gridding.cpp - Modify: forge/Gridding/griddingSupport.cpp - Modify: forge/FFT/ftCpu.cpp - Modify: forge/FFT/ftCpuWithGrads.cpp - Modify: forge/Core/forgeCol.hpp - Modify: forge/Core/forgeMat.hpp

Now that single-threaded correctness is verified, layer in OpenMP.

[ ] Step 1: Add OpenMP to gridding_adjoint_2D/3D in gridding.cpp

For the outer loop of gridding_adjoint_2D, add:

#pragma omp parallel for schedule(dynamic)

For the atomic grid accumulation inside the loop, add:

#pragma omp atomic update

Same pattern for gridding_adjoint_3D. The #pragma omp atomic replaces the old #pragma acc atomic — same semantics.

For gridding_forward_2D/3D, the gather operation is embarrassingly parallel with no atomics:

#pragma omp parallel for schedule(static)

[ ] Step 2: Add OpenMP to griddingSupport.cpp

For deinterleave_data2d:

#pragma omp parallel for collapse(2)

For deinterleave_data3d:

#pragma omp parallel for collapse(3)

For normalize_fft2d, normalize_fft3d:

#pragma omp parallel for

For deapodization functions:

#pragma omp parallel for

Follow the same loop structure the old OpenACC pragmas annotated — they identified which loops are safe to parallelize.

[ ] Step 3: Add OpenMP to ftCpu.cpp

For ftCpu() and iftCpu(), add:

#pragma omp parallel for

to the outer loop over k-space samples. The inner loop (over spatial positions) is a dot product — keep it sequential within each thread.

[ ] Step 3b: Add OpenMP to ftCpuWithGrads.cpp

Same pattern as ftCpu.cpp — the DFT loops are structurally identical (with additional R2* exponential terms). Add #pragma omp parallel for to the outer loops of both ftCpuWithGrads() and iftCpuWithGrads().

[ ] Step 4: Add OpenMP to forgeCol::zeros() and forgeCol::ones()

void zeros() {
    #pragma omp parallel for schedule(static)
    for (uword i = 0; i < n_elem; i++) {
        mem[i] = T(0);
    }
}

Same for ones(). These are large arrays (millions of elements for 3D images).

[ ] Step 5: Add OpenMP to forgeMat::zeros() and forgeMat::ones()

Same pattern as forgeCol.

[ ] Step 6: Build and test with OpenMP

cmake --build build_cpu -j4
./build_cpu/cpu_tests '~[Benchmark]'

Expected: All tests pass. Results should match single-threaded within floating-point tolerance.

[ ] Step 7: Quick performance sanity check

Run one of the medium-speed tests and compare wall time:

# Single-threaded
OMP_NUM_THREADS=1 time ./build_cpu/cpu_tests "[Gnufft adjoint]"

# All cores
time ./build_cpu/cpu_tests "[Gnufft adjoint]"

Expected: Multi-threaded is noticeably faster. Exact speedup depends on the test.

[ ] Step 8: Commit

git add forge/Gridding/gridding.cpp forge/Gridding/griddingSupport.cpp forge/FFT/ftCpu.cpp forge/FFT/ftCpuWithGrads.cpp forge/Core/forgeCol.hpp forge/Core/forgeMat.hpp
git commit -m "feat: add OpenMP parallelism to gridding, griddingSupport, ftCpu, ftCpuWithGrads, forgeCol, and forgeMat"

Task 12: Create BackendDispatch.hpp¶

Files: - Create: forge/Core/BackendDispatch.hpp

[ ] Step 1: Create the dispatch header

#pragma once

// BackendDispatch.hpp — Compile-time backend resolution
//
// Maps GPU memory management and vector algebra to the active backend.
// CPU backend: GPU ops are no-ops or static_asserts.
// Metal backend: delegates to MetalVectorOps_dispatch.hpp
// CUDA backend (future): delegates to CudaVectorOps_dispatch.hpp

#include <cstddef>
#include "forgeComplex.hpp"

namespace forge::backend {

// ============================================================
// Compile-time backend query
// ============================================================

constexpr bool has_gpu() {
#if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE)
    return true;
#else
    return false;
#endif
}

// ============================================================
// GPU Memory Management
// ============================================================
// On CPU-only builds, these are no-ops or errors.
// On Metal, memory is unified (page-aligned alloc gives GPU access).
// On CUDA (future), these map to cudaMalloc/cudaMemcpy/cudaFree.

#if defined(CUDA_COMPUTE)
    // Future: #include "CUDA/CudaVectorOps_dispatch.hpp"
#elif defined(METAL_COMPUTE)
    // Metal uses unified memory — no explicit GPU malloc/copy needed.
    // MetalVectorOps_dispatch.hpp handles compute dispatch.
    #include "Metal/MetalVectorOps_dispatch.hpp"
#else
    // CPU-only: no GPU memory operations
#endif

} // namespace forge::backend

[ ] Step 2: Verify it compiles when included

Add a temporary #include "BackendDispatch.hpp" to one source file, build, and remove.

[ ] Step 3: Commit

git add forge/Core/BackendDispatch.hpp
git commit -m "feat: add BackendDispatch.hpp compile-time backend resolution layer"

Task 13: Refactor Operators to Use Dispatch Layer¶

Files: - Modify: forge/Operators/Gnufft.cpp - Modify: forge/Operators/Gdft.cpp - Modify: forge/Operators/GdftR2.cpp - Modify: forge/Operators/Gfft.cpp

This task replaces the remaining #ifdef METAL_COMPUTE blocks in operators with dispatch-layer patterns. The goal is that operators have clean code without scattered #ifdef blocks — just if (pipelineCtx) runtime checks against the backend context.

[ ] Step 1: Evaluate current Metal #ifdef blocks in Gnufft.cpp

After Task 9, the #ifdef blocks should be just Metal vs CPU. Evaluate whether the dispatch layer abstraction adds value here, or if the 2-way #ifdef METAL_COMPUTE is already clean enough.

Pragmatic decision: If the operator code is already clean with just #ifdef METAL_COMPUTE / #else guards, don't force it through the dispatch layer. The dispatch layer's main value is for GPU memory management (forgeCol/forgeMat) and vector algebra — not for pipeline-level operations which are inherently backend-specific.

[ ] Step 2: If refactoring, update the operator dispatch pattern

If the #ifdef blocks are complex, refactor to use dispatch. If they're simple 2-way switches, leave them and document the pattern for the CUDA backend to follow.

[ ] Step 3: Verify all tests still pass

cmake --build build -j4
./build/cpu_tests '~[Benchmark]'
./build/metal_tests '~[Benchmark]'

[ ] Step 4: Commit

git add forge/Operators/ forge/Core/BackendDispatch.hpp
git commit -m "refactor: operators use BackendDispatch where beneficial, document pattern for CUDA"

Task 14: Final Verification and Documentation¶

Files: - Modify: CLAUDE.md (update build instructions)

[ ] Step 1: Full clean build — Metal backend

rm -rf build
cmake -B build -S . -DMETAL_COMPUTE=ON
cmake --build build -j4

[ ] Step 2: Run complete Metal test suite

./build/metal_tests '~[Benchmark]'
./build/cpu_tests '~[Benchmark]'

Expected: All pass.

[ ] Step 3: Full clean build — CPU-only backend

rm -rf build_cpu
cmake -B build_cpu -S .
cmake --build build_cpu -j4

Note: no -DMETAL_COMPUTE and no -DOPENACC_GPU — this is the pure CPU+OpenMP build.

[ ] Step 4: Run CPU test suite

./build_cpu/cpu_tests '~[Benchmark]'

Expected: All CPU-agnostic tests pass.

[ ] Step 5: OpenMP thread scaling check

OMP_NUM_THREADS=1 ./build_cpu/cpu_tests "[Gnufft adjoint]" --benchmark
OMP_NUM_THREADS=4 ./build_cpu/cpu_tests "[Gnufft adjoint]" --benchmark

Confirm OpenMP provides speedup.

[ ] Step 6: Update CLAUDE.md build instructions

Update the build instructions to reflect new flags: - Remove references to OPENACC_GPU and OPENACC_MP - Add CUDA_COMPUTE (coming soon) - Add BUILD_SHARED_LIBS - Note that OpenMP is now required for all builds (Apple Clang needs brew install libomp)

[ ] Step 7: Update CLAUDE.md Key CMake Options table

Flag	Effect
`-DMETAL_COMPUTE=ON`	Enable Apple Metal GPU backend
`-DCUDA_COMPUTE=ON`	Enable NVIDIA CUDA GPU backend (future)
`-DBUILD_SHARED_LIBS=OFF`	Build static libraries (for forge-studio)
`-DENABLE_DOUBLE_PRECISION=ON`	Switch to double precision
`-DBUILD_FORGEVIEW=ON`	Build the forgeview TUI viewer (default ON)

[ ] Step 8: Final commit

git add CLAUDE.md
git commit -m "docs: update build instructions for OpenMP/CUDA_COMPUTE, remove OpenACC references"

Summary of Commits¶

Task	Commit Message
1	`build: remove OpenACC/OPENACC_MP options, add OpenMP and CUDA_COMPUTE stubs`
2	`build: remove fftGPU from sources, use BUILD_SHARED_LIBS, link OpenMP`
3	`refactor: remove OpenACC guards from ForgeIncludes.h and Tracer.hpp`
4	`refactor: strip all OpenACC pragmas and isCopy flag from forgeCol.hpp`
5	`refactor: strip all OpenACC pragmas from forgeMat.hpp`
6	`refactor: remove OpenACC guards from subview, tests, and microbenchmarks`
7	`refactor: strip OpenACC from gridding, griddingSupport, and TimeSegmentation`
8	`refactor: strip OpenACC from FFT files, remove fftGPU.h include from forge.h`
9	`refactor: strip all OpenACC from operators`
10	`refactor: retire fftGPU and my_sincosf.cuh, verify clean build on Metal and CPU`
11	`feat: add OpenMP parallelism to gridding, griddingSupport, ftCpu, ftCpuWithGrads, forgeCol, and forgeMat`
12	`feat: add BackendDispatch.hpp compile-time backend resolution layer`
13	`refactor: operators use BackendDispatch where beneficial`
14	`docs: update build instructions for OpenMP/CUDA_COMPUTE, remove OpenACC references`