Skip to content

CPU Backend + OpenMP Implementation Plan

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Strip all OpenACC code, establish a clean CPU backend with OpenMP parallelism, and introduce a compile-time backend dispatch layer that unifies Metal and future CUDA paths.

Architecture: Three-phase approach: (1) remove OpenACC pragmas and guards to get a clean single-threaded CPU build, (2) introduce BackendDispatch.hpp to replace scattered #ifdef blocks in operators, (3) add OpenMP parallel for to hot loops for CPU parallelism.

Tech Stack: C++17, CMake 3.17+, OpenMP, FFTW3, Armadillo, Catch2

Spec: docs/superpowers/specs/2026-03-17-multi-backend-acceleration-design.md

Subsequent plans (not covered here): - Phase 4: CUDA backend (forge/CUDA/ mirroring Metal) - Phase 5: CI pipeline (GitHub Actions, Docker, tiered testing)


File Structure

Files to Create

File Responsibility
forge/Core/BackendDispatch.hpp Compile-time dispatch: maps GPU memory ops and vector algebra to Metal, CUDA (future), or CPU no-ops

Files to Modify (by component)

Build system: - CMakeLists.txt — remove OPENACC_GPU/OPENACC_MP options, add OpenMP for all builds, add BUILD_SHARED_LIBS - forge/CMakeLists.txt — remove fftGPU.cpp from sources, change SHARED to default

Core data structures: - forge/Core/forgeCol.hpp — remove ~50 #pragma acc directives, remove isCopy flag, keep Metal allocation path - forge/Core/forgeMat.hpp — remove ~40 #pragma acc directives - forge/Core/forgeSubview_Col.hpp — remove draft #pragma acc directives - forge/Core/ForgeIncludes.h — remove #ifdef OPENACC_GPU block (CUDA profiler includes) - forge/Core/Tracer.hpp — change USE_NVTX guard to also trigger on CUDA_COMPUTE

Gridding: - forge/Gridding/gridding.h — simplify 4-way #ifdef to 2-way (Metal vs CPU) - forge/Gridding/gridding.cpp — replace #pragma acc with #pragma omp - forge/Gridding/griddingSupport.h — remove _OPENACC guards, keep #pragma acc routine seq as comments for CUDA reference - forge/Gridding/griddingSupport.cpp — replace 21 #pragma acc with #pragma omp

FFT: - forge/FFT/ftCpu.h — remove _OPENACC include guard - forge/FFT/ftCpu.cpp — replace #pragma acc kernels with #pragma omp parallel for - forge/FFT/ftCpuWithGrads.h — remove _OPENACC include guard - forge/FFT/ftCpuWithGrads.cpp — replace 6 #pragma acc directives with #pragma omp

Public header: - forge/Core/forge.h — remove unconditional #include "FFT/fftGPU.h" (file is being retired) - forge/FFT/fftCPU.h — remove @see fftGPU.h docstring reference - forge/FFT/fftAccelerate.h — remove @see fftGPU.h docstring reference

Operators: - forge/Operators/Gfft.h — simplify 4-way #ifdef to 2-way - forge/Operators/Gfft.cpp — remove OPENACC_GPU cuFFT plan code - forge/Operators/Gnufft.h, Gnufft.cpp — remove OPENACC_GPU blocks, keep Metal - forge/Operators/Gdft.h, Gdft.cpp — remove OPENACC_GPU blocks - forge/Operators/GdftR2.h, GdftR2.cpp — remove OPENACC_GPU blocks - forge/Operators/SENSE.h, SENSE.cpp — remove OpenACC guards if present - forge/Operators/pcSENSE.h, pcSENSE.cpp — same - forge/Operators/pcSenseTimeSeg.h, pcSenseTimeSeg.cpp — same

Tests: - forge/Tests/forgeColTests.cpp — remove _OPENACC/OPENACC_GPU guards - forge/Tests/forgeColCplxTests.cpp — same - forge/Tests/forgeMatTests.cpp — same - forge/Tests/MetalVectorOpsTests.cpp — same

Files to Retire

  • forge/FFT/fftGPU.h — cuFFT wrappers, entirely #ifdef OPENACC_GPU
  • forge/FFT/fftGPU.cpp — same
  • forge/my_sincosf.cuh — legacy CUDA header

Task 1: Update Root CMakeLists.txt — Remove OpenACC, Add OpenMP

Files: - Modify: CMakeLists.txt

This is the foundation — get the build system right first.

  • [ ] Step 1: Read the current CMakeLists.txt to identify all OpenACC sections

Read CMakeLists.txt in full. Identify: - option(OPENACC_GPU ...) and option(OPENACC_MP ...) (lines ~52-53) - The mutual exclusion check (lines ~56-58) - PGI/NVHPC compiler flags (lines ~243-254) - OPENACC_GPU conditionals for CUDA toolkit (lines ~83-110) - USE_NVTX definition (line ~260)

  • [ ] Step 2: Remove OPENACC_GPU and OPENACC_MP options and guards

Remove:

option(OPENACC_GPU "Use OpenACC to run on GPU" OFF)
option(OPENACC_MP "Use OpenACC to run on CPU multicore" OFF)

Remove the mutual exclusion check between OPENACC_GPU and OPENACC_MP.

Remove all if(OPENACC_GPU) blocks that: - Find CUDA toolkit - Set cuFFT link flags - Create cuFFT plans - Set PGI compiler flags (-cuda -acc -gpu=cuda12.6,cc60-cc89)

Remove the if(OPENACC_MP) block with -Mnollvm -O1 -acc -ta=multicore flags.

  • [ ] Step 3: Add OpenMP for all builds

Add after the Metal/CUDA mutual exclusion check:

# OpenMP is available for ALL builds (including Metal) for CPU fallback paths
find_package(OpenMP REQUIRED)

Link OpenMP to ForgeCommon (will be done in forge/CMakeLists.txt).

  • [ ] Step 4: Update USE_NVTX to trigger on CUDA_COMPUTE

Change the USE_NVTX definition from:

if(OPENACC_GPU)
    add_definitions(-DUSE_NVTX)
endif()
To:
if(CUDA_COMPUTE)
    add_definitions(-DUSE_NVTX)
endif()

  • [ ] Step 5: Add BUILD_SHARED_LIBS and CUDA_COMPUTE options (stubs for future)

Add:

option(CUDA_COMPUTE "Enable NVIDIA CUDA GPU backend" OFF)
option(BUILD_SHARED_LIBS "Build shared or static libraries" ON)

if(METAL_COMPUTE AND CUDA_COMPUTE)
    message(FATAL_ERROR "Cannot enable both METAL_COMPUTE and CUDA_COMPUTE")
endif()

Add a stub if(CUDA_COMPUTE) block:

if(CUDA_COMPUTE)
    enable_language(CUDA)
    find_package(CUDAToolkit REQUIRED)
    set(CMAKE_CUDA_ARCHITECTURES "75;80;86;89;90")
endif()

  • [ ] Step 6: Configure and verify the build still works

Run:

cmake -B build -S . -DMETAL_COMPUTE=ON

Expected: Configure succeeds. If it fails on missing OpenACC references, fix them.

  • [ ] Step 7: Commit
git add CMakeLists.txt
git commit -m "build: remove OpenACC/OPENACC_MP options, add OpenMP and CUDA_COMPUTE stubs"

Task 2: Update forge/CMakeLists.txt — Remove fftGPU, Fix SHARED

Files: - Modify: forge/CMakeLists.txt

  • [ ] Step 1: Read forge/CMakeLists.txt

Identify: - fftGPU.cpp in the source list (line ~11) - add_library(ForgeCommon SHARED ...) (line ~46) - Any OPENACC_GPU conditionals

  • [ ] Step 2: Remove fftGPU.cpp from source list

Remove the line:

${PROJECT_SOURCE_DIR}/forge/FFT/fftGPU.cpp

  • [ ] Step 3: Change SHARED to default

Change:

add_library(ForgeCommon SHARED ${PG_FILES})
To:
add_library(ForgeCommon ${PG_FILES})

This lets BUILD_SHARED_LIBS control shared vs static.

  • [ ] Step 4: Link OpenMP to ForgeCommon

Add after the target_link_libraries section:

if(OpenMP_CXX_FOUND)
    target_link_libraries(ForgeCommon PUBLIC OpenMP::OpenMP_CXX)
endif()

  • [ ] Step 5: Build to verify
cmake -B build -S . -DMETAL_COMPUTE=ON && cmake --build build --target ForgeCommon -j4

Expected: Build fails because source files still reference fftGPU.h or OpenACC headers. That's expected — we'll fix those in subsequent tasks.

  • [ ] Step 6: Commit
git add forge/CMakeLists.txt
git commit -m "build: remove fftGPU from sources, use BUILD_SHARED_LIBS, link OpenMP"

Task 3: Clean ForgeIncludes.h and Tracer.hpp

Files: - Modify: forge/Core/ForgeIncludes.h - Modify: forge/Core/Tracer.hpp

  • [ ] Step 1: Read ForgeIncludes.h

Look for the #ifdef OPENACC_GPU block (lines ~47-50) that includes CUDA profiler headers.

  • [ ] Step 2: Remove the OPENACC_GPU block from ForgeIncludes.h

Remove:

#ifdef OPENACC_GPU
#include <cuda_profiler_api.h>
#include <nvToolsExt.h>
#endif

These CUDA profiler includes will be handled by the CUDA backend directly when it's implemented.

  • [ ] Step 3: Read Tracer.hpp

Identify the USE_NVTX guard structure.

  • [ ] Step 4: Update Tracer.hpp — USE_NVTX now triggered by CUDA_COMPUTE

The #if defined(USE_NVTX) guard in Tracer.hpp is fine as-is — the CMakeLists.txt change (Task 1, Step 4) already ensures USE_NVTX is defined when CUDA_COMPUTE=ON. No code change needed in Tracer.hpp unless there are direct OPENACC_GPU references.

  • [ ] Step 5: Commit
git add forge/Core/ForgeIncludes.h forge/Core/Tracer.hpp
git commit -m "refactor: remove OpenACC guards from ForgeIncludes.h and Tracer.hpp"

Task 4: Strip OpenACC from forgeCol.hpp

Files: - Modify: forge/Core/forgeCol.hpp

This is the largest single file change — ~50 #pragma acc directives to remove. Work methodically top-to-bottom.

  • [ ] Step 1: Read forgeCol.hpp in full

Map every #pragma acc and #ifdef _OPENACC / #ifdef OPENACC_GPU location. The exploration found them at approximately these locations: - Lines 24-27: _OPENACC header includes (accel.h, openacc.h) - Line 135: constructor enter data copyin - Line 148: constructor enter data create - Lines 167-219: Arma constructors with device updates - Lines 231-246: Copy constructor with isCopy, acc_memcpy - Lines 257-275: Move constructor with acc_attach - Lines 281-296: Destructor with acc_delete, exit data - Lines 307-318: reset_mem() with acc_detach - Lines 330-362: set_size() with device allocation - Lines 370, 381: zeros()/ones() with parallel loop - Lines 395-425: getArma() with update host - Lines 515-931: Operator overloads and free functions with scattered pragmas

  • [ ] Step 2: Remove _OPENACC header includes

Remove:

#ifdef _OPENACC
#include <accel.h>
#include <openacc.h>
#endif

  • [ ] Step 3: Remove all #pragma acc from constructors and destructor

For each constructor, remove lines like:

#pragma acc enter data copyin(this)
#pragma acc enter data create(mem[0:n_elem])
#pragma acc update device(this)

For the destructor, remove:

#pragma acc exit data finalize detach(mem) delete(mem[0:n_elem])
#pragma acc exit data delete(this)

Keep the underlying new[]/delete[] and Metal std::aligned_alloc/std::free allocation logic intact.

  • [ ] Step 4: Remove isCopy flag

Remove the bool isCopy member variable declaration. Remove all references to isCopy in: - Copy constructor (where isCopy = true is set) - Move constructor - Destructor (where if (!isCopy) guards cleanup) - Any other locations

The copy constructor should still perform a deep copy of mem — just remove the isCopy bookkeeping.

  • [ ] Step 5: Remove OpenACC from set_size() and reset_mem()

In set_size(), remove:

#ifdef _OPENACC
    isOnGPU = true;
    #pragma acc update device(this)
    #pragma acc enter data create(mem[0:n_elem])
#endif

Keep the allocation paths (Metal aligned_alloc vs standard new) and the isOnGPU = false default.

In reset_mem(), remove acc_detach and device update pragmas.

  • [ ] Step 6: Keep zeros()/ones() loops — replace pragma for Phase 3

For now, just remove #pragma acc parallel loop present(mem[0:n_elem]). The plain for loop remains and works single-threaded. OpenMP will be added in Task 11.

  • [ ] Step 7: Remove OpenACC from getArma() variants

Remove #pragma acc update host(mem[0:n_elem]) from non-const getArma(). Keep the Metal #ifdef METAL_COMPUTE path that does Metal-specific sync.

  • [ ] Step 8: Remove OpenACC from all operator overloads

Go through every operator+, operator-, operator*, operator%=, etc. and remove #pragma acc parallel loop directives. The loops remain as plain single-threaded loops. Keep Metal dispatch paths (#ifdef METAL_COMPUTE).

  • [ ] Step 9: Remove OpenACC from free functions (sum, norm, etc.)

Same pattern — remove #pragma acc from sum(), norm(), and template operator functions at the bottom of the file.

  • [ ] Step 10: Verify the file compiles
cmake --build build --target ForgeCommon -j4 2>&1 | head -50

Fix any compilation errors from dangling references to removed symbols.

  • [ ] Step 11: Commit
git add forge/Core/forgeCol.hpp
git commit -m "refactor: strip all OpenACC pragmas and isCopy flag from forgeCol.hpp"

Task 5: Strip OpenACC from forgeMat.hpp

Files: - Modify: forge/Core/forgeMat.hpp

Same pattern as forgeCol — ~40 #pragma acc directives.

  • [ ] Step 1: Read forgeMat.hpp in full

  • [ ] Step 2: Remove _OPENACC header includes

  • [ ] Step 3: Remove all #pragma acc from constructors, destructor, set_size(), reset_mem()

Same approach as forgeCol. Keep Metal allocation paths.

  • [ ] Step 4: Remove OpenACC from zeros()/ones()

Remove #pragma acc parallel loop. Plain loops for now.

  • [ ] Step 5: Remove OpenACC from getArma() and all operator overloads

  • [ ] Step 6: Remove OpenACC from sum() functions

The sum() functions have nested parallel loops with reductions — remove the pragmas, keep the loops.

  • [ ] Step 7: Remove OpenACC from vectorise()

  • [ ] Step 8: Verify compilation

cmake --build build --target ForgeCommon -j4 2>&1 | head -50
  • [ ] Step 9: Commit
git add forge/Core/forgeMat.hpp
git commit -m "refactor: strip all OpenACC pragmas from forgeMat.hpp"

Task 6: Clean forgeSubview_Col.hpp and Test Files

Files: - Modify: forge/Core/forgeSubview_Col.hpp - Modify: forge/Tests/forgeColTests.cpp - Modify: forge/Tests/forgeColCplxTests.cpp - Modify: forge/Tests/forgeMatTests.cpp - Modify: forge/Tests/MetalVectorOpsTests.cpp - Modify: forge/Microbenchmarks/forgeCol/main.cpp

  • [ ] Step 1: Read forgeSubview_Col.hpp

This is a draft/stub file. Remove any #pragma acc directives.

  • [ ] Step 2: Clean test files — remove _OPENACC/OPENACC_GPU guards

In each test file, find and remove blocks like:

#ifdef _OPENACC
    // OpenACC-specific test setup
#endif

Or:

#ifdef OPENACC_GPU
    // ...
#endif

Keep #ifdef METAL_COMPUTE guards — those are still active.

  • [ ] Step 3: Clean Microbenchmarks

Remove _OPENACC guards from forge/Microbenchmarks/forgeCol/main.cpp.

  • [ ] Step 4: Verify compilation
cmake --build build --target cpu_tests -j4 2>&1 | head -50
  • [ ] Step 5: Commit
git add forge/Core/forgeSubview_Col.hpp forge/Tests/forgeColTests.cpp forge/Tests/forgeColCplxTests.cpp forge/Tests/forgeMatTests.cpp forge/Tests/MetalVectorOpsTests.cpp forge/Microbenchmarks/forgeCol/main.cpp
git commit -m "refactor: remove OpenACC guards from subview, tests, and microbenchmarks"

Task 7: Strip OpenACC from Gridding

Files: - Modify: forge/Gridding/gridding.h - Modify: forge/Gridding/gridding.cpp - Modify: forge/Gridding/griddingSupport.h - Modify: forge/Gridding/griddingSupport.cpp - Modify: forge/Gridding/TimeSegmentation.cpp

  • [ ] Step 1: Read gridding.h

Simplify the 4-way #ifdef (lines ~54-81) from:

#ifdef OPENACC_GPU
    // GPU version
#elif _OPENACC
    // OpenACC without CUDA
#elif defined(METAL_COMPUTE)
    // Metal
#else
    // CPU
#endif
To:
#ifdef METAL_COMPUTE
    // Metal
#else
    // CPU
#endif

  • [ ] Step 2: Read and clean gridding.cpp

Remove all #pragma acc parallel loop, #pragma acc loop seq, #pragma acc loop independent, #pragma acc atomic update, and #pragma acc data directives. Leave the loops and their bodies intact.

Key locations: - gridding_adjoint_2D(): ~6 pragmas (parallel loop, seq loops, atomic updates) - gridding_adjoint_3D(): ~8 pragmas (similar structure) - gridding_forward_2D(): ~4 pragmas - gridding_forward_3D(): ~4 pragmas

Do NOT add OpenMP yet — that's Task 11. Just remove OpenACC, leaving plain single-threaded loops.

  • [ ] Step 3: Read and clean griddingSupport.h

Remove #ifdef _OPENACC include guard for openacc.h. Remove #pragma acc routine seq from function declarations (bessi0, kernel_value_LUT). These annotations told OpenACC the function could be called from device code — not relevant for CPU/OpenMP.

  • [ ] Step 4: Read and clean griddingSupport.cpp

Remove all 21 #pragma acc directives: - #pragma acc routine seq on function definitions - #pragma acc parallel loop collapse(N) on deinterleave functions - #pragma acc parallel loop on normalize functions - #pragma acc parallel loop on deapodization functions - #pragma acc data regions

Leave all loops intact as single-threaded.

  • [ ] Step 5: Un-comment #pragma omp in TimeSegmentation.cpp

Find the commented-out #pragma omp parallel for (line ~143) and un-comment it:

#pragma omp parallel for

  • [ ] Step 6: Verify compilation
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
  • [ ] Step 7: Commit
git add forge/Gridding/gridding.h forge/Gridding/gridding.cpp forge/Gridding/griddingSupport.h forge/Gridding/griddingSupport.cpp forge/Gridding/TimeSegmentation.cpp
git commit -m "refactor: strip OpenACC from gridding, griddingSupport, and TimeSegmentation"

Task 8: Strip OpenACC from FFT Files and Public Header

Files: - Modify: forge/FFT/ftCpu.h - Modify: forge/FFT/ftCpu.cpp - Modify: forge/FFT/ftCpuWithGrads.h - Modify: forge/FFT/ftCpuWithGrads.cpp - Modify: forge/Core/forge.h - Modify: forge/FFT/fftCPU.h (docstring cleanup) - Modify: forge/FFT/fftAccelerate.h (docstring cleanup) - Retire: forge/FFT/fftGPU.h (do not delete yet — just confirm it's excluded from build) - Retire: forge/FFT/fftGPU.cpp (excluded from build in Task 2)

  • [ ] Step 1: Read and clean ftCpu.h

Remove #ifdef _OPENACC block that includes openacc.h.

  • [ ] Step 2: Read and clean ftCpu.cpp

Remove #pragma acc kernels, #pragma acc loop independent gang, #pragma acc loop vector(128), and any copyin/copyout clauses. Leave loops as single-threaded.

Both ftCpu() and iftCpu() functions have similar OpenACC structures — clean both.

  • [ ] Step 3: Read and clean ftCpuWithGrads.h

Remove #ifdef _OPENACC include guard. This file already uses #pragma omp — leave those intact.

  • [ ] Step 4: Read and clean ftCpuWithGrads.cpp

This file has 6 #pragma acc directives (lines ~107, 112, 123, 175, 179, 196). Remove all of them. The DFT loops are structurally identical to ftCpu.cpp. Leave existing #pragma omp directives intact.

  • [ ] Step 5: Remove fftGPU.h include from forge.h

In forge/Core/forge.h (line ~33), remove:

#include "FFT/fftGPU.h"

This is an unconditional include of a file being retired. Without this fix, every translation unit that includes forge.h will fail to compile after fftGPU.h is deleted.

  • [ ] Step 6: Clean up @see fftGPU.h docstring references

In forge/FFT/fftCPU.h and forge/FFT/fftAccelerate.h, remove @see fftGPU.h documentation references.

  • [ ] Step 7: Verify fftGPU files are excluded from build

Confirm fftGPU.cpp was removed from forge/CMakeLists.txt in Task 2. The files still exist on disk but are no longer compiled.

  • [ ] Step 8: Verify compilation
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
  • [ ] Step 9: Commit
git add forge/FFT/ftCpu.h forge/FFT/ftCpu.cpp forge/FFT/ftCpuWithGrads.h forge/FFT/ftCpuWithGrads.cpp forge/Core/forge.h forge/FFT/fftCPU.h forge/FFT/fftAccelerate.h
git commit -m "refactor: strip OpenACC from FFT files, remove fftGPU.h include from forge.h"

Task 9: Strip OpenACC from All Operators

Files: - Modify: forge/Operators/Gfft.h, forge/Operators/Gfft.cpp - Modify: forge/Operators/Gnufft.h, forge/Operators/Gnufft.cpp - Modify: forge/Operators/Gdft.h, forge/Operators/Gdft.cpp - Modify: forge/Operators/GdftR2.h, forge/Operators/GdftR2.cpp - Modify: forge/Operators/SENSE.h, forge/Operators/SENSE.cpp - Modify: forge/Operators/pcSENSE.h, forge/Operators/pcSENSE.cpp - Modify: forge/Operators/pcSenseTimeSeg.h, forge/Operators/pcSenseTimeSeg.cpp

  • [ ] Step 1: Read and clean Gfft.h

Simplify the 4-way #ifdef to 2-way:

#ifdef METAL_COMPUTE
    // Metal includes
#else
    // CPU includes (FFTW)
#endif

Remove both the OPENACC_GPU and OPENACC_MP branches — both include fftGPU.h which is being retired. Also remove OPENACC_GPU members (stream, cuFFT plan handle).

  • [ ] Step 2: Read and clean Gfft.cpp

Remove #ifdef OPENACC_GPU blocks that: - Create CUDA streams - Create cuFFT plans - Use #pragma acc data regions - Call fftGPU functions

The CPU fallback path (using fftCPU or fftAccelerate) becomes the only non-Metal path.

  • [ ] Step 3: Read and clean Gnufft.h and Gnufft.cpp

In Gnufft.h: - Remove OPENACC_GPU members (cuFFT plan handle, etc.) - Keep Metal members (pipelineCtx, metalCtx)

In Gnufft.cpp: - Remove #ifdef OPENACC_GPU block in constructor (cuFFT plan setup, lines ~86-99) - Remove #pragma acc enter data in constructor (line ~116-118) - In operator* and operator/: simplify from 3-way dispatch (Metal pipeline / Metal gridding / OpenACC) to 2-way (Metal / CPU) - Remove any #pragma acc in the CPU computation path

  • [ ] Step 4: Read and clean Gdft.h and Gdft.cpp

Same pattern — remove OPENACC_GPU blocks, keep Metal blocks, simplify dispatch.

  • [ ] Step 5: Read and clean GdftR2.h and GdftR2.cpp

Same pattern.

  • [ ] Step 6: Read and clean SENSE.h, SENSE.cpp, pcSENSE.h, pcSENSE.cpp, pcSenseTimeSeg.h, pcSenseTimeSeg.cpp

These higher-level operators may have fewer or no direct OpenACC pragmas — they delegate to Gnufft/Gdft. But check for any #ifdef OPENACC_GPU guards or #pragma acc data regions.

  • [ ] Step 7: Verify compilation
cmake --build build --target ForgeCommon -j4 2>&1 | head -50
  • [ ] Step 8: Commit
git add forge/Operators/
git commit -m "refactor: strip all OpenACC from operators (Gfft, Gnufft, Gdft, GdftR2, SENSE, pcSENSE, pcSenseTimeSeg)"

Task 10: Retire Legacy Files and Full Build + Test

Files: - Retire: forge/FFT/fftGPU.h, forge/FFT/fftGPU.cpp - Retire: forge/my_sincosf.cuh

  • [ ] Step 1: Delete retired files
git rm forge/FFT/fftGPU.h forge/FFT/fftGPU.cpp

If forge/my_sincosf.cuh exists and is not referenced anywhere:

git rm forge/my_sincosf.cuh

  • [ ] Step 2: Search for any remaining OpenACC references
# Search for any remaining #pragma acc
grep -rn "#pragma acc" forge/ --include="*.cpp" --include="*.hpp" --include="*.h"

# Search for any remaining _OPENACC or OPENACC_GPU
grep -rn "_OPENACC\|OPENACC_GPU\|OPENACC_MP" forge/ --include="*.cpp" --include="*.hpp" --include="*.h" --include="*.cmake"
grep -rn "_OPENACC\|OPENACC_GPU\|OPENACC_MP" CMakeLists.txt

Expected: No results (or only in files already retired/excluded from build). Fix any stragglers.

  • [ ] Step 3: Full clean build (Metal backend)
rm -rf build
cmake -B build -S . -DMETAL_COMPUTE=ON
cmake --build build -j4

Expected: Clean build with zero errors and zero OpenACC-related warnings.

  • [ ] Step 4: Run all fast+medium tests
./build/cpu_tests '~[Benchmark]'

Expected: All tests pass. This is the single-threaded correctness baseline.

  • [ ] Step 5: Run Metal tests
./build/metal_tests '~[Benchmark]'

Expected: All Metal tests pass. Metal backend is unchanged.

  • [ ] Step 6: Full clean build (CPU-only backend)
rm -rf build_cpu
cmake -B build_cpu -S . -DMETAL_COMPUTE=OFF -DOPENACC_GPU=OFF
cmake --build build_cpu -j4

Expected: Clean build on CPU-only path (no Metal, no CUDA, no OpenACC).

  • [ ] Step 7: Run CPU-only tests
./build_cpu/cpu_tests '~[Benchmark]'

Expected: All CPU-agnostic tests pass. Metal-specific tests are excluded by #ifdef.

  • [ ] Step 8: Commit
git add -A
git commit -m "refactor: retire fftGPU and my_sincosf.cuh, verify clean build on Metal and CPU"

Task 11: Add OpenMP Parallelism to Hot Loops

Files: - Modify: forge/Gridding/gridding.cpp - Modify: forge/Gridding/griddingSupport.cpp - Modify: forge/FFT/ftCpu.cpp - Modify: forge/FFT/ftCpuWithGrads.cpp - Modify: forge/Core/forgeCol.hpp - Modify: forge/Core/forgeMat.hpp

Now that single-threaded correctness is verified, layer in OpenMP.

  • [ ] Step 1: Add OpenMP to gridding_adjoint_2D/3D in gridding.cpp

For the outer loop of gridding_adjoint_2D, add:

#pragma omp parallel for schedule(dynamic)

For the atomic grid accumulation inside the loop, add:

#pragma omp atomic update

Same pattern for gridding_adjoint_3D. The #pragma omp atomic replaces the old #pragma acc atomic — same semantics.

For gridding_forward_2D/3D, the gather operation is embarrassingly parallel with no atomics:

#pragma omp parallel for schedule(static)

  • [ ] Step 2: Add OpenMP to griddingSupport.cpp

For deinterleave_data2d:

#pragma omp parallel for collapse(2)

For deinterleave_data3d:

#pragma omp parallel for collapse(3)

For normalize_fft2d, normalize_fft3d:

#pragma omp parallel for

For deapodization functions:

#pragma omp parallel for

Follow the same loop structure the old OpenACC pragmas annotated — they identified which loops are safe to parallelize.

  • [ ] Step 3: Add OpenMP to ftCpu.cpp

For ftCpu() and iftCpu(), add:

#pragma omp parallel for

to the outer loop over k-space samples. The inner loop (over spatial positions) is a dot product — keep it sequential within each thread.

  • [ ] Step 3b: Add OpenMP to ftCpuWithGrads.cpp

Same pattern as ftCpu.cpp — the DFT loops are structurally identical (with additional R2* exponential terms). Add #pragma omp parallel for to the outer loops of both ftCpuWithGrads() and iftCpuWithGrads().

  • [ ] Step 4: Add OpenMP to forgeCol::zeros() and forgeCol::ones()
void zeros() {
    #pragma omp parallel for schedule(static)
    for (uword i = 0; i < n_elem; i++) {
        mem[i] = T(0);
    }
}

Same for ones(). These are large arrays (millions of elements for 3D images).

  • [ ] Step 5: Add OpenMP to forgeMat::zeros() and forgeMat::ones()

Same pattern as forgeCol.

  • [ ] Step 6: Build and test with OpenMP
cmake --build build_cpu -j4
./build_cpu/cpu_tests '~[Benchmark]'

Expected: All tests pass. Results should match single-threaded within floating-point tolerance.

  • [ ] Step 7: Quick performance sanity check

Run one of the medium-speed tests and compare wall time:

# Single-threaded
OMP_NUM_THREADS=1 time ./build_cpu/cpu_tests "[Gnufft adjoint]"

# All cores
time ./build_cpu/cpu_tests "[Gnufft adjoint]"

Expected: Multi-threaded is noticeably faster. Exact speedup depends on the test.

  • [ ] Step 8: Commit
git add forge/Gridding/gridding.cpp forge/Gridding/griddingSupport.cpp forge/FFT/ftCpu.cpp forge/FFT/ftCpuWithGrads.cpp forge/Core/forgeCol.hpp forge/Core/forgeMat.hpp
git commit -m "feat: add OpenMP parallelism to gridding, griddingSupport, ftCpu, ftCpuWithGrads, forgeCol, and forgeMat"

Task 12: Create BackendDispatch.hpp

Files: - Create: forge/Core/BackendDispatch.hpp

  • [ ] Step 1: Create the dispatch header
#pragma once

// BackendDispatch.hpp — Compile-time backend resolution
//
// Maps GPU memory management and vector algebra to the active backend.
// CPU backend: GPU ops are no-ops or static_asserts.
// Metal backend: delegates to MetalVectorOps_dispatch.hpp
// CUDA backend (future): delegates to CudaVectorOps_dispatch.hpp

#include <cstddef>
#include "forgeComplex.hpp"

namespace forge::backend {

// ============================================================
// Compile-time backend query
// ============================================================

constexpr bool has_gpu() {
#if defined(METAL_COMPUTE) || defined(CUDA_COMPUTE)
    return true;
#else
    return false;
#endif
}

// ============================================================
// GPU Memory Management
// ============================================================
// On CPU-only builds, these are no-ops or errors.
// On Metal, memory is unified (page-aligned alloc gives GPU access).
// On CUDA (future), these map to cudaMalloc/cudaMemcpy/cudaFree.

#if defined(CUDA_COMPUTE)
    // Future: #include "CUDA/CudaVectorOps_dispatch.hpp"
#elif defined(METAL_COMPUTE)
    // Metal uses unified memory — no explicit GPU malloc/copy needed.
    // MetalVectorOps_dispatch.hpp handles compute dispatch.
    #include "Metal/MetalVectorOps_dispatch.hpp"
#else
    // CPU-only: no GPU memory operations
#endif

} // namespace forge::backend
  • [ ] Step 2: Verify it compiles when included

Add a temporary #include "BackendDispatch.hpp" to one source file, build, and remove.

  • [ ] Step 3: Commit
git add forge/Core/BackendDispatch.hpp
git commit -m "feat: add BackendDispatch.hpp compile-time backend resolution layer"

Task 13: Refactor Operators to Use Dispatch Layer

Files: - Modify: forge/Operators/Gnufft.cpp - Modify: forge/Operators/Gdft.cpp - Modify: forge/Operators/GdftR2.cpp - Modify: forge/Operators/Gfft.cpp

This task replaces the remaining #ifdef METAL_COMPUTE blocks in operators with dispatch-layer patterns. The goal is that operators have clean code without scattered #ifdef blocks — just if (pipelineCtx) runtime checks against the backend context.

  • [ ] Step 1: Evaluate current Metal #ifdef blocks in Gnufft.cpp

After Task 9, the #ifdef blocks should be just Metal vs CPU. Evaluate whether the dispatch layer abstraction adds value here, or if the 2-way #ifdef METAL_COMPUTE is already clean enough.

Pragmatic decision: If the operator code is already clean with just #ifdef METAL_COMPUTE / #else guards, don't force it through the dispatch layer. The dispatch layer's main value is for GPU memory management (forgeCol/forgeMat) and vector algebra — not for pipeline-level operations which are inherently backend-specific.

  • [ ] Step 2: If refactoring, update the operator dispatch pattern

If the #ifdef blocks are complex, refactor to use dispatch. If they're simple 2-way switches, leave them and document the pattern for the CUDA backend to follow.

  • [ ] Step 3: Verify all tests still pass
cmake --build build -j4
./build/cpu_tests '~[Benchmark]'
./build/metal_tests '~[Benchmark]'
  • [ ] Step 4: Commit
git add forge/Operators/ forge/Core/BackendDispatch.hpp
git commit -m "refactor: operators use BackendDispatch where beneficial, document pattern for CUDA"

Task 14: Final Verification and Documentation

Files: - Modify: CLAUDE.md (update build instructions)

  • [ ] Step 1: Full clean build — Metal backend
rm -rf build
cmake -B build -S . -DMETAL_COMPUTE=ON
cmake --build build -j4
  • [ ] Step 2: Run complete Metal test suite
./build/metal_tests '~[Benchmark]'
./build/cpu_tests '~[Benchmark]'

Expected: All pass.

  • [ ] Step 3: Full clean build — CPU-only backend
rm -rf build_cpu
cmake -B build_cpu -S .
cmake --build build_cpu -j4

Note: no -DMETAL_COMPUTE and no -DOPENACC_GPU — this is the pure CPU+OpenMP build.

  • [ ] Step 4: Run CPU test suite
./build_cpu/cpu_tests '~[Benchmark]'

Expected: All CPU-agnostic tests pass.

  • [ ] Step 5: OpenMP thread scaling check
OMP_NUM_THREADS=1 ./build_cpu/cpu_tests "[Gnufft adjoint]" --benchmark
OMP_NUM_THREADS=4 ./build_cpu/cpu_tests "[Gnufft adjoint]" --benchmark

Confirm OpenMP provides speedup.

  • [ ] Step 6: Update CLAUDE.md build instructions

Update the build instructions to reflect new flags: - Remove references to OPENACC_GPU and OPENACC_MP - Add CUDA_COMPUTE (coming soon) - Add BUILD_SHARED_LIBS - Note that OpenMP is now required for all builds (Apple Clang needs brew install libomp)

  • [ ] Step 7: Update CLAUDE.md Key CMake Options table
Flag Effect
-DMETAL_COMPUTE=ON Enable Apple Metal GPU backend
-DCUDA_COMPUTE=ON Enable NVIDIA CUDA GPU backend (future)
-DBUILD_SHARED_LIBS=OFF Build static libraries (for forge-studio)
-DENABLE_DOUBLE_PRECISION=ON Switch to double precision
-DBUILD_FORGEVIEW=ON Build the forgeview TUI viewer (default ON)
  • [ ] Step 8: Final commit
git add CLAUDE.md
git commit -m "docs: update build instructions for OpenMP/CUDA_COMPUTE, remove OpenACC references"

Summary of Commits

Task Commit Message
1 build: remove OpenACC/OPENACC_MP options, add OpenMP and CUDA_COMPUTE stubs
2 build: remove fftGPU from sources, use BUILD_SHARED_LIBS, link OpenMP
3 refactor: remove OpenACC guards from ForgeIncludes.h and Tracer.hpp
4 refactor: strip all OpenACC pragmas and isCopy flag from forgeCol.hpp
5 refactor: strip all OpenACC pragmas from forgeMat.hpp
6 refactor: remove OpenACC guards from subview, tests, and microbenchmarks
7 refactor: strip OpenACC from gridding, griddingSupport, and TimeSegmentation
8 refactor: strip OpenACC from FFT files, remove fftGPU.h include from forge.h
9 refactor: strip all OpenACC from operators
10 refactor: retire fftGPU and my_sincosf.cuh, verify clean build on Metal and CPU
11 feat: add OpenMP parallelism to gridding, griddingSupport, ftCpu, ftCpuWithGrads, forgeCol, and forgeMat
12 feat: add BackendDispatch.hpp compile-time backend resolution layer
13 refactor: operators use BackendDispatch where beneficial
14 docs: update build instructions for OpenMP/CUDA_COMPUTE, remove OpenACC references