MPI Modernization Implementation Plan¶

For agentic workers: REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Fix MPI build errors and fully modernize all MPI executables and helper classes to match the primary executables' patterns (forge types, structured logging, CLI parity, exit codes, signal handling).

Architecture: Remove conflicting custom op_circshift (Armadillo native suffices). Add Boost.Serialization to forgeCol/forgeMat for MPI communication. Convert all MPI code from raw Armadillo to forge types. Add full modern CLI infrastructure to all 3 MPI executables.

Tech Stack: C++17, Boost.MPI, Boost.Serialization, Armadillo 15.2.3, spdlog, Catch2

Spec: docs/superpowers/specs/2026-03-16-mpi-modernization-design.md

Chunk 1: Shared Code Fixes (prerequisites for all builds)¶

Task 1: Remove custom op_circshift and update fftshift.hpp¶

Files: - Modify: Support/ArmaExtensions/arma_extensions.h - Modify: forge/FFT/fftshift.hpp - Delete: Support/ArmaExtensions/op_circshift_bones.hpp - Delete: Support/ArmaExtensions/op_circshift_meat.hpp - Delete: Support/ArmaExtensions/fn_circshift.hpp

Context: Armadillo 15.2.3 natively defines arma::op_circshift and arma::circshift(). The custom versions in Support/ArmaExtensions/ conflict, causing a redefinition error when both are in scope (MPI builds). The custom circshift(X, dim, shift) has swapped argument order vs Armadillo's native circshift(X, shift, dim).

[ ] Step 1: Remove circshift includes from arma_extensions.h

In Support/ArmaExtensions/arma_extensions.h, remove lines 36-42:

namespace arma {

#include "op_circshift_bones.hpp"
#include "fn_circshift.hpp"
#include "op_circshift_meat.hpp"
//#include "permute.hpp"
}

Replace with just the permute.hpp import (if it's needed — it's commented out, so just remove the whole namespace arma {} block).

[ ] Step 2: Update fftshift.hpp call sites

In forge/FFT/fftshift.hpp, change line 58 from:

T1 out = circshift(X, dim, std::floor(size / 2));

to:

T1 out = arma::circshift(X, static_cast<arma::sword>(std::floor(size / 2)), dim);

Change line 65 from:

T1 out = circshift(circshift(X, 0, std::floor(X.n_rows / 2)), 1, std::floor(X.n_cols / 2));

to:

T1 out = arma::circshift(
    arma::circshift(X, static_cast<arma::sword>(std::floor(X.n_rows / 2)), 0),
    static_cast<arma::sword>(std::floor(X.n_cols / 2)), 1);

Note: arma::circshift takes sword (signed word) for the shift amount. The explicit namespace prefix avoids ambiguity with any lingering using namespace arma.

[ ] Step 3: Delete the custom circshift files

git rm Support/ArmaExtensions/op_circshift_bones.hpp
git rm Support/ArmaExtensions/op_circshift_meat.hpp
git rm Support/ArmaExtensions/fn_circshift.hpp

[ ] Step 4: Build and run ALL tests (non-MPI)

cmake --build build --target cpu_tests --target metal_tests -j4
./build/cpu_tests '~[Benchmark]'
./build/metal_tests '~[Benchmark]'

Expected: all pass — fftshift is used throughout reconstruction.

[ ] Step 5: Commit

git add Support/ArmaExtensions/arma_extensions.h forge/FFT/fftshift.hpp
git commit -m "fix: remove custom op_circshift, use Armadillo native circshift"

Task 2: Add Boost.Serialization support to forgeCol and forgeMat¶

Files: - Modify: forge/Core/forgeCol.hpp - Modify: forge/Core/forgeMat.hpp

[ ] Step 1: Add serialize() to forgeCol

In forge/Core/forgeCol.hpp, add inside the forgeCol class (near the end, before the closing };). Guard with #ifdef ForgeMPI:

#ifdef ForgeMPI
    /// @brief Boost.Serialization support for MPI communication.
    template <class Archive>
    void serialize(Archive& ar, const unsigned int /*version*/)
    {
        arma::uword len = n_elem;
        ar & len;
        if (Archive::is_loading::value) {
            set_size(len);
        }
        ar & boost::serialization::make_array(mem, n_elem);
    }
#endif

Also add the necessary Boost.Serialization includes, guarded by #ifdef ForgeMPI, near the top of the file (after existing includes):

#ifdef ForgeMPI
#include <boost/serialization/array.hpp>
#include <boost/serialization/serialization.hpp>
#endif

[ ] Step 2: Add serialize() to forgeMat

Same pattern in forge/Core/forgeMat.hpp:

#ifdef ForgeMPI
    /// @brief Boost.Serialization support for MPI communication.
    template <class Archive>
    void serialize(Archive& ar, const unsigned int /*version*/)
    {
        arma::uword nr = n_rows;
        arma::uword nc = n_cols;
        ar & nr;
        ar & nc;
        if (Archive::is_loading::value) {
            set_size(nc, nr); // note: set_size takes (nCols, nRows)
        }
        ar & boost::serialization::make_array(mem, n_rows * n_cols);
    }
#endif

Add the same guarded includes near the top.

[ ] Step 3: Build non-MPI tests to verify no regressions

cmake --build build --target cpu_tests -j4
./build/cpu_tests '~[Benchmark]'

Expected: all pass (ForgeMPI not defined, serialize code not compiled)

[ ] Step 4: Build MPI to verify serialization compiles

cmake -B build_mpi -S . -DMETAL_COMPUTE=ON -DOPENACC_GPU=OFF -DMPISupport=ON
cmake --build build_mpi -j4 2>&1 | grep "error:" | head -10

Expected: MPI build may still have other errors (reconSolve, type mismatches) but should NOT have serialization compile errors. If there are errors from forgeCol/ forgeMat serialization, fix them before proceeding.

Note: forgeCol::n_elem and forgeMat::n_rows/n_cols are const members. The serialize() method uses a local copy for the ar & operation, then calls set_size() on load. Verify that set_size() works on a const-member class — if n_elem is truly const and set_size modifies it via placement-new or similar, this will work. If not, you may need to use const_cast or a different approach. Check the actual forgeCol::set_size() implementation.

[ ] Step 5: Commit

git add forge/Core/forgeCol.hpp forge/Core/forgeMat.hpp
git commit -m "feat: add Boost.Serialization support to forgeCol and forgeMat for MPI"

Chunk 2: MPI Helper Class Modernization¶

Task 3: Modernize mpipcSENSE types and operator signatures¶

Files: - Modify: apps/MPI/mpipcSENSE.h - Modify: apps/MPI/mpipcSENSE.cpp

[ ] Step 1: Update mpipcSENSE.h
Replace typedef std::complex<T1> CxT1; with typedef forgeComplex<T1> CxT1;
Convert member variables from Armadillo to forge types:
Mat<CxT1> SMap → forgeMat<CxT1> SMap
Mat<T1> PMap → forgeMat<T1> PMap
Col<T1> FMap → forgeCol<T1> FMap
Mat<T1> Kx, Ky, Kz, Tvec → forgeMat<T1> Kx, Ky, Kz, Tvec
Col<T1> Ix, Iy, Iz → forgeCol<T1> Ix, Iy, Iz
Col<uword> shotList, coilList → keep as arma::Col<arma::uword> (MPI serializes these)

Update operator signatures:

forgeCol<CxT1> operator*(const forgeCol<CxT1>& d) const;
forgeCol<CxT1> operator/(const forgeCol<CxT1>& d) const;

Update constructor signature to accept forge types:

mpipcSENSE(forgeCol<T1> kx, forgeCol<T1> ky, forgeCol<T1> kz,
    uword nx, uword ny, uword nz, uword nc, forgeCol<T1> t,
    forgeCol<CxT1> SENSEmap, forgeCol<T1> FieldMap, forgeCol<T1> ShotPhaseMap,
    bmpi::environment& en, bmpi::communicator& wor);

Update extern template declaration to match current reconSolve signature:

extern template forgeCol<forgeComplex<float>> reconSolve(
    forgeCol<forgeComplex<float>>, mpipcSENSE<float>&, QuadPenalty<float>,
    arma::Col<float>, arma::Col<float>, arma::Col<float>,
    arma::uword, arma::uword, arma::uword, arma::Col<float>, arma::uword,
    size_t, size_t);

[ ] Step 2: Update mpipcSENSE.cpp
Update operator* and operator/ implementations:
Change return types and parameter types to forgeCol<forgeComplex<T1>>
Internal computation: the operators use Armadillo expression templates heavily (%, , exp, conj, vectorise, reshape). Keep internal computation in Armadillo types by extracting views via .getArma() / .getArmaComplex() at the top of operator and operator/, then wrap the final result back into forge types before returning.
Armadillo expression template pitfall: Never use auto to capture expressions involving .getArmaComplex() — must force evaluation into named arma::Col<std::complex<T>> variables.
For MPI gather/broadcast: with Boost.Serialization now on forge types, communication can use forge types directly. However, if the internal computation already uses Armadillo temporaries (Mat<complex<T1>>), those can be gathered/broadcast as-is since they already have serialization support via ARMA_EXTRA macros. Use whichever is most natural at each communication point.

Update explicit template instantiation:

template forgeCol<forgeComplex<float>> reconSolve(
    forgeCol<forgeComplex<float>>, mpipcSENSE<float>&, QuadPenalty<float>,
    arma::Col<float>, arma::Col<float>, arma::Col<float>,
    arma::uword, arma::uword, arma::uword, arma::Col<float>, arma::uword,
    size_t, size_t);

Replace all std::cout with FORGE_DEBUG:

// Old: std::cout << "Task Number = " << tasks << std::endl;
// New: FORGE_DEBUG("Task Number = {}", tasks);

[ ] Step 3: Verify CMakeLists links Boost.MPI and Boost.Serialization

Check that apps/MPI/CMakeLists.txt links Boost::mpi and Boost::serialization. These may already be pulled in transitively through ForgeCommon (the main CMakeLists.txt adds them to LIBS). If not, add:

target_link_libraries(forgePcSenseMPI ${mpiLIBS} ForgeCommon Boost::mpi Boost::serialization)

(and similarly for other MPI targets).

[ ] Step 4: Build MPI to verify

cmake --build build_mpi -j4 2>&1 | grep "error:" | head -20

Expect: fewer errors than before (mpipcSENSE should compile). Other MPI files may still have errors.

[ ] Step 5: Commit

git add apps/MPI/mpipcSENSE.h apps/MPI/mpipcSENSE.cpp
git commit -m "refactor(mpipcSENSE): modernize to forge types, update reconSolve signature"

Task 4: Modernize mpipcSENSETimeSeg types and operator signatures¶

Files: - Modify: apps/MPI/mpipcSENSETimeSeg.h - Modify: apps/MPI/mpipcSENSETimeSeg.cpp

Same pattern as Task 3. Key differences: - Has additional GObj (Gnufft) and AObj (TimeSegmentation) members - Constructor takes extra L and type parameters - Same MPI gather/broadcast patterns

[ ] Step 1: Update mpipcSENSETimeSeg.h (same type conversions as Task 3)
[ ] Step 2: Update mpipcSENSETimeSeg.cpp (same pattern, update explicit instantiation)
[ ] Step 3: Build MPI to verify
[ ] Step 4: Commit

git add apps/MPI/mpipcSENSETimeSeg.h apps/MPI/mpipcSENSETimeSeg.cpp
git commit -m "refactor(mpipcSENSETimeSeg): modernize to forge types, update reconSolve signature"

Chunk 3: MPI Executable Modernization¶

Common pattern for all 3 executables. Each gets:

New includes: ForgeExitCodes.hpp, ForgeLog.hpp, SignalHandler.hpp, <cstdio>
Full modern CLI: --version, --device (string, default "auto"), --log-level, --no-tui, --no-jsonl, --jsonl-output, --mpi-log-all, --voxel-size, --no-orient
po::notify() moved AFTER --help/--version checks
FORGE_LOG_INIT + PG_TUI_SPAWN_MODE (rank 0 only for TUI)
Rank-based logging suppression (non-rank-0 gets file-only unless --mpi-log-all)
forge_install_signal_handlers() on all ranks
ForgeExitCode enum for all return values
Encoding limits: raw .maximum with <= loops (remove +1 convention)
All std::cout → FORGE_DEBUG/FORGE_INFO
Convert local variables to forge types
Progress tracking on rank 0
FORGE_SET_IMAGE_DIMS, NIfTI orientation support

Task 5: Modernize forgeSenseMPI.cpp¶

Files: - Modify: apps/MPI/forgeSenseMPI.cpp

[ ] Step 1: Add new includes

After existing includes, add:

#include "../forge/Core/ForgeExitCodes.hpp"
#include "../forge/Core/ForgeLog.hpp"
#include "../forge/Core/SignalHandler.hpp"
#include <cstdio>

[ ] Step 2: Add full modern CLI options

Replace the current desc.add_options() block with the full set matching primary executables plus MPI-specific flags. Add: --version, --device (string, default "auto"), --log-level, --no-tui, --no-jsonl, --jsonl-output, --mpi-log-all, --voxel-size, --no-orient.

[ ] Step 3: Fix po::notify ordering

Move po::notify(vm) AFTER --help/--version checks:

po::store(po::parse_command_line(argc, argv, desc), vm);

if (vm.count("help")) {
    std::cout << desc << std::endl;
    return static_cast<int>(ForgeExitCode::Success);
}
if (vm.count("version")) {
    std::cout << "FORGE v" << FORGE_VERSION_STRING << " (forgeSenseMPI) built "
              << __DATE__ << std::endl;
    return static_cast<int>(ForgeExitCode::Success);
}

po::notify(vm);

[ ] Step 4: Add logging and signal handler initialization

After po::notify, add the JSONL flag processing block (same as primary executables), then:

FORGE_LOG_INIT(vm["log-level"].as<std::string>(), "forge.log", no_jsonl, jsonl_output);
PG_TUI_SPAWN_MODE(!no_tui, true);

// Suppress non-rank-0 stderr output unless --mpi-log-all
if (world.rank() != 0 && !vm["mpi-log-all"].as<bool>()) {
    // Reinit with file-only logging (no stderr/JSONL sink)
    auto level = spdlog::get("powergrid") ? spdlog::get("powergrid")->level() : spdlog::level::info;
    auto file_sink = std::make_shared<spdlog::sinks::rotating_file_sink_mt>(
        "forge.log", 10UL * 1024UL * 1024UL, 3);
    file_sink->set_level(spdlog::level::debug);
    auto logger = std::make_shared<spdlog::logger>("powergrid",
        spdlog::sinks_init_list{file_sink});
    logger->set_level(level);
    spdlog::set_default_logger(logger);
}

if (world.rank() == 0) {
    FORGE_TUI_START("forgeSenseMPI");
}

forge_install_signal_handlers();

[ ] Step 5: Add --device auto handling

std::string device_str = vm["device"].as<std::string>();
if (device_str == "auto") {
    #ifdef OPENACC_GPU
    int ngpus = acc_get_num_devices(acc_device_nvidia);
    if (ngpus > 0) {
        int gpunum = world.rank() % ngpus;
        acc_set_device_num(gpunum, acc_device_nvidia);
        FORGE_INFO("Rank {}: using GPU device {} (auto, {} GPUs available)",
            world.rank(), gpunum, ngpus);
    } else {
        acc_set_device_type(acc_device_host);
        FORGE_INFO("Rank {}: no GPUs available, using host", world.rank());
    }
    #elif defined(METAL_COMPUTE)
    FORGE_INFO("Rank {}: using Metal GPU backend (device 0)", world.rank());
    #else
    FORGE_INFO("Rank {}: CPU-only build", world.rank());
    #endif
} else {
    int device = std::stoi(device_str);
    if (device < 0) {
        FORGE_ERROR("Invalid --device value: {}. Must be >= 0.", device);
        return static_cast<int>(ForgeExitCode::ValidationError);
    }
    #ifdef OPENACC_GPU
    acc_set_device_num(device, acc_device_default);
    #endif
}

[ ] Step 6: Convert local variables to forge types

Key variables that MUST change (will not compile otherwise): - ix, iy, iz → forgeCol<float> (required by initImageSpaceCoords) - data → forgeCol<forgeComplex<float>> (required by reconSolve first param) - ImageTemp → forgeCol<forgeComplex<float>> (return type of reconSolve) - FM, sen → forgeCol<float>, forgeCol<forgeComplex<float>> (required by processISMRMRDInput)

Variables that CAN remain Armadillo (reconSolve still takes Col<T1> for these): - kx, ky, kz, tvec — Col<float> is fine - senSlice, fmSlice — Armadillo types are fine (SENSE constructor takes them)

Armadillo expression template pitfall (from MEMORY.md): Never use auto to capture getArmaComplex() * getArma() — must use explicit arma::Col<std::complex<T>> result = ... to force evaluation before temporaries are destroyed.

Update processISMRMRDInput<float>() call to match primary executable pattern.

[ ] Step 7: Fix encoding limits convention

Change all encoding limit extractions from +1 to raw .maximum:

int NShotMax = hdr.encoding[0].encodingLimits.kspace_encoding_step_1->maximum;
int NParMax = hdr.encoding[0].encodingLimits.kspace_encoding_step_2->maximum;
// etc.

Update loop bounds from < NSliceMax to <= NSliceMax (matching primary executables).

[ ] Step 8: Replace all std::cout with FORGE_DEBUG/FORGE_INFO
[ ] Step 9: Add progress tracking (rank 0 only)

if (world.rank() == 0) {
    FORGE_SET_IMAGE_DIMS(Nx, Ny, Nz);
    // Create progress bar and pass recon_bar_idx to reconSolve
}

[ ] Step 10: Add NIfTI orientation support

Same pattern as primary executables: extractNiftiOrientation, --voxel-size parsing, --no-orient, pass orient to writeNiftiMagPhsImage.

[ ] Step 11: Add FORGE_TUI_EXIT

if (world.rank() == 0) {
    FORGE_TUI_EXIT(static_cast<int>(ForgeExitCode::Success));
}
return static_cast<int>(ForgeExitCode::Success);

[ ] Step 12: Build MPI to verify

cmake --build build_mpi --target forgeSenseMPI -j4 2>&1 | tail -10

[ ] Step 13: Commit

git add apps/MPI/forgeSenseMPI.cpp
git commit -m "feat(forgeSenseMPI): full modernization — forge types, CLI, logging, signals"

Task 6: Modernize forgePcSenseMPI.cpp¶

Files: - Modify: apps/MPI/forgePcSenseMPI.cpp

Same pattern as Task 5. Key differences: - Uses mpipcSENSE<float> operator (MPI-distributed) - Has broadcast-then-reconstruct pattern (rank 0 reads data, broadcasts to all) - Encoding limits already use raw .maximum (correct). Loops already use <= (correct) except for NRep < NRepMax + 1 at line 204 — normalize to NRep <= NRepMax for consistency. This is NOT a bug fix (functionally equivalent), just a style normalization.

[ ] Steps 1-13: Same as Task 5, adapted for forgePcSenseMPI patterns.

git add apps/MPI/forgePcSenseMPI.cpp
git commit -m "feat(forgePcSenseMPI): full modernization — forge types, CLI, logging, signals"

Task 7: Modernize forgePcSenseMPI_TS.cpp¶

Files: - Modify: apps/MPI/forgePcSenseMPI_TS.cpp

Same pattern as Task 5. Key differences: - Uses NON-MPI pcSenseTimeSeg<float> operator (not mpipcSENSETimeSeg) - Has same task-distribution pattern as forgeSenseMPI - Encoding limits use +1 convention — normalize

[ ] Steps 1-13: Same as Task 5.

git add apps/MPI/forgePcSenseMPI_TS.cpp
git commit -m "feat(forgePcSenseMPI_TS): full modernization — forge types, CLI, logging, signals"

Chunk 4: Testing, Documentation, Verification¶

Task 8: Verify full MPI build¶

[ ] Step 1: Clean MPI build

rm -rf build_mpi
cmake -B build_mpi -S . -DMETAL_COMPUTE=ON -DOPENACC_GPU=OFF -DMPISupport=ON
cmake --build build_mpi -j4 2>&1 | tail -20

Expected: all targets compile and link.

[ ] Step 2: Verify non-MPI tests still pass

cmake --build build --target cpu_tests --target metal_tests -j4
./build/cpu_tests '~[Benchmark]'
./build/metal_tests '~[Benchmark]'

[ ] Step 3: Verify MPI executables run

./build_mpi/forgeSenseMPI --version 2>&1
./build_mpi/forgePcSenseMPI --version 2>&1
./build_mpi/forgePcSenseMPI_TS --version 2>&1

Expected: each prints FORGE v0.10.0 ({binary}) built ...

./build_mpi/forgeSenseMPI --help 2>&1 | head -5

Expected: prints help text, exits 0.

[ ] Step 4: Commit any build fixes

Task 9: Add CLI tests for MPI executables¶

Files: - Modify: forge/Tests/CLITests.cpp (or apps/MPI/CMakeLists.txt for paths)

[ ] Step 1: Add MPI binary path defines

In CMakeLists.txt, add compile definitions for MPI test binary paths (similar to existing FORGE_SENSE_PATH etc.), or use runtime detection.

[ ] Step 2: Add CLI tests

#ifdef FORGE_SENSE_MPI_PATH
TEST_CASE("forgeSenseMPI --version exits 0", "[CLI_MPI]")
{
    auto [code, output] = run_command(
        std::string(FORGE_SENSE_MPI_PATH) + " --version");
    REQUIRE(code == 0);
    REQUIRE(output.find("FORGE v") != std::string::npos);
}

TEST_CASE("forgeSenseMPI --help exits 0", "[CLI_MPI]")
{
    auto [code, output] = run_command(
        std::string(FORGE_SENSE_MPI_PATH) + " --help");
    REQUIRE(code == 0);
}
#endif

[ ] Step 3: Build and run

cmake --build build_mpi --target cpu_tests -j4
./build_mpi/cpu_tests "[CLI_MPI]"

[ ] Step 4: Commit

git add forge/Tests/CLITests.cpp CMakeLists.txt
git commit -m "test: add CLI tests for MPI executables"

Task 10: Update documentation¶

Files: - Modify: README.md - Modify: CHANGELOG.md - Modify: CLAUDE.md

[ ] Step 1: Update README.md

Add an MPI section documenting: - --device auto (rank-based GPU assignment) vs explicit device selection - --mpi-log-all for multi-rank logging - Example: mpirun -np 4 forgeSenseMPI --device auto --no-tui [args]

[ ] Step 2: Update CHANGELOG.md

Add under [Unreleased]:

### Fixed
- MPI build: removed custom op_circshift (conflicts with Armadillo 15.2.3 native).
- MPI build: updated reconSolve extern template signatures to match modern forgeCol types.
- MPI executables: fixed po::notify() ordering (was before --help check).
- MPI executables: fixed encoding limit double-counting bug in forgePcSenseMPI.

### Changed
- MPI executables fully modernized: forge types, ForgeLog structured logging,
  ForgeExitCodes, signal handling, --version/--device/--log-level/--no-tui/
  --no-jsonl/--jsonl-output/--mpi-log-all flags.
- Added Boost.Serialization support to forgeCol and forgeMat for MPI communication.
- fftshift.hpp: uses Armadillo native circshift() (argument order: shift, dim).

[ ] Step 3: Update CLAUDE.md

Update the Active Branches section and note MPI executables are modernized.

[ ] Step 4: Commit

git add README.md CHANGELOG.md CLAUDE.md
git commit -m "docs: document MPI modernization and op_circshift removal"

Task 11: Final verification¶

[ ] Step 1: Run clang-format

find forge apps tools \( -name "*.h" -o -name "*.hpp" -o -name "*.cpp" \) ! -name "*.mm" -print0 | xargs -0 clang-format -i
find forge apps tools \( -name "*.h" -o -name "*.hpp" -o -name "*.cpp" \) ! -name "*.mm" -print0 | xargs -0 clang-format --dry-run --Werror 2>&1

Commit if needed: git commit -m "style: apply clang-format"

[ ] Step 2: Full non-MPI test run

./build/cpu_tests '~[Benchmark]'
./build/metal_tests '~[Benchmark]'

[ ] Step 3: Full MPI build verification

cmake --build build_mpi -j4 2>&1 | grep "error:" | head -5

Expected: no errors.

[ ] Step 4: MPI executable smoke test

./build_mpi/forgeSenseMPI --version
./build_mpi/forgePcSenseMPI --version
./build_mpi/forgePcSenseMPI_TS --version