diff --git a/.gitignore b/.gitignore
index 435cadd12..ec8514e6e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -2,7 +2,6 @@ __pycache__/
 *.pyc
 m5out/
 packer
-
 # For jekyll
 _site
 .jekyll-cache
diff --git a/src/npb-24.04-imgs/.gitignore b/src/npb-24.04-imgs/.gitignore
new file mode 100644
index 000000000..4a57e240e
--- /dev/null
+++ b/src/npb-24.04-imgs/.gitignore
@@ -0,0 +1,4 @@
+disk-image*/
+arm-ubuntu-*
+disk-image-arm-npb
+disk-image-x86-npb
\ No newline at end of file
diff --git a/src/npb-24.04-imgs/README.md b/src/npb-24.04-imgs/README.md
new file mode 100644
index 000000000..21579a78b
--- /dev/null
+++ b/src/npb-24.04-imgs/README.md
@@ -0,0 +1,186 @@
+---
+title: NPB ubuntu 24.04 disk images
+tags:
+    - x86
+    - arm
+    - fullsystem
+permalink: resources/npb-24.04-imgs
+shortdoc: >
+    This resource implementes the NPB benchmark .
+author: ["Harshil Patel"]
+license: BSD-3-Clause
+---
+
+This document provides instructions to create a NPB ubuntu 24.04 disk image, which, along with an example script, may be used to run NPB within gem5 simulations. The example script uses a pre-built disk-image.
+
+A pre-built disk image, for X86, can be found, gzipped, here: [x86-ubuntu-24.04-npb-img](https://resources.gem5.org/resources/x86-ubuntu-24.04-npb-img?version=2.0.0)
+
+A pre-built disk image, for arm, can be found, gzipped, here:
+[arm-ubuntu-24.04-npb-img](https://resources.gem5.org/resources/arm-ubuntu-24.04-npb-img?version=2.0.0)
+
+## What's on the disk?
+
+- username: gem5
+- password: 12345
+
+- The `gem5-bridge`(m5) utility is installed in `/usr/local/bin/gem5-bridge`.
+- `libm5` is installed in `/usr/local/lib/`.
+- The headers for `libm5` are installed in `/usr/local/include/gem5-bridge`.
+- `npb` benchmark sutie with ROI annotations
+
+Thus, you should be able to build packages on the disk and easily link to the gem5-bridge library.
+
+The disk has network disabled by default to improve boot time in gem5.
+
+If you want to enable networking, you need to modify the disk image and move the file `/etc/netplan/50-cloud-init.yaml.bak` to `/etc/netplan/50-cloud-init.yaml`.
+
+## Building the Disk Image
+
+### Arm specific file requirement
+
+To get the `flash0.img` run the following commands in the `files` directory.
+
+```bash
+dd if=/dev/zero of=flash0.img bs=1M count=64
+dd if=/usr/share/qemu-efi-aarch64/QEMU_EFI.fd of=flash0.img conv=notrunc
+```
+
+**Note**: The `build-arm.sh` will make this file for you.
+
+Assuming that you are in the `src/npb-24.04-imgs/` directory, run
+
+```sh
+./build-x86.sh          # the script downloading packer binary and building 
+```
+
+to build the x86 disk image or 
+
+```sh
+./build-arm.sh
+```
+
+to run the arm disk image.
+After this process succeeds, the disk image can be found on the `npb-24.04-imgs/disk-image-x86-npb/disk-image-x86-npb` or `npb-24.04-imgs/disk-image-arm-npb/disk-image-arm-npb` repectively.
+
+This npb image uses the prebuilt ubuntu 24.04 image as a base image. The npb image also throws the same exit events as the base image.
+
+Each benchmark also has its regions of intrests annotated and they throw a `gem5-bridge workbegin` and `gem5-bridge workend` exit event.
+
+## Init Process and Exit Events
+
+This section outlines the disk image's boot process variations and the impact of specific boot parameters on its behavior.
+By default, the disk image boots with systemd in a non-interactive mode.
+Users can adjust this behavior through kernel arguments at boot time, influencing the init system and session interactivity.
+
+### Boot Parameters
+
+The disk image supports two main kernel arguments to adjust the boot process:
+
+- `no_systemd=true`: Disables systemd as the init system, allowing the system to boot without systemd's management.
+- `interactive=true`: Enables interactive mode, presenting a shell prompt to the user for interactive session management.
+
+Combining these parameters yields four possible boot configurations:
+
+1. **Default (Systemd, Non-Interactive)**: The system uses systemd for initialization and runs non-interactively.
+2. **Systemd and Interactive**: Systemd initializes the system, and the boot process enters an interactive mode, providing a user shell.
+3. **Without Systemd and Non-Interactive**: The system boots without systemd and proceeds non-interactively, executing predefined scripts.
+4. **Without Systemd and Interactive**: Boots without systemd and provides a shell for interactive use.
+
+### Note on Print Statements and Exit Events
+
+- The bold points in the sequence descriptions are `printf` statements in the code, indicating key moments in the boot process.
+- The `**` symbols mark gem5 exit events, essential for simulation purposes, dictating system shutdown or reboot actions based on the configured scenario.
+
+### Boot Sequences
+
+#### Default Boot Sequence (Systemd, Non-Interactive)
+
+- Kernel output
+- **Kernel Booted print message** **
+- Running systemd print message
+- Systemd output
+- autologin
+- **Running after_boot script** **
+- Print indicating **non-interactive** mode
+- **Reading run script file**
+- Script output
+- Exit **
+
+#### With Systemd and Interactive
+
+- Kernel output
+- **Kernel Booted print message** **
+- Running systemd print message
+- Systemd output
+- autologin
+- **Running after_boot script** **
+- Shell
+
+#### Without Systemd and Non-Interactive
+
+- Kernel output
+- **Kernel Booted print message** **
+- autologin
+- **Running after_boot script** **
+- Print indicating **non-interactive** mode
+- **Reading run script file**
+- Script output
+- Exit **
+
+#### Without Systemd and Interactive
+
+- Kernel output
+- **Kernel Booted print message** **
+- autologin
+- **Running after_boot script** **
+- Shell
+
+This detailed overview provides a foundational understanding of how different boot configurations affect the system's initialization and mode of operation.
+By selecting the appropriate parameters, users can customize the boot process for diverse environments, ranging from automated setups to hands-on interactive sessions.
+
+## Handling Exit Events in gem5
+
+The disk image triggers five exit events in total:
+
+- 3 `gem5-bridge exit` events
+- 1 `gem5-bridge workbegin` event
+- 1 `gem5-bridge workend` event
+
+To manage these events in gem5, you need to create three exit event handlers. Below is a code snippet showing how these handlers could be implemented and added to the `simulator` object in gem5:
+
+```python
+def handle_workbegin():
+    print("Done booting Linux")
+    print("Resetting stats at the start of ROI!")
+    m5.stats.reset()
+    processor.switch()
+    yield False
+
+# We expect that the ROI ends with `workend` or `simulate() limit reached`.
+def handle_workend():
+    print("Dumping stats at the end of the ROI!")
+    m5.stats.dump()
+    yield True
+
+def exit_event_handler():
+    print("First exit: Kernel booted")
+    yield False  # gem5 is now executing systemd startup
+    print("Second exit: Started `after_boot.sh` script")
+    # The after_boot.sh script is executed after the kernel and systemd have booted.
+    yield False  # gem5 is now executing the `after_boot.sh` script
+    print("Third exit: Finished `after_boot.sh` script")
+    # The after_boot.sh script will run a script if passed via m5 readfile. 
+    # This is the last exit event before the simulation exits.
+    yield True
+
+simulator = Simulator(
+    board=board,
+    on_exit_event={
+        ExitEvent.WORKBEGIN: handle_workbegin(),
+        ExitEvent.WORKEND: handle_workend(),
+        ExitEvent.EXIT: exit_event_handler(),
+    },
+)
+```
+
+This script defines three handlers for different exit events (`WORKBEGIN`, `WORKEND`, and `EXIT`).
diff --git a/src/npb-24.04-imgs/arm-npb.pkr.hcl b/src/npb-24.04-imgs/arm-npb.pkr.hcl
new file mode 100644
index 000000000..4c3463633
--- /dev/null
+++ b/src/npb-24.04-imgs/arm-npb.pkr.hcl
@@ -0,0 +1,75 @@
+packer {
+  required_plugins {
+    qemu = {
+      source  = "github.com/hashicorp/qemu"
+      version = "~> 1"
+    }
+  }
+}
+
+variable "image_name" {
+  type    = string
+  default = "arm-ubuntu"
+}
+
+variable "ssh_password" {
+  type    = string
+  default = "12345"
+}
+
+variable "ssh_username" {
+  type    = string
+  default = "gem5"
+}
+
+source "qemu" "initialize" {
+  boot_command     = ["<wait130>",
+                      "gem5<enter><wait>",
+                      "12345<enter><wait>",
+                      "sudo mv /etc/netplan/50-cloud-init.yaml.bak /etc/netplan/50-cloud-init.yaml<enter><wait>",
+                      "12345<enter><wait>",
+                      "sudo netplan apply<enter><wait>",
+                      "<wait>"]
+  cpus             = "4"
+  disk_size        = "4600"
+  format           = "raw"
+  headless         = "true"
+  disk_image       = "true"
+  iso_checksum     = "sha256:eb94422a3908c6c5183c03666b278b6e8bcfbde04da3d7c3bb5374bc82e0ef48"
+  iso_urls         = ["./arm-ubuntu-24.04-20240823"]
+  memory           = "8192"
+  output_directory = "disk-image-arm-npb"
+  qemu_binary      = "/usr/bin/qemu-system-aarch64"
+  qemuargs         = [  ["-boot", "order=dc"],
+                        ["-bios", "./files/flash0.img"],
+                        ["-cpu", "host"],
+                        ["-enable-kvm"],
+                        ["-machine", "virt"],
+                        ["-machine", "gic-version=3"],
+                        ["-device","virtio-gpu-pci"],
+                        ["-device", "qemu-xhci"],
+                        ["-device","usb-kbd"],
+
+                      ]
+  shutdown_command = "echo '${var.ssh_password}'|sudo -S shutdown -P now"
+  ssh_password     = "${var.ssh_password}"
+  ssh_username     = "${var.ssh_username}"
+  ssh_wait_timeout = "60m"
+  vm_name          = "${var.image_name}"
+  ssh_handshake_attempts = "1000"
+}
+
+build {
+  sources = ["source.qemu.initialize"]
+
+  provisioner "file" {
+    source      = "npb-with-roi/NPB/NPB3.4-OMP"
+    destination = "/home/gem5/"
+  }
+
+  provisioner "shell" {
+    execute_command = "echo '${var.ssh_password}' | {{ .Vars }} sudo -E -S bash '{{ .Path }}'"
+    scripts         = ["scripts/post-installation.sh"]
+  }
+
+}
diff --git a/src/npb-24.04-imgs/build-arm.sh b/src/npb-24.04-imgs/build-arm.sh
new file mode 100755
index 000000000..33f627e42
--- /dev/null
+++ b/src/npb-24.04-imgs/build-arm.sh
@@ -0,0 +1,24 @@
+PACKER_VERSION="1.10.0"
+
+if [ ! -f ./packer ]; then
+    wget https://releases.hashicorp.com/packer/${PACKER_VERSION}/packer_${PACKER_VERSION}_linux_arm64.zip;
+    unzip packer_${PACKER_VERSION}_linux_arm64.zip;
+    rm packer_${PACKER_VERSION}_linux_arm64.zip;
+fi
+
+# make the flash0.sh file
+mkdir files
+cd ./files
+dd if=/dev/zero of=flash0.img bs=1M count=64
+dd if=/usr/share/qemu-efi-aarch64/QEMU_EFI.fd of=flash0.img conv=notrunc
+cd ..
+
+# get the  base image from gem5 resoruces
+wget https://storage.googleapis.com/dist.gem5.org/dist/develop/images/arm/ubuntu-24-04/arm-ubuntu-24.04-20240823.gz
+gunzip arm-ubuntu-24.04-20240823.gz 
+
+# Install the needed plugins
+./packer init arm-npb.pkr.hcl
+
+# Build the image
+./packer build arm-npb.pkr.hcl
diff --git a/src/npb-24.04-imgs/build-x86.sh b/src/npb-24.04-imgs/build-x86.sh
new file mode 100755
index 000000000..50ec0e8ac
--- /dev/null
+++ b/src/npb-24.04-imgs/build-x86.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+# Copyright (c) 2024 The Regents of the University of California.
+# SPDX-License-Identifier: BSD 3-Clause
+
+PACKER_VERSION="1.10.0"
+
+if [ ! -f ./packer ]; then
+    wget https://releases.hashicorp.com/packer/${PACKER_VERSION}/packer_${PACKER_VERSION}_linux_amd64.zip;
+    unzip packer_${PACKER_VERSION}_linux_amd64.zip;
+    rm packer_${PACKER_VERSION}_linux_amd64.zip;
+fi
+
+wget https://storage.googleapis.com/dist.gem5.org/dist/develop/images/x86/ubuntu-24-04/x86-ubuntu-24-04-v2.gz
+gunzip x86-ubuntu-24-04-v2.gz
+
+# Install the needed plugins
+./packer init x86-npb.pkr.hcl
+
+# Build the image
+./packer build x86-npb.pkr.hcl
\ No newline at end of file
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/Changes.log b/src/npb-24.04-imgs/npb-with-roi/NPB/Changes.log
new file mode 100644
index 000000000..c91cf488c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/Changes.log
@@ -0,0 +1,564 @@
+###########################################
+# Modification History of NPB3.x          #
+# ------------------------------          #
+#   NPB development team                  #
+#   NASA Ames Research Center             #
+#   npb@nas.nasa.gov                      #
+#   http://www.nas.nasa.gov/Software/NPB/ #
+###########################################
+
+
+------------------------------------------------------
+Changes in NPB3.4.2
+      ( NPB3.4-MPI, NPB3.4-OMP )
+------------------------------------------------------
+[20-Jul-20]
+
+This is a bug-fix release with following changes.
+
+  o Verification change for the EP benchmark (NPB-MPI and NPB-OMP)
+    - No. of Gaussian pairs is now part of the verification to broaden
+      the coverage
+
+    - Due to numeric sensitivity in subtracting two close numbers,
+      the verification of (SX,SY) could fail for the Class F problem
+      for the given threthold (1.d-8).  A new scheme has been implemented
+      to use the absolute values of (X,Y) in calculating (SX,SY) to mitigate
+      the sensitivity.  There is no change in the number of operations, but
+      the verification values have been regenerated.
+
+  o Change in NPB-MPI
+    - add back the VEC versions of BT and LU and make them available via
+      make option "VERSION=vec"
+
+    - minor format fix in common/print_results.f90
+
+    - fixed a bug in the BT-IO benchmark that can cause integer overflow
+      in CLASS=D or larger problems.  Setting FORTRAN_REC_SIZE in make.def
+      is no longer required.
+
+
+------------------------------------------------------
+Changes in NPB3.4.1
+      ( NPB3.4-MPI, NPB3.4-OMP )
+------------------------------------------------------
+[15-Feb-20]
+
+This is a minor release with following changes.
+
+  o Changed Fortran sources from fixed form to free form
+
+  o Change in the NPB-MPI version
+     - Fixed an inconsistency in enforcing process count requirement in
+       different benchmarks.  The environment variable NPB_NPROCS_STRICT
+       can be used to turn off the enforcement.
+
+  o Change in the NPB-OMP version
+     - Fixed the report of Fortran compiler flag (F77 -> FC).
+
+     - The blocking factor for FT can now be set via make option
+       "VERSION=blk<n>"
+
+
+------------------------------------------------------
+Changes in NPB3.4
+      ( NPB3.4-MPI, NPB3.4-OMP )
+------------------------------------------------------
+[13-May-18]
+
+1. General
+
+ - The serial version of NPBs (NPB-SER) is no longer included in
+   the distribution.  The same functionality can be achieved by
+   the OpenMP version compiled with OpenMP disabled.
+
+ - Version 3.4 uses Fortran modules and allocatable arrays to define 
+   and manage global data (to replace common blocks), and Fortran 2003 
+   IEEE arithmetic function to catch the NaN condition during verification. .
+
+   So, the version requires a compiler that supports these features.
+   Examples of a few compilers that are known to work:
+      Intel compiler v12+, GCC v5+, PGI v10+.
+
+ - The environment variable NPB_TIMER_FLAG is now used to enable 
+   additional timers.  This method supersedes the use of the file
+   "timer.flag" in the working directory.
+
+ - The MPIF77 or F77 flag in make.def is renamed to MPIFC or FC to match
+   with the fact that a Fortran 90 or newer compiler is required.
+
+2. MPI version
+
+ - NPB3.4-MPI added the class E problem size for IS, and the class F
+   problem size for BT, LU, SP, CG, EP, FT, and MG.
+
+ - Version 3.4 uses the dynamic memory allocation feature
+   in Fortran 90 so that separate compilations for different
+   process counts are no longer necessary.  The number of processes
+   is solely determined and checked at runtime.
+
+ - The LU benchmark improvement:
+      * Reduced memory usage for working arrays (a,b,c,d) in the solver.
+        This could improve performance in some cases.
+
+      * Relaxed the number of processes allowed.  For example, the square
+        number of processes (3x3=9) is now allowed.
+
+ - The vector codes for the BT and LU benchmarks have been removed
+   due to the fact that these implementations were not portable and
+   successful vectorization highly depends on the compiler used.
+
+3. OMP version
+
+ - Added the class E problem size for IS, and the class F problem 
+   size for BT, LU, SP, CG, EP, FT, and MG.
+
+ - Improved loop-level parallelism with the use of the OpenMP
+   COLLAPSE clause available since OpenMP 3.0.  This version 
+   requires an OpenMP compiler that supports this feature.
+
+ - Changes specific to LU:
+      * The thread synchronization in the pipelined version of LU was
+        changed to use ATOMIC read/write available from OpenMP 3.0.
+
+      * Re-introduced the hyperplane implementation of LU in the 
+        distribution, which is accessible via the VERSION=HP make
+        option during compilation.
+
+      * Included a third version of LU that uses the DOACROSS feature 
+        of OpenMP 4.0.  This version requires an OpenMP compiler that 
+        supports this feature.
+
+ - Changes specific to BT and SP:
+      * Data access in RHS has been improved for better performance.
+
+      * Included a version with blocking factor in the solver to
+        improve cache performance. This version can be selected via 
+        the VERSION=BLK make option during compilation and supersedes 
+        the "vector" version that was introduced in version 3.3.
+
+ - Changes specific to UA:
+      * Included a version that uses array reduction for atomic updates.
+        This version is selectable via the VERSION=rd make option 
+        during compilation.
+
+
+------------------------------------------------------
+Changes in NPB3.3.1
+      (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI )
+------------------------------------------------------
+[17-Feb-09]
+
+This is a bug fixing release of NPB3.3.
+
+1. All versions
+
+ - sys/setparams.c: fixed a problem in dealing with quoted (") flags
+   from make.def when producing npbparams.h for C.
+
+ - CG: ensure 'implicit none' used in all subroutines.
+
+2. MPI version
+
+ - Additional timers can be used for profiling purpose, similar
+   to those already included in the OMP and SER versions.
+
+ - LU:
+   * code clean up (suggested by Rob Van der Wijngaart)
+      > avoid using MPI_ANY_SOURCE in exchange_*.f, which might 
+        alter performance in some cases.
+      > delete references to sethyper and 'icomm*', which are 
+        no longer used since NPB2.2.
+   * change the low-bound limit on the sub-domain size in subdomain.f
+     from 4 to 3 in order to increase allowable process counts.
+   * allow number of processes other than power of two.
+
+ - FT: fix a non-portable way of broadcasting input parameters
+      (pointed out by Art Lazanoff)
+
+ - BT: include 'btio_cleanup' as part of the I/O timing
+
+3. OMP and SER versions
+
+ - DC: fix access to out-of-bound array elements in adc.c
+      Reported by Per Larsen of Denmark <pl@imm.dtu.dk>
+
+ - UA: fix the use of uninitialized array 'sje' in mortar_vertex() by
+      adding "call nr_init[_omp](sje,4*6*nelt,0)" in the main program.
+
+ - MG, UA: include additional timers for profiling purpose.
+
+ - Executables now use ".x" as a name extension
+
+
+------------------------------------------------------
+Changes in NPB3.3
+      (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI )
+------------------------------------------------------
+[02-Aug-07]
+
+1. New and improvements
+
+ - The Class E problem has been introduced in seven of the benchmarks
+   (BT, SP, LU, CG, MG, FT, and EP) in all three implementations.
+
+ - The Class D problem has been added to the IS benchmark in all 
+   three implementations.  It requires the compiler support of 
+   64-bit "long" type in C.  The MPI version of IS now allows runs 
+   up to 1024 processes.
+
+ - The Bucket Sort option (USE_BUCKETS) has been added to
+   the OpenMP version of IS and made as the default.
+
+ - Introduced the "twiddle" array in the OpenMP FT benchmark,
+   which has been used in the MPI and SER versions and seems 
+   to improve performance for larger problem sizes.
+
+ - Merged vector codes for the BT and LU benchmarks into
+   the release.
+
+ - Updates to BTIO (MPI/BT with IO subtypes):
+    * added I/O stats (I/O timing, data size written, I/O data rate)
+    * added an option for interleaving reads between writes through
+      the inputbt.data file.  Although the data file size would be
+      smaller as a result, the total amount of data written is still
+      the same.
+
+ - Made documents more consistent throughout different versions
+   (README and README.install).
+
+2. Bug fixes
+
+ - MPI/FT: fixed a verification failure for cases where NX/=NY 
+   and the 2D decomposition are used.  The bug occurred at least
+   for (Class D, NPROCS=2048) and (Class B, NPROCS=512).
+
+   fixed an output printing format problem occurred when 
+   the number of processes >= 1000.
+
+ - MPI/SP: fixed a performance regression due to improper
+   padding of array dimensions.
+
+ - MPI/IS: minor fix to support large processor counts (>=512).
+
+ - OMP/UA: fixed a race condition in mason.f, avoided the use 
+   of the LASTPRIVATE directive.
+
+ - OMP/LU: minor fix in data flushing for pipelining.
+
+ - DC: There are a number of fixes -
+   * fixed segmentation fault in both OMP and SER versions
+     caused by accessing zero-length array elements.
+     Reported by Jeff Odom <jodom@cs.umd.edu>.
+
+   * fixed a race in reporting benchmark timing in the OMP version
+
+   * fixed the use of timer in the OMP version, which limited
+     the number of threads to 64.  The number of threads is now
+     lifted to a maximum of MAX_NUMBER_OF_TASKS (=256).
+
+   * made the benchmark output consistent with other NPBs.
+
+ - fixed a use of uninitialized variable in MPI/sys/setparams.c.
+   setparams in all three versions was updated to deal with 
+   make.def that contains carriage-return character ('\r').
+
+ - SER/FT: added 'implicit none' to all missing places.
+
+ - SER/IS: fixed missing variable declarations for the Bucket 
+   Sort option (when USE_BUCKETS is defined).
+
+3. Others
+
+ - The default value for collbuf_nodes in the BT I/O benchmark
+   is now set to 0, indicating no file hints will be used.
+   The setting can be changed by using the "inputbt.data" file.
+
+ - The hyperplane version of LU (LU-HP) is no longer included 
+   in the distribution.
+
+
+------------------------------------------------------
+Changes in NPB3.2.1
+      (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI )
+------------------------------------------------------
+[27-Jul-05]
+
+This is a bug fixing release of NPB3.2.
+
+1. MPI version
+  - sys/setparams.c: removed a duplicated statement for writing
+      FT parameters and made invalid SUBTYPE as an error condition.
+      The 'duplicated statement' problem was fixed in NPB3.2 (See 
+      the note below).  However, during the final updating process, 
+      the fix was left out, even though the log file was updated.
+
+  - BT: included SUBTYPE=EPIO in the I/O verification.
+
+  - LU: bcast_inputs.f: fixed wrong data type (dp_type) used for 
+      communicating integers (nx0,ny0,nz0) with the correct type 
+      MPI_INTEGER.
+
+  - MG: fixed a mis-calculation of parameter "nr" in globals.h 
+      that caused run-time failure for NPROCS >= 512 
+      (reported by Donald Ferry of Cray).  Expanded to limit to 
+      131072 processes and added an error checking code.
+
+      The use of MPI_ANY_SOURCE for MPI_Irecv inside subroutine
+      ready() could cause MPI_Wait return a message meant for
+      the wrong k.  The problem is fixed with nbr(axis,-dir,k)
+      in place of MPI_ANY_SOURCE in the call to MPI_Irecv
+      (reported and suggested by Hideo Saito).
+
+2. OpenMP version
+  - EP: use THREADPRIVATE for working array storage. It should not
+      change performance but made some compiler happier.
+
+  - LU: add variable "v" to FLUSH to ensure solution data properly 
+      flushed for pipeline.  This change is needed according to
+      the OpenMP 2.5 standard.
+
+  - IS: reorganized working buffers so that the count for key 
+      population could be more naturally performed.  This version
+      uses much less stack space.
+
+  - UA: implemented atomic updates with locks in order to achieve
+      better scaling on those systems that have an inefficient
+      (or even buggy) ATOMIC implementation.
+
+
+------------------------------------------------------
+Changes in NPB3.2
+      (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI )
+------------------------------------------------------
+[07-Jan-05]
+
+1. DC version in NPB3.2-SER was converted to C from C++
+   (CLASSES S, W, A, B). 
+   sys/setparams.c file was changed appropriately.
+   
+2. OpenMP version of DC was added to NPB3.2-OMP.
+
+3. Data Traffic benchmark DT was added to NPB3.2-MPI.
+
+[24-May-04]
+
+All versions:
+   - use assumed shape "(*)" declaration in CG
+   - fixed the use of an uninitialized variable in EP
+   - avoid using integer array for assumed shape dimensions in FT
+   - fix in UA:
+      * fix the reference to file "inputua.data"
+      * avoid overindexing
+      * avoid reference to out-of-bound array elements
+      * change declaration "real*8" to "double precision"
+
+OMP version:
+   - explicitly added "SCHEDULE(STATIC)" to the OMP version
+   - use the "omp_get_wtime()" function for timer if available
+   - removed the call to "getenv" for portability
+   - change in UA:
+      * implemented an alternative approach for atomic update
+
+MPI version:
+   - removed a duplicated declaration in FT (from setparams.c)
+   - removed a duplicated declaration in BT/full_mpiio.f
+   - fixed a missing "NPROCS=" in sys/suite.awk
+
+
+------------------------------------------------------
+Changes in NPB3.1
+      (NPB3.1-MPI, NPB3.1-SER, NPB3.1-OMP)
+------------------------------------------------------
+[22-Apr-04] NPB3.1-MPI
+
+Merged the NPB2.4-MPI branch into NPB3.1 with the following changes.
+
+  - Optimized the BT memory usage.  The new version is about 1/3 of
+    the memory used in NPB2.x.
+  - Fixed a bug in CG for running on a large number of processes
+  - Redefined the Class W size in MG so that the verification value
+    will not be too small. (see below for SER & OMP versions)
+  - Use the relative errors for verification in both CG and MG
+  - Fixed a race in 'make suite'
+
+[08-Apr-04] NPB3.1-SER and NPB3.1-OMP
+
+The following changes are made in both NPB3.1-SER and NPB3.1-OMP.
+
+1. Added the Class D problem
+   - verification values taken from NPB2.4-MPI
+   - modified variables to fit in large problem
+
+2. Improvements for LU and LU-HP:
+   - reduced the memory usage for the 'tv' variable in LU and LU-HP
+   - a more efficient memory access for variables "a,b,c,d" in LU-HP
+   - a dummy iteration added before the time step loop for consistency
+     with other benchmarks
+
+3. Improvement and fix in MG:
+   - verification in MG now uses the relative error
+     (instead of the absolute error).  This will avoid incorrect
+     verification for small reference values.
+   - redefined the class size for Class W so that the verification
+     value will not be too small.
+     In version 3.0 and earlier: 64x64x64,    40 iters
+     New size in version 3.1   : 128x128x128, 4 iters
+   - fixed incorrect verification values for Classes A and C.
+
+4. CG:
+   - use relative error for verification
+   - clean up codes for matrix initialization (makea).
+     The new code uses about 1/2 memory of the previous version.
+
+5. Fixed makefile related issues
+   - fixed dependence on make.def for files in common.
+   - fixed a race in 'make suite'
+   - added 'LU-HP' as a valid benchmark option in makefiles
+
+The following changes are made in NPB3.1-OMP.
+
+1. Included a hyper-plane version of the LU benchmark: LU-HP
+   - based on the serial version
+
+2. The dummy 'omp_lib_dum' library is no longer used for compilation 
+   without an OpenMP compiler. Conditional compilation is now used.
+
+3. Parallelization of the initialization part of MG.
+   It improves the turn-around time quite a bit for the larger
+   classes, such as class D.
+
+4. Parallelize codes for matrix initialization (makea) in CG.
+   The new code uses about 2/3 memory of the version in NPB3.0-OMP.
+
+5. Code clean up in SP so that the structure is more consistent
+   with the serial version.
+
+
+
+------------------------------------------------------
+Changes in NPB2.x MPI version
+------------------------------------------------------
+
+Changes in 2.4.1
+- fixed error in BT/Makefile (replaced "==" with "=")
+- added stub function accumulate_norms in BT/btio.f
+- changed type of Class B verification constants in BT/verify.f from 
+  single to double precision
+                                                       
+Changes in 2.4
+- Added I/O benchmark (subtype of BT).
+- Added Class D for all benchmarks except IS.
+- Reduced size of tabulated exponentials in FT.
+- Made minor changes to FT to prevent integer overflow for class D on 
+  systems with 32-bit integers. FT class D will not run on small 
+  numbers of processors anymore.
+
+
+------------------------------------------------------
+Changes in non-MPI versions of NPB (previously PBN3.0)
+      (NPB3.0-SER, NPB3.0-HPF, NPB3.0-OMP, NPB3.0-JAV)
+------------------------------------------------------
+
+[01-Mar-99] Initial Beta Release.
+
+[06-Apr-99] Based on report from Charles Grassl and Ramesh Menon (SGI).
+
+   1. NPB-SER, FT: file auxfnct.f -
+      lines 74 and 75 were interchanged:
+
+      double complex u0(d1+1,d2,d3), tmp(maxdim)
+      integer d1,d2,d3
+
+   2. NPB-OMP: The OpenMP standards requires reduction variable be scalars,
+      thus, changes made to remove the use of array variable for reduction.
+      Relevant modifications in EP, CG, LU, SP, and BT
+
+   3. NPB-OMP: Remove compiler warnings of "Referenced scalar variables 
+      use defaults" by declaring explicitly as shared.
+      Relevant modifications in FT, LU, and BT
+
+   4. NPB-OMP, README.openmp: Explicitly spell out the requirement of
+      the static scheduling (setenv OMP_SCHEDULE "static").
+
+
+[05-Oct-99] NPB3.0-non-MPI Beta Release (02)
+
+General change to all (NPB-SER, NPB-HPF, NPB-OMP) -
+   1. Update header information for all benchmarks.
+
+   2. Allow continuation lines in 'make.def' (modification done
+      in sys/setparams.c).
+
+Change made in NPB-OMP -
+   1. 'print_results' now prints Number-Of-Threads and Mflops/s/thread.
+      The printed number is the activated threads during the run, which
+      may not be the same as what's requested.
+
+   2. A initial data touch loop for array A is added in CG.
+
+   3. 'CRITICAL' section is used for reduction with array.
+      Relevant changes in EP, CG, LU, SP, and BT.
+
+   4. Reconfigure 'make.def' such that 'omp_lib_dum' can be activated
+      from the file for no directive compilation.
+
+   5. The "!$OMP END DO" seems needed before "!$OMP MASTER" in rhs.f
+      for both BT and SP for some f90 compilers.
+
+   6. "SCHEDULE(STATIC)" are used for the pipeline in LU to ensure
+      compliance with the OMP standard.
+
+Change made in NPB-HPF -
+   1. 'print_results' now prints Number-Of-Processes and Mflops/s/process.
+
+   2. Use more consistent output format (via print_results).
+
+   3. More consistent makefiles (via config/make.def).
+
+
+[04-Apr-00] NPB3.0-non-MPI Beta Release (03)
+
+Change made in NPB-OMP -
+   1. The OpenMP-C version of IS has been added, including more timers.
+
+   2. 'cprint_results' includes Number-Of-Threads and Mflops/s/thread.
+
+Change made in NPB-SER -
+   1. More timers included in IS.
+
+NPB-JAV has been included in NPB3.0-non-MPI.
+
+
+[31-May-01] NPB3.0-non-MPI Beta Release (04)
+
+Change made in NPB-OMP -
+   1. NPB-OMP/LU: Failure in verification for number of threads greater 
+      than the problem size is now fixed.
+
+   2. If OMP_NUM_THREADS is unset, the printout will report as "unset"
+      instead of "1"
+
+   3. NPB-OMP/IS: Allocating work_buff on the stack seems to cause problem
+      for large problem size (CLASS C).  "work_buff" is now allocated
+      by "malloc" on the heap for CLASS C.
+
+   4. NPB-OMP/IS: Reported by <RaeLyn.Crowell@compaq.com> - potential
+      synchronization problem could arise due to the use of "static"
+      variables inside "randlc()".  Declaration of these static variables
+      are moved out of randlc() and put in the threadprivate directive.
+
+General change to all (NPB-SER, NPB-HPF, NPB-OMP) -
+   1. Cleanup in makefiles
+
+
+[28-Aug-02] The Official NPB3.0 Release
+
+Change made in all -
+   1. Fixed a bogus verification for "NaN".
+
+   2. Name change from "PBN3.0" to "NPB3.0". Updated all the banners.
+
+   3. NPB-SER/FT: use a derived version from NPB2.3-serial.
+
+   4. NPB-HPF/FT: use a consistent printing format.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-HPF.README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-HPF.README
new file mode 100644
index 000000000..ff1e508d2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-HPF.README
@@ -0,0 +1,4 @@
+The HPF version of NPB is not included in this distribution.
+Please download it from NPB3.0 instead.
+
+http://www.nas.nasa.gov/Software/NPB
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-JAV.README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-JAV.README
new file mode 100644
index 000000000..b36e68676
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-JAV.README
@@ -0,0 +1,4 @@
+The Java version of NPB is not included in this distribution.
+Please download it from NPB3.0 instead.
+
+http://www.nas.nasa.gov/Software/NPB
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/Makefile
new file mode 100644
index 000000000..e8439220a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/Makefile
@@ -0,0 +1,94 @@
+SHELL=/bin/sh
+BENCHMARK=bt
+BENCHMARKU=BT
+VEC=
+
+include ../config/make.def
+
+
+OBJS = bt.o bt_data.o make_set.o initialize.o exact_solution.o \
+       exact_rhs.o set_constants.o adi.o define.o copy_faces.o \
+       rhs.o solve_subs.o x_solve$(VEC).o y_solve$(VEC).o z_solve$(VEC).o \
+       add.o error.o verify.o setup_mpi.o mpinpb.o \
+       ${COMMON}/get_active_nprocs.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+# npbparams.h is included by bt_data module (via bt_data.o)
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	@if [ x$(SUBTYPE) = xfull -o x$(SUBTYPE) = xFULL ] ; then	\
+		${MAKE} bt-full;		\
+	elif [ x$(SUBTYPE) = xsimple -o x$(SUBTYPE) = xSIMPLE ] ; then	\
+		${MAKE} bt-simple;		\
+	elif [ x$(SUBTYPE) = xfortran -o x$(SUBTYPE) = xFORTRAN ] ; then \
+		${MAKE} bt-fortran;		\
+	elif [ x$(SUBTYPE) = xepio -o x$(SUBTYPE) = xEPIO ] ; then	\
+		${MAKE} bt-epio;		\
+	else					\
+		${MAKE} bt-bt;			\
+	fi
+
+bt-bt: ${OBJS} btio.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} btio.o ${FMPI_LIB}
+
+bt-full: ${OBJS} full_mpiio.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_full ${OBJS} btio_common.o full_mpiio.o ${FMPI_LIB}
+
+bt-simple: ${OBJS} simple_mpiio.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_simple ${OBJS} btio_common.o simple_mpiio.o ${FMPI_LIB}
+
+bt-fortran: ${OBJS} fortran_io.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.fortran_io ${OBJS} btio_common.o fortran_io.o ${FMPI_LIB}
+
+bt-epio: ${OBJS} epio.o btio_common.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.ep_io ${OBJS} btio_common.o epio.o ${FMPI_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+.c.o:
+	${CCOMPILE} $<
+
+
+bt.o:             bt.f90  bt_data.o mpinpb.o
+make_set.o:       make_set.f90  bt_data.o mpinpb.o
+initialize.o:     initialize.f90  bt_data.o
+exact_solution.o: exact_solution.f90  bt_data.o
+exact_rhs.o:      exact_rhs.f90  bt_data.o
+set_constants.o:  set_constants.f90  bt_data.o
+adi.o:            adi.f90  bt_data.o
+define.o:         define.f90  bt_data.o
+copy_faces.o:     copy_faces.f90  bt_data.o mpinpb.o
+rhs.o:            rhs.f90  bt_data.o
+x_solve$(VEC).o:  x_solve$(VEC).f90  bt_data.o mpinpb.o
+y_solve$(VEC).o:  y_solve$(VEC).f90  bt_data.o mpinpb.o
+z_solve$(VEC).o:  z_solve$(VEC).f90  bt_data.o mpinpb.o
+solve_subs.o:     solve_subs.f90
+add.o:            add.f90  bt_data.o
+error.o:          error.f90  bt_data.o mpinpb.o
+verify.o:         verify.f90  bt_data.o mpinpb.o
+setup_mpi.o:      setup_mpi.f90  bt_data.o mpinpb.o
+btio.o:           btio.f90  bt_data.o
+btio_common.o:    btio_common.f90  bt_data.o mpinpb.o
+fortran_io.o:     fortran_io.f90  bt_data.o mpinpb.o
+simple_mpiio.o:   simple_mpiio.f90  bt_data.o mpinpb.o
+full_mpiio.o:     full_mpiio.f90  bt_data.o mpinpb.o
+epio.o:           epio.f90  bt_data.o mpinpb.o
+bt_data.o:        bt_data$(VEC).f90 mpinpb.o npbparams.h
+	${FCOMPILE} -o $@ bt_data$(VEC).f90
+mpinpb.o:         mpinpb.f90
+
+clean:
+	- rm -f *.o *.mod *~ mputil*
+	- rm -f  npbparams.h core
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/add.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/add.f90
new file mode 100644
index 000000000..f8dd37913
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/add.f90
@@ -0,0 +1,31 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  add
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     addition of update to the vector u
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer  c, i, j, k, m
+
+      do     c = 1, ncells
+         do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do    m = 1, 5
+                     u(m,i,j,k,c) = u(m,i,j,k,c) + rhs(m,i,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/adi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/adi.f90
new file mode 100644
index 000000000..78025ad8e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/adi.f90
@@ -0,0 +1,21 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  adi
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      call copy_faces
+
+      call x_solve
+
+      call y_solve
+
+      call z_solve
+
+      call add
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt.f90
new file mode 100644
index 000000000..f2f1ea9f9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt.f90
@@ -0,0 +1,349 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                                   B T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.4 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007.          !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!---------------------------------------------------------------------
+!
+! Authors: R. F. Van der Wijngaart
+!          T. Harris
+!          M. Yarrow
+!
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+       program MPBT
+!---------------------------------------------------------------------
+
+       use bt_data
+       use mpinpb
+
+       implicit none
+
+       integer i, niter, step, c, error, fstatus
+       double precision navg, mflops, mbytes, n3
+
+       external timer_read
+       double precision t, tmax, iorate(2), tpc, timer_read
+       logical verified
+       character class, cbuff*40
+       double precision t1(t_last), tsum(t_last),  &
+     &                  tming(t_last), tmaxg(t_last)
+       character        t_recs(t_last)*8
+
+       integer wr_interval
+
+       data t_recs/'total', 'i/o', 'rhs', 'xsolve', 'ysolve', 'zsolve',  &
+     &             'bpack', 'exch', 'xcomm', 'ycomm', 'zcomm',  &
+     &             ' totcomp', ' totcomm'/
+
+       call setup_mpi
+       if (.not. active) goto 999
+
+!---------------------------------------------------------------------
+!      Root node reads input file (if it exists) else takes
+!      defaults from parameters
+!---------------------------------------------------------------------
+       if (node .eq. root) then
+
+          write(*, 1000)
+
+          call check_timer_flag( timeron )
+
+          open (unit=2,file='inputbt.data',status='old', iostat=fstatus)
+!
+          rd_interval = 0
+          if (fstatus .eq. 0) then
+            write(*,233) 
+ 233        format(' Reading from input file inputbt.data')
+            read (2,*) niter
+            read (2,*) dt
+            read (2,*) grid_points(1), grid_points(2), grid_points(3)
+            if (iotype .ne. 0) then
+                read (2,'(A)') cbuff
+                read (cbuff,*,iostat=i) wr_interval, rd_interval
+                if (i .ne. 0) rd_interval = 0
+                if (wr_interval .le. 0) wr_interval = wr_default
+            endif
+            if (iotype .eq. 1) then
+                read (2,*) collbuf_nodes, collbuf_size
+                write(*,*) 'collbuf_nodes ', collbuf_nodes
+                write(*,*) 'collbuf_size  ', collbuf_size
+            endif
+            close(2)
+          else
+            write(*,234) 
+            niter = niter_default
+            dt    = dt_default
+            grid_points(1) = problem_size
+            grid_points(2) = problem_size
+            grid_points(3) = problem_size
+            wr_interval = wr_default
+            if (iotype .eq. 1) then
+!             set number of nodes involved in collective buffering to 4,
+!             unless total number of nodes is smaller than that.
+!             set buffer size for collective buffering to 1MB per node
+!             collbuf_nodes = min(4,no_nodes)
+!             set default to No-File-Hints with a value of 0
+              collbuf_nodes = 0
+              collbuf_size = 1000000
+            endif
+          endif
+ 234      format(' No input file inputbt.data. Using compiled defaults')
+
+          call set_class(niter, class)
+
+          write(*, 1001) grid_points(1), grid_points(2), grid_points(3),  &
+     &                   class
+          write(*, 1002) niter, dt
+          write(*, 1003) total_nodes
+          if (no_nodes .ne. total_nodes) write(*, 1004) no_nodes
+          write(*, *)
+
+          if (iotype .eq. 1) write(*, 1006) 'FULL MPI-IO', wr_interval
+          if (iotype .eq. 2) write(*, 1006) 'SIMPLE MPI-IO', wr_interval
+          if (iotype .eq. 3) write(*, 1006) 'EPIO', wr_interval
+          if (iotype .eq. 4) write(*, 1006) 'FORTRAN IO', wr_interval
+
+ 1000 format(//, ' NAS Parallel Benchmarks 3.4 -- BT Benchmark',/)
+ 1001     format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', a, ')' )
+ 1002     format(' Iterations: ', i4, '    dt: ', F11.7)
+ 1003     format(' Total number of processes: ', i6)
+ 1004     format(' WARNING: Number of processes is not a square number',  &
+     &           ' (', i0, ' active)')
+ 1006     format(' BTIO -- ', A, ' write interval: ', i3 /)
+
+       endif
+
+       call mpi_bcast(niter, 1, MPI_INTEGER,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(dt, 1, dp_type,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(grid_points(1), 3, MPI_INTEGER,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(wr_interval, 1, MPI_INTEGER,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(rd_interval, 1, MPI_INTEGER,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(timeron, 1, MPI_LOGICAL,  &
+     &                root, comm_setup, error)
+
+       call alloc_space
+
+       call make_set
+
+       do  c = 1, maxcells
+          if ( (cell_size(1,c) .gt. IMAX) .or.  &
+     &         (cell_size(2,c) .gt. JMAX) .or.  &
+     &         (cell_size(3,c) .gt. KMAX) ) then
+             print *,node, c, (cell_size(i,c),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+          endif
+       end do
+
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call set_constants
+
+       call initialize
+
+       call setup_btio
+       idump = 0
+
+       call lhsinit
+
+       call exact_rhs
+
+       call compute_buffer_size(5)
+
+!---------------------------------------------------------------------
+!      do one time step to touch all code, and reinitialize
+!---------------------------------------------------------------------
+       call adi
+       call initialize
+
+!---------------------------------------------------------------------
+!      Synchronize before placing time stamp
+!---------------------------------------------------------------------
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+       call mpi_barrier(comm_setup, error)
+
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (node .eq. root) then
+             if (mod(step, 20) .eq. 0 .or. step .eq. niter .or.  &
+     &           step .eq. 1) then
+                write(*, 200) step
+ 200            format(' Time step ', i4)
+             endif
+          endif
+
+          call adi
+
+          if (iotype .ne. 0) then
+              if (mod(step, wr_interval).eq.0 .or. step .eq. niter) then
+                  if (node .eq. root) then
+                      print *, 'Writing data set, time step', step
+                  endif
+                  if (step .eq. niter .and. rd_interval .gt. 1) then
+                      rd_interval = 1
+                  endif
+                  call timer_start(2)
+                  call output_timestep
+                  call timer_stop(2)
+                  idump = idump + 1
+              endif
+          endif
+       end do
+
+       call timer_start(2)
+       call btio_cleanup
+       call timer_stop(2)
+
+       call timer_stop(1)
+       t = timer_read(1)
+       t1(1) = timer_read(t_enorm)
+
+       call timer_clear(t_enorm)
+       call verify(class, verified)
+
+       call mpi_reduce(t, tmax, 1,  &
+     &                 dp_type, MPI_MAX,  &
+     &                 root, comm_setup, error)
+
+       if (iotype .ne. 0) then
+          n3 = 0.d0
+          do c = 1,ncells
+             n3 = n3 + dble(cell_size(1,c)) * cell_size(2,c) * cell_size(3,c)
+          end do
+          mbytes = n3 * 40.0 * idump * 1.0d-6
+          t1(2) = timer_read(t_enorm)
+          do i = 1, 2
+             if (i .eq. 1) then
+                t = timer_read(t_io)
+             else
+                t = timer_read(t_iov)
+             endif
+             t = t - t1(i)                      ! remove enorm time
+             if (t .ne. 0.d0) t = mbytes / t	! rate MB/s
+             t1(i) = t
+          end do
+          if (rd_interval .gt. 0) t1(1) = t1(1) * 2
+          call mpi_reduce(t1, iorate, 2,  &
+     &                    dp_type, MPI_SUM,  &
+     &                    root, comm_setup, error)
+       endif
+
+       if( node .eq. root ) then
+          n3 = dble(grid_points(1))*grid_points(2)*grid_points(3)
+          navg = (grid_points(1)+grid_points(2)+grid_points(3))/3.d0
+          if( tmax .ne. 0. ) then
+             mflops = 1.0d-6*dble(niter)*  &
+     &                (3478.8*n3-17655.7*navg**2+28023.7*navg)  &
+     &                / tmax
+          else
+             mflops = 0.d0
+          endif
+
+          if (iotype .ne. 0) then
+             mbytes = n3 * 40.0 * idump * 1.0d-6
+             do i = 1, 2
+                t1(i) = 0.0
+                if (iorate(i) .ne. 0.d0) t1(i) = mbytes / iorate(i)
+             end do
+             if (rd_interval .gt. 0) t1(1) = t1(1) * 2
+             tpc = 0.0
+             if (tmax .ne. 0.) tpc = t1(1) * 100.0 / tmax
+             write(*,1100) t1(1), tpc, t1(2), mbytes, iorate(1)
+ 1100        format(/' BTIO -- statistics:'/  &
+     &               '   I/O timing in seconds   : ', f14.2/  &
+     &               '   I/O timing percentage   : ', f14.2/  &
+     &               '   I/O timing in verify    : ', f14.2/  &
+     &               '   Total data written (MB) : ', f14.2/  &
+     &               '   I/O data rate  (MB/sec) : ', f14.2)
+          endif
+
+         call print_results('BT', class, grid_points(1),  &
+     &     grid_points(2), grid_points(3), niter, no_nodes,  &
+     &     total_nodes, tmax, mflops, '          floating point',  &
+     &     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5,  &
+     &     cs6, '(none)')
+       endif
+
+       if (.not.timeron) goto 999
+
+       do i = 1, t_zcomm
+          t1(i) = timer_read(i)
+       end do
+       t1(t_xsolve) = t1(t_xsolve) - t1(t_xcomm)
+       t1(t_ysolve) = t1(t_ysolve) - t1(t_ycomm)
+       t1(t_zsolve) = t1(t_zsolve) - t1(t_zcomm)
+       t1(t_comm) = t1(t_xcomm)+t1(t_ycomm)+t1(t_zcomm)+t1(t_exch)
+       t1(t_comp) = t1(t_total) - t1(t_comm)
+
+       call MPI_Reduce(t1, tsum,  t_last, dp_type, MPI_SUM,  &
+     &                 0, comm_setup, error)
+       call MPI_Reduce(t1, tming, t_last, dp_type, MPI_MIN,  &
+     &                 0, comm_setup, error)
+       call MPI_Reduce(t1, tmaxg, t_last, dp_type, MPI_MAX,  &
+     &                 0, comm_setup, error)
+
+       if (node .eq. 0) then
+          write(*, 800) no_nodes
+          do i = 1, t_last
+             tsum(i) = tsum(i) / no_nodes
+             write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+          end do
+       endif
+ 800   format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum',  &
+     &        5x, 'average')
+ 810   format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999   continue
+       call mpi_barrier(MPI_COMM_WORLD, error)
+       call mpi_finalize(error)
+
+       end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data.f90
new file mode 100644
index 000000000..750ed8fb7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data.f90
@@ -0,0 +1,193 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  bt_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+ 
+      module bt_data
+
+!---------------------------------------------------------------------
+! The following include file is generated automatically by the
+! "setparams" utility. It defines 
+!      maxcells:      the square root of the maximum number of processors
+!      problem_size:  12, 64, 102, 162 (for class S, A, B, C)
+!      dt_default:    default time step for this problem size if no
+!                     config file
+!      niter_default: default number of iterations for this problem size
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           aa, bb, cc, BLOCK_SIZE
+      parameter (aa=1, bb=2, cc=3, BLOCK_SIZE=5)
+
+      integer           ncells, grid_points(3)
+      double precision  elapsed_time
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,  &
+     &                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4,  &
+     &                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt,  &
+     &                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2,  &
+     &                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,  &
+     &                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,  &
+     &                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,  &
+     &                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1,  &
+     &                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1,  &
+     &                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2,  &
+     &                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,  &
+     &                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1,  &
+     &                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6,  &
+     &                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer           EAST, WEST, NORTH, SOUTH,  &
+     &                  BOTTOM, TOP
+
+      parameter (EAST=2000, WEST=3000,      NORTH=4000, SOUTH=5000,  &
+     &           BOTTOM=6000, TOP=7000)
+
+      integer maxcells, IMAX, JMAX, KMAX, MAX_CELL_DIM, BUF_SIZE
+
+      integer predecessor(3), successor(3), grid_size(3)
+      integer, allocatable ::  &
+     &        cell_coord (:,:), cell_low (:,:),  &
+     &        cell_high  (:,:), cell_size(:,:),  &
+     &        start      (:,:), end      (:,:),  &
+     &        slice      (:,:)
+
+      double precision, allocatable ::  &
+     &        us      (    :,:,:,:),  &
+     &        vs      (    :,:,:,:),  &
+     &        ws      (    :,:,:,:),  &
+     &        qs      (    :,:,:,:),  &
+     &        rho_i   (    :,:,:,:),  &
+     &        square  (    :,:,:,:),  &
+     &        forcing (  :,:,:,:,:),  &
+     &        u       (  :,:,:,:,:),  &
+     &        rhs     (  :,:,:,:,:),  &
+     &        lhsc    (:,:,:,:,:,:),  &
+     &        backsub_info(:,:,:,:),  &
+     &        in_buffer(:), out_buffer(:)
+
+      double precision, allocatable ::  &
+     &        cv  (:), rhon(:),  &
+     &        rhos(:), rhoq(:),  &
+     &        cuf (:), q   (:),  &
+     &        ue(:,:), buf (:,:)
+
+      double precision, allocatable ::  &
+     &        fjac(:, :, :),  &
+     &        njac(:, :, :),  &
+     &        lhsa(:, :, :),  &
+     &        lhsb(:, :, :)
+
+      integer west_size, east_size, bottom_size, top_size,  &
+     &        north_size, south_size, start_send_west,  &
+     &        start_send_east, start_send_south, start_send_north,  &
+     &        start_send_bottom, start_send_top, start_recv_west,  &
+     &        start_recv_east, start_recv_south, start_recv_north,  &
+     &        start_recv_bottom, start_recv_top
+
+      double precision tmp_block(5,5), b_inverse(5,5), tmp_vec(5)
+
+!---------------------------------------------------------------------
+!     These are used by btio
+!---------------------------------------------------------------------
+      integer collbuf_nodes, collbuf_size, iosize, eltext,  &
+     &        combined_btype, fp, idump, record_length, element,  &
+     &        combined_ftype, idump_sub, rd_interval
+      double precision sum(niter_default), xce_sub(5)
+      integer(kind=8) :: iseek
+
+
+!---------------------------------------------------------------------
+!     Timer constants
+!---------------------------------------------------------------------
+      integer t_total, t_io, t_rhs, t_xsolve, t_ysolve, t_zsolve,  &
+     &        t_bpack, t_exch, t_xcomm, t_ycomm, t_zcomm,  &
+     &        t_comp, t_comm, t_enorm, t_iov, t_last
+      parameter (t_total=1, t_io=2, t_rhs=3, t_xsolve=4, t_ysolve=5,  &
+     &        t_zsolve=6, t_bpack=7, t_exch=8, t_xcomm=9,  &
+     &        t_ycomm=10, t_zcomm=11, t_comp=12, t_comm=13,  &
+     &        t_enorm=12, t_iov=13, t_last=13)
+      logical timeron
+
+      end module bt_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ios, ierr
+
+      MAX_CELL_DIM = (problem_size/maxcells)+1
+
+      IMAX = MAX_CELL_DIM
+      JMAX = MAX_CELL_DIM
+      KMAX = MAX_CELL_DIM
+
+      BUF_SIZE = MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60+1
+
+      allocate (  &
+     &         cell_coord (3,maxcells), cell_low (3,maxcells),  &
+     &         cell_high  (3,maxcells), cell_size(3,maxcells),  &
+     &         start      (3,maxcells), end      (3,maxcells),  &
+     &         slice      (3,maxcells),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &   forcing (5,   0:IMAX-1, 0:JMAX-1, 0:KMAX-1, maxcells),  &
+     &   u       (5,  -2:IMAX+1,-2:JMAX+1,-2:KMAX+1, maxcells),  &
+     &   rhs     (5,  -1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells),  &
+     &   lhsc    (5,5,-1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells),  &
+     &   backsub_info (5, 0:MAX_CELL_DIM, 0:MAX_CELL_DIM, maxcells),  &
+     &   in_buffer(BUF_SIZE), out_buffer(BUF_SIZE),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &         cv  (-2:MAX_CELL_DIM+1),  rhon(-2:MAX_CELL_DIM+1),  &
+     &         rhos(-2:MAX_CELL_DIM+1),  rhoq(-2:MAX_CELL_DIM+1),  &
+     &         cuf (-2:MAX_CELL_DIM+1),     q(-2:MAX_CELL_DIM+1),  &
+     &         ue  (-2:MAX_CELL_DIM+1,5), buf(-2:MAX_CELL_DIM+1,5),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &         fjac(5, 5, -2:MAX_CELL_DIM+1),  &
+     &         njac(5, 5, -2:MAX_CELL_DIM+1),  &
+     &         lhsa(5, 5, -1:MAX_CELL_DIM),  &
+     &         lhsb(5, 5, -1:MAX_CELL_DIM),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &         us    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         vs    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         ws    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         qs    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         rho_i (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         square(-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data_vec.f90
new file mode 100644
index 000000000..6f248dd04
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data_vec.f90
@@ -0,0 +1,193 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  bt_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+ 
+      module bt_data
+
+!---------------------------------------------------------------------
+! The following include file is generated automatically by the
+! "setparams" utility. It defines 
+!      maxcells:      the square root of the maximum number of processors
+!      problem_size:  12, 64, 102, 162 (for class S, A, B, C)
+!      dt_default:    default time step for this problem size if no
+!                     config file
+!      niter_default: default number of iterations for this problem size
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           aa, bb, cc, BLOCK_SIZE
+      parameter (aa=1, bb=2, cc=3, BLOCK_SIZE=5)
+
+      integer           ncells, grid_points(3)
+      double precision  elapsed_time
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,  &
+     &                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4,  &
+     &                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt,  &
+     &                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2,  &
+     &                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,  &
+     &                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,  &
+     &                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,  &
+     &                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1,  &
+     &                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1,  &
+     &                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2,  &
+     &                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,  &
+     &                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1,  &
+     &                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6,  &
+     &                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer           EAST, WEST, NORTH, SOUTH,  &
+     &                  BOTTOM, TOP
+
+      parameter (EAST=2000, WEST=3000,      NORTH=4000, SOUTH=5000,  &
+     &           BOTTOM=6000, TOP=7000)
+
+      integer maxcells, IMAX, JMAX, KMAX, MAX_CELL_DIM, BUF_SIZE
+
+      integer predecessor(3), successor(3), grid_size(3)
+      integer, allocatable ::  &
+     &        cell_coord (:,:), cell_low (:,:),  &
+     &        cell_high  (:,:), cell_size(:,:),  &
+     &        start      (:,:), end      (:,:),  &
+     &        slice      (:,:)
+
+      double precision, allocatable ::  &
+     &        us      (    :,:,:,:),  &
+     &        vs      (    :,:,:,:),  &
+     &        ws      (    :,:,:,:),  &
+     &        qs      (    :,:,:,:),  &
+     &        rho_i   (    :,:,:,:),  &
+     &        square  (    :,:,:,:),  &
+     &        forcing (  :,:,:,:,:),  &
+     &        u       (  :,:,:,:,:),  &
+     &        rhs     (  :,:,:,:,:),  &
+     &        lhsc    (:,:,:,:,:,:),  &
+     &        backsub_info(:,:,:,:),  &
+     &        in_buffer(:), out_buffer(:)
+
+      double precision, allocatable ::  &
+     &        cv  (:), rhon(:),  &
+     &        rhos(:), rhoq(:),  &
+     &        cuf (:), q   (:),  &
+     &        ue(:,:), buf (:,:)
+
+      double precision, allocatable ::  &
+     &        fjac(:, :, :, :),  &
+     &        njac(:, :, :, :),  &
+     &        lhsa(:, :, :, :),  &
+     &        lhsb(:, :, :, :)
+
+      integer west_size, east_size, bottom_size, top_size,  &
+     &        north_size, south_size, start_send_west,  &
+     &        start_send_east, start_send_south, start_send_north,  &
+     &        start_send_bottom, start_send_top, start_recv_west,  &
+     &        start_recv_east, start_recv_south, start_recv_north,  &
+     &        start_recv_bottom, start_recv_top
+
+      double precision tmp_block(5,5), b_inverse(5,5), tmp_vec(5)
+
+!---------------------------------------------------------------------
+!     These are used by btio
+!---------------------------------------------------------------------
+      integer collbuf_nodes, collbuf_size, iosize, eltext,  &
+     &        combined_btype, fp, idump, record_length, element,  &
+     &        combined_ftype, idump_sub, rd_interval
+      double precision sum(niter_default), xce_sub(5)
+      integer(kind=8) :: iseek
+
+
+!---------------------------------------------------------------------
+!     Timer constants
+!---------------------------------------------------------------------
+      integer t_total, t_io, t_rhs, t_xsolve, t_ysolve, t_zsolve,  &
+     &        t_bpack, t_exch, t_xcomm, t_ycomm, t_zcomm,  &
+     &        t_comp, t_comm, t_enorm, t_iov, t_last
+      parameter (t_total=1, t_io=2, t_rhs=3, t_xsolve=4, t_ysolve=5,  &
+     &        t_zsolve=6, t_bpack=7, t_exch=8, t_xcomm=9,  &
+     &        t_ycomm=10, t_zcomm=11, t_comp=12, t_comm=13,  &
+     &        t_enorm=12, t_iov=13, t_last=13)
+      logical timeron
+
+      end module bt_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ios, ierr
+
+      MAX_CELL_DIM = (problem_size/maxcells)+1
+
+      IMAX = MAX_CELL_DIM
+      JMAX = MAX_CELL_DIM
+      KMAX = MAX_CELL_DIM
+
+      BUF_SIZE = MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60+1
+
+      allocate (  &
+     &         cell_coord (3,maxcells), cell_low (3,maxcells),  &
+     &         cell_high  (3,maxcells), cell_size(3,maxcells),  &
+     &         start      (3,maxcells), end      (3,maxcells),  &
+     &         slice      (3,maxcells),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &   forcing (5,   0:IMAX-1, 0:JMAX-1, 0:KMAX-1, maxcells),  &
+     &   u       (5,  -2:IMAX+1,-2:JMAX+1,-2:KMAX+1, maxcells),  &
+     &   rhs     (5,  -1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells),  &
+     &   lhsc    (5,5,-1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells),  &
+     &   backsub_info (5, 0:MAX_CELL_DIM, 0:MAX_CELL_DIM, maxcells),  &
+     &   in_buffer(BUF_SIZE), out_buffer(BUF_SIZE),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &         cv  (-2:MAX_CELL_DIM+1),  rhon(-2:MAX_CELL_DIM+1),  &
+     &         rhos(-2:MAX_CELL_DIM+1),  rhoq(-2:MAX_CELL_DIM+1),  &
+     &         cuf (-2:MAX_CELL_DIM+1),     q(-2:MAX_CELL_DIM+1),  &
+     &         ue  (-2:MAX_CELL_DIM+1,5), buf(-2:MAX_CELL_DIM+1,5),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &         fjac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1),  &
+     &         njac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1),  &
+     &         lhsa(5, 5, -1:MAX_CELL_DIM,   -1:MAX_CELL_DIM),  &
+     &         lhsb(5, 5, -1:MAX_CELL_DIM,   -1:MAX_CELL_DIM),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &         us    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         vs    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         ws    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         qs    (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         rho_i (-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         square(-1:IMAX, -1:JMAX, -1:KMAX, maxcells),  &
+     &         stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio.f90
new file mode 100644
index 000000000..3b36cca87
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio.f90
@@ -0,0 +1,72 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine btio_verify(verified)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      logical verified
+
+      verified = .true.
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision xce_acc(5)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine checksum_timestep
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio_common.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio_common.f90
new file mode 100644
index 000000000..f06bc1cd3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio_common.f90
@@ -0,0 +1,30 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine clear_timestep
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer cio, kio, jio, ix
+
+      do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  do ix=0,cell_size(1,cio)-1
+                            u(1,ix, jio,kio,cio) = 0
+                            u(2,ix, jio,kio,cio) = 0
+                            u(3,ix, jio,kio,cio) = 0
+                            u(4,ix, jio,kio,cio) = 0
+                            u(5,ix, jio,kio,cio) = 0
+                  enddo
+              enddo
+          enddo
+      enddo
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/copy_faces.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/copy_faces.f90
new file mode 100644
index 000000000..ff9ac3a35
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/copy_faces.f90
@@ -0,0 +1,324 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine copy_faces
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+! This function copies the face values of a variable defined on a set 
+! of cells to the overlap locations of the adjacent sets of cells. 
+! Because a set of cells interfaces in each direction with exactly one 
+! other set, we only need to fill six different buffers. We could try to 
+! overlap communication with computation, by computing
+! some internal values while communicating boundary values, but this
+! adds so much overhead that it's not clearly useful. 
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i, j, k, c, m, requests(0:11), p0, p1,  &
+     &     p2, p3, p4, p5, b_size(0:5), ss(0:5),  &
+     &     sr(0:5), error, statuses(MPI_STATUS_SIZE, 0:11)
+
+!---------------------------------------------------------------------
+!     exit immediately if there are no faces to be copied           
+!---------------------------------------------------------------------
+      if (no_nodes .eq. 1) then
+         call compute_rhs
+         return
+      endif
+
+      ss(0) = start_send_east
+      ss(1) = start_send_west
+      ss(2) = start_send_north
+      ss(3) = start_send_south
+      ss(4) = start_send_top
+      ss(5) = start_send_bottom
+
+      sr(0) = start_recv_east
+      sr(1) = start_recv_west
+      sr(2) = start_recv_north
+      sr(3) = start_recv_south
+      sr(4) = start_recv_top
+      sr(5) = start_recv_bottom
+
+      b_size(0) = east_size   
+      b_size(1) = west_size   
+      b_size(2) = north_size  
+      b_size(3) = south_size  
+      b_size(4) = top_size    
+      b_size(5) = bottom_size 
+
+!---------------------------------------------------------------------
+!     because the difference stencil for the diagonalized scheme is 
+!     orthogonal, we do not have to perform the staged copying of faces, 
+!     but can send all face information simultaneously to the neighboring 
+!     cells in all directions          
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_bpack)
+      p0 = 0
+      p1 = 0
+      p2 = 0
+      p3 = 0
+      p4 = 0
+      p5 = 0
+
+      do  c = 1, ncells
+
+!---------------------------------------------------------------------
+!     fill the buffer to be sent to eastern neighbors (i-dir)
+!---------------------------------------------------------------------
+         if (cell_coord(1,c) .ne. ncells) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = cell_size(1,c)-2, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(0)+p0) = u(m,i,j,k,c)
+                        p0 = p0 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+!---------------------------------------------------------------------
+!     fill the buffer to be sent to western neighbors 
+!---------------------------------------------------------------------
+         if (cell_coord(1,c) .ne. 1) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = 0, 1
+                     do   m = 1, 5
+                        out_buffer(ss(1)+p1) = u(m,i,j,k,c)
+                        p1 = p1 + 1
+                     end do
+                  end do
+               end do
+            end do
+
+         endif
+
+!---------------------------------------------------------------------
+!     fill the buffer to be sent to northern neighbors (j_dir)
+!---------------------------------------------------------------------
+         if (cell_coord(2,c) .ne. ncells) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = cell_size(2,c)-2, cell_size(2,c)-1
+                  do   i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(2)+p2) = u(m,i,j,k,c)
+                        p2 = p2 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+!---------------------------------------------------------------------
+!     fill the buffer to be sent to southern neighbors 
+!---------------------------------------------------------------------
+         if (cell_coord(2,c).ne. 1) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, 1
+                  do   i = 0, cell_size(1,c)-1   
+                     do   m = 1, 5
+                        out_buffer(ss(3)+p3) = u(m,i,j,k,c)
+                        p3 = p3 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+!---------------------------------------------------------------------
+!     fill the buffer to be sent to top neighbors (k-dir)
+!---------------------------------------------------------------------
+         if (cell_coord(3,c) .ne. ncells) then
+            do   k = cell_size(3,c)-2, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(4)+p4) = u(m,i,j,k,c)
+                        p4 = p4 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+!---------------------------------------------------------------------
+!     fill the buffer to be sent to bottom neighbors
+!---------------------------------------------------------------------
+         if (cell_coord(3,c).ne. 1) then
+            do    k=0, 1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        out_buffer(ss(5)+p5) = u(m,i,j,k,c)
+                        p5 = p5 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+!---------------------------------------------------------------------
+!     cell loop
+!---------------------------------------------------------------------
+      end do
+      if (timeron) call timer_stop(t_bpack)
+
+      if (timeron) call timer_start(t_exch)
+      call mpi_irecv(in_buffer(sr(0)), b_size(0),  &
+     &     dp_type, successor(1), WEST,  &
+     &     comm_rhs, requests(0), error)
+      call mpi_irecv(in_buffer(sr(1)), b_size(1),  &
+     &     dp_type, predecessor(1), EAST,  &
+     &     comm_rhs, requests(1), error)
+      call mpi_irecv(in_buffer(sr(2)), b_size(2),  &
+     &     dp_type, successor(2), SOUTH,  &
+     &     comm_rhs, requests(2), error)
+      call mpi_irecv(in_buffer(sr(3)), b_size(3),  &
+     &     dp_type, predecessor(2), NORTH,  &
+     &     comm_rhs, requests(3), error)
+      call mpi_irecv(in_buffer(sr(4)), b_size(4),  &
+     &     dp_type, successor(3), BOTTOM,  &
+     &     comm_rhs, requests(4), error)
+      call mpi_irecv(in_buffer(sr(5)), b_size(5),  &
+     &     dp_type, predecessor(3), TOP,   &
+     &     comm_rhs, requests(5), error)
+
+      call mpi_isend(out_buffer(ss(0)), b_size(0),  &
+     &     dp_type, successor(1),   EAST,  &
+     &     comm_rhs, requests(6), error)
+      call mpi_isend(out_buffer(ss(1)), b_size(1),  &
+     &     dp_type, predecessor(1), WEST,  &
+     &     comm_rhs, requests(7), error)
+      call mpi_isend(out_buffer(ss(2)), b_size(2),  &
+     &     dp_type,successor(2),   NORTH,  &
+     &     comm_rhs, requests(8), error)
+      call mpi_isend(out_buffer(ss(3)), b_size(3),  &
+     &     dp_type,predecessor(2), SOUTH,  &
+     &     comm_rhs, requests(9), error)
+      call mpi_isend(out_buffer(ss(4)), b_size(4),  &
+     &     dp_type,successor(3),   TOP,  &
+     &     comm_rhs,   requests(10), error)
+      call mpi_isend(out_buffer(ss(5)), b_size(5),  &
+     &     dp_type,predecessor(3), BOTTOM,  &
+     &     comm_rhs,requests(11), error)
+
+
+      call mpi_waitall(12, requests, statuses, error)
+      if (timeron) call timer_stop(t_exch)
+
+!---------------------------------------------------------------------
+!     unpack the data that has just been received;             
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_bpack)
+      p0 = 0
+      p1 = 0
+      p2 = 0
+      p3 = 0
+      p4 = 0
+      p5 = 0
+
+      do   c = 1, ncells
+
+         if (cell_coord(1,c) .ne. 1) then
+            do   k = 0, cell_size(3,c)-1
+               do   j = 0, cell_size(2,c)-1
+                  do   i = -2, -1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(1)+p0)
+                        p0 = p0 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+         if (cell_coord(1,c) .ne. ncells) then
+            do  k = 0, cell_size(3,c)-1
+               do  j = 0, cell_size(2,c)-1
+                  do  i = cell_size(1,c), cell_size(1,c)+1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(0)+p1)
+                        p1 = p1 + 1
+                     end do
+                  end do
+               end do
+            end do
+         end if
+            
+         if (cell_coord(2,c) .ne. 1) then
+            do  k = 0, cell_size(3,c)-1
+               do   j = -2, -1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(3)+p2)
+                        p2 = p2 + 1
+                     end do
+                  end do
+               end do
+            end do
+
+         endif
+            
+         if (cell_coord(2,c) .ne. ncells) then
+            do  k = 0, cell_size(3,c)-1
+               do   j = cell_size(2,c), cell_size(2,c)+1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(2)+p3)
+                        p3 = p3 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+         if (cell_coord(3,c) .ne. 1) then
+            do  k = -2, -1
+               do  j = 0, cell_size(2,c)-1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(5)+p4)
+                        p4 = p4 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+         if (cell_coord(3,c) .ne. ncells) then
+            do  k = cell_size(3,c), cell_size(3,c)+1
+               do  j = 0, cell_size(2,c)-1
+                  do  i = 0, cell_size(1,c)-1
+                     do   m = 1, 5
+                        u(m,i,j,k,c) = in_buffer(sr(4)+p5)
+                        p5 = p5 + 1
+                     end do
+                  end do
+               end do
+            end do
+         endif
+
+!---------------------------------------------------------------------
+!     cells loop
+!---------------------------------------------------------------------
+      end do
+      if (timeron) call timer_stop(t_bpack)
+
+!---------------------------------------------------------------------
+!     do the rest of the rhs that uses the copied face values          
+!---------------------------------------------------------------------
+      call compute_rhs
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/define.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/define.f90
new file mode 100644
index 000000000..f42fcc4e0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/define.f90
@@ -0,0 +1,65 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine compute_buffer_size(dim)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer  c, dim, face_size
+
+      if (ncells .eq. 1) return
+
+!---------------------------------------------------------------------
+!     compute the actual sizes of the buffers; note that there is 
+!     always one cell face that doesn't need buffer space, because it 
+!     is at the boundary of the grid
+!---------------------------------------------------------------------
+      west_size = 0
+      east_size = 0
+
+      do   c = 1, ncells
+         face_size = cell_size(2,c) * cell_size(3,c) * dim * 2
+         if (cell_coord(1,c).ne.1) west_size = west_size + face_size
+         if (cell_coord(1,c).ne.ncells) east_size = east_size +  &
+     &        face_size 
+      end do
+
+      north_size = 0
+      south_size = 0
+      do   c = 1, ncells
+         face_size = cell_size(1,c)*cell_size(3,c) * dim * 2
+         if (cell_coord(2,c).ne.1) south_size = south_size + face_size
+         if (cell_coord(2,c).ne.ncells) north_size = north_size +  &
+     &        face_size 
+      end do
+
+      top_size = 0
+      bottom_size = 0
+      do   c = 1, ncells
+         face_size = cell_size(1,c) * cell_size(2,c) * dim * 2
+         if (cell_coord(3,c).ne.1) bottom_size = bottom_size +  &
+     &        face_size
+         if (cell_coord(3,c).ne.ncells) top_size = top_size +  &
+     &        face_size     
+      end do
+
+      start_send_west   = 1
+      start_send_east   = start_send_west   + west_size
+      start_send_south  = start_send_east   + east_size
+      start_send_north  = start_send_south  + south_size
+      start_send_bottom = start_send_north  + north_size
+      start_send_top    = start_send_bottom + bottom_size
+      start_recv_west   = 1
+      start_recv_east   = start_recv_west   + west_size
+      start_recv_south  = start_recv_east   + east_size
+      start_recv_north  = start_recv_south  + south_size
+      start_recv_bottom = start_recv_north  + north_size
+      start_recv_top    = start_recv_bottom + bottom_size
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/epio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/epio.f90
new file mode 100644
index 000000000..7c209656a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/epio.f90
@@ -0,0 +1,174 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      character(128) newfilenm
+      integer m
+
+      if (node .lt. 10000) then
+          write (newfilenm, 996) filenm,node
+      else
+          print *, 'error generating file names (> 10000 nodes)'
+          stop
+      endif
+
+996   format (a,'.',i4.4)
+
+      open (unit=99, file=newfilenm, form='unformatted',  &
+     &      status='unknown')
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer ix, iio, jio, kio, cio, aio
+
+      do cio=1,ncells
+          write(99)  &
+     &         ((((u(aio,ix, jio,kio,cio),aio=1,5),  &
+     &             ix=0, cell_size(1,cio)-1),  &
+     &             jio=0, cell_size(2,cio)-1),  &
+     &             kio=0, cell_size(3,cio)-1)
+      enddo
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            rewind(99)
+            call acc_sub_norms(idump+1)
+
+            rewind(99)
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer idump_cur
+
+      integer ix, jio, kio, cio, ii, m, ichunk
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+        do cio=1,ncells
+          read(99)  &
+     &         ((((u(m,ix, jio,kio,cio),m=1,5),  &
+     &             ix=0, cell_size(1,cio)-1),  &
+     &             jio=0, cell_size(2,cio)-1),  &
+     &             kio=0, cell_size(3,cio)-1)
+        enddo
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+
+      close(unit=99)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      double precision xce_acc(5)
+
+      character(128) newfilenm
+      integer m
+
+      if (rd_interval .gt. 0) goto 20
+
+      if (node .lt. 10000) then
+          write (newfilenm, 996) filenm,node
+      else
+          print *, 'error generating file names (> 10000 nodes)'
+          stop
+      endif
+
+996   format (a,'.',i4.4)
+
+      open (unit=99, file=newfilenm,  &
+     &      form='unformatted', action='read')
+
+!     clear the last time step
+
+      call clear_timestep
+
+!     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      close(unit=99)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/error.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/error.f90
new file mode 100644
index 000000000..b09998125
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/error.f90
@@ -0,0 +1,114 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine error_norm(rms)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     this function computes the norm of the difference between the
+!     computed solution and the exact solution
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer c, i, j, k, m, ii, jj, kk, d, error
+      double precision xi, eta, zeta, u_exact(5), rms(5), rms_work(5),  &
+     &     add
+
+      call timer_start(t_enorm)
+
+      do m = 1, 5 
+         rms_work(m) = 0.0d0
+      enddo
+
+      do c = 1, ncells
+         kk = 0
+         do k = cell_low(3,c), cell_high(3,c)
+            zeta = dble(k) * dnzm1
+            jj = 0
+            do j = cell_low(2,c), cell_high(2,c)
+               eta = dble(j) * dnym1
+               ii = 0
+               do i = cell_low(1,c), cell_high(1,c)
+                  xi = dble(i) * dnxm1
+                  call exact_solution(xi, eta, zeta, u_exact)
+
+                  do m = 1, 5
+                     add = u(m,ii,jj,kk,c)-u_exact(m)
+                     rms_work(m) = rms_work(m) + add*add
+                  enddo
+                  ii = ii + 1
+               enddo
+               jj = jj + 1
+            enddo
+            kk = kk + 1
+         enddo
+      enddo
+
+      call mpi_allreduce(rms_work, rms, 5, dp_type,  &
+     &     MPI_SUM, comm_setup, error)
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo
+         rms(m) = dsqrt(rms(m))
+      enddo
+
+      call timer_stop(t_enorm)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rhs_norm(rms)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer c, i, j, k, d, m, error
+      double precision rms(5), rms_work(5), add
+
+      do m = 1, 5
+         rms_work(m) = 0.0d0
+      enddo 
+
+      do c = 1, ncells
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     add = rhs(m,i,j,k,c)
+                     rms_work(m) = rms_work(m) + add*add
+                  enddo 
+               enddo 
+            enddo 
+         enddo 
+      enddo 
+
+      call mpi_allreduce(rms_work, rms, 5, dp_type,  &
+     &     MPI_SUM, comm_setup, error)
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo 
+         rms(m) = dsqrt(rms(m))
+      enddo 
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_rhs.f90
new file mode 100644
index 000000000..ed27eeaf9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_rhs.f90
@@ -0,0 +1,361 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exact_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision dtemp(5), xi, eta, zeta, dtpp
+      integer          c, m, i, j, k, ip1, im1, jp1,  &
+     &     jm1, km1, kp1
+
+
+!---------------------------------------------------------------------
+!     loop over all cells owned by this node                   
+!---------------------------------------------------------------------
+      do c = 1, ncells
+
+!---------------------------------------------------------------------
+!     initialize                                  
+!---------------------------------------------------------------------
+         do k= 0, cell_size(3,c)-1
+            do j = 0, cell_size(2,c)-1
+               do i = 0, cell_size(1,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = 0.0d0
+                  enddo
+               enddo
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     xi-direction flux differences                      
+!---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            zeta = dble(k+cell_low(3,c)) * dnzm1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               eta = dble(j+cell_low(2,c)) * dnym1
+
+               do i=-2*(1-start(1,c)), cell_size(1,c)+1-2*end(1,c)
+                  xi = dble(i+cell_low(1,c)) * dnxm1
+
+                  call exact_solution(xi, eta, zeta, dtemp)
+                  do m = 1, 5
+                     ue(i,m) = dtemp(m)
+                  enddo
+
+                  dtpp = 1.0d0 / dtemp(1)
+
+                  do m = 2, 5
+                     buf(i,m) = dtpp * dtemp(m)
+                  enddo
+
+                  cuf(i)   = buf(i,2) * buf(i,2)
+                  buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) +  &
+     &                 buf(i,4) * buf(i,4) 
+                  q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +  &
+     &                 buf(i,4)*ue(i,4))
+
+               enddo
+               
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  im1 = i-1
+                  ip1 = i+1
+
+                  forcing(1,i,j,k,c) = forcing(1,i,j,k,c) -  &
+     &                 tx2*( ue(ip1,2)-ue(im1,2) )+  &
+     &                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+                  forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tx2 * (  &
+     &                 (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-  &
+     &                 (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+  &
+     &                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+  &
+     &                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+                  forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tx2 * (  &
+     &                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+  &
+     &                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+                  forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tx2*(  &
+     &                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+  &
+     &                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+                  forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tx2*(  &
+     &                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-  &
+     &                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+  &
+     &                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+  &
+     &                 buf(im1,1))+  &
+     &                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+  &
+     &                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+  &
+     &                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+               enddo
+
+!---------------------------------------------------------------------
+!     Fourth-order dissipation                         
+!---------------------------------------------------------------------
+               if (start(1,c) .gt. 0) then
+                  do m = 1, 5
+                     i = 1
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+                     i = 2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -  &
+     &                    4.0d0*ue(i+1,m) +       ue(i+2,m))
+                  enddo
+               endif
+
+               do i = start(1,c)*3, cell_size(1,c)-3*end(1,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp*  &
+     &                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+                  enddo
+               enddo
+
+               if (end(1,c) .gt. 0) then
+                  do m = 1, 5
+                     i = cell_size(1,c)-3
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+                     i = cell_size(1,c)-2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+                  enddo
+               endif
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     eta-direction flux differences             
+!---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1          
+            zeta = dble(k+cell_low(3,c)) * dnzm1
+            do i=start(1,c), cell_size(1,c)-end(1,c)-1
+               xi = dble(i+cell_low(1,c)) * dnxm1
+
+               do j=-2*(1-start(2,c)), cell_size(2,c)+1-2*end(2,c)
+                  eta = dble(j+cell_low(2,c)) * dnym1
+
+                  call exact_solution(xi, eta, zeta, dtemp)
+                  do m = 1, 5 
+                     ue(j,m) = dtemp(m)
+                  enddo
+                  
+                  dtpp = 1.0d0/dtemp(1)
+
+                  do m = 2, 5
+                     buf(j,m) = dtpp * dtemp(m)
+                  enddo
+
+                  cuf(j)   = buf(j,3) * buf(j,3)
+                  buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) +  &
+     &                 buf(j,4) * buf(j,4)
+                  q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +  &
+     &                 buf(j,4)*ue(j,4))
+               enddo
+
+               do j = start(2,c), cell_size(2,c)-end(2,c)-1
+                  jm1 = j-1
+                  jp1 = j+1
+                  
+                  forcing(1,i,j,k,c) = forcing(1,i,j,k,c) -  &
+     &                 ty2*( ue(jp1,3)-ue(jm1,3) )+  &
+     &                 dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+                  forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - ty2*(  &
+     &                 ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+  &
+     &                 yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+  &
+     &                 dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+                  forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - ty2*(  &
+     &                 (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-  &
+     &                 (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+  &
+     &                 yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+  &
+     &                 dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+                  forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - ty2*(  &
+     &                 ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+  &
+     &                 yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+  &
+     &                 dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+                  forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - ty2*(  &
+     &                 buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-  &
+     &                 buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+  &
+     &                 0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+  &
+     &                 buf(jm1,1))+  &
+     &                 yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+  &
+     &                 yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+  &
+     &                 dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+               enddo
+
+!---------------------------------------------------------------------
+!     Fourth-order dissipation                      
+!---------------------------------------------------------------------
+               if (start(2,c) .gt. 0) then
+                  do m = 1, 5
+                     j = 1
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+                     j = 2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -  &
+     &                    4.0d0*ue(j+1,m) +       ue(j+2,m))
+                  enddo
+               endif
+
+               do j = start(2,c)*3, cell_size(2,c)-3*end(2,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp*  &
+     &                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+                  enddo
+               enddo
+
+               if (end(2,c) .gt. 0) then
+                  do m = 1, 5
+                     j = cell_size(2,c)-3
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+                     j = cell_size(2,c)-2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+                  enddo
+               endif
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     zeta-direction flux differences                      
+!---------------------------------------------------------------------
+         do j=start(2,c), cell_size(2,c)-end(2,c)-1
+            eta = dble(j+cell_low(2,c)) * dnym1
+            do i = start(1,c), cell_size(1,c)-end(1,c)-1
+               xi = dble(i+cell_low(1,c)) * dnxm1
+
+               do k=-2*(1-start(3,c)), cell_size(3,c)+1-2*end(3,c)
+                  zeta = dble(k+cell_low(3,c)) * dnzm1
+
+                  call exact_solution(xi, eta, zeta, dtemp)
+                  do m = 1, 5
+                     ue(k,m) = dtemp(m)
+                  enddo
+
+                  dtpp = 1.0d0/dtemp(1)
+
+                  do m = 2, 5
+                     buf(k,m) = dtpp * dtemp(m)
+                  enddo
+
+                  cuf(k)   = buf(k,4) * buf(k,4)
+                  buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) +  &
+     &                 buf(k,3) * buf(k,3)
+                  q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +  &
+     &                 buf(k,4)*ue(k,4))
+               enddo
+
+               do k=start(3,c), cell_size(3,c)-end(3,c)-1
+                  km1 = k-1
+                  kp1 = k+1
+                  
+                  forcing(1,i,j,k,c) = forcing(1,i,j,k,c) -  &
+     &                 tz2*( ue(kp1,4)-ue(km1,4) )+  &
+     &                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+                  forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tz2 * (  &
+     &                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+  &
+     &                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+                  forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tz2 * (  &
+     &                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+  &
+     &                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+                  forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tz2 * (  &
+     &                 (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-  &
+     &                 (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+  &
+     &                 zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+  &
+     &                 dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+                  forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tz2 * (  &
+     &                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-  &
+     &                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+  &
+     &                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)  &
+     &                 +buf(km1,1))+  &
+     &                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+  &
+     &                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+  &
+     &                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+               enddo
+
+!---------------------------------------------------------------------
+!     Fourth-order dissipation                        
+!---------------------------------------------------------------------
+               if (start(3,c) .gt. 0) then
+                  do m = 1, 5
+                     k = 1
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+                     k = 2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -  &
+     &                    4.0d0*ue(k+1,m) +       ue(k+2,m))
+                  enddo
+               endif
+
+               do k = start(3,c)*3, cell_size(3,c)-3*end(3,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp*  &
+     &                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+                  enddo
+               enddo
+
+               if (end(3,c) .gt. 0) then
+                  do m = 1, 5
+                     k = cell_size(3,c)-3
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+                     k = cell_size(3,c)-2
+                     forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp *  &
+     &                    (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+                  enddo
+               endif
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     now change the sign of the forcing function, 
+!---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     forcing(m,i,j,k,c) = -1.d0 * forcing(m,i,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_solution.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_solution.f90
new file mode 100644
index 000000000..2ada9387a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_solution.f90
@@ -0,0 +1,30 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     this function returns the exact solution at point xi, eta, zeta  
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision  xi, eta, zeta, dtemp(5)
+      integer m
+
+      do m = 1, 5
+         dtemp(m) =  ce(m,1) +  &
+     &     xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +  &
+     &     eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+  &
+     &     zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) +  &
+     &     zeta*ce(m,13))))
+      enddo
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/fortran_io.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/fortran_io.f90
new file mode 100644
index 000000000..d35781fbf
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/fortran_io.f90
@@ -0,0 +1,198 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      double precision d5(5)
+      integer frec_sz, m, ierr
+
+!     determine a proper record_length to use
+      if (node.eq.root) then
+         frec_sz = fortran_rec_sz
+         if (frec_sz > 0) then
+            ! use the compiled value
+            record_length = 40/frec_sz
+         else
+            ! query directly
+            inquire(iolength=record_length) d5
+         endif
+         if (record_length < 1) record_length = 40
+      endif
+
+      call mpi_bcast(record_length, 1, MPI_INTEGER,  &
+     &                root, comm_setup, ierr)
+
+      open (unit=99, file=filenm,  &
+     &      form='unformatted', access='direct',  &
+     &      recl=record_length)
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer ix, jio, kio, cio
+
+      do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=(cell_low(3,cio)+kio) +  &
+     &                   PROBLEM_SIZE*idump_sub
+                  iseek=(cell_low(2,cio)+jio) +  &
+     &                   PROBLEM_SIZE*iseek
+                  iseek=(cell_low(1,cio)) +  &
+     &                   PROBLEM_SIZE*iseek
+
+                  do ix=0,cell_size(1,cio)-1
+                      write(99, rec=iseek+ix+1)  &
+     &                      u(1,ix, jio,kio,cio),  &
+     &                      u(2,ix, jio,kio,cio),  &
+     &                      u(3,ix, jio,kio,cio),  &
+     &                      u(4,ix, jio,kio,cio),  &
+     &                      u(5,ix, jio,kio,cio)
+                  enddo
+              enddo
+          enddo
+      enddo
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            call acc_sub_norms(idump+1)
+
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer idump_cur
+
+      integer ix, jio, kio, cio, ii, m, ichunk
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+        do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=(cell_low(3,cio)+kio) +  &
+     &                   PROBLEM_SIZE*ii
+                  iseek=(cell_low(2,cio)+jio) +  &
+     &                   PROBLEM_SIZE*iseek
+                  iseek=(cell_low(1,cio)) +  &
+     &                   PROBLEM_SIZE*iseek
+
+
+                  do ix=0,cell_size(1,cio)-1
+                      read(99, rec=iseek+ix+1)  &
+     &                      u(1,ix, jio,kio,cio),  &
+     &                      u(2,ix, jio,kio,cio),  &
+     &                      u(3,ix, jio,kio,cio),  &
+     &                      u(4,ix, jio,kio,cio),  &
+     &                      u(5,ix, jio,kio,cio)
+                  enddo
+              enddo
+          enddo
+        enddo
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+
+      close(unit=99)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision xce_acc(5)
+      integer m
+
+      if (rd_interval .gt. 0) goto 20
+
+      open (unit=99, file=filenm,  &
+     &      form='unformatted', access='direct',  &
+     &      recl=record_length, action='read')
+
+!     clear the last time step
+
+      call clear_timestep
+
+!     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      close(unit=99)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/full_mpiio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/full_mpiio.f90
new file mode 100644
index 000000000..b14acd832
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/full_mpiio.f90
@@ -0,0 +1,319 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+      integer sizes(4), starts(4), subsizes(4)
+      integer cell_btype(maxcells), cell_ftype(maxcells)
+      integer cell_blength(maxcells)
+      integer info
+      character*20 cb_nodes, cb_size
+      integer c, m
+      integer cell_disp(maxcells)
+
+       call mpi_bcast(collbuf_nodes, 1, MPI_INTEGER,  &
+     &                root, comm_setup, ierr)
+
+       call mpi_bcast(collbuf_size, 1, MPI_INTEGER,  &
+     &                root, comm_setup, ierr)
+
+       if (collbuf_nodes .eq. 0) then
+          info = MPI_INFO_NULL
+       else
+          write (cb_nodes,*) collbuf_nodes
+          write (cb_size,*) collbuf_size
+          call MPI_Info_create(info, ierr)
+          call MPI_Info_set(info, 'cb_nodes', cb_nodes, ierr)
+          call MPI_Info_set(info, 'cb_buffer_size', cb_size, ierr)
+          call MPI_Info_set(info, 'collective_buffering', 'true', ierr)
+       endif
+
+       call MPI_Type_contiguous(5, MPI_DOUBLE_PRECISION,  &
+     &                          element, ierr)
+       call MPI_Type_commit(element, ierr)
+       call MPI_Type_extent(element, eltext, ierr)
+
+       do  c = 1, ncells
+!
+! Outer array dimensions ar same for every cell
+!
+           sizes(1) = IMAX+4
+           sizes(2) = JMAX+4
+           sizes(3) = KMAX+4
+!
+! 4th dimension is cell number, total of maxcells cells
+!
+           sizes(4) = maxcells
+!
+! Internal dimensions of cells can differ slightly between cells
+!
+           subsizes(1) = cell_size(1, c)
+           subsizes(2) = cell_size(2, c)
+           subsizes(3) = cell_size(3, c)
+!
+! Cell is 4th dimension, 1 cell per cell type to handle varying 
+! cell sub-array sizes
+!
+           subsizes(4) = 1
+
+!
+! type constructors use 0-based start addresses
+!
+           starts(1) = 2 
+           starts(2) = 2
+           starts(3) = 2
+           starts(4) = c-1
+
+! 
+! Create buftype for a cell
+!
+           call MPI_Type_create_subarray(4, sizes, subsizes,  &
+     &          starts, MPI_ORDER_FORTRAN, element,  &
+     &          cell_btype(c), ierr)
+!
+! block length and displacement for joining cells - 
+! 1 cell buftype per block, cell buftypes have own displacment
+! generated from cell number (4th array dimension)
+!
+           cell_blength(c) = 1
+           cell_disp(c) = 0
+
+       enddo
+!
+! Create combined buftype for all cells
+!
+       call MPI_Type_struct(ncells, cell_blength, cell_disp,  &
+     &            cell_btype, combined_btype, ierr)
+       call MPI_Type_commit(combined_btype, ierr)
+
+       do  c = 1, ncells
+!
+! Entire array size
+!
+           sizes(1) = PROBLEM_SIZE
+           sizes(2) = PROBLEM_SIZE
+           sizes(3) = PROBLEM_SIZE
+
+!
+! Size of c'th cell
+!
+           subsizes(1) = cell_size(1, c)
+           subsizes(2) = cell_size(2, c)
+           subsizes(3) = cell_size(3, c)
+
+!
+! Starting point in full array of c'th cell
+!
+           starts(1) = cell_low(1,c)
+           starts(2) = cell_low(2,c)
+           starts(3) = cell_low(3,c)
+
+           call MPI_Type_create_subarray(3, sizes, subsizes,  &
+     &          starts, MPI_ORDER_FORTRAN,  &
+     &          element, cell_ftype(c), ierr)
+           cell_blength(c) = 1
+           cell_disp(c) = 0
+       enddo
+
+       call MPI_Type_struct(ncells, cell_blength, cell_disp,  &
+     &            cell_ftype, combined_ftype, ierr)
+       call MPI_Type_commit(combined_ftype, ierr)
+
+       iseek=0
+       if (node .eq. root) then
+          call MPI_File_delete(filenm, MPI_INFO_NULL, ierr)
+       endif
+
+
+      call MPI_Barrier(comm_solve, ierr)
+
+       call MPI_File_open(comm_solve,  &
+     &          filenm,  &
+     &          MPI_MODE_RDWR+MPI_MODE_CREATE,  &
+     &          MPI_INFO_NULL, fp, ierr)
+
+       if (ierr .ne. MPI_SUCCESS) then
+                print *, 'Error opening file'
+                stop
+       endif
+
+        call MPI_File_set_view(fp, iseek, element,  &
+     &          combined_ftype, 'native', info, ierr)
+
+       if (ierr .ne. MPI_SUCCESS) then
+                print *, 'Error setting file view'
+                stop
+       endif
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer mstatus(MPI_STATUS_SIZE)
+      integer ierr
+
+      call MPI_File_write_at_all(fp, iseek, u,  &
+     &                           1, combined_btype, mstatus, ierr)
+      if (ierr .ne. MPI_SUCCESS) then
+          print *, 'Error writing to file'
+          stop
+      endif
+
+      call MPI_Type_size(combined_btype, iosize, ierr)
+      iseek = iseek + iosize/eltext
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            iseek = 0
+            call acc_sub_norms(idump+1)
+
+            iseek = 0
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer idump_cur
+
+      integer ii, m, ichunk
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+
+        call MPI_File_read_at_all(fp, iseek, u,  &
+     &                           1, combined_btype, mstatus, ierr)
+        if (ierr .ne. MPI_SUCCESS) then
+           print *, 'Error reading back file'
+           call MPI_File_close(fp, ierr)
+           stop
+        endif
+
+        call MPI_Type_size(combined_btype, iosize, ierr)
+        iseek = iseek + iosize/eltext
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ierr
+
+      call MPI_File_close(fp, ierr)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+      subroutine accumulate_norms(xce_acc)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      double precision xce_acc(5)
+      integer m, ierr
+
+      if (rd_interval .gt. 0) goto 20
+
+      call MPI_File_open(comm_solve,  &
+     &          filenm,  &
+     &          MPI_MODE_RDONLY,  &
+     &          MPI_INFO_NULL,  &
+     &          fp,  &
+     &          ierr)
+
+      iseek = 0
+      call MPI_File_set_view(fp, iseek, element, combined_ftype,  &
+     &          'native', MPI_INFO_NULL, ierr)
+
+!     clear the last time step
+
+      call clear_timestep
+
+!     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      call MPI_File_close(fp, ierr)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/initialize.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/initialize.f90
new file mode 100644
index 000000000..aeca10abc
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/initialize.f90
@@ -0,0 +1,310 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  initialize
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This subroutine initializes the field variable u using 
+!     tri-linear transfinite interpolation of the boundary values     
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+      
+      integer c, i, j, k, m, ii, jj, kk, ix, iy, iz
+      double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta,  &
+     &     Pzeta, temp(5)
+
+!---------------------------------------------------------------------
+!  Later (in compute_rhs) we compute 1/u for every element. A few of 
+!  the corner elements are not used, but it convenient (and faster) 
+!  to compute the whole thing with a simple loop. Make sure those 
+!  values are nonzero by initializing the whole thing here. 
+!---------------------------------------------------------------------
+      do c = 1, ncells
+         do kk = -1, KMAX
+            do jj = -1, JMAX
+               do ii = -1, IMAX
+                  do m = 1, 5
+                     u(m, ii, jj, kk, c) = 1.0
+                  end do
+               end do
+            end do
+         end do
+      end do
+!---------------------------------------------------------------------
+
+
+
+!---------------------------------------------------------------------
+!     first store the "interpolated" values everywhere on the grid    
+!---------------------------------------------------------------------
+      do c=1, ncells
+         kk = 0
+         do k = cell_low(3,c), cell_high(3,c)
+            zeta = dble(k) * dnzm1
+            jj = 0
+            do j = cell_low(2,c), cell_high(2,c)
+               eta = dble(j) * dnym1
+               ii = 0
+               do i = cell_low(1,c), cell_high(1,c)
+                  xi = dble(i) * dnxm1
+                  
+                  do ix = 1, 2
+                     call exact_solution(dble(ix-1), eta, zeta,  &
+     &                    Pface(1,1,ix))
+                  enddo
+
+                  do iy = 1, 2
+                     call exact_solution(xi, dble(iy-1) , zeta,  &
+     &                    Pface(1,2,iy))
+                  enddo
+
+                  do iz = 1, 2
+                     call exact_solution(xi, eta, dble(iz-1),   &
+     &                    Pface(1,3,iz))
+                  enddo
+
+                  do m = 1, 5
+                     Pxi   = xi   * Pface(m,1,2) +  &
+     &                    (1.0d0-xi)   * Pface(m,1,1)
+                     Peta  = eta  * Pface(m,2,2) +  &
+     &                    (1.0d0-eta)  * Pface(m,2,1)
+                     Pzeta = zeta * Pface(m,3,2) +  &
+     &                    (1.0d0-zeta) * Pface(m,3,1)
+                     
+                     u(m,ii,jj,kk,c) = Pxi + Peta + Pzeta -  &
+     &                    Pxi*Peta - Pxi*Pzeta - Peta*Pzeta +  &
+     &                    Pxi*Peta*Pzeta
+
+                  enddo
+                  ii = ii + 1
+               enddo
+               jj = jj + 1
+            enddo
+            kk = kk+1
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     now store the exact values on the boundaries        
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     west face                                                  
+!---------------------------------------------------------------------
+      c = slice(1,1)
+      ii = 0
+      xi = 0.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         jj = 0
+         do j = cell_low(2,c), cell_high(2,c)
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            jj = jj + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+!---------------------------------------------------------------------
+!     east face                                                      
+!---------------------------------------------------------------------
+      c  = slice(1,ncells)
+      ii = cell_size(1,c)-1
+      xi = 1.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         jj = 0
+         do j = cell_low(2,c), cell_high(2,c)
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            jj = jj + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+!---------------------------------------------------------------------
+!     south face                                                 
+!---------------------------------------------------------------------
+      c = slice(2,1)
+      jj = 0
+      eta = 0.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         ii = 0
+         do i = cell_low(1,c), cell_high(1,c)
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+
+!---------------------------------------------------------------------
+!     north face                                    
+!---------------------------------------------------------------------
+      c = slice(2,ncells)
+      jj = cell_size(2,c)-1
+      eta = 1.0d0
+      kk = 0
+      do k = cell_low(3,c), cell_high(3,c)
+         zeta = dble(k) * dnzm1
+         ii = 0
+         do i = cell_low(1,c), cell_high(1,c)
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         kk = kk + 1
+      enddo
+
+!---------------------------------------------------------------------
+!     bottom face                                       
+!---------------------------------------------------------------------
+      c = slice(3,1)
+      kk = 0
+      zeta = 0.0d0
+      jj = 0
+      do j = cell_low(2,c), cell_high(2,c)
+         eta = dble(j) * dnym1
+         ii = 0
+         do i =cell_low(1,c), cell_high(1,c)
+            xi = dble(i) *dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         jj = jj + 1
+      enddo
+
+!---------------------------------------------------------------------
+!     top face     
+!---------------------------------------------------------------------
+      c = slice(3,ncells)
+      kk = cell_size(3,c)-1
+      zeta = 1.0d0
+      jj = 0
+      do j = cell_low(2,c), cell_high(2,c)
+         eta = dble(j) * dnym1
+         ii = 0
+         do i =cell_low(1,c), cell_high(1,c)
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,ii,jj,kk,c) = temp(m)
+            enddo
+            ii = ii + 1
+         enddo
+         jj = jj + 1
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine lhsinit
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+      
+      integer i, j, k, d, c, m, n
+
+!---------------------------------------------------------------------
+!     loop over all cells                                       
+!---------------------------------------------------------------------
+      do c = 1, ncells
+
+!---------------------------------------------------------------------
+!     first, initialize the start and end arrays
+!---------------------------------------------------------------------
+         do d = 1, 3
+            if (cell_coord(d,c) .eq. 1) then
+               start(d,c) = 1
+            else 
+               start(d,c) = 0
+            endif
+            if (cell_coord(d,c) .eq. ncells) then
+               end(d,c) = 1
+            else
+               end(d,c) = 0
+            endif
+         enddo
+
+!---------------------------------------------------------------------
+!     zero the whole left hand side for starters
+!---------------------------------------------------------------------
+         do k = 0, cell_size(3,c)-1
+            do j = 0, cell_size(2,c)-1
+               do i = 0, cell_size(1,c)-1
+                  do m = 1,5
+                     do n = 1, 5
+                        lhsc(m,n,i,j,k,c) = 0.0d0
+                     enddo
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine lhsabinit(lhsa, lhsb, size)
+      implicit none
+
+      integer size
+      double precision lhsa(5, 5, -1:size), lhsb(5, 5, -1:size)
+
+      integer i, m, n
+
+!---------------------------------------------------------------------
+!     next, set all diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+      do i = 0, size
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,i) = 0.0d0
+               lhsb(m,n,i) = 0.0d0
+            enddo
+            lhsb(m,m,i) = 1.0d0
+         enddo
+      enddo
+
+      return
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/inputbt.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/inputbt.data.sample
new file mode 100644
index 000000000..776654e8d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/inputbt.data.sample
@@ -0,0 +1,5 @@
+200       number of time steps
+0.0008d0  dt for class A = 0.0008d0. class B = 0.0003d0  class C = 0.0001d0
+64 64 64
+5 0        write interval (optional read interval) for BTIO
+0 1000000  number of nodes in collective buffering and buffer size for BTIO
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/make_set.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/make_set.f90
new file mode 100644
index 000000000..b24575109
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/make_set.f90
@@ -0,0 +1,126 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine make_set
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This function allocates space for a set of cells and fills the set     
+!     such that communication between cells on different nodes is only
+!     nearest neighbor                                                   
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer p, i, j, c, dir, size, excess, ierr,ierrcode
+
+!---------------------------------------------------------------------
+!     compute square root; add small number to allow for roundoff
+!     (note: this is computed in setup_mpi.f also, but prefer to do
+!     it twice because of some include file problems).
+!---------------------------------------------------------------------
+      ncells = dint(dsqrt(dble(no_nodes) + 0.00001d0))
+
+!---------------------------------------------------------------------
+!     this makes coding easier
+!---------------------------------------------------------------------
+      p = ncells
+      
+!---------------------------------------------------------------------
+!     determine the location of the cell at the bottom of the 3D 
+!     array of cells
+!---------------------------------------------------------------------
+      cell_coord(1,1) = mod(node,p) 
+      cell_coord(2,1) = node/p 
+      cell_coord(3,1) = 0
+
+!---------------------------------------------------------------------
+!     set the cell_coords for cells in the rest of the z-layers; 
+!     this comes down to a simple linear numbering in the z-direct-
+!     ion, and to the doubly-cyclic numbering in the other dirs     
+!---------------------------------------------------------------------
+      do c=2, p
+         cell_coord(1,c) = mod(cell_coord(1,c-1)+1,p) 
+         cell_coord(2,c) = mod(cell_coord(2,c-1)-1+p,p) 
+         cell_coord(3,c) = c-1
+      end do
+
+!---------------------------------------------------------------------
+!     offset all the coordinates by 1 to adjust for Fortran arrays
+!---------------------------------------------------------------------
+      do dir = 1, 3
+         do c = 1, p
+            cell_coord(dir,c) = cell_coord(dir,c) + 1
+         end do
+      end do
+      
+!---------------------------------------------------------------------
+!     slice(dir,n) contains the sequence number of the cell that is in
+!     coordinate plane n in the dir direction
+!---------------------------------------------------------------------
+      do dir = 1, 3
+         do c = 1, p
+            slice(dir,cell_coord(dir,c)) = c
+         end do
+      end do
+
+
+!---------------------------------------------------------------------
+!     fill the predecessor and successor entries, using the indices 
+!     of the bottom cells (they are the same at each level of k 
+!     anyway) acting as if full periodicity pertains; note that p is
+!     added to those arguments to the mod functions that might
+!     otherwise return wrong values when using the modulo function
+!---------------------------------------------------------------------
+      i = cell_coord(1,1)-1
+      j = cell_coord(2,1)-1
+
+      predecessor(1) = mod(i-1+p,p) + p*j
+      predecessor(2) = i + p*mod(j-1+p,p)
+      predecessor(3) = mod(i+1,p) + p*mod(j-1+p,p)
+      successor(1)   = mod(i+1,p) + p*j
+      successor(2)   = i + p*mod(j+1,p)
+      successor(3)   = mod(i-1+p,p) + p*mod(j+1,p)
+
+!---------------------------------------------------------------------
+!     now compute the sizes of the cells                                    
+!---------------------------------------------------------------------
+      do dir= 1, 3
+!---------------------------------------------------------------------
+!     set cell_coord range for each direction                            
+!---------------------------------------------------------------------
+         size   = grid_points(dir)/p
+         excess = mod(grid_points(dir),p)
+         do c=1, ncells
+            if (cell_coord(dir,c) .le. excess) then
+               cell_size(dir,c) = size+1
+               cell_low(dir,c) = (cell_coord(dir,c)-1)*(size+1)
+               cell_high(dir,c) = cell_low(dir,c)+size
+            else 
+               cell_size(dir,c) = size
+               cell_low(dir,c)  = excess*(size+1)+  &
+     &              (cell_coord(dir,c)-excess-1)*size
+               cell_high(dir,c) = cell_low(dir,c)+size-1
+            endif
+            if (cell_size(dir, c) .le. 2) then
+               write(*,50)
+ 50            format(' Error: Cell size too small. Min size is 3')
+               ierrcode = 1
+               call MPI_Abort(mpi_comm_world,ierrcode,ierr)
+               stop
+            endif
+         end do
+      end do
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/mpinpb.f90
new file mode 100644
index 000000000..6fd83ac63
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/mpinpb.f90
@@ -0,0 +1,18 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mpinpb module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mpinpb
+
+      include 'mpif.h'
+
+      integer  node, no_nodes, total_nodes, root, comm_setup,  &
+     &         comm_solve, comm_rhs, dp_type
+      logical  active
+
+      end module mpinpb
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/rhs.f90
new file mode 100644
index 000000000..b47cdf6c1
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/rhs.f90
@@ -0,0 +1,429 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine compute_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer c, i, j, k, m
+      double precision rho_inv, uijk, up1, um1, vijk, vp1, vm1,  &
+     &     wijk, wp1, wm1
+
+
+      if (timeron) call timer_start(t_rhs)
+!---------------------------------------------------------------------
+!     loop over all cells owned by this node                           
+!---------------------------------------------------------------------
+      do c = 1, ncells
+
+!---------------------------------------------------------------------
+!     compute the reciprocal of density, and the kinetic energy, 
+!     and the speed of sound.
+!---------------------------------------------------------------------
+         do k = -1, cell_size(3,c)
+            do j = -1, cell_size(2,c)
+               do i = -1, cell_size(1,c)
+                  rho_inv = 1.0d0/u(1,i,j,k,c)
+                  rho_i(i,j,k,c) = rho_inv
+                  us(i,j,k,c) = u(2,i,j,k,c) * rho_inv
+                  vs(i,j,k,c) = u(3,i,j,k,c) * rho_inv
+                  ws(i,j,k,c) = u(4,i,j,k,c) * rho_inv
+                  square(i,j,k,c)     = 0.5d0* (  &
+     &                 u(2,i,j,k,c)*u(2,i,j,k,c) +  &
+     &                 u(3,i,j,k,c)*u(3,i,j,k,c) +  &
+     &                 u(4,i,j,k,c)*u(4,i,j,k,c) ) * rho_inv
+                  qs(i,j,k,c) = square(i,j,k,c) * rho_inv
+               enddo
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+! copy the exact forcing term to the right hand side;  because 
+! this forcing term is known, we can store it on the whole of every 
+! cell,  including the boundary                   
+!---------------------------------------------------------------------
+
+         do k = 0, cell_size(3,c)-1
+            do j = 0, cell_size(2,c)-1
+               do i = 0, cell_size(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = forcing(m,i,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+
+
+!---------------------------------------------------------------------
+!     compute xi-direction fluxes 
+!---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  uijk = us(i,j,k,c)
+                  up1  = us(i+1,j,k,c)
+                  um1  = us(i-1,j,k,c)
+
+                  rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dx1tx1 *  &
+     &                 (u(1,i+1,j,k,c) - 2.0d0*u(1,i,j,k,c) +  &
+     &                 u(1,i-1,j,k,c)) -  &
+     &                 tx2 * (u(2,i+1,j,k,c) - u(2,i-1,j,k,c))
+
+                  rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dx2tx1 *  &
+     &                 (u(2,i+1,j,k,c) - 2.0d0*u(2,i,j,k,c) +  &
+     &                 u(2,i-1,j,k,c)) +  &
+     &                 xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -  &
+     &                 tx2 * (u(2,i+1,j,k,c)*up1 -  &
+     &                 u(2,i-1,j,k,c)*um1 +  &
+     &                 (u(5,i+1,j,k,c)- square(i+1,j,k,c)-  &
+     &                 u(5,i-1,j,k,c)+ square(i-1,j,k,c))*  &
+     &                 c2)
+
+                  rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dx3tx1 *  &
+     &                 (u(3,i+1,j,k,c) - 2.0d0*u(3,i,j,k,c) +  &
+     &                 u(3,i-1,j,k,c)) +  &
+     &                 xxcon2 * (vs(i+1,j,k,c) - 2.0d0*vs(i,j,k,c) +  &
+     &                 vs(i-1,j,k,c)) -  &
+     &                 tx2 * (u(3,i+1,j,k,c)*up1 -  &
+     &                 u(3,i-1,j,k,c)*um1)
+
+                  rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dx4tx1 *  &
+     &                 (u(4,i+1,j,k,c) - 2.0d0*u(4,i,j,k,c) +  &
+     &                 u(4,i-1,j,k,c)) +  &
+     &                 xxcon2 * (ws(i+1,j,k,c) - 2.0d0*ws(i,j,k,c) +  &
+     &                 ws(i-1,j,k,c)) -  &
+     &                 tx2 * (u(4,i+1,j,k,c)*up1 -  &
+     &                 u(4,i-1,j,k,c)*um1)
+
+                  rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dx5tx1 *  &
+     &                 (u(5,i+1,j,k,c) - 2.0d0*u(5,i,j,k,c) +  &
+     &                 u(5,i-1,j,k,c)) +  &
+     &                 xxcon3 * (qs(i+1,j,k,c) - 2.0d0*qs(i,j,k,c) +  &
+     &                 qs(i-1,j,k,c)) +  &
+     &                 xxcon4 * (up1*up1 -       2.0d0*uijk*uijk +  &
+     &                 um1*um1) +  &
+     &                 xxcon5 * (u(5,i+1,j,k,c)*rho_i(i+1,j,k,c) -  &
+     &                 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) +  &
+     &                 u(5,i-1,j,k,c)*rho_i(i-1,j,k,c)) -  &
+     &                 tx2 * ( (c1*u(5,i+1,j,k,c) -  &
+     &                 c2*square(i+1,j,k,c))*up1 -  &
+     &                 (c1*u(5,i-1,j,k,c) -  &
+     &                 c2*square(i-1,j,k,c))*um1 )
+               enddo
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     add fourth order xi-direction dissipation               
+!---------------------------------------------------------------------
+         if (start(1,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               do j = start(2,c), cell_size(2,c)-end(2,c)-1
+                  i = 1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp *  &
+     &                    ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) +  &
+     &                    u(m,i+2,j,k,c))
+                  enddo
+
+                  i = 2
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    (-4.0d0*u(m,i-1,j,k,c) + 6.0d0*u(m,i,j,k,c) -  &
+     &                    4.0d0*u(m,i+1,j,k,c) + u(m,i+2,j,k,c))
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = 3*start(1,c),cell_size(1,c)-3*end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    (  u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) +  &
+     &                    6.0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) +  &
+     &                    u(m,i+2,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         enddo
+         
+
+         if (end(1,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               do j = start(2,c), cell_size(2,c)-end(2,c)-1
+                  i = cell_size(1,c)-3
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    ( u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) +  &
+     &                    6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) )
+                  enddo
+
+                  i = cell_size(1,c)-2
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    ( u(m,i-2,j,k,c) - 4.d0*u(m,i-1,j,k,c) +  &
+     &                    5.d0*u(m,i,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         endif
+
+!---------------------------------------------------------------------
+!     compute eta-direction fluxes 
+!---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  vijk = vs(i,j,k,c)
+                  vp1  = vs(i,j+1,k,c)
+                  vm1  = vs(i,j-1,k,c)
+                  rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dy1ty1 *  &
+     &                 (u(1,i,j+1,k,c) - 2.0d0*u(1,i,j,k,c) +  &
+     &                 u(1,i,j-1,k,c)) -  &
+     &                 ty2 * (u(3,i,j+1,k,c) - u(3,i,j-1,k,c))
+                  rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dy2ty1 *  &
+     &                 (u(2,i,j+1,k,c) - 2.0d0*u(2,i,j,k,c) +  &
+     &                 u(2,i,j-1,k,c)) +  &
+     &                 yycon2 * (us(i,j+1,k,c) - 2.0d0*us(i,j,k,c) +  &
+     &                 us(i,j-1,k,c)) -  &
+     &                 ty2 * (u(2,i,j+1,k,c)*vp1 -  &
+     &                 u(2,i,j-1,k,c)*vm1)
+                  rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dy3ty1 *  &
+     &                 (u(3,i,j+1,k,c) - 2.0d0*u(3,i,j,k,c) +  &
+     &                 u(3,i,j-1,k,c)) +  &
+     &                 yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -  &
+     &                 ty2 * (u(3,i,j+1,k,c)*vp1 -  &
+     &                 u(3,i,j-1,k,c)*vm1 +  &
+     &                 (u(5,i,j+1,k,c) - square(i,j+1,k,c) -  &
+     &                 u(5,i,j-1,k,c) + square(i,j-1,k,c))  &
+     &                 *c2)
+                  rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dy4ty1 *  &
+     &                 (u(4,i,j+1,k,c) - 2.0d0*u(4,i,j,k,c) +  &
+     &                 u(4,i,j-1,k,c)) +  &
+     &                 yycon2 * (ws(i,j+1,k,c) - 2.0d0*ws(i,j,k,c) +  &
+     &                 ws(i,j-1,k,c)) -  &
+     &                 ty2 * (u(4,i,j+1,k,c)*vp1 -  &
+     &                 u(4,i,j-1,k,c)*vm1)
+                  rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dy5ty1 *  &
+     &                 (u(5,i,j+1,k,c) - 2.0d0*u(5,i,j,k,c) +  &
+     &                 u(5,i,j-1,k,c)) +  &
+     &                 yycon3 * (qs(i,j+1,k,c) - 2.0d0*qs(i,j,k,c) +  &
+     &                 qs(i,j-1,k,c)) +  &
+     &                 yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk +  &
+     &                 vm1*vm1) +  &
+     &                 yycon5 * (u(5,i,j+1,k,c)*rho_i(i,j+1,k,c) -  &
+     &                 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) +  &
+     &                 u(5,i,j-1,k,c)*rho_i(i,j-1,k,c)) -  &
+     &                 ty2 * ((c1*u(5,i,j+1,k,c) -  &
+     &                 c2*square(i,j+1,k,c)) * vp1 -  &
+     &                 (c1*u(5,i,j-1,k,c) -  &
+     &                 c2*square(i,j-1,k,c)) * vm1)
+               enddo
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     add fourth order eta-direction dissipation         
+!---------------------------------------------------------------------
+         if (start(2,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               j = 1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp *  &
+     &                    ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) +  &
+     &                    u(m,i,j+2,k,c))
+                  enddo
+               enddo
+
+               j = 2
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    (-4.0d0*u(m,i,j-1,k,c) + 6.0d0*u(m,i,j,k,c) -  &
+     &                    4.0d0*u(m,i,j+1,k,c) + u(m,i,j+2,k,c))
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = 3*start(2,c), cell_size(2,c)-3*end(2,c)-1
+               do i = start(1,c),cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    (  u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) +  &
+     &                    6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) +  &
+     &                    u(m,i,j+2,k,c) )
+                  enddo
+               enddo
+            enddo
+         enddo
+         
+         if (end(2,c) .gt. 0) then
+            do k = start(3,c), cell_size(3,c)-end(3,c)-1
+               j = cell_size(2,c)-3
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    ( u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) +  &
+     &                    6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) )
+                  enddo
+               enddo
+
+               j = cell_size(2,c)-2
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    ( u(m,i,j-2,k,c) - 4.d0*u(m,i,j-1,k,c) +  &
+     &                    5.d0*u(m,i,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         endif
+
+!---------------------------------------------------------------------
+!     compute zeta-direction fluxes 
+!---------------------------------------------------------------------
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  wijk = ws(i,j,k,c)
+                  wp1  = ws(i,j,k+1,c)
+                  wm1  = ws(i,j,k-1,c)
+
+                  rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dz1tz1 *  &
+     &                 (u(1,i,j,k+1,c) - 2.0d0*u(1,i,j,k,c) +  &
+     &                 u(1,i,j,k-1,c)) -  &
+     &                 tz2 * (u(4,i,j,k+1,c) - u(4,i,j,k-1,c))
+                  rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dz2tz1 *  &
+     &                 (u(2,i,j,k+1,c) - 2.0d0*u(2,i,j,k,c) +  &
+     &                 u(2,i,j,k-1,c)) +  &
+     &                 zzcon2 * (us(i,j,k+1,c) - 2.0d0*us(i,j,k,c) +  &
+     &                 us(i,j,k-1,c)) -  &
+     &                 tz2 * (u(2,i,j,k+1,c)*wp1 -  &
+     &                 u(2,i,j,k-1,c)*wm1)
+                  rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dz3tz1 *  &
+     &                 (u(3,i,j,k+1,c) - 2.0d0*u(3,i,j,k,c) +  &
+     &                 u(3,i,j,k-1,c)) +  &
+     &                 zzcon2 * (vs(i,j,k+1,c) - 2.0d0*vs(i,j,k,c) +  &
+     &                 vs(i,j,k-1,c)) -  &
+     &                 tz2 * (u(3,i,j,k+1,c)*wp1 -  &
+     &                 u(3,i,j,k-1,c)*wm1)
+                  rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dz4tz1 *  &
+     &                 (u(4,i,j,k+1,c) - 2.0d0*u(4,i,j,k,c) +  &
+     &                 u(4,i,j,k-1,c)) +  &
+     &                 zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -  &
+     &                 tz2 * (u(4,i,j,k+1,c)*wp1 -  &
+     &                 u(4,i,j,k-1,c)*wm1 +  &
+     &                 (u(5,i,j,k+1,c) - square(i,j,k+1,c) -  &
+     &                 u(5,i,j,k-1,c) + square(i,j,k-1,c))  &
+     &                 *c2)
+                  rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dz5tz1 *  &
+     &                 (u(5,i,j,k+1,c) - 2.0d0*u(5,i,j,k,c) +  &
+     &                 u(5,i,j,k-1,c)) +  &
+     &                 zzcon3 * (qs(i,j,k+1,c) - 2.0d0*qs(i,j,k,c) +  &
+     &                 qs(i,j,k-1,c)) +  &
+     &                 zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk +  &
+     &                 wm1*wm1) +  &
+     &                 zzcon5 * (u(5,i,j,k+1,c)*rho_i(i,j,k+1,c) -  &
+     &                 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) +  &
+     &                 u(5,i,j,k-1,c)*rho_i(i,j,k-1,c)) -  &
+     &                 tz2 * ( (c1*u(5,i,j,k+1,c) -  &
+     &                 c2*square(i,j,k+1,c))*wp1 -  &
+     &                 (c1*u(5,i,j,k-1,c) -  &
+     &                 c2*square(i,j,k-1,c))*wm1)
+               enddo
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     add fourth order zeta-direction dissipation                
+!---------------------------------------------------------------------
+         if (start(3,c) .gt. 0) then
+            k = 1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp *  &
+     &                    ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) +  &
+     &                    u(m,i,j,k+2,c))
+                  enddo
+               enddo
+            enddo
+
+            k = 2
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    (-4.0d0*u(m,i,j,k-1,c) + 6.0d0*u(m,i,j,k,c) -  &
+     &                    4.0d0*u(m,i,j,k+1,c) + u(m,i,j,k+2,c))
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c),cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    (  u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) +  &
+     &                    6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) +  &
+     &                    u(m,i,j,k+2,c) )
+                  enddo
+               enddo
+            enddo
+         enddo
+         
+         if (end(3,c) .gt. 0) then
+            k = cell_size(3,c)-3
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    ( u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) +  &
+     &                    6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) )
+                  enddo
+               enddo
+            enddo
+
+            k = cell_size(3,c)-2
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp *  &
+     &                    ( u(m,i,j,k-2,c) - 4.d0*u(m,i,j,k-1,c) +  &
+     &                    5.d0*u(m,i,j,k,c) )
+                  enddo
+               enddo
+            enddo
+         endif
+
+         do k = start(3,c), cell_size(3,c)-end(3,c)-1
+            do j = start(2,c), cell_size(2,c)-end(2,c)-1
+               do i = start(1,c), cell_size(1,c)-end(1,c)-1
+                  do m = 1, 5
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c) * dt
+                  enddo
+               enddo
+            enddo
+         enddo
+
+      enddo
+     
+      if (timeron) call timer_stop(t_rhs)
+     
+      return
+      end
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/set_constants.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/set_constants.f90
new file mode 100644
index 000000000..1519e2cb7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/set_constants.f90
@@ -0,0 +1,203 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  set_constants
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+      
+      ce(1,1)  = 2.0d0
+      ce(1,2)  = 0.0d0
+      ce(1,3)  = 0.0d0
+      ce(1,4)  = 4.0d0
+      ce(1,5)  = 5.0d0
+      ce(1,6)  = 3.0d0
+      ce(1,7)  = 0.5d0
+      ce(1,8)  = 0.02d0
+      ce(1,9)  = 0.01d0
+      ce(1,10) = 0.03d0
+      ce(1,11) = 0.5d0
+      ce(1,12) = 0.4d0
+      ce(1,13) = 0.3d0
+      
+      ce(2,1)  = 1.0d0
+      ce(2,2)  = 0.0d0
+      ce(2,3)  = 0.0d0
+      ce(2,4)  = 0.0d0
+      ce(2,5)  = 1.0d0
+      ce(2,6)  = 2.0d0
+      ce(2,7)  = 3.0d0
+      ce(2,8)  = 0.01d0
+      ce(2,9)  = 0.03d0
+      ce(2,10) = 0.02d0
+      ce(2,11) = 0.4d0
+      ce(2,12) = 0.3d0
+      ce(2,13) = 0.5d0
+
+      ce(3,1)  = 2.0d0
+      ce(3,2)  = 2.0d0
+      ce(3,3)  = 0.0d0
+      ce(3,4)  = 0.0d0
+      ce(3,5)  = 0.0d0
+      ce(3,6)  = 2.0d0
+      ce(3,7)  = 3.0d0
+      ce(3,8)  = 0.04d0
+      ce(3,9)  = 0.03d0
+      ce(3,10) = 0.05d0
+      ce(3,11) = 0.3d0
+      ce(3,12) = 0.5d0
+      ce(3,13) = 0.4d0
+
+      ce(4,1)  = 2.0d0
+      ce(4,2)  = 2.0d0
+      ce(4,3)  = 0.0d0
+      ce(4,4)  = 0.0d0
+      ce(4,5)  = 0.0d0
+      ce(4,6)  = 2.0d0
+      ce(4,7)  = 3.0d0
+      ce(4,8)  = 0.03d0
+      ce(4,9)  = 0.05d0
+      ce(4,10) = 0.04d0
+      ce(4,11) = 0.2d0
+      ce(4,12) = 0.1d0
+      ce(4,13) = 0.3d0
+
+      ce(5,1)  = 5.0d0
+      ce(5,2)  = 4.0d0
+      ce(5,3)  = 3.0d0
+      ce(5,4)  = 2.0d0
+      ce(5,5)  = 0.1d0
+      ce(5,6)  = 0.4d0
+      ce(5,7)  = 0.3d0
+      ce(5,8)  = 0.05d0
+      ce(5,9)  = 0.04d0
+      ce(5,10) = 0.03d0
+      ce(5,11) = 0.1d0
+      ce(5,12) = 0.3d0
+      ce(5,13) = 0.2d0
+
+      c1 = 1.4d0
+      c2 = 0.4d0
+      c3 = 0.1d0
+      c4 = 1.0d0
+      c5 = 1.4d0
+
+      bt = dsqrt(0.5d0)
+
+      dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+      dnym1 = 1.0d0 / dble(grid_points(2)-1)
+      dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+      c1c2 = c1 * c2
+      c1c5 = c1 * c5
+      c3c4 = c3 * c4
+      c1345 = c1c5 * c3c4
+
+      conz1 = (1.0d0-c1c5)
+
+      tx1 = 1.0d0 / (dnxm1 * dnxm1)
+      tx2 = 1.0d0 / (2.0d0 * dnxm1)
+      tx3 = 1.0d0 / dnxm1
+
+      ty1 = 1.0d0 / (dnym1 * dnym1)
+      ty2 = 1.0d0 / (2.0d0 * dnym1)
+      ty3 = 1.0d0 / dnym1
+      
+      tz1 = 1.0d0 / (dnzm1 * dnzm1)
+      tz2 = 1.0d0 / (2.0d0 * dnzm1)
+      tz3 = 1.0d0 / dnzm1
+
+      dx1 = 0.75d0
+      dx2 = 0.75d0
+      dx3 = 0.75d0
+      dx4 = 0.75d0
+      dx5 = 0.75d0
+
+      dy1 = 0.75d0
+      dy2 = 0.75d0
+      dy3 = 0.75d0
+      dy4 = 0.75d0
+      dy5 = 0.75d0
+
+      dz1 = 1.0d0
+      dz2 = 1.0d0
+      dz3 = 1.0d0
+      dz4 = 1.0d0
+      dz5 = 1.0d0
+
+      dxmax = dmax1(dx3, dx4)
+      dymax = dmax1(dy2, dy4)
+      dzmax = dmax1(dz2, dz3)
+
+      dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+      c4dssp = 4.0d0 * dssp
+      c5dssp = 5.0d0 * dssp
+
+      dttx1 = dt*tx1
+      dttx2 = dt*tx2
+      dtty1 = dt*ty1
+      dtty2 = dt*ty2
+      dttz1 = dt*tz1
+      dttz2 = dt*tz2
+
+      c2dttx1 = 2.0d0*dttx1
+      c2dtty1 = 2.0d0*dtty1
+      c2dttz1 = 2.0d0*dttz1
+
+      dtdssp = dt*dssp
+
+      comz1  = dtdssp
+      comz4  = 4.0d0*dtdssp
+      comz5  = 5.0d0*dtdssp
+      comz6  = 6.0d0*dtdssp
+
+      c3c4tx3 = c3c4*tx3
+      c3c4ty3 = c3c4*ty3
+      c3c4tz3 = c3c4*tz3
+
+      dx1tx1 = dx1*tx1
+      dx2tx1 = dx2*tx1
+      dx3tx1 = dx3*tx1
+      dx4tx1 = dx4*tx1
+      dx5tx1 = dx5*tx1
+      
+      dy1ty1 = dy1*ty1
+      dy2ty1 = dy2*ty1
+      dy3ty1 = dy3*ty1
+      dy4ty1 = dy4*ty1
+      dy5ty1 = dy5*ty1
+      
+      dz1tz1 = dz1*tz1
+      dz2tz1 = dz2*tz1
+      dz3tz1 = dz3*tz1
+      dz4tz1 = dz4*tz1
+      dz5tz1 = dz5*tz1
+
+      c2iv  = 2.5d0
+      con43 = 4.0d0/3.0d0
+      con16 = 1.0d0/6.0d0
+      
+      xxcon1 = c3c4tx3*con43*tx3
+      xxcon2 = c3c4tx3*tx3
+      xxcon3 = c3c4tx3*conz1*tx3
+      xxcon4 = c3c4tx3*con16*tx3
+      xxcon5 = c3c4tx3*c1c5*tx3
+
+      yycon1 = c3c4ty3*con43*ty3
+      yycon2 = c3c4ty3*ty3
+      yycon3 = c3c4ty3*conz1*ty3
+      yycon4 = c3c4ty3*con16*ty3
+      yycon5 = c3c4ty3*c1c5*ty3
+
+      zzcon1 = c3c4tz3*con43*tz3
+      zzcon2 = c3c4tz3*tz3
+      zzcon3 = c3c4tz3*conz1*tz3
+      zzcon4 = c3c4tz3*con16*tz3
+      zzcon5 = c3c4tz3*c1c5*tz3
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/setup_mpi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/setup_mpi.f90
new file mode 100644
index 000000000..9d7939b7c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/setup_mpi.f90
@@ -0,0 +1,48 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup_mpi
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! set up MPI stuff
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer error, nc
+
+      call mpi_init(error)
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+!---------------------------------------------------------------------
+!     get a process grid that requires a square number of procs.
+!     excess ranks are marked as inactive.
+!---------------------------------------------------------------------
+      call get_active_nprocs(1, nc, maxcells, no_nodes,  &
+     &                       total_nodes, node, comm_setup, active)
+
+      if (.not. active) return
+
+      call mpi_comm_dup(comm_setup, comm_solve, error)
+      call mpi_comm_dup(comm_setup, comm_rhs, error)
+
+!---------------------------------------------------------------------
+!     let node 0 be the root for the group (there is only one)
+!---------------------------------------------------------------------
+      root = 0
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/simple_mpiio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/simple_mpiio.f90
new file mode 100644
index 000000000..e47da2943
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/simple_mpiio.f90
@@ -0,0 +1,228 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup_btio
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer m, ierr
+
+      iseek=0
+
+      if (node .eq. root) then
+          call MPI_File_delete(filenm, MPI_INFO_NULL, ierr)
+      endif
+
+      call MPI_Barrier(comm_solve, ierr)
+
+      call MPI_File_open(comm_solve,  &
+     &          filenm,  &
+     &          MPI_MODE_RDWR + MPI_MODE_CREATE,  &
+     &          MPI_INFO_NULL,  &
+     &          fp,  &
+     &          ierr)
+
+      call MPI_File_set_view(fp,  &
+     &          iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION,  &
+     &          'native', MPI_INFO_NULL, ierr)
+
+      if (ierr .ne. MPI_SUCCESS) then
+          print *, 'Error opening file'
+          stop
+      endif
+
+      do m = 1, 5
+         xce_sub(m) = 0.d0
+      end do
+
+      idump_sub = 0
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine output_timestep
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer count, jio, kio, cio, aio
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+
+      do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=(cell_low(3,cio)+kio) +  &
+     &                   PROBLEM_SIZE*idump_sub
+                  iseek=(cell_low(2,cio)+jio) +  &
+     &                   PROBLEM_SIZE*iseek
+                  iseek=5*(cell_low(1,cio) +  &
+     &                   PROBLEM_SIZE*iseek)
+
+                  count=5*cell_size(1,cio)
+
+                  call MPI_File_write_at(fp, iseek,  &
+     &                  u(1,0,jio,kio,cio),  &
+     &                  count, MPI_DOUBLE_PRECISION,  &
+     &                  mstatus, ierr)
+
+                  if (ierr .ne. MPI_SUCCESS) then
+                      print *, 'Error writing to file'
+                      stop
+                  endif
+              enddo
+          enddo
+      enddo
+
+      idump_sub = idump_sub + 1
+      if (rd_interval .gt. 0) then
+         if (idump_sub .ge. rd_interval) then
+
+            call acc_sub_norms(idump+1)
+
+            idump_sub = 0
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine acc_sub_norms(idump_cur)
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer idump_cur
+
+      integer count, jio, kio, cio, ii, m, ichunk
+      integer ierr
+      integer mstatus(MPI_STATUS_SIZE)
+      double precision xce_single(5)
+
+      ichunk = idump_cur - idump_sub + 1
+      do ii=0, idump_sub-1
+        do cio=1,ncells
+          do kio=0, cell_size(3,cio)-1
+              do jio=0, cell_size(2,cio)-1
+                  iseek=(cell_low(3,cio)+kio) +  &
+     &                   PROBLEM_SIZE*ii
+                  iseek=(cell_low(2,cio)+jio) +  &
+     &                   PROBLEM_SIZE*iseek
+                  iseek=5*(cell_low(1,cio) +  &
+     &                   PROBLEM_SIZE*iseek)
+
+                  count=5*cell_size(1,cio)
+
+                  call MPI_File_read_at(fp, iseek,  &
+     &                  u(1,0,jio,kio,cio),  &
+     &                  count, MPI_DOUBLE_PRECISION,  &
+     &                  mstatus, ierr)
+
+                  if (ierr .ne. MPI_SUCCESS) then
+                      print *, 'Error reading back file'
+                      call MPI_File_close(fp, ierr)
+                      stop
+                  endif
+              enddo
+          enddo
+        enddo
+
+        if (node .eq. root) print *, 'Reading data set ', ii+ichunk
+
+        call error_norm(xce_single)
+        do m = 1, 5
+           xce_sub(m) = xce_sub(m) + xce_single(m)
+        end do
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine btio_cleanup
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ierr
+
+      call MPI_File_close(fp, ierr)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine accumulate_norms(xce_acc)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      double precision xce_acc(5)
+      integer m, ierr
+
+      if (rd_interval .gt. 0) goto 20
+
+      call MPI_File_open(comm_solve,  &
+     &          filenm,  &
+     &          MPI_MODE_RDONLY,  &
+     &          MPI_INFO_NULL,  &
+     &          fp,  &
+     &          ierr)
+
+      iseek = 0
+      call MPI_File_set_view(fp,  &
+     &          iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION,  &
+     &          'native', MPI_INFO_NULL, ierr)
+
+!     clear the last time step
+
+      call clear_timestep
+
+!     read back the time steps and accumulate norms
+
+      call acc_sub_norms(idump)
+
+      call MPI_File_close(fp, ierr)
+
+ 20   continue
+      do m = 1, 5
+         xce_acc(m) = xce_sub(m) / dble(idump)
+      end do
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/solve_subs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/solve_subs.f90
new file mode 100644
index 000000000..913bd2778
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/solve_subs.f90
@@ -0,0 +1,642 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine matvec_sub(ablock,avec,bvec)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtracts bvec=bvec - ablock*avec
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock,avec,bvec
+      dimension ablock(5,5),avec(5),bvec(5)
+
+!---------------------------------------------------------------------
+!            rhs(i,ic,jc,kc,ccell) = rhs(i,ic,jc,kc,ccell) 
+!     $           - lhs(i,1,ablock,ia,ja,ka,acell)*
+!---------------------------------------------------------------------
+         bvec(1) = bvec(1) - ablock(1,1)*avec(1)  &
+     &                     - ablock(1,2)*avec(2)  &
+     &                     - ablock(1,3)*avec(3)  &
+     &                     - ablock(1,4)*avec(4)  &
+     &                     - ablock(1,5)*avec(5)
+         bvec(2) = bvec(2) - ablock(2,1)*avec(1)  &
+     &                     - ablock(2,2)*avec(2)  &
+     &                     - ablock(2,3)*avec(3)  &
+     &                     - ablock(2,4)*avec(4)  &
+     &                     - ablock(2,5)*avec(5)
+         bvec(3) = bvec(3) - ablock(3,1)*avec(1)  &
+     &                     - ablock(3,2)*avec(2)  &
+     &                     - ablock(3,3)*avec(3)  &
+     &                     - ablock(3,4)*avec(4)  &
+     &                     - ablock(3,5)*avec(5)
+         bvec(4) = bvec(4) - ablock(4,1)*avec(1)  &
+     &                     - ablock(4,2)*avec(2)  &
+     &                     - ablock(4,3)*avec(3)  &
+     &                     - ablock(4,4)*avec(4)  &
+     &                     - ablock(4,5)*avec(5)
+         bvec(5) = bvec(5) - ablock(5,1)*avec(1)  &
+     &                     - ablock(5,2)*avec(2)  &
+     &                     - ablock(5,3)*avec(3)  &
+     &                     - ablock(5,4)*avec(4)  &
+     &                     - ablock(5,5)*avec(5)
+
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine matmul_sub(ablock, bblock, cblock)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtracts a(i,j,k) X b(i,j,k) from c(i,j,k)
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock, bblock, cblock
+      dimension ablock(5,5), bblock(5,5), cblock(5,5)
+
+
+         cblock(1,1) = cblock(1,1) - ablock(1,1)*bblock(1,1)  &
+     &                             - ablock(1,2)*bblock(2,1)  &
+     &                             - ablock(1,3)*bblock(3,1)  &
+     &                             - ablock(1,4)*bblock(4,1)  &
+     &                             - ablock(1,5)*bblock(5,1)
+         cblock(2,1) = cblock(2,1) - ablock(2,1)*bblock(1,1)  &
+     &                             - ablock(2,2)*bblock(2,1)  &
+     &                             - ablock(2,3)*bblock(3,1)  &
+     &                             - ablock(2,4)*bblock(4,1)  &
+     &                             - ablock(2,5)*bblock(5,1)
+         cblock(3,1) = cblock(3,1) - ablock(3,1)*bblock(1,1)  &
+     &                             - ablock(3,2)*bblock(2,1)  &
+     &                             - ablock(3,3)*bblock(3,1)  &
+     &                             - ablock(3,4)*bblock(4,1)  &
+     &                             - ablock(3,5)*bblock(5,1)
+         cblock(4,1) = cblock(4,1) - ablock(4,1)*bblock(1,1)  &
+     &                             - ablock(4,2)*bblock(2,1)  &
+     &                             - ablock(4,3)*bblock(3,1)  &
+     &                             - ablock(4,4)*bblock(4,1)  &
+     &                             - ablock(4,5)*bblock(5,1)
+         cblock(5,1) = cblock(5,1) - ablock(5,1)*bblock(1,1)  &
+     &                             - ablock(5,2)*bblock(2,1)  &
+     &                             - ablock(5,3)*bblock(3,1)  &
+     &                             - ablock(5,4)*bblock(4,1)  &
+     &                             - ablock(5,5)*bblock(5,1)
+         cblock(1,2) = cblock(1,2) - ablock(1,1)*bblock(1,2)  &
+     &                             - ablock(1,2)*bblock(2,2)  &
+     &                             - ablock(1,3)*bblock(3,2)  &
+     &                             - ablock(1,4)*bblock(4,2)  &
+     &                             - ablock(1,5)*bblock(5,2)
+         cblock(2,2) = cblock(2,2) - ablock(2,1)*bblock(1,2)  &
+     &                             - ablock(2,2)*bblock(2,2)  &
+     &                             - ablock(2,3)*bblock(3,2)  &
+     &                             - ablock(2,4)*bblock(4,2)  &
+     &                             - ablock(2,5)*bblock(5,2)
+         cblock(3,2) = cblock(3,2) - ablock(3,1)*bblock(1,2)  &
+     &                             - ablock(3,2)*bblock(2,2)  &
+     &                             - ablock(3,3)*bblock(3,2)  &
+     &                             - ablock(3,4)*bblock(4,2)  &
+     &                             - ablock(3,5)*bblock(5,2)
+         cblock(4,2) = cblock(4,2) - ablock(4,1)*bblock(1,2)  &
+     &                             - ablock(4,2)*bblock(2,2)  &
+     &                             - ablock(4,3)*bblock(3,2)  &
+     &                             - ablock(4,4)*bblock(4,2)  &
+     &                             - ablock(4,5)*bblock(5,2)
+         cblock(5,2) = cblock(5,2) - ablock(5,1)*bblock(1,2)  &
+     &                             - ablock(5,2)*bblock(2,2)  &
+     &                             - ablock(5,3)*bblock(3,2)  &
+     &                             - ablock(5,4)*bblock(4,2)  &
+     &                             - ablock(5,5)*bblock(5,2)
+         cblock(1,3) = cblock(1,3) - ablock(1,1)*bblock(1,3)  &
+     &                             - ablock(1,2)*bblock(2,3)  &
+     &                             - ablock(1,3)*bblock(3,3)  &
+     &                             - ablock(1,4)*bblock(4,3)  &
+     &                             - ablock(1,5)*bblock(5,3)
+         cblock(2,3) = cblock(2,3) - ablock(2,1)*bblock(1,3)  &
+     &                             - ablock(2,2)*bblock(2,3)  &
+     &                             - ablock(2,3)*bblock(3,3)  &
+     &                             - ablock(2,4)*bblock(4,3)  &
+     &                             - ablock(2,5)*bblock(5,3)
+         cblock(3,3) = cblock(3,3) - ablock(3,1)*bblock(1,3)  &
+     &                             - ablock(3,2)*bblock(2,3)  &
+     &                             - ablock(3,3)*bblock(3,3)  &
+     &                             - ablock(3,4)*bblock(4,3)  &
+     &                             - ablock(3,5)*bblock(5,3)
+         cblock(4,3) = cblock(4,3) - ablock(4,1)*bblock(1,3)  &
+     &                             - ablock(4,2)*bblock(2,3)  &
+     &                             - ablock(4,3)*bblock(3,3)  &
+     &                             - ablock(4,4)*bblock(4,3)  &
+     &                             - ablock(4,5)*bblock(5,3)
+         cblock(5,3) = cblock(5,3) - ablock(5,1)*bblock(1,3)  &
+     &                             - ablock(5,2)*bblock(2,3)  &
+     &                             - ablock(5,3)*bblock(3,3)  &
+     &                             - ablock(5,4)*bblock(4,3)  &
+     &                             - ablock(5,5)*bblock(5,3)
+         cblock(1,4) = cblock(1,4) - ablock(1,1)*bblock(1,4)  &
+     &                             - ablock(1,2)*bblock(2,4)  &
+     &                             - ablock(1,3)*bblock(3,4)  &
+     &                             - ablock(1,4)*bblock(4,4)  &
+     &                             - ablock(1,5)*bblock(5,4)
+         cblock(2,4) = cblock(2,4) - ablock(2,1)*bblock(1,4)  &
+     &                             - ablock(2,2)*bblock(2,4)  &
+     &                             - ablock(2,3)*bblock(3,4)  &
+     &                             - ablock(2,4)*bblock(4,4)  &
+     &                             - ablock(2,5)*bblock(5,4)
+         cblock(3,4) = cblock(3,4) - ablock(3,1)*bblock(1,4)  &
+     &                             - ablock(3,2)*bblock(2,4)  &
+     &                             - ablock(3,3)*bblock(3,4)  &
+     &                             - ablock(3,4)*bblock(4,4)  &
+     &                             - ablock(3,5)*bblock(5,4)
+         cblock(4,4) = cblock(4,4) - ablock(4,1)*bblock(1,4)  &
+     &                             - ablock(4,2)*bblock(2,4)  &
+     &                             - ablock(4,3)*bblock(3,4)  &
+     &                             - ablock(4,4)*bblock(4,4)  &
+     &                             - ablock(4,5)*bblock(5,4)
+         cblock(5,4) = cblock(5,4) - ablock(5,1)*bblock(1,4)  &
+     &                             - ablock(5,2)*bblock(2,4)  &
+     &                             - ablock(5,3)*bblock(3,4)  &
+     &                             - ablock(5,4)*bblock(4,4)  &
+     &                             - ablock(5,5)*bblock(5,4)
+         cblock(1,5) = cblock(1,5) - ablock(1,1)*bblock(1,5)  &
+     &                             - ablock(1,2)*bblock(2,5)  &
+     &                             - ablock(1,3)*bblock(3,5)  &
+     &                             - ablock(1,4)*bblock(4,5)  &
+     &                             - ablock(1,5)*bblock(5,5)
+         cblock(2,5) = cblock(2,5) - ablock(2,1)*bblock(1,5)  &
+     &                             - ablock(2,2)*bblock(2,5)  &
+     &                             - ablock(2,3)*bblock(3,5)  &
+     &                             - ablock(2,4)*bblock(4,5)  &
+     &                             - ablock(2,5)*bblock(5,5)
+         cblock(3,5) = cblock(3,5) - ablock(3,1)*bblock(1,5)  &
+     &                             - ablock(3,2)*bblock(2,5)  &
+     &                             - ablock(3,3)*bblock(3,5)  &
+     &                             - ablock(3,4)*bblock(4,5)  &
+     &                             - ablock(3,5)*bblock(5,5)
+         cblock(4,5) = cblock(4,5) - ablock(4,1)*bblock(1,5)  &
+     &                             - ablock(4,2)*bblock(2,5)  &
+     &                             - ablock(4,3)*bblock(3,5)  &
+     &                             - ablock(4,4)*bblock(4,5)  &
+     &                             - ablock(4,5)*bblock(5,5)
+         cblock(5,5) = cblock(5,5) - ablock(5,1)*bblock(1,5)  &
+     &                             - ablock(5,2)*bblock(2,5)  &
+     &                             - ablock(5,3)*bblock(3,5)  &
+     &                             - ablock(5,4)*bblock(4,5)  &
+     &                             - ablock(5,5)*bblock(5,5)
+
+              
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine binvcrhs( lhs,c,r )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision c(5,5), r(5)
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      c(1,1) = c(1,1)*pivot
+      c(1,2) = c(1,2)*pivot
+      c(1,3) = c(1,3)*pivot
+      c(1,4) = c(1,4)*pivot
+      c(1,5) = c(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      c(2,1) = c(2,1) - coeff*c(1,1)
+      c(2,2) = c(2,2) - coeff*c(1,2)
+      c(2,3) = c(2,3) - coeff*c(1,3)
+      c(2,4) = c(2,4) - coeff*c(1,4)
+      c(2,5) = c(2,5) - coeff*c(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      c(3,1) = c(3,1) - coeff*c(1,1)
+      c(3,2) = c(3,2) - coeff*c(1,2)
+      c(3,3) = c(3,3) - coeff*c(1,3)
+      c(3,4) = c(3,4) - coeff*c(1,4)
+      c(3,5) = c(3,5) - coeff*c(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      c(4,1) = c(4,1) - coeff*c(1,1)
+      c(4,2) = c(4,2) - coeff*c(1,2)
+      c(4,3) = c(4,3) - coeff*c(1,3)
+      c(4,4) = c(4,4) - coeff*c(1,4)
+      c(4,5) = c(4,5) - coeff*c(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      c(5,1) = c(5,1) - coeff*c(1,1)
+      c(5,2) = c(5,2) - coeff*c(1,2)
+      c(5,3) = c(5,3) - coeff*c(1,3)
+      c(5,4) = c(5,4) - coeff*c(1,4)
+      c(5,5) = c(5,5) - coeff*c(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      c(2,1) = c(2,1)*pivot
+      c(2,2) = c(2,2)*pivot
+      c(2,3) = c(2,3)*pivot
+      c(2,4) = c(2,4)*pivot
+      c(2,5) = c(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      c(1,1) = c(1,1) - coeff*c(2,1)
+      c(1,2) = c(1,2) - coeff*c(2,2)
+      c(1,3) = c(1,3) - coeff*c(2,3)
+      c(1,4) = c(1,4) - coeff*c(2,4)
+      c(1,5) = c(1,5) - coeff*c(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      c(3,1) = c(3,1) - coeff*c(2,1)
+      c(3,2) = c(3,2) - coeff*c(2,2)
+      c(3,3) = c(3,3) - coeff*c(2,3)
+      c(3,4) = c(3,4) - coeff*c(2,4)
+      c(3,5) = c(3,5) - coeff*c(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      c(4,1) = c(4,1) - coeff*c(2,1)
+      c(4,2) = c(4,2) - coeff*c(2,2)
+      c(4,3) = c(4,3) - coeff*c(2,3)
+      c(4,4) = c(4,4) - coeff*c(2,4)
+      c(4,5) = c(4,5) - coeff*c(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      c(5,1) = c(5,1) - coeff*c(2,1)
+      c(5,2) = c(5,2) - coeff*c(2,2)
+      c(5,3) = c(5,3) - coeff*c(2,3)
+      c(5,4) = c(5,4) - coeff*c(2,4)
+      c(5,5) = c(5,5) - coeff*c(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      c(3,1) = c(3,1)*pivot
+      c(3,2) = c(3,2)*pivot
+      c(3,3) = c(3,3)*pivot
+      c(3,4) = c(3,4)*pivot
+      c(3,5) = c(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      c(1,1) = c(1,1) - coeff*c(3,1)
+      c(1,2) = c(1,2) - coeff*c(3,2)
+      c(1,3) = c(1,3) - coeff*c(3,3)
+      c(1,4) = c(1,4) - coeff*c(3,4)
+      c(1,5) = c(1,5) - coeff*c(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      c(2,1) = c(2,1) - coeff*c(3,1)
+      c(2,2) = c(2,2) - coeff*c(3,2)
+      c(2,3) = c(2,3) - coeff*c(3,3)
+      c(2,4) = c(2,4) - coeff*c(3,4)
+      c(2,5) = c(2,5) - coeff*c(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      c(4,1) = c(4,1) - coeff*c(3,1)
+      c(4,2) = c(4,2) - coeff*c(3,2)
+      c(4,3) = c(4,3) - coeff*c(3,3)
+      c(4,4) = c(4,4) - coeff*c(3,4)
+      c(4,5) = c(4,5) - coeff*c(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      c(5,1) = c(5,1) - coeff*c(3,1)
+      c(5,2) = c(5,2) - coeff*c(3,2)
+      c(5,3) = c(5,3) - coeff*c(3,3)
+      c(5,4) = c(5,4) - coeff*c(3,4)
+      c(5,5) = c(5,5) - coeff*c(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      c(4,1) = c(4,1)*pivot
+      c(4,2) = c(4,2)*pivot
+      c(4,3) = c(4,3)*pivot
+      c(4,4) = c(4,4)*pivot
+      c(4,5) = c(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      c(1,1) = c(1,1) - coeff*c(4,1)
+      c(1,2) = c(1,2) - coeff*c(4,2)
+      c(1,3) = c(1,3) - coeff*c(4,3)
+      c(1,4) = c(1,4) - coeff*c(4,4)
+      c(1,5) = c(1,5) - coeff*c(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      c(2,1) = c(2,1) - coeff*c(4,1)
+      c(2,2) = c(2,2) - coeff*c(4,2)
+      c(2,3) = c(2,3) - coeff*c(4,3)
+      c(2,4) = c(2,4) - coeff*c(4,4)
+      c(2,5) = c(2,5) - coeff*c(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      c(3,1) = c(3,1) - coeff*c(4,1)
+      c(3,2) = c(3,2) - coeff*c(4,2)
+      c(3,3) = c(3,3) - coeff*c(4,3)
+      c(3,4) = c(3,4) - coeff*c(4,4)
+      c(3,5) = c(3,5) - coeff*c(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      c(5,1) = c(5,1) - coeff*c(4,1)
+      c(5,2) = c(5,2) - coeff*c(4,2)
+      c(5,3) = c(5,3) - coeff*c(4,3)
+      c(5,4) = c(5,4) - coeff*c(4,4)
+      c(5,5) = c(5,5) - coeff*c(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      c(5,1) = c(5,1)*pivot
+      c(5,2) = c(5,2)*pivot
+      c(5,3) = c(5,3)*pivot
+      c(5,4) = c(5,4)*pivot
+      c(5,5) = c(5,5)*pivot
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      c(1,1) = c(1,1) - coeff*c(5,1)
+      c(1,2) = c(1,2) - coeff*c(5,2)
+      c(1,3) = c(1,3) - coeff*c(5,3)
+      c(1,4) = c(1,4) - coeff*c(5,4)
+      c(1,5) = c(1,5) - coeff*c(5,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      c(2,1) = c(2,1) - coeff*c(5,1)
+      c(2,2) = c(2,2) - coeff*c(5,2)
+      c(2,3) = c(2,3) - coeff*c(5,3)
+      c(2,4) = c(2,4) - coeff*c(5,4)
+      c(2,5) = c(2,5) - coeff*c(5,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      c(3,1) = c(3,1) - coeff*c(5,1)
+      c(3,2) = c(3,2) - coeff*c(5,2)
+      c(3,3) = c(3,3) - coeff*c(5,3)
+      c(3,4) = c(3,4) - coeff*c(5,4)
+      c(3,5) = c(3,5) - coeff*c(5,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      c(4,1) = c(4,1) - coeff*c(5,1)
+      c(4,2) = c(4,2) - coeff*c(5,2)
+      c(4,3) = c(4,3) - coeff*c(5,3)
+      c(4,4) = c(4,4) - coeff*c(5,4)
+      c(4,5) = c(4,5) - coeff*c(5,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine binvrhs( lhs,r )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision r(5)
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/verify.f90
new file mode 100644
index 000000000..977b95ec9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/verify.f90
@@ -0,0 +1,529 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine set_class(no_time_steps, class)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  set problem class based on problem size
+!---------------------------------------------------------------------
+
+        use bt_data
+        implicit none
+
+        integer no_time_steps
+        character class
+
+
+        if ( (grid_points(1)  .eq. 12     ) .and.  &
+     &       (grid_points(2)  .eq. 12     ) .and.  &
+     &       (grid_points(3)  .eq. 12     ) .and.  &
+     &       (no_time_steps   .eq. 60    ))  then
+
+           class = 'S'
+
+        elseif ( (grid_points(1) .eq. 24) .and.  &
+     &           (grid_points(2) .eq. 24) .and.  &
+     &           (grid_points(3) .eq. 24) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'W'
+
+        elseif ( (grid_points(1) .eq. 64) .and.  &
+     &           (grid_points(2) .eq. 64) .and.  &
+     &           (grid_points(3) .eq. 64) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'A'
+
+        elseif ( (grid_points(1) .eq. 102) .and.  &
+     &           (grid_points(2) .eq. 102) .and.  &
+     &           (grid_points(3) .eq. 102) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'B'
+
+        elseif ( (grid_points(1) .eq. 162) .and.  &
+     &           (grid_points(2) .eq. 162) .and.  &
+     &           (grid_points(3) .eq. 162) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'C'
+
+        elseif ( (grid_points(1) .eq. 408) .and.  &
+     &           (grid_points(2) .eq. 408) .and.  &
+     &           (grid_points(3) .eq. 408) .and.  &
+     &           (no_time_steps  .eq. 250) ) then
+
+           class = 'D'
+
+        elseif ( (grid_points(1) .eq. 1020) .and.  &
+     &           (grid_points(2) .eq. 1020) .and.  &
+     &           (grid_points(3) .eq. 1020) .and.  &
+     &           (no_time_steps  .eq. 250) ) then
+
+           class = 'E'
+
+        elseif ( (grid_points(1) .eq. 2560) .and.  &
+     &           (grid_points(2) .eq. 2560) .and.  &
+     &           (grid_points(3) .eq. 2560) .and.  &
+     &           (no_time_steps  .eq. 250) ) then
+
+           class = 'F'
+
+        else
+
+           class = 'U'
+
+        endif
+
+        return
+        end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine verify(class, verified)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  verification routine                         
+!---------------------------------------------------------------------
+
+        use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+        use bt_data
+        use mpinpb
+
+        implicit none
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5),  &
+     &                   epsilon, xce(5), xcr(5), dtref
+        integer m
+        character class
+        logical verified
+
+!---------------------------------------------------------------------
+!   tolerance level
+!---------------------------------------------------------------------
+        epsilon = 1.0d-08
+        verified = .true.
+
+!---------------------------------------------------------------------
+!   compute the error norm and the residual norm, and exit if not printing
+!---------------------------------------------------------------------
+
+        if (iotype .ne. 0) then
+           call timer_start(t_iov)
+           call accumulate_norms(xce)
+           call timer_stop(t_iov)
+        else
+           call error_norm(xce)
+        endif
+
+        call copy_faces
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+        if (node .ne. 0) return
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+!---------------------------------------------------------------------
+!    reference data for 12X12X12 grids after 60 time steps, with DT = 1.0d-02
+!---------------------------------------------------------------------
+        if ( class .eq. 'S' ) then
+
+           dtref = 1.0d-2
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 1.7034283709541311d-01
+         xcrref(2) = 1.2975252070034097d-02
+         xcrref(3) = 3.2527926989486055d-02
+         xcrref(4) = 2.6436421275166801d-02
+         xcrref(5) = 1.9211784131744430d-01
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 4.9976913345811579d-04
+           xceref(2) = 4.5195666782961927d-05
+           xceref(3) = 7.3973765172921357d-05
+           xceref(4) = 7.3821238632439731d-05
+           xceref(5) = 8.9269630987491446d-04
+         else
+           xceref(1) = 0.1149036328945d+02
+           xceref(2) = 0.9156788904727d+00
+           xceref(3) = 0.2857899428614d+01
+           xceref(4) = 0.2598273346734d+01
+           xceref(5) = 0.2652795397547d+02
+         endif
+
+!---------------------------------------------------------------------
+!    reference data for 24X24X24 grids after 200 time steps, with DT = 0.8d-3
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'W' ) then
+
+           dtref = 0.8d-3
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.1125590409344d+03
+           xcrref(2) = 0.1180007595731d+02
+           xcrref(3) = 0.2710329767846d+02
+           xcrref(4) = 0.2469174937669d+02
+           xcrref(5) = 0.2638427874317d+03
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.4419655736008d+01
+           xceref(2) = 0.4638531260002d+00
+           xceref(3) = 0.1011551749967d+01
+           xceref(4) = 0.9235878729944d+00
+           xceref(5) = 0.1018045837718d+02
+         else
+           xceref(1) = 0.6729594398612d+02
+           xceref(2) = 0.5264523081690d+01
+           xceref(3) = 0.1677107142637d+02
+           xceref(4) = 0.1508721463436d+02
+           xceref(5) = 0.1477018363393d+03
+         endif
+
+
+!---------------------------------------------------------------------
+!    reference data for 64X64X64 grids after 200 time steps, with DT = 0.8d-3
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'A' ) then
+
+           dtref = 0.8d-3
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 1.0806346714637264d+02
+         xcrref(2) = 1.1319730901220813d+01
+         xcrref(3) = 2.5974354511582465d+01
+         xcrref(4) = 2.3665622544678910d+01
+         xcrref(5) = 2.5278963211748344d+02
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 4.2348416040525025d+00
+           xceref(2) = 4.4390282496995698d-01
+           xceref(3) = 9.6692480136345650d-01
+           xceref(4) = 8.8302063039765474d-01
+           xceref(5) = 9.7379901770829278d+00
+         else
+           xceref(1) = 0.6482218724961d+02
+           xceref(2) = 0.5066461714527d+01
+           xceref(3) = 0.1613931961359d+02
+           xceref(4) = 0.1452010201481d+02
+           xceref(5) = 0.1420099377681d+03
+         endif
+
+!---------------------------------------------------------------------
+!    reference data for 102X102X102 grids after 200 time steps,
+!    with DT = 3.0d-04
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'B' ) then
+
+           dtref = 3.0d-4
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 1.4233597229287254d+03
+         xcrref(2) = 9.9330522590150238d+01
+         xcrref(3) = 3.5646025644535285d+02
+         xcrref(4) = 3.2485447959084092d+02
+         xcrref(5) = 3.2707541254659363d+03
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 5.2969847140936856d+01
+           xceref(2) = 4.4632896115670668d+00
+           xceref(3) = 1.3122573342210174d+01
+           xceref(4) = 1.2006925323559144d+01
+           xceref(5) = 1.2459576151035986d+02
+         else
+           xceref(1) = 0.1477545106464d+03
+           xceref(2) = 0.1108895555053d+02
+           xceref(3) = 0.3698065590331d+02
+           xceref(4) = 0.3310505581440d+02
+           xceref(5) = 0.3157928282563d+03
+         endif
+
+!---------------------------------------------------------------------
+!    reference data for 162X162X162 grids after 200 time steps,
+!    with DT = 1.0d-04
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'C' ) then
+
+           dtref = 1.0d-4
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.62398116551764615d+04
+         xcrref(2) = 0.50793239190423964d+03
+         xcrref(3) = 0.15423530093013596d+04
+         xcrref(4) = 0.13302387929291190d+04
+         xcrref(5) = 0.11604087428436455d+05
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.16462008369091265d+03
+           xceref(2) = 0.11497107903824313d+02
+           xceref(3) = 0.41207446207461508d+02
+           xceref(4) = 0.37087651059694167d+02
+           xceref(5) = 0.36211053051841265d+03
+         else
+           xceref(1) = 0.2597156483475d+03
+           xceref(2) = 0.1985384289495d+02
+           xceref(3) = 0.6517950485788d+02
+           xceref(4) = 0.5757235541520d+02
+           xceref(5) = 0.5215668188726d+03
+         endif 
+
+
+!---------------------------------------------------------------------
+!    reference data for 408x408x408 grids after 250 time steps,
+!    with DT = 0.2d-04
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'D' ) then
+
+           dtref = 0.2d-4
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.2533188551738d+05
+         xcrref(2) = 0.2346393716980d+04
+         xcrref(3) = 0.6294554366904d+04
+         xcrref(4) = 0.5352565376030d+04
+         xcrref(5) = 0.3905864038618d+05
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.3100009377557d+03
+           xceref(2) = 0.2424086324913d+02
+           xceref(3) = 0.7782212022645d+02
+           xceref(4) = 0.6835623860116d+02
+           xceref(5) = 0.6065737200368d+03
+         else
+           xceref(1) = 0.3813781566713d+03
+           xceref(2) = 0.3160872966198d+02
+           xceref(3) = 0.9593576357290d+02
+           xceref(4) = 0.8363391989815d+02
+           xceref(5) = 0.7063466087423d+03
+         endif
+
+
+!---------------------------------------------------------------------
+!    reference data for 1020x1020x1020 grids after 250 time steps,
+!    with DT = 0.4d-05
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'E' ) then
+
+           dtref = 0.4d-5
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.9795372484517d+05
+         xcrref(2) = 0.9739814511521d+04
+         xcrref(3) = 0.2467606342965d+05
+         xcrref(4) = 0.2092419572860d+05
+         xcrref(5) = 0.1392138856939d+06
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.4327562208414d+03
+           xceref(2) = 0.3699051964887d+02
+           xceref(3) = 0.1089845040954d+03
+           xceref(4) = 0.9462517622043d+02
+           xceref(5) = 0.7765512765309d+03
+         else
+!  wr_interval = 5
+           xceref(1) = 0.4729898413058d+03
+           xceref(2) = 0.4145899331704d+02
+           xceref(3) = 0.1192850917138d+03
+           xceref(4) = 0.1032746026932d+03
+           xceref(5) = 0.8270322177634d+03
+!  wr_interval = 10
+!          xceref(1) = 0.4718135916251d+03
+!          xceref(2) = 0.4132620259096d+02
+!          xceref(3) = 0.1189831133503d+03
+!          xceref(4) = 0.1030212798803d+03
+!          xceref(5) = 0.8255924078458d+03
+         endif
+
+!---------------------------------------------------------------------
+!    reference data for 2560x2560x2560 grids after 250 time steps,
+!    with DT = 0.6d-06
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'F' ) then
+
+           dtref = 0.6d-6
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.4240735175585d+06
+         xcrref(2) = 0.4348701133212d+05
+         xcrref(3) = 0.1078114688845d+06
+         xcrref(4) = 0.9142160938556d+05
+         xcrref(5) = 0.5879842143431d+06
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         if (iotype .eq. 0) then
+           xceref(1) = 0.5095577042351d+03
+           xceref(2) = 0.4557065541652d+02
+           xceref(3) = 0.1286632140581d+03
+           xceref(4) = 0.1111419378722d+03
+           xceref(5) = 0.8720011709356d+03
+         endif
+
+        else
+
+           verified = .false.
+
+        endif
+
+!---------------------------------------------------------------------
+!    verification test for residuals if gridsize is one of 
+!    the defined grid sizes above (class .ne. 'U')
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!    Compute the difference of solution values and the known reference 
+!    values.
+!---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+!---------------------------------------------------------------------
+!    Output the comparison of computed results to known cases.
+!---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ',  &
+     &                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*,2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if ((.not.ieee_is_nan(xcrdif(m))) .and.  &
+     &              xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if ((.not.ieee_is_nan(xcedif(m))) .and.  &
+     &              xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve.f90
new file mode 100644
index 000000000..125a46894
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve.f90
@@ -0,0 +1,790 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!     Performs line solves in X direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!     
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer  c, istart, stage,  &
+     &     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),  &
+     &     isize,jsize,ksize,send_id
+
+      istart = 0
+
+      if (timeron) call timer_start(t_xsolve)
+!---------------------------------------------------------------------
+!     in our terminology stage is the number of the cell in the x-direction
+!     i.e. stage = 1 means the start of the line stage=ncells means end
+!---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(1,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+         
+!---------------------------------------------------------------------
+!     set last-cell flag
+!---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+!---------------------------------------------------------------------
+!     This is the first cell, so solve without receiving data
+!---------------------------------------------------------------------
+            first = 1
+!            call lhsx(c)
+            call x_solve_cell(first,last,c)
+         else
+!---------------------------------------------------------------------
+!     Not the first cell of this line, so receive info from
+!     processor working on preceeding cell
+!---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_solve_info(recv_id,c)
+!---------------------------------------------------------------------
+!     overlap computations and communications
+!---------------------------------------------------------------------
+!            call lhsx(c)
+!---------------------------------------------------------------------
+!     wait for completion
+!---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+!---------------------------------------------------------------------
+!     install C'(istart) and rhs'(istart) to be used in this cell
+!---------------------------------------------------------------------
+            call x_unpack_solve_info(c)
+            call x_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call x_send_solve_info(send_id,c)
+      enddo
+
+!---------------------------------------------------------------------
+!     now perform backsubstitution in reverse direction
+!---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(1,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+!---------------------------------------------------------------------
+!     last cell, so perform back substitute without waiting
+!---------------------------------------------------------------------
+            call x_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+            call x_unpack_backsub_info(c)
+            call x_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call x_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_unpack_solve_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack C'(-1) and rhs'(-1) for
+!     all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer j,k,m,n,ptr,c,istart 
+
+      istart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,istart-1,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine x_send_solve_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send C'(iend) and rhs'(iend) for
+!     all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer j,k,m,n,isize,ptr,c,jp,kp
+      integer error,send_id,buffer_size 
+
+      isize = cell_size(1,c)-1
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+!---------------------------------------------------------------------
+!     pack up buffer
+!---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,isize,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     send buffer 
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, successor(1),  &
+     &     WEST+jp+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_send_backsub_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send U(istart) for all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer j,k,n,ptr,c,istart,jp,kp
+      integer error,send_id,buffer_size
+
+!---------------------------------------------------------------------
+!     Send element 0 to previous processor
+!---------------------------------------------------------------------
+      istart = 0
+      jp = cell_coord(2,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,istart,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, predecessor(1),  &
+     &     EAST+jp+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_unpack_backsub_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack U(isize) for all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer j,k,n,ptr,c
+
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_receive_backsub_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer error,recv_id,jp,kp,c,buffer_size
+
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, successor(1),  &
+     &     EAST+jp+kp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_receive_solve_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives 
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer jp,kp,recv_id,error,c,buffer_size
+
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, predecessor(1),  &
+     &     WEST+jp+kp*NCELLS,  comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine x_backsubstitute(first, last, c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(isize)=rhs(isize)
+!     else assume U(isize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(istart) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer first, last, c, i, j, k
+      integer m,n,isize,jsize,ksize,istart
+      
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1      
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do j=start(2,c),jsize
+!---------------------------------------------------------------------
+!     U(isize) uses info from previous cell if not last cell
+!---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c)  &
+     &                    - lhsc(m,n,isize,j,k,c)*  &
+     &                    backsub_info(n,j,k,c)
+!---------------------------------------------------------------------
+!     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) 
+!     $                    - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c)
+!---------------------------------------------------------------------
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=start(2,c),jsize
+            do i=isize-1,istart,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)  &
+     &                    - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_solve_cell(first,last,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision tmp1, tmp2, tmp3
+      integer first,last,c
+      integer i,j,k,isize,ksize,jsize,istart
+
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+      call lhsabinit(lhsa, lhsb, isize)
+
+      do k=start(3,c),ksize 
+         do j=start(2,c),jsize
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side in the xi-direction
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     determine a (labeled f) and n jacobians for cell c
+!---------------------------------------------------------------------
+            do i = start(1,c)-1, cell_size(1,c) - end(1,c)
+
+               tmp1 = rho_i(i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+               fjac(1,1,i) = 0.0d+00
+               fjac(1,2,i) = 1.0d+00
+               fjac(1,3,i) = 0.0d+00
+               fjac(1,4,i) = 0.0d+00
+               fjac(1,5,i) = 0.0d+00
+
+               fjac(2,1,i) = -(u(2,i,j,k,c) * tmp2 *  &
+     &              u(2,i,j,k,c))  &
+     &              + c2 * qs(i,j,k,c)
+               fjac(2,2,i) = ( 2.0d+00 - c2 )  &
+     &              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(2,3,i) = - c2 * ( u(3,i,j,k,c) * tmp1 )
+               fjac(2,4,i) = - c2 * ( u(4,i,j,k,c) * tmp1 )
+               fjac(2,5,i) = c2
+
+               fjac(3,1,i) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2
+               fjac(3,2,i) = u(3,i,j,k,c) * tmp1
+               fjac(3,3,i) = u(2,i,j,k,c) * tmp1
+               fjac(3,4,i) = 0.0d+00
+               fjac(3,5,i) = 0.0d+00
+
+               fjac(4,1,i) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2
+               fjac(4,2,i) = u(4,i,j,k,c) * tmp1
+               fjac(4,3,i) = 0.0d+00
+               fjac(4,4,i) = u(2,i,j,k,c) * tmp1
+               fjac(4,5,i) = 0.0d+00
+
+               fjac(5,1,i) = ( c2 * 2.0d0 * qs(i,j,k,c)  &
+     &              - c1 * ( u(5,i,j,k,c) * tmp1 ) )  &
+     &              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(5,2,i) = c1 *  u(5,i,j,k,c) * tmp1  &
+     &              - c2  &
+     &              * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2  &
+     &              + qs(i,j,k,c) )
+               fjac(5,3,i) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) )  &
+     &              * tmp2
+               fjac(5,4,i) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) )  &
+     &              * tmp2
+               fjac(5,5,i) = c1 * ( u(2,i,j,k,c) * tmp1 )
+
+               njac(1,1,i) = 0.0d+00
+               njac(1,2,i) = 0.0d+00
+               njac(1,3,i) = 0.0d+00
+               njac(1,4,i) = 0.0d+00
+               njac(1,5,i) = 0.0d+00
+
+               njac(2,1,i) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i) =   con43 * c3c4 * tmp1
+               njac(2,3,i) =   0.0d+00
+               njac(2,4,i) =   0.0d+00
+               njac(2,5,i) =   0.0d+00
+
+               njac(3,1,i) = - c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i) =   0.0d+00
+               njac(3,3,i) =   c3c4 * tmp1
+               njac(3,4,i) =   0.0d+00
+               njac(3,5,i) =   0.0d+00
+
+               njac(4,1,i) = - c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i) =   0.0d+00 
+               njac(4,3,i) =   0.0d+00
+               njac(4,4,i) =   c3c4 * tmp1
+               njac(4,5,i) =   0.0d+00
+
+               njac(5,1,i) = - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i) = ( c1345 ) * tmp1
+
+            enddo
+!---------------------------------------------------------------------
+!     now jacobians set, so form left hand side in x direction
+!---------------------------------------------------------------------
+            do i = start(1,c), isize - end(1,c)
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhsa(1,1,i) = - tmp2 * fjac(1,1,i-1)  &
+     &              - tmp1 * njac(1,1,i-1)  &
+     &              - tmp1 * dx1 
+               lhsa(1,2,i) = - tmp2 * fjac(1,2,i-1)  &
+     &              - tmp1 * njac(1,2,i-1)
+               lhsa(1,3,i) = - tmp2 * fjac(1,3,i-1)  &
+     &              - tmp1 * njac(1,3,i-1)
+               lhsa(1,4,i) = - tmp2 * fjac(1,4,i-1)  &
+     &              - tmp1 * njac(1,4,i-1)
+               lhsa(1,5,i) = - tmp2 * fjac(1,5,i-1)  &
+     &              - tmp1 * njac(1,5,i-1)
+
+               lhsa(2,1,i) = - tmp2 * fjac(2,1,i-1)  &
+     &              - tmp1 * njac(2,1,i-1)
+               lhsa(2,2,i) = - tmp2 * fjac(2,2,i-1)  &
+     &              - tmp1 * njac(2,2,i-1)  &
+     &              - tmp1 * dx2
+               lhsa(2,3,i) = - tmp2 * fjac(2,3,i-1)  &
+     &              - tmp1 * njac(2,3,i-1)
+               lhsa(2,4,i) = - tmp2 * fjac(2,4,i-1)  &
+     &              - tmp1 * njac(2,4,i-1)
+               lhsa(2,5,i) = - tmp2 * fjac(2,5,i-1)  &
+     &              - tmp1 * njac(2,5,i-1)
+
+               lhsa(3,1,i) = - tmp2 * fjac(3,1,i-1)  &
+     &              - tmp1 * njac(3,1,i-1)
+               lhsa(3,2,i) = - tmp2 * fjac(3,2,i-1)  &
+     &              - tmp1 * njac(3,2,i-1)
+               lhsa(3,3,i) = - tmp2 * fjac(3,3,i-1)  &
+     &              - tmp1 * njac(3,3,i-1)  &
+     &              - tmp1 * dx3 
+               lhsa(3,4,i) = - tmp2 * fjac(3,4,i-1)  &
+     &              - tmp1 * njac(3,4,i-1)
+               lhsa(3,5,i) = - tmp2 * fjac(3,5,i-1)  &
+     &              - tmp1 * njac(3,5,i-1)
+
+               lhsa(4,1,i) = - tmp2 * fjac(4,1,i-1)  &
+     &              - tmp1 * njac(4,1,i-1)
+               lhsa(4,2,i) = - tmp2 * fjac(4,2,i-1)  &
+     &              - tmp1 * njac(4,2,i-1)
+               lhsa(4,3,i) = - tmp2 * fjac(4,3,i-1)  &
+     &              - tmp1 * njac(4,3,i-1)
+               lhsa(4,4,i) = - tmp2 * fjac(4,4,i-1)  &
+     &              - tmp1 * njac(4,4,i-1)  &
+     &              - tmp1 * dx4
+               lhsa(4,5,i) = - tmp2 * fjac(4,5,i-1)  &
+     &              - tmp1 * njac(4,5,i-1)
+
+               lhsa(5,1,i) = - tmp2 * fjac(5,1,i-1)  &
+     &              - tmp1 * njac(5,1,i-1)
+               lhsa(5,2,i) = - tmp2 * fjac(5,2,i-1)  &
+     &              - tmp1 * njac(5,2,i-1)
+               lhsa(5,3,i) = - tmp2 * fjac(5,3,i-1)  &
+     &              - tmp1 * njac(5,3,i-1)
+               lhsa(5,4,i) = - tmp2 * fjac(5,4,i-1)  &
+     &              - tmp1 * njac(5,4,i-1)
+               lhsa(5,5,i) = - tmp2 * fjac(5,5,i-1)  &
+     &              - tmp1 * njac(5,5,i-1)  &
+     &              - tmp1 * dx5
+
+               lhsb(1,1,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,i)  &
+     &              + tmp1 * 2.0d+00 * dx1
+               lhsb(1,2,i) = tmp1 * 2.0d+00 * njac(1,2,i)
+               lhsb(1,3,i) = tmp1 * 2.0d+00 * njac(1,3,i)
+               lhsb(1,4,i) = tmp1 * 2.0d+00 * njac(1,4,i)
+               lhsb(1,5,i) = tmp1 * 2.0d+00 * njac(1,5,i)
+
+               lhsb(2,1,i) = tmp1 * 2.0d+00 * njac(2,1,i)
+               lhsb(2,2,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,i)  &
+     &              + tmp1 * 2.0d+00 * dx2
+               lhsb(2,3,i) = tmp1 * 2.0d+00 * njac(2,3,i)
+               lhsb(2,4,i) = tmp1 * 2.0d+00 * njac(2,4,i)
+               lhsb(2,5,i) = tmp1 * 2.0d+00 * njac(2,5,i)
+
+               lhsb(3,1,i) = tmp1 * 2.0d+00 * njac(3,1,i)
+               lhsb(3,2,i) = tmp1 * 2.0d+00 * njac(3,2,i)
+               lhsb(3,3,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,i)  &
+     &              + tmp1 * 2.0d+00 * dx3
+               lhsb(3,4,i) = tmp1 * 2.0d+00 * njac(3,4,i)
+               lhsb(3,5,i) = tmp1 * 2.0d+00 * njac(3,5,i)
+
+               lhsb(4,1,i) = tmp1 * 2.0d+00 * njac(4,1,i)
+               lhsb(4,2,i) = tmp1 * 2.0d+00 * njac(4,2,i)
+               lhsb(4,3,i) = tmp1 * 2.0d+00 * njac(4,3,i)
+               lhsb(4,4,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,i)  &
+     &              + tmp1 * 2.0d+00 * dx4
+               lhsb(4,5,i) = tmp1 * 2.0d+00 * njac(4,5,i)
+
+               lhsb(5,1,i) = tmp1 * 2.0d+00 * njac(5,1,i)
+               lhsb(5,2,i) = tmp1 * 2.0d+00 * njac(5,2,i)
+               lhsb(5,3,i) = tmp1 * 2.0d+00 * njac(5,3,i)
+               lhsb(5,4,i) = tmp1 * 2.0d+00 * njac(5,4,i)
+               lhsb(5,5,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,i)  &
+     &              + tmp1 * 2.0d+00 * dx5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i+1)  &
+     &              - tmp1 * njac(1,1,i+1)  &
+     &              - tmp1 * dx1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i+1)  &
+     &              - tmp1 * njac(1,2,i+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i+1)  &
+     &              - tmp1 * njac(1,3,i+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i+1)  &
+     &              - tmp1 * njac(1,4,i+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i+1)  &
+     &              - tmp1 * njac(1,5,i+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i+1)  &
+     &              - tmp1 * njac(2,1,i+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i+1)  &
+     &              - tmp1 * njac(2,2,i+1)  &
+     &              - tmp1 * dx2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i+1)  &
+     &              - tmp1 * njac(2,3,i+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i+1)  &
+     &              - tmp1 * njac(2,4,i+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i+1)  &
+     &              - tmp1 * njac(2,5,i+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i+1)  &
+     &              - tmp1 * njac(3,1,i+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i+1)  &
+     &              - tmp1 * njac(3,2,i+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i+1)  &
+     &              - tmp1 * njac(3,3,i+1)  &
+     &              - tmp1 * dx3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i+1)  &
+     &              - tmp1 * njac(3,4,i+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i+1)  &
+     &              - tmp1 * njac(3,5,i+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i+1)  &
+     &              - tmp1 * njac(4,1,i+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i+1)  &
+     &              - tmp1 * njac(4,2,i+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i+1)  &
+     &              - tmp1 * njac(4,3,i+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i+1)  &
+     &              - tmp1 * njac(4,4,i+1)  &
+     &              - tmp1 * dx4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i+1)  &
+     &              - tmp1 * njac(4,5,i+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i+1)  &
+     &              - tmp1 * njac(5,1,i+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i+1)  &
+     &              - tmp1 * njac(5,2,i+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i+1)  &
+     &              - tmp1 * njac(5,3,i+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i+1)  &
+     &              - tmp1 * njac(5,4,i+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i+1)  &
+     &              - tmp1 * njac(5,5,i+1)  &
+     &              - tmp1 * dx5
+
+            enddo
+
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+            if (first .eq. 1) then 
+
+!---------------------------------------------------------------------
+!     multiply c(istart,j,k) by b_inverse and copy back to c
+!     multiply rhs(istart) by b_inverse(istart) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,istart),  &
+     &                        lhsc(1,1,istart,j,k,c),  &
+     &                        rhs(1,istart,j,k,c) )
+
+            endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+            do i=istart+first,isize-last
+
+!---------------------------------------------------------------------
+!     rhs(i) = rhs(i) - A*rhs(i-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i),  &
+     &                         rhs(1,i-1,j,k,c),rhs(1,i,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(i) = B(i) - C(i-1)*A(i)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i),  &
+     &                         lhsc(1,1,i-1,j,k,c),  &
+     &                         lhsb(1,1,i))
+
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i),  &
+     &                        lhsc(1,1,i,j,k,c),  &
+     &                        rhs(1,i,j,k,c) )
+
+            enddo
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+            if (last .eq. 1) then
+
+!---------------------------------------------------------------------
+!     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,isize),  &
+     &                         rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(isize) = B(isize) - C(isize-1)*A(isize)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,isize),  &
+     &                         lhsc(1,1,isize-1,j,k,c),  &
+     &                         lhsb(1,1,isize))
+
+!---------------------------------------------------------------------
+!     multiply rhs() by b_inverse() and copy to rhs
+!---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,isize),  &
+     &                       rhs(1,isize,j,k,c) )
+
+            endif
+         enddo
+      enddo
+
+
+      return
+      end
+      
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve_vec.f90
new file mode 100644
index 000000000..593f6a3a6
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve_vec.f90
@@ -0,0 +1,813 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!     Performs line solves in X direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!     
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer  c, istart, stage,  &
+     &     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),  &
+     &     isize,jsize,ksize,send_id
+
+      istart = 0
+
+      if (timeron) call timer_start(t_xsolve)
+!---------------------------------------------------------------------
+!     in our terminology stage is the number of the cell in the x-direct
+!     i.e. stage = 1 means the start of the line stage=ncells means end
+!---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(1,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+         
+!---------------------------------------------------------------------
+!     set last-cell flag
+!---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+!---------------------------------------------------------------------
+!     This is the first cell, so solve without receiving data
+!---------------------------------------------------------------------
+            first = 1
+!            call lhsx(c)
+            call x_solve_cell(first,last,c)
+         else
+!---------------------------------------------------------------------
+!     Not the first cell of this line, so receive info from
+!     processor working on preceeding cell
+!---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_solve_info(recv_id,c)
+!---------------------------------------------------------------------
+!     overlap computations and communications
+!---------------------------------------------------------------------
+!            call lhsx(c)
+!---------------------------------------------------------------------
+!     wait for completion
+!---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+!---------------------------------------------------------------------
+!     install C'(istart) and rhs'(istart) to be used in this cell
+!---------------------------------------------------------------------
+            call x_unpack_solve_info(c)
+            call x_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call x_send_solve_info(send_id,c)
+      enddo
+
+!---------------------------------------------------------------------
+!     now perform backsubstitution in reverse direction
+!---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(1,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+!---------------------------------------------------------------------
+!     last cell, so perform back substitute without waiting
+!---------------------------------------------------------------------
+            call x_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_xcomm)
+            call x_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_xcomm)
+            call x_unpack_backsub_info(c)
+            call x_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call x_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+      
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_unpack_solve_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack C'(-1) and rhs'(-1) for
+!     all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+      integer j,k,m,n,ptr,c,istart 
+
+      istart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,istart-1,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine x_send_solve_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send C'(iend) and rhs'(iend) for
+!     all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer j,k,m,n,isize,ptr,c,jp,kp
+      integer error,send_id,buffer_size 
+
+      isize = cell_size(1,c)-1
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+!---------------------------------------------------------------------
+!     pack up buffer
+!---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,isize,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     send buffer 
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, successor(1),  &
+     &     WEST+jp+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_send_backsub_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send U(istart) for all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer j,k,n,ptr,c,istart,jp,kp
+      integer error,send_id,buffer_size
+
+!---------------------------------------------------------------------
+!     Send element 0 to previous processor
+!---------------------------------------------------------------------
+      istart = 0
+      jp = cell_coord(2,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,istart,j,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_xcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, predecessor(1),  &
+     &     EAST+jp+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_xcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_unpack_backsub_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack U(isize) for all j and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+      integer j,k,n,ptr,c
+
+      ptr = 0
+      do k=0,KMAX-1
+         do j=0,JMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,j,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_receive_backsub_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer error,recv_id,jp,kp,c,buffer_size
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, successor(1),  &
+     &     EAST+jp+kp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_receive_solve_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives 
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer jp,kp,recv_id,error,c,buffer_size
+      jp = cell_coord(2,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, predecessor(1),  &
+     &     WEST+jp+kp*NCELLS,  comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine x_backsubstitute(first, last, c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(isize)=rhs(isize)
+!     else assume U(isize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(istart) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer first, last, c, i, j, k
+      integer m,n,isize,jsize,ksize,istart
+      
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1      
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do j=start(2,c),jsize
+!---------------------------------------------------------------------
+!     U(isize) uses info from previous cell if not last cell
+!---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c)  &
+     &                    - lhsc(m,n,isize,j,k,c)*  &
+     &                    backsub_info(n,j,k,c)
+!---------------------------------------------------------------------
+!     rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) 
+!     $                    - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c)
+!---------------------------------------------------------------------
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=start(2,c),jsize
+            do i=isize-1,istart,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)  &
+     &                    - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_solve_cell(first,last,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision tmp1, tmp2, tmp3
+      integer first,last,c
+      integer i,j,k,m,n,isize,ksize,jsize,istart
+
+      istart = 0
+      isize = cell_size(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+!---------------------------------------------------------------------
+!     zero the left hand side for starters
+!     set diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+      do j = 0, jsize
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,0,j) = 0.0d0
+               lhsb(m,n,0,j) = 0.0d0
+               lhsa(m,n,isize,j) = 0.0d0
+               lhsb(m,n,isize,j) = 0.0d0
+            enddo
+            lhsb(m,m,0,j) = 1.0d0
+            lhsb(m,m,isize,j) = 1.0d0
+         enddo
+      enddo
+
+      do k=start(3,c),ksize 
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side in the xi-direction
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     determine a (labeled f) and n jacobians for cell !
+!---------------------------------------------------------------------
+         do j=start(2,c),jsize
+            do i = start(1,c)-1, cell_size(1,c) - end(1,c)
+
+               tmp1 = rho_i(i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 1.0d+00
+               fjac(1,3,i,j) = 0.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = -(u(2,i,j,k,c) * tmp2 *  &
+     &              u(2,i,j,k,c))  &
+     &              + c2 * qs(i,j,k,c)
+               fjac(2,2,i,j) = ( 2.0d+00 - c2 )  &
+     &              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(2,3,i,j) = - c2 * ( u(3,i,j,k,c) * tmp1 )
+               fjac(2,4,i,j) = - c2 * ( u(4,i,j,k,c) * tmp1 )
+               fjac(2,5,i,j) = c2
+
+               fjac(3,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2
+               fjac(3,2,i,j) = u(3,i,j,k,c) * tmp1
+               fjac(3,3,i,j) = u(2,i,j,k,c) * tmp1
+               fjac(3,4,i,j) = 0.0d+00
+               fjac(3,5,i,j) = 0.0d+00
+
+               fjac(4,1,i,j) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2
+               fjac(4,2,i,j) = u(4,i,j,k,c) * tmp1
+               fjac(4,3,i,j) = 0.0d+00
+               fjac(4,4,i,j) = u(2,i,j,k,c) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c)  &
+     &              - c1 * ( u(5,i,j,k,c) * tmp1 ) )  &
+     &              * ( u(2,i,j,k,c) * tmp1 )
+               fjac(5,2,i,j) = c1 *  u(5,i,j,k,c) * tmp1  &
+     &              - c2  &
+     &              * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2  &
+     &              + qs(i,j,k,c) )
+               fjac(5,3,i,j) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) )  &
+     &              * tmp2
+               fjac(5,4,i,j) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) )  &
+     &              * tmp2
+               fjac(5,5,i,j) = c1 * ( u(2,i,j,k,c) * tmp1 )
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i,j) =   con43 * c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i,j) =   0.0d+00 
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i,j) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i,j) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     now jacobians set, so form left hand side in x direction
+!---------------------------------------------------------------------
+         do j=start(2,c),jsize
+            do i = start(1,c), isize - end(1,c)
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i-1,j)  &
+     &              - tmp1 * njac(1,1,i-1,j)  &
+     &              - tmp1 * dx1 
+               lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i-1,j)  &
+     &              - tmp1 * njac(1,2,i-1,j)
+               lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i-1,j)  &
+     &              - tmp1 * njac(1,3,i-1,j)
+               lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i-1,j)  &
+     &              - tmp1 * njac(1,4,i-1,j)
+               lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i-1,j)  &
+     &              - tmp1 * njac(1,5,i-1,j)
+
+               lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i-1,j)  &
+     &              - tmp1 * njac(2,1,i-1,j)
+               lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i-1,j)  &
+     &              - tmp1 * njac(2,2,i-1,j)  &
+     &              - tmp1 * dx2
+               lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i-1,j)  &
+     &              - tmp1 * njac(2,3,i-1,j)
+               lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i-1,j)  &
+     &              - tmp1 * njac(2,4,i-1,j)
+               lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i-1,j)  &
+     &              - tmp1 * njac(2,5,i-1,j)
+
+               lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i-1,j)  &
+     &              - tmp1 * njac(3,1,i-1,j)
+               lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i-1,j)  &
+     &              - tmp1 * njac(3,2,i-1,j)
+               lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i-1,j)  &
+     &              - tmp1 * njac(3,3,i-1,j)  &
+     &              - tmp1 * dx3 
+               lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i-1,j)  &
+     &              - tmp1 * njac(3,4,i-1,j)
+               lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i-1,j)  &
+     &              - tmp1 * njac(3,5,i-1,j)
+
+               lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i-1,j)  &
+     &              - tmp1 * njac(4,1,i-1,j)
+               lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i-1,j)  &
+     &              - tmp1 * njac(4,2,i-1,j)
+               lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i-1,j)  &
+     &              - tmp1 * njac(4,3,i-1,j)
+               lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i-1,j)  &
+     &              - tmp1 * njac(4,4,i-1,j)  &
+     &              - tmp1 * dx4
+               lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i-1,j)  &
+     &              - tmp1 * njac(4,5,i-1,j)
+
+               lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i-1,j)  &
+     &              - tmp1 * njac(5,1,i-1,j)
+               lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i-1,j)  &
+     &              - tmp1 * njac(5,2,i-1,j)
+               lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i-1,j)  &
+     &              - tmp1 * njac(5,3,i-1,j)
+               lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i-1,j)  &
+     &              - tmp1 * njac(5,4,i-1,j)
+               lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i-1,j)  &
+     &              - tmp1 * njac(5,5,i-1,j)  &
+     &              - tmp1 * dx5
+
+               lhsb(1,1,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,i,j)  &
+     &              + tmp1 * 2.0d+00 * dx1
+               lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhsb(2,2,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,i,j)  &
+     &              + tmp1 * 2.0d+00 * dx2
+               lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhsb(3,3,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,i,j)  &
+     &              + tmp1 * 2.0d+00 * dx3
+               lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhsb(4,4,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,i,j)  &
+     &              + tmp1 * 2.0d+00 * dx4
+               lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhsb(5,5,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,i,j)  &
+     &              + tmp1 * 2.0d+00 * dx5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i+1,j)  &
+     &              - tmp1 * njac(1,1,i+1,j)  &
+     &              - tmp1 * dx1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i+1,j)  &
+     &              - tmp1 * njac(1,2,i+1,j)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i+1,j)  &
+     &              - tmp1 * njac(1,3,i+1,j)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i+1,j)  &
+     &              - tmp1 * njac(1,4,i+1,j)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i+1,j)  &
+     &              - tmp1 * njac(1,5,i+1,j)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i+1,j)  &
+     &              - tmp1 * njac(2,1,i+1,j)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i+1,j)  &
+     &              - tmp1 * njac(2,2,i+1,j)  &
+     &              - tmp1 * dx2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i+1,j)  &
+     &              - tmp1 * njac(2,3,i+1,j)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i+1,j)  &
+     &              - tmp1 * njac(2,4,i+1,j)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i+1,j)  &
+     &              - tmp1 * njac(2,5,i+1,j)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i+1,j)  &
+     &              - tmp1 * njac(3,1,i+1,j)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i+1,j)  &
+     &              - tmp1 * njac(3,2,i+1,j)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i+1,j)  &
+     &              - tmp1 * njac(3,3,i+1,j)  &
+     &              - tmp1 * dx3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i+1,j)  &
+     &              - tmp1 * njac(3,4,i+1,j)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i+1,j)  &
+     &              - tmp1 * njac(3,5,i+1,j)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i+1,j)  &
+     &              - tmp1 * njac(4,1,i+1,j)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i+1,j)  &
+     &              - tmp1 * njac(4,2,i+1,j)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i+1,j)  &
+     &              - tmp1 * njac(4,3,i+1,j)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i+1,j)  &
+     &              - tmp1 * njac(4,4,i+1,j)  &
+     &              - tmp1 * dx4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i+1,j)  &
+     &              - tmp1 * njac(4,5,i+1,j)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i+1,j)  &
+     &              - tmp1 * njac(5,1,i+1,j)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i+1,j)  &
+     &              - tmp1 * njac(5,2,i+1,j)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i+1,j)  &
+     &              - tmp1 * njac(5,3,i+1,j)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i+1,j)  &
+     &              - tmp1 * njac(5,4,i+1,j)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i+1,j)  &
+     &              - tmp1 * njac(5,5,i+1,j)  &
+     &              - tmp1 * dx5
+
+            enddo
+         enddo
+
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+         if (first .eq. 1) then 
+
+!---------------------------------------------------------------------
+!     multiply c(istart,j,k) by b_inverse and copy back to !
+!     multiply rhs(istart) by b_inverse(istart) and copy to rhs
+!---------------------------------------------------------------------
+!dir$ ivdep
+            do j=start(2,c),jsize
+               call binvcrhs( lhsb(1,1,istart,j),  &
+     &                        lhsc(1,1,istart,j,k,c),  &
+     &                        rhs(1,istart,j,k,c) )
+            enddo
+
+         endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+         do i=istart+first,isize-last
+!dir$ ivdep
+            do j=start(2,c),jsize
+
+!---------------------------------------------------------------------
+!     rhs(i) = rhs(i) - A*rhs(i-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,j),  &
+     &                         rhs(1,i-1,j,k,c),rhs(1,i,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(i) = B(i) - C(i-1)*A(i)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,j),  &
+     &                         lhsc(1,1,i-1,j,k,c),  &
+     &                         lhsb(1,1,i,j))
+
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to !
+!     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i,j),  &
+     &                        lhsc(1,1,i,j,k,c),  &
+     &                        rhs(1,i,j,k,c) )
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+         if (last .eq. 1) then
+
+!dir$ ivdep
+            do j=start(2,c),jsize
+!---------------------------------------------------------------------
+!     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,isize,j),  &
+     &                         rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(isize) = B(isize) - C(isize-1)*A(isize)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,isize,j),  &
+     &                         lhsc(1,1,isize-1,j,k,c),  &
+     &                         lhsb(1,1,isize,j))
+
+!---------------------------------------------------------------------
+!     multiply rhs() by b_inverse() and copy to rhs
+!---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,isize,j),  &
+     &                       rhs(1,isize,j,k,c) )
+            enddo
+
+         endif
+      enddo
+
+
+      return
+      end
+      
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve.f90
new file mode 100644
index 000000000..1398094a8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve.f90
@@ -0,0 +1,797 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Y direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer  &
+     &     c, jstart, stage,  &
+     &     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),  &
+     &     isize,jsize,ksize,send_id
+
+      jstart = 0
+
+      if (timeron) call timer_start(t_ysolve)
+!---------------------------------------------------------------------
+!     in our terminology stage is the number of the cell in the y-direction
+!     i.e. stage = 1 means the start of the line stage=ncells means end
+!---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(2,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+
+!---------------------------------------------------------------------
+!     set last-cell flag
+!---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+!---------------------------------------------------------------------
+!     This is the first cell, so solve without receiving data
+!---------------------------------------------------------------------
+            first = 1
+!            call lhsy(c)
+            call y_solve_cell(first,last,c)
+         else
+!---------------------------------------------------------------------
+!     Not the first cell of this line, so receive info from
+!     processor working on preceeding cell
+!---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_solve_info(recv_id,c)
+!---------------------------------------------------------------------
+!     overlap computations and communications
+!---------------------------------------------------------------------
+!            call lhsy(c)
+!---------------------------------------------------------------------
+!     wait for completion
+!---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+!---------------------------------------------------------------------
+!     install C'(jstart+1) and rhs'(jstart+1) to be used in this cell
+!---------------------------------------------------------------------
+            call y_unpack_solve_info(c)
+            call y_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call y_send_solve_info(send_id,c)
+      enddo
+
+!---------------------------------------------------------------------
+!     now perform backsubstitution in reverse direction
+!---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(2,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+!---------------------------------------------------------------------
+!     last cell, so perform back substitute without waiting
+!---------------------------------------------------------------------
+            call y_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+            call y_unpack_backsub_info(c)
+            call y_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call y_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine y_unpack_solve_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack C'(-1) and rhs'(-1) for
+!     all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,k,m,n,ptr,c,jstart 
+
+      jstart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine y_send_solve_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send C'(jend) and rhs'(jend) for
+!     all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,k,m,n,jsize,ptr,c,ip,kp
+      integer error,send_id,buffer_size 
+
+      jsize = cell_size(2,c)-1
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+!---------------------------------------------------------------------
+!     pack up buffer
+!---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jsize,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     send buffer 
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, successor(2),  &
+     &     SOUTH+ip+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_send_backsub_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send U(jstart) for all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,k,n,ptr,c,jstart,ip,kp
+      integer error,send_id,buffer_size
+
+!---------------------------------------------------------------------
+!     Send element 0 to previous processor
+!---------------------------------------------------------------------
+      jstart = 0
+      ip = cell_coord(1,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jstart,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, predecessor(2),  &
+     &     NORTH+ip+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_unpack_backsub_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack U(jsize) for all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,k,n,ptr,c 
+
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_receive_backsub_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer error,recv_id,ip,kp,c,buffer_size
+
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, successor(2),  &
+     &     NORTH+ip+kp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_receive_solve_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives 
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ip,kp,recv_id,error,c,buffer_size
+
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, predecessor(2),  &
+     &     SOUTH+ip+kp*NCELLS,  comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_backsubstitute(first, last, c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+!     else assume U(jsize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(jstart) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,jstart
+      
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do i=start(1,c),isize
+!---------------------------------------------------------------------
+!     U(jsize) uses info from previous cell if not last cell
+!---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c)  &
+     &                    - lhsc(m,n,i,jsize,k,c)*  &
+     &                    backsub_info(n,i,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=jsize-1,jstart,-1
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)  &
+     &                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_solve_cell(first,last,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision tmp1, tmp2, tmp3
+      integer first,last,c
+      integer i,j,k,isize,ksize,jsize,jstart
+      double precision utmp(6,-2:JMAX+1)
+
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+      call lhsabinit(lhsa, lhsb, jsize)
+
+      do k=start(3,c),ksize 
+         do i=start(1,c),isize
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three y-factors   
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the tri-diagonal matrix;
+!     determine a (labeled f) and n jacobians for cell c
+!---------------------------------------------------------------------
+            do j = start(2,c)-1, cell_size(2,c)-end(2,c)
+               utmp(1,j) = 1.0d0 / u(1,i,j,k,c)
+               utmp(2,j) = u(2,i,j,k,c)
+               utmp(3,j) = u(3,i,j,k,c)
+               utmp(4,j) = u(4,i,j,k,c)
+               utmp(5,j) = u(5,i,j,k,c)
+               utmp(6,j) = qs(i,j,k,c)
+            end do
+
+            do j = start(2,c)-1, cell_size(2,c)-end(2,c)
+
+               tmp1 = utmp(1,j)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,j) = 0.0d+00
+               fjac(1,2,j) = 0.0d+00
+               fjac(1,3,j) = 1.0d+00
+               fjac(1,4,j) = 0.0d+00
+               fjac(1,5,j) = 0.0d+00
+
+               fjac(2,1,j) = - ( utmp(2,j)*utmp(3,j) )  &
+     &              * tmp2
+               fjac(2,2,j) = utmp(3,j) * tmp1
+               fjac(2,3,j) = utmp(2,j) * tmp1
+               fjac(2,4,j) = 0.0d+00
+               fjac(2,5,j) = 0.0d+00
+
+               fjac(3,1,j) = - ( utmp(3,j)*utmp(3,j)*tmp2)  &
+     &              + c2 * utmp(6,j)
+               fjac(3,2,j) = - c2 *  utmp(2,j) * tmp1
+               fjac(3,3,j) = ( 2.0d+00 - c2 )  &
+     &              *  utmp(3,j) * tmp1 
+               fjac(3,4,j) = - c2 * utmp(4,j) * tmp1 
+               fjac(3,5,j) = c2
+
+               fjac(4,1,j) = - ( utmp(3,j)*utmp(4,j) )  &
+     &              * tmp2
+               fjac(4,2,j) = 0.0d+00
+               fjac(4,3,j) = utmp(4,j) * tmp1
+               fjac(4,4,j) = utmp(3,j) * tmp1
+               fjac(4,5,j) = 0.0d+00
+
+               fjac(5,1,j) = ( c2 * 2.0d0 * utmp(6,j)  &
+     &              - c1 * utmp(5,j) * tmp1 )  &
+     &              * utmp(3,j) * tmp1 
+               fjac(5,2,j) = - c2 * utmp(2,j)*utmp(3,j)  &
+     &              * tmp2
+               fjac(5,3,j) = c1 * utmp(5,j) * tmp1  &
+     &              - c2 * ( utmp(6,j)  &
+     &              + utmp(3,j)*utmp(3,j) * tmp2 )
+               fjac(5,4,j) = - c2 * ( utmp(3,j)*utmp(4,j) )  &
+     &              * tmp2
+               fjac(5,5,j) = c1 * utmp(3,j) * tmp1 
+
+               njac(1,1,j) = 0.0d+00
+               njac(1,2,j) = 0.0d+00
+               njac(1,3,j) = 0.0d+00
+               njac(1,4,j) = 0.0d+00
+               njac(1,5,j) = 0.0d+00
+
+               njac(2,1,j) = - c3c4 * tmp2 * utmp(2,j)
+               njac(2,2,j) =   c3c4 * tmp1
+               njac(2,3,j) =   0.0d+00
+               njac(2,4,j) =   0.0d+00
+               njac(2,5,j) =   0.0d+00
+
+               njac(3,1,j) = - con43 * c3c4 * tmp2 * utmp(3,j)
+               njac(3,2,j) =   0.0d+00
+               njac(3,3,j) =   con43 * c3c4 * tmp1
+               njac(3,4,j) =   0.0d+00
+               njac(3,5,j) =   0.0d+00
+
+               njac(4,1,j) = - c3c4 * tmp2 * utmp(4,j)
+               njac(4,2,j) =   0.0d+00
+               njac(4,3,j) =   0.0d+00
+               njac(4,4,j) =   c3c4 * tmp1
+               njac(4,5,j) =   0.0d+00
+
+               njac(5,1,j) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (utmp(2,j)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (utmp(3,j)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (utmp(4,j)**2)  &
+     &              - c1345 * tmp2 * utmp(5,j)
+
+               njac(5,2,j) = (  c3c4 - c1345 ) * tmp2 * utmp(2,j)
+               njac(5,3,j) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * utmp(3,j)
+               njac(5,4,j) = ( c3c4 - c1345 ) * tmp2 * utmp(4,j)
+               njac(5,5,j) = ( c1345 ) * tmp1
+
+            enddo
+
+!---------------------------------------------------------------------
+!     now joacobians set, so form left hand side in y direction
+!---------------------------------------------------------------------
+            do j = start(2,c), jsize-end(2,c)
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhsa(1,1,j) = - tmp2 * fjac(1,1,j-1)  &
+     &              - tmp1 * njac(1,1,j-1)  &
+     &              - tmp1 * dy1 
+               lhsa(1,2,j) = - tmp2 * fjac(1,2,j-1)  &
+     &              - tmp1 * njac(1,2,j-1)
+               lhsa(1,3,j) = - tmp2 * fjac(1,3,j-1)  &
+     &              - tmp1 * njac(1,3,j-1)
+               lhsa(1,4,j) = - tmp2 * fjac(1,4,j-1)  &
+     &              - tmp1 * njac(1,4,j-1)
+               lhsa(1,5,j) = - tmp2 * fjac(1,5,j-1)  &
+     &              - tmp1 * njac(1,5,j-1)
+
+               lhsa(2,1,j) = - tmp2 * fjac(2,1,j-1)  &
+     &              - tmp1 * njac(2,1,j-1)
+               lhsa(2,2,j) = - tmp2 * fjac(2,2,j-1)  &
+     &              - tmp1 * njac(2,2,j-1)  &
+     &              - tmp1 * dy2
+               lhsa(2,3,j) = - tmp2 * fjac(2,3,j-1)  &
+     &              - tmp1 * njac(2,3,j-1)
+               lhsa(2,4,j) = - tmp2 * fjac(2,4,j-1)  &
+     &              - tmp1 * njac(2,4,j-1)
+               lhsa(2,5,j) = - tmp2 * fjac(2,5,j-1)  &
+     &              - tmp1 * njac(2,5,j-1)
+
+               lhsa(3,1,j) = - tmp2 * fjac(3,1,j-1)  &
+     &              - tmp1 * njac(3,1,j-1)
+               lhsa(3,2,j) = - tmp2 * fjac(3,2,j-1)  &
+     &              - tmp1 * njac(3,2,j-1)
+               lhsa(3,3,j) = - tmp2 * fjac(3,3,j-1)  &
+     &              - tmp1 * njac(3,3,j-1)  &
+     &              - tmp1 * dy3 
+               lhsa(3,4,j) = - tmp2 * fjac(3,4,j-1)  &
+     &              - tmp1 * njac(3,4,j-1)
+               lhsa(3,5,j) = - tmp2 * fjac(3,5,j-1)  &
+     &              - tmp1 * njac(3,5,j-1)
+
+               lhsa(4,1,j) = - tmp2 * fjac(4,1,j-1)  &
+     &              - tmp1 * njac(4,1,j-1)
+               lhsa(4,2,j) = - tmp2 * fjac(4,2,j-1)  &
+     &              - tmp1 * njac(4,2,j-1)
+               lhsa(4,3,j) = - tmp2 * fjac(4,3,j-1)  &
+     &              - tmp1 * njac(4,3,j-1)
+               lhsa(4,4,j) = - tmp2 * fjac(4,4,j-1)  &
+     &              - tmp1 * njac(4,4,j-1)  &
+     &              - tmp1 * dy4
+               lhsa(4,5,j) = - tmp2 * fjac(4,5,j-1)  &
+     &              - tmp1 * njac(4,5,j-1)
+
+               lhsa(5,1,j) = - tmp2 * fjac(5,1,j-1)  &
+     &              - tmp1 * njac(5,1,j-1)
+               lhsa(5,2,j) = - tmp2 * fjac(5,2,j-1)  &
+     &              - tmp1 * njac(5,2,j-1)
+               lhsa(5,3,j) = - tmp2 * fjac(5,3,j-1)  &
+     &              - tmp1 * njac(5,3,j-1)
+               lhsa(5,4,j) = - tmp2 * fjac(5,4,j-1)  &
+     &              - tmp1 * njac(5,4,j-1)
+               lhsa(5,5,j) = - tmp2 * fjac(5,5,j-1)  &
+     &              - tmp1 * njac(5,5,j-1)  &
+     &              - tmp1 * dy5
+
+               lhsb(1,1,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,j)  &
+     &              + tmp1 * 2.0d+00 * dy1
+               lhsb(1,2,j) = tmp1 * 2.0d+00 * njac(1,2,j)
+               lhsb(1,3,j) = tmp1 * 2.0d+00 * njac(1,3,j)
+               lhsb(1,4,j) = tmp1 * 2.0d+00 * njac(1,4,j)
+               lhsb(1,5,j) = tmp1 * 2.0d+00 * njac(1,5,j)
+
+               lhsb(2,1,j) = tmp1 * 2.0d+00 * njac(2,1,j)
+               lhsb(2,2,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,j)  &
+     &              + tmp1 * 2.0d+00 * dy2
+               lhsb(2,3,j) = tmp1 * 2.0d+00 * njac(2,3,j)
+               lhsb(2,4,j) = tmp1 * 2.0d+00 * njac(2,4,j)
+               lhsb(2,5,j) = tmp1 * 2.0d+00 * njac(2,5,j)
+
+               lhsb(3,1,j) = tmp1 * 2.0d+00 * njac(3,1,j)
+               lhsb(3,2,j) = tmp1 * 2.0d+00 * njac(3,2,j)
+               lhsb(3,3,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,j)  &
+     &              + tmp1 * 2.0d+00 * dy3
+               lhsb(3,4,j) = tmp1 * 2.0d+00 * njac(3,4,j)
+               lhsb(3,5,j) = tmp1 * 2.0d+00 * njac(3,5,j)
+
+               lhsb(4,1,j) = tmp1 * 2.0d+00 * njac(4,1,j)
+               lhsb(4,2,j) = tmp1 * 2.0d+00 * njac(4,2,j)
+               lhsb(4,3,j) = tmp1 * 2.0d+00 * njac(4,3,j)
+               lhsb(4,4,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,j)  &
+     &              + tmp1 * 2.0d+00 * dy4
+               lhsb(4,5,j) = tmp1 * 2.0d+00 * njac(4,5,j)
+
+               lhsb(5,1,j) = tmp1 * 2.0d+00 * njac(5,1,j)
+               lhsb(5,2,j) = tmp1 * 2.0d+00 * njac(5,2,j)
+               lhsb(5,3,j) = tmp1 * 2.0d+00 * njac(5,3,j)
+               lhsb(5,4,j) = tmp1 * 2.0d+00 * njac(5,4,j)
+               lhsb(5,5,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,j)  &
+     &              + tmp1 * 2.0d+00 * dy5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,j+1)  &
+     &              - tmp1 * njac(1,1,j+1)  &
+     &              - tmp1 * dy1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,j+1)  &
+     &              - tmp1 * njac(1,2,j+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,j+1)  &
+     &              - tmp1 * njac(1,3,j+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,j+1)  &
+     &              - tmp1 * njac(1,4,j+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,j+1)  &
+     &              - tmp1 * njac(1,5,j+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,j+1)  &
+     &              - tmp1 * njac(2,1,j+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,j+1)  &
+     &              - tmp1 * njac(2,2,j+1)  &
+     &              - tmp1 * dy2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,j+1)  &
+     &              - tmp1 * njac(2,3,j+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,j+1)  &
+     &              - tmp1 * njac(2,4,j+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,j+1)  &
+     &              - tmp1 * njac(2,5,j+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,j+1)  &
+     &              - tmp1 * njac(3,1,j+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,j+1)  &
+     &              - tmp1 * njac(3,2,j+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,j+1)  &
+     &              - tmp1 * njac(3,3,j+1)  &
+     &              - tmp1 * dy3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,j+1)  &
+     &              - tmp1 * njac(3,4,j+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,j+1)  &
+     &              - tmp1 * njac(3,5,j+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,j+1)  &
+     &              - tmp1 * njac(4,1,j+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,j+1)  &
+     &              - tmp1 * njac(4,2,j+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,j+1)  &
+     &              - tmp1 * njac(4,3,j+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,j+1)  &
+     &              - tmp1 * njac(4,4,j+1)  &
+     &              - tmp1 * dy4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,j+1)  &
+     &              - tmp1 * njac(4,5,j+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,j+1)  &
+     &              - tmp1 * njac(5,1,j+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,j+1)  &
+     &              - tmp1 * njac(5,2,j+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,j+1)  &
+     &              - tmp1 * njac(5,3,j+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,j+1)  &
+     &              - tmp1 * njac(5,4,j+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,j+1)  &
+     &              - tmp1 * njac(5,5,j+1)  &
+     &              - tmp1 * dy5
+
+            enddo
+
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+            if (first .eq. 1) then 
+
+!---------------------------------------------------------------------
+!     multiply c(i,jstart,k) by b_inverse and copy back to c
+!     multiply rhs(jstart) by b_inverse(jstart) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,jstart),  &
+     &                        lhsc(1,1,i,jstart,k,c),  &
+     &                        rhs(1,i,jstart,k,c) )
+
+            endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+            do j=jstart+first,jsize-last
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(j-1) from lhs_vector(j)
+!     
+!     rhs(j) = rhs(j) - A*rhs(j-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,j),  &
+     &                         rhs(1,i,j-1,k,c),rhs(1,i,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(j) = B(j) - C(j-1)*A(j)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,j),  &
+     &                         lhsc(1,1,i,j-1,k,c),  &
+     &                         lhsb(1,1,j))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,j),  &
+     &                        lhsc(1,1,i,j,k,c),  &
+     &                        rhs(1,i,j,k,c) )
+
+            enddo
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+            if (last .eq. 1) then
+
+!---------------------------------------------------------------------
+!     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,jsize),  &
+     &                         rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c))
+
+!---------------------------------------------------------------------
+!     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+!     call matmul_sub(aa,i,jsize,k,c,
+!     $              cc,i,jsize-1,k,c,bb,i,jsize,k,c)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,jsize),  &
+     &                         lhsc(1,1,i,jsize-1,k,c),  &
+     &                         lhsb(1,1,jsize))
+
+!---------------------------------------------------------------------
+!     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+!---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,jsize),  &
+     &                       rhs(1,i,jsize,k,c) )
+
+            endif
+         enddo
+      enddo
+
+
+      return
+      end
+      
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve_vec.f90
new file mode 100644
index 000000000..201064511
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve_vec.f90
@@ -0,0 +1,812 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Y direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer  &
+     &     c, jstart, stage,  &
+     &     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),  &
+     &     isize,jsize,ksize,send_id
+
+      jstart = 0
+
+      if (timeron) call timer_start(t_ysolve)
+!---------------------------------------------------------------------
+!     in our terminology stage is the number of the cell in the y-direct
+!     i.e. stage = 1 means the start of the line stage=ncells means end
+!---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(2,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+
+!---------------------------------------------------------------------
+!     set last-cell flag
+!---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+!---------------------------------------------------------------------
+!     This is the first cell, so solve without receiving data
+!---------------------------------------------------------------------
+            first = 1
+!            call lhsy(c)
+            call y_solve_cell(first,last,c)
+         else
+!---------------------------------------------------------------------
+!     Not the first cell of this line, so receive info from
+!     processor working on preceeding cell
+!---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_solve_info(recv_id,c)
+!---------------------------------------------------------------------
+!     overlap computations and communications
+!---------------------------------------------------------------------
+!            call lhsy(c)
+!---------------------------------------------------------------------
+!     wait for completion
+!---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+!---------------------------------------------------------------------
+!     install C'(jstart+1) and rhs'(jstart+1) to be used in this cell
+!---------------------------------------------------------------------
+            call y_unpack_solve_info(c)
+            call y_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call y_send_solve_info(send_id,c)
+      enddo
+
+!---------------------------------------------------------------------
+!     now perform backsubstitution in reverse direction
+!---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(2,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+!---------------------------------------------------------------------
+!     last cell, so perform back substitute without waiting
+!---------------------------------------------------------------------
+            call y_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_ycomm)
+            call y_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_ycomm)
+            call y_unpack_backsub_info(c)
+            call y_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call y_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine y_unpack_solve_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack C'(-1) and rhs'(-1) for
+!     all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,k,m,n,ptr,c,jstart 
+
+      jstart = 0
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine y_send_solve_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send C'(jend) and rhs'(jend) for
+!     all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,k,m,n,jsize,ptr,c,ip,kp
+      integer error,send_id,buffer_size 
+
+      jsize = cell_size(2,c)-1
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+!---------------------------------------------------------------------
+!     pack up buffer
+!---------------------------------------------------------------------
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jsize,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     send buffer 
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, successor(2),  &
+     &     SOUTH+ip+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_send_backsub_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send U(jstart) for all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,k,n,ptr,c,jstart,ip,kp
+      integer error,send_id,buffer_size
+
+!---------------------------------------------------------------------
+!     Send element 0 to previous processor
+!---------------------------------------------------------------------
+      jstart = 0
+      ip = cell_coord(1,c)-1
+      kp = cell_coord(3,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,jstart,k,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+      if (timeron) call timer_start(t_ycomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, predecessor(2),  &
+     &     NORTH+ip+kp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_ycomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_unpack_backsub_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack U(jsize) for all i and k
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,k,n,ptr,c 
+
+      ptr = 0
+      do k=0,KMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,k,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_receive_backsub_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer error,recv_id,ip,kp,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, successor(2),  &
+     &     NORTH+ip+kp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_receive_solve_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives 
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ip,kp,recv_id,error,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      kp = cell_coord(3,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, predecessor(2),  &
+     &     SOUTH+ip+kp*NCELLS,  comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_backsubstitute(first, last, c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+!     else assume U(jsize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(jstart) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,jstart
+      
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+      if (last .eq. 0) then
+         do k=start(3,c),ksize
+            do i=start(1,c),isize
+!---------------------------------------------------------------------
+!     U(jsize) uses info from previous cell if not last cell
+!---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c)  &
+     &                    - lhsc(m,n,i,jsize,k,c)*  &
+     &                    backsub_info(n,i,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=start(3,c),ksize
+         do j=jsize-1,jstart,-1
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)  &
+     &                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_solve_cell(first,last,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision tmp1, tmp2, tmp3
+      integer first,last,c
+      integer i,j,k,m,n,isize,ksize,jsize,jstart
+
+      jstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-1
+      ksize = cell_size(3,c)-end(3,c)-1
+
+!---------------------------------------------------------------------
+!     zero the left hand side for starters
+!     set diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+      do i = 0, isize
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,i,0) = 0.0d0
+               lhsb(m,n,i,0) = 0.0d0
+               lhsa(m,n,i,jsize) = 0.0d0
+               lhsb(m,n,i,jsize) = 0.0d0
+            enddo
+            lhsb(m,m,i,0) = 1.0d0
+            lhsb(m,m,i,jsize) = 1.0d0
+         enddo
+      enddo
+
+      do k=start(3,c),ksize 
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three y-factors 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the tri-diagonal matrix;
+!     determine a (labeled f) and n jacobians for cell !
+!---------------------------------------------------------------------
+
+         do j = start(2,c)-1, cell_size(2,c)-end(2,c)
+            do i=start(1,c),isize
+
+               tmp1 = 1.0d0 / u(1,i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,j) = 0.0d+00
+               fjac(1,2,i,j) = 0.0d+00
+               fjac(1,3,i,j) = 1.0d+00
+               fjac(1,4,i,j) = 0.0d+00
+               fjac(1,5,i,j) = 0.0d+00
+
+               fjac(2,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) )  &
+     &              * tmp2
+               fjac(2,2,i,j) = u(3,i,j,k,c) * tmp1
+               fjac(2,3,i,j) = u(2,i,j,k,c) * tmp1
+               fjac(2,4,i,j) = 0.0d+00
+               fjac(2,5,i,j) = 0.0d+00
+
+               fjac(3,1,i,j) = - ( u(3,i,j,k,c)*u(3,i,j,k,c)*tmp2)  &
+     &              + c2 * qs(i,j,k,c)
+               fjac(3,2,i,j) = - c2 *  u(2,i,j,k,c) * tmp1
+               fjac(3,3,i,j) = ( 2.0d+00 - c2 )  &
+     &              *  u(3,i,j,k,c) * tmp1 
+               fjac(3,4,i,j) = - c2 * u(4,i,j,k,c) * tmp1 
+               fjac(3,5,i,j) = c2
+
+               fjac(4,1,i,j) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) )  &
+     &              * tmp2
+               fjac(4,2,i,j) = 0.0d+00
+               fjac(4,3,i,j) = u(4,i,j,k,c) * tmp1
+               fjac(4,4,i,j) = u(3,i,j,k,c) * tmp1
+               fjac(4,5,i,j) = 0.0d+00
+
+               fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c)  &
+     &              - c1 * u(5,i,j,k,c) * tmp1 )  &
+     &              * u(3,i,j,k,c) * tmp1 
+               fjac(5,2,i,j) = - c2 * u(2,i,j,k,c)*u(3,i,j,k,c)  &
+     &              * tmp2
+               fjac(5,3,i,j) = c1 * u(5,i,j,k,c) * tmp1  &
+     &              - c2 * ( qs(i,j,k,c)  &
+     &              + u(3,i,j,k,c)*u(3,i,j,k,c) * tmp2 )
+               fjac(5,4,i,j) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) )  &
+     &              * tmp2
+               fjac(5,5,i,j) = c1 * u(3,i,j,k,c) * tmp1 
+
+               njac(1,1,i,j) = 0.0d+00
+               njac(1,2,i,j) = 0.0d+00
+               njac(1,3,i,j) = 0.0d+00
+               njac(1,4,i,j) = 0.0d+00
+               njac(1,5,i,j) = 0.0d+00
+
+               njac(2,1,i,j) = - c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i,j) =   c3c4 * tmp1
+               njac(2,3,i,j) =   0.0d+00
+               njac(2,4,i,j) =   0.0d+00
+               njac(2,5,i,j) =   0.0d+00
+
+               njac(3,1,i,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i,j) =   0.0d+00
+               njac(3,3,i,j) =   con43 * c3c4 * tmp1
+               njac(3,4,i,j) =   0.0d+00
+               njac(3,5,i,j) =   0.0d+00
+
+               njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i,j) =   0.0d+00
+               njac(4,3,i,j) =   0.0d+00
+               njac(4,4,i,j) =   c3c4 * tmp1
+               njac(4,5,i,j) =   0.0d+00
+
+               njac(5,1,i,j) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i,j) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i,j) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i,j) = ( c1345 ) * tmp1
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     now joacobians set, so form left hand side in y direction
+!---------------------------------------------------------------------
+         do j = start(2,c), jsize-end(2,c)
+            do i=start(1,c),isize
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i,j-1)  &
+     &              - tmp1 * njac(1,1,i,j-1)  &
+     &              - tmp1 * dy1 
+               lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i,j-1)  &
+     &              - tmp1 * njac(1,2,i,j-1)
+               lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i,j-1)  &
+     &              - tmp1 * njac(1,3,i,j-1)
+               lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i,j-1)  &
+     &              - tmp1 * njac(1,4,i,j-1)
+               lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i,j-1)  &
+     &              - tmp1 * njac(1,5,i,j-1)
+
+               lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i,j-1)  &
+     &              - tmp1 * njac(2,1,i,j-1)
+               lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i,j-1)  &
+     &              - tmp1 * njac(2,2,i,j-1)  &
+     &              - tmp1 * dy2
+               lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i,j-1)  &
+     &              - tmp1 * njac(2,3,i,j-1)
+               lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i,j-1)  &
+     &              - tmp1 * njac(2,4,i,j-1)
+               lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i,j-1)  &
+     &              - tmp1 * njac(2,5,i,j-1)
+
+               lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i,j-1)  &
+     &              - tmp1 * njac(3,1,i,j-1)
+               lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i,j-1)  &
+     &              - tmp1 * njac(3,2,i,j-1)
+               lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i,j-1)  &
+     &              - tmp1 * njac(3,3,i,j-1)  &
+     &              - tmp1 * dy3 
+               lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i,j-1)  &
+     &              - tmp1 * njac(3,4,i,j-1)
+               lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i,j-1)  &
+     &              - tmp1 * njac(3,5,i,j-1)
+
+               lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i,j-1)  &
+     &              - tmp1 * njac(4,1,i,j-1)
+               lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i,j-1)  &
+     &              - tmp1 * njac(4,2,i,j-1)
+               lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i,j-1)  &
+     &              - tmp1 * njac(4,3,i,j-1)
+               lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i,j-1)  &
+     &              - tmp1 * njac(4,4,i,j-1)  &
+     &              - tmp1 * dy4
+               lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i,j-1)  &
+     &              - tmp1 * njac(4,5,i,j-1)
+
+               lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i,j-1)  &
+     &              - tmp1 * njac(5,1,i,j-1)
+               lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i,j-1)  &
+     &              - tmp1 * njac(5,2,i,j-1)
+               lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i,j-1)  &
+     &              - tmp1 * njac(5,3,i,j-1)
+               lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i,j-1)  &
+     &              - tmp1 * njac(5,4,i,j-1)
+               lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i,j-1)  &
+     &              - tmp1 * njac(5,5,i,j-1)  &
+     &              - tmp1 * dy5
+
+               lhsb(1,1,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,i,j)  &
+     &              + tmp1 * 2.0d+00 * dy1
+               lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j)
+               lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j)
+               lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j)
+               lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j)
+
+               lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j)
+               lhsb(2,2,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,i,j)  &
+     &              + tmp1 * 2.0d+00 * dy2
+               lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j)
+               lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j)
+               lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j)
+
+               lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j)
+               lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j)
+               lhsb(3,3,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,i,j)  &
+     &              + tmp1 * 2.0d+00 * dy3
+               lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j)
+               lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j)
+
+               lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j)
+               lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j)
+               lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j)
+               lhsb(4,4,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,i,j)  &
+     &              + tmp1 * 2.0d+00 * dy4
+               lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j)
+
+               lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j)
+               lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j)
+               lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j)
+               lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j)
+               lhsb(5,5,i,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,i,j)  &
+     &              + tmp1 * 2.0d+00 * dy5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i,j+1)  &
+     &              - tmp1 * njac(1,1,i,j+1)  &
+     &              - tmp1 * dy1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i,j+1)  &
+     &              - tmp1 * njac(1,2,i,j+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i,j+1)  &
+     &              - tmp1 * njac(1,3,i,j+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i,j+1)  &
+     &              - tmp1 * njac(1,4,i,j+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i,j+1)  &
+     &              - tmp1 * njac(1,5,i,j+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i,j+1)  &
+     &              - tmp1 * njac(2,1,i,j+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i,j+1)  &
+     &              - tmp1 * njac(2,2,i,j+1)  &
+     &              - tmp1 * dy2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i,j+1)  &
+     &              - tmp1 * njac(2,3,i,j+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i,j+1)  &
+     &              - tmp1 * njac(2,4,i,j+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i,j+1)  &
+     &              - tmp1 * njac(2,5,i,j+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i,j+1)  &
+     &              - tmp1 * njac(3,1,i,j+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i,j+1)  &
+     &              - tmp1 * njac(3,2,i,j+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i,j+1)  &
+     &              - tmp1 * njac(3,3,i,j+1)  &
+     &              - tmp1 * dy3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i,j+1)  &
+     &              - tmp1 * njac(3,4,i,j+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i,j+1)  &
+     &              - tmp1 * njac(3,5,i,j+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i,j+1)  &
+     &              - tmp1 * njac(4,1,i,j+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i,j+1)  &
+     &              - tmp1 * njac(4,2,i,j+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i,j+1)  &
+     &              - tmp1 * njac(4,3,i,j+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i,j+1)  &
+     &              - tmp1 * njac(4,4,i,j+1)  &
+     &              - tmp1 * dy4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i,j+1)  &
+     &              - tmp1 * njac(4,5,i,j+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i,j+1)  &
+     &              - tmp1 * njac(5,1,i,j+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i,j+1)  &
+     &              - tmp1 * njac(5,2,i,j+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i,j+1)  &
+     &              - tmp1 * njac(5,3,i,j+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i,j+1)  &
+     &              - tmp1 * njac(5,4,i,j+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i,j+1)  &
+     &              - tmp1 * njac(5,5,i,j+1)  &
+     &              - tmp1 * dy5
+
+            enddo
+         enddo
+
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+         if (first .eq. 1) then 
+
+!---------------------------------------------------------------------
+!     multiply c(i,jstart,k) by b_inverse and copy back to !
+!     multiply rhs(jstart) by b_inverse(jstart) and copy to rhs
+!---------------------------------------------------------------------
+!dir$ ivdep
+            do i=start(1,c),isize
+               call binvcrhs( lhsb(1,1,i,jstart),  &
+     &                        lhsc(1,1,i,jstart,k,c),  &
+     &                        rhs(1,i,jstart,k,c) )
+            enddo
+
+         endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+         do j=jstart+first,jsize-last
+!dir$ ivdep
+            do i=start(1,c),isize
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(j-1) from lhs_vector(j)
+!     
+!     rhs(j) = rhs(j) - A*rhs(j-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,j),  &
+     &                         rhs(1,i,j-1,k,c),rhs(1,i,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(j) = B(j) - C(j-1)*A(j)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,j),  &
+     &                         lhsc(1,1,i,j-1,k,c),  &
+     &                         lhsb(1,1,i,j))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to !
+!     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i,j),  &
+     &                        lhsc(1,1,i,j,k,c),  &
+     &                        rhs(1,i,j,k,c) )
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+         if (last .eq. 1) then
+
+!dir$ ivdep
+            do i=start(1,c),isize
+!---------------------------------------------------------------------
+!     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,jsize),  &
+     &                         rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c))
+
+!---------------------------------------------------------------------
+!     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+!     call matmul_sub(aa,i,jsize,k,c,
+!     $              cc,i,jsize-1,k,c,bb,i,jsize,k,c)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,jsize),  &
+     &                         lhsc(1,1,i,jsize-1,k,c),  &
+     &                         lhsb(1,1,i,jsize))
+
+!---------------------------------------------------------------------
+!     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+!---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,i,jsize),  &
+     &                       rhs(1,i,jsize,k,c) )
+            enddo
+
+         endif
+      enddo
+
+
+      return
+      end
+      
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve.f90
new file mode 100644
index 000000000..ccbba0147
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve.f90
@@ -0,0 +1,802 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Z direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer c, kstart, stage,  &
+     &     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),  &
+     &     isize,jsize,ksize,send_id
+
+      kstart = 0
+
+      if (timeron) call timer_start(t_zsolve)
+!---------------------------------------------------------------------
+!     in our terminology stage is the number of the cell in the y-direction
+!     i.e. stage = 1 means the start of the line stage=ncells means end
+!---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(3,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+!---------------------------------------------------------------------
+!     set last-cell flag
+!---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+!---------------------------------------------------------------------
+!     This is the first cell, so solve without receiving data
+!---------------------------------------------------------------------
+            first = 1
+!            call lhsz(c)
+            call z_solve_cell(first,last,c)
+         else
+!---------------------------------------------------------------------
+!     Not the first cell of this line, so receive info from
+!     processor working on preceeding cell
+!---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_solve_info(recv_id,c)
+!---------------------------------------------------------------------
+!     overlap computations and communications
+!---------------------------------------------------------------------
+!            call lhsz(c)
+!---------------------------------------------------------------------
+!     wait for completion
+!---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+!---------------------------------------------------------------------
+!     install C'(kstart+1) and rhs'(kstart+1) to be used in this cell
+!---------------------------------------------------------------------
+            call z_unpack_solve_info(c)
+            call z_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call z_send_solve_info(send_id,c)
+      enddo
+
+!---------------------------------------------------------------------
+!     now perform backsubstitution in reverse direction
+!---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(3,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+!---------------------------------------------------------------------
+!     last cell, so perform back substitute without waiting
+!---------------------------------------------------------------------
+            call z_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+            call z_unpack_backsub_info(c)
+            call z_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call z_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine z_unpack_solve_info(c)
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack C'(-1) and rhs'(-1) for
+!     all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,j,m,n,ptr,c,kstart 
+
+      kstart = 0
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine z_send_solve_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send C'(kend) and rhs'(kend) for
+!     all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,j,m,n,ksize,ptr,c,ip,jp
+      integer error,send_id,buffer_size
+
+      ksize = cell_size(3,c)-1
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+!---------------------------------------------------------------------
+!     pack up buffer
+!---------------------------------------------------------------------
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,ksize,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     send buffer 
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, successor(3),  &
+     &     BOTTOM+ip+jp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_send_backsub_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send U(jstart) for all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,j,n,ptr,c,kstart,ip,jp
+      integer error,send_id,buffer_size
+
+!---------------------------------------------------------------------
+!     Send element 0 to previous processor
+!---------------------------------------------------------------------
+      kstart = 0
+      ip = cell_coord(1,c)-1
+      jp = cell_coord(2,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,kstart,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, predecessor(3),  &
+     &     TOP+ip+jp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_unpack_backsub_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack U(ksize) for all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,j,n,ptr,c
+
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,j,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_receive_backsub_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer error,recv_id,ip,jp,c,buffer_size
+
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, successor(3),  &
+     &     TOP+ip+jp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_receive_solve_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives 
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ip,jp,recv_id,error,c,buffer_size
+
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, predecessor(3),  &
+     &     BOTTOM+ip+jp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_backsubstitute(first, last, c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+!     else assume U(ksize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(kstart) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,kstart
+      
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+      if (last .eq. 0) then
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+!---------------------------------------------------------------------
+!     U(jsize) uses info from previous cell if not last cell
+!---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c)  &
+     &                    - lhsc(m,n,i,j,ksize,c)*  &
+     &                    backsub_info(n,i,j,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=ksize-1,kstart,-1
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)  &
+     &                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_solve_cell(first,last,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision tmp1, tmp2, tmp3
+      integer first,last,c
+      integer i,j,k,isize,ksize,jsize,kstart
+      double precision utmp(6,-2:KMAX+1)
+
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+
+      call lhsabinit(lhsa, lhsb, ksize)
+
+      do j=start(2,c),jsize 
+         do i=start(1,c),isize
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three z-factors   
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the block-diagonal matrix;
+!     determine c (labeled f) and s jacobians for cell c
+!---------------------------------------------------------------------
+            do k = start(3,c)-1, cell_size(3,c)-end(3,c)
+               utmp(1,k) = 1.0d0 / u(1,i,j,k,c)
+               utmp(2,k) = u(2,i,j,k,c)
+               utmp(3,k) = u(3,i,j,k,c)
+               utmp(4,k) = u(4,i,j,k,c)
+               utmp(5,k) = u(5,i,j,k,c)
+               utmp(6,k) = qs(i,j,k,c)
+            end do
+
+            do k = start(3,c)-1, cell_size(3,c)-end(3,c)
+
+               tmp1 = utmp(1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,k) = 0.0d+00
+               fjac(1,2,k) = 0.0d+00
+               fjac(1,3,k) = 0.0d+00
+               fjac(1,4,k) = 1.0d+00
+               fjac(1,5,k) = 0.0d+00
+
+               fjac(2,1,k) = - ( utmp(2,k)*utmp(4,k) )  &
+     &              * tmp2 
+               fjac(2,2,k) = utmp(4,k) * tmp1
+               fjac(2,3,k) = 0.0d+00
+               fjac(2,4,k) = utmp(2,k) * tmp1
+               fjac(2,5,k) = 0.0d+00
+
+               fjac(3,1,k) = - ( utmp(3,k)*utmp(4,k) )  &
+     &              * tmp2 
+               fjac(3,2,k) = 0.0d+00
+               fjac(3,3,k) = utmp(4,k) * tmp1
+               fjac(3,4,k) = utmp(3,k) * tmp1
+               fjac(3,5,k) = 0.0d+00
+
+               fjac(4,1,k) = - (utmp(4,k)*utmp(4,k) * tmp2 )  &
+     &              + c2 * utmp(6,k)
+               fjac(4,2,k) = - c2 *  utmp(2,k) * tmp1 
+               fjac(4,3,k) = - c2 *  utmp(3,k) * tmp1
+               fjac(4,4,k) = ( 2.0d+00 - c2 )  &
+     &              *  utmp(4,k) * tmp1 
+               fjac(4,5,k) = c2
+
+               fjac(5,1,k) = ( c2 * 2.0d0 * utmp(6,k)  &
+     &              - c1 * ( utmp(5,k) * tmp1 ) )  &
+     &              * ( utmp(4,k) * tmp1 )
+               fjac(5,2,k) = - c2 * ( utmp(2,k)*utmp(4,k) )  &
+     &              * tmp2 
+               fjac(5,3,k) = - c2 * ( utmp(3,k)*utmp(4,k) )  &
+     &              * tmp2
+               fjac(5,4,k) = c1 * ( utmp(5,k) * tmp1 )  &
+     &              - c2 * ( utmp(6,k)  &
+     &              + utmp(4,k)*utmp(4,k) * tmp2 )
+               fjac(5,5,k) = c1 * utmp(4,k) * tmp1
+
+               njac(1,1,k) = 0.0d+00
+               njac(1,2,k) = 0.0d+00
+               njac(1,3,k) = 0.0d+00
+               njac(1,4,k) = 0.0d+00
+               njac(1,5,k) = 0.0d+00
+
+               njac(2,1,k) = - c3c4 * tmp2 * utmp(2,k)
+               njac(2,2,k) =   c3c4 * tmp1
+               njac(2,3,k) =   0.0d+00
+               njac(2,4,k) =   0.0d+00
+               njac(2,5,k) =   0.0d+00
+
+               njac(3,1,k) = - c3c4 * tmp2 * utmp(3,k)
+               njac(3,2,k) =   0.0d+00
+               njac(3,3,k) =   c3c4 * tmp1
+               njac(3,4,k) =   0.0d+00
+               njac(3,5,k) =   0.0d+00
+
+               njac(4,1,k) = - con43 * c3c4 * tmp2 * utmp(4,k)
+               njac(4,2,k) =   0.0d+00
+               njac(4,3,k) =   0.0d+00
+               njac(4,4,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,k) =   0.0d+00
+
+               njac(5,1,k) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (utmp(2,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (utmp(3,k)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (utmp(4,k)**2)  &
+     &              - c1345 * tmp2 * utmp(5,k)
+
+               njac(5,2,k) = (  c3c4 - c1345 ) * tmp2 * utmp(2,k)
+               njac(5,3,k) = (  c3c4 - c1345 ) * tmp2 * utmp(3,k)
+               njac(5,4,k) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * utmp(4,k)
+               njac(5,5,k) = ( c1345 )* tmp1
+
+
+            enddo
+
+!---------------------------------------------------------------------
+!     now joacobians set, so form left hand side in z direction
+!---------------------------------------------------------------------
+            do k = start(3,c), ksize-end(3,c)
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhsa(1,1,k) = - tmp2 * fjac(1,1,k-1)  &
+     &              - tmp1 * njac(1,1,k-1)  &
+     &              - tmp1 * dz1 
+               lhsa(1,2,k) = - tmp2 * fjac(1,2,k-1)  &
+     &              - tmp1 * njac(1,2,k-1)
+               lhsa(1,3,k) = - tmp2 * fjac(1,3,k-1)  &
+     &              - tmp1 * njac(1,3,k-1)
+               lhsa(1,4,k) = - tmp2 * fjac(1,4,k-1)  &
+     &              - tmp1 * njac(1,4,k-1)
+               lhsa(1,5,k) = - tmp2 * fjac(1,5,k-1)  &
+     &              - tmp1 * njac(1,5,k-1)
+
+               lhsa(2,1,k) = - tmp2 * fjac(2,1,k-1)  &
+     &              - tmp1 * njac(2,1,k-1)
+               lhsa(2,2,k) = - tmp2 * fjac(2,2,k-1)  &
+     &              - tmp1 * njac(2,2,k-1)  &
+     &              - tmp1 * dz2
+               lhsa(2,3,k) = - tmp2 * fjac(2,3,k-1)  &
+     &              - tmp1 * njac(2,3,k-1)
+               lhsa(2,4,k) = - tmp2 * fjac(2,4,k-1)  &
+     &              - tmp1 * njac(2,4,k-1)
+               lhsa(2,5,k) = - tmp2 * fjac(2,5,k-1)  &
+     &              - tmp1 * njac(2,5,k-1)
+
+               lhsa(3,1,k) = - tmp2 * fjac(3,1,k-1)  &
+     &              - tmp1 * njac(3,1,k-1)
+               lhsa(3,2,k) = - tmp2 * fjac(3,2,k-1)  &
+     &              - tmp1 * njac(3,2,k-1)
+               lhsa(3,3,k) = - tmp2 * fjac(3,3,k-1)  &
+     &              - tmp1 * njac(3,3,k-1)  &
+     &              - tmp1 * dz3 
+               lhsa(3,4,k) = - tmp2 * fjac(3,4,k-1)  &
+     &              - tmp1 * njac(3,4,k-1)
+               lhsa(3,5,k) = - tmp2 * fjac(3,5,k-1)  &
+     &              - tmp1 * njac(3,5,k-1)
+
+               lhsa(4,1,k) = - tmp2 * fjac(4,1,k-1)  &
+     &              - tmp1 * njac(4,1,k-1)
+               lhsa(4,2,k) = - tmp2 * fjac(4,2,k-1)  &
+     &              - tmp1 * njac(4,2,k-1)
+               lhsa(4,3,k) = - tmp2 * fjac(4,3,k-1)  &
+     &              - tmp1 * njac(4,3,k-1)
+               lhsa(4,4,k) = - tmp2 * fjac(4,4,k-1)  &
+     &              - tmp1 * njac(4,4,k-1)  &
+     &              - tmp1 * dz4
+               lhsa(4,5,k) = - tmp2 * fjac(4,5,k-1)  &
+     &              - tmp1 * njac(4,5,k-1)
+
+               lhsa(5,1,k) = - tmp2 * fjac(5,1,k-1)  &
+     &              - tmp1 * njac(5,1,k-1)
+               lhsa(5,2,k) = - tmp2 * fjac(5,2,k-1)  &
+     &              - tmp1 * njac(5,2,k-1)
+               lhsa(5,3,k) = - tmp2 * fjac(5,3,k-1)  &
+     &              - tmp1 * njac(5,3,k-1)
+               lhsa(5,4,k) = - tmp2 * fjac(5,4,k-1)  &
+     &              - tmp1 * njac(5,4,k-1)
+               lhsa(5,5,k) = - tmp2 * fjac(5,5,k-1)  &
+     &              - tmp1 * njac(5,5,k-1)  &
+     &              - tmp1 * dz5
+
+               lhsb(1,1,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,k)  &
+     &              + tmp1 * 2.0d+00 * dz1
+               lhsb(1,2,k) = tmp1 * 2.0d+00 * njac(1,2,k)
+               lhsb(1,3,k) = tmp1 * 2.0d+00 * njac(1,3,k)
+               lhsb(1,4,k) = tmp1 * 2.0d+00 * njac(1,4,k)
+               lhsb(1,5,k) = tmp1 * 2.0d+00 * njac(1,5,k)
+
+               lhsb(2,1,k) = tmp1 * 2.0d+00 * njac(2,1,k)
+               lhsb(2,2,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,k)  &
+     &              + tmp1 * 2.0d+00 * dz2
+               lhsb(2,3,k) = tmp1 * 2.0d+00 * njac(2,3,k)
+               lhsb(2,4,k) = tmp1 * 2.0d+00 * njac(2,4,k)
+               lhsb(2,5,k) = tmp1 * 2.0d+00 * njac(2,5,k)
+
+               lhsb(3,1,k) = tmp1 * 2.0d+00 * njac(3,1,k)
+               lhsb(3,2,k) = tmp1 * 2.0d+00 * njac(3,2,k)
+               lhsb(3,3,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,k)  &
+     &              + tmp1 * 2.0d+00 * dz3
+               lhsb(3,4,k) = tmp1 * 2.0d+00 * njac(3,4,k)
+               lhsb(3,5,k) = tmp1 * 2.0d+00 * njac(3,5,k)
+
+               lhsb(4,1,k) = tmp1 * 2.0d+00 * njac(4,1,k)
+               lhsb(4,2,k) = tmp1 * 2.0d+00 * njac(4,2,k)
+               lhsb(4,3,k) = tmp1 * 2.0d+00 * njac(4,3,k)
+               lhsb(4,4,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,k)  &
+     &              + tmp1 * 2.0d+00 * dz4
+               lhsb(4,5,k) = tmp1 * 2.0d+00 * njac(4,5,k)
+
+               lhsb(5,1,k) = tmp1 * 2.0d+00 * njac(5,1,k)
+               lhsb(5,2,k) = tmp1 * 2.0d+00 * njac(5,2,k)
+               lhsb(5,3,k) = tmp1 * 2.0d+00 * njac(5,3,k)
+               lhsb(5,4,k) = tmp1 * 2.0d+00 * njac(5,4,k)
+               lhsb(5,5,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,k)  &
+     &              + tmp1 * 2.0d+00 * dz5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,k+1)  &
+     &              - tmp1 * njac(1,1,k+1)  &
+     &              - tmp1 * dz1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,k+1)  &
+     &              - tmp1 * njac(1,2,k+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,k+1)  &
+     &              - tmp1 * njac(1,3,k+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,k+1)  &
+     &              - tmp1 * njac(1,4,k+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,k+1)  &
+     &              - tmp1 * njac(1,5,k+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,k+1)  &
+     &              - tmp1 * njac(2,1,k+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,k+1)  &
+     &              - tmp1 * njac(2,2,k+1)  &
+     &              - tmp1 * dz2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,k+1)  &
+     &              - tmp1 * njac(2,3,k+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,k+1)  &
+     &              - tmp1 * njac(2,4,k+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,k+1)  &
+     &              - tmp1 * njac(2,5,k+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,k+1)  &
+     &              - tmp1 * njac(3,1,k+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,k+1)  &
+     &              - tmp1 * njac(3,2,k+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,k+1)  &
+     &              - tmp1 * njac(3,3,k+1)  &
+     &              - tmp1 * dz3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,k+1)  &
+     &              - tmp1 * njac(3,4,k+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,k+1)  &
+     &              - tmp1 * njac(3,5,k+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,k+1)  &
+     &              - tmp1 * njac(4,1,k+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,k+1)  &
+     &              - tmp1 * njac(4,2,k+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,k+1)  &
+     &              - tmp1 * njac(4,3,k+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,k+1)  &
+     &              - tmp1 * njac(4,4,k+1)  &
+     &              - tmp1 * dz4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,k+1)  &
+     &              - tmp1 * njac(4,5,k+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,k+1)  &
+     &              - tmp1 * njac(5,1,k+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,k+1)  &
+     &              - tmp1 * njac(5,2,k+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,k+1)  &
+     &              - tmp1 * njac(5,3,k+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,k+1)  &
+     &              - tmp1 * njac(5,4,k+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,k+1)  &
+     &              - tmp1 * njac(5,5,k+1)  &
+     &              - tmp1 * dz5
+
+            enddo
+
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+            if (first .eq. 1) then 
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,kstart) by b_inverse and copy back to c
+!     multiply rhs(kstart) by b_inverse(kstart) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,kstart),  &
+     &                        lhsc(1,1,i,j,kstart,c),  &
+     &                        rhs(1,i,j,kstart,c) )
+
+            endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+            do k=kstart+first,ksize-last
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(k-1) from lhs_vector(k)
+!     
+!     rhs(k) = rhs(k) - A*rhs(k-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,k),  &
+     &                         rhs(1,i,j,k-1,c),rhs(1,i,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(k) = B(k) - C(k-1)*A(k)
+!     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,k),  &
+     &                         lhsc(1,1,i,j,k-1,c),  &
+     &                         lhsb(1,1,k))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,k),  &
+     &                        lhsc(1,1,i,j,k,c),  &
+     &                        rhs(1,i,j,k,c) )
+
+            enddo
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+            if (last .eq. 1) then
+
+!---------------------------------------------------------------------
+!     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,ksize),  &
+     &                         rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c))
+
+!---------------------------------------------------------------------
+!     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+!     call matmul_sub(aa,i,j,ksize,c,
+!     $              cc,i,j,ksize-1,c,bb,i,j,ksize,c)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,ksize),  &
+     &                         lhsc(1,1,i,j,ksize-1,c),  &
+     &                         lhsb(1,1,ksize))
+
+!---------------------------------------------------------------------
+!     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+!---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,ksize),  &
+     &                       rhs(1,i,j,ksize,c) )
+
+            endif
+         enddo
+      enddo
+
+
+      return
+      end
+      
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve_vec.f90
new file mode 100644
index 000000000..2491969a5
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve_vec.f90
@@ -0,0 +1,817 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Z direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer c, kstart, stage,  &
+     &     first, last, recv_id, error, r_status(MPI_STATUS_SIZE),  &
+     &     isize,jsize,ksize,send_id
+
+      kstart = 0
+
+      if (timeron) call timer_start(t_zsolve)
+!---------------------------------------------------------------------
+!     in our terminology stage is the number of the cell in the y-direct
+!     i.e. stage = 1 means the start of the line stage=ncells means end
+!---------------------------------------------------------------------
+      do stage = 1,ncells
+         c = slice(3,stage)
+         isize = cell_size(1,c) - 1
+         jsize = cell_size(2,c) - 1
+         ksize = cell_size(3,c) - 1
+!---------------------------------------------------------------------
+!     set last-cell flag
+!---------------------------------------------------------------------
+         if (stage .eq. ncells) then
+            last = 1
+         else
+            last = 0
+         endif
+
+         if (stage .eq. 1) then
+!---------------------------------------------------------------------
+!     This is the first cell, so solve without receiving data
+!---------------------------------------------------------------------
+            first = 1
+!            call lhsz(c)
+            call z_solve_cell(first,last,c)
+         else
+!---------------------------------------------------------------------
+!     Not the first cell of this line, so receive info from
+!     processor working on preceeding cell
+!---------------------------------------------------------------------
+            first = 0
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_solve_info(recv_id,c)
+!---------------------------------------------------------------------
+!     overlap computations and communications
+!---------------------------------------------------------------------
+!            call lhsz(c)
+!---------------------------------------------------------------------
+!     wait for completion
+!---------------------------------------------------------------------
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+!---------------------------------------------------------------------
+!     install C'(kstart+1) and rhs'(kstart+1) to be used in this cell
+!---------------------------------------------------------------------
+            call z_unpack_solve_info(c)
+            call z_solve_cell(first,last,c)
+         endif
+
+         if (last .eq. 0) call z_send_solve_info(send_id,c)
+      enddo
+
+!---------------------------------------------------------------------
+!     now perform backsubstitution in reverse direction
+!---------------------------------------------------------------------
+      do stage = ncells, 1, -1
+         c = slice(3,stage)
+         first = 0
+         last = 0
+         if (stage .eq. 1) first = 1
+         if (stage .eq. ncells) then
+            last = 1
+!---------------------------------------------------------------------
+!     last cell, so perform back substitute without waiting
+!---------------------------------------------------------------------
+            call z_backsubstitute(first, last,c)
+         else
+            if (timeron) call timer_start(t_zcomm)
+            call z_receive_backsub_info(recv_id,c)
+            call mpi_wait(send_id,r_status,error)
+            call mpi_wait(recv_id,r_status,error)
+            if (timeron) call timer_stop(t_zcomm)
+            call z_unpack_backsub_info(c)
+            call z_backsubstitute(first,last,c)
+         endif
+         if (first .eq. 0) call z_send_backsub_info(send_id,c)
+      enddo
+
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine z_unpack_solve_info(c)
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack C'(-1) and rhs'(-1) for
+!     all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,j,m,n,ptr,c,kstart 
+
+      kstart = 0
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine z_send_solve_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send C'(kend) and rhs'(kend) for
+!     all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,j,m,n,ksize,ptr,c,ip,jp
+      integer error,send_id,buffer_size
+
+      ksize = cell_size(3,c)-1
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+
+!---------------------------------------------------------------------
+!     pack up buffer
+!---------------------------------------------------------------------
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do m=1,BLOCK_SIZE
+               do n=1,BLOCK_SIZE
+                  in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c)
+               enddo
+               ptr = ptr+BLOCK_SIZE
+            enddo
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,ksize,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     send buffer 
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, successor(3),  &
+     &     BOTTOM+ip+jp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_send_backsub_info(send_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     pack up and send U(jstart) for all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer i,j,n,ptr,c,kstart,ip,jp
+      integer error,send_id,buffer_size
+
+!---------------------------------------------------------------------
+!     Send element 0 to previous processor
+!---------------------------------------------------------------------
+      kstart = 0
+      ip = cell_coord(1,c)-1
+      jp = cell_coord(2,c)-1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               in_buffer(ptr+n) = rhs(n,i,j,kstart,c)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      if (timeron) call timer_start(t_zcomm)
+      call mpi_isend(in_buffer, buffer_size,  &
+     &     dp_type, predecessor(3),  &
+     &     TOP+ip+jp*NCELLS, comm_solve,  &
+     &     send_id,error)
+      if (timeron) call timer_stop(t_zcomm)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_unpack_backsub_info(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     unpack U(ksize) for all i and j
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i,j,n,ptr,c
+
+      ptr = 0
+      do j=0,JMAX-1
+         do i=0,IMAX-1
+            do n=1,BLOCK_SIZE
+               backsub_info(n,i,j,c) = out_buffer(ptr+n)
+            enddo
+            ptr = ptr+BLOCK_SIZE
+         enddo
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_receive_backsub_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer error,recv_id,ip,jp,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, successor(3),  &
+     &     TOP+ip+jp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_receive_solve_info(recv_id,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     post mpi receives 
+!---------------------------------------------------------------------
+
+      use bt_data
+      use mpinpb
+
+      implicit none
+
+      integer ip,jp,recv_id,error,c,buffer_size
+      ip = cell_coord(1,c) - 1
+      jp = cell_coord(2,c) - 1
+      buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*  &
+     &     (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE)
+      call mpi_irecv(out_buffer, buffer_size,  &
+     &     dp_type, predecessor(3),  &
+     &     BOTTOM+ip+jp*NCELLS, comm_solve,  &
+     &     recv_id, error)
+
+      return
+      end
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_backsubstitute(first, last, c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+!     else assume U(ksize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(kstart) will be sent to next cell
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer first, last, c, i, k
+      integer m,n,j,jsize,isize,ksize,kstart
+      
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1      
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+      if (last .eq. 0) then
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+!---------------------------------------------------------------------
+!     U(jsize) uses info from previous cell if not last cell
+!---------------------------------------------------------------------
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c)  &
+     &                    - lhsc(m,n,i,j,ksize,c)*  &
+     &                    backsub_info(n,i,j,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      endif
+      do k=ksize-1,kstart,-1
+         do j=start(2,c),jsize
+            do i=start(1,c),isize
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k,c) = rhs(m,i,j,k,c)  &
+     &                    - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c)
+                  enddo
+               enddo
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_solve_cell(first,last,c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision tmp1, tmp2, tmp3
+      integer first,last,c
+      integer i,j,k,m,n,isize,ksize,jsize,kstart
+
+      kstart = 0
+      isize = cell_size(1,c)-end(1,c)-1
+      jsize = cell_size(2,c)-end(2,c)-1
+      ksize = cell_size(3,c)-1
+
+!---------------------------------------------------------------------
+!     zero the left hand side for starters
+!     set diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+      do i = 0, isize
+         do m = 1, 5
+            do n = 1, 5
+               lhsa(m,n,i,0) = 0.0d0
+               lhsb(m,n,i,0) = 0.0d0
+               lhsa(m,n,i,ksize) = 0.0d0
+               lhsb(m,n,i,ksize) = 0.0d0
+            enddo
+            lhsb(m,m,i,0) = 1.0d0
+            lhsb(m,m,i,ksize) = 1.0d0
+         enddo
+      enddo
+
+      do j=start(2,c),jsize 
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three z-factors 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the block-diagonal matrix;
+!     determine c (labeled f) and s jacobians for cell !
+!---------------------------------------------------------------------
+
+         do k = start(3,c)-1, cell_size(3,c)-end(3,c)
+            do i=start(1,c),isize
+
+               tmp1 = 1.0d0 / u(1,i,j,k,c)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,i,k) = 0.0d+00
+               fjac(1,2,i,k) = 0.0d+00
+               fjac(1,3,i,k) = 0.0d+00
+               fjac(1,4,i,k) = 1.0d+00
+               fjac(1,5,i,k) = 0.0d+00
+
+               fjac(2,1,i,k) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) )  &
+     &              * tmp2 
+               fjac(2,2,i,k) = u(4,i,j,k,c) * tmp1
+               fjac(2,3,i,k) = 0.0d+00
+               fjac(2,4,i,k) = u(2,i,j,k,c) * tmp1
+               fjac(2,5,i,k) = 0.0d+00
+
+               fjac(3,1,i,k) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) )  &
+     &              * tmp2 
+               fjac(3,2,i,k) = 0.0d+00
+               fjac(3,3,i,k) = u(4,i,j,k,c) * tmp1
+               fjac(3,4,i,k) = u(3,i,j,k,c) * tmp1
+               fjac(3,5,i,k) = 0.0d+00
+
+               fjac(4,1,i,k) = - (u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 )  &
+     &              + c2 * qs(i,j,k,c)
+               fjac(4,2,i,k) = - c2 *  u(2,i,j,k,c) * tmp1 
+               fjac(4,3,i,k) = - c2 *  u(3,i,j,k,c) * tmp1
+               fjac(4,4,i,k) = ( 2.0d+00 - c2 )  &
+     &              *  u(4,i,j,k,c) * tmp1 
+               fjac(4,5,i,k) = c2
+
+               fjac(5,1,i,k) = ( c2 * 2.0d0 * qs(i,j,k,c)  &
+     &              - c1 * ( u(5,i,j,k,c) * tmp1 ) )  &
+     &              * ( u(4,i,j,k,c) * tmp1 )
+               fjac(5,2,i,k) = - c2 * ( u(2,i,j,k,c)*u(4,i,j,k,c) )  &
+     &              * tmp2 
+               fjac(5,3,i,k) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) )  &
+     &              * tmp2
+               fjac(5,4,i,k) = c1 * ( u(5,i,j,k,c) * tmp1 )  &
+     &              - c2 * ( qs(i,j,k,c)  &
+     &              + u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 )
+               fjac(5,5,i,k) = c1 * u(4,i,j,k,c) * tmp1
+
+               njac(1,1,i,k) = 0.0d+00
+               njac(1,2,i,k) = 0.0d+00
+               njac(1,3,i,k) = 0.0d+00
+               njac(1,4,i,k) = 0.0d+00
+               njac(1,5,i,k) = 0.0d+00
+
+               njac(2,1,i,k) = - c3c4 * tmp2 * u(2,i,j,k,c)
+               njac(2,2,i,k) =   c3c4 * tmp1
+               njac(2,3,i,k) =   0.0d+00
+               njac(2,4,i,k) =   0.0d+00
+               njac(2,5,i,k) =   0.0d+00
+
+               njac(3,1,i,k) = - c3c4 * tmp2 * u(3,i,j,k,c)
+               njac(3,2,i,k) =   0.0d+00
+               njac(3,3,i,k) =   c3c4 * tmp1
+               njac(3,4,i,k) =   0.0d+00
+               njac(3,5,i,k) =   0.0d+00
+
+               njac(4,1,i,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k,c)
+               njac(4,2,i,k) =   0.0d+00
+               njac(4,3,i,k) =   0.0d+00
+               njac(4,4,i,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,i,k) =   0.0d+00
+
+               njac(5,1,i,k) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k,c)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(4,i,j,k,c)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k,c)
+
+               njac(5,2,i,k) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c)
+               njac(5,3,i,k) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c)
+               njac(5,4,i,k) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(4,i,j,k,c)
+               njac(5,5,i,k) = ( c1345 )* tmp1
+
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     now joacobians set, so form left hand side in z direction
+!---------------------------------------------------------------------
+         do k = start(3,c), ksize-end(3,c)
+            do i=start(1,c),isize
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhsa(1,1,i,k) = - tmp2 * fjac(1,1,i,k-1)  &
+     &              - tmp1 * njac(1,1,i,k-1)  &
+     &              - tmp1 * dz1 
+               lhsa(1,2,i,k) = - tmp2 * fjac(1,2,i,k-1)  &
+     &              - tmp1 * njac(1,2,i,k-1)
+               lhsa(1,3,i,k) = - tmp2 * fjac(1,3,i,k-1)  &
+     &              - tmp1 * njac(1,3,i,k-1)
+               lhsa(1,4,i,k) = - tmp2 * fjac(1,4,i,k-1)  &
+     &              - tmp1 * njac(1,4,i,k-1)
+               lhsa(1,5,i,k) = - tmp2 * fjac(1,5,i,k-1)  &
+     &              - tmp1 * njac(1,5,i,k-1)
+
+               lhsa(2,1,i,k) = - tmp2 * fjac(2,1,i,k-1)  &
+     &              - tmp1 * njac(2,1,i,k-1)
+               lhsa(2,2,i,k) = - tmp2 * fjac(2,2,i,k-1)  &
+     &              - tmp1 * njac(2,2,i,k-1)  &
+     &              - tmp1 * dz2
+               lhsa(2,3,i,k) = - tmp2 * fjac(2,3,i,k-1)  &
+     &              - tmp1 * njac(2,3,i,k-1)
+               lhsa(2,4,i,k) = - tmp2 * fjac(2,4,i,k-1)  &
+     &              - tmp1 * njac(2,4,i,k-1)
+               lhsa(2,5,i,k) = - tmp2 * fjac(2,5,i,k-1)  &
+     &              - tmp1 * njac(2,5,i,k-1)
+
+               lhsa(3,1,i,k) = - tmp2 * fjac(3,1,i,k-1)  &
+     &              - tmp1 * njac(3,1,i,k-1)
+               lhsa(3,2,i,k) = - tmp2 * fjac(3,2,i,k-1)  &
+     &              - tmp1 * njac(3,2,i,k-1)
+               lhsa(3,3,i,k) = - tmp2 * fjac(3,3,i,k-1)  &
+     &              - tmp1 * njac(3,3,i,k-1)  &
+     &              - tmp1 * dz3 
+               lhsa(3,4,i,k) = - tmp2 * fjac(3,4,i,k-1)  &
+     &              - tmp1 * njac(3,4,i,k-1)
+               lhsa(3,5,i,k) = - tmp2 * fjac(3,5,i,k-1)  &
+     &              - tmp1 * njac(3,5,i,k-1)
+
+               lhsa(4,1,i,k) = - tmp2 * fjac(4,1,i,k-1)  &
+     &              - tmp1 * njac(4,1,i,k-1)
+               lhsa(4,2,i,k) = - tmp2 * fjac(4,2,i,k-1)  &
+     &              - tmp1 * njac(4,2,i,k-1)
+               lhsa(4,3,i,k) = - tmp2 * fjac(4,3,i,k-1)  &
+     &              - tmp1 * njac(4,3,i,k-1)
+               lhsa(4,4,i,k) = - tmp2 * fjac(4,4,i,k-1)  &
+     &              - tmp1 * njac(4,4,i,k-1)  &
+     &              - tmp1 * dz4
+               lhsa(4,5,i,k) = - tmp2 * fjac(4,5,i,k-1)  &
+     &              - tmp1 * njac(4,5,i,k-1)
+
+               lhsa(5,1,i,k) = - tmp2 * fjac(5,1,i,k-1)  &
+     &              - tmp1 * njac(5,1,i,k-1)
+               lhsa(5,2,i,k) = - tmp2 * fjac(5,2,i,k-1)  &
+     &              - tmp1 * njac(5,2,i,k-1)
+               lhsa(5,3,i,k) = - tmp2 * fjac(5,3,i,k-1)  &
+     &              - tmp1 * njac(5,3,i,k-1)
+               lhsa(5,4,i,k) = - tmp2 * fjac(5,4,i,k-1)  &
+     &              - tmp1 * njac(5,4,i,k-1)
+               lhsa(5,5,i,k) = - tmp2 * fjac(5,5,i,k-1)  &
+     &              - tmp1 * njac(5,5,i,k-1)  &
+     &              - tmp1 * dz5
+
+               lhsb(1,1,i,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,i,k)  &
+     &              + tmp1 * 2.0d+00 * dz1
+               lhsb(1,2,i,k) = tmp1 * 2.0d+00 * njac(1,2,i,k)
+               lhsb(1,3,i,k) = tmp1 * 2.0d+00 * njac(1,3,i,k)
+               lhsb(1,4,i,k) = tmp1 * 2.0d+00 * njac(1,4,i,k)
+               lhsb(1,5,i,k) = tmp1 * 2.0d+00 * njac(1,5,i,k)
+
+               lhsb(2,1,i,k) = tmp1 * 2.0d+00 * njac(2,1,i,k)
+               lhsb(2,2,i,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,i,k)  &
+     &              + tmp1 * 2.0d+00 * dz2
+               lhsb(2,3,i,k) = tmp1 * 2.0d+00 * njac(2,3,i,k)
+               lhsb(2,4,i,k) = tmp1 * 2.0d+00 * njac(2,4,i,k)
+               lhsb(2,5,i,k) = tmp1 * 2.0d+00 * njac(2,5,i,k)
+
+               lhsb(3,1,i,k) = tmp1 * 2.0d+00 * njac(3,1,i,k)
+               lhsb(3,2,i,k) = tmp1 * 2.0d+00 * njac(3,2,i,k)
+               lhsb(3,3,i,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,i,k)  &
+     &              + tmp1 * 2.0d+00 * dz3
+               lhsb(3,4,i,k) = tmp1 * 2.0d+00 * njac(3,4,i,k)
+               lhsb(3,5,i,k) = tmp1 * 2.0d+00 * njac(3,5,i,k)
+
+               lhsb(4,1,i,k) = tmp1 * 2.0d+00 * njac(4,1,i,k)
+               lhsb(4,2,i,k) = tmp1 * 2.0d+00 * njac(4,2,i,k)
+               lhsb(4,3,i,k) = tmp1 * 2.0d+00 * njac(4,3,i,k)
+               lhsb(4,4,i,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,i,k)  &
+     &              + tmp1 * 2.0d+00 * dz4
+               lhsb(4,5,i,k) = tmp1 * 2.0d+00 * njac(4,5,i,k)
+
+               lhsb(5,1,i,k) = tmp1 * 2.0d+00 * njac(5,1,i,k)
+               lhsb(5,2,i,k) = tmp1 * 2.0d+00 * njac(5,2,i,k)
+               lhsb(5,3,i,k) = tmp1 * 2.0d+00 * njac(5,3,i,k)
+               lhsb(5,4,i,k) = tmp1 * 2.0d+00 * njac(5,4,i,k)
+               lhsb(5,5,i,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,i,k)  &
+     &              + tmp1 * 2.0d+00 * dz5
+
+               lhsc(1,1,i,j,k,c) =  tmp2 * fjac(1,1,i,k+1)  &
+     &              - tmp1 * njac(1,1,i,k+1)  &
+     &              - tmp1 * dz1
+               lhsc(1,2,i,j,k,c) =  tmp2 * fjac(1,2,i,k+1)  &
+     &              - tmp1 * njac(1,2,i,k+1)
+               lhsc(1,3,i,j,k,c) =  tmp2 * fjac(1,3,i,k+1)  &
+     &              - tmp1 * njac(1,3,i,k+1)
+               lhsc(1,4,i,j,k,c) =  tmp2 * fjac(1,4,i,k+1)  &
+     &              - tmp1 * njac(1,4,i,k+1)
+               lhsc(1,5,i,j,k,c) =  tmp2 * fjac(1,5,i,k+1)  &
+     &              - tmp1 * njac(1,5,i,k+1)
+
+               lhsc(2,1,i,j,k,c) =  tmp2 * fjac(2,1,i,k+1)  &
+     &              - tmp1 * njac(2,1,i,k+1)
+               lhsc(2,2,i,j,k,c) =  tmp2 * fjac(2,2,i,k+1)  &
+     &              - tmp1 * njac(2,2,i,k+1)  &
+     &              - tmp1 * dz2
+               lhsc(2,3,i,j,k,c) =  tmp2 * fjac(2,3,i,k+1)  &
+     &              - tmp1 * njac(2,3,i,k+1)
+               lhsc(2,4,i,j,k,c) =  tmp2 * fjac(2,4,i,k+1)  &
+     &              - tmp1 * njac(2,4,i,k+1)
+               lhsc(2,5,i,j,k,c) =  tmp2 * fjac(2,5,i,k+1)  &
+     &              - tmp1 * njac(2,5,i,k+1)
+
+               lhsc(3,1,i,j,k,c) =  tmp2 * fjac(3,1,i,k+1)  &
+     &              - tmp1 * njac(3,1,i,k+1)
+               lhsc(3,2,i,j,k,c) =  tmp2 * fjac(3,2,i,k+1)  &
+     &              - tmp1 * njac(3,2,i,k+1)
+               lhsc(3,3,i,j,k,c) =  tmp2 * fjac(3,3,i,k+1)  &
+     &              - tmp1 * njac(3,3,i,k+1)  &
+     &              - tmp1 * dz3
+               lhsc(3,4,i,j,k,c) =  tmp2 * fjac(3,4,i,k+1)  &
+     &              - tmp1 * njac(3,4,i,k+1)
+               lhsc(3,5,i,j,k,c) =  tmp2 * fjac(3,5,i,k+1)  &
+     &              - tmp1 * njac(3,5,i,k+1)
+
+               lhsc(4,1,i,j,k,c) =  tmp2 * fjac(4,1,i,k+1)  &
+     &              - tmp1 * njac(4,1,i,k+1)
+               lhsc(4,2,i,j,k,c) =  tmp2 * fjac(4,2,i,k+1)  &
+     &              - tmp1 * njac(4,2,i,k+1)
+               lhsc(4,3,i,j,k,c) =  tmp2 * fjac(4,3,i,k+1)  &
+     &              - tmp1 * njac(4,3,i,k+1)
+               lhsc(4,4,i,j,k,c) =  tmp2 * fjac(4,4,i,k+1)  &
+     &              - tmp1 * njac(4,4,i,k+1)  &
+     &              - tmp1 * dz4
+               lhsc(4,5,i,j,k,c) =  tmp2 * fjac(4,5,i,k+1)  &
+     &              - tmp1 * njac(4,5,i,k+1)
+
+               lhsc(5,1,i,j,k,c) =  tmp2 * fjac(5,1,i,k+1)  &
+     &              - tmp1 * njac(5,1,i,k+1)
+               lhsc(5,2,i,j,k,c) =  tmp2 * fjac(5,2,i,k+1)  &
+     &              - tmp1 * njac(5,2,i,k+1)
+               lhsc(5,3,i,j,k,c) =  tmp2 * fjac(5,3,i,k+1)  &
+     &              - tmp1 * njac(5,3,i,k+1)
+               lhsc(5,4,i,j,k,c) =  tmp2 * fjac(5,4,i,k+1)  &
+     &              - tmp1 * njac(5,4,i,k+1)
+               lhsc(5,5,i,j,k,c) =  tmp2 * fjac(5,5,i,k+1)  &
+     &              - tmp1 * njac(5,5,i,k+1)  &
+     &              - tmp1 * dz5
+
+            enddo
+         enddo
+
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+         if (first .eq. 1) then 
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,kstart) by b_inverse and copy back to !
+!     multiply rhs(kstart) by b_inverse(kstart) and copy to rhs
+!---------------------------------------------------------------------
+!dir$ ivdep
+            do i=start(1,c),isize
+               call binvcrhs( lhsb(1,1,i,kstart),  &
+     &                        lhsc(1,1,i,j,kstart,c),  &
+     &                        rhs(1,i,j,kstart,c) )
+            enddo
+
+         endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+         do k=kstart+first,ksize-last
+!dir$ ivdep
+            do i=start(1,c),isize
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(k-1) from lhs_vector(k)
+!     
+!     rhs(k) = rhs(k) - A*rhs(k-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,k),  &
+     &                         rhs(1,i,j,k-1,c),rhs(1,i,j,k,c))
+
+!---------------------------------------------------------------------
+!     B(k) = B(k) - C(k-1)*A(k)
+!     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,k),  &
+     &                         lhsc(1,1,i,j,k-1,c),  &
+     &                         lhsb(1,1,i,k))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to !
+!     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,i,k),  &
+     &                        lhsc(1,1,i,j,k,c),  &
+     &                        rhs(1,i,j,k,c) )
+
+            enddo
+         enddo
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+         if (last .eq. 1) then
+
+!dir$ ivdep
+            do i=start(1,c),isize
+!---------------------------------------------------------------------
+!     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,i,ksize),  &
+     &                         rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c))
+
+!---------------------------------------------------------------------
+!     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+!     call matmul_sub(aa,i,j,ksize,c,
+!     $              cc,i,j,ksize-1,c,bb,i,j,ksize,c)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,i,ksize),  &
+     &                         lhsc(1,1,i,j,ksize-1,c),  &
+     &                         lhsb(1,1,i,ksize))
+
+!---------------------------------------------------------------------
+!     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+!---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,i,ksize),  &
+     &                       rhs(1,i,j,ksize,c) )
+            enddo
+
+         endif
+      enddo
+
+
+      return
+      end
+      
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/Makefile
new file mode 100644
index 000000000..78daa13ee
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/Makefile
@@ -0,0 +1,28 @@
+SHELL=/bin/sh
+BENCHMARK=cg
+BENCHMARKU=CG
+
+include ../config/make.def
+
+OBJS = cg.o cg_data.o mpinpb.o ${COMMON}/print_results.o  \
+       ${COMMON}/get_active_nprocs.o \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+cg.o:		cg.f90  cg_data.o mpinpb.o
+cg_data.o:	cg_data.f90 mpinpb.o npbparams.h
+mpinpb.o:	mpinpb.f90
+
+clean:
+	- rm -f *.o *.mod *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg.f90
new file mode 100644
index 000000000..ff94f5f80
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg.f90
@@ -0,0 +1,1541 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                                   C G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.4 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+!---------------------------------------------------------------------
+!
+! Authors: M. Yarrow
+!          C. Kuszmaul
+!          R. F. Van der Wijngaart
+!          H. Jin
+!
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      program cg
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      use cg_data
+      use mpinpb
+      use timing
+
+      implicit none
+
+      integer status(MPI_STATUS_SIZE), request, ierr
+
+      integer            i, j, k, it
+
+      double precision   zeta, randlc
+      external           randlc
+      double precision   rnorm
+      double precision   norm_temp1(2), norm_temp2(2)
+
+      double precision   t, tmax, mflops
+      external           timer_read
+      double precision   timer_read
+      character          class
+      logical            verified
+      double precision   zeta_verify_value, epsilon, err
+
+      double precision tsum(t_last+2), t1(t_last+2),  &
+     &                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      data t_recs/'total', 'conjg', 'rcomm', 'ncomm',  &
+     &            ' totcomp', ' totcomm'/
+
+
+!---------------------------------------------------------------------
+!  Set up mpi initialization and number of proc testing
+!---------------------------------------------------------------------
+      call initialize_mpi
+      if (.not. active) goto 999
+
+!---------------------------------------------------------------------
+!  Set up processor info, such as whether sq num of procs, etc
+!---------------------------------------------------------------------
+      call setup_proc_info( )
+
+!---------------------------------------------------------------------
+!  Allocate space for work arrays
+!---------------------------------------------------------------------
+      call alloc_space( )
+
+
+      if( na .eq. 1400 .and.  &
+     &    nonzer .eq. 7 .and.  &
+     &    niter .eq. 15 .and.  &
+     &    shift .eq. 10.d0 ) then
+         class = 'S'
+         zeta_verify_value = 8.5971775078648d0
+      else if( na .eq. 7000 .and.  &
+     &         nonzer .eq. 8 .and.  &
+     &         niter .eq. 15 .and.  &
+     &         shift .eq. 12.d0 ) then
+         class = 'W'
+         zeta_verify_value = 10.362595087124d0
+      else if( na .eq. 14000 .and.  &
+     &         nonzer .eq. 11 .and.  &
+     &         niter .eq. 15 .and.  &
+     &         shift .eq. 20.d0 ) then
+         class = 'A'
+         zeta_verify_value = 17.130235054029d0
+      else if( na .eq. 75000 .and.  &
+     &         nonzer .eq. 13 .and.  &
+     &         niter .eq. 75 .and.  &
+     &         shift .eq. 60.d0 ) then
+         class = 'B'
+         zeta_verify_value = 22.712745482631d0
+      else if( na .eq. 150000 .and.  &
+     &         nonzer .eq. 15 .and.  &
+     &         niter .eq. 75 .and.  &
+     &         shift .eq. 110.d0 ) then
+         class = 'C'
+         zeta_verify_value = 28.973605592845d0
+      else if( na .eq. 1500000 .and.  &
+     &         nonzer .eq. 21 .and.  &
+     &         niter .eq. 100 .and.  &
+     &         shift .eq. 500.d0 ) then
+         class = 'D'
+         zeta_verify_value = 52.514532105794d0
+      else if( na .eq. 9000000 .and.  &
+     &         nonzer .eq. 26 .and.  &
+     &         niter .eq. 100 .and.  &
+     &         shift .eq. 1.5d3 ) then
+         class = 'E'
+         zeta_verify_value = 77.522164599383d0
+      else if( na .eq. 54000000 .and.  &
+     &         nonzer .eq. 31 .and.  &
+     &         niter .eq. 100 .and.  &
+     &         shift .eq. 5.0d3 ) then
+         class = 'F'
+         zeta_verify_value = 107.3070826433d0
+      else
+         class = 'U'
+      endif
+
+      if( me .eq. root )then
+         write( *,1000 ) 
+         write( *,1001 ) na, class
+         write( *,1002 ) niter
+         write( *,1003 ) nonzer
+         write( *,1004 ) shift
+         write( *,1005 ) total_nodes
+         if (total_nodes .ne. nprocs) write (*, 1006) nprocs
+
+ 1000 format(//,' NAS Parallel Benchmarks 3.4 -- CG Benchmark', /)
+ 1001 format(' Size: ', i10, '  (class ', a, ')' )
+ 1002 format(' Iterations: ', i5 )
+ 1003 format(' Number of nonzeroes per row: ', i8)
+ 1004 format(' Eigenvalue shift: ', f9.3)
+ 1005 format(' Total number of processes: ', i6)
+ 1006 format(' WARNING: Number of processes is not power of two (',  &
+     &       i0, ' active)')
+      endif
+
+
+!---------------------------------------------------------------------
+!  Set up partition's submatrix info: firstcol, lastcol, firstrow, lastrow
+!---------------------------------------------------------------------
+      call setup_submatrix_info( )
+
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+!---------------------------------------------------------------------
+!  Inialize random number generator
+!---------------------------------------------------------------------
+      tran    = 314159265.0D0
+      amult   = 1220703125.0D0
+      zeta    = randlc( tran, amult )
+
+!---------------------------------------------------------------------
+!  Set up partition's sparse random matrix for given class size
+!---------------------------------------------------------------------
+      call makea(na, nz, a, colidx, rowstr, nonzer,  &
+     &           firstrow, lastrow, firstcol, lastcol,  &
+     &           rcond, arow, acol, aelt, v, iv, shift)
+
+
+
+!---------------------------------------------------------------------
+!  Note: as a result of the above call to makea:
+!        values of j used in indexing rowstr go from 1 --> lastrow-firstrow+1
+!        values of colidx which are col indexes go from firstcol --> lastcol
+!        So:
+!        Shift the col index vals from actual (firstcol --> lastcol ) 
+!        to local, i.e., (1 --> lastcol-firstcol+1)
+!---------------------------------------------------------------------
+      do j=1,lastrow-firstrow+1
+         do k=rowstr(j),rowstr(j+1)-1
+            colidx(k) = colidx(k) - firstcol + 1
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!  set starting vector to (1, 1, .... 1)
+!---------------------------------------------------------------------
+      do i = 1, naa+1
+         x(i) = 1.0D0
+      enddo
+
+      zeta  = 0.0d0
+
+!---------------------------------------------------------------------
+!---->
+!  Do one iteration untimed to init all code and data page tables
+!---->                    (then reinit, start timing, to niter its)
+!---------------------------------------------------------------------
+      do it = 1, 1
+
+!---------------------------------------------------------------------
+!  The call to the conjugate gradient routine:
+!---------------------------------------------------------------------
+         call conj_grad ( rnorm )
+
+!---------------------------------------------------------------------
+!  zeta = shift + 1/(x.z)
+!  So, first: (x.z)
+!  Also, find norm of z
+!  So, first: (z.z)
+!---------------------------------------------------------------------
+         norm_temp1(1) = 0.0d0
+         norm_temp1(2) = 0.0d0
+         do j=1, lastcol-firstcol+1
+            norm_temp1(1) = norm_temp1(1) + x(j)*z(j)
+            norm_temp1(2) = norm_temp1(2) + z(j)*z(j)
+         enddo
+
+         if (timeron) call timer_start(t_ncomm)
+         do i = 1, l2npcols
+            call mpi_irecv( norm_temp2,  &
+     &                      2,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      request,  &
+     &                      ierr )
+            call mpi_send(  norm_temp1,  &
+     &                      2,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      ierr )
+            call mpi_wait( request, status, ierr )
+
+            norm_temp1(1) = norm_temp1(1) + norm_temp2(1)
+            norm_temp1(2) = norm_temp1(2) + norm_temp2(2)
+         enddo
+         if (timeron) call timer_stop(t_ncomm)
+
+         norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) )
+
+
+!---------------------------------------------------------------------
+!  Normalize z to obtain x
+!---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp1(2)*z(j)    
+         enddo                           
+
+
+      enddo                              ! end of do one iteration untimed
+
+
+!---------------------------------------------------------------------
+!  set starting vector to (1, 1, .... 1)
+!---------------------------------------------------------------------
+!
+!  NOTE: a questionable limit on size:  should this be na/num_proc_cols+1 ?
+!
+      do i = 1, naa+1
+         x(i) = 1.0d0
+      enddo
+
+      zeta  = 0.0d0
+
+!---------------------------------------------------------------------
+!  Synchronize and start timing
+!---------------------------------------------------------------------
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call mpi_barrier( comm_solve, ierr )
+
+      call timer_clear( 1 )
+      call timer_start( 1 )
+
+!---------------------------------------------------------------------
+!---->
+!  Main Iteration for inverse power method
+!---->
+!---------------------------------------------------------------------
+      do it = 1, niter
+
+!---------------------------------------------------------------------
+!  The call to the conjugate gradient routine:
+!---------------------------------------------------------------------
+         call conj_grad ( rnorm )
+
+
+!---------------------------------------------------------------------
+!  zeta = shift + 1/(x.z)
+!  So, first: (x.z)
+!  Also, find norm of z
+!  So, first: (z.z)
+!---------------------------------------------------------------------
+         norm_temp1(1) = 0.0d0
+         norm_temp1(2) = 0.0d0
+         do j=1, lastcol-firstcol+1
+            norm_temp1(1) = norm_temp1(1) + x(j)*z(j)
+            norm_temp1(2) = norm_temp1(2) + z(j)*z(j)
+         enddo
+
+         if (timeron) call timer_start(t_ncomm)
+         do i = 1, l2npcols
+            call mpi_irecv( norm_temp2,  &
+     &                      2,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      request,  &
+     &                      ierr )
+            call mpi_send(  norm_temp1,  &
+     &                      2,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      ierr )
+            call mpi_wait( request, status, ierr )
+
+            norm_temp1(1) = norm_temp1(1) + norm_temp2(1)
+            norm_temp1(2) = norm_temp1(2) + norm_temp2(2)
+         enddo
+         if (timeron) call timer_stop(t_ncomm)
+
+         norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) )
+
+
+         if( me .eq. root )then
+            zeta = shift + 1.0d0 / norm_temp1(1)
+            if( it .eq. 1 ) write( *,9000 )
+            write( *,9001 ) it, rnorm, zeta
+         endif
+ 9000 format( /,'   iteration           ||r||                 zeta' )
+ 9001 format( 4x, i5, 6x, e21.14, f20.13 )
+
+!---------------------------------------------------------------------
+!  Normalize z to obtain x
+!---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1      
+            x(j) = norm_temp1(2)*z(j)    
+         enddo                           
+
+
+      enddo                              ! end of main iter inv pow meth
+
+      call timer_stop( 1 )
+
+!---------------------------------------------------------------------
+!  End of timed section
+!---------------------------------------------------------------------
+
+      t = timer_read( 1 )
+
+      call mpi_reduce( t,  &
+     &                 tmax,  &
+     &                 1,  &
+     &                 dp_type,  &
+     &                 MPI_MAX,  &
+     &                 root,  &
+     &                 comm_solve,  &
+     &                 ierr )
+
+      if( me .eq. root )then
+         write(*,100)
+ 100     format(' Benchmark completed ')
+
+         epsilon = 1.d-10
+         if (class .ne. 'U') then
+
+            err = abs( zeta - zeta_verify_value )/zeta_verify_value
+            if( (.not.ieee_is_nan(err)) .and. (err .le. epsilon) ) then
+               verified = .TRUE.
+               write(*, 200)
+               write(*, 201) zeta
+               write(*, 202) err
+ 200           format(' VERIFICATION SUCCESSFUL ')
+ 201           format(' Zeta is    ', E20.13)
+ 202           format(' Error is   ', E20.13)
+            else
+               verified = .FALSE.
+               write(*, 300) 
+               write(*, 301) zeta
+               write(*, 302) zeta_verify_value
+ 300           format(' VERIFICATION FAILED')
+ 301           format(' Zeta                ', E20.13)
+ 302           format(' The correct zeta is ', E20.13)
+            endif
+         else
+            verified = .FALSE.
+            write (*, 400)
+            write (*, 401)
+            write (*, 201) zeta
+ 400        format(' Problem size unknown')
+ 401        format(' NO VERIFICATION PERFORMED')
+         endif
+
+
+         if( tmax .ne. 0. ) then
+            mflops = 1.0d-6 * 2*niter*dble( na )  &
+     &                  * ( 3.+nonzer*dble(nonzer+1)  &
+     &                    + 25.*(5.+nonzer*dble(nonzer+1))  &
+     &                    + 3. ) / tmax
+         else
+            mflops = 0.d0
+         endif
+
+         call print_results('CG', class, na, 0, 0,  &
+     &                      niter, nprocs, total_nodes, tmax,  &
+     &                      mflops, '          floating point',  &
+     &                      verified, npbversion, compiletime,  &
+     &                      cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+      endif
+
+
+      if (.not.timeron) goto 999
+
+      do i = 1, t_last
+         t1(i) = timer_read(i)
+      end do
+      t1(t_conjg) = t1(t_conjg) - t1(t_rcomm)
+      t1(t_last+2) = t1(t_rcomm) + t1(t_ncomm)
+      t1(t_last+1) = t1(t_total) - t1(t_last+2)
+
+      call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM,  &
+     &                0, comm_solve, ierr)
+      call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN,  &
+     &                0, comm_solve, ierr)
+      call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX,  &
+     &                0, comm_solve, ierr)
+
+      if (me .eq. 0) then
+         write(*, 800) nprocs
+         do i = 1, t_last+2
+            tsum(i) = tsum(i) / nprocs
+            write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum',  &
+     &       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+
+
+
+      end                              ! end main
+
+
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine initialize_mpi
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use cg_data
+      use mpinpb
+      use timing
+
+      implicit none
+
+      integer   ierr
+
+
+      call mpi_init( ierr )
+
+!---------------------------------------------------------------------
+!     get a process grid that requires a pwr-2 number of procs.
+!     excess ranks are marked as inactive.
+!---------------------------------------------------------------------
+      call get_active_nprocs(3, num_proc_cols, num_proc_rows, nprocs,  &
+     &                       total_nodes, me, comm_solve, active)
+
+      if (.not. active) return
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+      root = 0
+
+      if (me .eq. root) then
+         call check_timer_flag( timeron )
+      endif
+
+      call mpi_bcast(timeron, 1, MPI_LOGICAL, 0, comm_solve, ierr)
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine setup_proc_info( )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use cg_data
+      use mpinpb
+
+      implicit none
+
+      integer   i, ierr
+
+
+!---------------------------------------------------------------------
+!  set up dimension parameters after partition
+!  num_proc_rows & num_proc_cols are set by get_active_nprocs
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows.
+!  num_proc_cols and num_proc_cols are to be found in npbparams.h.
+!  When num_procs is not square, then num_proc_cols must be = 2*num_proc_rows.
+!---------------------------------------------------------------------
+      num_procs = num_proc_cols * num_proc_rows
+
+!---------------------------------------------------------------------
+!  num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows
+!  When num_procs is not square, then num_proc_cols = 2*num_proc_rows
+!---------------------------------------------------------------------
+!  First, number of procs must be power of two. 
+!---------------------------------------------------------------------
+      if( nprocs .ne. num_procs )then
+          if( me .eq. root ) write( *,9000 ) nprocs, num_procs
+ 9000     format( /,'ERROR: Number of processes (',  &
+     &             i0, ') is not a power of two (', i0, '?)'/ )
+          call mpi_abort(mpi_comm_world, mpi_err_other, ierr)
+          stop
+      endif
+
+      
+      npcols = num_proc_cols
+      nprows = num_proc_rows
+
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine setup_submatrix_info( )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use cg_data
+      use mpinpb
+
+      implicit none
+
+      integer   col_size, row_size
+      integer   i, j
+      integer   div_factor
+
+
+      proc_row = me / npcols
+      proc_col = me - proc_row*npcols
+
+
+!---------------------------------------------------------------------
+!  If na evenly divisible by npcols, then it is evenly divisible 
+!  by nprows 
+!---------------------------------------------------------------------
+
+      if( na/npcols*npcols .eq. na )then
+          col_size = na/npcols
+          firstcol = proc_col*col_size + 1
+          lastcol  = firstcol - 1 + col_size
+          row_size = na/nprows
+          firstrow = proc_row*row_size + 1
+          lastrow  = firstrow - 1 + row_size
+!---------------------------------------------------------------------
+!  If na not evenly divisible by npcols, then first subdivide for nprows
+!  and then, if npcols not equal to nprows (i.e., not a sq number of procs), 
+!  get col subdivisions by dividing by 2 each row subdivision.
+!---------------------------------------------------------------------
+      else
+          if( proc_row .lt. na - na/nprows*nprows)then
+              row_size = na/nprows+ 1
+              firstrow = proc_row*row_size + 1
+              lastrow  = firstrow - 1 + row_size
+          else
+              row_size = na/nprows
+              firstrow = (na - na/nprows*nprows)*(row_size+1)  &
+     &                 + (proc_row-(na-na/nprows*nprows))  &
+     &                     *row_size + 1
+              lastrow  = firstrow - 1 + row_size
+          endif
+          if( npcols .eq. nprows )then
+              if( proc_col .lt. na - na/npcols*npcols )then
+                  col_size = na/npcols+ 1
+                  firstcol = proc_col*col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              else
+                  col_size = na/npcols
+                  firstcol = (na - na/npcols*npcols)*(col_size+1)  &
+     &                     + (proc_col-(na-na/npcols*npcols))  &
+     &                         *col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              endif
+          else
+              if( (proc_col/2) .lt.  &
+     &                           na - na/(npcols/2)*(npcols/2) )then
+                  col_size = na/(npcols/2) + 1
+                  firstcol = (proc_col/2)*col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              else
+                  col_size = na/(npcols/2)
+                  firstcol = (na - na/(npcols/2)*(npcols/2))  &
+     &                                                 *(col_size+1)  &
+     &               + ((proc_col/2)-(na-na/(npcols/2)*(npcols/2)))  &
+     &                         *col_size + 1
+                  lastcol  = firstcol - 1 + col_size
+              endif
+!C               write( *,* ) col_size,firstcol,lastcol
+              if( mod( me,2 ) .eq. 0 )then
+                  lastcol  = firstcol - 1 + (col_size-1)/2 + 1
+              else
+                  firstcol = firstcol + (col_size-1)/2 + 1
+                  lastcol  = firstcol - 1 + col_size/2
+!C                   write( *,* ) firstcol,lastcol
+              endif
+          endif
+      endif
+
+
+
+      if( npcols .eq. nprows )then
+          send_start = 1
+          send_len   = lastrow - firstrow + 1
+      else
+          if( mod( me,2 ) .eq. 0 )then
+              send_start = 1
+              send_len   = (1 + lastrow-firstrow+1)/2
+          else
+              send_start = (1 + lastrow-firstrow+1)/2 + 1
+              send_len   = (lastrow-firstrow+1)/2
+          endif
+      endif
+          
+
+
+
+!---------------------------------------------------------------------
+!  Transpose exchange processor
+!---------------------------------------------------------------------
+
+      if( npcols .eq. nprows )then
+          exch_proc = mod( me,nprows )*nprows + me/nprows
+      else
+          exch_proc = 2*(mod( me/2,nprows )*nprows + me/2/nprows)  &
+     &                 + mod( me,2 )
+      endif
+
+
+
+      i = npcols / 2
+      l2npcols = 0
+      do while( i .gt. 0 )
+         l2npcols = l2npcols + 1
+         i = i / 2
+      enddo
+
+
+!---------------------------------------------------------------------
+!  Set up the reduce phase schedules...
+!---------------------------------------------------------------------
+
+      div_factor = npcols
+      do i = 1, l2npcols
+
+         j = mod( proc_col+div_factor/2, div_factor )  &
+     &     + proc_col / div_factor * div_factor
+         reduce_exch_proc(i) = proc_row*npcols + j
+
+         div_factor = div_factor / 2
+
+      enddo
+
+
+      do i = l2npcols, 1, -1
+
+            if( nprows .eq. npcols )then
+               reduce_send_starts(i)  = send_start
+               reduce_send_lengths(i) = send_len
+               reduce_recv_lengths(i) = lastrow - firstrow + 1
+            else
+               reduce_recv_lengths(i) = send_len
+               if( i .eq. l2npcols )then
+                  reduce_send_lengths(i) = lastrow-firstrow+1 - send_len
+                  if( me/2*2 .eq. me )then
+                     reduce_send_starts(i) = send_start + send_len
+                  else
+                     reduce_send_starts(i) = 1
+                  endif
+               else
+                  reduce_send_lengths(i) = send_len
+                  reduce_send_starts(i)  = send_start
+               endif
+            endif
+            reduce_recv_starts(i) = send_start
+
+      enddo
+
+
+      exch_recv_length = lastcol - firstcol + 1
+
+
+      return
+      end
+
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine conj_grad ( rnorm )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  Floaging point arrays here are named as in NPB1 spec discussion of 
+!  CG algorithm
+!---------------------------------------------------------------------
+ 
+      use cg_data
+      use mpinpb
+      use timing
+
+      implicit none
+
+      double precision rnorm
+
+      integer status(MPI_STATUS_SIZE ), request
+
+      integer   i, j, k, ierr
+      integer   cgit, cgitmax
+
+      double precision d, sum, rho, rho0, alpha, beta
+
+      external         timer_read
+      double precision timer_read
+
+      data      cgitmax / 25 /
+
+
+      if (timeron) call timer_start(t_conjg)
+!---------------------------------------------------------------------
+!  Initialize the CG algorithm:
+!---------------------------------------------------------------------
+      do j=1,naa+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = x(j)
+         p(j) = r(j)
+         w(j) = 0.0d0                 
+      enddo
+
+
+!---------------------------------------------------------------------
+!  rho = r.r
+!  Now, obtain the norm of r: First, sum squares of r elements locally...
+!---------------------------------------------------------------------
+      sum = 0.0d0
+      do j=1, lastcol-firstcol+1
+         sum = sum + r(j)*r(j)
+      enddo
+
+!---------------------------------------------------------------------
+!  Exchange and sum with procs identified in reduce_exch_proc
+!  (This is equivalent to mpi_allreduce.)
+!  Sum the partial sums of rho, leaving rho on all processors
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_rcomm)
+      do i = 1, l2npcols
+         call mpi_irecv( rho,  &
+     &                   1,  &
+     &                   dp_type,  &
+     &                   reduce_exch_proc(i),  &
+     &                   i,  &
+     &                   comm_solve,  &
+     &                   request,  &
+     &                   ierr )
+         call mpi_send(  sum,  &
+     &                   1,  &
+     &                   dp_type,  &
+     &                   reduce_exch_proc(i),  &
+     &                   i,  &
+     &                   comm_solve,  &
+     &                   ierr )
+         call mpi_wait( request, status, ierr )
+
+         sum = sum + rho
+      enddo
+      if (timeron) call timer_stop(t_rcomm)
+      rho = sum
+
+
+
+!---------------------------------------------------------------------
+!---->
+!  The conj grad iteration loop
+!---->
+!---------------------------------------------------------------------
+      do cgit = 1, cgitmax
+
+
+!---------------------------------------------------------------------
+!  q = A.p
+!  The partition submatrix-vector multiply: use workspace w
+!---------------------------------------------------------------------
+         do j=1,lastrow-firstrow+1
+            sum = 0.d0
+            do k=rowstr(j),rowstr(j+1)-1
+               sum = sum + a(k)*p(colidx(k))
+            enddo
+            w(j) = sum
+         enddo
+
+!---------------------------------------------------------------------
+!  Sum the partition submatrix-vec A.p's across rows
+!  Exchange and sum piece of w with procs identified in reduce_exch_proc
+!---------------------------------------------------------------------
+         if (timeron) call timer_start(t_rcomm)
+         do i = l2npcols, 1, -1
+            call mpi_irecv( q(reduce_recv_starts(i)),  &
+     &                      reduce_recv_lengths(i),  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      request,  &
+     &                      ierr )
+            call mpi_send(  w(reduce_send_starts(i)),  &
+     &                      reduce_send_lengths(i),  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      ierr )
+            call mpi_wait( request, status, ierr )
+            do j=send_start,send_start + reduce_recv_lengths(i) - 1
+               w(j) = w(j) + q(j)
+            enddo
+         enddo
+         if (timeron) call timer_stop(t_rcomm)
+
+
+!---------------------------------------------------------------------
+!  Exchange piece of q with transpose processor:
+!---------------------------------------------------------------------
+         if( l2npcols .ne. 0 )then
+            if (timeron) call timer_start(t_rcomm)
+            call mpi_irecv( q,               &
+     &                      exch_recv_length,  &
+     &                      dp_type,  &
+     &                      exch_proc,  &
+     &                      1,  &
+     &                      comm_solve,  &
+     &                      request,  &
+     &                      ierr )
+
+            call mpi_send(  w(send_start),   &
+     &                      send_len,  &
+     &                      dp_type,  &
+     &                      exch_proc,  &
+     &                      1,  &
+     &                      comm_solve,  &
+     &                      ierr )
+            call mpi_wait( request, status, ierr )
+            if (timeron) call timer_stop(t_rcomm)
+         else
+            do j=1,exch_recv_length
+               q(j) = w(j)
+            enddo
+         endif
+
+
+!---------------------------------------------------------------------
+!  Clear w for reuse...
+!---------------------------------------------------------------------
+         do j=1, max( lastrow-firstrow+1, lastcol-firstcol+1 )
+            w(j) = 0.0d0
+         enddo
+         
+
+!---------------------------------------------------------------------
+!  Obtain p.q
+!---------------------------------------------------------------------
+         sum = 0.0d0
+         do j=1, lastcol-firstcol+1
+            sum = sum + p(j)*q(j)
+         enddo
+
+!---------------------------------------------------------------------
+!  Obtain d with a sum-reduce
+!---------------------------------------------------------------------
+         if (timeron) call timer_start(t_rcomm)
+         do i = 1, l2npcols
+            call mpi_irecv( d,  &
+     &                      1,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      request,  &
+     &                      ierr )
+            call mpi_send(  sum,  &
+     &                      1,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      ierr )
+
+            call mpi_wait( request, status, ierr )
+
+            sum = sum + d
+         enddo
+         if (timeron) call timer_stop(t_rcomm)
+         d = sum
+
+
+!---------------------------------------------------------------------
+!  Obtain alpha = rho / (p.q)
+!---------------------------------------------------------------------
+         alpha = rho / d
+
+!---------------------------------------------------------------------
+!  Save a temporary of rho
+!---------------------------------------------------------------------
+         rho0 = rho
+
+!---------------------------------------------------------------------
+!  Obtain z = z + alpha*p
+!  and    r = r - alpha*q
+!---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1
+            z(j) = z(j) + alpha*p(j)
+            r(j) = r(j) - alpha*q(j)
+         enddo
+            
+!---------------------------------------------------------------------
+!  rho = r.r
+!  Now, obtain the norm of r: First, sum squares of r elements locally...
+!---------------------------------------------------------------------
+         sum = 0.0d0
+         do j=1, lastcol-firstcol+1
+            sum = sum + r(j)*r(j)
+         enddo
+
+!---------------------------------------------------------------------
+!  Obtain rho with a sum-reduce
+!---------------------------------------------------------------------
+         if (timeron) call timer_start(t_rcomm)
+         do i = 1, l2npcols
+            call mpi_irecv( rho,  &
+     &                      1,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      request,  &
+     &                      ierr )
+            call mpi_send(  sum,  &
+     &                      1,  &
+     &                      dp_type,  &
+     &                      reduce_exch_proc(i),  &
+     &                      i,  &
+     &                      comm_solve,  &
+     &                      ierr )
+            call mpi_wait( request, status, ierr )
+
+            sum = sum + rho
+         enddo
+         if (timeron) call timer_stop(t_rcomm)
+         rho = sum
+
+!---------------------------------------------------------------------
+!  Obtain beta:
+!---------------------------------------------------------------------
+         beta = rho / rho0
+
+!---------------------------------------------------------------------
+!  p = r + beta*p
+!---------------------------------------------------------------------
+         do j=1, lastcol-firstcol+1
+            p(j) = r(j) + beta*p(j)
+         enddo
+
+
+
+      enddo                             ! end of do cgit=1,cgitmax
+
+
+
+!---------------------------------------------------------------------
+!  Compute residual norm explicitly:  ||r|| = ||x - A.z||
+!  First, form A.z
+!  The partition submatrix-vector multiply
+!---------------------------------------------------------------------
+      do j=1,lastrow-firstrow+1
+         sum = 0.d0
+         do k=rowstr(j),rowstr(j+1)-1
+            sum = sum + a(k)*z(colidx(k))
+         enddo
+         w(j) = sum
+      enddo
+
+
+
+!---------------------------------------------------------------------
+!  Sum the partition submatrix-vec A.z's across rows
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_rcomm)
+      do i = l2npcols, 1, -1
+         call mpi_irecv( r(reduce_recv_starts(i)),  &
+     &                   reduce_recv_lengths(i),  &
+     &                   dp_type,  &
+     &                   reduce_exch_proc(i),  &
+     &                   i,  &
+     &                   comm_solve,  &
+     &                   request,  &
+     &                   ierr )
+         call mpi_send(  w(reduce_send_starts(i)),  &
+     &                   reduce_send_lengths(i),  &
+     &                   dp_type,  &
+     &                   reduce_exch_proc(i),  &
+     &                   i,  &
+     &                   comm_solve,  &
+     &                   ierr )
+         call mpi_wait( request, status, ierr )
+
+         do j=send_start,send_start + reduce_recv_lengths(i) - 1
+            w(j) = w(j) + r(j)
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_rcomm)
+      
+
+!---------------------------------------------------------------------
+!  Exchange piece of q with transpose processor:
+!---------------------------------------------------------------------
+      if( l2npcols .ne. 0 )then
+         if (timeron) call timer_start(t_rcomm)
+         call mpi_irecv( r,               &
+     &                   exch_recv_length,  &
+     &                   dp_type,  &
+     &                   exch_proc,  &
+     &                   1,  &
+     &                   comm_solve,  &
+     &                   request,  &
+     &                   ierr )
+   
+         call mpi_send(  w(send_start),   &
+     &                   send_len,  &
+     &                   dp_type,  &
+     &                   exch_proc,  &
+     &                   1,  &
+     &                   comm_solve,  &
+     &                   ierr )
+         call mpi_wait( request, status, ierr )
+         if (timeron) call timer_stop(t_rcomm)
+      else
+         do j=1,exch_recv_length
+            r(j) = w(j)
+         enddo
+      endif
+
+
+!---------------------------------------------------------------------
+!  At this point, r contains A.z
+!---------------------------------------------------------------------
+         sum = 0.0d0
+         do j=1, lastcol-firstcol+1
+            d   = x(j) - r(j)         
+            sum = sum + d*d
+         enddo
+         
+!---------------------------------------------------------------------
+!  Obtain d with a sum-reduce
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_rcomm)
+      do i = 1, l2npcols
+         call mpi_irecv( d,  &
+     &                   1,  &
+     &                   dp_type,  &
+     &                   reduce_exch_proc(i),  &
+     &                   i,  &
+     &                   comm_solve,  &
+     &                   request,  &
+     &                   ierr )
+         call mpi_send(  sum,  &
+     &                   1,  &
+     &                   dp_type,  &
+     &                   reduce_exch_proc(i),  &
+     &                   i,  &
+     &                   comm_solve,  &
+     &                   ierr )
+         call mpi_wait( request, status, ierr )
+
+         sum = sum + d
+      enddo
+      if (timeron) call timer_stop(t_rcomm)
+      d = sum
+
+
+      if( me .eq. root ) rnorm = sqrt( d )
+
+      if (timeron) call timer_stop(t_conjg)
+
+
+      return
+      end                               ! end of routine conj_grad
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine makea( n, nz, a, colidx, rowstr, nonzer,  &
+     &                  firstrow, lastrow, firstcol, lastcol,  &
+     &                  rcond, arow, acol, aelt, v, iv, shift )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit            none
+      integer             n, nz, nonzer
+      integer             firstrow, lastrow, firstcol, lastcol
+      integer             colidx(nz), rowstr(n+1)
+      integer             iv(2*n+1), arow(nz), acol(nz)
+      double precision    v(n+1), aelt(nz)
+      double precision    rcond, a(nz), shift
+
+!---------------------------------------------------------------------
+!       generate the test problem for benchmark 6
+!       makea generates a sparse matrix with a
+!       prescribed sparsity distribution
+!
+!       parameter    type        usage
+!
+!       input
+!
+!       n            i           number of cols/rows of matrix
+!       nz           i           nonzeros as declared array size
+!       rcond        r*8         condition number
+!       shift        r*8         main diagonal shift
+!
+!       output
+!
+!       a            r*8         array for nonzeros
+!       colidx       i           col indices
+!       rowstr       i           row pointers
+!
+!       workspace
+!
+!       iv, arow, acol i
+!       v, aelt        r*8
+!---------------------------------------------------------------------
+
+      integer i, nnza, iouter, ivelt, ivelt1, irow, nzv, jcol
+
+!---------------------------------------------------------------------
+!      nonzer is approximately  (int(sqrt(nnza /n)));
+!---------------------------------------------------------------------
+
+      double precision  size, ratio, scale
+      external          sparse, sprnvc, vecset
+
+      size = 1.0D0
+      ratio = rcond ** (1.0D0 / dfloat(n))
+      nnza = 0
+
+!---------------------------------------------------------------------
+!  Initialize iv(n+1 .. 2n) to zero.
+!  Used by sprnvc to mark nonzero positions
+!---------------------------------------------------------------------
+
+      do i = 1, n
+           iv(n+i) = 0
+      enddo
+      do iouter = 1, n
+         nzv = nonzer
+         call sprnvc( n, nzv, v, colidx, iv(1), iv(n+1) )
+         call vecset( n, v, colidx, nzv, iouter, .5D0 )
+         do ivelt = 1, nzv
+              jcol = colidx(ivelt)
+              if (jcol.ge.firstcol .and. jcol.le.lastcol) then
+                 scale = size * v(ivelt)
+                 do ivelt1 = 1, nzv
+                    irow = colidx(ivelt1)
+                    if (irow.ge.firstrow .and. irow.le.lastrow) then
+                       nnza = nnza + 1
+                       if (nnza .gt. nz) goto 9999
+                       acol(nnza) = jcol
+                       arow(nnza) = irow
+                       aelt(nnza) = v(ivelt1) * scale
+                    endif
+                 enddo
+              endif
+         enddo
+         size = size * ratio
+      enddo
+
+
+!---------------------------------------------------------------------
+!       ... add the identity * rcond to the generated matrix to bound
+!           the smallest eigenvalue from below by rcond
+!---------------------------------------------------------------------
+        do i = firstrow, lastrow
+           if (i.ge.firstcol .and. i.le.lastcol) then
+              iouter = n + i
+              nnza = nnza + 1
+              if (nnza .gt. nz) goto 9999
+              acol(nnza) = i
+              arow(nnza) = i
+              aelt(nnza) = rcond - shift
+           endif
+        enddo
+
+
+!---------------------------------------------------------------------
+!       ... make the sparse matrix from list of elements with duplicates
+!           (v and iv are used as  workspace)
+!---------------------------------------------------------------------
+      call sparse( a, colidx, rowstr, n, arow, acol, aelt,  &
+     &             firstrow, lastrow,  &
+     &             v, iv(1), iv(n+1), nnza )
+      return
+
+ 9999 continue
+      write(*,*) 'Space for matrix elements exceeded in makea'
+      write(*,*) 'nnza, nzmax = ',nnza, nz
+      write(*,*) ' iouter = ',iouter
+
+      stop
+      end
+!-------end   of makea------------------------------
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine sparse( a, colidx, rowstr, n, arow, acol, aelt,  &
+     &                   firstrow, lastrow,  &
+     &                   x, mark, nzloc, nnza )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit           none
+      integer            colidx(*), rowstr(*)
+      integer            firstrow, lastrow
+      integer            n, arow(*), acol(*), nnza
+      double precision   a(*), aelt(*)
+
+!---------------------------------------------------------------------
+!       rows range from firstrow to lastrow
+!       the rowstr pointers are defined for nrows = lastrow-firstrow+1 values
+!---------------------------------------------------------------------
+      integer            nzloc(n), nrows
+      double precision   x(n)
+      integer            mark(n)
+
+!---------------------------------------------------
+!       generate a sparse matrix from a list of
+!       [col, row, element] tri
+!---------------------------------------------------
+
+      integer            i, j, jajp1, nza, k, nzrow
+      double precision   xi
+
+!---------------------------------------------------------------------
+!    how many rows of result
+!---------------------------------------------------------------------
+      nrows = lastrow - firstrow + 1
+
+!---------------------------------------------------------------------
+!     ...count the number of triples in each row
+!---------------------------------------------------------------------
+      do j = 1, n
+         rowstr(j) = 0
+         mark(j) = 0
+      enddo
+      rowstr(n+1) = 0
+
+      do nza = 1, nnza
+         j = (arow(nza) - firstrow + 1) + 1
+         rowstr(j) = rowstr(j) + 1
+      enddo
+
+      rowstr(1) = 1
+      do j = 2, nrows+1
+         rowstr(j) = rowstr(j) + rowstr(j-1)
+      enddo
+
+
+!---------------------------------------------------------------------
+!     ... rowstr(j) now is the location of the first nonzero
+!           of row j of a
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!     ... do a bucket sort of the triples on the row index
+!---------------------------------------------------------------------
+      do nza = 1, nnza
+         j = arow(nza) - firstrow + 1
+         k = rowstr(j)
+         a(k) = aelt(nza)
+         colidx(k) = acol(nza)
+         rowstr(j) = rowstr(j) + 1
+      enddo
+
+
+!---------------------------------------------------------------------
+!       ... rowstr(j) now points to the first element of row j+1
+!---------------------------------------------------------------------
+      do j = nrows, 1, -1
+          rowstr(j+1) = rowstr(j)
+      enddo
+      rowstr(1) = 1
+
+
+!---------------------------------------------------------------------
+!       ... generate the actual output rows by adding elements
+!---------------------------------------------------------------------
+      nza = 0
+      do i = 1, n
+          x(i)    = 0.0
+          mark(i) = 0
+      enddo
+
+      jajp1 = rowstr(1)
+      do j = 1, nrows
+         nzrow = 0
+
+!---------------------------------------------------------------------
+!          ...loop over the jth row of a
+!---------------------------------------------------------------------
+         do k = jajp1 , rowstr(j+1)-1
+            i = colidx(k)
+            x(i) = x(i) + a(k)
+            if ( (mark(i) .eq. 0) .and. (x(i) .ne. 0.D0)) then
+             mark(i) = 1
+             nzrow = nzrow + 1
+             nzloc(nzrow) = i
+            endif
+         enddo
+
+!---------------------------------------------------------------------
+!          ... extract the nonzeros of this row
+!---------------------------------------------------------------------
+         do k = 1, nzrow
+            i = nzloc(k)
+            mark(i) = 0
+            xi = x(i)
+            x(i) = 0.D0
+            if (xi .ne. 0.D0) then
+             nza = nza + 1
+             a(nza) = xi
+             colidx(nza) = i
+            endif
+         enddo
+         jajp1 = rowstr(j+1)
+         rowstr(j+1) = nza + rowstr(1)
+      enddo
+!C       write (*, 11000) nza
+      return
+11000   format ( //,'final nonzero count in sparse ',  &
+     &            /,'number of nonzeros       = ', i16 )
+      end
+!-------end   of sparse-----------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine sprnvc( n, nz, v, iv, nzloc, mark )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use cg_data, only : amult, tran
+      implicit           none
+
+      double precision   v(*)
+      integer            n, nz, iv(*), nzloc(n), nn1
+      integer            mark(n)
+
+
+!---------------------------------------------------------------------
+!       generate a sparse n-vector (v, iv)
+!       having nzv nonzeros
+!
+!       mark(i) is set to 1 if position i is nonzero.
+!       mark is all zero on entry and is reset to all zero before exit
+!       this corrects a performance bug found by John G. Lewis, caused by
+!       reinitialization of mark on every one of the n calls to sprnvc
+!---------------------------------------------------------------------
+
+        integer            nzrow, nzv, ii, i, icnvrt
+
+        external           randlc, icnvrt
+        double precision   randlc, vecelt, vecloc
+
+
+        nzv = 0
+        nzrow = 0
+        nn1 = 1
+ 50     continue
+          nn1 = 2 * nn1
+          if (nn1 .lt. n) goto 50
+
+!---------------------------------------------------------------------
+!    nn1 is the smallest power of two not less than n
+!---------------------------------------------------------------------
+
+100     continue
+        if (nzv .ge. nz) goto 110
+         vecelt = randlc( tran, amult )
+
+!---------------------------------------------------------------------
+!   generate an integer between 1 and n in a portable manner
+!---------------------------------------------------------------------
+         vecloc = randlc(tran, amult)
+         i = icnvrt(vecloc, nn1) + 1
+         if (i .gt. n) goto 100
+
+!---------------------------------------------------------------------
+!  was this integer generated already?
+!---------------------------------------------------------------------
+         if (mark(i) .eq. 0) then
+            mark(i) = 1
+            nzrow = nzrow + 1
+            nzloc(nzrow) = i
+            nzv = nzv + 1
+            v(nzv) = vecelt
+            iv(nzv) = i
+         endif
+         goto 100
+110      continue
+      do ii = 1, nzrow
+         i = nzloc(ii)
+         mark(i) = 0
+      enddo
+      return
+      end
+!-------end   of sprnvc-----------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      function icnvrt(x, ipwr2)
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit           none
+      double precision   x
+      integer            ipwr2, icnvrt
+
+!---------------------------------------------------------------------
+!    scale a double precision number x in (0,1) by a power of 2 and chop it
+!---------------------------------------------------------------------
+      icnvrt = int(ipwr2 * x)
+
+      return
+      end
+!-------end   of icnvrt-----------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine vecset(n, v, iv, nzv, i, val)
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit           none
+      integer            n, iv(*), nzv, i, k
+      double precision   v(*), val
+
+!---------------------------------------------------------------------
+!       set ith element of sparse vector (v, iv) with
+!       nzv nonzeros to val
+!---------------------------------------------------------------------
+
+      logical set
+
+      set = .false.
+      do k = 1, nzv
+         if (iv(k) .eq. i) then
+            v(k) = val
+            set  = .true.
+         endif
+      enddo
+      if (.not. set) then
+         nzv     = nzv + 1
+         v(nzv)  = val
+         iv(nzv) = i
+      endif
+      return
+      end
+!-------end   of vecset-----------------------------
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg_data.f90
new file mode 100644
index 000000000..5d193f547
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg_data.f90
@@ -0,0 +1,161 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  cg_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module cg_data
+
+
+!---------------------------------------------------------------------
+!  Class specific parameters are defined in the npbparams.h file,
+!  which is written by the sys/setparams.c program.
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+
+      ! main_int_mem
+      integer, allocatable :: colidx(:), rowstr(:),  &
+     &                        iv(:), arow(:), acol(:)
+
+      ! main_flt_mem
+      double precision, allocatable ::  &
+     &                        v(:), aelt(:), a(:),  &
+     &                        x(:),  &
+     &                        z(:),  &
+     &                        p(:),  &
+     &                        q(:),  &
+     &                        r(:),  &
+     &                        w(:)
+
+      ! urando
+      double precision   amult, tran
+
+
+      ! process grid
+      integer num_procs, num_proc_rows, num_proc_cols
+
+      ! number of nonzeros after partition
+      integer nz
+
+      ! partit_size
+      integer naa, nzz,  &
+     &        npcols, nprows,  &
+     &        proc_col, proc_row,  &
+     &        firstrow,  &
+     &        lastrow,  &
+     &        firstcol,  &
+     &        lastcol,  &
+     &        exch_proc,  &
+     &        exch_recv_length,  &
+     &        send_start,  &
+     &        send_len
+
+      ! work arrays for reduction
+      integer l2npcols
+      integer, allocatable ::  &
+     &        reduce_exch_proc(:),  &
+     &        reduce_send_starts(:),  &
+     &        reduce_send_lengths(:),  &
+     &        reduce_recv_starts(:),  &
+     &        reduce_recv_lengths(:)
+
+
+      end module cg_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  timing module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module timing
+
+      integer t_total, t_conjg, t_rcomm, t_ncomm, t_last
+      parameter (t_total=1, t_conjg=2, t_rcomm=3, t_ncomm=4, t_last=4)
+      logical timeron
+
+      end module timing
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use cg_data
+      use mpinpb
+
+      implicit none
+
+      integer(8) naz
+      integer ios, ierr
+
+
+!---------------------------------------------------------------------
+! set up dimension parameters after partition
+!---------------------------------------------------------------------
+
+      naz = na			! to avoid integer overflow
+      naz = naz*(nonzer+1)/num_procs*(nonzer+1)+nonzer  &
+     &     + naz*(nonzer+2+num_procs/256)/num_proc_cols
+      nz = naz
+      if (nz .ne. naz) then
+         write(*,*) 'Error: integer overflow', nz, naz
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+      endif
+
+      naa = na / num_proc_rows
+      nzz = nz
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      allocate (  &
+     &          colidx(nz),  &
+     &          rowstr(na+1),  &
+     &          iv(2*na+1),  &
+     &          arow(nz),  &
+     &          acol(nz),  &
+     &          stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &          v(na+1), aelt(nz), a(nz),  &
+     &          x(naa+2),  &
+     &          z(naa+2),  &
+     &          p(naa+2),  &
+     &          q(naa+2),  &
+     &          r(naa+2),  &
+     &          w(naa+2),  &
+     &          stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &          reduce_exch_proc(num_proc_cols),  &
+     &          reduce_send_starts(num_proc_cols),  &
+     &          reduce_send_lengths(num_proc_cols),  &
+     &          reduce_recv_starts(num_proc_cols),  &
+     &          reduce_recv_lengths(num_proc_cols),  &
+     &          stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/mpinpb.f90
new file mode 100644
index 000000000..1a34c2924
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/mpinpb.f90
@@ -0,0 +1,17 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mpinpb module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mpinpb
+
+      include 'mpif.h'
+
+      integer  me, nprocs, total_nodes, root, comm_solve, dp_type
+      logical  active
+
+      end module mpinpb
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.c
new file mode 100644
index 000000000..3c7a57819
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.c
@@ -0,0 +1,184 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "DGraph.h"
+
+DGArc *newArc(DGNode *tl,DGNode *hd){
+  DGArc *ar=(DGArc *)malloc(sizeof(DGArc));
+  ar->tail=tl;
+  ar->head=hd;
+  return ar;
+}
+void arcShow(DGArc *ar){
+  DGNode *tl=(DGNode *)ar->tail,
+         *hd=(DGNode *)ar->head;
+  fprintf(stderr,"%d. |%s ->%s\n",ar->id,tl->name,hd->name);
+}
+
+DGNode *newNode(char *nm){
+  DGNode *nd=(DGNode *)malloc(sizeof(DGNode));
+  nd->attribute=0;
+  nd->color=0;
+  nd->inDegree=0;
+  nd->outDegree=0;
+  nd->maxInDegree=SMALL_BLOCK_SIZE;
+  nd->maxOutDegree=SMALL_BLOCK_SIZE;
+  nd->inArc=(DGArc **)malloc(nd->maxInDegree*sizeof(DGArc*));
+  nd->outArc=(DGArc **)malloc(nd->maxOutDegree*sizeof(DGArc*));
+  nd->name=strdup(nm);
+  nd->feat=NULL;
+  return nd;
+}
+void nodeShow(DGNode* nd){
+  fprintf( stderr,"%3d.%s: (%d,%d)\n",
+	           nd->id,nd->name,nd->inDegree,nd->outDegree);
+/*
+  if(nd->verified==1) fprintf(stderr,"%ld.%s\t: usable.",nd->id,nd->name);
+  else if(nd->verified==0)  fprintf(stderr,"%ld.%s\t: unusable.",nd->id,nd->name);
+  else  fprintf(stderr,"%ld.%s\t: notverified.",nd->id,nd->name);   
+*/
+}
+
+DGraph* newDGraph(char* nm){
+  DGraph *dg=(DGraph *)malloc(sizeof(DGraph));
+  dg->numNodes=0;
+  dg->numArcs=0;
+  dg->maxNodes=BLOCK_SIZE;
+  dg->maxArcs=BLOCK_SIZE;
+  dg->node=(DGNode **)malloc(dg->maxNodes*sizeof(DGNode*));
+  dg->arc=(DGArc **)malloc(dg->maxArcs*sizeof(DGArc*));
+  dg->name=strdup(nm);
+  return dg;
+}
+int AttachNode(DGraph* dg, DGNode* nd) {
+  int i=0,j,len=0;
+  DGNode **nds =NULL, *tmpnd=NULL;
+  DGArc **ar=NULL;
+
+	if (dg->numNodes == dg->maxNodes-1 ) {
+	  dg->maxNodes += BLOCK_SIZE;
+          nds =(DGNode **) calloc(dg->maxNodes,sizeof(DGNode*));
+	  memcpy(nds,dg->node,(dg->maxNodes-BLOCK_SIZE)*sizeof(DGNode*));
+	  free(dg->node);
+	  dg->node=nds;
+	}
+
+        len = strlen( nd->name);
+	for (i = 0; i < dg->numNodes; i++) {
+	  tmpnd =dg->node[ i];
+	  ar=NULL;
+	  if ( strlen( tmpnd->name) != len ) continue;
+	  if ( strncmp( nd->name, tmpnd->name, len) ) continue;
+	  if ( nd->inDegree > 0 ) {
+	    tmpnd->maxInDegree += nd->maxInDegree;
+            ar =(DGArc **) calloc(tmpnd->maxInDegree,sizeof(DGArc*));
+	    memcpy(ar,tmpnd->inArc,(tmpnd->inDegree)*sizeof(DGArc*));
+	    free(tmpnd->inArc);
+	    tmpnd->inArc=ar;
+	    for (j = 0; j < nd->inDegree; j++ ) {
+	      nd->inArc[ j]->head = tmpnd;
+	    }
+	    memcpy( &(tmpnd->inArc[ tmpnd->inDegree]), nd->inArc, nd->inDegree*sizeof( DGArc *));
+	    tmpnd->inDegree += nd->inDegree;
+	  } 	
+	  if ( nd->outDegree > 0 ) {
+	    tmpnd->maxOutDegree += nd->maxOutDegree;
+            ar =(DGArc **) calloc(tmpnd->maxOutDegree,sizeof(DGArc*));
+	    memcpy(ar,tmpnd->outArc,(tmpnd->outDegree)*sizeof(DGArc*));
+	    free(tmpnd->outArc);
+	    tmpnd->outArc=ar;
+	    for (j = 0; j < nd->outDegree; j++ ) {
+	      nd->outArc[ j]->tail = tmpnd;
+	    }			
+	    memcpy( &(tmpnd->outArc[tmpnd->outDegree]),nd->outArc,nd->outDegree*sizeof( DGArc *));
+	    tmpnd->outDegree += nd->outDegree;
+	  } 
+	  free(nd); 
+	  return i;
+	}
+	nd->id = dg->numNodes;
+	dg->node[dg->numNodes] = nd;
+	dg->numNodes++;
+return nd->id;
+}
+int AttachArc(DGraph *dg,DGArc* nar){
+int	arcId = -1;
+int i=0,newNumber=0;
+DGNode	*head = nar->head,
+	*tail = nar->tail; 
+DGArc **ars=NULL,*probe=NULL;
+/*fprintf(stderr,"AttachArc %ld\n",dg->numArcs); */
+	if ( !tail || !head ) return arcId;
+	if ( dg->numArcs == dg->maxArcs-1 ) {
+	  dg->maxArcs += BLOCK_SIZE;
+          ars =(DGArc **) calloc(dg->maxArcs,sizeof(DGArc*));
+	  memcpy(ars,dg->arc,(dg->maxArcs-BLOCK_SIZE)*sizeof(DGArc*));
+	  free(dg->arc);
+	  dg->arc=ars;
+	}
+	for(i = 0; i < tail->outDegree; i++ ) { /* parallel arc */
+	  probe = tail->outArc[ i];
+	  if(probe->head == head
+	     &&
+	     probe->length == nar->length
+            ){
+            free(nar);
+	    return probe->id;   
+	  }
+	}
+	
+	nar->id = dg->numArcs;
+	arcId=dg->numArcs;
+	dg->arc[dg->numArcs] = nar;
+	dg->numArcs++;
+	
+	head->inArc[ head->inDegree] = nar;
+	head->inDegree++;
+	if ( head->inDegree >= head->maxInDegree ) {
+	  newNumber = head->maxInDegree + SMALL_BLOCK_SIZE;
+          ars =(DGArc **) calloc(newNumber,sizeof(DGArc*));
+	  memcpy(ars,head->inArc,(head->inDegree)*sizeof(DGArc*));
+	  free(head->inArc);
+	  head->inArc=ars;
+	  head->maxInDegree = newNumber;
+	}
+	tail->outArc[ tail->outDegree] = nar;
+	tail->outDegree++;
+	if(tail->outDegree >= tail->maxOutDegree ) {
+	  newNumber = tail->maxOutDegree + SMALL_BLOCK_SIZE;
+          ars =(DGArc **) calloc(newNumber,sizeof(DGArc*));
+	  memcpy(ars,tail->outArc,(tail->outDegree)*sizeof(DGArc*));
+	  free(tail->outArc);
+	  tail->outArc=ars;
+	  tail->maxOutDegree = newNumber;
+	}
+/*fprintf(stderr,"AttachArc: head->in=%d tail->out=%ld\n",head->inDegree,tail->outDegree);*/
+return arcId;
+}
+void graphShow(DGraph *dg,int DetailsLevel){
+  int i=0,j=0;
+  fprintf(stderr," %d.%s: (%d,%d)\n",dg->id,dg->name,dg->numNodes,dg->numArcs);
+  if ( DetailsLevel < 1) return;
+  for (i = 0; i < dg->numNodes; i++ ) {
+    DGNode *focusNode = dg->node[ i];
+    if(DetailsLevel >= 2) {
+      for (j = 0; j < focusNode->inDegree; j++ ) {
+	fprintf(stderr,"\t ");
+	nodeShow(focusNode->inArc[ j]->tail);
+      }
+    }
+    nodeShow(focusNode);
+    if ( DetailsLevel < 2) continue;
+    for (j = 0; j < focusNode->outDegree; j++ ) {
+      fprintf(stderr, "\t ");
+      nodeShow(focusNode->outArc[ j]->head);
+    }	
+    fprintf(stderr, "---\n");
+  }
+  fprintf(stderr,"----------------------------------------\n");
+  if ( DetailsLevel < 3) return;
+}
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.h
new file mode 100644
index 000000000..f38f898b2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.h
@@ -0,0 +1,43 @@
+#ifndef _DGRAPH
+#define _DGRAPH
+
+#define BLOCK_SIZE  128
+#define SMALL_BLOCK_SIZE 32
+
+typedef struct{
+  int id;
+  void *tail,*head;
+  int length,width,attribute,maxWidth;
+}DGArc;
+
+typedef struct{
+  int maxInDegree,maxOutDegree;
+  int inDegree,outDegree;
+  int id;
+  char *name;
+  DGArc **inArc,**outArc;
+  int depth,height,width;
+  int color,attribute,address,verified;
+  void *feat;
+}DGNode;
+
+typedef struct{
+  int maxNodes,maxArcs;
+  int id;
+  char *name;
+  int numNodes,numArcs;
+  DGNode **node;
+  DGArc **arc;
+} DGraph;
+
+DGArc *newArc(DGNode *tl,DGNode *hd);
+void arcShow(DGArc *ar);
+DGNode *newNode(char *nm);
+void nodeShow(DGNode* nd);
+
+DGraph* newDGraph(char *nm);
+int AttachNode(DGraph *dg,DGNode *nd);
+int AttachArc(DGraph *dg,DGArc* nar);
+void graphShow(DGraph *dg,int DetailsLevel);
+
+#endif
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/Makefile
new file mode 100644
index 000000000..687ac3324
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/Makefile
@@ -0,0 +1,26 @@
+SHELL=/bin/sh
+BENCHMARK=dt
+BENCHMARKU=DT
+
+include ../config/make.def
+
+include ../sys/make.common
+#Override PROGRAM
+DTPROGRAM  = $(BINDIR)/$(BENCHMARK).$(CLASS).x
+
+OBJS = dt.o DGraph.o \
+	${COMMON}/c_print_results.o ${COMMON}/c_timers.o ${COMMON}/c_randdp.o
+
+
+${PROGRAM}: config ${OBJS}
+	${CLINK} ${CLINKFLAGS} -o ${DTPROGRAM} ${OBJS} ${CMPI_LIB}
+
+.c.o:
+	${CCOMPILE} $<
+
+dt.o:             dt.c  npbparams.h
+DGraph.o:	DGraph.c DGraph.h
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f dt npbparams.h core
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/README
new file mode 100644
index 000000000..873e3ae6f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/README
@@ -0,0 +1,22 @@
+Data Traffic benchmark DT is new in the NPB suite 
+(released as part of NPB3.x-MPI package).
+----------------------------------------------------
+
+DT is written in C and same executable can run on any number of processors,
+provided this number is not less than the number of nodes in the communication
+graph.  DT benchmark takes one argument: BH, WH, or SH. This argument 
+specifies the communication graph Black Hole, White Hole, or SHuffle 
+respectively. The current release contains verification numbers for 
+CLASSES S, W, A, and B only.  Classes C and D are defined, but verification 
+numbers are not provided in this release.
+
+The following table summarizes the number of nodes in the communication
+graph based on CLASS and graph TYPE.
+
+CLASS  N_Source N_Nodes(BH,WH) N_Nodes(SH)
+ S      4        5              12
+ W      8        11             32
+ A      16       21             80
+ B      32       43             192
+ C      64       85             448
+ D      128      171            1024
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/dt.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/dt.c
new file mode 100644
index 000000000..281979d31
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/dt.c
@@ -0,0 +1,755 @@
+/*************************************************************************
+ *                                                                       * 
+ *        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4       *
+ *                                                                       * 
+ *                                  D T					 * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   This benchmark is part of the NAS Parallel Benchmark 3.4 suite.     *
+ *                                                                       * 
+ *   Permission to use, copy, distribute and modify this software        * 
+ *   for any purpose with or without fee is hereby granted.  We          * 
+ *   request, however, that all derived work reference the NAS           * 
+ *   Parallel Benchmarks 3.4. This software is provided "as is"          *
+ *   without express or implied warranty.                                * 
+ *                                                                       * 
+ *   Information on NPB 3.4, including the technical report, the         *
+ *   original specifications, source code, results and information       * 
+ *   on how to submit new results, is available at:                      * 
+ *                                                                       * 
+ *          http:  www.nas.nasa.gov/Software/NPB                         * 
+ *                                                                       * 
+ *   Send comments or suggestions to  npb@nas.nasa.gov                   * 
+ *   Send bug reports to              npb-bugs@nas.nasa.gov              * 
+ *                                                                       * 
+ *         NAS Parallel Benchmarks Group                                 * 
+ *         NASA Ames Research Center                                     * 
+ *         Mail Stop: T27A-1                                             * 
+ *         Moffett Field, CA   94035-1000                                * 
+ *                                                                       * 
+ *         E-mail:  npb@nas.nasa.gov                                     * 
+ *         Fax:     (650) 604-3957                                       * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   Author: M. Frumkin							 *						 * 
+ *                                                                       * 
+ *************************************************************************/
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include "mpi.h"
+#include "npbparams.h"
+
+#ifndef CLASS
+#define CLASS 'S'
+#endif
+
+int      passed_verification;
+extern double randlc( double *X, double *A );
+extern
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      int    nprocs_compiled,
+                      int    nprocs_total,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *mpicc,
+                      char   *clink,
+                      char   *cmpi_lib,
+                      char   *cmpi_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+		      
+#include "../common/c_timers.h"
+int timer_on=0,timers_tot=64;
+
+int verify(char *bmname,double rnm2){
+    double verify_value=0.0;
+    double epsilon=1.0E-8;
+    char cls=CLASS;
+    int verified=-1;
+    if (cls != 'U') {
+       if(cls=='S') {
+         if(strstr(bmname,"BH")){
+           verify_value=30892725.0;
+         }else if(strstr(bmname,"WH")){
+           verify_value=67349758.0;
+         }else if(strstr(bmname,"SH")){
+           verify_value=58875767.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+         verified = 0;
+       }else if(cls=='W') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 4102461.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 204280762.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 186944764.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+         verified = 0;
+       }else if(cls=='A') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 17809491.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 1289925229.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 610856482.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+  	 verified = 0;
+       }else if(cls=='B') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 4317114.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 7877279917.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 1836863082.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+  	   verified = 0;
+         }
+       }else if(cls=='C') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 0.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+  	   verified = -1;
+         }
+       }else if(cls=='D') {
+         if(strstr(bmname,"BH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"WH")){
+  	   verify_value = 0.0;
+         }else if(strstr(bmname,"SH")){
+  	   verify_value = 0.0;
+         }else{
+           fprintf(stderr,"No such benchmark as %s.\n",bmname);
+         }
+         verified = -1;
+       }else{
+         fprintf(stderr,"No such class as %c.\n",cls);
+       }
+       fprintf(stderr," %s L2 Norm = %f\n",bmname,rnm2);
+       if(verified==-1){
+  	 fprintf(stderr," No verification was performed.\n");
+       }else if( rnm2 - verify_value < epsilon &&
+                 rnm2 - verify_value > -epsilon) {  /* abs here does not work on ALTIX */
+  	  verified = 1;
+  	  fprintf(stderr," Deviation = %f\n",(rnm2 - verify_value));
+       }else{
+  	 verified = 0;
+  	 fprintf(stderr," The correct verification value = %f\n",verify_value);
+  	 fprintf(stderr," Got value = %f\n",rnm2);
+       }
+    }else{
+       verified = -1;
+    }
+    return  verified;  
+  }
+
+int ipowMod(int a,long long int n,int md){ 
+  int seed=1,q=a,r=1;
+  if(n<0){
+    fprintf(stderr,"ipowMod: exponent must be nonnegative exp=%lld\n",n);
+    n=-n; /* temp fix */
+/*    return 1; */
+  }
+  if(md<=0){
+    fprintf(stderr,"ipowMod: module must be positive mod=%d",md);
+    return 1;
+  }
+  if(n==0) return 1;
+  while(n>1){
+    int n2 = n/2;
+    if (n2*2==n){
+       seed = (q*q)%md;
+       q=seed;
+       n = n2;
+    }else{
+       seed = (r*q)%md;
+       r=seed;
+       n = n-1;
+    }
+  }
+  seed = (r*q)%md;
+  return seed;
+}
+
+#include "DGraph.h"
+DGraph *buildSH(char cls){
+/*
+  Nodes of the graph must be topologically sorted
+  to avoid MPI deadlock.
+*/
+  DGraph *dg;
+  int numSources=NUM_SOURCES; /* must be power of 2 */
+  int numOfLayers=0,tmpS=numSources>>1;
+  int firstLayerNode=0;
+  DGArc *ar=NULL;
+  DGNode *nd=NULL;
+  int mask=0x0,ndid=0,ndoff=0;
+  int i=0,j=0;
+  char nm[BLOCK_SIZE];
+  
+  sprintf(nm,"DT_SH.%c",cls);
+  dg=newDGraph(nm);
+
+  while(tmpS>1){
+    numOfLayers++;
+    tmpS>>=1;
+  }
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Source.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+  }
+  for(j=0;j<numOfLayers;j++){
+    mask=0x00000001<<j;
+    for(i=0;i<numSources;i++){
+      sprintf(nm,"Comparator.%d",(i+j*firstLayerNode));
+      nd=newNode(nm);
+      AttachNode(dg,nd);
+      ndoff=i&(~mask);
+      ndid=firstLayerNode+ndoff;
+      ar=newArc(dg->node[ndid],nd);     
+      AttachArc(dg,ar);
+      ndoff+=mask;
+      ndid=firstLayerNode+ndoff;
+      ar=newArc(dg->node[ndid],nd);     
+      AttachArc(dg,ar);
+    }
+    firstLayerNode+=numSources;
+  }
+  mask=0x00000001<<numOfLayers;
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Sink.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+    ndoff=i&(~mask);
+    ndid=firstLayerNode+ndoff;
+    ar=newArc(dg->node[ndid],nd);     
+    AttachArc(dg,ar);
+    ndoff+=mask;
+    ndid=firstLayerNode+ndoff;
+    ar=newArc(dg->node[ndid],nd);     
+    AttachArc(dg,ar);
+  }
+return dg;
+}
+DGraph *buildWH(char cls){
+/*
+  Nodes of the graph must be topologically sorted
+  to avoid MPI deadlock.
+*/
+  int i=0,j=0;
+  int numSources=NUM_SOURCES,maxInDeg=4;
+  int numLayerNodes=numSources,firstLayerNode=0;
+  int totComparators=0;
+  int numPrevLayerNodes=numLayerNodes;
+  int id=0,sid=0;
+  DGraph *dg;
+  DGNode *nd=NULL,*source=NULL,*tmp=NULL,*snd=NULL;
+  DGArc *ar=NULL;
+  char nm[BLOCK_SIZE];
+
+  sprintf(nm,"DT_WH.%c",cls);
+  dg=newDGraph(nm);
+
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Sink.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+  }
+  totComparators=0;
+  numPrevLayerNodes=numLayerNodes;
+  while(numLayerNodes>maxInDeg){
+    numLayerNodes=numLayerNodes/maxInDeg;
+    if(numLayerNodes*maxInDeg<numPrevLayerNodes)numLayerNodes++;
+    for(i=0;i<numLayerNodes;i++){
+      sprintf(nm,"Comparator.%d",totComparators);
+      totComparators++;
+      nd=newNode(nm);
+      id=AttachNode(dg,nd);
+      for(j=0;j<maxInDeg;j++){
+        sid=i*maxInDeg+j;
+	if(sid>=numPrevLayerNodes) break;
+        snd=dg->node[firstLayerNode+sid];
+        ar=newArc(dg->node[id],snd);
+        AttachArc(dg,ar);
+      }
+    }
+    firstLayerNode+=numPrevLayerNodes;
+    numPrevLayerNodes=numLayerNodes;
+  }
+  source=newNode("Source");
+  AttachNode(dg,source);   
+  for(i=0;i<numPrevLayerNodes;i++){
+    nd=dg->node[firstLayerNode+i];
+    ar=newArc(source,nd);
+    AttachArc(dg,ar);
+  }
+
+  for(i=0;i<dg->numNodes/2;i++){  /* Topological sorting */
+    tmp=dg->node[i];
+    dg->node[i]=dg->node[dg->numNodes-1-i];
+    dg->node[i]->id=i;
+    dg->node[dg->numNodes-1-i]=tmp;
+    dg->node[dg->numNodes-1-i]->id=dg->numNodes-1-i;
+  }
+return dg;
+}
+DGraph *buildBH(char cls){
+/*
+  Nodes of the graph must be topologically sorted
+  to avoid MPI deadlock.
+*/
+  int i=0,j=0;
+  int numSources=NUM_SOURCES,maxInDeg=4;
+  int numLayerNodes=numSources,firstLayerNode=0;
+  DGraph *dg;
+  DGNode *nd=NULL, *snd=NULL, *sink=NULL;
+  DGArc *ar=NULL;
+  int totComparators=0;
+  int numPrevLayerNodes=numLayerNodes;
+  int id=0, sid=0;
+  char nm[BLOCK_SIZE];
+
+  sprintf(nm,"DT_BH.%c",cls);
+  dg=newDGraph(nm);
+
+  for(i=0;i<numSources;i++){
+    sprintf(nm,"Source.%d",i);
+    nd=newNode(nm);
+    AttachNode(dg,nd);
+  }
+  while(numLayerNodes>maxInDeg){
+    numLayerNodes=numLayerNodes/maxInDeg;
+    if(numLayerNodes*maxInDeg<numPrevLayerNodes)numLayerNodes++;
+    for(i=0;i<numLayerNodes;i++){
+      sprintf(nm,"Comparator.%d",totComparators);
+      totComparators++;
+      nd=newNode(nm);
+      id=AttachNode(dg,nd);
+      for(j=0;j<maxInDeg;j++){
+        sid=i*maxInDeg+j;
+	if(sid>=numPrevLayerNodes) break;
+        snd=dg->node[firstLayerNode+sid];
+        ar=newArc(snd,dg->node[id]);
+        AttachArc(dg,ar);
+      }
+    }
+    firstLayerNode+=numPrevLayerNodes;
+    numPrevLayerNodes=numLayerNodes;
+  }
+  sink=newNode("Sink");
+  AttachNode(dg,sink);   
+  for(i=0;i<numPrevLayerNodes;i++){
+    nd=dg->node[firstLayerNode+i];
+    ar=newArc(nd,sink);
+    AttachArc(dg,ar);
+  }
+return dg;
+}
+
+typedef struct{
+  int len;
+  double* val;
+} Arr;
+Arr *newArr(int len){
+  Arr *arr=(Arr *)malloc(sizeof(Arr));
+  arr->len=len;
+  arr->val=(double *)malloc(len*sizeof(double));
+  return arr;
+}
+void arrShow(Arr* a){
+  if(!a) fprintf(stderr,"-- NULL array\n");
+  else{
+    fprintf(stderr,"-- length=%d\n",a->len);
+  }
+}
+double CheckVal(Arr *feat){
+  double csum=0.0;
+  int i=0;
+  for(i=0;i<feat->len;i++){
+    csum+=feat->val[i]*feat->val[i]/feat->len; /* The truncation does not work since 
+                                                  result will be 0 for large len  */
+  }
+   return csum;
+}
+int GetFNumDPar(int* mean, int* stdev){
+  *mean=NUM_SAMPLES;
+  *stdev=STD_DEVIATION;
+  return 0;
+}
+int GetFeatureNum(char *mbname,int id){
+  double tran=314159265.0;
+  double A=2*id+1;
+  double denom=randlc(&tran,&A);
+  char cval='S';
+  int mean=NUM_SAMPLES,stdev=128;
+  int rtfs=0,len=0;
+  GetFNumDPar(&mean,&stdev);
+  rtfs=ipowMod((int)(1/denom)*(int)cval,(long long int) (2*id+1),2*stdev);
+  if(rtfs<0) rtfs=-rtfs;
+  len=mean-stdev+rtfs;
+  return len;
+}
+Arr* RandomFeatures(char *bmname,int fdim,int id){
+  int len=GetFeatureNum(bmname,id)*fdim;
+  Arr* feat=newArr(len);
+  int nxg=2,nyg=2,nzg=2,nfg=5;
+  int nx=421,ny=419,nz=1427,nf=3527;
+  long long int expon=(len*(id+1))%3141592;
+  int seedx=ipowMod(nxg,expon,nx),
+      seedy=ipowMod(nyg,expon,ny),
+      seedz=ipowMod(nzg,expon,nz),
+      seedf=ipowMod(nfg,expon,nf);
+  int i=0;
+  if(timer_on){
+    timer_clear(id+1);
+    timer_start(id+1);
+  }
+  for(i=0;i<len;i+=fdim){
+    seedx=(seedx*nxg)%nx;
+    seedy=(seedy*nyg)%ny;
+    seedz=(seedz*nzg)%nz;
+    seedf=(seedf*nfg)%nf;
+    feat->val[i]=seedx;
+    feat->val[i+1]=seedy;
+    feat->val[i+2]=seedz;
+    feat->val[i+3]=seedf;
+  }
+  if(timer_on){
+    timer_stop(id+1);
+    fprintf(stderr,"** RandomFeatures time in node %d = %f\n",id,timer_read(id+1));
+  }
+  return feat;   
+}
+void Resample(Arr *a,int blen){
+    long long int i=0,j=0,jlo=0,jhi=0;
+    double avval=0.0;
+    double *nval=(double *)malloc(blen*sizeof(double));
+    Arr *tmp=newArr(10);
+    for(i=0;i<blen;i++) nval[i]=0.0;
+    for(i=1;i<a->len-1;i++){
+      jlo=(int)(0.5*(2*i-1)*(blen/a->len)); 
+      jhi=(int)(0.5*(2*i+1)*(blen/a->len));
+
+      avval=a->val[i]/(jhi-jlo+1);    
+      for(j=jlo;j<=jhi;j++){
+        nval[j]+=avval;
+      }
+    }
+    nval[0]=a->val[0];
+    nval[blen-1]=a->val[a->len-1];
+    free(a->val);
+    a->val=nval;
+    a->len=blen;
+}
+#define fielddim 4
+Arr* WindowFilter(Arr *a, Arr* b,int w){
+  int i=0,j=0,k=0;
+  double rms0=0.0,rms1=0.0,rmsm1=0.0;
+  double weight=((double) (w+1))/(w+2);
+ 
+  w+=1;
+  if(timer_on){
+    timer_clear(w);
+    timer_start(w);
+  }
+  if(a->len<b->len) Resample(a,b->len);
+  if(a->len>b->len) Resample(b,a->len);
+  for(i=fielddim;i<a->len-fielddim;i+=fielddim){
+    rms0=(a->val[i]-b->val[i])*(a->val[i]-b->val[i])
+	+(a->val[i+1]-b->val[i+1])*(a->val[i+1]-b->val[i+1])
+	+(a->val[i+2]-b->val[i+2])*(a->val[i+2]-b->val[i+2])
+	+(a->val[i+3]-b->val[i+3])*(a->val[i+3]-b->val[i+3]);
+    j=i+fielddim;
+    rms1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j])
+    	+(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1])
+    	+(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2])
+    	+(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]);
+    j=i-fielddim;
+    rmsm1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j])
+	 +(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1])
+	 +(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2])
+	 +(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]);
+    k=0;
+    if(rms1<rms0){
+      k=1;
+      rms0=rms1;
+    }
+    if(rmsm1<rms0) k=-1;
+    if(k==0){
+      j=i+fielddim;
+      a->val[i]=weight*b->val[i];
+      a->val[i+1]=weight*b->val[i+1];
+      a->val[i+2]=weight*b->val[i+2];
+      a->val[i+3]=weight*b->val[i+3];  
+    }else if(k==1){
+      j=i+fielddim;
+      a->val[i]=weight*b->val[j];
+      a->val[i+1]=weight*b->val[j+1];
+      a->val[i+2]=weight*b->val[j+2];
+      a->val[i+3]=weight*b->val[j+3];  
+    }else { /*if(k==-1)*/
+      j=i-fielddim;
+      a->val[i]=weight*b->val[j];
+      a->val[i+1]=weight*b->val[j+1];
+      a->val[i+2]=weight*b->val[j+2];
+      a->val[i+3]=weight*b->val[j+3];  
+    }	   
+  }
+  if(timer_on){
+    timer_stop(w);
+    fprintf(stderr,"** WindowFilter time in node %d = %f\n",(w-1),timer_read(w));
+  }
+  return a;
+}
+
+int SendResults(DGraph *dg,DGNode *nd,Arr *feat){
+  int i=0,tag=0;
+  DGArc *ar=NULL;
+  DGNode *head=NULL;
+  if(!feat) return 0;
+  for(i=0;i<nd->outDegree;i++){
+    ar=nd->outArc[i];
+    if(ar->tail!=nd) continue;
+    head=ar->head;
+    tag=ar->id;
+    if(head->address!=nd->address){
+      MPI_Send(&feat->len,1,MPI_INT,head->address,tag,MPI_COMM_WORLD);
+      MPI_Send(feat->val,feat->len,MPI_DOUBLE,head->address,tag,MPI_COMM_WORLD);
+    }
+  }
+  return 1;
+}
+Arr* CombineStreams(DGraph *dg,DGNode *nd){
+  Arr *resfeat=newArr(NUM_SAMPLES*fielddim);
+  int i=0,len=0,tag=0;
+  DGArc *ar=NULL;
+  DGNode *tail=NULL;
+  MPI_Status status;
+  Arr *feat=NULL,*featp=NULL;
+
+  if(nd->inDegree==0) return NULL;
+  for(i=0;i<nd->inDegree;i++){
+    ar=nd->inArc[i];
+    if(ar->head!=nd) continue;
+    tail=ar->tail;
+    if(tail->address!=nd->address){
+      len=0;
+      tag=ar->id;
+      MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status);
+      feat=newArr(len);
+      MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status);
+      resfeat=WindowFilter(resfeat,feat,nd->id);
+      free(feat);
+    }else{
+      featp=(Arr *)tail->feat;
+      feat=newArr(featp->len);
+      memcpy(feat->val,featp->val,featp->len*sizeof(double));
+      resfeat=WindowFilter(resfeat,feat,nd->id);  
+      free(feat);
+    }
+  }
+  for(i=0;i<resfeat->len;i++) resfeat->val[i]=((int)resfeat->val[i])/nd->inDegree;
+  nd->feat=resfeat;
+  return nd->feat;
+}
+double Reduce(Arr *a,int w){
+  double retv=0.0;
+  if(timer_on){
+    timer_clear(w);
+    timer_start(w);
+  }
+  retv=(int)(w*CheckVal(a));/* The casting needed for node  
+                               and array dependent verifcation */
+  if(timer_on){
+    timer_stop(w);
+    fprintf(stderr,"** Reduce time in node %d = %f\n",(w-1),timer_read(w));
+  }
+  return retv;
+}
+
+double ReduceStreams(DGraph *dg,DGNode *nd){
+  double csum=0.0;
+  int i=0,len=0,tag=0;
+  DGArc *ar=NULL;
+  DGNode *tail=NULL;
+  Arr *feat=NULL;
+  double retv=0.0;
+
+  for(i=0;i<nd->inDegree;i++){
+    ar=nd->inArc[i];
+    if(ar->head!=nd) continue;
+    tail=ar->tail;
+    if(tail->address!=nd->address){
+      MPI_Status status;
+      len=0;
+      tag=ar->id;
+      MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status);
+      feat=newArr(len);
+      MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status);
+      csum+=Reduce(feat,(nd->id+1));  
+      free(feat);
+    }else{
+      csum+=Reduce(tail->feat,(nd->id+1));  
+    }
+  }
+  if(nd->inDegree>0)csum=(((long long int)csum)/nd->inDegree);
+  retv=(nd->id+1)*csum;
+  return retv;
+}
+
+int ProcessNodes(DGraph *dg,int me){
+  double chksum=0.0;
+  Arr *feat=NULL;
+  int i=0,verified=0,tag;
+  DGNode *nd=NULL;
+  double rchksum=0.0;
+  MPI_Status status;
+
+  for(i=0;i<dg->numNodes;i++){
+    nd=dg->node[i];
+    if(nd->address!=me) continue;
+    if(strstr(nd->name,"Source")){
+      nd->feat=RandomFeatures(dg->name,fielddim,nd->id); 
+      SendResults(dg,nd,nd->feat);
+    }else if(strstr(nd->name,"Sink")){
+      chksum=ReduceStreams(dg,nd);
+      tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */
+      MPI_Send(&chksum,1,MPI_DOUBLE,0,tag,MPI_COMM_WORLD);
+    }else{
+      feat=CombineStreams(dg,nd);
+      SendResults(dg,nd,feat);
+    }
+  }
+  if(me==0){ /* Report node */
+    rchksum=0.0;
+    chksum=0.0;
+    for(i=0;i<dg->numNodes;i++){
+      nd=dg->node[i];
+      if(!strstr(nd->name,"Sink")) continue;
+       tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */
+      MPI_Recv(&rchksum,1,MPI_DOUBLE,nd->address,tag,MPI_COMM_WORLD,&status);
+      chksum+=rchksum;
+    }
+    verified=verify(dg->name,chksum);
+  }
+return verified;
+}
+
+int main(int argc,char **argv ){
+  int my_rank,comm_size;
+  int i;
+  DGraph *dg=NULL;
+  int verified=0, featnum=0;
+  double bytes_sent=2.0,tot_time=0.0;
+
+    MPI_Init( &argc, &argv );
+    MPI_Comm_rank( MPI_COMM_WORLD, &my_rank );
+    MPI_Comm_size( MPI_COMM_WORLD, &comm_size );
+
+     if(argc!=2||
+                (  strncmp(argv[1],"BH",2)!=0
+                 &&strncmp(argv[1],"WH",2)!=0
+                 &&strncmp(argv[1],"SH",2)!=0
+                )
+      ){
+      if(my_rank==0){
+        fprintf(stderr,"** Usage: mpirun -np N ../bin/dt.S GraphName\n");
+        fprintf(stderr,"** Where \n   - N is integer number of MPI processes\n");
+        fprintf(stderr,"   - S is the class S, W, or A \n");
+        fprintf(stderr,"   - GraphName is the communication graph name BH, WH, or SH.\n");
+        fprintf(stderr,"   - the number of MPI processes N should not be be less than \n");
+        fprintf(stderr,"     the number of nodes in the graph\n");
+      }
+      MPI_Finalize();
+      exit(1);
+    } 
+   if(strncmp(argv[1],"BH",2)==0){
+      dg=buildBH(CLASS);
+    }else if(strncmp(argv[1],"WH",2)==0){
+      dg=buildWH(CLASS);
+    }else if(strncmp(argv[1],"SH",2)==0){
+      dg=buildSH(CLASS);
+    }
+
+    if(timer_on&&dg->numNodes+1>timers_tot){
+      timer_on=0;
+      if(my_rank==0)
+        fprintf(stderr,"Not enough timers. Node timeing is off. \n");
+    }
+    if(dg->numNodes>comm_size){
+      if(my_rank==0){
+        fprintf(stderr,"**  The number of MPI processes should not be less than \n");
+        fprintf(stderr,"**  the number of nodes in the graph\n");
+        fprintf(stderr,"**  Number of MPI processes = %d\n",comm_size);
+        fprintf(stderr,"**  Number nodes in the graph = %d\n",dg->numNodes);
+      }
+      MPI_Finalize();
+      exit(1);
+    }
+    for(i=0;i<dg->numNodes;i++){ 
+      dg->node[i]->address=i;
+    }
+    if( my_rank == 0 ){
+      printf( "\n\n NAS Parallel Benchmarks 3.4 -- DT Benchmark\n\n" );
+      graphShow(dg,0);
+      timer_clear(0);
+      timer_start(0);
+    }
+    verified=ProcessNodes(dg,my_rank);
+    
+    featnum=NUM_SAMPLES*fielddim;
+    bytes_sent=featnum*dg->numArcs;
+    bytes_sent/=1048576;
+    if(my_rank==0){
+      timer_stop(0);
+      tot_time=timer_read(0);
+      c_print_results( dg->name,
+        	       CLASS,
+        	       featnum,
+        	       0,
+        	       0,
+        	       dg->numNodes,
+        	       0,
+        	       comm_size,
+        	       tot_time,
+        	       bytes_sent/tot_time,
+        	       "bytes transmitted", 
+        	       verified,
+        	       NPBVERSION,
+        	       COMPILETIME,
+        	       MPICC,
+        	       CLINK,
+        	       CMPI_LIB,
+        	       CMPI_INC,
+        	       CFLAGS,
+        	       CLINKFLAGS );
+    }          
+    MPI_Finalize();
+  return 0;
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/Makefile
new file mode 100644
index 000000000..b77a4e80d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/Makefile
@@ -0,0 +1,28 @@
+SHELL=/bin/sh
+BENCHMARK=ep
+BENCHMARKU=EP
+
+include ../config/make.def
+
+OBJS = ep.o ep_data.o verify.o mpinpb.o \
+	${COMMON}/print_results.o ${COMMON}/${RAND}.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+ep.o:		ep.f90 ep_data.o mpinpb.o
+ep_data.o:	ep_data.f90 npbparams.h
+verify.o:	verify.f90
+mpinpb.o:	mpinpb.f90
+
+clean:
+	- rm -f *.o *~ *.mod
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/README
new file mode 100644
index 000000000..6eb36571a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/README
@@ -0,0 +1,6 @@
+This code implements the random-number generator described in the
+NAS Parallel Benchmark document RNR Technical Report RNR-94-007.
+The code is "embarrassingly" parallel in that no communication is
+required for the generation of the random numbers itself. There is
+no special requirement on the number of processors used for running
+the benchmark.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep.f90
new file mode 100644
index 000000000..1bdaa9aac
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep.f90
@@ -0,0 +1,319 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                                   E P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.4 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+!---------------------------------------------------------------------
+!
+! Authors: P. O. Frederickson 
+!          D. H. Bailey
+!          A. C. Woo
+!          R. F. Van der Wijngaart
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+      program EMBAR
+!---------------------------------------------------------------------
+
+!   This is the MPI version of the APP Benchmark 1,
+!   the "embarassingly parallel" benchmark.
+
+      use ep_data
+      use mpinpb
+
+      implicit none
+
+      double precision Mops, t1, t2, t3, t4, x1,  &
+     &                 x2, sx, sy, tm, an, tt, gc, dum(3)
+
+      integer          i, ik, kk, l, k, nit, no_large_nodes,  &
+     &                 np, np_add, k_offset, j
+      integer          ierr, ierrcode
+
+      logical          verified, timers_enabled
+
+      double precision randlc, timer_read
+      external         randlc, timer_read
+
+      character        size*15, classv
+
+      double precision epsilon
+      parameter       (epsilon=1.d-8)
+
+      double precision tsum(t_last+2), t1m(t_last+2),  &
+     &                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      data             dum /1.d0, 1.d0, 1.d0/
+
+      data t_recs/'total', 'gpairs', 'randn', 'rcomm',  &
+     &            ' totcomp', ' totcomm'/
+
+
+      call mpi_init(ierr)
+      comm_solve = MPI_COMM_WORLD
+      call mpi_comm_rank(comm_solve,node,ierr)
+      call mpi_comm_size(comm_solve,no_nodes,ierr)
+
+      root = 0
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+      if (node.eq.root)  then
+
+!   Because the size of the problem is too large to store in a 32-bit
+!   integer for some classes, we put it into a string (for printing).
+!   Have to strip off the decimal point put in there by the floating
+!   point print statement (internal file)
+
+          write(*, 1000)
+          write(size, '(f15.0)' ) 2.d0**(m+1)
+          j = 15
+          if (size(j:j) .eq. '.') j = j - 1
+          write (*,1001) size(1:j), class
+          write(*, 1003) no_nodes
+
+ 1000 format(/,' NAS Parallel Benchmarks 3.4 -- EP Benchmark',/)
+ 1001     format(' Number of random numbers generated: ', a15,  &
+     &           '  (class ', a, ')' )
+ 1003     format(' Total number of processes:          ', 2x, i13, /)
+
+          call check_timer_flag( timers_enabled )
+      endif
+
+      call mpi_bcast(timers_enabled, 1, MPI_LOGICAL, root,  &
+     &               comm_solve, ierr)
+
+      verified = .false.
+
+!   Compute the number of "batches" of random number pairs generated 
+!   per processor. Adjust if the number of processors does not evenly 
+!   divide the total number
+
+      np = nn / no_nodes
+      no_large_nodes = mod(nn, no_nodes)
+      if (node .lt. no_large_nodes) then
+         np_add = 1
+      else
+         np_add = 0
+      endif
+      np = np + np_add
+
+      if (np .eq. 0) then
+         write (6, 1) no_nodes, nn
+ 1       format ('Too many nodes:', i0, 1x, i0)
+         ierrcode = 1
+         call mpi_abort(MPI_COMM_WORLD,ierrcode,ierr)
+         stop
+      endif
+
+!   Call the random number generator functions and initialize
+!   the x-array to reduce the effects of paging on the timings.
+!   Also, call all mathematical functions that are used. Make
+!   sure these initializations cannot be eliminated as dead code.
+
+      call vranlc(0, dum(1), dum(2), dum(3))
+      dum(1) = randlc(dum(2), dum(3))
+      do 5    i = 1, 2*nk
+         x(i) = -1.d99
+ 5    continue
+      Mops = log(sqrt(abs(max(1.d0,1.d0))))
+
+!---------------------------------------------------------------------
+!      Synchronize before placing time stamp
+!---------------------------------------------------------------------
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      call mpi_barrier(comm_solve, ierr)
+      call timer_start(1)
+
+      t1 = a
+      call vranlc(0, t1, a, x)
+
+!   Compute AN = A ^ (2 * NK) (mod 2^46).
+
+      t1 = a
+
+      do 100 i = 1, mk + 1
+         t2 = randlc(t1, t1)
+ 100  continue
+
+      an = t1
+      tt = s
+      gc = 0.d0
+      sx = 0.d0
+      sy = 0.d0
+
+      do 110 i = 0, nq - 1
+         q(i) = 0.d0
+ 110  continue
+
+!   Each instance of this loop may be performed independently. We compute
+!   the k offsets separately to take into account the fact that some nodes
+!   have more numbers to generate than others
+
+      if (np_add .eq. 1) then
+         k_offset = node * np -1
+      else
+         k_offset = no_large_nodes*(np+1) + (node-no_large_nodes)*np -1
+      endif
+
+      do 150 k = 1, np
+         kk = k_offset + k 
+         t1 = s
+         t2 = an
+
+!        Find starting seed t1 for this kk.
+
+         do 120 i = 1, 100
+            ik = kk / 2
+            if (2 * ik .ne. kk) t3 = randlc(t1, t2)
+            if (ik .eq. 0) goto 130
+            t3 = randlc(t2, t2)
+            kk = ik
+ 120     continue
+
+!        Compute uniform pseudorandom numbers.
+ 130     continue
+
+         if (timers_enabled) call timer_start(t_randn)
+         call vranlc(2 * nk, t1, a, x)
+         if (timers_enabled) call timer_stop(t_randn)
+
+!        Compute Gaussian deviates by acceptance-rejection method and 
+!        tally counts in concentric square annuli.  This loop is not 
+!        vectorizable. 
+
+         if (timers_enabled) call timer_start(t_gpairs)
+
+         do 140 i = 1, nk
+            x1 = 2.d0 * x(2*i-1) - 1.d0
+            x2 = 2.d0 * x(2*i) - 1.d0
+            t1 = x1 ** 2 + x2 ** 2
+            if (t1 .le. 1.d0) then
+               t2   = sqrt(-2.d0 * log(t1) / t1)
+               t3   = abs(x1 * t2)
+               t4   = abs(x2 * t2)
+               l    = max(t3, t4)
+               q(l) = q(l) + 1.d0
+               sx   = sx + t3
+               sy   = sy + t4
+            endif
+ 140     continue
+
+         if (timers_enabled) call timer_stop(t_gpairs)
+
+ 150  continue
+
+      if (timers_enabled) call timer_start(t_rcomm)
+      call mpi_allreduce(sx, x, 1, dp_type,  &
+     &                   MPI_SUM, comm_solve, ierr)
+      sx = x(1)
+      call mpi_allreduce(sy, x, 1, dp_type,  &
+     &                   MPI_SUM, comm_solve, ierr)
+      sy = x(1)
+      call mpi_allreduce(q, x, nq, dp_type,  &
+     &                   MPI_SUM, comm_solve, ierr)
+      if (timers_enabled) call timer_stop(t_rcomm)
+
+      do i = 1, nq
+         q(i-1) = x(i)
+      enddo
+
+      do 160 i = 0, nq - 1
+        gc = gc + q(i)
+ 160  continue
+
+      call timer_stop(1)
+      tm  = timer_read(1)
+
+      call mpi_allreduce(tm, x, 1, dp_type,  &
+     &                   MPI_MAX, comm_solve, ierr)
+      tm = x(1)
+
+      if (node.eq.root) then
+         call verify(m, sx, sy, gc, verified, classv)
+
+         nit = 0
+         Mops = 2.d0**(m+1)/tm/1000000.d0
+
+         write (6,11) tm, m, gc, sx, sy, (i, q(i), i = 0, nq - 1)
+ 11      format ('EP Benchmark Results:'//'CPU Time =',f10.3/'N = 2^',  &
+     &           i5/'No. Gaussian Pairs =',f15.0/'Sums = ',1p,2d25.15/  &
+     &           'Counts:'/(i3,0p,f15.0))
+
+         call print_results('EP', class, m+1, 0, 0, nit, no_nodes,  &
+     &                      no_nodes, tm, Mops,  &
+     &                      'Random numbers generated',  &
+     &                      verified, npbversion, compiletime, cs1,  &
+     &                      cs2, cs3, cs4, cs5, cs6, cs7)
+
+      endif
+
+
+      if (.not.timers_enabled) goto 999
+
+      do i = 1, t_last
+         t1m(i) = timer_read(i)
+      end do
+      t1m(t_last+2) = t1m(t_rcomm)
+      t1m(t_last+1) = t1m(t_total) - t1m(t_last+2)
+
+      call MPI_Reduce(t1m, tsum,  t_last+2, dp_type, MPI_SUM,  &
+     &                0, comm_solve, ierr)
+      call MPI_Reduce(t1m, tming, t_last+2, dp_type, MPI_MIN,  &
+     &                0, comm_solve, ierr)
+      call MPI_Reduce(t1m, tmaxg, t_last+2, dp_type, MPI_MAX,  &
+     &                0, comm_solve, ierr)
+
+      if (node .eq. 0) then
+         write(*, 800) no_nodes
+         do i = 1, t_last+2
+            tsum(i) = tsum(i) / no_nodes
+            write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum',  &
+     &       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep_data.f90
new file mode 100644
index 000000000..93ee961c7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep_data.f90
@@ -0,0 +1,39 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  ep_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+ 
+      module ep_data
+
+!---------------------------------------------------------------------
+!  The following include file is generated automatically by the
+!  "setparams" utility, which defines the problem size 'm'
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+!---------------------------------------------------------------------
+!   M is the Log_2 of the number of complex pairs of uniform (0, 1) random
+!   numbers.  MK is the Log_2 of the size of each batch of uniform random
+!   numbers.  MK can be set for convenience on a given system, since it does
+!   not affect the results.
+!---------------------------------------------------------------------
+      integer    mk, mm, nn, nk, nq
+      parameter (mk = 16, mm = m - mk, nn = 2 ** mm,  &
+     &           nk = 2 ** mk, nq = 10)
+
+      double precision a, s
+      parameter (a = 1220703125.d0, s = 271828183.d0)
+
+! ... storage
+      double precision x(2*nk), q(0:nq-1), qq(10000)
+
+! ... timer constants
+      integer    t_total, t_gpairs, t_randn, t_rcomm, t_last
+      parameter (t_total=1, t_gpairs=2, t_randn=3, t_rcomm=4, t_last=4)
+
+      end module ep_data
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/mpinpb.f90
new file mode 100644
index 000000000..865402747
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/mpinpb.f90
@@ -0,0 +1,16 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mpinpb module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mpinpb
+
+      include 'mpif.h'
+
+      integer  node, no_nodes, root, comm_solve, dp_type
+
+      end module mpinpb
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/verify.f90
new file mode 100644
index 000000000..65fee595c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/verify.f90
@@ -0,0 +1,82 @@
+!---------------------------------------------------------------------
+      subroutine verify(m, sx, sy, gc, verified, class)
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      implicit none
+      integer m
+      double precision sx, sy, gc
+      logical verified
+      character class
+
+      double precision sx_verify_value, sy_verify_value
+      double precision gc_verify_value
+      double precision sx_err, sy_err, gc_err
+
+      double precision, parameter :: epsilon = 1.d-8
+
+      verified = .true.
+      if (m.eq.24) then
+         class = 'S'
+         sx_verify_value = 1.051299420395306D+07
+         sy_verify_value = 1.051517131857535D+07
+         gc_verify_value = 13176389.D0
+      elseif (m.eq.25) then
+         class = 'W'
+         sx_verify_value = 2.102505525182392D+07
+         sy_verify_value = 2.103162209578822D+07
+         gc_verify_value = 26354769.D0
+      elseif (m.eq.28) then
+         class = 'A'
+         sx_verify_value = 1.682235632304711D+08
+         sy_verify_value = 1.682195123368299D+08
+         gc_verify_value = 210832767.D0
+      elseif (m.eq.30) then
+         class = 'B'
+         sx_verify_value = 6.728927543423024D+08
+         sy_verify_value = 6.728951822504275D+08
+         gc_verify_value = 843345606.D0
+      elseif (m.eq.32) then
+         class = 'C'
+         sx_verify_value = 2.691444083862931D+09
+         sy_verify_value = 2.691519118724585D+09
+         gc_verify_value = 3373275903.D0
+      elseif (m.eq.36) then
+         class = 'D'
+         sx_verify_value = 4.306350280812112D+10
+         sy_verify_value = 4.306347571859157D+10
+         gc_verify_value = 53972171957.D0
+      elseif (m.eq.40) then
+         class = 'E'
+         sx_verify_value = 6.890169663167274D+11
+         sy_verify_value = 6.890164670688535D+11
+         gc_verify_value = 863554308186.D0
+      elseif (m.eq.44) then
+         class = 'F'
+         sx_verify_value = 1.102426773788175D+13
+         sy_verify_value = 1.102426773787993D+13
+         gc_verify_value = 13816870608324.D0
+      else
+         class = 'U'
+         verified = .false.
+      endif
+      if (verified) then
+         sx_err = abs((sx - sx_verify_value)/sx_verify_value)
+         sy_err = abs((sy - sy_verify_value)/sy_verify_value)
+         if (ieee_is_nan(sx_err) .or. ieee_is_nan(sy_err)) then
+            verified = .false.
+         else
+            verified = ((sx_err.le.epsilon) .and. (sy_err.le.epsilon))
+         endif
+      endif
+      if (verified) then
+         gc_err = abs((gc - gc_verify_value)/gc_verify_value)
+         if (ieee_is_nan(gc_err) .or. gc_err.gt.epsilon) then
+            verified = .false.
+         endif
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/Makefile
new file mode 100644
index 000000000..28e3e7df1
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/Makefile
@@ -0,0 +1,25 @@
+SHELL=/bin/sh
+BENCHMARK=ft
+BENCHMARKU=FT
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = ft.o ft_data.o mpinpb.o ${COMMON}/get_active_nprocs.o \
+	${COMMON}/${RAND}.o ${COMMON}/print_results.o ${COMMON}/timers.o
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+
+.f90.o:
+	${FCOMPILE} $<
+
+ft.o:		ft.f90  ft_data.o mpinpb.o
+ft_data.o:	ft_data.f90  mpinpb.o npbparams.h
+mpinpb.o:	mpinpb.f90
+
+clean:
+	- rm -f *.o *.mod *~ mputil*
+	- rm -f ft npbparams.h core
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/README
new file mode 100644
index 000000000..ab08b363b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/README
@@ -0,0 +1,5 @@
+This code implements the time integration of a three-dimensional
+partial differential equation using the Fast Fourier Transform.
+Some of the dimension statements are not F77 conforming and will
+not work using the g77 compiler. All dimension statements,
+however, are legal F90.
\ No newline at end of file
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft.f90
new file mode 100644
index 000000000..ac2df0a58
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft.f90
@@ -0,0 +1,2124 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                                   F T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.4 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!TO REDUCE THE AMOUNT OF MEMORY REQUIRED BY THE BENCHMARK WE NO LONGER
+!STORE THE ENTIRE TIME EVOLUTION ARRAY "EX" FOR ALL TIME STEPS, BUT
+!JUST FOR THE FIRST. ALSO, IT IS STORED ONLY FOR THE PART OF THE GRID
+!FOR WHICH THE CALLING PROCESSOR IS RESPONSIBLE, SO THAT THE MEMORY 
+!USAGE BECOMES SCALABLE. THIS NEW ARRAY IS CALLED "TWIDDLE" (SEE
+!NPB3.0-SER)
+
+!TO AVOID PROBLEMS WITH VERY LARGE ARRAY SIZES THAT ARE COMPUTED BY
+!MULTIPLYING GRID DIMENSIONS (CAUSING INTEGER OVERFLOW IN THE VARIABLE
+!NTOTAL) AND SUBSEQUENTLY DIVIDING BY THE NUMBER OF PROCESSORS, WE
+!COMPUTE THE SIZE OF ARRAY PARTITIONS MORE CONSERVATIVELY AS
+!((NX*NY)/NP)*NZ, WHERE NX, NY, AND NZ ARE GRID DIMENSIONS AND NP IS
+!THE NUMBER OF PROCESSORS, THE RESULT IS STORED IN "NTDIVNP". FOR THE 
+!PERFORMANCE CALCULATION WE STORE THE TOTAL NUMBER OF GRID POINTS IN A 
+!FLOATING POINT NUMBER "NTOTAL_F" INSTEAD OF AN INTEGER.
+!THIS FIX WILL FAIL IF THE NUMBER OF PROCESSORS IS SMALL.
+
+!UGLY HACK OF SUBROUTINE IPOW46: FOR VERY LARGE GRIDS THE SINGLE EXPONENT
+!FROM NPB2.3 MAY NOT FIT IN A 32-BIT INTEGER. HOWEVER, WE KNOW THAT THE
+!"EXPONENT" ARGUMENT OF THIS ROUTINE CAN ALWAYS BE FACTORED INTO A TERM 
+!DIVISIBLE BY NX (EXP_1) AND ANOTHER TERM (EXP_2). NX IS USUALLY A POWER
+!OF TWO, SO WE CAN KEEP HALVING IT UNTIL THE PRODUCT OF EXP_1
+!AND EXP_2 IS SMALL ENOUGH (NAMELY EXP_2 ITSELF). THIS UPDATED VERSION
+!OF IPWO46, WHICH NOW TAKES THE TWO FACTORS OF "EXPONENT" AS SEPARATE
+!ARGUMENTS, MAY BREAK DOWN IF EXP_1 DOES NOT CONTAIN A LARGE POWER OF TWO.
+
+!---------------------------------------------------------------------
+!
+! Authors: D. Bailey
+!          W. Saphir
+!          R. F. Van der Wijngaart
+!
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! FT benchmark
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      program ft
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use ft_fields
+      use mpinpb
+
+      implicit none
+
+      integer i, ierr
+
+      integer iter
+      double precision total_time, mflops
+      logical verified
+      character class
+
+
+      call setup(class)
+      if (.not. active) goto 999
+
+!---------------------------------------------------------------------
+! Run the entire problem once to make sure all data is touched. 
+! This reduces variable startup costs, which is important for such a 
+! short benchmark. The other NPB 2 implementations are similar. 
+!---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+
+      call timer_start(T_init)
+      call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3))
+      call compute_initial_conditions(u1, dims(1,1), dims(2,1),  &
+     &                                dims(3,1))
+      call fft_init (dims(1,1))
+      call fft(1, u1, u0)
+      call timer_stop(T_init)
+      if (me .eq. 0) then
+         write(*, 1000) timer_read(T_init)
+1000     format(/' Initialization time =', f12.4/)
+      endif
+
+!---------------------------------------------------------------------
+! Start over from the beginning. Note that all operations must
+! be timed, in contrast to other benchmarks. 
+!---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+      call MPI_Barrier(comm_solve, ierr)
+
+      call timer_start(T_total)
+      if (timers_enabled) call timer_start(T_setup)
+
+      call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3))
+      call compute_initial_conditions(u1, dims(1,1), dims(2,1),  &
+     &                                dims(3,1))
+      call fft_init (dims(1,1))
+
+!      if (timers_enabled) call synchup()
+      if (timers_enabled) call timer_stop(T_setup)
+
+      if (timers_enabled) call timer_start(T_fft)
+      call fft(1, u1, u0)
+      if (timers_enabled) call timer_stop(T_fft)
+
+      do iter = 1, niter
+         if (timers_enabled) call timer_start(T_evolve)
+         call evolve(u0, u1, twiddle,  &
+     &               dims(1,1), dims(2,1), dims(3,1))
+         if (timers_enabled) call timer_stop(T_evolve)
+         if (timers_enabled) call timer_start(T_fft)
+         call fft(-1, u1, u2)
+         if (timers_enabled) call timer_stop(T_fft)
+!         if (timers_enabled) call synchup()
+         if (timers_enabled) call timer_start(T_checksum)
+         call checksum(iter, u2, dims(1,1), dims(2,1), dims(3,1))
+         if (timers_enabled) call timer_stop(T_checksum)
+      end do
+
+      call verify(niter, verified, class)
+      call timer_stop(t_total)
+!!      if (np .ne. np_min) verified = .false.
+      total_time = timer_read(t_total)
+
+      if( total_time .ne. 0. ) then
+         mflops = 1.0d-6*ntotal_f *  &
+     &             (14.8157+7.19641*log(ntotal_f)  &
+     &          +  (5.23518+7.21113*log(ntotal_f))*niter)  &
+     &                 /total_time
+      else
+         mflops = 0.0
+      endif
+      if (me .eq. 0) then
+         call print_results('FT', class, nx, ny, nz, niter, np_min, np,  &
+     &     total_time, mflops, '          floating point', verified,  &
+     &     npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      endif
+      if (timers_enabled) call print_timers()
+
+  999 continue
+      call MPI_Finalize(ierr)
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine evolve(u0, u1, twiddle, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! evolve u0 -> u1 (t time steps) in fourier space
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double precision exi
+      double complex u0(d1,d2,d3)
+      double complex u1(d1,d2,d3)
+      double precision twiddle(d1,d2,d3)
+      integer i, j, k
+
+      do k = 1, d3
+         do j = 1, d2
+            do i = 1, d1
+               u0(i,j,k) = u0(i,j,k)*(twiddle(i,j,k))
+               u1(i,j,k) = u0(i,j,k)
+            end do
+         end do
+      end do
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine compute_initial_conditions(u0, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! Fill in array u0 with initial conditions from 
+! random number generator 
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double complex u0(d1, d2, d3)
+      integer k
+      double precision x0, start, an, dummy
+      
+!---------------------------------------------------------------------
+! 0-D and 1-D layouts are easy because each processor gets a contiguous
+! chunk of the array, in the Fortran ordering sense. 
+! For a 2-D layout, it's a bit more complicated. We always
+! have entire x-lines (contiguous) in processor. 
+! We can do ny/np1 of them at a time since we have
+! ny/np1 contiguous in y-direction. But then we jump
+! by z-planes (nz/np2 of them, total). 
+! For the 0-D and 1-D layouts we could do larger chunks, but
+! this turns out to have no measurable impact on performance. 
+!---------------------------------------------------------------------
+
+
+      start = seed                                    
+!---------------------------------------------------------------------
+! Jump to the starting element for our first plane.
+!---------------------------------------------------------------------
+      call ipow46(a, 2*nx, (zstart(1)-1)*ny + (ystart(1)-1), an)
+      dummy = randlc(start, an)
+      call ipow46(a, 2*nx, ny, an)
+      
+!---------------------------------------------------------------------
+! Go through by z planes filling in one square at a time.
+!---------------------------------------------------------------------
+      do k = 1, dims(3, 1) ! nz/np2
+         x0 = start
+         call vranlc(2*nx*dims(2, 1), x0, a, u0(1, 1, k))
+         if (k .ne. dims(3, 1)) dummy = randlc(start, an)
+      end do
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ipow46(a, exp_1, exp_2, result)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute a^exponent mod 2^46
+!---------------------------------------------------------------------
+
+      implicit none
+      double precision a, result, dummy, q, r
+      integer exp_1, exp_2, n, n2, ierr
+      external randlc
+      double precision randlc
+      logical  two_pow
+!---------------------------------------------------------------------
+! Use
+!   a^n = a^(n/2)*a^(n/2) if n even else
+!   a^n = a*a^(n-1)       if n odd
+!---------------------------------------------------------------------
+      result = 1
+      if (exp_2 .eq. 0 .or. exp_1 .eq. 0) return
+      q = a
+      r = 1
+      n = exp_1
+      two_pow = .true.
+
+      do while (two_pow)
+         n2 = n/2
+         if (n2 * 2 .eq. n) then
+            dummy = randlc(q, q)
+            n = n2
+         else
+            n = n * exp_2
+            two_pow = .false.
+         endif
+      end do
+
+      do while (n .gt. 1)
+         n2 = n/2
+         if (n2 * 2 .eq. n) then
+            dummy = randlc(q, q) 
+            n = n2
+         else
+            dummy = randlc(r, q)
+            n = n-1
+         endif
+      end do
+      dummy = randlc(r, q)
+      result = r
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup(class)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      character class
+      integer ierr, i, fstatus
+      debug = .FALSE.
+      
+      call MPI_Init(ierr)
+
+!---------------------------------------------------------------------
+!     get a process grid that requires a pwr-2 number of procs.
+!     excess ranks are marked as inactive.
+!---------------------------------------------------------------------
+      call get_active_nprocs(3, np1, np2, np_min,  &
+     &                       np, me, comm_solve, active)
+
+      if (.not. active) return
+
+      if (.not. convertdouble) then
+         dc_type = MPI_DOUBLE_COMPLEX
+      else
+         dc_type = MPI_COMPLEX
+      endif
+
+      if (me .eq. 0) then
+         write(*, 1000)
+
+         call check_timer_flag( timers_enabled )
+
+         open (unit=2,file='inputft.data',status='old', iostat=fstatus)
+
+         if (fstatus .eq. 0) then
+            write(*,233) 
+ 233        format(' Reading from input file inputft.data')
+            read (2,*) niter
+            read (2,*) layout_type
+            read (2,*) np1, np2
+            close(2)
+
+!---------------------------------------------------------------------
+! check to make sure input data is consistent
+!---------------------------------------------------------------------
+
+    
+!---------------------------------------------------------------------
+! 1. product of processor grid dims must equal number of processors
+!---------------------------------------------------------------------
+
+            if (np1 * np2 .ne. np_min) then
+               write(*, 238)
+ 238           format(' np1 and np2 given in input file are not valid.')
+               write(*, 239) np1*np2, np_min
+ 239           format(' Product is ', i5, ' and should be ', i5)
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+
+!---------------------------------------------------------------------
+! 2. layout type must be valid
+!---------------------------------------------------------------------
+
+            if (layout_type .ne. layout_0D .and.  &
+     &          layout_type .ne. layout_1D .and.  &
+     &          layout_type .ne. layout_2D) then
+               write(*, 240)
+ 240           format(' Layout type specified in inputft.data is  &
+     &                  invalid ')
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+
+!---------------------------------------------------------------------
+! 3. 0D layout must be 1x1 grid
+!---------------------------------------------------------------------
+
+            if (layout_type .eq. layout_0D .and.  &
+     &            (np1 .ne.1 .or. np2 .ne. 1)) then
+               write(*, 241)
+ 241           format(' For 0D layout, both np1 and np2 must be 1 ')
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+!---------------------------------------------------------------------
+! 4. 1D layout must be 1xN grid
+!---------------------------------------------------------------------
+
+            if (layout_type .eq. layout_1D .and. np1 .ne. 1) then
+               write(*, 242)
+ 242           format(' For 1D layout, np1 must be 1 ')
+               call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
+            endif
+
+         else
+            write(*,234) 
+            niter = niter_default
+            if (np_min .eq. 1) then
+               np1 = 1
+               np2 = 1
+               layout_type = layout_0D
+            else if (np_min .le. nz) then
+               np1 = 1
+               np2 = np_min
+               layout_type = layout_1D
+            else
+               np1 = nz
+               np2 = np_min/nz
+               layout_type = layout_2D
+            endif
+         endif
+
+         call set_class(nx, ny, nz, niter, class)
+
+ 234     format(' No input file inputft.data. Using compiled defaults')
+         write(*, 1001) nx, ny, nz, class
+         write(*, 1002) niter
+         write(*, 1004) np
+         if (np .ne. np_min) write(*, 1006) np_min
+         write(*, 1005) np1, np2
+
+         if (layout_type .eq. layout_0D) then
+            write(*, 1010) '0D'
+         else if (layout_type .eq. layout_1D) then
+            write(*, 1010) '1D'
+         else
+            write(*, 1010) '2D'
+         endif
+
+ 1000 format(//,' NAS Parallel Benchmarks 3.4 -- FT Benchmark',/)
+ 1001    format(' Size                : ', i4, 'x', i4, 'x', i4,  &
+     &          '  (class ', a, ')')
+ 1002    format(' Iterations          : ', 7x, i7)
+ 1004    format(' Number of processes : ', 7x, i7)
+ 1005    format(' Processor array     : ', 5x, i4, 'x', i4)
+ 1006    format(' WARNING: Number of processes is not power of two (',  &
+     &          i0, ' active)')
+ 1010    format(' Layout type         : ', 9x, A5)
+      endif
+
+
+!---------------------------------------------------------------------
+! Broadcast parameters 
+!---------------------------------------------------------------------
+      call MPI_BCAST(np1, 1, MPI_INTEGER, 0, comm_solve, ierr)
+      call MPI_BCAST(np2, 1, MPI_INTEGER, 0, comm_solve, ierr)
+      call MPI_BCAST(layout_type, 1, MPI_INTEGER, 0, comm_solve,  &
+     &               ierr)
+      call MPI_BCAST(niter, 1, MPI_INTEGER, 0, comm_solve, ierr)
+      call MPI_BCAST(timers_enabled, 1, MPI_LOGICAL, 0, comm_solve,  &
+     &               ierr)
+
+      if (np1 .eq. 1 .and. np2 .eq. 1) then
+        layout_type = layout_0D
+      else if (np1 .eq. 1) then
+         layout_type = layout_1D
+      else
+         layout_type = layout_2D
+      endif
+
+      if (layout_type .eq. layout_0D) then
+         do i = 1, 3
+            dims(1, i) = nx
+            dims(2, i) = ny
+            dims(3, i) = nz
+         end do
+      else if (layout_type .eq. layout_1D) then
+         dims(1, 1) = nx
+         dims(2, 1) = ny
+         dims(3, 1) = nz
+
+         dims(1, 2) = nx
+         dims(2, 2) = ny
+         dims(3, 2) = nz
+
+         dims(1, 3) = nz
+         dims(2, 3) = nx
+         dims(3, 3) = ny
+      else if (layout_type .eq. layout_2D) then
+         dims(1, 1) = nx
+         dims(2, 1) = ny
+         dims(3, 1) = nz
+
+         dims(1, 2) = ny
+         dims(2, 2) = nx
+         dims(3, 2) = nz
+
+         dims(1, 3) = nz
+         dims(2, 3) = nx
+         dims(3, 3) = ny
+
+      endif
+      do i = 1, 3
+         dims(2, i) = dims(2, i) / np1
+         dims(3, i) = dims(3, i) / np2
+      end do
+
+!---------------------------------------------------------------------
+! Allocate space
+!---------------------------------------------------------------------
+      call alloc_space
+
+!---------------------------------------------------------------------
+! Determine processor coordinates of this processor
+! Processor grid is np1xnp2. 
+! Arrays are always (n1, n2/np1, n3/np2)
+! Processor coords are zero-based. 
+!---------------------------------------------------------------------
+      me2 = mod(me, np2)  ! goes from 0...np2-1
+      me1 = me/np2        ! goes from 0...np1-1
+!---------------------------------------------------------------------
+! Communicators for rows/columns of processor grid. 
+! commslice1 is communicator of all procs with same me1, ranked as me2
+! commslice2 is communicator of all procs with same me2, ranked as me1
+! mpi_comm_split(comm, color, key, ...)
+!---------------------------------------------------------------------
+      call MPI_Comm_split(comm_solve, me1, me2, commslice1, ierr)
+      call MPI_Comm_split(comm_solve, me2, me1, commslice2, ierr)
+!      if (timers_enabled) call synchup()
+
+      if (debug) print *, 'proc coords: ', me, me1, me2
+
+!---------------------------------------------------------------------
+! Determine which section of the grid is owned by this
+! processor. 
+!---------------------------------------------------------------------
+      if (layout_type .eq. layout_0d) then
+
+         do i = 1, 3
+            xstart(i) = 1
+            xend(i)   = nx
+            ystart(i) = 1
+            yend(i)   = ny
+            zstart(i) = 1
+            zend(i)   = nz
+         end do
+
+      else if (layout_type .eq. layout_1d) then
+
+         xstart(1) = 1
+         xend(1)   = nx
+         ystart(1) = 1
+         yend(1)   = ny
+         zstart(1) = 1 + me2 * nz/np2
+         zend(1)   = (me2+1) * nz/np2
+
+         xstart(2) = 1
+         xend(2)   = nx
+         ystart(2) = 1
+         yend(2)   = ny
+         zstart(2) = 1 + me2 * nz/np2
+         zend(2)   = (me2+1) * nz/np2
+
+         xstart(3) = 1
+         xend(3)   = nx
+         ystart(3) = 1 + me2 * ny/np2
+         yend(3)   = (me2+1) * ny/np2
+         zstart(3) = 1
+         zend(3)   = nz
+
+      else if (layout_type .eq. layout_2d) then
+
+         xstart(1) = 1
+         xend(1)   = nx
+         ystart(1) = 1 + me1 * ny/np1
+         yend(1)   = (me1+1) * ny/np1
+         zstart(1) = 1 + me2 * nz/np2
+         zend(1)   = (me2+1) * nz/np2
+
+         xstart(2) = 1 + me1 * nx/np1
+         xend(2)   = (me1+1)*nx/np1
+         ystart(2) = 1
+         yend(2)   = ny
+         zstart(2) = zstart(1)
+         zend(2)   = zend(1)
+
+         xstart(3) = xstart(2)
+         xend(3)   = xend(2)
+         ystart(3) = 1 + me2 *ny/np2
+         yend(3)   = (me2+1)*ny/np2
+         zstart(3) = 1
+         zend(3)   = nz
+      endif
+
+!---------------------------------------------------------------------
+! Set up info for blocking of ffts and transposes.  This improves
+! performance on cache-based systems. Blocking involves
+! working on a chunk of the problem at a time, taking chunks
+! along the first, second, or third dimension. 
+!
+! - In cffts1 blocking is on 2nd dimension (with fft on 1st dim)
+! - In cffts2/3 blocking is on 1st dimension (with fft on 2nd and 3rd dims)
+
+! Since 1st dim is always in processor, we'll assume it's long enough 
+! (default blocking factor is 16 so min size for 1st dim is 16)
+! The only case we have to worry about is cffts1 in a 2d decomposition. 
+! so the blocking factor should not be larger than the 2nd dimension. 
+!---------------------------------------------------------------------
+
+      fftblock = fftblock_default
+      fftblockpad = fftblockpad_default
+
+      if (layout_type .eq. layout_2d) then
+         if (dims(2, 1) .lt. fftblock) fftblock = dims(2, 1)
+         if (dims(2, 2) .lt. fftblock) fftblock = dims(2, 2)
+         if (dims(2, 3) .lt. fftblock) fftblock = dims(2, 3)
+      endif
+      
+      if (fftblock .ne. fftblock_default) fftblockpad = fftblock+3
+
+      return
+      end
+
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine compute_indexmap(twiddle, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute function from local (i,j,k) to ibar^2+jbar^2+kbar^2 
+! for time evolution exponent. 
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer d1, d2, d3
+      integer i, j, k, ii, ii2, jj, ij2, kk
+      double precision ap, twiddle(d1, d2, d3)
+
+!---------------------------------------------------------------------
+! this function is very different depending on whether 
+! we are in the 0d, 1d or 2d layout. Compute separately. 
+! basically we want to convert the fortran indices 
+!   1 2 3 4 5 6 7 8 
+! to 
+!   0 1 2 3 -4 -3 -2 -1
+! The following magic formula does the trick:
+! mod(i-1+n/2, n) - n/2
+!---------------------------------------------------------------------
+
+      ap = - 4.d0 * alpha * pi *pi
+
+      if (layout_type .eq. layout_0d) then ! xyz layout
+         do i = 1, dims(1,3)
+            ii =  mod(i+xstart(3)-2+nx/2, nx) - nx/2
+            ii2 = ii*ii
+            do j = 1, dims(2,3)
+               jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2
+               ij2 = jj*jj+ii2
+               do k = 1, dims(3,3)
+                  kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2
+                  twiddle(i,j,k) = dexp(ap*dfloat(kk*kk+ij2))
+               end do
+            end do
+         end do
+      else if (layout_type .eq. layout_1d) then ! zxy layout 
+         do i = 1,dims(2,3)
+            ii =  mod(i+xstart(3)-2+nx/2, nx) - nx/2
+            ii2 = ii*ii
+            do j = 1,dims(3,3)
+               jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2
+               ij2 = jj*jj+ii2
+               do k = 1,dims(1,3)
+                  kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2
+                  twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2))
+               end do
+            end do
+         end do
+      else if (layout_type .eq. layout_2d) then ! zxy layout
+         do i = 1,dims(2,3)
+            ii =  mod(i+xstart(3)-2+nx/2, nx) - nx/2
+            ii2 = ii*ii
+            do j = 1, dims(3,3)
+               jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2
+               ij2 = jj*jj+ii2
+               do k =1,dims(1,3)
+                  kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2
+                  twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2))
+               end do
+            end do
+         end do
+      else
+         print *, ' Unknown layout type ', layout_type
+         stop
+      endif
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine print_timers()
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer i, ierr
+      character*25 tstrings(T_max+2)
+      double precision t1(T_max+2), tsum(T_max+2),  &
+     &                 tming(T_max+2), tmaxg(T_max+2)
+      data tstrings / '          total ',  &
+     &                '          setup ',  &
+     &                '            fft ',  &
+     &                '         evolve ',  &
+     &                '       checksum ',  &
+     &                '         fftlow ',  &
+     &                '        fftcopy ',  &
+     &                '      transpose ',  &
+     &                ' transpose1_loc ',  &
+     &                ' transpose1_glo ',  &
+     &                ' transpose1_fin ',  &
+     &                ' transpose2_loc ',  &
+     &                ' transpose2_glo ',  &
+     &                ' transpose2_fin ',  &
+     &                '           sync ',  &
+     &                '           init ',  &
+     &                '        totcomp ',  &
+     &                '        totcomm ' /
+
+      do i = 1, t_max
+         t1(i) = timer_read(i)
+      end do
+      t1(t_max+2) = t1(t_transxzglo) + t1(t_transxyglo) + t1(t_synch)
+      t1(t_max+1) = t1(t_total) - t1(t_max+2)
+
+      call MPI_Reduce(t1, tsum,  t_max+2, MPI_DOUBLE_PRECISION,  &
+     &                MPI_SUM, 0, comm_solve, ierr)
+      call MPI_Reduce(t1, tming, t_max+2, MPI_DOUBLE_PRECISION,  &
+     &                MPI_MIN, 0, comm_solve, ierr)
+      call MPI_Reduce(t1, tmaxg, t_max+2, MPI_DOUBLE_PRECISION,  &
+     &                MPI_MAX, 0, comm_solve, ierr)
+
+      if (me .ne. 0) return
+      write(*, 800) np_min
+      do i = 1, t_max+2
+         if (tsum(i) .ne. 0.0d0) then
+            write(*, 810) i, tstrings(i), tming(i), tmaxg(i),  &
+     &                    tsum(i)/np_min
+         endif
+      end do
+ 800  format(' nprocs =', i6, 19x, 'minimum', 5x, 'maximum',  &
+     &       5x, 'average')
+ 810  format(' timer ', i2, '(', A16, ') :', 3(2X,F10.4))
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine fft(dir, x1, x2)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer dir
+      double complex x1(ntdivnp), x2(ntdivnp)
+
+      double complex scratch(fftblockpad_default*maxdim*2)
+
+!---------------------------------------------------------------------
+! note: args x1, x2 must be different arrays
+! note: args for cfftsx are (direction, layout, xin, xout, scratch)
+!       xin/xout may be the same and it can be somewhat faster
+!       if they are
+! note: args for transpose are (layout1, layout2, xin, xout)
+!       xin/xout must be different
+!---------------------------------------------------------------------
+
+      if (dir .eq. 1) then
+         if (layout_type .eq. layout_0d) then
+            call cffts1(1, dims(1,1), dims(2,1), dims(3,1),  &
+     &                  x1, x1, scratch)
+            call cffts2(1, dims(1,2), dims(2,2), dims(3,2),  &
+     &                  x1, x1, scratch)
+            call cffts3(1, dims(1,3), dims(2,3), dims(3,3),  &
+     &                  x1, x2, scratch)
+         else if (layout_type .eq. layout_1d) then
+            call cffts1(1, dims(1,1), dims(2,1), dims(3,1),  &
+     &                  x1, x1, scratch)
+            call cffts2(1, dims(1,2), dims(2,2), dims(3,2),  &
+     &                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_xy_z(2, 3, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(1, dims(1,3), dims(2,3), dims(3,3),  &
+     &                  x2, x2, scratch)
+         else if (layout_type .eq. layout_2d) then
+            call cffts1(1, dims(1,1), dims(2,1), dims(3,1),  &
+     &                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_y(1, 2, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(1, dims(1,2), dims(2,2), dims(3,2),  &
+     &                  x2, x2, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_z(2, 3, x2, x1)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(1, dims(1,3), dims(2,3), dims(3,3),  &
+     &                  x1, x2, scratch)
+         endif
+      else
+         if (layout_type .eq. layout_0d) then
+            call cffts3(-1, dims(1,3), dims(2,3), dims(3,3),  &
+     &                  x1, x1, scratch)
+            call cffts2(-1, dims(1,2), dims(2,2), dims(3,2),  &
+     &                  x1, x1, scratch)
+            call cffts1(-1, dims(1,1), dims(2,1), dims(3,1),  &
+     &                  x1, x2, scratch)
+         else if (layout_type .eq. layout_1d) then
+            call cffts1(-1, dims(1,3), dims(2,3), dims(3,3),  &
+     &                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_yz(3, 2, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts2(-1, dims(1,2), dims(2,2), dims(3,2),  &
+     &                  x2, x2, scratch)
+            call cffts1(-1, dims(1,1), dims(2,1), dims(3,1),  &
+     &                  x2, x2, scratch)
+         else if (layout_type .eq. layout_2d) then
+            call cffts1(-1, dims(1,3), dims(2,3), dims(3,3),  &
+     &                  x1, x1, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_z(3, 2, x1, x2)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(-1, dims(1,2), dims(2,2), dims(3,2),  &
+     &                  x2, x2, scratch)
+            if (timers_enabled) call timer_start(T_transpose)
+            call transpose_x_y(2, 1, x2, x1)
+            if (timers_enabled) call timer_stop(T_transpose)
+            call cffts1(-1, dims(1,1), dims(2,1), dims(3,1),  &
+     &                  x1, x2, scratch)
+         endif
+      endif
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cffts1(is, d1, d2, d3, x, xout, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is, d1, d2, d3, logd1
+      double complex x(d1,d2,d3)
+      double complex xout(d1,d2,d3)
+      double complex y(fftblockpad, d1, 2) 
+      integer i, j, k, jj
+
+      logd1 = ilog2(d1)
+
+      do k = 1, d3
+         do jj = 0, d2 - fftblock, fftblock
+            if (timers_enabled) call timer_start(T_fftcopy)
+            do j = 1, fftblock
+               do i = 1, d1
+                  y(j,i,1) = x(i,j+jj,k)
+               enddo
+            enddo
+            if (timers_enabled) call timer_stop(T_fftcopy)
+            
+            if (timers_enabled) call timer_start(T_fftlow)
+            call cfftz (is, logd1, d1, y, y(1,1,2))
+            if (timers_enabled) call timer_stop(T_fftlow)
+
+            if (timers_enabled) call timer_start(T_fftcopy)
+            do j = 1, fftblock
+               do i = 1, d1
+                  xout(i,j+jj,k) = y(j,i,1)
+               enddo
+            enddo
+            if (timers_enabled) call timer_stop(T_fftcopy)
+         enddo
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cffts2(is, d1, d2, d3, x, xout, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is, d1, d2, d3, logd2
+      double complex x(d1,d2,d3)
+      double complex xout(d1,d2,d3)
+      double complex y(fftblockpad, d2, 2) 
+      integer i, j, k, ii
+
+      logd2 = ilog2(d2)
+
+      do k = 1, d3
+        do ii = 0, d1 - fftblock, fftblock
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do j = 1, d2
+              do i = 1, fftblock
+                 y(i,j,1) = x(i+ii,j,k)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+
+           if (timers_enabled) call timer_start(T_fftlow)
+           call cfftz (is, logd2, d2, y, y(1, 1, 2))
+           if (timers_enabled) call timer_stop(T_fftlow)
+
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do j = 1, d2
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y(i,j,1)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+        enddo
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cffts3(is, d1, d2, d3, x, xout, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is, d1, d2, d3, logd3
+      double complex x(d1,d2,d3)
+      double complex xout(d1,d2,d3)
+      double complex y(fftblockpad, d3, 2) 
+      integer i, j, k, ii
+
+      logd3 = ilog2(d3)
+
+      do j = 1, d2
+        do ii = 0, d1 - fftblock, fftblock
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do k = 1, d3
+              do i = 1, fftblock
+                 y(i,k,1) = x(i+ii,j,k)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+
+           if (timers_enabled) call timer_start(T_fftlow)
+           call cfftz (is, logd3, d3, y, y(1, 1, 2))
+           if (timers_enabled) call timer_stop(T_fftlow)
+
+           if (timers_enabled) call timer_start(T_fftcopy)
+           do k = 1, d3
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y(i,k,1)
+              enddo
+           enddo
+           if (timers_enabled) call timer_stop(T_fftcopy)
+        enddo
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine fft_init (n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute the roots-of-unity array that will be used for subsequent FFTs. 
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer m,n,nu,ku,i,j,ln
+      double precision t, ti
+
+
+!---------------------------------------------------------------------
+!   Initialize the U array with sines and cosines in a manner that permits
+!   stride one access at each FFT iteration.
+!---------------------------------------------------------------------
+      nu = n
+      m = ilog2(n)
+      u(1) = m
+      ku = 2
+      ln = 1
+
+      do j = 1, m
+         t = pi / ln
+         
+         do i = 0, ln - 1
+            ti = i * t
+            u(i+ku) = dcmplx (cos (ti), sin(ti))
+         enddo
+         
+         ku = ku + ln
+         ln = 2 * ln
+      enddo
+      
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cfftz (is, m, n, x, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   Computes NY N-point complex-to-complex FFTs of X using an algorithm due
+!   to Swarztrauber.  X is both the input and the output array, while Y is a 
+!   scratch array.  It is assumed that N = 2^M.  Before calling CFFTZ to 
+!   perform FFTs, the array U must be initialized by calling CFFTZ with IS 
+!   set to 0 and M set to MX, where MX is the maximum value of M for any 
+!   subsequent call.
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is,m,n,i,j,l,mx
+      double complex x, y
+
+      dimension x(fftblockpad,n), y(fftblockpad,n)
+
+!---------------------------------------------------------------------
+!   Check if input parameters are invalid.
+!---------------------------------------------------------------------
+      mx = u(1)
+      if ((is .ne. 1 .and. is .ne. -1) .or. m .lt. 1 .or. m .gt. mx)    &
+     &  then
+        write (*, 1)  is, m, mx
+ 1      format ('CFFTZ: Either U has not been initialized, or else'/    &
+     &    'one of the input parameters is invalid', 3I5)
+        stop
+      endif
+
+!---------------------------------------------------------------------
+!   Perform one variant of the Stockham FFT.
+!---------------------------------------------------------------------
+      do l = 1, m, 2
+        call fftz2 (is, l, m, n, fftblock, fftblockpad, u, x, y)
+        if (l .eq. m) goto 160
+        call fftz2 (is, l + 1, m, n, fftblock, fftblockpad, u, y, x)
+      enddo
+
+      goto 180
+
+!---------------------------------------------------------------------
+!   Copy Y to X.
+!---------------------------------------------------------------------
+ 160  do j = 1, n
+        do i = 1, fftblock
+          x(i,j) = y(i,j)
+        enddo
+      enddo
+
+ 180  continue
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine fftz2 (is, l, m, n, ny, ny1, u, x, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   Performs the L-th iteration of the second variant of the Stockham FFT.
+!---------------------------------------------------------------------
+
+      implicit none
+
+      integer is,k,l,m,n,ny,ny1,n1,li,lj,lk,ku,i,j,i11,i12,i21,i22
+      double complex u,x,y,u1,x11,x21
+      dimension u(n), x(ny1,n), y(ny1,n)
+
+
+!---------------------------------------------------------------------
+!   Set initial parameters.
+!---------------------------------------------------------------------
+
+      n1 = n / 2
+      lk = 2 ** (l - 1)
+      li = 2 ** (m - l)
+      lj = 2 * lk
+      ku = li + 1
+
+      do i = 0, li - 1
+        i11 = i * lk + 1
+        i12 = i11 + n1
+        i21 = i * lj + 1
+        i22 = i21 + lk
+        if (is .ge. 1) then
+          u1 = u(ku+i)
+        else
+          u1 = dconjg (u(ku+i))
+        endif
+
+!---------------------------------------------------------------------
+!   This loop is vectorizable.
+!---------------------------------------------------------------------
+        do k = 0, lk - 1
+          do j = 1, ny
+            x11 = x(j,i11+k)
+            x21 = x(j,i12+k)
+            y(j,i21+k) = x11 + x21
+            y(j,i22+k) = u1 * (x11 - x21)
+          enddo
+        enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      integer function ilog2(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+      integer n, nn, lg
+      if (n .eq. 1) then
+         ilog2=0
+         return
+      endif
+      lg = 1
+      nn = 2
+      do while (nn .lt. n)
+         nn = nn*2
+         lg = lg+1
+      end do
+      ilog2 = lg
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_yz(l1, l2, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+      call transpose2_local(dims(1,l1),dims(2, l1)*dims(3, l1),  &
+     &                          xin, xout)
+
+      call transpose2_global(xout, xin)
+
+      call transpose2_finish(dims(1,l1),dims(2, l1)*dims(3, l1),  &
+     &                          xin, xout)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_xy_z(l1, l2, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+      call transpose2_local(dims(1,l1)*dims(2, l1),dims(3, l1),  &
+     &                          xin, xout)
+      call transpose2_global(xout, xin)
+      call transpose2_finish(dims(1,l1)*dims(2, l1),dims(3, l1),  &
+     &                          xin, xout)
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose2_local(n1, n2, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer n1, n2
+      double complex xin(n1, n2), xout(n2, n1)
+      
+      double complex z(transblockpad, transblock)
+
+      integer i, j, ii, jj
+
+      if (timers_enabled) call timer_start(T_transxzloc)
+
+!---------------------------------------------------------------------
+! If possible, block the transpose for cache memory systems. 
+! How much does this help? Example: R8000 Power Challenge (90 MHz)
+! Blocked version decreases time spend in this routine 
+! from 14 seconds to 5.2 seconds on 8 nodes class A.
+!---------------------------------------------------------------------
+
+      if (n1 .lt. transblock .or. n2 .lt. transblock) then
+         if (n1 .ge. n2) then 
+            do j = 1, n2
+               do i = 1, n1
+                  xout(j, i) = xin(i, j)
+               end do
+            end do
+         else
+            do i = 1, n1
+               do j = 1, n2
+                  xout(j, i) = xin(i, j)
+               end do
+            end do
+         endif
+      else
+         do j = 0, n2-1, transblock
+            do i = 0, n1-1, transblock
+               
+!---------------------------------------------------------------------
+! Note: compiler should be able to take j+jj out of inner loop
+!---------------------------------------------------------------------
+               do jj = 1, transblock
+                  do ii = 1, transblock
+                     z(jj,ii) = xin(i+ii, j+jj)
+                  end do
+               end do
+               
+               do ii = 1, transblock
+                  do jj = 1, transblock
+                     xout(j+jj, i+ii) = z(jj,ii)
+                  end do
+               end do
+               
+            end do
+         end do
+      endif
+      if (timers_enabled) call timer_stop(T_transxzloc)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose2_global(xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      double complex xin(ntdivnp)
+      double complex xout(ntdivnp) 
+      integer ierr
+
+!      if (timers_enabled) call synchup()
+
+      if (timers_enabled) call timer_start(T_transxzglo)
+      call mpi_alltoall(xin, ntdivnp/np_min, dc_type,  &
+     &                  xout, ntdivnp/np_min, dc_type,  &
+     &                  commslice1, ierr)
+      if (timers_enabled) call timer_stop(T_transxzglo)
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose2_finish(n1, n2, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer n1, n2, ioff
+      double complex xin(n2, n1/np2, 0:np2-1), xout(n2*np2, n1/np2)
+      
+      integer i, j, p
+
+      if (timers_enabled) call timer_start(T_transxzfin)
+      do p = 0, np2-1
+         ioff = p*n2
+         do j = 1, n1/np2
+            do i = 1, n2
+               xout(i+ioff, j) = xin(i, j, p)
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxzfin)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_z(l1, l2, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+      call transpose_x_z_local(dims(1,l1),dims(2,l1),dims(3,l1),  &
+     &                         xin, xout)
+      call transpose_x_z_global(dims(1,l1),dims(2,l1),dims(3,l1),  &
+     &                          xout, xin)
+      call transpose_x_z_finish(dims(1,l2),dims(2,l2),dims(3,l2),  &
+     &                          xin, xout)
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_z_local(d1, d2, d3, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double complex xin(d1,d2,d3)
+      double complex xout(d3,d2,d1)
+      integer block1, block3
+      integer i, j, k, kk, ii, i1, k1
+
+      double complex buf(transblockpad, maxdim)
+      if (timers_enabled) call timer_start(T_transxzloc)
+      if (d1 .lt. 32) goto 100
+      block3 = d3
+      if (block3 .eq. 1)  goto 100
+      if (block3 .gt. transblock) block3 = transblock
+      block1 = d1
+      if (block1*block3 .gt. transblock*transblock)  &
+     &          block1 = transblock*transblock/block3
+!---------------------------------------------------------------------
+! blocked transpose
+!---------------------------------------------------------------------
+      do j = 1, d2
+         do kk = 0, d3-block3, block3
+            do ii = 0, d1-block1, block1
+               
+               do k = 1, block3
+                  k1 = k + kk
+                  do i = 1, block1
+                     buf(k, i) = xin(i+ii, j, k1)
+                  end do
+               end do
+
+               do i = 1, block1
+                  i1 = i + ii
+                  do k = 1, block3
+                     xout(k+kk, j, i1) = buf(k, i)
+                  end do
+               end do
+
+            end do
+         end do
+      end do
+      goto 200
+      
+
+!---------------------------------------------------------------------
+! basic transpose
+!---------------------------------------------------------------------
+ 100  continue
+      
+      do j = 1, d2
+         do k = 1, d3
+            do i = 1, d1
+               xout(k, j, i) = xin(i, j, k)
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+! all done
+!---------------------------------------------------------------------
+ 200  continue
+
+      if (timers_enabled) call timer_stop(T_transxzloc)
+      return 
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_z_global(d1, d2, d3, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer d1, d2, d3
+      double complex xin(d3,d2,d1)
+      double complex xout(d3,d2,d1) ! not real layout, but right size
+      integer ierr
+
+!      if (timers_enabled) call synchup()
+
+!---------------------------------------------------------------------
+! do transpose among all  processes with same 1-coord (me1)
+!---------------------------------------------------------------------
+      if (timers_enabled)call timer_start(T_transxzglo)
+      call mpi_alltoall(xin, d1*d2*d3/np2, dc_type,  &
+     &                  xout, d1*d2*d3/np2, dc_type,  &
+     &                  commslice1, ierr)
+      if (timers_enabled) call timer_stop(T_transxzglo)
+      return
+      end
+      
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_z_finish(d1, d2, d3, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double complex xin(d1/np2, d2, d3, 0:np2-1)
+      double complex xout(d1,d2,d3)
+      integer i, j, k, p, ioff
+      if (timers_enabled) call timer_start(T_transxzfin)
+!---------------------------------------------------------------------
+! this is the most straightforward way of doing it. the
+! calculation in the inner loop doesn't help. 
+!      do i = 1, d1/np2
+!         do j = 1, d2
+!            do k = 1, d3
+!               do p = 0, np2-1
+!                  ii = i + p*d1/np2
+!                  xout(ii, j, k) = xin(i, j, k, p)
+!               end do
+!            end do
+!         end do
+!      end do
+!---------------------------------------------------------------------
+
+      do p = 0, np2-1
+         ioff = p*d1/np2
+         do k = 1, d3
+            do j = 1, d2
+               do i = 1, d1/np2
+                  xout(i+ioff, j, k) = xin(i, j, k, p)
+               end do
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxzfin)
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_y(l1, l2, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer l1, l2
+      double complex xin(ntdivnp), xout(ntdivnp)
+
+!---------------------------------------------------------------------
+! xy transpose is a little tricky, since we don't want
+! to touch 3rd axis. But alltoall must involve 3rd axis (most 
+! slowly varying) to be efficient. So we do
+! (nx, ny/np1, nz/np2) -> (ny/np1, nz/np2, nx) (local)
+! (ny/np1, nz/np2, nx) -> ((ny/np1*nz/np2)*np1, nx/np1) (global)
+! then local finish. 
+!---------------------------------------------------------------------
+
+
+      call transpose_x_y_local(dims(1,l1),dims(2,l1),dims(3,l1),  &
+     &                         xin, xout)
+      call transpose_x_y_global(dims(1,l1),dims(2,l1),dims(3,l1),  &
+     &                          xout, xin)
+      call transpose_x_y_finish(dims(1,l2),dims(2,l2),dims(3,l2),  &
+     &                          xin, xout)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_y_local(d1, d2, d3, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double complex xin(d1, d2, d3)
+      double complex xout(d2, d3, d1)
+      integer i, j, k
+      if (timers_enabled) call timer_start(T_transxyloc)
+
+      do k = 1, d3
+         do i = 1, d1
+            do j = 1, d2
+               xout(j,k,i)=xin(i,j,k)
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxyloc)
+      return 
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_y_global(d1, d2, d3, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer d1, d2, d3
+!---------------------------------------------------------------------
+! array is in form (ny/np1, nz/np2, nx)
+!---------------------------------------------------------------------
+      double complex xin(d2,d3,d1)
+      double complex xout(d2,d3,d1) ! not real layout but right size
+      integer ierr
+
+!      if (timers_enabled) call synchup()
+
+!---------------------------------------------------------------------
+! do transpose among all processes with same 1-coord (me1)
+!---------------------------------------------------------------------
+      if (timers_enabled) call timer_start(T_transxyglo)
+      call mpi_alltoall(xin, d1*d2*d3/np1, dc_type,  &
+     &                  xout, d1*d2*d3/np1, dc_type,  &
+     &                  commslice2, ierr)
+      if (timers_enabled) call timer_stop(T_transxyglo)
+
+      return
+      end
+      
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine transpose_x_y_finish(d1, d2, d3, xin, xout)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double complex xin(d1/np1, d3, d2, 0:np1-1)
+      double complex xout(d1,d2,d3)
+      integer i, j, k, p, ioff
+      if (timers_enabled) call timer_start(T_transxyfin)
+!---------------------------------------------------------------------
+! this is the most straightforward way of doing it. the
+! calculation in the inner loop doesn't help. 
+!      do i = 1, d1/np1
+!         do j = 1, d2
+!            do k = 1, d3
+!               do p = 0, np1-1
+!                  ii = i + p*d1/np1
+! note order is screwy bcz we have (ny/np1, nz/np2, nx) -> (ny, nx/np1, nz/np2)
+!                  xout(ii, j, k) = xin(i, k, j, p)
+!               end do
+!            end do
+!         end do
+!      end do
+!---------------------------------------------------------------------
+
+      do p = 0, np1-1
+         ioff = p*d1/np1
+         do k = 1, d3
+            do j = 1, d2
+               do i = 1, d1/np1
+                  xout(i+ioff, j, k) = xin(i, k, j, p)
+               end do
+            end do
+         end do
+      end do
+      if (timers_enabled) call timer_stop(T_transxyfin)
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine checksum(i, u1, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer i, d1, d2, d3
+      double complex u1(d1, d2, d3)
+      integer j, q,r,s, ierr
+      double complex chk,allchk
+      chk = (0.0,0.0)
+
+      do j=1,1024
+         q = mod(j, nx)+1
+         if (q .ge. xstart(1) .and. q .le. xend(1)) then
+            r = mod(3*j,ny)+1
+            if (r .ge. ystart(1) .and. r .le. yend(1)) then
+               s = mod(5*j,nz)+1
+               if (s .ge. zstart(1) .and. s .le. zend(1)) then
+                  chk=chk+u1(q-xstart(1)+1,r-ystart(1)+1,s-zstart(1)+1)
+               end if
+            end if
+         end if
+      end do
+      chk = chk/ntotal_f
+
+      if (timers_enabled) call timer_start(T_synch)
+      call MPI_Reduce(chk, allchk, 1, dc_type, MPI_SUM,  &
+     &                0, comm_solve, ierr)      
+      if (timers_enabled) call timer_stop(T_synch)
+      if (me .eq. 0) then
+            write (*, 30) i, allchk
+ 30         format (' T =',I5,5X,'Checksum =',1P2D22.12)
+      endif
+
+!      sums(i) = allchk
+!     If we compute the checksum for diagnostic purposes, we let i be
+!     negative, so the result will not be stored in an array
+      if (i .gt. 0) sums(i) = allchk
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine synchup
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer ierr
+      call timer_start(T_synch)
+      call mpi_barrier(comm_solve, ierr)
+      call timer_stop(T_synch)
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine set_class (d1, d2, d3, nt, class)
+
+!---------------------------------------------------------------------
+!  set problem class based on problem size
+!---------------------------------------------------------------------
+
+      implicit none
+
+      integer d1, d2, d3, nt
+      character class
+
+
+      class = 'U'
+
+      if (d1 .eq. 64 .and.  &
+     &    d2 .eq. 64 .and.  &
+     &    d3 .eq. 64 .and.  &
+     &    nt .eq. 6) then
+         class = 'S'
+
+      else if (d1 .eq. 128 .and.  &
+     &    d2 .eq. 128 .and.  &
+     &    d3 .eq. 32 .and.  &
+     &    nt .eq. 6) then
+         class = 'W'
+
+      else if (d1 .eq. 256 .and.  &
+     &    d2 .eq. 256 .and.  &
+     &    d3 .eq. 128 .and.  &
+     &    nt .eq. 6) then
+         class = 'A'
+      
+      else if (d1 .eq. 512 .and.  &
+     &    d2 .eq. 256 .and.  &
+     &    d3 .eq. 256 .and.  &
+     &    nt .eq. 20) then
+         class = 'B'
+
+      else if (d1 .eq. 512 .and.  &
+     &    d2 .eq. 512 .and.  &
+     &    d3 .eq. 512 .and.  &
+     &    nt .eq. 20) then
+         class = 'C'
+
+      else if (d1 .eq. 2048 .and.  &
+     &    d2 .eq. 1024 .and.  &
+     &    d3 .eq. 1024 .and.  &
+     &    nt .eq. 25) then
+         class = 'D'
+
+      else if (d1 .eq. 4096 .and.  &
+     &    d2 .eq. 2048 .and.  &
+     &    d3 .eq. 2048 .and.  &
+     &    nt .eq. 25) then
+         class = 'E'
+
+      else if (d1 .eq. 8192 .and.  &
+     &    d2 .eq. 4096 .and.  &
+     &    d3 .eq. 4096 .and.  &
+     &    nt .eq. 25) then
+         class = 'F'
+
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine verify (nt, verified, class)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      use ft_data
+      use mpinpb
+
+      implicit none
+
+      integer nt
+      character class
+      logical verified
+      integer ierr, size, i
+      double precision err, epsilon
+
+!---------------------------------------------------------------------
+!   Reference checksums
+!---------------------------------------------------------------------
+      double complex csum_ref(25)
+
+
+      if (me .ne. 0) return
+
+      epsilon = 1.0d-12
+      verified = .FALSE.
+
+      if ( class .eq. 'S' ) then
+!---------------------------------------------------------------------
+!   Sample size reference checksums
+!---------------------------------------------------------------------
+         csum_ref(1) = dcmplx(5.546087004964D+02, 4.845363331978D+02)
+         csum_ref(2) = dcmplx(5.546385409189D+02, 4.865304269511D+02)
+         csum_ref(3) = dcmplx(5.546148406171D+02, 4.883910722336D+02)
+         csum_ref(4) = dcmplx(5.545423607415D+02, 4.901273169046D+02)
+         csum_ref(5) = dcmplx(5.544255039624D+02, 4.917475857993D+02)
+         csum_ref(6) = dcmplx(5.542683411902D+02, 4.932597244941D+02)
+
+      else if ( class .eq. 'W' ) then
+!---------------------------------------------------------------------
+!   Class W size reference checksums
+!---------------------------------------------------------------------
+         csum_ref(1) = dcmplx(5.673612178944D+02, 5.293246849175D+02)
+         csum_ref(2) = dcmplx(5.631436885271D+02, 5.282149986629D+02)
+         csum_ref(3) = dcmplx(5.594024089970D+02, 5.270996558037D+02)
+         csum_ref(4) = dcmplx(5.560698047020D+02, 5.260027904925D+02)
+         csum_ref(5) = dcmplx(5.530898991250D+02, 5.249400845633D+02)
+         csum_ref(6) = dcmplx(5.504159734538D+02, 5.239212247086D+02)
+
+      else if ( class .eq. 'A' ) then
+!---------------------------------------------------------------------
+!   Class A size reference checksums
+!---------------------------------------------------------------------
+         csum_ref(1) = dcmplx(5.046735008193D+02, 5.114047905510D+02)
+         csum_ref(2) = dcmplx(5.059412319734D+02, 5.098809666433D+02)
+         csum_ref(3) = dcmplx(5.069376896287D+02, 5.098144042213D+02)
+         csum_ref(4) = dcmplx(5.077892868474D+02, 5.101336130759D+02)
+         csum_ref(5) = dcmplx(5.085233095391D+02, 5.104914655194D+02)
+         csum_ref(6) = dcmplx(5.091487099959D+02, 5.107917842803D+02)
+      
+      else if ( class .eq. 'B' ) then
+!---------------------------------------------------------------------
+!   Class B size reference checksums
+!---------------------------------------------------------------------
+         csum_ref(1)  = dcmplx(5.177643571579D+02, 5.077803458597D+02)
+         csum_ref(2)  = dcmplx(5.154521291263D+02, 5.088249431599D+02)
+         csum_ref(3)  = dcmplx(5.146409228649D+02, 5.096208912659D+02)
+         csum_ref(4)  = dcmplx(5.142378756213D+02, 5.101023387619D+02)
+         csum_ref(5)  = dcmplx(5.139626667737D+02, 5.103976610617D+02)
+         csum_ref(6)  = dcmplx(5.137423460082D+02, 5.105948019802D+02)
+         csum_ref(7)  = dcmplx(5.135547056878D+02, 5.107404165783D+02)
+         csum_ref(8)  = dcmplx(5.133910925466D+02, 5.108576573661D+02)
+         csum_ref(9)  = dcmplx(5.132470705390D+02, 5.109577278523D+02)
+         csum_ref(10) = dcmplx(5.131197729984D+02, 5.110460304483D+02)
+         csum_ref(11) = dcmplx(5.130070319283D+02, 5.111252433800D+02)
+         csum_ref(12) = dcmplx(5.129070537032D+02, 5.111968077718D+02)
+         csum_ref(13) = dcmplx(5.128182883502D+02, 5.112616233064D+02)
+         csum_ref(14) = dcmplx(5.127393733383D+02, 5.113203605551D+02)
+         csum_ref(15) = dcmplx(5.126691062020D+02, 5.113735928093D+02)
+         csum_ref(16) = dcmplx(5.126064276004D+02, 5.114218460548D+02)
+         csum_ref(17) = dcmplx(5.125504076570D+02, 5.114656139760D+02)
+         csum_ref(18) = dcmplx(5.125002331720D+02, 5.115053595966D+02)
+         csum_ref(19) = dcmplx(5.124551951846D+02, 5.115415130407D+02)
+         csum_ref(20) = dcmplx(5.124146770029D+02, 5.115744692211D+02)
+
+      else if ( class .eq. 'C' ) then
+!---------------------------------------------------------------------
+!   Class C size reference checksums
+!---------------------------------------------------------------------
+         csum_ref(1)  = dcmplx(5.195078707457D+02, 5.149019699238D+02)
+         csum_ref(2)  = dcmplx(5.155422171134D+02, 5.127578201997D+02)
+         csum_ref(3)  = dcmplx(5.144678022222D+02, 5.122251847514D+02)
+         csum_ref(4)  = dcmplx(5.140150594328D+02, 5.121090289018D+02)
+         csum_ref(5)  = dcmplx(5.137550426810D+02, 5.121143685824D+02)
+         csum_ref(6)  = dcmplx(5.135811056728D+02, 5.121496764568D+02)
+         csum_ref(7)  = dcmplx(5.134569343165D+02, 5.121870921893D+02)
+         csum_ref(8)  = dcmplx(5.133651975661D+02, 5.122193250322D+02)
+         csum_ref(9)  = dcmplx(5.132955192805D+02, 5.122454735794D+02)
+         csum_ref(10) = dcmplx(5.132410471738D+02, 5.122663649603D+02)
+         csum_ref(11) = dcmplx(5.131971141679D+02, 5.122830879827D+02)
+         csum_ref(12) = dcmplx(5.131605205716D+02, 5.122965869718D+02)
+         csum_ref(13) = dcmplx(5.131290734194D+02, 5.123075927445D+02)
+         csum_ref(14) = dcmplx(5.131012720314D+02, 5.123166486553D+02)
+         csum_ref(15) = dcmplx(5.130760908195D+02, 5.123241541685D+02)
+         csum_ref(16) = dcmplx(5.130528295923D+02, 5.123304037599D+02)
+         csum_ref(17) = dcmplx(5.130310107773D+02, 5.123356167976D+02)
+         csum_ref(18) = dcmplx(5.130103090133D+02, 5.123399592211D+02)
+         csum_ref(19) = dcmplx(5.129905029333D+02, 5.123435588985D+02)
+         csum_ref(20) = dcmplx(5.129714421109D+02, 5.123465164008D+02)
+
+      else if ( class .eq. 'D' ) then
+!---------------------------------------------------------------------
+!   Class D size reference checksums
+!---------------------------------------------------------------------
+         csum_ref(1)  = dcmplx(5.122230065252D+02, 5.118534037109D+02)
+         csum_ref(2)  = dcmplx(5.120463975765D+02, 5.117061181082D+02)
+         csum_ref(3)  = dcmplx(5.119865766760D+02, 5.117096364601D+02)
+         csum_ref(4)  = dcmplx(5.119518799488D+02, 5.117373863950D+02)
+         csum_ref(5)  = dcmplx(5.119269088223D+02, 5.117680347632D+02)
+         csum_ref(6)  = dcmplx(5.119082416858D+02, 5.117967875532D+02)
+         csum_ref(7)  = dcmplx(5.118943814638D+02, 5.118225281841D+02)
+         csum_ref(8)  = dcmplx(5.118842385057D+02, 5.118451629348D+02)
+         csum_ref(9)  = dcmplx(5.118769435632D+02, 5.118649119387D+02)
+         csum_ref(10) = dcmplx(5.118718203448D+02, 5.118820803844D+02)
+         csum_ref(11) = dcmplx(5.118683569061D+02, 5.118969781011D+02)
+         csum_ref(12) = dcmplx(5.118661708593D+02, 5.119098918835D+02)
+         csum_ref(13) = dcmplx(5.118649768950D+02, 5.119210777066D+02)
+         csum_ref(14) = dcmplx(5.118645605626D+02, 5.119307604484D+02)
+         csum_ref(15) = dcmplx(5.118647586618D+02, 5.119391362671D+02)
+         csum_ref(16) = dcmplx(5.118654451572D+02, 5.119463757241D+02)
+         csum_ref(17) = dcmplx(5.118665212451D+02, 5.119526269238D+02)
+         csum_ref(18) = dcmplx(5.118679083821D+02, 5.119580184108D+02)
+         csum_ref(19) = dcmplx(5.118695433664D+02, 5.119626617538D+02)
+         csum_ref(20) = dcmplx(5.118713748264D+02, 5.119666538138D+02)
+         csum_ref(21) = dcmplx(5.118733606701D+02, 5.119700787219D+02)
+         csum_ref(22) = dcmplx(5.118754661974D+02, 5.119730095953D+02)
+         csum_ref(23) = dcmplx(5.118776626738D+02, 5.119755100241D+02)
+         csum_ref(24) = dcmplx(5.118799262314D+02, 5.119776353561D+02)
+         csum_ref(25) = dcmplx(5.118822370068D+02, 5.119794338060D+02)
+
+      else if ( class .eq. 'E' ) then
+!---------------------------------------------------------------------
+!   Class E size reference checksums
+!---------------------------------------------------------------------
+         csum_ref(1)  = dcmplx(5.121601045346D+02, 5.117395998266D+02)
+         csum_ref(2)  = dcmplx(5.120905403678D+02, 5.118614716182D+02)
+         csum_ref(3)  = dcmplx(5.120623229306D+02, 5.119074203747D+02)
+         csum_ref(4)  = dcmplx(5.120438418997D+02, 5.119345900733D+02)
+         csum_ref(5)  = dcmplx(5.120311521872D+02, 5.119551325550D+02)
+         csum_ref(6)  = dcmplx(5.120226088809D+02, 5.119720179919D+02)
+         csum_ref(7)  = dcmplx(5.120169296534D+02, 5.119861371665D+02)
+         csum_ref(8)  = dcmplx(5.120131225172D+02, 5.119979364402D+02)
+         csum_ref(9)  = dcmplx(5.120104767108D+02, 5.120077674092D+02)
+         csum_ref(10) = dcmplx(5.120085127969D+02, 5.120159443121D+02)
+         csum_ref(11) = dcmplx(5.120069224127D+02, 5.120227453670D+02)
+         csum_ref(12) = dcmplx(5.120055158164D+02, 5.120284096041D+02)
+         csum_ref(13) = dcmplx(5.120041820159D+02, 5.120331373793D+02)
+         csum_ref(14) = dcmplx(5.120028605402D+02, 5.120370938679D+02)
+         csum_ref(15) = dcmplx(5.120015223011D+02, 5.120404138831D+02)
+         csum_ref(16) = dcmplx(5.120001570022D+02, 5.120432068837D+02)
+         csum_ref(17) = dcmplx(5.119987650555D+02, 5.120455615860D+02)
+         csum_ref(18) = dcmplx(5.119973525091D+02, 5.120475499442D+02)
+         csum_ref(19) = dcmplx(5.119959279472D+02, 5.120492304629D+02)
+         csum_ref(20) = dcmplx(5.119945006558D+02, 5.120506508902D+02)
+         csum_ref(21) = dcmplx(5.119930795911D+02, 5.120518503782D+02)
+         csum_ref(22) = dcmplx(5.119916728462D+02, 5.120528612016D+02)
+         csum_ref(23) = dcmplx(5.119902874185D+02, 5.120537101195D+02)
+         csum_ref(24) = dcmplx(5.119889291565D+02, 5.120544194514D+02)
+         csum_ref(25) = dcmplx(5.119876028049D+02, 5.120550079284D+02)
+
+      else if ( class .eq. 'F' ) then
+!---------------------------------------------------------------------
+!   Class F size reference checksums
+!---------------------------------------------------------------------
+         csum_ref( 1) = dcmplx(5.119892866928D+02, 5.121457822747D+02)
+         csum_ref( 2) = dcmplx(5.119560157487D+02, 5.121009044434D+02)
+         csum_ref( 3) = dcmplx(5.119437960123D+02, 5.120761074285D+02)
+         csum_ref( 4) = dcmplx(5.119395628845D+02, 5.120614320496D+02)
+         csum_ref( 5) = dcmplx(5.119390371879D+02, 5.120514085624D+02)
+         csum_ref( 6) = dcmplx(5.119405091840D+02, 5.120438117102D+02)
+         csum_ref( 7) = dcmplx(5.119430444528D+02, 5.120376348915D+02)
+         csum_ref( 8) = dcmplx(5.119460702242D+02, 5.120323831062D+02)
+         csum_ref( 9) = dcmplx(5.119492377036D+02, 5.120277980818D+02)
+         csum_ref(10) = dcmplx(5.119523446268D+02, 5.120237368268D+02)
+         csum_ref(11) = dcmplx(5.119552825361D+02, 5.120201137845D+02)
+         csum_ref(12) = dcmplx(5.119580008777D+02, 5.120168723492D+02)
+         csum_ref(13) = dcmplx(5.119604834177D+02, 5.120139707209D+02)
+         csum_ref(14) = dcmplx(5.119627332821D+02, 5.120113749334D+02)
+         csum_ref(15) = dcmplx(5.119647637538D+02, 5.120090554887D+02)
+         csum_ref(16) = dcmplx(5.119665927740D+02, 5.120069857863D+02)
+         csum_ref(17) = dcmplx(5.119682397643D+02, 5.120051414260D+02)
+         csum_ref(18) = dcmplx(5.119697238718D+02, 5.120034999132D+02)
+         csum_ref(19) = dcmplx(5.119710630664D+02, 5.120020405355D+02)
+         csum_ref(20) = dcmplx(5.119722737384D+02, 5.120007442976D+02)
+         csum_ref(21) = dcmplx(5.119733705802D+02, 5.119995938652D+02)
+         csum_ref(22) = dcmplx(5.119743666226D+02, 5.119985735001D+02)
+         csum_ref(23) = dcmplx(5.119752733481D+02, 5.119976689792D+02)
+         csum_ref(24) = dcmplx(5.119761008382D+02, 5.119968675026D+02)
+         csum_ref(25) = dcmplx(5.119768579280D+02, 5.119961575929D+02)
+
+      endif
+
+
+      if (class .ne. 'U') then
+
+         do i = 1, nt
+            err = abs( (sums(i) - csum_ref(i)) / csum_ref(i) )
+            if (ieee_is_nan(err) .or. (err .gt. epsilon)) goto 100
+         end do
+         verified = .TRUE.
+ 100     continue
+
+      endif
+
+!      call MPI_COMM_SIZE(comm_solve, size, ierr)
+!      if (size .ne. np) then
+!         write(*, 4010) np
+!         write(*, 4011)
+!         write(*, 4012)
+!---------------------------------------------------------------------
+! multiple statements because some Fortran compilers have
+! problems with long strings. 
+!---------------------------------------------------------------------
+! 4010    format( ' Warning: benchmark was compiled for ', i5, 
+!     >           'processors')
+! 4011    format( ' Must be run on this many processors for official',
+!     >           ' verification')
+! 4012    format( ' so memory access is repeatable')
+!         verified = .false.
+!      endif
+         
+      if (class .ne. 'U') then
+         if (verified) then
+            write(*,2000)
+ 2000       format(' Result verification successful')
+         else
+            write(*,2001)
+ 2001       format(' Result verification failed')
+         endif
+      endif
+
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft_data.f90
new file mode 100644
index 000000000..62eadd6c2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft_data.f90
@@ -0,0 +1,209 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  ft_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module ft_data
+
+      include 'npbparams.h'
+
+! total number of grid points in floating point number
+      double precision ntotal_f
+      parameter (ntotal_f=dble(nx)*ny*nz)
+
+! total dimension scaled by the number of processes
+      integer ntdivnp
+
+
+      double precision seed, a, pi, alpha
+      parameter (seed = 314159265.d0, a = 1220703125.d0,  &
+     &  pi = 3.141592653589793238d0, alpha=1.0d-6)
+
+! roots of unity array
+! relies on x being largest dimension?
+      double complex, allocatable :: u(:)
+
+
+! for checksum data
+      double complex sums(0:niter_default)
+
+! number of iterations
+      integer niter
+
+! other stuff
+      logical debug, debugsynch
+
+
+!--------------------------------------------------------------------
+! Cache blocking params. These values are good for most
+! RISC processors.  
+! FFT parameters:
+!  fftblock controls how many ffts are done at a time. 
+!  The default is appropriate for most cache-based machines
+!  On vector machines, the FFT can be vectorized with vector
+!  length equal to the block size, so the block size should
+!  be as large as possible. This is the size of the smallest
+!  dimension of the problem: 128 for class A, 256 for class B and
+!  512 for class C.
+! Transpose parameters:
+!  transblock is the blocking factor for the transposes when there
+!  is a 1-D layout. On vector machines it should probably be
+!  large (largest dimension of the problem).
+!--------------------------------------------------------------------
+
+      integer fftblock_default, fftblockpad_default
+      parameter (fftblock_default=16, fftblockpad_default=18)
+      integer transblock, transblockpad
+      parameter(transblock=32, transblockpad=34)
+      
+      integer fftblock, fftblockpad
+
+
+!--------------------------------------------------------------------
+! 2D processor array -> 2D grid decomposition (by pencils)
+! If processor array is 1xN or -> 1D grid decomposition (by planes)
+! If processor array is 1x1 -> 0D grid decomposition
+! For simplicity, do not treat Nx1 (np2 = 1) specially
+!--------------------------------------------------------------------
+      integer np1, np2
+
+! basic decomposition strategy
+      integer layout_type
+      integer layout_0D, layout_1D, layout_2D
+      parameter (layout_0D = 0, layout_1D = 1, layout_2D = 2)
+
+!--------------------------------------------------------------------
+! There are basically three stages
+! 1: x-y-z layout
+! 2: after x-transform (before y)
+! 3: after y-transform (before z)
+! The computation proceeds logically as
+
+! set up initial conditions
+! fftx(1)
+! transpose (1->2)
+! ffty(2)
+! transpose (2->3)
+! fftz(3)
+! time evolution
+! fftz(3)
+! transpose (3->2)
+! ffty(2)
+! transpose (2->1)
+! fftx(1)
+! compute residual(1)
+
+! for the 0D, 1D, 2D strategies, the layouts look like xxx
+!        
+!            0D        1D        2D
+! 1:        xyz       xyz       xyz
+! 2:        xyz       xyz       yxz
+! 3:        xyz       zyx       zxy
+!--------------------------------------------------------------------
+
+! the array dimensions are stored in dims(coord, phase)
+      integer dims(3, 3)
+      integer xstart(3), ystart(3), zstart(3)
+      integer xend(3), yend(3), zend(3)
+
+!--------------------------------------------------------------------
+! Timing constants
+!--------------------------------------------------------------------
+      integer T_total, T_setup, T_fft, T_evolve, T_checksum,  &
+     &        T_fftlow, T_fftcopy, T_transpose,  &
+     &        T_transxzloc, T_transxzglo, T_transxzfin,  &
+     &        T_transxyloc, T_transxyglo, T_transxyfin,  &
+     &        T_synch, T_init, T_max
+      parameter (T_total = 1, T_setup = 2, T_fft = 3,  &
+     &           T_evolve = 4, T_checksum = 5,  &
+     &           T_fftlow = 6, T_fftcopy = 7, T_transpose = 8,  &
+     &           T_transxzloc = 9, T_transxzglo = 10, T_transxzfin = 11,  &
+     &           T_transxyloc = 12, T_transxyglo = 13,  &
+     &           T_transxyfin = 14,  T_synch = 15, T_init = 16,  &
+     &           T_max = 16)
+
+      logical timers_enabled
+
+!--------------------------------------------------------------------
+! external functions
+!--------------------------------------------------------------------
+      double precision, external :: randlc, timer_read
+      integer, external ::          ilog2
+
+      end module ft_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  ft_fields module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module ft_fields
+
+!---------------------------------------------------------------------
+! u0, u1, u2 are the main arrays in the problem. 
+! Depending on the decomposition, these arrays will have different 
+! dimensions. To accomodate all possibilities, we allocate them as 
+! one-dimensional arrays and pass them to subroutines for different 
+! views
+!  - u0 contains the initial (transformed) initial condition
+!  - u1 and u2 are working arrays
+!---------------------------------------------------------------------
+      double complex, allocatable ::  &
+     &                 u0(:), u1(:), u2(:)
+      double precision, allocatable ::  &
+     &                 twiddle(:)
+
+      end module ft_fields
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use ft_data
+      use ft_fields
+      use mpinpb
+
+      implicit none
+
+      integer ios, ierr
+
+
+      ntdivnp = ((nx*ny)/np_min)*nz
+
+!---------------------------------------------------------------------
+! Padding+3 is to avoid accidental cache problems, 
+! since all array sizes are powers of two.
+!---------------------------------------------------------------------
+      allocate (  &
+     &          u0     (ntdivnp+3),  &
+     &          u1     (ntdivnp+3),  &
+     &          u2     (ntdivnp+3),  &
+     &          twiddle(ntdivnp),  &
+     &          u      (maxdim),  &
+     &          stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/inputft.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/inputft.data.sample
new file mode 100644
index 000000000..448ac42bc
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/inputft.data.sample
@@ -0,0 +1,3 @@
+6   ! number of iterations
+2   ! layout type. 0 = 0d, 1 = 1d, 2 = 2d
+2 4 ! processor layout. 0d must be "1 1"; 1d must be "1 N"
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/mpinpb.f90
new file mode 100644
index 000000000..d125ea4ca
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/mpinpb.f90
@@ -0,0 +1,31 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mpinpb module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mpinpb
+
+      include 'mpif.h'
+
+!--------------------------------------------------------------------
+! 'np' number of processors, 'np_min' min number of processors
+!--------------------------------------------------------------------
+      integer np_min, np
+
+! we need a bunch of logic to keep track of how
+! arrays are laid out. 
+! coords of this processor
+      integer me, me1, me2
+
+! need a communicator for row/col in processor grid
+      integer comm_solve, commslice1, commslice2
+      logical active
+
+! mpi data types
+      integer dc_type
+
+      end module mpinpb
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/Makefile
new file mode 100644
index 000000000..0ac4ae959
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/Makefile
@@ -0,0 +1,23 @@
+SHELL=/bin/sh
+BENCHMARK=is
+BENCHMARKU=IS
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = is.o ${COMMON}/c_print_results.o ${COMMON}/c_timers.o
+
+
+${PROGRAM}: config ${OBJS}
+	${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${CMPI_LIB}
+
+.c.o:
+	${CCOMPILE} $<
+
+is.o:             is.c  npbparams.h
+
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f is npbparams.h core
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/is.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/is.c
new file mode 100644
index 000000000..e7227ae6e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/is.c
@@ -0,0 +1,1219 @@
+/*************************************************************************
+ *                                                                       * 
+ *        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4       *
+ *                                                                       * 
+ *                                  I S                                  * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   This benchmark is part of the NAS Parallel Benchmark 3.4 suite.     *
+ *   It is described in NAS Technical Report 95-020.                     * 
+ *                                                                       * 
+ *   Permission to use, copy, distribute and modify this software        * 
+ *   for any purpose with or without fee is hereby granted.  We          * 
+ *   request, however, that all derived work reference the NAS           * 
+ *   Parallel Benchmarks 3.4. This software is provided "as is"          *
+ *   without express or implied warranty.                                * 
+ *                                                                       * 
+ *   Information on NPB 3.4, including the technical report, the         *
+ *   original specifications, source code, results and information       * 
+ *   on how to submit new results, is available at:                      * 
+ *                                                                       * 
+ *          http://www.nas.nasa.gov/Software/NPB                         * 
+ *                                                                       * 
+ *   Send comments or suggestions to  npb@nas.nasa.gov                   * 
+ *   Send bug reports to              npb-bugs@nas.nasa.gov              * 
+ *                                                                       * 
+ *         NAS Parallel Benchmarks Group                                 * 
+ *         NASA Ames Research Center                                     * 
+ *         Mail Stop: T27A-1                                             * 
+ *         Moffett Field, CA   94035-1000                                * 
+ *                                                                       * 
+ *         E-mail:  npb@nas.nasa.gov                                     * 
+ *         Fax:     (650) 604-3957                                       * 
+ *                                                                       * 
+ ************************************************************************* 
+ *                                                                       * 
+ *   Author: M. Yarrow                                                   * 
+ *           H. Jin                                                      * 
+ *                                                                       * 
+ *************************************************************************/
+
+#include "mpi.h"
+#include "npbparams.h"
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+/******************/
+/* default values */
+/******************/
+#ifndef CLASS
+#define CLASS 'S'
+#define NUM_PROCS            1                 
+#endif
+#define MIN_PROCS            1
+#define ONE                  1
+
+
+/*************/
+/*  CLASS S  */
+/*************/
+#if CLASS == 'S'
+#define  TOTAL_KEYS_LOG_2    16
+#define  MAX_KEY_LOG_2       11
+#define  NUM_BUCKETS_LOG_2   9
+#endif
+
+
+/*************/
+/*  CLASS W  */
+/*************/
+#if CLASS == 'W'
+#define  TOTAL_KEYS_LOG_2    20
+#define  MAX_KEY_LOG_2       16
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+/*************/
+/*  CLASS A  */
+/*************/
+#if CLASS == 'A'
+#define  TOTAL_KEYS_LOG_2    23
+#define  MAX_KEY_LOG_2       19
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS B  */
+/*************/
+#if CLASS == 'B'
+#define  TOTAL_KEYS_LOG_2    25
+#define  MAX_KEY_LOG_2       21
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS C  */
+/*************/
+#if CLASS == 'C'
+#define  TOTAL_KEYS_LOG_2    27
+#define  MAX_KEY_LOG_2       23
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS D  */
+/*************/
+#if CLASS == 'D'
+#define  TOTAL_KEYS_LOG_2    29     /* 2^31 */
+#define  MAX_KEY_LOG_2       27
+#define  NUM_BUCKETS_LOG_2   10
+#undef   MIN_PROCS
+#define  MIN_PROCS           4
+#endif
+
+
+/*************/
+/*  CLASS E  */
+/*************/
+#if CLASS == 'E'
+#define  TOTAL_KEYS_LOG_2    29     /* 2^35 */
+#define  MAX_KEY_LOG_2       31
+#define  NUM_BUCKETS_LOG_2   10
+#undef   MIN_PROCS
+#define  MIN_PROCS           64
+#undef   ONE
+#define  ONE                 1L
+#endif
+
+
+/*******************************************************************
+ * Defining MIN_PROCS is to avoid integer overflow for large problem 
+ * sizes without using a larger integer type, such as long int.
+ * The actual total keys = TOTAL_KEYS * MIN_PROCS
+ *******************************************************************/
+#define  TOTAL_KEYS          (1 << TOTAL_KEYS_LOG_2)
+
+#define  MAX_KEY             (ONE << MAX_KEY_LOG_2)
+#define  NUM_BUCKETS         (1 << NUM_BUCKETS_LOG_2)
+
+/*****************************************************************/
+/* NOTE: THIS CODE CANNOT BE RUN ON ARBITRARILY LARGE NUMBERS OF */
+/* PROCESSORS. THE LARGEST VERIFIED NUMBER IS 1024. INCREASE     */
+/* MAX_PROCS AT YOUR PERIL                                       */
+/*****************************************************************/
+#if CLASS == 'S'
+#define  MAX_PROCS           128
+#else
+#define  MAX_PROCS           1024
+#endif
+
+#define  MAX_ITERATIONS      10
+#define  TEST_ARRAY_SIZE     5
+
+
+/* Number of keys assigned to each processor
+ * #define  NUM_KEYS            (TOTAL_KEYS/NUM_PROCS*MIN_PROCS)
+ */
+int num_keys;
+
+/*****************************************************************/
+/* On larger number of processors, since the keys are (roughly)  */ 
+/* gaussian distributed, the first and last processor sort keys  */ 
+/* in a large interval, requiring array sizes to be larger. Note */
+/* that for large NUM_PROCS, NUM_KEYS is, however, a small number*/
+/* The required array size also depends on the bucket size used. */
+/* The following values are validated for the 1024-bucket setup. */
+/*****************************************************************/
+/*
+ * #if   NUM_PROCS < 256
+ * #define  SIZE_OF_BUFFERS     3*NUM_KEYS/2
+ * #elif NUM_PROCS < 512
+ * #define  SIZE_OF_BUFFERS     5*NUM_KEYS/2
+ * #elif NUM_PROCS < 1024
+ * #define  SIZE_OF_BUFFERS     4*NUM_KEYS
+ * #else
+ * #define  SIZE_OF_BUFFERS     13*NUM_KEYS/2
+ * #endif
+ */
+int size_of_buffers;
+
+
+/***********************************/
+/* Enable separate communication,  */
+/* computation timing and printout */
+/***********************************/
+#define  TIMING_ENABLED
+#ifdef NO_MTIMERS
+#undef TIMINIG_ENABLED
+#define TIMER_START( x )
+#define TIMER_STOP( x )
+#else
+#define TIMER_START( x ) if (timeron) timer_start( x )
+#define TIMER_STOP( x ) if (timeron) timer_stop( x )
+#define T_TOTAL  0
+#define T_RANK   1
+#define T_RCOMM  2
+#define T_VERIFY 3
+#define T_LAST   3
+#endif
+int timeron;
+
+
+/*************************************/
+/* Typedef: if necessary, change the */
+/* size of int here by changing the  */
+/* int type to, say, long            */
+/*************************************/
+typedef  int  INT_TYPE;
+#if CLASS == 'D' || CLASS == 'E'
+typedef  long KEY_TYPE;
+#else
+typedef  int  KEY_TYPE;
+#endif
+#define MP_KEY_TYPE MPI_INT
+
+
+
+/********************/
+/* MPI properties:  */
+/********************/
+int      my_rank, np_total,
+         comm_size;
+MPI_Comm comm_work;
+
+
+/********************/
+/* Some global info */
+/********************/
+INT_TYPE *key_buff_ptr_global,         /* used by full_verify to get */
+         total_local_keys,             /* copies of rank info        */
+         total_lesser_keys;
+
+
+int      passed_verification;
+                                 
+
+
+/************************************/
+/* These are the three main arrays. */
+/* See SIZE_OF_BUFFERS def above    */
+/************************************/
+INT_TYPE *key_array,    
+         *key_buff1,    
+         *key_buff2,
+         bucket_size[NUM_BUCKETS+TEST_ARRAY_SIZE],     /* Top 5 elements for */
+         bucket_size_totals[NUM_BUCKETS+TEST_ARRAY_SIZE], /* part. ver. vals */
+         bucket_ptrs[NUM_BUCKETS],
+         process_bucket_distrib_ptr1[NUM_BUCKETS+TEST_ARRAY_SIZE],   
+         process_bucket_distrib_ptr2[NUM_BUCKETS+TEST_ARRAY_SIZE];   
+int      *send_count, *recv_count,
+         *send_displ, *recv_displ;
+
+
+/**********************/
+/* Partial verif info */
+/**********************/
+KEY_TYPE test_index_array[TEST_ARRAY_SIZE],
+         test_rank_array[TEST_ARRAY_SIZE];
+
+int      S_test_index_array[TEST_ARRAY_SIZE] = 
+                             {48427,17148,23627,62548,4431},
+         S_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {0,18,346,64917,65463},
+
+         W_test_index_array[TEST_ARRAY_SIZE] = 
+                             {357773,934767,875723,898999,404505},
+         W_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1249,11698,1039987,1043896,1048018},
+
+         A_test_index_array[TEST_ARRAY_SIZE] = 
+                             {2112377,662041,5336171,3642833,4250760},
+         A_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {104,17523,123928,8288932,8388264},
+
+         B_test_index_array[TEST_ARRAY_SIZE] = 
+                             {41869,812306,5102857,18232239,26860214},
+         B_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {33422937,10244,59149,33135281,99}, 
+
+         C_test_index_array[TEST_ARRAY_SIZE] = 
+                             {44172927,72999161,74326391,129606274,21736814},
+         C_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {61147,882988,266290,133997595,133525895};
+
+long     D_test_index_array[TEST_ARRAY_SIZE] = 
+                             {1317351170,995930646,1157283250,1503301535,1453734525},
+         D_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {1,36538729,1978098519,2145192618,2147425337},
+
+         E_test_index_array[TEST_ARRAY_SIZE] = 
+                             {21492309536L,24606226181L,12608530949L,4065943607L,3324513396L},
+         E_test_rank_array[TEST_ARRAY_SIZE] = 
+                             {3L,27580354L,3248475153L,30048754302L,31485259697L};
+
+
+
+/***********************/
+/* function prototypes */
+/***********************/
+double	randlc( double *X, double *A );
+
+void full_verify( void );
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      int    nprocs_active,
+                      int    nprocs_total,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *mpicc,
+                      char   *clink,
+                      char   *cmpi_lib,
+                      char   *cmpi_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+#include "../common/c_timers.h"
+
+
+/*****************************************************************/
+/*     Dynamically allocate space for main arrays                */
+/*****************************************************************/
+void alloc_space(void)
+{
+   /* problem size after partition */
+   num_keys = (TOTAL_KEYS/comm_size) * MIN_PROCS;
+
+   /* buffer size for communication */
+   if ( comm_size < 256 )
+      size_of_buffers = 3*num_keys/2;
+   else if ( comm_size < 512 )
+      size_of_buffers = 5*num_keys/2;
+   else if ( comm_size < 1024 )
+      size_of_buffers = 4*num_keys;
+   else
+      size_of_buffers = 13*num_keys/2;
+
+   /* allocate space */
+   key_array = (INT_TYPE *)malloc(sizeof(INT_TYPE)*size_of_buffers);
+   key_buff1 = (INT_TYPE *)malloc(sizeof(INT_TYPE)*size_of_buffers);
+   key_buff2 = (INT_TYPE *)malloc(sizeof(INT_TYPE)*size_of_buffers);
+
+   send_count = (int *)malloc(sizeof(int)*comm_size);
+   recv_count = (int *)malloc(sizeof(int)*comm_size);
+   send_displ = (int *)malloc(sizeof(int)*comm_size);
+   recv_displ = (int *)malloc(sizeof(int)*comm_size);
+
+   if (!key_array || !key_buff1 || !key_buff2 ||
+       !send_count || !recv_count || !send_displ || !recv_displ) {
+      printf("ERROR: memoy allocation failed\n");
+      MPI_Abort(MPI_COMM_WORLD, 1);
+      exit(1);
+   }
+}
+
+
+/*
+ *    FUNCTION RANDLC (X, A)
+ *
+ *  This routine returns a uniform pseudorandom double precision number in the
+ *  range (0, 1) by using the linear congruential generator
+ *
+ *  x_{k+1} = a x_k  (mod 2^46)
+ *
+ *  where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+ *  before repeating.  The argument A is the same as 'a' in the above formula,
+ *  and X is the same as x_0.  A and X must be odd double precision integers
+ *  in the range (1, 2^46).  The returned value RANDLC is normalized to be
+ *  between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+ *  the new seed x_1, so that subsequent calls to RANDLC using the same
+ *  arguments will generate a continuous sequence.
+ *
+ *  This routine should produce the same results on any computer with at least
+ *  48 mantissa bits in double precision floating point data.  On Cray systems,
+ *  double precision should be disabled.
+ *
+ *  David H. Bailey     October 26, 1990
+ *
+ *     IMPLICIT DOUBLE PRECISION (A-H, O-Z)
+ *     SAVE KS, R23, R46, T23, T46
+ *     DATA KS/0/
+ *
+ *  If this is the first call to RANDLC, compute R23 = 2 ^ -23, R46 = 2 ^ -46,
+ *  T23 = 2 ^ 23, and T46 = 2 ^ 46.  These are computed in loops, rather than
+ *  by merely using the ** operator, in order to insure that the results are
+ *  exact on all systems.  This code assumes that 0.5D0 is represented exactly.
+ */
+
+
+/*****************************************************************/
+/*************           R  A  N  D  L  C             ************/
+/*************                                        ************/
+/*************    portable random number generator    ************/
+/*****************************************************************/
+
+double	randlc( double *X, double *A )
+{
+      static int        KS=0;
+      static double	R23, R46, T23, T46;
+      double		T1, T2, T3, T4;
+      double		A1;
+      double		A2;
+      double		X1;
+      double		X2;
+      double		Z;
+      int     		i, j;
+
+      if (KS == 0) 
+      {
+        R23 = 1.0;
+        R46 = 1.0;
+        T23 = 1.0;
+        T46 = 1.0;
+    
+        for (i=1; i<=23; i++)
+        {
+          R23 = 0.50 * R23;
+          T23 = 2.0 * T23;
+        }
+        for (i=1; i<=46; i++)
+        {
+          R46 = 0.50 * R46;
+          T46 = 2.0 * T46;
+        }
+        KS = 1;
+      }
+
+/*  Break A into two parts such that A = 2^23 * A1 + A2 and set X = N.  */
+
+      T1 = R23 * *A;
+      j  = T1;
+      A1 = j;
+      A2 = *A - T23 * A1;
+
+/*  Break X into two parts such that X = 2^23 * X1 + X2, compute
+    Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+    X = 2^23 * Z + A2 * X2  (mod 2^46).                            */
+
+      T1 = R23 * *X;
+      j  = T1;
+      X1 = j;
+      X2 = *X - T23 * X1;
+      T1 = A1 * X2 + A2 * X1;
+      
+      j  = R23 * T1;
+      T2 = j;
+      Z = T1 - T23 * T2;
+      T3 = T23 * Z + A2 * X2;
+      j  = R46 * T3;
+      T4 = j;
+      *X = T3 - T46 * T4;
+      return(R46 * *X);
+} 
+
+
+
+/*****************************************************************/
+/************   F  I  N  D  _  M  Y  _  S  E  E  D    ************/
+/************                                         ************/
+/************ returns parallel random number seq seed ************/
+/*****************************************************************/
+
+/*
+ * Create a random number sequence of total length nn residing
+ * on np number of processors.  Each processor will therefore have a 
+ * subsequence of length nn/np.  This routine returns that random 
+ * number which is the first random number for the subsequence belonging
+ * to processor rank kn, and which is used as seed for proc kn ran # gen.
+ */
+
+double   find_my_seed( int  kn,       /* my processor rank, 0<=kn<=num procs */
+                       int  np,       /* np = num procs                      */
+                       long nn,       /* total num of ran numbers, all procs */
+                       double s,      /* Ran num seed, for ex.: 314159265.00 */
+                       double a )     /* Ran num gen mult, try 1220703125.00 */
+{
+
+  long   i;
+
+  double t1,t2,t3,an;
+  long   mq,nq,kk,ik;
+
+
+
+      nq = nn / np;
+
+      for( mq=0; nq>1; mq++,nq/=2 )
+          ;
+
+      t1 = a;
+
+      for( i=1; i<=mq; i++ )
+        t2 = randlc( &t1, &t1 );
+
+      an = t1;
+
+      kk = kn;
+      t1 = s;
+      t2 = an;
+
+      for( i=1; i<=100; i++ )
+      {
+        ik = kk / 2;
+        if( 2 * ik !=  kk ) 
+            t3 = randlc( &t1, &t2 );
+        if( ik == 0 ) 
+            break;
+        t3 = randlc( &t2, &t2 );
+        kk = ik;
+      }
+
+      return( t1 );
+
+}
+
+
+
+
+/*****************************************************************/
+/*************      C  R  E  A  T  E  _  S  E  Q      ************/
+/*****************************************************************/
+
+void	create_seq( double seed, double a )
+{
+	double x;
+	int    i, k;
+
+        k = MAX_KEY/4;
+
+	for (i=0; i<num_keys; i++)
+	{
+	    x = randlc(&seed, &a);
+	    x += randlc(&seed, &a);
+    	    x += randlc(&seed, &a);
+	    x += randlc(&seed, &a);  
+
+            key_array[i] = k*x;
+	}
+}
+
+
+
+
+/*****************************************************************/
+/*************    F  U  L  L  _  V  E  R  I  F  Y     ************/
+/*****************************************************************/
+
+
+void full_verify( void )
+{
+    MPI_Status  status;
+    MPI_Request request;
+    
+    INT_TYPE    i, j;
+    INT_TYPE    k, last_local_key;
+
+    
+    TIMER_START( T_VERIFY );
+
+/*  Now, finally, sort the keys:  */
+    for( i=0; i<total_local_keys; i++ )
+        key_array[--key_buff_ptr_global[key_buff2[i]]-
+                                 total_lesser_keys] = key_buff2[i];
+    last_local_key = (total_local_keys<1)? 0 : (total_local_keys-1);
+
+/*  Send largest key value to next processor  */
+    if( my_rank > 0 )
+        MPI_Irecv( &k,
+                   1,
+                   MP_KEY_TYPE,
+                   my_rank-1,
+                   1000,
+                   comm_work,
+                   &request );                   
+    if( my_rank < comm_size-1 )
+        MPI_Send( &key_array[last_local_key],
+                  1,
+                  MP_KEY_TYPE,
+                  my_rank+1,
+                  1000,
+                  comm_work );
+    if( my_rank > 0 )
+        MPI_Wait( &request, &status );
+
+/*  Confirm that neighbor's greatest key value 
+    is not greater than my least key value       */              
+    j = 0;
+    if( my_rank > 0 && total_local_keys > 0 )
+        if( k > key_array[0] )
+            j++;
+
+
+/*  Confirm keys correctly sorted: count incorrectly sorted keys, if any */
+    for( i=1; i<total_local_keys; i++ )
+        if( key_array[i-1] > key_array[i] )
+            j++;
+
+
+    if( j != 0 )
+    {
+        printf( "Processor %d:  Full_verify: number of keys out of sort: %d\n",
+                my_rank, j );
+    }
+    else
+        passed_verification++;
+           
+    TIMER_STOP( T_VERIFY );
+
+}
+
+
+
+
+/*****************************************************************/
+/*************             R  A  N  K             ****************/
+/*****************************************************************/
+
+
+void rank( int iteration )
+{
+
+    INT_TYPE    i, k;
+
+    INT_TYPE    shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2;
+    INT_TYPE    key;
+    KEY_TYPE    bucket_sum_accumulator, j, m;
+    INT_TYPE    local_bucket_sum_accumulator;
+    INT_TYPE    min_key_val, max_key_val;
+    INT_TYPE    *key_buff_ptr;
+
+
+
+    TIMER_START( T_RANK );
+
+/*  Iteration alteration of keys */  
+    if(my_rank == 0 )                    
+    {
+      key_array[iteration] = iteration;
+      key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration;
+    }
+
+
+/*  Initialize */
+    for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ )  
+    {
+        bucket_size[i] = 0;
+        bucket_size_totals[i] = 0;
+        process_bucket_distrib_ptr1[i] = 0;
+        process_bucket_distrib_ptr2[i] = 0;
+    }
+
+
+/*  Determine where the partial verify test keys are, load into  */
+/*  top of array bucket_size                                     */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        if( (test_index_array[i]/num_keys) == my_rank )
+            bucket_size[NUM_BUCKETS+i] = 
+                          key_array[test_index_array[i] % num_keys];
+
+
+/*  Determine the number of keys in each bucket */
+    for( i=0; i<num_keys; i++ )
+        bucket_size[key_array[i] >> shift]++;
+
+
+/*  Accumulative bucket sizes are the bucket pointers */
+    bucket_ptrs[0] = 0;
+    for( i=1; i< NUM_BUCKETS; i++ )  
+        bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1];
+
+
+/*  Sort into appropriate bucket */
+    for( i=0; i<num_keys; i++ )  
+    {
+        key = key_array[i];
+        key_buff1[bucket_ptrs[key >> shift]++] = key;
+    }
+
+    TIMER_STOP( T_RANK );
+    TIMER_START( T_RCOMM );
+
+/*  Get the bucket size totals for the entire problem. These 
+    will be used to determine the redistribution of keys      */
+    MPI_Allreduce( bucket_size, 
+                   bucket_size_totals, 
+                   NUM_BUCKETS+TEST_ARRAY_SIZE, 
+                   MP_KEY_TYPE,
+                   MPI_SUM,
+                   comm_work );
+
+    TIMER_STOP( T_RCOMM );
+    TIMER_START( T_RANK );
+
+/*  Determine Redistibution of keys: accumulate the bucket size totals 
+    till this number surpasses NUM_KEYS (which the average number of keys
+    per processor).  Then all keys in these buckets go to processor 0.
+    Continue accumulating again until supassing 2*NUM_KEYS. All keys
+    in these buckets go to processor 1, etc.  This algorithm guarantees
+    that all processors have work ranking; no processors are left idle.
+    The optimum number of buckets, however, does not result in as high
+    a degree of load balancing (as even a distribution of keys as is
+    possible) as is obtained from increasing the number of buckets, but
+    more buckets results in more computation per processor so that the
+    optimum number of buckets turns out to be 1024 for machines tested.
+    Note that process_bucket_distrib_ptr1 and ..._ptr2 hold the bucket
+    number of first and last bucket which each processor will have after   
+    the redistribution is done.                                          */
+
+    bucket_sum_accumulator = 0;
+    local_bucket_sum_accumulator = 0;
+    send_displ[0] = 0;
+    process_bucket_distrib_ptr1[0] = 0;
+    for( i=0, j=0; i<NUM_BUCKETS; i++ )  
+    {
+        bucket_sum_accumulator       += bucket_size_totals[i];
+        local_bucket_sum_accumulator += bucket_size[i];
+        if( bucket_sum_accumulator >= (j+1)*num_keys )  
+        {
+            send_count[j] = local_bucket_sum_accumulator;
+            if( j != 0 )
+            {
+                send_displ[j] = send_displ[j-1] + send_count[j-1];
+                process_bucket_distrib_ptr1[j] = 
+                                        process_bucket_distrib_ptr2[j-1]+1;
+            }
+            process_bucket_distrib_ptr2[j++] = i;
+            local_bucket_sum_accumulator = 0;
+        }
+    }
+
+/*  When NUM_PROCS approaching NUM_BUCKETS, it is highly possible
+    that the last few processors don't get any buckets.  So, we
+    need to set counts properly in this case to avoid any fallouts.    */
+    while( j < comm_size )
+    {
+        send_count[j] = 0;
+        process_bucket_distrib_ptr1[j] = 1;
+        j++;
+    }
+
+    TIMER_STOP( T_RANK );
+    TIMER_START( T_RCOMM ); 
+
+/*  This is the redistribution section:  first find out how many keys
+    each processor will send to every other processor:                 */
+    MPI_Alltoall( send_count,
+                  1,
+                  MPI_INT,
+                  recv_count,
+                  1,
+                  MPI_INT,
+                  comm_work );
+
+/*  Determine the receive array displacements for the buckets */    
+    recv_displ[0] = 0;
+    for( i=1; i<comm_size; i++ )
+        recv_displ[i] = recv_displ[i-1] + recv_count[i-1];
+
+
+/*  Now send the keys to respective processors  */    
+    MPI_Alltoallv( key_buff1,
+                   send_count,
+                   send_displ,
+                   MP_KEY_TYPE,
+                   key_buff2,
+                   recv_count,
+                   recv_displ,
+                   MP_KEY_TYPE,
+                   comm_work );
+
+    TIMER_STOP( T_RCOMM ); 
+    TIMER_START( T_RANK );
+
+/*  The starting and ending bucket numbers on each processor are
+    multiplied by the interval size of the buckets to obtain the 
+    smallest possible min and greatest possible max value of any 
+    key on each processor                                          */
+    min_key_val = process_bucket_distrib_ptr1[my_rank] << shift;
+    max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1;
+
+/*  Clear the work array */
+    for( i=0; i<max_key_val-min_key_val+1; i++ )
+        key_buff1[i] = 0;
+
+/*  Determine the total number of keys on all other 
+    processors holding keys of lesser value         */
+    m = 0;
+    for( k=0; k<my_rank; k++ )
+        for( i= process_bucket_distrib_ptr1[k];
+             i<=process_bucket_distrib_ptr2[k];
+             i++ )  
+            m += bucket_size_totals[i]; /*  m has total # of lesser keys */
+
+/*  Determine total number of keys on this processor */
+    j = 0;                                 
+    for( i= process_bucket_distrib_ptr1[my_rank];
+         i<=process_bucket_distrib_ptr2[my_rank];
+         i++ )  
+        j += bucket_size_totals[i];     /* j has total # of local keys   */
+
+
+/*  Ranking of all keys occurs in this section:                 */
+/*  shift it backwards so no subtractions are necessary in loop */
+    key_buff_ptr = key_buff1 - min_key_val;
+
+/*  In this section, the keys themselves are used as their 
+    own indexes to determine how many of each there are: their
+    individual population                                       */
+    for( i=0; i<j; i++ )
+        key_buff_ptr[key_buff2[i]]++;  /* Now they have individual key   */
+                                       /* population                     */
+
+/*  To obtain ranks of each key, successively add the individual key
+    population, not forgetting the total of lesser keys, m.
+    NOTE: Since the total of lesser keys would be subtracted later 
+    in verification, it is no longer added to the first key population 
+    here, but still needed during the partial verify test.  This is to 
+    ensure that 32-bit key_buff can still be used for class D.           */
+/*    key_buff_ptr[min_key_val] += m;    */
+    for( i=min_key_val; i<max_key_val; i++ )   
+        key_buff_ptr[i+1] += key_buff_ptr[i];  
+
+
+/* This is the partial verify test section */
+/* Observe that test_rank_array vals are   */
+/* shifted differently for different cases */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+    {                                             
+        k = bucket_size_totals[i+NUM_BUCKETS];    /* Keys were hidden here */
+        if( min_key_val <= k  &&  k <= max_key_val )
+        {
+            /* Add the total of lesser keys, m, here */
+            KEY_TYPE key_rank = key_buff_ptr[k-1] + m;
+            KEY_TYPE test_rank = test_rank_array[i];
+            int failed = 0;
+
+            switch( CLASS )
+            {
+                case 'S':
+                    if( i <= 2 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'W':
+                    if( i < 2 )
+                        test_rank += iteration - 2;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'A':
+                    if( i <= 2 )
+                        test_rank += iteration - 1;
+                    else
+                        test_rank -= iteration - 1;
+                    break;
+                case 'B':
+                    if( i == 1 || i == 2 || i == 4 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'C':
+                    if( i <= 2 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'D':
+                    if( i < 2 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                 case 'E':
+                    if( i < 2 )
+                        test_rank += iteration - 2;
+                    else if( i == 2 )
+                    {
+                        test_rank += iteration - 2;
+                        if (iteration > 4)
+                            test_rank -= 2;
+                        else if (iteration > 2)
+                            test_rank -= 1;
+                    }
+                    else
+                        test_rank -= iteration - 2;
+                    break;
+            }
+            if( key_rank != test_rank )
+                failed = 1;
+            else
+                passed_verification++;
+            if( failed == 1 )
+                printf( "Failed partial verification: "
+                        "iteration %d, processor %d, test key %d, key rank %ld\n", 
+                         iteration, my_rank, (int)i, (long)key_rank );
+        }
+    }
+
+
+    TIMER_STOP( T_RANK ); 
+
+
+/*  Make copies of rank info for use by full_verify: these variables
+    in rank are local; making them global slows down the code, probably
+    since they cannot be made register by compiler                        */
+
+    if( iteration == MAX_ITERATIONS ) 
+    {
+        key_buff_ptr_global = key_buff_ptr;
+        total_local_keys    = j;
+        total_lesser_keys   = 0;  /* no longer set to 'm', see note above */
+    }
+
+}      
+
+
+/*****************************************************************/
+/*************             M  A  I  N             ****************/
+/*****************************************************************/
+
+int main( int argc, char **argv )
+{
+
+    int             i, iteration, itemp, active;
+
+    double          timecounter, maxtime;
+
+
+/*  Initialize MPI */
+    MPI_Init( &argc, &argv );
+    MPI_Comm_rank( MPI_COMM_WORLD, &my_rank );
+    MPI_Comm_size( MPI_COMM_WORLD, &np_total );
+
+
+/*  Check to see whether total number of processes is within bounds.
+    This could in principle be checked in setparams.c, but it is more
+    convenient to do it here                                               */
+    if( np_total < MIN_PROCS || np_total > MAX_PROCS)
+    {
+       if( my_rank == 0 )
+           printf( "\n ERROR: number of processes %d not within range %d-%d"
+                   "\n Exiting program!\n\n", np_total, MIN_PROCS, MAX_PROCS);
+       MPI_Finalize();
+       exit( 1 );
+    }
+
+
+/*  comm_size needs to be power of two */
+    for (comm_size = 1; comm_size < np_total; comm_size *= 2);
+    if (comm_size > np_total) comm_size /= 2;
+
+/*  If the actual number of processes doesn't agree with comm_size,
+    check if excess ranks need to be masked */
+    active = 1;
+    if( comm_size != np_total )
+    {
+        /* check if NPB_NPROCS_STRICT is set */
+        if( my_rank == 0 ) {
+            char *ep = getenv("NPB_NPROCS_STRICT");
+            if (ep && *ep) {
+               if (strchr("nNfF-", *ep) || strcmp(ep, "0") == 0)
+                  active = 0;
+               else if (strcmp(ep, "off") == 0 || strcmp(ep, "OFF") == 0)
+                  active = 0;
+            }
+        }
+        MPI_Bcast(&active, 1, MPI_INT, 0, MPI_COMM_WORLD);
+
+        /* abort if a strict NPROCS enforcement is required */
+        if (active) {
+            if( my_rank == 0 )
+               printf( "\n ERROR: Number of processes (%d)"
+                       " is not a power of two (%d?)\n"
+                       " Exiting program!\n\n", np_total, comm_size );
+            MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER);
+            exit( 1 );
+        }
+
+        /* mark excess ranks as inactive */
+        active = ( my_rank >= comm_size )? 0 : 1;
+        MPI_Comm_split(MPI_COMM_WORLD, active, my_rank, &comm_work);
+    }
+    else
+        MPI_Comm_dup(MPI_COMM_WORLD, &comm_work);
+
+    if (!active) {
+        MPI_Finalize();
+        exit( 0 );
+    }
+
+
+/*  Initialize the verification arrays if a valid class */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        switch( CLASS )
+        {
+            case 'S':
+                test_index_array[i] = S_test_index_array[i];
+                test_rank_array[i]  = S_test_rank_array[i];
+                break;
+            case 'A':
+                test_index_array[i] = A_test_index_array[i];
+                test_rank_array[i]  = A_test_rank_array[i];
+                break;
+            case 'W':
+                test_index_array[i] = W_test_index_array[i];
+                test_rank_array[i]  = W_test_rank_array[i];
+                break;
+            case 'B':
+                test_index_array[i] = B_test_index_array[i];
+                test_rank_array[i]  = B_test_rank_array[i];
+                break;
+            case 'C':
+                test_index_array[i] = C_test_index_array[i];
+                test_rank_array[i]  = C_test_rank_array[i];
+                break;
+            case 'D':
+                test_index_array[i] = D_test_index_array[i];
+                test_rank_array[i]  = D_test_rank_array[i];
+                break;
+            case 'E':
+                test_index_array[i] = E_test_index_array[i];
+                test_rank_array[i]  = E_test_rank_array[i];
+                break;
+        };
+        
+
+/*  Printout initial NPB info */
+    if( my_rank == 0 )
+    {
+        printf( "\n\n NAS Parallel Benchmarks 3.4 -- IS Benchmark\n\n" );
+        printf( " Size:  %ld  (class %c)\n", (long)TOTAL_KEYS*MIN_PROCS, CLASS );
+        printf( " Iterations:   %d\n", MAX_ITERATIONS );
+        printf( " Total number of processes:  %d\n", np_total );
+        if ( comm_size != np_total )
+            printf( " WARNING: Number of processes"
+                    " is not a power of two (%d active)\n", comm_size );
+
+        timeron = check_timer_flag();
+    }
+
+    MPI_Bcast(&timeron, 1, MPI_INT, 0, comm_work);
+
+#ifdef  TIMING_ENABLED 
+    for( i=1; i<=T_LAST; i++ ) timer_clear( i );
+#endif
+
+/*  allocate space for work arrays */
+    alloc_space();
+
+/*  Generate random number sequence and subsequent keys on all procs */
+    create_seq( find_my_seed( my_rank, 
+                              comm_size, 
+                              4*(long)TOTAL_KEYS*MIN_PROCS,
+                              314159265.00,      /* Random number gen seed */
+                              1220703125.00 ),   /* Random number gen mult */
+                1220703125.00 );                 /* Random number gen mult */
+
+
+/*  Do one interation for free (i.e., untimed) to guarantee initialization of  
+    all data and code pages and respective tables */
+    rank( 1 );  
+
+/*  Start verification counter */
+    passed_verification = 0;
+
+    if( my_rank == 0 && CLASS != 'S' ) printf( "\n   iteration\n" );
+
+/*  Initialize timer  */             
+    timer_clear( 0 );
+
+/*  Initialize separate communication, computation timing */
+#ifdef  TIMING_ENABLED 
+    for( i=1; i<=T_LAST; i++ ) timer_clear( i );
+#endif
+
+/*  Start timer  */             
+    timer_start( 0 );
+
+
+/*  This is the main iteration */
+    for( iteration=1; iteration<=MAX_ITERATIONS; iteration++ )
+    {
+        if( my_rank == 0 && CLASS != 'S' ) printf( "        %d\n", iteration );
+        rank( iteration );
+    }
+
+
+/*  Stop timer, obtain time for processors */
+    timer_stop( 0 );
+
+    timecounter = timer_read( 0 );
+
+/*  End of timing, obtain maximum time of all processors */
+    MPI_Reduce( &timecounter,
+                &maxtime,
+                1,
+                MPI_DOUBLE,
+                MPI_MAX,
+                0,
+                comm_work );
+
+
+/*  This tests that keys are in sequence: sorting of last ranked key seq
+    occurs here, but is an untimed operation                             */
+    full_verify();
+
+
+/*  Obtain verification counter sum */
+    itemp = passed_verification;
+    MPI_Reduce( &itemp,
+                &passed_verification,
+                1,
+                MPI_INT,
+                MPI_SUM,
+                0,
+                comm_work );
+
+
+
+/*  The final printout  */
+    if( my_rank == 0 )
+    {
+        if( passed_verification != 5*MAX_ITERATIONS + comm_size )
+            passed_verification = 0;
+        c_print_results( "IS",
+                         CLASS,
+                         (int)(TOTAL_KEYS),
+                         MIN_PROCS,
+                         0,
+                         MAX_ITERATIONS,
+                         comm_size,
+                         np_total,
+                         maxtime,
+                         ((double) (MAX_ITERATIONS)*TOTAL_KEYS*MIN_PROCS)
+                                                      /maxtime/1000000.,
+                         "keys ranked", 
+                         passed_verification,
+                         NPBVERSION,
+                         COMPILETIME,
+                         MPICC,
+                         CLINK,
+                         CMPI_LIB,
+                         CMPI_INC,
+                         CFLAGS,
+                         CLINKFLAGS );
+    }
+                    
+
+#ifdef  TIMING_ENABLED
+    if (timeron)
+    {
+        double    t1[T_LAST+1], tmin[T_LAST+1], tsum[T_LAST+1], tmax[T_LAST+1];
+        char      t_recs[T_LAST+1][9];
+    
+        for( i=0; i<=T_LAST; i++ )
+            t1[i] = timer_read( i );
+
+        MPI_Reduce( t1,
+                    tmin,
+                    T_LAST+1,
+                    MPI_DOUBLE,
+                    MPI_MIN,
+                    0,
+                    comm_work );
+        MPI_Reduce( t1,
+                    tsum,
+                    T_LAST+1,
+                    MPI_DOUBLE,
+                    MPI_SUM,
+                    0,
+                    comm_work );
+        MPI_Reduce( t1,
+                    tmax,
+                    T_LAST+1,
+                    MPI_DOUBLE,
+                    MPI_MAX,
+                    0,
+                    comm_work );
+
+        if( my_rank == 0 )
+        {
+            strcpy( t_recs[T_TOTAL],  "total" );
+            strcpy( t_recs[T_RANK],   "rcomp" );
+            strcpy( t_recs[T_RCOMM],  "rcomm" );
+            strcpy( t_recs[T_VERIFY], "verify");
+            printf( " nprocs = %6d     ", comm_size);
+            printf( "     minimum     maximum     average\n" );
+            for( i=0; i<=T_LAST; i++ )
+            {
+                printf( " timer %2d (%-8s):  %10.4f  %10.4f  %10.4f\n",
+                        i+1, t_recs[i], tmin[i], tmax[i], 
+                        tsum[i]/((double) comm_size) );
+            }
+            printf( "\n" );
+        }
+    }
+#endif
+
+    MPI_Finalize();
+
+
+    return 0;
+         /**************************/
+}        /*  E N D  P R O G R A M  */
+         /**************************/
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/Makefile
new file mode 100644
index 000000000..082c3fdf8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/Makefile
@@ -0,0 +1,70 @@
+SHELL=/bin/sh
+BENCHMARK=lu
+BENCHMARKU=LU
+VEC=
+
+include ../config/make.def
+
+OBJS = lu.o lu_data$(VEC).o init_comm.o read_input.o bcast_inputs.o \
+       proc_grid.o neighbors.o nodedim.o subdomain.o setcoeff.o \
+       setbv.o exact.o setiv.o erhs.o ssor$(VEC).o exchange_1.o exchange_3.o \
+       exchange_4.o exchange_5.o exchange_6.o rhs.o l2norm.o \
+       jacld$(VEC).o blts$(VEC).o jacu$(VEC).o buts$(VEC).o mpinpb.o \
+       error.o pintgr.o verify.o ${COMMON}/get_active_nprocs.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+
+# npbparams.h is included by lu_data module (via lu_data.o)
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xvec ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	elif [ x$(VERSION) = xVEC ] ; then	\
+		${MAKE} VEC=_vec exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+.f90.o :
+	${FCOMPILE} $<
+
+lu.o:		lu.f90 lu_data.o mpinpb.o
+bcast_inputs.o:	bcast_inputs.f90 lu_data.o mpinpb.o
+blts$(VEC).o:	blts$(VEC).f90 lu_data.o
+buts$(VEC).o:	buts$(VEC).f90 lu_data.o
+erhs.o:		erhs.f90 lu_data.o
+error.o:	error.f90 lu_data.o mpinpb.o
+exact.o:	exact.f90 lu_data.o
+exchange_1.o:	exchange_1.f90 lu_data.o mpinpb.o
+exchange_3.o:	exchange_3.f90 lu_data.o mpinpb.o
+exchange_4.o:	exchange_4.f90 lu_data.o mpinpb.o
+exchange_5.o:	exchange_5.f90 lu_data.o mpinpb.o
+exchange_6.o:	exchange_6.f90 lu_data.o mpinpb.o
+init_comm.o:	init_comm.f90 lu_data.o mpinpb.o 
+jacld$(VEC).o:	jacld$(VEC).f90 lu_data.o
+jacu$(VEC).o:	jacu$(VEC).f90 lu_data.o
+l2norm.o:	l2norm.f90 lu_data.o mpinpb.o
+neighbors.o:	neighbors.f90 lu_data.o
+nodedim.o:	nodedim.f90
+pintgr.o:	pintgr.f90 lu_data.o mpinpb.o
+proc_grid.o:	proc_grid.f90 lu_data.o mpinpb.o
+read_input.o:	read_input.f90 lu_data.o mpinpb.o
+rhs.o:		rhs.f90 lu_data.o
+setbv.o:	setbv.f90 lu_data.o
+setiv.o:	setiv.f90 lu_data.o
+setcoeff.o:	setcoeff.f90 lu_data.o
+ssor$(VEC).o:	ssor$(VEC).f90 lu_data.o mpinpb.o
+subdomain.o:	subdomain.f90 lu_data.o mpinpb.o
+verify.o:	verify.f90 lu_data.o
+lu_data.o:      lu_data$(VEC).f90 mpinpb.o npbparams.h
+	${FCOMPILE} -o $@ lu_data$(VEC).f90
+mpinpb.o:       mpinpb.f90
+
+clean:
+	- rm -f npbparams.h
+	- rm -f *.o *.mod *~
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/bcast_inputs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/bcast_inputs.f90
new file mode 100644
index 000000000..d538d01d9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/bcast_inputs.f90
@@ -0,0 +1,44 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine bcast_inputs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+      use timing
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer ierr
+
+!---------------------------------------------------------------------
+!   root broadcasts the data
+!   The data isn't contiguous or of the same type, so it's not
+!   clear how to send it in the "MPI" way. 
+!   We could pack the info into a buffer or we could create
+!   an obscene datatype to handle it all at once. Since we only
+!   broadcast the data once, just use a separate broadcast for
+!   each piece. 
+!---------------------------------------------------------------------
+      call MPI_BCAST(ipr, 1, MPI_INTEGER, root, comm_solve, ierr)
+      call MPI_BCAST(inorm, 1, MPI_INTEGER, root, comm_solve, ierr)
+      call MPI_BCAST(itmax, 1, MPI_INTEGER, root, comm_solve, ierr)
+      call MPI_BCAST(dt, 1, dp_type, root, comm_solve, ierr)
+      call MPI_BCAST(omega, 1, dp_type, root, comm_solve, ierr)
+      call MPI_BCAST(tolrsd, 5, dp_type, root, comm_solve, ierr)
+      call MPI_BCAST(nx0, 1, MPI_INTEGER, root, comm_solve, ierr)
+      call MPI_BCAST(ny0, 1, MPI_INTEGER, root, comm_solve, ierr)
+      call MPI_BCAST(nz0, 1, MPI_INTEGER, root, comm_solve, ierr)
+      call MPI_BCAST(timeron, 1, MPI_LOGICAL, root, comm_solve,  &
+     &               ierr)
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/blts.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/blts.f90
new file mode 100644
index 000000000..aecd009b9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/blts.f90
@@ -0,0 +1,250 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,  &
+     &                  nx, ny, nz, j, k,  &
+     &                  omega,  &
+     &                  v,  &
+     &                  ldz, ldy, ldx, d,  &
+     &                  ist, iend, jst, jend,  &
+     &                  nx0, ny0, ipt, jpt)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the regular-sparse, block lower triangular solution:
+!
+!                     v <-- ( L-inv ) * v
+!
+!---------------------------------------------------------------------
+
+      use timing
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer j, k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *),  &
+     &        ldz( 5, 5, ldmx ),  &
+     &        ldy( 5, 5, ldmx ),  &
+     &        ldx( 5, 5, ldmx ),  &
+     &        d( 5, 5, ldmx )
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )  &
+     &    - omega * (  ldz( m, 1, i ) * v( 1, i, j, k-1 )  &
+     &               + ldz( m, 2, i ) * v( 2, i, j, k-1 )  &
+     &               + ldz( m, 3, i ) * v( 3, i, j, k-1 )  &
+     &               + ldz( m, 4, i ) * v( 4, i, j, k-1 )  &
+     &               + ldz( m, 5, i ) * v( 5, i, j, k-1 )  )
+
+            end do
+         end do
+
+
+         do i = ist, iend
+
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )  &
+     & - omega * ( ldy( m, 1, i ) * v( 1, i, j-1, k )  &
+     &           + ldx( m, 1, i ) * v( 1, i-1, j, k )  &
+     &           + ldy( m, 2, i ) * v( 2, i, j-1, k )  &
+     &           + ldx( m, 2, i ) * v( 2, i-1, j, k )  &
+     &           + ldy( m, 3, i ) * v( 3, i, j-1, k )  &
+     &           + ldx( m, 3, i ) * v( 3, i-1, j, k )  &
+     &           + ldy( m, 4, i ) * v( 4, i, j-1, k )  &
+     &           + ldx( m, 4, i ) * v( 4, i-1, j, k )  &
+     &           + ldy( m, 5, i ) * v( 5, i, j-1, k )  &
+     &           + ldx( m, 5, i ) * v( 5, i-1, j, k ) )
+
+            end do
+       
+!---------------------------------------------------------------------
+!   diagonal block inversion
+!
+!   forward elimination
+!---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i )
+               tmat( m, 2 ) = d( m, 2, i )
+               tmat( m, 3 ) = d( m, 3, i )
+               tmat( m, 4 ) = d( m, 4, i )
+               tmat( m, 5 ) = d( m, 5, i )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 2, i, j, k ) = v( 2, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 2, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &        - v( 3, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 3, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 4, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 4, i, j, k ) * tmp
+
+!---------------------------------------------------------------------
+!   back substitution
+!---------------------------------------------------------------------
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &                      / tmat( 5, 5 )
+
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &                      / tmat( 4, 4 )
+
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &           - tmat( 3, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &                      / tmat( 3, 3 )
+
+            v( 2, i, j, k ) = v( 2, i, j, k )  &
+     &           - tmat( 2, 3 ) * v( 3, i, j, k )  &
+     &           - tmat( 2, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = v( 2, i, j, k )  &
+     &                      / tmat( 2, 2 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k )  &
+     &           - tmat( 1, 2 ) * v( 2, i, j, k )  &
+     &           - tmat( 1, 3 ) * v( 3, i, j, k )  &
+     &           - tmat( 1, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = v( 1, i, j, k )  &
+     &                      / tmat( 1, 1 )
+
+
+         enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/blts_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/blts_vec.f90
new file mode 100644
index 000000000..7691fbf61
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/blts_vec.f90
@@ -0,0 +1,342 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,  &
+     &                  nx, ny, nz, k,  &
+     &                  omega,  &
+     &                  v,  &
+     &                  ldz, ldy, ldx, d,  &
+     &                  ist, iend, jst, jend,  &
+     &                  nx0, ny0, ipt, jpt)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the regular-sparse, block lower triangular solution:
+!
+!                     v <-- ( L-inv ) * v
+!
+!---------------------------------------------------------------------
+
+      use timing
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *),  &
+     &        ldz( 5, 5, ldmx, ldmy),  &
+     &        ldy( 5, 5, ldmx, ldmy),  &
+     &        ldx( 5, 5, ldmx, ldmy),  &
+     &        d( 5, 5, ldmx, ldmy)
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      integer iex
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+!---------------------------------------------------------------------
+!   receive data from north and west
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_lcomm)
+      iex = 0
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_lcomm)
+
+
+      if (timeron) call timer_start(t_blts)
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+
+                  v( m, i, j, k ) =  v( m, i, j, k )  &
+     &    - omega * (  ldz( m, 1, i, j ) * v( 1, i, j, k-1 )  &
+     &               + ldz( m, 2, i, j ) * v( 2, i, j, k-1 )  &
+     &               + ldz( m, 3, i, j ) * v( 3, i, j, k-1 )  &
+     &               + ldz( m, 4, i, j ) * v( 4, i, j, k-1 )  &
+     &               + ldz( m, 5, i, j ) * v( 5, i, j, k-1 )  )
+
+            end do
+         end do
+      end do
+
+
+      do l = ist+jst, iend+jend
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+         do i = istp, iendp
+            j = l - i
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+
+                  v( 1, i, j, k ) =  v( 1, i, j, k )  &
+     & - omega * ( ldy( 1, 1, i, j ) * v( 1, i, j-1, k )  &
+     &           + ldx( 1, 1, i, j ) * v( 1, i-1, j, k )  &
+     &           + ldy( 1, 2, i, j ) * v( 2, i, j-1, k )  &
+     &           + ldx( 1, 2, i, j ) * v( 2, i-1, j, k )  &
+     &           + ldy( 1, 3, i, j ) * v( 3, i, j-1, k )  &
+     &           + ldx( 1, 3, i, j ) * v( 3, i-1, j, k )  &
+     &           + ldy( 1, 4, i, j ) * v( 4, i, j-1, k )  &
+     &           + ldx( 1, 4, i, j ) * v( 4, i-1, j, k )  &
+     &           + ldy( 1, 5, i, j ) * v( 5, i, j-1, k )  &
+     &           + ldx( 1, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 2, i, j, k ) =  v( 2, i, j, k )  &
+     & - omega * ( ldy( 2, 1, i, j ) * v( 1, i, j-1, k )  &
+     &           + ldx( 2, 1, i, j ) * v( 1, i-1, j, k )  &
+     &           + ldy( 2, 2, i, j ) * v( 2, i, j-1, k )  &
+     &           + ldx( 2, 2, i, j ) * v( 2, i-1, j, k )  &
+     &           + ldy( 2, 3, i, j ) * v( 3, i, j-1, k )  &
+     &           + ldx( 2, 3, i, j ) * v( 3, i-1, j, k )  &
+     &           + ldy( 2, 4, i, j ) * v( 4, i, j-1, k )  &
+     &           + ldx( 2, 4, i, j ) * v( 4, i-1, j, k )  &
+     &           + ldy( 2, 5, i, j ) * v( 5, i, j-1, k )  &
+     &           + ldx( 2, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 3, i, j, k ) =  v( 3, i, j, k )  &
+     & - omega * ( ldy( 3, 1, i, j ) * v( 1, i, j-1, k )  &
+     &           + ldx( 3, 1, i, j ) * v( 1, i-1, j, k )  &
+     &           + ldy( 3, 2, i, j ) * v( 2, i, j-1, k )  &
+     &           + ldx( 3, 2, i, j ) * v( 2, i-1, j, k )  &
+     &           + ldy( 3, 3, i, j ) * v( 3, i, j-1, k )  &
+     &           + ldx( 3, 3, i, j ) * v( 3, i-1, j, k )  &
+     &           + ldy( 3, 4, i, j ) * v( 4, i, j-1, k )  &
+     &           + ldx( 3, 4, i, j ) * v( 4, i-1, j, k )  &
+     &           + ldy( 3, 5, i, j ) * v( 5, i, j-1, k )  &
+     &           + ldx( 3, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 4, i, j, k ) =  v( 4, i, j, k )  &
+     & - omega * ( ldy( 4, 1, i, j ) * v( 1, i, j-1, k )  &
+     &           + ldx( 4, 1, i, j ) * v( 1, i-1, j, k )  &
+     &           + ldy( 4, 2, i, j ) * v( 2, i, j-1, k )  &
+     &           + ldx( 4, 2, i, j ) * v( 2, i-1, j, k )  &
+     &           + ldy( 4, 3, i, j ) * v( 3, i, j-1, k )  &
+     &           + ldx( 4, 3, i, j ) * v( 3, i-1, j, k )  &
+     &           + ldy( 4, 4, i, j ) * v( 4, i, j-1, k )  &
+     &           + ldx( 4, 4, i, j ) * v( 4, i-1, j, k )  &
+     &           + ldy( 4, 5, i, j ) * v( 5, i, j-1, k )  &
+     &           + ldx( 4, 5, i, j ) * v( 5, i-1, j, k ) )
+                  v( 5, i, j, k ) =  v( 5, i, j, k )  &
+     & - omega * ( ldy( 5, 1, i, j ) * v( 1, i, j-1, k )  &
+     &           + ldx( 5, 1, i, j ) * v( 1, i-1, j, k )  &
+     &           + ldy( 5, 2, i, j ) * v( 2, i, j-1, k )  &
+     &           + ldx( 5, 2, i, j ) * v( 2, i-1, j, k )  &
+     &           + ldy( 5, 3, i, j ) * v( 3, i, j-1, k )  &
+     &           + ldx( 5, 3, i, j ) * v( 3, i-1, j, k )  &
+     &           + ldy( 5, 4, i, j ) * v( 4, i, j-1, k )  &
+     &           + ldx( 5, 4, i, j ) * v( 4, i-1, j, k )  &
+     &           + ldy( 5, 5, i, j ) * v( 5, i, j-1, k )  &
+     &           + ldx( 5, 5, i, j ) * v( 5, i-1, j, k ) )
+
+!            end do
+       
+!---------------------------------------------------------------------
+!   diagonal block inversion
+!
+!   forward elimination
+!---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 2, i, j, k ) = v( 2, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 1, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &        - v( 2, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 2, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &        - v( 3, i, j, k ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 3, i, j, k ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 4, 5 )
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &        - v( 4, i, j, k ) * tmp
+
+!---------------------------------------------------------------------
+!   back substitution
+!---------------------------------------------------------------------
+            v( 5, i, j, k ) = v( 5, i, j, k )  &
+     &                      / tmat( 5, 5 )
+
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = v( 4, i, j, k )  &
+     &                      / tmat( 4, 4 )
+
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &           - tmat( 3, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = v( 3, i, j, k )  &
+     &                      / tmat( 3, 3 )
+
+            v( 2, i, j, k ) = v( 2, i, j, k )  &
+     &           - tmat( 2, 3 ) * v( 3, i, j, k )  &
+     &           - tmat( 2, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = v( 2, i, j, k )  &
+     &                      / tmat( 2, 2 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k )  &
+     &           - tmat( 1, 2 ) * v( 2, i, j, k )  &
+     &           - tmat( 1, 3 ) * v( 3, i, j, k )  &
+     &           - tmat( 1, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = v( 1, i, j, k )  &
+     &                      / tmat( 1, 1 )
+
+
+        enddo
+      enddo
+      if (timeron) call timer_stop(t_blts)
+
+!---------------------------------------------------------------------
+!   send data to east and south
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_lcomm)
+      iex = 2
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_lcomm)
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/buts.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/buts.f90
new file mode 100644
index 000000000..139eac84b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/buts.f90
@@ -0,0 +1,249 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,  &
+     &                 nx, ny, nz, j, k,  &
+     &                 omega,  &
+     &                 v, tv,  &
+     &                 d, udx, udy, udz,  &
+     &                 ist, iend, jst, jend,  &
+     &                 nx0, ny0, ipt, jpt )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the regular-sparse, block upper triangular solution:
+!
+!                     v <-- ( U-inv ) * v
+!
+!---------------------------------------------------------------------
+
+      use timing
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer j, k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *),  &
+     &        tv( 5, ldmx ),  &
+     &        d( 5, 5, ldmx ),  &
+     &        udx( 5, 5, ldmx ),  &
+     &        udy( 5, 5, ldmx ),  &
+     &        udz( 5, 5, ldmx )
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i ) =  &
+     &      omega * (  udz( m, 1, i ) * v( 1, i, j, k+1 )  &
+     &               + udz( m, 2, i ) * v( 2, i, j, k+1 )  &
+     &               + udz( m, 3, i ) * v( 3, i, j, k+1 )  &
+     &               + udz( m, 4, i ) * v( 4, i, j, k+1 )  &
+     &               + udz( m, 5, i ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+
+
+         do i = iend,ist,-1
+
+            do m = 1, 5
+                  tv( m, i ) = tv( m, i )  &
+     & + omega * ( udy( m, 1, i ) * v( 1, i, j+1, k )  &
+     &           + udx( m, 1, i ) * v( 1, i+1, j, k )  &
+     &           + udy( m, 2, i ) * v( 2, i, j+1, k )  &
+     &           + udx( m, 2, i ) * v( 2, i+1, j, k )  &
+     &           + udy( m, 3, i ) * v( 3, i, j+1, k )  &
+     &           + udx( m, 3, i ) * v( 3, i+1, j, k )  &
+     &           + udy( m, 4, i ) * v( 4, i, j+1, k )  &
+     &           + udx( m, 4, i ) * v( 4, i+1, j, k )  &
+     &           + udy( m, 5, i ) * v( 5, i, j+1, k )  &
+     &           + udx( m, 5, i ) * v( 5, i+1, j, k ) )
+            end do
+
+!---------------------------------------------------------------------
+!   diagonal block inversion
+!---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i )
+               tmat( m, 2 ) = d( m, 2, i )
+               tmat( m, 3 ) = d( m, 3, i )
+               tmat( m, 4 ) = d( m, 4, i )
+               tmat( m, 5 ) = d( m, 5, i )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 2, i ) = tv( 2, i )  &
+     &        - tv( 1, i ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 3, i ) = tv( 3, i )  &
+     &        - tv( 1, i ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 4, i ) = tv( 4, i )  &
+     &        - tv( 1, i ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 5, i ) = tv( 5, i )  &
+     &        - tv( 1, i ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 3, i ) = tv( 3, i )  &
+     &        - tv( 2, i ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 4, i ) = tv( 4, i )  &
+     &        - tv( 2, i ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 5, i ) = tv( 5, i )  &
+     &        - tv( 2, i ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 4, i ) = tv( 4, i )  &
+     &        - tv( 3, i ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 5, i ) = tv( 5, i )  &
+     &        - tv( 3, i ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 4, 5 )
+            tv( 5, i ) = tv( 5, i )  &
+     &        - tv( 4, i ) * tmp
+
+!---------------------------------------------------------------------
+!   back substitution
+!---------------------------------------------------------------------
+            tv( 5, i ) = tv( 5, i )  &
+     &                      / tmat( 5, 5 )
+
+            tv( 4, i ) = tv( 4, i )  &
+     &           - tmat( 4, 5 ) * tv( 5, i )
+            tv( 4, i ) = tv( 4, i )  &
+     &                      / tmat( 4, 4 )
+
+            tv( 3, i ) = tv( 3, i )  &
+     &           - tmat( 3, 4 ) * tv( 4, i )  &
+     &           - tmat( 3, 5 ) * tv( 5, i )
+            tv( 3, i ) = tv( 3, i )  &
+     &                      / tmat( 3, 3 )
+
+            tv( 2, i ) = tv( 2, i )  &
+     &           - tmat( 2, 3 ) * tv( 3, i )  &
+     &           - tmat( 2, 4 ) * tv( 4, i )  &
+     &           - tmat( 2, 5 ) * tv( 5, i )
+            tv( 2, i ) = tv( 2, i )  &
+     &                      / tmat( 2, 2 )
+
+            tv( 1, i ) = tv( 1, i )  &
+     &           - tmat( 1, 2 ) * tv( 2, i )  &
+     &           - tmat( 1, 3 ) * tv( 3, i )  &
+     &           - tmat( 1, 4 ) * tv( 4, i )  &
+     &           - tmat( 1, 5 ) * tv( 5, i )
+            tv( 1, i ) = tv( 1, i )  &
+     &                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i )
+
+
+         enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/buts_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/buts_vec.f90
new file mode 100644
index 000000000..7544ce646
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/buts_vec.f90
@@ -0,0 +1,340 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,  &
+     &                 nx, ny, nz, k,  &
+     &                 omega,  &
+     &                 v, tv,  &
+     &                 d, udx, udy, udz,  &
+     &                 ist, iend, jst, jend,  &
+     &                 nx0, ny0, ipt, jpt )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the regular-sparse, block upper triangular solution:
+!
+!                     v <-- ( U-inv ) * v
+!
+!---------------------------------------------------------------------
+
+      use timing
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      integer k
+      double precision  omega
+      double precision  v( 5, -1:ldmx+2, -1:ldmy+2, *),  &
+     &        tv(5, ldmx, ldmy),  &
+     &        d( 5, 5, ldmx, ldmy),  &
+     &        udx( 5, 5, ldmx, ldmy),  &
+     &        udy( 5, 5, ldmx, ldmy),  &
+     &        udz( 5, 5, ldmx, ldmy )
+      integer ist, iend
+      integer jst, jend
+      integer nx0, ny0
+      integer ipt, jpt
+
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, m, l, istp, iendp
+      integer iex
+      double precision  tmp, tmp1
+      double precision  tmat(5,5)
+
+
+!---------------------------------------------------------------------
+!   receive data from south and east
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ucomm)
+      iex = 1
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_ucomm)
+
+      if (timeron) call timer_start(t_buts)
+      do j = jend, jst, -1
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m, i, j ) =  &
+     &      omega * (  udz( m, 1, i, j ) * v( 1, i, j, k+1 )  &
+     &               + udz( m, 2, i, j ) * v( 2, i, j, k+1 )  &
+     &               + udz( m, 3, i, j ) * v( 3, i, j, k+1 )  &
+     &               + udz( m, 4, i, j ) * v( 4, i, j, k+1 )  &
+     &               + udz( m, 5, i, j ) * v( 5, i, j, k+1 ) )
+            end do
+         end do
+      end do
+
+
+      do l = iend+jend, ist+jst, -1
+         istp  = max(l - jend, ist)
+         iendp = min(l - jst, iend)
+
+!dir$ ivdep
+         do i = istp, iendp
+            j = l - i
+
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+                  tv( 1, i, j ) = tv( 1, i, j )  &
+     & + omega * ( udy( 1, 1, i, j ) * v( 1, i, j+1, k )  &
+     &           + udx( 1, 1, i, j ) * v( 1, i+1, j, k )  &
+     &           + udy( 1, 2, i, j ) * v( 2, i, j+1, k )  &
+     &           + udx( 1, 2, i, j ) * v( 2, i+1, j, k )  &
+     &           + udy( 1, 3, i, j ) * v( 3, i, j+1, k )  &
+     &           + udx( 1, 3, i, j ) * v( 3, i+1, j, k )  &
+     &           + udy( 1, 4, i, j ) * v( 4, i, j+1, k )  &
+     &           + udx( 1, 4, i, j ) * v( 4, i+1, j, k )  &
+     &           + udy( 1, 5, i, j ) * v( 5, i, j+1, k )  &
+     &           + udx( 1, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 2, i, j ) = tv( 2, i, j )  &
+     & + omega * ( udy( 2, 1, i, j ) * v( 1, i, j+1, k )  &
+     &           + udx( 2, 1, i, j ) * v( 1, i+1, j, k )  &
+     &           + udy( 2, 2, i, j ) * v( 2, i, j+1, k )  &
+     &           + udx( 2, 2, i, j ) * v( 2, i+1, j, k )  &
+     &           + udy( 2, 3, i, j ) * v( 3, i, j+1, k )  &
+     &           + udx( 2, 3, i, j ) * v( 3, i+1, j, k )  &
+     &           + udy( 2, 4, i, j ) * v( 4, i, j+1, k )  &
+     &           + udx( 2, 4, i, j ) * v( 4, i+1, j, k )  &
+     &           + udy( 2, 5, i, j ) * v( 5, i, j+1, k )  &
+     &           + udx( 2, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 3, i, j ) = tv( 3, i, j )  &
+     & + omega * ( udy( 3, 1, i, j ) * v( 1, i, j+1, k )  &
+     &           + udx( 3, 1, i, j ) * v( 1, i+1, j, k )  &
+     &           + udy( 3, 2, i, j ) * v( 2, i, j+1, k )  &
+     &           + udx( 3, 2, i, j ) * v( 2, i+1, j, k )  &
+     &           + udy( 3, 3, i, j ) * v( 3, i, j+1, k )  &
+     &           + udx( 3, 3, i, j ) * v( 3, i+1, j, k )  &
+     &           + udy( 3, 4, i, j ) * v( 4, i, j+1, k )  &
+     &           + udx( 3, 4, i, j ) * v( 4, i+1, j, k )  &
+     &           + udy( 3, 5, i, j ) * v( 5, i, j+1, k )  &
+     &           + udx( 3, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 4, i, j ) = tv( 4, i, j )  &
+     & + omega * ( udy( 4, 1, i, j ) * v( 1, i, j+1, k )  &
+     &           + udx( 4, 1, i, j ) * v( 1, i+1, j, k )  &
+     &           + udy( 4, 2, i, j ) * v( 2, i, j+1, k )  &
+     &           + udx( 4, 2, i, j ) * v( 2, i+1, j, k )  &
+     &           + udy( 4, 3, i, j ) * v( 3, i, j+1, k )  &
+     &           + udx( 4, 3, i, j ) * v( 3, i+1, j, k )  &
+     &           + udy( 4, 4, i, j ) * v( 4, i, j+1, k )  &
+     &           + udx( 4, 4, i, j ) * v( 4, i+1, j, k )  &
+     &           + udy( 4, 5, i, j ) * v( 5, i, j+1, k )  &
+     &           + udx( 4, 5, i, j ) * v( 5, i+1, j, k ) )
+                  tv( 5, i, j ) = tv( 5, i, j )  &
+     & + omega * ( udy( 5, 1, i, j ) * v( 1, i, j+1, k )  &
+     &           + udx( 5, 1, i, j ) * v( 1, i+1, j, k )  &
+     &           + udy( 5, 2, i, j ) * v( 2, i, j+1, k )  &
+     &           + udx( 5, 2, i, j ) * v( 2, i+1, j, k )  &
+     &           + udy( 5, 3, i, j ) * v( 3, i, j+1, k )  &
+     &           + udx( 5, 3, i, j ) * v( 3, i+1, j, k )  &
+     &           + udy( 5, 4, i, j ) * v( 4, i, j+1, k )  &
+     &           + udx( 5, 4, i, j ) * v( 4, i+1, j, k )  &
+     &           + udy( 5, 5, i, j ) * v( 5, i, j+1, k )  &
+     &           + udx( 5, 5, i, j ) * v( 5, i+1, j, k ) )
+!            end do
+
+!---------------------------------------------------------------------
+!   diagonal block inversion
+!---------------------------------------------------------------------
+!!dir$ unroll 5
+!   manually unroll the loop
+!            do m = 1, 5
+               tmat( 1, 1 ) = d( 1, 1, i, j )
+               tmat( 1, 2 ) = d( 1, 2, i, j )
+               tmat( 1, 3 ) = d( 1, 3, i, j )
+               tmat( 1, 4 ) = d( 1, 4, i, j )
+               tmat( 1, 5 ) = d( 1, 5, i, j )
+               tmat( 2, 1 ) = d( 2, 1, i, j )
+               tmat( 2, 2 ) = d( 2, 2, i, j )
+               tmat( 2, 3 ) = d( 2, 3, i, j )
+               tmat( 2, 4 ) = d( 2, 4, i, j )
+               tmat( 2, 5 ) = d( 2, 5, i, j )
+               tmat( 3, 1 ) = d( 3, 1, i, j )
+               tmat( 3, 2 ) = d( 3, 2, i, j )
+               tmat( 3, 3 ) = d( 3, 3, i, j )
+               tmat( 3, 4 ) = d( 3, 4, i, j )
+               tmat( 3, 5 ) = d( 3, 5, i, j )
+               tmat( 4, 1 ) = d( 4, 1, i, j )
+               tmat( 4, 2 ) = d( 4, 2, i, j )
+               tmat( 4, 3 ) = d( 4, 3, i, j )
+               tmat( 4, 4 ) = d( 4, 4, i, j )
+               tmat( 4, 5 ) = d( 4, 5, i, j )
+               tmat( 5, 1 ) = d( 5, 1, i, j )
+               tmat( 5, 2 ) = d( 5, 2, i, j )
+               tmat( 5, 3 ) = d( 5, 3, i, j )
+               tmat( 5, 4 ) = d( 5, 4, i, j )
+               tmat( 5, 5 ) = d( 5, 5, i, j )
+!            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 2, i, j ) = tv( 2, i, j )  &
+     &        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )  &
+     &        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )  &
+     &        - tv( 1, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )  &
+     &        - tv( 1, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 3, i, j ) = tv( 3, i, j )  &
+     &        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )  &
+     &        - tv( 2, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )  &
+     &        - tv( 2, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 4, i, j ) = tv( 4, i, j )  &
+     &        - tv( 3, i, j ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )  &
+     &        - tv( 3, i, j ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 4, 5 )
+            tv( 5, i, j ) = tv( 5, i, j )  &
+     &        - tv( 4, i, j ) * tmp
+
+!---------------------------------------------------------------------
+!   back substitution
+!---------------------------------------------------------------------
+            tv( 5, i, j ) = tv( 5, i, j )  &
+     &                      / tmat( 5, 5 )
+
+            tv( 4, i, j ) = tv( 4, i, j )  &
+     &           - tmat( 4, 5 ) * tv( 5, i, j )
+            tv( 4, i, j ) = tv( 4, i, j )  &
+     &                      / tmat( 4, 4 )
+
+            tv( 3, i, j ) = tv( 3, i, j )  &
+     &           - tmat( 3, 4 ) * tv( 4, i, j )  &
+     &           - tmat( 3, 5 ) * tv( 5, i, j )
+            tv( 3, i, j ) = tv( 3, i, j )  &
+     &                      / tmat( 3, 3 )
+
+            tv( 2, i, j ) = tv( 2, i, j )  &
+     &           - tmat( 2, 3 ) * tv( 3, i, j )  &
+     &           - tmat( 2, 4 ) * tv( 4, i, j )  &
+     &           - tmat( 2, 5 ) * tv( 5, i, j )
+            tv( 2, i, j ) = tv( 2, i, j )  &
+     &                      / tmat( 2, 2 )
+
+            tv( 1, i, j ) = tv( 1, i, j )  &
+     &           - tmat( 1, 2 ) * tv( 2, i, j )  &
+     &           - tmat( 1, 3 ) * tv( 3, i, j )  &
+     &           - tmat( 1, 4 ) * tv( 4, i, j )  &
+     &           - tmat( 1, 5 ) * tv( 5, i, j )
+            tv( 1, i, j ) = tv( 1, i, j )  &
+     &                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1, i, j )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2, i, j )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3, i, j )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4, i, j )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5, i, j )
+
+
+        enddo
+      end do
+      if (timeron) call timer_stop(t_buts)
+
+!---------------------------------------------------------------------
+!   send data to north and west
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_ucomm)
+      iex = 3
+      call exchange_1( v,k,iex )
+      if (timeron) call timer_stop(t_ucomm)
+ 
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/erhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/erhs.f90
new file mode 100644
index 000000000..932aed340
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/erhs.f90
@@ -0,0 +1,535 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine erhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the right hand side based on exact solution
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iglob, jglob
+      integer iex
+      integer L1, L2
+      integer ist1, iend1
+      integer jst1, jend1
+      double precision  dsspm
+      double precision  xi, eta, zeta
+      double precision  q
+      double precision  u21, u31, u41
+      double precision  tmp
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+      dsspm = dssp
+
+
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  frct( m, i, j, k ) = 0.0d+00
+               end do
+            end do
+         end do
+      end do
+
+      do k = 1, nz
+         zeta = ( dble(k-1) ) / ( nz - 1 )
+         do j = 1, ny
+            jglob = jpt + j
+            eta = ( dble(jglob-1) ) / ( ny0 - 1 )
+            do i = 1, nx
+               iglob = ipt + i
+               xi = ( dble(iglob-1) ) / ( nx0 - 1 )
+               do m = 1, 5
+                  rsd(m,i,j,k) =  ce(m,1)  &
+     &                 + ce(m,2) * xi  &
+     &                 + ce(m,3) * eta  &
+     &                 + ce(m,4) * zeta  &
+     &                 + ce(m,5) * xi * xi  &
+     &                 + ce(m,6) * eta * eta  &
+     &                 + ce(m,7) * zeta * zeta  &
+     &                 + ce(m,8) * xi * xi * xi  &
+     &                 + ce(m,9) * eta * eta * eta  &
+     &                 + ce(m,10) * zeta * zeta * zeta  &
+     &                 + ce(m,11) * xi * xi * xi * xi  &
+     &                 + ce(m,12) * eta * eta * eta * eta  &
+     &                 + ce(m,13) * zeta * zeta * zeta * zeta
+               end do
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   xi-direction flux differences
+!---------------------------------------------------------------------
+!
+!   iex = flag : iex = 0  north/south communication
+!              : iex = 1  east/west communication
+!
+!---------------------------------------------------------------------
+      iex   = 0
+
+!---------------------------------------------------------------------
+!   communicate and receive/send two rows of data
+!---------------------------------------------------------------------
+      call exchange_3 (rsd,iex)
+
+      L1 = 0
+      if (north.eq.-1) L1 = 1
+      L2 = nx + 1
+      if (south.eq.-1) L2 = nx
+
+      ist1 = 1
+      iend1 = nx
+      if (north.eq.-1) ist1 = 4
+      if (south.eq.-1) iend1 = nx - 3
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = L1, L2
+               flux(1,i,j,k) = rsd(2,i,j,k)
+               u21 = rsd(2,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)  &
+     &                         + rsd(3,i,j,k) * rsd(3,i,j,k)  &
+     &                         + rsd(4,i,j,k) * rsd(4,i,j,k) )  &
+     &                      / rsd(1,i,j,k)
+               flux(2,i,j,k) = rsd(2,i,j,k) * u21 + c2 *  &
+     &                         ( rsd(5,i,j,k) - q )
+               flux(3,i,j,k) = rsd(3,i,j,k) * u21
+               flux(4,i,j,k) = rsd(4,i,j,k) * u21
+               flux(5,i,j,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u21
+            end do
+         end do
+      end do 
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)  &
+     &                   - tx2 * ( flux(m,i+1,j,k) - flux(m,i-1,j,k) )
+               end do
+            end do
+            do i = ist, L2
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21i = tmp * rsd(2,i,j,k)
+               u31i = tmp * rsd(3,i,j,k)
+               u41i = tmp * rsd(4,i,j,k)
+               u51i = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i-1,j,k)
+
+               u21im1 = tmp * rsd(2,i-1,j,k)
+               u31im1 = tmp * rsd(3,i-1,j,k)
+               u41im1 = tmp * rsd(4,i-1,j,k)
+               u51im1 = tmp * rsd(5,i-1,j,k)
+
+               flux(2,i,j,k) = (4.0d+00/3.0d+00) * tx3 *  &
+     &                        ( u21i - u21im1 )
+               flux(3,i,j,k) = tx3 * ( u31i - u31im1 )
+               flux(4,i,j,k) = tx3 * ( u41i - u41im1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )  &
+     &                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tx3 * ( u21i**2 - u21im1**2 )  &
+     &              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)  &
+     &              + dx1 * tx1 * (            rsd(1,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(1,i,j,k)  &
+     &                             +           rsd(1,i+1,j,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)  &
+     &           + tx3 * c3 * c4 * ( flux(2,i+1,j,k) - flux(2,i,j,k) )  &
+     &              + dx2 * tx1 * (            rsd(2,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(2,i,j,k)  &
+     &                             +           rsd(2,i+1,j,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)  &
+     &           + tx3 * c3 * c4 * ( flux(3,i+1,j,k) - flux(3,i,j,k) )  &
+     &              + dx3 * tx1 * (            rsd(3,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(3,i,j,k)  &
+     &                             +           rsd(3,i+1,j,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)  &
+     &            + tx3 * c3 * c4 * ( flux(4,i+1,j,k) - flux(4,i,j,k) )  &
+     &              + dx4 * tx1 * (            rsd(4,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(4,i,j,k)  &
+     &                             +           rsd(4,i+1,j,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)  &
+     &           + tx3 * c3 * c4 * ( flux(5,i+1,j,k) - flux(5,i,j,k) )  &
+     &              + dx5 * tx1 * (            rsd(5,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(5,i,j,k)  &
+     &                             +           rsd(5,i+1,j,k) )
+            end do
+
+!---------------------------------------------------------------------
+!   Fourth-order dissipation
+!---------------------------------------------------------------------
+            IF (north.eq.-1) then
+             do m = 1, 5
+               frct(m,2,j,k) = frct(m,2,j,k)  &
+     &           - dsspm * ( + 5.0d+00 * rsd(m,2,j,k)  &
+     &                       - 4.0d+00 * rsd(m,3,j,k)  &
+     &                       +           rsd(m,4,j,k) )
+               frct(m,3,j,k) = frct(m,3,j,k)  &
+     &           - dsspm * ( - 4.0d+00 * rsd(m,2,j,k)  &
+     &                       + 6.0d+00 * rsd(m,3,j,k)  &
+     &                       - 4.0d+00 * rsd(m,4,j,k)  &
+     &                       +           rsd(m,5,j,k) )
+             end do
+            END IF
+
+            do i = ist1,iend1
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)  &
+     &              - dsspm * (            rsd(m,i-2,j,k)  &
+     &                         - 4.0d+00 * rsd(m,i-1,j,k)  &
+     &                         + 6.0d+00 * rsd(m,i,j,k)  &
+     &                         - 4.0d+00 * rsd(m,i+1,j,k)  &
+     &                         +           rsd(m,i+2,j,k) )
+               end do
+            end do
+
+            IF (south.eq.-1) then
+             do m = 1, 5
+               frct(m,nx-2,j,k) = frct(m,nx-2,j,k)  &
+     &           - dsspm * (             rsd(m,nx-4,j,k)  &
+     &                       - 4.0d+00 * rsd(m,nx-3,j,k)  &
+     &                       + 6.0d+00 * rsd(m,nx-2,j,k)  &
+     &                       - 4.0d+00 * rsd(m,nx-1,j,k)  )
+               frct(m,nx-1,j,k) = frct(m,nx-1,j,k)  &
+     &           - dsspm * (             rsd(m,nx-3,j,k)  &
+     &                       - 4.0d+00 * rsd(m,nx-2,j,k)  &
+     &                       + 5.0d+00 * rsd(m,nx-1,j,k) )
+             end do
+            END IF
+
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   eta-direction flux differences
+!---------------------------------------------------------------------
+!
+!   iex = flag : iex = 0  north/south communication
+!              : iex = 1  east/west communication
+!
+!---------------------------------------------------------------------
+      iex   = 1
+
+!---------------------------------------------------------------------
+!   communicate and receive/send two rows of data
+!---------------------------------------------------------------------
+      call exchange_3 (rsd,iex)
+
+      L1 = 0
+      if (west.eq.-1) L1 = 1
+      L2 = ny + 1
+      if (east.eq.-1) L2 = ny
+
+      jst1 = 1
+      jend1 = ny
+      if (west.eq.-1) jst1 = 4
+      if (east.eq.-1) jend1 = ny - 3
+
+      do k = 2, nz - 1
+         do j = L1, L2
+            do i = ist, iend
+               flux(1,i,j,k) = rsd(3,i,j,k)
+               u31 = rsd(3,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)  &
+     &                         + rsd(3,i,j,k) * rsd(3,i,j,k)  &
+     &                         + rsd(4,i,j,k) * rsd(4,i,j,k) )  &
+     &                      / rsd(1,i,j,k)
+               flux(2,i,j,k) = rsd(2,i,j,k) * u31 
+               flux(3,i,j,k) = rsd(3,i,j,k) * u31 + c2 *  &
+     &                       ( rsd(5,i,j,k) - q )
+               flux(4,i,j,k) = rsd(4,i,j,k) * u31
+               flux(5,i,j,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u31
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = jst, jend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)  &
+     &                 - ty2 * ( flux(m,i,j+1,k) - flux(m,i,j-1,k) )
+               end do
+            end do
+         end do
+
+         do j = jst, L2
+            do i = ist, iend
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21j = tmp * rsd(2,i,j,k)
+               u31j = tmp * rsd(3,i,j,k)
+               u41j = tmp * rsd(4,i,j,k)
+               u51j = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j-1,k)
+
+               u21jm1 = tmp * rsd(2,i,j-1,k)
+               u31jm1 = tmp * rsd(3,i,j-1,k)
+               u41jm1 = tmp * rsd(4,i,j-1,k)
+               u51jm1 = tmp * rsd(5,i,j-1,k)
+
+               flux(2,i,j,k) = ty3 * ( u21j - u21jm1 )
+               flux(3,i,j,k) = (4.0d+00/3.0d+00) * ty3 *  &
+     &                       ( u31j - u31jm1 )
+               flux(4,i,j,k) = ty3 * ( u41j - u41jm1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )  &
+     &                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * ty3 * ( u31j**2 - u31jm1**2 )  &
+     &              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+         end do
+
+         do j = jst, jend
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)  &
+     &              + dy1 * ty1 * (            rsd(1,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(1,i,j,k)  &
+     &                             +           rsd(1,i,j+1,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(2,i,j+1,k) - flux(2,i,j,k) )  &
+     &              + dy2 * ty1 * (            rsd(2,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(2,i,j,k)  &
+     &                             +           rsd(2,i,j+1,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(3,i,j+1,k) - flux(3,i,j,k) )  &
+     &              + dy3 * ty1 * (            rsd(3,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(3,i,j,k)  &
+     &                             +           rsd(3,i,j+1,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(4,i,j+1,k) - flux(4,i,j,k) )  &
+     &              + dy4 * ty1 * (            rsd(4,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(4,i,j,k)  &
+     &                             +           rsd(4,i,j+1,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(5,i,j+1,k) - flux(5,i,j,k) )  &
+     &              + dy5 * ty1 * (            rsd(5,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(5,i,j,k)  &
+     &                             +           rsd(5,i,j+1,k) )
+            end do
+         end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+         IF (west.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               frct(m,i,2,k) = frct(m,i,2,k)  &
+     &           - dsspm * ( + 5.0d+00 * rsd(m,i,2,k)  &
+     &                       - 4.0d+00 * rsd(m,i,3,k)  &
+     &                       +           rsd(m,i,4,k) )
+               frct(m,i,3,k) = frct(m,i,3,k)  &
+     &           - dsspm * ( - 4.0d+00 * rsd(m,i,2,k)  &
+     &                       + 6.0d+00 * rsd(m,i,3,k)  &
+     &                       - 4.0d+00 * rsd(m,i,4,k)  &
+     &                       +           rsd(m,i,5,k) )
+             end do
+            end do
+         END IF
+
+         do j = jst1, jend1
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)  &
+     &              - dsspm * (            rsd(m,i,j-2,k)  &
+     &                        - 4.0d+00 * rsd(m,i,j-1,k)  &
+     &                        + 6.0d+00 * rsd(m,i,j,k)  &
+     &                        - 4.0d+00 * rsd(m,i,j+1,k)  &
+     &                        +           rsd(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         IF (east.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               frct(m,i,ny-2,k) = frct(m,i,ny-2,k)  &
+     &           - dsspm * (             rsd(m,i,ny-4,k)  &
+     &                       - 4.0d+00 * rsd(m,i,ny-3,k)  &
+     &                       + 6.0d+00 * rsd(m,i,ny-2,k)  &
+     &                       - 4.0d+00 * rsd(m,i,ny-1,k)  )
+               frct(m,i,ny-1,k) = frct(m,i,ny-1,k)  &
+     &           - dsspm * (             rsd(m,i,ny-3,k)  &
+     &                       - 4.0d+00 * rsd(m,i,ny-2,k)  &
+     &                       + 5.0d+00 * rsd(m,i,ny-1,k)  )
+             end do
+            end do
+         END IF
+
+      end do
+
+!---------------------------------------------------------------------
+!   zeta-direction flux differences
+!---------------------------------------------------------------------
+      do k = 1, nz
+         do j = jst, jend
+            do i = ist, iend
+               flux(1,i,j,k) = rsd(4,i,j,k)
+               u41 = rsd(4,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)  &
+     &                         + rsd(3,i,j,k) * rsd(3,i,j,k)  &
+     &                         + rsd(4,i,j,k) * rsd(4,i,j,k) )  &
+     &                      / rsd(1,i,j,k)
+               flux(2,i,j,k) = rsd(2,i,j,k) * u41 
+               flux(3,i,j,k) = rsd(3,i,j,k) * u41 
+               flux(4,i,j,k) = rsd(4,i,j,k) * u41 + c2 *  &
+     &                         ( rsd(5,i,j,k) - q )
+               flux(5,i,j,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u41
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)  &
+     &                  - tz2 * ( flux(m,i,j,k+1) - flux(m,i,j,k-1) )
+               end do
+            end do
+         end do
+      end do
+
+      do k = 2, nz
+         do j = jst, jend
+            do i = ist, iend
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21k = tmp * rsd(2,i,j,k)
+               u31k = tmp * rsd(3,i,j,k)
+               u41k = tmp * rsd(4,i,j,k)
+               u51k = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j,k-1)
+
+               u21km1 = tmp * rsd(2,i,j,k-1)
+               u31km1 = tmp * rsd(3,i,j,k-1)
+               u41km1 = tmp * rsd(4,i,j,k-1)
+               u51km1 = tmp * rsd(5,i,j,k-1)
+
+               flux(2,i,j,k) = tz3 * ( u21k - u21km1 )
+               flux(3,i,j,k) = tz3 * ( u31k - u31km1 )
+               flux(4,i,j,k) = (4.0d+00/3.0d+00) * tz3 * ( u41k  &
+     &                       - u41km1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )  &
+     &                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tz3 * ( u41k**2 - u41km1**2 )  &
+     &              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)  &
+     &              + dz1 * tz1 * (            rsd(1,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(1,i,j,k)  &
+     &                             +           rsd(1,i,j,k-1) )
+               frct(2,i,j,k) = frct(2,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(2,i,j,k+1) - flux(2,i,j,k) )  &
+     &              + dz2 * tz1 * (            rsd(2,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(2,i,j,k)  &
+     &                             +           rsd(2,i,j,k-1) )
+               frct(3,i,j,k) = frct(3,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(3,i,j,k+1) - flux(3,i,j,k) )  &
+     &              + dz3 * tz1 * (            rsd(3,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(3,i,j,k)  &
+     &                             +           rsd(3,i,j,k-1) )
+               frct(4,i,j,k) = frct(4,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(4,i,j,k+1) - flux(4,i,j,k) )  &
+     &              + dz4 * tz1 * (            rsd(4,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(4,i,j,k)  &
+     &                             +           rsd(4,i,j,k-1) )
+               frct(5,i,j,k) = frct(5,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(5,i,j,k+1) - flux(5,i,j,k) )  &
+     &              + dz5 * tz1 * (            rsd(5,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(5,i,j,k)  &
+     &                             +           rsd(5,i,j,k-1) )
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               frct(m,i,j,2) = frct(m,i,j,2)  &
+     &           - dsspm * ( + 5.0d+00 * rsd(m,i,j,2)  &
+     &                       - 4.0d+00 * rsd(m,i,j,3)  &
+     &                       +           rsd(m,i,j,4) )
+               frct(m,i,j,3) = frct(m,i,j,3)  &
+     &           - dsspm * (- 4.0d+00 * rsd(m,i,j,2)  &
+     &                      + 6.0d+00 * rsd(m,i,j,3)  &
+     &                      - 4.0d+00 * rsd(m,i,j,4)  &
+     &                      +           rsd(m,i,j,5) )
+            end do
+         end do
+      end do
+
+      do k = 4, nz - 3
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)  &
+     &              - dsspm * (           rsd(m,i,j,k-2)  &
+     &                        - 4.0d+00 * rsd(m,i,j,k-1)  &
+     &                        + 6.0d+00 * rsd(m,i,j,k)  &
+     &                        - 4.0d+00 * rsd(m,i,j,k+1)  &
+     &                        +           rsd(m,i,j,k+2) )
+               end do
+            end do
+         end do
+      end do
+
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               frct(m,i,j,nz-2) = frct(m,i,j,nz-2)  &
+     &           - dsspm * (            rsd(m,i,j,nz-4)  &
+     &                      - 4.0d+00 * rsd(m,i,j,nz-3)  &
+     &                      + 6.0d+00 * rsd(m,i,j,nz-2)  &
+     &                      - 4.0d+00 * rsd(m,i,j,nz-1)  )
+               frct(m,i,j,nz-1) = frct(m,i,j,nz-1)  &
+     &           - dsspm * (             rsd(m,i,j,nz-3)  &
+     &                       - 4.0d+00 * rsd(m,i,j,nz-2)  &
+     &                       + 5.0d+00 * rsd(m,i,j,nz-1)  )
+            end do
+         end do
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/error.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/error.f90
new file mode 100644
index 000000000..61871f4bf
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/error.f90
@@ -0,0 +1,81 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine error
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the solution error
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iglob, jglob
+      double precision  tmp
+      double precision  u000ijk(5), dummy(5)
+
+      integer IERROR
+
+
+      do m = 1, 5
+         errnm(m) = 0.0d+00
+         dummy(m) = 0.0d+00
+      end do
+
+      do k = 2, nz-1
+         do j = jst, jend
+            jglob = jpt + j
+            do i = ist, iend
+               iglob = ipt + i
+               call exact( iglob, jglob, k, u000ijk )
+               do m = 1, 5
+                  tmp = ( u000ijk(m) - u(m,i,j,k) )
+                  dummy(m) = dummy(m) + tmp ** 2
+               end do
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   compute the global sum of individual contributions to dot product.
+!---------------------------------------------------------------------
+      call MPI_ALLREDUCE( dummy,  &
+     &                    errnm,  &
+     &                    5,  &
+     &                    dp_type,  &
+     &                    MPI_SUM,  &
+     &                    comm_solve,  &
+     &                    IERROR )
+
+      do m = 1, 5
+         errnm(m) = sqrt ( errnm(m) / ( dble(nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+!      if (id.eq.0) then
+!        write (*,1002) ( errnm(m), m = 1, 5 )
+!      end if
+
+ 1002 format (1x/1x,'RMS-norm of error in soln. to ',  &
+     & 'first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'fifth pde  = ',1pe12.5)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exact.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exact.f90
new file mode 100644
index 000000000..e38642005
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exact.f90
@@ -0,0 +1,52 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exact( i, j, k, u000ijk )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the exact solution at (i,j,k)
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer i, j, k
+      double precision u000ijk(*)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer m
+      double precision xi, eta, zeta
+
+      xi  = ( dble ( i - 1 ) ) / ( nx0 - 1 )
+      eta  = ( dble ( j - 1 ) ) / ( ny0 - 1 )
+      zeta = ( dble ( k - 1 ) ) / ( nz - 1 )
+
+
+      do m = 1, 5
+         u000ijk(m) =  ce(m,1)  &
+     &        + ce(m,2) * xi  &
+     &        + ce(m,3) * eta  &
+     &        + ce(m,4) * zeta  &
+     &        + ce(m,5) * xi * xi  &
+     &        + ce(m,6) * eta * eta  &
+     &        + ce(m,7) * zeta * zeta  &
+     &        + ce(m,8) * xi * xi * xi  &
+     &        + ce(m,9) * eta * eta * eta  &
+     &        + ce(m,10) * zeta * zeta * zeta  &
+     &        + ce(m,11) * xi * xi * xi * xi  &
+     &        + ce(m,12) * eta * eta * eta * eta  &
+     &        + ce(m,13) * zeta * zeta * zeta * zeta
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_1.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_1.f90
new file mode 100644
index 000000000..64b8c4e8f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_1.f90
@@ -0,0 +1,184 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exchange_1( g,k,iex )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer k, iex
+      double precision  g(5,-1:isiz1+2,-1:isiz2+2,isiz3)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j
+
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+      if( iex .eq. 0 ) then
+
+          if( north .ne. -1 ) then
+              call MPI_RECV( buf1(1,jst),  &
+     &                       5*(jend-jst+1),  &
+     &                       dp_type,  &
+     &                       north,  &
+     &                       from_n,  &
+     &                       comm_solve,  &
+     &                       status,  &
+     &                       IERROR )
+              do j=jst,jend
+                  g(1,0,j,k) = buf1(1,j)
+                  g(2,0,j,k) = buf1(2,j)
+                  g(3,0,j,k) = buf1(3,j)
+                  g(4,0,j,k) = buf1(4,j)
+                  g(5,0,j,k) = buf1(5,j)
+              enddo
+          endif
+
+          if( west .ne. -1 ) then
+              call MPI_RECV( buf1(1,ist),  &
+     &                       5*(iend-ist+1),  &
+     &                       dp_type,  &
+     &                       west,  &
+     &                       from_w,  &
+     &                       comm_solve,  &
+     &                       status,  &
+     &                       IERROR )
+              do i=ist,iend
+                  g(1,i,0,k) = buf1(1,i)
+                  g(2,i,0,k) = buf1(2,i)
+                  g(3,i,0,k) = buf1(3,i)
+                  g(4,i,0,k) = buf1(4,i)
+                  g(5,i,0,k) = buf1(5,i)
+              enddo
+          endif
+
+      else if( iex .eq. 1 ) then
+
+          if( south .ne. -1 ) then
+              call MPI_RECV( buf1(1,jst),  &
+     &                       5*(jend-jst+1),  &
+     &                       dp_type,  &
+     &                       south,  &
+     &                       from_s,  &
+     &                       comm_solve,  &
+     &                       status,  &
+     &                       IERROR )
+              do j=jst,jend
+                  g(1,nx+1,j,k) = buf1(1,j)
+                  g(2,nx+1,j,k) = buf1(2,j)
+                  g(3,nx+1,j,k) = buf1(3,j)
+                  g(4,nx+1,j,k) = buf1(4,j)
+                  g(5,nx+1,j,k) = buf1(5,j)
+              enddo
+          endif
+
+          if( east .ne. -1 ) then
+              call MPI_RECV( buf1(1,ist),  &
+     &                       5*(iend-ist+1),  &
+     &                       dp_type,  &
+     &                       east,  &
+     &                       from_e,  &
+     &                       comm_solve,  &
+     &                       status,  &
+     &                       IERROR )
+              do i=ist,iend
+                  g(1,i,ny+1,k) = buf1(1,i)
+                  g(2,i,ny+1,k) = buf1(2,i)
+                  g(3,i,ny+1,k) = buf1(3,i)
+                  g(4,i,ny+1,k) = buf1(4,i)
+                  g(5,i,ny+1,k) = buf1(5,i)
+              enddo
+          endif
+
+      else if( iex .eq. 2 ) then
+
+          if( south .ne. -1 ) then
+              do j=jst,jend
+                  buf(1,j) = g(1,nx,j,k) 
+                  buf(2,j) = g(2,nx,j,k) 
+                  buf(3,j) = g(3,nx,j,k) 
+                  buf(4,j) = g(4,nx,j,k) 
+                  buf(5,j) = g(5,nx,j,k) 
+              enddo
+              call MPI_SEND( buf(1,jst),  &
+     &                       5*(jend-jst+1),  &
+     &                       dp_type,  &
+     &                       south,  &
+     &                       from_n,  &
+     &                       comm_solve,  &
+     &                       IERROR )
+          endif
+
+          if( east .ne. -1 ) then
+              do i=ist,iend
+                  buf(1,i) = g(1,i,ny,k)
+                  buf(2,i) = g(2,i,ny,k)
+                  buf(3,i) = g(3,i,ny,k)
+                  buf(4,i) = g(4,i,ny,k)
+                  buf(5,i) = g(5,i,ny,k)
+              enddo
+              call MPI_SEND( buf(1,ist),  &
+     &                       5*(iend-ist+1),  &
+     &                       dp_type,  &
+     &                       east,  &
+     &                       from_w,  &
+     &                       comm_solve,  &
+     &                       IERROR )
+          endif
+
+      else
+
+          if( north .ne. -1 ) then
+              do j=jst,jend
+                  buf(1,j) = g(1,1,j,k)
+                  buf(2,j) = g(2,1,j,k)
+                  buf(3,j) = g(3,1,j,k)
+                  buf(4,j) = g(4,1,j,k)
+                  buf(5,j) = g(5,1,j,k)
+              enddo
+              call MPI_SEND( buf(1,jst),  &
+     &                       5*(jend-jst+1),  &
+     &                       dp_type,  &
+     &                       north,  &
+     &                       from_s,  &
+     &                       comm_solve,  &
+     &                       IERROR )
+          endif
+
+          if( west .ne. -1 ) then
+              do i=ist,iend
+                  buf(1,i) = g(1,i,1,k)
+                  buf(2,i) = g(2,i,1,k)
+                  buf(3,i) = g(3,i,1,k)
+                  buf(4,i) = g(4,i,1,k)
+                  buf(5,i) = g(5,i,1,k)
+              enddo
+              call MPI_SEND( buf(1,ist),  &
+     &                       5*(iend-ist+1),  &
+     &                       dp_type,  &
+     &                       west,  &
+     &                       from_e,  &
+     &                       comm_solve,  &
+     &                       IERROR )
+          endif
+
+      endif
+
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_3.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_3.f90
new file mode 100644
index 000000000..2674135e0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_3.f90
@@ -0,0 +1,312 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exchange_3(g,iex)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer iex
+      double precision  g(5,-1:isiz1+2,-1:isiz2+2,isiz3)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k
+      integer ipos1, ipos2
+
+      integer mid
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+      if (iex.eq.0) then
+!---------------------------------------------------------------------
+!   communicate in the south and north directions
+!---------------------------------------------------------------------
+      if (north.ne.-1) then
+          call MPI_IRECV( buf1,  &
+     &                    10*ny*nz,  &
+     &                    dp_type,  &
+     &                    north,  &
+     &                    from_n,  &
+     &                    comm_solve,  &
+     &                    mid,  &
+     &                    IERROR )
+      end if
+
+!---------------------------------------------------------------------
+!   send south
+!---------------------------------------------------------------------
+      if (south.ne.-1) then
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              buf(1,ipos1) = g(1,nx-1,j,k) 
+              buf(2,ipos1) = g(2,nx-1,j,k) 
+              buf(3,ipos1) = g(3,nx-1,j,k) 
+              buf(4,ipos1) = g(4,nx-1,j,k) 
+              buf(5,ipos1) = g(5,nx-1,j,k) 
+              buf(1,ipos2) = g(1,nx,j,k)
+              buf(2,ipos2) = g(2,nx,j,k)
+              buf(3,ipos2) = g(3,nx,j,k)
+              buf(4,ipos2) = g(4,nx,j,k)
+              buf(5,ipos2) = g(5,nx,j,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,  &
+     &                   10*ny*nz,  &
+     &                   dp_type,  &
+     &                   south,  &
+     &                   from_n,  &
+     &                   comm_solve,  &
+     &                   IERROR )
+        end if
+
+!---------------------------------------------------------------------
+!   receive from north
+!---------------------------------------------------------------------
+        if (north.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              g(1,-1,j,k) = buf1(1,ipos1)
+              g(2,-1,j,k) = buf1(2,ipos1)
+              g(3,-1,j,k) = buf1(3,ipos1)
+              g(4,-1,j,k) = buf1(4,ipos1)
+              g(5,-1,j,k) = buf1(5,ipos1)
+              g(1,0,j,k) = buf1(1,ipos2)
+              g(2,0,j,k) = buf1(2,ipos2)
+              g(3,0,j,k) = buf1(3,ipos2)
+              g(4,0,j,k) = buf1(4,ipos2)
+              g(5,0,j,k) = buf1(5,ipos2)
+            end do
+          end do
+
+        end if
+
+      if (south.ne.-1) then
+          call MPI_IRECV( buf1,  &
+     &                    10*ny*nz,  &
+     &                    dp_type,  &
+     &                    south,  &
+     &                    from_s,  &
+     &                    comm_solve,  &
+     &                    mid,  &
+     &                    IERROR )
+      end if
+
+!---------------------------------------------------------------------
+!   send north
+!---------------------------------------------------------------------
+        if (north.ne.-1) then
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              buf(1,ipos1) = g(1,2,j,k)
+              buf(2,ipos1) = g(2,2,j,k)
+              buf(3,ipos1) = g(3,2,j,k)
+              buf(4,ipos1) = g(4,2,j,k)
+              buf(5,ipos1) = g(5,2,j,k)
+              buf(1,ipos2) = g(1,1,j,k)
+              buf(2,ipos2) = g(2,1,j,k)
+              buf(3,ipos2) = g(3,1,j,k)
+              buf(4,ipos2) = g(4,1,j,k)
+              buf(5,ipos2) = g(5,1,j,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,  &
+     &                   10*ny*nz,  &
+     &                   dp_type,  &
+     &                   north,  &
+     &                   from_s,  &
+     &                   comm_solve,  &
+     &                   IERROR )
+        end if
+
+!---------------------------------------------------------------------
+!   receive from south
+!---------------------------------------------------------------------
+        if (south.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do j = 1,ny
+              ipos1 = (k-1)*ny + j
+              ipos2 = ipos1 + ny*nz
+              g(1,nx+2,j,k) = buf1(1,ipos1)
+              g(2,nx+2,j,k) = buf1(2,ipos1)
+              g(3,nx+2,j,k) = buf1(3,ipos1)
+              g(4,nx+2,j,k) = buf1(4,ipos1)
+              g(5,nx+2,j,k) = buf1(5,ipos1)
+              g(1,nx+1,j,k) = buf1(1,ipos2)
+              g(2,nx+1,j,k) = buf1(2,ipos2)
+              g(3,nx+1,j,k) = buf1(3,ipos2)
+              g(4,nx+1,j,k) = buf1(4,ipos2)
+              g(5,nx+1,j,k) = buf1(5,ipos2)
+            end do
+          end do
+        end if
+
+      else
+
+!---------------------------------------------------------------------
+!   communicate in the east and west directions
+!---------------------------------------------------------------------
+      if (west.ne.-1) then
+          call MPI_IRECV( buf1,  &
+     &                    10*nx*nz,  &
+     &                    dp_type,  &
+     &                    west,  &
+     &                    from_w,  &
+     &                    comm_solve,  &
+     &                    mid,  &
+     &                    IERROR )
+      end if
+
+!---------------------------------------------------------------------
+!   send east
+!---------------------------------------------------------------------
+        if (east.ne.-1) then
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              buf(1,ipos1) = g(1,i,ny-1,k)
+              buf(2,ipos1) = g(2,i,ny-1,k)
+              buf(3,ipos1) = g(3,i,ny-1,k)
+              buf(4,ipos1) = g(4,i,ny-1,k)
+              buf(5,ipos1) = g(5,i,ny-1,k)
+              buf(1,ipos2) = g(1,i,ny,k)
+              buf(2,ipos2) = g(2,i,ny,k)
+              buf(3,ipos2) = g(3,i,ny,k)
+              buf(4,ipos2) = g(4,i,ny,k)
+              buf(5,ipos2) = g(5,i,ny,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,  &
+     &                   10*nx*nz,  &
+     &                   dp_type,  &
+     &                   east,  &
+     &                   from_w,  &
+     &                   comm_solve,  &
+     &                   IERROR )
+        end if
+
+!---------------------------------------------------------------------
+!   receive from west
+!---------------------------------------------------------------------
+        if (west.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              g(1,i,-1,k) = buf1(1,ipos1)
+              g(2,i,-1,k) = buf1(2,ipos1)
+              g(3,i,-1,k) = buf1(3,ipos1)
+              g(4,i,-1,k) = buf1(4,ipos1)
+              g(5,i,-1,k) = buf1(5,ipos1)
+              g(1,i,0,k) = buf1(1,ipos2)
+              g(2,i,0,k) = buf1(2,ipos2)
+              g(3,i,0,k) = buf1(3,ipos2)
+              g(4,i,0,k) = buf1(4,ipos2)
+              g(5,i,0,k) = buf1(5,ipos2)
+            end do
+          end do
+
+        end if
+
+      if (east.ne.-1) then
+          call MPI_IRECV( buf1,  &
+     &                    10*nx*nz,  &
+     &                    dp_type,  &
+     &                    east,  &
+     &                    from_e,  &
+     &                    comm_solve,  &
+     &                    mid,  &
+     &                    IERROR )
+      end if
+
+!---------------------------------------------------------------------
+!   send west
+!---------------------------------------------------------------------
+      if (west.ne.-1) then
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              buf(1,ipos1) = g(1,i,2,k)
+              buf(2,ipos1) = g(2,i,2,k)
+              buf(3,ipos1) = g(3,i,2,k)
+              buf(4,ipos1) = g(4,i,2,k)
+              buf(5,ipos1) = g(5,i,2,k)
+              buf(1,ipos2) = g(1,i,1,k)
+              buf(2,ipos2) = g(2,i,1,k)
+              buf(3,ipos2) = g(3,i,1,k)
+              buf(4,ipos2) = g(4,i,1,k)
+              buf(5,ipos2) = g(5,i,1,k)
+            end do
+          end do
+
+          call MPI_SEND( buf,  &
+     &                   10*nx*nz,  &
+     &                   dp_type,  &
+     &                   west,  &
+     &                   from_e,  &
+     &                   comm_solve,  &
+     &                   IERROR )
+        end if
+
+!---------------------------------------------------------------------
+!   receive from east
+!---------------------------------------------------------------------
+        if (east.ne.-1) then
+          call MPI_WAIT( mid, STATUS, IERROR )
+
+          do k = 1,nz
+            do i = 1,nx
+              ipos1 = (k-1)*nx + i
+              ipos2 = ipos1 + nx*nz
+              g(1,i,ny+2,k) = buf1(1,ipos1)
+              g(2,i,ny+2,k) = buf1(2,ipos1)
+              g(3,i,ny+2,k) = buf1(3,ipos1)
+              g(4,i,ny+2,k) = buf1(4,ipos1)
+              g(5,i,ny+2,k) = buf1(5,ipos1)
+              g(1,i,ny+1,k) = buf1(1,ipos2)
+              g(2,i,ny+1,k) = buf1(2,ipos2)
+              g(3,i,ny+1,k) = buf1(3,ipos2)
+              g(4,i,ny+1,k) = buf1(4,ipos2)
+              g(5,i,ny+1,k) = buf1(5,ipos2)
+            end do
+          end do
+
+        end if
+
+      end if
+
+      return
+      end     
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_4.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_4.f90
new file mode 100644
index 000000000..48c55b34c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_4.f90
@@ -0,0 +1,132 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exchange_4(g,h,ibeg,ifin1,jbeg,jfin1)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ibeg, ifin1, jbeg, jfin1
+      double precision  g(0:isiz2+1,0:isiz3+1),  &
+     &        h(0:isiz2+1,0:isiz3+1)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j
+      integer ny2
+      double precision  dum(2*isiz02+4)
+
+      integer msgid1, msgid3
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+      ny2 = ny + 2
+
+!---------------------------------------------------------------------
+!   communicate in the east and west directions
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   receive from east
+!---------------------------------------------------------------------
+      if (jfin1.eq.ny) then
+        call MPI_IRECV( dum,  &
+     &                  2*nx,  &
+     &                  dp_type,  &
+     &                  east,  &
+     &                  from_e,  &
+     &                  comm_solve,  &
+     &                  msgid3,  &
+     &                  IERROR )
+
+        call MPI_WAIT( msgid3, STATUS, IERROR )
+
+        do i = 1,nx
+          g(i,ny+1) = dum(i)
+          h(i,ny+1) = dum(i+nx)
+        end do
+
+      end if
+
+!---------------------------------------------------------------------
+!   send west
+!---------------------------------------------------------------------
+      if (jbeg.eq.1) then
+        do i = 1,nx
+          dum(i) = g(i,1)
+          dum(i+nx) = h(i,1)
+        end do
+
+        call MPI_SEND( dum,  &
+     &                 2*nx,  &
+     &                 dp_type,  &
+     &                 west,  &
+     &                 from_e,  &
+     &                 comm_solve,  &
+     &                 IERROR )
+
+      end if
+
+!---------------------------------------------------------------------
+!   communicate in the south and north directions
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   receive from south
+!---------------------------------------------------------------------
+      if (ifin1.eq.nx) then
+        call MPI_IRECV( dum,  &
+     &                  2*ny2,  &
+     &                  dp_type,  &
+     &                  south,  &
+     &                  from_s,  &
+     &                  comm_solve,  &
+     &                  msgid1,  &
+     &                  IERROR )
+
+        call MPI_WAIT( msgid1, STATUS, IERROR )
+
+        do j = 0,ny+1
+          g(nx+1,j) = dum(j+1)
+          h(nx+1,j) = dum(j+ny2+1)
+        end do
+
+      end if
+
+!---------------------------------------------------------------------
+!   send north
+!---------------------------------------------------------------------
+      if (ibeg.eq.1) then
+        do j = 0,ny+1
+          dum(j+1) = g(1,j)
+          dum(j+ny2+1) = h(1,j)
+        end do
+
+        call MPI_SEND( dum,  &
+     &                 2*ny2,  &
+     &                 dp_type,  &
+     &                 north,  &
+     &                 from_s,  &
+     &                 comm_solve,  &
+     &                 IERROR )
+
+      end if
+
+      return
+      end     
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_5.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_5.f90
new file mode 100644
index 000000000..da1aed658
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_5.f90
@@ -0,0 +1,81 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exchange_5(g,ibeg,ifin1)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ibeg, ifin1
+      double precision  g(0:isiz2+1,0:isiz3+1)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer k
+      double precision  dum(isiz03)
+
+      integer msgid1
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+!---------------------------------------------------------------------
+!   communicate in the south and north directions
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   receive from south
+!---------------------------------------------------------------------
+      if (ifin1.eq.nx) then
+        call MPI_IRECV( dum,  &
+     &                  nz,  &
+     &                  dp_type,  &
+     &                  south,  &
+     &                  from_s,  &
+     &                  comm_solve,  &
+     &                  msgid1,  &
+     &                  IERROR )
+
+        call MPI_WAIT( msgid1, STATUS, IERROR )
+
+        do k = 1,nz
+          g(nx+1,k) = dum(k)
+        end do
+
+      end if
+
+!---------------------------------------------------------------------
+!   send north
+!---------------------------------------------------------------------
+      if (ibeg.eq.1) then
+        do k = 1,nz
+          dum(k) = g(1,k)
+        end do
+
+        call MPI_SEND( dum,  &
+     &                 nz,  &
+     &                 dp_type,  &
+     &                 north,  &
+     &                 from_s,  &
+     &                 comm_solve,  &
+     &                 IERROR )
+
+      end if
+
+      return
+      end     
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_6.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_6.f90
new file mode 100644
index 000000000..9d10ee201
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/exchange_6.f90
@@ -0,0 +1,81 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exchange_6(g,jbeg,jfin1)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer jbeg, jfin1
+      double precision  g(0:isiz2+1,0:isiz3+1)
+
+!---------------------------------------------------------------------
+!  local parameters
+!---------------------------------------------------------------------
+      integer k
+      double precision  dum(isiz03)
+
+      integer msgid3
+      integer STATUS(MPI_STATUS_SIZE)
+      integer IERROR
+
+
+
+!---------------------------------------------------------------------
+!   communicate in the east and west directions
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   receive from east
+!---------------------------------------------------------------------
+      if (jfin1.eq.ny) then
+        call MPI_IRECV( dum,  &
+     &                  nz,  &
+     &                  dp_type,  &
+     &                  east,  &
+     &                  from_e,  &
+     &                  comm_solve,  &
+     &                  msgid3,  &
+     &                  IERROR )
+
+        call MPI_WAIT( msgid3, STATUS, IERROR )
+
+        do k = 1,nz
+          g(ny+1,k) = dum(k)
+        end do
+
+      end if
+
+!---------------------------------------------------------------------
+!   send west
+!---------------------------------------------------------------------
+      if (jbeg.eq.1) then
+        do k = 1,nz
+          dum(k) = g(1,k)
+        end do
+
+        call MPI_SEND( dum,  &
+     &                 nz,  &
+     &                 dp_type,  &
+     &                 west,  &
+     &                 from_e,  &
+     &                 comm_solve,  &
+     &                 IERROR )
+
+      end if
+
+      return
+      end     
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/init_comm.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/init_comm.f90
new file mode 100644
index 000000000..13944887b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/init_comm.f90
@@ -0,0 +1,59 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine init_comm 
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   initialize MPI and establish rank and size
+!
+! This is a module in the MPI implementation of LUSSOR
+! pseudo application from the NAS Parallel Benchmarks. 
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+      integer nodedim
+      integer IERROR
+
+
+!---------------------------------------------------------------------
+!    initialize MPI communication
+!---------------------------------------------------------------------
+      call MPI_INIT( IERROR )
+
+!---------------------------------------------------------------------
+!     get a process grid that requires a (nx*ny) number of procs.
+!     excess ranks are marked as inactive.
+!---------------------------------------------------------------------
+      call get_active_nprocs(2, xdim, ydim, no_nodes,  &
+     &                       total_nodes, node, comm_solve, active)
+
+      if (.not. active) return
+
+!---------------------------------------------------------------------
+!   establish the global rank of this process and the group size
+!---------------------------------------------------------------------
+      id = node
+      num = no_nodes
+      root = 0
+
+      ndim   = nodedim(num)
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/inputlu.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/inputlu.data.sample
new file mode 100644
index 000000000..9ef5a7be0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/inputlu.data.sample
@@ -0,0 +1,24 @@
+c
+c***controls printing of the progress of iterations: ipr    inorm
+                                                      1      250
+c
+c***the maximum no. of pseudo-time steps to be performed: nitmax
+                                                             250
+c
+c***magnitude of the time step: dt 
+                               2.0e+00
+c
+c***relaxation factor for SSOR iterations: omega
+                                            1.2
+c
+c***tolerance levels for steady-state residuals: tolnwt(m),m=1,5
+                             1.0e-08   1.0e-08   1.0e-08  1.0e-08  1.0e-08 
+c
+c***number of grid points in xi and eta and zeta directions: nx   ny   nz
+                                                            64  64  64
+c
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacld.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacld.f90
new file mode 100644
index 000000000..d2fa09519
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacld.f90
@@ -0,0 +1,381 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine jacld(j,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!   compute the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer j, k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+            do i = ist, iend
+
+!---------------------------------------------------------------------
+!   form the block daigonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i) =  1.0d+00  &
+     &                       + dt * 2.0d+00 * (   tx1 * dx1  &
+     &                                          + ty1 * dy1  &
+     &                                          + tz1 * dz1 )
+               d(1,2,i) =  0.0d+00
+               d(1,3,i) =  0.0d+00
+               d(1,4,i) =  0.0d+00
+               d(1,5,i) =  0.0d+00
+
+               d(2,1,i) =  dt * 2.0d+00  &
+     &          * (  tx1 * ( - r43 * c34 * tmp2 * u(2,i,j,k) )  &
+     &             + ty1 * ( -       c34 * tmp2 * u(2,i,j,k) )  &
+     &             + tz1 * ( -       c34 * tmp2 * u(2,i,j,k) ) )
+               d(2,2,i) =  1.0d+00  &
+     &          + dt * 2.0d+00  &
+     &          * (  tx1 * r43 * c34 * tmp1  &
+     &             + ty1 *       c34 * tmp1  &
+     &             + tz1 *       c34 * tmp1 )  &
+     &          + dt * 2.0d+00 * (   tx1 * dx2  &
+     &                             + ty1 * dy2  &
+     &                             + tz1 * dz2  )
+               d(2,3,i) = 0.0d+00
+               d(2,4,i) = 0.0d+00
+               d(2,5,i) = 0.0d+00
+
+               d(3,1,i) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(3,i,j,k) )  &
+     &         + ty1 * ( - r43 * c34 * tmp2 * u(3,i,j,k) )  &
+     &         + tz1 * ( -       c34 * tmp2 * u(3,i,j,k) ) )
+               d(3,2,i) = 0.0d+00
+               d(3,3,i) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 * r43 * c34 * tmp1  &
+     &                 + tz1 *       c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx3  &
+     &                           + ty1 * dy3  &
+     &                           + tz1 * dz3 )
+               d(3,4,i) = 0.0d+00
+               d(3,5,i) = 0.0d+00
+
+               d(4,1,i) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + ty1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k) ) )
+               d(4,2,i) = 0.0d+00
+               d(4,3,i) = 0.0d+00
+               d(4,4,i) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 *       c34 * tmp1  &
+     &                 + tz1 * r43 * c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx4  &
+     &                           + ty1 * dy4  &
+     &                           + tz1 * dz4 )
+               d(4,5,i) = 0.0d+00
+
+               d(5,1,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + ty1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + tz1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) ) )
+               d(5,2,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( r43*c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + ty1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + tz1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k) )
+               d(5,3,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + ty1 * ( r43*c34 -c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k) )
+               d(5,4,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k) )
+               d(5,5,i) = 1.0d+00  &
+     &   + dt * 2.0d+00 * ( tx1 * c1345 * tmp1  &
+     &                    + ty1 * c1345 * tmp1  &
+     &                    + tz1 * c1345 * tmp1 )  &
+     &   + dt * 2.0d+00 * (  tx1 * dx5  &
+     &                    +  ty1 * dy5  &
+     &                    +  tz1 * dz5 )
+
+!---------------------------------------------------------------------
+!   form the first block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k-1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i) = - dt * tz1 * dz1
+               a(1,2,i) =   0.0d+00
+               a(1,3,i) =   0.0d+00
+               a(1,4,i) = - dt * tz2
+               a(1,5,i) =   0.0d+00
+
+               a(2,1,i) = - dt * tz2  &
+     &           * ( - ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k-1) )
+               a(2,2,i) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )  &
+     &           - dt * tz1 * c34 * tmp1  &
+     &           - dt * tz1 * dz2 
+               a(2,3,i) = 0.0d+00
+               a(2,4,i) = - dt * tz2 * ( u(2,i,j,k-1) * tmp1 )
+               a(2,5,i) = 0.0d+00
+
+               a(3,1,i) = - dt * tz2  &
+     &           * ( - ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k-1) )
+               a(3,2,i) = 0.0d+00
+               a(3,3,i) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )  &
+     &           - dt * tz1 * ( c34 * tmp1 )  &
+     &           - dt * tz1 * dz3
+               a(3,4,i) = - dt * tz2 * ( u(3,i,j,k-1) * tmp1 )
+               a(3,5,i) = 0.0d+00
+
+               a(4,1,i) = - dt * tz2  &
+     &        * ( - ( u(4,i,j,k-1) * tmp1 ) ** 2  &
+     &            + 0.50d+00 * c2  &
+     &            * ( ( u(2,i,j,k-1) * u(2,i,j,k-1)  &
+     &                + u(3,i,j,k-1) * u(3,i,j,k-1)  &
+     &                + u(4,i,j,k-1) * u(4,i,j,k-1) ) * tmp2 ) )  &
+     &        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k-1) )
+               a(4,2,i) = - dt * tz2  &
+     &             * ( - c2 * ( u(2,i,j,k-1) * tmp1 ) )
+               a(4,3,i) = - dt * tz2  &
+     &             * ( - c2 * ( u(3,i,j,k-1) * tmp1 ) )
+               a(4,4,i) = - dt * tz2 * ( 2.0d+00 - c2 )  &
+     &             * ( u(4,i,j,k-1) * tmp1 )  &
+     &             - dt * tz1 * ( r43 * c34 * tmp1 )  &
+     &             - dt * tz1 * dz4
+               a(4,5,i) = - dt * tz2 * c2
+
+               a(5,1,i) = - dt * tz2  &
+     &     * ( ( c2 * (  u(2,i,j,k-1) * u(2,i,j,k-1)  &
+     &                 + u(3,i,j,k-1) * u(3,i,j,k-1)  &
+     &                 + u(4,i,j,k-1) * u(4,i,j,k-1) ) * tmp2  &
+     &       - c1 * ( u(5,i,j,k-1) * tmp1 ) )  &
+     &            * ( u(4,i,j,k-1) * tmp1 ) )  &
+     &       - dt * tz1  &
+     &       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k-1)**2)  &
+     &           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k-1)**2)  &
+     &           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k-1)**2)  &
+     &          - c1345 * tmp2 * u(5,i,j,k-1) )
+               a(5,2,i) = - dt * tz2  &
+     &       * ( - c2 * ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k-1)
+               a(5,3,i) = - dt * tz2  &
+     &       * ( - c2 * ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k-1)
+               a(5,4,i) = - dt * tz2  &
+     &       * ( c1 * ( u(5,i,j,k-1) * tmp1 )  &
+     &       - 0.50d+00 * c2  &
+     &       * ( (  u(2,i,j,k-1)*u(2,i,j,k-1)  &
+     &            + u(3,i,j,k-1)*u(3,i,j,k-1)  &
+     &            + 3.0d+00*u(4,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 ) )  &
+     &       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k-1)
+               a(5,5,i) = - dt * tz2  &
+     &       * ( c1 * ( u(4,i,j,k-1) * tmp1 ) )  &
+     &       - dt * tz1 * c1345 * tmp1  &
+     &       - dt * tz1 * dz5
+
+!---------------------------------------------------------------------
+!   form the second block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j-1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i) = - dt * ty1 * dy1
+               b(1,2,i) =   0.0d+00
+               b(1,3,i) = - dt * ty2
+               b(1,4,i) =   0.0d+00
+               b(1,5,i) =   0.0d+00
+
+               b(2,1,i) = - dt * ty2  &
+     &           * ( - ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )  &
+     &           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j-1,k) )
+               b(2,2,i) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )  &
+     &          - dt * ty1 * ( c34 * tmp1 )  &
+     &          - dt * ty1 * dy2
+               b(2,3,i) = - dt * ty2 * ( u(2,i,j-1,k) * tmp1 )
+               b(2,4,i) = 0.0d+00
+               b(2,5,i) = 0.0d+00
+
+               b(3,1,i) = - dt * ty2  &
+     &           * ( - ( u(3,i,j-1,k) * tmp1 ) ** 2  &
+     &      + 0.50d+00 * c2 * ( (  u(2,i,j-1,k) * u(2,i,j-1,k)  &
+     &                           + u(3,i,j-1,k) * u(3,i,j-1,k)  &
+     &                           + u(4,i,j-1,k) * u(4,i,j-1,k) )  &
+     &                          * tmp2 ) )  &
+     &       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j-1,k) )
+               b(3,2,i) = - dt * ty2  &
+     &                   * ( - c2 * ( u(2,i,j-1,k) * tmp1 ) )
+               b(3,3,i) = - dt * ty2 * ( ( 2.0d+00 - c2 )  &
+     &                   * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( r43 * c34 * tmp1 )  &
+     &       - dt * ty1 * dy3
+               b(3,4,i) = - dt * ty2  &
+     &                   * ( - c2 * ( u(4,i,j-1,k) * tmp1 ) )
+               b(3,5,i) = - dt * ty2 * c2
+
+               b(4,1,i) = - dt * ty2  &
+     &              * ( - ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )  &
+     &       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j-1,k) )
+               b(4,2,i) = 0.0d+00
+               b(4,3,i) = - dt * ty2 * ( u(4,i,j-1,k) * tmp1 )
+               b(4,4,i) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )  &
+     &                        - dt * ty1 * ( c34 * tmp1 )  &
+     &                        - dt * ty1 * dy4
+               b(4,5,i) = 0.0d+00
+
+               b(5,1,i) = - dt * ty2  &
+     &          * ( ( c2 * (  u(2,i,j-1,k) * u(2,i,j-1,k)  &
+     &                      + u(3,i,j-1,k) * u(3,i,j-1,k)  &
+     &                      + u(4,i,j-1,k) * u(4,i,j-1,k) ) * tmp2  &
+     &               - c1 * ( u(5,i,j-1,k) * tmp1 ) )  &
+     &          * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &          - dt * ty1  &
+     &          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j-1,k)**2)  &
+     &              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j-1,k)**2)  &
+     &              - (     c34 - c1345 )*tmp3*(u(4,i,j-1,k)**2)  &
+     &              - c1345*tmp2*u(5,i,j-1,k) )
+               b(5,2,i) = - dt * ty2  &
+     &          * ( - c2 * ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )  &
+     &          - dt * ty1  &
+     &          * ( c34 - c1345 ) * tmp2 * u(2,i,j-1,k)
+               b(5,3,i) = - dt * ty2  &
+     &          * ( c1 * ( u(5,i,j-1,k) * tmp1 )  &
+     &          - 0.50d+00 * c2  &
+     &          * ( (  u(2,i,j-1,k)*u(2,i,j-1,k)  &
+     &               + 3.0d+00 * u(3,i,j-1,k)*u(3,i,j-1,k)  &
+     &               + u(4,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j-1,k)
+               b(5,4,i) = - dt * ty2  &
+     &          * ( - c2 * ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )  &
+     &          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j-1,k)
+               b(5,5,i) = - dt * ty2  &
+     &          * ( c1 * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &          - dt * ty1 * c1345 * tmp1  &
+     &          - dt * ty1 * dy5
+
+!---------------------------------------------------------------------
+!   form the third block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i-1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i) = - dt * tx1 * dx1
+               c(1,2,i) = - dt * tx2
+               c(1,3,i) =   0.0d+00
+               c(1,4,i) =   0.0d+00
+               c(1,5,i) =   0.0d+00
+
+               c(2,1,i) = - dt * tx2  &
+     &          * ( - ( u(2,i-1,j,k) * tmp1 ) ** 2  &
+     &     + c2 * 0.50d+00 * (  u(2,i-1,j,k) * u(2,i-1,j,k)  &
+     &                        + u(3,i-1,j,k) * u(3,i-1,j,k)  &
+     &                        + u(4,i-1,j,k) * u(4,i-1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i-1,j,k) )
+               c(2,2,i) = - dt * tx2  &
+     &          * ( ( 2.0d+00 - c2 ) * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &          - dt * tx1 * ( r43 * c34 * tmp1 )  &
+     &          - dt * tx1 * dx2
+               c(2,3,i) = - dt * tx2  &
+     &              * ( - c2 * ( u(3,i-1,j,k) * tmp1 ) )
+               c(2,4,i) = - dt * tx2  &
+     &              * ( - c2 * ( u(4,i-1,j,k) * tmp1 ) )
+               c(2,5,i) = - dt * tx2 * c2 
+
+               c(3,1,i) = - dt * tx2  &
+     &              * ( - ( u(2,i-1,j,k) * u(3,i-1,j,k) ) * tmp2 )  &
+     &         - dt * tx1 * ( - c34 * tmp2 * u(3,i-1,j,k) )
+               c(3,2,i) = - dt * tx2 * ( u(3,i-1,j,k) * tmp1 )
+               c(3,3,i) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx3
+               c(3,4,i) = 0.0d+00
+               c(3,5,i) = 0.0d+00
+
+               c(4,1,i) = - dt * tx2  &
+     &          * ( - ( u(2,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - c34 * tmp2 * u(4,i-1,j,k) )
+               c(4,2,i) = - dt * tx2 * ( u(4,i-1,j,k) * tmp1 )
+               c(4,3,i) = 0.0d+00
+               c(4,4,i) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx4
+               c(4,5,i) = 0.0d+00
+
+               c(5,1,i) = - dt * tx2  &
+     &          * ( ( c2 * (  u(2,i-1,j,k) * u(2,i-1,j,k)  &
+     &                      + u(3,i-1,j,k) * u(3,i-1,j,k)  &
+     &                      + u(4,i-1,j,k) * u(4,i-1,j,k) ) * tmp2  &
+     &              - c1 * ( u(5,i-1,j,k) * tmp1 ) )  &
+     &          * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &          - dt * tx1  &
+     &          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i-1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(3,i-1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(4,i-1,j,k)**2 )  &
+     &              - c1345 * tmp2 * u(5,i-1,j,k) )
+               c(5,2,i) = - dt * tx2  &
+     &          * ( c1 * ( u(5,i-1,j,k) * tmp1 )  &
+     &             - 0.50d+00 * c2  &
+     &             * ( (  3.0d+00*u(2,i-1,j,k)*u(2,i-1,j,k)  &
+     &                  + u(3,i-1,j,k)*u(3,i-1,j,k)  &
+     &                  + u(4,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 ) )  &
+     &           - dt * tx1  &
+     &           * ( r43*c34 - c1345 ) * tmp2 * u(2,i-1,j,k)
+               c(5,3,i) = - dt * tx2  &
+     &           * ( - c2 * ( u(3,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(3,i-1,j,k)
+               c(5,4,i) = - dt * tx2  &
+     &           * ( - c2 * ( u(4,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(4,i-1,j,k)
+               c(5,5,i) = - dt * tx2  &
+     &           * ( c1 * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &           - dt * tx1 * c1345 * tmp1  &
+     &           - dt * tx1 * dx5
+
+            end do
+
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacld_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacld_vec.f90
new file mode 100644
index 000000000..66747bea4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacld_vec.f90
@@ -0,0 +1,389 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine jacld(k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!   compute the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+
+      use lu_data
+      use timing
+
+      implicit none
+
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+      if (timeron) call timer_start(t_jacld)
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+         do j = jst, jend
+            do i = ist, iend
+
+!---------------------------------------------------------------------
+!   form the block daigonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i,j) =  1.0d+00  &
+     &                       + dt * 2.0d+00 * (   tx1 * dx1  &
+     &                                          + ty1 * dy1  &
+     &                                          + tz1 * dz1 )
+               d(1,2,i,j) =  0.0d+00
+               d(1,3,i,j) =  0.0d+00
+               d(1,4,i,j) =  0.0d+00
+               d(1,5,i,j) =  0.0d+00
+
+               d(2,1,i,j) =  dt * 2.0d+00  &
+     &          * (  tx1 * ( - r43 * c34 * tmp2 * u(2,i,j,k) )  &
+     &             + ty1 * ( -       c34 * tmp2 * u(2,i,j,k) )  &
+     &             + tz1 * ( -       c34 * tmp2 * u(2,i,j,k) ) )
+               d(2,2,i,j) =  1.0d+00  &
+     &          + dt * 2.0d+00  &
+     &          * (  tx1 * r43 * c34 * tmp1  &
+     &             + ty1 *       c34 * tmp1  &
+     &             + tz1 *       c34 * tmp1 )  &
+     &          + dt * 2.0d+00 * (   tx1 * dx2  &
+     &                             + ty1 * dy2  &
+     &                             + tz1 * dz2  )
+               d(2,3,i,j) = 0.0d+00
+               d(2,4,i,j) = 0.0d+00
+               d(2,5,i,j) = 0.0d+00
+
+               d(3,1,i,j) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(3,i,j,k) )  &
+     &         + ty1 * ( - r43 * c34 * tmp2 * u(3,i,j,k) )  &
+     &         + tz1 * ( -       c34 * tmp2 * u(3,i,j,k) ) )
+               d(3,2,i,j) = 0.0d+00
+               d(3,3,i,j) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 * r43 * c34 * tmp1  &
+     &                 + tz1 *       c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx3  &
+     &                           + ty1 * dy3  &
+     &                           + tz1 * dz3 )
+               d(3,4,i,j) = 0.0d+00
+               d(3,5,i,j) = 0.0d+00
+
+               d(4,1,i,j) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + ty1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k) ) )
+               d(4,2,i,j) = 0.0d+00
+               d(4,3,i,j) = 0.0d+00
+               d(4,4,i,j) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 *       c34 * tmp1  &
+     &                 + tz1 * r43 * c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx4  &
+     &                           + ty1 * dy4  &
+     &                           + tz1 * dz4 )
+               d(4,5,i,j) = 0.0d+00
+
+               d(5,1,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + ty1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + tz1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) ) )
+               d(5,2,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( r43*c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + ty1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + tz1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k) )
+               d(5,3,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + ty1 * ( r43*c34 -c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k) )
+               d(5,4,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k) )
+               d(5,5,i,j) = 1.0d+00  &
+     &   + dt * 2.0d+00 * ( tx1 * c1345 * tmp1  &
+     &                    + ty1 * c1345 * tmp1  &
+     &                    + tz1 * c1345 * tmp1 )  &
+     &   + dt * 2.0d+00 * (  tx1 * dx5  &
+     &                    +  ty1 * dy5  &
+     &                    +  tz1 * dz5 )
+
+!---------------------------------------------------------------------
+!   form the first block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k-1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i,j) = - dt * tz1 * dz1
+               a(1,2,i,j) =   0.0d+00
+               a(1,3,i,j) =   0.0d+00
+               a(1,4,i,j) = - dt * tz2
+               a(1,5,i,j) =   0.0d+00
+
+               a(2,1,i,j) = - dt * tz2  &
+     &           * ( - ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k-1) )
+               a(2,2,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )  &
+     &           - dt * tz1 * c34 * tmp1  &
+     &           - dt * tz1 * dz2 
+               a(2,3,i,j) = 0.0d+00
+               a(2,4,i,j) = - dt * tz2 * ( u(2,i,j,k-1) * tmp1 )
+               a(2,5,i,j) = 0.0d+00
+
+               a(3,1,i,j) = - dt * tz2  &
+     &           * ( - ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k-1) )
+               a(3,2,i,j) = 0.0d+00
+               a(3,3,i,j) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )  &
+     &           - dt * tz1 * ( c34 * tmp1 )  &
+     &           - dt * tz1 * dz3
+               a(3,4,i,j) = - dt * tz2 * ( u(3,i,j,k-1) * tmp1 )
+               a(3,5,i,j) = 0.0d+00
+
+               a(4,1,i,j) = - dt * tz2  &
+     &        * ( - ( u(4,i,j,k-1) * tmp1 ) ** 2  &
+     &            + 0.50d+00 * c2  &
+     &            * ( ( u(2,i,j,k-1) * u(2,i,j,k-1)  &
+     &                + u(3,i,j,k-1) * u(3,i,j,k-1)  &
+     &                + u(4,i,j,k-1) * u(4,i,j,k-1) ) * tmp2 ) )  &
+     &        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k-1) )
+               a(4,2,i,j) = - dt * tz2  &
+     &             * ( - c2 * ( u(2,i,j,k-1) * tmp1 ) )
+               a(4,3,i,j) = - dt * tz2  &
+     &             * ( - c2 * ( u(3,i,j,k-1) * tmp1 ) )
+               a(4,4,i,j) = - dt * tz2 * ( 2.0d+00 - c2 )  &
+     &             * ( u(4,i,j,k-1) * tmp1 )  &
+     &             - dt * tz1 * ( r43 * c34 * tmp1 )  &
+     &             - dt * tz1 * dz4
+               a(4,5,i,j) = - dt * tz2 * c2
+
+               a(5,1,i,j) = - dt * tz2  &
+     &     * ( ( c2 * (  u(2,i,j,k-1) * u(2,i,j,k-1)  &
+     &                 + u(3,i,j,k-1) * u(3,i,j,k-1)  &
+     &                 + u(4,i,j,k-1) * u(4,i,j,k-1) ) * tmp2  &
+     &       - c1 * ( u(5,i,j,k-1) * tmp1 ) )  &
+     &            * ( u(4,i,j,k-1) * tmp1 ) )  &
+     &       - dt * tz1  &
+     &       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k-1)**2)  &
+     &           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k-1)**2)  &
+     &           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k-1)**2)  &
+     &          - c1345 * tmp2 * u(5,i,j,k-1) )
+               a(5,2,i,j) = - dt * tz2  &
+     &       * ( - c2 * ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k-1)
+               a(5,3,i,j) = - dt * tz2  &
+     &       * ( - c2 * ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k-1)
+               a(5,4,i,j) = - dt * tz2  &
+     &       * ( c1 * ( u(5,i,j,k-1) * tmp1 )  &
+     &       - 0.50d+00 * c2  &
+     &       * ( (  u(2,i,j,k-1)*u(2,i,j,k-1)  &
+     &            + u(3,i,j,k-1)*u(3,i,j,k-1)  &
+     &            + 3.0d+00*u(4,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 ) )  &
+     &       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k-1)
+               a(5,5,i,j) = - dt * tz2  &
+     &       * ( c1 * ( u(4,i,j,k-1) * tmp1 ) )  &
+     &       - dt * tz1 * c1345 * tmp1  &
+     &       - dt * tz1 * dz5
+
+!---------------------------------------------------------------------
+!   form the second block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j-1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i,j) = - dt * ty1 * dy1
+               b(1,2,i,j) =   0.0d+00
+               b(1,3,i,j) = - dt * ty2
+               b(1,4,i,j) =   0.0d+00
+               b(1,5,i,j) =   0.0d+00
+
+               b(2,1,i,j) = - dt * ty2  &
+     &           * ( - ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )  &
+     &           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j-1,k) )
+               b(2,2,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )  &
+     &          - dt * ty1 * ( c34 * tmp1 )  &
+     &          - dt * ty1 * dy2
+               b(2,3,i,j) = - dt * ty2 * ( u(2,i,j-1,k) * tmp1 )
+               b(2,4,i,j) = 0.0d+00
+               b(2,5,i,j) = 0.0d+00
+
+               b(3,1,i,j) = - dt * ty2  &
+     &           * ( - ( u(3,i,j-1,k) * tmp1 ) ** 2  &
+     &      + 0.50d+00 * c2 * ( (  u(2,i,j-1,k) * u(2,i,j-1,k)  &
+     &                           + u(3,i,j-1,k) * u(3,i,j-1,k)  &
+     &                           + u(4,i,j-1,k) * u(4,i,j-1,k) )  &
+     &                          * tmp2 ) )  &
+     &       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j-1,k) )
+               b(3,2,i,j) = - dt * ty2  &
+     &                   * ( - c2 * ( u(2,i,j-1,k) * tmp1 ) )
+               b(3,3,i,j) = - dt * ty2 * ( ( 2.0d+00 - c2 )  &
+     &                   * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( r43 * c34 * tmp1 )  &
+     &       - dt * ty1 * dy3
+               b(3,4,i,j) = - dt * ty2  &
+     &                   * ( - c2 * ( u(4,i,j-1,k) * tmp1 ) )
+               b(3,5,i,j) = - dt * ty2 * c2
+
+               b(4,1,i,j) = - dt * ty2  &
+     &              * ( - ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )  &
+     &       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j-1,k) )
+               b(4,2,i,j) = 0.0d+00
+               b(4,3,i,j) = - dt * ty2 * ( u(4,i,j-1,k) * tmp1 )
+               b(4,4,i,j) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )  &
+     &                        - dt * ty1 * ( c34 * tmp1 )  &
+     &                        - dt * ty1 * dy4
+               b(4,5,i,j) = 0.0d+00
+
+               b(5,1,i,j) = - dt * ty2  &
+     &          * ( ( c2 * (  u(2,i,j-1,k) * u(2,i,j-1,k)  &
+     &                      + u(3,i,j-1,k) * u(3,i,j-1,k)  &
+     &                      + u(4,i,j-1,k) * u(4,i,j-1,k) ) * tmp2  &
+     &               - c1 * ( u(5,i,j-1,k) * tmp1 ) )  &
+     &          * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &          - dt * ty1  &
+     &          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j-1,k)**2)  &
+     &              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j-1,k)**2)  &
+     &              - (     c34 - c1345 )*tmp3*(u(4,i,j-1,k)**2)  &
+     &              - c1345*tmp2*u(5,i,j-1,k) )
+               b(5,2,i,j) = - dt * ty2  &
+     &          * ( - c2 * ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )  &
+     &          - dt * ty1  &
+     &          * ( c34 - c1345 ) * tmp2 * u(2,i,j-1,k)
+               b(5,3,i,j) = - dt * ty2  &
+     &          * ( c1 * ( u(5,i,j-1,k) * tmp1 )  &
+     &          - 0.50d+00 * c2  &
+     &          * ( (  u(2,i,j-1,k)*u(2,i,j-1,k)  &
+     &               + 3.0d+00 * u(3,i,j-1,k)*u(3,i,j-1,k)  &
+     &               + u(4,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j-1,k)
+               b(5,4,i,j) = - dt * ty2  &
+     &          * ( - c2 * ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )  &
+     &          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j-1,k)
+               b(5,5,i,j) = - dt * ty2  &
+     &          * ( c1 * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &          - dt * ty1 * c1345 * tmp1  &
+     &          - dt * ty1 * dy5
+
+!---------------------------------------------------------------------
+!   form the third block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i-1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i,j) = - dt * tx1 * dx1
+               c(1,2,i,j) = - dt * tx2
+               c(1,3,i,j) =   0.0d+00
+               c(1,4,i,j) =   0.0d+00
+               c(1,5,i,j) =   0.0d+00
+
+               c(2,1,i,j) = - dt * tx2  &
+     &          * ( - ( u(2,i-1,j,k) * tmp1 ) ** 2  &
+     &     + c2 * 0.50d+00 * (  u(2,i-1,j,k) * u(2,i-1,j,k)  &
+     &                        + u(3,i-1,j,k) * u(3,i-1,j,k)  &
+     &                        + u(4,i-1,j,k) * u(4,i-1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i-1,j,k) )
+               c(2,2,i,j) = - dt * tx2  &
+     &          * ( ( 2.0d+00 - c2 ) * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &          - dt * tx1 * ( r43 * c34 * tmp1 )  &
+     &          - dt * tx1 * dx2
+               c(2,3,i,j) = - dt * tx2  &
+     &              * ( - c2 * ( u(3,i-1,j,k) * tmp1 ) )
+               c(2,4,i,j) = - dt * tx2  &
+     &              * ( - c2 * ( u(4,i-1,j,k) * tmp1 ) )
+               c(2,5,i,j) = - dt * tx2 * c2 
+
+               c(3,1,i,j) = - dt * tx2  &
+     &              * ( - ( u(2,i-1,j,k) * u(3,i-1,j,k) ) * tmp2 )  &
+     &         - dt * tx1 * ( - c34 * tmp2 * u(3,i-1,j,k) )
+               c(3,2,i,j) = - dt * tx2 * ( u(3,i-1,j,k) * tmp1 )
+               c(3,3,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx3
+               c(3,4,i,j) = 0.0d+00
+               c(3,5,i,j) = 0.0d+00
+
+               c(4,1,i,j) = - dt * tx2  &
+     &          * ( - ( u(2,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - c34 * tmp2 * u(4,i-1,j,k) )
+               c(4,2,i,j) = - dt * tx2 * ( u(4,i-1,j,k) * tmp1 )
+               c(4,3,i,j) = 0.0d+00
+               c(4,4,i,j) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx4
+               c(4,5,i,j) = 0.0d+00
+
+               c(5,1,i,j) = - dt * tx2  &
+     &          * ( ( c2 * (  u(2,i-1,j,k) * u(2,i-1,j,k)  &
+     &                      + u(3,i-1,j,k) * u(3,i-1,j,k)  &
+     &                      + u(4,i-1,j,k) * u(4,i-1,j,k) ) * tmp2  &
+     &              - c1 * ( u(5,i-1,j,k) * tmp1 ) )  &
+     &          * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &          - dt * tx1  &
+     &          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i-1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(3,i-1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(4,i-1,j,k)**2 )  &
+     &              - c1345 * tmp2 * u(5,i-1,j,k) )
+               c(5,2,i,j) = - dt * tx2  &
+     &          * ( c1 * ( u(5,i-1,j,k) * tmp1 )  &
+     &             - 0.50d+00 * c2  &
+     &             * ( (  3.0d+00*u(2,i-1,j,k)*u(2,i-1,j,k)  &
+     &                  + u(3,i-1,j,k)*u(3,i-1,j,k)  &
+     &                  + u(4,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 ) )  &
+     &           - dt * tx1  &
+     &           * ( r43*c34 - c1345 ) * tmp2 * u(2,i-1,j,k)
+               c(5,3,i,j) = - dt * tx2  &
+     &           * ( - c2 * ( u(3,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(3,i-1,j,k)
+               c(5,4,i,j) = - dt * tx2  &
+     &           * ( - c2 * ( u(4,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(4,i-1,j,k)
+               c(5,5,i,j) = - dt * tx2  &
+     &           * ( c1 * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &           - dt * tx1 * c1345 * tmp1  &
+     &           - dt * tx1 * dx5
+
+            end do
+         end do
+
+      if (timeron) call timer_stop(t_jacld)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacu.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacu.f90
new file mode 100644
index 000000000..661523a16
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacu.f90
@@ -0,0 +1,380 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine jacu(j,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer j, k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+            do i = ist, iend
+
+!---------------------------------------------------------------------
+!   form the block daigonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i) =  1.0d+00  &
+     &                       + dt * 2.0d+00 * (   tx1 * dx1  &
+     &                                          + ty1 * dy1  &
+     &                                          + tz1 * dz1 )
+               d(1,2,i) =  0.0d+00
+               d(1,3,i) =  0.0d+00
+               d(1,4,i) =  0.0d+00
+               d(1,5,i) =  0.0d+00
+
+               d(2,1,i) =  dt * 2.0d+00  &
+     &          * (  tx1 * ( - r43 * c34 * tmp2 * u(2,i,j,k) )  &
+     &             + ty1 * ( -       c34 * tmp2 * u(2,i,j,k) )  &
+     &             + tz1 * ( -       c34 * tmp2 * u(2,i,j,k) ) )
+               d(2,2,i) =  1.0d+00  &
+     &          + dt * 2.0d+00  &
+     &          * (  tx1 * r43 * c34 * tmp1  &
+     &             + ty1 *       c34 * tmp1  &
+     &             + tz1 *       c34 * tmp1 )  &
+     &          + dt * 2.0d+00 * (   tx1 * dx2  &
+     &                             + ty1 * dy2  &
+     &                             + tz1 * dz2  )
+               d(2,3,i) = 0.0d+00
+               d(2,4,i) = 0.0d+00
+               d(2,5,i) = 0.0d+00
+
+               d(3,1,i) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(3,i,j,k) )  &
+     &         + ty1 * ( - r43 * c34 * tmp2 * u(3,i,j,k) )  &
+     &         + tz1 * ( -       c34 * tmp2 * u(3,i,j,k) ) )
+               d(3,2,i) = 0.0d+00
+               d(3,3,i) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 * r43 * c34 * tmp1  &
+     &                 + tz1 *       c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx3  &
+     &                           + ty1 * dy3  &
+     &                           + tz1 * dz3 )
+               d(3,4,i) = 0.0d+00
+               d(3,5,i) = 0.0d+00
+
+               d(4,1,i) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + ty1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k) ) )
+               d(4,2,i) = 0.0d+00
+               d(4,3,i) = 0.0d+00
+               d(4,4,i) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 *       c34 * tmp1  &
+     &                 + tz1 * r43 * c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx4  &
+     &                           + ty1 * dy4  &
+     &                           + tz1 * dz4 )
+               d(4,5,i) = 0.0d+00
+
+               d(5,1,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + ty1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + tz1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) ) )
+               d(5,2,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( r43*c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + ty1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + tz1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k) )
+               d(5,3,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + ty1 * ( r43*c34 -c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k) )
+               d(5,4,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k) )
+               d(5,5,i) = 1.0d+00  &
+     &   + dt * 2.0d+00 * ( tx1 * c1345 * tmp1  &
+     &                    + ty1 * c1345 * tmp1  &
+     &                    + tz1 * c1345 * tmp1 )  &
+     &   + dt * 2.0d+00 * (  tx1 * dx5  &
+     &                    +  ty1 * dy5  &
+     &                    +  tz1 * dz5 )
+
+!---------------------------------------------------------------------
+!   form the first block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i+1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i) = - dt * tx1 * dx1
+               a(1,2,i) =   dt * tx2
+               a(1,3,i) =   0.0d+00
+               a(1,4,i) =   0.0d+00
+               a(1,5,i) =   0.0d+00
+
+               a(2,1,i) =  dt * tx2  &
+     &          * ( - ( u(2,i+1,j,k) * tmp1 ) ** 2  &
+     &     + c2 * 0.50d+00 * (  u(2,i+1,j,k) * u(2,i+1,j,k)  &
+     &                        + u(3,i+1,j,k) * u(3,i+1,j,k)  &
+     &                        + u(4,i+1,j,k) * u(4,i+1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i+1,j,k) )
+               a(2,2,i) =  dt * tx2  &
+     &          * ( ( 2.0d+00 - c2 ) * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &          - dt * tx1 * ( r43 * c34 * tmp1 )  &
+     &          - dt * tx1 * dx2
+               a(2,3,i) =  dt * tx2  &
+     &              * ( - c2 * ( u(3,i+1,j,k) * tmp1 ) )
+               a(2,4,i) =  dt * tx2  &
+     &              * ( - c2 * ( u(4,i+1,j,k) * tmp1 ) )
+               a(2,5,i) =  dt * tx2 * c2 
+
+               a(3,1,i) =  dt * tx2  &
+     &              * ( - ( u(2,i+1,j,k) * u(3,i+1,j,k) ) * tmp2 )  &
+     &         - dt * tx1 * ( - c34 * tmp2 * u(3,i+1,j,k) )
+               a(3,2,i) =  dt * tx2 * ( u(3,i+1,j,k) * tmp1 )
+               a(3,3,i) =  dt * tx2 * ( u(2,i+1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx3
+               a(3,4,i) = 0.0d+00
+               a(3,5,i) = 0.0d+00
+
+               a(4,1,i) = dt * tx2  &
+     &          * ( - ( u(2,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - c34 * tmp2 * u(4,i+1,j,k) )
+               a(4,2,i) = dt * tx2 * ( u(4,i+1,j,k) * tmp1 )
+               a(4,3,i) = 0.0d+00
+               a(4,4,i) = dt * tx2 * ( u(2,i+1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx4
+               a(4,5,i) = 0.0d+00
+
+               a(5,1,i) = dt * tx2  &
+     &          * ( ( c2 * (  u(2,i+1,j,k) * u(2,i+1,j,k)  &
+     &                      + u(3,i+1,j,k) * u(3,i+1,j,k)  &
+     &                      + u(4,i+1,j,k) * u(4,i+1,j,k) ) * tmp2  &
+     &              - c1 * ( u(5,i+1,j,k) * tmp1 ) )  &
+     &          * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &          - dt * tx1  &
+     &          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i+1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(3,i+1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(4,i+1,j,k)**2 )  &
+     &              - c1345 * tmp2 * u(5,i+1,j,k) )
+               a(5,2,i) = dt * tx2  &
+     &          * ( c1 * ( u(5,i+1,j,k) * tmp1 )  &
+     &             - 0.50d+00 * c2  &
+     &             * ( (  3.0d+00*u(2,i+1,j,k)*u(2,i+1,j,k)  &
+     &                  + u(3,i+1,j,k)*u(3,i+1,j,k)  &
+     &                  + u(4,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 ) )  &
+     &           - dt * tx1  &
+     &           * ( r43*c34 - c1345 ) * tmp2 * u(2,i+1,j,k)
+               a(5,3,i) = dt * tx2  &
+     &           * ( - c2 * ( u(3,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(3,i+1,j,k)
+               a(5,4,i) = dt * tx2  &
+     &           * ( - c2 * ( u(4,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(4,i+1,j,k)
+               a(5,5,i) = dt * tx2  &
+     &           * ( c1 * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &           - dt * tx1 * c1345 * tmp1  &
+     &           - dt * tx1 * dx5
+
+!---------------------------------------------------------------------
+!   form the second block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j+1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i) = - dt * ty1 * dy1
+               b(1,2,i) =   0.0d+00
+               b(1,3,i) =  dt * ty2
+               b(1,4,i) =   0.0d+00
+               b(1,5,i) =   0.0d+00
+
+               b(2,1,i) =  dt * ty2  &
+     &           * ( - ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )  &
+     &           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j+1,k) )
+               b(2,2,i) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )  &
+     &          - dt * ty1 * ( c34 * tmp1 )  &
+     &          - dt * ty1 * dy2
+               b(2,3,i) =  dt * ty2 * ( u(2,i,j+1,k) * tmp1 )
+               b(2,4,i) = 0.0d+00
+               b(2,5,i) = 0.0d+00
+
+               b(3,1,i) =  dt * ty2  &
+     &           * ( - ( u(3,i,j+1,k) * tmp1 ) ** 2  &
+     &      + 0.50d+00 * c2 * ( (  u(2,i,j+1,k) * u(2,i,j+1,k)  &
+     &                           + u(3,i,j+1,k) * u(3,i,j+1,k)  &
+     &                           + u(4,i,j+1,k) * u(4,i,j+1,k) )  &
+     &                          * tmp2 ) )  &
+     &       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j+1,k) )
+               b(3,2,i) =  dt * ty2  &
+     &                   * ( - c2 * ( u(2,i,j+1,k) * tmp1 ) )
+               b(3,3,i) =  dt * ty2 * ( ( 2.0d+00 - c2 )  &
+     &                   * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( r43 * c34 * tmp1 )  &
+     &       - dt * ty1 * dy3
+               b(3,4,i) =  dt * ty2  &
+     &                   * ( - c2 * ( u(4,i,j+1,k) * tmp1 ) )
+               b(3,5,i) =  dt * ty2 * c2
+
+               b(4,1,i) =  dt * ty2  &
+     &              * ( - ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )  &
+     &       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j+1,k) )
+               b(4,2,i) = 0.0d+00
+               b(4,3,i) =  dt * ty2 * ( u(4,i,j+1,k) * tmp1 )
+               b(4,4,i) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )  &
+     &                        - dt * ty1 * ( c34 * tmp1 )  &
+     &                        - dt * ty1 * dy4
+               b(4,5,i) = 0.0d+00
+
+               b(5,1,i) =  dt * ty2  &
+     &          * ( ( c2 * (  u(2,i,j+1,k) * u(2,i,j+1,k)  &
+     &                      + u(3,i,j+1,k) * u(3,i,j+1,k)  &
+     &                      + u(4,i,j+1,k) * u(4,i,j+1,k) ) * tmp2  &
+     &               - c1 * ( u(5,i,j+1,k) * tmp1 ) )  &
+     &          * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &          - dt * ty1  &
+     &          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j+1,k)**2)  &
+     &              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j+1,k)**2)  &
+     &              - (     c34 - c1345 )*tmp3*(u(4,i,j+1,k)**2)  &
+     &              - c1345*tmp2*u(5,i,j+1,k) )
+               b(5,2,i) =  dt * ty2  &
+     &          * ( - c2 * ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )  &
+     &          - dt * ty1  &
+     &          * ( c34 - c1345 ) * tmp2 * u(2,i,j+1,k)
+               b(5,3,i) =  dt * ty2  &
+     &          * ( c1 * ( u(5,i,j+1,k) * tmp1 )  &
+     &          - 0.50d+00 * c2  &
+     &          * ( (  u(2,i,j+1,k)*u(2,i,j+1,k)  &
+     &               + 3.0d+00 * u(3,i,j+1,k)*u(3,i,j+1,k)  &
+     &               + u(4,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j+1,k)
+               b(5,4,i) =  dt * ty2  &
+     &          * ( - c2 * ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )  &
+     &          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j+1,k)
+               b(5,5,i) =  dt * ty2  &
+     &          * ( c1 * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &          - dt * ty1 * c1345 * tmp1  &
+     &          - dt * ty1 * dy5
+
+!---------------------------------------------------------------------
+!   form the third block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k+1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i) = - dt * tz1 * dz1
+               c(1,2,i) =   0.0d+00
+               c(1,3,i) =   0.0d+00
+               c(1,4,i) = dt * tz2
+               c(1,5,i) =   0.0d+00
+
+               c(2,1,i) = dt * tz2  &
+     &           * ( - ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k+1) )
+               c(2,2,i) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )  &
+     &           - dt * tz1 * c34 * tmp1  &
+     &           - dt * tz1 * dz2 
+               c(2,3,i) = 0.0d+00
+               c(2,4,i) = dt * tz2 * ( u(2,i,j,k+1) * tmp1 )
+               c(2,5,i) = 0.0d+00
+
+               c(3,1,i) = dt * tz2  &
+     &           * ( - ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k+1) )
+               c(3,2,i) = 0.0d+00
+               c(3,3,i) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )  &
+     &           - dt * tz1 * ( c34 * tmp1 )  &
+     &           - dt * tz1 * dz3
+               c(3,4,i) = dt * tz2 * ( u(3,i,j,k+1) * tmp1 )
+               c(3,5,i) = 0.0d+00
+
+               c(4,1,i) = dt * tz2  &
+     &        * ( - ( u(4,i,j,k+1) * tmp1 ) ** 2  &
+     &            + 0.50d+00 * c2  &
+     &            * ( ( u(2,i,j,k+1) * u(2,i,j,k+1)  &
+     &                + u(3,i,j,k+1) * u(3,i,j,k+1)  &
+     &                + u(4,i,j,k+1) * u(4,i,j,k+1) ) * tmp2 ) )  &
+     &        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k+1) )
+               c(4,2,i) = dt * tz2  &
+     &             * ( - c2 * ( u(2,i,j,k+1) * tmp1 ) )
+               c(4,3,i) = dt * tz2  &
+     &             * ( - c2 * ( u(3,i,j,k+1) * tmp1 ) )
+               c(4,4,i) = dt * tz2 * ( 2.0d+00 - c2 )  &
+     &             * ( u(4,i,j,k+1) * tmp1 )  &
+     &             - dt * tz1 * ( r43 * c34 * tmp1 )  &
+     &             - dt * tz1 * dz4
+               c(4,5,i) = dt * tz2 * c2
+
+               c(5,1,i) = dt * tz2  &
+     &     * ( ( c2 * (  u(2,i,j,k+1) * u(2,i,j,k+1)  &
+     &                 + u(3,i,j,k+1) * u(3,i,j,k+1)  &
+     &                 + u(4,i,j,k+1) * u(4,i,j,k+1) ) * tmp2  &
+     &       - c1 * ( u(5,i,j,k+1) * tmp1 ) )  &
+     &            * ( u(4,i,j,k+1) * tmp1 ) )  &
+     &       - dt * tz1  &
+     &       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k+1)**2)  &
+     &           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k+1)**2)  &
+     &           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k+1)**2)  &
+     &          - c1345 * tmp2 * u(5,i,j,k+1) )
+               c(5,2,i) = dt * tz2  &
+     &       * ( - c2 * ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k+1)
+               c(5,3,i) = dt * tz2  &
+     &       * ( - c2 * ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k+1)
+               c(5,4,i) = dt * tz2  &
+     &       * ( c1 * ( u(5,i,j,k+1) * tmp1 )  &
+     &       - 0.50d+00 * c2  &
+     &       * ( (  u(2,i,j,k+1)*u(2,i,j,k+1)  &
+     &            + u(3,i,j,k+1)*u(3,i,j,k+1)  &
+     &            + 3.0d+00*u(4,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 ) )  &
+     &       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k+1)
+               c(5,5,i) = dt * tz2  &
+     &       * ( c1 * ( u(4,i,j,k+1) * tmp1 ) )  &
+     &       - dt * tz1 * c1345 * tmp1  &
+     &       - dt * tz1 * dz5
+
+            end do
+
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacu_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacu_vec.f90
new file mode 100644
index 000000000..eb91dd3b5
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/jacu_vec.f90
@@ -0,0 +1,388 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine jacu(k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+
+      use lu_data
+      use timing
+
+      implicit none
+
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+      if (timeron) call timer_start(t_jacu)
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+         do j = jst, jend
+            do i = ist, iend
+
+!---------------------------------------------------------------------
+!   form the block daigonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i,j) =  1.0d+00  &
+     &                       + dt * 2.0d+00 * (   tx1 * dx1  &
+     &                                          + ty1 * dy1  &
+     &                                          + tz1 * dz1 )
+               d(1,2,i,j) =  0.0d+00
+               d(1,3,i,j) =  0.0d+00
+               d(1,4,i,j) =  0.0d+00
+               d(1,5,i,j) =  0.0d+00
+
+               d(2,1,i,j) =  dt * 2.0d+00  &
+     &          * (  tx1 * ( - r43 * c34 * tmp2 * u(2,i,j,k) )  &
+     &             + ty1 * ( -       c34 * tmp2 * u(2,i,j,k) )  &
+     &             + tz1 * ( -       c34 * tmp2 * u(2,i,j,k) ) )
+               d(2,2,i,j) =  1.0d+00  &
+     &          + dt * 2.0d+00  &
+     &          * (  tx1 * r43 * c34 * tmp1  &
+     &             + ty1 *       c34 * tmp1  &
+     &             + tz1 *       c34 * tmp1 )  &
+     &          + dt * 2.0d+00 * (   tx1 * dx2  &
+     &                             + ty1 * dy2  &
+     &                             + tz1 * dz2  )
+               d(2,3,i,j) = 0.0d+00
+               d(2,4,i,j) = 0.0d+00
+               d(2,5,i,j) = 0.0d+00
+
+               d(3,1,i,j) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(3,i,j,k) )  &
+     &         + ty1 * ( - r43 * c34 * tmp2 * u(3,i,j,k) )  &
+     &         + tz1 * ( -       c34 * tmp2 * u(3,i,j,k) ) )
+               d(3,2,i,j) = 0.0d+00
+               d(3,3,i,j) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 * r43 * c34 * tmp1  &
+     &                 + tz1 *       c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx3  &
+     &                           + ty1 * dy3  &
+     &                           + tz1 * dz3 )
+               d(3,4,i,j) = 0.0d+00
+               d(3,5,i,j) = 0.0d+00
+
+               d(4,1,i,j) = dt * 2.0d+00  &
+     &      * (  tx1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + ty1 * ( -       c34 * tmp2 * u(4,i,j,k) )  &
+     &         + tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k) ) )
+               d(4,2,i,j) = 0.0d+00
+               d(4,3,i,j) = 0.0d+00
+               d(4,4,i,j) = 1.0d+00  &
+     &         + dt * 2.0d+00  &
+     &              * (  tx1 *       c34 * tmp1  &
+     &                 + ty1 *       c34 * tmp1  &
+     &                 + tz1 * r43 * c34 * tmp1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx4  &
+     &                           + ty1 * dy4  &
+     &                           + tz1 * dz4 )
+               d(4,5,i,j) = 0.0d+00
+
+               d(5,1,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + ty1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) )  &
+     &   + tz1 * ( - ( c34 - c1345 ) * tmp3 * ( u(2,i,j,k) ** 2 )  &
+     &             - ( c34 - c1345 ) * tmp3 * ( u(3,i,j,k) ** 2 )  &
+     &             - ( r43*c34 - c1345 ) * tmp3 * ( u(4,i,j,k) ** 2 )  &
+     &             - ( c1345 ) * tmp2 * u(5,i,j,k) ) )
+               d(5,2,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( r43*c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + ty1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k)  &
+     &   + tz1 * (     c34 - c1345 ) * tmp2 * u(2,i,j,k) )
+               d(5,3,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + ty1 * ( r43*c34 -c1345 ) * tmp2 * u(3,i,j,k)  &
+     &   + tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k) )
+               d(5,4,i,j) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j,k)  &
+     &   + tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k) )
+               d(5,5,i,j) = 1.0d+00  &
+     &   + dt * 2.0d+00 * ( tx1 * c1345 * tmp1  &
+     &                    + ty1 * c1345 * tmp1  &
+     &                    + tz1 * c1345 * tmp1 )  &
+     &   + dt * 2.0d+00 * (  tx1 * dx5  &
+     &                    +  ty1 * dy5  &
+     &                    +  tz1 * dz5 )
+
+!---------------------------------------------------------------------
+!   form the first block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i+1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i,j) = - dt * tx1 * dx1
+               a(1,2,i,j) =   dt * tx2
+               a(1,3,i,j) =   0.0d+00
+               a(1,4,i,j) =   0.0d+00
+               a(1,5,i,j) =   0.0d+00
+
+               a(2,1,i,j) =  dt * tx2  &
+     &          * ( - ( u(2,i+1,j,k) * tmp1 ) ** 2  &
+     &     + c2 * 0.50d+00 * (  u(2,i+1,j,k) * u(2,i+1,j,k)  &
+     &                        + u(3,i+1,j,k) * u(3,i+1,j,k)  &
+     &                        + u(4,i+1,j,k) * u(4,i+1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i+1,j,k) )
+               a(2,2,i,j) =  dt * tx2  &
+     &          * ( ( 2.0d+00 - c2 ) * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &          - dt * tx1 * ( r43 * c34 * tmp1 )  &
+     &          - dt * tx1 * dx2
+               a(2,3,i,j) =  dt * tx2  &
+     &              * ( - c2 * ( u(3,i+1,j,k) * tmp1 ) )
+               a(2,4,i,j) =  dt * tx2  &
+     &              * ( - c2 * ( u(4,i+1,j,k) * tmp1 ) )
+               a(2,5,i,j) =  dt * tx2 * c2 
+
+               a(3,1,i,j) =  dt * tx2  &
+     &              * ( - ( u(2,i+1,j,k) * u(3,i+1,j,k) ) * tmp2 )  &
+     &         - dt * tx1 * ( - c34 * tmp2 * u(3,i+1,j,k) )
+               a(3,2,i,j) =  dt * tx2 * ( u(3,i+1,j,k) * tmp1 )
+               a(3,3,i,j) =  dt * tx2 * ( u(2,i+1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx3
+               a(3,4,i,j) = 0.0d+00
+               a(3,5,i,j) = 0.0d+00
+
+               a(4,1,i,j) = dt * tx2  &
+     &          * ( - ( u(2,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - c34 * tmp2 * u(4,i+1,j,k) )
+               a(4,2,i,j) = dt * tx2 * ( u(4,i+1,j,k) * tmp1 )
+               a(4,3,i,j) = 0.0d+00
+               a(4,4,i,j) = dt * tx2 * ( u(2,i+1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx4
+               a(4,5,i,j) = 0.0d+00
+
+               a(5,1,i,j) = dt * tx2  &
+     &          * ( ( c2 * (  u(2,i+1,j,k) * u(2,i+1,j,k)  &
+     &                      + u(3,i+1,j,k) * u(3,i+1,j,k)  &
+     &                      + u(4,i+1,j,k) * u(4,i+1,j,k) ) * tmp2  &
+     &              - c1 * ( u(5,i+1,j,k) * tmp1 ) )  &
+     &          * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &          - dt * tx1  &
+     &          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i+1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(3,i+1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(4,i+1,j,k)**2 )  &
+     &              - c1345 * tmp2 * u(5,i+1,j,k) )
+               a(5,2,i,j) = dt * tx2  &
+     &          * ( c1 * ( u(5,i+1,j,k) * tmp1 )  &
+     &             - 0.50d+00 * c2  &
+     &             * ( (  3.0d+00*u(2,i+1,j,k)*u(2,i+1,j,k)  &
+     &                  + u(3,i+1,j,k)*u(3,i+1,j,k)  &
+     &                  + u(4,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 ) )  &
+     &           - dt * tx1  &
+     &           * ( r43*c34 - c1345 ) * tmp2 * u(2,i+1,j,k)
+               a(5,3,i,j) = dt * tx2  &
+     &           * ( - c2 * ( u(3,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(3,i+1,j,k)
+               a(5,4,i,j) = dt * tx2  &
+     &           * ( - c2 * ( u(4,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(4,i+1,j,k)
+               a(5,5,i,j) = dt * tx2  &
+     &           * ( c1 * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &           - dt * tx1 * c1345 * tmp1  &
+     &           - dt * tx1 * dx5
+
+!---------------------------------------------------------------------
+!   form the second block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j+1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i,j) = - dt * ty1 * dy1
+               b(1,2,i,j) =   0.0d+00
+               b(1,3,i,j) =  dt * ty2
+               b(1,4,i,j) =   0.0d+00
+               b(1,5,i,j) =   0.0d+00
+
+               b(2,1,i,j) =  dt * ty2  &
+     &           * ( - ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )  &
+     &           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j+1,k) )
+               b(2,2,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )  &
+     &          - dt * ty1 * ( c34 * tmp1 )  &
+     &          - dt * ty1 * dy2
+               b(2,3,i,j) =  dt * ty2 * ( u(2,i,j+1,k) * tmp1 )
+               b(2,4,i,j) = 0.0d+00
+               b(2,5,i,j) = 0.0d+00
+
+               b(3,1,i,j) =  dt * ty2  &
+     &           * ( - ( u(3,i,j+1,k) * tmp1 ) ** 2  &
+     &      + 0.50d+00 * c2 * ( (  u(2,i,j+1,k) * u(2,i,j+1,k)  &
+     &                           + u(3,i,j+1,k) * u(3,i,j+1,k)  &
+     &                           + u(4,i,j+1,k) * u(4,i,j+1,k) )  &
+     &                          * tmp2 ) )  &
+     &       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j+1,k) )
+               b(3,2,i,j) =  dt * ty2  &
+     &                   * ( - c2 * ( u(2,i,j+1,k) * tmp1 ) )
+               b(3,3,i,j) =  dt * ty2 * ( ( 2.0d+00 - c2 )  &
+     &                   * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( r43 * c34 * tmp1 )  &
+     &       - dt * ty1 * dy3
+               b(3,4,i,j) =  dt * ty2  &
+     &                   * ( - c2 * ( u(4,i,j+1,k) * tmp1 ) )
+               b(3,5,i,j) =  dt * ty2 * c2
+
+               b(4,1,i,j) =  dt * ty2  &
+     &              * ( - ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )  &
+     &       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j+1,k) )
+               b(4,2,i,j) = 0.0d+00
+               b(4,3,i,j) =  dt * ty2 * ( u(4,i,j+1,k) * tmp1 )
+               b(4,4,i,j) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )  &
+     &                        - dt * ty1 * ( c34 * tmp1 )  &
+     &                        - dt * ty1 * dy4
+               b(4,5,i,j) = 0.0d+00
+
+               b(5,1,i,j) =  dt * ty2  &
+     &          * ( ( c2 * (  u(2,i,j+1,k) * u(2,i,j+1,k)  &
+     &                      + u(3,i,j+1,k) * u(3,i,j+1,k)  &
+     &                      + u(4,i,j+1,k) * u(4,i,j+1,k) ) * tmp2  &
+     &               - c1 * ( u(5,i,j+1,k) * tmp1 ) )  &
+     &          * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &          - dt * ty1  &
+     &          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j+1,k)**2)  &
+     &              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j+1,k)**2)  &
+     &              - (     c34 - c1345 )*tmp3*(u(4,i,j+1,k)**2)  &
+     &              - c1345*tmp2*u(5,i,j+1,k) )
+               b(5,2,i,j) =  dt * ty2  &
+     &          * ( - c2 * ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )  &
+     &          - dt * ty1  &
+     &          * ( c34 - c1345 ) * tmp2 * u(2,i,j+1,k)
+               b(5,3,i,j) =  dt * ty2  &
+     &          * ( c1 * ( u(5,i,j+1,k) * tmp1 )  &
+     &          - 0.50d+00 * c2  &
+     &          * ( (  u(2,i,j+1,k)*u(2,i,j+1,k)  &
+     &               + 3.0d+00 * u(3,i,j+1,k)*u(3,i,j+1,k)  &
+     &               + u(4,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j+1,k)
+               b(5,4,i,j) =  dt * ty2  &
+     &          * ( - c2 * ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )  &
+     &          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j+1,k)
+               b(5,5,i,j) =  dt * ty2  &
+     &          * ( c1 * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &          - dt * ty1 * c1345 * tmp1  &
+     &          - dt * ty1 * dy5
+
+!---------------------------------------------------------------------
+!   form the third block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = 1.0d+00 / u(1,i,j,k+1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i,j) = - dt * tz1 * dz1
+               c(1,2,i,j) =   0.0d+00
+               c(1,3,i,j) =   0.0d+00
+               c(1,4,i,j) = dt * tz2
+               c(1,5,i,j) =   0.0d+00
+
+               c(2,1,i,j) = dt * tz2  &
+     &           * ( - ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k+1) )
+               c(2,2,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )  &
+     &           - dt * tz1 * c34 * tmp1  &
+     &           - dt * tz1 * dz2 
+               c(2,3,i,j) = 0.0d+00
+               c(2,4,i,j) = dt * tz2 * ( u(2,i,j,k+1) * tmp1 )
+               c(2,5,i,j) = 0.0d+00
+
+               c(3,1,i,j) = dt * tz2  &
+     &           * ( - ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k+1) )
+               c(3,2,i,j) = 0.0d+00
+               c(3,3,i,j) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )  &
+     &           - dt * tz1 * ( c34 * tmp1 )  &
+     &           - dt * tz1 * dz3
+               c(3,4,i,j) = dt * tz2 * ( u(3,i,j,k+1) * tmp1 )
+               c(3,5,i,j) = 0.0d+00
+
+               c(4,1,i,j) = dt * tz2  &
+     &        * ( - ( u(4,i,j,k+1) * tmp1 ) ** 2  &
+     &            + 0.50d+00 * c2  &
+     &            * ( ( u(2,i,j,k+1) * u(2,i,j,k+1)  &
+     &                + u(3,i,j,k+1) * u(3,i,j,k+1)  &
+     &                + u(4,i,j,k+1) * u(4,i,j,k+1) ) * tmp2 ) )  &
+     &        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k+1) )
+               c(4,2,i,j) = dt * tz2  &
+     &             * ( - c2 * ( u(2,i,j,k+1) * tmp1 ) )
+               c(4,3,i,j) = dt * tz2  &
+     &             * ( - c2 * ( u(3,i,j,k+1) * tmp1 ) )
+               c(4,4,i,j) = dt * tz2 * ( 2.0d+00 - c2 )  &
+     &             * ( u(4,i,j,k+1) * tmp1 )  &
+     &             - dt * tz1 * ( r43 * c34 * tmp1 )  &
+     &             - dt * tz1 * dz4
+               c(4,5,i,j) = dt * tz2 * c2
+
+               c(5,1,i,j) = dt * tz2  &
+     &     * ( ( c2 * (  u(2,i,j,k+1) * u(2,i,j,k+1)  &
+     &                 + u(3,i,j,k+1) * u(3,i,j,k+1)  &
+     &                 + u(4,i,j,k+1) * u(4,i,j,k+1) ) * tmp2  &
+     &       - c1 * ( u(5,i,j,k+1) * tmp1 ) )  &
+     &            * ( u(4,i,j,k+1) * tmp1 ) )  &
+     &       - dt * tz1  &
+     &       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k+1)**2)  &
+     &           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k+1)**2)  &
+     &           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k+1)**2)  &
+     &          - c1345 * tmp2 * u(5,i,j,k+1) )
+               c(5,2,i,j) = dt * tz2  &
+     &       * ( - c2 * ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k+1)
+               c(5,3,i,j) = dt * tz2  &
+     &       * ( - c2 * ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k+1)
+               c(5,4,i,j) = dt * tz2  &
+     &       * ( c1 * ( u(5,i,j,k+1) * tmp1 )  &
+     &       - 0.50d+00 * c2  &
+     &       * ( (  u(2,i,j,k+1)*u(2,i,j,k+1)  &
+     &            + u(3,i,j,k+1)*u(3,i,j,k+1)  &
+     &            + 3.0d+00*u(4,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 ) )  &
+     &       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k+1)
+               c(5,5,i,j) = dt * tz2  &
+     &       * ( c1 * ( u(4,i,j,k+1) * tmp1 ) )  &
+     &       - dt * tz1 * c1345 * tmp1  &
+     &       - dt * tz1 * dz5
+
+            end do
+         end do
+
+      if (timeron) call timer_stop(t_jacu)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/l2norm.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/l2norm.f90
new file mode 100644
index 000000000..4f42b097c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/l2norm.f90
@@ -0,0 +1,71 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine l2norm ( ldx, ldy, ldz,  &
+     &                    nx0, ny0, nz0,  &
+     &                    ist, iend,  &
+     &                    jst, jend,  &
+     &                    v, sum )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   to compute the l2-norm of vector v.
+!---------------------------------------------------------------------
+
+      use timing
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldx, ldy, ldz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+      double precision  v(5,-1:ldx+2,-1:ldy+2,*), sum(5)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  dummy(5)
+
+      integer IERROR
+
+
+      do m = 1, 5
+         dummy(m) = 0.0d+00
+      end do
+
+      do k = 2, nz0-1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  dummy(m) = dummy(m) + v(m,i,j,k) * v(m,i,j,k)
+               end do
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   compute the global sum of individual contributions to dot product.
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_rcomm)
+      call MPI_ALLREDUCE( dummy,  &
+     &                    sum,  &
+     &                    5,  &
+     &                    dp_type,  &
+     &                    MPI_SUM,  &
+     &                    comm_solve,  &
+     &                    IERROR )
+      if (timeron) call timer_stop(t_rcomm)
+
+      do m = 1, 5
+         sum(m) = sqrt ( sum(m) / ( dble(nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu.f90
new file mode 100644
index 000000000..65210b404
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu.f90
@@ -0,0 +1,209 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                                   L U                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.4 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!---------------------------------------------------------------------
+!
+! Authors: S. Weeratunga
+!          V. Venkatakrishnan
+!          E. Barszcz
+!          M. Yarrow
+!
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+      program applu
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   driver for the performance evaluation of the solver for
+!   five coupled parabolic/elliptic partial differential equations.
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+      use timing
+
+      implicit none
+
+      character class
+      logical verified
+      double precision mflops, timer_read
+      integer i, ierr
+      double precision tsum(t_last+2), t1(t_last+2),  &
+     &                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      data t_recs/'total', 'rhs', 'blts', 'buts', '#jacld', '#jacu',  &
+     &            'exch', 'lcomm', 'ucomm', 'rcomm',  &
+     &            ' totcomp', ' totcomm'/
+
+!---------------------------------------------------------------------
+!   initialize communications
+!---------------------------------------------------------------------
+      call init_comm()
+      if (.not. active) goto 999
+
+!---------------------------------------------------------------------
+!   read input data
+!---------------------------------------------------------------------
+      call read_input(class)
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+!---------------------------------------------------------------------
+!   set up processor grid
+!---------------------------------------------------------------------
+      call proc_grid()
+
+!---------------------------------------------------------------------
+!   allocate space
+!---------------------------------------------------------------------
+      call alloc_space()
+
+!---------------------------------------------------------------------
+!   determine the neighbors
+!---------------------------------------------------------------------
+      call neighbors()
+
+!---------------------------------------------------------------------
+!   set up sub-domain sizes
+!---------------------------------------------------------------------
+      call subdomain()
+
+!---------------------------------------------------------------------
+!   set up coefficients
+!---------------------------------------------------------------------
+      call setcoeff()
+
+!---------------------------------------------------------------------
+!   set the boundary values for dependent variables
+!---------------------------------------------------------------------
+      call setbv()
+
+!---------------------------------------------------------------------
+!   set the initial values for dependent variables
+!---------------------------------------------------------------------
+      call setiv()
+
+!---------------------------------------------------------------------
+!   compute the forcing term based on prescribed exact solution
+!---------------------------------------------------------------------
+      call erhs()
+
+!---------------------------------------------------------------------
+!   perform one SSOR iteration to touch all data and program pages 
+!---------------------------------------------------------------------
+      call ssor(1)
+
+!---------------------------------------------------------------------
+!   reset the boundary and initial values
+!---------------------------------------------------------------------
+      call setbv()
+      call setiv()
+
+!---------------------------------------------------------------------
+!   perform the SSOR iterations
+!---------------------------------------------------------------------
+      call ssor(itmax)
+
+!---------------------------------------------------------------------
+!   compute the solution error
+!---------------------------------------------------------------------
+      call error()
+
+!---------------------------------------------------------------------
+!   compute the surface integral
+!---------------------------------------------------------------------
+      call pintgr()
+
+!---------------------------------------------------------------------
+!   verification test
+!---------------------------------------------------------------------
+      IF (id.eq.0) THEN
+         call verify ( rsdnm, errnm, frc, class, verified )
+         mflops = 1.0d-6*dble(itmax)*(1984.77*dble( nx0 )  &
+     &        *dble( ny0 )  &
+     &        *dble( nz0 )  &
+     &        -10923.3*(dble( nx0+ny0+nz0 )/3.)**2  &
+     &        +27770.9* dble( nx0+ny0+nz0 )/3.  &
+     &        -144010.)  &
+     &        / maxtime
+
+         call print_results('LU', class, nx0,  &
+     &     ny0, nz0, itmax, no_nodes, total_nodes,  &
+     &     maxtime, mflops, '          floating point', verified,  &
+     &     npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6,  &
+     &     '(none)')
+
+      END IF
+
+      if (.not.timeron) goto 999
+
+      do i = 1, t_last
+         t1(i) = timer_read(i)
+      end do
+      t1(t_rhs) = t1(t_rhs) - t1(t_exch)
+      t1(t_last+2) = t1(t_lcomm)+t1(t_ucomm)+t1(t_rcomm)+t1(t_exch)
+      t1(t_last+1) = t1(t_total) - t1(t_last+2)
+
+      call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM,  &
+     &                0, comm_solve, ierr)
+      call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN,  &
+     &                0, comm_solve, ierr)
+      call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX,  &
+     &                0, comm_solve, ierr)
+
+      if (id .eq. 0) then
+         write(*, 800) no_nodes
+         do i = 1, t_last+2
+            if (t_recs(i)(1:1) .ne. '#') then
+               tsum(i) = tsum(i) / no_nodes
+               write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+            endif
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum',  &
+     &       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu_data.f90
new file mode 100644
index 000000000..c5d3db8da
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu_data.f90
@@ -0,0 +1,193 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!---  lu_data module
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module lu_data
+
+!---------------------------------------------------------------------
+!   npbparams.h defines parameters that depend on the class and 
+!   number of nodes
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+!---------------------------------------------------------------------
+!   parameters which can be overridden in runtime config file
+!   (in addition to size of problem - isiz01,02,03 give the maximum size)
+!   ipr = 1 to print out verbose information
+!   omega = 2.0 is correct for all classes
+!   tolrsd is tolerance levels for steady state residuals
+!---------------------------------------------------------------------
+      integer ipr_default
+      parameter (ipr_default = 1)
+      double precision omega_default
+      parameter (omega_default = 1.2d0)
+      double precision tolrsd1_def, tolrsd2_def, tolrsd3_def,  &
+     &                 tolrsd4_def, tolrsd5_def
+      parameter (tolrsd1_def=1.0e-08,  &
+     &          tolrsd2_def=1.0e-08, tolrsd3_def=1.0e-08,  &
+     &          tolrsd4_def=1.0e-08, tolrsd5_def=1.0e-08)
+
+      double precision c1, c2, c3, c4, c5
+      parameter( c1 = 1.40d+00, c2 = 0.40d+00,  &
+     &           c3 = 1.00d-01, c4 = 1.00d+00,  &
+     &           c5 = 1.40d+00 )
+
+!---------------------------------------------------------------------
+!   grid
+!---------------------------------------------------------------------
+      integer nx, ny, nz
+      integer nx0, ny0, nz0
+      integer ipt, ist, iend
+      integer jpt, jst, jend
+      integer ii1, ii2
+      integer ji1, ji2
+      integer ki1, ki2
+      double precision  dxi, deta, dzeta
+      double precision  tx1, tx2, tx3
+      double precision  ty1, ty2, ty3
+      double precision  tz1, tz2, tz3
+
+!---------------------------------------------------------------------
+!   dissipation
+!---------------------------------------------------------------------
+      double precision dx1, dx2, dx3, dx4, dx5
+      double precision dy1, dy2, dy3, dy4, dy5
+      double precision dz1, dz2, dz3, dz4, dz5
+      double precision dssp
+
+!---------------------------------------------------------------------
+!   field variables and residuals
+!---------------------------------------------------------------------
+      double precision, allocatable ::  &
+     &       u   (:,:,:,:),  &
+     &       rsd (:,:,:,:),  &
+     &       frct(:,:,:,:),  &
+     &       flux(:,:,:,:)
+
+
+!---------------------------------------------------------------------
+!   output control parameters
+!---------------------------------------------------------------------
+      integer ipr, inorm
+
+!---------------------------------------------------------------------
+!   newton-raphson iteration control parameters
+!---------------------------------------------------------------------
+      integer itmax, invert
+      double precision  dt, omega, tolrsd(5),  &
+     &        rsdnm(5), errnm(5), frc, ttotal
+
+      double precision, allocatable ::  &
+     &       a(:,:,:),  &
+     &       b(:,:,:),  &
+     &       c(:,:,:),  &
+     &       d(:,:,:)
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution
+!---------------------------------------------------------------------
+      double precision ce(5,13)
+
+!---------------------------------------------------------------------
+!   working arrays for surface integral
+!---------------------------------------------------------------------
+      double precision, allocatable ::  &
+     &       phi1(:,:),  &
+     &       phi2(:,:)
+
+!---------------------------------------------------------------------
+!   multi-processor parameters
+!---------------------------------------------------------------------
+      integer id, ndim, num, xdim, ydim, row, col
+
+      integer north,south,east,west
+
+      integer from_s,from_n,from_e,from_w
+      parameter (from_s=1,from_n=2,from_e=3,from_w=4)
+
+      double precision, allocatable ::  &
+     &       buf (:,:),  &
+     &       buf1(:,:)
+
+! sub-domain array size
+      integer isiz1, isiz2, isiz3, nnodes_xdim
+
+
+      end module lu_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!---  timing module
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module timing
+
+      integer t_total, t_rhs, t_blts, t_buts, t_jacld, t_jacu,  &
+     &        t_exch, t_lcomm, t_ucomm, t_rcomm, t_last
+      parameter (t_total=1, t_rhs=2, t_blts=3, t_buts=4, t_jacld=5,  &
+     &        t_jacu=6, t_exch=7, t_lcomm=8, t_ucomm=9, t_rcomm=10,  &
+     &        t_last=10)
+
+      double precision maxtime
+      logical timeron
+
+      end module timing
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+      integer ios, ierr
+
+!---------------------------------------------------------------------
+! parameters (isiz1, isiz2, isiz3) are set in proc_grid
+!---------------------------------------------------------------------
+      allocate (  &
+     &       u   (5, -1:isiz1+2, -1:isiz2+2, isiz3),  &
+     &       rsd (5, -1:isiz1+2, -1:isiz2+2, isiz3),  &
+     &       frct(5, -1:isiz1+2, -1:isiz2+2, isiz3),  &
+     &       flux(5,  0:isiz1+1,  0:isiz2+1, isiz3),  &
+     &       stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &       a(5, 5, isiz1),  &
+     &       b(5, 5, isiz1),  &
+     &       c(5, 5, isiz1),  &
+     &       d(5, 5, isiz1),  &
+     &       phi1(0:isiz2+1, 0:isiz3+1),  &
+     &       phi2(0:isiz2+1, 0:isiz3+1),  &
+     &       stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &       buf (5, 2*isiz2*isiz3),  &
+     &       buf1(5, 2*isiz2*isiz3),  &
+     &       stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu_data_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu_data_vec.f90
new file mode 100644
index 000000000..cc74afe43
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/lu_data_vec.f90
@@ -0,0 +1,193 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!---  lu_data module
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module lu_data
+
+!---------------------------------------------------------------------
+!   npbparams.h defines parameters that depend on the class and 
+!   number of nodes
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+!---------------------------------------------------------------------
+!   parameters which can be overridden in runtime config file
+!   (in addition to size of problem - isiz01,02,03 give the maximum size)
+!   ipr = 1 to print out verbose information
+!   omega = 2.0 is correct for all classes
+!   tolrsd is tolerance levels for steady state residuals
+!---------------------------------------------------------------------
+      integer ipr_default
+      parameter (ipr_default = 1)
+      double precision omega_default
+      parameter (omega_default = 1.2d0)
+      double precision tolrsd1_def, tolrsd2_def, tolrsd3_def,  &
+     &                 tolrsd4_def, tolrsd5_def
+      parameter (tolrsd1_def=1.0e-08,  &
+     &          tolrsd2_def=1.0e-08, tolrsd3_def=1.0e-08,  &
+     &          tolrsd4_def=1.0e-08, tolrsd5_def=1.0e-08)
+
+      double precision c1, c2, c3, c4, c5
+      parameter( c1 = 1.40d+00, c2 = 0.40d+00,  &
+     &           c3 = 1.00d-01, c4 = 1.00d+00,  &
+     &           c5 = 1.40d+00 )
+
+!---------------------------------------------------------------------
+!   grid
+!---------------------------------------------------------------------
+      integer nx, ny, nz
+      integer nx0, ny0, nz0
+      integer ipt, ist, iend
+      integer jpt, jst, jend
+      integer ii1, ii2
+      integer ji1, ji2
+      integer ki1, ki2
+      double precision  dxi, deta, dzeta
+      double precision  tx1, tx2, tx3
+      double precision  ty1, ty2, ty3
+      double precision  tz1, tz2, tz3
+
+!---------------------------------------------------------------------
+!   dissipation
+!---------------------------------------------------------------------
+      double precision dx1, dx2, dx3, dx4, dx5
+      double precision dy1, dy2, dy3, dy4, dy5
+      double precision dz1, dz2, dz3, dz4, dz5
+      double precision dssp
+
+!---------------------------------------------------------------------
+!   field variables and residuals
+!---------------------------------------------------------------------
+      double precision, allocatable ::  &
+     &       u   (:,:,:,:),  &
+     &       rsd (:,:,:,:),  &
+     &       frct(:,:,:,:),  &
+     &       flux(:,:,:,:)
+
+
+!---------------------------------------------------------------------
+!   output control parameters
+!---------------------------------------------------------------------
+      integer ipr, inorm
+
+!---------------------------------------------------------------------
+!   newton-raphson iteration control parameters
+!---------------------------------------------------------------------
+      integer itmax, invert
+      double precision  dt, omega, tolrsd(5),  &
+     &        rsdnm(5), errnm(5), frc, ttotal
+
+      double precision, allocatable ::  &
+     &       a(:,:,:,:),  &
+     &       b(:,:,:,:),  &
+     &       c(:,:,:,:),  &
+     &       d(:,:,:,:)
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution
+!---------------------------------------------------------------------
+      double precision ce(5,13)
+
+!---------------------------------------------------------------------
+!   working arrays for surface integral
+!---------------------------------------------------------------------
+      double precision, allocatable ::  &
+     &       phi1(:,:),  &
+     &       phi2(:,:)
+
+!---------------------------------------------------------------------
+!   multi-processor parameters
+!---------------------------------------------------------------------
+      integer id, ndim, num, xdim, ydim, row, col
+
+      integer north,south,east,west
+
+      integer from_s,from_n,from_e,from_w
+      parameter (from_s=1,from_n=2,from_e=3,from_w=4)
+
+      double precision, allocatable ::  &
+     &       buf (:,:),  &
+     &       buf1(:,:)
+
+! sub-domain array size
+      integer isiz1, isiz2, isiz3, nnodes_xdim
+
+
+      end module lu_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!---  timing module
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module timing
+
+      integer t_total, t_rhs, t_blts, t_buts, t_jacld, t_jacu,  &
+     &        t_exch, t_lcomm, t_ucomm, t_rcomm, t_last
+      parameter (t_total=1, t_rhs=2, t_blts=3, t_buts=4, t_jacld=5,  &
+     &        t_jacu=6, t_exch=7, t_lcomm=8, t_ucomm=9, t_rcomm=10,  &
+     &        t_last=10)
+
+      double precision maxtime
+      logical timeron
+
+      end module timing
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+      integer ios, ierr
+
+!---------------------------------------------------------------------
+! parameters (isiz1, isiz2, isiz3) are set in proc_grid
+!---------------------------------------------------------------------
+      allocate (  &
+     &       u   (5, -1:isiz1+2, -1:isiz2+2, isiz3),  &
+     &       rsd (5, -1:isiz1+2, -1:isiz2+2, isiz3),  &
+     &       frct(5, -1:isiz1+2, -1:isiz2+2, isiz3),  &
+     &       flux(5,  0:isiz1+1,  0:isiz2+1, isiz3),  &
+     &       stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &       a(5, 5, isiz1, isiz2),  &
+     &       b(5, 5, isiz1, isiz2),  &
+     &       c(5, 5, isiz1, isiz2),  &
+     &       d(5, 5, isiz1, isiz2),  &
+     &       phi1(0:isiz2+1, 0:isiz3+1),  &
+     &       phi2(0:isiz2+1, 0:isiz3+1),  &
+     &       stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &       buf (5, 2*isiz2*isiz3),  &
+     &       buf1(5, 2*isiz2*isiz3),  &
+     &       stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/mpinpb.f90
new file mode 100644
index 000000000..5fbe8abd4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/mpinpb.f90
@@ -0,0 +1,18 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mpinpb module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mpinpb
+
+      include 'mpif.h'
+
+      integer  node, no_nodes, total_nodes, root, comm_solve,  &
+     &         dp_type
+      logical   active
+
+      end module mpinpb
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/neighbors.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/neighbors.f90
new file mode 100644
index 000000000..0c75bf1a5
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/neighbors.f90
@@ -0,0 +1,47 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine neighbors ()
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!     figure out the neighbors and their wrap numbers for each processor
+!---------------------------------------------------------------------
+
+      south = -1
+      east  = -1
+      north = -1
+      west  = -1
+
+      if (row.gt.1) then
+              north = id -1
+      else
+              north = -1
+      end if
+
+      if (row.lt.xdim) then
+              south = id + 1
+      else
+              south = -1
+      end if
+
+      if (col.gt.1) then
+              west = id- xdim
+      else
+              west = -1
+      end if
+
+      if (col.lt.ydim) then
+              east = id + xdim
+      else 
+              east = -1
+      end if
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/nodedim.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/nodedim.f90
new file mode 100644
index 000000000..79f8ddf7e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/nodedim.f90
@@ -0,0 +1,36 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      integer function nodedim(num)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!  compute the exponent where num = 2**nodedim
+!  NOTE: assumes a power-of-two number of nodes
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer num
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      double precision fnum
+
+
+      fnum = dble(num)
+      nodedim = log(fnum)/log(2.0d+0) + 0.00001
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/pintgr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/pintgr.f90
new file mode 100644
index 000000000..16c6776d8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/pintgr.f90
@@ -0,0 +1,286 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine pintgr
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k
+      integer ibeg, ifin, ifin1
+      integer jbeg, jfin, jfin1
+      integer iglob, iglob1, iglob2
+      integer jglob, jglob1, jglob2
+      integer ind1, ind2
+      double precision  frc1, frc2, frc3
+      double precision  dummy
+
+      integer IERROR
+
+
+!---------------------------------------------------------------------
+!   set up the sub-domains for integeration in each processor
+!---------------------------------------------------------------------
+      ibeg = nx + 1
+      ifin = 0
+      iglob1 = ipt + 1
+      iglob2 = ipt + nx
+      if (iglob1.ge.ii1.and.iglob2.lt.ii2+nx) ibeg = 1
+      if (iglob1.gt.ii1-nx.and.iglob2.le.ii2) ifin = nx
+      if (ii1.ge.iglob1.and.ii1.le.iglob2) ibeg = ii1 - ipt
+      if (ii2.ge.iglob1.and.ii2.le.iglob2) ifin = ii2 - ipt
+      jbeg = ny + 1
+      jfin = 0
+      jglob1 = jpt + 1
+      jglob2 = jpt + ny
+      if (jglob1.ge.ji1.and.jglob2.lt.ji2+ny) jbeg = 1
+      if (jglob1.gt.ji1-ny.and.jglob2.le.ji2) jfin = ny
+      if (ji1.ge.jglob1.and.ji1.le.jglob2) jbeg = ji1 - jpt
+      if (ji2.ge.jglob1.and.ji2.le.jglob2) jfin = ji2 - jpt
+      ifin1 = ifin
+      jfin1 = jfin
+      if (ipt + ifin1.eq.ii2) ifin1 = ifin -1
+      if (jpt + jfin1.eq.ji2) jfin1 = jfin -1
+
+!---------------------------------------------------------------------
+!   initialize
+!---------------------------------------------------------------------
+      do k = 0,isiz3+1
+        do i = 0,isiz2+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+
+      do j = jbeg,jfin
+         jglob = jpt + j
+         do i = ibeg,ifin
+            iglob = ipt + i
+
+            k = ki1
+
+            phi1(i,j) = c2*(  u(5,i,j,k)  &
+     &           - 0.50d+00 * (  u(2,i,j,k) ** 2  &
+     &                         + u(3,i,j,k) ** 2  &
+     &                         + u(4,i,j,k) ** 2 )  &
+     &                        / u(1,i,j,k) )
+
+            k = ki2
+
+            phi2(i,j) = c2*(  u(5,i,j,k)  &
+     &           - 0.50d+00 * (  u(2,i,j,k) ** 2  &
+     &                         + u(3,i,j,k) ** 2  &
+     &                         + u(4,i,j,k) ** 2 )  &
+     &                        / u(1,i,j,k) )
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!  communicate in i and j directions
+!---------------------------------------------------------------------
+      call exchange_4(phi1,phi2,ibeg,ifin1,jbeg,jfin1)
+
+      frc1 = 0.0d+00
+
+      do j = jbeg,jfin1
+         do i = ibeg, ifin1
+            frc1 = frc1 + (  phi1(i,j)  &
+     &                     + phi1(i+1,j)  &
+     &                     + phi1(i,j+1)  &
+     &                     + phi1(i+1,j+1)  &
+     &                     + phi2(i,j)  &
+     &                     + phi2(i+1,j)  &
+     &                     + phi2(i,j+1)  &
+     &                     + phi2(i+1,j+1) )
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!  compute the global sum of individual contributions to frc1
+!---------------------------------------------------------------------
+      dummy = frc1
+      call MPI_ALLREDUCE( dummy,  &
+     &                    frc1,  &
+     &                    1,  &
+     &                    dp_type,  &
+     &                    MPI_SUM,  &
+     &                    comm_solve,  &
+     &                    IERROR )
+
+      frc1 = dxi * deta * frc1
+
+!---------------------------------------------------------------------
+!   initialize
+!---------------------------------------------------------------------
+      do k = 0,isiz3+1
+        do i = 0,isiz2+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+      jglob = jpt + jbeg
+      ind1 = 0
+      if (jglob.eq.ji1) then
+        ind1 = 1
+        do k = ki1, ki2
+           do i = ibeg, ifin
+              iglob = ipt + i
+              phi1(i,k) = c2*(  u(5,i,jbeg,k)  &
+     &             - 0.50d+00 * (  u(2,i,jbeg,k) ** 2  &
+     &                           + u(3,i,jbeg,k) ** 2  &
+     &                           + u(4,i,jbeg,k) ** 2 )  &
+     &                          / u(1,i,jbeg,k) )
+           end do
+        end do
+      end if
+
+      jglob = jpt + jfin
+      ind2 = 0
+      if (jglob.eq.ji2) then
+        ind2 = 1
+        do k = ki1, ki2
+           do i = ibeg, ifin
+              iglob = ipt + i
+              phi2(i,k) = c2*(  u(5,i,jfin,k)  &
+     &             - 0.50d+00 * (  u(2,i,jfin,k) ** 2  &
+     &                           + u(3,i,jfin,k) ** 2  &
+     &                           + u(4,i,jfin,k) ** 2 )  &
+     &                          / u(1,i,jfin,k) )
+           end do
+        end do
+      end if
+
+!---------------------------------------------------------------------
+!  communicate in i direction
+!---------------------------------------------------------------------
+      if (ind1.eq.1) then
+        call exchange_5(phi1,ibeg,ifin1)
+      end if
+      if (ind2.eq.1) then
+        call exchange_5 (phi2,ibeg,ifin1)
+      end if
+
+      frc2 = 0.0d+00
+      do k = ki1, ki2-1
+         do i = ibeg, ifin1
+            frc2 = frc2 + (  phi1(i,k)  &
+     &                     + phi1(i+1,k)  &
+     &                     + phi1(i,k+1)  &
+     &                     + phi1(i+1,k+1)  &
+     &                     + phi2(i,k)  &
+     &                     + phi2(i+1,k)  &
+     &                     + phi2(i,k+1)  &
+     &                     + phi2(i+1,k+1) )
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!  compute the global sum of individual contributions to frc2
+!---------------------------------------------------------------------
+      dummy = frc2
+      call MPI_ALLREDUCE( dummy,  &
+     &                    frc2,  &
+     &                    1,  &
+     &                    dp_type,  &
+     &                    MPI_SUM,  &
+     &                    comm_solve,  &
+     &                    IERROR )
+
+      frc2 = dxi * dzeta * frc2
+
+!---------------------------------------------------------------------
+!   initialize
+!---------------------------------------------------------------------
+      do k = 0,isiz3+1
+        do i = 0,isiz2+1
+          phi1(i,k) = 0.
+          phi2(i,k) = 0.
+        end do
+      end do
+      iglob = ipt + ibeg
+      ind1 = 0
+      if (iglob.eq.ii1) then
+        ind1 = 1
+        do k = ki1, ki2
+           do j = jbeg, jfin
+              jglob = jpt + j
+              phi1(j,k) = c2*(  u(5,ibeg,j,k)  &
+     &             - 0.50d+00 * (  u(2,ibeg,j,k) ** 2  &
+     &                           + u(3,ibeg,j,k) ** 2  &
+     &                           + u(4,ibeg,j,k) ** 2 )  &
+     &                          / u(1,ibeg,j,k) )
+           end do
+        end do
+      end if
+
+      iglob = ipt + ifin
+      ind2 = 0
+      if (iglob.eq.ii2) then
+        ind2 = 1
+        do k = ki1, ki2
+           do j = jbeg, jfin
+              jglob = jpt + j
+              phi2(j,k) = c2*(  u(5,ifin,j,k)  &
+     &             - 0.50d+00 * (  u(2,ifin,j,k) ** 2  &
+     &                           + u(3,ifin,j,k) ** 2  &
+     &                           + u(4,ifin,j,k) ** 2 )  &
+     &                          / u(1,ifin,j,k) )
+           end do
+        end do
+      end if
+
+!---------------------------------------------------------------------
+!  communicate in j direction
+!---------------------------------------------------------------------
+      if (ind1.eq.1) then
+        call exchange_6(phi1,jbeg,jfin1)
+      end if
+      if (ind2.eq.1) then
+        call exchange_6(phi2,jbeg,jfin1)
+      end if
+
+      frc3 = 0.0d+00
+
+      do k = ki1, ki2-1
+         do j = jbeg, jfin1
+            frc3 = frc3 + (  phi1(j,k)  &
+     &                     + phi1(j+1,k)  &
+     &                     + phi1(j,k+1)  &
+     &                     + phi1(j+1,k+1)  &
+     &                     + phi2(j,k)  &
+     &                     + phi2(j+1,k)  &
+     &                     + phi2(j,k+1)  &
+     &                     + phi2(j+1,k+1) )
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!  compute the global sum of individual contributions to frc3
+!---------------------------------------------------------------------
+      dummy = frc3
+      call MPI_ALLREDUCE( dummy,  &
+     &                    frc3,  &
+     &                    1,  &
+     &                    dp_type,  &
+     &                    MPI_SUM,  &
+     &                    comm_solve,  &
+     &                    IERROR )
+
+      frc3 = deta * dzeta * frc3
+      frc = 0.25d+00 * ( frc1 + frc2 + frc3 )
+!      if (id.eq.0) write (*,1001) frc
+
+      return
+
+ 1001 format (//5x,'surface integral = ',1pe12.5//)
+
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/proc_grid.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/proc_grid.f90
new file mode 100644
index 000000000..a4d0cd71b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/proc_grid.f90
@@ -0,0 +1,109 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine isqrt2(i, xdim)
+
+      implicit none
+      integer i, xdim
+
+      integer ydim, square
+
+      xdim = -1
+      if (i <= 0) return
+
+      square = 0;
+      xdim = 1
+      do while (square <= i)
+         square = xdim*xdim
+         if (square == i) return
+         xdim = xdim + 1
+      end do
+
+      xdim = xdim - 1
+      ydim = i / xdim
+      do while (xdim*ydim /= i .and. 2*ydim >= xdim)
+         xdim = xdim + 1
+         ydim = i / xdim
+      end do
+
+      if (xdim*ydim /= i .or. 2*ydim < xdim) xdim = -1
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine proc_grid
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer xdim0, ydim0, IERROR
+      integer xdiv, ydiv
+
+!---------------------------------------------------------------------
+!  calculate sub-domain array size
+!---------------------------------------------------------------------
+      call isqrt2(no_nodes, xdiv)
+
+      if (xdiv .le. 0) then
+         if (id .eq. 0) write(*,2000) no_nodes
+2000     format(' ERROR: could not determine proper proc_grid',  &
+     &          ' for nprocs=', i6)
+         CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+         stop
+      endif
+
+      ydiv = no_nodes/xdiv
+      isiz1 = isiz01/xdiv
+      if (isiz1*xdiv < isiz01) isiz1 = isiz1 + 1
+      isiz2 = isiz02/ydiv
+      if (isiz2*ydiv < isiz02) isiz2 = isiz2 + 1
+      nnodes_xdim = xdiv
+      isiz3 = isiz03
+
+!---------------------------------------------------------------------
+!
+!   set up a two-d grid for processors: column-major ordering of unknowns
+!
+!---------------------------------------------------------------------
+
+      xdim0  = nnodes_xdim
+      ydim0  = no_nodes/xdim0
+
+      ydim   = dsqrt(dble(num))+0.001d0
+      xdim   = num/ydim
+      do while (ydim .ge. ydim0 .and. xdim*ydim .ne. num)
+         ydim = ydim - 1
+         xdim = num/ydim
+      end do
+
+      if (xdim .lt. xdim0 .or. ydim .lt. ydim0 .or.  &
+     &    xdim*ydim .ne. num) then
+         if (id .eq. 0) write(*,2000) num
+         CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+         stop
+      endif
+
+      if (id .eq. 0 .and. num .ne. 2**ndim)  &
+     &   write(*,2100) num, xdim, ydim
+2100  format(' Proc_grid for nprocs =',i6,':',i5,' x',i5/)
+
+      row    = mod(id,xdim) + 1
+      col    = id/xdim + 1
+
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/read_input.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/read_input.f90
new file mode 100644
index 000000000..2759f874c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/read_input.f90
@@ -0,0 +1,129 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine read_input(class)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+      use timing
+
+      implicit none
+
+      character class
+      integer IERROR, fstatus
+
+
+!---------------------------------------------------------------------
+!    only root reads the input file
+!    if input file does not exist, it uses defaults
+!       ipr = 1 for detailed progress output
+!       inorm = how often the norm is printed (once every inorm iterations)
+!       itmax = number of pseudo time steps
+!       dt = time step
+!       omega 1 over-relaxation factor for SSOR
+!       tolrsd = steady state residual tolerance levels
+!       nx, ny, nz = number of grid points in x, y, z directions
+!---------------------------------------------------------------------
+      if (id .eq. root) then
+
+         write(*, 1000)
+
+         call check_timer_flag( timeron )
+
+         open (unit=3,file='inputlu.data',status='old',  &
+     &         access='sequential',form='formatted', iostat=fstatus)
+         if (fstatus .eq. 0) then
+
+            write(*, *) 'Reading from input file inputlu.data'
+
+            read (3,*)
+            read (3,*)
+            read (3,*) ipr, inorm
+            read (3,*)
+            read (3,*)
+            read (3,*) itmax
+            read (3,*)
+            read (3,*)
+            read (3,*) dt
+            read (3,*)
+            read (3,*)
+            read (3,*) omega
+            read (3,*)
+            read (3,*)
+            read (3,*) tolrsd(1),tolrsd(2),tolrsd(3),tolrsd(4),tolrsd(5)
+            read (3,*)
+            read (3,*)
+            read (3,*) nx0, ny0, nz0
+            close(3)
+         else
+            ipr = ipr_default
+            inorm = inorm_default
+            itmax = itmax_default
+            dt = dt_default
+            omega = omega_default
+            tolrsd(1) = tolrsd1_def
+            tolrsd(2) = tolrsd2_def
+            tolrsd(3) = tolrsd3_def
+            tolrsd(4) = tolrsd4_def
+            tolrsd(5) = tolrsd5_def
+            nx0 = isiz01
+            ny0 = isiz02
+            nz0 = isiz03
+         endif
+
+!---------------------------------------------------------------------
+!   check problem size
+!---------------------------------------------------------------------
+         if ( ( nx0 .lt. 4 ) .or.  &
+     &        ( ny0 .lt. 4 ) .or.  &
+     &        ( nz0 .lt. 4 ) ) then
+
+            write (*,2001)
+ 2001       format (5x,'PROBLEM SIZE IS TOO SMALL - ',  &
+     &           /5x,'SET EACH OF NX, NY AND NZ AT LEAST EQUAL TO 5')
+            CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+
+         end if
+
+         if ( ( nx0 .gt. isiz01 ) .or.  &
+     &        ( ny0 .gt. isiz02 ) .or.  &
+     &        ( nz0 .gt. isiz03 ) ) then
+
+            write (*,2002)
+ 2002       format (5x,'PROBLEM SIZE IS TOO LARGE - ',  &
+     &           /5x,'NX, NY AND NZ SHOULD BE LESS THAN OR EQUAL TO ',  &
+     &           /5x,'ISIZ01, ISIZ02 AND ISIZ03 RESPECTIVELY')
+            CALL MPI_ABORT( MPI_COMM_WORLD, MPI_ERR_OTHER, IERROR )
+
+         end if
+
+         call set_class(class)
+
+         write(*, 1001) nx0, ny0, nz0, class
+         write(*, 1002) itmax
+
+         write(*, 1003) total_nodes
+         if (total_nodes .ne. no_nodes) write (*, 1004) no_nodes
+         write(*, *)
+
+ 1000 format(//, ' NAS Parallel Benchmarks 3.4 -- LU Benchmark',/)
+ 1001    format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', a, ')')
+ 1002    format(' Iterations: ', i4)
+ 1003    format(' Total number of processes: ', i6)
+ 1004    format(' WARNING: Number of processes is not in a form of',  &
+     &          ' (n1*n2, n1/n2 <= 2).'/  &
+     &          ' Number of active processes: ', i6)
+
+
+      end if
+
+      call bcast_inputs
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/rhs.f90
new file mode 100644
index 000000000..86be208f3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/rhs.f90
@@ -0,0 +1,512 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the right hand sides
+!---------------------------------------------------------------------
+
+      use lu_data
+      use timing
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iex
+      integer L1, L2
+      integer ist1, iend1
+      integer jst1, jend1
+      double precision  q
+      double precision  u21, u31, u41
+      double precision  tmp
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+      if (timeron) call timer_start(t_rhs)
+
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  rsd(m,i,j,k) = - frct(m,i,j,k)
+               end do
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   xi-direction flux differences
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   iex = flag : iex = 0  north/south communication
+!              : iex = 1  east/west communication
+!---------------------------------------------------------------------
+      iex   = 0
+
+!---------------------------------------------------------------------
+!   communicate and receive/send two rows of data
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_exch)
+      call exchange_3(u,iex)
+      if (timeron) call timer_stop(t_exch)
+
+      L1 = 0
+      if (north.eq.-1) L1 = 1
+      L2 = nx + 1
+      if (south.eq.-1) L2 = nx
+
+      ist1 = 1
+      iend1 = nx
+      if (north.eq.-1) ist1 = 4
+      if (south.eq.-1) iend1 = nx - 3
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = L1, L2
+               flux(1,i,j,k) = u(2,i,j,k)
+               u21 = u(2,i,j,k) / u(1,i,j,k)
+
+               q = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)  &
+     &                         + u(3,i,j,k) * u(3,i,j,k)  &
+     &                         + u(4,i,j,k) * u(4,i,j,k) )  &
+     &                      / u(1,i,j,k)
+
+               flux(2,i,j,k) = u(2,i,j,k) * u21 + c2 *  &
+     &                        ( u(5,i,j,k) - q )
+               flux(3,i,j,k) = u(3,i,j,k) * u21
+               flux(4,i,j,k) = u(4,i,j,k) * u21
+               flux(5,i,j,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)  &
+     &                 - tx2 * ( flux(m,i+1,j,k) - flux(m,i-1,j,k) )
+               end do
+            end do
+
+            do i = ist, L2
+               tmp = 1.0d+00 / u(1,i,j,k)
+
+               u21i = tmp * u(2,i,j,k)
+               u31i = tmp * u(3,i,j,k)
+               u41i = tmp * u(4,i,j,k)
+               u51i = tmp * u(5,i,j,k)
+
+               tmp = 1.0d+00 / u(1,i-1,j,k)
+
+               u21im1 = tmp * u(2,i-1,j,k)
+               u31im1 = tmp * u(3,i-1,j,k)
+               u41im1 = tmp * u(4,i-1,j,k)
+               u51im1 = tmp * u(5,i-1,j,k)
+
+               flux(2,i,j,k) = (4.0d+00/3.0d+00) * tx3 * (u21i-u21im1)
+               flux(3,i,j,k) = tx3 * ( u31i - u31im1 )
+               flux(4,i,j,k) = tx3 * ( u41i - u41im1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )  &
+     &                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tx3 * ( u21i**2 - u21im1**2 )  &
+     &              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)  &
+     &              + dx1 * tx1 * (            u(1,i-1,j,k)  &
+     &                             - 2.0d+00 * u(1,i,j,k)  &
+     &                             +           u(1,i+1,j,k) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(2,i+1,j,k) - flux(2,i,j,k) )  &
+     &              + dx2 * tx1 * (            u(2,i-1,j,k)  &
+     &                             - 2.0d+00 * u(2,i,j,k)  &
+     &                             +           u(2,i+1,j,k) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(3,i+1,j,k) - flux(3,i,j,k) )  &
+     &              + dx3 * tx1 * (            u(3,i-1,j,k)  &
+     &                             - 2.0d+00 * u(3,i,j,k)  &
+     &                             +           u(3,i+1,j,k) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(4,i+1,j,k) - flux(4,i,j,k) )  &
+     &              + dx4 * tx1 * (            u(4,i-1,j,k)  &
+     &                             - 2.0d+00 * u(4,i,j,k)  &
+     &                             +           u(4,i+1,j,k) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(5,i+1,j,k) - flux(5,i,j,k) )  &
+     &              + dx5 * tx1 * (            u(5,i-1,j,k)  &
+     &                             - 2.0d+00 * u(5,i,j,k)  &
+     &                             +           u(5,i+1,j,k) )
+            end do
+
+!---------------------------------------------------------------------
+!   Fourth-order dissipation
+!---------------------------------------------------------------------
+            IF (north.eq.-1) then
+             do m = 1, 5
+               rsd(m,2,j,k) = rsd(m,2,j,k)  &
+     &           - dssp * ( + 5.0d+00 * u(m,2,j,k)  &
+     &                      - 4.0d+00 * u(m,3,j,k)  &
+     &                      +           u(m,4,j,k) )
+               rsd(m,3,j,k) = rsd(m,3,j,k)  &
+     &           - dssp * ( - 4.0d+00 * u(m,2,j,k)  &
+     &                      + 6.0d+00 * u(m,3,j,k)  &
+     &                      - 4.0d+00 * u(m,4,j,k)  &
+     &                      +           u(m,5,j,k) )
+             end do
+            END IF
+
+            do i = ist1,iend1
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)  &
+     &              - dssp * (            u(m,i-2,j,k)  &
+     &                        - 4.0d+00 * u(m,i-1,j,k)  &
+     &                        + 6.0d+00 * u(m,i,j,k)  &
+     &                        - 4.0d+00 * u(m,i+1,j,k)  &
+     &                        +           u(m,i+2,j,k) )
+               end do
+            end do
+
+            IF (south.eq.-1) then
+             do m = 1, 5
+               rsd(m,nx-2,j,k) = rsd(m,nx-2,j,k)  &
+     &           - dssp * (             u(m,nx-4,j,k)  &
+     &                      - 4.0d+00 * u(m,nx-3,j,k)  &
+     &                      + 6.0d+00 * u(m,nx-2,j,k)  &
+     &                      - 4.0d+00 * u(m,nx-1,j,k)  )
+               rsd(m,nx-1,j,k) = rsd(m,nx-1,j,k)  &
+     &           - dssp * (             u(m,nx-3,j,k)  &
+     &                      - 4.0d+00 * u(m,nx-2,j,k)  &
+     &                      + 5.0d+00 * u(m,nx-1,j,k) )
+             end do
+            END IF
+
+         end do
+      end do 
+
+!---------------------------------------------------------------------
+!   eta-direction flux differences
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   iex = flag : iex = 0  north/south communication
+!---------------------------------------------------------------------
+      iex   = 1
+
+!---------------------------------------------------------------------
+!   communicate and receive/send two rows of data
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_exch)
+      call exchange_3(u,iex)
+      if (timeron) call timer_stop(t_exch)
+
+      L1 = 0
+      if (west.eq.-1) L1 = 1
+      L2 = ny + 1
+      if (east.eq.-1) L2 = ny
+
+      jst1 = 1
+      jend1 = ny
+      if (west.eq.-1) jst1 = 4
+      if (east.eq.-1) jend1 = ny - 3
+
+      do k = 2, nz - 1
+         do j = L1, L2
+            do i = ist, iend
+               flux(1,i,j,k) = u(3,i,j,k)
+               u31 = u(3,i,j,k) / u(1,i,j,k)
+
+               q = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)  &
+     &                         + u(3,i,j,k) * u(3,i,j,k)  &
+     &                         + u(4,i,j,k) * u(4,i,j,k) )  &
+     &                      / u(1,i,j,k)
+
+               flux(2,i,j,k) = u(2,i,j,k) * u31 
+               flux(3,i,j,k) = u(3,i,j,k) * u31 + c2 * (u(5,i,j,k)-q)
+               flux(4,i,j,k) = u(4,i,j,k) * u31
+               flux(5,i,j,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u31
+            end do
+         end do
+
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)  &
+     &                   - ty2 * ( flux(m,i,j+1,k) - flux(m,i,j-1,k) )
+               end do
+            end do
+         end do
+
+         do j = jst, L2
+            do i = ist, iend
+               tmp = 1.0d+00 / u(1,i,j,k)
+
+               u21j = tmp * u(2,i,j,k)
+               u31j = tmp * u(3,i,j,k)
+               u41j = tmp * u(4,i,j,k)
+               u51j = tmp * u(5,i,j,k)
+
+               tmp = 1.0d+00 / u(1,i,j-1,k)
+               u21jm1 = tmp * u(2,i,j-1,k)
+               u31jm1 = tmp * u(3,i,j-1,k)
+               u41jm1 = tmp * u(4,i,j-1,k)
+               u51jm1 = tmp * u(5,i,j-1,k)
+
+               flux(2,i,j,k) = ty3 * ( u21j - u21jm1 )
+               flux(3,i,j,k) = (4.0d+00/3.0d+00) * ty3 * (u31j-u31jm1)
+               flux(4,i,j,k) = ty3 * ( u41j - u41jm1 )
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )  &
+     &                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * ty3 * ( u31j**2 - u31jm1**2 )  &
+     &              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+         end do
+
+         do j = jst, jend
+            do i = ist, iend
+
+               rsd(1,i,j,k) = rsd(1,i,j,k)  &
+     &              + dy1 * ty1 * (            u(1,i,j-1,k)  &
+     &                             - 2.0d+00 * u(1,i,j,k)  &
+     &                             +           u(1,i,j+1,k) )
+
+               rsd(2,i,j,k) = rsd(2,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(2,i,j+1,k) - flux(2,i,j,k) )  &
+     &              + dy2 * ty1 * (            u(2,i,j-1,k)  &
+     &                             - 2.0d+00 * u(2,i,j,k)  &
+     &                             +           u(2,i,j+1,k) )
+
+               rsd(3,i,j,k) = rsd(3,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(3,i,j+1,k) - flux(3,i,j,k) )  &
+     &              + dy3 * ty1 * (            u(3,i,j-1,k)  &
+     &                             - 2.0d+00 * u(3,i,j,k)  &
+     &                             +           u(3,i,j+1,k) )
+
+               rsd(4,i,j,k) = rsd(4,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(4,i,j+1,k) - flux(4,i,j,k) )  &
+     &              + dy4 * ty1 * (            u(4,i,j-1,k)  &
+     &                             - 2.0d+00 * u(4,i,j,k)  &
+     &                             +           u(4,i,j+1,k) )
+
+               rsd(5,i,j,k) = rsd(5,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(5,i,j+1,k) - flux(5,i,j,k) )  &
+     &              + dy5 * ty1 * (            u(5,i,j-1,k)  &
+     &                             - 2.0d+00 * u(5,i,j,k)  &
+     &                             +           u(5,i,j+1,k) )
+
+            end do
+         end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+         IF (west.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               rsd(m,i,2,k) = rsd(m,i,2,k)  &
+     &           - dssp * ( + 5.0d+00 * u(m,i,2,k)  &
+     &                      - 4.0d+00 * u(m,i,3,k)  &
+     &                      +           u(m,i,4,k) )
+               rsd(m,i,3,k) = rsd(m,i,3,k)  &
+     &           - dssp * ( - 4.0d+00 * u(m,i,2,k)  &
+     &                      + 6.0d+00 * u(m,i,3,k)  &
+     &                      - 4.0d+00 * u(m,i,4,k)  &
+     &                      +           u(m,i,5,k) )
+             end do
+            end do
+         END IF
+
+         do j = jst1, jend1
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)  &
+     &              - dssp * (            u(m,i,j-2,k)  &
+     &                        - 4.0d+00 * u(m,i,j-1,k)  &
+     &                        + 6.0d+00 * u(m,i,j,k)  &
+     &                        - 4.0d+00 * u(m,i,j+1,k)  &
+     &                        +           u(m,i,j+2,k) )
+               end do
+            end do
+         end do
+
+         IF (east.eq.-1) then
+            do i = ist, iend
+             do m = 1, 5
+               rsd(m,i,ny-2,k) = rsd(m,i,ny-2,k)  &
+     &           - dssp * (             u(m,i,ny-4,k)  &
+     &                      - 4.0d+00 * u(m,i,ny-3,k)  &
+     &                      + 6.0d+00 * u(m,i,ny-2,k)  &
+     &                      - 4.0d+00 * u(m,i,ny-1,k)  )
+               rsd(m,i,ny-1,k) = rsd(m,i,ny-1,k)  &
+     &           - dssp * (             u(m,i,ny-3,k)  &
+     &                      - 4.0d+00 * u(m,i,ny-2,k)  &
+     &                      + 5.0d+00 * u(m,i,ny-1,k) )
+             end do
+            end do
+         END IF
+
+      end do
+
+!---------------------------------------------------------------------
+!   zeta-direction flux differences
+!---------------------------------------------------------------------
+      do k = 1, nz
+         do j = jst, jend
+            do i = ist, iend
+               flux(1,i,j,k) = u(4,i,j,k)
+               u41 = u(4,i,j,k) / u(1,i,j,k)
+
+               q = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)  &
+     &                         + u(3,i,j,k) * u(3,i,j,k)  &
+     &                         + u(4,i,j,k) * u(4,i,j,k) )  &
+     &                      / u(1,i,j,k)
+
+               flux(2,i,j,k) = u(2,i,j,k) * u41 
+               flux(3,i,j,k) = u(3,i,j,k) * u41 
+               flux(4,i,j,k) = u(4,i,j,k) * u41 + c2 * (u(5,i,j,k)-q)
+               flux(5,i,j,k) = ( c1 * u(5,i,j,k) - c2 * q ) * u41
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)  &
+     &                - tz2 * ( flux(m,i,j,k+1) - flux(m,i,j,k-1) )
+               end do
+            end do
+         end do
+      end do
+
+      do k = 2, nz
+         do j = jst, jend
+            do i = ist, iend
+               tmp = 1.0d+00 / u(1,i,j,k)
+
+               u21k = tmp * u(2,i,j,k)
+               u31k = tmp * u(3,i,j,k)
+               u41k = tmp * u(4,i,j,k)
+               u51k = tmp * u(5,i,j,k)
+
+               tmp = 1.0d+00 / u(1,i,j,k-1)
+
+               u21km1 = tmp * u(2,i,j,k-1)
+               u31km1 = tmp * u(3,i,j,k-1)
+               u41km1 = tmp * u(4,i,j,k-1)
+               u51km1 = tmp * u(5,i,j,k-1)
+
+               flux(2,i,j,k) = tz3 * ( u21k - u21km1 )
+               flux(3,i,j,k) = tz3 * ( u31k - u31km1 )
+               flux(4,i,j,k) = (4.0d+00/3.0d+00) * tz3 * (u41k-u41km1)
+               flux(5,i,j,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )  &
+     &                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tz3 * ( u41k**2 - u41km1**2 )  &
+     &              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+         end do
+      end do
+
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)  &
+     &              + dz1 * tz1 * (            u(1,i,j,k-1)  &
+     &                             - 2.0d+00 * u(1,i,j,k)  &
+     &                             +           u(1,i,j,k+1) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(2,i,j,k+1) - flux(2,i,j,k) )  &
+     &              + dz2 * tz1 * (            u(2,i,j,k-1)  &
+     &                             - 2.0d+00 * u(2,i,j,k)  &
+     &                             +           u(2,i,j,k+1) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(3,i,j,k+1) - flux(3,i,j,k) )  &
+     &              + dz3 * tz1 * (            u(3,i,j,k-1)  &
+     &                             - 2.0d+00 * u(3,i,j,k)  &
+     &                             +           u(3,i,j,k+1) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(4,i,j,k+1) - flux(4,i,j,k) )  &
+     &              + dz4 * tz1 * (            u(4,i,j,k-1)  &
+     &                             - 2.0d+00 * u(4,i,j,k)  &
+     &                             +           u(4,i,j,k+1) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(5,i,j,k+1) - flux(5,i,j,k) )  &
+     &              + dz5 * tz1 * (            u(5,i,j,k-1)  &
+     &                             - 2.0d+00 * u(5,i,j,k)  &
+     &                             +           u(5,i,j,k+1) )
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,2) = rsd(m,i,j,2)  &
+     &           - dssp * ( + 5.0d+00 * u(m,i,j,2)  &
+     &                      - 4.0d+00 * u(m,i,j,3)  &
+     &                      +           u(m,i,j,4) )
+               rsd(m,i,j,3) = rsd(m,i,j,3)  &
+     &           - dssp * ( - 4.0d+00 * u(m,i,j,2)  &
+     &                      + 6.0d+00 * u(m,i,j,3)  &
+     &                      - 4.0d+00 * u(m,i,j,4)  &
+     &                      +           u(m,i,j,5) )
+            end do
+         end do
+      end do
+
+      do k = 4, nz - 3
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)  &
+     &              - dssp * (            u(m,i,j,k-2)  &
+     &                        - 4.0d+00 * u(m,i,j,k-1)  &
+     &                        + 6.0d+00 * u(m,i,j,k)  &
+     &                        - 4.0d+00 * u(m,i,j,k+1)  &
+     &                        +           u(m,i,j,k+2) )
+               end do
+            end do
+         end do
+      end do
+
+      do j = jst, jend
+         do i = ist, iend
+            do m = 1, 5
+               rsd(m,i,j,nz-2) = rsd(m,i,j,nz-2)  &
+     &           - dssp * (             u(m,i,j,nz-4)  &
+     &                      - 4.0d+00 * u(m,i,j,nz-3)  &
+     &                      + 6.0d+00 * u(m,i,j,nz-2)  &
+     &                      - 4.0d+00 * u(m,i,j,nz-1)  )
+               rsd(m,i,j,nz-1) = rsd(m,i,j,nz-1)  &
+     &           - dssp * (             u(m,i,j,nz-3)  &
+     &                      - 4.0d+00 * u(m,i,j,nz-2)  &
+     &                      + 5.0d+00 * u(m,i,j,nz-1) )
+            end do
+         end do
+      end do
+
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setbv.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setbv.f90
new file mode 100644
index 000000000..333f75949
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setbv.f90
@@ -0,0 +1,78 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setbv
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   set the boundary values of dependent variables
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!   local variables
+!---------------------------------------------------------------------
+      integer i, j, k
+      integer iglob, jglob
+
+!---------------------------------------------------------------------
+!   set the dependent variable values along the top and bottom faces
+!---------------------------------------------------------------------
+      do j = 1, ny
+         jglob = jpt + j
+         do i = 1, nx
+           iglob = ipt + i
+            call exact( iglob, jglob, 1, u( 1, i, j, 1 ) )
+            call exact( iglob, jglob, nz, u( 1, i, j, nz ) )
+         end do
+      end do
+
+!---------------------------------------------------------------------
+!   set the dependent variable values along north and south faces
+!---------------------------------------------------------------------
+      IF (west.eq.-1) then
+         do k = 1, nz
+            do i = 1, nx
+               iglob = ipt + i
+               call exact( iglob, 1, k, u( 1, i, 1, k ) )
+            end do
+         end do
+      END IF
+
+      IF (east.eq.-1) then
+          do k = 1, nz
+             do i = 1, nx
+                iglob = ipt + i
+                call exact( iglob, ny0, k, u( 1, i, ny, k ) )
+             end do
+          end do
+      END IF
+
+!---------------------------------------------------------------------
+!   set the dependent variable values along east and west faces
+!---------------------------------------------------------------------
+      IF (north.eq.-1) then
+         do k = 1, nz
+            do j = 1, ny
+               jglob = jpt + j
+               call exact( 1, jglob, k, u( 1, 1, j, k ) )
+            end do
+         end do
+      END IF
+
+      IF (south.eq.-1) then
+         do k = 1, nz
+            do j = 1, ny
+                  jglob = jpt + j
+            call exact( nx0, jglob, k, u( 1, nx, j, k ) )
+            end do
+         end do
+      END IF
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setcoeff.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setcoeff.f90
new file mode 100644
index 000000000..5870519e2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setcoeff.f90
@@ -0,0 +1,158 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setcoeff
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!   set up coefficients
+!---------------------------------------------------------------------
+      dxi = 1.0d+00 / ( nx0 - 1 )
+      deta = 1.0d+00 / ( ny0 - 1 )
+      dzeta = 1.0d+00 / ( nz0 - 1 )
+
+      tx1 = 1.0d+00 / ( dxi * dxi )
+      tx2 = 1.0d+00 / ( 2.0d+00 * dxi )
+      tx3 = 1.0d+00 / dxi
+
+      ty1 = 1.0d+00 / ( deta * deta )
+      ty2 = 1.0d+00 / ( 2.0d+00 * deta )
+      ty3 = 1.0d+00 / deta
+
+      tz1 = 1.0d+00 / ( dzeta * dzeta )
+      tz2 = 1.0d+00 / ( 2.0d+00 * dzeta )
+      tz3 = 1.0d+00 / dzeta
+
+      ii1 = 2
+      ii2 = nx0 - 1
+      ji1 = 2
+      ji2 = ny0 - 2
+      ki1 = 3
+      ki2 = nz0 - 1
+
+!---------------------------------------------------------------------
+!   diffusion coefficients
+!---------------------------------------------------------------------
+      dx1 = 0.75d+00
+      dx2 = dx1
+      dx3 = dx1
+      dx4 = dx1
+      dx5 = dx1
+
+      dy1 = 0.75d+00
+      dy2 = dy1
+      dy3 = dy1
+      dy4 = dy1
+      dy5 = dy1
+
+      dz1 = 1.00d+00
+      dz2 = dz1
+      dz3 = dz1
+      dz4 = dz1
+      dz5 = dz1
+
+!---------------------------------------------------------------------
+!   fourth difference dissipation
+!---------------------------------------------------------------------
+      dssp = ( max (dx1, dy1, dz1 ) ) / 4.0d+00
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the first pde
+!---------------------------------------------------------------------
+      ce(1,1) = 2.0d+00
+      ce(1,2) = 0.0d+00
+      ce(1,3) = 0.0d+00
+      ce(1,4) = 4.0d+00
+      ce(1,5) = 5.0d+00
+      ce(1,6) = 3.0d+00
+      ce(1,7) = 5.0d-01
+      ce(1,8) = 2.0d-02
+      ce(1,9) = 1.0d-02
+      ce(1,10) = 3.0d-02
+      ce(1,11) = 5.0d-01
+      ce(1,12) = 4.0d-01
+      ce(1,13) = 3.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the second pde
+!---------------------------------------------------------------------
+      ce(2,1) = 1.0d+00
+      ce(2,2) = 0.0d+00
+      ce(2,3) = 0.0d+00
+      ce(2,4) = 0.0d+00
+      ce(2,5) = 1.0d+00
+      ce(2,6) = 2.0d+00
+      ce(2,7) = 3.0d+00
+      ce(2,8) = 1.0d-02
+      ce(2,9) = 3.0d-02
+      ce(2,10) = 2.0d-02
+      ce(2,11) = 4.0d-01
+      ce(2,12) = 3.0d-01
+      ce(2,13) = 5.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the third pde
+!---------------------------------------------------------------------
+      ce(3,1) = 2.0d+00
+      ce(3,2) = 2.0d+00
+      ce(3,3) = 0.0d+00
+      ce(3,4) = 0.0d+00
+      ce(3,5) = 0.0d+00
+      ce(3,6) = 2.0d+00
+      ce(3,7) = 3.0d+00
+      ce(3,8) = 4.0d-02
+      ce(3,9) = 3.0d-02
+      ce(3,10) = 5.0d-02
+      ce(3,11) = 3.0d-01
+      ce(3,12) = 5.0d-01
+      ce(3,13) = 4.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the fourth pde
+!---------------------------------------------------------------------
+      ce(4,1) = 2.0d+00
+      ce(4,2) = 2.0d+00
+      ce(4,3) = 0.0d+00
+      ce(4,4) = 0.0d+00
+      ce(4,5) = 0.0d+00
+      ce(4,6) = 2.0d+00
+      ce(4,7) = 3.0d+00
+      ce(4,8) = 3.0d-02
+      ce(4,9) = 5.0d-02
+      ce(4,10) = 4.0d-02
+      ce(4,11) = 2.0d-01
+      ce(4,12) = 1.0d-01
+      ce(4,13) = 3.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the fifth pde
+!---------------------------------------------------------------------
+      ce(5,1) = 5.0d+00
+      ce(5,2) = 4.0d+00
+      ce(5,3) = 3.0d+00
+      ce(5,4) = 2.0d+00
+      ce(5,5) = 1.0d-01
+      ce(5,6) = 4.0d-01
+      ce(5,7) = 3.0d-01
+      ce(5,8) = 5.0d-02
+      ce(5,9) = 4.0d-02
+      ce(5,10) = 3.0d-02
+      ce(5,11) = 1.0d-01
+      ce(5,12) = 3.0d-01
+      ce(5,13) = 2.0d-01
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setiv.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setiv.f90
new file mode 100644
index 000000000..097eb65d0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/setiv.f90
@@ -0,0 +1,66 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine setiv
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   set the initial values of independent variables based on tri-linear
+!   interpolation of boundary values in the computational space.
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      integer iglob, jglob
+      double precision  xi, eta, zeta
+      double precision  pxi, peta, pzeta
+      double precision  ue_1jk(5),ue_nx0jk(5),ue_i1k(5),  &
+     &        ue_iny0k(5),ue_ij1(5),ue_ijnz(5)
+
+
+      do k = 2, nz - 1
+         zeta = ( dble (k-1) ) / (nz-1)
+         do j = 1, ny
+          jglob = jpt + j
+          IF (jglob.ne.1.and.jglob.ne.ny0) then
+            eta = ( dble (jglob-1) ) / (ny0-1)
+            do i = 1, nx
+              iglob = ipt + i
+              IF (iglob.ne.1.and.iglob.ne.nx0) then
+               xi = ( dble (iglob-1) ) / (nx0-1)
+               call exact (1,jglob,k,ue_1jk)
+               call exact (nx0,jglob,k,ue_nx0jk)
+               call exact (iglob,1,k,ue_i1k)
+               call exact (iglob,ny0,k,ue_iny0k)
+               call exact (iglob,jglob,1,ue_ij1)
+               call exact (iglob,jglob,nz,ue_ijnz)
+               do m = 1, 5
+                  pxi =   ( 1.0d+00 - xi ) * ue_1jk(m)  &
+     &                              + xi   * ue_nx0jk(m)
+                  peta =  ( 1.0d+00 - eta ) * ue_i1k(m)  &
+     &                              + eta   * ue_iny0k(m)
+                  pzeta = ( 1.0d+00 - zeta ) * ue_ij1(m)  &
+     &                              + zeta   * ue_ijnz(m)
+
+                  u( m, i, j, k ) = pxi + peta + pzeta  &
+     &                 - pxi * peta - peta * pzeta - pzeta * pxi  &
+     &                 + pxi * peta * pzeta
+
+               end do
+              END IF
+            end do
+          END IF
+         end do
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/ssor.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/ssor.f90
new file mode 100644
index 000000000..b9c31a7ad
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/ssor.f90
@@ -0,0 +1,287 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   to perform pseudo-time stepping SSOR iterations
+!   for five nonlinear pde's.
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+      use timing
+
+      implicit none
+
+      integer  niter
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      integer istep, iex
+      double precision  tmp
+      double precision  delunm(5), tv(5,isiz1)
+
+      external timer_read
+      double precision wtime, timer_read
+
+      integer IERROR
+
+ 
+!---------------------------------------------------------------------
+!   begin pseudo-time stepping iterations
+!---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+!---------------------------------------------------------------------
+!   initialize a,b,c,d to zero (guarantees that page tables have been
+!   formed, if applicable on given architecture, before timestepping).
+!---------------------------------------------------------------------
+      do i=1,isiz1
+         do m=1,5
+            do k=1,5
+               a(k,m,i) = 0.d0
+               b(k,m,i) = 0.d0
+               c(k,m,i) = 0.d0
+               d(k,m,i) = 0.d0
+            enddo
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+      call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the L2 norms of newton iteration residuals
+!---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &             ist, iend, jst, jend,  &
+     &             rsd, rsdnm )
+  
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+      call MPI_BARRIER( comm_solve, IERROR )
+ 
+      call timer_clear(1)
+      call timer_start(1)
+
+!---------------------------------------------------------------------
+!   the timestep loop
+!---------------------------------------------------------------------
+      do istep = 1, niter
+
+         if (id .eq. 0) then
+            if (mod ( istep, 20) .eq. 0 .or.  &
+     &            istep .eq. itmax .or.  &
+     &            istep .eq. 1) then
+               if (niter .gt. 1) write( *, 200) istep
+ 200           format(' Time step ', i4)
+            endif
+         endif
+ 
+!---------------------------------------------------------------------
+!   perform SSOR iteration
+!---------------------------------------------------------------------
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = dt * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+ 
+         do k = 2, nz -1 
+!---------------------------------------------------------------------
+!   receive data from north and west
+!---------------------------------------------------------------------
+            if (timeron) call timer_start(t_lcomm)
+            iex = 0
+            call exchange_1( rsd,k,iex )
+            if (timeron) call timer_stop(t_lcomm)
+
+
+            if (timeron) call timer_start(t_blts)
+            do j = jst, jend
+
+!---------------------------------------------------------------------
+!   form the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacld(j, k)
+ 
+!---------------------------------------------------------------------
+!   perform the lower triangular solution
+!---------------------------------------------------------------------
+               call blts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz, j, k,  &
+     &                    omega,  &
+     &                    rsd,  &
+     &                    a, b, c, d,  &
+     &                    ist, iend, jst, jend,  &
+     &                    nx0, ny0, ipt, jpt)
+            end do
+            if (timeron) call timer_stop(t_blts)
+
+!---------------------------------------------------------------------
+!   send data to east and south
+!---------------------------------------------------------------------
+            if (timeron) call timer_start(t_lcomm)
+            iex = 2
+            call exchange_1( rsd,k,iex )
+            if (timeron) call timer_stop(t_lcomm)
+         end do
+ 
+         do k = nz - 1, 2, -1
+!---------------------------------------------------------------------
+!   receive data from south and east
+!---------------------------------------------------------------------
+            if (timeron) call timer_start(t_ucomm)
+            iex = 1
+            call exchange_1( rsd,k,iex )
+            if (timeron) call timer_stop(t_ucomm)
+
+            if (timeron) call timer_start(t_buts)
+            do j = jend, jst, -1
+
+!---------------------------------------------------------------------
+!   form the strictly upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacu(j, k)
+
+!---------------------------------------------------------------------
+!   perform the upper triangular solution
+!---------------------------------------------------------------------
+               call buts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz, j, k,  &
+     &                    omega,  &
+     &                    rsd, tv,  &
+     &                    d, a, b, c,  &
+     &                    ist, iend, jst, jend,  &
+     &                    nx0, ny0, ipt, jpt)
+            end do
+            if (timeron) call timer_stop(t_buts)
+
+!---------------------------------------------------------------------
+!   send data to north and west
+!---------------------------------------------------------------------
+            if (timeron) call timer_start(t_ucomm)
+            iex = 3
+            call exchange_1( rsd,k,iex )
+            if (timeron) call timer_stop(t_ucomm)
+         end do
+ 
+!---------------------------------------------------------------------
+!   update the variables
+!---------------------------------------------------------------------
+ 
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )  &
+     &                    + tmp * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration corrections
+!---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, delunm )
+!            if ( ipr .eq. 1 .and. id .eq. 0 ) then
+!                write (*,1006) ( delunm(m), m = 1, 5 )
+!            else if ( ipr .eq. 2 .and. id .eq. 0 ) then
+!                write (*,'(i5,f15.6)') istep,delunm(5)
+!            end if
+         end if
+ 
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+         call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration residuals
+!---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.  &
+     &        ( istep .eq. itmax ) ) then
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, rsdnm )
+!            if ( ipr .eq. 1.and.id.eq.0 ) then
+!                write (*,1007) ( rsdnm(m), m = 1, 5 )
+!            end if
+         end if
+
+!---------------------------------------------------------------------
+!   check the newton-iteration residuals against the tolerance levels
+!---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.  &
+     &        ( rsdnm(2) .lt. tolrsd(2) ) .and.  &
+     &        ( rsdnm(3) .lt. tolrsd(3) ) .and.  &
+     &        ( rsdnm(4) .lt. tolrsd(4) ) .and.  &
+     &        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+            if (id.eq.0) then
+               write (*,1004) istep
+            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+      wtime = timer_read(1)
+ 
+
+      call MPI_ALLREDUCE( wtime,  &
+     &                    maxtime,  &
+     &                    1,  &
+     &                    MPI_DOUBLE_PRECISION,  &
+     &                    MPI_MAX,  &
+     &                    comm_solve,  &
+     &                    IERROR )
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,  &
+     &   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',  &
+     & 'first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/ssor_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/ssor_vec.f90
new file mode 100644
index 000000000..053d89024
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/ssor_vec.f90
@@ -0,0 +1,246 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   to perform pseudo-time stepping SSOR iterations
+!   for five nonlinear pde's.
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+      use timing
+
+      implicit none
+      integer  niter
+
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      integer istep
+      double precision  tmp
+      double precision  delunm(5), tv(5,isiz1,isiz2)
+
+      external timer_read
+      double precision wtime, timer_read
+
+      integer IERROR
+
+ 
+!---------------------------------------------------------------------
+!   begin pseudo-time stepping iterations
+!---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+!---------------------------------------------------------------------
+!   initialize a,b,c,d to zero (guarantees that page tables have been
+!   formed, if applicable on given architecture, before timestepping).
+!---------------------------------------------------------------------
+      do j=1,isiz2
+         do i=1,isiz1
+            do m=1,5
+               do k=1,5
+                  a(k,m,i,j) = 0.d0
+                  b(k,m,i,j) = 0.d0
+                  c(k,m,i,j) = 0.d0
+                  d(k,m,i,j) = 0.d0
+               enddo
+            enddo
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+      call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the L2 norms of newton iteration residuals
+!---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &             ist, iend, jst, jend,  &
+     &             rsd, rsdnm )
+  
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+      call MPI_BARRIER( comm_solve, IERROR )
+ 
+      call timer_clear(1)
+      call timer_start(1)
+
+!---------------------------------------------------------------------
+!   the timestep loop
+!---------------------------------------------------------------------
+      do istep = 1, niter
+
+         if (id .eq. 0) then
+            if (mod ( istep, 20) .eq. 0 .or.  &
+     &            istep .eq. itmax .or.  &
+     &            istep .eq. 1) then
+               if (niter .gt. 1) write( *, 200) istep
+ 200           format(' Time step ', i4)
+            endif
+         endif
+ 
+!---------------------------------------------------------------------
+!   perform SSOR iteration
+!---------------------------------------------------------------------
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = dt * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+ 
+         do k = 2, nz -1 
+!---------------------------------------------------------------------
+!   form the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+            call jacld(k)
+ 
+!---------------------------------------------------------------------
+!   perform the lower triangular solution
+!---------------------------------------------------------------------
+            call blts( isiz1, isiz2, isiz3,  &
+     &                 nx, ny, nz, k,  &
+     &                 omega,  &
+     &                 rsd,  &
+     &                 a, b, c, d,  &
+     &                 ist, iend, jst, jend,  &
+     &                 nx0, ny0, ipt, jpt)
+         end do
+ 
+         do k = nz - 1, 2, -1
+!---------------------------------------------------------------------
+!   form the strictly upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+            call jacu(k)
+
+!---------------------------------------------------------------------
+!   perform the upper triangular solution
+!---------------------------------------------------------------------
+            call buts( isiz1, isiz2, isiz3,  &
+     &                 nx, ny, nz, k,  &
+     &                 omega,  &
+     &                 rsd, tv,  &
+     &                 d, a, b, c,  &
+     &                 ist, iend, jst, jend,  &
+     &                 nx0, ny0, ipt, jpt)
+         end do
+ 
+!---------------------------------------------------------------------
+!   update the variables
+!---------------------------------------------------------------------
+ 
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )  &
+     &                    + tmp * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration corrections
+!---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, delunm )
+!            if ( ipr .eq. 1 .and. id .eq. 0 ) then
+!                write (*,1006) ( delunm(m), m = 1, 5 )
+!            else if ( ipr .eq. 2 .and. id .eq. 0 ) then
+!                write (*,'(i5,f15.6)') istep,delunm(5)
+!            end if
+         end if
+ 
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+         call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration residuals
+!---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.  &
+     &        ( istep .eq. itmax ) ) then
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, rsdnm )
+!            if ( ipr .eq. 1.and.id.eq.0 ) then
+!                write (*,1007) ( rsdnm(m), m = 1, 5 )
+!            end if
+         end if
+
+!---------------------------------------------------------------------
+!   check the newton-iteration residuals against the tolerance levels
+!---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.  &
+     &        ( rsdnm(2) .lt. tolrsd(2) ) .and.  &
+     &        ( rsdnm(3) .lt. tolrsd(3) ) .and.  &
+     &        ( rsdnm(4) .lt. tolrsd(4) ) .and.  &
+     &        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+            if (id.eq.0) then
+               write (*,1004) istep
+            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+      wtime = timer_read(1)
+ 
+
+      call MPI_ALLREDUCE( wtime,  &
+     &                    maxtime,  &
+     &                    1,  &
+     &                    MPI_DOUBLE_PRECISION,  &
+     &                    MPI_MAX,  &
+     &                    comm_solve,  &
+     &                    IERROR )
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,  &
+     &   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',  &
+     & 'first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/subdomain.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/subdomain.f90
new file mode 100644
index 000000000..567c18bcc
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/subdomain.f90
@@ -0,0 +1,105 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine subdomain
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer mm, ierror, errorcode
+
+
+!---------------------------------------------------------------------
+!
+!   set up the sub-domain sizes
+!
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   x dimension
+!---------------------------------------------------------------------
+      mm   = mod(nx0,xdim)
+      if (row.le.mm) then
+        nx = nx0/xdim + 1
+        ipt = (row-1)*nx
+      else
+        nx = nx0/xdim
+        ipt = (row-1)*nx + mm
+      end if
+
+!---------------------------------------------------------------------
+!   y dimension
+!---------------------------------------------------------------------
+      mm   = mod(ny0,ydim)
+      if (col.le.mm) then
+        ny = ny0/ydim + 1
+        jpt = (col-1)*ny
+      else
+        ny = ny0/ydim
+        jpt = (col-1)*ny + mm
+      end if
+
+!---------------------------------------------------------------------
+!   z dimension
+!---------------------------------------------------------------------
+      nz = nz0
+
+!---------------------------------------------------------------------
+!   check the sub-domain size
+!---------------------------------------------------------------------
+      if ( ( nx .lt. 3 ) .or.  &
+     &     ( ny .lt. 3 ) .or.  &
+     &     ( nz .lt. 3 ) ) then
+         write (*,2001) nx, ny, nz
+ 2001    format (5x,'SUBDOMAIN SIZE IS TOO SMALL - ',  &
+     &        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',  &
+     &        /5x,'SO THAT NX, NY AND NZ ARE GREATER THAN OR EQUAL',  &
+     &        /5x,'TO 3 THEY ARE CURRENTLY', 3I5)
+          ERRORCODE = 1
+          CALL MPI_ABORT( MPI_COMM_WORLD,  &
+     &                    ERRORCODE,  &
+     &                    IERROR )
+      end if
+
+      if ( ( nx .gt. isiz1 ) .or.  &
+     &     ( ny .gt. isiz2 ) .or.  &
+     &     ( nz .gt. isiz3 ) ) then
+         write (*,2002) nx, ny, nz
+ 2002    format (5x,'SUBDOMAIN SIZE IS TOO LARGE - ',  &
+     &        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',  &
+     &        /5x,'SO THAT NX, NY AND NZ ARE LESS THAN OR EQUAL TO ',  &
+     &        /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY.  THEY ARE',  &
+     &        /5x,'CURRENTLY', 3I5)
+          ERRORCODE = 1
+          CALL MPI_ABORT( MPI_COMM_WORLD,  &
+     &                    ERRORCODE,  &
+     &                    IERROR )
+      end if
+
+
+!---------------------------------------------------------------------
+!   set up the start and end in i and j extents for all processors
+!---------------------------------------------------------------------
+      ist = 1
+      iend = nx
+      if (north.eq.-1) ist = 2
+      if (south.eq.-1) iend = nx - 1
+
+      jst = 1
+      jend = ny
+      if (west.eq.-1) jst = 2
+      if (east.eq.-1) jend = ny - 1
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/verify.f90
new file mode 100644
index 000000000..13274e1bc
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/LU/verify.f90
@@ -0,0 +1,492 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine set_class(class)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  set problem class based on problem size
+!---------------------------------------------------------------------
+
+        use lu_data
+        implicit none
+
+        character class
+
+
+        if ( (nx0  .eq. 12     ) .and.  &
+     &       (ny0  .eq. 12     ) .and.  &
+     &       (nz0  .eq. 12     ) .and.  &
+     &       (itmax   .eq. 50    ))  then
+
+           class = 'S'
+
+        elseif ( (nx0 .eq. 33) .and.  &
+     &           (ny0 .eq. 33) .and.  &
+     &           (nz0 .eq. 33) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'W'   !SPEC95fp size
+
+        elseif ( (nx0 .eq. 64) .and.  &
+     &           (ny0 .eq. 64) .and.  &
+     &           (nz0 .eq. 64) .and.  &
+     &           (itmax  .eq. 250) ) then
+
+           class = 'A'
+
+        elseif ( (nx0 .eq. 102) .and.  &
+     &           (ny0 .eq. 102) .and.  &
+     &           (nz0 .eq. 102) .and.  &
+     &           (itmax  .eq. 250) ) then
+
+           class = 'B'
+
+        elseif ( (nx0 .eq. 162) .and.  &
+     &           (ny0 .eq. 162) .and.  &
+     &           (nz0 .eq. 162) .and.  &
+     &           (itmax  .eq. 250) ) then
+
+           class = 'C'
+
+        elseif ( (nx0 .eq. 408) .and.  &
+     &           (ny0 .eq. 408) .and.  &
+     &           (nz0 .eq. 408) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'D'
+
+        elseif ( (nx0 .eq. 1020) .and.  &
+     &           (ny0 .eq. 1020) .and.  &
+     &           (nz0 .eq. 1020) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'E'
+
+        elseif ( (nx0 .eq. 2560) .and.  &
+     &           (ny0 .eq. 2560) .and.  &
+     &           (nz0 .eq. 2560) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'F'
+
+        else
+
+           class = 'U'
+
+        endif
+
+        return
+        end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine verify(xcr, xce, xci, class, verified)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  verification routine                         
+!---------------------------------------------------------------------
+
+        use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+        use lu_data
+        implicit none
+
+        double precision xcr(5), xce(5), xci
+        double precision xcrref(5),xceref(5),xciref,  &
+     &                   xcrdif(5),xcedif(5),xcidif,  &
+     &                   epsilon, dtref
+        integer m
+        character class
+        logical verified
+
+!---------------------------------------------------------------------
+!   tolerance level
+!---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+        xciref = 1.0
+
+        if ( class .eq. 'S' ) then
+
+           dtref = 5.0d-1
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (12X12X12) grid,
+!   after 50 time steps, with  DT = 5.0d-01
+!---------------------------------------------------------------------
+         xcrref(1) = 1.6196343210976702d-02
+         xcrref(2) = 2.1976745164821318d-03
+         xcrref(3) = 1.5179927653399185d-03
+         xcrref(4) = 1.5029584435994323d-03
+         xcrref(5) = 3.4264073155896461d-02
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (12X12X12) grid,
+!   after 50 time steps, with  DT = 5.0d-01
+!---------------------------------------------------------------------
+         xceref(1) = 6.4223319957960924d-04
+         xceref(2) = 8.4144342047347926d-05
+         xceref(3) = 5.8588269616485186d-05
+         xceref(4) = 5.8474222595157350d-05
+         xceref(5) = 1.3103347914111294d-03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (12X12X12) grid,
+!   after 50 time steps, with DT = 5.0d-01
+!---------------------------------------------------------------------
+         xciref = 7.8418928865937083d+00
+
+
+        elseif ( class .eq. 'W' ) then
+
+           dtref = 1.5d-3
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (33x33x33) grid,
+!   after 300 time steps, with  DT = 1.5d-3
+!---------------------------------------------------------------------
+           xcrref(1) =   0.1236511638192d+02
+           xcrref(2) =   0.1317228477799d+01
+           xcrref(3) =   0.2550120713095d+01
+           xcrref(4) =   0.2326187750252d+01
+           xcrref(5) =   0.2826799444189d+02
+
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (33X33X33) grid,
+!---------------------------------------------------------------------
+           xceref(1) =   0.4867877144216d+00
+           xceref(2) =   0.5064652880982d-01
+           xceref(3) =   0.9281818101960d-01
+           xceref(4) =   0.8570126542733d-01
+           xceref(5) =   0.1084277417792d+01
+
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (33X33X33) grid,
+!   after 300 time steps, with  DT = 1.5d-3
+!---------------------------------------------------------------------
+           xciref    =   0.1161399311023d+02
+
+        elseif ( class .eq. 'A' ) then
+
+           dtref = 2.0d+0
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (64X64X64) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 7.7902107606689367d+02
+         xcrref(2) = 6.3402765259692870d+01
+         xcrref(3) = 1.9499249727292479d+02
+         xcrref(4) = 1.7845301160418537d+02
+         xcrref(5) = 1.8384760349464247d+03
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (64X64X64) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 2.9964085685471943d+01
+         xceref(2) = 2.8194576365003349d+00
+         xceref(3) = 7.3473412698774742d+00
+         xceref(4) = 6.7139225687777051d+00
+         xceref(5) = 7.0715315688392578d+01
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (64X64X64) grid,
+!   after 250 time steps, with DT = 2.0d+00
+!---------------------------------------------------------------------
+         xciref = 2.6030925604886277d+01
+
+
+        elseif ( class .eq. 'B' ) then
+
+           dtref = 2.0d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (102X102X102) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 3.5532672969982736d+03
+         xcrref(2) = 2.6214750795310692d+02
+         xcrref(3) = 8.8333721850952190d+02
+         xcrref(4) = 7.7812774739425265d+02
+         xcrref(5) = 7.3087969592545314d+03
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (102X102X102) 
+!   grid, after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 1.1401176380212709d+02
+         xceref(2) = 8.1098963655421574d+00
+         xceref(3) = 2.8480597317698308d+01
+         xceref(4) = 2.5905394567832939d+01
+         xceref(5) = 2.6054907504857413d+02
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (102X102X102) grid,
+!   after 250 time steps, with DT = 2.0d+00
+!---------------------------------------------------------------------
+         xciref = 4.7887162703308227d+01
+
+        elseif ( class .eq. 'C' ) then
+
+           dtref = 2.0d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (162X162X162) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 1.03766980323537846d+04
+         xcrref(2) = 8.92212458801008552d+02
+         xcrref(3) = 2.56238814582660871d+03
+         xcrref(4) = 2.19194343857831427d+03
+         xcrref(5) = 1.78078057261061185d+04
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (162X162X162) 
+!   grid, after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 2.15986399716949279d+02
+         xceref(2) = 1.55789559239863600d+01
+         xceref(3) = 5.41318863077207766d+01
+         xceref(4) = 4.82262643154045421d+01
+         xceref(5) = 4.55902910043250358d+02
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (162X162X162) grid,
+!   after 250 time steps, with DT = 2.0d+00
+!---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+        elseif ( class .eq. 'D' ) then
+
+           dtref = 1.0d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (408X408X408) grid,
+!   after 300 time steps, with  DT = 1.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 0.4868417937025d+05
+         xcrref(2) = 0.4696371050071d+04
+         xcrref(3) = 0.1218114549776d+05 
+         xcrref(4) = 0.1033801493461d+05
+         xcrref(5) = 0.7142398413817d+05
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (408X408X408) 
+!   grid, after 300 time steps, with  DT = 1.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 0.3752393004482d+03
+         xceref(2) = 0.3084128893659d+02
+         xceref(3) = 0.9434276905469d+02
+         xceref(4) = 0.8230686681928d+02
+         xceref(5) = 0.7002620636210d+03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (408X408X408) grid,
+!   after 300 time steps, with DT = 1.0d+00
+!---------------------------------------------------------------------
+         xciref =    0.8334101392503d+02
+
+        elseif ( class .eq. 'E' ) then
+
+           dtref = 0.5d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (1020X1020X1020) grid,
+!   after 300 time steps, with  DT = 0.5d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 0.2099641687874d+06
+         xcrref(2) = 0.2130403143165d+05
+         xcrref(3) = 0.5319228789371d+05 
+         xcrref(4) = 0.4509761639833d+05
+         xcrref(5) = 0.2932360006590d+06
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (1020X1020X1020) 
+!   grid, after 300 time steps, with  DT = 0.5d+00
+!---------------------------------------------------------------------
+         xceref(1) = 0.4800572578333d+03
+         xceref(2) = 0.4221993400184d+02
+         xceref(3) = 0.1210851906824d+03
+         xceref(4) = 0.1047888986770d+03
+         xceref(5) = 0.8363028257389d+03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (1020X1020X1020) grid,
+!   after 300 time steps, with DT = 0.5d+00
+!---------------------------------------------------------------------
+         xciref =    0.9512163272273d+02
+
+        elseif ( class .eq. 'F' ) then
+
+           dtref = 0.2d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (2560X2560X2560) grid,
+!   after 300 time steps, with  DT = 0.2d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 0.8505125358152d+06
+         xcrref(2) = 0.8774655318044d+05
+         xcrref(3) = 0.2167258198851d+06
+         xcrref(4) = 0.1838245257371d+06
+         xcrref(5) = 0.1175556512415d+07
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (2560X2560X2560)
+!   grid, after 300 time steps, with  DT = 0.2d+00
+!---------------------------------------------------------------------
+         xceref(1) = 0.5293914132486d+03
+         xceref(2) = 0.4784861621068d+02
+         xceref(3) = 0.1337701281659d+03
+         xceref(4) = 0.1154215049655d+03
+         xceref(5) = 0.8956266851467d+03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (2560X2560X2560) grid,
+!   after 300 time steps, with DT = 0.2d+00
+!---------------------------------------------------------------------
+         xciref =    0.1002509436546d+03
+
+        else
+
+           verified = .FALSE.
+
+        endif
+
+!---------------------------------------------------------------------
+!    verification test for residuals if gridsize is one of 
+!    the defined grid sizes above (class .ne. 'U')
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!    Compute the difference of solution values and the known reference values.
+!---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+        xcidif = dabs((xci - xciref)/xciref)
+
+
+!---------------------------------------------------------------------
+!    Output the comparison of computed results to known cases.
+!---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(/, ' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' Accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ',  &
+     &                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*,2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if ((.not.ieee_is_nan(xcrdif(m))) .and.  &
+     &              xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if ((.not.ieee_is_nan(xcedif(m))) .and.  &
+     &              xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, 2x, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, 2x, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, 2x, E20.13)
+        
+        if (class .ne. 'U') then
+           write (*,2025)
+        else
+           write (*,2026)
+        endif
+ 2025   format(' Comparison of surface integral')
+ 2026   format(' Surface integral')
+
+
+        if (class .eq. 'U') then
+           write(*, 2030) xci
+        else if (xcidif .le. epsilon) then
+           write(*, 2032) xci, xciref, xcidif
+        else
+           verified = .false.
+           write(*, 2031) xci, xciref, xcidif
+        endif
+
+ 2030   format('          ', 4x, E20.13)
+ 2031   format(' FAILURE: ', 4x, E20.13, E20.13, E20.13)
+ 2032   format('          ', 4x, E20.13, E20.13, E20.13)
+
+
+
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/Makefile
new file mode 100644
index 000000000..927797d75
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/Makefile
@@ -0,0 +1,28 @@
+SHELL=/bin/sh
+BENCHMARK=mg
+BENCHMARKU=MG
+
+include ../config/make.def
+
+OBJS = mg.o mg_data.o mpinpb.o ${COMMON}/print_results.o  \
+       ${COMMON}/get_active_nprocs.o \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+mg.o:		mg.f90  mg_data.o mpinpb.o
+mg_data.o:	mg_data.f90 mpinpb.o npbparams.h
+mpinpb.o:	mpinpb.f90
+
+clean:
+	- rm -f *.o *.mod *~ 
+	- rm -f npbparams.h core
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/README
new file mode 100644
index 000000000..6c03f7852
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/README
@@ -0,0 +1,138 @@
+Some info about the MG benchmark
+================================
+    
+'mg_demo' demonstrates the capabilities of a very simple multigrid
+solver in computing a three dimensional potential field.  This is
+a simplified multigrid solver in two important respects:
+
+  (1) it solves only a constant coefficient equation,
+  and that only on a uniform cubical grid,
+    
+  (2) it solves only a single equation, representing
+  a scalar field rather than a vector field.
+
+We chose it for its portability and simplicity, and expect that a
+supercomputer which can run it effectively will also be able to
+run more complex multigrid programs at least as well.
+     
+     Eric Barszcz                         Paul Frederickson
+     RIACS
+     NASA Ames Research Center            NASA Ames Research Center
+
+========================================================================
+Running the program:  (Note: also see parameter lm information in the
+                       two sections immediately below this section)
+
+The program may be run with or without an input deck (called "mg.input"). 
+The following describes a few things about the input deck if you want to 
+use one. 
+
+The four lines below are the "mg.input" file required to run a
+problem of total size 256x256x256, for 4 iterations (Class "A"),
+and presumes the use of 8 processors:
+
+   8 = top level
+   256 256 256 = nx ny nz
+   4 = nit
+   0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+8 processors are solving this problem (recall that the number of 
+processors is specified to MPI as a run parameter, and MPI subsequently
+determines this for the code via an MPI subroutine call), a 2x2x2 
+processor grid is  formed, and thus each partition on a processor is 
+of size 128x128x128.  Therefore, a maximum of 8 multi-grid levels may 
+be used.  These are of size 128,64,32,16,8,4,2,1, with the coarsest 
+level being a single point on a given processor.
+
+
+Next, consider the same size problem but running on 1 processor.  The
+following "mg.input" file is appropriate:
+
+    9 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+Since this processor must solve the full 256x256x256 problem, this
+permits 9 multi-grid levels (256,128,64,32,16,8,4,2,1), resulting in 
+a coarsest multi-grid level of a single point on the processor
+
+
+Next, consider the same size problem but running on 2 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The algorithm for partitioning the full grid onto some power of 2 number 
+of processors is to start by splitting the last dimension of the grid
+(z dimension) in 2: the problem is now partitioned onto 2 processors.
+Next the middle dimension (y dimension) is split in 2: the problem is now
+partitioned onto 4 processors.  Next, first dimension (x dimension) is
+split in 2: the problem is now partitioned onto 8 processors.  Next, the
+last dimension (z dimension) is split again in 2: the problem is now
+partitioned onto 16 processors.  This partitioning is repeated until all 
+of the power of 2 processors have been allocated.
+
+Thus to run the above problem on 2 processors, the grid partitioning 
+algorithm will allocate the two processors across the last dimension, 
+creating two partitions each of size 256x256x128. The coarsest level of 
+multi-grid must be a single point surrounded by a cubic number of grid 
+points.  Therefore, each of the two processor partitions will contain 4 
+coarsest multi-grid level points, each surrounded by a cube of grid points 
+of size 128x128x128, indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 4 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The partitioning algorithm will create 4 partitions, each of size
+256x128x128.  Each partition will contain 2 coarsest multi-grid level
+points each surrounded by a cube of grid points of size 128x128x128, 
+indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 16 processors.  The
+following "mg.input" file is required:
+
+    7 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+On each node a partition of size 128x128x64 will be created.  A maximum
+of 7 multi-grid levels (64,32,16,8,4,2,1) may be used, resulting in each 
+partions containing 4 coarsest multi-grid level points, each surrounded 
+by a cube of grid points of size 64x64x64, indicated by a top level of 7.
+
+
+
+
+Note that non-cubic problem sizes may also be considered:
+
+The four lines below are the "mg.input" file appropriate for running a
+problem of total size 256x512x512, for 20 iterations and presumes the 
+use of 32 processors (note: this is NOT a class C problem):
+
+    8 = top level
+    256 512 512 = nx ny nz
+    20 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+32 processors are solving this problem, a 2x4x4 processor grid is
+formed, and thus each partition on a processor is of size 128x128x128.
+Therefore, a maximum of 8 multi-grid levels may be used.  These are of
+size 128,64,32,16,8,4,2,1, with the coarsest level being a single 
+point on a given processor.
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg.f90
new file mode 100644
index 000000000..6185c4edb
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg.f90
@@ -0,0 +1,2577 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                                   M G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.4 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+!---------------------------------------------------------------------
+!
+! Authors: E. Barszcz
+!          P. Frederickson
+!          A. Woo
+!          M. Yarrow
+!          R. F. Van der Wijngaart
+!
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+      program mg_mpi
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      use mg_data
+      use mg_fields
+      use mpinpb
+
+      implicit none
+
+!---------------------------------------------------------------------------c
+! k is the current level. It is passed down through subroutine args
+! and is NOT global. it is the current iteration
+!---------------------------------------------------------------------------c
+
+      integer k, it
+      
+      external timer_read
+      double precision t, t0, tinit, mflops, timer_read
+
+      double precision rnm2, rnmu, old2, oldu, epsilon
+      integer n1, n2, n3, nit, i
+      double precision nn, verify_value, err
+      logical verified
+
+      integer ierr, fstatus
+
+      double precision tsum(t_last+2), t1(t_last+2),  &
+     &                 tming(t_last+2), tmaxg(t_last+2)
+      character        t_recs(t_last+2)*8
+
+      data t_recs/'total', 'init', 'psinv', 'resid', 'rprj3',  &
+     &            'interp', 'norm2u3', 'comm3', 'rcomm',  &
+     &            ' totcomp', ' totcomm'/
+
+
+      call mpi_init(ierr)
+
+!---------------------------------------------------------------------
+! get a process grid that requires a pwr-2 number of procs.
+! excess ranks are marked as inactive.
+!---------------------------------------------------------------------
+      call get_active_nprocs(3, n1, n2, nprocs,  &
+     &                       nprocs_total, me, comm_work, active)
+      if (.not. active) goto 999
+
+      root = 0
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+!---------------------------------------------------------------------
+! Determine active set of processes
+!---------------------------------------------------------------------
+      if (nprocs .gt. maxprocs) then
+         if (me .eq. root) write(*,20) nprocs_total, maxprocs
+ 20      format(' ERROR: requested for ',i8,' processes'//  &
+     &          ' The maximum size allowed for this benchmark is ',i6)
+         call mpi_abort(mpi_comm_world, 1, ierr)
+         stop
+      endif
+
+!---------------------------------------------------------------------
+! Allocate space
+!---------------------------------------------------------------------
+      call alloc_space
+
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+      call mpi_barrier(comm_work, ierr)
+
+      call timer_start(T_init)
+      
+
+!---------------------------------------------------------------------
+! Read in and broadcast input data
+!---------------------------------------------------------------------
+
+      if( me .eq. root )then
+         write (*, 1000) 
+
+         call check_timer_flag( timeron )
+
+         open(unit=7,file="mg.input", status="old", iostat=fstatus)
+         if (fstatus .eq. 0) then
+            write(*,50) 
+ 50         format(' Reading from input file mg.input')
+            read(7,*) lt
+            read(7,*) nx(lt), ny(lt), nz(lt)
+            read(7,*) nit
+            read(7,*) (debug_vec(i),i=0,7)
+         else
+            write(*,51) 
+ 51         format(' No input file. Using compiled defaults ')
+            lt = lt_default
+            nit = nit_default
+            nx(lt) = nx_default
+            ny(lt) = ny_default
+            nz(lt) = nz_default
+            do i = 0,7
+               debug_vec(i) = debug_default
+            end do
+         endif
+      endif
+
+      call mpi_bcast(lt, 1, MPI_INTEGER, 0, comm_work, ierr)
+      call mpi_bcast(nit, 1, MPI_INTEGER, 0, comm_work, ierr)
+      call mpi_bcast(nx(lt), 1, MPI_INTEGER, 0, comm_work, ierr)
+      call mpi_bcast(ny(lt), 1, MPI_INTEGER, 0, comm_work, ierr)
+      call mpi_bcast(nz(lt), 1, MPI_INTEGER, 0, comm_work, ierr)
+      call mpi_bcast(debug_vec(0), 8, MPI_INTEGER, 0,  &
+     &               comm_work, ierr)
+      call mpi_bcast(timeron, 1, MPI_LOGICAL, 0, comm_work, ierr)
+
+      if ( (nx(lt) .ne. ny(lt)) .or. (nx(lt) .ne. nz(lt)) ) then
+         Class = 'U' 
+      else if( nx(lt) .eq. 32 .and. nit .eq. 4 ) then
+         Class = 'S'
+      else if( nx(lt) .eq. 128 .and. nit .eq. 4 ) then
+         Class = 'W'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 4 ) then  
+         Class = 'A'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 20 ) then
+         Class = 'B'
+      else if( nx(lt) .eq. 512 .and. nit .eq. 20 ) then  
+         Class = 'C'
+      else if( nx(lt) .eq. 1024 .and. nit .eq. 50 ) then  
+         Class = 'D'
+      else if( nx(lt) .eq. 2048 .and. nit .eq. 50 ) then  
+         Class = 'E'
+      else if( nx(lt) .eq. 4096 .and. nit .eq. 50 ) then  
+         Class = 'F'
+      else
+         Class = 'U'
+      endif
+
+!---------------------------------------------------------------------
+!  Use these for debug info:
+!---------------------------------------------------------------------
+!     debug_vec(0) = 1 !=> report all norms
+!     debug_vec(1) = 1 !=> some setup information
+!     debug_vec(1) = 2 !=> more setup information
+!     debug_vec(2) = k => at level k or below, show result of resid
+!     debug_vec(3) = k => at level k or below, show result of psinv
+!     debug_vec(4) = k => at level k or below, show result of rprj
+!     debug_vec(5) = k => at level k or below, show result of interp
+!     debug_vec(6) = 1 => (unused)
+!     debug_vec(7) = 1 => (unused)
+!---------------------------------------------------------------------
+      a(0) = -8.0D0/3.0D0 
+      a(1) =  0.0D0 
+      a(2) =  1.0D0/6.0D0 
+      a(3) =  1.0D0/12.0D0
+      
+      if(Class .eq. 'A' .or. Class .eq. 'S'.or. Class .eq.'W') then
+!---------------------------------------------------------------------
+!     Coefficients for the S(a) smoother
+!---------------------------------------------------------------------
+         c(0) =  -3.0D0/8.0D0
+         c(1) =  +1.0D0/32.0D0
+         c(2) =  -1.0D0/64.0D0
+         c(3) =   0.0D0
+      else
+!---------------------------------------------------------------------
+!     Coefficients for the S(b) smoother
+!---------------------------------------------------------------------
+         c(0) =  -3.0D0/17.0D0
+         c(1) =  +1.0D0/33.0D0
+         c(2) =  -1.0D0/61.0D0
+         c(3) =   0.0D0
+      endif
+      lb = 1
+      k  = lt
+
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call norm2u3(v,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+
+      if( me .eq. root )then
+         write (*, 1001) nx(lt),ny(lt),nz(lt), Class
+         write (*, 1002) nit
+         write (*, 1003) nprocs_total
+         if (nprocs .ne. nprocs_total) write (*, 1004) nprocs
+         write (*, *)
+
+ 1000 format(//,' NAS Parallel Benchmarks 3.4 -- MG Benchmark', /)
+ 1001    format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', a, ')' )
+ 1002    format(' Iterations: ', i4)
+ 1003    format(' Total number of processes: ', i6)
+ 1004    format(' WARNING: Number of processes is not power of two (',  &
+     &          i0, ' active)')
+      endif
+
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+!---------------------------------------------------------------------
+!     One iteration for startup
+!---------------------------------------------------------------------
+      call mg3P(u,v,r,a,c,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call timer_stop(T_init)
+      if( me .eq. root )then
+         tinit = timer_read(T_init)
+         write( *,'(/A,F15.3,A/)' )  &
+     &        ' Initialization time: ',tinit, ' seconds'
+      endif
+
+      do i = 1, t_last
+         if (i .ne. t_init) call timer_clear(i)
+      end do
+      call mpi_barrier(comm_work,ierr)
+
+      call timer_start(T_bench)
+
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+      do  it=1,nit
+         if (it.eq.1 .or. it.eq.nit .or. mod(it,5).eq.0) then
+            if (me .eq. root) write(*,80) it
+   80       format('  iter ',i4)
+         endif
+         call mg3P(u,v,r,a,c,n1,n2,n3,k)
+         call resid(u,v,r,n1,n2,n3,a,k)
+      enddo
+
+
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+
+      call timer_stop(T_bench)
+
+      t0 = timer_read(T_bench)
+
+      call mpi_reduce(t0,t,1,dp_type,  &
+     &     mpi_max,root,comm_work,ierr)
+      verified = .FALSE.
+      verify_value = 0.0
+      if( me .eq. root )then
+         write(*,100)
+ 100     format(/' Benchmark completed ')
+
+         epsilon = 1.d-8
+         if (Class .ne. 'U') then
+            if(Class.eq.'S') then
+               verify_value = 0.5307707005734d-04
+            elseif(Class.eq.'W') then
+               verify_value = 0.6467329375339d-05
+            elseif(Class.eq.'A') then
+               verify_value = 0.2433365309069d-05
+            elseif(Class.eq.'B') then
+               verify_value = 0.1800564401355d-05
+            elseif(Class.eq.'C') then
+               verify_value = 0.5706732285740d-06
+            elseif(Class.eq.'D') then
+               verify_value = 0.1583275060440d-09
+            elseif(Class.eq.'E') then
+               verify_value = 0.5630442584711d-10
+            elseif(Class.eq.'F') then
+               verify_value = 0.1889225697989d-10
+            endif
+
+            err = abs( rnm2 - verify_value ) / verify_value
+            if( (.not.ieee_is_nan(err)) .and. (err .le. epsilon) ) then
+               verified = .TRUE.
+               write(*, 200)
+               write(*, 201) rnm2
+               write(*, 202) err
+ 200           format(' VERIFICATION SUCCESSFUL ')
+ 201           format(' L2 Norm is ', E20.13)
+ 202           format(' Error is   ', E20.13)
+            else
+               verified = .FALSE.
+               write(*, 300) 
+               write(*, 301) rnm2
+               write(*, 302) verify_value
+ 300           format(' VERIFICATION FAILED')
+ 301           format(' L2 Norm is             ', E20.13)
+ 302           format(' The correct L2 Norm is ', E20.13)
+            endif
+         else
+            verified = .FALSE.
+            write (*, 400)
+            write (*, 401)
+            write (*, 201) rnm2
+ 400        format(' Problem size unknown')
+ 401        format(' NO VERIFICATION PERFORMED')
+         endif
+
+         nn = 1.0d0*nx(lt)*ny(lt)*nz(lt)
+
+         if( t .ne. 0. ) then
+            mflops = 58.*1.0D-6*nit*nn / t
+         else
+            mflops = 0.0
+         endif
+
+         call print_results('MG', class, nx(lt), ny(lt), nz(lt),  &
+     &                      nit, nprocs, nprocs_total, t,  &
+     &                      mflops, '          floating point',  &
+     &                      verified, npbversion, compiletime,  &
+     &                      cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+      endif
+
+
+      if (.not.timeron) goto 999
+
+      do i = 1, t_last
+         t1(i) = timer_read(i)
+      end do
+      t1(t_last+2) = t1(t_rcomm) + t1(t_comm3)
+      t1(t_last+1) = t1(t_bench) - t1(t_last+2)
+
+      call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM,  &
+     &                0, comm_work, ierr)
+      call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN,  &
+     &                0, comm_work, ierr)
+      call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX,  &
+     &                0, comm_work, ierr)
+
+      if (me .eq. 0) then
+         write(*, 800) nprocs
+         do i = 1, t_last+2
+            tsum(i) = tsum(i) / nprocs
+            write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+         end do
+      endif
+ 800  format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum',  &
+     &       5x, 'average')
+ 810  format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999  continue
+      call mpi_finalize(ierr)
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup(n1,n2,n3,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer n1,n2,n3,k
+      integer dx, dy, log_p, d, i, j
+
+      integer ax, next(3),mi(3,maxlevel),mip(3,maxlevel)
+      integer ng(3,maxlevel)
+      integer idi(3), pi(3), idin(3,-1:1)
+      integer s, dir,ierr
+
+      do  j=-1,1,1
+         do  d=1,3
+            msg_type(d,j) = 100*(j+2+10*d)
+         enddo
+      enddo
+
+      ng(1,lt) = nx(lt)
+      ng(2,lt) = ny(lt)
+      ng(3,lt) = nz(lt)
+      do  ax=1,3
+         next(ax) = 1
+         do  k=lt-1,1,-1
+            ng(ax,k) = ng(ax,k+1)/2
+         enddo
+      enddo
+ 61   format(10i4)
+      do  k=lt,1,-1
+         nx(k) = ng(1,k)
+         ny(k) = ng(2,k)
+         nz(k) = ng(3,k)
+      enddo
+
+      log_p  = log(float(nprocs)+0.0001)/log(2.0)
+      dx     = log_p/3
+      pi(1)  = 2**dx
+      idi(1) = mod(me,pi(1))
+
+      dy     = (log_p-dx)/2
+      pi(2)  = 2**dy
+      idi(2) = mod((me/pi(1)),pi(2))
+
+      pi(3)  = nprocs/(pi(1)*pi(2))
+      idi(3) = me/(pi(1)*pi(2))
+
+      do  k = lt,1,-1
+         dead(k) = .false.
+         do  ax = 1,3
+            take_ex(ax,k) = .false.
+            give_ex(ax,k) = .false.
+
+            mi(ax,k) = 2 +  &
+     &           ((idi(ax)+1)*ng(ax,k))/pi(ax) -  &
+     &           ((idi(ax)+0)*ng(ax,k))/pi(ax)
+            mip(ax,k) = 2 +  &
+     &           ((next(ax)+idi(ax)+1)*ng(ax,k))/pi(ax) -  &
+     &           ((next(ax)+idi(ax)+0)*ng(ax,k))/pi(ax) 
+
+            if(mip(ax,k).eq.2.or.mi(ax,k).eq.2)then
+               next(ax) = 2*next(ax)
+            endif
+
+            if( k+1 .le. lt )then
+               if((mip(ax,k).eq.2).and.(mi(ax,k).eq.3))then
+                  give_ex(ax,k+1) = .true.
+               endif
+               if((mip(ax,k).eq.3).and.(mi(ax,k).eq.2))then
+                  take_ex(ax,k+1) = .true.
+               endif
+            endif
+         enddo
+
+         if( mi(1,k).eq.2 .or.  &
+     &        mi(2,k).eq.2 .or.  &
+     &        mi(3,k).eq.2      )then
+            dead(k) = .true.
+         endif
+         m1(k) = mi(1,k)
+         m2(k) = mi(2,k)
+         m3(k) = mi(3,k)
+
+         do  ax=1,3
+            idin(ax,+1) = mod( idi(ax) + next(ax) + pi(ax) , pi(ax) )
+            idin(ax,-1) = mod( idi(ax) - next(ax) + pi(ax) , pi(ax) )
+         enddo
+         do  dir = 1,-1,-2
+            nbr(1,dir,k) = idin(1,dir) + pi(1)  &
+     &           *(idi(2)      + pi(2)  &
+     &           * idi(3))
+            nbr(2,dir,k) = idi(1)      + pi(1)  &
+     &           *(idin(2,dir) + pi(2)  &
+     &           * idi(3))
+            nbr(3,dir,k) = idi(1)      + pi(1)  &
+     &           *(idi(2)      + pi(2)  &
+     &           * idin(3,dir))
+         enddo
+      enddo
+
+      k = lt
+      is1 = 2 + ng(1,k) - ((pi(1)  -idi(1))*ng(1,lt))/pi(1)
+      ie1 = 1 + ng(1,k) - ((pi(1)-1-idi(1))*ng(1,lt))/pi(1)
+      n1 = 3 + ie1 - is1
+      is2 = 2 + ng(2,k) - ((pi(2)  -idi(2))*ng(2,lt))/pi(2)
+      ie2 = 1 + ng(2,k) - ((pi(2)-1-idi(2))*ng(2,lt))/pi(2)
+      n2 = 3 + ie2 - is2
+      is3 = 2 + ng(3,k) - ((pi(3)  -idi(3))*ng(3,lt))/pi(3)
+      ie3 = 1 + ng(3,k) - ((pi(3)-1-idi(3))*ng(3,lt))/pi(3)
+      n3 = 3 + ie3 - is3
+
+
+      ir(lt)=1
+      do  j = lt-1, 1, -1
+         ir(j)=ir(j+1)+m1(j+1)*m2(j+1)*m3(j+1)
+      enddo
+
+
+      if( debug_vec(1) .ge. 1 )then
+         if( me .eq. root )write(*,*)' in setup, '
+         if( me .eq. root )write(*,*)' me   k  lt  nx  ny  nz ',  &
+     &        ' n1  n2  n3 is1 is2 is3 ie1 ie2 ie3'
+         do  i=0,nprocs-1
+            if( me .eq. i )then
+               write(*,9) me,k,lt,ng(1,k),ng(2,k),ng(3,k),  &
+     &              n1,n2,n3,is1,is2,is3,ie1,ie2,ie3
+ 9             format(15i4)
+            endif
+            call mpi_barrier(comm_work,ierr)
+         enddo
+      endif
+      if( debug_vec(1) .ge. 2 )then
+         do  i=0,nprocs-1
+            if( me .eq. i )then
+               write(*,*)' '
+               write(*,*)' processor =',me
+               do  k=lt,1,-1
+                  write(*,7)k,idi(1),idi(2),idi(3),  &
+     &                 ((nbr(d,j,k),j=-1,1,2),d=1,3),  &
+     &                 (mi(d,k),d=1,3)
+               enddo
+ 7             format(i4,'idi=',3i4,'nbr=',3(2i4,'  '),'mi=',3i4,' ')
+               write(*,*)'idi(s) = ',(idi(s),s=1,3)
+               write(*,*)'dead(2), dead(1) = ',dead(2),dead(1)
+               do  ax=1,3
+                  write(*,*)'give_ex(ax,2)= ',give_ex(ax,2)
+                  write(*,*)'take_ex(ax,2)= ',take_ex(ax,2)
+               enddo
+            endif
+            call mpi_barrier(comm_work,ierr)
+         enddo
+      endif
+
+      k = lt
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine mg3P(u,v,r,a,c,n1,n2,n3,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     multigrid V-cycle routine
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer n1, n2, n3, k
+      double precision u(nr),v(nv),r(nr)
+      double precision a(0:3),c(0:3)
+
+      integer j
+
+!---------------------------------------------------------------------
+!     down cycle.
+!     restrict the residual from the find grid to the coarse
+!---------------------------------------------------------------------
+
+      do  k= lt, lb+1 , -1
+         j = k-1
+         call rprj3(r(ir(k)),m1(k),m2(k),m3(k),  &
+     &        r(ir(j)),m1(j),m2(j),m3(j),k)
+      enddo
+
+      k = lb
+!---------------------------------------------------------------------
+!     compute an approximate solution on the coarsest grid
+!---------------------------------------------------------------------
+      call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+      call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+
+      do  k = lb+1, lt-1     
+          j = k-1
+!---------------------------------------------------------------------
+!        prolongate from level k-1  to k
+!---------------------------------------------------------------------
+         call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+         call interp(u(ir(j)),m1(j),m2(j),m3(j),  &
+     &               u(ir(k)),m1(k),m2(k),m3(k),k)
+!---------------------------------------------------------------------
+!        compute residual for level k
+!---------------------------------------------------------------------
+         call resid(u(ir(k)),r(ir(k)),r(ir(k)),m1(k),m2(k),m3(k),a,k)
+!---------------------------------------------------------------------
+!        apply smoother
+!---------------------------------------------------------------------
+         call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+      enddo
+ 200  continue
+      j = lt - 1
+      k = lt
+      call interp(u(ir(j)),m1(j),m2(j),m3(j),u,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call psinv(r,u,n1,n2,n3,c,k)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine psinv( r,u,n1,n2,n3,c,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     psinv applies an approximate inverse as smoother:  u = u + Cr
+!
+!     This  implementation costs  15A + 4M per result, where
+!     A and M denote the costs of Addition and Multiplication.  
+!     Presuming coefficient c(3) is zero (the NPB assumes this,
+!     but it is thus not a general case), 2A + 1M may be eliminated,
+!     resulting in 13A + 3M.
+!     Note that this vectorizes, and is also fine for cache 
+!     based machines.  
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),r(n1,n2,n3),c(0:3)
+      integer i3, i2, i1
+
+      double precision r1(m), r2(m)
+      
+      if (timeron) call timer_start(t_psinv)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3)  &
+     &                + r(i1,i2,i3-1) + r(i1,i2,i3+1)
+               r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1)  &
+     &                + r(i1,i2-1,i3+1) + r(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               u(i1,i2,i3) = u(i1,i2,i3)  &
+     &                     + c(0) * r(i1,i2,i3)  &
+     &                     + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3)  &
+     &                              + r1(i1) )  &
+     &                     + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) )
+!---------------------------------------------------------------------
+!  Assume c(3) = 0    (Enable line below if c(3) not= 0)
+!---------------------------------------------------------------------
+!    >                     + c(3) * ( r2(i1-1) + r2(i1+1) )
+!---------------------------------------------------------------------
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_psinv)
+
+!---------------------------------------------------------------------
+!     exchange boundary points
+!---------------------------------------------------------------------
+      call comm3(u,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(u,n1,n2,n3,'   psinv',k)
+      endif
+
+      if( debug_vec(3) .ge. k )then
+         call showall(u,n1,n2,n3)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine resid( u,v,r,n1,n2,n3,a,k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     resid computes the residual:  r = v - Au
+!
+!     This  implementation costs  15A + 4M per result, where
+!     A and M denote the costs of Addition (or Subtraction) and 
+!     Multiplication, respectively. 
+!     Presuming coefficient a(1) is zero (the NPB assumes this,
+!     but it is thus not a general case), 3A + 1M may be eliminated,
+!     resulting in 12A + 3M.
+!     Note that this vectorizes, and is also fine for cache 
+!     based machines.  
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),v(n1,n2,n3),r(n1,n2,n3),a(0:3)
+      integer i3, i2, i1
+      double precision u1(m), u2(m)
+
+      if (timeron) call timer_start(t_resid)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)  &
+     &                + u(i1,i2,i3-1) + u(i1,i2,i3+1)
+               u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)  &
+     &                + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               r(i1,i2,i3) = v(i1,i2,i3)  &
+     &                     - a(0) * u(i1,i2,i3)  &
+!---------------------------------------------------------------------
+!  Assume a(1) = 0      (Enable 2 lines below if a(1) not= 0)
+!---------------------------------------------------------------------
+!    >                     - a(1) * ( u(i1-1,i2,i3) + u(i1+1,i2,i3)
+!    >                              + u1(i1) )
+!---------------------------------------------------------------------
+     &                     - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )  &
+     &                     - a(3) * ( u2(i1-1) + u2(i1+1) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_resid)
+
+!---------------------------------------------------------------------
+!     exchange boundary data
+!---------------------------------------------------------------------
+      call comm3(r,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(r,n1,n2,n3,'   resid',k)
+      endif
+
+      if( debug_vec(2) .ge. k )then
+         call showall(r,n1,n2,n3)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rprj3( r,m1k,m2k,m3k,s,m1j,m2j,m3j,k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     rprj3 projects onto the next coarser grid, 
+!     using a trilinear Finite Element projection:  s = r' = P r
+!     
+!     This  implementation costs  20A + 4M per result, where
+!     A and M denote the costs of Addition and Multiplication.  
+!     Note that this vectorizes, and is also fine for cache 
+!     based machines.  
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer m1k, m2k, m3k, m1j, m2j, m3j,k
+      double precision r(m1k,m2k,m3k), s(m1j,m2j,m3j)
+      integer j3, j2, j1, i3, i2, i1, d1, d2, d3, j
+
+      double precision x1(m), y1(m), x2,y2
+
+
+      if (timeron) call timer_start(t_rprj3)
+      if(m1k.eq.3)then
+        d1 = 2
+      else
+        d1 = 1
+      endif
+
+      if(m2k.eq.3)then
+        d2 = 2
+      else
+        d2 = 1
+      endif
+
+      if(m3k.eq.3)then
+        d3 = 2
+      else
+        d3 = 1
+      endif
+
+      do  j3=2,m3j-1
+         i3 = 2*j3-d3
+!        i3 = 2*j3-1
+         do  j2=2,m2j-1
+            i2 = 2*j2-d2
+!           i2 = 2*j2-1
+
+            do j1=2,m1j
+              i1 = 2*j1-d1
+!             i1 = 2*j1-1
+              x1(i1-1) = r(i1-1,i2-1,i3  ) + r(i1-1,i2+1,i3  )  &
+     &                 + r(i1-1,i2,  i3-1) + r(i1-1,i2,  i3+1)
+              y1(i1-1) = r(i1-1,i2-1,i3-1) + r(i1-1,i2-1,i3+1)  &
+     &                 + r(i1-1,i2+1,i3-1) + r(i1-1,i2+1,i3+1)
+            enddo
+
+            do  j1=2,m1j-1
+              i1 = 2*j1-d1
+!             i1 = 2*j1-1
+              y2 = r(i1,  i2-1,i3-1) + r(i1,  i2-1,i3+1)  &
+     &           + r(i1,  i2+1,i3-1) + r(i1,  i2+1,i3+1)
+              x2 = r(i1,  i2-1,i3  ) + r(i1,  i2+1,i3  )  &
+     &           + r(i1,  i2,  i3-1) + r(i1,  i2,  i3+1)
+              s(j1,j2,j3) =  &
+     &               0.5D0 * r(i1,i2,i3)  &
+     &             + 0.25D0 * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) + x2)  &
+     &             + 0.125D0 * ( x1(i1-1) + x1(i1+1) + y2)  &
+     &             + 0.0625D0 * ( y1(i1-1) + y1(i1+1) )
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_rprj3)
+
+
+      j = k-1
+      call comm3(s,m1j,m2j,m3j,j)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(s,m1j,m2j,m3j,'   rprj3',k-1)
+      endif
+
+      if( debug_vec(4) .ge. k )then
+         call showall(s,m1j,m2j,m3j)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine interp( z,mm1,mm2,mm3,u,n1,n2,n3,k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     interp adds the trilinear interpolation of the correction
+!     from the coarser grid to the current approximation:  u = u + Qu'
+!     
+!     Observe that this  implementation costs  16A + 4M, where
+!     A and M denote the costs of Addition and Multiplication.  
+!     Note that this vectorizes, and is also fine for cache 
+!     based machines.  Vector machines may get slightly better 
+!     performance however, with 8 separate "do i1" loops, rather than 4.
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer mm1, mm2, mm3, n1, n2, n3,k
+      double precision z(mm1,mm2,mm3),u(n1,n2,n3)
+      integer i3, i2, i1, d1, d2, d3, t1, t2, t3
+
+! note that m = 1037 in globals.h but for this only need to be
+! 535 to handle up to 1024^3
+!      integer m
+!      parameter( m=535 )
+      double precision z1(m),z2(m),z3(m)
+
+
+      if (timeron) call timer_start(t_interp)
+      if( n1 .ne. 3 .and. n2 .ne. 3 .and. n3 .ne. 3 ) then
+
+         do  i3=1,mm3-1
+            do  i2=1,mm2-1
+
+               do i1=1,mm1
+                  z1(i1) = z(i1,i2+1,i3) + z(i1,i2,i3)
+                  z2(i1) = z(i1,i2,i3+1) + z(i1,i2,i3)
+                  z3(i1) = z(i1,i2+1,i3+1) + z(i1,i2,i3+1) + z1(i1)
+               enddo
+
+               do  i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3-1)=u(2*i1-1,2*i2-1,2*i3-1)  &
+     &                 +z(i1,i2,i3)
+                  u(2*i1,2*i2-1,2*i3-1)=u(2*i1,2*i2-1,2*i3-1)  &
+     &                 +0.5d0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3-1)=u(2*i1-1,2*i2,2*i3-1)  &
+     &                 +0.5d0 * z1(i1)
+                  u(2*i1,2*i2,2*i3-1)=u(2*i1,2*i2,2*i3-1)  &
+     &                 +0.25d0*( z1(i1) + z1(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3)=u(2*i1-1,2*i2-1,2*i3)  &
+     &                 +0.5d0 * z2(i1)
+                  u(2*i1,2*i2-1,2*i3)=u(2*i1,2*i2-1,2*i3)  &
+     &                 +0.25d0*( z2(i1) + z2(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3)=u(2*i1-1,2*i2,2*i3)  &
+     &                 +0.25d0* z3(i1)
+                  u(2*i1,2*i2,2*i3)=u(2*i1,2*i2,2*i3)  &
+     &                 +0.125d0*( z3(i1) + z3(i1+1) )
+               enddo
+            enddo
+         enddo
+
+      else
+
+         if(n1.eq.3)then
+            d1 = 2
+            t1 = 1
+         else
+            d1 = 1
+            t1 = 0
+         endif
+         
+         if(n2.eq.3)then
+            d2 = 2
+            t2 = 1
+         else
+            d2 = 1
+            t2 = 0
+         endif
+         
+         if(n3.eq.3)then
+            d3 = 2
+            t3 = 1
+         else
+            d3 = 1
+            t3 = 0
+         endif
+         
+         do  i3=d3,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-d3)=u(2*i1-d1,2*i2-d2,2*i3-d3)  &
+     &                 +z(i1,i2,i3)
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-d3)=u(2*i1-t1,2*i2-d2,2*i3-d3)  &
+     &                 +0.5D0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-d3)=u(2*i1-d1,2*i2-t2,2*i3-d3)  &
+     &                 +0.5D0*(z(i1,i2+1,i3)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-d3)=u(2*i1-t1,2*i2-t2,2*i3-d3)  &
+     &                 +0.25D0*(z(i1+1,i2+1,i3)+z(i1+1,i2,i3)  &
+     &                 +z(i1,  i2+1,i3)+z(i1,  i2,i3))
+               enddo
+            enddo
+         enddo
+
+         do  i3=1,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-t3)=u(2*i1-d1,2*i2-d2,2*i3-t3)  &
+     &                 +0.5D0*(z(i1,i2,i3+1)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-t3)=u(2*i1-t1,2*i2-d2,2*i3-t3)  &
+     &                 +0.25D0*(z(i1+1,i2,i3+1)+z(i1,i2,i3+1)  &
+     &                 +z(i1+1,i2,i3  )+z(i1,i2,i3  ))
+               enddo
+            enddo
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-t3)=u(2*i1-d1,2*i2-t2,2*i3-t3)  &
+     &                 +0.25D0*(z(i1,i2+1,i3+1)+z(i1,i2,i3+1)  &
+     &                 +z(i1,i2+1,i3  )+z(i1,i2,i3  ))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-t3)=u(2*i1-t1,2*i2-t2,2*i3-t3)  &
+     &                 +0.125D0*(z(i1+1,i2+1,i3+1)+z(i1+1,i2,i3+1)  &
+     &                 +z(i1  ,i2+1,i3+1)+z(i1  ,i2,i3+1)  &
+     &                 +z(i1+1,i2+1,i3  )+z(i1+1,i2,i3  )  &
+     &                 +z(i1  ,i2+1,i3  )+z(i1  ,i2,i3  ))
+               enddo
+            enddo
+         enddo
+
+      endif
+      if (timeron) call timer_stop(t_interp)
+
+      call comm3_ex(u,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(z,mm1,mm2,mm3,'z: inter',k-1)
+         call rep_nrm(u,n1,n2,n3,'u: inter',k)
+      endif
+
+      if( debug_vec(5) .ge. k )then
+         call showall(z,mm1,mm2,mm3)
+         call showall(u,n1,n2,n3)
+      endif
+
+      return 
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine norm2u3(r,n1,n2,n3,rnm2,rnmu,nx0,ny0,nz0)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     norm2u3 evaluates approximations to the L2 norm and the
+!     uniform (or L-infinity or Chebyshev) norm, under the
+!     assumption that the boundaries are periodic or zero.  Add the
+!     boundaries in with half weight (quarter weight on the edges
+!     and eighth weight at the corners) for inhomogeneous boundaries.
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer n1, n2, n3, nx0, ny0, nz0
+      double precision rnm2, rnmu, r(n1,n2,n3)
+      double precision s, a, ss
+      integer i3, i2, i1, ierr
+
+      double precision dn
+
+      if (timeron) call timer_start(t_norm2u3)
+      dn = 1.0d0*nx0*ny0*nz0
+
+      s=0.0D0
+      rnmu = 0.0D0
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               s=s+r(i1,i2,i3)**2
+               a=abs(r(i1,i2,i3))
+               if(a.gt.rnmu)rnmu=a
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_norm2u3)
+
+      if (timeron) call timer_start(t_rcomm)
+      call mpi_allreduce(rnmu,ss,1,dp_type,  &
+     &     mpi_max,comm_work,ierr)
+      rnmu = ss
+      call mpi_allreduce(s, ss, 1, dp_type,  &
+     &     mpi_sum,comm_work,ierr)
+      s = ss
+      if (timeron) call timer_stop(t_rcomm)
+      rnm2=sqrt( s / dn )
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rep_nrm(u,n1,n2,n3,title,kk)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     report on norm
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      character*8 title
+
+      double precision rnm2, rnmu
+
+
+      call norm2u3(u,n1,n2,n3,rnm2,rnmu,nx(kk),ny(kk),nz(kk))
+      if( me .eq. root )then
+         write(*,7)kk,title,rnm2,rnmu
+ 7       format(' Level',i2,' in ',a8,': norms =',D21.14,D21.14)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine comm3(u,n1,n2,n3,kk)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     comm3 organizes the communication on all borders 
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      integer axis
+
+      if( .not. dead(kk) )then
+         do  axis = 1, 3
+            if( nprocs .ne. 1) then
+   
+               call ready( axis, -1, kk )
+               call ready( axis, +1, kk )
+   
+               call give3( axis, +1, u, n1, n2, n3, kk )
+               call give3( axis, -1, u, n1, n2, n3, kk )
+   
+               call take3( axis, -1, u, n1, n2, n3 )
+               call take3( axis, +1, u, n1, n2, n3 )
+   
+            else
+               call comm1p( axis, u, n1, n2, n3, kk )
+            endif
+         enddo
+      else
+         call zero3(u,n1,n2,n3)
+      endif
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine comm3_ex(u,n1,n2,n3,kk)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     comm3_ex  communicates to expand the number of processors
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      integer axis
+
+      do  axis = 1, 3
+         if( nprocs .ne. 1 ) then
+            if( take_ex( axis, kk ) )then
+               call ready( axis, -1, kk )
+               call ready( axis, +1, kk )
+               call take3_ex( axis, -1, u, n1, n2, n3 )
+               call take3_ex( axis, +1, u, n1, n2, n3 )
+            endif
+   
+            if( give_ex( axis, kk ) )then
+               call give3_ex( axis, +1, u, n1, n2, n3, kk )
+               call give3_ex( axis, -1, u, n1, n2, n3, kk )
+            endif
+         else
+            call comm1p_ex( axis, u, n1, n2, n3, kk )
+         endif
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ready( axis, dir, k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     ready allocates a buffer to take in a message
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer axis, dir, k
+      integer buff_id,buff_len,i,ierr
+
+      buff_id = 3 + dir
+      buff_len = nm2
+
+      do  i=1,nm2
+         buff(i,buff_id) = 0.0D0
+      enddo
+
+
+!---------------------------------------------------------------------
+!     fake message request type
+!---------------------------------------------------------------------
+      if (timeron) call timer_start(t_comm3)
+      msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me
+
+      call mpi_irecv( buff(1,buff_id), buff_len,  &
+     &     dp_type, nbr(axis,-dir,k), msg_type(axis,dir),  &
+     &     comm_work, msg_id(axis,dir,1), ierr)
+      if (timeron) call timer_stop(t_comm3)
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine give3( axis, dir, u, n1, n2, n3, k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     give3 sends border data out in the requested direction
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer axis, dir, n1, n2, n3, k, ierr
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len,buff_id
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  buff_len = buff_len + 1
+                  buff(buff_len,buff_id ) = u( 2,  i2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( n1-1, i2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,  2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len,  buff_id )= u( i1,n2-1,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,2)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,n3-1)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine take3( axis, dir, u, n1, n2, n3 )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     take3 copies in border data from the requested direction
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer buff_id, indx
+
+      integer status(mpi_status_size), ierr
+
+      integer i3, i2, i1
+
+      if (timeron) call timer_start(t_comm3)
+      call mpi_wait( msg_id( axis, dir, 1 ),status,ierr)
+      if (timeron) call timer_stop(t_comm3)
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  indx = indx + 1
+                  u(n1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i2=2,n2-1
+                  indx = indx + 1
+                  u(1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,n2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=2,n3-1
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,1,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,n3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,1) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine give3_ex( axis, dir, u, n1, n2, n3, k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     give3_ex sends border data out to expand number of processors
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer axis, dir, n1, n2, n3, k, ierr
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len, buff_id
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  buff_len = buff_len + 1
+                  buff(buff_len,buff_id ) = u( 2,  i2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=n1-1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id)= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,  2,i3)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=n2-1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id )= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,2)
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=n3-1,n3
+               do  i2=1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len, buff_id ) = u( i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+
+            if (timeron) call timer_start(t_comm3)
+            call mpi_send(  &
+     &           buff(1, buff_id ), buff_len,dp_type,  &
+     &           nbr( axis, dir, k ), msg_type(axis,dir),  &
+     &           comm_work, ierr)
+            if (timeron) call timer_stop(t_comm3)
+
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine take3_ex( axis, dir, u, n1, n2, n3 )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     take3_ex copies in border data to expand number of processors
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer buff_id, indx
+
+      integer status(mpi_status_size) , ierr
+
+      integer i3, i2, i1
+
+      if (timeron) call timer_start(t_comm3)
+      call mpi_wait( msg_id( axis, dir, 1 ),status,ierr)
+      if (timeron) call timer_stop(t_comm3)
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  indx = indx + 1
+                  u(n1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=1,2
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  2 )then
+         if( dir .eq. -1 )then
+
+            do  i3=1,n3
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,n2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,n3
+               do  i2=1,2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+
+         endif
+      endif
+
+      if( axis .eq.  3 )then
+         if( dir .eq. -1 )then
+
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,n3) = buff(indx, buff_id )
+               enddo
+            enddo
+
+         else if( dir .eq. +1 ) then
+
+            do  i3=1,2
+               do  i2=1,n2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+
+         endif
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine comm1p( axis, u, n1, n2, n3, kk )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len,buff_id
+      integer i, kk, indx
+
+      dir = -1
+
+      buff_id = 3 + dir
+      buff_len = nm2
+
+      do  i=1,nm2
+         buff(i,buff_id) = 0.0D0
+      enddo
+
+
+      dir = +1
+
+      buff_id = 3 + dir
+      buff_len = nm2
+
+      do  i=1,nm2
+         buff(i,buff_id) = 0.0D0
+      enddo
+
+      dir = +1
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( n1-1, i2,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len,  buff_id )= u( i1,n2-1,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( i1,i2,n3-1)
+            enddo
+         enddo
+      endif
+
+      dir = -1
+
+      buff_id = 2 + dir 
+      buff_len = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               buff_len = buff_len + 1
+               buff(buff_len,buff_id ) = u( 2,  i2,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( i1,  2,i3)
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               buff_len = buff_len + 1
+               buff(buff_len, buff_id ) = u( i1,i2,2)
+            enddo
+         enddo
+      endif
+
+      do  i=1,nm2
+         buff(i,4) = buff(i,3)
+         buff(i,2) = buff(i,1)
+      enddo
+
+      dir = -1
+
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               indx = indx + 1
+               u(n1,i2,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,n2,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,i2,n3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+
+      dir = +1
+
+      buff_id = 3 + dir
+      indx = 0
+
+      if( axis .eq.  1 )then
+         do  i3=2,n3-1
+            do  i2=2,n2-1
+               indx = indx + 1
+               u(1,i2,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  2 )then
+         do  i3=2,n3-1
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,1,i3) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      if( axis .eq.  3 )then
+         do  i2=1,n2
+            do  i1=1,n1
+               indx = indx + 1
+               u(i1,i2,1) = buff(indx, buff_id )
+            enddo
+         enddo
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine comm1p_ex( axis, u, n1, n2, n3, kk )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mpinpb
+
+      implicit none
+
+      integer axis, dir, n1, n2, n3
+      double precision u( n1, n2, n3 )
+
+      integer i3, i2, i1, buff_len,buff_id
+      integer i, kk, indx
+
+      if( take_ex( axis, kk ) ) then
+
+         dir = -1
+
+         buff_id = 3 + dir
+         buff_len = nm2
+
+         do  i=1,nm2
+            buff(i,buff_id) = 0.0D0
+         enddo
+
+
+         dir = +1
+
+         buff_id = 3 + dir
+         buff_len = nm2
+
+         do  i=1,nm2
+            buff(i,buff_id) = 0.0D0
+         enddo
+
+
+         dir = -1
+
+         buff_id = 3 + dir
+         indx = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  indx = indx + 1
+                  u(n1,i2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,n2,i3) = buff(indx, buff_id )
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i2=1,n2
+               do  i1=1,n1
+                  indx = indx + 1
+                  u(i1,i2,n3) = buff(indx, buff_id )
+               enddo
+            enddo
+         endif
+
+         dir = +1
+
+         buff_id = 3 + dir
+         indx = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=1,2
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i2=1,2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i3=1,2
+               do  i2=1,n2
+                  do  i1=1,n1
+                     indx = indx + 1
+                     u(i1,i2,i3) = buff(indx,buff_id)
+                  enddo
+               enddo
+            enddo
+         endif
+
+      endif
+
+      if( give_ex( axis, kk ) )then
+
+         dir = +1
+
+         buff_id = 2 + dir 
+         buff_len = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  do  i1=n1-1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id)= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i2=n2-1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len,buff_id )= u(i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i3=n3-1,n3
+               do  i2=1,n2
+                  do  i1=1,n1
+                     buff_len = buff_len + 1
+                     buff(buff_len, buff_id ) = u( i1,i2,i3)
+                  enddo
+               enddo
+            enddo
+         endif
+
+         dir = -1
+
+         buff_id = 2 + dir 
+         buff_len = 0
+
+         if( axis .eq.  1 )then
+            do  i3=1,n3
+               do  i2=1,n2
+                  buff_len = buff_len + 1
+                  buff(buff_len,buff_id ) = u( 2,  i2,i3)
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  2 )then
+            do  i3=1,n3
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,  2,i3)
+               enddo
+            enddo
+         endif
+
+         if( axis .eq.  3 )then
+            do  i2=1,n2
+               do  i1=1,n1
+                  buff_len = buff_len + 1
+                  buff(buff_len, buff_id ) = u( i1,i2,2)
+               enddo
+            enddo
+         endif
+
+      endif
+
+      do  i=1,nm2
+         buff(i,4) = buff(i,3)
+         buff(i,2) = buff(i,1)
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine zran3(z,n1,n2,n3,nx,ny,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     zran3  loads +1 at ten randomly chosen points,
+!     loads -1 at a different ten random points,
+!     and zero elsewhere.
+!---------------------------------------------------------------------
+
+      use mg_data, only : is1, is2, is3, ie1, ie2, ie3
+      use mpinpb
+
+      implicit none
+
+      integer n1, n2, n3, k, nx, ny, ierr, i0, m0, m1
+      double precision z(n1,n2,n3)
+
+      integer mm, i1, i2, i3, d1, e1, e2, e3
+      double precision x, a
+      double precision xx, x0, x1, a1, a2, ai, power
+      parameter( mm = 10,  a = 5.D0 ** 13, x = 314159265.D0)
+      double precision ten( mm, 0:1 ), temp, best
+      integer i, j1( mm, 0:1 ), j2( mm, 0:1 ), j3( mm, 0:1 )
+      integer jg( 0:3, mm, 0:1 ), jg_temp(4)
+
+      external randlc
+      double precision randlc, rdummy
+
+      a1 = power( a, nx, 1, 0 )
+      a2 = power( a, nx, ny, 0 )
+
+      call zero3(z,n1,n2,n3)
+
+!      i = is1-2+nx*(is2-2+ny*(is3-2))
+
+      ai = power( a, nx, is2-2+ny*(is3-2), is1-2 )
+      d1 = ie1 - is1 + 1
+      e1 = ie1 - is1 + 2
+      e2 = ie2 - is2 + 2
+      e3 = ie3 - is3 + 2
+      x0 = x
+      rdummy = randlc( x0, ai )
+      do  i3 = 2, e3
+         x1 = x0
+         do  i2 = 2, e2
+            xx = x1
+            call vranlc( d1, xx, a, z( 2, i2, i3 ))
+            rdummy = randlc( x1, a1 )
+         enddo
+         rdummy = randlc( x0, a2 )
+      enddo
+
+!---------------------------------------------------------------------
+!       call comm3(z,n1,n2,n3)
+!       call showall(z,n1,n2,n3)
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     each processor looks for twenty candidates
+!---------------------------------------------------------------------
+      do  i=1,mm
+         ten( i, 1 ) = 0.0D0
+         j1( i, 1 ) = 0
+         j2( i, 1 ) = 0
+         j3( i, 1 ) = 0
+         ten( i, 0 ) = 1.0D0
+         j1( i, 0 ) = 0
+         j2( i, 0 ) = 0
+         j3( i, 0 ) = 0
+      enddo
+
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               if( z(i1,i2,i3) .gt. ten( 1, 1 ) )then
+                  ten(1,1) = z(i1,i2,i3) 
+                  j1(1,1) = i1
+                  j2(1,1) = i2
+                  j3(1,1) = i3
+                  call bubble( ten, j1, j2, j3, mm, 1 )
+               endif
+               if( z(i1,i2,i3) .lt. ten( 1, 0 ) )then
+                  ten(1,0) = z(i1,i2,i3) 
+                  j1(1,0) = i1
+                  j2(1,0) = i2
+                  j3(1,0) = i3
+                  call bubble( ten, j1, j2, j3, mm, 0 )
+               endif
+            enddo
+         enddo
+      enddo
+
+      call mpi_barrier(comm_work,ierr)
+
+!---------------------------------------------------------------------
+!     Now which of these are globally best?
+!---------------------------------------------------------------------
+      i1 = mm
+      i0 = mm
+      do  i=mm,1,-1
+
+         best = z( j1(i1,1), j2(i1,1), j3(i1,1) )
+         call mpi_allreduce(best,temp,1,dp_type,  &
+     &        mpi_max,comm_work,ierr)
+         best = temp
+         if(best.eq.z(j1(i1,1),j2(i1,1),j3(i1,1)))then
+            jg( 0, i, 1) = me
+            jg( 1, i, 1) = is1 - 2 + j1( i1, 1 ) 
+            jg( 2, i, 1) = is2 - 2 + j2( i1, 1 ) 
+            jg( 3, i, 1) = is3 - 2 + j3( i1, 1 ) 
+            i1 = i1-1
+         else
+            jg( 0, i, 1) = 0
+            jg( 1, i, 1) = 0
+            jg( 2, i, 1) = 0
+            jg( 3, i, 1) = 0
+         endif
+         ten( i, 1 ) = best
+         call mpi_allreduce(jg(0,i,1), jg_temp,4,MPI_INTEGER,  &
+     &        mpi_max,comm_work,ierr)
+         jg( 0, i, 1) =  jg_temp(1)
+         jg( 1, i, 1) =  jg_temp(2)
+         jg( 2, i, 1) =  jg_temp(3)
+         jg( 3, i, 1) =  jg_temp(4)
+
+         best = z( j1(i0,0), j2(i0,0), j3(i0,0) )
+         call mpi_allreduce(best,temp,1,dp_type,  &
+     &        mpi_min,comm_work,ierr)
+         best = temp
+         if(best.eq.z(j1(i0,0),j2(i0,0),j3(i0,0)))then
+            jg( 0, i, 0) = me
+            jg( 1, i, 0) = is1 - 2 + j1( i0, 0 ) 
+            jg( 2, i, 0) = is2 - 2 + j2( i0, 0 ) 
+            jg( 3, i, 0) = is3 - 2 + j3( i0, 0 ) 
+            i0 = i0-1
+         else
+            jg( 0, i, 0) = 0
+            jg( 1, i, 0) = 0
+            jg( 2, i, 0) = 0
+            jg( 3, i, 0) = 0
+         endif
+         ten( i, 0 ) = best
+         call mpi_allreduce(jg(0,i,0), jg_temp,4,MPI_INTEGER,  &
+     &        mpi_max,comm_work,ierr)
+         jg( 0, i, 0) =  jg_temp(1)
+         jg( 1, i, 0) =  jg_temp(2)
+         jg( 2, i, 0) =  jg_temp(3)
+         jg( 3, i, 0) =  jg_temp(4)
+
+      enddo
+      m1 = i1+1
+      m0 = i0+1
+
+!      if( me .eq. root) then
+!         write(*,*)' '
+!         write(*,*)' negative charges at'
+!         write(*,9)(jg(1,i,0),jg(2,i,0),jg(3,i,0),i=1,mm)
+!         write(*,*)' positive charges at'
+!         write(*,9)(jg(1,i,1),jg(2,i,1),jg(3,i,1),i=1,mm)
+!         write(*,*)' small random numbers were'
+!         write(*,8)(ten( i,0),i=mm,1,-1)
+!         write(*,*)' and they were found on processor number'
+!         write(*,7)(jg(0,i,0),i=mm,1,-1)
+!         write(*,*)' large random numbers were'
+!         write(*,8)(ten( i,1),i=mm,1,-1)
+!         write(*,*)' and they were found on processor number'
+!         write(*,7)(jg(0,i,1),i=mm,1,-1)
+!      endif
+! 9    format(5(' (',i3,2(',',i3),')'))
+! 8    format(5D15.8)
+! 7    format(10i4)
+      call mpi_barrier(comm_work,ierr)
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3) = 0.0D0
+            enddo
+         enddo
+      enddo
+      do  i=mm,m0,-1
+         z( j1(i,0), j2(i,0), j3(i,0) ) = -1.0D0
+      enddo
+      do  i=mm,m1,-1
+         z( j1(i,1), j2(i,1), j3(i,1) ) = +1.0D0
+      enddo
+      call comm3(z,n1,n2,n3,k)
+
+!---------------------------------------------------------------------
+!          call showall(z,n1,n2,n3)
+!---------------------------------------------------------------------
+
+      return 
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine show_l(z,n1,n2,n3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mpinpb
+      implicit none
+
+      integer n1,n2,n3,i1,i2,i3,ierr
+      double precision z(n1,n2,n3)
+      integer m1, m2, m3,i
+
+      m1 = min(n1,18)
+      m2 = min(n2,14)
+      m3 = min(n3,18)
+
+      write(*,*)'  '
+      do  i=0,nprocs-1
+         if( me .eq. i )then
+            write(*,*)' id = ', me
+            do  i3=1,m3
+               do  i1=1,m1
+                  write(*,6)(z(i1,i2,i3),i2=1,m2)
+               enddo
+               write(*,*)' - - - - - - - '
+            enddo
+            write(*,*)'  '
+ 6          format(6f15.11)
+         endif
+         call mpi_barrier(comm_work,ierr)
+      enddo
+
+      return 
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine showall(z,n1,n2,n3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mpinpb
+      implicit none
+
+      integer n1,n2,n3,i1,i2,i3,i,ierr
+      double precision z(n1,n2,n3)
+      integer m1, m2, m3
+
+      m1 = min(n1,18)
+      m2 = min(n2,14)
+      m3 = min(n3,18)
+
+      write(*,*)'  '
+      do  i=0,nprocs-1
+         if( me .eq. i )then
+            write(*,*)' id = ', me
+            do  i3=1,m3
+               do  i1=1,m1
+                  write(*,6)(z(i1,i2,i3),i2=1,m2)
+               enddo
+               write(*,*)' - - - - - - - '
+            enddo
+            write(*,*)'  '
+ 6          format(15f6.3)
+         endif
+         call mpi_barrier(comm_work,ierr)
+      enddo
+
+      return 
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine show(z,n1,n2,n3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mpinpb
+      implicit none
+
+      integer n1,n2,n3,i1,i2,i3,ierr,i
+      double precision z(n1,n2,n3)
+
+      write(*,*)'  '
+      do  i=0,nprocs-1
+         if( me .eq. i )then
+            write(*,*)' id = ', me
+            do  i3=2,n3-1
+               do  i1=2,n1-1
+                  write(*,6)(z(i1,i2,i3),i2=2,n1-1)
+               enddo
+               write(*,*)' - - - - - - - '
+            enddo
+            write(*,*)'  '
+ 6          format(8D10.3)
+         endif
+         call mpi_barrier(comm_work,ierr)
+      enddo
+
+!     call comm3(z,n1,n2,n3)
+
+      return 
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision function power( a, n1, n2, n3 )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     power  raises an integer, disguised as a double
+!     precision real, to an integer power.
+!     This version tries to avoid integer overflow by treating
+!     it as expressed in a form of "n1*n2+n3".
+!---------------------------------------------------------------------
+      implicit none
+
+      double precision a, aj
+      integer n1, n2, n3
+
+      integer n1j, n2j, nj
+      external randlc
+      double precision randlc, rdummy
+
+      power = 1.0d0
+      aj = a
+      nj = n3
+      n1j = n1
+      n2j = n2
+ 100  continue
+
+      if( n2j .gt. 0 ) then
+         if( mod(n2j,2) .eq. 1 ) nj = nj + n1j
+         n2j = n2j/2
+      else if( nj .eq. 0 ) then
+         go to 200
+      endif
+      if( mod(nj,2) .eq. 1 ) rdummy =  randlc( power, aj )
+      rdummy = randlc( aj, aj )
+      nj = nj/2
+      go to 100
+
+ 200  continue
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine bubble( ten, j1, j2, j3, m, ind )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     bubble        does a bubble sort in direction dir
+!---------------------------------------------------------------------
+
+      use mpinpb
+      implicit none
+
+      integer m, ind, j1( m, 0:1 ), j2( m, 0:1 ), j3( m, 0:1 )
+      double precision ten( m, 0:1 )
+      double precision temp
+      integer i, j_temp
+
+      if( ind .eq. 1 )then
+
+         do  i=1,m-1
+            if( ten(i,ind) .gt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      else
+
+         do  i=1,m-1
+            if( ten(i,ind) .lt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else 
+               return
+            endif
+         enddo
+
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine zero3(z,n1,n2,n3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mpinpb
+      implicit none
+
+      integer n1, n2, n3
+      double precision z(n1,n2,n3)
+      integer i1, i2, i3
+
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3)=0.0D0
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+!----- end of program ------------------------------------------------
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg.input.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg.input.sample
new file mode 100644
index 000000000..a4dcf8127
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg.input.sample
@@ -0,0 +1,4 @@
+ 8 = top level
+ 256 256 256 = nx ny nz
+ 20 = nit
+ 0 0 0 0 0 0 0 0 = debug_vec
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg_data.f90
new file mode 100644
index 000000000..53fc271b4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mg_data.f90
@@ -0,0 +1,161 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mg_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mg_data
+
+!---------------------------------------------------------------------
+!  Parameter lm is the log-base2 of the edge size max for
+!  the partition on a given node, so must be changed either
+!  to save space (if running a small case) or made bigger for larger 
+!  cases, for example, 512^3. Thus lm=7 means that the largest dimension 
+!  of a partition that can be solved on a node is 2^7 = 128. lm is set 
+!  automatically in npbparams.h
+!  Parameters ndim1, ndim2, ndim3 are the local problem dimensions. 
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      ! partitioned size in each dimension
+      integer ndim1, ndim2, ndim3
+
+      ! log of maximum dimension on a node
+      integer lm
+
+      integer nm  &    ! actual dimension including ghost cells for communications
+     &      , nv  &    ! size of rhs array
+     &      , nr  &    ! size of residual array
+     &      , nm2  &   ! size of communication buffer
+     &      , maxlevel! maximum number of levels
+      parameter (maxlevel = lt_default+1)
+
+
+      integer maxprocs
+      parameter( maxprocs = 131072 )  ! this is the upper proc limit that 
+                                      ! the current "nr" parameter can handle
+!---------------------------------------------------------------------
+      integer nbr(3,-1:1,maxlevel), msg_type(3,-1:1)
+      integer msg_id(3,-1:1,2),nx(maxlevel),ny(maxlevel),nz(maxlevel)
+
+      character class
+
+      integer debug_vec(0:7)
+
+      integer ir(maxlevel), m1(maxlevel), m2(maxlevel), m3(maxlevel)
+      integer lt, lb
+
+      logical dead(maxlevel), give_ex(3,maxlevel), take_ex(3,maxlevel)
+
+! ... grid
+      integer  is1, is2, is3, ie1, ie2, ie3
+
+!---------------------------------------------------------------------
+!  Set at m=1024, can handle cases up to 1024^3 case
+!---------------------------------------------------------------------
+      integer m
+!      parameter( m=1037 )
+
+      double precision, allocatable ::  &
+     &        buff(:,:)
+
+!---------------------------------------------------------------------
+!  Timing constants
+!---------------------------------------------------------------------
+      integer t_bench, t_init, t_psinv, t_resid, t_rprj3, t_interp,  &
+     &        t_norm2u3, t_comm3, t_rcomm, t_last
+      parameter (t_bench=1, t_init=2, t_psinv=3, t_resid=4, t_rprj3=5,  &
+     &        t_interp=6, t_norm2u3=7, t_comm3=8,  &
+     &        t_rcomm=9, t_last=9)
+
+      logical timeron
+
+
+      end module mg_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mg_fields module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mg_fields
+
+!---------------------------------------------------------------------------c
+! These are major data arrays and can be quite large.
+! They are always passed as subroutine args.
+!---------------------------------------------------------------------------c
+      double precision, allocatable :: u(:), v(:), r(:)
+
+      double precision  a(0:3),c(0:3)
+
+      end module mg_fields
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use mg_data
+      use mg_fields
+      use mpinpb
+
+      implicit none
+
+      integer ios, ierr
+      integer log2_size, log_p
+
+
+!---------------------------------------------------------------------
+! set up dimension parameters after partition
+!---------------------------------------------------------------------
+      log_p  = log(float(nprocs)+0.0001)/log(2.0)
+
+      ! lt is log of largest total dimension
+      log2_size = lt_default
+
+      ! log of maximum dimension on a node
+      lm = log2_size - log_p/3
+      ndim1 = lm
+      ndim3 = log2_size - (log_p+2)/3
+      ndim2 = log2_size - (log_p+1)/3
+
+      ! array size parameters
+      nm = 2+2**lm
+      nv = (2+2**ndim1)*(2+2**ndim2)*(2+2**ndim3)
+      nm2= 2*nm*nm
+      nr = (8*(nv+nm**2+5*nm+14*lt_default-7*lm))/7
+      m  = nm + 1
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      allocate (  &
+     &          u(nr),  &
+     &          v(nv),  &
+     &          r(nr),  &
+     &          buff(nm2,4),  &
+     &          stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mpinpb.f90
new file mode 100644
index 000000000..1702df9de
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MG/mpinpb.f90
@@ -0,0 +1,17 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mpinpb module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mpinpb
+
+      include 'mpif.h'
+
+      integer me, nprocs, nprocs_total, root, dp_type, comm_work
+      logical active
+
+      end module mpinpb
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/Makefile
new file mode 100644
index 000000000..b1c5f1b4e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/Makefile
@@ -0,0 +1,38 @@
+# Makefile for MPI dummy library. 
+# Must be edited for a specific machine. Does NOT read in 
+# the make.def file of NPB 3.4
+FC = f90
+CC = cc
+AR = ar
+
+# Enable if either Cray or IBM: (no such flag for most machines: see wtime.h)
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+libmpi.a: mpi_dummy.o mpi_dummy_c.o wtime.o
+	$(AR) r libmpi.a mpi_dummy.o mpi_dummy_c.o wtime.o
+
+mpi_dummy.o: mpi_dummy.f90 mpif.h
+	$(FC) -c mpi_dummy.f90
+# For a Cray C90, try:
+#	cf90 -dp -c mpi_dummy.f90
+# For an IBM 590, try:
+#	xlf90 -c mpi_dummy.f90
+
+mpi_dummy_c.o: mpi_dummy.c mpi.h
+	$(CC) -c ${MACHINE} -o mpi_dummy_c.o mpi_dummy.c
+
+wtime.o: wtime.c
+# For most machines or CRAY or IBM
+	$(CC) -c ${MACHINE} wtime.c
+# For a precise timer on an SGI Power Challenge, try:
+#	$(CC) -o wtime.o -c wtime_sgi64.c
+
+test: test.f90
+	$(FC) -o test -I. test.f90 -L. -lmpi
+
+
+
+clean: 
+	- rm -f *~ *.o
+	- rm -f test libmpi.a
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/README
new file mode 100644
index 000000000..c89d4ba2d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/README
@@ -0,0 +1,52 @@
+###########################################
+# NAS Parallel Benchmarks 2&3             #
+# MPI/Fortran/C                           #
+# Revision 3.4                            #
+# NASA Ames Research Center               #
+# npb@nas.nasa.gov                        #
+# http://www.nas.nasa.gov/Software/NPB/   #
+###########################################
+
+MPI Dummy Library
+
+
+The MPI dummy library is supplied as a convenience for people who do
+not have an MPI library but would like to try running on one processor
+anyway. The NPB 2.x/3.x benchmarks are designed so that they do not
+actually try to do any message passing when run on one node. The MPI
+dummy library is just that - a set of dummy MPI routines which don't
+do anything, but allow you to link the benchmarks. Actually they do a
+few things, but nothing important. Note that the dummy library is 
+sufficient only for the NPB 2.x/3.x benchmarks. It probably won't be
+useful for anything else because it implements only a handful of
+functions. 
+
+Because the dummy library is just an extra goody, and since we don't
+have an infinite amount of time, it may be a bit trickier to configure
+than the rest of the benchmarks. You need to:
+
+1. Find out how C and Fortran interact on your machine. On most machines, 
+the fortran functon foo(x) is declared in C as foo_(xp) where xp is 
+a pointer, not a value. On IBMs, it's just foo(xp). On Cray C90s, its
+FOO(xp). You can define CRAY or IBM to get these, or you need to
+edit wtime.c if you've got something else. 
+
+2. Edit the Makefile to compile mpi_dummy.f and wtime.c correctly
+for your machine (including -DCRAY or -DIBM if necessary). 
+
+3. The substitute MPI timer gives wall clock time, not CPU time. 
+If you're running on a timeshared machine, you may want to 
+use a CPU timer. Edit the function mpi_wtime() in mpi_dummy.f
+to change this timer. (NOTE: for official benchmark results, 
+ONLY wall clock times are valid. Using a CPU timer is ok 
+if you want to get things running, but don't report any results
+measured with a CPU timer. )
+
+TROUBLESHOOTING
+
+o Compiling or linking of the benchmark aborts because the dummy MPI
+  header file or the dummy MPI library cannot be found.
+  - the file make.dummy in subdirectory config relies on the use
+    of the -I"path" and -L"path" -l"library" constructs to pass
+    information to the compilers and linkers. Edit this file to conform
+    to your system.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi.h
new file mode 100644
index 000000000..af0b97e6a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi.h
@@ -0,0 +1,132 @@
+#define MPI_DOUBLE          1
+#define MPI_INT             2
+#define MPI_BYTE            3
+#define MPI_FLOAT           4
+#define MPI_LONG            5
+
+#define MPI_COMM_WORLD      0
+
+#define MPI_MAX             1
+#define MPI_SUM             2
+#define MPI_MIN             3
+
+#define MPI_SUCCESS         0
+#define MPI_ANY_SOURCE     -1
+#define MPI_ERR_OTHER      -1
+#define MPI_STATUS_SIZE     3
+
+
+/* 
+   Status object.  It is the only user-visible MPI data-structure 
+   The "count" field is PRIVATE; use MPI_Get_count to access it. 
+ */
+typedef struct { 
+    int count;
+    int MPI_SOURCE;
+    int MPI_TAG;
+    int MPI_ERROR;
+} MPI_Status;
+
+
+/* MPI request objects */
+typedef int MPI_Request;
+
+/* MPI datatype */
+typedef int MPI_Datatype;
+
+/* MPI comm */
+typedef int MPI_Comm;
+
+/* MPI operation */
+typedef int MPI_Op;
+
+
+
+/* Prototypes: */
+void  mpi_error( void );
+
+int MPI_Abort( MPI_Comm comm, int ecode );
+
+int   MPI_Irecv( void         *buf,
+                 int          count,
+                 MPI_Datatype datatype,
+                 int          source,
+                 int          tag,
+                 MPI_Comm     comm,
+                 MPI_Request  *request );
+
+int   MPI_Send( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          dest,
+                int          tag,
+                MPI_Comm     comm );
+
+int   MPI_Recv( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          source,
+                int          tag,
+                MPI_Comm     comm,
+                MPI_Status   *status );
+
+int   MPI_Wait( MPI_Request *request,
+                MPI_Status  *status );
+
+int   MPI_Init( int  *argc,
+                char ***argv );
+
+int   MPI_Comm_rank( MPI_Comm comm, 
+                     int      *rank );
+
+int   MPI_Comm_size( MPI_Comm comm, 
+                     int      *size );
+
+int   MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm *newcomm );
+
+int   MPI_Comm_dup( MPI_Comm comm, MPI_Comm *newcomm );
+
+double MPI_Wtime( void );
+
+int  MPI_Barrier( MPI_Comm comm );
+
+int  MPI_Bcast( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          root,
+                MPI_Comm     comm );
+
+int  MPI_Finalize( void );
+
+int  MPI_Allreduce( void         *sendbuf,
+                    void         *recvbuf,
+                    int          nitems,
+                    MPI_Datatype type,
+                    MPI_Op       op,
+                    MPI_Comm     comm );
+
+int  MPI_Reduce( void         *sendbuf,
+                 void         *recvbuf,
+                 int          nitems,
+                 MPI_Datatype type,
+                 MPI_Op       op,
+                 int          root,
+                 MPI_Comm     comm );
+
+int  MPI_Alltoall( void         *sendbuf,
+                   int          sendcount,
+                   MPI_Datatype sendtype,
+                   void         *recvbuf,
+                   int          recvcount,
+                   MPI_Datatype recvtype,
+                   MPI_Comm     comm );
+
+int  MPI_Alltoallv( void         *sendbuf,
+                    int          *sendcounts,
+                    int          *senddispl,
+                    MPI_Datatype sendtype,
+                    void         *recvbuf,
+                    int          *recvcounts,
+                    int          *recvdispl,
+                    MPI_Datatype recvtype,
+                    MPI_Comm     comm );
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi_dummy.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi_dummy.c
new file mode 100644
index 000000000..19c1b2981
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi_dummy.c
@@ -0,0 +1,321 @@
+#include "mpi.h"
+#include "wtime.h"
+#include <stdio.h>
+#include <stdlib.h>
+
+
+
+void  mpi_error( void )
+{
+    printf( "mpi_error called\n" );
+    abort();
+}
+
+
+int MPI_Abort( MPI_Comm comm, int ecode )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+int   MPI_Irecv( void         *buf,
+                 int          count,
+                 MPI_Datatype datatype,
+                 int          source,
+                 int          tag,
+                 MPI_Comm     comm,
+                 MPI_Request  *request )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Recv( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          source,
+                int          tag,
+                MPI_Comm     comm,
+                MPI_Status   *status )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Send( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          dest,
+                int          tag,
+                MPI_Comm     comm )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Wait( MPI_Request *request,
+                MPI_Status  *status )
+{
+    mpi_error();
+    return( MPI_ERR_OTHER );
+}
+
+
+
+
+int   MPI_Init( int  *argc,
+                char ***argv )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int   MPI_Comm_rank( MPI_Comm comm, 
+                     int      *rank )
+{
+    *rank = 0;
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int   MPI_Comm_size( MPI_Comm comm, 
+                     int      *size )
+{
+    *size = 1;
+    return( MPI_SUCCESS );
+}
+
+
+
+int   MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm *newcomm )
+{
+    *newcomm = comm;
+    return( MPI_SUCCESS );
+}
+
+
+
+int   MPI_Comm_dup( MPI_Comm comm, MPI_Comm *newcomm )
+{
+    *newcomm = comm;
+    return( MPI_SUCCESS );
+}
+
+
+
+
+double MPI_Wtime( void )
+{
+    void wtime();
+
+    double t;
+    wtime( &t );
+    return( t );
+}
+
+
+
+
+int  MPI_Barrier( MPI_Comm comm )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int  MPI_Bcast( void         *buf,
+                int          count,
+                MPI_Datatype datatype,
+                int          root,
+                MPI_Comm     comm )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int  MPI_Finalize( void )
+{
+    return( MPI_SUCCESS );
+}
+
+
+
+
+int  MPI_Allreduce( void         *sendbuf,
+                    void         *recvbuf,
+                    int          nitems,
+                    MPI_Datatype type,
+                    MPI_Op       op,
+                    MPI_Comm     comm )
+{
+    int i;
+    if( type == MPI_INT )
+    {
+        int *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (int *) sendbuf;    
+        pd_recvbuf = (int *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    else if( type == MPI_LONG )
+    {
+        long *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (long *) sendbuf;    
+        pd_recvbuf = (long *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    else if( type == MPI_DOUBLE )
+    {
+        double *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (double *) sendbuf;    
+        pd_recvbuf = (double *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    else
+    {
+        printf("MPI_Allreduce: bad type %d\n", type);
+        return( MPI_ERR_OTHER );
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
+int  MPI_Reduce( void         *sendbuf,
+                 void         *recvbuf,
+                 int          nitems,
+                 MPI_Datatype type,
+                 MPI_Op       op,
+                 int          root,
+                 MPI_Comm     comm )
+{
+    int i;
+    if( type == MPI_INT )
+    {
+        int *pi_sendbuf, *pi_recvbuf;
+        pi_sendbuf = (int *) sendbuf;    
+        pi_recvbuf = (int *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pi_recvbuf+i) = *(pi_sendbuf+i);
+    }
+    else if( type == MPI_LONG )
+    {
+        long *pi_sendbuf, *pi_recvbuf;
+        pi_sendbuf = (long *) sendbuf;    
+        pi_recvbuf = (long *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pi_recvbuf+i) = *(pi_sendbuf+i);
+    }
+    else if( type == MPI_DOUBLE )
+    {
+        double *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (double *) sendbuf;    
+        pd_recvbuf = (double *) recvbuf;    
+        for( i=0; i<nitems; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    else
+    {
+        printf("MPI_Reduce: bad type %d\n", type);
+        return( MPI_ERR_OTHER );
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
+int  MPI_Alltoall( void         *sendbuf,
+                   int          sendcount,
+                   MPI_Datatype sendtype,
+                   void         *recvbuf,
+                   int          recvcount,
+                   MPI_Datatype recvtype,
+                   MPI_Comm     comm )
+{
+    int i;
+    if( recvtype == MPI_INT )
+    {
+        int *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (int *) sendbuf;    
+        pd_recvbuf = (int *) recvbuf;    
+        for( i=0; i<sendcount; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    else if( recvtype == MPI_LONG )
+    {
+        long *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (long *) sendbuf;    
+        pd_recvbuf = (long *) recvbuf;    
+        for( i=0; i<sendcount; i++ )
+            *(pd_recvbuf+i) = *(pd_sendbuf+i);
+    }
+    else
+    {
+        printf("MPI_Alltoall: bad type %d\n", recvtype);
+        return( MPI_ERR_OTHER );
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
+int  MPI_Alltoallv( void         *sendbuf,
+                    int          *sendcounts,
+                    int          *senddispl,
+                    MPI_Datatype sendtype,
+                    void         *recvbuf,
+                    int          *recvcounts,
+                    int          *recvdispl,
+                    MPI_Datatype recvtype,
+                    MPI_Comm     comm )
+{
+    int i;
+    if( recvtype == MPI_INT )
+    {
+        int *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (int *) sendbuf;    
+        pd_recvbuf = (int *) recvbuf;    
+        for( i=0; i<sendcounts[0]; i++ )
+            *(pd_recvbuf+i+recvdispl[0]) = *(pd_sendbuf+i+senddispl[0]);
+    }
+    else if( recvtype == MPI_LONG )
+    {
+        long *pd_sendbuf, *pd_recvbuf;
+        pd_sendbuf = (long *) sendbuf;    
+        pd_recvbuf = (long *) recvbuf;    
+        for( i=0; i<sendcounts[0]; i++ )
+            *(pd_recvbuf+i+recvdispl[0]) = *(pd_sendbuf+i+senddispl[0]);
+    }
+    else
+    {
+        printf("MPI_Alltoallv: bad type %d\n", recvtype);
+        return( MPI_ERR_OTHER );
+    }
+    return( MPI_SUCCESS );
+}
+  
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi_dummy.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi_dummy.f90
new file mode 100644
index 000000000..699874d62
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpi_dummy.f90
@@ -0,0 +1,247 @@
+      subroutine mpi_isend(buf,count,datatype,source,  &
+     &                     tag,comm,request,ierror)
+      implicit none
+      integer buf(*), count,datatype,source,tag,comm,request,ierror
+      call mpi_error()
+      return
+      end  
+
+      subroutine mpi_irecv(buf,count,datatype,source,  &
+     &                     tag,comm,request,ierror)
+      implicit none
+      integer buf(*), count,datatype,source,tag,comm,request,ierror
+      call mpi_error()
+      return
+      end
+
+      subroutine mpi_send(buf,count,datatype,dest,tag,comm,ierror)
+      implicit none
+      integer buf(*), count,datatype,dest,tag,comm,ierror
+      call mpi_error()
+      return
+      end
+      
+      subroutine mpi_recv(buf,count,datatype,source,  &
+     &                    tag,comm,status,ierror)
+      implicit none
+      integer buf(*), count,datatype,source,tag,comm,status(*),ierror
+      call mpi_error()
+      return
+      end
+
+      subroutine mpi_comm_split(comm,color,key,newcomm,ierror)
+      implicit none
+      integer comm,color,key,newcomm,ierror
+      newcomm = comm
+      return
+      end
+
+      subroutine mpi_comm_rank(comm, rank,ierr)
+      implicit none
+      integer comm, rank,ierr
+      rank = 0
+      return
+      end
+
+      subroutine mpi_comm_size(comm, size, ierr)
+      implicit none
+      integer comm, size, ierr
+      size = 1
+      return
+      end
+
+      double precision function mpi_wtime()
+      implicit none
+      double precision t
+! This function must measure wall clock time, not CPU time. 
+! Since there is no portable timer in Fortran (77)
+! we call a routine compiled in C (though the C source may have
+! to be tweaked). 
+      call wtime(t)
+! The following is not ok for "official" results because it reports
+! CPU time not wall clock time. It may be useful for developing/testing
+! on timeshared Crays, though. 
+!     call second(t)
+
+      mpi_wtime = t
+
+      return
+      end
+
+
+! may be valid to call this in single processor case
+      subroutine mpi_barrier(comm,ierror)
+      implicit none
+      integer comm,ierror
+      return
+      end
+
+! may be valid to call this in single processor case
+      subroutine mpi_bcast(buf, nitems, dtype, root, comm, ierr)
+      implicit none
+      integer buf(*), nitems, dtype, root, comm, ierr
+      return
+      end
+
+      subroutine mpi_comm_dup(oldcomm, newcomm,ierror)
+      implicit none
+      integer oldcomm, newcomm,ierror
+      newcomm= oldcomm
+      return
+      end
+
+      subroutine mpi_error()
+      implicit none
+      print *, 'mpi_error called'
+      stop
+      end 
+
+      subroutine mpi_abort(comm, errcode, ierr)
+      implicit none
+      integer comm, errcode, ierr
+      print *, 'mpi_abort called'
+      stop
+      end
+
+      subroutine mpi_finalize(ierr)
+      implicit none
+      integer ierr
+      return
+      end
+
+      subroutine mpi_init(ierr)
+      implicit none
+      integer ierr
+      return
+      end
+
+
+! assume double precision, which is all SP uses 
+      subroutine mpi_reduce(inbuf, outbuf, nitems,  &
+     &                      dtype, op, root, comm, ierr)
+      implicit none
+      include 'mpif.h'
+      integer nitems, dtype, op, root, comm, ierr
+      double precision inbuf(*), outbuf(*)
+
+      if (dtype .eq. mpi_double_precision) then
+         call dmpi_copy_dp(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_double_complex) then
+         call dmpi_copy_dc(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_complex) then
+         call dmpi_copy_complex(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_real) then
+         call dmpi_copy_real(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_integer) then
+         call dmpi_copy_int(inbuf, outbuf, nitems)
+      else
+         print *, 'mpi_reduce: unknown type ', dtype
+      end if
+      return
+      end
+
+
+      subroutine dmpi_copy_real(inbuf, outbuf, nitems)
+      implicit none
+      integer nitems, i
+      real inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine dmpi_copy_dp(inbuf, outbuf, nitems)
+      implicit none
+      integer nitems, i
+      double precision inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine dmpi_copy_dc(inbuf, outbuf, nitems)
+      implicit none
+      integer nitems, i
+      double complex inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+
+      subroutine dmpi_copy_complex(inbuf, outbuf, nitems)
+      implicit none
+      integer nitems, i
+      complex inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine dmpi_copy_int(inbuf, outbuf, nitems)
+      implicit none
+      integer nitems, i
+      integer inbuf(*), outbuf(*)
+      do i = 1, nitems
+         outbuf(i) = inbuf(i)
+      end do
+      
+      return
+      end
+
+      subroutine mpi_allreduce(inbuf, outbuf, nitems,  &
+     &                      dtype, op, comm, ierr)
+      implicit none
+      integer nitems, dtype, op, comm, ierr
+      double precision inbuf(*), outbuf(*)
+
+      call mpi_reduce(inbuf, outbuf, nitems,  &
+     &                      dtype, op, 0, comm, ierr)
+      return
+      end
+
+      subroutine mpi_alltoall(inbuf, nitems_in, dtype_in,  &
+     &                        outbuf, nitems, dtype, comm, ierr)
+      implicit none
+      include 'mpif.h'
+      integer nitems_in, dtype_in, comm, ierr, nitems, dtype
+      double precision inbuf(*), outbuf(*)
+
+      if (dtype .eq. mpi_double_precision) then
+         call dmpi_copy_dp(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_double_complex) then
+         call dmpi_copy_dc(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_complex) then
+         call dmpi_copy_complex(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_real) then
+         call dmpi_copy_real(inbuf, outbuf, nitems)
+      else if (dtype .eq. mpi_integer) then
+         call dmpi_copy_int(inbuf, outbuf, nitems)
+      else
+         print *, 'mpi_alltoall: unknown type ', dtype
+      end if
+      return
+      end
+
+      subroutine mpi_wait(request,status,ierror)
+      implicit none
+      integer request,status,ierror
+      call mpi_error()
+      return
+      end
+
+      subroutine mpi_waitall(count,requests,status,ierror)
+      implicit none
+      integer count,requests(*),status(*),ierror
+      call mpi_error()
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpif.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpif.h
new file mode 100644
index 000000000..091e7f300
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/mpif.h
@@ -0,0 +1,28 @@
+      integer mpi_comm_world
+      parameter (mpi_comm_world = 0)
+
+      integer mpi_max, mpi_min, mpi_sum
+      parameter (mpi_max = 1, mpi_sum = 2, mpi_min = 3)
+
+      integer mpi_byte, mpi_integer, mpi_real, mpi_logical,  &
+     &                  mpi_double_precision,  mpi_complex,  &
+     &                  mpi_double_complex
+      parameter (mpi_double_precision = 1,  &
+     &           mpi_integer = 2,  &
+     &           mpi_byte = 3,  &
+     &           mpi_real= 4,  &
+     &           mpi_logical = 5,  &
+     &           mpi_complex = 6,  &
+     &           mpi_double_complex = 7)
+
+      integer mpi_any_source
+      parameter (mpi_any_source = -1)
+
+      integer mpi_err_other
+      parameter (mpi_err_other = -1)
+
+      double precision mpi_wtime
+      external mpi_wtime
+
+      integer mpi_status_size
+      parameter (mpi_status_size=3)
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/test.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/test.f90
new file mode 100644
index 000000000..081c73c72
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/test.f90
@@ -0,0 +1,10 @@
+      program
+      implicit none
+      double precision t, mpi_wtime
+      external mpi_wtime
+      t = 0.0
+      t = mpi_wtime()
+      print *, t
+      t = mpi_wtime()
+      print *, t
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.c
new file mode 100644
index 000000000..221d2225a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.c
@@ -0,0 +1,13 @@
+#include "wtime.h"
+#include <sys/time.h>
+
+void wtime(double *t)
+{
+  static int sec = -1;
+  struct timeval tv;
+  gettimeofday(&tv, (void *)0);
+  if (sec < 0) sec = tv.tv_sec;
+  *t = (tv.tv_sec - sec) + 1.0e-6*tv.tv_usec;
+}
+
+    
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.f90
new file mode 100644
index 000000000..a1cfde9aa
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.f90
@@ -0,0 +1,12 @@
+      subroutine wtime(tim)
+      real*8 tim
+      dimension tarray(2)
+      call etime(tarray)
+      tim = tarray(1)
+      return
+      end
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.h
new file mode 100644
index 000000000..12eb0cb0e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime.h
@@ -0,0 +1,12 @@
+/* C/Fortran interface is different on different machines. 
+ * You may need to tweak this.
+ */
+
+
+#if defined(IBM)
+#define wtime wtime
+#elif defined(CRAY)
+#define wtime WTIME
+#else
+#define wtime wtime_
+#endif
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime_sgi64.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime_sgi64.c
new file mode 100644
index 000000000..d08d50cd3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/MPI_dummy/wtime_sgi64.c
@@ -0,0 +1,74 @@
+#include <sys/types.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/syssgi.h>
+#include <sys/immu.h>
+#include <errno.h>
+#include <stdio.h>
+
+/* The following works on SGI Power Challenge systems */
+
+typedef unsigned long iotimer_t;
+
+unsigned int cycleval;
+volatile iotimer_t *iotimer_addr, base_counter;
+double resolution;
+
+/* address_t is an integer type big enough to hold an address */
+typedef unsigned long address_t;
+
+
+
+void timer_init() 
+{
+  
+  int fd;
+  char *virt_addr;
+  address_t phys_addr, page_offset, pagemask, pagebase_addr;
+  
+  pagemask = getpagesize() - 1;
+  errno = 0;
+  phys_addr = syssgi(SGI_QUERY_CYCLECNTR, &cycleval);
+  if (errno != 0) {
+    perror("SGI_QUERY_CYCLECNTR");
+    exit(1);
+  }
+  /* rel_addr = page offset of physical address */
+  page_offset = phys_addr & pagemask;
+  pagebase_addr = phys_addr - page_offset;
+  fd = open("/dev/mmem", O_RDONLY);
+
+  virt_addr = mmap(0, pagemask, PROT_READ, MAP_PRIVATE, fd, pagebase_addr);
+  virt_addr = virt_addr + page_offset;
+  iotimer_addr = (iotimer_t *)virt_addr;
+  /* cycleval in picoseconds to this gives resolution in seconds */
+  resolution = 1.0e-12*cycleval; 
+  base_counter = *iotimer_addr;
+}
+
+void wtime_(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
+void wtime(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/Makefile
new file mode 100644
index 000000000..1b6374ee9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/Makefile
@@ -0,0 +1,68 @@
+SHELL=/bin/sh
+CLASS=U
+SUBTYPE=
+VERSION=
+SFILE=config/suite.def
+
+default: header
+	@ sys/print_instructions
+
+BT: bt
+bt: header
+	cd BT; $(MAKE) CLASS=$(CLASS) SUBTYPE=$(SUBTYPE) VERSION=$(VERSION)
+
+SP: sp
+sp: header
+	cd SP; $(MAKE) CLASS=$(CLASS)
+
+LU: lu
+lu: header
+	cd LU; $(MAKE) CLASS=$(CLASS) VERSION=$(VERSION)
+
+MG: mg
+mg: header
+	cd MG; $(MAKE) CLASS=$(CLASS)
+
+FT: ft
+ft: header
+	cd FT; $(MAKE) CLASS=$(CLASS)
+
+IS: is
+is: header
+	cd IS; $(MAKE) CLASS=$(CLASS)
+
+CG: cg
+cg: header
+	cd CG; $(MAKE) CLASS=$(CLASS)
+
+EP: ep
+ep: header
+	cd EP; $(MAKE) CLASS=$(CLASS)
+
+DT: dt
+dt: header
+	cd DT; $(MAKE) CLASS=$(CLASS)
+
+# Awk script courtesy cmg@cray.com, modified by Haoqiang Jin
+suite:
+	@ awk -f sys/suite.awk SMAKE=$(MAKE) $(SFILE) | $(SHELL)
+
+
+# It would be nice to make clean in each subdirectory (the targets
+# are defined) but on a really clean system this will won't work
+# because those makefiles need config/make.def
+clean:
+	- rm -f core *~ */core */*~
+	- rm -f */*.o */*.mod */*.obj */*.exe */npbparams.h
+	- rm -f MPI_dummy/test MPI_dummy/libmpi.a
+	- rm -f sys/setparams sys/makesuite sys/setparams.h
+	- rm -f btio.*.out*
+
+veryclean: clean
+	- rm -f config/make.def config/suite.def 
+	- rm -f bin/sp.* bin/lu.* bin/mg.* bin/ft.* bin/bt.* bin/is.* 
+	- rm -f bin/ep.* bin/cg.* bin/dt.*
+
+header:
+	@ sys/print_header
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/README
new file mode 100644
index 000000000..0c62a4ecb
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/README
@@ -0,0 +1,72 @@
+The MPI implementation of NPB 3.4.2 (NPB3.4-MPI)
+--------------------------------------------------
+
+For problem reports and suggestions on the implementation, 
+please contact:
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+   http://www.nas.nasa.gov/Software/NPB
+
+
+This directory contains the MPI implementation of the NAS
+Parallel Benchmarks, Version 3.4.2 (NPB3.4-MPI).  A brief
+summary of the new features introduced in this version is
+given below.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+For explanation of compilation and running of the benchmarks,
+please refer to README.install.  For a special note on DT, please
+see the README file in the DT subdirectory.
+
+
+New features in NPB3.4-MPI of NPB 3.4.2:
+  * New verification scheme for EP
+
+  * Add back the VEC versions of BT and LU, accessible by "VERSION=VEC"
+
+  * Fixed a bug in the BT-IO benchmark that can cause integer overflow
+    in CLASS=D or larger problems.  Setting FORTRAN_REC_SIZE in make.def
+    is no longer required.
+
+
+New features in NPB3.4-MPI of NPB 3.4.1:
+  * Changed Fortran sources from fixed form to free form
+
+  * Fix inconsistency in enforcing process count requirements.
+    The enforcement of process count can be turned off by setting 
+    the environment variable NPB_NPROCS_STRICT to (0, off, no, false).
+
+  * Changed the reference of "INTEGER*8" to "INTEGER(8)" in randi8.f
+
+
+New features in NPB3.4-MPI:
+  * NPB3.4-MPI added the class E problem size for IS, and the class F
+    problem size for BT, LU, SP, CG, EP, FT, and MG.
+
+  * Version 3.4 uses the dynamic memory allocation feature in
+    Fortran 90 so that separate compilations for different process
+    counts are no longer necessary.  The number of processes is solely
+    determined and checked at runtime.
+
+  * The version uses Fortran modules to define global data (to replace 
+    common blocks) and Fortran 2003 IEEE arithmetic function to catch
+    the NaN condition during verification.
+
+    The version requires a compiler that supports features available
+    in Fortran 90 and 2003. Because of these changes, the MPIF77 flag 
+    in make.def is renamed to MPIFC.
+
+  * The environment variable NPB_TIMER_FLAG is now used to enable 
+    additional timers.
+
+  * The vector codes for the BT and LU benchmarks have been removed
+    due to the fact that these implementations were not portable and
+    successful vectorization highly depends on the compiler used.
+
+  * Potential performance improvement of the LU benchmark as a result of 
+    reduced memory usage for working arrays in the solver.
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/README.install b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/README.install
new file mode 100644
index 000000000..99a268007
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/README.install
@@ -0,0 +1,201 @@
+Some explanations on the MPI implementation of NPB 3.4.2 (NPB3.4-MPI)
+----------------------------------------------------------------------
+
+NPB-MPI is a sample MPI implementation based on NPB2.4 and NPB3.0-SER.
+This implementation contains all eight original benchmarks:
+Seven in Fortran: BT, SP, LU, FT, CG, MG, and EP; one in C: IS,
+as well as the DT benchmark, written in C, introduced in NPB3.2-MPI.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+This version has been tested, among others, on an SGI Origin3000 and
+an SGI Altix.  For problem reports and suggestions on the implementation, 
+please contact
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+
+CAUTION *********************************
+When running the I/O benchmark, one or more data files will be written
+in the directory from which the executable is invoked. They are not
+deleted at the end of the program. A new run will overwrite the old
+file(s). If not enough space is available in the user partition, the
+program will fail. For classes C and D the disk space required is
+3 GB and 135 GB, respectively.
+*****************************************
+
+
+1. Compilation
+
+   NPB3-MPI uses the same directory tree as NPB3-SER (and NPB2.x) does.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file if necessary.  
+   If it does not (yet) exist, copy 'make.def.template' or one of the
+   sample files in the NAS.samples subdirectory to 'make.def' and
+   edit the content for site- and machine-specific data.  Some of the
+   flags to be specified in make.def are:
+
+      MPIFC  - MPI Fortran compiler
+      FFLAGS - Fortran compilation flags
+      FLINK  - Fortran linker, usually the same as MPIFC
+      MPICC  - MPI C compiler
+      CFLAGS - C compilation flags
+      CLINK  - C linker, usually the same as MPICC
+
+   Then
+
+       make <benchmark-name> CLASS=<class> [SUBTYPE=<type>] [VERSION=VEC]
+
+   where <benchmark-name>  is "bt", "cg", "dt", "ep", "ft", "is", 
+                              "lu", "mg", or "sp"
+         <class>           is "S", "W", "A", "B", "C", "D", "E", or "F"
+
+   Class F is not defined for IS.
+   Class E or F is not defined for DT.
+
+   The "VERSION=VEC" option is used for selecting the vectorized 
+   versions of BT and LU.
+
+   Only when making the I/O benchmark:
+         <benchmark-name>  is "bt"
+         <class>           as above
+         <type>            is "full", "simple", "fortran", or "epio"
+
+   Three parameters not used in the original BT benchmark are present in
+   the I/O benchmark. Two are set by default in the file BT/bt.f90. 
+   Changing them is optional.
+   One is set in make.def. It must be specified.
+
+   bt.f90: collbuf_nodes: number of processes used to buffer data before
+                        writing to file in the collective buffering mode
+                        (<type> is "full").
+         collbuf_size:  size of buffer (in bytes) per process used in
+                        collective buffering
+
+   make.def: -DFORTRAN_REC_SIZE: Fortran I/O record length in bytes. This
+                        is a system-specific value. It is part of the
+                        definition string of variable CONVERTFLAG. Syntax:
+                        "CONVERTFLAG = -DFORTRAN_REC_SIZE=n", where n is
+                        the record length unit.
+         In 3.4.2, setting FORTRAN_REC_SIZE is no longer needed (<n>=0 as
+         the default signifies auto setting).  However, this variable can 
+         still be used to override the auto setting.
+
+   When <type> is "full" or "simple", the code must be linked with an
+   MPI library that contains the subset of IO routines defined in MPI 2.
+
+
+   Class D or E for IS (Integer Sort) requires a compiler/system that 
+   supports the "long" type in C to be 64-bit.  As examples, the SGI 
+   MIPS compiler for the SGI Origin using the "-64" compilation flag and
+   the Intel compiler for IA64 are known to work.
+
+
+   The above procedure allows you to build one benchmark
+   at a time. To build a whole suite, you can type "make suite"
+   Make will look in file "config/suite.def" for a list of 
+   executables to build. The file contains one line per specification, 
+   with comments preceded by "#". Each line contains the name
+   of a benchmark, the class, and the number of processors, separated
+   by spaces or tabs. config/suite.def.template contains an example
+   of such a file.
+
+
+   The benchmarks have been designed so that they can be run
+   on a single processor without an MPI library. A few "dummy" 
+   MPI routines are still required for linking. For convenience
+   such a library is supplied in the "MPI_dummy" subdirectory of
+   the distribution. It contains an mpif.h and mpi.f include files
+   which must be used as well. The dummy library is built and
+   linked automatically and paths to the include files are defined
+   by inserting the line "include ../config/make.dummy" into the
+   make.def file (see example in make.def.template). Make sure to 
+   read the warnings in the README file in "MPI_dummy".The use of
+   the library is fragile and can produce unexpected errors.
+
+
+   ================================
+   
+   The "RAND" variable in make.def
+   --------------------------------
+   
+   Most of the NPBs use a random number generator. In two of the NPBs (FT
+   and EP) the computation of random numbers is included in the timed
+   part of the calculation, and it is important that the random number
+   generator be efficient.  The default random number generator package
+   provided is called "randi8" and should be used where possible. It has 
+   the following requirements:
+   
+   randi8:
+     1. Uses integer(8) arithmetic. Compiler must support integer(8)
+     2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
+     3. Assumes overflow bits are discarded by the hardware. In particular, 
+        that the lowest 46 bits of a*b are always correct, even if the 
+        result a*b is larger than 2^64. 
+   
+   Since randi8 may not work on all machines, we supply the following
+   alternatives:
+   
+   randi8_safe
+     1. Uses integer(8) arithmetic
+     2. Uses the Fortran 90 IBITS intrinsic. 
+     3. Does not make any assumptions about overflow. Should always
+        work correctly if compiler supports integer(8) and IBITS. 
+   
+   randdp
+     1. Uses double precision arithmetic (to simulate integer(8) operations). 
+        Should work with any system with support for 64-bit floating
+        point arithmetic.      
+   
+   randdpvec
+     1. Similar to randdp but written to be easier to vectorize. 
+   
+   
+2. Execution
+
+   The executable is named <benchmark-name>.<class>.x[.<suffix>],
+   where <suffix> is "fortran_io", "mpi_io_simple",  "ep_io", or 
+                     "mpi_io_full"
+   The executable is placed in the bin subdirectory (or in the directory 
+   BINDIR specified in make.def, if you've defined it). The method for 
+   running the MPI program depends on your local system. As an example of
+   running the BT benchmark Class C, the command might be:
+
+      % mpiexec -np 16 bin/bt.C.x
+
+   Different benchmark has different requirement for process count,
+   as listed below:
+
+      BT, SP         - a square number of processes (1, 4, 9, ...)
+      LU             - 2D (n1 * n2) process grid where n1/2 <= n2 <= n1
+      CG, FT, IS, MG - a power-of-two number of processes (1, 2, 4, ...)
+      EP, DT         - no special requirement
+
+   The required process count is checked at runtime. By default, a run
+   will abort if the process count requirement is not met.  However,
+   if the environment variable NPB_NPROCS_STRICT is set to one of:
+
+      0, off, no, false
+
+   the run will continue using the largest possible process count that
+   does not exceed the requested process count.  Any excessed ranks
+   will be marked as inactive with a warning message.
+
+   For IS and DT, there is a minimal process count for a given class
+   of problem size.
+
+   When any of the I/O benchmarks is run (non-empty subtype), one or 
+   more output files are created, and placed in the directory from which
+   the program was started. These are not removed automatically, and 
+   will be overwritten the next time an IO benchmark is run.
+
+   To enable additional timers in several benchmarks at runtime, set
+   the environment variable NPB_TIMER_FLAG to one of:
+
+      1, on, yes, true
+
+   before executing a benchmark.  The previous method of creating a dummy 
+   file "timer.flag" in the working directory to enable timers is still 
+   supported, but not recommended.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/Makefile
new file mode 100644
index 000000000..8a9004532
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/Makefile
@@ -0,0 +1,55 @@
+SHELL=/bin/sh
+BENCHMARK=sp
+BENCHMARKU=SP
+
+include ../config/make.def
+
+
+OBJS = sp.o sp_data.o make_set.o initialize.o exact_solution.o \
+       exact_rhs.o set_constants.o adi.o define.o copy_faces.o \
+       rhs.o lhsx.o lhsy.o lhsz.o x_solve.o ninvr.o y_solve.o pinvr.o \
+       z_solve.o tzetar.o add.o txinvr.o error.o verify.o setup_mpi.o \
+       mpinpb.o ${COMMON}/get_active_nprocs.o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o
+
+include ../sys/make.common
+
+# npbparams.h is included by sp_data module (via sp_data.o)
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+sp.o:             sp.f90  sp_data.o mpinpb.o
+make_set.o:       make_set.f90  sp_data.o mpinpb.o
+initialize.o:     initialize.f90  sp_data.o
+exact_solution.o: exact_solution.f90  sp_data.o
+exact_rhs.o:      exact_rhs.f90  sp_data.o
+set_constants.o:  set_constants.f90  sp_data.o
+adi.o:            adi.f90  sp_data.o
+define.o:         define.f90  sp_data.o
+copy_faces.o:     copy_faces.f90  sp_data.o mpinpb.o
+rhs.o:            rhs.f90  sp_data.o
+lhsx.o:           lhsx.f90  sp_data.o
+lhsy.o:           lhsy.f90  sp_data.o
+lhsz.o:           lhsz.f90  sp_data.o
+x_solve.o:        x_solve.f90  sp_data.o mpinpb.o
+ninvr.o:          ninvr.f90  sp_data.o
+y_solve.o:        y_solve.f90  sp_data.o mpinpb.o
+pinvr.o:          pinvr.f90  sp_data.o
+z_solve.o:        z_solve.f90  sp_data.o mpinpb.o
+tzetar.o:         tzetar.f90  sp_data.o
+add.o:            add.f90  sp_data.o
+txinvr.o:         txinvr.f90  sp_data.o
+error.o:          error.f90  sp_data.o mpinpb.o
+verify.o:         verify.f90  sp_data.o mpinpb.o
+setup_mpi.o:      setup_mpi.f90  sp_data.o mpinpb.o
+sp_data.o:        sp_data.f90  mpinpb.o npbparams.h
+mpinpb.o:         mpinpb.f90
+
+
+clean:
+	- rm -f *.o *.mod *~ mputil*
+	- rm -f npbparams.h core
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/README
new file mode 100644
index 000000000..fe423db43
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/README
@@ -0,0 +1,17 @@
+
+This code implements a 3D Multi-partition algorithm for the solution 
+of the uncoupled systems of linear equations resulting from 
+Beam-Warming approximate factorization.  Consequently, the program 
+must be run on a square number of processors.  The included file 
+"npbparams.h" contains a parameter statement which sets "maxcells" 
+and "problem_size".  The parameter maxcells must be set to the 
+square root of the number of processors.  For example, if running 
+on 25 processors, then set max_cells=5.  The standard problem sizes 
+are problem_size=64 for class A, 102 for class B, and 162 for class C.
+
+The number of time steps and the time step size dt are set in the 
+npbparams.h but may be overridden in the input deck "inputsp.data".  
+The number of time steps is 400 for all three 
+standard problems, and the appropriate time step sizes "dt" are 
+0.0015d0 for class A, 0.001d0 for class B, and 0.00067 for class C.  
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/add.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/add.f90
new file mode 100644
index 000000000..6c36d2681
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/add.f90
@@ -0,0 +1,32 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  add
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! addition of update to the vector u
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer  c, i, j, k, m
+
+       do  c = 1, ncells
+          do m = 1, 5
+             do  k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      u(i,j,k,m,c) = u(i,j,k,m,c) + rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+       end do
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/adi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/adi.f90
new file mode 100644
index 000000000..4f01b2ff9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/adi.f90
@@ -0,0 +1,24 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  adi
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       call copy_faces
+
+       call txinvr
+
+       call x_solve
+
+       call y_solve
+
+       call z_solve
+
+       call add
+
+       return
+       end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/copy_faces.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/copy_faces.f90
new file mode 100644
index 000000000..d0800865f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/copy_faces.f90
@@ -0,0 +1,314 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine copy_faces
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function copies the face values of a variable defined on a set 
+! of cells to the overlap locations of the adjacent sets of cells. 
+! Because a set of cells interfaces in each direction with exactly one 
+! other set, we only need to fill six different buffers. We could try to 
+! overlap communication with computation, by computing
+! some internal values while communicating boundary values, but this
+! adds so much overhead that it's not clearly useful. 
+!---------------------------------------------------------------------
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+
+       integer i, j, k, c, m, requests(0:11), p0, p1,  &
+     &         p2, p3, p4, p5, b_size(0:5), ss(0:5),  &
+     &         sr(0:5), error, statuses(MPI_STATUS_SIZE, 0:11)
+
+!---------------------------------------------------------------------
+!      exit immediately if there are no faces to be copied           
+!---------------------------------------------------------------------
+       if (no_nodes .eq. 1) then
+          call compute_rhs
+          return
+       endif
+
+
+       ss(0) = start_send_east
+       ss(1) = start_send_west
+       ss(2) = start_send_north
+       ss(3) = start_send_south
+       ss(4) = start_send_top
+       ss(5) = start_send_bottom
+
+       sr(0) = start_recv_east
+       sr(1) = start_recv_west
+       sr(2) = start_recv_north
+       sr(3) = start_recv_south
+       sr(4) = start_recv_top
+       sr(5) = start_recv_bottom
+
+       b_size(0) = east_size   
+       b_size(1) = west_size   
+       b_size(2) = north_size  
+       b_size(3) = south_size  
+       b_size(4) = top_size    
+       b_size(5) = bottom_size 
+
+!---------------------------------------------------------------------
+! because the difference stencil for the diagonalized scheme is 
+! orthogonal, we do not have to perform the staged copying of faces, 
+! but can send all face information simultaneously to the neighboring 
+! cells in all directions          
+!---------------------------------------------------------------------
+       if (timeron) call timer_start(t_bpack)
+       p0 = 0
+       p1 = 0
+       p2 = 0
+       p3 = 0
+       p4 = 0
+       p5 = 0
+
+       do  c = 1, ncells
+          do   m = 1, 5
+
+!---------------------------------------------------------------------
+!            fill the buffer to be sent to eastern neighbors (i-dir)
+!---------------------------------------------------------------------
+             if (cell_coord(1,c) .ne. ncells) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = cell_size(1,c)-2, cell_size(1,c)-1
+                         out_buffer(ss(0)+p0) = u(i,j,k,m,c)
+                         p0 = p0 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+!---------------------------------------------------------------------
+!            fill the buffer to be sent to western neighbors 
+!---------------------------------------------------------------------
+             if (cell_coord(1,c) .ne. 1) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = 0, 1
+                         out_buffer(ss(1)+p1) = u(i,j,k,m,c)
+                         p1 = p1 + 1
+                      end do
+                   end do
+                end do
+
+
+             endif
+
+!---------------------------------------------------------------------
+!            fill the buffer to be sent to northern neighbors (j_dir)
+!---------------------------------------------------------------------
+             if (cell_coord(2,c) .ne. ncells) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = cell_size(2,c)-2, cell_size(2,c)-1
+                      do   i = 0, cell_size(1,c)-1
+                         out_buffer(ss(2)+p2) = u(i,j,k,m,c)
+                         p2 = p2 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+!---------------------------------------------------------------------
+!            fill the buffer to be sent to southern neighbors 
+!---------------------------------------------------------------------
+             if (cell_coord(2,c).ne. 1) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, 1
+                      do   i = 0, cell_size(1,c)-1   
+                         out_buffer(ss(3)+p3) = u(i,j,k,m,c)
+                         p3 = p3 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+!---------------------------------------------------------------------
+!            fill the buffer to be sent to top neighbors (k-dir)
+!---------------------------------------------------------------------
+             if (cell_coord(3,c) .ne. ncells) then
+                do   k = cell_size(3,c)-2, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = 0, cell_size(1,c)-1
+                         out_buffer(ss(4)+p4) = u(i,j,k,m,c)
+                         p4 = p4 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+!---------------------------------------------------------------------
+!            fill the buffer to be sent to bottom neighbors
+!---------------------------------------------------------------------
+             if (cell_coord(3,c).ne. 1) then
+                 do    k=0, 1
+                    do   j = 0, cell_size(2,c)-1
+                       do   i = 0, cell_size(1,c)-1
+                          out_buffer(ss(5)+p5) = u(i,j,k,m,c)
+                          p5 = p5 + 1
+                       end do
+                    end do
+                 end do
+              endif
+
+!---------------------------------------------------------------------
+!          m loop
+!---------------------------------------------------------------------
+           end do
+
+!---------------------------------------------------------------------
+!       cell loop
+!---------------------------------------------------------------------
+        end do
+       if (timeron) call timer_stop(t_bpack)
+
+       if (timeron) call timer_start(t_exch)
+       call mpi_irecv(in_buffer(sr(0)), b_size(0),  &
+     &                dp_type, successor(1), WEST,  &
+     &                comm_rhs, requests(0), error)
+       call mpi_irecv(in_buffer(sr(1)), b_size(1),  &
+     &                dp_type, predecessor(1), EAST,  &
+     &                comm_rhs, requests(1), error)
+       call mpi_irecv(in_buffer(sr(2)), b_size(2),  &
+     &                dp_type, successor(2), SOUTH,  &
+     &                comm_rhs, requests(2), error)
+       call mpi_irecv(in_buffer(sr(3)), b_size(3),  &
+     &                dp_type, predecessor(2), NORTH,  &
+     &                comm_rhs, requests(3), error)
+       call mpi_irecv(in_buffer(sr(4)), b_size(4),  &
+     &                dp_type, successor(3), BOTTOM,  &
+     &                comm_rhs, requests(4), error)
+       call mpi_irecv(in_buffer(sr(5)), b_size(5),  &
+     &                dp_type, predecessor(3), TOP,   &
+     &                comm_rhs, requests(5), error)
+
+       call mpi_isend(out_buffer(ss(0)), b_size(0),  &
+     &                dp_type, successor(1),   EAST,  &
+     &                comm_rhs, requests(6), error)
+       call mpi_isend(out_buffer(ss(1)), b_size(1),  &
+     &                dp_type, predecessor(1), WEST,  &
+     &                comm_rhs, requests(7), error)
+       call mpi_isend(out_buffer(ss(2)), b_size(2),  &
+     &                dp_type,successor(2),   NORTH,  &
+     &                comm_rhs, requests(8), error)
+       call mpi_isend(out_buffer(ss(3)), b_size(3),  &
+     &                dp_type,predecessor(2), SOUTH,  &
+     &                comm_rhs, requests(9), error)
+       call mpi_isend(out_buffer(ss(4)), b_size(4),  &
+     &                dp_type,successor(3),   TOP,  &
+     &                comm_rhs,   requests(10), error)
+       call mpi_isend(out_buffer(ss(5)), b_size(5),  &
+     &                dp_type,predecessor(3), BOTTOM,  &
+     &                comm_rhs,requests(11), error)
+
+
+       call mpi_waitall(12, requests, statuses, error)
+       if (timeron) call timer_stop(t_exch)
+
+!---------------------------------------------------------------------
+! unpack the data that has just been received;             
+!---------------------------------------------------------------------
+       if (timeron) call timer_start(t_bpack)
+       p0 = 0
+       p1 = 0
+       p2 = 0
+       p3 = 0
+       p4 = 0
+       p5 = 0
+
+       do   c = 1, ncells
+          do    m = 1, 5
+
+             if (cell_coord(1,c) .ne. 1) then
+                do   k = 0, cell_size(3,c)-1
+                   do   j = 0, cell_size(2,c)-1
+                      do   i = -2, -1
+                         u(i,j,k,m,c) = in_buffer(sr(1)+p0)
+                         p0 = p0 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+             if (cell_coord(1,c) .ne. ncells) then
+                do  k = 0, cell_size(3,c)-1
+                   do  j = 0, cell_size(2,c)-1
+                      do  i = cell_size(1,c), cell_size(1,c)+1
+                         u(i,j,k,m,c) = in_buffer(sr(0)+p1)
+                         p1 = p1 + 1
+                      end do
+                   end do
+                end do
+             end if
+ 
+             if (cell_coord(2,c) .ne. 1) then
+                do  k = 0, cell_size(3,c)-1
+                   do   j = -2, -1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(3)+p2)
+                         p2 = p2 + 1
+                      end do
+                   end do
+                end do
+
+             endif
+ 
+             if (cell_coord(2,c) .ne. ncells) then
+                do  k = 0, cell_size(3,c)-1
+                   do   j = cell_size(2,c), cell_size(2,c)+1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(2)+p3)
+                         p3 = p3 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+             if (cell_coord(3,c) .ne. 1) then
+                do  k = -2, -1
+                   do  j = 0, cell_size(2,c)-1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(5)+p4)
+                         p4 = p4 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+             if (cell_coord(3,c) .ne. ncells) then
+                do  k = cell_size(3,c), cell_size(3,c)+1
+                   do  j = 0, cell_size(2,c)-1
+                      do  i = 0, cell_size(1,c)-1
+                         u(i,j,k,m,c) = in_buffer(sr(4)+p5)
+                         p5 = p5 + 1
+                      end do
+                   end do
+                end do
+             endif
+
+!---------------------------------------------------------------------
+!         m loop            
+!---------------------------------------------------------------------
+          end do
+
+!---------------------------------------------------------------------
+!      cells loop
+!---------------------------------------------------------------------
+       end do
+       if (timeron) call timer_stop(t_bpack)
+
+!---------------------------------------------------------------------
+! now that we have all the data, compute the rhs
+!---------------------------------------------------------------------
+       call compute_rhs
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/define.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/define.f90
new file mode 100644
index 000000000..1319b084a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/define.f90
@@ -0,0 +1,67 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine compute_buffer_size(dim)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer  c, dim, face_size
+
+       if (ncells .eq. 1) return
+
+!---------------------------------------------------------------------
+!      compute the actual sizes of the buffers; note that there is 
+!      always one cell face that doesn't need buffer space, because it 
+!      is at the boundary of the grid
+!---------------------------------------------------------------------
+
+       west_size = 0
+       east_size = 0
+
+       do   c = 1, ncells
+          face_size = cell_size(2,c) * cell_size(3,c) * dim * 2
+          if (cell_coord(1,c).ne.1) west_size = west_size + face_size
+          if (cell_coord(1,c).ne.ncells) east_size = east_size +  &
+     &                                                 face_size 
+       end do
+
+       north_size = 0
+       south_size = 0
+       do   c = 1, ncells
+          face_size = cell_size(1,c)*cell_size(3,c) * dim * 2
+          if (cell_coord(2,c).ne.1) south_size = south_size + face_size
+          if (cell_coord(2,c).ne.ncells) north_size = north_size +  &
+     &                                                  face_size 
+       end do
+
+       top_size = 0
+       bottom_size = 0
+       do   c = 1, ncells
+          face_size = cell_size(1,c) * cell_size(2,c) * dim * 2
+          if (cell_coord(3,c).ne.1) bottom_size = bottom_size +  &
+     &                                            face_size
+          if (cell_coord(3,c).ne.ncells) top_size = top_size +  &
+     &                                                face_size     
+       end do
+
+       start_send_west   = 1
+       start_send_east   = start_send_west   + west_size
+       start_send_south  = start_send_east   + east_size
+       start_send_north  = start_send_south  + south_size
+       start_send_bottom = start_send_north  + north_size
+       start_send_top    = start_send_bottom + bottom_size
+       start_recv_west   = 1
+       start_recv_east   = start_recv_west   + west_size
+       start_recv_south  = start_recv_east   + east_size
+       start_recv_north  = start_recv_south  + south_size
+       start_recv_bottom = start_recv_north  + north_size
+       start_recv_top    = start_recv_bottom + bottom_size
+
+       return
+       end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/error.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/error.f90
new file mode 100644
index 000000000..f9879d661
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/error.f90
@@ -0,0 +1,109 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine error_norm(rms)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function computes the norm of the difference between the
+! computed solution and the exact solution
+!---------------------------------------------------------------------
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+
+       integer c, i, j, k, m, ii, jj, kk, d, error
+       double precision xi, eta, zeta, u_exact(5), rms(5), rms_work(5),  &
+     &                  add
+
+       do   m = 1, 5 
+          rms_work(m) = 0.0d0
+       end do
+
+       do   c = 1, ncells
+          kk = 0
+          do   k = cell_low(3,c), cell_high(3,c)
+             zeta = dble(k) * dnzm1
+             jj = 0
+             do   j = cell_low(2,c), cell_high(2,c)
+                eta = dble(j) * dnym1
+                ii = 0
+                do   i = cell_low(1,c), cell_high(1,c)
+                   xi = dble(i) * dnxm1
+                   call exact_solution(xi, eta, zeta, u_exact)
+
+                   do   m = 1, 5
+                      add = u(ii,jj,kk,m,c)-u_exact(m)
+                      rms_work(m) = rms_work(m) + add*add
+                   end do
+                   ii = ii + 1
+                end do
+                jj = jj + 1
+             end do
+             kk = kk + 1
+          end do
+       end do
+
+       call mpi_allreduce(rms_work, rms, 5, dp_type,  &
+     &                 MPI_SUM, comm_setup, error)
+
+       do    m = 1, 5
+          do    d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
+
+       subroutine rhs_norm(rms)
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+
+       integer c, i, j, k, d, m, error
+       double precision rms(5), rms_work(5), add
+
+       do    m = 1, 5
+          rms_work(m) = 0.0d0
+       end do
+
+       do   c = 1, ncells
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   do   m = 1, 5
+                      add = rhs(i,j,k,m,c)
+                      rms_work(m) = rms_work(m) + add*add
+                   end do
+                end do
+             end do
+          end do
+       end do
+
+
+
+       call mpi_allreduce(rms_work, rms, 5, dp_type,  &
+     &                 MPI_SUM, comm_setup, error)
+
+       do   m = 1, 5
+          do   d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/exact_rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/exact_rhs.f90
new file mode 100644
index 000000000..84c213a58
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/exact_rhs.f90
@@ -0,0 +1,364 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine exact_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       double precision dtemp(5), xi, eta, zeta, dtpp
+       integer          c, m, i, j, k, ip1, im1, jp1,  &
+     &                  jm1, km1, kp1
+
+!---------------------------------------------------------------------
+! loop over all cells owned by this node                   
+!---------------------------------------------------------------------
+       do   c = 1, ncells
+
+!---------------------------------------------------------------------
+!         initialize                                  
+!---------------------------------------------------------------------
+          do   m = 1, 5
+             do   k= 0, cell_size(3,c)-1
+                do   j = 0, cell_size(2,c)-1
+                   do   i = 0, cell_size(1,c)-1
+                      forcing(i,j,k,m,c) = 0.0d0
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+! xi-direction flux differences                      
+!---------------------------------------------------------------------
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             zeta = dble(k+cell_low(3,c)) * dnzm1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                eta = dble(j+cell_low(2,c)) * dnym1
+
+                do  i=-2*(1-start(1,c)), cell_size(1,c)+1-2*end(1,c)
+                   xi = dble(i+cell_low(1,c)) * dnxm1
+
+                   call exact_solution(xi, eta, zeta, dtemp)
+                   do  m = 1, 5
+                      ue(i,m) = dtemp(m)
+                   end do
+
+                   dtpp = 1.0d0 / dtemp(1)
+
+                   do  m = 2, 5
+                      buf(i,m) = dtpp * dtemp(m)
+                   end do
+
+                   cuf(i)   = buf(i,2) * buf(i,2)
+                   buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) +  &
+     &                        buf(i,4) * buf(i,4) 
+                   q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +  &
+     &                           buf(i,4)*ue(i,4))
+
+                end do
+ 
+                do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   im1 = i-1
+                   ip1 = i+1
+
+                   forcing(i,j,k,1,c) = forcing(i,j,k,1,c) -  &
+     &                 tx2*( ue(ip1,2)-ue(im1,2) )+  &
+     &                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+                   forcing(i,j,k,2,c) = forcing(i,j,k,2,c) - tx2 * (  &
+     &                (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-  &
+     &                (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+  &
+     &                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+  &
+     &                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+                   forcing(i,j,k,3,c) = forcing(i,j,k,3,c) - tx2 * (  &
+     &                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+  &
+     &                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+                   forcing(i,j,k,4,c) = forcing(i,j,k,4,c) - tx2*(  &
+     &                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+  &
+     &                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+                   forcing(i,j,k,5,c) = forcing(i,j,k,5,c) - tx2*(  &
+     &                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-  &
+     &                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+  &
+     &                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+  &
+     &                               buf(im1,1))+  &
+     &                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+  &
+     &                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+  &
+     &                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+                end do
+
+!---------------------------------------------------------------------
+! Fourth-order dissipation                         
+!---------------------------------------------------------------------
+                if (start(1,c) .gt. 0) then
+                   do   m = 1, 5
+                      i = 1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+                      i = 2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -  &
+     &                     4.0d0*ue(i+1,m) +       ue(i+2,m))
+                   end do
+                endif
+
+                do   m = 1, 5
+                   do  i = start(1,c)*3, cell_size(1,c)-3*end(1,c)-1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp*  &
+     &                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+                   end do
+                end do
+
+                if (end(1,c) .gt. 0) then
+                   do   m = 1, 5
+                      i = cell_size(1,c)-3
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+                      i = cell_size(1,c)-2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+                   end do
+                endif
+
+             end do
+          end do
+!---------------------------------------------------------------------
+!  eta-direction flux differences             
+!---------------------------------------------------------------------
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1          
+             zeta = dble(k+cell_low(3,c)) * dnzm1
+             do   i=start(1,c), cell_size(1,c)-end(1,c)-1
+                xi = dble(i+cell_low(1,c)) * dnxm1
+
+                do  j=-2*(1-start(2,c)), cell_size(2,c)+1-2*end(2,c)
+                   eta = dble(j+cell_low(2,c)) * dnym1
+
+                   call exact_solution(xi, eta, zeta, dtemp)
+                   do   m = 1, 5 
+                      ue(j,m) = dtemp(m)
+                   end do
+                   dtpp = 1.0d0/dtemp(1)
+
+                   do  m = 2, 5
+                      buf(j,m) = dtpp * dtemp(m)
+                   end do
+
+                   cuf(j)   = buf(j,3) * buf(j,3)
+                   buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) +  &
+     &                        buf(j,4) * buf(j,4)
+                   q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +  &
+     &                           buf(j,4)*ue(j,4))
+                end do
+
+                do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   jm1 = j-1
+                   jp1 = j+1
+                  
+                   forcing(i,j,k,1,c) = forcing(i,j,k,1,c) -  &
+     &                ty2*( ue(jp1,3)-ue(jm1,3) )+  &
+     &                dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+                   forcing(i,j,k,2,c) = forcing(i,j,k,2,c) - ty2*(  &
+     &                ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+  &
+     &                yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+  &
+     &                dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+                   forcing(i,j,k,3,c) = forcing(i,j,k,3,c) - ty2*(  &
+     &                (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-  &
+     &                (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+  &
+     &                yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+  &
+     &                dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+                   forcing(i,j,k,4,c) = forcing(i,j,k,4,c) - ty2*(  &
+     &                ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+  &
+     &                yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+  &
+     &                dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+                   forcing(i,j,k,5,c) = forcing(i,j,k,5,c) - ty2*(  &
+     &                buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-  &
+     &                buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+  &
+     &                0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+  &
+     &                              buf(jm1,1))+  &
+     &                yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+  &
+     &                yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+  &
+     &                dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+                end do
+
+!---------------------------------------------------------------------
+! Fourth-order dissipation                      
+!---------------------------------------------------------------------
+                if (start(2,c) .gt. 0) then
+                   do   m = 1, 5
+                      j = 1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+                      j = 2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -  &
+     &                     4.0d0*ue(j+1,m) +       ue(j+2,m))
+                   end do
+                endif
+
+                do   m = 1, 5
+                   do  j = start(2,c)*3, cell_size(2,c)-3*end(2,c)-1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp*  &
+     &                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+                   end do
+                end do
+                if (end(2,c) .gt. 0) then
+                   do   m = 1, 5
+                      j = cell_size(2,c)-3
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+                      j = cell_size(2,c)-2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+                   end do
+                endif
+
+             end do
+          end do
+
+!---------------------------------------------------------------------
+! zeta-direction flux differences                      
+!---------------------------------------------------------------------
+          do  j=start(2,c), cell_size(2,c)-end(2,c)-1
+             eta = dble(j+cell_low(2,c)) * dnym1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                xi = dble(i+cell_low(1,c)) * dnxm1
+
+                do k=-2*(1-start(3,c)), cell_size(3,c)+1-2*end(3,c)
+                   zeta = dble(k+cell_low(3,c)) * dnzm1
+
+                   call exact_solution(xi, eta, zeta, dtemp)
+                   do   m = 1, 5
+                      ue(k,m) = dtemp(m)
+                   end do
+
+                   dtpp = 1.0d0/dtemp(1)
+
+                   do   m = 2, 5
+                      buf(k,m) = dtpp * dtemp(m)
+                   end do
+
+                   cuf(k)   = buf(k,4) * buf(k,4)
+                   buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) +  &
+     &                        buf(k,3) * buf(k,3)
+                   q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +  &
+     &                           buf(k,4)*ue(k,4))
+                end do
+
+                do    k=start(3,c), cell_size(3,c)-end(3,c)-1
+                   km1 = k-1
+                   kp1 = k+1
+                  
+                   forcing(i,j,k,1,c) = forcing(i,j,k,1,c) -  &
+     &                 tz2*( ue(kp1,4)-ue(km1,4) )+  &
+     &                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+                   forcing(i,j,k,2,c) = forcing(i,j,k,2,c) - tz2 * (  &
+     &                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+  &
+     &                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+                   forcing(i,j,k,3,c) = forcing(i,j,k,3,c) - tz2 * (  &
+     &                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+  &
+     &                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+                   forcing(i,j,k,4,c) = forcing(i,j,k,4,c) - tz2 * (  &
+     &                (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-  &
+     &                (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+  &
+     &                zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+  &
+     &                dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+                   forcing(i,j,k,5,c) = forcing(i,j,k,5,c) - tz2 * (  &
+     &                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-  &
+     &                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+  &
+     &                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)  &
+     &                              +buf(km1,1))+  &
+     &                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+  &
+     &                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+  &
+     &                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+                end do
+
+!---------------------------------------------------------------------
+! Fourth-order dissipation                        
+!---------------------------------------------------------------------
+                if (start(3,c) .gt. 0) then
+                   do   m = 1, 5
+                      k = 1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+                      k = 2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -  &
+     &                     4.0d0*ue(k+1,m) +       ue(k+2,m))
+                   end do
+                endif
+
+                do   m = 1, 5
+                   do  k = start(3,c)*3, cell_size(3,c)-3*end(3,c)-1
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp*  &
+     &                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+                   end do
+                end do
+
+                if (end(3,c) .gt. 0) then
+                   do    m = 1, 5
+                      k = cell_size(3,c)-3
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+                      k = cell_size(3,c)-2
+                      forcing(i,j,k,m,c) = forcing(i,j,k,m,c) - dssp *  &
+     &                   (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+                   end do
+                endif
+
+             end do
+          end do
+!---------------------------------------------------------------------
+! now change the sign of the forcing function, 
+!---------------------------------------------------------------------
+          do   m = 1, 5
+             do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      forcing(i,j,k,m,c) = -1.d0 * forcing(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!      cell loop
+!---------------------------------------------------------------------
+       end do
+
+       return
+       end
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/exact_solution.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/exact_solution.f90
new file mode 100644
index 000000000..117b3be63
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/exact_solution.f90
@@ -0,0 +1,31 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine exact_solution(xi,eta,zeta,dtemp)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function returns the exact solution at point xi, eta, zeta  
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       double precision  xi, eta, zeta, dtemp(5)
+       integer m
+
+       do  m = 1, 5
+          dtemp(m) =  ce(m,1) +  &
+     &    xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +  &
+     &    eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+  &
+     &    zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) +  &
+     &    zeta*ce(m,13))))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/initialize.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/initialize.f90
new file mode 100644
index 000000000..2df24a4a4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/initialize.f90
@@ -0,0 +1,288 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  initialize
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! This subroutine initializes the field variable u using 
+! tri-linear transfinite interpolation of the boundary values     
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+  
+       integer c, i, j, k, m, ii, jj, kk, ix, iy, iz
+       double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta,  &
+     &                   Pzeta, temp(5)
+
+
+!---------------------------------------------------------------------
+!  Later (in compute_rhs) we compute 1/u for every element. A few of 
+!  the corner elements are not used, but it convenient (and faster) 
+!  to compute the whole thing with a simple loop. Make sure those 
+!  values are nonzero by initializing the whole thing here. 
+!---------------------------------------------------------------------
+      do c = 1, ncells
+         do kk = -1, IMAX
+            do jj = -1, IMAX
+               do ii = -1, IMAX
+                  u(ii, jj, kk, 1, c) = 1.0
+                  u(ii, jj, kk, 2, c) = 0.0
+                  u(ii, jj, kk, 3, c) = 0.0
+                  u(ii, jj, kk, 4, c) = 0.0
+                  u(ii, jj, kk, 5, c) = 1.0
+               end do
+            end do
+         end do
+      end do
+
+!---------------------------------------------------------------------
+! first store the "interpolated" values everywhere on the grid    
+!---------------------------------------------------------------------
+       do  c=1, ncells
+          kk = 0
+          do  k = cell_low(3,c), cell_high(3,c)
+             zeta = dble(k) * dnzm1
+             jj = 0
+             do  j = cell_low(2,c), cell_high(2,c)
+                eta = dble(j) * dnym1
+                ii = 0
+                do   i = cell_low(1,c), cell_high(1,c)
+                   xi = dble(i) * dnxm1
+                  
+                   do ix = 1, 2
+                      call exact_solution(dble(ix-1), eta, zeta,  &
+     &                                    Pface(1,1,ix))
+                   end do
+
+                   do    iy = 1, 2
+                      call exact_solution(xi, dble(iy-1) , zeta,  &
+     &                                    Pface(1,2,iy))
+                   end do
+
+                   do    iz = 1, 2
+                      call exact_solution(xi, eta, dble(iz-1),   &
+     &                                    Pface(1,3,iz))
+                   end do
+
+                   do   m = 1, 5
+                      Pxi   = xi   * Pface(m,1,2) +  &
+     &                        (1.0d0-xi)   * Pface(m,1,1)
+                      Peta  = eta  * Pface(m,2,2) +  &
+     &                        (1.0d0-eta)  * Pface(m,2,1)
+                      Pzeta = zeta * Pface(m,3,2) +  &
+     &                        (1.0d0-zeta) * Pface(m,3,1)
+ 
+                      u(ii,jj,kk,m,c) = Pxi + Peta + Pzeta -  &
+     &                          Pxi*Peta - Pxi*Pzeta - Peta*Pzeta +  &
+     &                          Pxi*Peta*Pzeta
+
+                   end do
+                   ii = ii + 1
+                end do
+                jj = jj + 1
+             end do
+             kk = kk+1
+          end do
+       end do
+
+!---------------------------------------------------------------------
+! now store the exact values on the boundaries        
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! west face                                                  
+!---------------------------------------------------------------------
+       c = slice(1,1)
+       ii = 0
+       xi = 0.0d0
+       kk = 0
+       do  k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          jj = 0
+          do   j = cell_low(2,c), cell_high(2,c)
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             jj = jj + 1
+          end do
+          kk = kk + 1
+       end do
+
+!---------------------------------------------------------------------
+! east face                                                      
+!---------------------------------------------------------------------
+       c  = slice(1,ncells)
+       ii = cell_size(1,c)-1
+       xi = 1.0d0
+       kk = 0
+       do   k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          jj = 0
+          do   j = cell_low(2,c), cell_high(2,c)
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             jj = jj + 1
+          end do
+          kk = kk + 1
+       end do
+
+!---------------------------------------------------------------------
+! south face                                                 
+!---------------------------------------------------------------------
+       c = slice(2,1)
+       jj = 0
+       eta = 0.0d0
+       kk = 0
+       do  k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          ii = 0
+          do   i = cell_low(1,c), cell_high(1,c)
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          kk = kk + 1
+       end do
+
+
+!---------------------------------------------------------------------
+! north face                                    
+!---------------------------------------------------------------------
+       c = slice(2,ncells)
+       jj = cell_size(2,c)-1
+       eta = 1.0d0
+       kk = 0
+       do   k = cell_low(3,c), cell_high(3,c)
+          zeta = dble(k) * dnzm1
+          ii = 0
+          do   i = cell_low(1,c), cell_high(1,c)
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          kk = kk + 1
+       end do
+
+!---------------------------------------------------------------------
+! bottom face                                       
+!---------------------------------------------------------------------
+       c = slice(3,1)
+       kk = 0
+       zeta = 0.0d0
+       jj = 0
+       do   j = cell_low(2,c), cell_high(2,c)
+          eta = dble(j) * dnym1
+          ii = 0
+          do   i =cell_low(1,c), cell_high(1,c)
+             xi = dble(i) *dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          jj = jj + 1
+       end do
+
+!---------------------------------------------------------------------
+! top face     
+!---------------------------------------------------------------------
+       c = slice(3,ncells)
+       kk = cell_size(3,c)-1
+       zeta = 1.0d0
+       jj = 0
+       do   j = cell_low(2,c), cell_high(2,c)
+          eta = dble(j) * dnym1
+          ii = 0
+          do   i =cell_low(1,c), cell_high(1,c)
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(ii,jj,kk,m,c) = temp(m)
+             end do
+             ii = ii + 1
+          end do
+          jj = jj + 1
+       end do
+
+       return
+       end
+
+
+       subroutine lhsinit
+
+       use sp_data
+       implicit none
+       
+       integer i, j, k, d, c, n
+
+!---------------------------------------------------------------------
+! loop over all cells                                       
+!---------------------------------------------------------------------
+       do  c = 1, ncells
+
+!---------------------------------------------------------------------
+!         first, initialize the start and end arrays
+!---------------------------------------------------------------------
+          do  d = 1, 3
+             if (cell_coord(d,c) .eq. 1) then
+                start(d,c) = 1
+             else 
+                start(d,c) = 0
+             endif
+             if (cell_coord(d,c) .eq. ncells) then
+                end(d,c) = 1
+             else
+                end(d,c) = 0
+             endif
+          end do
+
+!---------------------------------------------------------------------
+!     zap the whole left hand side for starters
+!---------------------------------------------------------------------
+          do  n = 1, 15
+             do  k = 0, cell_size(3,c)-1
+                do  j = 0, cell_size(2,c)-1
+                   do  i = 0, cell_size(1,c)-1
+                      lhs(i,j,k,n,c) = 0.0d0
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+! next, set all diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+          do   n = 1, 3
+             do   k = 0, cell_size(3,c)-1
+                do   j = 0, cell_size(2,c)-1
+                   do   i = 0, cell_size(1,c)-1
+                      lhs(i,j,k,5*n-2,c) = 1.0d0
+                   end do
+                end do
+             end do
+          end do
+
+       end do
+
+      return
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/inputsp.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/inputsp.data.sample
new file mode 100644
index 000000000..ae3801fdb
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/inputsp.data.sample
@@ -0,0 +1,3 @@
+400       number of time steps
+0.0015d0  dt for class A = 0.0015d0. class B = 0.001d0  class C = 0.00067d0
+64 64 64
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsx.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsx.f90
new file mode 100644
index 000000000..fbda8a8b3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsx.f90
@@ -0,0 +1,125 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine lhsx(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! This function computes the left hand side for the three x-factors  
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       double precision ru1
+       integer          i, j, k, c
+
+
+!---------------------------------------------------------------------
+!      treat only cell c             
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      first fill the lhs for the u-eigenvalue                   
+!---------------------------------------------------------------------
+       do  k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do  i = start(1,c)-1, cell_size(1,c)-end(1,c)
+                ru1 = c3c4*rho_i(i,j,k,c)
+                cv(i) = us(i,j,k,c)
+                rhon(i) = dmax1(dx2+con43*ru1,  &
+     &                          dx5+c1c5*ru1,  &
+     &                          dxmax+ru1,  &
+     &                          dx1)
+             end do
+
+             do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) =   0.0d0
+                lhs(i,j,k,2,c) = - dttx2 * cv(i-1) - dttx1 * rhon(i-1)
+                lhs(i,j,k,3,c) =   1.0d0 + c2dttx1 * rhon(i)
+                lhs(i,j,k,4,c) =   dttx2 * cv(i+1) - dttx1 * rhon(i+1)
+                lhs(i,j,k,5,c) =   0.0d0
+             end do
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                             
+!---------------------------------------------------------------------
+       if (start(1,c) .gt. 0) then
+          i = 1
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz5
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+  
+                lhs(i+1,j,k,2,c) = lhs(i+1,j,k,2,c) - comz4
+                lhs(i+1,j,k,3,c) = lhs(i+1,j,k,3,c) + comz6
+                lhs(i+1,j,k,4,c) = lhs(i+1,j,k,4,c) - comz4
+                lhs(i+1,j,k,5,c) = lhs(i+1,j,k,5,c) + comz1
+             end do
+          end do
+       endif
+
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do   i=3*start(1,c), cell_size(1,c)-3*end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+             end do
+          end do
+       end do
+
+       if (end(1,c) .gt. 0) then
+          i = cell_size(1,c)-3
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+
+                lhs(i+1,j,k,1,c) = lhs(i+1,j,k,1,c) + comz1
+                lhs(i+1,j,k,2,c) = lhs(i+1,j,k,2,c) - comz4
+                lhs(i+1,j,k,3,c) = lhs(i+1,j,k,3,c) + comz5
+             end do
+          end do
+       endif
+
+!---------------------------------------------------------------------
+!      subsequently, fill the other factors (u+c), (u-c) by a4ing to 
+!      the first  
+!---------------------------------------------------------------------
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1+5,c)  = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+5,c)  = lhs(i,j,k,2,c) -  &
+     &                            dttx2 * speed(i-1,j,k,c)
+                lhs(i,j,k,3+5,c)  = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+5,c)  = lhs(i,j,k,4,c) +  &
+     &                            dttx2 * speed(i+1,j,k,c)
+                lhs(i,j,k,5+5,c) = lhs(i,j,k,5,c)
+                lhs(i,j,k,1+10,c) = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+10,c) = lhs(i,j,k,2,c) +  &
+     &                            dttx2 * speed(i-1,j,k,c)
+                lhs(i,j,k,3+10,c) = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+10,c) = lhs(i,j,k,4,c) -  &
+     &                            dttx2 * speed(i+1,j,k,c)
+                lhs(i,j,k,5+10,c) = lhs(i,j,k,5,c)
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsy.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsy.f90
new file mode 100644
index 000000000..8ae5dd92b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsy.f90
@@ -0,0 +1,126 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine lhsy(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! This function computes the left hand side for the three y-factors   
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       double precision ru1
+       integer          i, j, k, c
+
+!---------------------------------------------------------------------
+!      treat only cell c
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      first fill the lhs for the u-eigenvalue         
+!---------------------------------------------------------------------
+       do  k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do  i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+             do  j = start(2,c)-1, cell_size(2,c)-end(2,c)
+                ru1 = c3c4*rho_i(i,j,k,c)
+                cv(j) = vs(i,j,k,c)
+                rhoq(j) = dmax1( dy3 + con43 * ru1,  &
+     &                           dy5 + c1c5*ru1,  &
+     &                           dymax + ru1,  &
+     &                           dy1)
+             end do
+            
+             do  j = start(2,c), cell_size(2,c)-end(2,c)-1
+                lhs(i,j,k,1,c) =  0.0d0
+                lhs(i,j,k,2,c) = -dtty2 * cv(j-1) - dtty1 * rhoq(j-1)
+                lhs(i,j,k,3,c) =  1.0 + c2dtty1 * rhoq(j)
+                lhs(i,j,k,4,c) =  dtty2 * cv(j+1) - dtty1 * rhoq(j+1)
+                lhs(i,j,k,5,c) =  0.0d0
+             end do
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                             
+!---------------------------------------------------------------------
+       if (start(2,c) .gt. 0) then
+          j = 1
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz5
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+       
+                lhs(i,j+1,k,2,c) = lhs(i,j+1,k,2,c) - comz4
+                lhs(i,j+1,k,3,c) = lhs(i,j+1,k,3,c) + comz6
+                lhs(i,j+1,k,4,c) = lhs(i,j+1,k,4,c) - comz4
+                lhs(i,j+1,k,5,c) = lhs(i,j+1,k,5,c) + comz1
+             end do
+          end do
+       endif
+
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j=3*start(2,c), cell_size(2,c)-3*end(2,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+             end do
+          end do
+       end do
+
+       if (end(2,c) .gt. 0) then
+          j = cell_size(2,c)-3
+          do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+
+                lhs(i,j+1,k,1,c) = lhs(i,j+1,k,1,c) + comz1
+                lhs(i,j+1,k,2,c) = lhs(i,j+1,k,2,c) - comz4
+                lhs(i,j+1,k,3,c) = lhs(i,j+1,k,3,c) + comz5
+             end do
+          end do
+       endif
+
+!---------------------------------------------------------------------
+!      subsequently, do the other two factors                    
+!---------------------------------------------------------------------
+       do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1+5,c)  = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+5,c)  = lhs(i,j,k,2,c) -  &
+     &                            dtty2 * speed(i,j-1,k,c)
+                lhs(i,j,k,3+5,c)  = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+5,c)  = lhs(i,j,k,4,c) +  &
+     &                            dtty2 * speed(i,j+1,k,c)
+                lhs(i,j,k,5+5,c) = lhs(i,j,k,5,c)
+                lhs(i,j,k,1+10,c) = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+10,c) = lhs(i,j,k,2,c) +  &
+     &                            dtty2 * speed(i,j-1,k,c)
+                lhs(i,j,k,3+10,c) = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+10,c) = lhs(i,j,k,4,c) -  &
+     &                            dtty2 * speed(i,j+1,k,c)
+                lhs(i,j,k,5+10,c) = lhs(i,j,k,5,c)
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsz.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsz.f90
new file mode 100644
index 000000000..a843e0280
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/lhsz.f90
@@ -0,0 +1,124 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine lhsz(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! This function computes the left hand side for the three z-factors   
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       double precision ru1
+       integer i, j, k, c
+
+!---------------------------------------------------------------------
+!      treat only cell c                                         
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! first fill the lhs for the u-eigenvalue                          
+!---------------------------------------------------------------------
+       do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+          do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+             do   k = start(3,c)-1, cell_size(3,c)-end(3,c)
+                ru1 = c3c4*rho_i(i,j,k,c)
+                cv(k) = ws(i,j,k,c)
+                rhos(k) = dmax1(dz4 + con43 * ru1,  &
+     &                          dz5 + c1c5 * ru1,  &
+     &                          dzmax + ru1,  &
+     &                          dz1)
+             end do
+
+             do   k =  start(3,c), cell_size(3,c)-end(3,c)-1
+                lhs(i,j,k,1,c) =  0.0d0
+                lhs(i,j,k,2,c) = -dttz2 * cv(k-1) - dttz1 * rhos(k-1)
+                lhs(i,j,k,3,c) =  1.0 + c2dttz1 * rhos(k)
+                lhs(i,j,k,4,c) =  dttz2 * cv(k+1) - dttz1 * rhos(k+1)
+                lhs(i,j,k,5,c) =  0.0d0
+             end do
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                                  
+!---------------------------------------------------------------------
+       if (start(3,c) .gt. 0) then
+          k = 1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz5
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+
+                lhs(i,j,k+1,2,c) = lhs(i,j,k+1,2,c) - comz4
+                lhs(i,j,k+1,3,c) = lhs(i,j,k+1,3,c) + comz6
+                lhs(i,j,k+1,4,c) = lhs(i,j,k+1,4,c) - comz4
+                lhs(i,j,k+1,5,c) = lhs(i,j,k+1,5,c) + comz1
+             end do
+          end do
+       endif
+
+       do    k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+                lhs(i,j,k,5,c) = lhs(i,j,k,5,c) + comz1
+             end do
+          end do
+       end do
+
+       if (end(3,c) .gt. 0) then
+          k = cell_size(3,c)-3 
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1,c) = lhs(i,j,k,1,c) + comz1
+                lhs(i,j,k,2,c) = lhs(i,j,k,2,c) - comz4
+                lhs(i,j,k,3,c) = lhs(i,j,k,3,c) + comz6
+                lhs(i,j,k,4,c) = lhs(i,j,k,4,c) - comz4
+
+                lhs(i,j,k+1,1,c) = lhs(i,j,k+1,1,c) + comz1
+                lhs(i,j,k+1,2,c) = lhs(i,j,k+1,2,c) - comz4
+                lhs(i,j,k+1,3,c) = lhs(i,j,k+1,3,c) + comz5
+             end do
+          end do
+       endif
+
+
+!---------------------------------------------------------------------
+!      subsequently, fill the other factors (u+c), (u-c) 
+!---------------------------------------------------------------------
+       do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                lhs(i,j,k,1+5,c)  = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+5,c)  = lhs(i,j,k,2,c) -  &
+     &                            dttz2 * speed(i,j,k-1,c)
+                lhs(i,j,k,3+5,c)  = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+5,c)  = lhs(i,j,k,4,c) +  &
+     &                            dttz2 * speed(i,j,k+1,c)
+                lhs(i,j,k,5+5,c) = lhs(i,j,k,5,c)
+                lhs(i,j,k,1+10,c) = lhs(i,j,k,1,c)
+                lhs(i,j,k,2+10,c) = lhs(i,j,k,2,c) +  &
+     &                            dttz2 * speed(i,j,k-1,c)
+                lhs(i,j,k,3+10,c) = lhs(i,j,k,3,c)
+                lhs(i,j,k,4+10,c) = lhs(i,j,k,4,c) -  &
+     &                            dttz2 * speed(i,j,k+1,c)
+                lhs(i,j,k,5+10,c) = lhs(i,j,k,5,c)
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/make_set.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/make_set.f90
new file mode 100644
index 000000000..888d300b1
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/make_set.f90
@@ -0,0 +1,123 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine make_set
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! This function allocates space for a set of cells and fills the set     
+! such that communication between cells on different nodes is only
+! nearest neighbor                                                   
+!---------------------------------------------------------------------
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+
+       integer p, i, j, c, dir, size, excess, ierr,ierrcode
+
+!---------------------------------------------------------------------
+!     compute square root; add small number to allow for roundoff
+!     (note: this is computed in setup_mpi.f also, but prefer to do
+!     it twice because of some include file problems).
+!---------------------------------------------------------------------
+      ncells = dint(dsqrt(dble(no_nodes) + 0.00001d0))
+
+!---------------------------------------------------------------------
+!      this makes coding easier
+!---------------------------------------------------------------------
+       p = ncells
+   
+!---------------------------------------------------------------------
+!      determine the location of the cell at the bottom of the 3D 
+!      array of cells
+!---------------------------------------------------------------------
+       cell_coord(1,1) = mod(node,p) 
+       cell_coord(2,1) = node/p 
+       cell_coord(3,1) = 0
+
+!---------------------------------------------------------------------
+!      set the cell_coords for cells in the rest of the z-layers; 
+!      this comes down to a simple linear numbering in the z-direct-
+!      ion, and to the doubly-cyclic numbering in the other dirs     
+!---------------------------------------------------------------------
+       do    c=2, p
+          cell_coord(1,c) = mod(cell_coord(1,c-1)+1,p) 
+          cell_coord(2,c) = mod(cell_coord(2,c-1)-1+p,p) 
+          cell_coord(3,c) = c-1
+       end do
+
+!---------------------------------------------------------------------
+!      offset all the coordinates by 1 to adjust for Fortran arrays
+!---------------------------------------------------------------------
+       do    dir = 1, 3
+          do    c = 1, p
+             cell_coord(dir,c) = cell_coord(dir,c) + 1
+          end do
+       end do
+   
+!---------------------------------------------------------------------
+!      slice(dir,n) contains the sequence number of the cell that is in
+!      coordinate plane n in the dir direction
+!---------------------------------------------------------------------
+       do   dir = 1, 3
+          do   c = 1, p
+             slice(dir,cell_coord(dir,c)) = c
+          end do
+       end do
+
+
+!---------------------------------------------------------------------
+!      fill the predecessor and successor entries, using the indices 
+!      of the bottom cells (they are the same at each level of k 
+!      anyway) acting as if full periodicity pertains; note that p is
+!      added to those arguments to the mod functions that might
+!      otherwise return wrong values when using the modulo function
+!---------------------------------------------------------------------
+       i = cell_coord(1,1)-1
+       j = cell_coord(2,1)-1
+
+       predecessor(1) = mod(i-1+p,p) + p*j
+       predecessor(2) = i + p*mod(j-1+p,p)
+       predecessor(3) = mod(i+1,p) + p*mod(j-1+p,p)
+       successor(1)   = mod(i+1,p) + p*j
+       successor(2)   = i + p*mod(j+1,p)
+       successor(3)   = mod(i-1+p,p) + p*mod(j+1,p)
+
+!---------------------------------------------------------------------
+! now compute the sizes of the cells                                    
+!---------------------------------------------------------------------
+       do    dir= 1, 3
+!---------------------------------------------------------------------
+!         set cell_coord range for each direction                            
+!---------------------------------------------------------------------
+          size   = grid_points(dir)/p
+          excess = mod(grid_points(dir),p)
+          do    c=1, ncells
+             if (cell_coord(dir,c) .le. excess) then
+                cell_size(dir,c) = size+1
+                cell_low(dir,c) = (cell_coord(dir,c)-1)*(size+1)
+                cell_high(dir,c) = cell_low(dir,c)+size
+             else 
+                cell_size(dir,c) = size
+                cell_low(dir,c)  = excess*(size+1)+  &
+     &                   (cell_coord(dir,c)-excess-1)*size
+                cell_high(dir,c) = cell_low(dir,c)+size-1
+             endif
+             if (cell_size(dir, c) .le. 2) then
+                write(*,50)
+ 50             format(' Error: Cell size too small. Min size is 3')
+                ierrcode = 1
+                call MPI_Abort(mpi_comm_world,ierrcode,ierr)
+                stop
+             endif
+          end do
+       end do
+
+       return
+       end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/mpinpb.f90
new file mode 100644
index 000000000..9b353778d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/mpinpb.f90
@@ -0,0 +1,20 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mpinpb module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mpinpb
+
+      include 'mpif.h'
+
+      integer   node, no_nodes, total_nodes, root, comm_setup,  &
+     &          comm_solve, comm_rhs, dp_type
+      logical   active
+
+      integer   DEFAULT_TAG
+      parameter (DEFAULT_TAG = 0)
+
+      end module mpinpb
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/ninvr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/ninvr.f90
new file mode 100644
index 000000000..e7d2636ef
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/ninvr.f90
@@ -0,0 +1,46 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  ninvr(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   block-diagonal matrix-vector multiplication              
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer  c,  i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+!---------------------------------------------------------------------
+!      treat only one cell                           
+!---------------------------------------------------------------------
+       do k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                r1 = rhs(i,j,k,1,c)
+                r2 = rhs(i,j,k,2,c)
+                r3 = rhs(i,j,k,3,c)
+                r4 = rhs(i,j,k,4,c)
+                r5 = rhs(i,j,k,5,c)
+               
+                t1 = bt * r3
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(i,j,k,1,c) = -r2
+                rhs(i,j,k,2,c) =  r1
+                rhs(i,j,k,3,c) = bt * ( r4 - r5 )
+                rhs(i,j,k,4,c) = -t1 + t2
+                rhs(i,j,k,5,c) =  t1 + t2
+             enddo    
+          enddo
+       enddo
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/pinvr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/pinvr.f90
new file mode 100644
index 000000000..e247fb0a5
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/pinvr.f90
@@ -0,0 +1,49 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine pinvr(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   block-diagonal matrix-vector multiplication                       
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i, j, k, c
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+!---------------------------------------------------------------------
+!      treat only one cell                                   
+!---------------------------------------------------------------------
+       do   k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do   j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do   i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                r1 = rhs(i,j,k,1,c)
+                r2 = rhs(i,j,k,2,c)
+                r3 = rhs(i,j,k,3,c)
+                r4 = rhs(i,j,k,4,c)
+                r5 = rhs(i,j,k,5,c)
+
+                t1 = bt * r1
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(i,j,k,1,c) =  bt * ( r4 - r5 )
+                rhs(i,j,k,2,c) = -r3
+                rhs(i,j,k,3,c) =  r2
+                rhs(i,j,k,4,c) = -t1 + t2
+                rhs(i,j,k,5,c) =  t1 + t2
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/rhs.f90
new file mode 100644
index 000000000..305b252b1
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/rhs.f90
@@ -0,0 +1,450 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine compute_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer c, i, j, k, m
+       double precision aux, rho_inv, uijk, up1, um1, vijk, vp1, vm1,  &
+     &                  wijk, wp1, wm1
+
+
+       if (timeron) call timer_start(t_rhs)
+!---------------------------------------------------------------------
+! loop over all cells owned by this node                           
+!---------------------------------------------------------------------
+       do    c = 1, ncells
+
+!---------------------------------------------------------------------
+!         compute the reciprocal of density, and the kinetic energy, 
+!         and the speed of sound. 
+!---------------------------------------------------------------------
+
+          do    k = -1, cell_size(3,c)
+             do    j = -1, cell_size(2,c)
+                do    i = -1, cell_size(1,c)
+                   rho_inv = 1.0d0/u(i,j,k,1,c)
+                   rho_i(i,j,k,c) = rho_inv
+                   us(i,j,k,c) = u(i,j,k,2,c) * rho_inv
+                   vs(i,j,k,c) = u(i,j,k,3,c) * rho_inv
+                   ws(i,j,k,c) = u(i,j,k,4,c) * rho_inv
+                   square(i,j,k,c)     = 0.5d0* (  &
+     &                        u(i,j,k,2,c)*u(i,j,k,2,c) +  &
+     &                        u(i,j,k,3,c)*u(i,j,k,3,c) +  &
+     &                        u(i,j,k,4,c)*u(i,j,k,4,c) ) * rho_inv
+                   qs(i,j,k,c) = square(i,j,k,c) * rho_inv
+!---------------------------------------------------------------------
+!                  (don't need speed and ainx until the lhs computation)
+!---------------------------------------------------------------------
+                   aux = c1c2*rho_inv* (u(i,j,k,5,c) - square(i,j,k,c))
+                   aux = dsqrt(aux)
+                   speed(i,j,k,c) = aux
+                   ainv(i,j,k,c)  = 1.0d0/aux
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+! copy the exact forcing term to the right hand side;  because 
+! this forcing term is known, we can store it on the whole of every 
+! cell,  including the boundary                   
+!---------------------------------------------------------------------
+
+          do   m = 1, 5
+             do   k = 0, cell_size(3,c)-1
+                do   j = 0, cell_size(2,c)-1
+                   do   i = 0, cell_size(1,c)-1
+                      rhs(i,j,k,m,c) = forcing(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+
+!---------------------------------------------------------------------
+!         compute xi-direction fluxes 
+!---------------------------------------------------------------------
+          do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   uijk = us(i,j,k,c)
+                   up1  = us(i+1,j,k,c)
+                   um1  = us(i-1,j,k,c)
+
+                   rhs(i,j,k,1,c) = rhs(i,j,k,1,c) + dx1tx1 *  &
+     &                    (u(i+1,j,k,1,c) - 2.0d0*u(i,j,k,1,c) +  &
+     &                     u(i-1,j,k,1,c)) -  &
+     &                    tx2 * (u(i+1,j,k,2,c) - u(i-1,j,k,2,c))
+
+                   rhs(i,j,k,2,c) = rhs(i,j,k,2,c) + dx2tx1 *  &
+     &                    (u(i+1,j,k,2,c) - 2.0d0*u(i,j,k,2,c) +  &
+     &                     u(i-1,j,k,2,c)) +  &
+     &                    xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -  &
+     &                    tx2 * (u(i+1,j,k,2,c)*up1 -  &
+     &                           u(i-1,j,k,2,c)*um1 +  &
+     &                           (u(i+1,j,k,5,c)- square(i+1,j,k,c)-  &
+     &                            u(i-1,j,k,5,c)+ square(i-1,j,k,c))*  &
+     &                            c2)
+
+                   rhs(i,j,k,3,c) = rhs(i,j,k,3,c) + dx3tx1 *  &
+     &                    (u(i+1,j,k,3,c) - 2.0d0*u(i,j,k,3,c) +  &
+     &                     u(i-1,j,k,3,c)) +  &
+     &                    xxcon2 * (vs(i+1,j,k,c) - 2.0d0*vs(i,j,k,c) +  &
+     &                              vs(i-1,j,k,c)) -  &
+     &                    tx2 * (u(i+1,j,k,3,c)*up1 -  &
+     &                           u(i-1,j,k,3,c)*um1)
+
+                   rhs(i,j,k,4,c) = rhs(i,j,k,4,c) + dx4tx1 *  &
+     &                    (u(i+1,j,k,4,c) - 2.0d0*u(i,j,k,4,c) +  &
+     &                     u(i-1,j,k,4,c)) +  &
+     &                    xxcon2 * (ws(i+1,j,k,c) - 2.0d0*ws(i,j,k,c) +  &
+     &                              ws(i-1,j,k,c)) -  &
+     &                    tx2 * (u(i+1,j,k,4,c)*up1 -  &
+     &                           u(i-1,j,k,4,c)*um1)
+
+                   rhs(i,j,k,5,c) = rhs(i,j,k,5,c) + dx5tx1 *  &
+     &                    (u(i+1,j,k,5,c) - 2.0d0*u(i,j,k,5,c) +  &
+     &                     u(i-1,j,k,5,c)) +  &
+     &                    xxcon3 * (qs(i+1,j,k,c) - 2.0d0*qs(i,j,k,c) +  &
+     &                              qs(i-1,j,k,c)) +  &
+     &                    xxcon4 * (up1*up1 -       2.0d0*uijk*uijk +  &
+     &                              um1*um1) +  &
+     &                    xxcon5 * (u(i+1,j,k,5,c)*rho_i(i+1,j,k,c) -  &
+     &                              2.0d0*u(i,j,k,5,c)*rho_i(i,j,k,c) +  &
+     &                              u(i-1,j,k,5,c)*rho_i(i-1,j,k,c)) -  &
+     &                    tx2 * ( (c1*u(i+1,j,k,5,c) -  &
+     &                             c2*square(i+1,j,k,c))*up1 -  &
+     &                            (c1*u(i-1,j,k,5,c) -  &
+     &                             c2*square(i-1,j,k,c))*um1 )
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         add fourth order xi-direction dissipation               
+!---------------------------------------------------------------------
+          if (start(1,c) .gt. 0) then
+             i = 1
+             do    m = 1, 5
+                do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c)- dssp *  &
+     &                    ( 5.0d0*u(i,j,k,m,c) - 4.0d0*u(i+1,j,k,m,c) +  &
+     &                            u(i+2,j,k,m,c))
+                   end do
+                end do
+             end do
+
+             i = 2
+             do    m = 1, 5
+                do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    (-4.0d0*u(i-1,j,k,m,c) + 6.0d0*u(i,j,k,m,c) -  &
+     &                      4.0d0*u(i+1,j,k,m,c) + u(i+2,j,k,m,c))
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do  i = 3*start(1,c),cell_size(1,c)-3*end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    (  u(i-2,j,k,m,c) - 4.0d0*u(i-1,j,k,m,c) +  &
+     &                     6.0*u(i,j,k,m,c) - 4.0d0*u(i+1,j,k,m,c) +  &
+     &                         u(i+2,j,k,m,c) )
+                   end do
+                end do
+             end do
+          end do
+ 
+
+          if (end(1,c) .gt. 0) then
+             i = cell_size(1,c)-3
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    ( u(i-2,j,k,m,c) - 4.0d0*u(i-1,j,k,m,c) +  &
+     &                      6.0d0*u(i,j,k,m,c) - 4.0d0*u(i+1,j,k,m,c) )
+                   end do
+                end do
+             end do
+
+             i = cell_size(1,c)-2
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    ( u(i-2,j,k,m,c) - 4.d0*u(i-1,j,k,m,c) +  &
+     &                      5.d0*u(i,j,k,m,c) )
+                   end do
+                end do
+             end do
+          endif
+
+!---------------------------------------------------------------------
+!         compute eta-direction fluxes 
+!---------------------------------------------------------------------
+          do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   vijk = vs(i,j,k,c)
+                   vp1  = vs(i,j+1,k,c)
+                   vm1  = vs(i,j-1,k,c)
+                   rhs(i,j,k,1,c) = rhs(i,j,k,1,c) + dy1ty1 *  &
+     &                   (u(i,j+1,k,1,c) - 2.0d0*u(i,j,k,1,c) +  &
+     &                    u(i,j-1,k,1,c)) -  &
+     &                   ty2 * (u(i,j+1,k,3,c) - u(i,j-1,k,3,c))
+                   rhs(i,j,k,2,c) = rhs(i,j,k,2,c) + dy2ty1 *  &
+     &                   (u(i,j+1,k,2,c) - 2.0d0*u(i,j,k,2,c) +  &
+     &                    u(i,j-1,k,2,c)) +  &
+     &                   yycon2 * (us(i,j+1,k,c) - 2.0d0*us(i,j,k,c) +  &
+     &                             us(i,j-1,k,c)) -  &
+     &                   ty2 * (u(i,j+1,k,2,c)*vp1 -  &
+     &                          u(i,j-1,k,2,c)*vm1)
+                   rhs(i,j,k,3,c) = rhs(i,j,k,3,c) + dy3ty1 *  &
+     &                   (u(i,j+1,k,3,c) - 2.0d0*u(i,j,k,3,c) +  &
+     &                    u(i,j-1,k,3,c)) +  &
+     &                   yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -  &
+     &                   ty2 * (u(i,j+1,k,3,c)*vp1 -  &
+     &                          u(i,j-1,k,3,c)*vm1 +  &
+     &                          (u(i,j+1,k,5,c) - square(i,j+1,k,c) -  &
+     &                           u(i,j-1,k,5,c) + square(i,j-1,k,c))  &
+     &                          *c2)
+                   rhs(i,j,k,4,c) = rhs(i,j,k,4,c) + dy4ty1 *  &
+     &                   (u(i,j+1,k,4,c) - 2.0d0*u(i,j,k,4,c) +  &
+     &                    u(i,j-1,k,4,c)) +  &
+     &                   yycon2 * (ws(i,j+1,k,c) - 2.0d0*ws(i,j,k,c) +  &
+     &                             ws(i,j-1,k,c)) -  &
+     &                   ty2 * (u(i,j+1,k,4,c)*vp1 -  &
+     &                          u(i,j-1,k,4,c)*vm1)
+                   rhs(i,j,k,5,c) = rhs(i,j,k,5,c) + dy5ty1 *  &
+     &                   (u(i,j+1,k,5,c) - 2.0d0*u(i,j,k,5,c) +  &
+     &                    u(i,j-1,k,5,c)) +  &
+     &                   yycon3 * (qs(i,j+1,k,c) - 2.0d0*qs(i,j,k,c) +  &
+     &                             qs(i,j-1,k,c)) +  &
+     &                   yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk +  &
+     &                             vm1*vm1) +  &
+     &                   yycon5 * (u(i,j+1,k,5,c)*rho_i(i,j+1,k,c) -  &
+     &                             2.0d0*u(i,j,k,5,c)*rho_i(i,j,k,c) +  &
+     &                             u(i,j-1,k,5,c)*rho_i(i,j-1,k,c)) -  &
+     &                   ty2 * ((c1*u(i,j+1,k,5,c) -  &
+     &                           c2*square(i,j+1,k,c)) * vp1 -  &
+     &                          (c1*u(i,j-1,k,5,c) -  &
+     &                           c2*square(i,j-1,k,c)) * vm1)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         add fourth order eta-direction dissipation         
+!---------------------------------------------------------------------
+          if (start(2,c) .gt. 0) then
+             j = 1
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c)- dssp *  &
+     &                    ( 5.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j+1,k,m,c) +  &
+     &                            u(i,j+2,k,m,c))
+                   end do
+                end do
+             end do
+
+             j = 2
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    (-4.0d0*u(i,j-1,k,m,c) + 6.0d0*u(i,j,k,m,c) -  &
+     &                      4.0d0*u(i,j+1,k,m,c) + u(i,j+2,k,m,c))
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do    j = 3*start(2,c), cell_size(2,c)-3*end(2,c)-1
+                   do  i = start(1,c),cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    (  u(i,j-2,k,m,c) - 4.0d0*u(i,j-1,k,m,c) +  &
+     &                     6.0*u(i,j,k,m,c) - 4.0d0*u(i,j+1,k,m,c) +  &
+     &                         u(i,j+2,k,m,c) )
+                   end do
+                end do
+             end do
+          end do
+ 
+          if (end(2,c) .gt. 0) then
+             j = cell_size(2,c)-3
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    ( u(i,j-2,k,m,c) - 4.0d0*u(i,j-1,k,m,c) +  &
+     &                      6.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j+1,k,m,c) )
+                   end do
+                end do
+             end do
+
+             j = cell_size(2,c)-2
+             do     m = 1, 5
+                do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    ( u(i,j-2,k,m,c) - 4.d0*u(i,j-1,k,m,c) +  &
+     &                      5.d0*u(i,j,k,m,c) )
+                   end do
+                end do
+             end do
+          endif
+
+
+!---------------------------------------------------------------------
+!         compute zeta-direction fluxes 
+!---------------------------------------------------------------------
+          do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                   wijk = ws(i,j,k,c)
+                   wp1  = ws(i,j,k+1,c)
+                   wm1  = ws(i,j,k-1,c)
+
+                   rhs(i,j,k,1,c) = rhs(i,j,k,1,c) + dz1tz1 *  &
+     &                   (u(i,j,k+1,1,c) - 2.0d0*u(i,j,k,1,c) +  &
+     &                    u(i,j,k-1,1,c)) -  &
+     &                   tz2 * (u(i,j,k+1,4,c) - u(i,j,k-1,4,c))
+                   rhs(i,j,k,2,c) = rhs(i,j,k,2,c) + dz2tz1 *  &
+     &                   (u(i,j,k+1,2,c) - 2.0d0*u(i,j,k,2,c) +  &
+     &                    u(i,j,k-1,2,c)) +  &
+     &                   zzcon2 * (us(i,j,k+1,c) - 2.0d0*us(i,j,k,c) +  &
+     &                             us(i,j,k-1,c)) -  &
+     &                   tz2 * (u(i,j,k+1,2,c)*wp1 -  &
+     &                          u(i,j,k-1,2,c)*wm1)
+                   rhs(i,j,k,3,c) = rhs(i,j,k,3,c) + dz3tz1 *  &
+     &                   (u(i,j,k+1,3,c) - 2.0d0*u(i,j,k,3,c) +  &
+     &                    u(i,j,k-1,3,c)) +  &
+     &                   zzcon2 * (vs(i,j,k+1,c) - 2.0d0*vs(i,j,k,c) +  &
+     &                             vs(i,j,k-1,c)) -  &
+     &                   tz2 * (u(i,j,k+1,3,c)*wp1 -  &
+     &                          u(i,j,k-1,3,c)*wm1)
+                   rhs(i,j,k,4,c) = rhs(i,j,k,4,c) + dz4tz1 *  &
+     &                   (u(i,j,k+1,4,c) - 2.0d0*u(i,j,k,4,c) +  &
+     &                    u(i,j,k-1,4,c)) +  &
+     &                   zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -  &
+     &                   tz2 * (u(i,j,k+1,4,c)*wp1 -  &
+     &                          u(i,j,k-1,4,c)*wm1 +  &
+     &                          (u(i,j,k+1,5,c) - square(i,j,k+1,c) -  &
+     &                           u(i,j,k-1,5,c) + square(i,j,k-1,c))  &
+     &                          *c2)
+                   rhs(i,j,k,5,c) = rhs(i,j,k,5,c) + dz5tz1 *  &
+     &                   (u(i,j,k+1,5,c) - 2.0d0*u(i,j,k,5,c) +  &
+     &                    u(i,j,k-1,5,c)) +  &
+     &                   zzcon3 * (qs(i,j,k+1,c) - 2.0d0*qs(i,j,k,c) +  &
+     &                             qs(i,j,k-1,c)) +  &
+     &                   zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk +  &
+     &                             wm1*wm1) +  &
+     &                   zzcon5 * (u(i,j,k+1,5,c)*rho_i(i,j,k+1,c) -  &
+     &                             2.0d0*u(i,j,k,5,c)*rho_i(i,j,k,c) +  &
+     &                             u(i,j,k-1,5,c)*rho_i(i,j,k-1,c)) -  &
+     &                   tz2 * ( (c1*u(i,j,k+1,5,c) -  &
+     &                            c2*square(i,j,k+1,c))*wp1 -  &
+     &                           (c1*u(i,j,k-1,5,c) -  &
+     &                            c2*square(i,j,k-1,c))*wm1)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         add fourth order zeta-direction dissipation                
+!---------------------------------------------------------------------
+          if (start(3,c) .gt. 0) then
+             k = 1
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c)- dssp *  &
+     &                    ( 5.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j,k+1,m,c) +  &
+     &                            u(i,j,k+2,m,c))
+                   end do
+                end do
+             end do
+
+             k = 2
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    (-4.0d0*u(i,j,k-1,m,c) + 6.0d0*u(i,j,k,m,c) -  &
+     &                      4.0d0*u(i,j,k+1,m,c) + u(i,j,k+2,m,c))
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c),cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    (  u(i,j,k-2,m,c) - 4.0d0*u(i,j,k-1,m,c) +  &
+     &                     6.0*u(i,j,k,m,c) - 4.0d0*u(i,j,k+1,m,c) +  &
+     &                         u(i,j,k+2,m,c) )
+                   end do
+                end do
+             end do
+          end do
+ 
+          if (end(3,c) .gt. 0) then
+             k = cell_size(3,c)-3
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    ( u(i,j,k-2,m,c) - 4.0d0*u(i,j,k-1,m,c) +  &
+     &                      6.0d0*u(i,j,k,m,c) - 4.0d0*u(i,j,k+1,m,c) )
+                   end do
+                end do
+             end do
+
+             k = cell_size(3,c)-2
+             do     m = 1, 5
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do     i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - dssp *  &
+     &                    ( u(i,j,k-2,m,c) - 4.d0*u(i,j,k-1,m,c) +  &
+     &                      5.d0*u(i,j,k,m,c) )
+                   end do
+                end do
+             end do
+          endif
+
+          do     m = 1, 5
+             do     k = start(3,c), cell_size(3,c)-end(3,c)-1
+                do     j = start(2,c), cell_size(2,c)-end(2,c)-1
+                   do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) * dt
+                   end do
+                end do
+             end do
+          end do
+
+       end do
+    
+       if (timeron) call timer_stop(t_rhs)
+
+       return
+       end
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/set_constants.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/set_constants.f90
new file mode 100644
index 000000000..81820d475
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/set_constants.f90
@@ -0,0 +1,204 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  set_constants
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+  
+       ce(1,1)  = 2.0d0
+       ce(1,2)  = 0.0d0
+       ce(1,3)  = 0.0d0
+       ce(1,4)  = 4.0d0
+       ce(1,5)  = 5.0d0
+       ce(1,6)  = 3.0d0
+       ce(1,7)  = 0.5d0
+       ce(1,8)  = 0.02d0
+       ce(1,9)  = 0.01d0
+       ce(1,10) = 0.03d0
+       ce(1,11) = 0.5d0
+       ce(1,12) = 0.4d0
+       ce(1,13) = 0.3d0
+ 
+       ce(2,1)  = 1.0d0
+       ce(2,2)  = 0.0d0
+       ce(2,3)  = 0.0d0
+       ce(2,4)  = 0.0d0
+       ce(2,5)  = 1.0d0
+       ce(2,6)  = 2.0d0
+       ce(2,7)  = 3.0d0
+       ce(2,8)  = 0.01d0
+       ce(2,9)  = 0.03d0
+       ce(2,10) = 0.02d0
+       ce(2,11) = 0.4d0
+       ce(2,12) = 0.3d0
+       ce(2,13) = 0.5d0
+
+       ce(3,1)  = 2.0d0
+       ce(3,2)  = 2.0d0
+       ce(3,3)  = 0.0d0
+       ce(3,4)  = 0.0d0
+       ce(3,5)  = 0.0d0
+       ce(3,6)  = 2.0d0
+       ce(3,7)  = 3.0d0
+       ce(3,8)  = 0.04d0
+       ce(3,9)  = 0.03d0
+       ce(3,10) = 0.05d0
+       ce(3,11) = 0.3d0
+       ce(3,12) = 0.5d0
+       ce(3,13) = 0.4d0
+
+       ce(4,1)  = 2.0d0
+       ce(4,2)  = 2.0d0
+       ce(4,3)  = 0.0d0
+       ce(4,4)  = 0.0d0
+       ce(4,5)  = 0.0d0
+       ce(4,6)  = 2.0d0
+       ce(4,7)  = 3.0d0
+       ce(4,8)  = 0.03d0
+       ce(4,9)  = 0.05d0
+       ce(4,10) = 0.04d0
+       ce(4,11) = 0.2d0
+       ce(4,12) = 0.1d0
+       ce(4,13) = 0.3d0
+
+       ce(5,1)  = 5.0d0
+       ce(5,2)  = 4.0d0
+       ce(5,3)  = 3.0d0
+       ce(5,4)  = 2.0d0
+       ce(5,5)  = 0.1d0
+       ce(5,6)  = 0.4d0
+       ce(5,7)  = 0.3d0
+       ce(5,8)  = 0.05d0
+       ce(5,9)  = 0.04d0
+       ce(5,10) = 0.03d0
+       ce(5,11) = 0.1d0
+       ce(5,12) = 0.3d0
+       ce(5,13) = 0.2d0
+
+       c1 = 1.4d0
+       c2 = 0.4d0
+       c3 = 0.1d0
+       c4 = 1.0d0
+       c5 = 1.4d0
+
+       bt = dsqrt(0.5d0)
+
+       dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+       dnym1 = 1.0d0 / dble(grid_points(2)-1)
+       dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+       c1c2 = c1 * c2
+       c1c5 = c1 * c5
+       c3c4 = c3 * c4
+       c1345 = c1c5 * c3c4
+
+       conz1 = (1.0d0-c1c5)
+
+       tx1 = 1.0d0 / (dnxm1 * dnxm1)
+       tx2 = 1.0d0 / (2.0d0 * dnxm1)
+       tx3 = 1.0d0 / dnxm1
+
+       ty1 = 1.0d0 / (dnym1 * dnym1)
+       ty2 = 1.0d0 / (2.0d0 * dnym1)
+       ty3 = 1.0d0 / dnym1
+ 
+       tz1 = 1.0d0 / (dnzm1 * dnzm1)
+       tz2 = 1.0d0 / (2.0d0 * dnzm1)
+       tz3 = 1.0d0 / dnzm1
+
+       dx1 = 0.75d0
+       dx2 = 0.75d0
+       dx3 = 0.75d0
+       dx4 = 0.75d0
+       dx5 = 0.75d0
+
+       dy1 = 0.75d0
+       dy2 = 0.75d0
+       dy3 = 0.75d0
+       dy4 = 0.75d0
+       dy5 = 0.75d0
+
+       dz1 = 1.0d0
+       dz2 = 1.0d0
+       dz3 = 1.0d0
+       dz4 = 1.0d0
+       dz5 = 1.0d0
+
+       dxmax = dmax1(dx3, dx4)
+       dymax = dmax1(dy2, dy4)
+       dzmax = dmax1(dz2, dz3)
+
+       dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+       c4dssp = 4.0d0 * dssp
+       c5dssp = 5.0d0 * dssp
+
+       dttx1 = dt*tx1
+       dttx2 = dt*tx2
+       dtty1 = dt*ty1
+       dtty2 = dt*ty2
+       dttz1 = dt*tz1
+       dttz2 = dt*tz2
+
+       c2dttx1 = 2.0d0*dttx1
+       c2dtty1 = 2.0d0*dtty1
+       c2dttz1 = 2.0d0*dttz1
+
+       dtdssp = dt*dssp
+
+       comz1  = dtdssp
+       comz4  = 4.0d0*dtdssp
+       comz5  = 5.0d0*dtdssp
+       comz6  = 6.0d0*dtdssp
+
+       c3c4tx3 = c3c4*tx3
+       c3c4ty3 = c3c4*ty3
+       c3c4tz3 = c3c4*tz3
+
+       dx1tx1 = dx1*tx1
+       dx2tx1 = dx2*tx1
+       dx3tx1 = dx3*tx1
+       dx4tx1 = dx4*tx1
+       dx5tx1 = dx5*tx1
+        
+       dy1ty1 = dy1*ty1
+       dy2ty1 = dy2*ty1
+       dy3ty1 = dy3*ty1
+       dy4ty1 = dy4*ty1
+       dy5ty1 = dy5*ty1
+        
+       dz1tz1 = dz1*tz1
+       dz2tz1 = dz2*tz1
+       dz3tz1 = dz3*tz1
+       dz4tz1 = dz4*tz1
+       dz5tz1 = dz5*tz1
+
+       c2iv  = 2.5d0
+       con43 = 4.0d0/3.0d0
+       con16 = 1.0d0/6.0d0
+        
+       xxcon1 = c3c4tx3*con43*tx3
+       xxcon2 = c3c4tx3*tx3
+       xxcon3 = c3c4tx3*conz1*tx3
+       xxcon4 = c3c4tx3*con16*tx3
+       xxcon5 = c3c4tx3*c1c5*tx3
+
+       yycon1 = c3c4ty3*con43*ty3
+       yycon2 = c3c4ty3*ty3
+       yycon3 = c3c4ty3*conz1*ty3
+       yycon4 = c3c4ty3*con16*ty3
+       yycon5 = c3c4ty3*c1c5*ty3
+
+       zzcon1 = c3c4tz3*con43*tz3
+       zzcon2 = c3c4tz3*tz3
+       zzcon3 = c3c4tz3*conz1*tz3
+       zzcon4 = c3c4tz3*con16*tz3
+       zzcon5 = c3c4tz3*c1c5*tz3
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/setup_mpi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/setup_mpi.f90
new file mode 100644
index 000000000..90d91f367
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/setup_mpi.f90
@@ -0,0 +1,48 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup_mpi
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! set up MPI stuff
+!---------------------------------------------------------------------
+
+      use sp_data
+      use mpinpb
+
+      implicit none
+
+      integer error, nc, color
+
+      call mpi_init(error)
+
+      if (.not. convertdouble) then
+         dp_type = MPI_DOUBLE_PRECISION
+      else
+         dp_type = MPI_REAL
+      endif
+
+!---------------------------------------------------------------------
+!     get a process grid that requires a square number of procs.
+!     excess ranks are marked as inactive.
+!---------------------------------------------------------------------
+      call get_active_nprocs(1, nc, maxcells, no_nodes,  &
+     &                       total_nodes, node, comm_setup, active)
+
+      if (.not. active) return
+
+      call mpi_comm_dup(comm_setup, comm_solve, error)
+      call mpi_comm_dup(comm_setup, comm_rhs, error)
+
+!---------------------------------------------------------------------
+!     let node 0 be the root for the group (there is only one)
+!---------------------------------------------------------------------
+      root = 0
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/sp.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/sp.f90
new file mode 100644
index 000000000..afb00317d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/sp.f90
@@ -0,0 +1,248 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                                   S P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is part of the NAS Parallel Benchmark 3.4 suite.      !
+!    It is described in NAS Technical Reports 95-020 and 02-007           !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+!---------------------------------------------------------------------
+!
+! Authors: R. F. Van der Wijngaart
+!          W. Saphir
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+       program MPSP
+!---------------------------------------------------------------------
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+      
+       integer          i, niter, step, c, error, fstatus
+       external timer_read
+       double precision mflops, n3, t, tmax, timer_read
+       logical          verified
+       character        class
+       double precision tsum(t_last+2), t1(t_last+2),  &
+     &                  tming(t_last+2), tmaxg(t_last+2)
+       character        t_recs(t_last+2)*8
+
+       data t_recs/'total', 'rhs', 'xsolve', 'ysolve', 'zsolve',  &
+     &             'bpack', 'exch', 'xcomm', 'ycomm', 'zcomm',  &
+     &             ' totcomp', ' totcomm'/
+
+       call setup_mpi
+       if (.not. active) goto 999
+
+!---------------------------------------------------------------------
+!      Root node reads input file (if it exists) else takes
+!      defaults from parameters
+!---------------------------------------------------------------------
+       if (node .eq. root) then
+          
+          write(*, 1000)
+
+          call check_timer_flag( timeron )
+
+          open (unit=2,file='inputsp.data',status='old', iostat=fstatus)
+!
+          if (fstatus .eq. 0) then
+            write(*,233) 
+ 233        format(' Reading from input file inputsp.data')
+            read (2,*) niter
+            read (2,*) dt
+            read (2,*) grid_points(1), grid_points(2), grid_points(3)
+            close(2)
+          else
+            write(*,234) 
+            niter = niter_default
+            dt    = dt_default
+            grid_points(1) = problem_size
+            grid_points(2) = problem_size
+            grid_points(3) = problem_size
+          endif
+ 234      format(' No input file inputsp.data. Using compiled defaults')
+
+          call set_class(niter, class)
+
+          write(*, 1001) grid_points(1), grid_points(2), grid_points(3),  &
+     &                   class
+          write(*, 1002) niter, dt
+          write(*, 1003) total_nodes
+          if (no_nodes .ne. total_nodes) write(*, 1004) no_nodes
+          write(*, *)
+
+ 1000 format(//,' NAS Parallel Benchmarks 3.4 -- SP Benchmark',/)
+ 1001     format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', a, ')')
+ 1002     format(' Iterations: ', i4, '    dt: ', F11.7)
+ 1003     format(' Total number of processes: ', i6)
+ 1004     format(' WARNING: Number of processes is not a square number',  &
+     &           ' (', i0, ' active)')
+
+       endif
+
+       call mpi_bcast(niter, 1, MPI_INTEGER,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(dt, 1, dp_type,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(grid_points(1), 3, MPI_INTEGER,  &
+     &                root, comm_setup, error)
+
+       call mpi_bcast(timeron, 1, MPI_LOGICAL,  &
+     &                root, comm_setup, error)
+
+
+       call alloc_space
+
+       call make_set
+
+       do  c = 1, ncells
+          if ( (cell_size(1,c) .gt. IMAX) .or.  &
+     &         (cell_size(2,c) .gt. JMAX) .or.  &
+     &         (cell_size(3,c) .gt. KMAX) ) then
+             print *,node, c, (cell_size(i,c),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+          endif
+       end do
+
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call set_constants
+
+       call initialize
+
+       call lhsinit
+
+       call exact_rhs
+
+       call compute_buffer_size(5)
+
+!---------------------------------------------------------------------
+!      do one time step to touch all code, and reinitialize
+!---------------------------------------------------------------------
+       call adi
+       call initialize
+
+!---------------------------------------------------------------------
+!      Synchronize before placing time stamp
+!---------------------------------------------------------------------
+       do  i = 1, t_last
+          call timer_clear(i)
+       end do
+       call mpi_barrier(comm_setup, error)
+
+       call timer_clear(1)
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (node .eq. root) then
+             if (mod(step, 20) .eq. 0 .or.  &
+     &           step .eq. 1) then
+                write(*, 200) step
+ 200            format(' Time step ', i4)
+              endif
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+       t = timer_read(1)
+       
+       call verify(class, verified)
+
+       call mpi_reduce(t, tmax, 1,  &
+     &                 dp_type, MPI_MAX,  &
+     &                 root, comm_setup, error)
+
+       if( node .eq. root ) then
+          if( tmax .ne. 0. ) then
+             n3 = dble(grid_points(1))*grid_points(2)*grid_points(3)
+             t = (grid_points(1)+grid_points(2)+grid_points(3))/3.d0
+             mflops = 1.0d-6*dble( niter )*(881.174*n3  &
+     &                -4683.91* t**2  &
+     &                +11484.5* t  &
+     &                -19272.4) / tmax
+          else
+             mflops = 0.d0
+          endif
+
+         call print_results('SP', class, grid_points(1),  &
+     &     grid_points(2), grid_points(3), niter, no_nodes,  &
+     &     total_nodes, tmax, mflops, '          floating point',  &
+     &     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5,  &
+     &     cs6, '(none)')
+       endif
+
+       if (.not.timeron) goto 999
+
+       do i = 1, t_last
+          t1(i) = timer_read(i)
+       end do
+       t1(t_xsolve) = t1(t_xsolve) - t1(t_xcomm)
+       t1(t_ysolve) = t1(t_ysolve) - t1(t_ycomm)
+       t1(t_zsolve) = t1(t_zsolve) - t1(t_zcomm)
+       t1(t_last+2) = t1(t_xcomm)+t1(t_ycomm)+t1(t_zcomm)+t1(t_exch)
+       t1(t_last+1) = t1(t_total)  - t1(t_last+2)
+
+       call MPI_Reduce(t1, tsum,  t_last+2, dp_type, MPI_SUM,  &
+     &                 0, comm_setup, error)
+       call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN,  &
+     &                 0, comm_setup, error)
+       call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX,  &
+     &                 0, comm_setup, error)
+
+       if (node .eq. 0) then
+          write(*, 800) no_nodes
+          do i = 1, t_last+2
+             tsum(i) = tsum(i) / no_nodes
+             write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i)
+          end do
+       endif
+ 800   format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum',  &
+     &        5x, 'average')
+ 810   format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4))
+
+ 999   continue
+       call mpi_barrier(MPI_COMM_WORLD, error)
+       call mpi_finalize(error)
+
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/sp_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/sp_data.f90
new file mode 100644
index 000000000..f8ac84489
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/sp_data.f90
@@ -0,0 +1,168 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  sp_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module sp_data
+
+!---------------------------------------------------------------------
+! The following include file is generated automatically by the
+! "setparams" utility. It defines 
+!      maxcells:      the square root of the maximum number of processors
+!      problem_size:  12, 64, 102, 162 (for class S, A, B, C)
+!      dt_default:    default time step for this problem size if no
+!                     config file
+!      niter_default: default number of iterations for this problem size
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           ncells, grid_points(3)
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,  &
+     &                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4,  &
+     &                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt,  &
+     &                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2,  &
+     &                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,  &
+     &                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,  &
+     &                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,  &
+     &                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1,  &
+     &                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1,  &
+     &                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2,  &
+     &                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,  &
+     &                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1,  &
+     &                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6,  &
+     &                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer           EAST, WEST, NORTH, SOUTH,  &
+     &                  BOTTOM, TOP
+
+      parameter (EAST=2000, WEST=3000,      NORTH=4000, SOUTH=5000,  &
+     &           BOTTOM=6000, TOP=7000)
+
+      integer maxcells, IMAX, JMAX, KMAX, MAX_CELL_DIM,  &
+     &        BUF_SIZE, IMAXP, JMAXP
+
+      integer predecessor(3), successor(3), grid_size(3)
+      integer, pointer ::  &
+     &        cell_coord (:,:), cell_low (:,:),  &
+     &        cell_high  (:,:), cell_size(:,:),  &
+     &        start      (:,:), end      (:,:),  &
+     &        slice      (:,:)
+
+      double precision, allocatable ::  &
+     &        u       (:,:,:,:,:),  &
+     &        us      (:,:,:,  :),  &
+     &        vs      (:,:,:,  :),  &
+     &        ws      (:,:,:,  :),  &
+     &        qs      (:,:,:,  :),  &
+     &        ainv    (:,:,:,  :),  &
+     &        rho_i   (:,:,:,  :),  &
+     &        speed   (:,:,:,  :),  &
+     &        square  (:,:,:,  :),  &
+     &        rhs     (:,:,:,:,:),  &
+     &        forcing (:,:,:,:,:),  &
+     &        lhs     (:,:,:,:,:),  &
+     &        in_buffer(:), out_buffer(:)
+
+      double precision, allocatable ::  &
+     &        cv  (:), rhon(:),  &
+     &        rhos(:), rhoq(:),  &
+     &        cuf (:), q   (:),  &
+     &        ue(:,:), buf (:,:)
+
+      integer west_size, east_size, bottom_size, top_size,  &
+     &        north_size, south_size, start_send_west,  &
+     &        start_send_east, start_send_south, start_send_north,  &
+     &        start_send_bottom, start_send_top, start_recv_west,  &
+     &        start_recv_east, start_recv_south, start_recv_north,  &
+     &        start_recv_bottom, start_recv_top
+
+!---------------------------------------------------------------------
+!     Timer constants
+!---------------------------------------------------------------------
+      integer t_total, t_rhs, t_xsolve, t_ysolve, t_zsolve, t_bpack,  &
+     &        t_exch, t_xcomm, t_ycomm, t_zcomm, t_last
+      parameter (t_total=1, t_rhs=2, t_xsolve=3, t_ysolve=4,  &
+     &        t_zsolve=5, t_bpack=6, t_exch=7, t_xcomm=8,  &
+     &        t_ycomm=9, t_zcomm=10, t_last=10)
+      logical timeron
+
+      end module sp_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use sp_data
+      use mpinpb
+
+      implicit none
+
+      integer ios, ierr
+
+      MAX_CELL_DIM = (problem_size/maxcells)+1
+
+      IMAX = MAX_CELL_DIM
+      JMAX = MAX_CELL_DIM
+      KMAX = MAX_CELL_DIM
+
+      IMAXP = IMAX/2*2+1
+      JMAXP = JMAX/2*2+1
+
+!---------------------------------------------------------------------
+! +1 at end to avoid zero length arrays for 1 node
+!---------------------------------------------------------------------
+      BUF_SIZE = MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60*2+1
+
+      allocate (  &
+     &         cell_coord (3,maxcells), cell_low (3,maxcells),  &
+     &         cell_high  (3,maxcells), cell_size(3,maxcells),  &
+     &         start      (3,maxcells), end      (3,maxcells),  &
+     &         slice      (3,maxcells),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &   u       (-2:IMAXP+1,-2:JMAXP+1,-2:KMAX+1, 5,maxcells),  &
+     &   us      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   vs      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   ws      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   qs      (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   ainv    (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   rho_i   (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   speed   (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   square  (-1:IMAX,   -1:JMAX,   -1:KMAX,     maxcells),  &
+     &   rhs     ( 0:IMAXP-1, 0:JMAXP-1, 0:KMAX-1, 5,maxcells),  &
+     &   forcing ( 0:IMAXP-1, 0:JMAXP-1, 0:KMAX-1, 5,maxcells),  &
+     &   lhs     ( 0:IMAXP-1, 0:JMAXP-1, 0:KMAX-1,15,maxcells),  &
+     &   in_buffer(BUF_SIZE), out_buffer(BUF_SIZE),  &
+     &         stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &         cv  (-2:MAX_CELL_DIM+1), rhon(-2:MAX_CELL_DIM+1),  &
+     &         rhos(-2:MAX_CELL_DIM+1), rhoq(-2:MAX_CELL_DIM+1),  &
+     &         cuf (-2:MAX_CELL_DIM+1),    q(-2:MAX_CELL_DIM+1),  &
+     &         ue  (-2:MAX_CELL_DIM+1,5),buf(-2:MAX_CELL_DIM+1,5),  &
+     &         stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr)
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/txinvr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/txinvr.f90
new file mode 100644
index 000000000..eeb0f8a5d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/txinvr.f90
@@ -0,0 +1,60 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  txinvr
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! block-diagonal matrix-vector multiplication                  
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer c, i, j, k
+       double precision t1, t2, t3, ac, ru1, uu, vv, ww, r1, r2, r3,  &
+     &                  r4, r5, ac2inv
+
+!---------------------------------------------------------------------
+!      loop over all cells owned by this node          
+!---------------------------------------------------------------------
+       do   c = 1, ncells
+          do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+             do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+                do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                   ru1 = rho_i(i,j,k,c)
+                   uu = us(i,j,k,c)
+                   vv = vs(i,j,k,c)
+                   ww = ws(i,j,k,c)
+                   ac = speed(i,j,k,c)
+                   ac2inv = ainv(i,j,k,c)*ainv(i,j,k,c)
+
+                   r1 = rhs(i,j,k,1,c)
+                   r2 = rhs(i,j,k,2,c)
+                   r3 = rhs(i,j,k,3,c)
+                   r4 = rhs(i,j,k,4,c)
+                   r5 = rhs(i,j,k,5,c)
+
+                   t1 = c2 * ac2inv * ( qs(i,j,k,c)*r1 - uu*r2  -  &
+     &                  vv*r3 - ww*r4 + r5 )
+                   t2 = bt * ru1 * ( uu * r1 - r2 )
+                   t3 = ( bt * ru1 * ac ) * t1
+
+                   rhs(i,j,k,1,c) = r1 - t1
+                   rhs(i,j,k,2,c) = - ru1 * ( ww*r1 - r4 )
+                   rhs(i,j,k,3,c) =   ru1 * ( vv*r1 - r3 )
+                   rhs(i,j,k,4,c) = - t2 + t3
+                   rhs(i,j,k,5,c) =   t2 + t3
+                end do
+             end do
+          end do
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/tzetar.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/tzetar.f90
new file mode 100644
index 000000000..0f5e924fe
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/tzetar.f90
@@ -0,0 +1,61 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  tzetar(c)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   block-diagonal matrix-vector multiplication                       
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i, j, k, c
+       double precision  t1, t2, t3, ac, xvel, yvel, zvel, r1, r2, r3,  &
+     &                   r4, r5, btuz, acinv, ac2u, uzik1
+
+!---------------------------------------------------------------------
+!      treat only one cell                                             
+!---------------------------------------------------------------------
+       do    k = start(3,c), cell_size(3,c)-end(3,c)-1
+          do    j = start(2,c), cell_size(2,c)-end(2,c)-1
+             do    i = start(1,c), cell_size(1,c)-end(1,c)-1
+
+                xvel = us(i,j,k,c)
+                yvel = vs(i,j,k,c)
+                zvel = ws(i,j,k,c)
+                ac   = speed(i,j,k,c)
+                acinv = ainv(i,j,k,c)
+
+                ac2u = ac*ac
+
+                r1 = rhs(i,j,k,1,c)
+                r2 = rhs(i,j,k,2,c)
+                r3 = rhs(i,j,k,3,c)
+                r4 = rhs(i,j,k,4,c)
+                r5 = rhs(i,j,k,5,c)      
+
+                uzik1 = u(i,j,k,1,c)
+                btuz  = bt * uzik1
+
+                t1 = btuz*acinv * (r4 + r5)
+                t2 = r3 + t1
+                t3 = btuz * (r4 - r5)
+
+                rhs(i,j,k,1,c) = t2
+                rhs(i,j,k,2,c) = -uzik1*r2 + xvel*t2
+                rhs(i,j,k,3,c) =  uzik1*r1 + yvel*t2
+                rhs(i,j,k,4,c) =  zvel*t2  + t3
+                rhs(i,j,k,5,c) =  uzik1*(-xvel*r2 + yvel*r1) +  &
+     &                    qs(i,j,k,c)*t2 + c2iv*ac2u*t1 + zvel*t3
+
+             end do
+          end do
+       end do
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/verify.f90
new file mode 100644
index 000000000..9107dcd4b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/verify.f90
@@ -0,0 +1,446 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine set_class(no_time_steps, class)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  set problem class based on problem size
+!---------------------------------------------------------------------
+
+        use sp_data
+        implicit none
+
+        integer no_time_steps
+        character class
+
+
+        if ( (grid_points(1)  .eq. 12     ) .and.  &
+     &       (grid_points(2)  .eq. 12     ) .and.  &
+     &       (grid_points(3)  .eq. 12     ) .and.  &
+     &       (no_time_steps   .eq. 100    ))  then
+
+           class = 'S'
+
+        elseif ( (grid_points(1) .eq. 36) .and.  &
+     &           (grid_points(2) .eq. 36) .and.  &
+     &           (grid_points(3) .eq. 36) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'W'
+
+        elseif ( (grid_points(1) .eq. 64) .and.  &
+     &           (grid_points(2) .eq. 64) .and.  &
+     &           (grid_points(3) .eq. 64) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'A'
+
+        elseif ( (grid_points(1) .eq. 102) .and.  &
+     &           (grid_points(2) .eq. 102) .and.  &
+     &           (grid_points(3) .eq. 102) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'B'
+
+        elseif ( (grid_points(1) .eq. 162) .and.  &
+     &           (grid_points(2) .eq. 162) .and.  &
+     &           (grid_points(3) .eq. 162) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'C'
+
+        elseif ( (grid_points(1) .eq. 408) .and.  &
+     &           (grid_points(2) .eq. 408) .and.  &
+     &           (grid_points(3) .eq. 408) .and.  &
+     &           (no_time_steps  .eq. 500) ) then
+
+           class = 'D'
+
+        elseif ( (grid_points(1) .eq. 1020) .and.  &
+     &           (grid_points(2) .eq. 1020) .and.  &
+     &           (grid_points(3) .eq. 1020) .and.  &
+     &           (no_time_steps  .eq. 500) ) then
+
+           class = 'E'
+
+        elseif ( (grid_points(1) .eq. 2560) .and.  &
+     &           (grid_points(2) .eq. 2560) .and.  &
+     &           (grid_points(3) .eq. 2560) .and.  &
+     &           (no_time_steps  .eq. 500) ) then
+
+           class = 'F'
+
+        else
+
+           class = 'U'
+
+        endif
+
+        return
+        end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine verify(class, verified)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  verification routine                         
+!---------------------------------------------------------------------
+
+        use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+        use sp_data
+        use mpinpb
+
+        implicit none
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5),  &
+     &                   epsilon, xce(5), xcr(5), dtref
+        integer m
+        character class
+        logical verified
+
+!---------------------------------------------------------------------
+!   tolerance level
+!---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+!---------------------------------------------------------------------
+!   compute the error norm and the residual norm, and exit if not printing
+!---------------------------------------------------------------------
+        call error_norm(xce)
+        call copy_faces
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+        if (node .ne. 0) return
+
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+!---------------------------------------------------------------------
+!    reference data for 12X12X12 grids after 100 time steps, with DT = 1.50d-02
+!---------------------------------------------------------------------
+        if ( class .eq. 'S' ) then
+
+           dtref = 1.5d-2
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 2.7470315451339479d-02
+           xcrref(2) = 1.0360746705285417d-02
+           xcrref(3) = 1.6235745065095532d-02
+           xcrref(4) = 1.5840557224455615d-02
+           xcrref(5) = 3.4849040609362460d-02
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 2.7289258557377227d-05
+           xceref(2) = 1.0364446640837285d-05
+           xceref(3) = 1.6154798287166471d-05
+           xceref(4) = 1.5750704994480102d-05
+           xceref(5) = 3.4177666183390531d-05
+
+!---------------------------------------------------------------------
+!    reference data for 36X36X36 grids after 400 time steps, with DT = 1.5d-03
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'W' ) then
+
+           dtref = 1.5d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.1893253733584d-02
+           xcrref(2) = 0.1717075447775d-03
+           xcrref(3) = 0.2778153350936d-03
+           xcrref(4) = 0.2887475409984d-03
+           xcrref(5) = 0.3143611161242d-02
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.7542088599534d-04
+           xceref(2) = 0.6512852253086d-05
+           xceref(3) = 0.1049092285688d-04
+           xceref(4) = 0.1128838671535d-04
+           xceref(5) = 0.1212845639773d-03
+
+!---------------------------------------------------------------------
+!    reference data for 64X64X64 grids after 400 time steps, with DT = 1.5d-03
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'A' ) then
+
+           dtref = 1.5d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 2.4799822399300195d0
+           xcrref(2) = 1.1276337964368832d0
+           xcrref(3) = 1.5028977888770491d0
+           xcrref(4) = 1.4217816211695179d0
+           xcrref(5) = 2.1292113035138280d0
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 1.0900140297820550d-04
+           xceref(2) = 3.7343951769282091d-05
+           xceref(3) = 5.0092785406541633d-05
+           xceref(4) = 4.7671093939528255d-05
+           xceref(5) = 1.3621613399213001d-04
+
+!---------------------------------------------------------------------
+!    reference data for 102X102X102 grids after 400 time steps,
+!    with DT = 1.0d-03
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'B' ) then
+
+           dtref = 1.0d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.6903293579998d+02
+           xcrref(2) = 0.3095134488084d+02
+           xcrref(3) = 0.4103336647017d+02
+           xcrref(4) = 0.3864769009604d+02
+           xcrref(5) = 0.5643482272596d+02
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.9810006190188d-02
+           xceref(2) = 0.1022827905670d-02
+           xceref(3) = 0.1720597911692d-02
+           xceref(4) = 0.1694479428231d-02
+           xceref(5) = 0.1847456263981d-01
+
+!---------------------------------------------------------------------
+!    reference data for 162X162X162 grids after 400 time steps,
+!    with DT = 0.67d-03
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'C' ) then
+
+           dtref = 0.67d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.5881691581829d+03
+           xcrref(2) = 0.2454417603569d+03
+           xcrref(3) = 0.3293829191851d+03
+           xcrref(4) = 0.3081924971891d+03
+           xcrref(5) = 0.4597223799176d+03
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.2598120500183d+00
+           xceref(2) = 0.2590888922315d-01
+           xceref(3) = 0.5132886416320d-01
+           xceref(4) = 0.4806073419454d-01
+           xceref(5) = 0.5483377491301d+00
+
+!---------------------------------------------------------------------
+!    reference data for 408X408X408 grids after 500 time steps,
+!    with DT = 0.3d-03
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'D' ) then
+
+           dtref = 0.30d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.1044696216887d+05
+           xcrref(2) = 0.3204427762578d+04
+           xcrref(3) = 0.4648680733032d+04
+           xcrref(4) = 0.4238923283697d+04
+           xcrref(5) = 0.7588412036136d+04
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.5089471423669d+01
+           xceref(2) = 0.5323514855894d+00
+           xceref(3) = 0.1187051008971d+01
+           xceref(4) = 0.1083734951938d+01
+           xceref(5) = 0.1164108338568d+02
+
+!---------------------------------------------------------------------
+!    reference data for 1020X1020X1020 grids after 500 time steps,
+!    with DT = 0.1d-03
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'E' ) then
+
+           dtref = 0.10d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.6255387422609d+05
+           xcrref(2) = 0.1495317020012d+05
+           xcrref(3) = 0.2347595750586d+05
+           xcrref(4) = 0.2091099783534d+05
+           xcrref(5) = 0.4770412841218d+05
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.6742735164909d+02
+           xceref(2) = 0.5390656036938d+01
+           xceref(3) = 0.1680647196477d+02
+           xceref(4) = 0.1536963126457d+02
+           xceref(5) = 0.1575330146156d+03
+
+!---------------------------------------------------------------------
+!    reference data for 2560X2560X2560 grids after 500 time steps,
+!    with DT = 0.1d-03
+!---------------------------------------------------------------------
+        elseif ( class .eq. 'F' ) then
+
+           dtref = 0.15d-4
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.9281628449462d+05
+           xcrref(2) = 0.2230152287675d+05
+           xcrref(3) = 0.3493102358632d+05
+           xcrref(4) = 0.3114096186689d+05
+           xcrref(5) = 0.7424426448298d+05
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.2683717702444d+03
+           xceref(2) = 0.2030647554028d+02
+           xceref(3) = 0.6734864248234d+02
+           xceref(4) = 0.5947451301640d+02
+           xceref(5) = 0.5417636652565d+03
+
+        else
+
+           verified = .false.
+
+        endif
+
+!---------------------------------------------------------------------
+!    verification test for residuals if gridsize is one of 
+!    the defined grid sizes above (class .ne. 'U')
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!    Compute the difference of solution values and the known reference values.
+!---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+!---------------------------------------------------------------------
+!    Output the comparison of computed results to known cases.
+!---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ',  &
+     &                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*,2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if ((.not.ieee_is_nan(xcrdif(m))) .and.  &
+     &              xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if ((.not.ieee_is_nan(xcedif(m))) .and.  &
+     &              xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/x_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/x_solve.f90
new file mode 100644
index 000000000..e56ed3803
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/x_solve.f90
@@ -0,0 +1,562 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine x_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the x-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the x-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+
+
+       integer i, j, k, jp, kp, n, iend, jsize, ksize, i1, i2,  &
+     &         buffer_size, c, m, p, istart, stage, error,  &
+     &         requests(2), statuses(MPI_STATUS_SIZE, 2)
+       double precision  r1, r2, d, e, s(5), sm1, sm2,  &
+     &                   fac1, fac2
+
+
+
+!---------------------------------------------------------------------
+!      OK, now we know that there are multiple processors
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! now do a sweep on a layer-by-layer basis, i.e. sweeping through cells
+! on this node in the direction of increasing i for the forward sweep,
+! and after that reversing the direction for the backsubstitution.
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_xsolve)
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+       do    stage = 1, ncells
+          c         = slice(1,stage)
+
+          istart = 0
+          iend   = cell_size(1,c)-1
+
+          jsize     = cell_size(2,c)
+          ksize     = cell_size(3,c)
+          jp        = cell_coord(2,c)-1
+          kp        = cell_coord(3,c)-1
+
+          buffer_size = (jsize-start(2,c)-end(2,c)) *  &
+     &                  (ksize-start(3,c)-end(3,c))
+
+          if ( stage .ne. 1) then
+
+!---------------------------------------------------------------------
+!            if this is not the first processor in this row of cells, 
+!            receive data from predecessor containing the right hand
+!            sides and the upper diagonal elements of the previous two rows
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_irecv(in_buffer, 22*buffer_size,  &
+     &                      dp_type, predecessor(1),  &
+     &                      DEFAULT_TAG,  comm_solve,  &
+     &                      requests(1), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+
+!---------------------------------------------------------------------
+!            communication has already been started. 
+!            compute the left hand side while waiting for the msg
+!---------------------------------------------------------------------
+             call lhsx(c)
+
+!---------------------------------------------------------------------
+!            wait for pending communication to complete
+!            This waits on the current receive and on the send
+!            from the previous stage. They always come in pairs. 
+!---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_xcomm)
+
+!---------------------------------------------------------------------
+!            unpack the buffer                                 
+!---------------------------------------------------------------------
+             i  = istart
+             i1 = istart + 1
+             n = 0
+
+!---------------------------------------------------------------------
+!            create a running pointer
+!---------------------------------------------------------------------
+             p = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                   lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -  &
+     &                       in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -  &
+     &                       in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                       in_buffer(p+2+m) * lhs(i,j,k,n+1,c)
+                   end do
+                   d            = in_buffer(p+6)
+                   e            = in_buffer(p+7)
+                   do    m = 1, 3
+                      s(m) = in_buffer(p+7+m)
+                   end do
+                   r1 = lhs(i,j,k,n+2,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                   lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - s(m) * r1
+                   end do
+                   r2 = lhs(i1,j,k,n+1,c)
+                   lhs(i1,j,k,n+2,c) = lhs(i1,j,k,n+2,c) - d * r2
+                   lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) - e * r2
+                   do    m = 1, 3
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) - s(m) * r2
+                   end do
+                   p = p + 10
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    j = start(2,c), jsize-end(2,c)-1
+                      lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -  &
+     &                          in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -  &
+     &                          in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) -  &
+     &                          in_buffer(p+3) * lhs(i,j,k,n+1,c)
+                      d                = in_buffer(p+4)
+                      e                = in_buffer(p+5)
+                      s(m)             = in_buffer(p+6)
+                      r1 = lhs(i,j,k,n+2,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                      lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) - s(m) * r1
+                      r2 = lhs(i1,j,k,n+1,c)
+                      lhs(i1,j,k,n+2,c) = lhs(i1,j,k,n+2,c) - d * r2
+                      lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) - e * r2
+                      rhs(i1,j,k,m,c)   = rhs(i1,j,k,m,c) - s(m) * r2
+                      p = p + 6
+                   end do
+                end do
+             end do
+
+          else            
+
+!---------------------------------------------------------------------
+!            if this IS the first cell, we still compute the lhs
+!---------------------------------------------------------------------
+             call lhsx(c)
+          endif
+
+!---------------------------------------------------------------------
+!         perform the Thomas algorithm; first, FORWARD ELIMINATION     
+!---------------------------------------------------------------------
+          n = 0
+
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = istart, iend-2
+                   i1 = i  + 1
+                   i2 = i  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -  &
+     &                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -  &
+     &                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -  &
+     &                         lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i2,j,k,n+2,c) = lhs(i2,j,k,n+2,c) -  &
+     &                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i2,j,k,n+3,c) = lhs(i2,j,k,n+3,c) -  &
+     &                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i2,j,k,m,c) = rhs(i2,j,k,m,c) -  &
+     &                         lhs(i2,j,k,n+1,c)*rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         The last two rows in this grid block are a bit different, 
+!         since they do not have two more rows available for the
+!         elimination of off-diagonal entries
+!---------------------------------------------------------------------
+
+          i  = iend - 1
+          i1 = iend
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    j = start(2,c), jsize-end(2,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                end do
+                lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -  &
+     &                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -  &
+     &                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -  &
+     &                      lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+                end do
+!---------------------------------------------------------------------
+!               scale the last row immediately (some of this is
+!               overkill in case this is the last cell)
+!---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i1,j,k,n+3,c)
+                lhs(i1,j,k,n+4,c) = fac2*lhs(i1,j,k,n+4,c)
+                lhs(i1,j,k,n+5,c) = fac2*lhs(i1,j,k,n+5,c)  
+                do    m = 1, 3
+                   rhs(i1,j,k,m,c) = fac2*rhs(i1,j,k,m,c)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         do the u+c and the u-c factors                 
+!---------------------------------------------------------------------
+
+          do    m = 4, 5
+             n = (m-3)*5
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = istart, iend-2
+                   i1 = i  + 1
+                   i2 = i  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -  &
+     &                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -  &
+     &                         lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -  &
+     &                         lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+                   lhs(i2,j,k,n+2,c) = lhs(i2,j,k,n+2,c) -  &
+     &                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i2,j,k,n+3,c) = lhs(i2,j,k,n+3,c) -  &
+     &                         lhs(i2,j,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   rhs(i2,j,k,m,c) = rhs(i2,j,k,m,c) -  &
+     &                         lhs(i2,j,k,n+1,c)*rhs(i,j,k,m,c)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!            And again the last two rows separately
+!---------------------------------------------------------------------
+             i  = iend - 1
+             i1 = iend
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                rhs(i,j,k,m,c)     = fac1*rhs(i,j,k,m,c)
+                lhs(i1,j,k,n+3,c) = lhs(i1,j,k,n+3,c) -  &
+     &                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i1,j,k,n+4,c) = lhs(i1,j,k,n+4,c) -  &
+     &                      lhs(i1,j,k,n+2,c)*lhs(i,j,k,n+5,c)
+                rhs(i1,j,k,m,c)   = rhs(i1,j,k,m,c) -  &
+     &                      lhs(i1,j,k,n+2,c)*rhs(i,j,k,m,c)
+!---------------------------------------------------------------------
+!               Scale the last row immediately (some of this is overkill
+!               if this is the last cell)
+!---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i1,j,k,n+3,c)
+                lhs(i1,j,k,n+4,c) = fac2*lhs(i1,j,k,n+4,c)
+                lhs(i1,j,k,n+5,c) = fac2*lhs(i1,j,k,n+5,c)
+                rhs(i1,j,k,m,c)   = fac2*rhs(i1,j,k,m,c)
+
+             end do
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!         send information to the next processor, except when this
+!         is the last grid block
+!---------------------------------------------------------------------
+          if (stage .ne. ncells) then
+
+!---------------------------------------------------------------------
+!            create a running pointer for the send buffer  
+!---------------------------------------------------------------------
+             p = 0
+             n = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = iend-1, iend
+                      out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                      out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                      do    m = 1, 3
+                         out_buffer(p+2+m) = rhs(i,j,k,m,c)
+                      end do
+                      p = p+5
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    j = start(2,c), jsize-end(2,c)-1
+                      do    i = iend-1, iend
+                         out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                         out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                         out_buffer(p+3) = rhs(i,j,k,m,c)
+                         p = p + 3
+                      end do
+                   end do
+                end do
+             end do
+
+!---------------------------------------------------------------------
+! send data to next phase
+! can't receive data yet because buffer size will be wrong 
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_isend(out_buffer, 22*buffer_size,  &
+     &                     dp_type, successor(1),  &
+     &                     DEFAULT_TAG, comm_solve,  &
+     &                     requests(2), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+          endif
+       end do
+
+!---------------------------------------------------------------------
+!      now go in the reverse direction                      
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+       do    stage = ncells, 1, -1
+          c = slice(1,stage)
+
+          istart = 0
+          iend   = cell_size(1,c)-1
+
+          jsize = cell_size(2,c)
+          ksize = cell_size(3,c)
+          jp    = cell_coord(2,c)-1
+          kp    = cell_coord(3,c)-1
+
+          buffer_size = (jsize-start(2,c)-end(2,c)) *  &
+     &                  (ksize-start(3,c)-end(3,c))
+
+          if (stage .ne. ncells) then
+
+!---------------------------------------------------------------------
+!            if this is not the starting cell in this row of cells, 
+!            wait for a message to be received, containing the 
+!            solution of the previous two stations     
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_irecv(in_buffer, 10*buffer_size,  &
+     &                      dp_type, successor(1),  &
+     &                      DEFAULT_TAG, comm_solve,  &
+     &                      requests(1), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+
+!---------------------------------------------------------------------
+!            communication has already been started
+!            while waiting, do the block-diagonal inversion for the 
+!            cell that was just finished                
+!---------------------------------------------------------------------
+
+             call ninvr(slice(1,stage+1))
+
+!---------------------------------------------------------------------
+!            wait for pending communication to complete
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_xcomm)
+
+!---------------------------------------------------------------------
+!            unpack the buffer for the first three factors         
+!---------------------------------------------------------------------
+             n = 0
+             p = 0
+             i  = iend
+             i1 = i - 1
+             do    m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k,n+4,c)*sm1 -  &
+     &                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -  &
+     &                        lhs(i1,j,k,n+4,c) * rhs(i,j,k,m,c) -  &
+     &                        lhs(i1,j,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!            now unpack the buffer for the remaining two factors
+!---------------------------------------------------------------------
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k,n+4,c)*sm1 -  &
+     &                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i1,j,k,m,c) = rhs(i1,j,k,m,c) -  &
+     &                        lhs(i1,j,k,n+4,c) * rhs(i,j,k,m,c) -  &
+     &                        lhs(i1,j,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+          else
+
+!---------------------------------------------------------------------
+!            now we know this is the first grid block on the back sweep,
+!            so we don't need a message to start the substitution. 
+!---------------------------------------------------------------------
+             i  = iend-1
+             i1 = iend
+             n = 0
+             do   m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                             lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c)
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   j = start(2,c), jsize-end(2,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                             lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c)
+                   end do
+                end do
+             end do
+          endif
+
+!---------------------------------------------------------------------
+!         Whether or not this is the last processor, we always have
+!         to complete the back-substitution 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!         The first three factors
+!---------------------------------------------------------------------
+          n = 0
+          do   m = 1, 3
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = iend-2, istart, -1
+                      i1 = i  + 1
+                      i2 = i  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+5,c)*rhs(i2,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         And the remaining two
+!---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = iend-2, istart, -1
+                      i1 = i  + 1
+                      i2 = i  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+4,c)*rhs(i1,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+5,c)*rhs(i2,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         send on information to the previous processor, if needed
+!---------------------------------------------------------------------
+          if (stage .ne.  1) then
+             i  = istart
+             i1 = istart+1
+             p = 0
+             do    m = 1, 5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    j = start(2,c), jsize-end(2,c)-1
+                      out_buffer(p+1) = rhs(i,j,k,m,c)
+                       out_buffer(p+2) = rhs(i1,j,k,m,c)
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!            pack and send the buffer
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_xcomm)
+             call mpi_isend(out_buffer, 10*buffer_size,  &
+     &                     dp_type, predecessor(1),  &
+     &                     DEFAULT_TAG, comm_solve,  &
+     &                     requests(2), error)
+             if (timeron) call timer_stop(t_xcomm)
+
+          endif
+
+!---------------------------------------------------------------------
+!         If this was the last stage, do the block-diagonal inversion          
+!---------------------------------------------------------------------
+          if (stage .eq. 1) call ninvr(c)
+
+       end do
+
+       if (timeron) call timer_stop(t_xsolve)
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/y_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/y_solve.f90
new file mode 100644
index 000000000..c4effd375
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/y_solve.f90
@@ -0,0 +1,555 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine y_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the y-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the y-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+
+       integer i, j, k, stage, ip, kp, n, isize, jend, ksize, j1, j2,  &
+     &         buffer_size, c, m, p, jstart, error,  &
+     &         requests(2), statuses(MPI_STATUS_SIZE, 2)
+       double precision  r1, r2, d, e, s(5), sm1, sm2,  &
+     &                   fac1, fac2
+
+
+!---------------------------------------------------------------------
+! now do a sweep on a layer-by-layer basis, i.e. sweeping through cells
+! on this node in the direction of increasing i for the forward sweep,
+! and after that reversing the direction for the backsubstitution  
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_ysolve)
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+       do    stage = 1, ncells
+          c      = slice(2,stage)
+
+          jstart = 0
+          jend   = cell_size(2,c)-1
+
+          isize     = cell_size(1,c)
+          ksize     = cell_size(3,c)
+          ip        = cell_coord(1,c)-1
+          kp        = cell_coord(3,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) *  &
+     &                  (ksize-start(3,c)-end(3,c))
+
+          if ( stage .ne. 1) then
+
+!---------------------------------------------------------------------
+!            if this is not the first processor in this row of cells, 
+!            receive data from predecessor containing the right hand
+!            sides and the upper diagonal elements of the previous two rows
+!---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_irecv(in_buffer, 22*buffer_size,  &
+     &                      dp_type, predecessor(2),  &
+     &                      DEFAULT_TAG, comm_solve,  &
+     &                      requests(1), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+!---------------------------------------------------------------------
+!            communication has already been started. 
+!            compute the left hand side while waiting for the msg
+!---------------------------------------------------------------------
+             call lhsy(c)
+
+!---------------------------------------------------------------------
+!            wait for pending communication to complete
+!            This waits on the current receive and on the send
+!            from the previous stage. They always come in pairs. 
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_ycomm)
+
+!---------------------------------------------------------------------
+!            unpack the buffer                                 
+!---------------------------------------------------------------------
+             j  = jstart
+             j1 = jstart + 1
+             n = 0
+!---------------------------------------------------------------------
+!            create a running pointer
+!---------------------------------------------------------------------
+             p = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -  &
+     &                       in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -  &
+     &                       in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                       in_buffer(p+2+m) * lhs(i,j,k,n+1,c)
+                   end do
+                   d            = in_buffer(p+6)
+                   e            = in_buffer(p+7)
+                   do    m = 1, 3
+                      s(m) = in_buffer(p+7+m)
+                   end do
+                   r1 = lhs(i,j,k,n+2,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                   lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - s(m) * r1
+                   end do
+                   r2 = lhs(i,j1,k,n+1,c)
+                   lhs(i,j1,k,n+2,c) = lhs(i,j1,k,n+2,c) - d * r2
+                   lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) - e * r2
+                   do    m = 1, 3
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) - s(m) * r2
+                   end do
+                   p = p + 10
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -  &
+     &                          in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -  &
+     &                          in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) -  &
+     &                          in_buffer(p+3) * lhs(i,j,k,n+1,c)
+                      d                = in_buffer(p+4)
+                      e                = in_buffer(p+5)
+                      s(m)             = in_buffer(p+6)
+                      r1 = lhs(i,j,k,n+2,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                      lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) - s(m) * r1
+                      r2 = lhs(i,j1,k,n+1,c)
+                      lhs(i,j1,k,n+2,c) = lhs(i,j1,k,n+2,c) - d * r2
+                      lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) - e * r2
+                      rhs(i,j1,k,m,c)   = rhs(i,j1,k,m,c) - s(m) * r2
+                      p = p + 6
+                   end do
+                end do
+             end do
+
+          else            
+
+!---------------------------------------------------------------------
+!            if this IS the first cell, we still compute the lhs
+!---------------------------------------------------------------------
+             call lhsy(c)
+          endif
+
+!---------------------------------------------------------------------
+!         perform the Thomas algorithm; first, FORWARD ELIMINATION     
+!---------------------------------------------------------------------
+          n = 0
+
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    j = jstart, jend-2
+                do    i = start(1,c), isize-end(1,c)-1
+                   j1 = j  + 1
+                   j2 = j  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -  &
+     &                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -  &
+     &                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -  &
+     &                         lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j2,k,n+2,c) = lhs(i,j2,k,n+2,c) -  &
+     &                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j2,k,n+3,c) = lhs(i,j2,k,n+3,c) -  &
+     &                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j2,k,m,c) = rhs(i,j2,k,m,c) -  &
+     &                         lhs(i,j2,k,n+1,c)*rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         The last two rows in this grid block are a bit different, 
+!         since they do not have two more rows available for the
+!         elimination of off-diagonal entries
+!---------------------------------------------------------------------
+
+          j  = jend - 1
+          j1 = jend
+          do    k = start(3,c), ksize-end(3,c)-1
+             do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                end do
+                lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -  &
+     &                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -  &
+     &                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -  &
+     &                      lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+                end do
+!---------------------------------------------------------------------
+!               scale the last row immediately (some of this is
+!               overkill in case this is the last cell)
+!---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j1,k,n+3,c)
+                lhs(i,j1,k,n+4,c) = fac2*lhs(i,j1,k,n+4,c)
+                lhs(i,j1,k,n+5,c) = fac2*lhs(i,j1,k,n+5,c)  
+                do    m = 1, 3
+                   rhs(i,j1,k,m,c) = fac2*rhs(i,j1,k,m,c)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         do the u+c and the u-c factors                 
+!---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    j = jstart, jend-2
+                   do    i = start(1,c), isize-end(1,c)-1
+                   j1 = j  + 1
+                   j2 = j  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -  &
+     &                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -  &
+     &                         lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -  &
+     &                         lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+                   lhs(i,j2,k,n+2,c) = lhs(i,j2,k,n+2,c) -  &
+     &                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j2,k,n+3,c) = lhs(i,j2,k,n+3,c) -  &
+     &                         lhs(i,j2,k,n+1,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j2,k,m,c) = rhs(i,j2,k,m,c) -  &
+     &                         lhs(i,j2,k,n+1,c)*rhs(i,j,k,m,c)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!            And again the last two rows separately
+!---------------------------------------------------------------------
+             j  = jend - 1
+             j1 = jend
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                rhs(i,j,k,m,c)     = fac1*rhs(i,j,k,m,c)
+                lhs(i,j1,k,n+3,c) = lhs(i,j1,k,n+3,c) -  &
+     &                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j1,k,n+4,c) = lhs(i,j1,k,n+4,c) -  &
+     &                      lhs(i,j1,k,n+2,c)*lhs(i,j,k,n+5,c)
+                rhs(i,j1,k,m,c)   = rhs(i,j1,k,m,c) -  &
+     &                      lhs(i,j1,k,n+2,c)*rhs(i,j,k,m,c)
+!---------------------------------------------------------------------
+!               Scale the last row immediately (some of this is overkill
+!               if this is the last cell)
+!---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j1,k,n+3,c)
+                lhs(i,j1,k,n+4,c) = fac2*lhs(i,j1,k,n+4,c)
+                lhs(i,j1,k,n+5,c) = fac2*lhs(i,j1,k,n+5,c)
+                rhs(i,j1,k,m,c)   = fac2*rhs(i,j1,k,m,c)
+
+             end do
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!         send information to the next processor, except when this
+!         is the last grid block;
+!---------------------------------------------------------------------
+
+          if (stage .ne. ncells) then
+
+!---------------------------------------------------------------------
+!            create a running pointer for the send buffer  
+!---------------------------------------------------------------------
+             p = 0
+             n = 0
+             do    k = start(3,c), ksize-end(3,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   do    j = jend-1, jend
+                      out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                      out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                      do    m = 1, 3
+                         out_buffer(p+2+m) = rhs(i,j,k,m,c)
+                      end do
+                      p = p+5
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      do    j = jend-1, jend
+                         out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                         out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                         out_buffer(p+3) = rhs(i,j,k,m,c)
+                         p = p + 3
+                      end do
+                   end do
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!            pack and send the buffer
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_isend(out_buffer, 22*buffer_size,  &
+     &                     dp_type, successor(2),  &
+     &                     DEFAULT_TAG, comm_solve,  &
+     &                     requests(2), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+          endif
+       end do
+
+!---------------------------------------------------------------------
+!      now go in the reverse direction                      
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+       do    stage = ncells, 1, -1
+          c = slice(2,stage)
+
+          jstart = 0
+          jend   = cell_size(2,c)-1
+
+          isize = cell_size(1,c)
+          ksize = cell_size(3,c)
+          ip    = cell_coord(1,c)-1
+          kp    = cell_coord(3,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) *  &
+     &                  (ksize-start(3,c)-end(3,c))
+
+          if (stage .ne. ncells) then
+
+!---------------------------------------------------------------------
+!            if this is not the starting cell in this row of cells, 
+!            wait for a message to be received, containing the 
+!            solution of the previous two stations     
+!---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_irecv(in_buffer, 10*buffer_size,  &
+     &                      dp_type, successor(2),  &
+     &                      DEFAULT_TAG, comm_solve,  &
+     &                      requests(1), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+
+!---------------------------------------------------------------------
+!            communication has already been started
+!            while waiting, do the block-diagonal inversion for the 
+!            cell that was just finished                
+!---------------------------------------------------------------------
+
+             call pinvr(slice(2,stage+1))
+
+!---------------------------------------------------------------------
+!            wait for pending communication to complete
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_ycomm)
+
+!---------------------------------------------------------------------
+!            unpack the buffer for the first three factors         
+!---------------------------------------------------------------------
+             n = 0
+             p = 0
+             j  = jend
+             j1 = j - 1
+             do    m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k,n+4,c)*sm1 -  &
+     &                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -  &
+     &                        lhs(i,j1,k,n+4,c) * rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j1,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!            now unpack the buffer for the remaining two factors
+!---------------------------------------------------------------------
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k,n+4,c)*sm1 -  &
+     &                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j1,k,m,c) = rhs(i,j1,k,m,c) -  &
+     &                        lhs(i,j1,k,n+4,c) * rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j1,k,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+          else
+!---------------------------------------------------------------------
+!            now we know this is the first grid block on the back sweep,
+!            so we don't need a message to start the substitution. 
+!---------------------------------------------------------------------
+
+             j  = jend - 1
+             j1 = jend
+             n = 0
+             do   m = 1, 3
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                             lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c)
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do   k = start(3,c), ksize-end(3,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                             lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c)
+                   end do
+                end do
+             end do
+          endif
+
+!---------------------------------------------------------------------
+!         Whether or not this is the last processor, we always have
+!         to complete the back-substitution 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!         The first three factors
+!---------------------------------------------------------------------
+          n = 0
+          do   m = 1, 3
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = jend-2, jstart, -1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      j1 = j  + 1
+                      j2 = j  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c) -  &
+     &                          lhs(i,j,k,n+5,c)*rhs(i,j2,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         And the remaining two
+!---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do   k = start(3,c), ksize-end(3,c)-1
+                do   j = jend-2, jstart, -1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      j1 = j  + 1
+                      j2 = j1 + 1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+4,c)*rhs(i,j1,k,m,c) -  &
+     &                          lhs(i,j,k,n+5,c)*rhs(i,j2,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         send on information to the previous processor, if needed
+!---------------------------------------------------------------------
+          if (stage .ne.  1) then
+             j  = jstart
+             j1 = jstart + 1
+             p = 0
+             do    m = 1, 5
+                do    k = start(3,c), ksize-end(3,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      out_buffer(p+1) = rhs(i,j,k,m,c)
+                      out_buffer(p+2) = rhs(i,j1,k,m,c)
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!            pack and send the buffer
+!---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_ycomm)
+             call mpi_isend(out_buffer, 10*buffer_size,  &
+     &                     dp_type, predecessor(2),  &
+     &                     DEFAULT_TAG, comm_solve,  &
+     &                     requests(2), error)
+             if (timeron) call timer_stop(t_ycomm)
+
+          endif
+
+!---------------------------------------------------------------------
+!         If this was the last stage, do the block-diagonal inversion          
+!---------------------------------------------------------------------
+          if (stage .eq. 1) call pinvr(c)
+
+       end do
+
+       if (timeron) call timer_stop(t_ysolve)
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/z_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/z_solve.f90
new file mode 100644
index 000000000..8fb121a29
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/SP/z_solve.f90
@@ -0,0 +1,549 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine z_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the z-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the z-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use mpinpb
+
+       implicit none
+
+       integer i, j, k, stage, ip, jp, n, isize, jsize, kend, k1, k2,  &
+     &         buffer_size, c, m, p, kstart, error,  &
+     &         requests(2), statuses(MPI_STATUS_SIZE, 2)
+       double precision  r1, r2, d, e, s(5), sm1, sm2,  &
+     &                   fac1, fac2
+
+!---------------------------------------------------------------------
+! now do a sweep on a layer-by-layer basis, i.e. sweeping through cells
+! on this node in the direction of increasing i for the forward sweep,
+! and after that reversing the direction for the backsubstitution  
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_zsolve)
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+       do    stage = 1, ncells
+          c         = slice(3,stage)
+
+          kstart = 0
+          kend   = cell_size(3,c)-1
+
+          isize     = cell_size(1,c)
+          jsize     = cell_size(2,c)
+          ip        = cell_coord(1,c)-1
+          jp        = cell_coord(2,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) *  &
+     &                  (jsize-start(2,c)-end(2,c))
+
+          if (stage .ne. 1) then
+
+
+!---------------------------------------------------------------------
+!            if this is not the first processor in this row of cells, 
+!            receive data from predecessor containing the right hand
+!            sides and the upper diagonal elements of the previous two rows
+!---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_irecv(in_buffer, 22*buffer_size,  &
+     &                      dp_type, predecessor(3),  &
+     &                      DEFAULT_TAG, comm_solve,  &
+     &                      requests(1), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+
+!---------------------------------------------------------------------
+!            communication has already been started. 
+!            compute the left hand side while waiting for the msg
+!---------------------------------------------------------------------
+             call lhsz(c)
+
+!---------------------------------------------------------------------
+!            wait for pending communication to complete
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_waitall(2, requests, statuses, error)
+              if (timeron) call timer_stop(t_zcomm)
+            
+!---------------------------------------------------------------------
+!            unpack the buffer                                 
+!---------------------------------------------------------------------
+             k  = kstart
+             k1 = kstart + 1
+             n = 0
+
+!---------------------------------------------------------------------
+!            create a running pointer
+!---------------------------------------------------------------------
+             p = 0
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -  &
+     &                       in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -  &
+     &                       in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                       in_buffer(p+2+m) * lhs(i,j,k,n+1,c)
+                   end do
+                   d            = in_buffer(p+6)
+                   e            = in_buffer(p+7)
+                   do    m = 1, 3
+                      s(m) = in_buffer(p+7+m)
+                   end do
+                   r1 = lhs(i,j,k,n+2,c)
+                   lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                   lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) - s(m) * r1
+                   end do
+                   r2 = lhs(i,j,k1,n+1,c)
+                   lhs(i,j,k1,n+2,c) = lhs(i,j,k1,n+2,c) - d * r2
+                   lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) - e * r2
+                   do    m = 1, 3
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) - s(m) * r2
+                   end do
+                   p = p + 10
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      lhs(i,j,k,n+2,c) = lhs(i,j,k,n+2,c) -  &
+     &                          in_buffer(p+1) * lhs(i,j,k,n+1,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) -  &
+     &                          in_buffer(p+2) * lhs(i,j,k,n+1,c)
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) -  &
+     &                          in_buffer(p+3) * lhs(i,j,k,n+1,c)
+                      d                = in_buffer(p+4)
+                      e                = in_buffer(p+5)
+                      s(m)             = in_buffer(p+6)
+                      r1 = lhs(i,j,k,n+2,c)
+                      lhs(i,j,k,n+3,c) = lhs(i,j,k,n+3,c) - d * r1
+                      lhs(i,j,k,n+4,c) = lhs(i,j,k,n+4,c) - e * r1
+                      rhs(i,j,k,m,c)   = rhs(i,j,k,m,c) - s(m) * r1
+                      r2 = lhs(i,j,k1,n+1,c)
+                      lhs(i,j,k1,n+2,c) = lhs(i,j,k1,n+2,c) - d * r2
+                      lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) - e * r2
+                      rhs(i,j,k1,m,c)   = rhs(i,j,k1,m,c) - s(m) * r2
+                      p = p + 6
+                   end do
+                end do
+             end do
+
+          else            
+
+!---------------------------------------------------------------------
+!            if this IS the first cell, we still compute the lhs
+!---------------------------------------------------------------------
+             call lhsz(c)
+          endif
+
+!---------------------------------------------------------------------
+!         perform the Thomas algorithm; first, FORWARD ELIMINATION     
+!---------------------------------------------------------------------
+          n = 0
+
+          do    k = kstart, kend-2
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   k1 = k  + 1
+                   k2 = k  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -  &
+     &                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -  &
+     &                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -  &
+     &                         lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+                   end do
+                   lhs(i,j,k2,n+2,c) = lhs(i,j,k2,n+2,c) -  &
+     &                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k2,n+3,c) = lhs(i,j,k2,n+3,c) -  &
+     &                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+5,c)
+                   do    m = 1, 3
+                      rhs(i,j,k2,m,c) = rhs(i,j,k2,m,c) -  &
+     &                         lhs(i,j,k2,n+1,c)*rhs(i,j,k,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         The last two rows in this grid block are a bit different, 
+!         since they do not have two more rows available for the
+!         elimination of off-diagonal entries
+!---------------------------------------------------------------------
+          k  = kend - 1
+          k1 = kend
+          do    j = start(2,c), jsize-end(2,c)-1
+             do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                end do
+                lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -  &
+     &                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -  &
+     &                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                do    m = 1, 3
+                   rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -  &
+     &                      lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+                end do
+!---------------------------------------------------------------------
+!               scale the last row immediately (some of this is
+!               overkill in case this is the last cell)
+!---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j,k1,n+3,c)
+                lhs(i,j,k1,n+4,c) = fac2*lhs(i,j,k1,n+4,c)
+                lhs(i,j,k1,n+5,c) = fac2*lhs(i,j,k1,n+5,c)  
+                do    m = 1, 3
+                   rhs(i,j,k1,m,c) = fac2*rhs(i,j,k1,m,c)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         do the u+c and the u-c factors               
+!---------------------------------------------------------------------
+          do   m = 4, 5
+             n = (m-3)*5
+             do    k = kstart, kend-2
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                   k1 = k  + 1
+                   k2 = k  + 2
+                   fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                   lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k,m,c) = fac1*rhs(i,j,k,m,c)
+                   lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -  &
+     &                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -  &
+     &                         lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -  &
+     &                         lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+                   lhs(i,j,k2,n+2,c) = lhs(i,j,k2,n+2,c) -  &
+     &                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+4,c)
+                   lhs(i,j,k2,n+3,c) = lhs(i,j,k2,n+3,c) -  &
+     &                         lhs(i,j,k2,n+1,c)*lhs(i,j,k,n+5,c)
+                   rhs(i,j,k2,m,c) = rhs(i,j,k2,m,c) -  &
+     &                         lhs(i,j,k2,n+1,c)*rhs(i,j,k,m,c)
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!            And again the last two rows separately
+!---------------------------------------------------------------------
+             k  = kend - 1
+             k1 = kend
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                fac1               = 1.d0/lhs(i,j,k,n+3,c)
+                lhs(i,j,k,n+4,c)   = fac1*lhs(i,j,k,n+4,c)
+                lhs(i,j,k,n+5,c)   = fac1*lhs(i,j,k,n+5,c)
+                rhs(i,j,k,m,c)     = fac1*rhs(i,j,k,m,c)
+                lhs(i,j,k1,n+3,c) = lhs(i,j,k1,n+3,c) -  &
+     &                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+4,c)
+                lhs(i,j,k1,n+4,c) = lhs(i,j,k1,n+4,c) -  &
+     &                      lhs(i,j,k1,n+2,c)*lhs(i,j,k,n+5,c)
+                rhs(i,j,k1,m,c)   = rhs(i,j,k1,m,c) -  &
+     &                      lhs(i,j,k1,n+2,c)*rhs(i,j,k,m,c)
+!---------------------------------------------------------------------
+!               Scale the last row immediately (some of this is overkill
+!               if this is the last cell)
+!---------------------------------------------------------------------
+                fac2               = 1.d0/lhs(i,j,k1,n+3,c)
+                lhs(i,j,k1,n+4,c) = fac2*lhs(i,j,k1,n+4,c)
+                lhs(i,j,k1,n+5,c) = fac2*lhs(i,j,k1,n+5,c)
+                rhs(i,j,k1,m,c)   = fac2*rhs(i,j,k1,m,c)
+
+             end do
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!         send information to the next processor, except when this
+!         is the last grid block,
+!---------------------------------------------------------------------
+
+          if (stage .ne. ncells) then
+
+!---------------------------------------------------------------------
+!            create a running pointer for the send buffer  
+!---------------------------------------------------------------------
+             p = 0
+             n = 0
+             do    j = start(2,c), jsize-end(2,c)-1
+                do    i = start(1,c), isize-end(1,c)-1
+                   do    k = kend-1, kend
+                      out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                      out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                      do    m = 1, 3
+                         out_buffer(p+2+m) = rhs(i,j,k,m,c)
+                      end do
+                      p = p+5
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      do    k = kend-1, kend
+                         out_buffer(p+1) = lhs(i,j,k,n+4,c)
+                         out_buffer(p+2) = lhs(i,j,k,n+5,c)
+                         out_buffer(p+3) = rhs(i,j,k,m,c)
+                         p = p + 3
+                      end do
+                   end do
+                end do
+             end do
+
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_isend(out_buffer, 22*buffer_size,  &
+     &                     dp_type, successor(3),  &
+     &                     DEFAULT_TAG, comm_solve,  &
+     &                     requests(2), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+          endif
+       end do
+
+!---------------------------------------------------------------------
+!      now go in the reverse direction                      
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+       do    stage = ncells, 1, -1
+          c = slice(3,stage)
+
+          kstart = 0
+          kend   = cell_size(3,c)-1
+
+          isize     = cell_size(1,c)
+          jsize     = cell_size(2,c)
+          ip        = cell_coord(1,c)-1
+          jp        = cell_coord(2,c)-1
+
+          buffer_size = (isize-start(1,c)-end(1,c)) *  &
+     &                  (jsize-start(2,c)-end(2,c))
+
+          if (stage .ne. ncells) then
+
+!---------------------------------------------------------------------
+!            if this is not the starting cell in this row of cells, 
+!            wait for a message to be received, containing the 
+!            solution of the previous two stations     
+!---------------------------------------------------------------------
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_irecv(in_buffer, 10*buffer_size,  &
+     &                      dp_type, successor(3),  &
+     &                      DEFAULT_TAG, comm_solve,  &
+     &                      requests(1), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+
+!---------------------------------------------------------------------
+!            communication has already been started
+!            while waiting, do the  block-diagonal inversion for the 
+!            cell that was just finished                
+!---------------------------------------------------------------------
+
+             call tzetar(slice(3,stage+1))
+
+!---------------------------------------------------------------------
+!            wait for pending communication to complete
+!---------------------------------------------------------------------
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_waitall(2, requests, statuses, error)
+             if (timeron) call timer_stop(t_zcomm)
+
+!---------------------------------------------------------------------
+!            unpack the buffer for the first three factors         
+!---------------------------------------------------------------------
+             n = 0
+             p = 0
+             k  = kend
+             k1 = k - 1
+             do    m = 1, 3
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k,n+4,c)*sm1 -  &
+     &                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -  &
+     &                        lhs(i,j,k1,n+4,c) * rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k1,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!            now unpack the buffer for the remaining two factors
+!---------------------------------------------------------------------
+             do    m = 4, 5
+                n = (m-3)*5
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      sm1 = in_buffer(p+1)
+                      sm2 = in_buffer(p+2)
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k,n+4,c)*sm1 -  &
+     &                        lhs(i,j,k,n+5,c)*sm2
+                      rhs(i,j,k1,m,c) = rhs(i,j,k1,m,c) -  &
+     &                        lhs(i,j,k1,n+4,c) * rhs(i,j,k,m,c) -  &
+     &                        lhs(i,j,k1,n+5,c) * sm1
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+          else
+
+!---------------------------------------------------------------------
+!            now we know this is the first grid block on the back sweep,
+!            so we don't need a message to start the substitution. 
+!---------------------------------------------------------------------
+
+             k  = kend - 1
+             k1 = kend
+             n = 0
+             do   m = 1, 3
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                             lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c)
+                   end do
+                end do
+             end do
+
+             do    m = 4, 5
+                n = (m-3)*5
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do   i = start(1,c), isize-end(1,c)-1
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                             lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c)
+                   end do
+                end do
+             end do
+          endif
+
+!---------------------------------------------------------------------
+!         Whether or not this is the last processor, we always have
+!         to complete the back-substitution 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!         The first three factors
+!---------------------------------------------------------------------
+          n = 0
+          do   m = 1, 3
+             do   k = kend-2, kstart, -1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      k1 = k  + 1
+                      k2 = k  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c) -  &
+     &                          lhs(i,j,k,n+5,c)*rhs(i,j,k2,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         And the remaining two
+!---------------------------------------------------------------------
+          do    m = 4, 5
+             n = (m-3)*5
+             do   k = kend-2, kstart, -1
+                do   j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      k1 = k  + 1
+                      k2 = k  + 2
+                      rhs(i,j,k,m,c) = rhs(i,j,k,m,c) -  &
+     &                          lhs(i,j,k,n+4,c)*rhs(i,j,k1,m,c) -  &
+     &                          lhs(i,j,k,n+5,c)*rhs(i,j,k2,m,c)
+                   end do
+                end do
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         send on information to the previous processor, if needed
+!---------------------------------------------------------------------
+          if (stage .ne.  1) then
+             k  = kstart
+             k1 = kstart + 1
+             p = 0
+             do    m = 1, 5
+                do    j = start(2,c), jsize-end(2,c)-1
+                   do    i = start(1,c), isize-end(1,c)-1
+                      out_buffer(p+1) = rhs(i,j,k,m,c)
+                      out_buffer(p+2) = rhs(i,j,k1,m,c)
+                      p = p + 2
+                   end do
+                end do
+             end do
+
+             if (timeron) call timer_start(t_zcomm)
+             call mpi_isend(out_buffer, 10*buffer_size,  &
+     &                     dp_type, predecessor(3),  &
+     &                     DEFAULT_TAG, comm_solve,  &
+     &                     requests(2), error)
+             if (timeron) call timer_stop(t_zcomm)
+
+          endif
+
+!---------------------------------------------------------------------
+!         If this was the last stage, do the block-diagonal inversion
+!---------------------------------------------------------------------
+          if (stage .eq. 1) call tzetar(c)
+
+       end do
+
+       if (timeron) call timer_stop(t_zsolve)
+
+       return
+       end
+    
+
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_print_results.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_print_results.c
new file mode 100644
index 000000000..f42336c98
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_print_results.c
@@ -0,0 +1,97 @@
+/*****************************************************************/
+/******     C  _  P  R  I  N  T  _  R  E  S  U  L  T  S     ******/
+/*****************************************************************/
+#include <stdlib.h>
+#include <stdio.h>
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      int    nprocs_active,
+                      int    nprocs_total,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *mpicc,
+                      char   *clink,
+                      char   *cmpi_lib,
+                      char   *cmpi_inc,
+                      char   *cflags,
+                      char   *clinkflags )
+{
+    char *evalue="1000";
+
+    printf( "\n\n %s Benchmark Completed\n", name ); 
+
+    printf( " Class           =                        %c\n", class );
+
+    if( n3 == 0 ) {
+        long nn = n1;
+        if ( n2 != 0 ) nn *= n2;
+        printf( " Size            =             %12ld\n", nn );   /* as in IS */
+    }
+    else
+        printf( " Size            =              %3dx %3dx %3d\n", n1,n2,n3 );
+
+    printf( " Iterations      =             %12d\n", niter );
+ 
+    printf( " Time in seconds =             %12.2f\n", t );
+
+    printf( " Total processes =             %12d\n", nprocs_total );
+
+    if ( nprocs_active != 0 )
+        printf( " Active processes=             %12d\n", nprocs_active );
+
+    printf( " Mop/s total     =             %12.2f\n", mops );
+
+    printf( " Mop/s/process   =             %12.2f\n", mops/((float) nprocs_total) );
+
+    printf( " Operation type  = %24s\n", optype);
+
+    if( passed_verification )
+        printf( " Verification    =               SUCCESSFUL\n" );
+    else
+        printf( " Verification    =             UNSUCCESSFUL\n" );
+
+    printf( " Version         =             %12s\n", npbversion );
+
+    printf( " Compile date    =             %12s\n", compiletime );
+
+    printf( "\n Compile options:\n" );
+
+    printf( "    MPICC        = %s\n", mpicc );
+
+    printf( "    CLINK        = %s\n", clink );
+
+    printf( "    CMPI_LIB     = %s\n", cmpi_lib );
+
+    printf( "    CMPI_INC     = %s\n", cmpi_inc );
+
+    printf( "    CFLAGS       = %s\n", cflags );
+
+    printf( "    CLINKFLAGS   = %s\n", clinkflags );
+#ifdef SMP
+    evalue = getenv("MP_SET_NUMTHREADS");
+    printf( "   MULTICPUS = %s\n", evalue );
+#endif
+
+    printf( "\n\n" );
+    printf( " Please send feedbacks and/or the results of this run to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " npb@nas.nasa.gov\n\n\n" );
+/*    printf( " Please send the results of this run to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " Internet: npb@nas.nasa.gov\n \n" );
+    printf( " If email is not available, send this to:\n\n" );
+    printf( " MS T27A-1\n" );
+    printf( " NASA Ames Research Center\n" );
+    printf( " Moffett Field, CA  94035-1000\n\n" );
+    printf( " Fax: 650-604-3957\n\n" );*/
+}
+ 
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_timers.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_timers.c
new file mode 100644
index 000000000..0e0bcc12a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_timers.c
@@ -0,0 +1,77 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "mpi.h"
+
+static double start[64], elapsed[64];
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  C  L  E  A  R          ******/
+/*****************************************************************/
+void timer_clear( int n )
+{
+    elapsed[n] = 0.0;
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  A  R  T          ******/
+/*****************************************************************/
+void timer_start( int n )
+{
+    start[n] = MPI_Wtime();
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  O  P             ******/
+/*****************************************************************/
+void timer_stop( int n )
+{
+    double t, now;
+
+    now = MPI_Wtime();
+    t = now - start[n];
+    elapsed[n] += t;
+
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  R  E  A  D             ******/
+/*****************************************************************/
+double timer_read( int n )
+{
+    return( elapsed[n] );
+}
+
+
+/*****************************************************************/
+/******            C H E C K _ T I M E R _ F L A G          ******/
+/*****************************************************************/
+int check_timer_flag( void )
+{
+    int timer_on = 0;
+    char *ev = getenv("NPB_TIMER_FLAG");
+
+    if (ev) {
+        if (*ev == '\0')
+            timer_on = 1;
+        else if (*ev >= '1' && *ev <= '9')
+            timer_on = 1;
+        else if (strcmp(ev, "on") == 0 || strcmp(ev, "ON") == 0 ||
+                 strcmp(ev, "yes") == 0 || strcmp(ev, "YES") == 0 ||
+                 strcmp(ev, "true") == 0 || strcmp(ev, "TRUE") == 0)
+            timer_on = 1;
+    }
+    else {
+        FILE *fp = fopen("timer.flag", "r");
+        if (fp != NULL) {
+            fclose(fp);
+            timer_on = 1;
+        }
+    }
+
+    return timer_on;
+}
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_timers.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_timers.h
new file mode 100644
index 000000000..ea3a2ceb0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/c_timers.h
@@ -0,0 +1,11 @@
+#ifndef __C_TIMERS_H
+#define __C_TIMERS_H
+
+extern void   timer_clear( int n );
+extern void   timer_start( int n );
+extern void   timer_stop( int n );
+extern double timer_read( int n );
+extern int    check_timer_flag( void );
+
+#endif
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/get_active_nprocs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/get_active_nprocs.f90
new file mode 100644
index 000000000..2b016a508
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/get_active_nprocs.f90
@@ -0,0 +1,117 @@
+!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+! return the largest np1 and np2 such that np1 * np2 <= nprocs
+! pkind = 1, np1 = np2                 (square number)
+!         2, np1/2 <= np2 <= np1
+!         3, np1 = np2 or np1 = np2*2  (power of 2)
+! other outputs:
+!     npa = np1 * np2 (active number of processes)
+!     nprocs   - total number of processes
+!     rank     - rank of this process
+!     comm_out - MPI communicator
+!     active   - .true. if this process is active; .false. otherwise
+!>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+      subroutine get_active_nprocs(pkind, np1, np2, npa,  &
+     &                             nprocs, rank, comm_out, active)
+      implicit none
+      integer pkind, np1, np2, npa
+      integer nprocs, rank, comm_out
+      logical active
+
+      include 'mpif.h'
+
+      integer comm_in, n, np1w, np2w, npaw
+      integer ic, ios
+      character(40) val
+
+! nprocs and rank in COMM_WORLD
+      comm_in = MPI_COMM_WORLD
+      call mpi_comm_size(comm_in, nprocs, ios)
+      call mpi_comm_rank(comm_in, rank, ios)
+
+      if (pkind <= 1) then
+! square number of processes (add small number to allow for roundoff)
+         np2 = int(sqrt(dble(nprocs) + 1.0d-3))
+         np1 = np2
+      else
+! power-of-two processes
+         np1 = int(log(dble(nprocs) + 1.0d-3) / log(2.0d0))
+         np2 = np1 / 2
+         np1 = np1 - np2
+         np1 = 2**np1
+         np2 = 2**np2
+      endif
+      npa = np1 * np2
+
+! for option 2, go further to get the best (np1 * np2) proc grid
+      if (pkind == 2 .and. npa < nprocs) then
+         np1w = int(sqrt(dble(nprocs) + 1.0d-3))
+         np2w = int(sqrt(dble(nprocs*2) + 1.0d-3))
+         do n = np1w, np2w
+            npaw = nprocs / n * n
+            if (n == 1 .and. nprocs == 3) npaw = 2
+            if (npaw > npa) then
+               npa = npaw
+               np1 = npa / n
+               if (np1 < n) then
+                  np2 = np1
+                  np1 = n
+               else
+                  np2 = n
+               endif
+            endif
+         end do
+      endif
+
+! all good if calculated is the same as requested
+      comm_out = comm_in
+      active = .true.
+      if (nprocs == npa) return
+
+! npa < nprocs, need to check if a strict NPROCS enforcement is required
+      if (rank == 0) then
+         call get_environment_variable('NPB_NPROCS_STRICT',  &
+     &                                 val, ic, ios)
+         if (ios == 0 .and. ic > 0) then
+            if (val == '0' .or. val(1:1) == '-') then
+               active = .false.
+            else if (val == 'off' .or. val == 'OFF' .or.  &
+     &            val(1:1) == 'n' .or. val(1:1) == 'N' .or.  &
+     &            val(1:1) == 'f' .or. val(1:1) == 'F') then
+               active = .false.
+            endif
+         endif
+      endif
+      call mpi_bcast(active, 1, MPI_LOGICAL, 0, comm_in, ios)
+
+! abort if a strict NPROCS enforcement is required
+      if (active) then
+         if (rank == 0) then
+            print 100, nprocs
+  100       format(' *** ERROR determining processor topology for ',  &
+     &             i0,' processes')
+            if (pkind <= 1) then
+               print 110, 'square', npa
+            else if (pkind == 2) then
+               print 110, 'grid (nx*ny, nx/2<=ny<=nx)', npa
+            else
+               print 110, 'power-of-two', npa
+            endif
+  110       format('     Expecting a ', a, ' number of processes',  &
+     &             ' (such as ', i0, ')')
+         endif
+         call mpi_abort(comm_in, MPI_ERR_OTHER, ios)
+         stop
+      endif
+
+! mark excess ranks as inactive
+! split communicator based on rank value
+      if (rank >= npa) then
+         active = .false.
+         ic = 1
+      else
+         active = .true.
+         ic = 0
+      endif
+      call mpi_comm_split(comm_in, ic, rank, comm_out, ios)
+
+      end subroutine
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/print_results.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/print_results.f90
new file mode 100644
index 000000000..e7fb485ec
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/print_results.f90
@@ -0,0 +1,119 @@
+
+      subroutine print_results(name, class, n1, n2, n3, niter,  &
+     &               nprocs_active, nprocs_total,  &
+     &               t, mops, optype, verified, npbversion,  &
+     &               compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      
+      implicit none
+      character name*2
+      character class
+      integer n1, n2, n3, niter, nprocs_active, nprocs_total, j
+      double precision t, mops
+      character optype*24, size*15
+      logical verified
+      character(len=*) npbversion, compiletime,  &
+     &              cs1, cs2, cs3, cs4, cs5, cs6, cs7
+
+         write (*, 2) name 
+ 2       format(//, ' ', A2, ' Benchmark Completed.')
+
+         write (*, 3) Class
+ 3       format(' Class           = ', 12x, a12)
+
+!   If this is not a grid-based problem (EP, FT, CG), then
+!   we only print n1, which contains some measure of the
+!   problem size. In that case, n2 and n3 are both zero.
+!   Otherwise, we print the grid size n1xn2xn3
+
+         if ((n2 .eq. 0) .and. (n3 .eq. 0)) then
+            if (name(1:2) .eq. 'EP') then
+               write(size, '(f15.0)' ) 2.d0**n1
+               j = 15
+               if (size(j:j) .eq. '.') j = j - 1
+               write (*,42) size(1:j)
+ 42            format(' Size            = ',9x, a15)
+            else
+               write (*,44) n1
+ 44            format(' Size            = ',12x, i12)
+            endif
+         else
+            write (*, 4) n1,n2,n3
+ 4          format(' Size            =  ',9x, i4,'x',i4,'x',i4)
+         endif
+
+         write (*, 5) niter
+ 5       format(' Iterations      = ', 12x, i12)
+         
+         write (*, 6) t
+ 6       format(' Time in seconds = ', 12x, f12.2)
+         
+         write (*,7) nprocs_total
+ 7       format(' Total processes = ', 12x, i12)
+         
+         if (nprocs_active .ne. 0) write (*,8) nprocs_active
+ 8       format(' Active processes= ', 12x, i12)
+
+         write (*,9) mops
+ 9       format(' Mop/s total     = ', 12x, f12.2)
+
+         write (*,10) mops/dble( nprocs_total )
+ 10      format(' Mop/s/process   = ', 12x, f12.2)        
+         
+         write(*, 11) optype
+ 11      format(' Operation type  = ', a24)
+
+         if (verified) then 
+            write(*,12) '  SUCCESSFUL'
+         else
+            write(*,12) 'UNSUCCESSFUL'
+         endif
+ 12      format(' Verification    = ', 12x, a)
+
+         write(*,13) npbversion
+ 13      format(' Version         = ', 12x, a12)
+
+         write(*,14) compiletime
+ 14      format(' Compile date    = ', 12x, a12)
+
+
+         write (*,121) cs1
+ 121     format(/, ' Compile options:', /,  &
+     &          '    MPIFC        = ', A)
+
+         write (*,122) cs2
+ 122     format('    FLINK        = ', A)
+
+         write (*,123) cs3
+ 123     format('    FMPI_LIB     = ', A)
+
+         write (*,124) cs4
+ 124     format('    FMPI_INC     = ', A)
+
+         write (*,125) cs5
+ 125     format('    FFLAGS       = ', A)
+
+         write (*,126) cs6
+ 126     format('    FLINKFLAGS   = ', A)
+
+         write(*, 127) cs7
+ 127     format('    RAND         = ', A)
+        
+         write (*,130)
+ 130     format(//' Please send feedbacks and/or',  &
+     &            ' the results of this run to:'//  &
+     &            ' NPB Development Team '/  &
+     &            ' Internet: npb@nas.nasa.gov'//)
+! 130     format(//' Please send the results of this run to:'//
+!     >            ' NPB Development Team '/
+!     >            ' Internet: npb@nas.nasa.gov'/
+!     >            ' '/
+!     >            ' If email is not available, send this to:'//
+!     >            ' MS T27A-1'/
+!     >            ' NASA Ames Research Center'/
+!     >            ' Moffett Field, CA  94035-1000'//
+!     >            ' Fax: 650-604-3957'//)
+
+
+         return
+         end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdp.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdp.c
new file mode 100644
index 000000000..676624795
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdp.c
@@ -0,0 +1,64 @@
+//---------------------------------------------------------------------
+//   This function is C verson of random number generator randdp.f 
+//---------------------------------------------------------------------
+
+double	randlc(X, A)
+double *X;
+double *A;
+{
+      static int        KS=0;
+      static double	R23, R46, T23, T46;
+      double		T1, T2, T3, T4;
+      double		A1;
+      double		A2;
+      double		X1;
+      double		X2;
+      double		Z;
+      int     		i, j;
+
+      if (KS == 0) 
+      {
+        R23 = 1.0;
+        R46 = 1.0;
+        T23 = 1.0;
+        T46 = 1.0;
+    
+        for (i=1; i<=23; i++)
+        {
+          R23 = 0.50 * R23;
+          T23 = 2.0 * T23;
+        }
+        for (i=1; i<=46; i++)
+        {
+          R46 = 0.50 * R46;
+          T46 = 2.0 * T46;
+        }
+        KS = 1;
+      }
+
+/*  Break A into two parts such that A = 2^23 * A1 + A2 and set X = N.  */
+
+      T1 = R23 * *A;
+      j  = T1;
+      A1 = j;
+      A2 = *A - T23 * A1;
+
+/*  Break X into two parts such that X = 2^23 * X1 + X2, compute
+    Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+    X = 2^23 * Z + A2 * X2  (mod 2^46).                            */
+
+      T1 = R23 * *X;
+      j  = T1;
+      X1 = j;
+      X2 = *X - T23 * X1;
+      T1 = A1 * X2 + A2 * X1;
+      
+      j  = R23 * T1;
+      T2 = j;
+      Z = T1 - T23 * T2;
+      T3 = T23 * Z + A2 * X2;
+      j  = R46 * T3;
+      T4 = j;
+      *X = T3 - T46 * T4;
+      return(R46 * *X);
+} 
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdp.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdp.f90
new file mode 100644
index 000000000..27fdf95e0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdp.f90
@@ -0,0 +1,137 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision function randlc (x, a)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+!
+!   This routine should produce the same results on any computer with at least
+!   48 mantissa bits in double precision floating point data.  On 64 bit
+!   systems, double precision should be disabled.
+!
+!   David H. Bailey     October 26, 1990
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,  &
+     &  t46 = t23 ** 2)
+
+!---------------------------------------------------------------------
+!   Break A into two parts such that A = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+!---------------------------------------------------------------------
+!   Break X into two parts such that X = 2^23 * X1 + X2, compute
+!   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+!   X = 2^23 * Z + A2 * X2  (mod 2^46).
+!---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+
+      return
+      end
+
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   This routine generates N uniform pseudorandom double precision numbers in
+!   the range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The N results are placed in Y and are normalized
+!   to be between 0 and 1.  X is updated to contain the new seed, so that
+!   subsequent calls to VRANLC using the same arguments will generate a
+!   continuous sequence.  If N is zero, only initialization is performed, and
+!   the variables X, A and Y are ignored.
+!
+!   This routine is the standard version designed for scalar or RISC systems.
+!   However, it should produce the same results on any single processor
+!   computer with at least 48 mantissa bits in double precision floating point
+!   data.  On 64 bit systems, double precision should be disabled.
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+      integer i,n
+      double precision y,r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      dimension y(*)
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,  &
+     &  t46 = t23 ** 2)
+
+
+!---------------------------------------------------------------------
+!   Break A into two parts such that A = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+!---------------------------------------------------------------------
+!   Generate N results.   This loop is not vectorizable.
+!---------------------------------------------------------------------
+      do i = 1, n
+
+!---------------------------------------------------------------------
+!   Break X into two parts such that X = 2^23 * X1 + X2, compute
+!   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+!   X = 2^23 * Z + A2 * X2  (mod 2^46).
+!---------------------------------------------------------------------
+        t1 = r23 * x
+        x1 = int (t1)
+        x2 = x - t23 * x1
+        t1 = a1 * x2 + a2 * x1
+        t2 = int (r23 * t1)
+        z = t1 - t23 * t2
+        t3 = t23 * z + a2 * x2
+        t4 = int (r46 * t3)
+        x = t3 - t46 * t4
+        y(i) = r46 * x
+      enddo
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdpvec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdpvec.f90
new file mode 100644
index 000000000..069e8cbe8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randdpvec.f90
@@ -0,0 +1,186 @@
+!---------------------------------------------------------------------
+      double precision function randlc (x, a)
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+!
+!   This routine should produce the same results on any computer with at least
+!   48 mantissa bits in double precision floating point data.  On 64 bit
+!   systems, double precision should be disabled.
+!
+!   David H. Bailey     October 26, 1990
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,  &
+     &  t46 = t23 ** 2)
+
+!---------------------------------------------------------------------
+!   Break A into two parts such that A = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+!---------------------------------------------------------------------
+!   Break X into two parts such that X = 2^23 * X1 + X2, compute
+!   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+!   X = 2^23 * Z + A2 * X2  (mod 2^46).
+!---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+
+
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   This routine generates N uniform pseudorandom double precision numbers in
+!   the range (0, 1) by using the linear congruential generator
+!   
+!   x_{k+1} = a x_k  (mod 2^46)
+!   
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The N results are placed in Y and are normalized
+!   to be between 0 and 1.  X is updated to contain the new seed, so that
+!   subsequent calls to RANDLC using the same arguments will generate a
+!   continuous sequence.
+!   
+!   This routine generates the output sequence in batches of length NV, for
+!   convenience on vector computers.  This routine should produce the same
+!   results on any computer with at least 48 mantissa bits in double precision
+!   floating point data.  On Cray systems, double precision should be disabled.
+!   
+!   David H. Bailey    August 30, 1990
+!---------------------------------------------------------------------
+
+      integer n
+      double precision x, a, y(*)
+      
+      double precision r23, r46, t23, t46
+      integer nv
+      parameter (r23 = 2.d0 ** (-23), r46 = r23 * r23, t23 = 2.d0 ** 23,  &
+     &     t46 = t23 * t23, nv = 64)
+      double precision  xv(nv), t1, t2, t3, t4, an, a1, a2, x1, x2, yy
+      integer n1, i, j
+      external randlc
+      double precision randlc
+
+!---------------------------------------------------------------------
+!     Compute the first NV elements of the sequence using RANDLC.
+!---------------------------------------------------------------------
+      t1 = x
+      n1 = min (n, nv)
+
+      do  i = 1, n1
+         xv(i) = t46 * randlc (t1, a)
+      enddo
+
+!---------------------------------------------------------------------
+!     It is not necessary to compute AN, A1 or A2 unless N is greater than NV.
+!---------------------------------------------------------------------
+      if (n .gt. nv) then
+
+!---------------------------------------------------------------------
+!     Compute AN = AA ^ NV (mod 2^46) using successive calls to RANDLC.
+!---------------------------------------------------------------------
+         t1 = a
+         t2 = r46 * a
+
+         do  i = 1, nv - 1
+            t2 = randlc (t1, a)
+         enddo
+
+         an = t46 * t2
+
+!---------------------------------------------------------------------
+!     Break AN into two parts such that AN = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+         t1 = r23 * an
+         a1 = aint (t1)
+         a2 = an - t23 * a1
+      endif
+
+!---------------------------------------------------------------------
+!     Compute N pseudorandom results in batches of size NV.
+!---------------------------------------------------------------------
+      do  j = 0, n - 1, nv
+         n1 = min (nv, n - j)
+
+!---------------------------------------------------------------------
+!     Compute up to NV results based on the current seed vector XV.
+!---------------------------------------------------------------------
+         do  i = 1, n1
+            y(i+j) = r46 * xv(i)
+         enddo
+
+!---------------------------------------------------------------------
+!     If this is the last pass through the 140 loop, it is not necessary to
+!     update the XV vector.
+!---------------------------------------------------------------------
+         if (j + n1 .eq. n) goto 150
+
+!---------------------------------------------------------------------
+!     Update the XV vector by multiplying each element by AN (mod 2^46).
+!---------------------------------------------------------------------
+         do  i = 1, nv
+            t1 = r23 * xv(i)
+            x1 = aint (t1)
+            x2 = xv(i) - t23 * x1
+            t1 = a1 * x2 + a2 * x1
+            t2 = aint (r23 * t1)
+            yy = t1 - t23 * t2
+            t3 = t23 * yy + a2 * x2
+            t4 = aint (r46 * t3)
+            xv(i) = t3 - t46 * t4
+         enddo
+
+      enddo
+
+!---------------------------------------------------------------------
+!     Save the last seed in X so that subsequent calls to VRANLC will generate
+!     a continuous sequence.
+!---------------------------------------------------------------------
+ 150  x = xv(n1)
+
+      return
+      end
+
+!----- end of program ------------------------------------------------
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randi8.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randi8.f90
new file mode 100644
index 000000000..f8932edaf
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randi8.f90
@@ -0,0 +1,67 @@
+      double precision function randlc(x, a)
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer(kind=8) i246m1, Lx, La
+      double precision d2m46
+
+      parameter(d2m46=0.5d0**46)
+
+      parameter(i246m1=INT(Z'00003FFFFFFFFFFF',8))
+
+      Lx = X
+      La = A
+
+      Lx   = iand(Lx*La,i246m1)
+      randlc = d2m46*dble(Lx)
+      x    = dble(Lx)
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer(kind=8) i246m1, Lx, La
+      double precision d2m46
+
+! This doesn't work, because the compiler does the calculation in 32
+! bits and overflows. No standard way (without f90 stuff) to specify
+! that the rhs should be done in 64 bit arithmetic. 
+!      parameter(i246m1=2**46-1)
+
+      parameter(d2m46=0.5d0**46)
+
+      parameter(i246m1=INT(Z'00003FFFFFFFFFFF',8))
+
+      Lx = X
+      La = A
+      do i = 1, N
+         Lx   = iand(Lx*La,i246m1)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x    = dble(Lx)
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randi8_safe.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randi8_safe.f90
new file mode 100644
index 000000000..ac63a1884
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/randi8_safe.f90
@@ -0,0 +1,64 @@
+      double precision function randlc(x, a)
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer(kind=8) Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = x
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      x1 = ibits(Lx, 23, 23)
+      x2 = ibits(Lx, 0, 23)
+      xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+      Lx   = ibits(xa,0, 46)
+      x    = dble(Lx)
+      randlc = d2m46*x
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer(kind=8) Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = X
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      do i = 1, N
+         x1 = ibits(Lx, 23, 23)
+         x2 = ibits(Lx, 0, 23)
+         xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+         Lx   = ibits(xa,0, 46)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x = dble(Lx)
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/timers.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/timers.f90
new file mode 100644
index 000000000..c89a3d81c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/common/timers.f90
@@ -0,0 +1,135 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      module timers
+
+      double precision start(64), elapsed(64)
+
+      end module timers
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine timer_clear(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+
+      elapsed(n) = 0.0
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine timer_start(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+
+      include 'mpif.h'
+
+      start(n) = MPI_Wtime()
+
+      return
+      end
+      
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine timer_stop(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+
+      include 'mpif.h'
+
+      double precision t, now
+
+      now = MPI_Wtime()
+      t = now - start(n)
+      elapsed(n) = elapsed(n) + t
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision function timer_read(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+      
+      timer_read = elapsed(n)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine check_timer_flag( timeron )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+      logical timeron
+
+      integer nc, ios
+      character(len=20) val
+
+      timeron = .false.
+
+! ... Check environment variable "NPB_TIMER_FLAG"
+      call get_environment_variable('NPB_TIMER_FLAG', val, nc, ios)
+      if (ios .eq. 0) then
+         if (nc .le. 0) then
+            timeron = .true.
+         else if (val(1:1) .ge. '1' .and. val(1:1) .le. '9') then
+            timeron = .true.
+         else if (val .eq. 'on' .or. val .eq. 'ON' .or.  &
+     &            val .eq. 'yes' .or. val .eq. 'YES' .or.  &
+     &            val .eq. 'true' .or. val .eq. 'TRUE') then
+            timeron = .true.
+         endif
+
+      else
+
+! ... Check if the "timer.flag" file exists
+         open (unit=2, file='timer.flag', status='old', iostat=ios)
+         if (ios .eq. 0) then
+            close(2)
+            timeron = .true.
+         endif
+
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/README
new file mode 100644
index 000000000..ae535e95c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/README
@@ -0,0 +1,7 @@
+This directory contains examples of make.def files that were used 
+by the NPB team in testing the benchmarks on different platforms. 
+They can be used as starting points for make.def files for your 
+own platform, but you may need to taylor them for best performance 
+on your installation. A clean template can be found in directory 
+`config'.
+Some examples of suite.def files are also provided.
\ No newline at end of file
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.gcc_mpich b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.gcc_mpich
new file mode 100644
index 000000000..fb41e1059
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.gcc_mpich
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIFC      - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIFC) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIFC) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIFC = mpif90
+# This links MPI fortran programs; usually the same as ${MPIFC}
+FLINK	= $(MPIFC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpicc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= gcc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=1
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.gcc_mpich_m b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.gcc_mpich_m
new file mode 100644
index 000000000..70ea874d2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.gcc_mpich_m
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIFC      - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIFC) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIFC) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIFC = mpif90
+# This links MPI fortran programs; usually the same as ${MPIFC}
+FLINK	= $(MPIFC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpicc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= gcc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=1
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.ibm_aix64 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.ibm_aix64
new file mode 100644
index 000000000..704e75c10
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.ibm_aix64
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIFC      - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIFC) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIFC) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIFC	= mpxlf -q64
+# This links MPI fortran programs; usually the same as ${MPIFC}
+FLINK	= $(MPIFC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -qarch=auto -qtune=auto -qhot -qnosave
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = -O3 -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpcc -q64
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -qarch=auto -qtune=auto -qhot
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = -O3 -qarch=auto
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= cc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=1
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.itc_mpt b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.itc_mpt
new file mode 100644
index 000000000..03417ef1d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.itc_mpt
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIFC      - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIFC) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIFC) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIFC = ifort
+# This links MPI fortran programs; usually the same as ${MPIFC}
+FLINK	= $(MPIFC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  = -lmpi
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = icc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  = -lmpi
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= icc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=4
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.pgi_mpich b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.pgi_mpich
new file mode 100644
index 000000000..82a63ac1c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/make.def.pgi_mpich
@@ -0,0 +1,162 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIFC      - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIFC) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIFC) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIFC = mpif90
+# This links MPI fortran programs; usually the same as ${MPIFC}
+FLINK	= $(MPIFC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fastsse
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpicc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB  = 
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC = 
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -fastsse
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= pgcc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=1
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.bt b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.bt
new file mode 100644
index 000000000..6fdde972e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.bt
@@ -0,0 +1,5 @@
+bt	S
+bt	A
+bt	B
+bt	C
+bt	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.cg b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.cg
new file mode 100644
index 000000000..12a9573de
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.cg
@@ -0,0 +1,5 @@
+cg	S
+cg	A
+cg	B
+cg	C
+cg	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.ep b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.ep
new file mode 100644
index 000000000..4994a5538
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.ep
@@ -0,0 +1,5 @@
+ep	S
+ep	A
+ep	B
+ep	C
+ep	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.ft b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.ft
new file mode 100644
index 000000000..1f2069ed4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.ft
@@ -0,0 +1,5 @@
+ft	S
+ft	A
+ft	B
+ft	C
+ft	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.is b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.is
new file mode 100644
index 000000000..2ef84ed4d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.is
@@ -0,0 +1,5 @@
+is	S
+is	A
+is	B
+is	C
+is	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.lu b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.lu
new file mode 100644
index 000000000..faa7c249e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.lu
@@ -0,0 +1,5 @@
+lu	S
+lu	A
+lu	B
+lu	C
+lu	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.mg b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.mg
new file mode 100644
index 000000000..20c4a1ef7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.mg
@@ -0,0 +1,5 @@
+mg	S
+mg	A
+mg	B
+mg	C
+mg	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.small b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.small
new file mode 100644
index 000000000..d3d52f018
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.small
@@ -0,0 +1,8 @@
+bt	S
+cg	S
+ep	S
+ft	S
+is	S
+lu	S
+mg	S
+sp	S
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.sp b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.sp
new file mode 100644
index 000000000..22ec0776a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/NAS.samples/suite.def.sp
@@ -0,0 +1,5 @@
+sp	S
+sp	A
+sp	B
+sp	C
+sp	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/make.def.template b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/make.def.template
new file mode 100644
index 000000000..33e7bd938
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/make.def.template
@@ -0,0 +1,163 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP and BT, which are in Fortran, the following must 
+# be defined:
+#
+# MPIFC      - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# FMPI_INC   - any -I arguments required for compiling MPI/Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# FMPI_LIB   - any -L and -l arguments required for linking MPI/Fortran 
+# 
+# compilations are done with $(MPIFC) $(FMPI_INC) $(FFLAGS) or
+#                            $(MPIFC) $(FFLAGS)
+# linking is done with       $(FLINK) $(FMPI_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPIFC = mpif90
+# This links MPI fortran programs; usually the same as ${MPIFC}
+FLINK	= $(MPIFC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+FMPI_LIB =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpif.h'
+#---------------------------------------------------------------------------
+FMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS, which is in C, the following must be defined:
+#
+# MPICC      - C compiler 
+# CFLAGS     - C compilation arguments
+# CMPI_INC   - any -I arguments required for compiling MPI/C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# CMPI_LIB   - any -L and -l arguments required for linking MPI/C 
+#
+# compilations are done with $(MPICC) $(CMPI_INC) $(CFLAGS) or
+#                            $(MPICC) $(CFLAGS)
+# linking is done with       $(CLINK) $(CMPI_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for MPI programs
+#---------------------------------------------------------------------------
+MPICC = mpicc
+# This links MPI C programs; usually the same as ${MPICC}
+CLINK	= $(MPICC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker to help link with MPI correctly
+#---------------------------------------------------------------------------
+CMPI_LIB =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler to help find 'mpi.h'
+#---------------------------------------------------------------------------
+CMPI_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+#---------------------------------------------------------------------------
+CFLAGS	= -O3
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# MPI dummy library:
+#
+# Uncomment if you want to use the MPI dummy library supplied by NAS instead 
+# of the true message-passing library. The include file redefines several of
+# the above macros. It also invokes make in subdirectory MPI_dummy. Make 
+# sure that no spaces or tabs precede include.
+#---------------------------------------------------------------------------
+# include ../config/make.dummy
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+CC	= gcc -g
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# Some machines (e.g. Crays) have 128-bit DOUBLE PRECISION numbers, which
+# is twice the precision required for the NPB suite. A compiler flag 
+# (e.g. -dp) can usually be used to change DOUBLE PRECISION variables to
+# 64 bits, but the MPI library may continue to send 128 bits. Short of
+# recompiling MPI, the solution is to use MPI_REAL to send these 64-bit
+# numbers, and MPI_COMPLEX to send their complex counterparts. Uncomment
+# the following line to enable this substitution. 
+# 
+# NOTE: IF THE I/O BENCHMARK IS BEING BUILT, WE USE CONVERTFLAG TO
+#       SPECIFIY THE FORTRAN RECORD LENGTH UNIT. IT IS A SYSTEM-SPECIFIC
+#       VALUE (USUALLY 1 OR 4). UNCOMMENT THE SECOND LINE AND SUBSTITUTE
+#       THE CORRECT VALUE FOR "length".
+#       IF BOTH 128-BIT DOUBLE PRECISION NUMBERS AND I/O ARE TO BE ENABLED,
+#       UNCOMMENT THE THIRD LINE AND SUBSTITUTE THE CORRECT VALUE FOR
+#       "length"
+#---------------------------------------------------------------------------
+# CONVERTFLAG	= -DCONVERTDOUBLE
+# CONVERTFLAG	= -DFORTRAN_REC_SIZE=length
+# CONVERTFLAG	= -DCONVERTDOUBLE -DFORTRAN_REC_SIZE=length
+CONVERTFLAG	= -DFORTRAN_REC_SIZE=0
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/make.dummy b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/make.dummy
new file mode 100644
index 000000000..34f754e86
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/make.dummy
@@ -0,0 +1,7 @@
+FMPI_LIB  = -L../MPI_dummy -lmpi
+FMPI_INC  = -I../MPI_dummy
+CMPI_LIB  = -L../MPI_dummy -lmpi
+CMPI_INC  = -I../MPI_dummy
+default:: ${PROGRAM} libmpi.a
+libmpi.a: 
+	cd ../MPI_dummy; $(MAKE) FC=$(MPIFC) CC=$(MPICC)
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/suite.def.template b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/suite.def.template
new file mode 100644
index 000000000..d018a1b78
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/config/suite.def.template
@@ -0,0 +1,24 @@
+# config/suite.def
+# This file is used to build several benchmarks with a single command. 
+# Typing "make suite" in the main directory will build all the benchmarks
+# specified in this file. 
+# Each line of this file contains a benchmark name, class, and number
+# of nodes. The name is one of "cg", "is", "ep", mg", "ft", "sp", "bt", 
+# "lu", and "dt". 
+# The class is one of "S", "W", "A", "B", "C", "D", and "E"
+# (except that no classes C, D and E for DT, and no class E for IS).
+# The number of nodes must be a legal number for a particular
+# benchmark. The utility which parses this file is primitive, so
+# formatting is inflexible. Separate name/class by tabs. 
+# Comments start with "#" as the first character on a line. 
+# No blank lines. 
+# The following example builds sample sizes of all benchmarks. 
+ft	S
+mg	S
+sp	S
+lu	S
+bt	S
+is	S
+ep	S
+cg	S
+dt	S
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/Makefile
new file mode 100644
index 000000000..d2137def6
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/Makefile
@@ -0,0 +1,22 @@
+include ../config/make.def
+
+# Note that COMPILE is also defined in make.common and should
+# be the same. We can't include make.common because it has a lot
+# of other garbage. LINK is not defined in make.common because
+# ${MPI_LIB} needs to go at the end of the line. 
+FCOMPILE = $(MPIFC) -c $(FMPI_INC) $(FFLAGS)
+
+all: setparams 
+
+# setparams creates an npbparam.h file for each benchmark 
+# configuration. npbparams.h also contains info about how a benchmark
+# was compiled and linked
+
+setparams: setparams.c ../config/make.def
+	$(CC) ${CONVERTFLAG} -o setparams setparams.c
+
+
+clean: 
+	-rm -f setparams setparams.h npbparams.h
+	-rm -f *~ *.o
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/README
new file mode 100644
index 000000000..3c97c524c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/README
@@ -0,0 +1,39 @@
+This directory contains utilities and files used by the 
+build process. You should not need to change anything
+in this directory. 
+
+Original Files
+--------------
+setparams.c:
+        Source for the setparams program. This program is used internally
+        in the build process to create the file "npbparams.h" for each 
+        benchmark. npbparams.h contains Fortran or C parameters to build a 
+        benchmark for a specific class and number of nodes. The setparams 
+        program is never run directly by a user. Its invocation syntax is 
+        "setparams benchmark-name nprocs class". 
+        It examines the file "npbparams.h" in the current directory. If 
+        the specified parameters are the same as those in the npbparams.h 
+        file, nothing it changed. If the file does not exist or corresponds 
+        to a different class/number of nodes, it is (re)built. 
+	One of the more complicated things in npbparams.h is that it 
+        contains, in a Fortran string, the compiler flags used to build a 
+        benchmark, so that a benchmark can print out how it was compiled. 
+
+make.common
+        A makefile segment that is included in each individual benchmark
+        program makefile. It sets up some standard macros (COMPILE, etc) 
+        and makes sure everything is configured correctly (npbparams.h)
+
+Makefile
+        Builds  setparams
+
+README
+        This file. 
+
+
+Created files
+-------------
+
+setparams
+	See descriptions above
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/make.common b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/make.common
new file mode 100644
index 000000000..dbd65235a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/make.common
@@ -0,0 +1,58 @@
+PROGRAM  = $(BINDIR)/$(BENCHMARK).$(CLASS).x
+FCOMPILE = $(MPIFC) -c $(FMPI_INC) $(FFLAGS)
+CCOMPILE = $(MPICC) -c $(CMPI_INC) $(CFLAGS)
+
+# Class "U" is used internally by the setparams program to mean
+# "unknown". This means that if you don't specify CLASS=
+# on the command line, you'll get an error. It would be nice
+# to be able to avoid this, but we'd have to get information
+# from the setparams back to the make program, which isn't easy. 
+CLASS=U
+
+default:: ${PROGRAM}
+
+# This makes sure the configuration utility setparams 
+# is up to date. 
+# Note that this must be run every time, which is why the
+# target does not exist and is not created. 
+# If you create a file called "config" you will break things. 
+config:
+	@cd ../sys; ${MAKE} all
+	../sys/setparams ${BENCHMARK} ${CLASS} ${SUBTYPE}
+
+COMMON=../common
+${COMMON}/${RAND}.o: ${COMMON}/${RAND}.f90
+	cd ${COMMON}; ${FCOMPILE} ${RAND}.f90
+${COMMON}/c_randdp.o: ${COMMON}/randdp.c
+	cd ${COMMON}; ${CCOMPILE} -o c_randdp.o randdp.c
+
+${COMMON}/get_active_nprocs.o: ${COMMON}/get_active_nprocs.f90
+	cd ${COMMON}; ${FCOMPILE} get_active_nprocs.f90
+
+${COMMON}/print_results.o: ${COMMON}/print_results.f90
+	cd ${COMMON}; ${FCOMPILE} print_results.f90
+
+${COMMON}/c_print_results.o: ${COMMON}/c_print_results.c
+	cd ${COMMON}; ${CCOMPILE} c_print_results.c
+
+${COMMON}/timers.o: ${COMMON}/timers.f90
+	cd ${COMMON}; ${FCOMPILE} timers.f90
+
+${COMMON}/c_timers.o: ${COMMON}/c_timers.c
+	cd ${COMMON}; ${CCOMPILE} c_timers.c
+
+# Normally setparams updates npbparams.h only if the settings (CLASS)
+# have changed. However, we also want to update if the compile options
+# may have changed (set in ../config/make.def). 
+npbparams.h: ../config/make.def
+	@ echo make.def modified. Rebuilding npbparams.h just in case
+	rm -f npbparams.h
+	../sys/setparams ${BENCHMARK} ${CLASS} ${SUBTYPE}
+
+# So that "make benchmark-name" works
+${BENCHMARK}:  default
+${BENCHMARKU}: default
+
+.SUFFIXES:
+.SUFFIXES: .c .h .f90 .f .o
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/print_header b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/print_header
new file mode 100755
index 000000000..f6ac896bc
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/print_header
@@ -0,0 +1,5 @@
+echo '   ========================================='
+echo '   =      NAS Parallel Benchmarks 3.4      ='
+echo '   =      MPI/Fortran/C                    ='
+echo '   ========================================='
+echo ''
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/print_instructions b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/print_instructions
new file mode 100755
index 000000000..69dfc336d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/print_instructions
@@ -0,0 +1,25 @@
+echo ''
+echo '   To make a NAS benchmark type '
+echo ''
+echo '         make <benchmark-name> CLASS=<class> [SUBTYPE=<type>]'
+echo ''
+echo '   where <benchmark-name>  is "bt", "cg", "dt", "ep", "ft", "is",'
+echo '                              "lu", "mg", or "sp"'
+echo '         <class>           is "S", "W", "A", "B", "C", "D", "E", or "F"'
+echo ''
+echo '   Only when making the I/O benchmark:'
+echo ''
+echo '         <benchmark-name>  is "bt"'
+echo '         <class> as above'
+echo '         <type>            is "full", "simple", "fortran", or "epio"'
+echo ''
+echo '   To make a set of benchmarks, create the file config/suite.def'
+echo '   according to the instructions in config/suite.def.template and type'
+echo ''
+echo '         make suite'
+echo ''
+echo ' ***************************************************************'
+echo ' * Remember to edit the file config/make.def for site specific *'
+echo ' * information as described in the README file                 *'
+echo ' ***************************************************************'
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/setparams.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/setparams.c
new file mode 100644
index 000000000..597a10055
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/setparams.c
@@ -0,0 +1,1120 @@
+/* 
+ * This utility configures a NPB to be built for a specific number
+ * of nodes and a specific class. It creates a file "npbparams.h" 
+ * in the source directory. This file keeps state information about 
+ * which size of benchmark is currently being built (so that nothing
+ * if unnecessarily rebuilt) and defines (through PARAMETER statements)
+ * the number of nodes and class for which a benchmark is being built. 
+
+ * The utility takes 2 arguments: 
+ *       setparams benchmark-name class
+ *    benchmark-name is "sp", "bt", etc
+ *    class is the size of the benchmark
+ * These parameters are checked for the current benchmark. If they
+ * are invalid, this program prints a message and aborts. 
+ * If the parameters are ok, the current npbsize.h (actually just
+ * the first line) is read in. If the new parameters are the same as 
+ * the old, nothing is done, but an exit code is returned to force the
+ * user to specify (otherwise the make procedure succeeds but builds a
+ * binary of the wrong name).  Otherwise the file is rewritten. 
+ * Errors write a message (to stdout) and abort. 
+ * 
+ * This program makes use of two extra benchmark "classes"
+ * class "X" means an invalid specification. It is returned if
+ * there is an error parsing the config file. 
+ * class "U" is an external specification meaning "unknown class"
+ * 
+ * Unfortunately everything has to be case sensitive. This is
+ * because we can always convert lower to upper or v.v. but
+ * can't feed this information back to the makefile, so typing
+ * make CLASS=a and make CLASS=A will produce different binaries.
+ *
+ * 
+ */
+
+#include <sys/types.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <time.h>
+
+/*
+ * This is the master version number for this set of 
+ * NPB benchmarks. It is in an obscure place so people
+ * won't accidentally change it. 
+ */
+
+#define VERSION "3.4.2"
+
+/* controls verbose output from setparams */
+/* #define VERBOSE */
+
+#define FILENAME "npbparams.h"
+#define DESC_LINE "! CLASS = %c\n"
+#define BT_DESC_LINE "! CLASS = %c SUBTYPE = %s\n"
+#define DEF_CLASS_LINE     "#define CLASS '%c'\n"
+#define FINDENT  "        "
+#define CONTINUE "     & "
+
+#ifdef FORTRAN_REC_SIZE
+int fortran_rec_size = FORTRAN_REC_SIZE;
+#else
+int fortran_rec_size = 0;
+#endif
+
+void get_info(int argc, char *argv[], int *typep, char *classp,
+	      int* subtypep);
+void check_info(int type, char class);
+void read_info(int type, char *classp, int *subtypep);
+void write_info(int type, char class, int subtype);
+void write_sp_info(FILE *fp, char class);
+void write_bt_info(FILE *fp, char class, int io);
+void write_lu_info(FILE *fp, char class);
+void write_mg_info(FILE *fp, char class);
+void write_cg_info(FILE *fp, char class);
+void write_ft_info(FILE *fp, char class);
+void write_ep_info(FILE *fp, char class);
+void write_is_info(FILE *fp, char class);
+void write_dt_info(FILE *fp, char class);
+void write_compiler_info(int type, FILE *fp);
+void write_convertdouble_info(int type, FILE *fp);
+void check_line(char *line, char *label, char *val);
+int  check_include_line(char *line, char *filename);
+void put_string(FILE *fp, char *name, char *val);
+void put_def_string(FILE *fp, char *name, char *val);
+void put_def_variable(FILE *fp, char *name, char *val);
+int isqrt(int i);
+int ilog2(int i);
+int ipow2(int i);
+int isqrt2(int i);
+
+enum benchmark_types {SP, BT, LU, MG, FT, IS, DT, EP, CG};
+enum iotypes { NONE = 0, FULL, SIMPLE, EPIO, FORTRAN};
+
+int main(int argc, char *argv[])
+{
+  int type;
+  char class, class_old;
+  int subtype = -1, old_subtype = -1;
+  
+  /* Get command line arguments. Make sure they're ok. */
+  get_info(argc, argv, &type, &class, &subtype);
+  if (class != 'U') {
+#ifdef VERBOSE
+    printf("setparams: For benchmark %s: class = %c\n", 
+	   argv[1], class); 
+#endif
+    check_info(type, class);
+  }
+
+  /* Get old information. */
+  read_info(type, &class_old, &old_subtype);
+  if (class != 'U') {
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams:     old settings: class = %c\n", 
+	     class_old); 
+#endif
+    }
+  } else {
+    printf("setparams:\n\
+  ************************************************************\n\
+  * You must specify CLASS to build this benchmark           *\n\
+  * For example, to build a class A benchmark, type          *\n\
+  *       make {benchmark-name} CLASS=A                      *\n\
+  ************************************************************\n\n"); 
+
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams: Previous settings were CLASS=%c\n", 
+	     class_old); 
+#endif
+    }
+    exit(1); /* exit on class==U */
+  }
+
+  /* Write out new information if it's different. */
+  if (class != class_old || subtype != old_subtype) {
+#ifdef VERBOSE
+    printf("setparams: Writing %s\n", FILENAME); 
+#endif
+    write_info(type, class, subtype);
+  } else {
+#ifdef VERBOSE
+    printf("setparams: Settings unchanged. %s unmodified\n", FILENAME); 
+#endif
+  }
+
+  return 0;
+}
+
+
+/*
+ *  get_info(): Get parameters from command line 
+ */
+
+void get_info(int argc, char *argv[], int *typep, char *classp,
+	      int *subtypep) 
+{
+
+  if (argc < 3) {
+    printf("Usage: %s (%d) benchmark-name class\n", argv[0], argc);
+    exit(1);
+  }
+
+  *classp = *argv[2];
+
+  if      (!strcmp(argv[1], "sp") || !strcmp(argv[1], "SP")) *typep = SP;
+  else if (!strcmp(argv[1], "ft") || !strcmp(argv[1], "FT")) *typep = FT;
+  else if (!strcmp(argv[1], "lu") || !strcmp(argv[1], "LU")) *typep = LU;
+  else if (!strcmp(argv[1], "mg") || !strcmp(argv[1], "MG")) *typep = MG;
+  else if (!strcmp(argv[1], "is") || !strcmp(argv[1], "IS")) *typep = IS;
+  else if (!strcmp(argv[1], "dt") || !strcmp(argv[1], "DT")) *typep = DT;
+  else if (!strcmp(argv[1], "ep") || !strcmp(argv[1], "EP")) *typep = EP;
+  else if (!strcmp(argv[1], "cg") || !strcmp(argv[1], "CG")) *typep = CG;
+  else if (!strcmp(argv[1], "bt") || !strcmp(argv[1], "BT")) {
+    *typep = BT;
+    if (argc != 4) {
+      /* printf("Usage: %s (%d) benchmark-name class\n", argv[0], argc); */
+      /* exit(1); */
+      *subtypep = NONE;
+    } else {
+      char *sstr = argv[3];
+      if (!strcmp(sstr, "full") || !strcmp(sstr, "FULL")) {
+        *subtypep = FULL;
+      } else if (!strcmp(sstr, "simple") || !strcmp(sstr, "SIMPLE")) {
+        *subtypep = SIMPLE;
+      } else if (!strcmp(sstr, "epio") || !strcmp(sstr, "EPIO")) {
+        *subtypep = EPIO;
+      } else if (!strcmp(sstr, "fortran") || !strcmp(sstr, "FORTRAN")) {
+        *subtypep = FORTRAN;
+      } else if (!strcmp(sstr, "none") || !strcmp(sstr, "NONE")) {
+        *subtypep = NONE;
+      } else {
+        printf("setparams: Error: unknown btio type %s\n", sstr);
+        printf("valid types - full, simple, epio, fortran, none\n");
+        exit(1);
+      }
+      if (*classp == 'F') {
+        printf("setparams: Error: btio type %s not defined for class %c\n", 
+               sstr, *classp);
+        exit(1);
+      }
+    }
+  } else {
+    printf("setparams: Error: unknown benchmark type %s\n", argv[1]);
+    exit(1);
+  }
+}
+
+/*
+ *  check_info(): Make sure command line data is ok for this benchmark 
+ */
+
+void check_info(int type, char class) 
+{
+  /* check class */
+  if (class != 'S' && 
+      class != 'W' && 
+      class != 'A' && 
+      class != 'B' && 
+      class != 'C' && 
+      class != 'D' && 
+      class != 'E' && 
+      class != 'F') {
+    printf("setparams: Unknown benchmark class %c\n", class); 
+    printf("setparams: Allowed classes are \"S\", \"W\", and \"A\" through \"F\"\n");
+    exit(1);
+  }
+
+  if ((class == 'E' && type == DT) ||
+      (class == 'F' && (type == IS || type == DT))) {
+    printf("setparams: Benchmark class %c not defined for %s\n", 
+           class, (type == IS)? "IS" : "DT");
+    exit(1);
+  }
+}
+
+
+/* 
+ * read_info(): Read previous information from file. 
+ *              Not an error if file doesn't exist, because this
+ *              may be the first time we're running. 
+ *              Assumes the first line of the file is in a special
+ *              format that we understand (since we wrote it). 
+ */
+
+void read_info(int type, char *classp, int *subtypep)
+{
+  int nread = 0;
+  FILE *fp;
+  fp = fopen(FILENAME, "r");
+  if (fp == NULL) {
+#ifdef VERBOSE
+    printf("setparams: INFO: configuration file %s does not exist (yet)\n", FILENAME); 
+#endif
+    goto abort;
+  }
+  
+  /* first line of file contains info (fortran), first two lines (C) */
+
+  switch(type) {
+      case BT: {
+	  char subtype_str[100];
+          nread = fscanf(fp, BT_DESC_LINE, classp, subtype_str);
+          if (nread != 2) {
+            if (nread != 1) {
+              printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+              goto abort;
+	    }
+	    *subtypep = 0;
+	    break;
+          }
+          if (!strcmp(subtype_str, "full") || !strcmp(subtype_str, "FULL")) {
+		*subtypep = FULL;
+          } else if (!strcmp(subtype_str, "simple") ||
+		     !strcmp(subtype_str, "SIMPLE")) {
+		*subtypep = SIMPLE;
+          } else if (!strcmp(subtype_str, "epio") || !strcmp(subtype_str, "EPIO")) {
+		*subtypep = EPIO;
+          } else if (!strcmp(subtype_str, "fortran") ||
+		     !strcmp(subtype_str, "FORTRAN")) {
+		*subtypep = FORTRAN;
+          } else {
+		*subtypep = -1;
+	  }
+          break;
+      }
+
+      case SP:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+          nread = fscanf(fp, DESC_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      case IS:
+      case DT:
+          nread = fscanf(fp, DEF_CLASS_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      default:
+        /* never should have gotten this far with a bad name */
+        printf("setparams: (Internal Error) Benchmark type %d unknown to this program\n", type); 
+        exit(1);
+  }
+
+  fclose(fp);
+
+
+  return;
+
+ abort:
+  *classp = 'X';
+  *subtypep = -1;
+  return;
+}
+
+
+/* 
+ * write_info(): Write new information to config file. 
+ *               First line is in a special format so we can read
+ *               it in again. Then comes a warning. The rest is all
+ *               specific to a particular benchmark. 
+ */
+
+void write_info(int type, char class, int subtype) 
+{
+  FILE *fp;
+  char *BT_TYPES[] = {"NONE", "FULL", "SIMPLE", "EPIO", "FORTRAN"};
+
+  fp = fopen(FILENAME, "w");
+  if (fp == NULL) {
+    printf("setparams: Can't open file %s for writing\n", FILENAME);
+    exit(1);
+  }
+
+  switch(type) {
+      case BT:
+          /* Write out the header */
+	  if (subtype == -1 || subtype == 0) {
+            fprintf(fp, DESC_LINE, class);
+	  } else {
+            fprintf(fp, BT_DESC_LINE, class, BT_TYPES[subtype]);
+	  }
+          /* Print out a warning so bozos don't mess with the file */
+          fprintf(fp, "\
+!  \n\
+!  \n\
+!  This file is generated automatically by the setparams utility.\n\
+!  It sets the number of processors and the class of the NPB\n\
+!  in this directory. Do not modify it by hand.\n\
+!  \n");
+
+          break;
+	
+      case SP:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+          /* Write out the header */
+          fprintf(fp, DESC_LINE, class);
+          /* Print out a warning so bozos don't mess with the file */
+          fprintf(fp, "\
+!  \n\
+!  \n\
+!  This file is generated automatically by the setparams utility.\n\
+!  It sets the number of processors and the class of the NPB\n\
+!  in this directory. Do not modify it by hand.\n\
+!  \n");
+
+          break;
+      case IS:
+      case DT:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.   */\n\
+   \n");
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+  /* Now do benchmark-specific stuff */
+  switch(type) {
+  case SP:
+    write_sp_info(fp, class);
+    break;
+  case LU:
+    write_lu_info(fp, class);
+    break;
+  case MG:
+    write_mg_info(fp, class);
+    break;
+  case IS:
+    write_is_info(fp, class);  
+    break;
+  case DT:
+    write_dt_info(fp, class);  
+    break;
+  case FT:
+    write_ft_info(fp, class);
+    break;
+  case EP:
+    write_ep_info(fp, class);
+    break;
+  case CG:
+    write_cg_info(fp, class);
+    break;
+  case BT:
+    write_bt_info(fp, class, subtype);
+    break;
+  default:
+    printf("setparams: (Internal error): Unknown benchmark type %d\n", type);
+    exit(1);
+  }
+  write_convertdouble_info(type, fp);
+  write_compiler_info(type, fp);
+  fclose(fp);
+  return;
+}
+
+
+/* 
+ * write_sp_info(): Write SP specific info to config file
+ */
+
+void write_sp_info(FILE *fp, char class) 
+{
+  int problem_size, niter;
+  char *dt;
+
+  if      (class == 'S') { problem_size = 12;  dt = "0.015d0";   niter = 100; }
+  else if (class == 'W') { problem_size = 36;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'B') { problem_size = 102; dt = "0.001d0";   niter = 400; }
+  else if (class == 'C') { problem_size = 162; dt = "0.00067d0"; niter = 400; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00030d0"; niter = 500; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.0001d0"; niter = 500; }
+  else if (class == 'F') { problem_size = 2560; dt = "0.15d-4";  niter = 500; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_bt_info(): Write BT specific info to config file
+ */
+
+void write_bt_info(FILE *fp, char class, int io) 
+{
+  int problem_size, niter, wr_interval;
+  char *dt;
+
+  if      (class == 'S') { problem_size = 12;  dt = "0.010d0";    niter = 60;  }
+  else if (class == 'W') { problem_size = 24;  dt = "0.0008d0";   niter = 200; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0008d0";   niter = 200; }
+  else if (class == 'B') { problem_size = 102; dt = "0.0003d0";   niter = 200; }
+  else if (class == 'C') { problem_size = 162; dt = "0.0001d0";   niter = 200; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00002d0";  niter = 250; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.4d-5";    niter = 250; }
+  else if (class == 'F') { problem_size = 2560; dt = "0.6d-6";    niter = 250; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  wr_interval = 5;
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+  fprintf(fp, "%sinteger wr_default\n", FINDENT);
+  fprintf(fp, "%sparameter (wr_default = %d)\n", FINDENT, wr_interval);
+  fprintf(fp, "%sinteger iotype\n", FINDENT);
+  fprintf(fp, "%sparameter (iotype = %d)\n", FINDENT, io);
+  if (io) {
+    fprintf(fp, "%scharacter*(*) filenm\n", FINDENT);
+    switch (io) {
+	case FULL:
+	    fprintf(fp, "%sparameter (filenm = 'btio.full.out')\n", FINDENT);
+	    break;
+	case SIMPLE:
+	    fprintf(fp, "%sparameter (filenm = 'btio.simple.out')\n", FINDENT);
+	    break;
+	case EPIO:
+	    fprintf(fp, "%sparameter (filenm = 'btio.epio.out')\n", FINDENT);
+	    break;
+	case FORTRAN:
+	    fprintf(fp, "%sparameter (filenm = 'btio.fortran.out')\n", FINDENT);
+	    fprintf(fp, "%sinteger fortran_rec_sz\n", FINDENT);
+	    fprintf(fp, "%sparameter (fortran_rec_sz = %d)\n",
+		    FINDENT, fortran_rec_size);
+	    break;
+	default:
+	    break;
+    }
+  }
+}
+  
+
+
+/* 
+ * write_lu_info(): Write SP specific info to config file
+ */
+
+void write_lu_info(FILE *fp, char class) 
+{
+  int itmax, inorm, problem_size;
+  char *dt_default;
+
+  if      (class == 'S') { problem_size = 12;  dt_default = "0.5d0";  itmax = 50; }
+  else if (class == 'W') { problem_size = 33;  dt_default = "1.5d-3"; itmax = 300; }
+  else if (class == 'A') { problem_size = 64;  dt_default = "2.0d0";  itmax = 250; }
+  else if (class == 'B') { problem_size = 102; dt_default = "2.0d0";  itmax = 250; }
+  else if (class == 'C') { problem_size = 162; dt_default = "2.0d0";  itmax = 250; }
+  else if (class == 'D') { problem_size = 408; dt_default = "1.0d0";  itmax = 300; }
+  else if (class == 'E') { problem_size = 1020; dt_default = "0.5d0"; itmax = 300; }
+  else if (class == 'F') { problem_size = 2560; dt_default = "0.2d0"; itmax = 300; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  inorm = itmax;
+
+  fprintf(fp, "\n! full problem size\n");
+  fprintf(fp, "%sinteger isiz01, isiz02, isiz03\n", FINDENT);
+  fprintf(fp, "%sparameter (isiz01=%d, isiz02=%d, isiz03=%d)\n", 
+	  FINDENT, problem_size, problem_size, problem_size);
+
+  fprintf(fp, "\n! number of iterations and how often to print the norm\n");
+  fprintf(fp, "%sinteger itmax_default, inorm_default\n", FINDENT);
+  fprintf(fp, "%sparameter (itmax_default=%d, inorm_default=%d)\n", 
+	  FINDENT, itmax, inorm);
+
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt_default);
+  
+}
+
+/* 
+ * write_mg_info(): Write MG specific info to config file
+ */
+
+void write_mg_info(FILE *fp, char class) 
+{
+  int problem_size, nit, log2_size, lt_default;
+
+  if      (class == 'S') { problem_size = 32;   nit = 4; }
+  else if (class == 'W') { problem_size = 128;  nit = 4; }
+  else if (class == 'A') { problem_size = 256;  nit = 4; }
+  else if (class == 'B') { problem_size = 256;  nit = 20; }
+  else if (class == 'C') { problem_size = 512;  nit = 20; }
+  else if (class == 'D') { problem_size = 1024; nit = 50; }
+  else if (class == 'E') { problem_size = 2048; nit = 50; }
+  else if (class == 'F') { problem_size = 4096; nit = 50; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  log2_size = ilog2(problem_size);
+  lt_default = log2_size;
+
+  fprintf(fp, "%sinteger nx_default, ny_default, nz_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nx_default=%d, ny_default=%d, nz_default=%d)\n", 
+	  FINDENT, problem_size, problem_size, problem_size);
+  fprintf(fp, "%sinteger nit_default, lt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nit_default=%d, lt_default=%d)\n", 
+	  FINDENT, nit, lt_default);
+  fprintf(fp, "%sinteger debug_default\n", FINDENT);
+  fprintf(fp, "%sparameter (debug_default=%d)\n", FINDENT, 0);
+}
+
+
+/* 
+ * write_dt_info(): Write DT specific info to config file
+ */
+
+void write_dt_info(FILE *fp, char class) 
+{
+  int num_samples,deviation,num_sources;
+  if      (class == 'S') { num_samples=1728; deviation=128; num_sources=4; }
+  else if (class == 'W') { num_samples=1728*8; deviation=128*2; num_sources=4*2; }
+  else if (class == 'A') { num_samples=1728*64; deviation=128*4; num_sources=4*4; }
+  else if (class == 'B') { num_samples=1728*512; deviation=128*8; num_sources=4*8; }
+  else if (class == 'C') { num_samples=1728*4096; deviation=128*16; num_sources=4*16; }
+  else if (class == 'D') { num_samples=1728*4096*8; deviation=128*32; num_sources=4*32; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "#define NUM_SAMPLES %d\n", num_samples);
+  fprintf(fp, "#define STD_DEVIATION %d\n", deviation);
+  fprintf(fp, "#define NUM_SOURCES %d\n", num_sources);
+}
+
+/* 
+ * write_is_info(): Write IS specific info to config file
+ */
+
+void write_is_info(FILE *fp, char class) 
+{
+  if( class != 'S' &&
+      class != 'W' &&
+      class != 'A' &&
+      class != 'B' &&
+      class != 'C' &&
+      class != 'D' &&
+      class != 'E' )
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+}
+
+/* 
+ * write_cg_info(): Write CG specific info to config file
+ */
+
+void write_cg_info(FILE *fp, char class) 
+{
+  int na,nonzer,niter;
+  char *shift,*rcond="1.0d-1";
+
+  if( class == 'S' )
+  { na=1400;    nonzer=7;  niter=15;  shift="10."; }
+  else if( class == 'W' )
+  { na=7000;    nonzer=8;  niter=15;  shift="12."; }
+  else if( class == 'A' )
+  { na=14000;   nonzer=11; niter=15;  shift="20."; }
+  else if( class == 'B' )
+  { na=75000;   nonzer=13; niter=75;  shift="60."; }
+  else if( class == 'C' )
+  { na=150000;  nonzer=15; niter=75;  shift="110."; }
+  else if( class == 'D' )
+  { na=1500000; nonzer=21; niter=100; shift="500."; }
+  else if( class == 'E' )
+  { na=9000000; nonzer=26; niter=100; shift="1.5d3"; }
+  else if( class == 'F' )
+  { na=54000000; nonzer=31; niter=100; shift="5.0d3"; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  fprintf( fp, "%sinteger            na, nonzer, niter\n", FINDENT );
+  fprintf( fp, "%sdouble precision   shift, rcond\n", FINDENT );
+  fprintf( fp, "%sparameter(  na=%d, &\n", FINDENT, na );
+  fprintf( fp, "%s             nonzer=%d, &\n", CONTINUE, nonzer );
+  fprintf( fp, "%s             niter=%d, &\n", CONTINUE, niter );
+  fprintf( fp, "%s             shift=%s, &\n", CONTINUE, shift );
+  fprintf( fp, "%s             rcond=%s )\n", CONTINUE, rcond );
+}
+
+
+/* 
+ * write_ft_info(): Write FT specific info to config file
+ */
+
+void write_ft_info(FILE *fp, char class) 
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int nx, ny, nz, maxdim, niter;
+  if      (class == 'S') { nx = 64;   ny = 64;   nz = 64;   niter = 6;}
+  else if (class == 'W') { nx = 128;  ny = 128;  nz = 32;   niter = 6;}
+  else if (class == 'A') { nx = 256;  ny = 256;  nz = 128;  niter = 6;}
+  else if (class == 'B') { nx = 512;  ny = 256;  nz = 256;  niter =20;}
+  else if (class == 'C') { nx = 512;  ny = 512;  nz = 512;  niter =20;}
+  else if (class == 'D') { nx = 2048; ny = 1024; nz = 1024; niter =25;}
+  else if (class == 'E') { nx = 4096; ny = 2048; nz = 2048; niter =25;}
+  else if (class == 'F') { nx = 8192; ny = 4096; nz = 4096; niter =25;}
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  maxdim = nx;
+  if (ny > maxdim) maxdim = ny;
+  if (nz > maxdim) maxdim = nz;
+  fprintf(fp, "%sinteger nx, ny, nz, maxdim, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nx=%d, ny=%d, nz=%d, maxdim=%d)\n", 
+          FINDENT, nx, ny, nz, maxdim);
+  fprintf(fp, "%sparameter (niter_default=%d)\n", FINDENT, niter);
+}
+
+/*
+ * write_ep_info(): Write EP specific info to config file
+ */
+
+void write_ep_info(FILE *fp, char class)
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int m;
+  if      (class == 'S') { m = 24; }
+  else if (class == 'W') { m = 25; }
+  else if (class == 'A') { m = 28; }
+  else if (class == 'B') { m = 30; }
+  else if (class == 'C') { m = 32; }
+  else if (class == 'D') { m = 36; }
+  else if (class == 'E') { m = 40; }
+  else if (class == 'F') { m = 44; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  /* number of processors given by "npm" */
+
+
+  fprintf(fp, "%scharacter class\n",FINDENT);
+  fprintf(fp, "%sparameter (class =\'%c\')\n",
+                  FINDENT, class);
+  fprintf(fp, "%sinteger m\n", FINDENT);
+  fprintf(fp, "%sparameter (m=%d)\n",
+          FINDENT, m);
+}
+
+
+/* 
+ * This is a gross hack to allow the benchmarks to 
+ * print out how they were compiled. Various other ways
+ * of doing this have been tried and they all fail on
+ * some machine - due to a broken "make" program, or
+ * Fortran limitations, of whatever. Hopefully this will
+ * always work because it uses very portable C. Unfortunately
+ * it relies on parsing the make.def file - YUK. 
+ * If your machine doesn't have <string.h> or <ctype.h>, happy hacking!
+ * 
+ */
+
+#define VERBOSE
+#define LL 400
+#include <stdio.h>
+#define DEFFILE "../config/make.def"
+#define DEFAULT_MESSAGE "(none)"
+FILE *deffile;
+void write_compiler_info(int type, FILE *fp)
+{
+  char line[LL];
+  char mpifc[LL], flink[LL], fmpi_lib[LL], fmpi_inc[LL], fflags[LL], flinkflags[LL];
+  char compiletime[LL], randfile[LL];
+  char mpicc[LL], cflags[LL], clink[LL], clinkflags[LL],
+       cmpi_lib[LL], cmpi_inc[LL];
+  struct tm *tmp;
+  time_t t;
+  deffile = fopen(DEFFILE, "r");
+  if (deffile == NULL) {
+    printf("\n\
+setparams: File %s doesn't exist. To build the NAS benchmarks\n\
+           you need to create is according to the instructions\n\
+           in the README in the main directory and comments in \n\
+           the file config/make.def.template\n", DEFFILE);
+    exit(1);
+  }
+  strcpy(mpifc, DEFAULT_MESSAGE);
+  strcpy(flink, DEFAULT_MESSAGE);
+  strcpy(fmpi_lib, DEFAULT_MESSAGE);
+  strcpy(fmpi_inc, DEFAULT_MESSAGE);
+  strcpy(fflags, DEFAULT_MESSAGE);
+  strcpy(flinkflags, DEFAULT_MESSAGE);
+  strcpy(randfile, DEFAULT_MESSAGE);
+  strcpy(mpicc, DEFAULT_MESSAGE);
+  strcpy(cflags, DEFAULT_MESSAGE);
+  strcpy(clink, DEFAULT_MESSAGE);
+  strcpy(clinkflags, DEFAULT_MESSAGE);
+  strcpy(cmpi_lib, DEFAULT_MESSAGE);
+  strcpy(cmpi_inc, DEFAULT_MESSAGE);
+
+  while (fgets(line, LL, deffile) != NULL) {
+    if (*line == '#') continue;
+    /* yes, this is inefficient. but it's simple! */
+    check_line(line, "MPIFC", mpifc);
+    check_line(line, "FLINK", flink);
+    check_line(line, "FMPI_LIB", fmpi_lib);
+    check_line(line, "FMPI_INC", fmpi_inc);
+    check_line(line, "FFLAGS", fflags);
+    check_line(line, "FLINKFLAGS", flinkflags);
+    check_line(line, "RAND", randfile);
+    check_line(line, "MPICC", mpicc);
+    check_line(line, "CFLAGS", cflags);
+    check_line(line, "CLINK", clink);
+    check_line(line, "CLINKFLAGS", clinkflags);
+    check_line(line, "CMPI_LIB", cmpi_lib);
+    check_line(line, "CMPI_INC", cmpi_inc);
+    /* if the dummy library is used by including make.dummy, we set the
+       Fortran and C paths to libraries and headers accordingly     */
+    if(check_include_line(line, "../config/make.dummy")) {
+       strcpy(fmpi_lib, "-L../MPI_dummy -lmpi");
+       strcpy(fmpi_inc, "-I../MPI_dummy");
+       strcpy(cmpi_lib, "-L../MPI_dummy -lmpi");
+       strcpy(cmpi_inc, "-I../MPI_dummy");
+    }
+  }
+
+  
+  (void) time(&t);
+  tmp = localtime(&t);
+  (void) strftime(compiletime, (size_t)LL, "%d %b %Y", tmp);
+
+
+  switch(type) {
+      case FT:
+      case SP:
+      case BT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+          put_string(fp, "compiletime", compiletime);
+          put_string(fp, "npbversion", VERSION);
+          put_string(fp, "cs1", mpifc);
+          put_string(fp, "cs2", flink);
+          put_string(fp, "cs3", fmpi_lib);
+          put_string(fp, "cs4", fmpi_inc);
+          put_string(fp, "cs5", fflags);
+          put_string(fp, "cs6", flinkflags);
+	  put_string(fp, "cs7", randfile);
+          break;
+      case IS:
+      case DT:
+          put_def_string(fp, "COMPILETIME", compiletime);
+          put_def_string(fp, "NPBVERSION", VERSION);
+          put_def_string(fp, "MPICC", mpicc);
+          put_def_string(fp, "CFLAGS", cflags);
+          put_def_string(fp, "CLINK", clink);
+          put_def_string(fp, "CLINKFLAGS", clinkflags);
+          put_def_string(fp, "CMPI_LIB", cmpi_lib);
+          put_def_string(fp, "CMPI_INC", cmpi_inc);
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+}
+
+void check_line(char *line, char *label, char *val)
+{
+  char *original_line;
+  int n;
+  original_line = line;
+  /* compare beginning of line and label */
+  while (*label != '\0' && *line == *label) {
+    line++; label++; 
+  }
+  /* if *label is not EOS, we must have had a mismatch */
+  if (*label != '\0') return;
+  /* if *line is not a space, actual label is longer than test label */
+  if (!isspace(*line) && *line != '=') return ; 
+  /* skip over white space */
+  while (isspace(*line)) line++;
+  /* next char should be '=' */
+  if (*line != '=') return;
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return;
+  /* finally we've come to the value */
+  strcpy(val, line);
+  /* chop off the newline at the end */
+  n = strlen(val)-1;
+  if (n >= 0 && val[n] == '\n')
+    val[n--] = '\0';
+  if (n >= 0 && val[n] == '\r')
+    val[n--] = '\0';
+  /* treat continuation */
+  while (val[n] == '\\' && fgets(original_line, LL, deffile)) {
+     line = original_line;
+     while (isspace(*line)) line++;
+     if (isspace(*original_line)) val[n++] = ' ';
+     while (*line && *line != '\n' && *line != '\r' && n < LL-1)
+       val[n++] = *line++;
+     val[n] = '\0';
+     n--;
+  }
+/*  if (val[strlen(val) - 1] == '\\') {
+    printf("\n\
+setparams: Error in file make.def. Because of the way in which\n\
+           command line arguments are incorporated into the\n\
+           executable benchmark, you can't have any continued\n\
+           lines in the file make.def, that is, lines ending\n\
+           with the character \"\\\". Although it may be ugly, \n\
+           you should be able to reformat without continuation\n\
+           lines. The offending line is\n\
+  %s\n", original_line);
+    exit(1);
+  } */
+}
+
+int check_include_line(char *line, char *filename)
+{
+  char *include_string = "include";
+  /* compare beginning of line and "include" */
+  while (*include_string != '\0' && *line == *include_string) {
+    line++; include_string++; 
+  }
+  /* if *include_string is not EOS, we must have had a mismatch */
+  if (*include_string != '\0') return(0);
+  /* if *line is not a space, first word is not "include" */
+  if (!isspace(*line)) return(0); 
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return(0);
+  /* next keyword should be name of include file in *filename */
+  while (*filename != '\0' && *line == *filename) {
+    line++; filename++; 
+  }  
+  if (*filename != '\0' || 
+      (*line != ' ' && *line != '\0' && *line !='\n')) return(0);
+  else return(1);
+}
+
+
+#define MAXL 46
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "%scharacter*%d %s\n", FINDENT, len, name);
+  fprintf(fp, "%sparameter (%s=\'%s\')\n", FINDENT, name, val);
+}
+
+/* need to escape quote (") in val */
+int fix_string_quote(char *val, char *newval, int maxl)
+{
+  int len;
+  int i, j;
+  len = strlen(val);
+  i = j = 0;
+  while (i < len && j < maxl) {
+    if (val[i] == '"')
+      newval[j++] = '\\';
+    if (j < maxl)
+      newval[j++] = val[i++];
+  }
+  newval[j] = '\0';
+  return j;
+}
+
+/* NOTE: is the ... stuff necessary in C? */
+void put_def_string(FILE *fp, char *name, char *val0)
+{
+  int len;
+  char val[MAXL+3];
+  len = fix_string_quote(val0, val, MAXL+2);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s \"%s\"\n", name, val);
+}
+
+void put_def_variable(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s %s\n", name, val);
+}
+
+
+
+#if 0
+
+/* this version allows arbitrarily long lines but 
+ * some compilers don't like that and they're rarely
+ * useful 
+ */
+
+#define LINELEN 65
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len, nlines, pos, i;
+  char line[100];
+  len = strlen(val);
+  nlines = len/LINELEN;
+  if (nlines*LINELEN < len) nlines++;
+  fprintf(fp, "%scharacter*%d %s\n", FINDENT, nlines*LINELEN, name);
+  fprintf(fp, "%sparameter (%s = &\n", FINDENT, name);
+  for (i = 0; i < nlines; i++) {
+    pos = i*LINELEN;
+    if (i == 0) fprintf(fp, "%s\'", CONTINUE);
+    else        fprintf(fp, "%s", CONTINUE);
+    /* number should be same as LINELEN */
+    fprintf(fp, "%.65s", val+pos);
+    if (i == nlines-1) fprintf(fp, "\')\n");
+    else             fprintf(fp, " &\n");
+  }
+}
+
+#endif
+
+
+/* integer square root. Return error if argument isn't
+ * a perfect square or is less than or equal to zero 
+ */
+
+int isqrt(int i)
+{
+  int root, square;
+  if (i <= 0) return(-1);
+  square = 0;
+  for (root = 1; square <= i; root++) {
+    square = root*root;
+    if (square == i) return(root);
+  }
+  return(-1);
+}
+
+int isqrt2(int i)
+{
+  int xdim, ydim, square;
+  if (i <= 0) return(-1);
+  square = 0;
+  for (xdim = 1; square <= i; xdim++) {
+    square = xdim*xdim;
+    if (square == i) return(xdim);
+  }
+  ydim = i / (--xdim);
+  while (xdim*ydim != i && 2*ydim >= xdim) {
+    xdim++;
+    ydim = i / xdim;
+  }
+  if (xdim*ydim == i && 2*ydim >= xdim)
+    return(xdim);
+  return(-1);
+}
+  
+
+/* integer log base two. Return error is argument isn't
+ * a power of two or is less than or equal to zero 
+ */
+
+int ilog2(int i)
+{
+  int log2;
+  int exp2 = 1;
+  if (i <= 0) return(-1);
+
+  for (log2 = 0; log2 < 30; log2++) {
+    if (exp2 == i) return(log2);
+    if (exp2 > i) break;
+    exp2 *= 2;
+  }
+  return(-1);
+}
+
+int ipow2(int i)
+{
+  int pow2 = 1;
+  if (i < 0) return(-1);
+  if (i == 0) return(1);
+  while(i--) pow2 *= 2;
+  return(pow2);
+}
+ 
+
+
+void write_convertdouble_info(int type, FILE *fp)
+{
+  switch(type) {
+  case SP:
+  case BT:
+  case LU:
+  case FT:
+  case MG:
+  case EP:
+  case CG:
+    fprintf(fp, "%slogical  convertdouble\n", FINDENT);
+#ifdef CONVERTDOUBLE
+    fprintf(fp, "%sparameter (convertdouble = .true.)\n", FINDENT);
+#else
+    fprintf(fp, "%sparameter (convertdouble = .false.)\n", FINDENT);
+#endif
+    break;
+  }
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/suite.awk b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/suite.awk
new file mode 100644
index 000000000..ad29a112f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/sys/suite.awk
@@ -0,0 +1,20 @@
+BEGIN { SMAKE = "make" } {
+  if ($1 !~ /^#/ &&  NF > 1) {
+    printf "cd `echo %s|tr '[a-z]' '[A-Z]'`; %s clean;", $1, SMAKE;
+    printf "%s CLASS=%s", SMAKE, $2;
+    if ( NF > 2 ) {
+      if ( $3 ~ /^blk/ ||  $3 ~ /^BLK/ ) {
+        printf " VERSION=%s", $3;
+        if ( NF > 3 ) {
+          printf " SUBTYPE=%s", $4;
+        }
+      } else {
+        printf " SUBTYPE=%s", $3;
+        if ( NF > 3 ) {
+          printf " VERSION=%s", $4;
+        }
+      }
+    }
+    printf "; cd ..\n";
+  }
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/comp b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/comp
new file mode 100755
index 000000000..437bbb4b0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/comp
@@ -0,0 +1,51 @@
+#!/bin/csh
+
+module purge
+module load comp/intel-12.0.4
+#module load comp/gcc-5.3
+module load comp/gcc-8.2
+module load mpi-gcc/mpich-3.1
+
+set logfile=npb-make.log
+touch $logfile
+set outf=npb-make.out
+touch $outf
+
+echo "Date: `date`" >> $logfile
+echo "Host: `hostname`" >> $logfile
+module list >>& $logfile
+echo "" >> $logfile
+
+set cnt=0
+set cntf=0
+
+foreach cf (gcc_mpich)
+
+set bindir=bin/bin_$cf
+if ( ! -d $bindir) mkdir -p $bindir
+\cp -f config/NAS.samples/make.def.$cf config/make.def
+make clean >>& $outf
+
+foreach c (A)
+foreach ap (bt cg ep ft is lu mg sp)
+   make $ap CLASS=$c >>& $outf
+   set pgm=${ap}.${c}.x
+   set pgmx=bin/$pgm
+   @ cnt++
+   if ( -e $pgmx ) then
+      \mv $pgmx $bindir
+      echo ">>> make $cf/$pgm - successful" | tee -a $logfile
+   else
+      echo "*** make $cf/$pgm - FAILED" | tee -a $logfile
+      @ cntf++
+   endif
+end
+end
+
+end
+
+echo "" >> $logfile
+echo "Date: `date`" >> $logfile
+echo "Total number of cases: $cnt" | tee -a $logfile
+echo "Total number of FAILED cases: $cntf" | tee -a $logfile
+echo "" >> $logfile
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/comp_pld b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/comp_pld
new file mode 100755
index 000000000..77fab597d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/comp_pld
@@ -0,0 +1,55 @@
+#!/bin/csh
+
+#PBS -j oe
+#PBS -l select=1:ncpus=28:model=bro
+#PBS -l walltime=0:30:00
+
+if ( $?PBS_O_WORKDIR ) cd $PBS_O_WORKDIR
+
+module purge
+module load comp-intel/2018.3.222
+module load mpi-hpe/mpt.2.17r13
+
+set logfile=npb-make.log
+touch $logfile
+set outf=npb-make.out
+touch $outf
+
+echo "Date: `date`" >> $logfile
+echo "Host: `hostname`" >> $logfile
+module list >>& $logfile
+echo "" >> $logfile
+
+set cnt=0
+set cntf=0
+
+foreach cf (itc_mpt)
+
+set bindir=bin/bin_$cf
+if ( ! -d $bindir) mkdir -p $bindir
+\cp -f config/NAS.samples/make.def.$cf config/make.def
+make clean >>& $outf
+
+foreach c (C)
+foreach ap (bt cg ep ft is lu mg sp)
+   make $ap CLASS=$c >>& $outf
+   set pgm=${ap}.${c}.x
+   set pgmx=bin/$pgm
+   @ cnt++
+   if ( -e $pgmx ) then
+      \mv $pgmx $bindir
+      echo ">>> make $cf/$pgm - successful" | tee -a $logfile
+   else
+      echo "*** make $cf/$pgm - FAILED" | tee -a $logfile
+      @ cntf++
+   endif
+end
+end
+
+end
+
+echo "" >> $logfile
+echo "Date: `date`" >> $logfile
+echo "Total number of cases: $cnt" | tee -a $logfile
+echo "Total number of FAILED cases: $cntf" | tee -a $logfile
+echo "" >> $logfile
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/run_test b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/run_test
new file mode 100755
index 000000000..ddb410b2e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/run_test
@@ -0,0 +1,13 @@
+#!/bin/csh
+
+set sdir=$0:h
+set wdir=$sdir/..
+
+cd $wdir
+echo "Testing ... $sdir/comp"
+$sdir/comp
+
+cd bin
+echo "Testing ... $sdir/runit"
+../$sdir/runit
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/run_test_pld b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/run_test_pld
new file mode 100755
index 000000000..c8c58b51a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/run_test_pld
@@ -0,0 +1,22 @@
+#!/bin/csh
+
+#PBS -j oe
+#PBS -l select=5:ncpus=28:model=bro
+#PBS -l walltime=2:00:00
+
+set m=bro
+set sdir=test_scripts
+if ( $?PBS_O_WORKDIR ) then
+   set wdir=$PBS_O_WORKDIR
+else
+   set wdir=.
+endif
+
+cd $wdir
+echo "Testing ... $sdir/comp"
+env PBS_O_WORKDIR=$wdir $sdir/comp_pld
+
+cd bin
+echo "Testing ... $sdir/runit"
+env PBS_O_WORKDIR=$wdir/bin ../$sdir/runit_pld $m
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/runit b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/runit
new file mode 100755
index 000000000..ce70deb82
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/runit
@@ -0,0 +1,75 @@
+#!/bin/csh
+
+module purge
+module load local
+module load comp/intel-12.0.4
+#module load comp/gcc-5.3
+module load comp/gcc-8.2
+module load mpi-gcc/mpich-3.1
+
+set logfile=npb-run.log
+touch $logfile
+set tmpf=npb.tmp.$$
+
+echo "Date: `date`" >> $logfile
+echo "Host: `hostname`" >> $logfile
+module list >>& $logfile
+echo "" >> $logfile
+
+set cnt=0
+set cntf=0
+set cntp=0
+set cntw=0
+
+setenv NPB_TIMER_FLAG 1
+
+foreach cf (gcc_mpich)
+
+set bindir=bin_$cf
+set outdir=outs_$cf
+if ( ! -e $outdir ) mkdir -p $outdir
+
+foreach np (4 2)
+foreach c (A)
+foreach ap (bt lu ft mg ep cg is sp)
+   set pgm=${ap}.${c}.x
+   set pgmx=$bindir/$pgm
+   set case="run $cf/$pgm np=$np"
+   @ cnt++
+   if ( -e $pgmx ) then
+      set outf=$outdir/${ap}.${c}.out.$np
+      touch $outf
+      mpiexec -np $np mbind.x $pgmx >&! $tmpf
+      grep -i ' successful' $tmpf >& /dev/null
+      if ( $status == 0 ) then
+         grep -i warning $tmpf >& /dev/null
+         if ( $status == 0 ) then
+            echo ">*> $case - successful+warning" | tee -a $logfile
+            @ cntw++
+         else
+            echo ">>> $case - successful" | tee -a $logfile
+         endif
+      else
+         echo "*** $case - FAILED" | tee -a $logfile
+         @ cntf++
+      endif
+      cat $tmpf >> $outf
+      \rm $tmpf
+   else
+      echo "... $case - not present" | tee -a $logfile
+      @ cntp++
+   endif
+end
+end
+end
+
+end
+
+echo "" >> $logfile
+echo "Date: `date`" >> $logfile
+echo "Total number of cases: $cnt" | tee -a $logfile
+echo "Total number of warned cases: $cntw" | tee -a $logfile
+echo "Total number of FAILED cases: $cntf" | tee -a $logfile
+echo "Total number of not present cases: $cntp" | tee -a $logfile
+echo "" >> $logfile
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/runit_pld b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/runit_pld
new file mode 100755
index 000000000..43b3f9693
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/test_scripts/runit_pld
@@ -0,0 +1,85 @@
+#!/bin/csh
+
+#PBS -j oe
+#PBS -l select=5:ncpus=28:model=bro
+#PBS -l walltime=2:00:00
+
+if ( $?PBS_O_WORKDIR ) cd $PBS_O_WORKDIR
+
+module purge
+module load my/local
+module load comp-intel/2018.3.222
+module load mpi-hpe/mpt.2.17r13
+
+if ( "$1" == "" ) then
+   set m=bro
+else
+   set m=$1
+endif
+set logfile=npb-run.log
+touch $logfile
+set tmpf=npb.tmp.$$
+
+echo "Date: `date`" >> $logfile
+echo "Host: `hostname`" >> $logfile
+module list >>& $logfile
+echo "" >> $logfile
+
+set cnt=0
+set cntf=0
+set cntp=0
+set cntw=0
+
+setenv NPB_TIMER_FLAG 1
+
+foreach cf (icc_mpt)
+
+set bindir=bin_$cf
+set outdir=${m}_outs_$cf
+if ( ! -e $outdir ) mkdir -p $outdir
+
+foreach c (C)
+foreach np (32 64 128)
+foreach ap (bt cg ep ft is lu mg sp)
+   set pgm=${ap}.${c}.x
+   set pgmx=$bindir/$pgm
+   set case="run $cf/$pgm np=$np"
+   @ cnt++
+   if ( -e $pgmx ) then
+      set outf=$outdir/${ap}.${c}.out.$np
+      touch $outf
+      mpiexec -np $np mbind.x $pgmx >&! $tmpf
+      grep -i ' successful' $tmpf >& /dev/null
+      if ( $status == 0 ) then
+         grep -i warning $tmpf >& /dev/null
+         if ( $status == 0 ) then
+            echo ">*> $case - successful+warning" | tee -a $logfile
+            @ cntw++
+         else
+            echo ">>> $case - successful" | tee -a $logfile
+         endif
+      else
+         echo "*** $case - FAILED" | tee -a $logfile
+         @ cntf++
+      endif
+      cat $tmpf >> $outf
+      \rm $tmpf
+   else
+      echo "... $case - not present" | tee -a $logfile
+      @ cntp++
+   endif
+end
+end
+end
+
+end
+
+echo "" >> $logfile
+echo "Date: `date`" >> $logfile
+echo "Total number of cases: $cnt" | tee -a $logfile
+echo "Total number of warned cases: $cntw" | tee -a $logfile
+echo "Total number of FAILED cases: $cntf" | tee -a $logfile
+echo "Total number of not present cases: $cntp" | tee -a $logfile
+echo "" >> $logfile
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/Makefile
new file mode 100644
index 000000000..3a0ac781a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/Makefile
@@ -0,0 +1,65 @@
+SHELL=/bin/sh
+BENCHMARK=bt
+BENCHMARKU=BT
+BLK=
+BLKFAC=0
+
+include ../config/make.def
+
+
+OBJS = bt.o bt_data.o initialize.o exact_solution.o \
+       exact_rhs.o set_constants.o adi.o  rhs.o      \
+       x_solve$(BLK).o y_solve$(BLK).o solve_subs$(BLK).o  \
+       z_solve$(BLK).o add.o error.o verify.o work_lhs$(BLK).o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+include ../sys/make.common
+
+# npbparams.h is included by bt_data module (via bt_data.o)
+
+${PROGRAM}: config
+	@ver=$(VERSION); bfac=`echo $$ver|sed -e 's/^blk//' -e 's/^BLK//'`; \
+	if [ x$$ver != x$$bfac ] ; then		\
+		${MAKE} BLK=_blk BLKFAC=$${bfac:-8} exec;	\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+
+blk_par.h: FORCE
+	sed -e 's/= 0/= $(BLKFAC)/' blk_par0.h > blk_par.h_wk
+	@ if ! `diff blk_par.h_wk blk_par.h > /dev/null 2>&1`; then \
+	mv -f blk_par.h_wk blk_par.h; else rm -f blk_par.h_wk; fi
+FORCE:
+
+bt.o:             bt.f90  bt_data.o blk_par.h
+initialize.o:     initialize.f90  bt_data.o
+exact_solution.o: exact_solution.f90  bt_data.o
+exact_rhs.o:      exact_rhs.f90  bt_data.o
+set_constants.o:  set_constants.f90  bt_data.o
+adi.o:            adi.f90  bt_data.o
+rhs.o:            rhs.f90  bt_data.o
+x_solve$(BLK).o:  x_solve$(BLK).f90  bt_data.o work_lhs$(BLK).o
+y_solve$(BLK).o:  y_solve$(BLK).f90  bt_data.o work_lhs$(BLK).o
+z_solve$(BLK).o:  z_solve$(BLK).f90  bt_data.o work_lhs$(BLK).o
+solve_subs$(BLK).o: solve_subs$(BLK).f90  work_lhs$(BLK).o
+work_lhs$(BLK).o: work_lhs$(BLK).f90  bt_data.o blk_par.h
+add.o:            add.f90  bt_data.o
+error.o:          error.f90  bt_data.o
+verify.o:         verify.f90  bt_data.o
+bt_data.o:        bt_data.f90  npbparams.h
+
+clean:
+	- rm -f *.o *~ *.mod mputil*
+	- rm -f npbparams.h core blk_par.h
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/README
new file mode 100644
index 000000000..b8769912b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/README
@@ -0,0 +1,13 @@
+This directory contains two versions of the BT implementation:
+
+- the standard version that has better cache utilization
+- the "blocking" version that contains codes for better vectorization
+
+For most platforms, the standard version gives reasonable performance. 
+To access the blocking version, use the VERSION=BLK make flag, such as,
+   make CLASS=A VERSION=BLK
+
+Since there is no standard way of performing vectorization, the mileage
+you get from the vector version depends very much on compilers.  Often
+additional compiler directives (or flags) may be necessary for optimal
+results.  The current version is intended to only serve as a baseline.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/add.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/add.f90
new file mode 100644
index 000000000..e9dd687d4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/add.f90
@@ -0,0 +1,32 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  add
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     addition of update to the vector u
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i, j, k, m
+
+      if (timeron) call timer_start(t_add)
+!$omp parallel do default(shared) private(i,j,k,m) collapse(2)
+      do     k = 1, grid_points(3)-2
+         do     j = 1, grid_points(2)-2
+            do     i = 1, grid_points(1)-2
+               do    m = 1, 5
+                  u(m,i,j,k) = u(m,i,j,k) + rhs(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_add)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/adi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/adi.f90
new file mode 100644
index 000000000..b55ed6d19
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/adi.f90
@@ -0,0 +1,21 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  adi
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      call compute_rhs
+
+      call x_solve
+
+      call y_solve
+
+      call z_solve
+
+      call add
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/blk_par0.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/blk_par0.h
new file mode 100644
index 000000000..eec3a0783
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/blk_par0.h
@@ -0,0 +1,10 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  blocking factor
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      integer bsize, blkdim
+      parameter (bsize = 0, blkdim = bsize)
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/bt.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/bt.f90
new file mode 100644
index 000000000..f801785c2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/bt.f90
@@ -0,0 +1,222 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   B T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB BT code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!---------------------------------------------------------------------
+!
+! Authors: R. Van der Wijngaart
+!          T. Harris
+!          M. Yarrow
+!          H. Jin
+!
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+       program BT
+!---------------------------------------------------------------------
+
+       use bt_data
+       implicit none
+
+       include 'blk_par.h'
+
+       integer i, niter, step, fstatus
+       double precision navg, mflops, n3
+
+       external timer_read
+       double precision tmax, timer_read, t, trecs(t_last)
+       logical verified
+       character class
+       character t_names(t_last)*8
+!$     integer  omp_get_max_threads
+!$     external omp_get_max_threads
+
+!---------------------------------------------------------------------
+!      Root node reads input file (if it exists) else takes
+!      defaults from parameters
+!---------------------------------------------------------------------
+
+       call check_timer_flag( timeron )
+       if (timeron) then
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_xsolve) = 'xsolve'
+         t_names(t_ysolve) = 'ysolve'
+         t_names(t_zsolve) = 'zsolve'
+         t_names(t_rdis1) = 'redist1'
+         t_names(t_solsub) = 'solsubs'
+         t_names(t_add) = 'add'
+       endif
+
+       write(*, 1000)
+       open (unit=2,file='inputbt.data',status='old', iostat=fstatus)
+
+       if (fstatus .eq. 0) then
+         write(*,233)
+ 233     format(' Reading from input file inputbt.data')
+         read (2,*) niter
+         read (2,*) dt
+         read (2,*) grid_points(1), grid_points(2), grid_points(3)
+         close(2)
+       else
+         write(*,234)
+         niter = niter_default
+         dt    = dt_default
+         grid_points(1) = problem_size
+         grid_points(2) = problem_size
+         grid_points(3) = problem_size
+       endif
+ 234   format(' No input file inputbt.data. Using compiled defaults')
+
+       write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+       write(*, 1002) niter, dt
+       if (blkdim .gt. 0) write(*, 1004) blkdim
+!$     write(*, 1003) omp_get_max_threads()
+       write(*, *)
+
+ 1000  format(//, ' NAS Parallel Benchmarks (NPB3.4-OMP)', &
+     &            ' - BT Benchmark', /)
+ 1001  format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002  format(' Iterations: ', i4, '       dt: ', f11.7)
+ 1003  format(' Number of available threads: ', i5)
+ 1004  format(' Dimension blocking size: ', i5)
+
+       if ( (grid_points(1) .gt. IMAX) .or.  &
+     &      (grid_points(2) .gt. JMAX) .or.  &
+     &      (grid_points(3) .gt. KMAX) ) then
+             print *, (grid_points(i),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+       endif
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call alloc_space
+
+       call set_constants
+
+       call initialize
+
+       call exact_rhs
+
+!---------------------------------------------------------------------
+!      do one time step to touch all code, and reinitialize
+!---------------------------------------------------------------------
+       call adi
+       call initialize
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+#ifdef M5_ANNOTATION
+       call m5_work_begin_interface
+#endif
+
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (mod(step, 20) .eq. 0 .or.  &
+     &        step .eq. 1) then
+             write(*, 200) step
+ 200         format(' Time step ', i4)
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+
+#ifdef M5_ANNOTATION
+       call m5_work_end_interface
+#endif
+
+       tmax = timer_read(1)
+       call verify(niter, class, verified)
+
+       n3 = dble(grid_points(1))*grid_points(2)*grid_points(3)
+       navg = (grid_points(1)+grid_points(2)+grid_points(3))/3.d0
+       if( tmax .ne. 0. ) then
+          mflops = 1.0d-6*dble(niter)*  &
+     &            (3478.8*n3-17655.7*navg**2+28023.7*navg)  &
+     &            / tmax
+       else
+          mflops = 0.d0
+       endif
+       call print_results('BT', class, grid_points(1),  &
+     &  grid_points(2), grid_points(3), niter,          &
+     &  tmax, mflops, '          floating point',       &
+     &  verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, &
+     &  cs6, '(none)')
+
+!---------------------------------------------------------------------
+!      More timers
+!---------------------------------------------------------------------
+       if (.not.timeron) goto 999
+
+       do i=1, t_last
+          trecs(i) = timer_read(i)
+       end do
+       if (tmax .eq. 0.0) tmax = 1.0
+
+       write(*,800)
+ 800   format('  SECTION   Time (secs)')
+       do i=1, t_last
+          write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+          if (i.eq.t_rhs) then
+             t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+             write(*,820) 'sub-rhs', t, t*100./tmax
+             t = trecs(t_rhs) - t
+             write(*,820) 'rest-rhs', t, t*100./tmax
+          elseif (i.eq.t_solsub) then
+             t = trecs(t_xsolve) + trecs(t_ysolve) + trecs(t_zsolve) &
+     &           - trecs(t_rdis1) - trecs(t_solsub)
+             write(*,820) 'rest-sol', t, t*100./tmax
+          endif
+ 810      format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820      format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+       end do
+
+ 999   continue
+
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/bt_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/bt_data.f90
new file mode 100644
index 000000000..a1169d746
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/bt_data.f90
@@ -0,0 +1,133 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  bt_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+ 
+      module bt_data
+
+!---------------------------------------------------------------------
+! The following include file is generated automatically by the
+! "setparams" utility. It defines 
+!      maxcells:      the square root of the maximum number of processors
+!      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+!      dt_default:    default time step for this problem size if no
+!                     config file
+!      niter_default: default number of iterations for this problem size
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           aa, bb, cc, BLOCK_SIZE
+      parameter        (aa=1, bb=2, cc=3, BLOCK_SIZE=5)
+
+      integer           grid_points(3)
+      double precision  elapsed_time
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,    &
+     &                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4,    &
+     &                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt,         &
+     &                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2,  &
+     &                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1, &
+     &                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4, &
+     &                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1, &
+     &                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, &
+     &                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1,   &
+     &                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2,  &
+     &                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1,      &
+     &                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1,     &
+     &                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6,   &
+     &                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer IMAX, JMAX, KMAX
+      parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
+
+!
+!   field arrays 
+!
+      double precision, allocatable ::  & 
+     &   us      (   :, :, :),  &
+     &   vs      (   :, :, :),  &
+     &   ws      (   :, :, :),  &
+     &   qs      (   :, :, :),  &
+     &   rho_i   (   :, :, :),  &
+     &   square  (   :, :, :),  &
+     &   forcing (:, :, :, :),  &
+     &   u       (:, :, :, :),  &
+     &   rhs     (:, :, :, :)
+
+      double precision cuf(0:problem_size),   q  (0:problem_size),  &
+     &                 ue (0:problem_size,5), buf(0:problem_size,5)
+!$omp threadprivate (cuf, q, ue, buf)
+
+!
+!-----------------------------------------------------------------------
+!   Timer constants
+!-----------------------------------------------------------------------
+      integer t_rhsx, t_rhsy, t_rhsz, t_xsolve, t_ysolve, t_zsolve, &
+     &        t_rdis1, t_solsub, t_add, t_rhs, t_last, t_total
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_xsolve = 6)
+      parameter (t_ysolve = 7)
+      parameter (t_zsolve = 8)
+      parameter (t_rdis1 = 9)
+      parameter (t_solsub = 10)
+      parameter (t_add = 11)
+      parameter (t_last = 11)
+
+      logical timeron
+
+      end module bt_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer ios
+
+      integer IMAXP, JMAXP
+      parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
+
+!
+!   to improve cache performance, grid dimensions padded by 1 
+!   for even number sizes only.
+!
+      allocate (   &
+     &   us      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   vs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   ws      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   qs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   rho_i   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   square  (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   forcing (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   u       (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   rhs     (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &         stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         stop
+      endif
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/error.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/error.f90
new file mode 100644
index 000000000..f8da14f15
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/error.f90
@@ -0,0 +1,95 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine error_norm(rms)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     this function computes the norm of the difference between the
+!     computed solution and the exact solution
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i, j, k, m, d
+      double precision xi, eta, zeta, u_exact(5), rms(5), add
+
+      do m = 1, 5
+         rms(m) = 0.0d0
+      enddo
+
+!$omp parallel do schedule(static) collapse(2) default(shared)  &
+!$omp& private(i,j,k,m,zeta,eta,xi,add,u_exact) reduction(+: rms)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            zeta = dble(k) * dnzm1
+            eta = dble(j) * dnym1
+            do i = 0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+               call exact_solution(xi, eta, zeta, u_exact)
+
+               do m = 1, 5
+                  add = u(m,i,j,k)-u_exact(m)
+                  rms(m) = rms(m) + add*add
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end parallel do
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo
+         rms(m) = dsqrt(rms(m))
+      enddo
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rhs_norm(rms)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i, j, k, d, m
+      double precision rms(5), add
+
+      do m = 1, 5
+         rms(m) = 0.0d0
+      enddo
+
+!$omp parallel do schedule(static) collapse(2)  &
+!$omp&  default(shared) private(i,j,k,m,add) reduction(+: rms)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  add = rhs(m,i,j,k)
+                  rms(m) = rms(m) + add*add
+               enddo 
+            enddo 
+         enddo 
+      enddo 
+!$omp end parallel do
+
+      do m = 1, 5
+         do d = 1, 3
+            rms(m) = rms(m) / dble(grid_points(d)-2)
+         enddo 
+         rms(m) = dsqrt(rms(m))
+      enddo 
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/exact_rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/exact_rhs.f90
new file mode 100644
index 000000000..c7c614adb
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/exact_rhs.f90
@@ -0,0 +1,353 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exact_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision dtemp(5), xi, eta, zeta, dtpp
+      integer m, i, j, k, ip1, im1, jp1, jm1, km1, kp1
+
+!$omp parallel default(shared) private(i,j,k,m,zeta,eta,xi,  &
+!$omp&  dtpp,im1,ip1,jm1,jp1,km1,kp1,dtemp)
+!---------------------------------------------------------------------
+!     initialize                                  
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k= 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  forcing(m,i,j,k) = 0.0d0
+               enddo
+            enddo
+         enddo
+      enddo
+
+!---------------------------------------------------------------------
+!     xi-direction flux differences                      
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            zeta = dble(k) * dnzm1
+            eta = dble(j) * dnym1
+
+            do i=0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5
+                  ue(i,m) = dtemp(m)
+               enddo
+
+               dtpp = 1.0d0 / dtemp(1)
+
+               do m = 2, 5
+                  buf(i,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(i)   = buf(i,2) * buf(i,2)
+               buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) +   &
+     &                 buf(i,4) * buf(i,4) 
+               q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +  &
+     &                 buf(i,4)*ue(i,4))
+
+            enddo
+               
+            do i = 1, grid_points(1)-2
+               im1 = i-1
+               ip1 = i+1
+
+               forcing(1,i,j,k) = forcing(1,i,j,k) -  &
+     &                 tx2*( ue(ip1,2)-ue(im1,2) )+  &
+     &                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - tx2 * (  &
+     &                 (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-  &
+     &                 (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+  &
+     &                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+  &
+     &                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - tx2 * (  &
+     &                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+  &
+     &                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+               forcing(4,i,j,k) = forcing(4,i,j,k) - tx2*(  &
+     &                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+  &
+     &                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - tx2*(  &
+     &                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-  &
+     &                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+  &
+     &                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+  &
+     &                 buf(im1,1))+  &
+     &                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+  &
+     &                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+  &
+     &                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+            enddo
+
+!---------------------------------------------------------------------
+!     Fourth-order dissipation                         
+!---------------------------------------------------------------------
+
+            do m = 1, 5
+               i = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+               i = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -  &
+     &                    4.0d0*ue(i+1,m) +       ue(i+2,m))
+            enddo
+
+            do i = 3, grid_points(1)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*  &
+     &                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               i = grid_points(1)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+               i = grid_points(1)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+            enddo
+
+         enddo
+      enddo
+!$omp end do
+
+!---------------------------------------------------------------------
+!     eta-direction flux differences             
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, grid_points(3)-2          
+         do i=1, grid_points(1)-2
+            zeta = dble(k) * dnzm1
+            xi = dble(i) * dnxm1
+
+            do j=0, grid_points(2)-1
+               eta = dble(j) * dnym1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5 
+                  ue(j,m) = dtemp(m)
+               enddo
+                  
+               dtpp = 1.0d0/dtemp(1)
+
+               do m = 2, 5
+                  buf(j,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(j)   = buf(j,3) * buf(j,3)
+               buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) +   &
+     &                 buf(j,4) * buf(j,4)
+               q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +  &
+     &                 buf(j,4)*ue(j,4))
+            enddo
+
+            do j = 1, grid_points(2)-2
+               jm1 = j-1
+               jp1 = j+1
+                  
+               forcing(1,i,j,k) = forcing(1,i,j,k) -  &
+     &                 ty2*( ue(jp1,3)-ue(jm1,3) )+  &
+     &                 dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - ty2*(  &
+     &                 ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+  &
+     &                 yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+  &
+     &                 dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - ty2*(  &
+     &                 (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-  &
+     &                 (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+  &
+     &                 yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+  &
+     &                 dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+               forcing(4,i,j,k) = forcing(4,i,j,k) - ty2*(  &
+     &                 ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+  &
+     &                 yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+  &
+     &                 dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - ty2*(  &
+     &                 buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-  &
+     &                 buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+  &
+     &                 0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+  &
+     &                 buf(jm1,1))+  &
+     &                 yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+  &
+     &                 yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+  &
+     &                 dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+            enddo
+
+!---------------------------------------------------------------------
+!     Fourth-order dissipation                      
+!---------------------------------------------------------------------
+            do m = 1, 5
+               j = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+               j = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -  &
+     &                    4.0d0*ue(j+1,m) +       ue(j+2,m))
+            enddo
+
+            do j = 3, grid_points(2)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*  &
+     &                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               j = grid_points(2)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+               j = grid_points(2)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+            enddo
+
+         enddo
+      enddo
+!$omp end do
+
+!---------------------------------------------------------------------
+!     zeta-direction flux differences                      
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do j=1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            eta = dble(j) * dnym1
+            xi = dble(i) * dnxm1
+
+            do k=0, grid_points(3)-1
+               zeta = dble(k) * dnzm1
+
+               call exact_solution(xi, eta, zeta, dtemp)
+               do m = 1, 5
+                  ue(k,m) = dtemp(m)
+               enddo
+
+               dtpp = 1.0d0/dtemp(1)
+
+               do m = 2, 5
+                  buf(k,m) = dtpp * dtemp(m)
+               enddo
+
+               cuf(k)   = buf(k,4) * buf(k,4)
+               buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) +   &
+     &                 buf(k,3) * buf(k,3)
+               q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +  &
+     &                 buf(k,4)*ue(k,4))
+            enddo
+
+            do k=1, grid_points(3)-2
+               km1 = k-1
+               kp1 = k+1
+                  
+               forcing(1,i,j,k) = forcing(1,i,j,k) -  &
+     &                 tz2*( ue(kp1,4)-ue(km1,4) )+  &
+     &                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+               forcing(2,i,j,k) = forcing(2,i,j,k) - tz2 * (  &
+     &                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+  &
+     &                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+               forcing(3,i,j,k) = forcing(3,i,j,k) - tz2 * (  &
+     &                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+  &
+     &                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+               forcing(4,i,j,k) = forcing(4,i,j,k) - tz2 * (  &
+     &                 (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-  &
+     &                 (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+  &
+     &                 zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+  &
+     &                 dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+               forcing(5,i,j,k) = forcing(5,i,j,k) - tz2 * (  &
+     &                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-  &
+     &                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+  &
+     &                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)  &
+     &                 +buf(km1,1))+  &
+     &                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+  &
+     &                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+  &
+     &                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+            enddo
+
+!---------------------------------------------------------------------
+!     Fourth-order dissipation                        
+!---------------------------------------------------------------------
+            do m = 1, 5
+               k = 1
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+               k = 2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -  &
+     &                    4.0d0*ue(k+1,m) +       ue(k+2,m))
+            enddo
+
+            do k = 3, grid_points(3)-4
+               do m = 1, 5
+                  forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*  &
+     &                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+               enddo
+            enddo
+
+            do m = 1, 5
+               k = grid_points(3)-3
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+               k = grid_points(3)-2
+               forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+            enddo
+
+         enddo
+      enddo
+!$omp end do
+
+!---------------------------------------------------------------------
+!     now change the sign of the forcing function, 
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  forcing(m,i,j,k) = -1.d0 * forcing(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/exact_solution.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/exact_solution.f90
new file mode 100644
index 000000000..d6ec265f8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/exact_solution.f90
@@ -0,0 +1,30 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     this function returns the exact solution at point xi, eta, zeta  
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      double precision  xi, eta, zeta, dtemp(5)
+      integer m
+
+      do m = 1, 5
+         dtemp(m) =  ce(m,1) +  &
+     &     xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +  &
+     &     eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+  &
+     &     zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) +   &
+     &     zeta*ce(m,13))))
+      enddo
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/initialize.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/initialize.f90
new file mode 100644
index 000000000..d94d7c494
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/initialize.f90
@@ -0,0 +1,207 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  initialize
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This subroutine initializes the field variable u using 
+!     tri-linear transfinite interpolation of the boundary values     
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+      
+      integer i, j, k, m, ix, iy, iz
+      double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta,   &
+     &     Pzeta, temp(5)
+
+
+!$omp parallel default(shared)  &
+!$omp& private(i,j,k,m,zeta,eta,xi,ix,iy,iz,Pface,Pxi,Peta,Pzeta,temp)
+!---------------------------------------------------------------------
+!  Later (in compute_rhs) we compute 1/u for every element. A few of 
+!  the corner elements are not used, but it convenient (and faster) 
+!  to compute the whole thing with a simple loop. Make sure those 
+!  values are nonzero by initializing the whole thing here. 
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  u(m,i,j,k) = 1.0
+               end do
+            end do
+         end do
+      end do
+!$omp end do nowait
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!     first store the "interpolated" values everywhere on the grid    
+!---------------------------------------------------------------------
+
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            zeta = dble(k) * dnzm1
+            eta = dble(j) * dnym1
+            do i = 0, grid_points(1)-1
+               xi = dble(i) * dnxm1
+                  
+               do ix = 1, 2
+                  call exact_solution(dble(ix-1), eta, zeta,   &
+     &                    Pface(1,1,ix))
+               enddo
+
+               do iy = 1, 2
+                  call exact_solution(xi, dble(iy-1) , zeta,   &
+     &                    Pface(1,2,iy))
+               enddo
+
+               do iz = 1, 2
+                  call exact_solution(xi, eta, dble(iz-1),     &
+     &                    Pface(1,3,iz))
+               enddo
+
+               do m = 1, 5
+                  Pxi   = xi   * Pface(m,1,2) +   &
+     &                    (1.0d0-xi)   * Pface(m,1,1)
+                  Peta  = eta  * Pface(m,2,2) +   &
+     &                    (1.0d0-eta)  * Pface(m,2,1)
+                  Pzeta = zeta * Pface(m,3,2) +   &
+     &                    (1.0d0-zeta) * Pface(m,3,1)
+                     
+                  u(m,i,j,k) = Pxi + Peta + Pzeta -   &
+     &                    Pxi*Peta - Pxi*Pzeta - Peta*Pzeta +   &
+     &                    Pxi*Peta*Pzeta
+
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+!     now store the exact values on the boundaries        
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     west face                                                  
+!---------------------------------------------------------------------
+      i = 0
+      xi = 0.0d0
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            zeta = dble(k) * dnzm1
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+!     east face                                                      
+!---------------------------------------------------------------------
+
+      i = grid_points(1)-1
+      xi = 1.0d0
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            zeta = dble(k) * dnzm1
+            eta = dble(j) * dnym1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+!---------------------------------------------------------------------
+!     south face                                                 
+!---------------------------------------------------------------------
+      j = 0
+      eta = 0.0d0
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do i = 0, grid_points(1)-1
+            zeta = dble(k) * dnzm1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+
+!---------------------------------------------------------------------
+!     north face                                    
+!---------------------------------------------------------------------
+      j = grid_points(2)-1
+      eta = 1.0d0
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do i = 0, grid_points(1)-1
+            zeta = dble(k) * dnzm1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+!---------------------------------------------------------------------
+!     bottom face                                       
+!---------------------------------------------------------------------
+      k = 0
+      zeta = 0.0d0
+!$omp do schedule(static) collapse(2)
+      do j = 0, grid_points(2)-1
+         do i =0, grid_points(1)-1
+            eta = dble(j) * dnym1
+            xi = dble(i) *dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+!     top face     
+!---------------------------------------------------------------------
+      k = grid_points(3)-1
+      zeta = 1.0d0
+!$omp do schedule(static) collapse(2)
+      do j = 0, grid_points(2)-1
+         do i =0, grid_points(1)-1
+            eta = dble(j) * dnym1
+            xi = dble(i) * dnxm1
+            call exact_solution(xi, eta, zeta, temp)
+            do m = 1, 5
+               u(m,i,j,k) = temp(m)
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/inputbt.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/inputbt.data.sample
new file mode 100644
index 000000000..d47ca916d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/inputbt.data.sample
@@ -0,0 +1,3 @@
+60       number of time steps
+0.01d0   dt for class A = 0.0008d0. class B = 0.0003d0  class C = 0.0001d0
+12 12 12
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/rhs.f90
new file mode 100644
index 000000000..12e31cc23
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/rhs.f90
@@ -0,0 +1,415 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine compute_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+
+      integer i, j, k, m
+      double precision rho_inv, uijk, up1, um1, vijk, vp1, vm1,  &
+     &     wijk, wp1, wm1
+
+
+      if (timeron) call timer_start(t_rhs)
+
+!$omp parallel default(shared) private(i,j,k,m,rho_inv,uijk,up1,um1,  &
+!$omp&   vijk,vp1,vm1,wijk,wp1,wm1)
+
+!---------------------------------------------------------------------
+!     compute the reciprocal of density, and the kinetic energy, 
+!     and the speed of sound.
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               rho_inv = 1.0d0/u(1,i,j,k)
+               rho_i(i,j,k) = rho_inv
+               us(i,j,k) = u(2,i,j,k) * rho_inv
+               vs(i,j,k) = u(3,i,j,k) * rho_inv
+               ws(i,j,k) = u(4,i,j,k) * rho_inv
+               square(i,j,k)     = 0.5d0* (  &
+     &                 u(2,i,j,k)*u(2,i,j,k) +   &
+     &                 u(3,i,j,k)*u(3,i,j,k) +  &
+     &                 u(4,i,j,k)*u(4,i,j,k) ) * rho_inv
+               qs(i,j,k) = square(i,j,k) * rho_inv
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+! copy the exact forcing term to the right hand side;  because 
+! this forcing term is known, we can store it on the whole grid
+! including the boundary                   
+!---------------------------------------------------------------------
+
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               do m = 1, 5
+                  rhs(m,i,j,k) = forcing(m,i,j,k)
+               enddo
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+
+!$omp master
+      if (timeron) call timer_start(t_rhsx)
+!$omp end master
+!---------------------------------------------------------------------
+!     compute xi-direction fluxes 
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               uijk = us(i,j,k)
+               up1  = us(i+1,j,k)
+               um1  = us(i-1,j,k)
+
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dx1tx1 *   &
+     &                 (u(1,i+1,j,k) - 2.0d0*u(1,i,j,k) +   &
+     &                 u(1,i-1,j,k)) -  &
+     &                 tx2 * (u(2,i+1,j,k) - u(2,i-1,j,k))
+
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dx2tx1 *   &
+     &                 (u(2,i+1,j,k) - 2.0d0*u(2,i,j,k) +   &
+     &                 u(2,i-1,j,k)) +  &
+     &                 xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -  &
+     &                 tx2 * (u(2,i+1,j,k)*up1 -   &
+     &                 u(2,i-1,j,k)*um1 +  &
+     &                 (u(5,i+1,j,k)- square(i+1,j,k)-  &
+     &                 u(5,i-1,j,k)+ square(i-1,j,k))*  &
+     &                 c2)
+
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dx3tx1 *   &
+     &                 (u(3,i+1,j,k) - 2.0d0*u(3,i,j,k) +  &
+     &                 u(3,i-1,j,k)) +  &
+     &                 xxcon2 * (vs(i+1,j,k) - 2.0d0*vs(i,j,k) +  &
+     &                 vs(i-1,j,k)) -  &
+     &                 tx2 * (u(3,i+1,j,k)*up1 -   &
+     &                 u(3,i-1,j,k)*um1)
+
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dx4tx1 *   &
+     &                 (u(4,i+1,j,k) - 2.0d0*u(4,i,j,k) +  &
+     &                 u(4,i-1,j,k)) +  &
+     &                 xxcon2 * (ws(i+1,j,k) - 2.0d0*ws(i,j,k) +  &
+     &                 ws(i-1,j,k)) -  &
+     &                 tx2 * (u(4,i+1,j,k)*up1 -   &
+     &                 u(4,i-1,j,k)*um1)
+
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dx5tx1 *   &
+     &                 (u(5,i+1,j,k) - 2.0d0*u(5,i,j,k) +  &
+     &                 u(5,i-1,j,k)) +  &
+     &                 xxcon3 * (qs(i+1,j,k) - 2.0d0*qs(i,j,k) +  &
+     &                 qs(i-1,j,k)) +  &
+     &                 xxcon4 * (up1*up1 -       2.0d0*uijk*uijk +   &
+     &                 um1*um1) +  &
+     &                 xxcon5 * (u(5,i+1,j,k)*rho_i(i+1,j,k) -   &
+     &                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +  &
+     &                 u(5,i-1,j,k)*rho_i(i-1,j,k)) -  &
+     &                 tx2 * ( (c1*u(5,i+1,j,k) -   &
+     &                 c2*square(i+1,j,k))*up1 -  &
+     &                 (c1*u(5,i-1,j,k) -   &
+     &                 c2*square(i-1,j,k))*um1 )
+            enddo
+
+!---------------------------------------------------------------------
+!     add fourth order xi-direction dissipation               
+!---------------------------------------------------------------------
+            i = 1
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k)- dssp *   &
+     &                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +  &
+     &                    u(m,i+2,j,k))
+            enddo
+
+            i = 2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *   &
+     &                    (-4.0d0*u(m,i-1,j,k) + 6.0d0*u(m,i,j,k) -  &
+     &                    4.0d0*u(m,i+1,j,k) + u(m,i+2,j,k))
+            enddo
+
+            do i = 3,grid_points(1)-4
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *   &
+     &                    (  u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) +   &
+     &                    6.0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +   &
+     &                    u(m,i+2,j,k) )
+               enddo
+            enddo
+         
+            i = grid_points(1)-3
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) +   &
+     &                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) )
+            enddo
+
+            i = grid_points(1)-2
+            do m = 1, 5
+               rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i-2,j,k) - 4.d0*u(m,i-1,j,k) +  &
+     &                    5.d0*u(m,i,j,k) )
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+!$omp end master
+!---------------------------------------------------------------------
+!     compute eta-direction fluxes 
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               vijk = vs(i,j,k)
+               vp1  = vs(i,j+1,k)
+               vm1  = vs(i,j-1,k)
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dy1ty1 *   &
+     &                 (u(1,i,j+1,k) - 2.0d0*u(1,i,j,k) +   &
+     &                 u(1,i,j-1,k)) -  &
+     &                 ty2 * (u(3,i,j+1,k) - u(3,i,j-1,k))
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dy2ty1 *   &
+     &                 (u(2,i,j+1,k) - 2.0d0*u(2,i,j,k) +   &
+     &                 u(2,i,j-1,k)) +  &
+     &                 yycon2 * (us(i,j+1,k) - 2.0d0*us(i,j,k) +   &
+     &                 us(i,j-1,k)) -  &
+     &                 ty2 * (u(2,i,j+1,k)*vp1 -   &
+     &                 u(2,i,j-1,k)*vm1)
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dy3ty1 *   &
+     &                 (u(3,i,j+1,k) - 2.0d0*u(3,i,j,k) +   &
+     &                 u(3,i,j-1,k)) +  &
+     &                 yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -  &
+     &                 ty2 * (u(3,i,j+1,k)*vp1 -   &
+     &                 u(3,i,j-1,k)*vm1 +  &
+     &                 (u(5,i,j+1,k) - square(i,j+1,k) -   &
+     &                 u(5,i,j-1,k) + square(i,j-1,k))  &
+     &                 *c2)
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dy4ty1 *   &
+     &                 (u(4,i,j+1,k) - 2.0d0*u(4,i,j,k) +   &
+     &                 u(4,i,j-1,k)) +  &
+     &                 yycon2 * (ws(i,j+1,k) - 2.0d0*ws(i,j,k) +   &
+     &                 ws(i,j-1,k)) -  &
+     &                 ty2 * (u(4,i,j+1,k)*vp1 -   &
+     &                 u(4,i,j-1,k)*vm1)
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dy5ty1 *   &
+     &                 (u(5,i,j+1,k) - 2.0d0*u(5,i,j,k) +   &
+     &                 u(5,i,j-1,k)) +  &
+     &                 yycon3 * (qs(i,j+1,k) - 2.0d0*qs(i,j,k) +   &
+     &                 qs(i,j-1,k)) +  &
+     &                 yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk +   &
+     &                 vm1*vm1) +  &
+     &                 yycon5 * (u(5,i,j+1,k)*rho_i(i,j+1,k) -   &
+     &                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +  &
+     &                 u(5,i,j-1,k)*rho_i(i,j-1,k)) -  &
+     &                 ty2 * ((c1*u(5,i,j+1,k) -   &
+     &                 c2*square(i,j+1,k)) * vp1 -  &
+     &                 (c1*u(5,i,j-1,k) -   &
+     &                 c2*square(i,j-1,k)) * vm1)
+            enddo
+
+!---------------------------------------------------------------------
+!     add fourth order eta-direction dissipation         
+!---------------------------------------------------------------------
+            if (j .eq. 1) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k)- dssp *   &
+     &                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +  &
+     &                    u(m,i,j+2,k))
+               enddo
+               enddo
+
+            else if (j .eq. 2) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *   &
+     &                    (-4.0d0*u(m,i,j-1,k) + 6.0d0*u(m,i,j,k) -  &
+     &                    4.0d0*u(m,i,j+1,k) + u(m,i,j+2,k))
+               enddo
+               enddo
+         
+            else if (j .eq. grid_points(2)-3) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) +   &
+     &                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) )
+               enddo
+               enddo
+
+            else if (j .eq. grid_points(2)-2) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j-2,k) - 4.d0*u(m,i,j-1,k) +  &
+     &                    5.d0*u(m,i,j,k) )
+               enddo
+               enddo
+
+            else
+               do i = 1,grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *   &
+     &                    (  u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) +   &
+     &                    6.0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +   &
+     &                    u(m,i,j+2,k) )
+               enddo
+               enddo
+            endif
+         enddo
+      enddo
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+!$omp end master
+!---------------------------------------------------------------------
+!     compute zeta-direction fluxes 
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               wijk = ws(i,j,k)
+               wp1  = ws(i,j,k+1)
+               wm1  = ws(i,j,k-1)
+
+               rhs(1,i,j,k) = rhs(1,i,j,k) + dz1tz1 *   &
+     &                 (u(1,i,j,k+1) - 2.0d0*u(1,i,j,k) +   &
+     &                 u(1,i,j,k-1)) -  &
+     &                 tz2 * (u(4,i,j,k+1) - u(4,i,j,k-1))
+               rhs(2,i,j,k) = rhs(2,i,j,k) + dz2tz1 *   &
+     &                 (u(2,i,j,k+1) - 2.0d0*u(2,i,j,k) +   &
+     &                 u(2,i,j,k-1)) +  &
+     &                 zzcon2 * (us(i,j,k+1) - 2.0d0*us(i,j,k) +   &
+     &                 us(i,j,k-1)) -  &
+     &                 tz2 * (u(2,i,j,k+1)*wp1 -   &
+     &                 u(2,i,j,k-1)*wm1)
+               rhs(3,i,j,k) = rhs(3,i,j,k) + dz3tz1 *   &
+     &                 (u(3,i,j,k+1) - 2.0d0*u(3,i,j,k) +   &
+     &                 u(3,i,j,k-1)) +  &
+     &                 zzcon2 * (vs(i,j,k+1) - 2.0d0*vs(i,j,k) +   &
+     &                 vs(i,j,k-1)) -  &
+     &                 tz2 * (u(3,i,j,k+1)*wp1 -   &
+     &                 u(3,i,j,k-1)*wm1)
+               rhs(4,i,j,k) = rhs(4,i,j,k) + dz4tz1 *   &
+     &                 (u(4,i,j,k+1) - 2.0d0*u(4,i,j,k) +   &
+     &                 u(4,i,j,k-1)) +  &
+     &                 zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -  &
+     &                 tz2 * (u(4,i,j,k+1)*wp1 -   &
+     &                 u(4,i,j,k-1)*wm1 +  &
+     &                 (u(5,i,j,k+1) - square(i,j,k+1) -   &
+     &                 u(5,i,j,k-1) + square(i,j,k-1))  &
+     &                 *c2)
+               rhs(5,i,j,k) = rhs(5,i,j,k) + dz5tz1 *   &
+     &                 (u(5,i,j,k+1) - 2.0d0*u(5,i,j,k) +   &
+     &                 u(5,i,j,k-1)) +  &
+     &                 zzcon3 * (qs(i,j,k+1) - 2.0d0*qs(i,j,k) +   &
+     &                 qs(i,j,k-1)) +  &
+     &                 zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk +   &
+     &                 wm1*wm1) +  &
+     &                 zzcon5 * (u(5,i,j,k+1)*rho_i(i,j,k+1) -   &
+     &                 2.0d0*u(5,i,j,k)*rho_i(i,j,k) +  &
+     &                 u(5,i,j,k-1)*rho_i(i,j,k-1)) -  &
+     &                 tz2 * ( (c1*u(5,i,j,k+1) -   &
+     &                 c2*square(i,j,k+1))*wp1 -  &
+     &                 (c1*u(5,i,j,k-1) -   &
+     &                 c2*square(i,j,k-1))*wm1)
+            enddo
+
+!---------------------------------------------------------------------
+!     add fourth order zeta-direction dissipation                
+!---------------------------------------------------------------------
+            if (k.eq.1) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k)- dssp *   &
+     &                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +  &
+     &                    u(m,i,j,k+2))
+               enddo
+               enddo
+
+            else if (k.eq.2) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *   &
+     &                    (-4.0d0*u(m,i,j,k-1) + 6.0d0*u(m,i,j,k) -  &
+     &                    4.0d0*u(m,i,j,k+1) + u(m,i,j,k+2))
+               enddo
+               enddo
+
+            else if (k.eq.grid_points(3)-3) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) +   &
+     &                    6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) )
+               enddo
+               enddo
+
+            else if (k.eq.grid_points(3)-2) then
+               do i = 1, grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j,k-2) - 4.d0*u(m,i,j,k-1) +  &
+     &                    5.d0*u(m,i,j,k) )
+               enddo
+               enddo
+
+            else
+               do i = 1,grid_points(1)-2
+               do m = 1, 5
+                  rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *   &
+     &                    (  u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) +   &
+     &                    6.0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +   &
+     &                    u(m,i,j,k+2) )
+               enddo
+               enddo
+            endif
+         enddo
+      enddo
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsz)
+!$omp end master
+
+!$omp do schedule(static) collapse(2)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 1, grid_points(1)-2
+               rhs(1,i,j,k) = rhs(1,i,j,k) * dt
+               rhs(2,i,j,k) = rhs(2,i,j,k) * dt
+               rhs(3,i,j,k) = rhs(3,i,j,k) * dt
+               rhs(4,i,j,k) = rhs(4,i,j,k) * dt
+               rhs(5,i,j,k) = rhs(5,i,j,k) * dt
+            enddo
+         enddo
+      enddo
+!$omp end do nowait
+
+!$omp end parallel
+
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/set_constants.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/set_constants.f90
new file mode 100644
index 000000000..ad83166d9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/set_constants.f90
@@ -0,0 +1,201 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine  set_constants
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use bt_data
+      implicit none
+      
+      ce(1,1)  = 2.0d0
+      ce(1,2)  = 0.0d0
+      ce(1,3)  = 0.0d0
+      ce(1,4)  = 4.0d0
+      ce(1,5)  = 5.0d0
+      ce(1,6)  = 3.0d0
+      ce(1,7)  = 0.5d0
+      ce(1,8)  = 0.02d0
+      ce(1,9)  = 0.01d0
+      ce(1,10) = 0.03d0
+      ce(1,11) = 0.5d0
+      ce(1,12) = 0.4d0
+      ce(1,13) = 0.3d0
+      
+      ce(2,1)  = 1.0d0
+      ce(2,2)  = 0.0d0
+      ce(2,3)  = 0.0d0
+      ce(2,4)  = 0.0d0
+      ce(2,5)  = 1.0d0
+      ce(2,6)  = 2.0d0
+      ce(2,7)  = 3.0d0
+      ce(2,8)  = 0.01d0
+      ce(2,9)  = 0.03d0
+      ce(2,10) = 0.02d0
+      ce(2,11) = 0.4d0
+      ce(2,12) = 0.3d0
+      ce(2,13) = 0.5d0
+
+      ce(3,1)  = 2.0d0
+      ce(3,2)  = 2.0d0
+      ce(3,3)  = 0.0d0
+      ce(3,4)  = 0.0d0
+      ce(3,5)  = 0.0d0
+      ce(3,6)  = 2.0d0
+      ce(3,7)  = 3.0d0
+      ce(3,8)  = 0.04d0
+      ce(3,9)  = 0.03d0
+      ce(3,10) = 0.05d0
+      ce(3,11) = 0.3d0
+      ce(3,12) = 0.5d0
+      ce(3,13) = 0.4d0
+
+      ce(4,1)  = 2.0d0
+      ce(4,2)  = 2.0d0
+      ce(4,3)  = 0.0d0
+      ce(4,4)  = 0.0d0
+      ce(4,5)  = 0.0d0
+      ce(4,6)  = 2.0d0
+      ce(4,7)  = 3.0d0
+      ce(4,8)  = 0.03d0
+      ce(4,9)  = 0.05d0
+      ce(4,10) = 0.04d0
+      ce(4,11) = 0.2d0
+      ce(4,12) = 0.1d0
+      ce(4,13) = 0.3d0
+
+      ce(5,1)  = 5.0d0
+      ce(5,2)  = 4.0d0
+      ce(5,3)  = 3.0d0
+      ce(5,4)  = 2.0d0
+      ce(5,5)  = 0.1d0
+      ce(5,6)  = 0.4d0
+      ce(5,7)  = 0.3d0
+      ce(5,8)  = 0.05d0
+      ce(5,9)  = 0.04d0
+      ce(5,10) = 0.03d0
+      ce(5,11) = 0.1d0
+      ce(5,12) = 0.3d0
+      ce(5,13) = 0.2d0
+
+      c1 = 1.4d0
+      c2 = 0.4d0
+      c3 = 0.1d0
+      c4 = 1.0d0
+      c5 = 1.4d0
+
+      dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+      dnym1 = 1.0d0 / dble(grid_points(2)-1)
+      dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+      c1c2 = c1 * c2
+      c1c5 = c1 * c5
+      c3c4 = c3 * c4
+      c1345 = c1c5 * c3c4
+
+      conz1 = (1.0d0-c1c5)
+
+      tx1 = 1.0d0 / (dnxm1 * dnxm1)
+      tx2 = 1.0d0 / (2.0d0 * dnxm1)
+      tx3 = 1.0d0 / dnxm1
+
+      ty1 = 1.0d0 / (dnym1 * dnym1)
+      ty2 = 1.0d0 / (2.0d0 * dnym1)
+      ty3 = 1.0d0 / dnym1
+      
+      tz1 = 1.0d0 / (dnzm1 * dnzm1)
+      tz2 = 1.0d0 / (2.0d0 * dnzm1)
+      tz3 = 1.0d0 / dnzm1
+
+      dx1 = 0.75d0
+      dx2 = 0.75d0
+      dx3 = 0.75d0
+      dx4 = 0.75d0
+      dx5 = 0.75d0
+
+      dy1 = 0.75d0
+      dy2 = 0.75d0
+      dy3 = 0.75d0
+      dy4 = 0.75d0
+      dy5 = 0.75d0
+
+      dz1 = 1.0d0
+      dz2 = 1.0d0
+      dz3 = 1.0d0
+      dz4 = 1.0d0
+      dz5 = 1.0d0
+
+      dxmax = dmax1(dx3, dx4)
+      dymax = dmax1(dy2, dy4)
+      dzmax = dmax1(dz2, dz3)
+
+      dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+      c4dssp = 4.0d0 * dssp
+      c5dssp = 5.0d0 * dssp
+
+      dttx1 = dt*tx1
+      dttx2 = dt*tx2
+      dtty1 = dt*ty1
+      dtty2 = dt*ty2
+      dttz1 = dt*tz1
+      dttz2 = dt*tz2
+
+      c2dttx1 = 2.0d0*dttx1
+      c2dtty1 = 2.0d0*dtty1
+      c2dttz1 = 2.0d0*dttz1
+
+      dtdssp = dt*dssp
+
+      comz1  = dtdssp
+      comz4  = 4.0d0*dtdssp
+      comz5  = 5.0d0*dtdssp
+      comz6  = 6.0d0*dtdssp
+
+      c3c4tx3 = c3c4*tx3
+      c3c4ty3 = c3c4*ty3
+      c3c4tz3 = c3c4*tz3
+
+      dx1tx1 = dx1*tx1
+      dx2tx1 = dx2*tx1
+      dx3tx1 = dx3*tx1
+      dx4tx1 = dx4*tx1
+      dx5tx1 = dx5*tx1
+      
+      dy1ty1 = dy1*ty1
+      dy2ty1 = dy2*ty1
+      dy3ty1 = dy3*ty1
+      dy4ty1 = dy4*ty1
+      dy5ty1 = dy5*ty1
+      
+      dz1tz1 = dz1*tz1
+      dz2tz1 = dz2*tz1
+      dz3tz1 = dz3*tz1
+      dz4tz1 = dz4*tz1
+      dz5tz1 = dz5*tz1
+
+      c2iv  = 2.5d0
+      con43 = 4.0d0/3.0d0
+      con16 = 1.0d0/6.0d0
+      
+      xxcon1 = c3c4tx3*con43*tx3
+      xxcon2 = c3c4tx3*tx3
+      xxcon3 = c3c4tx3*conz1*tx3
+      xxcon4 = c3c4tx3*con16*tx3
+      xxcon5 = c3c4tx3*c1c5*tx3
+
+      yycon1 = c3c4ty3*con43*ty3
+      yycon2 = c3c4ty3*ty3
+      yycon3 = c3c4ty3*conz1*ty3
+      yycon4 = c3c4ty3*con16*ty3
+      yycon5 = c3c4ty3*c1c5*ty3
+
+      zzcon1 = c3c4tz3*con43*tz3
+      zzcon2 = c3c4tz3*tz3
+      zzcon3 = c3c4tz3*conz1*tz3
+      zzcon4 = c3c4tz3*con16*tz3
+      zzcon5 = c3c4tz3*c1c5*tz3
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/solve_subs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/solve_subs.f90
new file mode 100644
index 000000000..036415ec4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/solve_subs.f90
@@ -0,0 +1,642 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine matvec_sub(ablock,avec,bvec)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtracts bvec=bvec - ablock*avec
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock,avec,bvec
+      dimension ablock(5,5),avec(5),bvec(5)
+
+!---------------------------------------------------------------------
+!            rhs(i,ic,jc,kc) = rhs(i,ic,jc,kc) 
+!     $           - lhs(i,1,ablock,ia)*
+!---------------------------------------------------------------------
+         bvec(1) = bvec(1) - ablock(1,1)*avec(1)  &
+     &                     - ablock(1,2)*avec(2)  &
+     &                     - ablock(1,3)*avec(3)  &
+     &                     - ablock(1,4)*avec(4)  &
+     &                     - ablock(1,5)*avec(5)
+         bvec(2) = bvec(2) - ablock(2,1)*avec(1)  &
+     &                     - ablock(2,2)*avec(2)  &
+     &                     - ablock(2,3)*avec(3)  &
+     &                     - ablock(2,4)*avec(4)  &
+     &                     - ablock(2,5)*avec(5)
+         bvec(3) = bvec(3) - ablock(3,1)*avec(1)  &
+     &                     - ablock(3,2)*avec(2)  &
+     &                     - ablock(3,3)*avec(3)  &
+     &                     - ablock(3,4)*avec(4)  &
+     &                     - ablock(3,5)*avec(5)
+         bvec(4) = bvec(4) - ablock(4,1)*avec(1)  &
+     &                     - ablock(4,2)*avec(2)  &
+     &                     - ablock(4,3)*avec(3)  &
+     &                     - ablock(4,4)*avec(4)  &
+     &                     - ablock(4,5)*avec(5)
+         bvec(5) = bvec(5) - ablock(5,1)*avec(1)  &
+     &                     - ablock(5,2)*avec(2)  &
+     &                     - ablock(5,3)*avec(3)  &
+     &                     - ablock(5,4)*avec(4)  &
+     &                     - ablock(5,5)*avec(5)
+
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine matmul_sub(ablock, bblock, cblock)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtracts a(i,j,k) X b(i,j,k) from c(i,j,k)
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision ablock, bblock, cblock
+      dimension ablock(5,5), bblock(5,5), cblock(5,5)
+
+
+         cblock(1,1) = cblock(1,1) - ablock(1,1)*bblock(1,1)  &
+     &                             - ablock(1,2)*bblock(2,1)  &
+     &                             - ablock(1,3)*bblock(3,1)  &
+     &                             - ablock(1,4)*bblock(4,1)  &
+     &                             - ablock(1,5)*bblock(5,1)
+         cblock(2,1) = cblock(2,1) - ablock(2,1)*bblock(1,1)  &
+     &                             - ablock(2,2)*bblock(2,1)  &
+     &                             - ablock(2,3)*bblock(3,1)  &
+     &                             - ablock(2,4)*bblock(4,1)  &
+     &                             - ablock(2,5)*bblock(5,1)
+         cblock(3,1) = cblock(3,1) - ablock(3,1)*bblock(1,1)  &
+     &                             - ablock(3,2)*bblock(2,1)  &
+     &                             - ablock(3,3)*bblock(3,1)  &
+     &                             - ablock(3,4)*bblock(4,1)  &
+     &                             - ablock(3,5)*bblock(5,1)
+         cblock(4,1) = cblock(4,1) - ablock(4,1)*bblock(1,1)  &
+     &                             - ablock(4,2)*bblock(2,1)  &
+     &                             - ablock(4,3)*bblock(3,1)  &
+     &                             - ablock(4,4)*bblock(4,1)  &
+     &                             - ablock(4,5)*bblock(5,1)
+         cblock(5,1) = cblock(5,1) - ablock(5,1)*bblock(1,1)  &
+     &                             - ablock(5,2)*bblock(2,1)  &
+     &                             - ablock(5,3)*bblock(3,1)  &
+     &                             - ablock(5,4)*bblock(4,1)  &
+     &                             - ablock(5,5)*bblock(5,1)
+         cblock(1,2) = cblock(1,2) - ablock(1,1)*bblock(1,2)  &
+     &                             - ablock(1,2)*bblock(2,2)  &
+     &                             - ablock(1,3)*bblock(3,2)  &
+     &                             - ablock(1,4)*bblock(4,2)  &
+     &                             - ablock(1,5)*bblock(5,2)
+         cblock(2,2) = cblock(2,2) - ablock(2,1)*bblock(1,2)  &
+     &                             - ablock(2,2)*bblock(2,2)  &
+     &                             - ablock(2,3)*bblock(3,2)  &
+     &                             - ablock(2,4)*bblock(4,2)  &
+     &                             - ablock(2,5)*bblock(5,2)
+         cblock(3,2) = cblock(3,2) - ablock(3,1)*bblock(1,2)  &
+     &                             - ablock(3,2)*bblock(2,2)  &
+     &                             - ablock(3,3)*bblock(3,2)  &
+     &                             - ablock(3,4)*bblock(4,2)  &
+     &                             - ablock(3,5)*bblock(5,2)
+         cblock(4,2) = cblock(4,2) - ablock(4,1)*bblock(1,2)  &
+     &                             - ablock(4,2)*bblock(2,2)  &
+     &                             - ablock(4,3)*bblock(3,2)  &
+     &                             - ablock(4,4)*bblock(4,2)  &
+     &                             - ablock(4,5)*bblock(5,2)
+         cblock(5,2) = cblock(5,2) - ablock(5,1)*bblock(1,2)  &
+     &                             - ablock(5,2)*bblock(2,2)  &
+     &                             - ablock(5,3)*bblock(3,2)  &
+     &                             - ablock(5,4)*bblock(4,2)  &
+     &                             - ablock(5,5)*bblock(5,2)
+         cblock(1,3) = cblock(1,3) - ablock(1,1)*bblock(1,3)  &
+     &                             - ablock(1,2)*bblock(2,3)  &
+     &                             - ablock(1,3)*bblock(3,3)  &
+     &                             - ablock(1,4)*bblock(4,3)  &
+     &                             - ablock(1,5)*bblock(5,3)
+         cblock(2,3) = cblock(2,3) - ablock(2,1)*bblock(1,3)  &
+     &                             - ablock(2,2)*bblock(2,3)  &
+     &                             - ablock(2,3)*bblock(3,3)  &
+     &                             - ablock(2,4)*bblock(4,3)  &
+     &                             - ablock(2,5)*bblock(5,3)
+         cblock(3,3) = cblock(3,3) - ablock(3,1)*bblock(1,3)  &
+     &                             - ablock(3,2)*bblock(2,3)  &
+     &                             - ablock(3,3)*bblock(3,3)  &
+     &                             - ablock(3,4)*bblock(4,3)  &
+     &                             - ablock(3,5)*bblock(5,3)
+         cblock(4,3) = cblock(4,3) - ablock(4,1)*bblock(1,3)  &
+     &                             - ablock(4,2)*bblock(2,3)  &
+     &                             - ablock(4,3)*bblock(3,3)  &
+     &                             - ablock(4,4)*bblock(4,3)  &
+     &                             - ablock(4,5)*bblock(5,3)
+         cblock(5,3) = cblock(5,3) - ablock(5,1)*bblock(1,3)  &
+     &                             - ablock(5,2)*bblock(2,3)  &
+     &                             - ablock(5,3)*bblock(3,3)  &
+     &                             - ablock(5,4)*bblock(4,3)  &
+     &                             - ablock(5,5)*bblock(5,3)
+         cblock(1,4) = cblock(1,4) - ablock(1,1)*bblock(1,4)  &
+     &                             - ablock(1,2)*bblock(2,4)  &
+     &                             - ablock(1,3)*bblock(3,4)  &
+     &                             - ablock(1,4)*bblock(4,4)  &
+     &                             - ablock(1,5)*bblock(5,4)
+         cblock(2,4) = cblock(2,4) - ablock(2,1)*bblock(1,4)  &
+     &                             - ablock(2,2)*bblock(2,4)  &
+     &                             - ablock(2,3)*bblock(3,4)  &
+     &                             - ablock(2,4)*bblock(4,4)  &
+     &                             - ablock(2,5)*bblock(5,4)
+         cblock(3,4) = cblock(3,4) - ablock(3,1)*bblock(1,4)  &
+     &                             - ablock(3,2)*bblock(2,4)  &
+     &                             - ablock(3,3)*bblock(3,4)  &
+     &                             - ablock(3,4)*bblock(4,4)  &
+     &                             - ablock(3,5)*bblock(5,4)
+         cblock(4,4) = cblock(4,4) - ablock(4,1)*bblock(1,4)  &
+     &                             - ablock(4,2)*bblock(2,4)  &
+     &                             - ablock(4,3)*bblock(3,4)  &
+     &                             - ablock(4,4)*bblock(4,4)  &
+     &                             - ablock(4,5)*bblock(5,4)
+         cblock(5,4) = cblock(5,4) - ablock(5,1)*bblock(1,4)  &
+     &                             - ablock(5,2)*bblock(2,4)  &
+     &                             - ablock(5,3)*bblock(3,4)  &
+     &                             - ablock(5,4)*bblock(4,4)  &
+     &                             - ablock(5,5)*bblock(5,4)
+         cblock(1,5) = cblock(1,5) - ablock(1,1)*bblock(1,5)  &
+     &                             - ablock(1,2)*bblock(2,5)  &
+     &                             - ablock(1,3)*bblock(3,5)  &
+     &                             - ablock(1,4)*bblock(4,5)  &
+     &                             - ablock(1,5)*bblock(5,5)
+         cblock(2,5) = cblock(2,5) - ablock(2,1)*bblock(1,5)  &
+     &                             - ablock(2,2)*bblock(2,5)  &
+     &                             - ablock(2,3)*bblock(3,5)  &
+     &                             - ablock(2,4)*bblock(4,5)  &
+     &                             - ablock(2,5)*bblock(5,5)
+         cblock(3,5) = cblock(3,5) - ablock(3,1)*bblock(1,5)  &
+     &                             - ablock(3,2)*bblock(2,5)  &
+     &                             - ablock(3,3)*bblock(3,5)  &
+     &                             - ablock(3,4)*bblock(4,5)  &
+     &                             - ablock(3,5)*bblock(5,5)
+         cblock(4,5) = cblock(4,5) - ablock(4,1)*bblock(1,5)  &
+     &                             - ablock(4,2)*bblock(2,5)  &
+     &                             - ablock(4,3)*bblock(3,5)  &
+     &                             - ablock(4,4)*bblock(4,5)  &
+     &                             - ablock(4,5)*bblock(5,5)
+         cblock(5,5) = cblock(5,5) - ablock(5,1)*bblock(1,5)  &
+     &                             - ablock(5,2)*bblock(2,5)  &
+     &                             - ablock(5,3)*bblock(3,5)  &
+     &                             - ablock(5,4)*bblock(4,5)  &
+     &                             - ablock(5,5)*bblock(5,5)
+
+              
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine binvcrhs( lhs,c,r )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision c(5,5), r(5)
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      c(1,1) = c(1,1)*pivot
+      c(1,2) = c(1,2)*pivot
+      c(1,3) = c(1,3)*pivot
+      c(1,4) = c(1,4)*pivot
+      c(1,5) = c(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      c(2,1) = c(2,1) - coeff*c(1,1)
+      c(2,2) = c(2,2) - coeff*c(1,2)
+      c(2,3) = c(2,3) - coeff*c(1,3)
+      c(2,4) = c(2,4) - coeff*c(1,4)
+      c(2,5) = c(2,5) - coeff*c(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      c(3,1) = c(3,1) - coeff*c(1,1)
+      c(3,2) = c(3,2) - coeff*c(1,2)
+      c(3,3) = c(3,3) - coeff*c(1,3)
+      c(3,4) = c(3,4) - coeff*c(1,4)
+      c(3,5) = c(3,5) - coeff*c(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      c(4,1) = c(4,1) - coeff*c(1,1)
+      c(4,2) = c(4,2) - coeff*c(1,2)
+      c(4,3) = c(4,3) - coeff*c(1,3)
+      c(4,4) = c(4,4) - coeff*c(1,4)
+      c(4,5) = c(4,5) - coeff*c(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      c(5,1) = c(5,1) - coeff*c(1,1)
+      c(5,2) = c(5,2) - coeff*c(1,2)
+      c(5,3) = c(5,3) - coeff*c(1,3)
+      c(5,4) = c(5,4) - coeff*c(1,4)
+      c(5,5) = c(5,5) - coeff*c(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      c(2,1) = c(2,1)*pivot
+      c(2,2) = c(2,2)*pivot
+      c(2,3) = c(2,3)*pivot
+      c(2,4) = c(2,4)*pivot
+      c(2,5) = c(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      c(1,1) = c(1,1) - coeff*c(2,1)
+      c(1,2) = c(1,2) - coeff*c(2,2)
+      c(1,3) = c(1,3) - coeff*c(2,3)
+      c(1,4) = c(1,4) - coeff*c(2,4)
+      c(1,5) = c(1,5) - coeff*c(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      c(3,1) = c(3,1) - coeff*c(2,1)
+      c(3,2) = c(3,2) - coeff*c(2,2)
+      c(3,3) = c(3,3) - coeff*c(2,3)
+      c(3,4) = c(3,4) - coeff*c(2,4)
+      c(3,5) = c(3,5) - coeff*c(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      c(4,1) = c(4,1) - coeff*c(2,1)
+      c(4,2) = c(4,2) - coeff*c(2,2)
+      c(4,3) = c(4,3) - coeff*c(2,3)
+      c(4,4) = c(4,4) - coeff*c(2,4)
+      c(4,5) = c(4,5) - coeff*c(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      c(5,1) = c(5,1) - coeff*c(2,1)
+      c(5,2) = c(5,2) - coeff*c(2,2)
+      c(5,3) = c(5,3) - coeff*c(2,3)
+      c(5,4) = c(5,4) - coeff*c(2,4)
+      c(5,5) = c(5,5) - coeff*c(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      c(3,1) = c(3,1)*pivot
+      c(3,2) = c(3,2)*pivot
+      c(3,3) = c(3,3)*pivot
+      c(3,4) = c(3,4)*pivot
+      c(3,5) = c(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      c(1,1) = c(1,1) - coeff*c(3,1)
+      c(1,2) = c(1,2) - coeff*c(3,2)
+      c(1,3) = c(1,3) - coeff*c(3,3)
+      c(1,4) = c(1,4) - coeff*c(3,4)
+      c(1,5) = c(1,5) - coeff*c(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      c(2,1) = c(2,1) - coeff*c(3,1)
+      c(2,2) = c(2,2) - coeff*c(3,2)
+      c(2,3) = c(2,3) - coeff*c(3,3)
+      c(2,4) = c(2,4) - coeff*c(3,4)
+      c(2,5) = c(2,5) - coeff*c(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      c(4,1) = c(4,1) - coeff*c(3,1)
+      c(4,2) = c(4,2) - coeff*c(3,2)
+      c(4,3) = c(4,3) - coeff*c(3,3)
+      c(4,4) = c(4,4) - coeff*c(3,4)
+      c(4,5) = c(4,5) - coeff*c(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      c(5,1) = c(5,1) - coeff*c(3,1)
+      c(5,2) = c(5,2) - coeff*c(3,2)
+      c(5,3) = c(5,3) - coeff*c(3,3)
+      c(5,4) = c(5,4) - coeff*c(3,4)
+      c(5,5) = c(5,5) - coeff*c(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      c(4,1) = c(4,1)*pivot
+      c(4,2) = c(4,2)*pivot
+      c(4,3) = c(4,3)*pivot
+      c(4,4) = c(4,4)*pivot
+      c(4,5) = c(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      c(1,1) = c(1,1) - coeff*c(4,1)
+      c(1,2) = c(1,2) - coeff*c(4,2)
+      c(1,3) = c(1,3) - coeff*c(4,3)
+      c(1,4) = c(1,4) - coeff*c(4,4)
+      c(1,5) = c(1,5) - coeff*c(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      c(2,1) = c(2,1) - coeff*c(4,1)
+      c(2,2) = c(2,2) - coeff*c(4,2)
+      c(2,3) = c(2,3) - coeff*c(4,3)
+      c(2,4) = c(2,4) - coeff*c(4,4)
+      c(2,5) = c(2,5) - coeff*c(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      c(3,1) = c(3,1) - coeff*c(4,1)
+      c(3,2) = c(3,2) - coeff*c(4,2)
+      c(3,3) = c(3,3) - coeff*c(4,3)
+      c(3,4) = c(3,4) - coeff*c(4,4)
+      c(3,5) = c(3,5) - coeff*c(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      c(5,1) = c(5,1) - coeff*c(4,1)
+      c(5,2) = c(5,2) - coeff*c(4,2)
+      c(5,3) = c(5,3) - coeff*c(4,3)
+      c(5,4) = c(5,4) - coeff*c(4,4)
+      c(5,5) = c(5,5) - coeff*c(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      c(5,1) = c(5,1)*pivot
+      c(5,2) = c(5,2)*pivot
+      c(5,3) = c(5,3)*pivot
+      c(5,4) = c(5,4)*pivot
+      c(5,5) = c(5,5)*pivot
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      c(1,1) = c(1,1) - coeff*c(5,1)
+      c(1,2) = c(1,2) - coeff*c(5,2)
+      c(1,3) = c(1,3) - coeff*c(5,3)
+      c(1,4) = c(1,4) - coeff*c(5,4)
+      c(1,5) = c(1,5) - coeff*c(5,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      c(2,1) = c(2,1) - coeff*c(5,1)
+      c(2,2) = c(2,2) - coeff*c(5,2)
+      c(2,3) = c(2,3) - coeff*c(5,3)
+      c(2,4) = c(2,4) - coeff*c(5,4)
+      c(2,5) = c(2,5) - coeff*c(5,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      c(3,1) = c(3,1) - coeff*c(5,1)
+      c(3,2) = c(3,2) - coeff*c(5,2)
+      c(3,3) = c(3,3) - coeff*c(5,3)
+      c(3,4) = c(3,4) - coeff*c(5,4)
+      c(3,5) = c(3,5) - coeff*c(5,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      c(4,1) = c(4,1) - coeff*c(5,1)
+      c(4,2) = c(4,2) - coeff*c(5,2)
+      c(4,3) = c(4,3) - coeff*c(5,3)
+      c(4,4) = c(4,4) - coeff*c(5,4)
+      c(4,5) = c(4,5) - coeff*c(5,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine binvrhs( lhs,r )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision pivot, coeff, lhs
+      dimension lhs(5,5)
+      double precision r(5)
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+
+      pivot = 1.00d0/lhs(1,1)
+      lhs(1,2) = lhs(1,2)*pivot
+      lhs(1,3) = lhs(1,3)*pivot
+      lhs(1,4) = lhs(1,4)*pivot
+      lhs(1,5) = lhs(1,5)*pivot
+      r(1)   = r(1)  *pivot
+
+      coeff = lhs(2,1)
+      lhs(2,2)= lhs(2,2) - coeff*lhs(1,2)
+      lhs(2,3)= lhs(2,3) - coeff*lhs(1,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(1,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(1,5)
+      r(2)   = r(2)   - coeff*r(1)
+
+      coeff = lhs(3,1)
+      lhs(3,2)= lhs(3,2) - coeff*lhs(1,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(1,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(1,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(1,5)
+      r(3)   = r(3)   - coeff*r(1)
+
+      coeff = lhs(4,1)
+      lhs(4,2)= lhs(4,2) - coeff*lhs(1,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(1,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(1,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(1,5)
+      r(4)   = r(4)   - coeff*r(1)
+
+      coeff = lhs(5,1)
+      lhs(5,2)= lhs(5,2) - coeff*lhs(1,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(1,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(1,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(1,5)
+      r(5)   = r(5)   - coeff*r(1)
+
+
+      pivot = 1.00d0/lhs(2,2)
+      lhs(2,3) = lhs(2,3)*pivot
+      lhs(2,4) = lhs(2,4)*pivot
+      lhs(2,5) = lhs(2,5)*pivot
+      r(2)   = r(2)  *pivot
+
+      coeff = lhs(1,2)
+      lhs(1,3)= lhs(1,3) - coeff*lhs(2,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(2,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(2,5)
+      r(1)   = r(1)   - coeff*r(2)
+
+      coeff = lhs(3,2)
+      lhs(3,3)= lhs(3,3) - coeff*lhs(2,3)
+      lhs(3,4)= lhs(3,4) - coeff*lhs(2,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(2,5)
+      r(3)   = r(3)   - coeff*r(2)
+
+      coeff = lhs(4,2)
+      lhs(4,3)= lhs(4,3) - coeff*lhs(2,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(2,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(2,5)
+      r(4)   = r(4)   - coeff*r(2)
+
+      coeff = lhs(5,2)
+      lhs(5,3)= lhs(5,3) - coeff*lhs(2,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(2,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(2,5)
+      r(5)   = r(5)   - coeff*r(2)
+
+
+      pivot = 1.00d0/lhs(3,3)
+      lhs(3,4) = lhs(3,4)*pivot
+      lhs(3,5) = lhs(3,5)*pivot
+      r(3)   = r(3)  *pivot
+
+      coeff = lhs(1,3)
+      lhs(1,4)= lhs(1,4) - coeff*lhs(3,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(3,5)
+      r(1)   = r(1)   - coeff*r(3)
+
+      coeff = lhs(2,3)
+      lhs(2,4)= lhs(2,4) - coeff*lhs(3,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(3,5)
+      r(2)   = r(2)   - coeff*r(3)
+
+      coeff = lhs(4,3)
+      lhs(4,4)= lhs(4,4) - coeff*lhs(3,4)
+      lhs(4,5)= lhs(4,5) - coeff*lhs(3,5)
+      r(4)   = r(4)   - coeff*r(3)
+
+      coeff = lhs(5,3)
+      lhs(5,4)= lhs(5,4) - coeff*lhs(3,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(3,5)
+      r(5)   = r(5)   - coeff*r(3)
+
+
+      pivot = 1.00d0/lhs(4,4)
+      lhs(4,5) = lhs(4,5)*pivot
+      r(4)   = r(4)  *pivot
+
+      coeff = lhs(1,4)
+      lhs(1,5)= lhs(1,5) - coeff*lhs(4,5)
+      r(1)   = r(1)   - coeff*r(4)
+
+      coeff = lhs(2,4)
+      lhs(2,5)= lhs(2,5) - coeff*lhs(4,5)
+      r(2)   = r(2)   - coeff*r(4)
+
+      coeff = lhs(3,4)
+      lhs(3,5)= lhs(3,5) - coeff*lhs(4,5)
+      r(3)   = r(3)   - coeff*r(4)
+
+      coeff = lhs(5,4)
+      lhs(5,5)= lhs(5,5) - coeff*lhs(4,5)
+      r(5)   = r(5)   - coeff*r(4)
+
+
+      pivot = 1.00d0/lhs(5,5)
+      r(5)   = r(5)  *pivot
+
+      coeff = lhs(1,5)
+      r(1)   = r(1)   - coeff*r(5)
+
+      coeff = lhs(2,5)
+      r(2)   = r(2)   - coeff*r(5)
+
+      coeff = lhs(3,5)
+      r(3)   = r(3)   - coeff*r(5)
+
+      coeff = lhs(4,5)
+      r(4)   = r(4)   - coeff*r(5)
+
+
+      return
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/solve_subs_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/solve_subs_blk.f90
new file mode 100644
index 000000000..8dc3d801b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/solve_subs_blk.f90
@@ -0,0 +1,665 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine matvec_sub(ablock,avec,bvec)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtracts bvec=bvec - ablock*avec
+!---------------------------------------------------------------------
+
+      implicit none
+      include 'blk_par.h'
+
+      double precision ablock,avec,bvec
+      dimension ablock(blkdim,5,5),avec(blkdim,5),bvec(blkdim,5)
+
+      integer i
+
+!---------------------------------------------------------------------
+!            rhs(i,ic,jc,kc) = rhs(i,ic,jc,kc) 
+!     $           - lhs(i,1,ablock,ia)*
+!---------------------------------------------------------------------
+!dir$ vector always
+      do i = 1, bsize
+         bvec(i,1) = bvec(i,1) - ablock(i,1,1)*avec(i,1)  &
+     &                     - ablock(i,1,2)*avec(i,2)  &
+     &                     - ablock(i,1,3)*avec(i,3)  &
+     &                     - ablock(i,1,4)*avec(i,4)  &
+     &                     - ablock(i,1,5)*avec(i,5)
+         bvec(i,2) = bvec(i,2) - ablock(i,2,1)*avec(i,1)  &
+     &                     - ablock(i,2,2)*avec(i,2)  &
+     &                     - ablock(i,2,3)*avec(i,3)  &
+     &                     - ablock(i,2,4)*avec(i,4)  &
+     &                     - ablock(i,2,5)*avec(i,5)
+         bvec(i,3) = bvec(i,3) - ablock(i,3,1)*avec(i,1)  &
+     &                     - ablock(i,3,2)*avec(i,2)  &
+     &                     - ablock(i,3,3)*avec(i,3)  &
+     &                     - ablock(i,3,4)*avec(i,4)  &
+     &                     - ablock(i,3,5)*avec(i,5)
+         bvec(i,4) = bvec(i,4) - ablock(i,4,1)*avec(i,1)  &
+     &                     - ablock(i,4,2)*avec(i,2)  &
+     &                     - ablock(i,4,3)*avec(i,3)  &
+     &                     - ablock(i,4,4)*avec(i,4)  &
+     &                     - ablock(i,4,5)*avec(i,5)
+         bvec(i,5) = bvec(i,5) - ablock(i,5,1)*avec(i,1)  &
+     &                     - ablock(i,5,2)*avec(i,2)  &
+     &                     - ablock(i,5,3)*avec(i,3)  &
+     &                     - ablock(i,5,4)*avec(i,4)  &
+     &                     - ablock(i,5,5)*avec(i,5)
+       end do
+
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine matmul_sub(ablock, bblock, cblock)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtracts a(i,j,k) X b(i,j,k) from c(i,j,k)
+!---------------------------------------------------------------------
+
+      implicit none
+      include 'blk_par.h'
+
+      double precision ablock, bblock, cblock
+      dimension ablock(blkdim,5,5), bblock(blkdim,5,5),   &
+     &          cblock(blkdim,5,5)
+
+      integer i
+
+!dir$ vector always
+      do i = 1, bsize
+         cblock(i,1,1) = cblock(i,1,1) - ablock(i,1,1)*bblock(i,1,1)  &
+     &                             - ablock(i,1,2)*bblock(i,2,1)  &
+     &                             - ablock(i,1,3)*bblock(i,3,1)  &
+     &                             - ablock(i,1,4)*bblock(i,4,1)  &
+     &                             - ablock(i,1,5)*bblock(i,5,1)
+         cblock(i,2,1) = cblock(i,2,1) - ablock(i,2,1)*bblock(i,1,1)  &
+     &                             - ablock(i,2,2)*bblock(i,2,1)  &
+     &                             - ablock(i,2,3)*bblock(i,3,1)  &
+     &                             - ablock(i,2,4)*bblock(i,4,1)  &
+     &                             - ablock(i,2,5)*bblock(i,5,1)
+         cblock(i,3,1) = cblock(i,3,1) - ablock(i,3,1)*bblock(i,1,1)  &
+     &                             - ablock(i,3,2)*bblock(i,2,1)  &
+     &                             - ablock(i,3,3)*bblock(i,3,1)  &
+     &                             - ablock(i,3,4)*bblock(i,4,1)  &
+     &                             - ablock(i,3,5)*bblock(i,5,1)
+         cblock(i,4,1) = cblock(i,4,1) - ablock(i,4,1)*bblock(i,1,1)  &
+     &                             - ablock(i,4,2)*bblock(i,2,1)  &
+     &                             - ablock(i,4,3)*bblock(i,3,1)  &
+     &                             - ablock(i,4,4)*bblock(i,4,1)  &
+     &                             - ablock(i,4,5)*bblock(i,5,1)
+         cblock(i,5,1) = cblock(i,5,1) - ablock(i,5,1)*bblock(i,1,1)  &
+     &                             - ablock(i,5,2)*bblock(i,2,1)  &
+     &                             - ablock(i,5,3)*bblock(i,3,1)  &
+     &                             - ablock(i,5,4)*bblock(i,4,1)  &
+     &                             - ablock(i,5,5)*bblock(i,5,1)
+         cblock(i,1,2) = cblock(i,1,2) - ablock(i,1,1)*bblock(i,1,2)  &
+     &                             - ablock(i,1,2)*bblock(i,2,2)  &
+     &                             - ablock(i,1,3)*bblock(i,3,2)  &
+     &                             - ablock(i,1,4)*bblock(i,4,2)  &
+     &                             - ablock(i,1,5)*bblock(i,5,2)
+         cblock(i,2,2) = cblock(i,2,2) - ablock(i,2,1)*bblock(i,1,2)  &
+     &                             - ablock(i,2,2)*bblock(i,2,2)  &
+     &                             - ablock(i,2,3)*bblock(i,3,2)  &
+     &                             - ablock(i,2,4)*bblock(i,4,2)  &
+     &                             - ablock(i,2,5)*bblock(i,5,2)
+         cblock(i,3,2) = cblock(i,3,2) - ablock(i,3,1)*bblock(i,1,2)  &
+     &                             - ablock(i,3,2)*bblock(i,2,2)  &
+     &                             - ablock(i,3,3)*bblock(i,3,2)  &
+     &                             - ablock(i,3,4)*bblock(i,4,2)  &
+     &                             - ablock(i,3,5)*bblock(i,5,2)
+         cblock(i,4,2) = cblock(i,4,2) - ablock(i,4,1)*bblock(i,1,2)  &
+     &                             - ablock(i,4,2)*bblock(i,2,2)  &
+     &                             - ablock(i,4,3)*bblock(i,3,2)  &
+     &                             - ablock(i,4,4)*bblock(i,4,2)  &
+     &                             - ablock(i,4,5)*bblock(i,5,2)
+         cblock(i,5,2) = cblock(i,5,2) - ablock(i,5,1)*bblock(i,1,2)  &
+     &                             - ablock(i,5,2)*bblock(i,2,2)  &
+     &                             - ablock(i,5,3)*bblock(i,3,2)  &
+     &                             - ablock(i,5,4)*bblock(i,4,2)  &
+     &                             - ablock(i,5,5)*bblock(i,5,2)
+         cblock(i,1,3) = cblock(i,1,3) - ablock(i,1,1)*bblock(i,1,3)  &
+     &                             - ablock(i,1,2)*bblock(i,2,3)  &
+     &                             - ablock(i,1,3)*bblock(i,3,3)  &
+     &                             - ablock(i,1,4)*bblock(i,4,3)  &
+     &                             - ablock(i,1,5)*bblock(i,5,3)
+         cblock(i,2,3) = cblock(i,2,3) - ablock(i,2,1)*bblock(i,1,3)  &
+     &                             - ablock(i,2,2)*bblock(i,2,3)  &
+     &                             - ablock(i,2,3)*bblock(i,3,3)  &
+     &                             - ablock(i,2,4)*bblock(i,4,3)  &
+     &                             - ablock(i,2,5)*bblock(i,5,3)
+         cblock(i,3,3) = cblock(i,3,3) - ablock(i,3,1)*bblock(i,1,3)  &
+     &                             - ablock(i,3,2)*bblock(i,2,3)  &
+     &                             - ablock(i,3,3)*bblock(i,3,3)  &
+     &                             - ablock(i,3,4)*bblock(i,4,3)  &
+     &                             - ablock(i,3,5)*bblock(i,5,3)
+         cblock(i,4,3) = cblock(i,4,3) - ablock(i,4,1)*bblock(i,1,3)  &
+     &                             - ablock(i,4,2)*bblock(i,2,3)  &
+     &                             - ablock(i,4,3)*bblock(i,3,3)  &
+     &                             - ablock(i,4,4)*bblock(i,4,3)  &
+     &                             - ablock(i,4,5)*bblock(i,5,3)
+         cblock(i,5,3) = cblock(i,5,3) - ablock(i,5,1)*bblock(i,1,3)  &
+     &                             - ablock(i,5,2)*bblock(i,2,3)  &
+     &                             - ablock(i,5,3)*bblock(i,3,3)  &
+     &                             - ablock(i,5,4)*bblock(i,4,3)  &
+     &                             - ablock(i,5,5)*bblock(i,5,3)
+         cblock(i,1,4) = cblock(i,1,4) - ablock(i,1,1)*bblock(i,1,4)  &
+     &                             - ablock(i,1,2)*bblock(i,2,4)  &
+     &                             - ablock(i,1,3)*bblock(i,3,4)  &
+     &                             - ablock(i,1,4)*bblock(i,4,4)  &
+     &                             - ablock(i,1,5)*bblock(i,5,4)
+         cblock(i,2,4) = cblock(i,2,4) - ablock(i,2,1)*bblock(i,1,4)  &
+     &                             - ablock(i,2,2)*bblock(i,2,4)  &
+     &                             - ablock(i,2,3)*bblock(i,3,4)  &
+     &                             - ablock(i,2,4)*bblock(i,4,4)  &
+     &                             - ablock(i,2,5)*bblock(i,5,4)
+         cblock(i,3,4) = cblock(i,3,4) - ablock(i,3,1)*bblock(i,1,4)  &
+     &                             - ablock(i,3,2)*bblock(i,2,4)  &
+     &                             - ablock(i,3,3)*bblock(i,3,4)  &
+     &                             - ablock(i,3,4)*bblock(i,4,4)  &
+     &                             - ablock(i,3,5)*bblock(i,5,4)
+         cblock(i,4,4) = cblock(i,4,4) - ablock(i,4,1)*bblock(i,1,4)  &
+     &                             - ablock(i,4,2)*bblock(i,2,4)  &
+     &                             - ablock(i,4,3)*bblock(i,3,4)  &
+     &                             - ablock(i,4,4)*bblock(i,4,4)  &
+     &                             - ablock(i,4,5)*bblock(i,5,4)
+         cblock(i,5,4) = cblock(i,5,4) - ablock(i,5,1)*bblock(i,1,4)  &
+     &                             - ablock(i,5,2)*bblock(i,2,4)  &
+     &                             - ablock(i,5,3)*bblock(i,3,4)  &
+     &                             - ablock(i,5,4)*bblock(i,4,4)  &
+     &                             - ablock(i,5,5)*bblock(i,5,4)
+         cblock(i,1,5) = cblock(i,1,5) - ablock(i,1,1)*bblock(i,1,5)  &
+     &                             - ablock(i,1,2)*bblock(i,2,5)  &
+     &                             - ablock(i,1,3)*bblock(i,3,5)  &
+     &                             - ablock(i,1,4)*bblock(i,4,5)  &
+     &                             - ablock(i,1,5)*bblock(i,5,5)
+         cblock(i,2,5) = cblock(i,2,5) - ablock(i,2,1)*bblock(i,1,5)  &
+     &                             - ablock(i,2,2)*bblock(i,2,5)  &
+     &                             - ablock(i,2,3)*bblock(i,3,5)  &
+     &                             - ablock(i,2,4)*bblock(i,4,5)  &
+     &                             - ablock(i,2,5)*bblock(i,5,5)
+         cblock(i,3,5) = cblock(i,3,5) - ablock(i,3,1)*bblock(i,1,5)  &
+     &                             - ablock(i,3,2)*bblock(i,2,5)  &
+     &                             - ablock(i,3,3)*bblock(i,3,5)  &
+     &                             - ablock(i,3,4)*bblock(i,4,5)  &
+     &                             - ablock(i,3,5)*bblock(i,5,5)
+         cblock(i,4,5) = cblock(i,4,5) - ablock(i,4,1)*bblock(i,1,5)  &
+     &                             - ablock(i,4,2)*bblock(i,2,5)  &
+     &                             - ablock(i,4,3)*bblock(i,3,5)  &
+     &                             - ablock(i,4,4)*bblock(i,4,5)  &
+     &                             - ablock(i,4,5)*bblock(i,5,5)
+         cblock(i,5,5) = cblock(i,5,5) - ablock(i,5,1)*bblock(i,1,5)  &
+     &                             - ablock(i,5,2)*bblock(i,2,5)  &
+     &                             - ablock(i,5,3)*bblock(i,3,5)  &
+     &                             - ablock(i,5,4)*bblock(i,4,5)  &
+     &                             - ablock(i,5,5)*bblock(i,5,5)
+      end do
+              
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine binvcrhs( lhs,c,r )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      implicit none
+      include 'blk_par.h'
+
+      double precision pivot, coeff, lhs
+      dimension lhs(blkdim,5,5)
+      double precision c(blkdim,5,5), r(blkdim,5)
+
+      integer i
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+!dir$ vector always
+      do i = 1, bsize
+      pivot = 1.00d0/lhs(i,1,1)
+      lhs(i,1,2) = lhs(i,1,2)*pivot
+      lhs(i,1,3) = lhs(i,1,3)*pivot
+      lhs(i,1,4) = lhs(i,1,4)*pivot
+      lhs(i,1,5) = lhs(i,1,5)*pivot
+      c(i,1,1) = c(i,1,1)*pivot
+      c(i,1,2) = c(i,1,2)*pivot
+      c(i,1,3) = c(i,1,3)*pivot
+      c(i,1,4) = c(i,1,4)*pivot
+      c(i,1,5) = c(i,1,5)*pivot
+      r(i,1)   = r(i,1)  *pivot
+
+      coeff = lhs(i,2,1)
+      lhs(i,2,2)= lhs(i,2,2) - coeff*lhs(i,1,2)
+      lhs(i,2,3)= lhs(i,2,3) - coeff*lhs(i,1,3)
+      lhs(i,2,4)= lhs(i,2,4) - coeff*lhs(i,1,4)
+      lhs(i,2,5)= lhs(i,2,5) - coeff*lhs(i,1,5)
+      c(i,2,1) = c(i,2,1) - coeff*c(i,1,1)
+      c(i,2,2) = c(i,2,2) - coeff*c(i,1,2)
+      c(i,2,3) = c(i,2,3) - coeff*c(i,1,3)
+      c(i,2,4) = c(i,2,4) - coeff*c(i,1,4)
+      c(i,2,5) = c(i,2,5) - coeff*c(i,1,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,1)
+
+      coeff = lhs(i,3,1)
+      lhs(i,3,2)= lhs(i,3,2) - coeff*lhs(i,1,2)
+      lhs(i,3,3)= lhs(i,3,3) - coeff*lhs(i,1,3)
+      lhs(i,3,4)= lhs(i,3,4) - coeff*lhs(i,1,4)
+      lhs(i,3,5)= lhs(i,3,5) - coeff*lhs(i,1,5)
+      c(i,3,1) = c(i,3,1) - coeff*c(i,1,1)
+      c(i,3,2) = c(i,3,2) - coeff*c(i,1,2)
+      c(i,3,3) = c(i,3,3) - coeff*c(i,1,3)
+      c(i,3,4) = c(i,3,4) - coeff*c(i,1,4)
+      c(i,3,5) = c(i,3,5) - coeff*c(i,1,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,1)
+
+      coeff = lhs(i,4,1)
+      lhs(i,4,2)= lhs(i,4,2) - coeff*lhs(i,1,2)
+      lhs(i,4,3)= lhs(i,4,3) - coeff*lhs(i,1,3)
+      lhs(i,4,4)= lhs(i,4,4) - coeff*lhs(i,1,4)
+      lhs(i,4,5)= lhs(i,4,5) - coeff*lhs(i,1,5)
+      c(i,4,1) = c(i,4,1) - coeff*c(i,1,1)
+      c(i,4,2) = c(i,4,2) - coeff*c(i,1,2)
+      c(i,4,3) = c(i,4,3) - coeff*c(i,1,3)
+      c(i,4,4) = c(i,4,4) - coeff*c(i,1,4)
+      c(i,4,5) = c(i,4,5) - coeff*c(i,1,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,1)
+
+      coeff = lhs(i,5,1)
+      lhs(i,5,2)= lhs(i,5,2) - coeff*lhs(i,1,2)
+      lhs(i,5,3)= lhs(i,5,3) - coeff*lhs(i,1,3)
+      lhs(i,5,4)= lhs(i,5,4) - coeff*lhs(i,1,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,1,5)
+      c(i,5,1) = c(i,5,1) - coeff*c(i,1,1)
+      c(i,5,2) = c(i,5,2) - coeff*c(i,1,2)
+      c(i,5,3) = c(i,5,3) - coeff*c(i,1,3)
+      c(i,5,4) = c(i,5,4) - coeff*c(i,1,4)
+      c(i,5,5) = c(i,5,5) - coeff*c(i,1,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,1)
+
+
+      pivot = 1.00d0/lhs(i,2,2)
+      lhs(i,2,3) = lhs(i,2,3)*pivot
+      lhs(i,2,4) = lhs(i,2,4)*pivot
+      lhs(i,2,5) = lhs(i,2,5)*pivot
+      c(i,2,1) = c(i,2,1)*pivot
+      c(i,2,2) = c(i,2,2)*pivot
+      c(i,2,3) = c(i,2,3)*pivot
+      c(i,2,4) = c(i,2,4)*pivot
+      c(i,2,5) = c(i,2,5)*pivot
+      r(i,2)   = r(i,2)  *pivot
+
+      coeff = lhs(i,1,2)
+      lhs(i,1,3)= lhs(i,1,3) - coeff*lhs(i,2,3)
+      lhs(i,1,4)= lhs(i,1,4) - coeff*lhs(i,2,4)
+      lhs(i,1,5)= lhs(i,1,5) - coeff*lhs(i,2,5)
+      c(i,1,1) = c(i,1,1) - coeff*c(i,2,1)
+      c(i,1,2) = c(i,1,2) - coeff*c(i,2,2)
+      c(i,1,3) = c(i,1,3) - coeff*c(i,2,3)
+      c(i,1,4) = c(i,1,4) - coeff*c(i,2,4)
+      c(i,1,5) = c(i,1,5) - coeff*c(i,2,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,2)
+
+      coeff = lhs(i,3,2)
+      lhs(i,3,3)= lhs(i,3,3) - coeff*lhs(i,2,3)
+      lhs(i,3,4)= lhs(i,3,4) - coeff*lhs(i,2,4)
+      lhs(i,3,5)= lhs(i,3,5) - coeff*lhs(i,2,5)
+      c(i,3,1) = c(i,3,1) - coeff*c(i,2,1)
+      c(i,3,2) = c(i,3,2) - coeff*c(i,2,2)
+      c(i,3,3) = c(i,3,3) - coeff*c(i,2,3)
+      c(i,3,4) = c(i,3,4) - coeff*c(i,2,4)
+      c(i,3,5) = c(i,3,5) - coeff*c(i,2,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,2)
+
+      coeff = lhs(i,4,2)
+      lhs(i,4,3)= lhs(i,4,3) - coeff*lhs(i,2,3)
+      lhs(i,4,4)= lhs(i,4,4) - coeff*lhs(i,2,4)
+      lhs(i,4,5)= lhs(i,4,5) - coeff*lhs(i,2,5)
+      c(i,4,1) = c(i,4,1) - coeff*c(i,2,1)
+      c(i,4,2) = c(i,4,2) - coeff*c(i,2,2)
+      c(i,4,3) = c(i,4,3) - coeff*c(i,2,3)
+      c(i,4,4) = c(i,4,4) - coeff*c(i,2,4)
+      c(i,4,5) = c(i,4,5) - coeff*c(i,2,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,2)
+
+      coeff = lhs(i,5,2)
+      lhs(i,5,3)= lhs(i,5,3) - coeff*lhs(i,2,3)
+      lhs(i,5,4)= lhs(i,5,4) - coeff*lhs(i,2,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,2,5)
+      c(i,5,1) = c(i,5,1) - coeff*c(i,2,1)
+      c(i,5,2) = c(i,5,2) - coeff*c(i,2,2)
+      c(i,5,3) = c(i,5,3) - coeff*c(i,2,3)
+      c(i,5,4) = c(i,5,4) - coeff*c(i,2,4)
+      c(i,5,5) = c(i,5,5) - coeff*c(i,2,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,2)
+
+
+      pivot = 1.00d0/lhs(i,3,3)
+      lhs(i,3,4) = lhs(i,3,4)*pivot
+      lhs(i,3,5) = lhs(i,3,5)*pivot
+      c(i,3,1) = c(i,3,1)*pivot
+      c(i,3,2) = c(i,3,2)*pivot
+      c(i,3,3) = c(i,3,3)*pivot
+      c(i,3,4) = c(i,3,4)*pivot
+      c(i,3,5) = c(i,3,5)*pivot
+      r(i,3)   = r(i,3)  *pivot
+
+      coeff = lhs(i,1,3)
+      lhs(i,1,4)= lhs(i,1,4) - coeff*lhs(i,3,4)
+      lhs(i,1,5)= lhs(i,1,5) - coeff*lhs(i,3,5)
+      c(i,1,1) = c(i,1,1) - coeff*c(i,3,1)
+      c(i,1,2) = c(i,1,2) - coeff*c(i,3,2)
+      c(i,1,3) = c(i,1,3) - coeff*c(i,3,3)
+      c(i,1,4) = c(i,1,4) - coeff*c(i,3,4)
+      c(i,1,5) = c(i,1,5) - coeff*c(i,3,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,3)
+
+      coeff = lhs(i,2,3)
+      lhs(i,2,4)= lhs(i,2,4) - coeff*lhs(i,3,4)
+      lhs(i,2,5)= lhs(i,2,5) - coeff*lhs(i,3,5)
+      c(i,2,1) = c(i,2,1) - coeff*c(i,3,1)
+      c(i,2,2) = c(i,2,2) - coeff*c(i,3,2)
+      c(i,2,3) = c(i,2,3) - coeff*c(i,3,3)
+      c(i,2,4) = c(i,2,4) - coeff*c(i,3,4)
+      c(i,2,5) = c(i,2,5) - coeff*c(i,3,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,3)
+
+      coeff = lhs(i,4,3)
+      lhs(i,4,4)= lhs(i,4,4) - coeff*lhs(i,3,4)
+      lhs(i,4,5)= lhs(i,4,5) - coeff*lhs(i,3,5)
+      c(i,4,1) = c(i,4,1) - coeff*c(i,3,1)
+      c(i,4,2) = c(i,4,2) - coeff*c(i,3,2)
+      c(i,4,3) = c(i,4,3) - coeff*c(i,3,3)
+      c(i,4,4) = c(i,4,4) - coeff*c(i,3,4)
+      c(i,4,5) = c(i,4,5) - coeff*c(i,3,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,3)
+
+      coeff = lhs(i,5,3)
+      lhs(i,5,4)= lhs(i,5,4) - coeff*lhs(i,3,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,3,5)
+      c(i,5,1) = c(i,5,1) - coeff*c(i,3,1)
+      c(i,5,2) = c(i,5,2) - coeff*c(i,3,2)
+      c(i,5,3) = c(i,5,3) - coeff*c(i,3,3)
+      c(i,5,4) = c(i,5,4) - coeff*c(i,3,4)
+      c(i,5,5) = c(i,5,5) - coeff*c(i,3,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,3)
+
+
+      pivot = 1.00d0/lhs(i,4,4)
+      lhs(i,4,5) = lhs(i,4,5)*pivot
+      c(i,4,1) = c(i,4,1)*pivot
+      c(i,4,2) = c(i,4,2)*pivot
+      c(i,4,3) = c(i,4,3)*pivot
+      c(i,4,4) = c(i,4,4)*pivot
+      c(i,4,5) = c(i,4,5)*pivot
+      r(i,4)   = r(i,4)  *pivot
+
+      coeff = lhs(i,1,4)
+      lhs(i,1,5)= lhs(i,1,5) - coeff*lhs(i,4,5)
+      c(i,1,1) = c(i,1,1) - coeff*c(i,4,1)
+      c(i,1,2) = c(i,1,2) - coeff*c(i,4,2)
+      c(i,1,3) = c(i,1,3) - coeff*c(i,4,3)
+      c(i,1,4) = c(i,1,4) - coeff*c(i,4,4)
+      c(i,1,5) = c(i,1,5) - coeff*c(i,4,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,4)
+
+      coeff = lhs(i,2,4)
+      lhs(i,2,5)= lhs(i,2,5) - coeff*lhs(i,4,5)
+      c(i,2,1) = c(i,2,1) - coeff*c(i,4,1)
+      c(i,2,2) = c(i,2,2) - coeff*c(i,4,2)
+      c(i,2,3) = c(i,2,3) - coeff*c(i,4,3)
+      c(i,2,4) = c(i,2,4) - coeff*c(i,4,4)
+      c(i,2,5) = c(i,2,5) - coeff*c(i,4,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,4)
+
+      coeff = lhs(i,3,4)
+      lhs(i,3,5)= lhs(i,3,5) - coeff*lhs(i,4,5)
+      c(i,3,1) = c(i,3,1) - coeff*c(i,4,1)
+      c(i,3,2) = c(i,3,2) - coeff*c(i,4,2)
+      c(i,3,3) = c(i,3,3) - coeff*c(i,4,3)
+      c(i,3,4) = c(i,3,4) - coeff*c(i,4,4)
+      c(i,3,5) = c(i,3,5) - coeff*c(i,4,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,4)
+
+      coeff = lhs(i,5,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,4,5)
+      c(i,5,1) = c(i,5,1) - coeff*c(i,4,1)
+      c(i,5,2) = c(i,5,2) - coeff*c(i,4,2)
+      c(i,5,3) = c(i,5,3) - coeff*c(i,4,3)
+      c(i,5,4) = c(i,5,4) - coeff*c(i,4,4)
+      c(i,5,5) = c(i,5,5) - coeff*c(i,4,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,4)
+
+
+      pivot = 1.00d0/lhs(i,5,5)
+      c(i,5,1) = c(i,5,1)*pivot
+      c(i,5,2) = c(i,5,2)*pivot
+      c(i,5,3) = c(i,5,3)*pivot
+      c(i,5,4) = c(i,5,4)*pivot
+      c(i,5,5) = c(i,5,5)*pivot
+      r(i,5)   = r(i,5)  *pivot
+
+      coeff = lhs(i,1,5)
+      c(i,1,1) = c(i,1,1) - coeff*c(i,5,1)
+      c(i,1,2) = c(i,1,2) - coeff*c(i,5,2)
+      c(i,1,3) = c(i,1,3) - coeff*c(i,5,3)
+      c(i,1,4) = c(i,1,4) - coeff*c(i,5,4)
+      c(i,1,5) = c(i,1,5) - coeff*c(i,5,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,5)
+
+      coeff = lhs(i,2,5)
+      c(i,2,1) = c(i,2,1) - coeff*c(i,5,1)
+      c(i,2,2) = c(i,2,2) - coeff*c(i,5,2)
+      c(i,2,3) = c(i,2,3) - coeff*c(i,5,3)
+      c(i,2,4) = c(i,2,4) - coeff*c(i,5,4)
+      c(i,2,5) = c(i,2,5) - coeff*c(i,5,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,5)
+
+      coeff = lhs(i,3,5)
+      c(i,3,1) = c(i,3,1) - coeff*c(i,5,1)
+      c(i,3,2) = c(i,3,2) - coeff*c(i,5,2)
+      c(i,3,3) = c(i,3,3) - coeff*c(i,5,3)
+      c(i,3,4) = c(i,3,4) - coeff*c(i,5,4)
+      c(i,3,5) = c(i,3,5) - coeff*c(i,5,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,5)
+
+      coeff = lhs(i,4,5)
+      c(i,4,1) = c(i,4,1) - coeff*c(i,5,1)
+      c(i,4,2) = c(i,4,2) - coeff*c(i,5,2)
+      c(i,4,3) = c(i,4,3) - coeff*c(i,5,3)
+      c(i,4,4) = c(i,4,4) - coeff*c(i,5,4)
+      c(i,4,5) = c(i,4,5) - coeff*c(i,5,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,5)
+      end do
+
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine binvrhs( lhs,r )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+      implicit none
+      include 'blk_par.h'
+
+      double precision pivot, coeff, lhs
+      dimension lhs(blkdim,5,5)
+      double precision r(blkdim,5)
+
+      integer i
+
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+
+!dir$ vector always
+      do i = 1, bsize
+
+      pivot = 1.00d0/lhs(i,1,1)
+      lhs(i,1,2) = lhs(i,1,2)*pivot
+      lhs(i,1,3) = lhs(i,1,3)*pivot
+      lhs(i,1,4) = lhs(i,1,4)*pivot
+      lhs(i,1,5) = lhs(i,1,5)*pivot
+      r(i,1)   = r(i,1)  *pivot
+
+      coeff = lhs(i,2,1)
+      lhs(i,2,2)= lhs(i,2,2) - coeff*lhs(i,1,2)
+      lhs(i,2,3)= lhs(i,2,3) - coeff*lhs(i,1,3)
+      lhs(i,2,4)= lhs(i,2,4) - coeff*lhs(i,1,4)
+      lhs(i,2,5)= lhs(i,2,5) - coeff*lhs(i,1,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,1)
+
+      coeff = lhs(i,3,1)
+      lhs(i,3,2)= lhs(i,3,2) - coeff*lhs(i,1,2)
+      lhs(i,3,3)= lhs(i,3,3) - coeff*lhs(i,1,3)
+      lhs(i,3,4)= lhs(i,3,4) - coeff*lhs(i,1,4)
+      lhs(i,3,5)= lhs(i,3,5) - coeff*lhs(i,1,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,1)
+
+      coeff = lhs(i,4,1)
+      lhs(i,4,2)= lhs(i,4,2) - coeff*lhs(i,1,2)
+      lhs(i,4,3)= lhs(i,4,3) - coeff*lhs(i,1,3)
+      lhs(i,4,4)= lhs(i,4,4) - coeff*lhs(i,1,4)
+      lhs(i,4,5)= lhs(i,4,5) - coeff*lhs(i,1,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,1)
+
+      coeff = lhs(i,5,1)
+      lhs(i,5,2)= lhs(i,5,2) - coeff*lhs(i,1,2)
+      lhs(i,5,3)= lhs(i,5,3) - coeff*lhs(i,1,3)
+      lhs(i,5,4)= lhs(i,5,4) - coeff*lhs(i,1,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,1,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,1)
+
+
+      pivot = 1.00d0/lhs(i,2,2)
+      lhs(i,2,3) = lhs(i,2,3)*pivot
+      lhs(i,2,4) = lhs(i,2,4)*pivot
+      lhs(i,2,5) = lhs(i,2,5)*pivot
+      r(i,2)   = r(i,2)  *pivot
+
+      coeff = lhs(i,1,2)
+      lhs(i,1,3)= lhs(i,1,3) - coeff*lhs(i,2,3)
+      lhs(i,1,4)= lhs(i,1,4) - coeff*lhs(i,2,4)
+      lhs(i,1,5)= lhs(i,1,5) - coeff*lhs(i,2,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,2)
+
+      coeff = lhs(i,3,2)
+      lhs(i,3,3)= lhs(i,3,3) - coeff*lhs(i,2,3)
+      lhs(i,3,4)= lhs(i,3,4) - coeff*lhs(i,2,4)
+      lhs(i,3,5)= lhs(i,3,5) - coeff*lhs(i,2,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,2)
+
+      coeff = lhs(i,4,2)
+      lhs(i,4,3)= lhs(i,4,3) - coeff*lhs(i,2,3)
+      lhs(i,4,4)= lhs(i,4,4) - coeff*lhs(i,2,4)
+      lhs(i,4,5)= lhs(i,4,5) - coeff*lhs(i,2,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,2)
+
+      coeff = lhs(i,5,2)
+      lhs(i,5,3)= lhs(i,5,3) - coeff*lhs(i,2,3)
+      lhs(i,5,4)= lhs(i,5,4) - coeff*lhs(i,2,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,2,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,2)
+
+
+      pivot = 1.00d0/lhs(i,3,3)
+      lhs(i,3,4) = lhs(i,3,4)*pivot
+      lhs(i,3,5) = lhs(i,3,5)*pivot
+      r(i,3)   = r(i,3)  *pivot
+
+      coeff = lhs(i,1,3)
+      lhs(i,1,4)= lhs(i,1,4) - coeff*lhs(i,3,4)
+      lhs(i,1,5)= lhs(i,1,5) - coeff*lhs(i,3,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,3)
+
+      coeff = lhs(i,2,3)
+      lhs(i,2,4)= lhs(i,2,4) - coeff*lhs(i,3,4)
+      lhs(i,2,5)= lhs(i,2,5) - coeff*lhs(i,3,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,3)
+
+      coeff = lhs(i,4,3)
+      lhs(i,4,4)= lhs(i,4,4) - coeff*lhs(i,3,4)
+      lhs(i,4,5)= lhs(i,4,5) - coeff*lhs(i,3,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,3)
+
+      coeff = lhs(i,5,3)
+      lhs(i,5,4)= lhs(i,5,4) - coeff*lhs(i,3,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,3,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,3)
+
+
+      pivot = 1.00d0/lhs(i,4,4)
+      lhs(i,4,5) = lhs(i,4,5)*pivot
+      r(i,4)   = r(i,4)  *pivot
+
+      coeff = lhs(i,1,4)
+      lhs(i,1,5)= lhs(i,1,5) - coeff*lhs(i,4,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,4)
+
+      coeff = lhs(i,2,4)
+      lhs(i,2,5)= lhs(i,2,5) - coeff*lhs(i,4,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,4)
+
+      coeff = lhs(i,3,4)
+      lhs(i,3,5)= lhs(i,3,5) - coeff*lhs(i,4,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,4)
+
+      coeff = lhs(i,5,4)
+      lhs(i,5,5)= lhs(i,5,5) - coeff*lhs(i,4,5)
+      r(i,5)   = r(i,5)   - coeff*r(i,4)
+
+
+      pivot = 1.00d0/lhs(i,5,5)
+      r(i,5)   = r(i,5)  *pivot
+
+      coeff = lhs(i,1,5)
+      r(i,1)   = r(i,1)   - coeff*r(i,5)
+
+      coeff = lhs(i,2,5)
+      r(i,2)   = r(i,2)   - coeff*r(i,5)
+
+      coeff = lhs(i,3,5)
+      r(i,3)   = r(i,3)   - coeff*r(i,5)
+
+      coeff = lhs(i,4,5)
+      r(i,4)   = r(i,4)   - coeff*r(i,5)
+      end do
+
+
+      return
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/verify.f90
new file mode 100644
index 000000000..106c0c9a4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/verify.f90
@@ -0,0 +1,393 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  verification routine                         
+!---------------------------------------------------------------------
+
+        use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+        use bt_data
+
+        implicit none
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5),   &
+     &                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+!---------------------------------------------------------------------
+!   tolerance level
+!---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+!---------------------------------------------------------------------
+!   compute the error norm and the residual norm, and exit if not printing
+!---------------------------------------------------------------------
+        call error_norm(xce)
+        call compute_rhs
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+!---------------------------------------------------------------------
+!    reference data for 12X12X12 grids after 60 time steps, with DT = 1.0d-02
+!---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and.   &
+     &       (grid_points(2)  .eq. 12     ) .and.  &
+     &       (grid_points(3)  .eq. 12     ) .and.  &
+     &       (no_time_steps   .eq. 60    ))  then
+
+           class = 'S'
+           dtref = 1.0d-2
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 1.7034283709541311d-01
+         xcrref(2) = 1.2975252070034097d-02
+         xcrref(3) = 3.2527926989486055d-02
+         xcrref(4) = 2.6436421275166801d-02
+         xcrref(5) = 1.9211784131744430d-01
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+         xceref(1) = 4.9976913345811579d-04
+         xceref(2) = 4.5195666782961927d-05
+         xceref(3) = 7.3973765172921357d-05
+         xceref(4) = 7.3821238632439731d-05
+         xceref(5) = 8.9269630987491446d-04
+
+!---------------------------------------------------------------------
+!    reference data for 24X24X24 grids after 200 time steps, with DT = 0.8d-3
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 24) .and.   &
+     &           (grid_points(2) .eq. 24) .and.  &
+     &           (grid_points(3) .eq. 24) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'W'
+           dtref = 0.8d-3
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.1125590409344d+03
+           xcrref(2) = 0.1180007595731d+02
+           xcrref(3) = 0.2710329767846d+02
+           xcrref(4) = 0.2469174937669d+02
+           xcrref(5) = 0.2638427874317d+03
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.4419655736008d+01
+           xceref(2) = 0.4638531260002d+00
+           xceref(3) = 0.1011551749967d+01
+           xceref(4) = 0.9235878729944d+00
+           xceref(5) = 0.1018045837718d+02
+
+
+!---------------------------------------------------------------------
+!    reference data for 64X64X64 grids after 200 time steps, with DT = 0.8d-3
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and.   &
+     &           (grid_points(2) .eq. 64) .and.  &
+     &           (grid_points(3) .eq. 64) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'A'
+           dtref = 0.8d-3
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 1.0806346714637264d+02
+         xcrref(2) = 1.1319730901220813d+01
+         xcrref(3) = 2.5974354511582465d+01
+         xcrref(4) = 2.3665622544678910d+01
+         xcrref(5) = 2.5278963211748344d+02
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+         xceref(1) = 4.2348416040525025d+00
+         xceref(2) = 4.4390282496995698d-01
+         xceref(3) = 9.6692480136345650d-01
+         xceref(4) = 8.8302063039765474d-01
+         xceref(5) = 9.7379901770829278d+00
+
+!---------------------------------------------------------------------
+!    reference data for 102X102X102 grids after 200 time steps,
+!    with DT = 3.0d-04
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and.   &
+     &           (grid_points(2) .eq. 102) .and.  &
+     &           (grid_points(3) .eq. 102) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'B'
+           dtref = 3.0d-4
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 1.4233597229287254d+03
+         xcrref(2) = 9.9330522590150238d+01
+         xcrref(3) = 3.5646025644535285d+02
+         xcrref(4) = 3.2485447959084092d+02
+         xcrref(5) = 3.2707541254659363d+03
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+         xceref(1) = 5.2969847140936856d+01
+         xceref(2) = 4.4632896115670668d+00
+         xceref(3) = 1.3122573342210174d+01
+         xceref(4) = 1.2006925323559144d+01
+         xceref(5) = 1.2459576151035986d+02
+
+!---------------------------------------------------------------------
+!    reference data for 162X162X162 grids after 200 time steps,
+!    with DT = 1.0d-04
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and.   &
+     &           (grid_points(2) .eq. 162) .and.  &
+     &           (grid_points(3) .eq. 162) .and.  &
+     &           (no_time_steps  .eq. 200) ) then
+
+           class = 'C'
+           dtref = 1.0d-4
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.62398116551764615d+04
+         xcrref(2) = 0.50793239190423964d+03
+         xcrref(3) = 0.15423530093013596d+04
+         xcrref(4) = 0.13302387929291190d+04
+         xcrref(5) = 0.11604087428436455d+05
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+         xceref(1) = 0.16462008369091265d+03
+         xceref(2) = 0.11497107903824313d+02
+         xceref(3) = 0.41207446207461508d+02
+         xceref(4) = 0.37087651059694167d+02
+         xceref(5) = 0.36211053051841265d+03
+
+!---------------------------------------------------------------------
+!    reference data for 408x408x408 grids after 250 time steps,
+!    with DT = 0.2d-04
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and.   &
+     &           (grid_points(2) .eq. 408) .and.  &
+     &           (grid_points(3) .eq. 408) .and.  &
+     &           (no_time_steps  .eq. 250) ) then
+
+           class = 'D'
+           dtref = 0.2d-4
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.2533188551738d+05
+         xcrref(2) = 0.2346393716980d+04
+         xcrref(3) = 0.6294554366904d+04
+         xcrref(4) = 0.5352565376030d+04
+         xcrref(5) = 0.3905864038618d+05
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         xceref(1) = 0.3100009377557d+03
+         xceref(2) = 0.2424086324913d+02
+         xceref(3) = 0.7782212022645d+02
+         xceref(4) = 0.6835623860116d+02
+         xceref(5) = 0.6065737200368d+03
+
+!---------------------------------------------------------------------
+!    reference data for 1020x1020x1020 grids after 250 time steps,
+!    with DT = 0.4d-05
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and.   &
+     &           (grid_points(2) .eq. 1020) .and.  &
+     &           (grid_points(3) .eq. 1020) .and.  &
+     &           (no_time_steps  .eq. 250) ) then
+
+           class = 'E'
+           dtref = 0.4d-5
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.9795372484517d+05
+         xcrref(2) = 0.9739814511521d+04
+         xcrref(3) = 0.2467606342965d+05
+         xcrref(4) = 0.2092419572860d+05
+         xcrref(5) = 0.1392138856939d+06
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         xceref(1) = 0.4327562208414d+03
+         xceref(2) = 0.3699051964887d+02
+         xceref(3) = 0.1089845040954d+03
+         xceref(4) = 0.9462517622043d+02
+         xceref(5) = 0.7765512765309d+03
+
+!---------------------------------------------------------------------
+!    reference data for 2560x2560x2560 grids after 250 time steps,
+!    with DT = 0.6d-06
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 2560) .and.   &
+     &           (grid_points(2) .eq. 2560) .and.  &
+     &           (grid_points(3) .eq. 2560) .and.  &
+     &           (no_time_steps  .eq. 250) ) then
+
+           class = 'F'
+           dtref = 0.6d-6
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+         xcrref(1) = 0.4240735175585d+06
+         xcrref(2) = 0.4348701133212d+05
+         xcrref(3) = 0.1078114688845d+06
+         xcrref(4) = 0.9142160938556d+05
+         xcrref(5) = 0.5879842143431d+06
+
+!---------------------------------------------------------------------
+!  Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+
+         xceref(1) = 0.5095577042351d+03
+         xceref(2) = 0.4557065541652d+02
+         xceref(3) = 0.1286632140581d+03
+         xceref(4) = 0.1111419378722d+03
+         xceref(5) = 0.8720011709356d+03
+
+        else
+           verified = .false.
+        endif
+
+!---------------------------------------------------------------------
+!    verification test for residuals if gridsize is one of 
+!    the defined grid sizes above (class .ne. 'U')
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!    Compute the difference of solution values and the known reference values.
+!---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+!---------------------------------------------------------------------
+!    Output the comparison of computed results to known cases.
+!---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ',   &
+     &                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if ((.not.ieee_is_nan(xcrdif(m))) .and.  &
+     &              xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if ((.not.ieee_is_nan(xcedif(m))) .and.  &
+     &              xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/work_lhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/work_lhs.f90
new file mode 100644
index 000000000..02b8a0700
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/work_lhs.f90
@@ -0,0 +1,57 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  work_lhs module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module work_lhs
+
+      use bt_data, only : problem_size
+
+      double precision fjac(5, 5,    0:problem_size),  &
+     &                 njac(5, 5,    0:problem_size),  &
+     &                 lhs (5, 5, 3, 0:problem_size)
+!$omp threadprivate( fjac, njac, lhs )
+
+      end module work_lhs
+      
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine lhsinit(lhs, ni)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      integer i, m, n, ni
+      double precision lhs(5,5,3,0:ni)
+
+!---------------------------------------------------------------------
+!     zero the whole left hand side for starters
+!     set all diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+      i = 0
+      do m = 1, 5
+         do n = 1, 5
+            lhs(m,n,1,i) = 0.0d0
+            lhs(m,n,2,i) = 0.0d0
+            lhs(m,n,3,i) = 0.0d0
+         end do
+         lhs(m,m,2,i) = 1.0d0
+      end do
+      i = ni
+      do m = 1, 5
+         do n = 1, 5
+            lhs(m,n,1,i) = 0.0d0
+            lhs(m,n,2,i) = 0.0d0
+            lhs(m,n,3,i) = 0.0d0
+         end do
+         lhs(m,m,2,i) = 1.0d0
+      end do
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/work_lhs_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/work_lhs_blk.f90
new file mode 100644
index 000000000..bb437e665
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/work_lhs_blk.f90
@@ -0,0 +1,85 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  work_lhs module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module work_lhs
+
+      use bt_data, only : problem_size
+
+      include 'blk_par.h'
+
+      double precision fjac(blkdim, 5, 5, 0:2),  &
+     &                 njac(blkdim, 5, 5, 0:2),  &
+     &                 lhsa(blkdim, 5, 5, 0:2),  &
+     &                 lhsb(blkdim, 5, 5, 0:2),  &
+     &                 lhsc(blkdim, 5, 5, 0:problem_size-1),  &
+     &                 rhsx(blkdim, 5, 0:problem_size-1)
+!$omp threadprivate( fjac, njac, lhsa, lhsb, lhsc, rhsx )
+
+      end module work_lhs
+      
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine lhsinit(ni)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      use work_lhs
+      implicit none
+
+      integer ni
+
+      integer i, m, jb
+
+!---------------------------------------------------------------------
+!     zero the whole left hand side for starters
+!     set all diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+      if (ni .gt. 0) goto 20
+
+      do i = 0, 2, 2
+!dir$ vector always
+         do jb = 1, bsize
+!dir$ unroll
+            do m = 1, 5
+               lhsa(jb,m,1,i) = 0.0d0
+               lhsa(jb,m,2,i) = 0.0d0
+               lhsa(jb,m,3,i) = 0.0d0
+               lhsa(jb,m,4,i) = 0.0d0
+               lhsa(jb,m,5,i) = 0.0d0
+               lhsb(jb,m,1,i) = 0.0d0
+               lhsb(jb,m,2,i) = 0.0d0
+               lhsb(jb,m,3,i) = 0.0d0
+               lhsb(jb,m,4,i) = 0.0d0
+               lhsb(jb,m,5,i) = 0.0d0
+               lhsb(jb,m,m,i) = 1.0d0
+            end do
+         end do
+      end do
+
+      return
+
+  20  continue
+      do i = 0, ni, ni
+!dir$ vector always
+         do jb = 1, bsize
+!dir$ unroll
+            do m = 1, 5
+               lhsc(jb,m,1,i) = 0.0d0
+               lhsc(jb,m,2,i) = 0.0d0
+               lhsc(jb,m,3,i) = 0.0d0
+               lhsc(jb,m,4,i) = 0.0d0
+               lhsc(jb,m,5,i) = 0.0d0
+            end do
+         end do
+      end do
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/x_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/x_solve.f90
new file mode 100644
index 000000000..4e7c38c8e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/x_solve.f90
@@ -0,0 +1,409 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!     Performs line solves in X direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!     
+!---------------------------------------------------------------------
+
+      use bt_data
+      use work_lhs
+
+      implicit none
+
+      integer i,j,k,m,n,isize
+      double precision tmp1, tmp2, tmp3
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_xsolve)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side in the xi-direction
+!---------------------------------------------------------------------
+
+      isize = grid_points(1)-1
+
+!---------------------------------------------------------------------
+!     determine a (labeled f) and n jacobians
+!---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(isize) collapse(2)  &
+!$omp& private(i,j,k,m,n,tmp1,tmp2,tmp3)
+      do k = 1, grid_points(3)-2
+         do j = 1, grid_points(2)-2
+            do i = 0, isize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+               fjac(1,1,i) = 0.0d+00
+               fjac(1,2,i) = 1.0d+00
+               fjac(1,3,i) = 0.0d+00
+               fjac(1,4,i) = 0.0d+00
+               fjac(1,5,i) = 0.0d+00
+
+               fjac(2,1,i) = -(u(2,i,j,k) * tmp2 *   &
+     &              u(2,i,j,k))  &
+     &              + c2 * qs(i,j,k)
+               fjac(2,2,i) = ( 2.0d+00 - c2 )  &
+     &              * ( u(2,i,j,k) / u(1,i,j,k) )
+               fjac(2,3,i) = - c2 * ( u(3,i,j,k) * tmp1 )
+               fjac(2,4,i) = - c2 * ( u(4,i,j,k) * tmp1 )
+               fjac(2,5,i) = c2
+
+               fjac(3,1,i) = - ( u(2,i,j,k)*u(3,i,j,k) ) * tmp2
+               fjac(3,2,i) = u(3,i,j,k) * tmp1
+               fjac(3,3,i) = u(2,i,j,k) * tmp1
+               fjac(3,4,i) = 0.0d+00
+               fjac(3,5,i) = 0.0d+00
+
+               fjac(4,1,i) = - ( u(2,i,j,k)*u(4,i,j,k) ) * tmp2
+               fjac(4,2,i) = u(4,i,j,k) * tmp1
+               fjac(4,3,i) = 0.0d+00
+               fjac(4,4,i) = u(2,i,j,k) * tmp1
+               fjac(4,5,i) = 0.0d+00
+
+               fjac(5,1,i) = ( c2 * 2.0d0 * square(i,j,k)  &
+     &              - c1 * u(5,i,j,k) )  &
+     &              * ( u(2,i,j,k) * tmp2 )
+               fjac(5,2,i) = c1 *  u(5,i,j,k) * tmp1   &
+     &              - c2  &
+     &              * ( u(2,i,j,k)*u(2,i,j,k) * tmp2  &
+     &              + qs(i,j,k) )
+               fjac(5,3,i) = - c2 * ( u(3,i,j,k)*u(2,i,j,k) )  &
+     &              * tmp2
+               fjac(5,4,i) = - c2 * ( u(4,i,j,k)*u(2,i,j,k) )  &
+     &              * tmp2
+               fjac(5,5,i) = c1 * ( u(2,i,j,k) * tmp1 )
+
+               njac(1,1,i) = 0.0d+00
+               njac(1,2,i) = 0.0d+00
+               njac(1,3,i) = 0.0d+00
+               njac(1,4,i) = 0.0d+00
+               njac(1,5,i) = 0.0d+00
+
+               njac(2,1,i) = - con43 * c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,i) =   con43 * c3c4 * tmp1
+               njac(2,3,i) =   0.0d+00
+               njac(2,4,i) =   0.0d+00
+               njac(2,5,i) =   0.0d+00
+
+               njac(3,1,i) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,i) =   0.0d+00
+               njac(3,3,i) =   c3c4 * tmp1
+               njac(3,4,i) =   0.0d+00
+               njac(3,5,i) =   0.0d+00
+
+               njac(4,1,i) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,i) =   0.0d+00 
+               njac(4,3,i) =   0.0d+00
+               njac(4,4,i) =   c3c4 * tmp1
+               njac(4,5,i) =   0.0d+00
+
+               njac(5,1,i) = - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,i) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,i) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,i) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,i) = ( c1345 ) * tmp1
+
+            enddo
+
+!---------------------------------------------------------------------
+!     now jacobians set, so form left hand side in x direction
+!---------------------------------------------------------------------
+            call lhsinit(lhs, isize)
+            do i = 1, isize-1
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhs(1,1,aa,i) = - tmp2 * fjac(1,1,i-1)  &
+     &              - tmp1 * njac(1,1,i-1)  &
+     &              - tmp1 * dx1 
+               lhs(1,2,aa,i) = - tmp2 * fjac(1,2,i-1)  &
+     &              - tmp1 * njac(1,2,i-1)
+               lhs(1,3,aa,i) = - tmp2 * fjac(1,3,i-1)  &
+     &              - tmp1 * njac(1,3,i-1)
+               lhs(1,4,aa,i) = - tmp2 * fjac(1,4,i-1)  &
+     &              - tmp1 * njac(1,4,i-1)
+               lhs(1,5,aa,i) = - tmp2 * fjac(1,5,i-1)  &
+     &              - tmp1 * njac(1,5,i-1)
+
+               lhs(2,1,aa,i) = - tmp2 * fjac(2,1,i-1)  &
+     &              - tmp1 * njac(2,1,i-1)
+               lhs(2,2,aa,i) = - tmp2 * fjac(2,2,i-1)  &
+     &              - tmp1 * njac(2,2,i-1)  &
+     &              - tmp1 * dx2
+               lhs(2,3,aa,i) = - tmp2 * fjac(2,3,i-1)  &
+     &              - tmp1 * njac(2,3,i-1)
+               lhs(2,4,aa,i) = - tmp2 * fjac(2,4,i-1)  &
+     &              - tmp1 * njac(2,4,i-1)
+               lhs(2,5,aa,i) = - tmp2 * fjac(2,5,i-1)  &
+     &              - tmp1 * njac(2,5,i-1)
+
+               lhs(3,1,aa,i) = - tmp2 * fjac(3,1,i-1)  &
+     &              - tmp1 * njac(3,1,i-1)
+               lhs(3,2,aa,i) = - tmp2 * fjac(3,2,i-1)  &
+     &              - tmp1 * njac(3,2,i-1)
+               lhs(3,3,aa,i) = - tmp2 * fjac(3,3,i-1)  &
+     &              - tmp1 * njac(3,3,i-1)  &
+     &              - tmp1 * dx3 
+               lhs(3,4,aa,i) = - tmp2 * fjac(3,4,i-1)  &
+     &              - tmp1 * njac(3,4,i-1)
+               lhs(3,5,aa,i) = - tmp2 * fjac(3,5,i-1)  &
+     &              - tmp1 * njac(3,5,i-1)
+
+               lhs(4,1,aa,i) = - tmp2 * fjac(4,1,i-1)  &
+     &              - tmp1 * njac(4,1,i-1)
+               lhs(4,2,aa,i) = - tmp2 * fjac(4,2,i-1)  &
+     &              - tmp1 * njac(4,2,i-1)
+               lhs(4,3,aa,i) = - tmp2 * fjac(4,3,i-1)  &
+     &              - tmp1 * njac(4,3,i-1)
+               lhs(4,4,aa,i) = - tmp2 * fjac(4,4,i-1)  &
+     &              - tmp1 * njac(4,4,i-1)  &
+     &              - tmp1 * dx4
+               lhs(4,5,aa,i) = - tmp2 * fjac(4,5,i-1)  &
+     &              - tmp1 * njac(4,5,i-1)
+
+               lhs(5,1,aa,i) = - tmp2 * fjac(5,1,i-1)  &
+     &              - tmp1 * njac(5,1,i-1)
+               lhs(5,2,aa,i) = - tmp2 * fjac(5,2,i-1)  &
+     &              - tmp1 * njac(5,2,i-1)
+               lhs(5,3,aa,i) = - tmp2 * fjac(5,3,i-1)  &
+     &              - tmp1 * njac(5,3,i-1)
+               lhs(5,4,aa,i) = - tmp2 * fjac(5,4,i-1)  &
+     &              - tmp1 * njac(5,4,i-1)
+               lhs(5,5,aa,i) = - tmp2 * fjac(5,5,i-1)  &
+     &              - tmp1 * njac(5,5,i-1)  &
+     &              - tmp1 * dx5
+
+               lhs(1,1,bb,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,i)  &
+     &              + tmp1 * 2.0d+00 * dx1
+               lhs(1,2,bb,i) = tmp1 * 2.0d+00 * njac(1,2,i)
+               lhs(1,3,bb,i) = tmp1 * 2.0d+00 * njac(1,3,i)
+               lhs(1,4,bb,i) = tmp1 * 2.0d+00 * njac(1,4,i)
+               lhs(1,5,bb,i) = tmp1 * 2.0d+00 * njac(1,5,i)
+
+               lhs(2,1,bb,i) = tmp1 * 2.0d+00 * njac(2,1,i)
+               lhs(2,2,bb,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,i)  &
+     &              + tmp1 * 2.0d+00 * dx2
+               lhs(2,3,bb,i) = tmp1 * 2.0d+00 * njac(2,3,i)
+               lhs(2,4,bb,i) = tmp1 * 2.0d+00 * njac(2,4,i)
+               lhs(2,5,bb,i) = tmp1 * 2.0d+00 * njac(2,5,i)
+
+               lhs(3,1,bb,i) = tmp1 * 2.0d+00 * njac(3,1,i)
+               lhs(3,2,bb,i) = tmp1 * 2.0d+00 * njac(3,2,i)
+               lhs(3,3,bb,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,i)  &
+     &              + tmp1 * 2.0d+00 * dx3
+               lhs(3,4,bb,i) = tmp1 * 2.0d+00 * njac(3,4,i)
+               lhs(3,5,bb,i) = tmp1 * 2.0d+00 * njac(3,5,i)
+
+               lhs(4,1,bb,i) = tmp1 * 2.0d+00 * njac(4,1,i)
+               lhs(4,2,bb,i) = tmp1 * 2.0d+00 * njac(4,2,i)
+               lhs(4,3,bb,i) = tmp1 * 2.0d+00 * njac(4,3,i)
+               lhs(4,4,bb,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,i)  &
+     &              + tmp1 * 2.0d+00 * dx4
+               lhs(4,5,bb,i) = tmp1 * 2.0d+00 * njac(4,5,i)
+
+               lhs(5,1,bb,i) = tmp1 * 2.0d+00 * njac(5,1,i)
+               lhs(5,2,bb,i) = tmp1 * 2.0d+00 * njac(5,2,i)
+               lhs(5,3,bb,i) = tmp1 * 2.0d+00 * njac(5,3,i)
+               lhs(5,4,bb,i) = tmp1 * 2.0d+00 * njac(5,4,i)
+               lhs(5,5,bb,i) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,i)  &
+     &              + tmp1 * 2.0d+00 * dx5
+
+               lhs(1,1,cc,i) =  tmp2 * fjac(1,1,i+1)  &
+     &              - tmp1 * njac(1,1,i+1)  &
+     &              - tmp1 * dx1
+               lhs(1,2,cc,i) =  tmp2 * fjac(1,2,i+1)  &
+     &              - tmp1 * njac(1,2,i+1)
+               lhs(1,3,cc,i) =  tmp2 * fjac(1,3,i+1)  &
+     &              - tmp1 * njac(1,3,i+1)
+               lhs(1,4,cc,i) =  tmp2 * fjac(1,4,i+1)  &
+     &              - tmp1 * njac(1,4,i+1)
+               lhs(1,5,cc,i) =  tmp2 * fjac(1,5,i+1)  &
+     &              - tmp1 * njac(1,5,i+1)
+
+               lhs(2,1,cc,i) =  tmp2 * fjac(2,1,i+1)  &
+     &              - tmp1 * njac(2,1,i+1)
+               lhs(2,2,cc,i) =  tmp2 * fjac(2,2,i+1)  &
+     &              - tmp1 * njac(2,2,i+1)  &
+     &              - tmp1 * dx2
+               lhs(2,3,cc,i) =  tmp2 * fjac(2,3,i+1)  &
+     &              - tmp1 * njac(2,3,i+1)
+               lhs(2,4,cc,i) =  tmp2 * fjac(2,4,i+1)  &
+     &              - tmp1 * njac(2,4,i+1)
+               lhs(2,5,cc,i) =  tmp2 * fjac(2,5,i+1)  &
+     &              - tmp1 * njac(2,5,i+1)
+
+               lhs(3,1,cc,i) =  tmp2 * fjac(3,1,i+1)  &
+     &              - tmp1 * njac(3,1,i+1)
+               lhs(3,2,cc,i) =  tmp2 * fjac(3,2,i+1)  &
+     &              - tmp1 * njac(3,2,i+1)
+               lhs(3,3,cc,i) =  tmp2 * fjac(3,3,i+1)  &
+     &              - tmp1 * njac(3,3,i+1)  &
+     &              - tmp1 * dx3
+               lhs(3,4,cc,i) =  tmp2 * fjac(3,4,i+1)  &
+     &              - tmp1 * njac(3,4,i+1)
+               lhs(3,5,cc,i) =  tmp2 * fjac(3,5,i+1)  &
+     &              - tmp1 * njac(3,5,i+1)
+
+               lhs(4,1,cc,i) =  tmp2 * fjac(4,1,i+1)  &
+     &              - tmp1 * njac(4,1,i+1)
+               lhs(4,2,cc,i) =  tmp2 * fjac(4,2,i+1)  &
+     &              - tmp1 * njac(4,2,i+1)
+               lhs(4,3,cc,i) =  tmp2 * fjac(4,3,i+1)  &
+     &              - tmp1 * njac(4,3,i+1)
+               lhs(4,4,cc,i) =  tmp2 * fjac(4,4,i+1)  &
+     &              - tmp1 * njac(4,4,i+1)  &
+     &              - tmp1 * dx4
+               lhs(4,5,cc,i) =  tmp2 * fjac(4,5,i+1)  &
+     &              - tmp1 * njac(4,5,i+1)
+
+               lhs(5,1,cc,i) =  tmp2 * fjac(5,1,i+1)  &
+     &              - tmp1 * njac(5,1,i+1)
+               lhs(5,2,cc,i) =  tmp2 * fjac(5,2,i+1)  &
+     &              - tmp1 * njac(5,2,i+1)
+               lhs(5,3,cc,i) =  tmp2 * fjac(5,3,i+1)  &
+     &              - tmp1 * njac(5,3,i+1)
+               lhs(5,4,cc,i) =  tmp2 * fjac(5,4,i+1)  &
+     &              - tmp1 * njac(5,4,i+1)
+               lhs(5,5,cc,i) =  tmp2 * fjac(5,5,i+1)  &
+     &              - tmp1 * njac(5,5,i+1)  &
+     &              - tmp1 * dx5
+
+            enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+
+            if (timeron) call timer_start(t_solsub)
+!---------------------------------------------------------------------
+!     multiply c(0,j,k) by b_inverse and copy back to c
+!     multiply rhs(0) by b_inverse(0) and copy to rhs
+!---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),  &
+     &                        lhs(1,1,cc,0),  &
+     &                        rhs(1,0,j,k) )
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+            do i=1,isize-1
+
+!---------------------------------------------------------------------
+!     rhs(i) = rhs(i) - A*rhs(i-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,i),  &
+     &                         rhs(1,i-1,j,k),rhs(1,i,j,k))
+
+!---------------------------------------------------------------------
+!     B(i) = B(i) - C(i-1)*A(i)
+!---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,i),  &
+     &                         lhs(1,1,cc,i-1),  &
+     &                         lhs(1,1,bb,i))
+
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,i),  &
+     &                        lhs(1,1,cc,i),  &
+     &                        rhs(1,i,j,k) )
+
+            enddo
+
+!---------------------------------------------------------------------
+!     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+!---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,isize),  &
+     &                         rhs(1,isize-1,j,k),rhs(1,isize,j,k))
+
+!---------------------------------------------------------------------
+!     B(isize) = B(isize) - C(isize-1)*A(isize)
+!---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,isize),  &
+     &                         lhs(1,1,cc,isize-1),  &
+     &                         lhs(1,1,bb,isize))
+
+!---------------------------------------------------------------------
+!     multiply rhs() by b_inverse() and copy to rhs
+!---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,isize),  &
+     &                       rhs(1,isize,j,k) )
+            if (timeron) call timer_stop(t_solsub)
+
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(isize)=rhs(isize)
+!     else assume U(isize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(istart) will be sent to next cell
+!---------------------------------------------------------------------
+
+            do i=isize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k)   &
+     &                    - lhs(m,n,cc,i)*rhs(n,i+1,j,k)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/x_solve_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/x_solve_blk.f90
new file mode 100644
index 000000000..a2fb0cacf
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/x_solve_blk.f90
@@ -0,0 +1,467 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine x_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     
+!     Performs line solves in X direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!     
+!---------------------------------------------------------------------
+
+      use bt_data
+      use work_lhs
+
+      implicit none
+
+      integer i,j,k,m,isize
+      integer ii,im,ip,ib,jj,jb
+      double precision tmp1, tmp2, tmp3
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_xsolve)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side in the xi-direction
+!---------------------------------------------------------------------
+
+      isize = grid_points(1)-1
+
+!---------------------------------------------------------------------
+!     determine a (labeled f) and n jacobians
+!---------------------------------------------------------------------
+!$omp parallel default(shared) shared(isize)  &
+!$omp& private(i,j,k,m,ii,im,ip,ib,jj,jb,tmp1,tmp2,tmp3)
+
+      call lhsinit(isize)
+
+!$omp do collapse(2)
+      do k = 1, grid_points(3)-2
+         do jj = 1, grid_points(2)-2, bsize
+
+            if (timeron) call timer_start(t_rdis1)
+            do i=0,isize
+            do jb = 1, bsize
+               j = min(jj+jb-1, grid_points(2)-2)
+               rhsx(jb,1,i) = rhs(1,i,j,k)
+               rhsx(jb,2,i) = rhs(2,i,j,k)
+               rhsx(jb,3,i) = rhs(3,i,j,k)
+               rhsx(jb,4,i) = rhs(4,i,j,k)
+               rhsx(jb,5,i) = rhs(5,i,j,k)
+            end do
+            end do
+            if (timeron) call timer_stop(t_rdis1)
+
+            call lhsinit(0)
+
+            ib = 0
+            do ii = 1, isize-1
+            ib = mod(ib + 1, 3)
+            im = min(2*ii - 3, 1)     ! -1 or 1
+            ip = mod(ib + im, 3) - 1
+
+            do i = ii+im, ii+1
+            ip = ip + 1
+            do jb = 1, bsize
+               j = min(jj+jb-1, grid_points(2)-2)
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+!---------------------------------------------------------------------
+!     
+!---------------------------------------------------------------------
+               fjac(jb,1,1,ip) = 0.0d+00
+               fjac(jb,1,2,ip) = 1.0d+00
+               fjac(jb,1,3,ip) = 0.0d+00
+               fjac(jb,1,4,ip) = 0.0d+00
+               fjac(jb,1,5,ip) = 0.0d+00
+
+               fjac(jb,2,1,ip) = -(u(2,i,j,k) * tmp2 *   &
+     &              u(2,i,j,k))  &
+     &              + c2 * qs(i,j,k)
+               fjac(jb,2,2,ip) = ( 2.0d+00 - c2 )  &
+     &              * ( u(2,i,j,k) / u(1,i,j,k) )
+               fjac(jb,2,3,ip) = - c2 * ( u(3,i,j,k) * tmp1 )
+               fjac(jb,2,4,ip) = - c2 * ( u(4,i,j,k) * tmp1 )
+               fjac(jb,2,5,ip) = c2
+
+               fjac(jb,3,1,ip) = - ( u(2,i,j,k)*u(3,i,j,k) ) * tmp2
+               fjac(jb,3,2,ip) = u(3,i,j,k) * tmp1
+               fjac(jb,3,3,ip) = u(2,i,j,k) * tmp1
+               fjac(jb,3,4,ip) = 0.0d+00
+               fjac(jb,3,5,ip) = 0.0d+00
+
+               fjac(jb,4,1,ip) = - ( u(2,i,j,k)*u(4,i,j,k) ) * tmp2
+               fjac(jb,4,2,ip) = u(4,i,j,k) * tmp1
+               fjac(jb,4,3,ip) = 0.0d+00
+               fjac(jb,4,4,ip) = u(2,i,j,k) * tmp1
+               fjac(jb,4,5,ip) = 0.0d+00
+
+               fjac(jb,5,1,ip) = ( c2 * 2.0d0 * square(i,j,k)  &
+     &              - c1 * u(5,i,j,k) )  &
+     &              * ( u(2,i,j,k) * tmp2 )
+               fjac(jb,5,2,ip) = c1 *  u(5,i,j,k) * tmp1   &
+     &              - c2  &
+     &              * ( u(2,i,j,k)*u(2,i,j,k) * tmp2  &
+     &              + qs(i,j,k) )
+               fjac(jb,5,3,ip) = - c2 * ( u(3,i,j,k)*u(2,i,j,k) )  &
+     &              * tmp2
+               fjac(jb,5,4,ip) = - c2 * ( u(4,i,j,k)*u(2,i,j,k) )  &
+     &              * tmp2
+               fjac(jb,5,5,ip) = c1 * ( u(2,i,j,k) * tmp1 )
+
+               njac(jb,1,1,ip) = 0.0d+00
+               njac(jb,1,2,ip) = 0.0d+00
+               njac(jb,1,3,ip) = 0.0d+00
+               njac(jb,1,4,ip) = 0.0d+00
+               njac(jb,1,5,ip) = 0.0d+00
+
+               njac(jb,2,1,ip) = - con43 * c3c4 * tmp2 * u(2,i,j,k)
+               njac(jb,2,2,ip) =   con43 * c3c4 * tmp1
+               njac(jb,2,3,ip) =   0.0d+00
+               njac(jb,2,4,ip) =   0.0d+00
+               njac(jb,2,5,ip) =   0.0d+00
+
+               njac(jb,3,1,ip) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(jb,3,2,ip) =   0.0d+00
+               njac(jb,3,3,ip) =   c3c4 * tmp1
+               njac(jb,3,4,ip) =   0.0d+00
+               njac(jb,3,5,ip) =   0.0d+00
+
+               njac(jb,4,1,ip) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(jb,4,2,ip) =   0.0d+00 
+               njac(jb,4,3,ip) =   0.0d+00
+               njac(jb,4,4,ip) =   c3c4 * tmp1
+               njac(jb,4,5,ip) =   0.0d+00
+
+               njac(jb,5,1,ip) = - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(jb,5,2,ip) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(jb,5,3,ip) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(jb,5,4,ip) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(jb,5,5,ip) = ( c1345 ) * tmp1
+
+            enddo
+            enddo
+
+!---------------------------------------------------------------------
+!     now jacobians set, so form left hand side in x direction
+!---------------------------------------------------------------------
+            im = mod(ib + 2, 3)
+            i = ii
+!dir$ vector always
+            do jb = 1, bsize
+
+               tmp1 = dt * tx1
+               tmp2 = dt * tx2
+
+               lhsa(jb,1,1,1) = - tmp2 * fjac(jb,1,1,im)  &
+     &              - tmp1 * njac(jb,1,1,im)  &
+     &              - tmp1 * dx1 
+               lhsa(jb,1,2,1) = - tmp2 * fjac(jb,1,2,im)  &
+     &              - tmp1 * njac(jb,1,2,im)
+               lhsa(jb,1,3,1) = - tmp2 * fjac(jb,1,3,im)  &
+     &              - tmp1 * njac(jb,1,3,im)
+               lhsa(jb,1,4,1) = - tmp2 * fjac(jb,1,4,im)  &
+     &              - tmp1 * njac(jb,1,4,im)
+               lhsa(jb,1,5,1) = - tmp2 * fjac(jb,1,5,im)  &
+     &              - tmp1 * njac(jb,1,5,im)
+
+               lhsa(jb,2,1,1) = - tmp2 * fjac(jb,2,1,im)  &
+     &              - tmp1 * njac(jb,2,1,im)
+               lhsa(jb,2,2,1) = - tmp2 * fjac(jb,2,2,im)  &
+     &              - tmp1 * njac(jb,2,2,im)  &
+     &              - tmp1 * dx2
+               lhsa(jb,2,3,1) = - tmp2 * fjac(jb,2,3,im)  &
+     &              - tmp1 * njac(jb,2,3,im)
+               lhsa(jb,2,4,1) = - tmp2 * fjac(jb,2,4,im)  &
+     &              - tmp1 * njac(jb,2,4,im)
+               lhsa(jb,2,5,1) = - tmp2 * fjac(jb,2,5,im)  &
+     &              - tmp1 * njac(jb,2,5,im)
+
+               lhsa(jb,3,1,1) = - tmp2 * fjac(jb,3,1,im)  &
+     &              - tmp1 * njac(jb,3,1,im)
+               lhsa(jb,3,2,1) = - tmp2 * fjac(jb,3,2,im)  &
+     &              - tmp1 * njac(jb,3,2,im)
+               lhsa(jb,3,3,1) = - tmp2 * fjac(jb,3,3,im)  &
+     &              - tmp1 * njac(jb,3,3,im)  &
+     &              - tmp1 * dx3 
+               lhsa(jb,3,4,1) = - tmp2 * fjac(jb,3,4,im)  &
+     &              - tmp1 * njac(jb,3,4,im)
+               lhsa(jb,3,5,1) = - tmp2 * fjac(jb,3,5,im)  &
+     &              - tmp1 * njac(jb,3,5,im)
+
+               lhsa(jb,4,1,1) = - tmp2 * fjac(jb,4,1,im)  &
+     &              - tmp1 * njac(jb,4,1,im)
+               lhsa(jb,4,2,1) = - tmp2 * fjac(jb,4,2,im)  &
+     &              - tmp1 * njac(jb,4,2,im)
+               lhsa(jb,4,3,1) = - tmp2 * fjac(jb,4,3,im)  &
+     &              - tmp1 * njac(jb,4,3,im)
+               lhsa(jb,4,4,1) = - tmp2 * fjac(jb,4,4,im)  &
+     &              - tmp1 * njac(jb,4,4,im)  &
+     &              - tmp1 * dx4
+               lhsa(jb,4,5,1) = - tmp2 * fjac(jb,4,5,im)  &
+     &              - tmp1 * njac(jb,4,5,im)
+
+               lhsa(jb,5,1,1) = - tmp2 * fjac(jb,5,1,im)  &
+     &              - tmp1 * njac(jb,5,1,im)
+               lhsa(jb,5,2,1) = - tmp2 * fjac(jb,5,2,im)  &
+     &              - tmp1 * njac(jb,5,2,im)
+               lhsa(jb,5,3,1) = - tmp2 * fjac(jb,5,3,im)  &
+     &              - tmp1 * njac(jb,5,3,im)
+               lhsa(jb,5,4,1) = - tmp2 * fjac(jb,5,4,im)  &
+     &              - tmp1 * njac(jb,5,4,im)
+               lhsa(jb,5,5,1) = - tmp2 * fjac(jb,5,5,im)  &
+     &              - tmp1 * njac(jb,5,5,im)  &
+     &              - tmp1 * dx5
+
+               lhsb(jb,1,1,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(jb,1,1,ib)  &
+     &              + tmp1 * 2.0d+00 * dx1
+               lhsb(jb,1,2,1) = tmp1 * 2.0d+00 * njac(jb,1,2,ib)
+               lhsb(jb,1,3,1) = tmp1 * 2.0d+00 * njac(jb,1,3,ib)
+               lhsb(jb,1,4,1) = tmp1 * 2.0d+00 * njac(jb,1,4,ib)
+               lhsb(jb,1,5,1) = tmp1 * 2.0d+00 * njac(jb,1,5,ib)
+
+               lhsb(jb,2,1,1) = tmp1 * 2.0d+00 * njac(jb,2,1,ib)
+               lhsb(jb,2,2,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(jb,2,2,ib)  &
+     &              + tmp1 * 2.0d+00 * dx2
+               lhsb(jb,2,3,1) = tmp1 * 2.0d+00 * njac(jb,2,3,ib)
+               lhsb(jb,2,4,1) = tmp1 * 2.0d+00 * njac(jb,2,4,ib)
+               lhsb(jb,2,5,1) = tmp1 * 2.0d+00 * njac(jb,2,5,ib)
+
+               lhsb(jb,3,1,1) = tmp1 * 2.0d+00 * njac(jb,3,1,ib)
+               lhsb(jb,3,2,1) = tmp1 * 2.0d+00 * njac(jb,3,2,ib)
+               lhsb(jb,3,3,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(jb,3,3,ib)  &
+     &              + tmp1 * 2.0d+00 * dx3
+               lhsb(jb,3,4,1) = tmp1 * 2.0d+00 * njac(jb,3,4,ib)
+               lhsb(jb,3,5,1) = tmp1 * 2.0d+00 * njac(jb,3,5,ib)
+
+               lhsb(jb,4,1,1) = tmp1 * 2.0d+00 * njac(jb,4,1,ib)
+               lhsb(jb,4,2,1) = tmp1 * 2.0d+00 * njac(jb,4,2,ib)
+               lhsb(jb,4,3,1) = tmp1 * 2.0d+00 * njac(jb,4,3,ib)
+               lhsb(jb,4,4,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(jb,4,4,ib)  &
+     &              + tmp1 * 2.0d+00 * dx4
+               lhsb(jb,4,5,1) = tmp1 * 2.0d+00 * njac(jb,4,5,ib)
+
+               lhsb(jb,5,1,1) = tmp1 * 2.0d+00 * njac(jb,5,1,ib)
+               lhsb(jb,5,2,1) = tmp1 * 2.0d+00 * njac(jb,5,2,ib)
+               lhsb(jb,5,3,1) = tmp1 * 2.0d+00 * njac(jb,5,3,ib)
+               lhsb(jb,5,4,1) = tmp1 * 2.0d+00 * njac(jb,5,4,ib)
+               lhsb(jb,5,5,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(jb,5,5,ib)  &
+     &              + tmp1 * 2.0d+00 * dx5
+
+               lhsc(jb,1,1,i) =  tmp2 * fjac(jb,1,1,ip)  &
+     &              - tmp1 * njac(jb,1,1,ip)  &
+     &              - tmp1 * dx1
+               lhsc(jb,1,2,i) =  tmp2 * fjac(jb,1,2,ip)  &
+     &              - tmp1 * njac(jb,1,2,ip)
+               lhsc(jb,1,3,i) =  tmp2 * fjac(jb,1,3,ip)  &
+     &              - tmp1 * njac(jb,1,3,ip)
+               lhsc(jb,1,4,i) =  tmp2 * fjac(jb,1,4,ip)  &
+     &              - tmp1 * njac(jb,1,4,ip)
+               lhsc(jb,1,5,i) =  tmp2 * fjac(jb,1,5,ip)  &
+     &              - tmp1 * njac(jb,1,5,ip)
+
+               lhsc(jb,2,1,i) =  tmp2 * fjac(jb,2,1,ip)  &
+     &              - tmp1 * njac(jb,2,1,ip)
+               lhsc(jb,2,2,i) =  tmp2 * fjac(jb,2,2,ip)  &
+     &              - tmp1 * njac(jb,2,2,ip)  &
+     &              - tmp1 * dx2
+               lhsc(jb,2,3,i) =  tmp2 * fjac(jb,2,3,ip)  &
+     &              - tmp1 * njac(jb,2,3,ip)
+               lhsc(jb,2,4,i) =  tmp2 * fjac(jb,2,4,ip)  &
+     &              - tmp1 * njac(jb,2,4,ip)
+               lhsc(jb,2,5,i) =  tmp2 * fjac(jb,2,5,ip)  &
+     &              - tmp1 * njac(jb,2,5,ip)
+
+               lhsc(jb,3,1,i) =  tmp2 * fjac(jb,3,1,ip)  &
+     &              - tmp1 * njac(jb,3,1,ip)
+               lhsc(jb,3,2,i) =  tmp2 * fjac(jb,3,2,ip)  &
+     &              - tmp1 * njac(jb,3,2,ip)
+               lhsc(jb,3,3,i) =  tmp2 * fjac(jb,3,3,ip)  &
+     &              - tmp1 * njac(jb,3,3,ip)  &
+     &              - tmp1 * dx3
+               lhsc(jb,3,4,i) =  tmp2 * fjac(jb,3,4,ip)  &
+     &              - tmp1 * njac(jb,3,4,ip)
+               lhsc(jb,3,5,i) =  tmp2 * fjac(jb,3,5,ip)  &
+     &              - tmp1 * njac(jb,3,5,ip)
+
+               lhsc(jb,4,1,i) =  tmp2 * fjac(jb,4,1,ip)  &
+     &              - tmp1 * njac(jb,4,1,ip)
+               lhsc(jb,4,2,i) =  tmp2 * fjac(jb,4,2,ip)  &
+     &              - tmp1 * njac(jb,4,2,ip)
+               lhsc(jb,4,3,i) =  tmp2 * fjac(jb,4,3,ip)  &
+     &              - tmp1 * njac(jb,4,3,ip)
+               lhsc(jb,4,4,i) =  tmp2 * fjac(jb,4,4,ip)  &
+     &              - tmp1 * njac(jb,4,4,ip)  &
+     &              - tmp1 * dx4
+               lhsc(jb,4,5,i) =  tmp2 * fjac(jb,4,5,ip)  &
+     &              - tmp1 * njac(jb,4,5,ip)
+
+               lhsc(jb,5,1,i) =  tmp2 * fjac(jb,5,1,ip)  &
+     &              - tmp1 * njac(jb,5,1,ip)
+               lhsc(jb,5,2,i) =  tmp2 * fjac(jb,5,2,ip)  &
+     &              - tmp1 * njac(jb,5,2,ip)
+               lhsc(jb,5,3,i) =  tmp2 * fjac(jb,5,3,ip)  &
+     &              - tmp1 * njac(jb,5,3,ip)
+               lhsc(jb,5,4,i) =  tmp2 * fjac(jb,5,4,ip)  &
+     &              - tmp1 * njac(jb,5,4,ip)
+               lhsc(jb,5,5,i) =  tmp2 * fjac(jb,5,5,ip)  &
+     &              - tmp1 * njac(jb,5,5,ip)  &
+     &              - tmp1 * dx5
+
+            enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(IMAX) and rhs'(IMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+
+            if (timeron) call timer_start(t_solsub)
+!---------------------------------------------------------------------
+!     multiply c(0,j,k) by b_inverse and copy back to c
+!     multiply rhs(0) by b_inverse(0) and copy to rhs
+!---------------------------------------------------------------------
+            if (ii .eq. 1) then
+               call binvcrhs( lhsb(1,1,1,0),  &
+     &                        lhsc(1,1,1,0),  &
+     &                        rhsx(1,1,0) )
+            endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     rhs(i) = rhs(i) - A*rhs(i-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,1,1),  &
+     &                         rhsx(1,1,i-1),rhsx(1,1,i))
+
+!---------------------------------------------------------------------
+!     B(i) = B(i) - C(i-1)*A(i)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,1,1),  &
+     &                         lhsc(1,1,1,i-1),  &
+     &                         lhsb(1,1,1,1))
+
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,1,1),  &
+     &                        lhsc(1,1,1,i),  &
+     &                        rhsx(1,1,i) )
+
+
+            if (ii .eq. isize-1) then
+!---------------------------------------------------------------------
+!     rhs(isize) = rhs(isize) - A*rhs(isize-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,1,2),  &
+     &                         rhsx(1,1,isize-1),rhsx(1,1,isize))
+
+!---------------------------------------------------------------------
+!     B(isize) = B(isize) - C(isize-1)*A(isize)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,1,2),  &
+     &                         lhsc(1,1,1,isize-1),  &
+     &                         lhsb(1,1,1,2))
+
+!---------------------------------------------------------------------
+!     multiply rhs() by b_inverse() and copy to rhs
+!---------------------------------------------------------------------
+               call binvrhs( lhsb(1,1,1,2),  &
+     &                       rhsx(1,1,isize) )
+            endif
+            if (timeron) call timer_stop(t_solsub)
+
+            enddo
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(isize)=rhs(isize)
+!     else assume U(isize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(istart) will be sent to next cell
+!---------------------------------------------------------------------
+
+            do i=isize-1,0,-1
+!dir$ vector always
+            do jb=1,bsize
+!dir$ unroll
+               do m=1,BLOCK_SIZE
+                  rhsx(jb,m,i) = rhsx(jb,m,i)   &
+     &                 - lhsc(jb,m,1,i)*rhsx(jb,1,i+1)  &
+     &                 - lhsc(jb,m,2,i)*rhsx(jb,2,i+1)  &
+     &                 - lhsc(jb,m,3,i)*rhsx(jb,3,i+1)  &
+     &                 - lhsc(jb,m,4,i)*rhsx(jb,4,i+1)  &
+     &                 - lhsc(jb,m,5,i)*rhsx(jb,5,i+1)
+               enddo
+            enddo
+            enddo
+
+            if (timeron) call timer_start(t_rdis1)
+            do jb = 1, bsize
+               j = jj+jb-1
+               if (j .lt. grid_points(2)-1) then
+               do i=0,isize
+                  rhs(1,i,j,k) = rhsx(jb,1,i)
+                  rhs(2,i,j,k) = rhsx(jb,2,i)
+                  rhs(3,i,j,k) = rhsx(jb,3,i)
+                  rhs(4,i,j,k) = rhsx(jb,4,i)
+                  rhs(5,i,j,k) = rhsx(jb,5,i)
+               end do
+               endif
+            end do
+            if (timeron) call timer_stop(t_rdis1)
+
+         enddo
+      enddo
+!$omp end parallel
+      if (timeron) call timer_stop(t_xsolve)
+
+      return
+      end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/y_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/y_solve.f90
new file mode 100644
index 000000000..6957f2d1e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/y_solve.f90
@@ -0,0 +1,406 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Y direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use work_lhs
+
+      implicit none
+
+      integer i, j, k, m, n, jsize
+      double precision tmp1, tmp2, tmp3
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_ysolve)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three y-factors   
+!---------------------------------------------------------------------
+
+      jsize = grid_points(2)-1
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the tri-diagonal matrix;
+!     determine a (labeled f) and n jacobians for cell c
+!---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(jsize) collapse(2)  &
+!$omp& private(i,j,k,m,n,tmp1,tmp2,tmp3)
+      do k = 1, grid_points(3)-2
+         do i = 1, grid_points(1)-2
+            do j = 0, jsize
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,j) = 0.0d+00
+               fjac(1,2,j) = 0.0d+00
+               fjac(1,3,j) = 1.0d+00
+               fjac(1,4,j) = 0.0d+00
+               fjac(1,5,j) = 0.0d+00
+
+               fjac(2,1,j) = - ( u(2,i,j,k)*u(3,i,j,k) )  &
+     &              * tmp2
+               fjac(2,2,j) = u(3,i,j,k) * tmp1
+               fjac(2,3,j) = u(2,i,j,k) * tmp1
+               fjac(2,4,j) = 0.0d+00
+               fjac(2,5,j) = 0.0d+00
+
+               fjac(3,1,j) = - ( u(3,i,j,k)*u(3,i,j,k)*tmp2)  &
+     &              + c2 * qs(i,j,k)
+               fjac(3,2,j) = - c2 *  u(2,i,j,k) * tmp1
+               fjac(3,3,j) = ( 2.0d+00 - c2 )  &
+     &              *  u(3,i,j,k) * tmp1 
+               fjac(3,4,j) = - c2 * u(4,i,j,k) * tmp1 
+               fjac(3,5,j) = c2
+
+               fjac(4,1,j) = - ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2
+               fjac(4,2,j) = 0.0d+00
+               fjac(4,3,j) = u(4,i,j,k) * tmp1
+               fjac(4,4,j) = u(3,i,j,k) * tmp1
+               fjac(4,5,j) = 0.0d+00
+
+               fjac(5,1,j) = ( c2 * 2.0d0 * square(i,j,k)  &
+     &              - c1 * u(5,i,j,k) )  &
+     &              * u(3,i,j,k) * tmp2
+               fjac(5,2,j) = - c2 * u(2,i,j,k)*u(3,i,j,k)   &
+     &              * tmp2
+               fjac(5,3,j) = c1 * u(5,i,j,k) * tmp1   &
+     &              - c2   &
+     &              * ( qs(i,j,k)  &
+     &              + u(3,i,j,k)*u(3,i,j,k) * tmp2 )
+               fjac(5,4,j) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2
+               fjac(5,5,j) = c1 * u(3,i,j,k) * tmp1 
+
+               njac(1,1,j) = 0.0d+00
+               njac(1,2,j) = 0.0d+00
+               njac(1,3,j) = 0.0d+00
+               njac(1,4,j) = 0.0d+00
+               njac(1,5,j) = 0.0d+00
+
+               njac(2,1,j) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,j) =   c3c4 * tmp1
+               njac(2,3,j) =   0.0d+00
+               njac(2,4,j) =   0.0d+00
+               njac(2,5,j) =   0.0d+00
+
+               njac(3,1,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,j) =   0.0d+00
+               njac(3,3,j) =   con43 * c3c4 * tmp1
+               njac(3,4,j) =   0.0d+00
+               njac(3,5,j) =   0.0d+00
+
+               njac(4,1,j) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,j) =   0.0d+00
+               njac(4,3,j) =   0.0d+00
+               njac(4,4,j) =   c3c4 * tmp1
+               njac(4,5,j) =   0.0d+00
+
+               njac(5,1,j) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(3,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,j) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,j) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,j) = ( c1345 ) * tmp1
+
+            enddo
+
+!---------------------------------------------------------------------
+!     now joacobians set, so form left hand side in y direction
+!---------------------------------------------------------------------
+            call lhsinit(lhs, jsize)
+            do j = 1, jsize-1
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhs(1,1,aa,j) = - tmp2 * fjac(1,1,j-1)  &
+     &              - tmp1 * njac(1,1,j-1)  &
+     &              - tmp1 * dy1 
+               lhs(1,2,aa,j) = - tmp2 * fjac(1,2,j-1)  &
+     &              - tmp1 * njac(1,2,j-1)
+               lhs(1,3,aa,j) = - tmp2 * fjac(1,3,j-1)  &
+     &              - tmp1 * njac(1,3,j-1)
+               lhs(1,4,aa,j) = - tmp2 * fjac(1,4,j-1)  &
+     &              - tmp1 * njac(1,4,j-1)
+               lhs(1,5,aa,j) = - tmp2 * fjac(1,5,j-1)  &
+     &              - tmp1 * njac(1,5,j-1)
+
+               lhs(2,1,aa,j) = - tmp2 * fjac(2,1,j-1)  &
+     &              - tmp1 * njac(2,1,j-1)
+               lhs(2,2,aa,j) = - tmp2 * fjac(2,2,j-1)  &
+     &              - tmp1 * njac(2,2,j-1)  &
+     &              - tmp1 * dy2
+               lhs(2,3,aa,j) = - tmp2 * fjac(2,3,j-1)  &
+     &              - tmp1 * njac(2,3,j-1)
+               lhs(2,4,aa,j) = - tmp2 * fjac(2,4,j-1)  &
+     &              - tmp1 * njac(2,4,j-1)
+               lhs(2,5,aa,j) = - tmp2 * fjac(2,5,j-1)  &
+     &              - tmp1 * njac(2,5,j-1)
+
+               lhs(3,1,aa,j) = - tmp2 * fjac(3,1,j-1)  &
+     &              - tmp1 * njac(3,1,j-1)
+               lhs(3,2,aa,j) = - tmp2 * fjac(3,2,j-1)  &
+     &              - tmp1 * njac(3,2,j-1)
+               lhs(3,3,aa,j) = - tmp2 * fjac(3,3,j-1)  &
+     &              - tmp1 * njac(3,3,j-1)  &
+     &              - tmp1 * dy3 
+               lhs(3,4,aa,j) = - tmp2 * fjac(3,4,j-1)  &
+     &              - tmp1 * njac(3,4,j-1)
+               lhs(3,5,aa,j) = - tmp2 * fjac(3,5,j-1)  &
+     &              - tmp1 * njac(3,5,j-1)
+
+               lhs(4,1,aa,j) = - tmp2 * fjac(4,1,j-1)  &
+     &              - tmp1 * njac(4,1,j-1)
+               lhs(4,2,aa,j) = - tmp2 * fjac(4,2,j-1)  &
+     &              - tmp1 * njac(4,2,j-1)
+               lhs(4,3,aa,j) = - tmp2 * fjac(4,3,j-1)  &
+     &              - tmp1 * njac(4,3,j-1)
+               lhs(4,4,aa,j) = - tmp2 * fjac(4,4,j-1)  &
+     &              - tmp1 * njac(4,4,j-1)  &
+     &              - tmp1 * dy4
+               lhs(4,5,aa,j) = - tmp2 * fjac(4,5,j-1)  &
+     &              - tmp1 * njac(4,5,j-1)
+
+               lhs(5,1,aa,j) = - tmp2 * fjac(5,1,j-1)  &
+     &              - tmp1 * njac(5,1,j-1)
+               lhs(5,2,aa,j) = - tmp2 * fjac(5,2,j-1)  &
+     &              - tmp1 * njac(5,2,j-1)
+               lhs(5,3,aa,j) = - tmp2 * fjac(5,3,j-1)  &
+     &              - tmp1 * njac(5,3,j-1)
+               lhs(5,4,aa,j) = - tmp2 * fjac(5,4,j-1)  &
+     &              - tmp1 * njac(5,4,j-1)
+               lhs(5,5,aa,j) = - tmp2 * fjac(5,5,j-1)  &
+     &              - tmp1 * njac(5,5,j-1)  &
+     &              - tmp1 * dy5
+
+               lhs(1,1,bb,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,j)  &
+     &              + tmp1 * 2.0d+00 * dy1
+               lhs(1,2,bb,j) = tmp1 * 2.0d+00 * njac(1,2,j)
+               lhs(1,3,bb,j) = tmp1 * 2.0d+00 * njac(1,3,j)
+               lhs(1,4,bb,j) = tmp1 * 2.0d+00 * njac(1,4,j)
+               lhs(1,5,bb,j) = tmp1 * 2.0d+00 * njac(1,5,j)
+
+               lhs(2,1,bb,j) = tmp1 * 2.0d+00 * njac(2,1,j)
+               lhs(2,2,bb,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,j)  &
+     &              + tmp1 * 2.0d+00 * dy2
+               lhs(2,3,bb,j) = tmp1 * 2.0d+00 * njac(2,3,j)
+               lhs(2,4,bb,j) = tmp1 * 2.0d+00 * njac(2,4,j)
+               lhs(2,5,bb,j) = tmp1 * 2.0d+00 * njac(2,5,j)
+
+               lhs(3,1,bb,j) = tmp1 * 2.0d+00 * njac(3,1,j)
+               lhs(3,2,bb,j) = tmp1 * 2.0d+00 * njac(3,2,j)
+               lhs(3,3,bb,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,j)  &
+     &              + tmp1 * 2.0d+00 * dy3
+               lhs(3,4,bb,j) = tmp1 * 2.0d+00 * njac(3,4,j)
+               lhs(3,5,bb,j) = tmp1 * 2.0d+00 * njac(3,5,j)
+
+               lhs(4,1,bb,j) = tmp1 * 2.0d+00 * njac(4,1,j)
+               lhs(4,2,bb,j) = tmp1 * 2.0d+00 * njac(4,2,j)
+               lhs(4,3,bb,j) = tmp1 * 2.0d+00 * njac(4,3,j)
+               lhs(4,4,bb,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,j)  &
+     &              + tmp1 * 2.0d+00 * dy4
+               lhs(4,5,bb,j) = tmp1 * 2.0d+00 * njac(4,5,j)
+
+               lhs(5,1,bb,j) = tmp1 * 2.0d+00 * njac(5,1,j)
+               lhs(5,2,bb,j) = tmp1 * 2.0d+00 * njac(5,2,j)
+               lhs(5,3,bb,j) = tmp1 * 2.0d+00 * njac(5,3,j)
+               lhs(5,4,bb,j) = tmp1 * 2.0d+00 * njac(5,4,j)
+               lhs(5,5,bb,j) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,j)   &
+     &              + tmp1 * 2.0d+00 * dy5
+
+               lhs(1,1,cc,j) =  tmp2 * fjac(1,1,j+1)  &
+     &              - tmp1 * njac(1,1,j+1)  &
+     &              - tmp1 * dy1
+               lhs(1,2,cc,j) =  tmp2 * fjac(1,2,j+1)  &
+     &              - tmp1 * njac(1,2,j+1)
+               lhs(1,3,cc,j) =  tmp2 * fjac(1,3,j+1)  &
+     &              - tmp1 * njac(1,3,j+1)
+               lhs(1,4,cc,j) =  tmp2 * fjac(1,4,j+1)  &
+     &              - tmp1 * njac(1,4,j+1)
+               lhs(1,5,cc,j) =  tmp2 * fjac(1,5,j+1)  &
+     &              - tmp1 * njac(1,5,j+1)
+
+               lhs(2,1,cc,j) =  tmp2 * fjac(2,1,j+1)  &
+     &              - tmp1 * njac(2,1,j+1)
+               lhs(2,2,cc,j) =  tmp2 * fjac(2,2,j+1)  &
+     &              - tmp1 * njac(2,2,j+1)  &
+     &              - tmp1 * dy2
+               lhs(2,3,cc,j) =  tmp2 * fjac(2,3,j+1)  &
+     &              - tmp1 * njac(2,3,j+1)
+               lhs(2,4,cc,j) =  tmp2 * fjac(2,4,j+1)  &
+     &              - tmp1 * njac(2,4,j+1)
+               lhs(2,5,cc,j) =  tmp2 * fjac(2,5,j+1)  &
+     &              - tmp1 * njac(2,5,j+1)
+
+               lhs(3,1,cc,j) =  tmp2 * fjac(3,1,j+1)  &
+     &              - tmp1 * njac(3,1,j+1)
+               lhs(3,2,cc,j) =  tmp2 * fjac(3,2,j+1)  &
+     &              - tmp1 * njac(3,2,j+1)
+               lhs(3,3,cc,j) =  tmp2 * fjac(3,3,j+1)  &
+     &              - tmp1 * njac(3,3,j+1)  &
+     &              - tmp1 * dy3
+               lhs(3,4,cc,j) =  tmp2 * fjac(3,4,j+1)  &
+     &              - tmp1 * njac(3,4,j+1)
+               lhs(3,5,cc,j) =  tmp2 * fjac(3,5,j+1)  &
+     &              - tmp1 * njac(3,5,j+1)
+
+               lhs(4,1,cc,j) =  tmp2 * fjac(4,1,j+1)  &
+     &              - tmp1 * njac(4,1,j+1)
+               lhs(4,2,cc,j) =  tmp2 * fjac(4,2,j+1)  &
+     &              - tmp1 * njac(4,2,j+1)
+               lhs(4,3,cc,j) =  tmp2 * fjac(4,3,j+1)  &
+     &              - tmp1 * njac(4,3,j+1)
+               lhs(4,4,cc,j) =  tmp2 * fjac(4,4,j+1)  &
+     &              - tmp1 * njac(4,4,j+1)  &
+     &              - tmp1 * dy4
+               lhs(4,5,cc,j) =  tmp2 * fjac(4,5,j+1)  &
+     &              - tmp1 * njac(4,5,j+1)
+
+               lhs(5,1,cc,j) =  tmp2 * fjac(5,1,j+1)  &
+     &              - tmp1 * njac(5,1,j+1)
+               lhs(5,2,cc,j) =  tmp2 * fjac(5,2,j+1)  &
+     &              - tmp1 * njac(5,2,j+1)
+               lhs(5,3,cc,j) =  tmp2 * fjac(5,3,j+1)  &
+     &              - tmp1 * njac(5,3,j+1)
+               lhs(5,4,cc,j) =  tmp2 * fjac(5,4,j+1)  &
+     &              - tmp1 * njac(5,4,j+1)
+               lhs(5,5,cc,j) =  tmp2 * fjac(5,5,j+1)  &
+     &              - tmp1 * njac(5,5,j+1)  &
+     &              - tmp1 * dy5
+
+            enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+            if (timeron) call timer_start(t_solsub)
+!---------------------------------------------------------------------
+!     multiply c(i,0,k) by b_inverse and copy back to c
+!     multiply rhs(0) by b_inverse(0) and copy to rhs
+!---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),  &
+     &                        lhs(1,1,cc,0),  &
+     &                        rhs(1,i,0,k) )
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+            do j=1,jsize-1
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(j-1) from lhs_vector(j)
+!     
+!     rhs(j) = rhs(j) - A*rhs(j-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,j),  &
+     &                         rhs(1,i,j-1,k),rhs(1,i,j,k))
+
+!---------------------------------------------------------------------
+!     B(j) = B(j) - C(j-1)*A(j)
+!---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,j),  &
+     &                         lhs(1,1,cc,j-1),  &
+     &                         lhs(1,1,bb,j))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,j),  &
+     &                        lhs(1,1,cc,j),  &
+     &                        rhs(1,i,j,k) )
+
+            enddo
+
+
+!---------------------------------------------------------------------
+!     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+!---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,jsize),  &
+     &                         rhs(1,i,jsize-1,k),rhs(1,i,jsize,k))
+
+!---------------------------------------------------------------------
+!     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+!     call matmul_sub(aa,i,jsize,k,c,
+!     $              cc,i,jsize-1,k,c,bb,i,jsize,k)
+!---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,jsize),  &
+     &                         lhs(1,1,cc,jsize-1),  &
+     &                         lhs(1,1,bb,jsize))
+
+!---------------------------------------------------------------------
+!     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+!---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,jsize),  &
+     &                       rhs(1,i,jsize,k) )
+            if (timeron) call timer_stop(t_solsub)
+
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+!     else assume U(jsize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(jstart) will be sent to next cell
+!---------------------------------------------------------------------
+      
+            do j=jsize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k)   &
+     &                    - lhs(m,n,cc,j)*rhs(n,i,j+1,k)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/y_solve_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/y_solve_blk.f90
new file mode 100644
index 000000000..2db171385
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/y_solve_blk.f90
@@ -0,0 +1,464 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine y_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Y direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use work_lhs
+
+      implicit none
+
+      integer i, j, k, m, jsize
+      integer ii,ib,jj,jm,jp,jb
+      double precision tmp1, tmp2, tmp3
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_ysolve)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three y-factors   
+!---------------------------------------------------------------------
+
+      jsize = grid_points(2)-1
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the tri-diagonal matrix;
+!     determine a (labeled f) and n jacobians for cell c
+!---------------------------------------------------------------------
+!$omp parallel default(shared) shared(jsize)  &
+!$omp& private(i,j,k,m,ii,ib,jj,jm,jp,jb,tmp1,tmp2,tmp3)
+
+      call lhsinit(jsize)
+
+!$omp do collapse(2)
+      do k = 1, grid_points(3)-2
+         do ii = 1, grid_points(1)-2, bsize
+
+            if (timeron) call timer_start(t_rdis1)
+            do j=0,jsize
+            do ib = 1, bsize
+               i = min(ii+ib-1, grid_points(1)-2)
+               rhsx(ib,1,j) = rhs(1,i,j,k)
+               rhsx(ib,2,j) = rhs(2,i,j,k)
+               rhsx(ib,3,j) = rhs(3,i,j,k)
+               rhsx(ib,4,j) = rhs(4,i,j,k)
+               rhsx(ib,5,j) = rhs(5,i,j,k)
+            end do
+            end do
+            if (timeron) call timer_stop(t_rdis1)
+
+            call lhsinit(0)
+
+            jb = 0
+            do jj = 1, jsize-1
+            jb = mod(jb + 1, 3)
+            jm = min(2*jj - 3, 1)     ! -1 or 1
+            jp = mod(jb + jm, 3) - 1
+
+            do j = jj+jm, jj+1
+            jp = jp + 1
+            do ib = 1, bsize
+               i = min(ii+ib-1, grid_points(1)-2)
+
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(ib,1,1,jp) = 0.0d+00
+               fjac(ib,1,2,jp) = 0.0d+00
+               fjac(ib,1,3,jp) = 1.0d+00
+               fjac(ib,1,4,jp) = 0.0d+00
+               fjac(ib,1,5,jp) = 0.0d+00
+
+               fjac(ib,2,1,jp) = - ( u(2,i,j,k)*u(3,i,j,k) )  &
+     &              * tmp2
+               fjac(ib,2,2,jp) = u(3,i,j,k) * tmp1
+               fjac(ib,2,3,jp) = u(2,i,j,k) * tmp1
+               fjac(ib,2,4,jp) = 0.0d+00
+               fjac(ib,2,5,jp) = 0.0d+00
+
+               fjac(ib,3,1,jp) = - ( u(3,i,j,k)*u(3,i,j,k)*tmp2)  &
+     &              + c2 * qs(i,j,k)
+               fjac(ib,3,2,jp) = - c2 *  u(2,i,j,k) * tmp1
+               fjac(ib,3,3,jp) = ( 2.0d+00 - c2 )  &
+     &              *  u(3,i,j,k) * tmp1 
+               fjac(ib,3,4,jp) = - c2 * u(4,i,j,k) * tmp1 
+               fjac(ib,3,5,jp) = c2
+
+               fjac(ib,4,1,jp) = - ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2
+               fjac(ib,4,2,jp) = 0.0d+00
+               fjac(ib,4,3,jp) = u(4,i,j,k) * tmp1
+               fjac(ib,4,4,jp) = u(3,i,j,k) * tmp1
+               fjac(ib,4,5,jp) = 0.0d+00
+
+               fjac(ib,5,1,jp) = ( c2 * 2.0d0 * square(i,j,k)  &
+     &              - c1 * u(5,i,j,k) )  &
+     &              * u(3,i,j,k) * tmp2
+               fjac(ib,5,2,jp) = - c2 * u(2,i,j,k)*u(3,i,j,k)   &
+     &              * tmp2
+               fjac(ib,5,3,jp) = c1 * u(5,i,j,k) * tmp1   &
+     &              - c2   &
+     &              * ( qs(i,j,k)  &
+     &              + u(3,i,j,k)*u(3,i,j,k) * tmp2 )
+               fjac(ib,5,4,jp) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2
+               fjac(ib,5,5,jp) = c1 * u(3,i,j,k) * tmp1 
+
+               njac(ib,1,1,jp) = 0.0d+00
+               njac(ib,1,2,jp) = 0.0d+00
+               njac(ib,1,3,jp) = 0.0d+00
+               njac(ib,1,4,jp) = 0.0d+00
+               njac(ib,1,5,jp) = 0.0d+00
+
+               njac(ib,2,1,jp) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(ib,2,2,jp) =   c3c4 * tmp1
+               njac(ib,2,3,jp) =   0.0d+00
+               njac(ib,2,4,jp) =   0.0d+00
+               njac(ib,2,5,jp) =   0.0d+00
+
+               njac(ib,3,1,jp) = - con43 * c3c4 * tmp2 * u(3,i,j,k)
+               njac(ib,3,2,jp) =   0.0d+00
+               njac(ib,3,3,jp) =   con43 * c3c4 * tmp1
+               njac(ib,3,4,jp) =   0.0d+00
+               njac(ib,3,5,jp) =   0.0d+00
+
+               njac(ib,4,1,jp) = - c3c4 * tmp2 * u(4,i,j,k)
+               njac(ib,4,2,jp) =   0.0d+00
+               njac(ib,4,3,jp) =   0.0d+00
+               njac(ib,4,4,jp) =   c3c4 * tmp1
+               njac(ib,4,5,jp) =   0.0d+00
+
+               njac(ib,5,1,jp) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(3,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(ib,5,2,jp) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(ib,5,3,jp) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(ib,5,4,jp) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(ib,5,5,jp) = ( c1345 ) * tmp1
+
+            enddo
+            enddo
+
+!---------------------------------------------------------------------
+!     now joacobians set, so form left hand side in y direction
+!---------------------------------------------------------------------
+            jm = mod(jb + 2, 3)
+            j = jj
+!dir$ vector always
+            do ib = 1, bsize
+
+               tmp1 = dt * ty1
+               tmp2 = dt * ty2
+
+               lhsa(ib,1,1,1) = - tmp2 * fjac(ib,1,1,jm)  &
+     &              - tmp1 * njac(ib,1,1,jm)  &
+     &              - tmp1 * dy1 
+               lhsa(ib,1,2,1) = - tmp2 * fjac(ib,1,2,jm)  &
+     &              - tmp1 * njac(ib,1,2,jm)
+               lhsa(ib,1,3,1) = - tmp2 * fjac(ib,1,3,jm)  &
+     &              - tmp1 * njac(ib,1,3,jm)
+               lhsa(ib,1,4,1) = - tmp2 * fjac(ib,1,4,jm)  &
+     &              - tmp1 * njac(ib,1,4,jm)
+               lhsa(ib,1,5,1) = - tmp2 * fjac(ib,1,5,jm)  &
+     &              - tmp1 * njac(ib,1,5,jm)
+
+               lhsa(ib,2,1,1) = - tmp2 * fjac(ib,2,1,jm)  &
+     &              - tmp1 * njac(ib,2,1,jm)
+               lhsa(ib,2,2,1) = - tmp2 * fjac(ib,2,2,jm)  &
+     &              - tmp1 * njac(ib,2,2,jm)  &
+     &              - tmp1 * dy2
+               lhsa(ib,2,3,1) = - tmp2 * fjac(ib,2,3,jm)  &
+     &              - tmp1 * njac(ib,2,3,jm)
+               lhsa(ib,2,4,1) = - tmp2 * fjac(ib,2,4,jm)  &
+     &              - tmp1 * njac(ib,2,4,jm)
+               lhsa(ib,2,5,1) = - tmp2 * fjac(ib,2,5,jm)  &
+     &              - tmp1 * njac(ib,2,5,jm)
+
+               lhsa(ib,3,1,1) = - tmp2 * fjac(ib,3,1,jm)  &
+     &              - tmp1 * njac(ib,3,1,jm)
+               lhsa(ib,3,2,1) = - tmp2 * fjac(ib,3,2,jm)  &
+     &              - tmp1 * njac(ib,3,2,jm)
+               lhsa(ib,3,3,1) = - tmp2 * fjac(ib,3,3,jm)  &
+     &              - tmp1 * njac(ib,3,3,jm)  &
+     &              - tmp1 * dy3 
+               lhsa(ib,3,4,1) = - tmp2 * fjac(ib,3,4,jm)  &
+     &              - tmp1 * njac(ib,3,4,jm)
+               lhsa(ib,3,5,1) = - tmp2 * fjac(ib,3,5,jm)  &
+     &              - tmp1 * njac(ib,3,5,jm)
+
+               lhsa(ib,4,1,1) = - tmp2 * fjac(ib,4,1,jm)  &
+     &              - tmp1 * njac(ib,4,1,jm)
+               lhsa(ib,4,2,1) = - tmp2 * fjac(ib,4,2,jm)  &
+     &              - tmp1 * njac(ib,4,2,jm)
+               lhsa(ib,4,3,1) = - tmp2 * fjac(ib,4,3,jm)  &
+     &              - tmp1 * njac(ib,4,3,jm)
+               lhsa(ib,4,4,1) = - tmp2 * fjac(ib,4,4,jm)  &
+     &              - tmp1 * njac(ib,4,4,jm)  &
+     &              - tmp1 * dy4
+               lhsa(ib,4,5,1) = - tmp2 * fjac(ib,4,5,jm)  &
+     &              - tmp1 * njac(ib,4,5,jm)
+
+               lhsa(ib,5,1,1) = - tmp2 * fjac(ib,5,1,jm)  &
+     &              - tmp1 * njac(ib,5,1,jm)
+               lhsa(ib,5,2,1) = - tmp2 * fjac(ib,5,2,jm)  &
+     &              - tmp1 * njac(ib,5,2,jm)
+               lhsa(ib,5,3,1) = - tmp2 * fjac(ib,5,3,jm)  &
+     &              - tmp1 * njac(ib,5,3,jm)
+               lhsa(ib,5,4,1) = - tmp2 * fjac(ib,5,4,jm)  &
+     &              - tmp1 * njac(ib,5,4,jm)
+               lhsa(ib,5,5,1) = - tmp2 * fjac(ib,5,5,jm)  &
+     &              - tmp1 * njac(ib,5,5,jm)  &
+     &              - tmp1 * dy5
+
+               lhsb(ib,1,1,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,1,1,jb)  &
+     &              + tmp1 * 2.0d+00 * dy1
+               lhsb(ib,1,2,1) = tmp1 * 2.0d+00 * njac(ib,1,2,jb)
+               lhsb(ib,1,3,1) = tmp1 * 2.0d+00 * njac(ib,1,3,jb)
+               lhsb(ib,1,4,1) = tmp1 * 2.0d+00 * njac(ib,1,4,jb)
+               lhsb(ib,1,5,1) = tmp1 * 2.0d+00 * njac(ib,1,5,jb)
+
+               lhsb(ib,2,1,1) = tmp1 * 2.0d+00 * njac(ib,2,1,jb)
+               lhsb(ib,2,2,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,2,2,jb)  &
+     &              + tmp1 * 2.0d+00 * dy2
+               lhsb(ib,2,3,1) = tmp1 * 2.0d+00 * njac(ib,2,3,jb)
+               lhsb(ib,2,4,1) = tmp1 * 2.0d+00 * njac(ib,2,4,jb)
+               lhsb(ib,2,5,1) = tmp1 * 2.0d+00 * njac(ib,2,5,jb)
+
+               lhsb(ib,3,1,1) = tmp1 * 2.0d+00 * njac(ib,3,1,jb)
+               lhsb(ib,3,2,1) = tmp1 * 2.0d+00 * njac(ib,3,2,jb)
+               lhsb(ib,3,3,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,3,3,jb)  &
+     &              + tmp1 * 2.0d+00 * dy3
+               lhsb(ib,3,4,1) = tmp1 * 2.0d+00 * njac(ib,3,4,jb)
+               lhsb(ib,3,5,1) = tmp1 * 2.0d+00 * njac(ib,3,5,jb)
+
+               lhsb(ib,4,1,1) = tmp1 * 2.0d+00 * njac(ib,4,1,jb)
+               lhsb(ib,4,2,1) = tmp1 * 2.0d+00 * njac(ib,4,2,jb)
+               lhsb(ib,4,3,1) = tmp1 * 2.0d+00 * njac(ib,4,3,jb)
+               lhsb(ib,4,4,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,4,4,jb)  &
+     &              + tmp1 * 2.0d+00 * dy4
+               lhsb(ib,4,5,1) = tmp1 * 2.0d+00 * njac(ib,4,5,jb)
+
+               lhsb(ib,5,1,1) = tmp1 * 2.0d+00 * njac(ib,5,1,jb)
+               lhsb(ib,5,2,1) = tmp1 * 2.0d+00 * njac(ib,5,2,jb)
+               lhsb(ib,5,3,1) = tmp1 * 2.0d+00 * njac(ib,5,3,jb)
+               lhsb(ib,5,4,1) = tmp1 * 2.0d+00 * njac(ib,5,4,jb)
+               lhsb(ib,5,5,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,5,5,jb)   &
+     &              + tmp1 * 2.0d+00 * dy5
+
+               lhsc(ib,1,1,j) =  tmp2 * fjac(ib,1,1,jp)  &
+     &              - tmp1 * njac(ib,1,1,jp)  &
+     &              - tmp1 * dy1
+               lhsc(ib,1,2,j) =  tmp2 * fjac(ib,1,2,jp)  &
+     &              - tmp1 * njac(ib,1,2,jp)
+               lhsc(ib,1,3,j) =  tmp2 * fjac(ib,1,3,jp)  &
+     &              - tmp1 * njac(ib,1,3,jp)
+               lhsc(ib,1,4,j) =  tmp2 * fjac(ib,1,4,jp)  &
+     &              - tmp1 * njac(ib,1,4,jp)
+               lhsc(ib,1,5,j) =  tmp2 * fjac(ib,1,5,jp)  &
+     &              - tmp1 * njac(ib,1,5,jp)
+
+               lhsc(ib,2,1,j) =  tmp2 * fjac(ib,2,1,jp)  &
+     &              - tmp1 * njac(ib,2,1,jp)
+               lhsc(ib,2,2,j) =  tmp2 * fjac(ib,2,2,jp)  &
+     &              - tmp1 * njac(ib,2,2,jp)  &
+     &              - tmp1 * dy2
+               lhsc(ib,2,3,j) =  tmp2 * fjac(ib,2,3,jp)  &
+     &              - tmp1 * njac(ib,2,3,jp)
+               lhsc(ib,2,4,j) =  tmp2 * fjac(ib,2,4,jp)  &
+     &              - tmp1 * njac(ib,2,4,jp)
+               lhsc(ib,2,5,j) =  tmp2 * fjac(ib,2,5,jp)  &
+     &              - tmp1 * njac(ib,2,5,jp)
+
+               lhsc(ib,3,1,j) =  tmp2 * fjac(ib,3,1,jp)  &
+     &              - tmp1 * njac(ib,3,1,jp)
+               lhsc(ib,3,2,j) =  tmp2 * fjac(ib,3,2,jp)  &
+     &              - tmp1 * njac(ib,3,2,jp)
+               lhsc(ib,3,3,j) =  tmp2 * fjac(ib,3,3,jp)  &
+     &              - tmp1 * njac(ib,3,3,jp)  &
+     &              - tmp1 * dy3
+               lhsc(ib,3,4,j) =  tmp2 * fjac(ib,3,4,jp)  &
+     &              - tmp1 * njac(ib,3,4,jp)
+               lhsc(ib,3,5,j) =  tmp2 * fjac(ib,3,5,jp)  &
+     &              - tmp1 * njac(ib,3,5,jp)
+
+               lhsc(ib,4,1,j) =  tmp2 * fjac(ib,4,1,jp)  &
+     &              - tmp1 * njac(ib,4,1,jp)
+               lhsc(ib,4,2,j) =  tmp2 * fjac(ib,4,2,jp)  &
+     &              - tmp1 * njac(ib,4,2,jp)
+               lhsc(ib,4,3,j) =  tmp2 * fjac(ib,4,3,jp)  &
+     &              - tmp1 * njac(ib,4,3,jp)
+               lhsc(ib,4,4,j) =  tmp2 * fjac(ib,4,4,jp)  &
+     &              - tmp1 * njac(ib,4,4,jp)  &
+     &              - tmp1 * dy4
+               lhsc(ib,4,5,j) =  tmp2 * fjac(ib,4,5,jp)  &
+     &              - tmp1 * njac(ib,4,5,jp)
+
+               lhsc(ib,5,1,j) =  tmp2 * fjac(ib,5,1,jp)  &
+     &              - tmp1 * njac(ib,5,1,jp)
+               lhsc(ib,5,2,j) =  tmp2 * fjac(ib,5,2,jp)  &
+     &              - tmp1 * njac(ib,5,2,jp)
+               lhsc(ib,5,3,j) =  tmp2 * fjac(ib,5,3,jp)  &
+     &              - tmp1 * njac(ib,5,3,jp)
+               lhsc(ib,5,4,j) =  tmp2 * fjac(ib,5,4,jp)  &
+     &              - tmp1 * njac(ib,5,4,jp)
+               lhsc(ib,5,5,j) =  tmp2 * fjac(ib,5,5,jp)  &
+     &              - tmp1 * njac(ib,5,5,jp)  &
+     &              - tmp1 * dy5
+
+            enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(JMAX) and rhs'(JMAX) will be sent to next cell
+!---------------------------------------------------------------------
+
+            if (timeron) call timer_start(t_solsub)
+!---------------------------------------------------------------------
+!     multiply c(i,0,k) by b_inverse and copy back to c
+!     multiply rhs(0) by b_inverse(0) and copy to rhs
+!---------------------------------------------------------------------
+            if (jj .eq. 1) then
+            call binvcrhs( lhsb(1,1,1,0),  &
+     &                        lhsc(1,1,1,0),  &
+     &                        rhsx(1,1,0) )
+            endif
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(j-1) from lhs_vector(j)
+!     
+!     rhs(j) = rhs(j) - A*rhs(j-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,1,1),  &
+     &                         rhsx(1,1,j-1),rhsx(1,1,j))
+
+!---------------------------------------------------------------------
+!     B(j) = B(j) - C(j-1)*A(j)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,1,1),  &
+     &                         lhsc(1,1,1,j-1),  &
+     &                         lhsb(1,1,1,1))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,1,1),  &
+     &                        lhsc(1,1,1,j),  &
+     &                        rhsx(1,1,j) )
+
+
+            if (jj .eq. jsize-1) then
+!---------------------------------------------------------------------
+!     rhs(jsize) = rhs(jsize) - A*rhs(jsize-1)
+!---------------------------------------------------------------------
+            call matvec_sub(lhsa(1,1,1,2),  &
+     &                         rhsx(1,1,jsize-1),rhsx(1,1,jsize))
+
+!---------------------------------------------------------------------
+!     B(jsize) = B(jsize) - C(jsize-1)*A(jsize)
+!     call matmul_sub(aa,i,jsize,k,c,
+!     $              cc,i,jsize-1,k,c,bb,i,jsize,k)
+!---------------------------------------------------------------------
+            call matmul_sub(lhsa(1,1,1,2),  &
+     &                         lhsc(1,1,1,jsize-1),  &
+     &                         lhsb(1,1,1,2))
+
+!---------------------------------------------------------------------
+!     multiply rhs(jsize) by b_inverse(jsize) and copy to rhs
+!---------------------------------------------------------------------
+            call binvrhs( lhsb(1,1,1,2),  &
+     &                       rhsx(1,1,jsize) )
+            endif
+
+            if (timeron) call timer_stop(t_solsub)
+
+            enddo
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(jsize)=rhs(jsize)
+!     else assume U(jsize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(jstart) will be sent to next cell
+!---------------------------------------------------------------------
+      
+            do j=jsize-1,0,-1
+!dir$ vector always
+            do ib=1,bsize
+!dir$ unroll
+               do m=1,BLOCK_SIZE
+                  rhsx(ib,m,j) = rhsx(ib,m,j)   &
+     &                 - lhsc(ib,m,1,j)*rhsx(ib,1,j+1)  &
+     &                 - lhsc(ib,m,2,j)*rhsx(ib,2,j+1)  &
+     &                 - lhsc(ib,m,3,j)*rhsx(ib,3,j+1)  &
+     &                 - lhsc(ib,m,4,j)*rhsx(ib,4,j+1)  &
+     &                 - lhsc(ib,m,5,j)*rhsx(ib,5,j+1)
+               enddo
+            enddo
+            enddo
+
+            if (timeron) call timer_start(t_rdis1)
+            do ib = 1, bsize
+               i = ii+ib-1
+               if (i .lt. grid_points(1)-1) then
+               do j=0,jsize
+                  rhs(1,i,j,k) = rhsx(ib,1,j)
+                  rhs(2,i,j,k) = rhsx(ib,2,j)
+                  rhs(3,i,j,k) = rhsx(ib,3,j)
+                  rhs(4,i,j,k) = rhsx(ib,4,j)
+                  rhs(5,i,j,k) = rhsx(ib,5,j)
+               end do
+               endif
+            end do
+            if (timeron) call timer_stop(t_rdis1)
+
+         enddo
+      enddo
+!$omp end parallel
+      if (timeron) call timer_stop(t_ysolve)
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/z_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/z_solve.f90
new file mode 100644
index 000000000..1613676d4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/z_solve.f90
@@ -0,0 +1,416 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Z direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use work_lhs
+
+      implicit none
+
+      integer i, j, k, m, n, ksize
+      double precision tmp1, tmp2, tmp3
+      
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_zsolve)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three z-factors   
+!---------------------------------------------------------------------
+
+      ksize = grid_points(3)-1
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the block-diagonal matrix;
+!     determine c (labeled f) and s jacobians
+!---------------------------------------------------------------------
+!$omp parallel do default(shared) shared(ksize) collapse(2)  &
+!$omp& private(i,j,k,m,n,tmp1,tmp2,tmp3)
+      do j = 1, grid_points(2)-2
+         do i = 1, grid_points(1)-2
+            do k = 0, ksize
+
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(1,1,k) = 0.0d+00
+               fjac(1,2,k) = 0.0d+00
+               fjac(1,3,k) = 0.0d+00
+               fjac(1,4,k) = 1.0d+00
+               fjac(1,5,k) = 0.0d+00
+
+               fjac(2,1,k) = - ( u(2,i,j,k)*u(4,i,j,k) )   &
+     &              * tmp2 
+               fjac(2,2,k) = u(4,i,j,k) * tmp1
+               fjac(2,3,k) = 0.0d+00
+               fjac(2,4,k) = u(2,i,j,k) * tmp1
+               fjac(2,5,k) = 0.0d+00
+
+               fjac(3,1,k) = - ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2 
+               fjac(3,2,k) = 0.0d+00
+               fjac(3,3,k) = u(4,i,j,k) * tmp1
+               fjac(3,4,k) = u(3,i,j,k) * tmp1
+               fjac(3,5,k) = 0.0d+00
+
+               fjac(4,1,k) = - (u(4,i,j,k)*u(4,i,j,k) * tmp2 )   &
+     &              + c2 * qs(i,j,k)
+               fjac(4,2,k) = - c2 *  u(2,i,j,k) * tmp1 
+               fjac(4,3,k) = - c2 *  u(3,i,j,k) * tmp1
+               fjac(4,4,k) = ( 2.0d+00 - c2 )  &
+     &              *  u(4,i,j,k) * tmp1 
+               fjac(4,5,k) = c2
+
+               fjac(5,1,k) = ( c2 * 2.0d0 * square(i,j,k)   &
+     &              - c1 * u(5,i,j,k) )  &
+     &              * u(4,i,j,k) * tmp2
+               fjac(5,2,k) = - c2 * ( u(2,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2 
+               fjac(5,3,k) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2
+               fjac(5,4,k) = c1 * ( u(5,i,j,k) * tmp1 )  &
+     &              - c2  &
+     &              * ( qs(i,j,k)  &
+     &              + u(4,i,j,k)*u(4,i,j,k) * tmp2 )
+               fjac(5,5,k) = c1 * u(4,i,j,k) * tmp1
+
+               njac(1,1,k) = 0.0d+00
+               njac(1,2,k) = 0.0d+00
+               njac(1,3,k) = 0.0d+00
+               njac(1,4,k) = 0.0d+00
+               njac(1,5,k) = 0.0d+00
+
+               njac(2,1,k) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(2,2,k) =   c3c4 * tmp1
+               njac(2,3,k) =   0.0d+00
+               njac(2,4,k) =   0.0d+00
+               njac(2,5,k) =   0.0d+00
+
+               njac(3,1,k) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(3,2,k) =   0.0d+00
+               njac(3,3,k) =   c3c4 * tmp1
+               njac(3,4,k) =   0.0d+00
+               njac(3,5,k) =   0.0d+00
+
+               njac(4,1,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k)
+               njac(4,2,k) =   0.0d+00
+               njac(4,3,k) =   0.0d+00
+               njac(4,4,k) =   con43 * c3 * c4 * tmp1
+               njac(4,5,k) =   0.0d+00
+
+               njac(5,1,k) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(4,i,j,k)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(5,2,k) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(5,3,k) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(5,4,k) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(5,5,k) = ( c1345 )* tmp1
+
+            enddo
+
+!---------------------------------------------------------------------
+!     now jacobians set, so form left hand side in z direction
+!---------------------------------------------------------------------
+            call lhsinit(lhs, ksize)
+            do k = 1, ksize-1
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhs(1,1,aa,k) = - tmp2 * fjac(1,1,k-1)  &
+     &              - tmp1 * njac(1,1,k-1)  &
+     &              - tmp1 * dz1 
+               lhs(1,2,aa,k) = - tmp2 * fjac(1,2,k-1)  &
+     &              - tmp1 * njac(1,2,k-1)
+               lhs(1,3,aa,k) = - tmp2 * fjac(1,3,k-1)  &
+     &              - tmp1 * njac(1,3,k-1)
+               lhs(1,4,aa,k) = - tmp2 * fjac(1,4,k-1)  &
+     &              - tmp1 * njac(1,4,k-1)
+               lhs(1,5,aa,k) = - tmp2 * fjac(1,5,k-1)  &
+     &              - tmp1 * njac(1,5,k-1)
+
+               lhs(2,1,aa,k) = - tmp2 * fjac(2,1,k-1)  &
+     &              - tmp1 * njac(2,1,k-1)
+               lhs(2,2,aa,k) = - tmp2 * fjac(2,2,k-1)  &
+     &              - tmp1 * njac(2,2,k-1)  &
+     &              - tmp1 * dz2
+               lhs(2,3,aa,k) = - tmp2 * fjac(2,3,k-1)  &
+     &              - tmp1 * njac(2,3,k-1)
+               lhs(2,4,aa,k) = - tmp2 * fjac(2,4,k-1)  &
+     &              - tmp1 * njac(2,4,k-1)
+               lhs(2,5,aa,k) = - tmp2 * fjac(2,5,k-1)  &
+     &              - tmp1 * njac(2,5,k-1)
+
+               lhs(3,1,aa,k) = - tmp2 * fjac(3,1,k-1)  &
+     &              - tmp1 * njac(3,1,k-1)
+               lhs(3,2,aa,k) = - tmp2 * fjac(3,2,k-1)  &
+     &              - tmp1 * njac(3,2,k-1)
+               lhs(3,3,aa,k) = - tmp2 * fjac(3,3,k-1)  &
+     &              - tmp1 * njac(3,3,k-1)  &
+     &              - tmp1 * dz3 
+               lhs(3,4,aa,k) = - tmp2 * fjac(3,4,k-1)  &
+     &              - tmp1 * njac(3,4,k-1)
+               lhs(3,5,aa,k) = - tmp2 * fjac(3,5,k-1)  &
+     &              - tmp1 * njac(3,5,k-1)
+
+               lhs(4,1,aa,k) = - tmp2 * fjac(4,1,k-1)  &
+     &              - tmp1 * njac(4,1,k-1)
+               lhs(4,2,aa,k) = - tmp2 * fjac(4,2,k-1)  &
+     &              - tmp1 * njac(4,2,k-1)
+               lhs(4,3,aa,k) = - tmp2 * fjac(4,3,k-1)  &
+     &              - tmp1 * njac(4,3,k-1)
+               lhs(4,4,aa,k) = - tmp2 * fjac(4,4,k-1)  &
+     &              - tmp1 * njac(4,4,k-1)  &
+     &              - tmp1 * dz4
+               lhs(4,5,aa,k) = - tmp2 * fjac(4,5,k-1)  &
+     &              - tmp1 * njac(4,5,k-1)
+
+               lhs(5,1,aa,k) = - tmp2 * fjac(5,1,k-1)  &
+     &              - tmp1 * njac(5,1,k-1)
+               lhs(5,2,aa,k) = - tmp2 * fjac(5,2,k-1)  &
+     &              - tmp1 * njac(5,2,k-1)
+               lhs(5,3,aa,k) = - tmp2 * fjac(5,3,k-1)  &
+     &              - tmp1 * njac(5,3,k-1)
+               lhs(5,4,aa,k) = - tmp2 * fjac(5,4,k-1)  &
+     &              - tmp1 * njac(5,4,k-1)
+               lhs(5,5,aa,k) = - tmp2 * fjac(5,5,k-1)  &
+     &              - tmp1 * njac(5,5,k-1)  &
+     &              - tmp1 * dz5
+
+               lhs(1,1,bb,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(1,1,k)  &
+     &              + tmp1 * 2.0d+00 * dz1
+               lhs(1,2,bb,k) = tmp1 * 2.0d+00 * njac(1,2,k)
+               lhs(1,3,bb,k) = tmp1 * 2.0d+00 * njac(1,3,k)
+               lhs(1,4,bb,k) = tmp1 * 2.0d+00 * njac(1,4,k)
+               lhs(1,5,bb,k) = tmp1 * 2.0d+00 * njac(1,5,k)
+
+               lhs(2,1,bb,k) = tmp1 * 2.0d+00 * njac(2,1,k)
+               lhs(2,2,bb,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(2,2,k)  &
+     &              + tmp1 * 2.0d+00 * dz2
+               lhs(2,3,bb,k) = tmp1 * 2.0d+00 * njac(2,3,k)
+               lhs(2,4,bb,k) = tmp1 * 2.0d+00 * njac(2,4,k)
+               lhs(2,5,bb,k) = tmp1 * 2.0d+00 * njac(2,5,k)
+
+               lhs(3,1,bb,k) = tmp1 * 2.0d+00 * njac(3,1,k)
+               lhs(3,2,bb,k) = tmp1 * 2.0d+00 * njac(3,2,k)
+               lhs(3,3,bb,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(3,3,k)  &
+     &              + tmp1 * 2.0d+00 * dz3
+               lhs(3,4,bb,k) = tmp1 * 2.0d+00 * njac(3,4,k)
+               lhs(3,5,bb,k) = tmp1 * 2.0d+00 * njac(3,5,k)
+
+               lhs(4,1,bb,k) = tmp1 * 2.0d+00 * njac(4,1,k)
+               lhs(4,2,bb,k) = tmp1 * 2.0d+00 * njac(4,2,k)
+               lhs(4,3,bb,k) = tmp1 * 2.0d+00 * njac(4,3,k)
+               lhs(4,4,bb,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(4,4,k)  &
+     &              + tmp1 * 2.0d+00 * dz4
+               lhs(4,5,bb,k) = tmp1 * 2.0d+00 * njac(4,5,k)
+
+               lhs(5,1,bb,k) = tmp1 * 2.0d+00 * njac(5,1,k)
+               lhs(5,2,bb,k) = tmp1 * 2.0d+00 * njac(5,2,k)
+               lhs(5,3,bb,k) = tmp1 * 2.0d+00 * njac(5,3,k)
+               lhs(5,4,bb,k) = tmp1 * 2.0d+00 * njac(5,4,k)
+               lhs(5,5,bb,k) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(5,5,k)   &
+     &              + tmp1 * 2.0d+00 * dz5
+
+               lhs(1,1,cc,k) =  tmp2 * fjac(1,1,k+1)  &
+     &              - tmp1 * njac(1,1,k+1)  &
+     &              - tmp1 * dz1
+               lhs(1,2,cc,k) =  tmp2 * fjac(1,2,k+1)  &
+     &              - tmp1 * njac(1,2,k+1)
+               lhs(1,3,cc,k) =  tmp2 * fjac(1,3,k+1)  &
+     &              - tmp1 * njac(1,3,k+1)
+               lhs(1,4,cc,k) =  tmp2 * fjac(1,4,k+1)  &
+     &              - tmp1 * njac(1,4,k+1)
+               lhs(1,5,cc,k) =  tmp2 * fjac(1,5,k+1)  &
+     &              - tmp1 * njac(1,5,k+1)
+
+               lhs(2,1,cc,k) =  tmp2 * fjac(2,1,k+1)  &
+     &              - tmp1 * njac(2,1,k+1)
+               lhs(2,2,cc,k) =  tmp2 * fjac(2,2,k+1)  &
+     &              - tmp1 * njac(2,2,k+1)  &
+     &              - tmp1 * dz2
+               lhs(2,3,cc,k) =  tmp2 * fjac(2,3,k+1)  &
+     &              - tmp1 * njac(2,3,k+1)
+               lhs(2,4,cc,k) =  tmp2 * fjac(2,4,k+1)  &
+     &              - tmp1 * njac(2,4,k+1)
+               lhs(2,5,cc,k) =  tmp2 * fjac(2,5,k+1)  &
+     &              - tmp1 * njac(2,5,k+1)
+
+               lhs(3,1,cc,k) =  tmp2 * fjac(3,1,k+1)  &
+     &              - tmp1 * njac(3,1,k+1)
+               lhs(3,2,cc,k) =  tmp2 * fjac(3,2,k+1)  &
+     &              - tmp1 * njac(3,2,k+1)
+               lhs(3,3,cc,k) =  tmp2 * fjac(3,3,k+1)  &
+     &              - tmp1 * njac(3,3,k+1)  &
+     &              - tmp1 * dz3
+               lhs(3,4,cc,k) =  tmp2 * fjac(3,4,k+1)  &
+     &              - tmp1 * njac(3,4,k+1)
+               lhs(3,5,cc,k) =  tmp2 * fjac(3,5,k+1)  &
+     &              - tmp1 * njac(3,5,k+1)
+
+               lhs(4,1,cc,k) =  tmp2 * fjac(4,1,k+1)  &
+     &              - tmp1 * njac(4,1,k+1)
+               lhs(4,2,cc,k) =  tmp2 * fjac(4,2,k+1)  &
+     &              - tmp1 * njac(4,2,k+1)
+               lhs(4,3,cc,k) =  tmp2 * fjac(4,3,k+1)  &
+     &              - tmp1 * njac(4,3,k+1)
+               lhs(4,4,cc,k) =  tmp2 * fjac(4,4,k+1)  &
+     &              - tmp1 * njac(4,4,k+1)  &
+     &              - tmp1 * dz4
+               lhs(4,5,cc,k) =  tmp2 * fjac(4,5,k+1)  &
+     &              - tmp1 * njac(4,5,k+1)
+
+               lhs(5,1,cc,k) =  tmp2 * fjac(5,1,k+1)  &
+     &              - tmp1 * njac(5,1,k+1)
+               lhs(5,2,cc,k) =  tmp2 * fjac(5,2,k+1)  &
+     &              - tmp1 * njac(5,2,k+1)
+               lhs(5,3,cc,k) =  tmp2 * fjac(5,3,k+1)  &
+     &              - tmp1 * njac(5,3,k+1)
+               lhs(5,4,cc,k) =  tmp2 * fjac(5,4,k+1)  &
+     &              - tmp1 * njac(5,4,k+1)
+               lhs(5,5,cc,k) =  tmp2 * fjac(5,5,k+1)  &
+     &              - tmp1 * njac(5,5,k+1)  &
+     &              - tmp1 * dz5
+
+            enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+
+            if (timeron) call timer_start(t_solsub)
+!---------------------------------------------------------------------
+!     multiply c(i,j,0) by b_inverse and copy back to c
+!     multiply rhs(0) by b_inverse(0) and copy to rhs
+!---------------------------------------------------------------------
+            call binvcrhs( lhs(1,1,bb,0),  &
+     &                        lhs(1,1,cc,0),  &
+     &                        rhs(1,i,j,0) )
+
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+            do k=1,ksize-1
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(k-1) from lhs_vector(k)
+!     
+!     rhs(k) = rhs(k) - A*rhs(k-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhs(1,1,aa,k),  &
+     &                         rhs(1,i,j,k-1),rhs(1,i,j,k))
+
+!---------------------------------------------------------------------
+!     B(k) = B(k) - C(k-1)*A(k)
+!     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k)
+!---------------------------------------------------------------------
+               call matmul_sub(lhs(1,1,aa,k),  &
+     &                         lhs(1,1,cc,k-1),  &
+     &                         lhs(1,1,bb,k))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhs(1,1,bb,k),  &
+     &                        lhs(1,1,cc,k),  &
+     &                        rhs(1,i,j,k) )
+
+            enddo
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+!---------------------------------------------------------------------
+            call matvec_sub(lhs(1,1,aa,ksize),  &
+     &                         rhs(1,i,j,ksize-1),rhs(1,i,j,ksize))
+
+!---------------------------------------------------------------------
+!     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+!     call matmul_sub(aa,i,j,ksize,c,
+!     $              cc,i,j,ksize-1,c,bb,i,j,ksize)
+!---------------------------------------------------------------------
+            call matmul_sub(lhs(1,1,aa,ksize),  &
+     &                         lhs(1,1,cc,ksize-1),  &
+     &                         lhs(1,1,bb,ksize))
+
+!---------------------------------------------------------------------
+!     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+!---------------------------------------------------------------------
+            call binvrhs( lhs(1,1,bb,ksize),  &
+     &                       rhs(1,i,j,ksize) )
+            if (timeron) call timer_stop(t_solsub)
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+!     else assume U(ksize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(kstart) will be sent to next cell
+!---------------------------------------------------------------------
+
+            do k=ksize-1,0,-1
+               do m=1,BLOCK_SIZE
+                  do n=1,BLOCK_SIZE
+                     rhs(m,i,j,k) = rhs(m,i,j,k)   &
+     &                    - lhs(m,n,cc,k)*rhs(n,i,j,k+1)
+                  enddo
+               enddo
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/z_solve_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/z_solve_blk.f90
new file mode 100644
index 000000000..460bed562
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/BT/z_solve_blk.f90
@@ -0,0 +1,475 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine z_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     Performs line solves in Z direction by first factoring
+!     the block-tridiagonal matrix into an upper triangular matrix, 
+!     and then performing back substitution to solve for the unknow
+!     vectors of each line.  
+!     
+!     Make sure we treat elements zero to cell_size in the direction
+!     of the sweep.
+!---------------------------------------------------------------------
+
+      use bt_data
+      use work_lhs
+
+      implicit none
+
+      integer i, j, k, m, ksize
+      integer ii, ib,kk,km,kp,kb
+      double precision tmp1, tmp2, tmp3
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      if (timeron) call timer_start(t_zsolve)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     This function computes the left hand side for the three z-factors   
+!---------------------------------------------------------------------
+
+      ksize = grid_points(3)-1
+
+!---------------------------------------------------------------------
+!     Compute the indices for storing the block-diagonal matrix;
+!     determine c (labeled f) and s jacobians
+!---------------------------------------------------------------------
+!$omp parallel default(shared) shared(ksize)  &
+!$omp& private(i,j,k,m,ii,ib,kk,km,kp,kb,tmp1,tmp2,tmp3)
+
+      call lhsinit(ksize)
+
+!$omp do collapse(2)
+      do j = 1, grid_points(2)-2
+         do ii = 1, grid_points(1)-2, bsize
+
+            if (timeron) call timer_start(t_rdis1)
+            do k=0,ksize
+            do ib = 1, bsize
+               i = min(ii+ib-1, grid_points(1)-2)
+               rhsx(ib,1,k) = rhs(1,i,j,k)
+               rhsx(ib,2,k) = rhs(2,i,j,k)
+               rhsx(ib,3,k) = rhs(3,i,j,k)
+               rhsx(ib,4,k) = rhs(4,i,j,k)
+               rhsx(ib,5,k) = rhs(5,i,j,k)
+            end do
+            end do
+            if (timeron) call timer_stop(t_rdis1)
+
+            call lhsinit(0)
+
+            kb = 0
+            do kk = 1, ksize-1
+            kb = mod(kb + 1, 3)
+            km = min(2*kk - 3, 1)     ! -1 or 1
+            kp = mod(kb + km, 3) - 1
+
+            do k = kk+km, kk+1
+            kp = kp + 1
+            do ib = 1, bsize
+               i = min(ii+ib-1, grid_points(1)-2)
+
+               tmp1 = 1.0d+00 / u(1,i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               fjac(ib,1,1,kp) = 0.0d+00
+               fjac(ib,1,2,kp) = 0.0d+00
+               fjac(ib,1,3,kp) = 0.0d+00
+               fjac(ib,1,4,kp) = 1.0d+00
+               fjac(ib,1,5,kp) = 0.0d+00
+
+               fjac(ib,2,1,kp) = - ( u(2,i,j,k)*u(4,i,j,k) )   &
+     &              * tmp2 
+               fjac(ib,2,2,kp) = u(4,i,j,k) * tmp1
+               fjac(ib,2,3,kp) = 0.0d+00
+               fjac(ib,2,4,kp) = u(2,i,j,k) * tmp1
+               fjac(ib,2,5,kp) = 0.0d+00
+
+               fjac(ib,3,1,kp) = - ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2 
+               fjac(ib,3,2,kp) = 0.0d+00
+               fjac(ib,3,3,kp) = u(4,i,j,k) * tmp1
+               fjac(ib,3,4,kp) = u(3,i,j,k) * tmp1
+               fjac(ib,3,5,kp) = 0.0d+00
+
+               fjac(ib,4,1,kp) = - (u(4,i,j,k)*u(4,i,j,k) * tmp2 )   &
+     &              + c2 * qs(i,j,k)
+               fjac(ib,4,2,kp) = - c2 *  u(2,i,j,k) * tmp1 
+               fjac(ib,4,3,kp) = - c2 *  u(3,i,j,k) * tmp1
+               fjac(ib,4,4,kp) = ( 2.0d+00 - c2 )  &
+     &              *  u(4,i,j,k) * tmp1 
+               fjac(ib,4,5,kp) = c2
+
+               fjac(ib,5,1,kp) = ( c2 * 2.0d0 * square(i,j,k)   &
+     &              - c1 * u(5,i,j,k) )  &
+     &              * u(4,i,j,k) * tmp2
+               fjac(ib,5,2,kp) = - c2 * ( u(2,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2 
+               fjac(ib,5,3,kp) = - c2 * ( u(3,i,j,k)*u(4,i,j,k) )  &
+     &              * tmp2
+               fjac(ib,5,4,kp) = c1 * ( u(5,i,j,k) * tmp1 )  &
+     &              - c2  &
+     &              * ( qs(i,j,k)  &
+     &              + u(4,i,j,k)*u(4,i,j,k) * tmp2 )
+               fjac(ib,5,5,kp) = c1 * u(4,i,j,k) * tmp1
+
+               njac(ib,1,1,kp) = 0.0d+00
+               njac(ib,1,2,kp) = 0.0d+00
+               njac(ib,1,3,kp) = 0.0d+00
+               njac(ib,1,4,kp) = 0.0d+00
+               njac(ib,1,5,kp) = 0.0d+00
+
+               njac(ib,2,1,kp) = - c3c4 * tmp2 * u(2,i,j,k)
+               njac(ib,2,2,kp) =   c3c4 * tmp1
+               njac(ib,2,3,kp) =   0.0d+00
+               njac(ib,2,4,kp) =   0.0d+00
+               njac(ib,2,5,kp) =   0.0d+00
+
+               njac(ib,3,1,kp) = - c3c4 * tmp2 * u(3,i,j,k)
+               njac(ib,3,2,kp) =   0.0d+00
+               njac(ib,3,3,kp) =   c3c4 * tmp1
+               njac(ib,3,4,kp) =   0.0d+00
+               njac(ib,3,5,kp) =   0.0d+00
+
+               njac(ib,4,1,kp) = - con43 * c3c4 * tmp2 * u(4,i,j,k)
+               njac(ib,4,2,kp) =   0.0d+00
+               njac(ib,4,3,kp) =   0.0d+00
+               njac(ib,4,4,kp) =   con43 * c3 * c4 * tmp1
+               njac(ib,4,5,kp) =   0.0d+00
+
+               njac(ib,5,1,kp) = - (  c3c4  &
+     &              - c1345 ) * tmp3 * (u(2,i,j,k)**2)  &
+     &              - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k)**2)  &
+     &              - ( con43 * c3c4  &
+     &              - c1345 ) * tmp3 * (u(4,i,j,k)**2)  &
+     &              - c1345 * tmp2 * u(5,i,j,k)
+
+               njac(ib,5,2,kp) = (  c3c4 - c1345 ) * tmp2 * u(2,i,j,k)
+               njac(ib,5,3,kp) = (  c3c4 - c1345 ) * tmp2 * u(3,i,j,k)
+               njac(ib,5,4,kp) = ( con43 * c3c4  &
+     &              - c1345 ) * tmp2 * u(4,i,j,k)
+               njac(ib,5,5,kp) = ( c1345 )* tmp1
+
+            enddo
+            enddo
+
+!---------------------------------------------------------------------
+!     now jacobians set, so form left hand side in z direction
+!---------------------------------------------------------------------
+            km = mod(kb + 2, 3)
+            k = kk
+!dir$ vector always
+            do ib = 1, bsize
+
+               tmp1 = dt * tz1
+               tmp2 = dt * tz2
+
+               lhsa(ib,1,1,1) = - tmp2 * fjac(ib,1,1,km)  &
+     &              - tmp1 * njac(ib,1,1,km)  &
+     &              - tmp1 * dz1 
+               lhsa(ib,1,2,1) = - tmp2 * fjac(ib,1,2,km)  &
+     &              - tmp1 * njac(ib,1,2,km)
+               lhsa(ib,1,3,1) = - tmp2 * fjac(ib,1,3,km)  &
+     &              - tmp1 * njac(ib,1,3,km)
+               lhsa(ib,1,4,1) = - tmp2 * fjac(ib,1,4,km)  &
+     &              - tmp1 * njac(ib,1,4,km)
+               lhsa(ib,1,5,1) = - tmp2 * fjac(ib,1,5,km)  &
+     &              - tmp1 * njac(ib,1,5,km)
+
+               lhsa(ib,2,1,1) = - tmp2 * fjac(ib,2,1,km)  &
+     &              - tmp1 * njac(ib,2,1,km)
+               lhsa(ib,2,2,1) = - tmp2 * fjac(ib,2,2,km)  &
+     &              - tmp1 * njac(ib,2,2,km)  &
+     &              - tmp1 * dz2
+               lhsa(ib,2,3,1) = - tmp2 * fjac(ib,2,3,km)  &
+     &              - tmp1 * njac(ib,2,3,km)
+               lhsa(ib,2,4,1) = - tmp2 * fjac(ib,2,4,km)  &
+     &              - tmp1 * njac(ib,2,4,km)
+               lhsa(ib,2,5,1) = - tmp2 * fjac(ib,2,5,km)  &
+     &              - tmp1 * njac(ib,2,5,km)
+
+               lhsa(ib,3,1,1) = - tmp2 * fjac(ib,3,1,km)  &
+     &              - tmp1 * njac(ib,3,1,km)
+               lhsa(ib,3,2,1) = - tmp2 * fjac(ib,3,2,km)  &
+     &              - tmp1 * njac(ib,3,2,km)
+               lhsa(ib,3,3,1) = - tmp2 * fjac(ib,3,3,km)  &
+     &              - tmp1 * njac(ib,3,3,km)  &
+     &              - tmp1 * dz3 
+               lhsa(ib,3,4,1) = - tmp2 * fjac(ib,3,4,km)  &
+     &              - tmp1 * njac(ib,3,4,km)
+               lhsa(ib,3,5,1) = - tmp2 * fjac(ib,3,5,km)  &
+     &              - tmp1 * njac(ib,3,5,km)
+
+               lhsa(ib,4,1,1) = - tmp2 * fjac(ib,4,1,km)  &
+     &              - tmp1 * njac(ib,4,1,km)
+               lhsa(ib,4,2,1) = - tmp2 * fjac(ib,4,2,km)  &
+     &              - tmp1 * njac(ib,4,2,km)
+               lhsa(ib,4,3,1) = - tmp2 * fjac(ib,4,3,km)  &
+     &              - tmp1 * njac(ib,4,3,km)
+               lhsa(ib,4,4,1) = - tmp2 * fjac(ib,4,4,km)  &
+     &              - tmp1 * njac(ib,4,4,km)  &
+     &              - tmp1 * dz4
+               lhsa(ib,4,5,1) = - tmp2 * fjac(ib,4,5,km)  &
+     &              - tmp1 * njac(ib,4,5,km)
+
+               lhsa(ib,5,1,1) = - tmp2 * fjac(ib,5,1,km)  &
+     &              - tmp1 * njac(ib,5,1,km)
+               lhsa(ib,5,2,1) = - tmp2 * fjac(ib,5,2,km)  &
+     &              - tmp1 * njac(ib,5,2,km)
+               lhsa(ib,5,3,1) = - tmp2 * fjac(ib,5,3,km)  &
+     &              - tmp1 * njac(ib,5,3,km)
+               lhsa(ib,5,4,1) = - tmp2 * fjac(ib,5,4,km)  &
+     &              - tmp1 * njac(ib,5,4,km)
+               lhsa(ib,5,5,1) = - tmp2 * fjac(ib,5,5,km)  &
+     &              - tmp1 * njac(ib,5,5,km)  &
+     &              - tmp1 * dz5
+
+               lhsb(ib,1,1,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,1,1,kb)  &
+     &              + tmp1 * 2.0d+00 * dz1
+               lhsb(ib,1,2,1) = tmp1 * 2.0d+00 * njac(ib,1,2,kb)
+               lhsb(ib,1,3,1) = tmp1 * 2.0d+00 * njac(ib,1,3,kb)
+               lhsb(ib,1,4,1) = tmp1 * 2.0d+00 * njac(ib,1,4,kb)
+               lhsb(ib,1,5,1) = tmp1 * 2.0d+00 * njac(ib,1,5,kb)
+
+               lhsb(ib,2,1,1) = tmp1 * 2.0d+00 * njac(ib,2,1,kb)
+               lhsb(ib,2,2,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,2,2,kb)  &
+     &              + tmp1 * 2.0d+00 * dz2
+               lhsb(ib,2,3,1) = tmp1 * 2.0d+00 * njac(ib,2,3,kb)
+               lhsb(ib,2,4,1) = tmp1 * 2.0d+00 * njac(ib,2,4,kb)
+               lhsb(ib,2,5,1) = tmp1 * 2.0d+00 * njac(ib,2,5,kb)
+
+               lhsb(ib,3,1,1) = tmp1 * 2.0d+00 * njac(ib,3,1,kb)
+               lhsb(ib,3,2,1) = tmp1 * 2.0d+00 * njac(ib,3,2,kb)
+               lhsb(ib,3,3,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,3,3,kb)  &
+     &              + tmp1 * 2.0d+00 * dz3
+               lhsb(ib,3,4,1) = tmp1 * 2.0d+00 * njac(ib,3,4,kb)
+               lhsb(ib,3,5,1) = tmp1 * 2.0d+00 * njac(ib,3,5,kb)
+
+               lhsb(ib,4,1,1) = tmp1 * 2.0d+00 * njac(ib,4,1,kb)
+               lhsb(ib,4,2,1) = tmp1 * 2.0d+00 * njac(ib,4,2,kb)
+               lhsb(ib,4,3,1) = tmp1 * 2.0d+00 * njac(ib,4,3,kb)
+               lhsb(ib,4,4,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,4,4,kb)  &
+     &              + tmp1 * 2.0d+00 * dz4
+               lhsb(ib,4,5,1) = tmp1 * 2.0d+00 * njac(ib,4,5,kb)
+
+               lhsb(ib,5,1,1) = tmp1 * 2.0d+00 * njac(ib,5,1,kb)
+               lhsb(ib,5,2,1) = tmp1 * 2.0d+00 * njac(ib,5,2,kb)
+               lhsb(ib,5,3,1) = tmp1 * 2.0d+00 * njac(ib,5,3,kb)
+               lhsb(ib,5,4,1) = tmp1 * 2.0d+00 * njac(ib,5,4,kb)
+               lhsb(ib,5,5,1) = 1.0d+00  &
+     &              + tmp1 * 2.0d+00 * njac(ib,5,5,kb)   &
+     &              + tmp1 * 2.0d+00 * dz5
+
+               lhsc(ib,1,1,k) =  tmp2 * fjac(ib,1,1,kp)  &
+     &              - tmp1 * njac(ib,1,1,kp)  &
+     &              - tmp1 * dz1
+               lhsc(ib,1,2,k) =  tmp2 * fjac(ib,1,2,kp)  &
+     &              - tmp1 * njac(ib,1,2,kp)
+               lhsc(ib,1,3,k) =  tmp2 * fjac(ib,1,3,kp)  &
+     &              - tmp1 * njac(ib,1,3,kp)
+               lhsc(ib,1,4,k) =  tmp2 * fjac(ib,1,4,kp)  &
+     &              - tmp1 * njac(ib,1,4,kp)
+               lhsc(ib,1,5,k) =  tmp2 * fjac(ib,1,5,kp)  &
+     &              - tmp1 * njac(ib,1,5,kp)
+
+               lhsc(ib,2,1,k) =  tmp2 * fjac(ib,2,1,kp)  &
+     &              - tmp1 * njac(ib,2,1,kp)
+               lhsc(ib,2,2,k) =  tmp2 * fjac(ib,2,2,kp)  &
+     &              - tmp1 * njac(ib,2,2,kp)  &
+     &              - tmp1 * dz2
+               lhsc(ib,2,3,k) =  tmp2 * fjac(ib,2,3,kp)  &
+     &              - tmp1 * njac(ib,2,3,kp)
+               lhsc(ib,2,4,k) =  tmp2 * fjac(ib,2,4,kp)  &
+     &              - tmp1 * njac(ib,2,4,kp)
+               lhsc(ib,2,5,k) =  tmp2 * fjac(ib,2,5,kp)  &
+     &              - tmp1 * njac(ib,2,5,kp)
+
+               lhsc(ib,3,1,k) =  tmp2 * fjac(ib,3,1,kp)  &
+     &              - tmp1 * njac(ib,3,1,kp)
+               lhsc(ib,3,2,k) =  tmp2 * fjac(ib,3,2,kp)  &
+     &              - tmp1 * njac(ib,3,2,kp)
+               lhsc(ib,3,3,k) =  tmp2 * fjac(ib,3,3,kp)  &
+     &              - tmp1 * njac(ib,3,3,kp)  &
+     &              - tmp1 * dz3
+               lhsc(ib,3,4,k) =  tmp2 * fjac(ib,3,4,kp)  &
+     &              - tmp1 * njac(ib,3,4,kp)
+               lhsc(ib,3,5,k) =  tmp2 * fjac(ib,3,5,kp)  &
+     &              - tmp1 * njac(ib,3,5,kp)
+
+               lhsc(ib,4,1,k) =  tmp2 * fjac(ib,4,1,kp)  &
+     &              - tmp1 * njac(ib,4,1,kp)
+               lhsc(ib,4,2,k) =  tmp2 * fjac(ib,4,2,kp)  &
+     &              - tmp1 * njac(ib,4,2,kp)
+               lhsc(ib,4,3,k) =  tmp2 * fjac(ib,4,3,kp)  &
+     &              - tmp1 * njac(ib,4,3,kp)
+               lhsc(ib,4,4,k) =  tmp2 * fjac(ib,4,4,kp)  &
+     &              - tmp1 * njac(ib,4,4,kp)  &
+     &              - tmp1 * dz4
+               lhsc(ib,4,5,k) =  tmp2 * fjac(ib,4,5,kp)  &
+     &              - tmp1 * njac(ib,4,5,kp)
+
+               lhsc(ib,5,1,k) =  tmp2 * fjac(ib,5,1,kp)  &
+     &              - tmp1 * njac(ib,5,1,kp)
+               lhsc(ib,5,2,k) =  tmp2 * fjac(ib,5,2,kp)  &
+     &              - tmp1 * njac(ib,5,2,kp)
+               lhsc(ib,5,3,k) =  tmp2 * fjac(ib,5,3,kp)  &
+     &              - tmp1 * njac(ib,5,3,kp)
+               lhsc(ib,5,4,k) =  tmp2 * fjac(ib,5,4,kp)  &
+     &              - tmp1 * njac(ib,5,4,kp)
+               lhsc(ib,5,5,k) =  tmp2 * fjac(ib,5,5,kp)  &
+     &              - tmp1 * njac(ib,5,5,kp)  &
+     &              - tmp1 * dz5
+
+            enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     performs guaussian elimination on this cell.
+!     
+!     assumes that unpacking routines for non-first cells 
+!     preload C' and rhs' from previous cell.
+!     
+!     assumed send happens outside this routine, but that
+!     c'(KMAX) and rhs'(KMAX) will be sent to next cell.
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     outer most do loops - sweeping in i direction
+!---------------------------------------------------------------------
+
+            if (timeron) call timer_start(t_solsub)
+!---------------------------------------------------------------------
+!     multiply c(i,j,0) by b_inverse and copy back to c
+!     multiply rhs(0) by b_inverse(0) and copy to rhs
+!---------------------------------------------------------------------
+            if (kk .eq. 1) then
+            call binvcrhs( lhsb(1,1,1,0),  &
+     &                        lhsc(1,1,1,0),  &
+     &                        rhsx(1,1,0) )
+            endif
+
+
+!---------------------------------------------------------------------
+!     begin inner most do loop
+!     do all the elements of the cell unless last 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     subtract A*lhs_vector(k-1) from lhs_vector(k)
+!     
+!     rhs(k) = rhs(k) - A*rhs(k-1)
+!---------------------------------------------------------------------
+               call matvec_sub(lhsa(1,1,1,1),  &
+     &                         rhsx(1,1,k-1),rhsx(1,1,k))
+
+!---------------------------------------------------------------------
+!     B(k) = B(k) - C(k-1)*A(k)
+!     call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k)
+!---------------------------------------------------------------------
+               call matmul_sub(lhsa(1,1,1,1),  &
+     &                         lhsc(1,1,1,k-1),  &
+     &                         lhsb(1,1,1,1))
+
+!---------------------------------------------------------------------
+!     multiply c(i,j,k) by b_inverse and copy back to c
+!     multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs
+!---------------------------------------------------------------------
+               call binvcrhs( lhsb(1,1,1,1),  &
+     &                        lhsc(1,1,1,k),  &
+     &                        rhsx(1,1,k) )
+
+
+!---------------------------------------------------------------------
+!     Now finish up special cases for last cell
+!---------------------------------------------------------------------
+
+            if (kk .eq. ksize-1) then
+!---------------------------------------------------------------------
+!     rhs(ksize) = rhs(ksize) - A*rhs(ksize-1)
+!---------------------------------------------------------------------
+            call matvec_sub(lhsa(1,1,1,2),  &
+     &                         rhsx(1,1,ksize-1),rhsx(1,1,ksize))
+
+!---------------------------------------------------------------------
+!     B(ksize) = B(ksize) - C(ksize-1)*A(ksize)
+!     call matmul_sub(aa,i,j,ksize,c,
+!     $              cc,i,j,ksize-1,c,bb,i,j,ksize)
+!---------------------------------------------------------------------
+            call matmul_sub(lhsa(1,1,1,2),  &
+     &                         lhsc(1,1,1,ksize-1),  &
+     &                         lhsb(1,1,1,2))
+
+!---------------------------------------------------------------------
+!     multiply rhs(ksize) by b_inverse(ksize) and copy to rhs
+!---------------------------------------------------------------------
+            call binvrhs( lhsb(1,1,1,2),  &
+     &                       rhsx(1,1,ksize) )
+            endif
+
+            if (timeron) call timer_stop(t_solsub)
+
+            enddo
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     back solve: if last cell, then generate U(ksize)=rhs(ksize)
+!     else assume U(ksize) is loaded in un pack backsub_info
+!     so just use it
+!     after call u(kstart) will be sent to next cell
+!---------------------------------------------------------------------
+
+            do k=ksize-1,0,-1
+!dir$ vector always
+            do ib = 1, bsize
+!dir$ unroll
+               do m=1,BLOCK_SIZE
+                  rhsx(ib,m,k) = rhsx(ib,m,k)   &
+     &                 - lhsc(ib,m,1,k)*rhsx(ib,1,k+1)  &
+     &                 - lhsc(ib,m,2,k)*rhsx(ib,2,k+1)  &
+     &                 - lhsc(ib,m,3,k)*rhsx(ib,3,k+1)  &
+     &                 - lhsc(ib,m,4,k)*rhsx(ib,4,k+1)  &
+     &                 - lhsc(ib,m,5,k)*rhsx(ib,5,k+1)
+               enddo
+            enddo
+            enddo
+
+            if (timeron) call timer_start(t_rdis1)
+            do ib = 1, bsize
+               i = ii+ib-1
+               if (i < grid_points(1)-1) then
+               do k=0,ksize
+                  rhs(1,i,j,k) = rhsx(ib,1,k)
+                  rhs(2,i,j,k) = rhsx(ib,2,k)
+                  rhs(3,i,j,k) = rhsx(ib,3,k)
+                  rhs(4,i,j,k) = rhsx(ib,4,k)
+                  rhs(5,i,j,k) = rhsx(ib,5,k)
+               end do
+               endif
+            end do
+            if (timeron) call timer_stop(t_rdis1)
+
+         enddo
+      enddo
+!$omp end parallel
+      if (timeron) call timer_stop(t_zsolve)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/Makefile
new file mode 100644
index 000000000..69778182a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/Makefile
@@ -0,0 +1,29 @@
+SHELL=/bin/sh
+BENCHMARK=cg
+BENCHMARKU=CG
+
+include ../config/make.def
+
+OBJS = cg.o cg_data.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+include ../sys/make.common
+.SUFFIXES: .c .f .f90 .h .o
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+cg.o:		cg.f90  cg_data.o
+cg_data.o:	cg_data.f90 npbparams.h
+
+clean:
+	- rm -f *.o *~ *.mod
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/README.carefully b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/README.carefully
new file mode 100644
index 000000000..cdcc3667d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/README.carefully
@@ -0,0 +1,16 @@
+Note: please observe that in the routine conj_grad three 
+implementations of the sparse matrix-vector multiply have
+been supplied.  The default matrix-vector multiply is not
+loop unrolled.  The alternate implementations are unrolled
+to a depth of 2 and unrolled to a depth of 8.  Please
+experiment with these to find the fastest for your particular
+architecture.  If reporting timing results, any of these three may
+be used without penalty.
+
+Performance examples:
+The non-unrolled version of the multiply is actually (slightly: 
+maybe %5) faster on the sp2-66MHz-WN on 16 nodes than is the 
+unrolled-by-2 version below.   On the Cray t3d, the reverse is true, 
+i.e., the unrolled-by-two version is some 10% faster.  
+The unrolled-by-8 version below is significantly faster
+on the Cray t3d - overall speed of code is 1.5 times faster.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/cg.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/cg.f90
new file mode 100644
index 000000000..c76ecc736
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/cg.f90
@@ -0,0 +1,1110 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   C G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB CG code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+!---------------------------------------------------------------------
+!
+! Authors: M. Yarrow
+!          C. Kuszmaul
+!          H. Jin
+!
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      program cg
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      use cg_data
+
+      implicit none
+
+
+      integer            i, j, it
+      integer(kz)        k
+
+      double precision   zeta, randlc
+      external           randlc
+      double precision   rnorm
+      double precision   norm_temp1,norm_temp2,norm_temp3
+
+      double precision   t, mflops, tmax
+      character          class
+      logical            verified
+      double precision   zeta_verify_value, epsilon, err
+
+      character t_names(t_last)*8
+!$    integer   omp_get_max_threads
+!$    external  omp_get_max_threads
+
+      do i = 1, T_last
+         call timer_clear( i )
+      end do
+
+      call check_timer_flag( timeron )
+      if (timeron) then
+         t_names(t_init) = 'init'
+         t_names(t_bench) = 'benchmk'
+         t_names(t_conj_grad) = 'conjgd'
+      endif
+
+      call timer_start( T_init )
+
+      firstrow = 1
+      lastrow  = na
+      firstcol = 1
+      lastcol  = na
+
+
+      if( na .eq. 1400 .and.  &
+     &    nonzer .eq. 7 .and.  &
+     &    niter .eq. 15 .and.  &
+     &    shift .eq. 10.d0 ) then
+         class = 'S'
+         zeta_verify_value = 8.5971775078648d0
+      else if( na .eq. 7000 .and.  &
+     &         nonzer .eq. 8 .and.  &
+     &         niter .eq. 15 .and.  &
+     &         shift .eq. 12.d0 ) then
+         class = 'W'
+         zeta_verify_value = 10.362595087124d0
+      else if( na .eq. 14000 .and.  &
+     &         nonzer .eq. 11 .and.  &
+     &         niter .eq. 15 .and.  &
+     &         shift .eq. 20.d0 ) then
+         class = 'A'
+         zeta_verify_value = 17.130235054029d0
+      else if( na .eq. 75000 .and.  &
+     &         nonzer .eq. 13 .and.  &
+     &         niter .eq. 75 .and.  &
+     &         shift .eq. 60.d0 ) then
+         class = 'B'
+         zeta_verify_value = 22.712745482631d0
+      else if( na .eq. 150000 .and.  &
+     &         nonzer .eq. 15 .and.  &
+     &         niter .eq. 75 .and.  &
+     &         shift .eq. 110.d0 ) then
+         class = 'C'
+         zeta_verify_value = 28.973605592845d0
+      else if( na .eq. 1500000 .and.  &
+     &         nonzer .eq. 21 .and.  &
+     &         niter .eq. 100 .and.  &
+     &         shift .eq. 500.d0 ) then
+         class = 'D'
+         zeta_verify_value = 52.514532105794d0
+      else if( na .eq. 9000000 .and.  &
+     &         nonzer .eq. 26 .and.  &
+     &         niter .eq. 100 .and.  &
+     &         shift .eq. 1.5d3 ) then
+         class = 'E'
+         zeta_verify_value = 77.522164599383d0
+      else if( na .eq. 54000000 .and.  &
+     &         nonzer .eq. 31 .and.  &
+     &         niter .eq. 100 .and.  &
+     &         shift .eq. 5.0d3 ) then
+         class = 'F'
+         zeta_verify_value = 107.3070826433d0
+      else
+         class = 'U'
+      endif
+
+      write( *,1000 )
+      write( *,1001 ) na
+      write( *,1002 ) niter
+!$    write( *,1003 ) omp_get_max_threads()
+      write( *,* )
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.4-OMP)',  &
+     &          ' - CG Benchmark', /)
+ 1001 format(' Size: ', i11 )
+ 1002 format(' Iterations:                  ', i5 )
+ 1003 format(' Number of available threads: ', i5)
+
+      naa = na
+      nzz = nz
+
+      call alloc_space
+
+!---------------------------------------------------------------------
+!  Inialize random number generator
+!---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,zeta)
+      tran    = 314159265.0D0
+      amult   = 1220703125.0D0
+      zeta    = randlc( tran, amult )
+
+!---------------------------------------------------------------------
+!
+!---------------------------------------------------------------------
+      call makea(naa, nzz, a, colidx, rowstr,  &
+     &           firstrow, lastrow, firstcol, lastcol,  &
+     &           arow, acol, aelt, v, iv)
+!$omp barrier
+
+
+!---------------------------------------------------------------------
+!  Note: as a result of the above call to makea:
+!        values of j used in indexing rowstr go from 1 --> lastrow-firstrow+1
+!        values of colidx which are col indexes go from firstcol --> lastcol
+!        So:
+!        Shift the col index vals from actual (firstcol --> lastcol )
+!        to local, i.e., (1 --> lastcol-firstcol+1)
+!---------------------------------------------------------------------
+!$omp do
+      do j=1,lastrow-firstrow+1
+         do k=rowstr(j),rowstr(j+1)-1
+            colidx(k) = colidx(k) - firstcol + 1
+         enddo
+      enddo
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+!  set starting vector to (1, 1, .... 1)
+!---------------------------------------------------------------------
+!$omp do
+      do i = 1, na+1
+         x(i) = 1.0D0
+      enddo
+!$omp end do nowait
+!$omp do
+      do j=1, lastcol-firstcol+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = 0.0d0
+         p(j) = 0.0d0
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      zeta  = 0.0d0
+
+!---------------------------------------------------------------------
+!---->
+!  Do one iteration untimed to init all code and data page tables
+!---->                    (then reinit, start timing, to niter its)
+!---------------------------------------------------------------------
+      do it = 1, 1
+
+!---------------------------------------------------------------------
+!  The call to the conjugate gradient routine:
+!---------------------------------------------------------------------
+         call conj_grad ( rnorm )
+
+!---------------------------------------------------------------------
+!  zeta = shift + 1/(x.z)
+!  So, first: (x.z)
+!  Also, find norm of z
+!  So, first: (z.z)
+!---------------------------------------------------------------------
+         norm_temp1 = 0.0d0
+         norm_temp2 = 0.0d0
+!$omp parallel default(shared) private(j,norm_temp3)
+!$omp do reduction(+:norm_temp1,norm_temp2)
+         do j=1, lastcol-firstcol+1
+            norm_temp1 = norm_temp1 + x(j)*z(j)
+            norm_temp2 = norm_temp2 + z(j)*z(j)
+         enddo
+!$omp end do
+
+         norm_temp3 = 1.0d0 / sqrt( norm_temp2 )
+
+
+!---------------------------------------------------------------------
+!  Normalize z to obtain x
+!---------------------------------------------------------------------
+!$omp do
+         do j=1, lastcol-firstcol+1
+            x(j) = norm_temp3*z(j)
+         enddo
+!$omp end do nowait
+!$omp end parallel
+
+
+      enddo                              ! end of do one iteration untimed
+
+
+!---------------------------------------------------------------------
+!  set starting vector to (1, 1, .... 1)
+!---------------------------------------------------------------------
+!
+!
+!
+!$omp parallel do default(shared) private(i)
+      do i = 1, na+1
+         x(i) = 1.0D0
+      enddo
+!$omp end parallel do
+
+      zeta  = 0.0d0
+
+      call timer_stop( T_init )
+
+      write (*, 2000) timer_read(T_init)
+ 2000 format(' Initialization time = ',f15.3,' seconds')
+
+#ifdef M5_ANNOTATION
+      call m5_work_begin_interface
+#endif
+      call timer_start( T_bench )
+
+!---------------------------------------------------------------------
+!---->
+!  Main Iteration for inverse power method
+!---->
+!---------------------------------------------------------------------
+      do it = 1, niter
+
+!---------------------------------------------------------------------
+!  The call to the conjugate gradient routine:
+!---------------------------------------------------------------------
+         if ( timeron ) call timer_start( T_conj_grad )
+         call conj_grad ( rnorm )
+         if ( timeron ) call timer_stop( T_conj_grad )
+
+
+!---------------------------------------------------------------------
+!  zeta = shift + 1/(x.z)
+!  So, first: (x.z)
+!  Also, find norm of z
+!  So, first: (z.z)
+!---------------------------------------------------------------------
+         norm_temp1 = 0.0d0
+         norm_temp2 = 0.0d0
+!$omp parallel default(shared) private(j,norm_temp3)
+!$omp do reduction(+:norm_temp1,norm_temp2)
+         do j=1, lastcol-firstcol+1
+            norm_temp1 = norm_temp1 + x(j)*z(j)
+            norm_temp2 = norm_temp2 + z(j)*z(j)
+         enddo
+!$omp end do
+
+
+         norm_temp3 = 1.0d0 / sqrt( norm_temp2 )
+
+
+!$omp master
+         zeta = shift + 1.0d0 / norm_temp1
+         if( it .eq. 1 ) write( *,9000 )
+         write( *,9001 ) it, rnorm, zeta
+!$omp end master
+
+ 9000    format( /,'   iteration           ||r||                 zeta' )
+ 9001    format( 4x, i5, 6x, e21.14, f20.13 )
+
+!---------------------------------------------------------------------
+!  Normalize z to obtain x
+!---------------------------------------------------------------------
+!$omp do
+         do j=1, lastcol-firstcol+1
+            x(j) = norm_temp3*z(j)
+         enddo
+!$omp end do nowait
+!$omp end parallel
+
+
+      enddo                              ! end of main iter inv pow meth
+
+      call timer_stop( T_bench )
+
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+
+!---------------------------------------------------------------------
+!  End of timed section
+!---------------------------------------------------------------------
+
+      t = timer_read( T_bench )
+
+
+      write(*,100)
+ 100  format(' Benchmark completed ')
+
+      epsilon = 1.d-10
+      if (class .ne. 'U') then
+
+!         err = abs( zeta - zeta_verify_value)
+         err = abs( zeta - zeta_verify_value )/zeta_verify_value
+         if( (.not.ieee_is_nan(err)) .and. (err .le. epsilon) ) then
+            verified = .TRUE.
+            write(*, 200)
+            write(*, 201) zeta
+            write(*, 202) err
+ 200        format(' VERIFICATION SUCCESSFUL ')
+ 201        format(' Zeta is    ', E20.13)
+ 202        format(' Error is   ', E20.13)
+         else
+            verified = .FALSE.
+            write(*, 300)
+            write(*, 301) zeta
+            write(*, 302) zeta_verify_value
+ 300        format(' VERIFICATION FAILED')
+ 301        format(' Zeta                ', E20.13)
+ 302        format(' The correct zeta is ', E20.13)
+         endif
+      else
+         verified = .FALSE.
+         write (*, 400)
+         write (*, 401)
+         write (*, 201) zeta
+ 400     format(' Problem size unknown')
+ 401     format(' NO VERIFICATION PERFORMED')
+      endif
+
+
+      if( t .ne. 0. ) then
+         mflops = 1.0d-6 * 2*niter*dble( na )  &
+     &               * ( 3.+nonzer*dble(nonzer+1)  &
+     &                 + 25.*(5.+nonzer*dble(nonzer+1))  &
+     &                 + 3. ) / t
+      else
+         mflops = 0.d0
+      endif
+
+
+         call print_results('CG', class, na, 0, 0,  &
+     &                      niter, t,  &
+     &                      mflops, '          floating point',  &
+     &                      verified, npbversion, compiletime,  &
+     &                      cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+
+ 600  format( i4, 2e19.12)
+
+
+!---------------------------------------------------------------------
+!      More timers
+!---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      tmax = timer_read(T_bench)
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION   Time (secs)')
+      do i=1, t_last
+         t = timer_read(i)
+         if (i.eq.t_init) then
+            write(*,810) t_names(i), t
+         else
+            write(*,810) t_names(i), t, t*100./tmax
+            if (i.eq.t_conj_grad) then
+               t = tmax - t
+               write(*,820) 'rest', t, t*100./tmax
+            endif
+         endif
+ 810     format(2x,a8,':',f9.3:'  (',f6.2,'%)')
+ 820     format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+
+      end                              ! end main
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine conj_grad ( rnorm )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  Floaging point arrays here are named as in NPB1 spec discussion of
+!  CG algorithm
+!---------------------------------------------------------------------
+
+      use cg_data
+      implicit none
+
+      integer   j
+      integer   cgit, cgitmax
+      integer(kz) k
+!C    integer(kz) i, iresidue
+
+      double precision   d, sum, rho, rho0, alpha, beta, rnorm, suml
+
+      data      cgitmax / 25 /
+
+
+      rho = 0.0d0
+      sum = 0.0d0
+
+!$omp parallel default(shared) private(j,k,cgit,suml,alpha,beta)  &
+!$omp&  shared(d,rho0,rho,sum)
+
+!---------------------------------------------------------------------
+!  Initialize the CG algorithm:
+!---------------------------------------------------------------------
+!$omp do
+      do j=1,naa+1
+         q(j) = 0.0d0
+         z(j) = 0.0d0
+         r(j) = x(j)
+         p(j) = r(j)
+      enddo
+!$omp end do
+
+
+!---------------------------------------------------------------------
+!  rho = r.r
+!  Now, obtain the norm of r: First, sum squares of r elements locally...
+!---------------------------------------------------------------------
+!$omp do reduction(+:rho)
+      do j=1, lastcol-firstcol+1
+         rho = rho + r(j)*r(j)
+      enddo
+!$omp end do
+
+!---------------------------------------------------------------------
+!---->
+!  The conj grad iteration loop
+!---->
+!---------------------------------------------------------------------
+      do cgit = 1, cgitmax
+
+!$omp master
+!---------------------------------------------------------------------
+!  Save a temporary of rho and initialize reduction variables
+!---------------------------------------------------------------------
+         rho0 = rho
+         d = 0.d0
+         rho = 0.d0
+!$omp end master
+!$omp barrier
+
+!---------------------------------------------------------------------
+!  q = A.p
+!  The partition submatrix-vector multiply: use workspace w
+!---------------------------------------------------------------------
+!
+!  NOTE: this version of the multiply is actually (slightly: maybe %5)
+!        faster on the sp2 on 16 nodes than is the unrolled-by-2 version
+!        below.   On the Cray t3d, the reverse is true, i.e., the
+!        unrolled-by-two version is some 10% faster.
+!        The unrolled-by-8 version below is significantly faster
+!        on the Cray t3d - overall speed of code is 1.5 times faster.
+!
+!$omp do
+         do j=1,lastrow-firstrow+1
+            suml = 0.d0
+            do k=rowstr(j),rowstr(j+1)-1
+               suml = suml + a(k)*p(colidx(k))
+            enddo
+            q(j) = suml
+         enddo
+!$omp end do
+
+!C          do j=1,lastrow-firstrow+1
+!C             i = rowstr(j)
+!C             iresidue = mod( rowstr(j+1)-i, 2 )
+!C             suml = 0.d0
+!C             if( iresidue .eq. 1 )  &
+!C      &          suml = suml + a(i)*p(colidx(i))
+!C             do k=i+iresidue, rowstr(j+1)-2, 2
+!C                suml = suml + a(k)  *p(colidx(k))  &
+!C      &                     + a(k+1)*p(colidx(k+1))
+!C             enddo
+!C             q(j) = suml
+!C          enddo
+
+!C          do j=1,lastrow-firstrow+1
+!C             i = rowstr(j)
+!C             iresidue = mod( rowstr(j+1)-i, 8 )
+!C             suml = 0.d0
+!C             do k=i,i+iresidue-1
+!C                suml = suml +  a(k)*p(colidx(k))
+!C             enddo
+!C             do k=i+iresidue, rowstr(j+1)-8, 8
+!C                suml = suml + a(k  )*p(colidx(k  ))  &
+!C      &                   + a(k+1)*p(colidx(k+1))  &
+!C      &                   + a(k+2)*p(colidx(k+2))  &
+!C      &                   + a(k+3)*p(colidx(k+3))  &
+!C      &                   + a(k+4)*p(colidx(k+4))  &
+!C      &                   + a(k+5)*p(colidx(k+5))  &
+!C      &                   + a(k+6)*p(colidx(k+6))  &
+!C      &                   + a(k+7)*p(colidx(k+7))
+!C             enddo
+!C             q(j) = suml
+!C          enddo
+
+
+!---------------------------------------------------------------------
+!  Obtain p.q
+!---------------------------------------------------------------------
+!$omp do reduction(+:d)
+         do j=1, lastcol-firstcol+1
+            d = d + p(j)*q(j)
+         enddo
+!$omp end do
+
+
+!---------------------------------------------------------------------
+!  Obtain alpha = rho / (p.q)
+!---------------------------------------------------------------------
+         alpha = rho0 / d
+
+!---------------------------------------------------------------------
+!  Obtain z = z + alpha*p
+!  and    r = r - alpha*q
+!---------------------------------------------------------------------
+!$omp do reduction(+:rho)
+         do j=1, lastcol-firstcol+1
+            z(j) = z(j) + alpha*p(j)
+            r(j) = r(j) - alpha*q(j)
+!         enddo
+
+!---------------------------------------------------------------------
+!  rho = r.r
+!  Now, obtain the norm of r: First, sum squares of r elements locally...
+!---------------------------------------------------------------------
+!         do j=1, lastcol-firstcol+1
+            rho = rho + r(j)*r(j)
+         enddo
+!$omp end do
+
+!---------------------------------------------------------------------
+!  Obtain beta:
+!---------------------------------------------------------------------
+         beta = rho / rho0
+
+!---------------------------------------------------------------------
+!  p = r + beta*p
+!---------------------------------------------------------------------
+!$omp do
+         do j=1, lastcol-firstcol+1
+            p(j) = r(j) + beta*p(j)
+         enddo
+!$omp end do
+
+
+      enddo                             ! end of do cgit=1,cgitmax
+
+
+!---------------------------------------------------------------------
+!  Compute residual norm explicitly:  ||r|| = ||x - A.z||
+!  First, form A.z
+!  The partition submatrix-vector multiply
+!---------------------------------------------------------------------
+!$omp do
+      do j=1,lastrow-firstrow+1
+         suml = 0.d0
+         do k=rowstr(j),rowstr(j+1)-1
+            suml = suml + a(k)*z(colidx(k))
+         enddo
+         r(j) = suml
+      enddo
+!$omp end do
+
+
+!---------------------------------------------------------------------
+!  At this point, r contains A.z
+!---------------------------------------------------------------------
+!$omp do reduction(+:sum)
+      do j=1, lastcol-firstcol+1
+         suml = x(j) - r(j)
+         sum  = sum + suml*suml
+      enddo
+!$omp end do nowait
+!$omp end parallel
+
+      rnorm = sqrt( sum )
+
+
+
+      return
+      end                               ! end of routine conj_grad
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine makea( n, nz, a, colidx, rowstr,  &
+     &                  firstrow, lastrow, firstcol, lastcol,  &
+     &                  arow, acol, aelt, v, iv )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use tinfo
+      use cg_data, only : nonzer, rcond, shift
+
+      implicit none
+
+      integer             n
+      integer(kz)         nz, rowstr(n+1)
+      integer             firstrow, lastrow, firstcol, lastcol
+      integer             colidx(nz)
+      integer             iv(n+nz), arow(n), acol(nonzer+1,n)
+      double precision    aelt(nonzer+1,n), v(nz)
+      double precision    a(nz)
+
+!---------------------------------------------------------------------
+!       generate the test problem for benchmark 6
+!       makea generates a sparse matrix with a
+!       prescribed sparsity distribution
+!
+!       parameter    type        usage
+!
+!       input
+!
+!       n            i           number of cols/rows of matrix
+!       nz           i           nonzeros as declared array size
+!       rcond        r*8         condition number
+!       shift        r*8         main diagonal shift
+!
+!       output
+!
+!       a            r*8         array for nonzeros
+!       colidx       i           col indices
+!       rowstr       i           row pointers
+!
+!       workspace
+!
+!       iv, arow, acol i
+!       v, aelt        r*8
+!---------------------------------------------------------------------
+
+      integer          i, iouter, ivelt, nzv, nn1
+      integer          ivc(nonzer+1)
+      double precision vc(nonzer+1)
+
+!---------------------------------------------------------------------
+!      nonzer is approximately  (int(sqrt(nnza /n)));
+!---------------------------------------------------------------------
+
+      external          sparse, sprnvc, vecset
+!$    integer           omp_get_num_threads, omp_get_thread_num
+!$    external          omp_get_num_threads, omp_get_thread_num
+      integer           work
+
+
+!---------------------------------------------------------------------
+!    nn1 is the smallest power of two not less than n
+!---------------------------------------------------------------------
+
+      nn1 = 1
+ 50   continue
+        nn1 = 2 * nn1
+        if (nn1 .lt. n) goto 50
+
+!---------------------------------------------------------------------
+!  Generate nonzero positions and save for the use in sparse.
+!---------------------------------------------------------------------
+      num_threads = 1
+!$    num_threads = omp_get_num_threads()
+      myid = 0
+!$    myid  = omp_get_thread_num()
+      if (num_threads .gt. max_threads) then
+         if (myid .eq. 0) write(*,100) num_threads, max_threads
+100      format(' Warning: num_threads',i6,  &
+     &          ' exceeded an internal limit',i6)
+         num_threads = max_threads
+      endif
+      work  = (n + num_threads - 1)/num_threads
+      ilow  = work * myid + 1
+      ihigh = ilow + work - 1
+      if (ihigh .gt. n) ihigh = n
+
+      do iouter = 1, ihigh
+         nzv = nonzer
+         call sprnvc( n, nzv, nn1, vc, ivc )
+         if ( iouter .ge. ilow ) then
+            call vecset( n, vc, ivc, nzv, iouter, .5D0 )
+            arow(iouter) = nzv
+            do ivelt = 1, nzv
+               acol(ivelt, iouter) = ivc(ivelt)
+               aelt(ivelt, iouter) = vc(ivelt)
+            enddo
+         endif
+      enddo
+!$omp barrier
+
+!---------------------------------------------------------------------
+!       ... make the sparse matrix from list of elements with duplicates
+!           (v and iv are used as  workspace)
+!---------------------------------------------------------------------
+      call sparse( a, colidx, rowstr, n, nz, nonzer, arow, acol,  &
+     &             aelt, firstrow, lastrow,  &
+     &             v, iv(1), iv(nz+1), rcond, shift )
+      return
+
+      end
+!-------end   of makea------------------------------
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine sparse( a, colidx, rowstr, n, nz, nonzer, arow, acol,  &
+     &                   aelt, firstrow, lastrow,  &
+     &                   v, iv, nzloc, rcond, shift )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use tinfo
+
+      implicit           none
+
+      integer            colidx(*), iv(*)
+      integer            firstrow, lastrow
+      integer            n, nonzer, arow(*), acol(nonzer+1,*)
+      integer(kz)        nz, rowstr(*)
+      double precision   a(*), aelt(nonzer+1,*), v(*), rcond, shift
+
+!---------------------------------------------------------------------
+!       rows range from firstrow to lastrow
+!       the rowstr pointers are defined for nrows = lastrow-firstrow+1 values
+!---------------------------------------------------------------------
+      integer            nzloc(n), nrows
+
+!---------------------------------------------------
+!       generate a sparse matrix from a list of
+!       [col, row, element] tri
+!---------------------------------------------------
+
+      integer            i, j, jcol
+      integer(kz)        j1, j2, nza, k, kk, nzrow
+      double precision   xi, size, scale, ratio, va
+
+!---------------------------------------------------------------------
+!    how many rows of result
+!---------------------------------------------------------------------
+      nrows = lastrow - firstrow + 1
+      j1 = ilow + 1
+      j2 = ihigh + 1
+
+!---------------------------------------------------------------------
+!     ...count the number of triples in each row
+!---------------------------------------------------------------------
+      do j = j1, j2
+         rowstr(j) = 0
+      enddo
+
+      do i = 1, n
+         do nza = 1, arow(i)
+            j = acol(nza, i)
+            if (j.ge.ilow .and. j.le.ihigh) then
+               j = j + 1
+               rowstr(j) = rowstr(j) + arow(i)
+            endif
+         end do
+      end do
+
+      if (myid .eq. 0) then
+         rowstr(1) = 1
+         j1 = 1
+      endif
+      do j = j1+1, j2
+         rowstr(j) = rowstr(j) + rowstr(j-1)
+      enddo
+      if (myid .lt. num_threads) last_n(myid) = rowstr(j2)
+!$omp barrier
+
+      nzrow = 0
+      if (myid .lt. num_threads) then
+         do i = 0, myid-1
+            nzrow = nzrow + last_n(i)
+         end do
+      endif
+      if (nzrow .gt. 0) then
+         do j = j1, j2
+            rowstr(j) = rowstr(j) + nzrow
+         enddo
+      endif
+!$omp barrier
+      nza = rowstr(nrows+1) - 1
+
+!---------------------------------------------------------------------
+!     ... rowstr(j) now is the location of the first nonzero
+!           of row j of a
+!---------------------------------------------------------------------
+
+      if (nza .gt. nz) then
+!$omp master
+         write(*,*) 'Space for matrix elements exceeded in sparse'
+         write(*,*) 'nza, nzmax = ',nza, nz
+!$omp end master
+         stop
+      endif
+
+
+!---------------------------------------------------------------------
+!     ... preload data pages
+!---------------------------------------------------------------------
+      do j = ilow, ihigh
+         do k = rowstr(j), rowstr(j+1)-1
+             v(k) = 0.d0
+             iv(k) = 0
+         enddo
+         nzloc(j) = 0
+      enddo
+
+!---------------------------------------------------------------------
+!     ... generate actual values by summing duplicates
+!---------------------------------------------------------------------
+
+      size = 1.0D0
+      ratio = rcond ** (1.0D0 / dfloat(n))
+
+      do i = 1, n
+         do nza = 1, arow(i)
+            j = acol(nza, i)
+
+            if (j .lt. ilow .or. j .gt. ihigh) goto 60
+
+            scale = size * aelt(nza, i)
+            do nzrow = 1, arow(i)
+               jcol = acol(nzrow, i)
+               va = aelt(nzrow, i) * scale
+
+!---------------------------------------------------------------------
+!       ... add the identity * rcond to the generated matrix to bound
+!           the smallest eigenvalue from below by rcond
+!---------------------------------------------------------------------
+               if (jcol .eq. j .and. j .eq. i) then
+                  va = va + rcond - shift
+               endif
+
+               do k = rowstr(j), rowstr(j+1)-1
+                  if (iv(k) .gt. jcol) then
+!---------------------------------------------------------------------
+!       ... insert colidx here orderly
+!---------------------------------------------------------------------
+                     do kk = rowstr(j+1)-2, k, -1
+                        if (iv(kk) .gt. 0) then
+                           v(kk+1)  = v(kk)
+                           iv(kk+1) = iv(kk)
+                        endif
+                     enddo
+                     iv(k) = jcol
+                     v(k)  = 0.d0
+                     goto 40
+                  else if (iv(k) .eq. 0) then
+                     iv(k) = jcol
+                     goto 40
+                  else if (iv(k) .eq. jcol) then
+!---------------------------------------------------------------------
+!       ... mark the duplicated entry
+!---------------------------------------------------------------------
+                     nzloc(j) = nzloc(j) + 1
+                     goto 40
+                  endif
+               enddo
+               print *,'internal error in sparse: i=',i
+               stop
+   40          continue
+               v(k) = v(k) + va
+            enddo
+   60       continue
+         enddo
+         size = size * ratio
+      enddo
+!$omp barrier
+
+
+!---------------------------------------------------------------------
+!       ... remove empty entries and generate final results
+!---------------------------------------------------------------------
+      do j = ilow+1, ihigh
+         nzloc(j) = nzloc(j) + nzloc(j-1)
+      enddo
+      if (myid .lt. num_threads) last_n(myid) = nzloc(ihigh)
+!$omp barrier
+
+      nzrow = 0
+      if (myid .lt. num_threads) then
+         do i = 0, myid-1
+            nzrow = nzrow + last_n(i)
+         end do
+      endif
+      if (nzrow .gt. 0) then
+         do j = ilow, ihigh
+            nzloc(j) = nzloc(j) + nzrow
+         enddo
+      endif
+!$omp barrier
+
+!$omp do
+      do j = 1, nrows
+         if (j .gt. 1) then
+            j1 = rowstr(j) - nzloc(j-1)
+         else
+            j1 = 1
+         endif
+         j2 = rowstr(j+1) - nzloc(j) - 1
+         nza = rowstr(j)
+         do k = j1, j2
+            a(k) = v(nza)
+            colidx(k) = iv(nza)
+            nza = nza + 1
+         enddo
+      enddo
+!$omp end do
+!$omp do
+      do j = 2, nrows+1
+         rowstr(j) = rowstr(j) - nzloc(j-1)
+      enddo
+!$omp end do
+      nza = rowstr(nrows+1) - 1
+
+
+!C       write (*, 11000) nza
+      return
+11000   format ( //,'final nonzero count in sparse ',  &
+     &            /,'number of nonzeros       = ', i16 )
+      end
+!-------end   of sparse-----------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine sprnvc( n, nz, nn1, v, iv )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use cg_data, only : amult, tran
+
+      implicit           none
+
+      double precision   v(*)
+      integer            n, nz, nn1, iv(*)
+
+
+!---------------------------------------------------------------------
+!       generate a sparse n-vector (v, iv)
+!       having nzv nonzeros
+!
+!       mark(i) is set to 1 if position i is nonzero.
+!       mark is all zero on entry and is reset to all zero before exit
+!       this corrects a performance bug found by John G. Lewis, caused by
+!       reinitialization of mark on every one of the n calls to sprnvc
+!---------------------------------------------------------------------
+
+        integer            nzv, ii, i, icnvrt
+
+        external           randlc, icnvrt
+        double precision   randlc, vecelt, vecloc
+
+
+        nzv = 0
+
+100     continue
+        if (nzv .ge. nz) goto 110
+
+         vecelt = randlc( tran, amult )
+
+!---------------------------------------------------------------------
+!   generate an integer between 1 and n in a portable manner
+!---------------------------------------------------------------------
+         vecloc = randlc(tran, amult)
+         i = icnvrt(vecloc, nn1) + 1
+         if (i .gt. n) goto 100
+
+!---------------------------------------------------------------------
+!  was this integer generated already?
+!---------------------------------------------------------------------
+         do ii = 1, nzv
+            if (iv(ii) .eq. i) goto 100
+         enddo
+         nzv = nzv + 1
+         v(nzv) = vecelt
+         iv(nzv) = i
+         goto 100
+110     continue
+
+      return
+      end
+!-------end   of sprnvc-----------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      function icnvrt(x, ipwr2)
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit           none
+
+      double precision   x
+      integer            ipwr2, icnvrt
+
+!---------------------------------------------------------------------
+!    scale a double precision number x in (0,1) by a power of 2 and chop it
+!---------------------------------------------------------------------
+      icnvrt = int(ipwr2 * x)
+
+      return
+      end
+!-------end   of icnvrt-----------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine vecset(n, v, iv, nzv, i, val)
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit           none
+
+      integer            n, iv(*), nzv, i, k
+      double precision   v(*), val
+
+!---------------------------------------------------------------------
+!       set ith element of sparse vector (v, iv) with
+!       nzv nonzeros to val
+!---------------------------------------------------------------------
+
+      logical set
+
+      set = .false.
+      do k = 1, nzv
+         if (iv(k) .eq. i) then
+            v(k) = val
+            set  = .true.
+         endif
+      enddo
+      if (.not. set) then
+         nzv     = nzv + 1
+         v(nzv)  = val
+         iv(nzv) = i
+      endif
+      return
+      end
+!-------end   of vecset-----------------------------
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/cg_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/cg_data.f90
new file mode 100644
index 000000000..8a935b913
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/CG/cg_data.f90
@@ -0,0 +1,118 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  cg_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module cg_data
+
+      include 'npbparams.h'
+
+!---------------------------------------------------------------------
+!  Class specific parameters are defined in the npbparams.h
+!  include file, which is written by the sys/setparams.c program.
+!---------------------------------------------------------------------
+
+
+! ... dimension parameters
+      integer(kz) nz, naz
+      parameter( nz = int(na,kz)*(nonzer+1)*(nonzer+1) )
+      parameter( naz = int(na,kz)*(nonzer+1) )
+
+! ... main_int_mem
+      integer, allocatable ::  colidx(:),  &
+     &                         iv(:),  arow(:), acol(:)
+      integer(kz), allocatable ::  rowstr(:)
+
+! ... main_flt_mem
+      double precision, allocatable ::  &
+     &                         v(:), aelt(:), a(:),  &
+     &                         x(:),  &
+     &                         z(:),  &
+     &                         p(:),  &
+     &                         q(:),  &
+     &                         r(:)
+
+! ... partition size
+      integer                  naa,  &
+     &                         firstrow,  &
+     &                         lastrow,  &
+     &                         firstcol,  &
+     &                         lastcol
+      integer(kz)              nzz
+
+      double precision         amult, tran
+!$omp threadprivate (amult, tran)
+
+      external         timer_read
+      double precision timer_read
+
+      integer T_init, T_bench, T_conj_grad, T_last
+      parameter (T_init=1, T_bench=2, T_conj_grad=3, T_last=3)
+
+      logical timeron
+
+      end module cg_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  tinfo module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module tinfo
+
+      use cg_data, only : kz
+      integer        max_threads
+      parameter      (max_threads=1024)
+
+      integer(kz)    last_n(0:max_threads)
+
+      integer        myid, num_threads, ilow, ihigh
+!$omp threadprivate (myid, num_threads, ilow, ihigh)
+
+      end module tinfo
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use cg_data
+      implicit none
+
+      integer ios
+
+
+      allocate (  &
+     &          colidx(nz), rowstr(na+1),  &
+     &          iv(nz+na),  arow(na), acol(naz),  &
+     &          v(nz), aelt(naz), a(nz),  &
+     &          x(na+2),  &
+     &          z(na+2),  &
+     &          p(na+2),  &
+     &          q(na+2),  &
+     &          r(na+2),  &
+     &          stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/ADC.par b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/ADC.par
new file mode 100644
index 000000000..05f9ce770
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/ADC.par
@@ -0,0 +1,5 @@
+attrNum=12
+measuresNum=1
+tuplesNum=100
+INVERSE_ENDIAN=0
+fileName=ADC
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/Makefile
new file mode 100644
index 000000000..11f6e4e54
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/Makefile
@@ -0,0 +1,31 @@
+SHELL=/bin/sh
+BENCHMARK=dc
+BENCHMARKU=DC
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = adc.o dc.o extbuild.o rbt.o jobcntl.o \
+	${COMMON}/c_print_results.o  \
+	${COMMON}/c_timers.o ${COMMON}/c_wtime.o
+
+# npbparams.h is provided for backward compatibility with NPB compilation
+# header.h: npbparams.h
+
+${PROGRAM}: config ${OBJS} 
+	${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${C_LIB}
+
+.c.o:
+	${CCOMPILE} $<
+
+adc.o:      adc.c npbparams.h
+dc.o:       dc.c adcc.h adc.h macrodef.h npbparams.h
+extbuild.o: extbuild.c adcc.h adc.h macrodef.h npbparams.h
+rbt.o:      rbt.c adcc.h adc.h rbt.h macrodef.h npbparams.h
+jobcntl.o:  jobcntl.c adcc.h adc.h macrodef.h npbparams.h
+
+clean:
+	- rm -f *.o 
+	- rm -f npbparams.h core
+	- rm -f {../,}ADC.{logf,view,dat,viewsz,groupby,chunks}.*
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/README
new file mode 100644
index 000000000..0c895fc83
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/README
@@ -0,0 +1,83 @@
+1. Compilation
+   DC benchmark uses the same directory tree as NPB3.0 (and NPB2.x) does.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file if necessary
+   (an example of make.def provided in DC directory). 
+   Then
+      make dc CLASS=S
+
+   If a compiler complains about type 'int64' already defined, add
+   "-DHAS_INT64" to the CFLAGS list in make.def.
+
+2. OpenMP environment needs to be set before program can be executed.
+   First set the number of the threads:
+   setenv OMP_NUM_THREADS 4
+   Then to fix OpenMP implemantations on some machines:
+   limit stacksize unlimit
+   If running on Altix 
+   setenv KMP_MONITOR_STACKSIZE 50m
+
+3. Run
+   A text file ADC.par is used to set DC parameters when the class 
+   is undefined (U). 
+   The file has 5 lines. The lines with 'key' words attrNum, measuresNum, 
+   and tuplesNum define the number of dimensions, measures,
+   and input tuples respectively. There a special parameter INVERSE_ENDIAN
+   allows us to create data in non-native endian format (INVERSE_ENDIAN=1). 
+   The last parameter(fileName) specifies a DC file set name, including
+   (optionally) a full path to a directory which will contain all
+   DC related files.
+
+   An example of the DC parameter file is as follows:
+
+   attrNum=9
+   measuresNum=1
+   tuplesNum=125000
+   class=U
+   INVERSE_ENDIAN=0
+   fileName=ADC
+   
+   After parameter are set run benchmark
+   bin/dc.S 100000000 DC/ADC.par 
+   where 100000000 is the memory size allowed to be allocated for 
+   the in-core data.
+   
+4. DC processing modes
+   The DC benchmark can be run in two modes (in-core and out-of-core).
+   A desirable mode should be set before compilation in the file adc.h.
+   If a flag IN_CORE is on, the benchmark will calculate all views in main
+   memory. In this case we can use an additional flag VIEW_FILE_OUTPUT to
+   allow writing all views into disk files.
+                
+   If the flag IN_CORE is off, the DC benchmark will run in a regular mode
+   using disks to store interim and result data which may not fit in main
+   memory.
+
+   _FILE_OFFSET_BITS=64 _LARGEFILE64_SOURCE -are standard compiler flags
+   which allow DC to work with files larger than 2GB. 
+
+   OPTIMIZATION turns on some nonstandard DC optimizations such as obtaining
+   a view by scanning existing views. These optimizations do not always 
+   guarantee reduction in the computing time.
+
+5. Tested architectures:
+   SUN Ultrasparc 60
+   SUNFire 880
+   Origin 2000, 3000, 3800
+   MAC G4 
+   Xeon + Mandrake Linux
+   SGI Altix
+
+6. setparams utility is used for generation of the npbparams.h file only 
+   for compatibility with the existing make facility of NPB. By the same
+   reason CLASS is appended to the DC executable name. It does not limit 
+   the sizes the executable can perform. The class is an input value
+   specified in ADC.par file. Providing ADC.par overrides compiled 
+   defaults in npbparams.h file.
+
+7. Known issues
+   If the benchmark runs out of disk space, a message like
+   "Write error from WriteToFile()" may not be printed. Instead,
+   the benchmark returns with UNSUCCESSFUL verification. In this case 
+   users are advised to check whether the file system is full before 
+   reporting a problem with the benchmark.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adc.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adc.c
new file mode 100644
index 000000000..26f88c478
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adc.c
@@ -0,0 +1,636 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+
+#define BlockSize 1024
+
+void swap4(void * num){
+  char t, *p;
+  p = (char *) num;
+  t = *p; *p = *(p + 3); *(p + 3) = t;
+  t = *(p + 1); *(p + 1) = *(p + 2); *(p + 2) = t;
+}
+void swap8(void * num){
+  char t, *p;
+  p = (char *) num;	  
+  t = *p; *p = *(p + 7); *(p + 7) = t;
+  t = *(p + 1); *(p + 1) = *(p + 6); *(p + 6) = t;
+  t = *(p + 2); *(p + 2) = *(p + 5); *(p + 5) = t;
+  t = *(p + 3); *(p + 3) = *(p + 4); *(p + 4) = t;
+}
+void initADCpar(ADC_PAR *par){
+  par->ndid=0;
+  par->dim=5;
+  par->mnum=1;
+  par->tuplenum=100;
+/*  par->isascii=1; */
+  par->inverse_endian=0;
+  par->filename="ADC";
+  par->clss='U';
+}
+int ParseParFile(char* parfname,ADC_PAR *par);
+int GenerateADC(ADC_PAR *par);
+
+typedef struct Factorization{
+  long int *mlt;
+  long int *exp;
+  long int dim;
+} Factorization;
+
+void ShowFactorization(Factorization *nmbfct){
+  int i=0;
+  for(i=0;i<nmbfct->dim;i++){
+    if(nmbfct->mlt[i]==1){
+      if(i==0) fprintf(stdout,"prime.");
+      break;
+    }
+    if(i>0) fprintf(stdout,"*");
+    if(nmbfct->exp[i]==1)
+      fprintf(stdout,"%ld",nmbfct->mlt[i]);    
+    else 
+      fprintf(stdout,"%ld^%ld",nmbfct->mlt[i],
+                               nmbfct->exp[i]);
+  }
+  fprintf(stdout,"\n");
+}
+
+long int adcprime[]={
+  421,601,631,701,883,
+  419,443,647,21737,31769,
+  1427,18353,22817,34337,98717,
+  3527,8693,9677,11093,18233};
+  
+long int ListFirstPrimes(long int mpr,long int *prlist){
+/*
+  fprintf(stdout,"ListFirstPrimes: listing primes less than %ld...\n",
+                 mpr);
+*/
+  long int prnum=0;
+  int composed=0;
+  long int nmb=0,j=0;
+  prlist[prnum++]=2;
+  prlist[prnum++]=3;
+  prlist[prnum++]=5;
+  prlist[prnum++]=7;
+  for(nmb=8;nmb<mpr;nmb++){
+    composed=0;
+    for(j=0;prlist[j]*prlist[j]<=nmb;j++){
+      if(nmb-prlist[j]*((long int)(nmb/prlist[j]))==0){
+        composed=1;
+	break;
+      }
+    }
+    if(composed==0) prlist[prnum++]=nmb;
+  }
+/*  fprintf(stdout,"ListFirstPrimes: Done.\n"); */
+  return prnum;
+}
+
+long long int LARGE_NUM=0x4FFFFFFFFFFFFFFFLL;
+long long int maxprmfctr=59;
+
+long long int GetLCM(long long int mask,
+                     Factorization **fctlist,
+		     long int *adcexpons){
+  int i=0,j=0,k=0;
+  int* expons=(int*) calloc(maxprmfctr+1,sizeof(int));
+  long long int LCM=1;
+  long int pr=2;
+  int genexp=1,lexp=1,fct=2;
+
+  for(i=0;i<maxprmfctr+1;i++)expons[i]=0;
+  i=0;
+  while(mask>0){
+    if(mask==2*(mask/2)){
+      mask=mask>>1;
+      i++;  
+      continue;
+    }
+    pr=adcprime[i];
+    genexp=adcexpons[i];
+/*
+  fprintf(stdout,"[%ld,%ld]\n",pr,genexp);
+  ShowFactorization(fctlist[genexp]);
+*/
+    for(j=0;j<fctlist[pr-1]->dim;j++){
+      fct=fctlist[pr-1]->mlt[j];
+      lexp=fctlist[pr-1]->exp[j];
+
+      for(k=0;k<fctlist[genexp]->dim;k++){
+        if(fctlist[genexp]->mlt[k]==1) break;
+        if(fct!=fctlist[genexp]->mlt[k]) continue;
+        lexp-=fctlist[genexp]->exp[k];
+	break;
+      }
+      if(expons[fct]<lexp)expons[fct]=lexp;
+    }
+    mask=mask>>1;
+    i++;
+  }
+/*
+for(i=0;i<maxprmfctr;i++){
+  if(expons[i]>0) fprintf(stdout,"*%ld^%ld",i,expons[i]);
+}
+fprintf(stdout,"\n");
+*/
+  for(i=0;i<=maxprmfctr;i++){
+    while(expons[i]>0){
+      LCM*=i;
+      if(LCM>LARGE_NUM/maxprmfctr) return LCM;
+      expons[i]--;
+    }
+  }
+/*  fprintf(stdout,"==== %lld\n",LCM); */
+  free(expons);
+  return LCM;
+}
+void ExtendFactors(long int nmb,long int firstdiv,
+                   Factorization *nmbfct,Factorization **fctlist){
+  Factorization *divfct=fctlist[nmb/firstdiv];
+  int fdivused=0;
+  int multnum=0;
+  int i=0;
+/*  fprintf(stdout,"==== %lld %ld %ld\n",divfct->dim,nmb,firstdiv); */
+   for(i=0;i<divfct->dim;i++){
+    if(divfct->mlt[i]==1){
+      if(fdivused==0){
+        nmbfct->mlt[multnum]=firstdiv;
+        nmbfct->exp[multnum]=1;   
+      }
+      break;
+    }
+    if(divfct->mlt[i]<firstdiv){
+      nmbfct->mlt[i]=divfct->mlt[i];
+      nmbfct->exp[i]=divfct->exp[i];
+      multnum++;
+    }else if(divfct->mlt[i]==firstdiv){
+      nmbfct->mlt[i]=divfct->mlt[i];
+      nmbfct->exp[i]=divfct->exp[i]+1;   
+      fdivused=1;
+    }else{
+      int j=i;
+      if(fdivused==0) j=i+1;
+      nmbfct->mlt[j]=divfct->mlt[i];
+      nmbfct->exp[j]=divfct->exp[i];    
+    }
+  }
+}
+void GetFactorization(long int prnum,long int *prlist,
+                            Factorization **fctlist){
+/*fprintf(stdout,"GetFactorization: factorizing first %ld numbers.\n",
+                prnum);*/
+  long int i=0,j=0;
+  Factorization *fct=(Factorization*)malloc(2*sizeof(Factorization)); 
+  long int len=0,isft=0,div=1,firstdiv=1;
+
+  fct->dim=2;
+  fct->mlt=(long int*)malloc(2*sizeof(long int));
+  fct->exp=(long int*)malloc(2*sizeof(long int));
+  for(i=0;i<fct->dim;i++){
+    fct->mlt[i]=1;
+    fct->exp[i]=0;
+  }
+  fct->mlt[0]=2;
+  fct->exp[0]=1;
+  fctlist[2]=fct;
+
+  fct=(Factorization*)malloc(2*sizeof(Factorization));
+  fct->dim=2;
+  fct->mlt=(long int*)malloc(2*sizeof(long int));
+  fct->exp=(long int*)malloc(2*sizeof(long int));
+  for(i=0;i<fct->dim;i++){
+    fct->mlt[i]=1;
+    fct->exp[i]=0;
+  }
+  fct->mlt[0]=3;
+  fct->exp[0]=1;
+  fctlist[3]=fct;
+ 
+  for(i=0;i<prlist[prnum-1];i++){
+    len=0;
+    isft=i;
+    while(isft>0){
+      len++;
+      isft=isft>>1;
+    }
+    fct=(Factorization*)malloc(2*sizeof(Factorization));
+    fct->dim=len;
+    if (len==0) len=1;
+    fct->mlt=(long int*)malloc(len*sizeof(long int));
+    fct->exp=(long int*)malloc(len*sizeof(long int));
+    for(j=0;j<fct->dim;j++){
+      fct->mlt[j]=1;
+      fct->exp[j]=0;
+    }
+    div=1;
+    for(j=0;prlist[j]*prlist[j]<=i;j++){
+      firstdiv=prlist[j];
+      if(i-firstdiv*((long int)i/firstdiv)==0){
+        div=firstdiv;
+        if(firstdiv*firstdiv==i){
+          fct->mlt[0]=firstdiv;
+          fct->exp[0]=2;	  
+	}else{
+	  ExtendFactors(i,firstdiv,fct,fctlist);
+        }
+	break;
+      }
+    }
+    if(div==1){
+      fct->mlt[0]=i;
+      fct->exp[0]=1;   
+    }
+    fctlist[i]=fct;
+/*
+     ShowFactorization(fct);
+*/
+  }
+/*  fprintf(stdout,"GetFactorization: Done.\n"); */
+}
+
+long int adcexp[]={
+  11,13,17,19,23,
+  23,29,31,37,41,	     	  
+  41,43,47,53,59,	     	  
+  3,5,7,11,13};
+long int adcexpS[]={
+  11,13,17,19,23};
+long int adcexpW[]={  
+  2*2,2*2*2*5,2*3,2*2*5,2*3*7,
+  23,29,31,2*2,2*2*19};
+long int adcexpA[]={  
+  2*2,2*2*2*5,2*3,2*2*5,2*3*7,
+  2*19,2*13,2*19,2*2*2*13*19,2*2*2*19*19,                    
+  2*23,2*2*2*2,2*2*2*2*2*23,2*2*2*2*2,2*2*23};
+long int adcexpB[]={  
+  2*2*7,2*2*2*5,2*3*7,2*2*5*7,2*3*7*7,
+  2*19,2*13,2*19,2*2*2*13*19,2*2*2*19*19,                      
+  2*31,2*2*2*2*31,2*2*2*2*2*31,2*2*2*2*2*29,2*2*29,
+  2*43,2*2,2*2,2*2*47,2*2*2*43};  
+long int UpPrimeLim=100000;
+
+typedef struct dc_view{
+  long long int vsize;
+  long int vidx;
+} DC_view;
+
+int CompareSizesByValue( const void* sz0, const void* sz1) {
+long long int *size0=(long long int*)sz0,
+              *size1=(long long int*)sz1;
+  int res=0;
+  if(*size0-*size1>0) res=1;
+  else if(*size0-*size1<0) res=-1;
+  return res;
+}
+int CompareViewsBySize( const void* vw0, const void* vw1) {
+DC_view *lvw0=(DC_view *)vw0, *lvw1=(DC_view *)vw1;
+  int res=0;
+  if(lvw0->vsize>lvw1->vsize) res=1;
+  else if(lvw0->vsize<lvw1->vsize) res=-1;
+  else if(lvw0->vidx>lvw1->vidx) res=1;
+  else if(lvw0->vidx<lvw1->vidx) res=-1;
+  return res;
+}
+
+int CalculateVeiwSizes(ADC_PAR *par){
+  unsigned long long totalInBytes = 0;
+  unsigned long long nViewDims, nCubeTuples = 0;
+ 
+  const char *adcfname=par->filename;
+  int NDID=par->ndid;
+  char clss=par->clss;
+  int dcdim=par->dim;
+  long long int tnum=par->tuplenum;
+  long long int i=0,j=0;
+  Factorization  
+    **fctlist=(Factorization **) calloc(UpPrimeLim,sizeof(Factorization *));
+  long int *prlist=(long int *) calloc(UpPrimeLim,sizeof(long int));
+  int prnum=ListFirstPrimes(UpPrimeLim,prlist);
+  DC_view *dcview=(DC_view *)calloc((1<<dcdim),sizeof(DC_view));
+  const char* vszefname0;
+  char *vszefname=NULL;
+  FILE* view=NULL;
+  int minvn=1, maxvn=(1<<dcdim), vinc=1;
+  long idx=0;
+
+  GetFactorization(prnum,prlist,fctlist); 
+  for(i=1;i<(1<<dcdim);i++){   
+    long long int LCM=1;
+    switch(clss){
+      case 'U':
+        LCM=GetLCM(i,fctlist,adcexp);
+      break;
+      case 'S':
+        LCM=GetLCM(i,fctlist,adcexpS);
+      break;
+      case 'W':
+        LCM=GetLCM(i,fctlist,adcexpW);
+      break;
+      case 'A':
+        LCM=GetLCM(i,fctlist,adcexpA);
+      break;
+      case 'B':
+        LCM=GetLCM(i,fctlist,adcexpB);
+      break;
+    }
+    if(LCM>tnum) LCM=tnum;
+    dcview[i].vsize=LCM;
+    dcview[i].vidx=i;
+  }
+  for(i=0;i<UpPrimeLim;i++){
+    if(!fctlist[i]) continue;
+    if(fctlist[i]->mlt) free(fctlist[i]->mlt); 
+    if(fctlist[i]->exp) free(fctlist[i]->exp); 
+    free(fctlist[i]);
+  }
+  free(fctlist);
+  free(prlist);
+   
+  vszefname0="view.sz";
+  vszefname=(char*)calloc(BlockSize,sizeof(char));
+  sprintf(vszefname,"%s.%s.%d",adcfname,vszefname0,NDID);
+  if(!(view = fopen(vszefname, "w+")) ) {
+    fprintf(stderr,"CalculateVeiwSizes: Can't open file: %s\n",vszefname);
+    return 0;
+  }
+  qsort( dcview, (1<<dcdim), sizeof(DC_view),CompareViewsBySize);	
+
+  switch(clss){
+    case 'U':
+      vinc=1<<3;
+    break;
+    case 'S':
+    break;
+    case 'W':
+    break;
+    case 'A':
+      vinc=1<<6;
+    break;
+    case 'B':
+      vinc=1<<14;
+    break;
+  }
+   for(i=minvn;i<maxvn;i+=vinc){   
+    nViewDims = 0;
+    fprintf(view,"Selection:");
+    idx=dcview[i].vidx;
+    for(j=0;j<dcdim;j++) 
+      if((idx>>j)&0x1==1) { fprintf(view," %lld",j+1); nViewDims++;}
+    fprintf(view,"\nView Size: %lld\n",dcview[i].vsize);
+
+    totalInBytes += (8+4*nViewDims)*dcview[i].vsize;
+    nCubeTuples += dcview[i].vsize;
+
+  }
+  fprintf(view,"\nTotal in bytes: %lld  Number of tuples: %lld\n", 
+          totalInBytes, nCubeTuples);
+  
+  fclose(view);
+  free(dcview);
+  fprintf(stdout,"View sizes are written into %s\n",vszefname);
+  free(vszefname);
+  return 1;
+}
+
+int ParseParFile(char* parfname,ADC_PAR *par){
+  char line[BlockSize];
+  FILE* parfile=NULL;
+  char* pos=strchr(parfname,'.');
+  int linenum=0,i=0;
+  const char *kwd;
+
+  if(!(parfile = fopen(parfname, "r")) ) {
+    fprintf(stderr,"ParseParFile: Can't open file: %s\n",parfname);
+    return 0;
+  }
+  if(pos) pos=strchr(pos+1,'.');
+  if(pos) sscanf(pos+1,"%d",&(par->ndid));
+  linenum=0;
+  while(fgets(&line[0],BlockSize,parfile)){
+    i=0;
+    kwd=adcKeyword[i];
+    while(kwd){
+      if(strstr(line,"#")) {
+        ;/*comment line, do nothing*/
+      }else if(strstr(line,kwd)){
+        char *pos=line+strlen(kwd)+1;
+        switch(i){
+          case 0:
+            sscanf(pos,"%d",&(par->dim));
+          break;
+          case 1:
+            sscanf(pos,"%d",&(par->mnum));
+          break;
+          case 2:
+            sscanf(pos,"%lld",&(par->tuplenum));
+          break;
+          case 3:
+/*            sscanf(pos,"%d",&(par->isascii));*/
+          break;
+          case 4:
+            sscanf(pos,"%d",&(par->inverse_endian));
+          break;
+          case 5:
+            par->filename=(char*) malloc(strlen(pos)*sizeof(char));
+            sscanf(pos,"%s",par->filename);
+          break;
+          case 6:
+            sscanf(pos,"%c",&(par->clss));
+          break;
+        }
+        break;        
+      }
+      i++;
+      kwd=adcKeyword[i];
+    }
+    linenum++;
+  }
+  fclose(parfile);
+  switch(par->clss){/* overwriting parameters according the class */
+    case 'S':
+      par->dim=5;
+      par->mnum=1;
+      par->tuplenum=1000;
+    break;
+    case 'W':
+      par->dim=10;
+      par->mnum=1;
+      par->tuplenum=100000;
+    break;
+    case 'A':
+      par->dim=15;
+      par->mnum=1;
+      par->tuplenum=1000000;
+    break;
+    case 'B':
+      par->dim=20;
+      par->mnum=1;
+      par->tuplenum=10000000;
+    break;
+  }  
+  return 1;
+}
+int WriteADCPar(ADC_PAR *par,char* fname){
+  char *lname=(char*) calloc(BlockSize,sizeof(char));
+  FILE *parfile=NULL;
+
+  sprintf(lname,"%s",fname);
+  parfile=fopen(lname,"w");
+  if(!parfile){
+    fprintf(stderr,"WriteADCPar: can't open file %s\n",lname);
+    return 0;
+  }
+  fprintf(parfile,"attrNum=%d\n",par->dim);
+  fprintf(parfile,"measuresNum=%d\n",par->mnum);
+  fprintf(parfile,"tuplesNum=%lld\n",par->tuplenum);
+  fprintf(parfile,"class=%c\n",par->clss);
+/*  fprintf(parfile,"isASCII=%d\n",par->isascii); */
+  fprintf(parfile,"INVERSE_ENDIAN=%d\n",par->inverse_endian);
+  fprintf(parfile,"fileName=%s\n",par->filename);
+  fclose(parfile);
+  return 1;
+}
+void ShowADCPar(ADC_PAR *par){
+  fprintf(stdout,"********************* ADC paramters\n");
+  fprintf(stdout," id		%d\n",par->ndid);
+  fprintf(stdout," attributes 	%d\n",par->dim);
+  fprintf(stdout," measures   	%d\n",par->mnum);
+  fprintf(stdout," tuples     	%lld\n",par->tuplenum);
+  fprintf(stdout," class	\t%c\n",par->clss);
+  fprintf(stdout," filename       %s\n",par->filename);
+  fprintf(stdout,"***********************************\n");
+}
+
+long int adcgen[]={
+  2,7,3,2,2,
+  2,2,5,31,7,
+  2,3,3,3,2,
+  5,2,2,2,3};
+  
+int GetNextTuple(int dcdim, int measnum,
+                 long long int* attr,long long int* meas,
+		 char clss){
+  static int tuplenum=0;
+  static const int maxdim=20;
+  static int measbound=31415;
+  int i=0,j=0;
+  int maxattr=0;
+  static long int seed[20];
+  long int *locexp=NULL;
+
+  if(dcdim>maxdim){
+    fprintf(stderr,"GetNextTuple: number of dcdim is too large:%d",
+                    dcdim);
+    return 0;
+  }
+  if(measnum>measbound){
+    fprintf(stderr,"GetNextTuple: number of mes is too large:%d",
+                    measnum);
+    return 0;
+  }
+  locexp=adcexp;
+  switch(clss){
+    case 'S':
+    locexp=adcexpS;
+    break;
+    case 'W':
+    locexp=adcexpW;
+    break;
+    case 'A':
+    locexp=adcexpA;
+    break;
+    case 'B':
+    locexp=adcexpB;
+    break;
+  }  
+  if(tuplenum==0){
+    for(i=0;i<dcdim;i++){
+      int tmpgen=adcgen[i];
+      for(j=0;j<locexp[i]-1;j++){
+        tmpgen*=adcgen[i];
+	tmpgen=tmpgen%adcprime[i];
+      }
+      adcgen[i]=tmpgen;
+    }
+    fprintf(stdout,"Prime \tGenerator \tSeed\n");
+    for(i=0;i<dcdim;i++){
+      seed[i]=(adcprime[i]+1)/2;
+      fprintf(stdout," %ld\t %ld\t\t %ld\n",adcprime[i],adcgen[i],seed[i]);
+     }
+  }
+  tuplenum++;
+  maxattr=0;
+  for(i=0;i<dcdim;i++){
+    attr[i]=seed[i]*adcgen[i];
+    attr[i]-=adcprime[i]*((long long int)attr[i]/adcprime[i]); 
+    seed[i]=attr[i];
+    if(seed[i]>maxattr) maxattr=seed[i];
+  }		     	  
+  for(i=0;i<measnum;i++){
+    meas[i]=(long long int)(seed[i]*maxattr);
+    meas[i]-=measbound*(meas[i]/measbound);
+  }		     	  
+  return 1;
+}
+
+int GenerateADC(ADC_PAR *par){
+  int dcdim=par->dim,
+      mesnum=par->mnum,
+      tplnum=par->tuplenum;
+  char *adcfname=(char*)calloc(BlockSize,sizeof(char));
+  
+  FILE *adc;
+  int i=0,j=0;
+  long long int* attr=NULL,*mes=NULL; 
+/*
+   if(par->isascii==1){
+    sprintf(adcfname,"%s.tpl.%d",par->filename,par->ndid);
+    if(!(adc = fopen(adcfname, "w+"))) {
+      fprintf(stderr,"GenerateADC: Can't open file: %s\n",adcfname);
+      return 0;
+    }
+  }else{
+*/
+  sprintf(adcfname,"%s.dat.%d",par->filename,par->ndid);
+    if(!(adc = fopen(adcfname, "wb+"))){
+      fprintf(stderr,"GenerateADC: Can't open file: %s\n",adcfname);
+       return 0;
+    }
+/*  } */
+  attr=(long long int *)malloc(dcdim*sizeof(long long int));
+  mes=(long long int *)malloc(mesnum*sizeof(long long int));
+
+  fprintf(stdout,"\nGenerateADC: writing %d tuples of %d attributes and %d measures to %s\n",
+		  tplnum,dcdim,mesnum,adcfname);
+   for(i=0;i<tplnum;i++){
+    if(!GetNextTuple(dcdim,mesnum,attr,mes,par->clss)) return 0;
+/*
+     if(par->isascii==1){
+      for(int j=0;j<dcdim;j++)fprintf(adc,"%lld ",attr[j]);
+      for(int j=0;j<mesnum;j++)fprintf(adc,"%lld ",mes[j]);
+      fprintf(adc,"\n");
+    }else{
+*/
+      for(j=0;j<mesnum;j++){ 
+    	long long mv =  mes[j];
+	    if(par->inverse_endian==1) swap8(&mv);
+	    fwrite(&mv, 8, 1, adc); 
+      }
+      for(j=0;j<dcdim;j++){ 
+    	int av = attr[j]; 
+	if(par->inverse_endian==1) swap4(&av);
+	fwrite(&av, 4, 1, adc); 
+      }
+    }
+/*  } */
+  fclose(adc);
+  fprintf(stdout,"Binary ADC file %s ",adcfname);
+  fprintf(stdout,"have been generated.\n");
+  free(attr);
+  free(mes);
+  free(adcfname);
+  CalculateVeiwSizes(par);
+  return 1;
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adc.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adc.h
new file mode 100644
index 000000000..e11f2439b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adc.h
@@ -0,0 +1,167 @@
+#if !adc_h
+#define adc_h 1
+
+/* For checking of L2-cache performance influence */ 
+/*#define IN_CORE_*/
+/*#define VIEW_FILE_OUTPUT*/ /* it can be used with IN_CORE only */
+
+/* Optimizations: prefixed views and share-sorted views */
+/*#define OPTIMIZATION*/
+
+#ifdef WINNT
+#ifndef HAS_INT64
+typedef __int64             int64;
+typedef int                 int32;
+#endif
+typedef unsigned __int64   uint64;
+typedef unsigned int       uint32;
+#else
+#ifndef HAS_INT64
+typedef long long           int64;
+typedef int                 int32;
+#endif
+typedef unsigned long long uint64;
+typedef unsigned int       uint32;
+#endif
+
+#include "adcc.h"
+#include "rbt.h"
+
+static int measbound=31415;   /* upper limit on a view measre bound */
+
+enum { smallestParent, prefixedParent, sharedSortParent, noneParent };
+
+static const char* adcKeyword[]={
+  "attrNum",
+  "measuresNum",
+  "tuplesNum",
+  "INVERSE_ENDIAN",
+  "fileName",
+  "class",
+  NULL
+};
+
+typedef struct ADCpar{
+  int ndid;
+  int dim;
+  int mnum;
+  long long int tuplenum;
+  int inverse_endian;
+  const char *filename;
+  char clss;
+} ADC_PAR;
+
+typedef struct {
+    int32 ndid;
+   char   clss;
+   char          adcName[MAX_FILE_FULL_PATH_SIZE];
+   char   adcInpFileName[MAX_FILE_FULL_PATH_SIZE];
+   uint32 nd; 
+   uint32 nm;
+   uint32 nInputRecs;
+   uint32 memoryLimit;
+   uint32 nTasks;
+   /*  FILE *statf; */
+} ADC_VIEW_PARS;
+
+typedef struct job_pool{ 
+   uint32 grpb; 
+   uint32 nv;
+   uint32 nRows; 
+    int64 viewOffset; 
+} JOB_POOL;
+
+typedef struct layer{
+   uint32 layerIndex;
+   uint32 layerQuantityLimit;
+   uint32 layerCurrentPopulation;
+} LAYER;
+
+typedef struct chunks{
+   uint32 curChunkNum;
+    int64 chunkOffset;
+   uint32 posSubChunk;
+   uint32 curSubChunk;
+} CHUNKS;
+
+typedef struct tuplevsize {
+    uint64 viewsize;
+    uint64 tuple;
+} TUPLE_VIEWSIZE;
+
+typedef struct tupleones {
+    uint32 nOnes;
+    uint64 tuple;
+} TUPLE_ONES;
+
+typedef struct {
+   char adcName[MAX_FILE_FULL_PATH_SIZE];
+   uint32 retCode;
+   uint32 verificationFailed;
+   uint32 swapIt;
+   uint32 nTasks;
+   uint32 taskNumber;
+    int32 ndid;
+
+   uint32 nTopDims; /* given number of dimension attributes */
+   uint32 nm;       /* number of measures */ 
+   uint32 nd;       /* number of parent's dimensions */
+   uint32 nv;       /* number of child's dimensions */
+
+   uint32 nInputRecs;
+   uint32 nViewRows; 
+   uint32 totalOfViewRows;
+   uint32 nParentViewRows;
+
+    int64 viewOffset;
+    int64 accViewFileOffset;
+
+   uint32 inpRecSize;
+   uint32 outRecSize;
+
+   uint32 memoryLimit;
+ unsigned char * memPool;
+   uint32 * inpDataBuffer;
+
+   RBTree *tree;
+
+   uint32 numberOfChunks;
+   CHUNKS *chunksParams;
+
+     char       adcLogFileName[MAX_FILE_FULL_PATH_SIZE];
+     char          inpFileName[MAX_FILE_FULL_PATH_SIZE];
+     char         viewFileName[MAX_FILE_FULL_PATH_SIZE];
+     char       chunksFileName[MAX_FILE_FULL_PATH_SIZE];
+     char      groupbyFileName[MAX_FILE_FULL_PATH_SIZE];
+     char adcViewSizesFileName[MAX_FILE_FULL_PATH_SIZE];
+     char    viewSizesFileName[MAX_FILE_FULL_PATH_SIZE];
+
+     FILE *logf;
+     FILE *inpf;
+     FILE *viewFile;   
+     FILE *fileOfChunks;
+     FILE *groupbyFile;
+     FILE *adcViewSizesFile;
+     FILE *viewSizesFile;
+   
+    int64     mSums[MAX_NUM_OF_MEAS];
+   uint32 selection[MAX_NUM_OF_DIMS];
+    int64 checksums[MAX_NUM_OF_MEAS]; /* view checksums */
+    int64 totchs[MAX_NUM_OF_MEAS];    /* checksums of a group of views */
+
+ JOB_POOL *jpp;
+    LAYER *lpp;
+   uint32 nViewLimit;
+   uint32 groupby;
+   uint32 smallestParentLevel;
+   uint32 parBinRepTuple;
+   uint32 nRowsToRead;
+   uint32 fromParent;
+
+   uint64 totalViewFileSize; /* in bytes */
+   uint32 numberOfMadeViews;
+   uint32 numberOfViewsMadeFromInput;
+   uint32 numberOfPrefixedGroupbys;
+   uint32 numberOfSharedSortGroupbys;
+} ADC_VIEW_CNTL;
+#endif /* adc_h */
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adcc.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adcc.h
new file mode 100644
index 000000000..fe5271861
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/adcc.h
@@ -0,0 +1,82 @@
+/*
+!-------------------------------------------------------------------------!
+!				                                    	                  !
+!		           N A S   G R I D   B E N C H M A R K S                  !
+!									                                      !
+!		                	C + +	V E R S I O N		                  !
+!									                                      !
+!			                       A D C C . H 		                      !
+!									                                      !
+!-------------------------------------------------------------------------!
+!									                                      !
+!    The the file contains comnstants definitions used for                !
+!    building veiws.                                                      !
+!									                                      !
+!    Permission to use, copy, distribute and modify this software	      !
+!    for any purpose with or without fee is hereby granted.		          !
+!    We request, however, that all derived work reference the		      !
+!    NAS Grid Benchmarks 3.0 or GridNPB3.0. This software is provided	  !
+!    "as is" without expressed or implied warranty.			              !
+!									                                      !
+!    Information on GridNPB3.0, including the concept of		          !
+!    the NAS Grid Benchmarks, the specifications, source code,  	      !
+!    results and information on how to submit new results,		          !
+!    is available at:							                          !
+!									                                      !
+!	  http://www.nas.nasa.gov/Software/NPB  			                  !
+!									                                      !
+!    Send comments or suggestions to  ngb@nas.nasa.gov  		          !
+!    Send bug reports to	      ngb@nas.nasa.gov  		              !
+!									                                      !
+!	   E-mail:  ngb@nas.nasa.gov					                      !
+!	   Fax:     (650) 604-3957					                          !
+!									                                      !
+!-------------------------------------------------------------------------!
+! GridNPB3.0 C++ version						                          !
+!	  Michael Frumkin, Leonid Shabanov				                      !
+!-------------------------------------------------------------------------!
+*/
+#ifndef _ADCC_CONST_DEFS_H_
+#define _ADCC_CONST_DEFS_H_
+
+/*#define WINNT*/
+#define UNIX
+
+#define ADC_OK                        0
+#define ADC_WRITE_FAILED              1
+#define ADC_INTERNAL_ERROR            2
+#define ADC_TREE_DESTROY_FAILURE      3
+#define ADC_FILE_OPEN_FAILURE         4
+#define ADC_MEMORY_ALLOCATION_FAILURE 5
+#define ADC_FILE_DELETE_FAILURE       6
+#define ADC_VERIFICATION_FAILED       7
+#define ADC_SHMEMORY_FAILURE          8
+
+#define SSA_BUFFER_SIZE     (1024*1024)
+#define MAX_NUMBER_OF_TASKS         256
+
+#define MAX_PAR_FILE_LINE_SIZE      512
+#define MAX_FILE_FULL_PATH_SIZE     512
+#define MAX_ADC_NAME_SIZE            32
+
+#define DIM_FSZ                       4
+#define MSR_FSZ                       8
+
+#define MAX_NUM_OF_DIMS              20
+#define MAX_NUM_OF_MEAS               4
+
+#define MAX_NUM_OF_CHUNKS          1024      
+#define MAX_PARAM_LINE_SIZE        1024
+
+#define OUTPUT_BUFFER_SIZE (MAX_NUM_OF_DIMS + (MSR_FSZ/4)*MAX_NUM_OF_MEAS)
+#define MAX_VIEW_REC_SIZE ((DIM_FSZ*MAX_NUM_OF_DIMS)+(MSR_FSZ*MAX_NUM_OF_MEAS))     
+#define MAX_VIEW_ROW_SIZE_IN_INTS (MAX_NUM_OF_DIMS + 2*MAX_NUM_OF_MEAS)
+#define MLB32  0x80000000
+
+#ifdef WINNT
+#define MLB    0x8000000000000000
+#else
+#define MLB 0x8000000000000000LL
+#endif
+
+#endif /*  _ADCC_CONST_DEFS_H_ */
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/dc.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/dc.c
new file mode 100644
index 000000000..d6f5a5b59
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/dc.c
@@ -0,0 +1,324 @@
+/*
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                      O p e n M P     V E R S I O N                      !
+!                                                                         !
+!                                   D C                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    DC creates all specifided data-cube views in parallel.               !
+!    Refer to NAS Technical Report 03-005 for details.                    !
+!    It calculates all groupbys in a top down manner using well known     !
+!    heuristics and optimizations.                                        !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+! Author: Michael Frumkin                                                 !
+!         Leonid Shabanov                                                 !
+!-------------------------------------------------------------------------!
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <ctype.h>
+#include <math.h>
+
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+
+#include "adc.h"
+#include "macrodef.h"
+#include "npbparams.h"
+
+#ifdef UNIX
+#include <sys/types.h>
+#include <unistd.h>
+
+#define MAX_TIMERS 64  /* NPB maximum timers */
+#include "../common/c_timers.h"
+#endif
+
+void c_print_results( char   *name,
+                      char   clss,
+                      int    n1,
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+void initADCpar(ADC_PAR *par);
+int ParseParFile(char* parfname, ADC_PAR *par);
+int GenerateADC(ADC_PAR *par);
+void ShowADCPar(ADC_PAR *par);
+int32 DC(ADC_VIEW_PARS *adcpp);
+int Verify(long long int checksum,ADC_VIEW_PARS *adcpp);
+
+#define BlockSize 1024
+
+int main ( int argc, char * argv[] )
+{
+  ADC_PAR *parp;
+  ADC_VIEW_PARS *adcpp;
+  int32 retCode;
+
+  fprintf(stdout,"\n\n NAS Parallel Benchmarks (NPB3.4-OMP) - DC Benchmark\n\n" );
+  if(argc!=3){
+    fprintf(stdout," No Paramter file. Using compiled defaults\n");
+  }
+  if(argc>3 || (argc>1 && !isdigit(argv[1][0]))){
+    fprintf(stderr,"Usage: <program name> <amount of memory>\n");
+    fprintf(stderr,"       <file of parameters>\n");
+    fprintf(stderr,"Example: bin/dc.S 1000000 DC/ADC.par\n");
+    fprintf(stderr,"The last argument, (a parameter file) can be skipped\n");
+    exit(1);
+  }
+
+  if(  !(parp = (ADC_PAR*) malloc(sizeof(ADC_PAR)))
+     ||!(adcpp = (ADC_VIEW_PARS*) malloc(sizeof(ADC_VIEW_PARS)))){
+     PutErrMsg("main: malloc failed")
+     exit(1);
+  }
+  initADCpar(parp);
+  parp->clss=CLASS;
+  if(argc!=3){
+    parp->dim=attrnum;
+    parp->tuplenum=input_tuples;
+  }else if( (argc==3)&&(!ParseParFile(argv[2], parp))) {
+    PutErrMsg("main.ParseParFile failed")
+    exit(1);
+  }
+  ShowADCPar(parp);
+  if(!GenerateADC(parp)) {
+     PutErrMsg("main.GenerateAdc failed")
+     exit(1);
+  }
+
+  adcpp->ndid = parp->ndid;
+  adcpp->clss = parp->clss;
+  adcpp->nd = parp->dim;
+  adcpp->nm = parp->mnum;
+  adcpp->nTasks = 1;
+
+  if(argc>=2)
+    adcpp->memoryLimit = atoi(argv[1]);
+  else
+    adcpp->memoryLimit = 0;
+  if(adcpp->memoryLimit <= 0){
+    /* size of rb-tree with tuplenum nodes */
+    adcpp->memoryLimit = parp->tuplenum*(50+5*parp->dim);
+    fprintf(stdout,"Estimated rb-tree size = %d \n", adcpp->memoryLimit);
+  }
+  adcpp->nInputRecs = parp->tuplenum;
+  strcpy(adcpp->adcName, parp->filename);
+  strcpy(adcpp->adcInpFileName, parp->filename);
+
+  if((retCode=DC(adcpp))) {
+     PutErrMsg("main.DC failed")
+     fprintf(stderr, "main.ParRun failed: retcode = %d\n", retCode);
+     exit(1);
+  }
+
+  if(parp)  { free(parp);   parp = 0; }
+  if(adcpp) { free(adcpp); adcpp = 0; }
+  return 0;
+}
+
+int32		 CloseAdcView(ADC_VIEW_CNTL *adccntl);
+int32		 PartitionCube(ADC_VIEW_CNTL *avp);
+ADC_VIEW_CNTL *NewAdcViewCntl(ADC_VIEW_PARS *adcpp, uint32 pnum);
+int32		 ComputeGivenGroupbys(ADC_VIEW_CNTL *adccntl);
+
+int32 DC(ADC_VIEW_PARS *adcpp) {
+   int32 itsk=0;
+   double t_total=0.0;
+   int verified;
+
+   typedef struct {
+      int    verificationFailed;
+      uint32 totalViewTuples;
+      uint64 totalViewSizesInBytes;
+      uint32 totalNumberOfMadeViews;
+      uint64 checksum;
+      double tm_max;
+   } PAR_VIEW_ST;
+
+   PAR_VIEW_ST *pvstp;
+
+   pvstp = (PAR_VIEW_ST*) malloc(sizeof(PAR_VIEW_ST));
+   pvstp->verificationFailed = 0;
+   pvstp->totalViewTuples = 0;
+   pvstp->totalViewSizesInBytes = 0;
+   pvstp->totalNumberOfMadeViews = 0;
+   pvstp->checksum = 0;
+
+#ifdef _OPENMP
+   adcpp->nTasks=omp_get_max_threads();
+   fprintf(stdout,"\nNumber of available threads:  %d\n", adcpp->nTasks);
+   if (adcpp->nTasks > MAX_NUMBER_OF_TASKS) {
+      adcpp->nTasks = MAX_NUMBER_OF_TASKS;
+      fprintf(stdout,"Warning: Maximum number of tasks reached: %d\n",
+              adcpp->nTasks);
+   }
+
+
+#pragma omp parallel shared(pvstp) private(itsk)
+#endif
+  {
+   double tm0=0;
+   int itimer=0;
+   ADC_VIEW_CNTL *adccntlp;
+#ifdef _OPENMP
+   itsk=omp_get_thread_num();
+#endif
+   adccntlp = NewAdcViewCntl(adcpp, itsk);
+
+   if (!adccntlp) {
+      PutErrMsg("ParRun.NewAdcViewCntl: returned NULL")
+      adccntlp->verificationFailed=1;
+   }else{
+     adccntlp->verificationFailed = 0;
+     if (adccntlp->retCode!=0) {
+   	fprintf(stderr,
+   		 "DC.NewAdcViewCntl: return code = %d\n",
+   						adccntlp->retCode);
+     }
+   }
+
+   if (!adccntlp->verificationFailed) {
+     if( PartitionCube(adccntlp) ) {
+        PutErrMsg("DC.PartitionCube failed");
+     }
+     timer_clear(itimer);
+     timer_start(itimer);
+     if( ComputeGivenGroupbys(adccntlp) ) {
+        PutErrMsg("DC.ComputeGivenGroupbys failed");
+     }
+     timer_stop(itimer);
+     tm0 = timer_read(itimer);
+   }
+#ifdef _OPENMP
+#pragma omp critical
+#endif
+   {
+     if(pvstp->tm_max<tm0) pvstp->tm_max=tm0;
+     pvstp->verificationFailed += adccntlp->verificationFailed;
+     if (!adccntlp->verificationFailed) {
+       pvstp->totalNumberOfMadeViews += adccntlp->numberOfMadeViews;
+       pvstp->totalViewSizesInBytes += adccntlp->totalViewFileSize;
+       pvstp->totalViewTuples += adccntlp->totalOfViewRows;
+       pvstp->checksum += adccntlp->totchs[0];
+     }
+   }
+   if(CloseAdcView(adccntlp)) {
+     PutErrMsg("ParRun.CloseAdcView: is failed");
+     adccntlp->verificationFailed = 1;
+   }
+ } /* omp parallel */
+
+   t_total=pvstp->tm_max;
+
+   pvstp->verificationFailed=Verify(pvstp->checksum,adcpp);
+   verified = (pvstp->verificationFailed == -1)? -1 :
+              (pvstp->verificationFailed ==  0)?  1 : 0;
+
+   fprintf(stdout,"\n*** DC Benchmark Results:\n");
+   fprintf(stdout," Benchmark Time   = %20.3f\n", t_total);
+   fprintf(stdout," Input Tuples     =         %12d\n", (int) adcpp->nInputRecs);
+   fprintf(stdout," Number of Views  =         %12d\n",
+           (int) pvstp->totalNumberOfMadeViews);
+   fprintf(stdout," Number of Tasks  =         %12d\n", (int) adcpp->nTasks);
+   fprintf(stdout," Tuples Generated = %20.0f\n",
+           (double) pvstp->totalViewTuples);
+   fprintf(stdout," Tuples/s         = %20.2f\n",
+           (double) pvstp->totalViewTuples / t_total);
+   fprintf(stdout," Checksum         = %20.12e\n", (double) pvstp->checksum);
+   if (pvstp->verificationFailed)
+      fprintf(stdout, " Verification failed\n");
+
+   c_print_results("DC",
+  		   adcpp->clss,
+  		   (int)adcpp->nInputRecs,
+                   0,
+                   0,
+                   1,
+  		   t_total,
+  		   (double) pvstp->totalViewTuples * 1.e-6 / t_total,
+  		   "Tuples generated",
+  		   verified,
+  		   NPBVERSION,
+  		   COMPILETIME,
+  		   CC,
+  		   CLINK,
+  		   C_LIB,
+  		   C_INC,
+  		   CFLAGS,
+  		   CLINKFLAGS);
+   return ADC_OK;
+}
+
+long long checksumS=464620213;
+long long checksumWlo=434318;
+long long checksumWhi=1401796;
+long long checksumAlo=178042;
+long long checksumAhi=7141688;
+long long checksumBlo=700453;
+long long checksumBhi=9348365;
+
+int Verify(long long int checksum,ADC_VIEW_PARS *adcpp){
+  switch(adcpp->clss){
+    case 'S':
+      if(checksum==checksumS) return 0;
+      break;
+    case 'W':
+      if(checksum==checksumWlo+1000000*checksumWhi) return 0;
+      break;
+    case 'A':
+      if(checksum==checksumAlo+1000000*checksumAhi) return 0;
+      break;
+    case 'B':
+      if(checksum==checksumBlo+1000000*checksumBhi) return 0;
+      break;
+    default:
+      return -1; /* CLASS U */
+  }
+  return 1;
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/extbuild.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/extbuild.c
new file mode 100644
index 000000000..3550537c1
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/extbuild.c
@@ -0,0 +1,988 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+#include "macrodef.h"
+#include "protots.h"
+
+#ifdef UNIX
+#include <errno.h>
+#endif
+
+extern int32 computeChecksum(ADC_VIEW_CNTL *avp,treeNode *t,uint64 *ordern);
+extern int32 WriteViewToDiskCS(ADC_VIEW_CNTL *avp,treeNode *t,uint64 *ordern);
+
+int32 ReadWholeInputData(ADC_VIEW_CNTL *avp, FILE *inpf){
+  uint32 iRec = 0;
+  uint32 inpBufferLineSize, inpBufferPace, inpRecSize, ib = 0;
+
+  FSEEK(inpf, 0L, SEEK_SET);
+  inpRecSize = 8*avp->nm+4*avp->nTopDims;
+  inpBufferLineSize = inpRecSize;
+  if (inpBufferLineSize%8) inpBufferLineSize += 4;
+  inpBufferPace = inpBufferLineSize/4;
+
+  while(fread(&avp->inpDataBuffer[ib], inpRecSize, 1, inpf)){
+     iRec++;
+     ib += inpBufferPace;      
+  }
+  avp->nRowsToRead = iRec;
+  FSEEK(inpf, 0L, SEEK_SET);
+  
+  if(avp->nInputRecs != iRec){
+     fprintf(stderr, " ReadWholeInputData(): wrong input data reading.\n");
+     return ADC_INTERNAL_ERROR;
+  }  
+  return ADC_OK;
+}
+int32 ComputeMemoryFittedView (ADC_VIEW_CNTL *avp){
+  uint32 iRec = 0;
+  uint32 viewBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+  uint32 inpBufferLineSize, inpBufferPace, inpRecSize,ib;
+  uint64 ordern=0;
+#ifdef VIEW_FILE_OUTPUT
+  uint32 retCode;
+#endif
+
+  FSEEK(avp->viewFile, 0L, SEEK_END);
+  inpRecSize = 8*avp->nm+4*avp->nTopDims;
+  inpBufferLineSize = inpRecSize;
+  if (inpBufferLineSize%8) inpBufferLineSize += 4;
+  inpBufferPace = inpBufferLineSize/4;
+
+  InitializeTree(avp->tree, avp->nv, avp->nm);
+
+  ib=0;
+  for ( iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+      SelectToView( &avp->inpDataBuffer[ib], avp->selection, viewBuf, 
+  		             avp->nd, avp->nm, avp->nv );
+      ib += inpBufferPace;
+      TreeInsert(avp->tree, viewBuf);
+      if(avp->tree->memoryIsFull){
+  	fprintf(stderr, "ComputeMemoryFittedView(): Not enough memory.\n");
+  	return 1; 
+      }
+  }
+
+#ifdef VIEW_FILE_OUTPUT
+  if( retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern) ){ 
+    fprintf(stderr, "ComputeMemoryFittedView() Write error is occured.\n");
+    return retCode;
+  }
+#else
+  computeChecksum(avp,avp->tree->root.left,&ordern);
+#endif
+ 
+  avp->nViewRows = avp->tree->count;
+  avp->totalOfViewRows += avp->nViewRows; 			      
+  InitializeTree(avp->tree, avp->nv, avp->nm);
+  return ADC_OK;
+}
+
+int32 SharedSortAggregate(ADC_VIEW_CNTL *avp){
+   int32 retCode;
+  uint32 iRec = 0;
+  uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+  uint32 currBuf[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   int64 chunkOffset = 0;
+   int64 inpfOffset;
+  uint32 nPart = 0;
+  uint32 prevV;
+  uint32 currV;
+  uint32 total = 0;
+  unsigned char *ib;
+  uint32 ibsize = SSA_BUFFER_SIZE;
+  uint32 nib;
+  uint32 iib;
+  uint32 nreg;
+  uint32 nlst;
+  uint32 nsgs;
+  uint32 ncur;
+  uint32 ibOffset = 0;
+  uint64 ordern=0;
+   
+  ib = (unsigned char*) malloc(ibsize); 
+  if (!ib){ 
+    fprintf(stderr,"SharedSortAggregate: memory allocation failed\n"); 
+    return ADC_MEMORY_ALLOCATION_FAILURE; 
+  }
+  
+  nib = ibsize/avp->inpRecSize;
+  nsgs = avp->nRowsToRead/nib;
+  
+  if (nsgs == 0){
+      nreg = avp->nRowsToRead; 
+      nlst = nreg; 
+      nsgs = 1; 
+  }else{
+     nreg = nib;
+     if (avp->nRowsToRead%nib) {
+       nsgs++; 
+       nlst = avp->nRowsToRead%nib;
+     }else{
+       nlst = nreg;			   
+     }
+  }
+  
+  avp->nViewRows = 0; 
+  for( iib = 1; iib <= nsgs; iib++ ){ 
+    if(iib > 1) FSEEK(avp->viewFile, inpfOffset, SEEK_SET);
+    if( iib == nsgs ) ncur = nlst; else ncur = nreg;
+    	  
+    fread(ib, ncur*avp->inpRecSize, 1, avp->viewFile);
+    inpfOffset = ftell(avp->viewFile);
+
+    for( ibOffset = 0, iRec = 1; iRec <= ncur; iRec++ ){
+      memcpy(attrs, &ib[ibOffset], avp->inpRecSize);
+      ibOffset += avp->inpRecSize;
+      SelectToView(attrs, avp->selection, currBuf, avp->nd, avp->nm, avp->nv); 
+      currV = currBuf[2*avp->nm];
+
+      if(iib == 1 && iRec == 1){ 
+        prevV = currV; 
+        nPart = 1;
+        InitializeTree(avp->tree, avp->nv, avp->nm);
+        TreeInsert(avp->tree, currBuf);
+      }else{
+         if (currV == prevV){
+            nPart++;
+	    TreeInsert (avp->tree, currBuf);
+            if (avp->tree->memoryIsFull){
+	      avp->chunksParams[avp->numberOfChunks].curChunkNum =
+	                                             avp->tree->count;
+	      avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+              (avp->numberOfChunks)++;
+	      if(avp->numberOfChunks >= MAX_NUM_OF_CHUNKS){
+                fprintf(stderr,"Too many chunks were created.\n"); 
+		exit(1);
+              }
+              chunkOffset += (uint64)(avp->tree->count*avp->outRecSize);
+              retCode=WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+	                               avp->tree->root.left, avp->logf);                                       
+              if(retCode!=ADC_OK){
+		fprintf(stderr,"SharedSortAggregate: Write error occured.\n"); 
+		return retCode;
+	      }
+              InitializeTree(avp->tree, avp->nv, avp->nm);
+	    } /* memoryIsFull */
+         }else{
+	   if(avp->numberOfChunks && avp->tree->count!=0){ 
+	     avp->chunksParams[avp->numberOfChunks].curChunkNum =
+	        				     avp->tree->count;
+	     avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+             (avp->numberOfChunks)++;
+             chunkOffset += 
+	    	      (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+	     retCode=WriteChunkToDisk( avp->outRecSize, avp->fileOfChunks,
+	   				 avp->tree->root.left, avp->logf);
+             if(retCode!=ADC_OK){
+	       fprintf(stderr,"SharedSortAggregate: Write error occured.\n");
+	       return retCode;    
+	      }
+	    }
+            FSEEK(avp->viewFile, 0L, SEEK_END);
+            if(!avp->numberOfChunks){
+               avp->nViewRows += avp->tree->count;
+	       retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern);
+	       if(retCode!=ADC_OK){ 
+	          fprintf(stderr, 
+	        	 "SharedSortAggregate: Write error occured.\n");
+	          return retCode;
+	       }
+ 	     }else{
+	       retCode=MultiWayMerge(avp);
+	       if(retCode!=ADC_OK) {
+	         fprintf(stderr,"SharedSortAggregate.MultiWayMerge: failed.\n");
+	         return retCode;
+	       } 
+	     }
+             InitializeTree(avp->tree, avp->nv, avp->nm);
+             TreeInsert(avp->tree, currBuf);
+             total += nPart;
+             nPart = 1;
+          }
+       }
+       prevV = currV;
+    } /* iRec */
+  } /* iib */
+
+  if(avp->numberOfChunks && avp->tree->count!=0) { 
+    avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+    avp->chunksParams[avp->numberOfChunks].chunkOffset = chunkOffset;
+    (avp->numberOfChunks)++;
+    chunkOffset += (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+    retCode=WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+    			     avp->tree->root.left, avp->logf);
+    if(retCode!=ADC_OK){
+      fprintf(stderr,"SharedSortAggregate: Write error occured.\n");
+      return retCode;	 
+    }
+  }
+  FSEEK(avp->viewFile, 0L, SEEK_END);
+  if(!avp->numberOfChunks){
+    avp->nViewRows += avp->tree->count;
+    if( retCode = WriteViewToDiskCS(avp, avp->tree->root.left,&ordern)){ 
+      fprintf(stderr, "SharedSortAggregate: Write error occured.\n");
+      return retCode;
+    }	 
+  }else{
+     retCode=MultiWayMerge(avp);
+     if(retCode!=ADC_OK) {
+       fprintf(stderr,"SharedSortAggregate.MultiWayMerge failed.\n");
+       return retCode;
+     } 
+  }
+  FSEEK(avp->fileOfChunks, 0L, SEEK_SET);
+  
+  total += nPart;
+  avp->totalOfViewRows += avp->nViewRows;
+  if(ib) free(ib);
+  return  ADC_OK;
+}
+int32 PrefixedAggregate(ADC_VIEW_CNTL *avp, FILE *iof){
+   uint32 i;
+   uint32 iRec = 0;
+   uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   uint32 aggrBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+   uint32 currBuf[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+   uint32 prevBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+    int64 *aggrmp;
+    int64 *currmp;
+    int32 compRes;
+   uint32 nOut = 0; 
+   uint32 mpOffset = 0;
+   uint32 nOutBufRecs;
+   uint32 nViewRows = 0;
+    int64 inpfOffset;
+
+    aggrmp = (int64*) &aggrBuf[0];
+    currmp = (int64*) &currBuf[0];
+    
+    for(i = 0; i < 2*avp->nm+avp->nv; i++){prevBuf[i] = 0; aggrBuf[i] = 0;}
+    nOutBufRecs = avp->memoryLimit/avp->outRecSize;
+
+    for(iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+      fread(attrs, avp->inpRecSize, 1, iof);
+      SelectToView(attrs, avp->selection, currBuf, avp->nd, avp->nm, avp->nv);
+      if (iRec == 1) memcpy(aggrBuf, currBuf, avp->outRecSize);
+      else{
+       compRes = KeyComp( &currBuf[2*avp->nm], &prevBuf[2*avp->nm], avp->nv);
+
+       switch(compRes){
+	  case  1: 
+	    memcpy(&avp->memPool[mpOffset], aggrBuf, avp->outRecSize);
+	    mpOffset += avp->outRecSize;
+	    nOut++;
+	    for ( i = 0; i < avp->nm; i++ ){
+	      avp->mSums[i] += aggrmp[i];
+	      avp->checksums[i] += nOut*aggrmp[i]%measbound;
+	    }    
+	    memcpy(aggrBuf, currBuf, avp->outRecSize);
+	    break;
+	  case  0: 
+	    for ( i = 0; i < avp->nm; i++ ) aggrmp[i] += currmp[i];
+	    break;
+	  case -1: 
+	    fprintf(stderr,"PrefixedAggregate: wrong parent view order.\n"); 
+	    exit(1);
+	    break; 
+	  default: 
+	    fprintf(stderr,"PrefixedAggregate: wrong KeyComp() result.\n"); 
+	    exit(1);
+	    break;
+       }     
+    
+       if (nOut == nOutBufRecs){
+	     inpfOffset = ftell(iof);
+	     FSEEK(iof, 0L, SEEK_END);
+	     WriteToFile(avp->memPool, nOut*avp->outRecSize, 1, iof, stderr);
+	     FSEEK(iof, inpfOffset, SEEK_SET);
+	     mpOffset = 0;
+	     nViewRows += nOut;
+	     nOut = 0; 
+       }
+     }
+     memcpy(prevBuf, currBuf, avp->outRecSize);
+   }
+   memcpy(&avp->memPool[mpOffset], aggrBuf, avp->outRecSize);
+   nOut++;
+   for ( i = 0; i < avp->nm; i++ ){
+     avp->mSums[i] += aggrmp[i];
+     avp->checksums[i] += nOut*aggrmp[i]%measbound;
+   }
+   FSEEK(iof, 0L, SEEK_END);
+   WriteToFile(avp->memPool, nOut*avp->outRecSize, 1, iof, stderr);
+   avp->nViewRows	 = nViewRows+nOut;
+   avp->totalOfViewRows += avp->nViewRows;
+   return ADC_OK;
+}
+int32 RunFormation (ADC_VIEW_CNTL *avp, FILE *inpf){
+   uint32 iRec = 0;
+   uint32 viewBuf[MAX_VIEW_ROW_SIZE_IN_INTS];
+   uint32   attrs[MAX_VIEW_ROW_SIZE_IN_INTS]; 
+    int64 chunkOffset = 0;
+
+   InitializeTree(avp->tree, avp->nv, avp->nm);
+
+   for(iRec = 1; iRec <= avp->nRowsToRead; iRec++ ){ 
+     fread(attrs, avp->inpRecSize, 1, inpf);
+     SelectToView(attrs, avp->selection, viewBuf, avp->nd, avp->nm, avp->nv); 
+     TreeInsert(avp->tree, viewBuf);
+
+     if(avp->tree->memoryIsFull) {
+        avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+	    avp->chunksParams[avp->numberOfChunks].chunkOffset  = chunkOffset;		 
+        (avp->numberOfChunks)++;
+	    if (avp->numberOfChunks >= MAX_NUM_OF_CHUNKS) {
+          fprintf(stderr, "RunFormation: Too many chunks were created.\n"); 
+          return ADC_INTERNAL_ERROR;
+        }
+        chunkOffset += (uint64)(avp->tree->count*avp->outRecSize);
+        if(WriteChunkToDisk( avp->outRecSize, avp->fileOfChunks,
+	                         avp->tree->root.left, avp->logf )){
+	       fprintf(stderr, 
+	         "RunFormation.WriteChunkToDisk: Write error is occured.\n");
+	       return ADC_WRITE_FAILED;
+	    }
+        InitializeTree(avp->tree, avp->nv, avp->nm);
+       }
+   } /* Insertion ... */
+   if(avp->numberOfChunks && avp->tree->count!=0) { 
+     avp->chunksParams[avp->numberOfChunks].curChunkNum = avp->tree->count;
+     avp->chunksParams[avp->numberOfChunks].chunkOffset  = chunkOffset;
+     (avp->numberOfChunks)++;
+     chunkOffset += (uint64)(avp->tree->count*(4*avp->nv + 8*avp->nm));
+     if(WriteChunkToDisk(avp->outRecSize, avp->fileOfChunks,
+                         avp->tree->root.left, avp->logf)){
+       fprintf(stderr, 
+            "RunFormation(.WriteChunkToDisk: Write error is occured.\n");
+       return ADC_WRITE_FAILED;  
+     }
+   }
+   FSEEK(avp->viewFile, 0L, SEEK_END);
+   return ADC_OK;
+}
+void SeekAndReadNextSubChunk( uint32 multiChunkBuffer[], 
+                              uint32 k,
+                              FILE *inFile,
+		              uint32 chunkRecSize, 
+		              uint64 inFileOffs,
+		              uint32 subChunkNum){
+   int64 ret;
+  
+   ret = FSEEK(inFile, inFileOffs, SEEK_SET);
+   if (ret < 0){
+      fprintf(stderr,"SeekAndReadNextSubChunk.fseek() < 0 "); 
+      exit(1); 
+   }
+   fread(&multiChunkBuffer[k], chunkRecSize*subChunkNum, 1, inFile);
+}
+void ReadSubChunk(
+            uint32 chunkRecSize,
+            uint32 *multiChunkBuffer,
+            uint32 mwBufRecSizeInInt,
+            uint32 iChunk,
+            uint32 regSubChunkSize,
+            CHUNKS *chunks,  
+              FILE *fileOfChunks
+            ){
+   if (chunks[iChunk].curChunkNum > 0){
+      if(chunks[iChunk].curChunkNum < regSubChunkSize){
+	SeekAndReadNextSubChunk(multiChunkBuffer,
+	   			(iChunk*regSubChunkSize +
+	   			(regSubChunkSize-chunks[iChunk].curChunkNum))*
+	   			mwBufRecSizeInInt,
+	   			fileOfChunks,
+	   			chunkRecSize,
+	   			chunks[iChunk].chunkOffset,
+	   			chunks[iChunk].curChunkNum);
+	chunks[iChunk].posSubChunk=regSubChunkSize-chunks[iChunk].curChunkNum;
+	chunks[iChunk].curSubChunk=chunks[iChunk].curChunkNum;
+	chunks[iChunk].curChunkNum=0;
+	chunks[iChunk].chunkOffset=-1;
+      }else{
+	SeekAndReadNextSubChunk(multiChunkBuffer,
+	   			iChunk*regSubChunkSize*mwBufRecSizeInInt,
+	   			fileOfChunks,
+	   			chunkRecSize,
+	   			chunks[iChunk].chunkOffset,
+	   			regSubChunkSize);
+	chunks[iChunk].posSubChunk = 0;
+	chunks[iChunk].curSubChunk = regSubChunkSize;
+	chunks[iChunk].curChunkNum -= regSubChunkSize;
+	chunks[iChunk].chunkOffset += regSubChunkSize * chunkRecSize;
+      }
+   }
+}
+int32 MultiWayMerge(ADC_VIEW_CNTL *avp){
+   uint32 outputBuffer[OUTPUT_BUFFER_SIZE];
+   uint32 r_buf       [OUTPUT_BUFFER_SIZE];
+   uint32 min_r_buf   [OUTPUT_BUFFER_SIZE];
+   uint32 first_one;
+   uint32 i;
+   uint32 iChunk;
+   uint32 min_r_chunk;
+   uint32 sPos;
+   uint32 iPos;
+   uint32 numEmptyBufs;
+   uint32 numEmptyRuns;
+   uint32 mwBufRecSizeInInt;
+   uint32 chunkRecSize;
+   uint32 *multiChunkBuffer;
+   uint32   regSubChunkSize;
+    int32 compRes;
+    int64 *m_min_r_buf;
+    int64 *m_outputBuffer;
+
+   FSEEK(avp->fileOfChunks, 0L, SEEK_SET);
+
+   multiChunkBuffer = (uint32*) &avp->memPool[0];
+   first_one = 1;
+   avp->nViewRows  = 0; 
+
+   chunkRecSize = avp->outRecSize;
+   mwBufRecSizeInInt = chunkRecSize/4;
+   m_min_r_buf = (int64*)&min_r_buf[0];
+   m_outputBuffer = (int64*)&outputBuffer[0];
+
+   mwBufRecSizeInInt = chunkRecSize/4;
+   regSubChunkSize = (avp->memoryLimit/avp->numberOfChunks)/chunkRecSize;
+	 
+   if (regSubChunkSize==0) {
+     fprintf(stderr,
+             "MultiWayMerge: Not enough memory to run the external sort\n");
+     return ADC_INTERNAL_ERROR;
+   }
+   multiChunkBuffer = (uint32*) &avp->memPool[0];
+
+   for(i = 0; i < avp->numberOfChunks; i++ ){
+      ReadSubChunk( 
+                   chunkRecSize,
+                   multiChunkBuffer,
+                   mwBufRecSizeInInt,
+                   i,
+                   regSubChunkSize,
+                   avp->chunksParams,  
+                   avp->fileOfChunks
+      );
+   }
+   while(1){
+     for(iChunk = 0;iChunk<avp->numberOfChunks;iChunk++){
+       if (avp->chunksParams[iChunk].curSubChunk > 0){
+     	sPos = iChunk*regSubChunkSize*mwBufRecSizeInInt;
+    	iPos = sPos+mwBufRecSizeInInt*avp->chunksParams[iChunk].posSubChunk;
+     	memcpy(&min_r_buf[0], &multiChunkBuffer[iPos], avp->outRecSize);
+	    min_r_chunk = iChunk;
+     	break;
+       }
+     }
+     for ( iChunk = min_r_chunk; iChunk < avp->numberOfChunks; iChunk++ ){
+       uint32 iPos;
+
+       if (avp->chunksParams[iChunk].curSubChunk > 0){
+          iPos = mwBufRecSizeInInt*(iChunk*regSubChunkSize+
+                                   avp->chunksParams[iChunk].posSubChunk);
+          memcpy(&r_buf[0],&multiChunkBuffer[iPos],avp->outRecSize);
+
+          compRes=KeyComp(&r_buf[2*avp->nm],&min_r_buf[2*avp->nm],avp->nv);	
+          if(compRes < 0) {
+     	      memcpy(&min_r_buf[0], &r_buf[0], avp->outRecSize);
+	          min_r_chunk = iChunk;
+          }
+       }
+     }
+     /* Step forward */
+     if(avp->chunksParams[min_r_chunk].curSubChunk != 0){
+       avp->chunksParams[min_r_chunk].curSubChunk--;
+       avp->chunksParams[min_r_chunk].posSubChunk++;
+     }
+
+       /* Aggreagation if a duplicate is encountered */
+       if(first_one){
+         memcpy( &outputBuffer[0], &min_r_buf[0], avp->outRecSize);
+         first_one = 0;
+       }else{
+         compRes = KeyComp( &outputBuffer[2*avp->nm], 
+        		    &min_r_buf[2*avp->nm], avp->nv );
+         if(!compRes){
+           for(i = 0; i < avp->nm; i++ ){ 
+             m_outputBuffer[i] += m_min_r_buf[i]; 
+           }
+         }else{
+           WriteToFile(outputBuffer,avp->outRecSize,1,avp->viewFile,stderr);
+           avp->nViewRows++;
+           for(i=0;i<avp->nm;i++){
+	     avp->mSums[i]+=m_outputBuffer[i];
+	     avp->checksums[i] += avp->nViewRows*m_outputBuffer[i]%measbound;
+	   }
+           memcpy( &outputBuffer[0], &min_r_buf[0], avp->outRecSize );
+        }
+      }
+
+      for(numEmptyBufs = 0, 
+          numEmptyRuns = 0, i = 0; i < avp->numberOfChunks; i++ ){
+	     if (avp->chunksParams[i].curSubChunk == 0) numEmptyBufs++;
+         if (avp->chunksParams[i].curChunkNum == 0) numEmptyRuns++;
+      }
+      if(   numEmptyBufs == avp->numberOfChunks 
+          &&numEmptyRuns == avp->numberOfChunks) break;
+
+      if(avp->chunksParams[min_r_chunk].curSubChunk == 0) {
+        ReadSubChunk( 
+        	 chunkRecSize,
+        	 multiChunkBuffer,
+        	 mwBufRecSizeInInt,
+        	 min_r_chunk,
+        	 regSubChunkSize,
+        	 avp->chunksParams,
+        	 avp->fileOfChunks);
+      }
+   } /* while(1) */
+
+   WriteToFile( outputBuffer, avp->outRecSize, 1, avp->viewFile, stderr);	  
+   avp->nViewRows++;
+   for(i = 0; i < avp->nm; i++ ){ 
+     avp->mSums[i] += m_outputBuffer[i]; 
+     avp->checksums[i] += avp->nViewRows*m_outputBuffer[i]%measbound;
+   }
+
+   avp->totalOfViewRows += avp->nViewRows;
+   return ADC_OK;
+}
+void SelectToView( uint32 * ib, uint32 *ix, uint32 *viewBuf, 
+                   uint32 nd, uint32 nm, uint32 nv ){
+   uint32 i, j;
+   for ( j = 0, i = 0; i < nv; i++ ) viewBuf[2*nm+j++] = ib[2*nm+ix[i]-1];
+   memcpy(&viewBuf[0], &ib[0], MSR_FSZ*nm);
+}
+FILE * AdcFileOpen(const char *fileName, const char *mode){
+   FILE *fr;
+   if ((fr = (FILE*) fopen(fileName, mode))==NULL)
+      fprintf(stderr, "AdcFileOpen: Cannot open the file %s errno = %d\n",  
+                       fileName, errno);
+   return fr;
+}
+void AdcFileName(char *adcFileName, const char *adcName, 
+		 const char *fileName, uint32 taskNumber){
+  sprintf(adcFileName, "%s.%s.%d",adcName,fileName,taskNumber);
+}
+ADC_VIEW_CNTL * NewAdcViewCntl(ADC_VIEW_PARS *adcpp, uint32 pnum){
+   ADC_VIEW_CNTL *adccntl;
+   uint32 i, j, k;
+#ifdef IN_CORE
+   uint32 ux;
+#endif
+   char id[8+1];
+   
+   adccntl = (ADC_VIEW_CNTL *) malloc(sizeof(ADC_VIEW_CNTL));
+   if (adccntl==NULL) return NULL;
+   
+   adccntl->ndid = adcpp->ndid;
+   adccntl->taskNumber = pnum;
+   adccntl->retCode = 0;
+   adccntl->swapIt = 0;
+   strcpy(adccntl->adcName, adcpp->adcName);
+   adccntl->nTopDims = adcpp->nd;
+   adccntl->nd = adcpp->nd;
+   adccntl->nm = adcpp->nm;
+   adccntl->nInputRecs = adcpp->nInputRecs;
+   adccntl->inpRecSize = GetRecSize(adccntl->nd,adccntl->nm);
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+   adccntl->accViewFileOffset = 0;
+   adccntl->totalViewFileSize = 0;
+   adccntl->numberOfMadeViews = 0;
+   adccntl->numberOfViewsMadeFromInput = 0;
+   adccntl->numberOfPrefixedGroupbys = 0;
+   adccntl->numberOfSharedSortGroupbys = 0;
+   adccntl->totalOfViewRows = 0;
+   adccntl->memoryLimit = adcpp->memoryLimit;
+   adccntl->nTasks = adcpp->nTasks;
+   strcpy(adccntl->inpFileName, adcpp->adcInpFileName);
+   sprintf(id, ".%d", adcpp->ndid);
+   
+   AdcFileName(adccntl->adcLogFileName, 
+               adccntl->adcName, "logf", adccntl->taskNumber);
+   strcat(adccntl->adcLogFileName, id);            
+   adccntl->logf = AdcFileOpen(adccntl->adcLogFileName, "w");
+
+   AdcFileName(adccntl->inpFileName, adccntl->adcName, "dat", adcpp->ndid);
+   adccntl->inpf = AdcFileOpen(adccntl->inpFileName, "rb");
+   if(!adccntl->inpf){ 
+     adccntl->retCode = ADC_FILE_OPEN_FAILURE; 
+     return(adccntl);
+   } 
+
+   AdcFileName(adccntl->viewFileName, adccntl->adcName, 
+               "view.dat", adccntl->taskNumber);
+   strcat(adccntl->viewFileName, id);            
+   adccntl->viewFile = AdcFileOpen(adccntl->viewFileName, "wb+");
+
+   AdcFileName(adccntl->chunksFileName, adccntl->adcName, 
+               "chunks.dat", adccntl->taskNumber);
+   strcat(adccntl->chunksFileName, id);            
+   adccntl->fileOfChunks = AdcFileOpen(adccntl->chunksFileName,"wb+");
+
+   AdcFileName(adccntl->groupbyFileName, adccntl->adcName, 
+               "groupby.dat", adccntl->taskNumber);
+   strcat(adccntl->groupbyFileName, id);
+   adccntl->groupbyFile = AdcFileOpen(adccntl->groupbyFileName,"wb+");
+
+   AdcFileName(adccntl->adcViewSizesFileName, adccntl->adcName, 
+               "view.sz", adcpp->ndid);
+   adccntl->adcViewSizesFile = AdcFileOpen(adccntl->adcViewSizesFileName,"r");
+   if(!adccntl->adcViewSizesFile){
+     adccntl->retCode = ADC_FILE_OPEN_FAILURE;
+     return(adccntl);
+   }
+
+   AdcFileName(adccntl->viewSizesFileName, adccntl->adcName, 
+               "viewsz.dat", adccntl->taskNumber);
+   strcat(adccntl->viewSizesFileName, id);            
+   adccntl->viewSizesFile = AdcFileOpen(adccntl->viewSizesFileName, "wb+");
+   
+   adccntl->chunksParams = (CHUNKS*) malloc(MAX_NUM_OF_CHUNKS*sizeof(CHUNKS));
+   if(adccntl->chunksParams==NULL){ 
+     fprintf(adccntl->logf,"NewAdcViewCntl: Cannot allocate 'chunksParsms'\n");
+     adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+     return(adccntl);
+   }
+   adccntl->memPool = (unsigned char*) malloc(adccntl->memoryLimit);
+   if(adccntl->memPool == NULL ){
+      fprintf(adccntl->logf, 
+              "NewAdcViewCntl: Cannot allocate 'main memory pool'\n"); 
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+   
+#ifdef IN_CORE   
+   /* add a condition to allocate this memory buffer, THIS is IMPORTANT */
+   ux = 4*adccntl->nTopDims + 8*adccntl->nm;
+   if (adccntl->nTopDims%8) ux += 4;
+   adccntl->inpDataBuffer = (uint32*) malloc(adccntl->nInputRecs*ux);
+   if(adccntl->inpDataBuffer == NULL ){
+      fprintf(adccntl->logf,
+              "NewAdcViewCntl: Cannot allocate 'input data buffer'\n"); 
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+#endif
+   adccntl->numberOfChunks = 0;
+
+   for ( i = 0; i < adccntl->nm; i++ ){
+     adccntl->mSums[i] = 0;
+     adccntl->checksums[i] = 0;
+     adccntl->totchs[i] = 0;
+  }
+   adccntl->tree = CreateEmptyTree(adccntl->nd, adccntl->nm, 
+                                   adccntl->memoryLimit, adccntl->memPool);
+   if(!adccntl->tree){
+      fprintf(adccntl->logf,"\nNewAdcViewCntl.CreateEmptyTree failed.\n");
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+
+   adccntl->nv = adcpp->nd; /* default */
+   for ( i = 0; i < adccntl->nv; i++ ) adccntl->selection[i]=i+1;
+   
+   adccntl->nViewLimit = (1<<adcpp->nd)-1;
+   adccntl->jpp=(JOB_POOL *) malloc((adccntl->nViewLimit+1)*sizeof(JOB_POOL));
+   if ( adccntl->jpp == NULL){
+      fprintf(adccntl->logf,
+        "\n Not enough space to allocate %ld byte for a job pool.", 
+        (long)(adccntl->nViewLimit+1)*sizeof(JOB_POOL));
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE; 
+      return(adccntl);
+   }
+   adccntl->lpp = (LAYER * ) malloc( (adcpp->nd+1)*sizeof(LAYER));
+   if ( adccntl->lpp == NULL){
+      fprintf(adccntl->logf,
+        "\n Not enough space to allocate %ld byte for a layer reference array.", 
+        (long)(adcpp->nd+1)*sizeof(LAYER));
+      adccntl->retCode = ADC_MEMORY_ALLOCATION_FAILURE;
+      return(adccntl);
+   }
+
+   for ( j = 1, i = 1; i <= adcpp->nd; i++ ) {
+      k =  NumOfCombsFromNbyK ( adcpp->nd, i );
+      adccntl->lpp[i].layerIndex = j;
+      j += k;
+      adccntl->lpp[i].layerQuantityLimit = k;
+      adccntl->lpp[i].layerCurrentPopulation = 0;
+   }    
+      
+   JobPoolInit ( adccntl->jpp, (adccntl->nViewLimit+1), adcpp->nd );
+
+   fprintf(adccntl->logf,"\nMeaning of the log file colums is as follows:\n");
+   fprintf(adccntl->logf,
+     "Row Number | Groupby | View Size | Measure Sums | Number of Chunks\n");
+
+   adccntl->verificationFailed = 1;
+   return adccntl;
+}
+void InitAdcViewCntl(ADC_VIEW_CNTL *adccntl, 
+		     uint32 nSelectedDims, 
+		     uint32 *selection, 
+		     uint32 fromParent ){
+   uint32 i;
+   
+   adccntl->nv = nSelectedDims;
+   
+   for (i = 0; i < adccntl->nm; i++ ) adccntl->mSums[i] = 0;
+   for (i = 0; i < adccntl->nv; i++ ) adccntl->selection[i] = selection[i];
+
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+   adccntl->numberOfChunks = 0;
+   adccntl->fromParent = fromParent;
+   adccntl->nViewRows = 0;
+
+   if(fromParent){
+     adccntl->nd = adccntl->smallestParentLevel;
+     FSEEK(adccntl->viewFile, adccntl->viewOffset, SEEK_SET);
+     adccntl->nRowsToRead = adccntl->nParentViewRows;
+   }else{
+     adccntl->nd = adccntl->nTopDims;
+     adccntl->nRowsToRead = adccntl->nInputRecs;
+   }
+   adccntl->inpRecSize = GetRecSize(adccntl->nd,adccntl->nm);
+   adccntl->outRecSize = GetRecSize(adccntl->nv,adccntl->nm);
+}
+int32 CloseAdcView(ADC_VIEW_CNTL *adccntl){
+   if (adccntl->inpf) fclose(adccntl->inpf);
+   if (adccntl->viewFile) fclose(adccntl->viewFile);
+   if (adccntl->fileOfChunks) fclose(adccntl->fileOfChunks);
+   if (adccntl->groupbyFile) fclose(adccntl->groupbyFile);
+   if (adccntl->adcViewSizesFile) fclose(adccntl->adcViewSizesFile);
+   if (adccntl->viewSizesFile) fclose(adccntl->viewSizesFile);
+   
+   if (DeleteOneFile(adccntl->chunksFileName))       
+      return ADC_FILE_DELETE_FAILURE;
+   if (DeleteOneFile(adccntl->viewSizesFileName))    
+      return ADC_FILE_DELETE_FAILURE;
+
+   if (DeleteOneFile(adccntl->groupbyFileName))      
+      return ADC_FILE_DELETE_FAILURE;
+
+   if (adccntl->chunksParams){ 
+     free(adccntl->chunksParams); 
+     adccntl->chunksParams=NULL; 
+   }  
+   if (adccntl->memPool){ free(adccntl->memPool); adccntl->memPool=NULL;} 
+   if (adccntl->jpp){ free(adccntl->jpp); adccntl->jpp=NULL; } 
+   if (adccntl->lpp){ free(adccntl->lpp); adccntl->lpp=NULL; } 
+
+   if (adccntl->logf) fclose(adccntl->logf);
+   free(adccntl);
+   return ADC_OK;
+}
+void AdcCntlLog(ADC_VIEW_CNTL *adccntlp){
+  fprintf(adccntlp->logf,"    memoryLimit = %20d\n",
+    adccntlp->memoryLimit);
+  fprintf(adccntlp->logf,"    treeNodeSize = %20d\n",
+    adccntlp->tree->treeNodeSize);
+  fprintf(adccntlp->logf," treeMemoryLimit = %20d\n",
+    adccntlp->tree->memoryLimit);
+  fprintf(adccntlp->logf,"    nNodesLimit = %20d\n",
+    adccntlp->tree->nNodesLimit);
+  fprintf(adccntlp->logf,"freeNodeCounter = %20d\n",
+    adccntlp->tree->freeNodeCounter);
+  fprintf(adccntlp->logf,"	nViewRows = %20d\n",
+    adccntlp->nViewRows);
+}
+int32 ViewSizesVerification(ADC_VIEW_CNTL *adccntlp){
+     char inps[MAX_PARAM_LINE_SIZE];
+     char msg[64];
+     uint32 *viewCounts;
+     uint32 selection_viewSize[2];
+     uint32 sz;
+     uint32 sel[64];
+     uint32 i;
+     uint32 k;
+     uint64 tx;
+     uint32 iTx; 
+   
+     viewCounts = (uint32 *) &adccntlp->memPool[0];
+     for ( i = 0; i <= adccntlp->nViewLimit; i++) viewCounts[i] = 0;
+     
+     FSEEK(adccntlp->viewSizesFile, 0L, SEEK_SET);
+     FSEEK(adccntlp->adcViewSizesFile, 0L, SEEK_SET);     
+
+     while(fread(selection_viewSize, 8, 1, adccntlp->viewSizesFile)){
+        viewCounts[selection_viewSize[0]] = selection_viewSize[1];
+     }
+     k = 0;
+     while ( fscanf(adccntlp->adcViewSizesFile, "%s", inps) != EOF ){
+        if ( strcmp(inps, "Selection:") == 0 ) {
+           while ( fscanf(adccntlp->adcViewSizesFile, "%s", inps)) {
+             if ( strcmp(inps, "View") == 0 ) break; 
+             sel[k++] = atoi(inps);	  
+           }
+        }
+        
+        if ( strcmp(inps, "Size:") == 0 ) {
+           fscanf(adccntlp->adcViewSizesFile, "%s", inps);
+           sz = atoi(inps);
+           CreateBinTuple(&tx, sel, k);
+           iTx = (int32)(tx>>(64-adccntlp->nTopDims)); 
+           adccntlp->verificationFailed = 0;
+           if (!adccntlp->numberOfMadeViews) adccntlp->verificationFailed = 1;
+
+           if ( viewCounts[iTx] != 0){
+              if (viewCounts[iTx] != sz) {
+                 if (viewCounts[iTx] != adccntlp->nInputRecs){
+                   fprintf(adccntlp->logf, 
+                           "A view size is wrong: genSz=%d calcSz=%d\n",
+                   	                               sz, viewCounts[iTx]);
+                   adccntlp->verificationFailed = 1;
+                   return ADC_VERIFICATION_FAILED;
+                 }
+              }               
+           }
+           k = 0;
+        }  
+     } /* of while() */
+
+     fprintf(adccntlp->logf,
+       "\n\nMeaning of the log file colums is as follows:\n");
+     fprintf(adccntlp->logf, 
+       "Row Number | Groupby | View Size | Measure Sums | Number of Chunks\n");
+
+     if (!adccntlp->verificationFailed) 
+          strcpy(msg, "Verification=passed");
+     else strcpy(msg, "Verification=failed");
+     FSEEK(adccntlp->logf, 0L, SEEK_SET);
+     fprintf(adccntlp->logf, msg);
+     FSEEK(adccntlp->logf, 0L, SEEK_END);
+     FSEEK(adccntlp->viewSizesFile, 0L, SEEK_SET);
+     return ADC_OK;
+}
+int32 ComputeGivenGroupbys(ADC_VIEW_CNTL *adccntlp){
+    int32 retCode;
+   uint32 i;
+   uint64 binRepTuple;
+   uint32 ut32;
+   uint32 nViews = 0;
+   uint32 nSelectedDims;
+   uint32 smp;
+#ifdef IN_CORE
+   uint32 firstView = 1;
+#endif
+   uint32 selection_viewsize[2];
+   char ttout[16];
+
+   while (fread(&binRepTuple, 8, 1, adccntlp->groupbyFile )){
+     for(i = 0; i < adccntlp->nm; i++) adccntlp->checksums[i]=0;
+     nViews++;
+     swap8(&binRepTuple);
+
+     GetRegTupleFromBin64(binRepTuple, adccntlp->selection,
+                          adccntlp->nTopDims, &nSelectedDims);
+     ut32 = (uint32)(binRepTuple>>(64-adccntlp->nTopDims));
+     selection_viewsize[0] = ut32;
+     ut32 <<= (32-adccntlp->nTopDims);
+     adccntlp->groupby = ut32;
+#ifndef IN_CORE
+     smp = GetParent(adccntlp, ut32);
+#endif
+#ifdef IN_CORE
+     if (firstView) {
+       firstView = 0;
+       if(ReadWholeInputData(adccntlp, adccntlp->inpf)) {
+          fprintf(stderr, "ReadWholeInputData failed.\n");
+          return ADC_INTERNAL_ERROR;   
+       }
+     }
+     smp = noneParent;
+#endif
+
+     if (smp != noneParent)
+     GetRegTupleFromParent(binRepTuple, 
+                           adccntlp->parBinRepTuple, 
+                           adccntlp->selection,
+                           adccntlp->nTopDims);
+     InitAdcViewCntl(adccntlp, nSelectedDims, 
+                     adccntlp->selection, (smp == noneParent)?0:1);
+#ifdef IN_CORE
+      if(retCode = ComputeMemoryFittedView(adccntlp)) {
+         fprintf(stderr, "ComputeMemoryFittedView failed.\n");
+         return retCode;
+      }
+#else
+#ifdef OPTIMIZATION
+     if (smp == prefixedParent){
+        if (retCode = PrefixedAggregate(adccntlp, adccntlp->viewFile)) {
+           fprintf(stderr, 
+	     "ComputeGivenGroupbys.PrefixedAggregate failed.\n");
+           return retCode;
+        }
+        adccntlp->numberOfPrefixedGroupbys++;
+     }else if (smp == sharedSortParent) {
+        if (retCode = SharedSortAggregate(adccntlp)) {
+           fprintf(stderr, 
+	     "ComputeGivenGroupbys.SharedSortAggregate failed.\n");
+           return retCode;
+        }
+        adccntlp->numberOfSharedSortGroupbys++;
+     }else
+#endif /* OPTIMIZATION */     
+     { 
+        if( smp != noneParent ) {
+	  retCode = RunFormation(adccntlp, adccntlp->viewFile);
+          if(retCode!=ADC_OK){
+              fprintf(stderr, 
+	  	  "ComputrGivenGroupbys.RunFormation failed.\n");
+              return retCode; 
+            }
+	  }else{
+	    if ((retCode=RunFormation (adccntlp, adccntlp->inpf)) != ADC_OK){
+              fprintf(stderr, 
+	  	  "ComputrGivenGroupbys.RunFormation failed.\n");
+              return retCode;
+            }
+	    adccntlp->numberOfViewsMadeFromInput++;
+	  }
+        if(!adccntlp->numberOfChunks){
+          uint64 ordern=0;
+          adccntlp->nViewRows        = adccntlp->tree->count;
+          adccntlp->totalOfViewRows += adccntlp->nViewRows;
+	  retCode=WriteViewToDiskCS(adccntlp,adccntlp->tree->root.left,&ordern);
+	  if(retCode!=ADC_OK){
+            fprintf(stderr,
+	            "ComputeGivenGroupbys.WriteViewToDisk: Write error.\n");
+	    return ADC_WRITE_FAILED;
+	  }
+        }else { 
+          retCode=MultiWayMerge(adccntlp);
+          if(retCode!=ADC_OK) {
+	     fprintf(stderr,"ComputeGivenGroupbys.MultiWayMerge failed.\n");
+	     return retCode;
+	  } 
+        } 
+      }
+     
+     JobPoolUpdate(adccntlp);
+
+     adccntlp->accViewFileOffset += 
+       (int64)(adccntlp->nViewRows*adccntlp->outRecSize);
+     FSEEK(adccntlp->fileOfChunks, 0L, SEEK_SET);
+     FSEEK(adccntlp->inpf, 0L, SEEK_SET);
+#endif /* IN_CORE */
+     for( i = 0; i < adccntlp->nm; i++) 
+       adccntlp->totchs[i]+=adccntlp->checksums[i];
+     selection_viewsize[1] = adccntlp->nViewRows;
+     fwrite(selection_viewsize, 8, 1, adccntlp->viewSizesFile);
+     adccntlp->totalViewFileSize += 
+                            adccntlp->outRecSize*adccntlp->nViewRows;
+     sprintf(ttout, "%7d ", nViews);
+     WriteOne32Tuple(ttout, adccntlp->groupby, 
+                     adccntlp->nTopDims, adccntlp->logf);
+     fprintf(adccntlp->logf, " |  %15d | ", adccntlp->nViewRows); 
+     for ( i = 0; i < adccntlp->nm; i++ ){ 
+        fprintf(adccntlp->logf, " %20lld", adccntlp->checksums[i]);
+     }
+     fprintf(adccntlp->logf, " | %5d", adccntlp->numberOfChunks);
+   }
+   adccntlp->numberOfMadeViews = nViews;  
+   if(ViewSizesVerification(adccntlp)) return ADC_VERIFICATION_FAILED;
+   return ADC_OK;
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/jobcntl.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/jobcntl.c
new file mode 100644
index 000000000..8d2e276fe
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/jobcntl.c
@@ -0,0 +1,562 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "adc.h"
+#include "macrodef.h"
+
+#ifdef UNIX
+#include <fcntl.h>
+#include <sys/file.h>
+#include <unistd.h>
+#endif
+
+uint32 NumberOfOnes(uint64 s);
+void swap8(void *a);
+void SetOneBit(uint64 *s, int32 pos){ uint64 ob = MLB; ob >>= pos; *s |= ob;}
+void SetOneBit32(uint32 *s, uint32 pos){ 
+   uint32 ob = 0x80000000;
+   ob >>= pos; 
+   *s |= ob;
+}
+uint32 Mlo32(uint32 x){
+   uint32 om = 0x80000000;
+   uint32 i;
+   uint32 k;
+              
+   for ( k = 0, i = 0; i < 32; i++ ) {
+       if (om&x) break;
+       om >>= 1;
+       k++;
+   } 
+   return(k);   
+}
+int32 mro32(uint32 x){
+   uint32 om = 0x00000001;
+   uint32 i;
+   uint32 k;
+              
+   for ( k = 32, i = 0; i < 32; i++ ) {
+       if (om&x) break;
+       om <<= 1;
+       k--;
+   } 
+   return(k);   
+}
+uint32 setLeadingOnes32(uint32 n){
+    int32 om = 0x80000000;
+   uint32 x;
+   uint32 i;
+         
+   for ( x = 0, i = 0; i < n; i++ ) {
+         x |= om;
+         om >>= 1;
+   } 
+   return (x);
+}
+int32 DeleteOneFile(const char * file_name) {
+#  ifdef WINNT
+      return(remove(file_name));
+#  else
+      return(unlink(file_name));
+#  endif
+}
+void WriteOne32Tuple(char * t, uint32 s, uint32 l, FILE * logf) {
+  uint64 ob = MLB32;
+  uint32 i;
+            
+  fprintf(logf, "\n %s", t);
+  for ( i = 0; i < l; i++ ) {
+    if (s&ob) fprintf(logf, "1"); else fprintf(logf, "0");
+    ob >>= 1;
+  }
+}
+uint32 NumOfCombsFromNbyK( uint32 n, uint32 k ){
+  uint32 l, combsNbyK;
+  if ( k > n ) return 0;
+  for(combsNbyK=1, l=1;l<=k;l++)combsNbyK = combsNbyK*(n-l+1)/l;
+  return  combsNbyK;
+}
+void JobPoolUpdate(ADC_VIEW_CNTL *avp){
+   uint32 l = avp->nv;
+   uint32 k;
+  
+   k = avp->lpp[l].layerIndex + avp->lpp[l].layerCurrentPopulation;
+   avp->jpp[k].grpb = avp->groupby;
+   avp->jpp[k].nv = l;
+   avp->jpp[k].nRows = avp->nViewRows;
+   avp->jpp[k].viewOffset = avp->accViewFileOffset;
+   avp->lpp[l].layerCurrentPopulation++;
+} 
+int32 GetParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 level, levelPop, i;
+   uint32 ig;
+   uint32 igOfSmallestParent;
+   uint32 igOfPrefixedParent;
+   uint32 igOfSharedSortParent;
+   uint32 spMinNumOfRows;
+   uint32 pfMinNumOfRows;
+   uint32 ssMinNumOfRows;
+   uint32 tgrpb;
+   uint32 pg;
+   uint32 pfm;
+   uint32 mlo = 0;
+   uint32 lom;
+   uint32 l = NumberOfOnes(binRepTuple);
+   uint32 spFound;
+   uint32 pfFound;
+   uint32 ssFound;
+   uint32 found;
+   uint32 spFt;
+   uint32 pfFt;   
+   uint32 ssFt;
+
+   found = noneParent;
+   pfm = setLeadingOnes32(mro32(avp->groupby));
+   SetOneBit32(&mlo, Mlo32(avp->groupby));
+   lom = setLeadingOnes32(Mlo32(avp->groupby)); 
+
+   for(spFound=pfFound=ssFound=0, level=l;level<=avp->nTopDims;level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+      
+      if(levelPop != 0);
+      {
+           for ( spFt = pfFt = ssFt = 1, ig = avp->lpp[level].layerIndex,
+                 i = 0; i < levelPop; i++ )
+           {
+               tgrpb = avp->jpp[ig].grpb;
+               if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+                  spFound = 1;
+                  if (spFt) { spMinNumOfRows = avp->jpp[ig].nRows; 
+                              igOfSmallestParent = ig; spFt = 0; }
+                  else   if ( spMinNumOfRows > avp->jpp[ig].nRows ) 
+                            { spMinNumOfRows = avp->jpp[ig].nRows; 
+                              igOfSmallestParent = ig; }
+
+				  pg = tgrpb & pfm;
+				  if (pg == binRepTuple) {
+                     pfFound = 1;
+                     if (pfFt) { pfMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfPrefixedParent = ig; pfFt = 0; }
+                     else   if ( pfMinNumOfRows > avp->jpp[ig].nRows) 
+                               { pfMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfPrefixedParent = ig; }
+				  }
+
+				  if ( (tgrpb & mlo) && !(tgrpb & lom)) {
+                     ssFound = 1;
+                     if (ssFt) { ssMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfSharedSortParent = ig; ssFt = 0; }
+                     else   if ( ssMinNumOfRows > avp->jpp[ig].nRows) 
+                               { ssMinNumOfRows = avp->jpp[ig].nRows; 
+                                 igOfSharedSortParent = ig; }
+				  }
+               }
+               ig++;
+           }
+      }
+      if (pfFound) found = prefixedParent;
+      else if (ssFound) found = sharedSortParent;
+           else if (spFound) found = smallestParent;
+
+      switch(found){
+         case prefixedParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset      = avp->jpp[igOfPrefixedParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfPrefixedParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfPrefixedParent].grpb;
+           break;
+         case sharedSortParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset	    = avp->jpp[igOfSharedSortParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfSharedSortParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfSharedSortParent].grpb;
+           break;
+         case smallestParent:
+           avp->smallestParentLevel = level;
+           avp->viewOffset	    = avp->jpp[igOfSmallestParent].viewOffset;
+           avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+           avp->parBinRepTuple  = avp->jpp[igOfSmallestParent].grpb;
+           break;
+         default: break;
+      }
+      if(   found == prefixedParent 
+         || found == sharedSortParent 
+	 || found == smallestParent) break;
+   }
+  return found;
+} 
+uint32 GetSmallestParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 found, level, levelPop, i, ig, igOfSmallestParent;
+   uint32 minNumOfRows;
+   uint32 tgrpb;
+   uint32 ft;
+   uint32 l = NumberOfOnes(binRepTuple);
+  
+   for(found=0, level=l; level<=avp->nTopDims;level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+      if(levelPop){
+        for(ft=1, ig=avp->lpp[level].layerIndex, i=0;i<levelPop;i++){
+          tgrpb = avp->jpp[ig].grpb;
+          if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+            found = 1;
+            if(ft){
+	      minNumOfRows=avp->jpp[ig].nRows;
+	      igOfSmallestParent = ig; 
+	      ft = 0;
+	    }else if(minNumOfRows > avp->jpp[ig].nRows){ 
+	      minNumOfRows = avp->jpp[ig].nRows;
+	      igOfSmallestParent = ig;
+	    }
+          }
+          ig++;
+        }
+      }
+      if( found ){      
+         avp->smallestParentLevel = level;
+         avp->viewOffset = avp->jpp[igOfSmallestParent].viewOffset;
+         avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+         avp->parBinRepTuple = avp->jpp[igOfSmallestParent].grpb;
+         break;
+      }
+   }
+   return found;
+} 
+int32 GetPrefixedParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple){
+   uint32 found, level, levelPop, i, ig, igOfSmallestParent;
+   uint32 minNumOfRows;
+   uint32 tgrpb;
+   uint32 ft;
+   uint32 pg, tm;
+   uint32 l = NumberOfOnes(binRepTuple);
+   
+   tm = setLeadingOnes32(mro32(avp->groupby));
+
+   for(found=0, level=l; level<=avp->nTopDims; level++){
+      levelPop = avp->lpp[level].layerCurrentPopulation;
+  
+      if (levelPop != 0);
+      {
+           for(ft = 1, ig = avp->lpp[level].layerIndex, 
+                i = 0; i < levelPop; i++ ) {
+               tgrpb = avp->jpp[ig].grpb;
+               if ( (avp->groupby & tgrpb) == avp->groupby ) { 
+				  pg = tgrpb & tm;
+				  if (pg == binRepTuple) {
+                     found = 1;
+                     if (ft) { minNumOfRows = avp->jpp[ig].nRows; 
+                               igOfSmallestParent = ig; ft = 0; }
+                     else if ( minNumOfRows > avp->jpp[ig].nRows) 
+                             { minNumOfRows = avp->jpp[ig].nRows; 
+                               igOfSmallestParent = ig; }
+				  }
+               }
+               ig++;
+           }
+      }
+      if ( found ) {      
+         avp->smallestParentLevel = level;
+         avp->viewOffset = avp->jpp[igOfSmallestParent].viewOffset;
+         avp->nParentViewRows = avp->jpp[igOfSmallestParent].nRows;
+         avp->parBinRepTuple = avp->jpp[igOfSmallestParent].grpb;
+         break;
+      }
+   }
+  return found;
+} 
+void JobPoolInit(JOB_POOL *jpp, uint32 n, uint32 nd){
+  uint32 i;
+
+  for ( i = 0; i < n; i++ ) {
+      jpp[i].grpb = 0;
+	  jpp[i].nv = 0;  
+      jpp[i].nRows = 0;
+      jpp[i].viewOffset = 0;
+  }    
+}
+void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf){
+   uint64 ob = MLB;
+   uint32 i;
+            
+   fprintf(logf, "\n %s", t);
+   for ( i = 0; i < l; i++ ) {
+      if (s&ob) fprintf(logf, "1"); else fprintf(logf, "0");
+      ob >>= 1;
+   }
+}
+uint32 NumberOfOnes(uint64 s){
+   uint64 ob = MLB;
+   uint32 i;
+   uint32 nOnes;
+
+   for ( nOnes = 0, i = 0; i < 64; i++ ) {
+      if (s&ob) nOnes++;
+      ob >>= 1;
+   }
+   return nOnes;
+}
+void GetRegTupleFromBin64(
+           uint64 binRepTuple, 
+	       uint32 *selTuple,
+	       uint32 numDims, 
+	       uint32 *numOfUnits){
+   uint64 oc = MLB;
+   uint32 i;
+   uint32 j;
+  
+   *numOfUnits = 0;  
+   for( j = 0, i = 0; i < numDims; i++ ) {
+     if (binRepTuple & oc) { selTuple[j++] = i+1; (*numOfUnits)++;}  
+     oc >>= 1;
+   }    
+}
+void getRegTupleFromBin32(
+           uint32 binRepTuple, 
+	       uint32 *selTuple,
+	       uint32 numDims, 
+	       uint32 *numOfUnits){
+   uint32 oc = MLB32;
+   uint32 i;
+   uint32 j;
+  
+   *numOfUnits = 0;
+   for( j = 0, i = 0; i < numDims; i++ ) {
+     if (binRepTuple & oc) { selTuple[j++] = i+1; (*numOfUnits)++;}  
+     oc >>= 1;
+   }    
+}
+void GetRegTupleFromParent(
+               uint64 bin64RepTuple,
+               uint32 bin32RepTuple, 
+	       uint32 *selTuple,
+	       uint32 nd){
+   uint32 oc = MLB32;
+   uint32 i, j, k;
+   uint32 ut32; 
+  
+   ut32 = (uint32)(bin64RepTuple>>(64-nd)); 
+   ut32 <<= (32-nd);
+   
+   for ( j = 0, k = 0, i = 0; i < nd; i++ ) {
+     if (bin32RepTuple & oc) k++;
+     if (bin32RepTuple & oc && ut32 & oc) selTuple[j++] = k; 
+     oc >>= 1;
+   }    
+}
+void CreateBinTuple(uint64 *binRepTuple, uint32 *selTuple, uint32 numDims){
+   uint32 i;
+
+   *binRepTuple = 0;
+   for(i = 0; i < numDims; i++ ){
+     SetOneBit( binRepTuple, selTuple[i]-1 );
+   }    
+}
+void d32v( char * t, uint32 *v, uint32 n){
+   uint32 i;
+   
+   fprintf(stderr,"\n%s ", t);
+   for ( i = 0; i < n; i++ ) fprintf(stderr," %d", v[i]);
+}
+void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf);
+int32 Comp8gbuf(const void *a, const void *b){
+   if ( a < b ) return -1;
+   else if (a > b) return 1;
+   else return 0;
+}
+void restore(TUPLE_VIEWSIZE x[], uint32 f, uint32 l ){ 
+   uint32 j, m, tj, mm1, jm1, hl;
+   uint64 iW;
+   uint64 iW64;
+
+   j = f;
+   hl = l>>1;
+   while( j <= hl ) {
+      tj = j*2;
+      if (tj < l && x[tj-1].viewsize < x[tj].viewsize) m = tj+1;
+      else m = tj;
+      mm1 = m - 1;
+      jm1 = j - 1;
+      if ( x[mm1].viewsize > x[jm1].viewsize ) {
+         iW = x[mm1].viewsize; 
+	 x[mm1].viewsize = x[jm1].viewsize; 
+	 x[jm1].viewsize = iW;  
+         iW64 = x[mm1].tuple; 
+	 x[mm1].tuple = x[jm1].tuple; 
+	 x[jm1].tuple = iW64;  
+         j = m;
+      }else j = l;
+   }
+}
+void vszsort( TUPLE_VIEWSIZE x[], uint32 n){
+  int32 i, im1;
+  uint64 iW;
+  uint64 iW64;
+  
+  for ( i = n>>1; i >= 1; i-- ) restore( x, i, n );
+  for ( i = n; i >= 2; i-- ) {
+     im1 = i - 1;
+     iW = x[0].viewsize; x[0].viewsize = x[im1].viewsize; x[im1].viewsize = iW;  
+     iW64 = x[0].tuple; x[0].tuple = x[im1].tuple; x[im1].tuple = iW64;  
+     restore( x, 1, im1);
+  }
+}
+uint32 countTupleOnes(uint64 binRepTuple, uint32 numDims){
+  uint32 i, cnt = 0;
+  uint64 ob = 0x0000000000000001; 
+
+  for(i = 0; i < numDims; i++ ){
+    if ( binRepTuple&ob) cnt++;
+    ob <<= 1;
+  }    
+  return cnt;
+}
+void restoreo( TUPLE_ONES x[], uint32 f, uint32 l ){ 
+   uint32 j, m, tj, mm1, jm1, hl;
+   uint32 iW;
+   uint64 iW64;
+
+   j = f;
+   hl = l>>1;
+   while( j <= hl ) {
+      tj = j*2;
+      if (tj < l && x[tj-1].nOnes < x[tj].nOnes) m = tj+1;
+      else m = tj;
+      mm1 = m - 1; jm1 = j - 1;
+      if ( x[mm1].nOnes > x[jm1].nOnes ){
+         iW = x[mm1].nOnes;
+	     x[mm1].nOnes = x[jm1].nOnes; 
+	     x[jm1].nOnes = iW;  
+         iW64 = x[mm1].tuple; 
+	     x[mm1].tuple = x[jm1].tuple; 
+	     x[jm1].tuple = iW64;  
+         j = m;
+      }else j = l;
+   }
+}
+void onessort( TUPLE_ONES x[], uint32 n){
+   int32 i, im1;
+  uint32 iW;
+  uint64 iW64;
+  
+  for ( i = n>>1; i >= 1; i-- ) restoreo( x, i, n );
+  for ( i = n; i >= 2; i-- ) {
+     im1 = i - 1;
+     iW = x[0].nOnes; 
+     x[0].nOnes = x[im1].nOnes; 
+     x[im1].nOnes = iW;  
+     iW64 = x[0].tuple; 
+     x[0].tuple = x[im1].tuple; 
+     x[im1].tuple = iW64;  
+     restoreo( x, 1, im1);
+  }
+}
+uint32 MultiFileProcJobs( TUPLE_VIEWSIZE *tuplesAndSizes, 
+		                          uint32 nViews, 
+                           ADC_VIEW_CNTL *avp ){
+   uint32 i;
+    int32 ii; /* it should be int */
+   uint32 j;
+   uint32 pn;
+   uint32 direction = 0;
+   uint32 dChange = 0;
+   uint32 gbi;
+   uint32 maxn;
+   uint64 *gbuf;
+   uint64      vszs[MAX_NUMBER_OF_TASKS];
+   uint32 nGroupbys[MAX_NUMBER_OF_TASKS];
+   TUPLE_ONES *toptr;
+
+   gbuf = (uint64*) &avp->memPool[0];
+
+   for(i = 0; i < avp->nTasks; i++ ){ nGroupbys[i] = 0; vszs[i] = 0; }
+
+   for(pn = 0, gbi = 0, ii = nViews-1; ii >= 0; ii-- ){
+     if(pn == avp->taskNumber) gbuf[gbi++]=tuplesAndSizes[ii].tuple;
+     nGroupbys[pn]++;
+     vszs[pn] += tuplesAndSizes[ii].viewsize; 
+     if(direction == 0 && pn == avp->nTasks-1 ) { 
+       direction = 1; 
+       dChange = 1; 
+     }
+     if(direction == 1 && pn == 0 ){ 
+       direction = 0; 
+       dChange = 1; 
+     }
+     if (!dChange){ if (direction) pn--; else pn++;}
+     dChange = 0;
+   }
+   for(maxn = 0, i = 0; i < avp->nTasks; i++) 
+     if (nGroupbys[i] > maxn) maxn = nGroupbys[i];
+
+   toptr = (TUPLE_ONES*) malloc(sizeof(TUPLE_ONES)*maxn);
+   if(!toptr) return 1; 
+
+   for(i = 0; i < avp->nTasks; i++ ){
+     if(i == avp->taskNumber){
+       for(j = 0; j < nGroupbys[i]; j++ ){
+         toptr[j].tuple = gbuf[j];
+         toptr[j].nOnes  = countTupleOnes(gbuf[j], avp->nTopDims);
+       }
+       qsort((void*)gbuf,  nGroupbys[i], 8, Comp8gbuf );
+       onessort(toptr, nGroupbys[i]);
+
+       for(j = 0; j < nGroupbys[i]; j++){
+         toptr[nGroupbys[i]-1-j].tuple <<= (64-avp->nTopDims);
+         swap8(&toptr[nGroupbys[i]-1-j].tuple);
+         fwrite(&toptr[nGroupbys[i]-1-j].tuple, 8, 1, avp->groupbyFile);
+       }
+     }
+   }
+   FSEEK(avp->groupbyFile, 0L, SEEK_SET);
+   if (toptr) free(toptr);
+   return 0;
+}
+int32 PartitionCube(ADC_VIEW_CNTL *avp){
+    TUPLE_VIEWSIZE *tuplesAndSizes;
+    uint32 it = 0;
+    uint64 sz;
+    uint32 sel[64];
+    uint32 k;
+    uint64 tx;
+    uint32 i;
+      char inps[256];
+      
+    tuplesAndSizes = 
+       (TUPLE_VIEWSIZE*) malloc(avp->nViewLimit*sizeof(TUPLE_VIEWSIZE));
+    if(tuplesAndSizes == NULL){
+       fprintf(stderr," PartitionCube(): memory allocation failure'\n");
+       return ADC_MEMORY_ALLOCATION_FAILURE;
+    }
+    k = 0;
+    while( fscanf(avp->adcViewSizesFile, "%s", inps) != EOF ){
+       if( strcmp(inps, "Selection:") == 0 ) {
+         while ( fscanf(avp->adcViewSizesFile, "%s", inps)) {
+           if ( strcmp(inps, "View") == 0 ) break; 
+           sel[k++] = atoi(inps);	
+         }
+       }
+       if( strcmp(inps, "Size:") == 0 ){
+         fscanf(avp->adcViewSizesFile, "%s", inps);
+         sz = atoi(inps);
+         CreateBinTuple(&tx, sel, k);
+         if (sz > avp->nInputRecs) sz = avp->nInputRecs;
+         tuplesAndSizes[it].viewsize = sz;
+         tuplesAndSizes[it].tuple = tx; 
+         it++;
+         k = 0;
+       }  
+    }
+    vszsort(tuplesAndSizes, it);
+    for( i = 0; i < it; i++){
+        tuplesAndSizes[i].tuple >>= (64-avp->nTopDims);
+    }
+    if(MultiFileProcJobs( tuplesAndSizes, it, avp )){
+       fprintf(stderr, "MultiFileProcJobs() is failed \n");
+       fprintf(avp->logf, "MultiFileProcJobs() is failed.\n");
+       fflush(avp->logf);
+       return 1;
+    }
+    FSEEK(avp->adcViewSizesFile, 0L, SEEK_SET);
+    free(tuplesAndSizes);
+    return 0;
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/macrodef.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/macrodef.h
new file mode 100644
index 000000000..ce67695ea
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/macrodef.h
@@ -0,0 +1,14 @@
+#define PutErrMsg(msg) {fprintf(stderr," %s, errno = %d\n", msg, errno);}
+
+#define WriteToFile(ptr,size,nitems,stream,logf) if( fwrite(ptr,size,nitems,stream) != nitems )\
+       {\
+        fprintf(stderr,"\n Write error from WriteToFile()\n"); return ADC_WRITE_FAILED; \
+       }
+
+#ifdef WINNT
+#define FSEEK(stream,offset,whence)  fseek(stream, (long)offset,whence);
+#else
+#define FSEEK(stream,offset,whence)  fseek(stream,offset,whence); 
+#endif
+
+#define GetRecSize(nd,nm) (DIM_FSZ*nd+MSR_FSZ*nm)
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/protots.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/protots.h
new file mode 100644
index 000000000..6ff92a731
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/protots.h
@@ -0,0 +1,100 @@
+ int32 ReadWholeInputData(ADC_VIEW_CNTL *avp, FILE *inpf);
+ 
+ int32 ComputeMemoryFittedView (ADC_VIEW_CNTL *avp);
+
+ int32 MultiWayMerge(ADC_VIEW_CNTL *avp);
+
+ int32 GetPrefixedParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+ int32 WriteChunkToDisk(
+       uint32     recordSize, 
+       FILE      *fileOfChunks, 
+       treeNode  *t, 
+       FILE      *logFile);
+
+ int32 DeleteOneFile(const char * file_name);
+
+  void WriteOne64Tuple(char * t, uint64 s, uint32 l, FILE * logf);
+
+ int32 ViewSizesVerification(ADC_VIEW_CNTL *adccntlp);
+
+  void CreateBinTuple(
+       uint64  *binRepTuple, 
+       uint32  *selTuple, 
+       uint32   numDims);
+
+  void AdcCntlLog(ADC_VIEW_CNTL *adccntlp);
+
+  void swap8(void *a);
+
+  void WriteOne32Tuple(char * t, uint32 s, uint32 l, FILE * logf);
+
+  void JobPoolUpdate(ADC_VIEW_CNTL *avp);
+
+ int32 WriteViewToDisk(ADC_VIEW_CNTL *avp, treeNode *t);
+
+uint32 GetSmallestParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+ int32 GetParent(ADC_VIEW_CNTL *avp, uint32 binRepTuple);
+
+  void GetRegTupleFromBin64(
+       uint64   binRepTuple, 
+       uint32  *selTuple, 
+       uint32   numDims, 
+       uint32  *numOfUnits); 
+
+  void GetRegTupleFromParent(
+       uint64   bin64RepTuple,
+       uint32   bin32RepTuple,
+       uint32  *selTuple,
+       uint32   nd);
+
+  void JobPoolInit(JOB_POOL *jpp, uint32 n, uint32 nd);
+
+uint32 NumOfCombsFromNbyK (uint32 n, uint32 k);
+
+  void InitializeTree(RBTree *tree, uint32 nd, uint32 nm);
+
+ int32 CheckTree(
+       treeNode  *t , 
+       uint32    *px, 
+       uint32     nv, 
+       uint32     nm, 
+       FILE      *logFile);
+
+ int32 KeyComp(const uint32 *a, const uint32 *b, uint32 n);
+
+ int32 TreeInsert(RBTree *tree, uint32 *attrs);
+
+  void InitializeTree(RBTree *tree, uint32 nd, uint32 nm);
+
+ int32 WriteChunkToDisk(
+       uint32     recordSize, 
+       FILE      *fileOfChunks, 
+       treeNode  *t, 
+       FILE      *logFile);
+
+  void SelectToView(
+       uint32  *ib, 
+       uint32  *ix, 
+       uint32  *viewBuf, 
+       uint32   nd, 
+       uint32   nm, 
+       uint32   nv);
+
+ int32 MultiWayBufferSnap(
+       uint32   nv, 
+       uint32   nm,  
+       uint32  *multiChunkBuffer, 
+       uint32	numberOfChunks, 
+       uint32	regSubChunkSize, 
+       uint32	nRecords);
+
+ RBTree *CreateEmptyTree(
+       uint32          nd, 
+       uint32          nm, 
+       uint32          memoryLimit, 
+       unsigned char  *memPool);
+
+int32 PrefixedAggregate(ADC_VIEW_CNTL *avp, FILE *iof);
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/rbt.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/rbt.c
new file mode 100644
index 000000000..ae96e45e4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/rbt.c
@@ -0,0 +1,240 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "adc.h"
+#include "macrodef.h"
+
+int32 KeyComp( const uint32 *a, const uint32 *b, uint32 n ) {
+  uint32 i;
+  for ( i = 0; i < n; i++ ) {
+    if (a[i] < b[i]) return(-1);
+    else if (a[i] > b[i]) return(1);
+  }
+  return(0);
+}
+int32 TreeInsert(RBTree *tree, uint32 *attrs){
+   uint32  sl = 1;			    	
+   uint32 *attrsP;
+    int32  cmpres;
+ treeNode *xNd, *yNd, *tmp;
+
+  tmp = &tree->root;
+  xNd = tmp->left;
+
+  if (xNd == NULL){
+    tree->count++;
+    NEW_TREE_NODE(tree->mp,tree->memPool,
+        	      tree->memaddr,tree->treeNodeSize,
+        	      tree->freeNodeCounter,tree->memoryIsFull)
+    xNd = tmp->left = tree->mp;
+    memcpy(&(xNd->nodeMemPool[0]), &attrs[0], tree->nodeDataSize);
+    xNd->left = xNd->right = NULL;
+    xNd->clr = BLACK;
+    return 0;
+  }
+
+  tree->drcts[0] = 0;
+  tree->nodes[0] = &tree->root;
+
+  while(1){
+    attrsP = (uint32*) &(xNd->nodeMemPool[tree->nm]);
+    cmpres = KeyComp( &attrs[tree->nm<<1], attrsP, tree->nd );
+
+    if (cmpres < 0){
+      tree->nodes[sl] = xNd;
+      tree->drcts[sl++] = 0;
+      yNd = xNd->left;
+
+      if(yNd == NULL){
+	    NEW_TREE_NODE(tree->mp,tree->memPool,
+	  	              tree->memaddr,tree->treeNodeSize,
+	  	              tree->freeNodeCounter,tree->memoryIsFull)
+        xNd = xNd->left = tree->mp;
+        break;
+      }
+    }else if (cmpres > 0){
+      tree->nodes[sl] = xNd;
+      tree->drcts[sl++] = 1;
+      yNd = xNd->right;
+      if(yNd == NULL){
+        NEW_TREE_NODE(tree->mp,tree->memPool,
+		              tree->memaddr,tree->treeNodeSize,
+		              tree->freeNodeCounter,tree->memoryIsFull)
+        xNd = xNd->right = tree->mp; 
+        break;
+      }
+    }else{  
+      uint64 ii; 
+      int64 *mx;
+      mx = (int64*) &attrs[0];
+      for ( ii = 0; ii < tree->nm; ii++ ) xNd->nodeMemPool[ii] += mx[ii];
+      return 0; 
+    }
+    xNd = yNd;
+  }
+  tree->count++;
+  memcpy(&(xNd->nodeMemPool[0]), &attrs[0], tree->nodeDataSize);
+  xNd->left = xNd->right = NULL;
+  xNd->clr  = RED;
+
+  while(1){
+    if ( tree->nodes[sl-1]->clr != RED || sl<3 ) break;
+      
+    if (tree->drcts[sl-2] == 0){
+      yNd = tree->nodes[sl-2]->right;
+      if (yNd != NULL && yNd->clr == RED){
+        tree->nodes[sl-1]->clr = BLACK;
+        yNd->clr = BLACK;
+        tree->nodes[sl-2]->clr = RED;
+        sl -= 2;
+      }else{
+        if (tree->drcts[sl-1] == 1){
+	      xNd = tree->nodes[sl-1];
+	      yNd = xNd->right;
+	      xNd->right = yNd->left;
+	      yNd->left  = xNd;
+	      tree->nodes[sl-2]->left = yNd;
+        }else
+          yNd = tree->nodes[sl-1];
+	  
+        xNd = tree->nodes[sl-2];
+        xNd->clr = RED;
+        yNd->clr = BLACK;
+
+        xNd->left  = yNd->right;
+        yNd->right = xNd;
+
+        if(tree->drcts[sl-3])
+          tree->nodes[sl-3]->right = yNd;
+	    else  
+          tree->nodes[sl-3]->left = yNd;
+        break;
+      }
+    }else{
+      yNd = tree->nodes[sl-2]->left;
+      if (yNd != NULL && yNd->clr == RED){
+         tree->nodes[sl-1]->clr = BLACK;
+         yNd->clr = BLACK;
+         tree->nodes[sl-2]->clr = RED;
+         sl -= 2;
+      }else{
+    	if(tree->drcts[sl-1] == 0){
+          xNd = tree->nodes[sl-1];
+          yNd = xNd->left;
+          xNd->left  = yNd->right;
+          yNd->right = xNd;
+          tree->nodes[sl-2]->right = yNd;
+   	    }else
+          yNd = tree->nodes[sl-1];
+
+   	    xNd = tree->nodes[sl-2];
+     	xNd->clr = RED;
+    	yNd->clr = BLACK;
+
+    	xNd->right = yNd->left;
+    	yNd->left  = xNd;
+
+   	    if (tree->drcts[sl-3])
+   	      tree->nodes[sl-3]->right = yNd;
+     	else  
+   	      tree->nodes[sl-3]->left  = yNd;
+   	    break;
+      }
+    }
+  }
+  tree->root.left->clr = BLACK;
+  return 0;
+}
+int32 WriteViewToDisk(ADC_VIEW_CNTL *avp, treeNode *t){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(WriteViewToDisk( avp, t->left)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->mSums[i] += t->nodeMemPool[i];  
+  }	   
+  WriteToFile(t->nodeMemPool,avp->outRecSize,1,avp->viewFile,avp->logf);
+  if(WriteViewToDisk( avp, t->right)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 WriteViewToDiskCS(ADC_VIEW_CNTL *avp, treeNode *t,uint64 *ordern){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(WriteViewToDiskCS( avp, t->left,ordern)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->mSums[i] += t->nodeMemPool[i];  
+    avp->checksums[i] += (++(*ordern))*t->nodeMemPool[i]%measbound;
+  }	   
+  WriteToFile(t->nodeMemPool,avp->outRecSize,1,avp->viewFile,avp->logf);
+  if(WriteViewToDiskCS( avp, t->right,ordern)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 computeChecksum(ADC_VIEW_CNTL *avp, treeNode *t,uint64 *ordern){
+  uint32 i;
+  if(!t) return ADC_OK;
+  if(computeChecksum(avp,t->left,ordern)) return ADC_WRITE_FAILED;
+  for(i=0;i<avp->nm;i++){
+    avp->checksums[i] += (++(*ordern))*t->nodeMemPool[i]%measbound;
+  }	   
+  if(computeChecksum(avp,t->right,ordern)) return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+int32 WriteChunkToDisk(uint32 recordSize,FILE *fileOfChunks,
+		       treeNode *t, FILE *logFile){   
+  if(!t) return ADC_OK;
+  if(WriteChunkToDisk( recordSize, fileOfChunks, t->left, logFile)) 
+    return ADC_WRITE_FAILED; 
+  WriteToFile( t->nodeMemPool, recordSize, 1, fileOfChunks, logFile);
+  if(WriteChunkToDisk( recordSize, fileOfChunks, t->right, logFile)) 
+    return ADC_WRITE_FAILED;
+  return ADC_OK;
+}
+RBTree * CreateEmptyTree(uint32 nd, uint32 nm, 
+                         uint32 memoryLimit, unsigned char * memPool){
+  RBTree *tree = (RBTree*)  malloc(sizeof(RBTree));
+  if (!tree) return NULL;
+
+  tree->root.left = NULL;    
+  tree->root.right = NULL;     
+  tree->count = 0;
+  tree->memaddr = 0;
+  tree->treeNodeSize = sizeof(struct treeNode) + DIM_FSZ*(nd-1)+MSR_FSZ*nm;
+  if (tree->treeNodeSize%8 != 0) tree->treeNodeSize += 4;
+  tree->memoryLimit = memoryLimit;
+  tree->memoryIsFull = 0;
+  tree->nodeDataSize = DIM_FSZ*nd + MSR_FSZ*nm;
+  tree->mp = NULL;
+  tree->nNodesLimit = tree->memoryLimit/tree->treeNodeSize;
+  tree->freeNodeCounter = tree->nNodesLimit;
+  tree->nd = nd;
+  tree->nm = nm;
+  tree->memPool = memPool;
+  tree->nodes = (treeNode**) malloc(sizeof(treeNode*)*MAX_TREE_HEIGHT);
+  if (!(tree->nodes)) return NULL;
+  tree->drcts = (uint32*) malloc( sizeof(uint32)*MAX_TREE_HEIGHT);
+  if (!(tree->drcts)) return NULL;
+  return tree;
+}
+void InitializeTree(RBTree *tree, uint32 nd, uint32 nm){
+  tree->root.left = NULL;    
+  tree->root.right = NULL;     
+  tree->count = 0;
+  tree->memaddr = 0;
+  tree->treeNodeSize = sizeof(struct treeNode) + DIM_FSZ*(nd-1)+MSR_FSZ*nm;
+  if (tree->treeNodeSize%8 != 0) tree->treeNodeSize += 4;
+  tree->memoryIsFull = 0;
+  tree->nodeDataSize = DIM_FSZ*nd + MSR_FSZ*nm;
+  tree->mp = NULL;
+  tree->nNodesLimit = tree->memoryLimit/tree->treeNodeSize;
+  tree->freeNodeCounter = tree->nNodesLimit;
+  tree->nd = nd;
+  tree->nm = nm;
+}
+int32 DestroyTree(RBTree *tree) {
+  if (tree==NULL) return ADC_TREE_DESTROY_FAILURE;
+  if (tree->memPool!=NULL) free(tree->memPool);
+  if (tree->nodes) free(tree->nodes);
+  if (tree->drcts) free(tree->drcts);
+  free(tree);
+  return ADC_OK;
+}
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/rbt.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/rbt.h
new file mode 100644
index 000000000..de4f99735
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/DC/rbt.h
@@ -0,0 +1,43 @@
+#ifndef _ADC_PARVIEW_TREE_DEF_H_
+#define _ADC_PARVIEW_TREE_DEF_H_
+
+#define MAX_TREE_HEIGHT	64
+enum{BLACK,RED};
+
+typedef struct treeNode{
+  struct treeNode *left;
+  struct treeNode *right;
+  uint32 clr;
+  int64 nodeMemPool[1];
+} treeNode;
+
+typedef struct RBTree{
+  treeNode root;	
+  treeNode * mp;
+  uint32 count;       
+  uint32 treeNodeSize;
+  uint32 nodeDataSize;
+  uint32 memoryLimit; 
+  uint32 memaddr;
+  uint32 memoryIsFull;
+  uint32 freeNodeCounter;
+  uint32 nNodesLimit;
+  uint32 nd;
+  uint32 nm;
+  uint32   *drcts;
+  treeNode **nodes;
+  unsigned char * memPool;
+} RBTree;
+
+#define NEW_TREE_NODE(node_ptr,memPool,memaddr,treeNodeSize, \
+ freeNodeCounter,memoryIsFull) \
+ node_ptr=(struct treeNode*)(memPool+memaddr); \
+ memaddr+=treeNodeSize; \
+ (freeNodeCounter)--; \
+ if( freeNodeCounter == 0 ) { \
+     memoryIsFull = 1; \
+ }
+
+int32 TreeInsert(RBTree *tree, uint32 *attrs);
+
+#endif /* _ADC_PARVIEW_TREE_DEF_H_ */
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/Makefile
new file mode 100644
index 000000000..8177ea07c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/Makefile
@@ -0,0 +1,30 @@
+SHELL=/bin/sh
+BENCHMARK=ep
+BENCHMARKU=EP
+
+include ../config/make.def
+
+OBJS = ep.o ep_data.o verify.o \
+       ${COMMON}/print_results.o ${COMMON}/${RAND}.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+ep.o:		ep.f90 ep_data.o
+ep_data.o:	ep_data.f90 npbparams.h
+verify.o:	verify.f90
+
+clean:
+	- rm -f *.o *~ *.mod
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/README
new file mode 100644
index 000000000..0ca487cfe
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/README
@@ -0,0 +1,4 @@
+This code implements the random-number generator described in the
+NAS Parallel Benchmark document RNR Technical Report RNR-94-007.
+The code is "embarrassingly" parallel in that no communication is
+required for the generation of the random numbers itself. 
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/ep.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/ep.f90
new file mode 100644
index 000000000..c41585dd5
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/ep.f90
@@ -0,0 +1,257 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   E P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB EP code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+!---------------------------------------------------------------------
+!
+! Author: P. O. Frederickson
+!         D. H. Bailey
+!         A. C. Woo
+!         H. Jin
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+      program EMBAR
+!---------------------------------------------------------------------
+
+!   This is the OpenMP version of the APP Benchmark 1,
+!   the "embarassingly parallel" benchmark.
+
+      use ep_data
+
+      implicit none
+
+      double precision Mops, t1, t2, t3, t4, x1, x2,  &
+     &                 sx, sy, tm, an, tt, gc, dum(3)
+
+      integer          i, ik, kk, l, k, nit,  &
+     &                 np, k_offset, j
+
+      logical          verified, timers_enabled
+
+      external         randlc, timer_read
+      double precision randlc, timer_read
+
+      character        size*15, classv
+
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+
+      data             dum /1.d0, 1.d0, 1.d0/
+
+
+      call check_timer_flag( timers_enabled )
+
+!   Because the size of the problem is too large to store in a 32-bit
+!   integer for some classes, we put it into a string (for printing).
+!   Have to strip off the decimal point put in there by the floating
+!   point print statement (internal file)
+
+      write(*, 1000)
+      write(size, '(f15.0)' ) 2.d0**(m+1)
+      j = 15
+      if (size(j:j) .eq. '.') j = j - 1
+      write (*,1001) size(1:j)
+!$    write (*,1003) omp_get_max_threads()
+      write (*,*)
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.4-OMP)',  &
+     &          ' - EP Benchmark', /)
+ 1001 format(' Number of random numbers generated: ', a15)
+ 1003 format(' Number of available threads:        ', 2x,i13)
+
+
+!   Compute the number of "batches" of random number pairs generated
+!   per processor. Adjust if the number of processors does not evenly
+!   divide the total number
+
+      np = nn
+
+
+!   Call the random number generator functions and initialize
+!   the x-array to reduce the effects of paging on the timings.
+!   Also, call all mathematical functions that are used. Make
+!   sure these initializations cannot be eliminated as dead code.
+
+      call vranlc(0, dum(1), dum(2), dum(3))
+      dum(1) = randlc(dum(2), dum(3))
+      Mops = log(sqrt(abs(max(1.d0,1.d0))))
+
+!$omp parallel default(shared) private(i)
+      do 5    i = 1, 2*nk
+         x(i) = -1.d99
+ 5    continue
+
+      call timer_clear(1)
+      if (timers_enabled) call timer_clear(2)
+      if (timers_enabled) call timer_clear(3)
+!$omp end parallel
+
+#ifdef M5_ANNOTATION
+      call m5_work_begin_interface
+#endif
+      call timer_start(1)
+
+      t1 = a
+      call vranlc(0, t1, a, x)
+
+!   Compute AN = A ^ (2 * NK) (mod 2^46).
+
+      t1 = a
+
+      do 100 i = 1, mk + 1
+         t2 = randlc(t1, t1)
+ 100  continue
+
+      an = t1
+      tt = s
+      gc = 0.d0
+      sx = 0.d0
+      sy = 0.d0
+
+      do 110 i = 0, nq - 1
+         q(i) = 0.d0
+ 110  continue
+
+!   Each instance of this loop may be performed independently. We compute
+!   the k offsets separately to take into account the fact that some nodes
+!   have more numbers to generate than others
+
+      k_offset = -1
+
+!$omp parallel default(shared) reduction(+:sx,sy)  &
+!$omp&  private(k,kk,t1,t2,t3,t4,i,ik,x1,x2,l)
+      do 115 i = 0, nq - 1
+         qq(i) = 0.d0
+ 115  continue
+
+!$omp do schedule(static)
+      do 150 k = 1, np
+         kk = k_offset + k
+         t1 = s
+         t2 = an
+
+!        Find starting seed t1 for this kk.
+
+         if (timers_enabled) call timer_start(3)
+         do 120 i = 1, 100
+            ik = kk / 2
+            if (2 * ik .ne. kk) t3 = randlc(t1, t2)
+            if (ik .eq. 0) goto 130
+            t3 = randlc(t2, t2)
+            kk = ik
+ 120     continue
+
+!        Compute uniform pseudorandom numbers.
+ 130     continue
+
+         call vranlc(2 * nk, t1, a, x)
+         if (timers_enabled) call timer_stop(3)
+
+!        Compute Gaussian deviates by acceptance-rejection method and
+!        tally counts in concentric square annuli.  This loop is not
+!        vectorizable.
+
+         if (timers_enabled) call timer_start(2)
+
+         do 140 i = 1, nk
+            x1 = 2.d0 * x(2*i-1) - 1.d0
+            x2 = 2.d0 * x(2*i) - 1.d0
+            t1 = x1 ** 2 + x2 ** 2
+            if (t1 .le. 1.d0) then
+               t2   = sqrt(-2.d0 * log(t1) / t1)
+               t3   = abs(x1 * t2)
+               t4   = abs(x2 * t2)
+               l    = max(t3, t4)
+               qq(l) = qq(l) + 1.d0
+               sx   = sx + t3
+               sy   = sy + t4
+            endif
+ 140     continue
+
+         if (timers_enabled) call timer_stop(2)
+
+ 150  continue
+!$omp end do nowait
+
+      do 155 i = 0, nq - 1
+!$omp atomic
+         q(i) = q(i) + qq(i)
+ 155  continue
+!$omp end parallel
+
+      do 160 i = 0, nq - 1
+         gc = gc + q(i)
+ 160  continue
+
+      call timer_stop(1)
+
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+
+      tm  = timer_read(1)
+      call verify(m, sx, sy, gc, verified, classv)
+
+      nit=0
+      Mops = 2.d0**(m+1)/tm/1000000.d0
+
+      write (6,11) tm, m, gc, sx, sy, (i, q(i), i = 0, nq - 1)
+ 11   format ('EP Benchmark Results:'//'CPU Time =',f10.3/'N = 2^',  &
+     &        i5/'No. Gaussian Pairs =',f15.0/'Sums = ',1p,2d25.15/  &
+     &        'Counts:'/(i3,0p,f15.0))
+
+      call print_results('EP', class, m+1, 0, 0, nit,  &
+     &                   tm, Mops,  &
+     &                   'Random numbers generated',  &
+     &                   verified, npbversion, compiletime, cs1,  &
+     &                   cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+      if (timers_enabled) then
+         if (tm .le. 0.d0) tm = 1.0
+         tt = timer_read(1)
+         print 810, 'Total time:    ', tt, tt*100./tm
+         tt = timer_read(2)
+         print 810, 'Gaussian pairs:', tt, tt*100./tm
+         tt = timer_read(3)
+         print 810, 'Random numbers:', tt, tt*100./tm
+810      format(1x,a,f9.3,' (',f6.2,'%)')
+      endif
+
+
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/ep_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/ep_data.f90
new file mode 100644
index 000000000..772c72156
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/ep_data.f90
@@ -0,0 +1,40 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  ep_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+ 
+      module ep_data
+
+!---------------------------------------------------------------------
+!  The following include file is generated automatically by the
+!  "setparams" utility, which defines the problem size 'm'
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+!---------------------------------------------------------------------
+!   M is the Log_2 of the number of complex pairs of uniform (0, 1) random
+!   numbers.  MK is the Log_2 of the size of each batch of uniform random
+!   numbers.  MK can be set for convenience on a given system, since it does
+!   not affect the results.
+!---------------------------------------------------------------------
+      integer    mk, mm, nn, nk, nq
+      parameter (mk = 16, mm = m - mk, nn = 2 ** mm,  &
+     &           nk = 2 ** mk, nq = 10)
+
+      double precision a, s
+      parameter (a = 1220703125.d0, s = 271828183.d0)
+
+! ... storage
+      double precision x(2*nk), qq(0:nq-1), q(0:nq-1)
+!$omp threadprivate( x, qq )
+
+! ... timer constants
+      integer    t_total, t_gpairs, t_randn, t_rcomm, t_last
+      parameter (t_total=1, t_gpairs=2, t_randn=3, t_rcomm=4, t_last=4)
+
+      end module ep_data
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/verify.f90
new file mode 100644
index 000000000..65fee595c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/EP/verify.f90
@@ -0,0 +1,82 @@
+!---------------------------------------------------------------------
+      subroutine verify(m, sx, sy, gc, verified, class)
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      implicit none
+      integer m
+      double precision sx, sy, gc
+      logical verified
+      character class
+
+      double precision sx_verify_value, sy_verify_value
+      double precision gc_verify_value
+      double precision sx_err, sy_err, gc_err
+
+      double precision, parameter :: epsilon = 1.d-8
+
+      verified = .true.
+      if (m.eq.24) then
+         class = 'S'
+         sx_verify_value = 1.051299420395306D+07
+         sy_verify_value = 1.051517131857535D+07
+         gc_verify_value = 13176389.D0
+      elseif (m.eq.25) then
+         class = 'W'
+         sx_verify_value = 2.102505525182392D+07
+         sy_verify_value = 2.103162209578822D+07
+         gc_verify_value = 26354769.D0
+      elseif (m.eq.28) then
+         class = 'A'
+         sx_verify_value = 1.682235632304711D+08
+         sy_verify_value = 1.682195123368299D+08
+         gc_verify_value = 210832767.D0
+      elseif (m.eq.30) then
+         class = 'B'
+         sx_verify_value = 6.728927543423024D+08
+         sy_verify_value = 6.728951822504275D+08
+         gc_verify_value = 843345606.D0
+      elseif (m.eq.32) then
+         class = 'C'
+         sx_verify_value = 2.691444083862931D+09
+         sy_verify_value = 2.691519118724585D+09
+         gc_verify_value = 3373275903.D0
+      elseif (m.eq.36) then
+         class = 'D'
+         sx_verify_value = 4.306350280812112D+10
+         sy_verify_value = 4.306347571859157D+10
+         gc_verify_value = 53972171957.D0
+      elseif (m.eq.40) then
+         class = 'E'
+         sx_verify_value = 6.890169663167274D+11
+         sy_verify_value = 6.890164670688535D+11
+         gc_verify_value = 863554308186.D0
+      elseif (m.eq.44) then
+         class = 'F'
+         sx_verify_value = 1.102426773788175D+13
+         sy_verify_value = 1.102426773787993D+13
+         gc_verify_value = 13816870608324.D0
+      else
+         class = 'U'
+         verified = .false.
+      endif
+      if (verified) then
+         sx_err = abs((sx - sx_verify_value)/sx_verify_value)
+         sy_err = abs((sy - sy_verify_value)/sy_verify_value)
+         if (ieee_is_nan(sx_err) .or. ieee_is_nan(sy_err)) then
+            verified = .false.
+         else
+            verified = ((sx_err.le.epsilon) .and. (sy_err.le.epsilon))
+         endif
+      endif
+      if (verified) then
+         gc_err = abs((gc - gc_verify_value)/gc_verify_value)
+         if (ieee_is_nan(gc_err) .or. gc_err.gt.epsilon) then
+            verified = .false.
+         endif
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/Makefile
new file mode 100644
index 000000000..b7e3ed600
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/Makefile
@@ -0,0 +1,44 @@
+SHELL=/bin/sh
+BENCHMARK=ft
+BENCHMARKU=FT
+BLKFAC=32
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = ft.o ft_data.o ${COMMON}/${RAND}.o ${COMMON}/print_results.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+${PROGRAM}: config
+	@ver=$(VERSION); bfac=`echo $$ver|sed -e 's/^blk//' -e 's/^BLK//'`; \
+	if [ x$$ver != x$$bfac ] ; then		\
+		${MAKE} BLKFAC=$${bfac:-32} exec;	\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+
+.f90.o:
+	${FCOMPILE} $<
+
+blk_par.h: FORCE
+	sed -e 's/=0/=$(BLKFAC)/' blk_par0.h > blk_par.h_wk
+	@ if ! `diff blk_par.h_wk blk_par.h > /dev/null 2>&1`; then \
+	mv -f blk_par.h_wk blk_par.h; else rm -f blk_par.h_wk; fi
+FORCE:
+
+ft.o:		ft.f90  ft_data.o
+ft_data.o:	ft_data.f90  npbparams.h blk_par.h
+
+clean:
+	- rm -f *.o *~ mputil* *.mod
+	- rm -f ft npbparams.h core blk_par.h
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/README
new file mode 100644
index 000000000..fd6d3b3ca
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/README
@@ -0,0 +1,5 @@
+This code implements the time integration of a three-dimensional
+partial differential equation using the Fast Fourier Transform.
+
+The code uses Fortran 90 module to specify data fields. So, a compiler
+supports Fortran 90 is required.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/blk_par0.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/blk_par0.h
new file mode 100644
index 000000000..d593f6ea0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/blk_par0.h
@@ -0,0 +1,4 @@
+      integer fftblock_default, fftblockpad_default
+      parameter (fftblock_default=0,  &
+     &           fftblockpad_default=fftblock_default+2)
+      
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/ft.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/ft.f90
new file mode 100644
index 000000000..601ac3cc6
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/ft.f90
@@ -0,0 +1,1138 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   F T                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB FT code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!---------------------------------------------------------------------
+!
+! Authors: D. Bailey
+!          W. Saphir
+!          H. Jin
+!
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! FT benchmark
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      program ft
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! Module ft_fields defines main arrays (u0, u1, u2) in the problem
+!---------------------------------------------------------------------
+
+      use ft_data
+      use ft_fields
+
+      implicit none
+
+      integer i
+
+      integer iter
+      double precision total_time, mflops
+      logical verified
+      character class
+
+
+!---------------------------------------------------------------------
+! Run the entire problem once to make sure all data is touched.
+! This reduces variable startup costs, which is important for such a
+! short benchmark. The other NPB 2 implementations are similar.
+!---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+
+      call alloc_space
+
+      call setup()
+      call init_ui(u0, u1, twiddle, dims(1), dims(2), dims(3))
+      call compute_indexmap(twiddle, dims(1), dims(2), dims(3))
+      call compute_initial_conditions(u1, dims(1), dims(2), dims(3))
+      call fft_init (dims(1))
+      call fft(1, u1, u0)
+
+!---------------------------------------------------------------------
+! Start over from the beginning. Note that all operations must
+! be timed, in contrast to other benchmarks.
+!---------------------------------------------------------------------
+      do i = 1, t_max
+         call timer_clear(i)
+      end do
+#ifdef M5_ANNOTATION
+      call m5_work_begin_interface
+#endif
+
+      call timer_start(T_total)
+      if (timers_enabled) call timer_start(T_setup)
+
+      call compute_indexmap(twiddle, dims(1), dims(2), dims(3))
+
+      call compute_initial_conditions(u1, dims(1), dims(2), dims(3))
+
+      call fft_init (dims(1))
+
+      if (timers_enabled) call timer_stop(T_setup)
+      if (timers_enabled) call timer_start(T_fft)
+      call fft(1, u1, u0)
+      if (timers_enabled) call timer_stop(T_fft)
+
+      do iter = 1, niter
+         if (timers_enabled) call timer_start(T_evolve)
+         call evolve(u0, u1, twiddle, dims(1), dims(2), dims(3))
+         if (timers_enabled) call timer_stop(T_evolve)
+         if (timers_enabled) call timer_start(T_fft)
+!         call fft(-1, u1, u2)
+         call fft(-1, u1, u1)
+         if (timers_enabled) call timer_stop(T_fft)
+         if (timers_enabled) call timer_start(T_checksum)
+!         call checksum(iter, u2, dims(1), dims(2), dims(3))
+         call checksum(iter, u1, dims(1), dims(2), dims(3))
+         if (timers_enabled) call timer_stop(T_checksum)
+      end do
+
+      call verify(nx, ny, nz, niter, verified, class)
+
+      call timer_stop(t_total)
+
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+
+      total_time = timer_read(t_total)
+      if( total_time .ne. 0. ) then
+         mflops = 1.0d-6*ntotal_f *  &
+     &             (14.8157+7.19641*log(ntotal_f)  &
+     &          +  (5.23518+7.21113*log(ntotal_f))*niter)  &
+     &                 /total_time
+      else
+         mflops = 0.0
+      endif
+      call print_results('FT', class, nx, ny, nz, niter,  &
+     &  total_time, mflops, '          floating point', verified,  &
+     &  npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      if (timers_enabled) call print_timers()
+
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine init_ui(u0, u1, twiddle, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! touch all the big data
+!---------------------------------------------------------------------
+
+      implicit none
+      integer d1, d2, d3
+      double complex   u0(d1+1,d2,d3)
+      double complex   u1(d1+1,d2,d3)
+      double precision twiddle(d1+1,d2,d3)
+      integer i, j, k
+
+!$omp parallel do default(shared) private(i,j,k) collapse(2)
+      do k = 1, d3
+         do j = 1, d2
+            do i = 1, d1
+               u0(i,j,k) = 0.d0
+               u1(i,j,k) = 0.d0
+               twiddle(i,j,k) = 0.d0
+            end do
+         end do
+      end do
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine evolve(u0, u1, twiddle, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! evolve u0 -> u1 (t time steps) in fourier space
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double complex   u0(d1+1,d2,d3)
+      double complex   u1(d1+1,d2,d3)
+      double precision twiddle(d1+1,d2,d3)
+      integer i, j, k
+
+!$omp parallel do default(shared) private(i,j,k) collapse(2)
+      do k = 1, d3
+         do j = 1, d2
+            do i = 1, d1
+               u0(i,j,k) = u0(i,j,k) * twiddle(i,j,k)
+               u1(i,j,k) = u0(i,j,k)
+            end do
+         end do
+      end do
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine compute_initial_conditions(u0, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! Fill in array u0 with initial conditions from
+! random number generator
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double complex u0(d1+1, d2, d3)
+      integer k, j
+      double precision x0, start, an, dummy, starts(nz)
+
+
+      start = seed
+!---------------------------------------------------------------------
+! Jump to the starting element for our first plane.
+!---------------------------------------------------------------------
+      call ipow46(a, 0, an)
+      dummy = randlc(start, an)
+      call ipow46(a, 2*nx*ny, an)
+
+      starts(1) = start
+      do k = 2, dims(3)
+         dummy = randlc(start, an)
+         starts(k) = start
+      end do
+
+!---------------------------------------------------------------------
+! Go through by z planes filling in one square at a time.
+!---------------------------------------------------------------------
+!$omp parallel do default(shared) private(k,j,x0)
+      do k = 1, dims(3)
+         x0 = starts(k)
+         do j = 1, dims(2)
+            call vranlc(2*nx, x0, a, u0(1, j, k))
+         end do
+      end do
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ipow46(a, exponent, result)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute a^exponent mod 2^46
+!---------------------------------------------------------------------
+
+      implicit none
+      double precision a, result, dummy, q, r
+      integer exponent, n, n2
+      external randlc
+      double precision randlc
+!---------------------------------------------------------------------
+! Use
+!   a^n = a^(n/2)*a^(n/2) if n even else
+!   a^n = a*a^(n-1)       if n odd
+!---------------------------------------------------------------------
+      result = 1
+      if (exponent .eq. 0) return
+      q = a
+      r = 1
+      n = exponent
+
+
+      do while (n .gt. 1)
+         n2 = n/2
+         if (n2 * 2 .eq. n) then
+            dummy = randlc(q, q)
+            n = n2
+         else
+            dummy = randlc(r, q)
+            n = n-1
+         endif
+      end do
+      dummy = randlc(r, q)
+      result = r
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+      debug = .FALSE.
+
+      call check_timer_flag( timers_enabled )
+
+      write(*, 1000)
+
+      niter = niter_default
+
+      write(*, 1001) nx, ny, nz
+      write(*, 1002) niter
+!$    write(*, 1003) omp_get_max_threads()
+      write(*, *)
+
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.4-OMP)',  &
+     &          ' - FT Benchmark', /)
+ 1001 format(' Size                : ', i4, 'x', i4, 'x', i4)
+ 1002 format(' Iterations                  :', i7)
+ 1003 format(' Number of available threads :', i7)
+
+      dims(1) = nx
+      dims(2) = ny
+      dims(3) = nz
+
+
+!---------------------------------------------------------------------
+! Set up info for blocking of ffts and transposes.  This improves
+! performance on cache-based systems. Blocking involves
+! working on a chunk of the problem at a time, taking chunks
+! along the first, second, or third dimension.
+!
+! - In cffts1 blocking is on 2nd dimension (with fft on 1st dim)
+! - In cffts2/3 blocking is on 1st dimension (with fft on 2nd and 3rd dims)
+
+! Since 1st dim is always in processor, we'll assume it's long enough
+! (default blocking factor is 16 so min size for 1st dim is 16)
+! The only case we have to worry about is cffts1 in a 2d decomposition.
+! so the blocking factor should not be larger than the 2nd dimension.
+!---------------------------------------------------------------------
+
+      fftblock = fftblock_default
+      fftblockpad = fftblockpad_default
+
+      if (fftblock .ne. fftblock_default) fftblockpad = fftblock+3
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine compute_indexmap(twiddle, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute function from local (i,j,k) to ibar^2+jbar^2+kbar^2
+! for time evolution exponent.
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer d1, d2, d3
+      double precision twiddle(d1+1, d2, d3)
+      integer i, j, k, kk, kk2, jj, kj2, ii
+      double precision ap
+
+!---------------------------------------------------------------------
+! basically we want to convert the fortran indices
+!   1 2 3 4 5 6 7 8
+! to
+!   0 1 2 3 -4 -3 -2 -1
+! The following magic formula does the trick:
+! mod(i-1+n/2, n) - n/2
+!---------------------------------------------------------------------
+
+      ap = - 4.d0 * alpha * pi *pi
+
+!$omp parallel do default(shared) private(i,j,k,kk,kk2,jj,kj2,ii)  &
+!$omp&  collapse(2)
+      do k = 1, dims(3)
+         do j = 1, dims(2)
+            kk =  mod(k-1+nz/2, nz) - nz/2
+            kk2 = kk*kk
+            jj = mod(j-1+ny/2, ny) - ny/2
+            kj2 = jj*jj+kk2
+            do i = 1, dims(1)
+               ii = mod(i-1+nx/2, nx) - nx/2
+               twiddle(i,j,k) = dexp(ap*dble(ii*ii+kj2))
+            end do
+         end do
+      end do
+
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine print_timers()
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer i
+      double precision t, t_m
+      character*25 tstrings(T_max)
+      data tstrings / '          total ',  &
+     &                '          setup ',  &
+     &                '            fft ',  &
+     &                '         evolve ',  &
+     &                '       checksum ',  &
+     &                '           fftx ',  &
+     &                '           ffty ',  &
+     &                '           fftz ' /
+
+      t_m = timer_read(T_total)
+      if (t_m .le. 0.0d0) t_m = 1.0d0
+      do i = 1, t_max
+         t = timer_read(i)
+         write(*, 100) i, tstrings(i), t, t*100.0/t_m
+      end do
+ 100  format(' timer ', i2, '(', A16,  ') :', F9.4, ' (',F6.2,'%)')
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine fft(dir, x1, x2)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer dir
+      double complex x1(ntotalp), x2(ntotalp)
+
+      double complex y1(fftblockpad_default*maxdim),  &
+     &               y2(fftblockpad_default*maxdim)
+
+!---------------------------------------------------------------------
+! note: args x1, x2 must be different arrays
+! note: args for cfftsx are (direction, layout, xin, xout, scratch)
+!       xin/xout may be the same and it can be somewhat faster
+!       if they are
+!---------------------------------------------------------------------
+
+      if (dir .eq. 1) then
+         call cffts1(1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts2(1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts3(1, dims(1), dims(2), dims(3), x1, x2, y1, y2)
+      else
+         call cffts3(-1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts2(-1, dims(1), dims(2), dims(3), x1, x1, y1, y2)
+         call cffts1(-1, dims(1), dims(2), dims(3), x1, x2, y1, y2)
+      endif
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cffts1(is, d1, d2, d3, x, xout, y1, y2)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is, d1, d2, d3, logd1
+      double complex x(d1+1,d2,d3)
+      double complex xout(d1+1,d2,d3)
+      double complex y1(fftblockpad, d1), y2(fftblockpad, d1)
+      integer i, j, k, jj, jn
+
+      logd1 = ilog2(d1)
+
+      if (timers_enabled) call timer_start(T_fftx)
+!$omp parallel do default(shared) private(i,j,k,jj,y1,y2,jn)  &
+!$omp&  shared(is,logd1,d1) collapse(2)
+      do k = 1, d3
+         do jn = 0, d2/fftblock - 1
+!         do jj = 0, d2 - fftblock, fftblock
+            jj = jn*fftblock
+            do j = 1, fftblock
+               do i = 1, d1
+                  y1(j,i) = x(i,j+jj,k)
+               enddo
+            enddo
+
+            call cfftz (is, logd1, d1, y1, y2)
+
+
+            do j = 1, fftblock
+               do i = 1, d1
+                  xout(i,j+jj,k) = y1(j,i)
+               enddo
+            enddo
+         enddo
+      enddo
+      if (timers_enabled) call timer_stop(T_fftx)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cffts2(is, d1, d2, d3, x, xout, y1, y2)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is, d1, d2, d3, logd2
+      double complex x(d1+1,d2,d3)
+      double complex xout(d1+1,d2,d3)
+      double complex y1(fftblockpad, d2), y2(fftblockpad, d2)
+      integer i, j, k, ii, in
+
+      logd2 = ilog2(d2)
+
+      if (timers_enabled) call timer_start(T_ffty)
+!$omp parallel do default(shared) private(i,j,k,ii,y1,y2,in)  &
+!$omp&  shared(is,logd2,d2) collapse(2)
+      do k = 1, d3
+        do in = 0, d1/fftblock - 1
+!        do ii = 0, d1 - fftblock, fftblock
+           ii = in*fftblock
+           do j = 1, d2
+              do i = 1, fftblock
+                 y1(i,j) = x(i+ii,j,k)
+              enddo
+           enddo
+
+           call cfftz (is, logd2, d2, y1, y2)
+
+           do j = 1, d2
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y1(i,j)
+              enddo
+           enddo
+        enddo
+      enddo
+      if (timers_enabled) call timer_stop(T_ffty)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cffts3(is, d1, d2, d3, x, xout, y1, y2)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is, d1, d2, d3, logd3
+      double complex x(d1+1,d2,d3)
+      double complex xout(d1+1,d2,d3)
+      double complex y1(fftblockpad, d3), y2(fftblockpad, d3)
+      integer i, j, k, ii, in
+
+      logd3 = ilog2(d3)
+
+      if (timers_enabled) call timer_start(T_fftz)
+!$omp parallel do default(shared) private(i,j,k,ii,y1,y2,in)  &
+!$omp&  shared(is) collapse(2)
+      do j = 1, d2
+        do in = 0, d1/fftblock - 1
+!        do ii = 0, d1 - fftblock, fftblock
+           ii = in*fftblock
+           do k = 1, d3
+              do i = 1, fftblock
+                 y1(i,k) = x(i+ii,j,k)
+              enddo
+           enddo
+
+           call cfftz (is, logd3, d3, y1, y2)
+
+           do k = 1, d3
+              do i = 1, fftblock
+                 xout(i+ii,j,k) = y1(i,k)
+              enddo
+           enddo
+        enddo
+      enddo
+      if (timers_enabled) call timer_stop(T_fftz)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine fft_init (n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute the roots-of-unity array that will be used for subsequent FFTs.
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer m,n,nu,ku,i,j,ln
+      double precision t, ti
+
+
+!---------------------------------------------------------------------
+!   Initialize the U array with sines and cosines in a manner that permits
+!   stride one access at each FFT iteration.
+!---------------------------------------------------------------------
+      nu = n
+      m = ilog2(n)
+      u(1) = m
+      ku = 2
+      ln = 1
+
+      do j = 1, m
+         t = pi / ln
+
+         do i = 0, ln - 1
+            ti = i * t
+            u(i+ku) = dcmplx (cos (ti), sin(ti))
+         enddo
+
+         ku = ku + ln
+         ln = 2 * ln
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine cfftz (is, m, n, x, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   Computes NY N-point complex-to-complex FFTs of X using an algorithm due
+!   to Swarztrauber.  X is both the input and the output array, while Y is a
+!   scratch array.  It is assumed that N = 2^M.  Before calling CFFTZ to
+!   perform FFTs, the array U must be initialized by calling CFFTZ with IS
+!   set to 0 and M set to MX, where MX is the maximum value of M for any
+!   subsequent call.
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer is,m,n,i,j,l,mx
+      double complex x, y
+
+      dimension x(fftblockpad,n), y(fftblockpad,n)
+
+!---------------------------------------------------------------------
+!   Check if input parameters are invalid.
+!---------------------------------------------------------------------
+      mx = u(1)
+      if ((is .ne. 1 .and. is .ne. -1) .or. m .lt. 1 .or. m .gt. mx)    &
+     &  then
+        write (*, 1)  is, m, mx
+ 1      format ('CFFTZ: Either U has not been initialized, or else'/    &
+     &    'one of the input parameters is invalid', 3I5)
+        stop
+      endif
+
+!---------------------------------------------------------------------
+!   Perform one variant of the Stockham FFT.
+!---------------------------------------------------------------------
+      do l = 1, m, 2
+        call fftz2 (is, l, m, n, fftblock, fftblockpad, u, x, y)
+        if (l .eq. m) goto 160
+        call fftz2 (is, l + 1, m, n, fftblock, fftblockpad, u, y, x)
+      enddo
+
+      goto 180
+
+!---------------------------------------------------------------------
+!   Copy Y to X.
+!---------------------------------------------------------------------
+ 160  do j = 1, n
+        do i = 1, fftblock
+          x(i,j) = y(i,j)
+        enddo
+      enddo
+
+ 180  continue
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine fftz2 (is, l, m, n, ny, ny1, u, x, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   Performs the L-th iteration of the second variant of the Stockham FFT.
+!---------------------------------------------------------------------
+
+      implicit none
+
+      integer is,k,l,m,n,ny,ny1,n1,li,lj,lk,ku,i,j,i11,i12,i21,i22
+      double complex u,x,y,u1,x11,x21
+      dimension u(n), x(ny1,n), y(ny1,n)
+
+
+!---------------------------------------------------------------------
+!   Set initial parameters.
+!---------------------------------------------------------------------
+
+      n1 = n / 2
+      lk = 2 ** (l - 1)
+      li = 2 ** (m - l)
+      lj = 2 * lk
+      ku = li + 1
+
+      do i = 0, li - 1
+        i11 = i * lk + 1
+        i12 = i11 + n1
+        i21 = i * lj + 1
+        i22 = i21 + lk
+        if (is .ge. 1) then
+          u1 = u(ku+i)
+        else
+          u1 = dconjg (u(ku+i))
+        endif
+
+!---------------------------------------------------------------------
+!   This loop is vectorizable.
+!---------------------------------------------------------------------
+        do k = 0, lk - 1
+          do j = 1, ny
+            x11 = x(j,i11+k)
+            x21 = x(j,i12+k)
+            y(j,i21+k) = x11 + x21
+            y(j,i22+k) = u1 * (x11 - x21)
+          enddo
+        enddo
+      enddo
+
+      return
+      end
+
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      integer function ilog2(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+      integer n, nn, lg
+      if (n .eq. 1) then
+         ilog2=0
+         return
+      endif
+      lg = 1
+      nn = 2
+      do while (nn .lt. n)
+         nn = nn*2
+         lg = lg+1
+      end do
+      ilog2 = lg
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine checksum(i, u1, d1, d2, d3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use ft_data
+      implicit none
+
+      integer i, d1, d2, d3
+      double complex u1(d1+1,d2,d3)
+      integer j, q,r,s
+      double complex chk
+      chk = (0.0,0.0)
+
+!$omp parallel do default(shared) private(i,q,r,s) reduction(+:chk)
+      do j=1,1024
+         q = mod(j, nx)+1
+         r = mod(3*j,ny)+1
+         s = mod(5*j,nz)+1
+         chk=chk+u1(q,r,s)
+      end do
+
+      chk = chk/ntotal_f
+
+      write (*, 30) i, chk
+ 30   format (' T =',I5,5X,'Checksum =',1P2D22.12)
+      sums(i) = chk
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine verify (d1, d2, d3, nt, verified, class)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      use ft_data
+
+      implicit none
+
+      integer d1, d2, d3, nt
+      character class
+      logical verified
+      integer i
+      double precision err, epsilon
+
+!---------------------------------------------------------------------
+!   Reference checksums
+!---------------------------------------------------------------------
+      double complex csum_ref(25)
+
+
+      class = 'U'
+
+      epsilon = 1.0d-12
+      verified = .FALSE.
+
+      if (d1 .eq. 64 .and.  &
+     &    d2 .eq. 64 .and.  &
+     &    d3 .eq. 64 .and.  &
+     &    nt .eq. 6) then
+!---------------------------------------------------------------------
+!   Sample size reference checksums
+!---------------------------------------------------------------------
+         class = 'S'
+         csum_ref(1) = dcmplx(5.546087004964D+02, 4.845363331978D+02)
+         csum_ref(2) = dcmplx(5.546385409189D+02, 4.865304269511D+02)
+         csum_ref(3) = dcmplx(5.546148406171D+02, 4.883910722336D+02)
+         csum_ref(4) = dcmplx(5.545423607415D+02, 4.901273169046D+02)
+         csum_ref(5) = dcmplx(5.544255039624D+02, 4.917475857993D+02)
+         csum_ref(6) = dcmplx(5.542683411902D+02, 4.932597244941D+02)
+
+      else if (d1 .eq. 128 .and.  &
+     &    d2 .eq. 128 .and.  &
+     &    d3 .eq. 32 .and.  &
+     &    nt .eq. 6) then
+!---------------------------------------------------------------------
+!   Class W size reference checksums
+!---------------------------------------------------------------------
+         class = 'W'
+         csum_ref(1) = dcmplx(5.673612178944D+02, 5.293246849175D+02)
+         csum_ref(2) = dcmplx(5.631436885271D+02, 5.282149986629D+02)
+         csum_ref(3) = dcmplx(5.594024089970D+02, 5.270996558037D+02)
+         csum_ref(4) = dcmplx(5.560698047020D+02, 5.260027904925D+02)
+         csum_ref(5) = dcmplx(5.530898991250D+02, 5.249400845633D+02)
+         csum_ref(6) = dcmplx(5.504159734538D+02, 5.239212247086D+02)
+
+      else if (d1 .eq. 256 .and.  &
+     &    d2 .eq. 256 .and.  &
+     &    d3 .eq. 128 .and.  &
+     &    nt .eq. 6) then
+!---------------------------------------------------------------------
+!   Class A size reference checksums
+!---------------------------------------------------------------------
+         class = 'A'
+         csum_ref(1) = dcmplx(5.046735008193D+02, 5.114047905510D+02)
+         csum_ref(2) = dcmplx(5.059412319734D+02, 5.098809666433D+02)
+         csum_ref(3) = dcmplx(5.069376896287D+02, 5.098144042213D+02)
+         csum_ref(4) = dcmplx(5.077892868474D+02, 5.101336130759D+02)
+         csum_ref(5) = dcmplx(5.085233095391D+02, 5.104914655194D+02)
+         csum_ref(6) = dcmplx(5.091487099959D+02, 5.107917842803D+02)
+
+      else if (d1 .eq. 512 .and.  &
+     &    d2 .eq. 256 .and.  &
+     &    d3 .eq. 256 .and.  &
+     &    nt .eq. 20) then
+!---------------------------------------------------------------------
+!   Class B size reference checksums
+!---------------------------------------------------------------------
+         class = 'B'
+         csum_ref(1)  = dcmplx(5.177643571579D+02, 5.077803458597D+02)
+         csum_ref(2)  = dcmplx(5.154521291263D+02, 5.088249431599D+02)
+         csum_ref(3)  = dcmplx(5.146409228649D+02, 5.096208912659D+02)
+         csum_ref(4)  = dcmplx(5.142378756213D+02, 5.101023387619D+02)
+         csum_ref(5)  = dcmplx(5.139626667737D+02, 5.103976610617D+02)
+         csum_ref(6)  = dcmplx(5.137423460082D+02, 5.105948019802D+02)
+         csum_ref(7)  = dcmplx(5.135547056878D+02, 5.107404165783D+02)
+         csum_ref(8)  = dcmplx(5.133910925466D+02, 5.108576573661D+02)
+         csum_ref(9)  = dcmplx(5.132470705390D+02, 5.109577278523D+02)
+         csum_ref(10) = dcmplx(5.131197729984D+02, 5.110460304483D+02)
+         csum_ref(11) = dcmplx(5.130070319283D+02, 5.111252433800D+02)
+         csum_ref(12) = dcmplx(5.129070537032D+02, 5.111968077718D+02)
+         csum_ref(13) = dcmplx(5.128182883502D+02, 5.112616233064D+02)
+         csum_ref(14) = dcmplx(5.127393733383D+02, 5.113203605551D+02)
+         csum_ref(15) = dcmplx(5.126691062020D+02, 5.113735928093D+02)
+         csum_ref(16) = dcmplx(5.126064276004D+02, 5.114218460548D+02)
+         csum_ref(17) = dcmplx(5.125504076570D+02, 5.114656139760D+02)
+         csum_ref(18) = dcmplx(5.125002331720D+02, 5.115053595966D+02)
+         csum_ref(19) = dcmplx(5.124551951846D+02, 5.115415130407D+02)
+         csum_ref(20) = dcmplx(5.124146770029D+02, 5.115744692211D+02)
+
+      else if (d1 .eq. 512 .and.  &
+     &    d2 .eq. 512 .and.  &
+     &    d3 .eq. 512 .and.  &
+     &    nt .eq. 20) then
+!---------------------------------------------------------------------
+!   Class C size reference checksums
+!---------------------------------------------------------------------
+         class = 'C'
+         csum_ref(1)  = dcmplx(5.195078707457D+02, 5.149019699238D+02)
+         csum_ref(2)  = dcmplx(5.155422171134D+02, 5.127578201997D+02)
+         csum_ref(3)  = dcmplx(5.144678022222D+02, 5.122251847514D+02)
+         csum_ref(4)  = dcmplx(5.140150594328D+02, 5.121090289018D+02)
+         csum_ref(5)  = dcmplx(5.137550426810D+02, 5.121143685824D+02)
+         csum_ref(6)  = dcmplx(5.135811056728D+02, 5.121496764568D+02)
+         csum_ref(7)  = dcmplx(5.134569343165D+02, 5.121870921893D+02)
+         csum_ref(8)  = dcmplx(5.133651975661D+02, 5.122193250322D+02)
+         csum_ref(9)  = dcmplx(5.132955192805D+02, 5.122454735794D+02)
+         csum_ref(10) = dcmplx(5.132410471738D+02, 5.122663649603D+02)
+         csum_ref(11) = dcmplx(5.131971141679D+02, 5.122830879827D+02)
+         csum_ref(12) = dcmplx(5.131605205716D+02, 5.122965869718D+02)
+         csum_ref(13) = dcmplx(5.131290734194D+02, 5.123075927445D+02)
+         csum_ref(14) = dcmplx(5.131012720314D+02, 5.123166486553D+02)
+         csum_ref(15) = dcmplx(5.130760908195D+02, 5.123241541685D+02)
+         csum_ref(16) = dcmplx(5.130528295923D+02, 5.123304037599D+02)
+         csum_ref(17) = dcmplx(5.130310107773D+02, 5.123356167976D+02)
+         csum_ref(18) = dcmplx(5.130103090133D+02, 5.123399592211D+02)
+         csum_ref(19) = dcmplx(5.129905029333D+02, 5.123435588985D+02)
+         csum_ref(20) = dcmplx(5.129714421109D+02, 5.123465164008D+02)
+
+      else if (d1 .eq. 2048 .and.  &
+     &    d2 .eq. 1024 .and.  &
+     &    d3 .eq. 1024 .and.  &
+     &    nt .eq. 25) then
+!---------------------------------------------------------------------
+!   Class D size reference checksums
+!---------------------------------------------------------------------
+         class = 'D'
+         csum_ref(1)  = dcmplx(5.122230065252D+02, 5.118534037109D+02)
+         csum_ref(2)  = dcmplx(5.120463975765D+02, 5.117061181082D+02)
+         csum_ref(3)  = dcmplx(5.119865766760D+02, 5.117096364601D+02)
+         csum_ref(4)  = dcmplx(5.119518799488D+02, 5.117373863950D+02)
+         csum_ref(5)  = dcmplx(5.119269088223D+02, 5.117680347632D+02)
+         csum_ref(6)  = dcmplx(5.119082416858D+02, 5.117967875532D+02)
+         csum_ref(7)  = dcmplx(5.118943814638D+02, 5.118225281841D+02)
+         csum_ref(8)  = dcmplx(5.118842385057D+02, 5.118451629348D+02)
+         csum_ref(9)  = dcmplx(5.118769435632D+02, 5.118649119387D+02)
+         csum_ref(10) = dcmplx(5.118718203448D+02, 5.118820803844D+02)
+         csum_ref(11) = dcmplx(5.118683569061D+02, 5.118969781011D+02)
+         csum_ref(12) = dcmplx(5.118661708593D+02, 5.119098918835D+02)
+         csum_ref(13) = dcmplx(5.118649768950D+02, 5.119210777066D+02)
+         csum_ref(14) = dcmplx(5.118645605626D+02, 5.119307604484D+02)
+         csum_ref(15) = dcmplx(5.118647586618D+02, 5.119391362671D+02)
+         csum_ref(16) = dcmplx(5.118654451572D+02, 5.119463757241D+02)
+         csum_ref(17) = dcmplx(5.118665212451D+02, 5.119526269238D+02)
+         csum_ref(18) = dcmplx(5.118679083821D+02, 5.119580184108D+02)
+         csum_ref(19) = dcmplx(5.118695433664D+02, 5.119626617538D+02)
+         csum_ref(20) = dcmplx(5.118713748264D+02, 5.119666538138D+02)
+         csum_ref(21) = dcmplx(5.118733606701D+02, 5.119700787219D+02)
+         csum_ref(22) = dcmplx(5.118754661974D+02, 5.119730095953D+02)
+         csum_ref(23) = dcmplx(5.118776626738D+02, 5.119755100241D+02)
+         csum_ref(24) = dcmplx(5.118799262314D+02, 5.119776353561D+02)
+         csum_ref(25) = dcmplx(5.118822370068D+02, 5.119794338060D+02)
+
+      else if (d1 .eq. 4096 .and.  &
+     &    d2 .eq. 2048 .and.  &
+     &    d3 .eq. 2048 .and.  &
+     &    nt .eq. 25) then
+!---------------------------------------------------------------------
+!   Class E size reference checksums
+!---------------------------------------------------------------------
+         class = 'E'
+         csum_ref(1)  = dcmplx(5.121601045346D+02, 5.117395998266D+02)
+         csum_ref(2)  = dcmplx(5.120905403678D+02, 5.118614716182D+02)
+         csum_ref(3)  = dcmplx(5.120623229306D+02, 5.119074203747D+02)
+         csum_ref(4)  = dcmplx(5.120438418997D+02, 5.119345900733D+02)
+         csum_ref(5)  = dcmplx(5.120311521872D+02, 5.119551325550D+02)
+         csum_ref(6)  = dcmplx(5.120226088809D+02, 5.119720179919D+02)
+         csum_ref(7)  = dcmplx(5.120169296534D+02, 5.119861371665D+02)
+         csum_ref(8)  = dcmplx(5.120131225172D+02, 5.119979364402D+02)
+         csum_ref(9)  = dcmplx(5.120104767108D+02, 5.120077674092D+02)
+         csum_ref(10) = dcmplx(5.120085127969D+02, 5.120159443121D+02)
+         csum_ref(11) = dcmplx(5.120069224127D+02, 5.120227453670D+02)
+         csum_ref(12) = dcmplx(5.120055158164D+02, 5.120284096041D+02)
+         csum_ref(13) = dcmplx(5.120041820159D+02, 5.120331373793D+02)
+         csum_ref(14) = dcmplx(5.120028605402D+02, 5.120370938679D+02)
+         csum_ref(15) = dcmplx(5.120015223011D+02, 5.120404138831D+02)
+         csum_ref(16) = dcmplx(5.120001570022D+02, 5.120432068837D+02)
+         csum_ref(17) = dcmplx(5.119987650555D+02, 5.120455615860D+02)
+         csum_ref(18) = dcmplx(5.119973525091D+02, 5.120475499442D+02)
+         csum_ref(19) = dcmplx(5.119959279472D+02, 5.120492304629D+02)
+         csum_ref(20) = dcmplx(5.119945006558D+02, 5.120506508902D+02)
+         csum_ref(21) = dcmplx(5.119930795911D+02, 5.120518503782D+02)
+         csum_ref(22) = dcmplx(5.119916728462D+02, 5.120528612016D+02)
+         csum_ref(23) = dcmplx(5.119902874185D+02, 5.120537101195D+02)
+         csum_ref(24) = dcmplx(5.119889291565D+02, 5.120544194514D+02)
+         csum_ref(25) = dcmplx(5.119876028049D+02, 5.120550079284D+02)
+
+      else if (d1 .eq. 8192 .and.  &
+     &    d2 .eq. 4096 .and.  &
+     &    d3 .eq. 4096 .and.  &
+     &    nt .eq. 25) then
+!---------------------------------------------------------------------
+!   Class F size reference checksums
+!---------------------------------------------------------------------
+         class = 'F'
+         csum_ref( 1) = dcmplx(5.119892866928D+02, 5.121457822747D+02)
+         csum_ref( 2) = dcmplx(5.119560157487D+02, 5.121009044434D+02)
+         csum_ref( 3) = dcmplx(5.119437960123D+02, 5.120761074285D+02)
+         csum_ref( 4) = dcmplx(5.119395628845D+02, 5.120614320496D+02)
+         csum_ref( 5) = dcmplx(5.119390371879D+02, 5.120514085624D+02)
+         csum_ref( 6) = dcmplx(5.119405091840D+02, 5.120438117102D+02)
+         csum_ref( 7) = dcmplx(5.119430444528D+02, 5.120376348915D+02)
+         csum_ref( 8) = dcmplx(5.119460702242D+02, 5.120323831062D+02)
+         csum_ref( 9) = dcmplx(5.119492377036D+02, 5.120277980818D+02)
+         csum_ref(10) = dcmplx(5.119523446268D+02, 5.120237368268D+02)
+         csum_ref(11) = dcmplx(5.119552825361D+02, 5.120201137845D+02)
+         csum_ref(12) = dcmplx(5.119580008777D+02, 5.120168723492D+02)
+         csum_ref(13) = dcmplx(5.119604834177D+02, 5.120139707209D+02)
+         csum_ref(14) = dcmplx(5.119627332821D+02, 5.120113749334D+02)
+         csum_ref(15) = dcmplx(5.119647637538D+02, 5.120090554887D+02)
+         csum_ref(16) = dcmplx(5.119665927740D+02, 5.120069857863D+02)
+         csum_ref(17) = dcmplx(5.119682397643D+02, 5.120051414260D+02)
+         csum_ref(18) = dcmplx(5.119697238718D+02, 5.120034999132D+02)
+         csum_ref(19) = dcmplx(5.119710630664D+02, 5.120020405355D+02)
+         csum_ref(20) = dcmplx(5.119722737384D+02, 5.120007442976D+02)
+         csum_ref(21) = dcmplx(5.119733705802D+02, 5.119995938652D+02)
+         csum_ref(22) = dcmplx(5.119743666226D+02, 5.119985735001D+02)
+         csum_ref(23) = dcmplx(5.119752733481D+02, 5.119976689792D+02)
+         csum_ref(24) = dcmplx(5.119761008382D+02, 5.119968675026D+02)
+         csum_ref(25) = dcmplx(5.119768579280D+02, 5.119961575929D+02)
+
+      endif
+
+
+      if (class .ne. 'U') then
+
+         do i = 1, nt
+            err = abs( (sums(i) - csum_ref(i)) / csum_ref(i) )
+            if (ieee_is_nan(err) .or. (err .gt. epsilon)) goto 100
+         end do
+         verified = .TRUE.
+ 100     continue
+
+      endif
+
+
+      if (class .ne. 'U') then
+         if (verified) then
+            write(*,2000)
+ 2000       format(' Result verification successful')
+         else
+            write(*,2001)
+ 2001       format(' Result verification failed')
+         endif
+      endif
+      print *, 'class = ', class
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/ft_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/ft_data.f90
new file mode 100644
index 000000000..9f17d49da
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/ft_data.f90
@@ -0,0 +1,189 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  ft_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module ft_data
+
+      include 'npbparams.h'
+
+! total number of grid points with padding
+      integer(kind2) nxp, ntotalp
+      parameter (nxp=nx+1)
+      parameter (ntotalp=nxp*ny*nz)
+      double precision ntotal_f
+      parameter (ntotal_f=dble(nx)*ny*nz)
+
+
+! If processor array is 1x1 -> 0D grid decomposition
+
+
+! Cache blocking params. These values are good for most
+! RISC processors.  
+! FFT parameters:
+!  fftblock controls how many ffts are done at a time. 
+!  The default is appropriate for most cache-based machines
+!  On vector machines, the FFT can be vectorized with vector
+!  length equal to the block size, so the block size should
+!  be as large as possible. This is the size of the smallest
+!  dimension of the problem: 128 for class A, 256 for class B and
+!  512 for class C.
+
+      include 'blk_par.h'
+!      integer fftblock_default, fftblockpad_default
+!      parameter (fftblock_default=32, fftblockpad_default=34)
+      
+      integer fftblock, fftblockpad
+
+! we need a bunch of logic to keep track of how
+! arrays are laid out. 
+
+
+! Note: this serial version is the derived from the parallel 0D case
+! of the ft NPB.
+! The computation proceeds logically as
+
+! set up initial conditions
+! fftx(1)
+! transpose (1->2)
+! ffty(2)
+! transpose (2->3)
+! fftz(3)
+! time evolution
+! fftz(3)
+! transpose (3->2)
+! ffty(2)
+! transpose (2->1)
+! fftx(1)
+! compute residual(1)
+
+! for the 0D, 1D, 2D strategies, the layouts look like xxx
+!        
+!            0D        1D        2D
+! 1:        xyz       xyz       xyz
+
+! the array dimensions are stored in dims(coord, phase)
+      integer dims(3)
+
+      integer T_total, T_setup, T_fft, T_evolve, T_checksum,  &
+     &        T_fftx, T_ffty,  &
+     &        T_fftz, T_max
+      parameter (T_total = 1, T_setup = 2, T_fft = 3,  &
+     &           T_evolve = 4, T_checksum = 5,  &
+     &           T_fftx = 6,  &
+     &           T_ffty = 7,  &
+     &           T_fftz = 8, T_max = 8)
+
+
+
+      logical timers_enabled
+
+
+      external timer_read
+      double precision timer_read
+      external ilog2
+      integer ilog2
+
+      external randlc
+      double precision randlc
+
+
+! other stuff
+      logical debug, debugsynch
+
+      double precision seed, a, pi, alpha
+      parameter (seed = 314159265.d0, a = 1220703125.d0,  &
+     &  pi = 3.141592653589793238d0, alpha=1.0d-6)
+
+
+! roots of unity array
+! relies on x being largest dimension?
+      double complex u(nxp)
+
+
+! for checksum data
+      double complex sums(0:niter_default)
+
+! number of iterations
+      integer niter
+
+
+      end module ft_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  ft_fields module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module ft_fields
+
+!---------------------------------------------------------------------
+! u0, u1, u2 are the main arrays in the problem. 
+! Depending on the decomposition, these arrays will have different 
+! dimensions. To accomodate all possibilities, we allocate them as 
+! one-dimensional arrays and pass them to subroutines for different 
+! views
+!  - u0 contains the initial (transformed) initial condition
+!  - u1 and u2 are working arrays
+!  - twiddle contains exponents for the time evolution operator. 
+!---------------------------------------------------------------------
+
+      double complex, allocatable ::  &
+     &                 u0(:), pad1(:),  &
+     &                 u1(:), pad2(:)
+!     >                 u2(:)
+      double precision, allocatable :: twiddle(:)
+!---------------------------------------------------------------------
+! Large arrays are in module so that they are allocated on the
+! heap rather than the stack. This module is not
+! referenced directly anywhere else. Padding is to avoid accidental 
+! cache problems, since all array sizes are powers of two.
+!---------------------------------------------------------------------
+
+
+      end module ft_fields
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use ft_data, only : ntotalp
+      use ft_fields
+
+      implicit none
+
+      integer ios
+
+
+      allocate (  &
+     &          u0(ntotalp), pad1(3),  &
+     &          u1(ntotalp), pad2(3),  &
+!     >          u2(ntotalp),
+     &          twiddle(ntotalp),  &
+     &          stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         stop
+      endif
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/inputft.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/inputft.data.sample
new file mode 100644
index 000000000..448ac42bc
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/FT/inputft.data.sample
@@ -0,0 +1,3 @@
+6   ! number of iterations
+2   ! layout type. 0 = 0d, 1 = 1d, 2 = 2d
+2 4 ! processor layout. 0d must be "1 1"; 1d must be "1 N"
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/Makefile
new file mode 100644
index 000000000..2bee7ee38
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/Makefile
@@ -0,0 +1,30 @@
+SHELL=/bin/sh
+BENCHMARK=is
+BENCHMARKU=IS
+
+include ../config/make.def
+
+include ../sys/make.common
+
+OBJS = is.o \
+       ${COMMON}/c_print_results.o \
+       ${COMMON}/c_timers.o \
+       ${COMMON}/c_wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+${PROGRAM}: config ${OBJS}
+	${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${C_LIB}
+
+.c.o:
+	${CCOMPILE} $<
+
+is.o:             is.c  npbparams.h
+
+
+clean:
+	- rm -f *.o *~ mputil*
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/README.carefully b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/README.carefully
new file mode 100644
index 000000000..f7dc8f2a7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/README.carefully
@@ -0,0 +1,49 @@
+Please note:  The IS code in this directory known as is.c is derived
+from the serial version of the NPB2.3 parallel IS.  Although for
+the serial version it is completely unnecessary to have any notion 
+of buckets at all in order to correctly solve the specified NPB1 IS 
+benchmark problem, the buckets seem to be very beneficial in
+parallel versions, including the OpenMP version.
+
+Default setting is
+
+    #define USE_BUCKETS
+
+i.e., buckets turned on!  To switch it off, simply comment out
+the line.
+
+The OpenMP version uses the "dynamic" schedule to improve load
+balance during key sorting.  Sometime, the use of the "static,1"
+(or cyclic) schedule may yield better performance.  Both options
+are acceptable.  The default setting is "dynamic".  To choose
+the cyclic option, define the line:
+
+    #define SCHED_CYCLIC
+
+
+Here some notes inherited from NPB2.3-serial:
+Nevertheless, it is possible to turn on bucketing via #ifdef'ed code.
+Then, the sort first rearranges the keys into buckets by range (the
+bucket's ranges evenly subdivide the total key range), and then
+ranks the contents of each bucket.  This results in key transfers
+first into contiguous elements of buckets.  This is relatively
+cache efficient, since there are a relatively small number of buckets.
+Then the key counting that occurs accesses contiguous array elements.
+Once again, accesses reuse cache lines efficiently.  Finally, the 
+accumulation of key multiplicities (the key count) which gives the key
+ranks also reuses cache line efficiently.
+
+But using the buckets more than doubles the amount of computational
+work that must be performed.  On machines with very large caches, the 
+aforementioned benefits may not exist, and the extra processing looks
+expensive. These examples apply to both CLASS A and B problems:
+
+    SP2-66MhzWN:  50% speedup with buckets                          
+    SGI Indy5000: 50% slowdown with buckets             
+    SGI O2000:   400% slowdown with buckets (Wow!)                
+
+It is a conjecture that cache access is the underlying mechanism 
+causing these variations.
+
+Note: If reporting timing results, either of these modes may be used 
+      without penalty.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/is.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/is.c
new file mode 100644
index 000000000..0e84a0b5e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/IS/is.c
@@ -0,0 +1,1038 @@
+/*************************************************************************
+ *                                                                       *
+ *       N  A  S     P A R A L L E L     B E N C H M A R K S  3.4        *
+ *                                                                       *
+ *                      O p e n M P     V E R S I O N                    *
+ *                                                                       *
+ *                                  I S                                  *
+ *                                                                       *
+ *************************************************************************
+ *                                                                       *
+ *   This benchmark is an OpenMP version of the NPB IS code.             *
+ *   It is described in NAS Technical Report 99-011.                     *
+ *                                                                       *
+ *   Permission to use, copy, distribute and modify this software        *
+ *   for any purpose with or without fee is hereby granted.  We          *
+ *   request, however, that all derived work reference the NAS           *
+ *   Parallel Benchmarks 3.4. This software is provided "as is"          *
+ *   without express or implied warranty.                                *
+ *                                                                       *
+ *   Information on NPB 3.4, including the technical report, the         *
+ *   original specifications, source code, results and information       *
+ *   on how to submit new results, is available at:                      *
+ *                                                                       *
+ *          http://www.nas.nasa.gov/Software/NPB/                        *
+ *                                                                       *
+ *   Send comments or suggestions to  npb@nas.nasa.gov                   *
+ *                                                                       *
+ *         NAS Parallel Benchmarks Group                                 *
+ *         NASA Ames Research Center                                     *
+ *         Mail Stop: T27A-1                                             *
+ *         Moffett Field, CA   94035-1000                                *
+ *                                                                       *
+ *         E-mail:  npb@nas.nasa.gov                                     *
+ *         Fax:     (650) 604-3957                                       *
+ *                                                                       *
+ *************************************************************************
+ *                                                                       *
+ *   Author: M. Yarrow                                                   *
+ *           H. Jin                                                      *
+ *                                                                       *
+ *************************************************************************/
+
+#include "npbparams.h"
+#include <stdlib.h>
+#include <stdio.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+
+/*****************************************************************/
+/* For serial IS, buckets are not really req'd to solve NPB1 IS  */
+/* spec, but their use on some machines improves performance, on */
+/* other machines the use of buckets compromises performance,    */
+/* probably because it is extra computation which is not req'd.  */
+/* (Note: Mechanism not understood, probably cache related)      */
+/* Example:  SP2-66MhzWN:  50% speedup with buckets              */
+/* Example:  SGI Indy5000: 50% slowdown with buckets             */
+/* Example:  SGI O2000:   400% slowdown with buckets (Wow!)      */
+/*****************************************************************/
+/* To disable the use of buckets, comment out the following line */
+#define USE_BUCKETS
+
+/* Uncomment below for cyclic schedule */
+/*#define SCHED_CYCLIC*/
+
+#ifdef M5_ANNOTATION
+void m5_work_begin_interface_();
+void m5_work_end_interface_();
+#endif
+
+/******************/
+/* default values */
+/******************/
+#ifndef CLASS
+#define CLASS 'S'
+#endif
+
+
+/*************/
+/*  CLASS S  */
+/*************/
+#if CLASS == 'S'
+#define  TOTAL_KEYS_LOG_2    16
+#define  MAX_KEY_LOG_2       11
+#define  NUM_BUCKETS_LOG_2   9
+#endif
+
+
+/*************/
+/*  CLASS W  */
+/*************/
+#if CLASS == 'W'
+#define  TOTAL_KEYS_LOG_2    20
+#define  MAX_KEY_LOG_2       16
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+/*************/
+/*  CLASS A  */
+/*************/
+#if CLASS == 'A'
+#define  TOTAL_KEYS_LOG_2    23
+#define  MAX_KEY_LOG_2       19
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS B  */
+/*************/
+#if CLASS == 'B'
+#define  TOTAL_KEYS_LOG_2    25
+#define  MAX_KEY_LOG_2       21
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS C  */
+/*************/
+#if CLASS == 'C'
+#define  TOTAL_KEYS_LOG_2    27
+#define  MAX_KEY_LOG_2       23
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS D  */
+/*************/
+#if CLASS == 'D'
+#define  TOTAL_KEYS_LOG_2    31
+#define  MAX_KEY_LOG_2       27
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+/*************/
+/*  CLASS E  */
+/*************/
+#if CLASS == 'E'
+#define  TOTAL_KEYS_LOG_2    35
+#define  MAX_KEY_LOG_2       31
+#define  NUM_BUCKETS_LOG_2   10
+#endif
+
+
+#if (CLASS == 'D' || CLASS == 'E')
+#define  TOTAL_KEYS          (1L << TOTAL_KEYS_LOG_2)
+#define  TOTAL_KS1           (1 << (TOTAL_KEYS_LOG_2-8))
+#define  TOTAL_KS2           (1 << 8)
+#define  MAX_KEY             (1L << MAX_KEY_LOG_2)
+#else
+#define  TOTAL_KEYS          (1 << TOTAL_KEYS_LOG_2)
+#define  TOTAL_KS1           TOTAL_KEYS
+#define  TOTAL_KS2           1
+#define  MAX_KEY             (1 << MAX_KEY_LOG_2)
+#endif
+#define  NUM_BUCKETS         (1 << NUM_BUCKETS_LOG_2)
+#define  NUM_KEYS            TOTAL_KEYS
+#define  SIZE_OF_BUFFERS     NUM_KEYS
+
+
+#define  MAX_ITERATIONS      10
+#define  TEST_ARRAY_SIZE     5
+
+
+/*************************************/
+/* Typedef: if necessary, change the */
+/* size of int here by changing the  */
+/* int type to, say, long            */
+/*************************************/
+#if (CLASS == 'D' || CLASS == 'E')
+typedef  long INT_TYPE;
+#else
+typedef  int  INT_TYPE;
+#endif
+
+
+/********************/
+/* Some global info */
+/********************/
+INT_TYPE *key_buff_ptr_global;         /* used by full_verify to get */
+                                       /* copies of rank info        */
+
+int      passed_verification;
+
+
+/************************************/
+/* These are the three main arrays. */
+/* See SIZE_OF_BUFFERS def above    */
+/************************************/
+INT_TYPE key_array[SIZE_OF_BUFFERS],
+         key_buff1[MAX_KEY],
+         key_buff2[SIZE_OF_BUFFERS],
+         partial_verify_vals[TEST_ARRAY_SIZE],
+         **key_buff1_aptr = NULL;
+
+#ifdef USE_BUCKETS
+INT_TYPE **bucket_size,
+         bucket_ptrs[NUM_BUCKETS];
+#pragma omp threadprivate(bucket_ptrs)
+#endif
+
+
+/**********************/
+/* Partial verif info */
+/**********************/
+INT_TYPE test_index_array[TEST_ARRAY_SIZE],
+         test_rank_array[TEST_ARRAY_SIZE];
+
+int      S_test_index_array[TEST_ARRAY_SIZE] =
+                             {48427,17148,23627,62548,4431},
+         S_test_rank_array[TEST_ARRAY_SIZE] =
+                             {0,18,346,64917,65463},
+
+         W_test_index_array[TEST_ARRAY_SIZE] =
+                             {357773,934767,875723,898999,404505},
+         W_test_rank_array[TEST_ARRAY_SIZE] =
+                             {1249,11698,1039987,1043896,1048018},
+
+         A_test_index_array[TEST_ARRAY_SIZE] =
+                             {2112377,662041,5336171,3642833,4250760},
+         A_test_rank_array[TEST_ARRAY_SIZE] =
+                             {104,17523,123928,8288932,8388264},
+
+         B_test_index_array[TEST_ARRAY_SIZE] =
+                             {41869,812306,5102857,18232239,26860214},
+         B_test_rank_array[TEST_ARRAY_SIZE] =
+                             {33422937,10244,59149,33135281,99},
+
+         C_test_index_array[TEST_ARRAY_SIZE] =
+                             {44172927,72999161,74326391,129606274,21736814},
+         C_test_rank_array[TEST_ARRAY_SIZE] =
+                             {61147,882988,266290,133997595,133525895};
+
+long     D_test_index_array[TEST_ARRAY_SIZE] =
+                             {1317351170,995930646,1157283250,1503301535,1453734525},
+         D_test_rank_array[TEST_ARRAY_SIZE] =
+                             {1,36538729,1978098519,2145192618,2147425337},
+
+         E_test_index_array[TEST_ARRAY_SIZE] =
+                             {21492309536L,24606226181L,12608530949L,4065943607L,3324513396L},
+         E_test_rank_array[TEST_ARRAY_SIZE] =
+                             {3L,27580354L,3248475153L,30048754302L,31485259697L};
+
+
+/***********************/
+/* function prototypes */
+/***********************/
+double	randlc( double *X, double *A );
+
+void full_verify( void );
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1,
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags );
+
+#include "../common/c_timers.h"
+
+
+/*
+ *    FUNCTION RANDLC (X, A)
+ *
+ *  This routine returns a uniform pseudorandom double precision number in the
+ *  range (0, 1) by using the linear congruential generator
+ *
+ *  x_{k+1} = a x_k  (mod 2^46)
+ *
+ *  where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+ *  before repeating.  The argument A is the same as 'a' in the above formula,
+ *  and X is the same as x_0.  A and X must be odd double precision integers
+ *  in the range (1, 2^46).  The returned value RANDLC is normalized to be
+ *  between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+ *  the new seed x_1, so that subsequent calls to RANDLC using the same
+ *  arguments will generate a continuous sequence.
+ *
+ *  This routine should produce the same results on any computer with at least
+ *  48 mantissa bits in double precision floating point data.  On Cray systems,
+ *  double precision should be disabled.
+ *
+ *  David H. Bailey     October 26, 1990
+ *
+ *     IMPLICIT DOUBLE PRECISION (A-H, O-Z)
+ *     SAVE KS, R23, R46, T23, T46
+ *     DATA KS/0/
+ *
+ *  If this is the first call to RANDLC, compute R23 = 2 ^ -23, R46 = 2 ^ -46,
+ *  T23 = 2 ^ 23, and T46 = 2 ^ 46.  These are computed in loops, rather than
+ *  by merely using the ** operator, in order to insure that the results are
+ *  exact on all systems.  This code assumes that 0.5D0 is represented exactly.
+ */
+
+/*****************************************************************/
+/*************           R  A  N  D  L  C             ************/
+/*************                                        ************/
+/*************    portable random number generator    ************/
+/*****************************************************************/
+
+static int      KS=0;
+static double	R23, R46, T23, T46;
+#pragma omp threadprivate(KS, R23, R46, T23, T46)
+
+double	randlc( double *X, double *A )
+{
+      double		T1, T2, T3, T4;
+      double		A1;
+      double		A2;
+      double		X1;
+      double		X2;
+      double		Z;
+      int     		i, j;
+
+      if (KS == 0)
+      {
+        R23 = 1.0;
+        R46 = 1.0;
+        T23 = 1.0;
+        T46 = 1.0;
+
+        for (i=1; i<=23; i++)
+        {
+          R23 = 0.50 * R23;
+          T23 = 2.0 * T23;
+        }
+        for (i=1; i<=46; i++)
+        {
+          R46 = 0.50 * R46;
+          T46 = 2.0 * T46;
+        }
+        KS = 1;
+      }
+
+/*  Break A into two parts such that A = 2^23 * A1 + A2 and set X = N.  */
+
+      T1 = R23 * *A;
+      j  = T1;
+      A1 = j;
+      A2 = *A - T23 * A1;
+
+/*  Break X into two parts such that X = 2^23 * X1 + X2, compute
+    Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+    X = 2^23 * Z + A2 * X2  (mod 2^46).                            */
+
+      T1 = R23 * *X;
+      j  = T1;
+      X1 = j;
+      X2 = *X - T23 * X1;
+      T1 = A1 * X2 + A2 * X1;
+
+      j  = R23 * T1;
+      T2 = j;
+      Z = T1 - T23 * T2;
+      T3 = T23 * Z + A2 * X2;
+      j  = R46 * T3;
+      T4 = j;
+      *X = T3 - T46 * T4;
+      return(R46 * *X);
+}
+
+
+
+
+/*****************************************************************/
+/************   F  I  N  D  _  M  Y  _  S  E  E  D    ************/
+/************                                         ************/
+/************ returns parallel random number seq seed ************/
+/*****************************************************************/
+
+/*
+ * Create a random number sequence of total length nn residing
+ * on np number of processors.  Each processor will therefore have a
+ * subsequence of length nn/np.  This routine returns that random
+ * number which is the first random number for the subsequence belonging
+ * to processor rank kn, and which is used as seed for proc kn ran # gen.
+ */
+
+double   find_my_seed( int kn,        /* my processor rank, 0<=kn<=num procs */
+                       int np,        /* np = num procs                      */
+                       long nn,       /* total num of ran numbers, all procs */
+                       double s,      /* Ran num seed, for ex.: 314159265.00 */
+                       double a )     /* Ran num gen mult, try 1220703125.00 */
+{
+
+      double t1,t2;
+      long   mq,nq,kk,ik;
+
+      if ( kn == 0 ) return s;
+
+      mq = (nn/4 + np - 1) / np;
+      nq = mq * 4 * kn;               /* number of rans to be skipped */
+
+      t1 = s;
+      t2 = a;
+      kk = nq;
+      while ( kk > 1 ) {
+      	 ik = kk / 2;
+         if( 2 * ik ==  kk ) {
+            (void)randlc( &t2, &t2 );
+	    kk = ik;
+	 }
+	 else {
+            (void)randlc( &t1, &t2 );
+	    kk = kk - 1;
+	 }
+      }
+      (void)randlc( &t1, &t2 );
+
+      return( t1 );
+
+}
+
+
+
+/*****************************************************************/
+/*************      C  R  E  A  T  E  _  S  E  Q      ************/
+/*****************************************************************/
+
+void	create_seq( double seed, double a )
+{
+	double x, s;
+	INT_TYPE i, k;
+
+#pragma omp parallel private(x,s,i,k)
+    {
+	INT_TYPE k1, k2;
+	double an = a;
+	int myid = 0, num_threads = 1;
+        INT_TYPE mq;
+
+#ifdef _OPENMP
+	myid = omp_get_thread_num();
+	num_threads = omp_get_num_threads();
+#endif
+
+	mq = (NUM_KEYS + num_threads - 1) / num_threads;
+	k1 = mq * myid;
+	k2 = k1 + mq;
+	if ( k2 > NUM_KEYS ) k2 = NUM_KEYS;
+
+	KS = 0;
+	s = find_my_seed( myid, num_threads,
+			  (long)4*NUM_KEYS, seed, an );
+
+        k = MAX_KEY/4;
+
+	for (i=k1; i<k2; i++)
+	{
+	    x = randlc(&s, &an);
+	    x += randlc(&s, &an);
+    	    x += randlc(&s, &an);
+	    x += randlc(&s, &an);
+
+            key_array[i] = k*x;
+	}
+    } /*omp parallel*/
+}
+
+
+
+/*****************************************************************/
+/*****************    Allocate Working Buffer     ****************/
+/*****************************************************************/
+void *alloc_mem( size_t size )
+{
+    void *p;
+
+    p = (void *)malloc(size);
+    if (!p) {
+        perror("Memory allocation error");
+        exit(1);
+    }
+    return p;
+}
+
+void alloc_key_buff( void )
+{
+    INT_TYPE i;
+    int      num_threads = 1;
+
+
+#ifdef _OPENMP
+    num_threads = omp_get_max_threads();
+#endif
+
+#ifdef USE_BUCKETS
+    bucket_size = (INT_TYPE **)alloc_mem(sizeof(INT_TYPE *) * num_threads);
+
+    for (i = 0; i < num_threads; i++) {
+        bucket_size[i] = (INT_TYPE *)alloc_mem(sizeof(INT_TYPE) * NUM_BUCKETS);
+    }
+
+    #pragma omp parallel for
+    for( i=0; i<NUM_KEYS; i++ )
+        key_buff2[i] = 0;
+
+#else /*USE_BUCKETS*/
+
+    key_buff1_aptr = (INT_TYPE **)alloc_mem(sizeof(INT_TYPE *) * num_threads);
+
+    key_buff1_aptr[0] = key_buff1;
+    for (i = 1; i < num_threads; i++) {
+        key_buff1_aptr[i] = (INT_TYPE *)alloc_mem(sizeof(INT_TYPE) * MAX_KEY);
+    }
+
+#endif /*USE_BUCKETS*/
+}
+
+
+
+/*****************************************************************/
+/*************    F  U  L  L  _  V  E  R  I  F  Y     ************/
+/*****************************************************************/
+
+
+void full_verify( void )
+{
+    INT_TYPE   i, j;
+    INT_TYPE   k, k1, k2;
+
+
+/*  Now, finally, sort the keys:  */
+
+/*  Copy keys into work array; keys in key_array will be reassigned. */
+
+#ifdef USE_BUCKETS
+
+    /* Buckets are already sorted.  Sorting keys within each bucket */
+#ifdef SCHED_CYCLIC
+    #pragma omp parallel for private(i,j,k,k1) schedule(static,1)
+#else
+    #pragma omp parallel for private(i,j,k,k1) schedule(dynamic)
+#endif
+    for( j=0; j< NUM_BUCKETS; j++ ) {
+
+        k1 = (j > 0)? bucket_ptrs[j-1] : 0;
+        for ( i = k1; i < bucket_ptrs[j]; i++ ) {
+            k = --key_buff_ptr_global[key_buff2[i]];
+            key_array[k] = key_buff2[i];
+        }
+    }
+
+#else
+
+#pragma omp parallel private(i,j,k,k1,k2)
+  {
+    #pragma omp for
+    for( i=0; i<NUM_KEYS; i++ )
+        key_buff2[i] = key_array[i];
+
+    /* This is actual sorting. Each thread is responsible for
+       a subset of key values */
+#ifdef _OPENMP
+    j = omp_get_num_threads();
+    j = (MAX_KEY + j - 1) / j;
+    k1 = j * omp_get_thread_num();
+#else
+    j = MAX_KEY;
+    k1 = 0;
+#endif
+    k2 = k1 + j;
+    if (k2 > MAX_KEY) k2 = MAX_KEY;
+
+    for( i=0; i<NUM_KEYS; i++ ) {
+        if (key_buff2[i] >= k1 && key_buff2[i] < k2) {
+            k = --key_buff_ptr_global[key_buff2[i]];
+            key_array[k] = key_buff2[i];
+        }
+    }
+  } /*omp parallel*/
+
+#endif
+
+
+/*  Confirm keys correctly sorted: count incorrectly sorted keys, if any */
+
+    j = 0;
+    #pragma omp parallel for reduction(+:j)
+    for( i=1; i<NUM_KEYS; i++ )
+        if( key_array[i-1] > key_array[i] )
+            j++;
+
+    if( j != 0 )
+        printf( "Full_verify: number of keys out of sort: %ld\n", (long)j );
+    else
+        passed_verification++;
+
+}
+
+
+
+
+/*****************************************************************/
+/*************             R  A  N  K             ****************/
+/*****************************************************************/
+
+
+void rank( int iteration )
+{
+
+    INT_TYPE    i, k;
+    INT_TYPE    *key_buff_ptr, *key_buff_ptr2;
+
+#ifdef USE_BUCKETS
+    int shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2;
+    INT_TYPE num_bucket_keys = (1L << shift);
+#endif
+
+
+    key_array[iteration] = iteration;
+    key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration;
+
+
+/*  Determine where the partial verify test keys are, load into  */
+/*  top of array bucket_size                                     */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        partial_verify_vals[i] = key_array[test_index_array[i]];
+
+
+/*  Setup pointers to key buffers  */
+#ifdef USE_BUCKETS
+    key_buff_ptr2 = key_buff2;
+#else
+    key_buff_ptr2 = key_array;
+#endif
+    key_buff_ptr = key_buff1;
+
+
+#pragma omp parallel private(i, k)
+  {
+    INT_TYPE *work_buff, m, k1, k2;
+    int myid = 0, num_threads = 1;
+
+#ifdef _OPENMP
+    myid = omp_get_thread_num();
+    num_threads = omp_get_num_threads();
+#endif
+
+
+/*  Bucket sort is known to improve cache performance on some   */
+/*  cache based systems.  But the actual performance may depend */
+/*  on cache size, problem size. */
+#ifdef USE_BUCKETS
+
+    work_buff = bucket_size[myid];
+
+/*  Initialize */
+    for( i=0; i<NUM_BUCKETS; i++ )
+        work_buff[i] = 0;
+
+/*  Determine the number of keys in each bucket */
+    #pragma omp for schedule(static)
+    for( i=0; i<NUM_KEYS; i++ )
+        work_buff[key_array[i] >> shift]++;
+
+/*  Accumulative bucket sizes are the bucket pointers.
+    These are global sizes accumulated upon to each bucket */
+    bucket_ptrs[0] = 0;
+    for( k=0; k< myid; k++ )
+        bucket_ptrs[0] += bucket_size[k][0];
+
+    for( i=1; i< NUM_BUCKETS; i++ ) {
+        bucket_ptrs[i] = bucket_ptrs[i-1];
+        for( k=0; k< myid; k++ )
+            bucket_ptrs[i] += bucket_size[k][i];
+        for( k=myid; k< num_threads; k++ )
+            bucket_ptrs[i] += bucket_size[k][i-1];
+    }
+
+
+/*  Sort into appropriate bucket */
+    #pragma omp for schedule(static)
+    for( i=0; i<NUM_KEYS; i++ )
+    {
+        k = key_array[i];
+        key_buff2[bucket_ptrs[k >> shift]++] = k;
+    }
+
+/*  The bucket pointers now point to the final accumulated sizes */
+    if (myid < num_threads-1) {
+        for( i=0; i< NUM_BUCKETS; i++ )
+            for( k=myid+1; k< num_threads; k++ )
+                bucket_ptrs[i] += bucket_size[k][i];
+    }
+
+
+/*  Now, buckets are sorted.  We only need to sort keys inside
+    each bucket, which can be done in parallel.  Because the distribution
+    of the number of keys in the buckets is Gaussian, the use of
+    a dynamic schedule should improve load balance, thus, performance     */
+
+#ifdef SCHED_CYCLIC
+    #pragma omp for schedule(static,1)
+#else
+    #pragma omp for schedule(dynamic)
+#endif
+    for( i=0; i< NUM_BUCKETS; i++ ) {
+
+/*  Clear the work array section associated with each bucket */
+        k1 = i * num_bucket_keys;
+        k2 = k1 + num_bucket_keys;
+        for ( k = k1; k < k2; k++ )
+            key_buff_ptr[k] = 0;
+
+/*  Ranking of all keys occurs in this section:                 */
+
+/*  In this section, the keys themselves are used as their
+    own indexes to determine how many of each there are: their
+    individual population                                       */
+        m = (i > 0)? bucket_ptrs[i-1] : 0;
+        for ( k = m; k < bucket_ptrs[i]; k++ )
+            key_buff_ptr[key_buff_ptr2[k]]++;  /* Now they have individual key   */
+                                       /* population                     */
+
+/*  To obtain ranks of each key, successively add the individual key
+    population, not forgetting to add m, the total of lesser keys,
+    to the first key population                                          */
+        key_buff_ptr[k1] += m;
+        for ( k = k1+1; k < k2; k++ )
+            key_buff_ptr[k] += key_buff_ptr[k-1];
+
+    }
+
+#else /*USE_BUCKETS*/
+
+
+    work_buff = key_buff1_aptr[myid];
+
+
+/*  Clear the work array */
+    for( i=0; i<MAX_KEY; i++ )
+        work_buff[i] = 0;
+
+
+/*  Ranking of all keys occurs in this section:                 */
+
+/*  In this section, the keys themselves are used as their
+    own indexes to determine how many of each there are: their
+    individual population                                       */
+
+    #pragma omp for nowait schedule(static)
+    for( i=0; i<NUM_KEYS; i++ )
+        work_buff[key_buff_ptr2[i]]++;  /* Now they have individual key   */
+                                       /* population                     */
+
+/*  To obtain ranks of each key, successively add the individual key
+    population                                          */
+
+    for( i=0; i<MAX_KEY-1; i++ )
+        work_buff[i+1] += work_buff[i];
+
+    #pragma omp barrier
+
+/*  Accumulate the global key population */
+    for( k=1; k<num_threads; k++ ) {
+        #pragma omp for nowait schedule(static)
+        for( i=0; i<MAX_KEY; i++ )
+            key_buff_ptr[i] += key_buff1_aptr[k][i];
+    }
+
+#endif /*USE_BUCKETS*/
+
+  } /*omp parallel*/
+
+/* This is the partial verify test section */
+/* Observe that test_rank_array vals are   */
+/* shifted differently for different cases */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+    {
+        k = partial_verify_vals[i];          /* test vals were put here */
+        if( 0 < k  &&  k <= NUM_KEYS-1 )
+        {
+            INT_TYPE key_rank = key_buff_ptr[k-1];
+            INT_TYPE test_rank = test_rank_array[i];
+            int failed = 0;
+
+            switch( CLASS )
+            {
+                case 'S':
+                    if( i <= 2 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'W':
+                    if( i < 2 )
+                        test_rank += iteration - 2;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'A':
+                    if( i <= 2 )
+                        test_rank += iteration - 1;
+                    else
+                        test_rank -= iteration - 1;
+                    break;
+                case 'B':
+                    if( i == 1 || i == 2 || i == 4 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'C':
+                    if( i <= 2 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'D':
+                    if( i < 2 )
+                        test_rank += iteration;
+                    else
+                        test_rank -= iteration;
+                    break;
+                case 'E':
+                    if( i < 2 )
+                        test_rank += iteration - 2;
+                    else if( i == 2 )
+                    {
+                        test_rank += iteration - 2;
+                        if (iteration > 4)
+                            test_rank -= 2;
+                        else if (iteration > 2)
+                            test_rank -= 1;
+                    }
+                    else
+                        test_rank -= iteration - 2;
+                    break;
+            }
+            if( key_rank != test_rank )
+                failed = 1;
+            else
+                passed_verification++;
+            if( failed == 1 )
+                printf( "Failed partial verification: "
+                        "iteration %d, test key %d\n",
+                         iteration, (int)i );
+        }
+    }
+
+
+
+
+/*  Make copies of rank info for use by full_verify: these variables
+    in rank are local; making them global slows down the code, probably
+    since they cannot be made register by compiler                        */
+
+    if( iteration == MAX_ITERATIONS )
+        key_buff_ptr_global = key_buff_ptr;
+
+}
+
+
+/*****************************************************************/
+/*************             M  A  I  N             ****************/
+/*****************************************************************/
+
+int main( int argc, char **argv )
+{
+
+    int             i, iteration, timer_on;
+
+    double          timecounter;
+
+
+/*  Initialize timers  */
+    timer_on = check_timer_flag();
+
+    timer_clear( 0 );
+    if (timer_on) {
+        timer_clear( 1 );
+        timer_clear( 2 );
+        timer_clear( 3 );
+    }
+
+    if (timer_on) timer_start( 3 );
+
+
+/*  Initialize the verification arrays if a valid class */
+    for( i=0; i<TEST_ARRAY_SIZE; i++ )
+        switch( CLASS )
+        {
+            case 'S':
+                test_index_array[i] = S_test_index_array[i];
+                test_rank_array[i]  = S_test_rank_array[i];
+                break;
+            case 'A':
+                test_index_array[i] = A_test_index_array[i];
+                test_rank_array[i]  = A_test_rank_array[i];
+                break;
+            case 'W':
+                test_index_array[i] = W_test_index_array[i];
+                test_rank_array[i]  = W_test_rank_array[i];
+                break;
+            case 'B':
+                test_index_array[i] = B_test_index_array[i];
+                test_rank_array[i]  = B_test_rank_array[i];
+                break;
+            case 'C':
+                test_index_array[i] = C_test_index_array[i];
+                test_rank_array[i]  = C_test_rank_array[i];
+                break;
+            case 'D':
+                test_index_array[i] = D_test_index_array[i];
+                test_rank_array[i]  = D_test_rank_array[i];
+                break;
+            case 'E':
+                test_index_array[i] = E_test_index_array[i];
+                test_rank_array[i]  = E_test_rank_array[i];
+                break;
+        };
+
+
+
+/*  Printout initial NPB info */
+    printf
+      ( "\n\n NAS Parallel Benchmarks (NPB3.4-OMP) - IS Benchmark\n\n" );
+    printf( " Size:  %ld  (class %c)\n", (long)TOTAL_KEYS, CLASS );
+    printf( " Iterations:  %d\n", MAX_ITERATIONS );
+#ifdef _OPENMP
+    printf( " Number of available threads:  %d\n", omp_get_max_threads() );
+#endif
+    printf( "\n" );
+
+    if (timer_on) timer_start( 1 );
+
+/*  Generate random number sequence and subsequent keys on all procs */
+    create_seq( 314159265.00,                    /* Random number gen seed */
+                1220703125.00 );                 /* Random number gen mult */
+
+    alloc_key_buff();
+    if (timer_on) timer_stop( 1 );
+
+
+/*  Do one interation for free (i.e., untimed) to guarantee initialization of
+    all data and code pages and respective tables */
+    rank( 1 );
+
+/*  Start verification counter */
+    passed_verification = 0;
+
+    if( CLASS != 'S' ) printf( "\n   iteration\n" );
+
+#ifdef M5_ANNOTATION
+    m5_work_begin_interface_();
+#endif
+/*  Start timer  */
+    timer_start( 0 );
+
+
+/*  This is the main iteration */
+    for( iteration=1; iteration<=MAX_ITERATIONS; iteration++ )
+    {
+        if( CLASS != 'S' ) printf( "        %d\n", iteration );
+        rank( iteration );
+    }
+
+
+/*  End of timing, obtain maximum time of all processors */
+    timer_stop( 0 );
+
+#ifdef M5_ANNOTATION
+    m5_work_end_interface_();
+#endif
+
+    timecounter = timer_read( 0 );
+/*  This tests that keys are in sequence: sorting of last ranked key seq
+    occurs here, but is an untimed operation                             */
+    if (timer_on) timer_start( 2 );
+    full_verify();
+    if (timer_on) timer_stop( 2 );
+
+    if (timer_on) timer_stop( 3 );
+
+
+/*  The final printout  */
+    if( passed_verification != 5*MAX_ITERATIONS + 1 )
+        passed_verification = 0;
+    c_print_results( "IS",
+                     CLASS,
+                     TOTAL_KS1,
+                     TOTAL_KS2,
+                     0,
+                     MAX_ITERATIONS,
+                     timecounter,
+                     1.0e-6*(double)(TOTAL_KEYS)*MAX_ITERATIONS
+                                                  /timecounter,
+                     "keys ranked",
+                     passed_verification,
+                     NPBVERSION,
+                     COMPILETIME,
+                     CC,
+                     CLINK,
+                     C_LIB,
+                     C_INC,
+                     CFLAGS,
+                     CLINKFLAGS );
+
+
+/*  Print additional timers  */
+    if (timer_on) {
+       double t_total, t_percent;
+
+       t_total = timer_read( 3 );
+       printf("\nAdditional timers -\n");
+       printf(" Total execution: %8.3f\n", t_total);
+       if (t_total == 0.0) t_total = 1.0;
+       timecounter = timer_read(1);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Initialization : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+       timecounter = timer_read(0);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Benchmarking   : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+       timecounter = timer_read(2);
+       t_percent = timecounter/t_total * 100.;
+       printf(" Sorting        : %8.3f (%5.2f%%)\n", timecounter, t_percent);
+    }
+
+    return 0;
+         /**************************/
+}        /*  E N D  P R O G R A M  */
+         /**************************/
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/Makefile
new file mode 100644
index 000000000..1430793fe
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/Makefile
@@ -0,0 +1,64 @@
+SHELL=/bin/sh
+BENCHMARK=lu
+BENCHMARKU=LU
+SVER=
+
+include ../config/make.def
+
+OBJS = lu.o lu_data.o read_input.o \
+       domain.o setcoeff.o setbv.o exact.o setiv.o \
+       erhs.o ssor$(SVER).o rhs.o l2norm.o \
+       jacld.o blts.o jacu.o buts.o error.o \
+       pintgr.o verify.o ${COMMON}/print_results.o \
+       ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+include ../sys/make.common
+
+# npbparams.h is included by lu_data module (via lu_data.o)
+
+${PROGRAM}: config
+	@if [ x$(VERSION) = xdoac -o x$(VERSION) = xDOAC ] ; then	\
+		${MAKE} SVER=_doac exec;			\
+	elif [ x$(VERSION) = xhp -o x$(VERSION) = xHP ] ; then	\
+		${MAKE} SVER=_hp exec;				\
+	else							\
+		${MAKE} exec-sync;				\
+	fi
+
+exec-sync: $(OBJS) syncs.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} syncs.o ${F_LIB}
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f90.o :
+	${FCOMPILE} $<
+
+lu.o:		lu.f90 lu_data.o
+blts.o:		blts.f90
+buts.o:		buts.f90
+erhs.o:		erhs.f90 lu_data.o
+error.o:	error.f90 lu_data.o
+exact.o:	exact.f90 lu_data.o
+jacld.o:	jacld.f90 lu_data.o
+jacu.o:		jacu.f90 lu_data.o
+l2norm.o:	l2norm.f90
+pintgr.o:	pintgr.f90 lu_data.o
+read_input.o:	read_input.f90 lu_data.o
+rhs.o:		rhs.f90 lu_data.o
+setbv.o:	setbv.f90 lu_data.o
+setiv.o:	setiv.f90 lu_data.o
+setcoeff.o:	setcoeff.f90 lu_data.o
+ssor$(SVER).o:	ssor$(SVER).f90 lu_data.o
+domain.o:	domain.f90 lu_data.o
+verify.o:	verify.f90 lu_data.o
+syncs.o:	syncs.f90 lu_data.o
+lu_data.o:	lu_data.f90 npbparams.h
+
+clean:
+	- rm -f *.o *~ *.mod npbparams.h
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/README
new file mode 100644
index 000000000..05e9883df
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/README
@@ -0,0 +1,20 @@
+This directory contains the OpenMP version of the LU benchmark.
+The benchmark solves a lower-upper tridiagonal system of equations using 
+the SSOR scheme. The directory includes three approaches in parallelizing
+the SSOR scheme, selectable via the make option VERSION:
+
+   VERSION=          the pipelined version (the default)
+   VERSION=hp        the hyperplane version
+   VERSION=doac      the hyperplane version using OpenMP DOACROSS
+
+Special note on feature requirements:
+   the pipelined version - ATOMIC read/write from OpenMP 3.0
+   the "doac" version - DOACROSS feature from OpenMP 4.0.
+
+To select different approaches, use the option VERSION for make:
+
+   % make CLASS=<class>               # build the pipelined version
+   % make CLASS=<class> VERSION=hp    # build the hyperplane version
+   % make CLASS=<class> VERSION=doac  # build the doacross version
+
+where <class> is one of [S,W,A,B,C,D,E,F].
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/blts.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/blts.f90
new file mode 100644
index 000000000..24528fbf4
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/blts.f90
@@ -0,0 +1,235 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine blts ( ldmx, ldmy, ldmz,  &
+     &                  nx, ny, nz,  &
+     &                  omega,  &
+     &                  v,  &
+     &                  ldz, ldy, ldx, d,  &
+     &                  ist, iend, j, k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the regular-sparse, block lower triangular solution:
+!
+!                     v <-- ( L-inv ) * v
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      double precision  omega
+!---------------------------------------------------------------------
+!   To improve cache performance, second two dimensions padded by 1 
+!   for even number sizes only.  Only needed in v.
+!---------------------------------------------------------------------
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, ldmz),  &
+     &        ldz( 5, 5, ldmx ),  &
+     &        ldy( 5, 5, ldmx ),  &
+     &        ldx( 5, 5, ldmx ),  &
+     &        d  ( 5, 5, ldmx )
+      integer ist, iend, j, k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5), tv(5)
+
+
+         do i = ist, iend
+            do m = 1, 5
+
+                  tv( m ) =  v( m, i, j, k )  &
+     &    - omega * (  ldz( m, 1, i ) * v( 1, i, j, k-1 )  &
+     &               + ldz( m, 2, i ) * v( 2, i, j, k-1 )  &
+     &               + ldz( m, 3, i ) * v( 3, i, j, k-1 )  &
+     &               + ldz( m, 4, i ) * v( 4, i, j, k-1 )  &
+     &               + ldz( m, 5, i ) * v( 5, i, j, k-1 )  )
+
+                  tv( m ) =  tv( m )  &
+     & - omega * ( ldy( m, 1, i ) * v( 1, i, j-1, k )  &
+     &           + ldx( m, 1, i ) * v( 1, i-1, j, k )  &
+     &           + ldy( m, 2, i ) * v( 2, i, j-1, k )  &
+     &           + ldx( m, 2, i ) * v( 2, i-1, j, k )  &
+     &           + ldy( m, 3, i ) * v( 3, i, j-1, k )  &
+     &           + ldx( m, 3, i ) * v( 3, i-1, j, k )  &
+     &           + ldy( m, 4, i ) * v( 4, i, j-1, k )  &
+     &           + ldx( m, 4, i ) * v( 4, i-1, j, k )  &
+     &           + ldy( m, 5, i ) * v( 5, i, j-1, k )  &
+     &           + ldx( m, 5, i ) * v( 5, i-1, j, k ) )
+
+            end do
+       
+!---------------------------------------------------------------------
+!   diagonal block inversion
+!
+!   forward elimination
+!---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i )
+               tmat( m, 2 ) = d( m, 2, i )
+               tmat( m, 3 ) = d( m, 3, i )
+               tmat( m, 4 ) = d( m, 4, i )
+               tmat( m, 5 ) = d( m, 5, i )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 2 ) = tv( 2 )  &
+     &        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 3 ) = tv( 3 )  &
+     &        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 4 ) = tv( 4 )  &
+     &        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 1 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 3 ) = tv( 3 )  &
+     &        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 4 ) = tv( 4 )  &
+     &        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 2 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 4 ) = tv( 4 )  &
+     &        - tv( 3 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 3 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 4, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 4 ) * tmp
+
+!---------------------------------------------------------------------
+!   back substitution
+!---------------------------------------------------------------------
+            v( 5, i, j, k ) = tv( 5 )  &
+     &                      / tmat( 5, 5 )
+
+            tv( 4 ) = tv( 4 )  &
+     &           - tmat( 4, 5 ) * v( 5, i, j, k )
+            v( 4, i, j, k ) = tv( 4 )  &
+     &                      / tmat( 4, 4 )
+
+            tv( 3 ) = tv( 3 )  &
+     &           - tmat( 3, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 3, 5 ) * v( 5, i, j, k )
+            v( 3, i, j, k ) = tv( 3 )  &
+     &                      / tmat( 3, 3 )
+
+            tv( 2 ) = tv( 2 )  &
+     &           - tmat( 2, 3 ) * v( 3, i, j, k )  &
+     &           - tmat( 2, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 2, 5 ) * v( 5, i, j, k )
+            v( 2, i, j, k ) = tv( 2 )  &
+     &                      / tmat( 2, 2 )
+
+            tv( 1 ) = tv( 1 )  &
+     &           - tmat( 1, 2 ) * v( 2, i, j, k )  &
+     &           - tmat( 1, 3 ) * v( 3, i, j, k )  &
+     &           - tmat( 1, 4 ) * v( 4, i, j, k )  &
+     &           - tmat( 1, 5 ) * v( 5, i, j, k )
+            v( 1, i, j, k ) = tv( 1 )  &
+     &                      / tmat( 1, 1 )
+
+
+        enddo
+
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/buts.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/buts.f90
new file mode 100644
index 000000000..e214256c3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/buts.f90
@@ -0,0 +1,234 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine buts( ldmx, ldmy, ldmz,  &
+     &                 nx, ny, nz,  &
+     &                 omega,  &
+     &                 v,  &
+     &                 d, udx, udy, udz,  &
+     &                 ist, iend, j, k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the regular-sparse, block upper triangular solution:
+!
+!                     v <-- ( U-inv ) * v
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldmx, ldmy, ldmz
+      integer nx, ny, nz
+      double precision  omega
+!---------------------------------------------------------------------
+!   To improve cache performance, second two dimensions padded by 1 
+!   for even number sizes only.  Only needed in v.
+!---------------------------------------------------------------------
+      double precision  v( 5,ldmx/2*2+1, ldmy/2*2+1, ldmz),  &
+     &        d  ( 5, 5, ldmx ),  &
+     &        udx( 5, 5, ldmx ),  &
+     &        udy( 5, 5, ldmx ),  &
+     &        udz( 5, 5, ldmx )
+      integer ist, iend, j, k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, m
+      double precision  tmp, tmp1
+      double precision  tmat(5,5), tv(5)
+
+
+         do i = iend, ist, -1
+            do m = 1, 5
+                  tv( m ) =  &
+     &      omega * (  udz( m, 1, i ) * v( 1, i, j, k+1 )  &
+     &               + udz( m, 2, i ) * v( 2, i, j, k+1 )  &
+     &               + udz( m, 3, i ) * v( 3, i, j, k+1 )  &
+     &               + udz( m, 4, i ) * v( 4, i, j, k+1 )  &
+     &               + udz( m, 5, i ) * v( 5, i, j, k+1 ) )
+
+                  tv( m ) = tv( m )  &
+     & + omega * ( udy( m, 1, i ) * v( 1, i, j+1, k )  &
+     &           + udx( m, 1, i ) * v( 1, i+1, j, k )  &
+     &           + udy( m, 2, i ) * v( 2, i, j+1, k )  &
+     &           + udx( m, 2, i ) * v( 2, i+1, j, k )  &
+     &           + udy( m, 3, i ) * v( 3, i, j+1, k )  &
+     &           + udx( m, 3, i ) * v( 3, i+1, j, k )  &
+     &           + udy( m, 4, i ) * v( 4, i, j+1, k )  &
+     &           + udx( m, 4, i ) * v( 4, i+1, j, k )  &
+     &           + udy( m, 5, i ) * v( 5, i, j+1, k )  &
+     &           + udx( m, 5, i ) * v( 5, i+1, j, k ) )
+            end do
+
+!---------------------------------------------------------------------
+!   diagonal block inversion
+!---------------------------------------------------------------------
+            do m = 1, 5
+               tmat( m, 1 ) = d( m, 1, i )
+               tmat( m, 2 ) = d( m, 2, i )
+               tmat( m, 3 ) = d( m, 3, i )
+               tmat( m, 4 ) = d( m, 4, i )
+               tmat( m, 5 ) = d( m, 5, i )
+            end do
+
+            tmp1 = 1.0d+00 / tmat( 1, 1 )
+            tmp = tmp1 * tmat( 2, 1 )
+            tmat( 2, 2 ) =  tmat( 2, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 2, 3 ) =  tmat( 2, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 2, 4 ) =  tmat( 2, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 2, 5 ) =  tmat( 2, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 2 ) = tv( 2 )  &
+     &        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 3, 1 )
+            tmat( 3, 2 ) =  tmat( 3, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 3 ) = tv( 3 )  &
+     &        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 1 )
+            tmat( 4, 2 ) =  tmat( 4, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 4 ) = tv( 4 )  &
+     &        - tv( 1 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 1 )
+            tmat( 5, 2 ) =  tmat( 5, 2 )  &
+     &           - tmp * tmat( 1, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 1, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 1, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 1, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 1 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 2, 2 )
+            tmp = tmp1 * tmat( 3, 2 )
+            tmat( 3, 3 ) =  tmat( 3, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 3, 4 ) =  tmat( 3, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 3, 5 ) =  tmat( 3, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 3 ) = tv( 3 )  &
+     &        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 4, 2 )
+            tmat( 4, 3 ) =  tmat( 4, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 4 ) = tv( 4 )  &
+     &        - tv( 2 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 2 )
+            tmat( 5, 3 ) =  tmat( 5, 3 )  &
+     &           - tmp * tmat( 2, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 2, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 2, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 2 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 3, 3 )
+            tmp = tmp1 * tmat( 4, 3 )
+            tmat( 4, 4 ) =  tmat( 4, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 4, 5 ) =  tmat( 4, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 4 ) = tv( 4 )  &
+     &        - tv( 3 ) * tmp
+
+            tmp = tmp1 * tmat( 5, 3 )
+            tmat( 5, 4 ) =  tmat( 5, 4 )  &
+     &           - tmp * tmat( 3, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 3, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 3 ) * tmp
+
+
+
+            tmp1 = 1.0d+00 / tmat( 4, 4 )
+            tmp = tmp1 * tmat( 5, 4 )
+            tmat( 5, 5 ) =  tmat( 5, 5 )  &
+     &           - tmp * tmat( 4, 5 )
+            tv( 5 ) = tv( 5 )  &
+     &        - tv( 4 ) * tmp
+
+!---------------------------------------------------------------------
+!   back substitution
+!---------------------------------------------------------------------
+            tv( 5 ) = tv( 5 )  &
+     &                      / tmat( 5, 5 )
+
+            tv( 4 ) = tv( 4 )  &
+     &           - tmat( 4, 5 ) * tv( 5 )
+            tv( 4 ) = tv( 4 )  &
+     &                      / tmat( 4, 4 )
+
+            tv( 3 ) = tv( 3 )  &
+     &           - tmat( 3, 4 ) * tv( 4 )  &
+     &           - tmat( 3, 5 ) * tv( 5 )
+            tv( 3 ) = tv( 3 )  &
+     &                      / tmat( 3, 3 )
+
+            tv( 2 ) = tv( 2 )  &
+     &           - tmat( 2, 3 ) * tv( 3 )  &
+     &           - tmat( 2, 4 ) * tv( 4 )  &
+     &           - tmat( 2, 5 ) * tv( 5 )
+            tv( 2 ) = tv( 2 )  &
+     &                      / tmat( 2, 2 )
+
+            tv( 1 ) = tv( 1 )  &
+     &           - tmat( 1, 2 ) * tv( 2 )  &
+     &           - tmat( 1, 3 ) * tv( 3 )  &
+     &           - tmat( 1, 4 ) * tv( 4 )  &
+     &           - tmat( 1, 5 ) * tv( 5 )
+            tv( 1 ) = tv( 1 )  &
+     &                      / tmat( 1, 1 )
+
+            v( 1, i, j, k ) = v( 1, i, j, k ) - tv( 1 )
+            v( 2, i, j, k ) = v( 2, i, j, k ) - tv( 2 )
+            v( 3, i, j, k ) = v( 3, i, j, k ) - tv( 3 )
+            v( 4, i, j, k ) = v( 4, i, j, k ) - tv( 4 )
+            v( 5, i, j, k ) = v( 5, i, j, k ) - tv( 5 )
+
+        enddo
+
+ 
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/domain.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/domain.f90
new file mode 100644
index 000000000..6ff3ba62a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/domain.f90
@@ -0,0 +1,67 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine domain
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+
+
+      nx = nx0
+      ny = ny0
+      nz = nz0
+
+!---------------------------------------------------------------------
+!   check the sub-domain size
+!---------------------------------------------------------------------
+      if ( ( nx .lt. 4 ) .or.  &
+     &     ( ny .lt. 4 ) .or.  &
+     &     ( nz .lt. 4 ) ) then
+         write (*,2001) nx, ny, nz
+ 2001    format (5x,'SUBDOMAIN SIZE IS TOO SMALL - ',  &
+     &        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',  &
+     &        /5x,'SO THAT NX, NY AND NZ ARE GREATER THAN OR EQUAL',  &
+     &        /5x,'TO 4 THEY ARE CURRENTLY', 3I3)
+         stop
+      end if
+
+      if ( ( nx .gt. isiz1 ) .or.  &
+     &     ( ny .gt. isiz2 ) .or.  &
+     &     ( nz .gt. isiz3 ) ) then
+         write (*,2002) nx, ny, nz
+ 2002    format (5x,'SUBDOMAIN SIZE IS TOO LARGE - ',  &
+     &        /5x,'ADJUST PROBLEM SIZE OR NUMBER OF PROCESSORS',  &
+     &        /5x,'SO THAT NX, NY AND NZ ARE LESS THAN OR EQUAL TO ',  &
+     &        /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY.  THEY ARE',  &
+     &        /5x,'CURRENTLY', 3I4)
+         stop
+      end if
+
+!---------------------------------------------------------------------
+!   set up the start and end in i and j extents for all processors
+!---------------------------------------------------------------------
+      ist = 2
+      iend = nx - 1
+
+      jst = 2
+      jend = ny - 1
+
+      ii1 = 2
+      ii2 = nx0 - 1
+      ji1 = 2
+      ji2 = ny0 - 2
+      ki1 = 3
+      ki2 = nz0 - 1
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/erhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/erhs.f90
new file mode 100644
index 000000000..04bc36716
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/erhs.f90
@@ -0,0 +1,448 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine erhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the right hand side based on exact solution
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  xi, eta, zeta
+      double precision  q
+      double precision  u21, u31, u41
+      double precision  tmp
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+!$omp parallel default(shared) private(i,j,k,m,xi,eta,zeta,tmp,q,  &
+!$omp&   u51im1,u41im1,u31im1,u21im1,u51i,u41i,u31i,u21i,u21,  &
+!$omp&   u51jm1,u41jm1,u31jm1,u21jm1,u51j,u41j,u31j,u21j,u31,  &
+!$omp&   u51km1,u41km1,u31km1,u21km1,u51k,u41k,u31k,u21k,u41)
+!$omp do schedule(static) collapse(2)
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  frct( m, i, j, k ) = 0.0d+00
+               end do
+            end do
+         end do
+      end do
+!$omp end do nowait
+
+!$omp do schedule(static) collapse(2)
+      do k = 1, nz
+         do j = 1, ny
+            zeta = ( dble(k-1) ) / ( nz - 1 )
+            eta = ( dble(j-1) ) / ( ny0 - 1 )
+            do i = 1, nx
+               xi = ( dble(i-1) ) / ( nx0 - 1 )
+               do m = 1, 5
+                  rsd(m,i,j,k) =  ce(m,1)  &
+     &                 + (ce(m,2)  &
+     &                 + (ce(m,5)  &
+     &                 + (ce(m,8)  &
+     &                 +  ce(m,11) * xi) * xi) * xi) * xi  &
+     &                 + (ce(m,3)  &
+     &                 + (ce(m,6)  &
+     &                 + (ce(m,9)  &
+     &                 +  ce(m,12) * eta) * eta) * eta) * eta  &
+     &                 + (ce(m,4)  &
+     &                 + (ce(m,7)  &
+     &                 + (ce(m,10)  &
+     &                 +  ce(m,13) * zeta) * zeta) * zeta) * zeta
+               end do
+            end do
+         end do
+      end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!   xi-direction flux differences
+!---------------------------------------------------------------------
+
+!$omp do schedule(static) collapse(2)
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = rsd(2,i,j,k)
+               u21 = rsd(2,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)  &
+     &                         + rsd(3,i,j,k) * rsd(3,i,j,k)  &
+     &                         + rsd(4,i,j,k) * rsd(4,i,j,k) )  &
+     &                      / rsd(1,i,j,k)
+               flux(2,i) = rsd(2,i,j,k) * u21 + c2 *  &
+     &                         ( rsd(5,i,j,k) - q )
+               flux(3,i) = rsd(3,i,j,k) * u21
+               flux(4,i) = rsd(4,i,j,k) * u21
+               flux(5,i) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)  &
+     &                   - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+            do i = ist, nx
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21i = tmp * rsd(2,i,j,k)
+               u31i = tmp * rsd(3,i,j,k)
+               u41i = tmp * rsd(4,i,j,k)
+               u51i = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i-1,j,k)
+
+               u21im1 = tmp * rsd(2,i-1,j,k)
+               u31im1 = tmp * rsd(3,i-1,j,k)
+               u41im1 = tmp * rsd(4,i-1,j,k)
+               u51im1 = tmp * rsd(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 *  &
+     &                        ( u21i - u21im1 )
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )  &
+     &                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tx3 * ( u21i**2 - u21im1**2 )  &
+     &              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               frct(1,i,j,k) = frct(1,i,j,k)  &
+     &              + dx1 * tx1 * (            rsd(1,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(1,i,j,k)  &
+     &                             +           rsd(1,i+1,j,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)  &
+     &           + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )  &
+     &              + dx2 * tx1 * (            rsd(2,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(2,i,j,k)  &
+     &                             +           rsd(2,i+1,j,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)  &
+     &           + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )  &
+     &              + dx3 * tx1 * (            rsd(3,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(3,i,j,k)  &
+     &                             +           rsd(3,i+1,j,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)  &
+     &            + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )  &
+     &              + dx4 * tx1 * (            rsd(4,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(4,i,j,k)  &
+     &                             +           rsd(4,i+1,j,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)  &
+     &           + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )  &
+     &              + dx5 * tx1 * (            rsd(5,i-1,j,k)  &
+     &                             - 2.0d+00 * rsd(5,i,j,k)  &
+     &                             +           rsd(5,i+1,j,k) )
+            end do
+
+!---------------------------------------------------------------------
+!   Fourth-order dissipation
+!---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,2,j,k) = frct(m,2,j,k)  &
+     &           - dssp * ( + 5.0d+00 * rsd(m,2,j,k)  &
+     &                       - 4.0d+00 * rsd(m,3,j,k)  &
+     &                       +           rsd(m,4,j,k) )
+               frct(m,3,j,k) = frct(m,3,j,k)  &
+     &           - dssp * ( - 4.0d+00 * rsd(m,2,j,k)  &
+     &                       + 6.0d+00 * rsd(m,3,j,k)  &
+     &                       - 4.0d+00 * rsd(m,4,j,k)  &
+     &                       +           rsd(m,5,j,k) )
+            end do
+
+            do i = 4, nx - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)  &
+     &              - dssp * (            rsd(m,i-2,j,k)  &
+     &                         - 4.0d+00 * rsd(m,i-1,j,k)  &
+     &                         + 6.0d+00 * rsd(m,i,j,k)  &
+     &                         - 4.0d+00 * rsd(m,i+1,j,k)  &
+     &                         +           rsd(m,i+2,j,k) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,nx-2,j,k) = frct(m,nx-2,j,k)  &
+     &           - dssp * (             rsd(m,nx-4,j,k)  &
+     &                       - 4.0d+00 * rsd(m,nx-3,j,k)  &
+     &                       + 6.0d+00 * rsd(m,nx-2,j,k)  &
+     &                       - 4.0d+00 * rsd(m,nx-1,j,k)  )
+               frct(m,nx-1,j,k) = frct(m,nx-1,j,k)  &
+     &           - dssp * (             rsd(m,nx-3,j,k)  &
+     &                       - 4.0d+00 * rsd(m,nx-2,j,k)  &
+     &                       + 5.0d+00 * rsd(m,nx-1,j,k) )
+            end do
+
+         end do
+      end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!   eta-direction flux differences
+!---------------------------------------------------------------------
+
+!$omp do schedule(static) collapse(2)
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = rsd(3,i,j,k)
+               u31 = rsd(3,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)  &
+     &                         + rsd(3,i,j,k) * rsd(3,i,j,k)  &
+     &                         + rsd(4,i,j,k) * rsd(4,i,j,k) )  &
+     &                      / rsd(1,i,j,k)
+               flux(2,j) = rsd(2,i,j,k) * u31 
+               flux(3,j) = rsd(3,i,j,k) * u31 + c2 *  &
+     &                       ( rsd(5,i,j,k) - q )
+               flux(4,j) = rsd(4,i,j,k) * u31
+               flux(5,j) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)  &
+     &                 - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21j = tmp * rsd(2,i,j,k)
+               u31j = tmp * rsd(3,i,j,k)
+               u41j = tmp * rsd(4,i,j,k)
+               u51j = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j-1,k)
+
+               u21jm1 = tmp * rsd(2,i,j-1,k)
+               u31jm1 = tmp * rsd(3,i,j-1,k)
+               u41jm1 = tmp * rsd(4,i,j-1,k)
+               u51jm1 = tmp * rsd(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 *  &
+     &                       ( u31j - u31jm1 )
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )  &
+     &                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * ty3 * ( u31j**2 - u31jm1**2 )  &
+     &              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+               frct(1,i,j,k) = frct(1,i,j,k)  &
+     &              + dy1 * ty1 * (            rsd(1,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(1,i,j,k)  &
+     &                             +           rsd(1,i,j+1,k) )
+               frct(2,i,j,k) = frct(2,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )  &
+     &              + dy2 * ty1 * (            rsd(2,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(2,i,j,k)  &
+     &                             +           rsd(2,i,j+1,k) )
+               frct(3,i,j,k) = frct(3,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )  &
+     &              + dy3 * ty1 * (            rsd(3,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(3,i,j,k)  &
+     &                             +           rsd(3,i,j+1,k) )
+               frct(4,i,j,k) = frct(4,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )  &
+     &              + dy4 * ty1 * (            rsd(4,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(4,i,j,k)  &
+     &                             +           rsd(4,i,j+1,k) )
+               frct(5,i,j,k) = frct(5,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )  &
+     &              + dy5 * ty1 * (            rsd(5,i,j-1,k)  &
+     &                             - 2.0d+00 * rsd(5,i,j,k)  &
+     &                             +           rsd(5,i,j+1,k) )
+            end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,i,2,k) = frct(m,i,2,k)  &
+     &           - dssp * ( + 5.0d+00 * rsd(m,i,2,k)  &
+     &                       - 4.0d+00 * rsd(m,i,3,k)  &
+     &                       +           rsd(m,i,4,k) )
+               frct(m,i,3,k) = frct(m,i,3,k)  &
+     &           - dssp * ( - 4.0d+00 * rsd(m,i,2,k)  &
+     &                       + 6.0d+00 * rsd(m,i,3,k)  &
+     &                       - 4.0d+00 * rsd(m,i,4,k)  &
+     &                       +           rsd(m,i,5,k) )
+            end do
+
+            do j = 4, ny - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)  &
+     &              - dssp * (            rsd(m,i,j-2,k)  &
+     &                        - 4.0d+00 * rsd(m,i,j-1,k)  &
+     &                        + 6.0d+00 * rsd(m,i,j,k)  &
+     &                        - 4.0d+00 * rsd(m,i,j+1,k)  &
+     &                        +           rsd(m,i,j+2,k) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,i,ny-2,k) = frct(m,i,ny-2,k)  &
+     &           - dssp * (             rsd(m,i,ny-4,k)  &
+     &                       - 4.0d+00 * rsd(m,i,ny-3,k)  &
+     &                       + 6.0d+00 * rsd(m,i,ny-2,k)  &
+     &                       - 4.0d+00 * rsd(m,i,ny-1,k)  )
+               frct(m,i,ny-1,k) = frct(m,i,ny-1,k)  &
+     &           - dssp * (             rsd(m,i,ny-3,k)  &
+     &                       - 4.0d+00 * rsd(m,i,ny-2,k)  &
+     &                       + 5.0d+00 * rsd(m,i,ny-1,k)  )
+            end do
+
+         end do
+      end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!   zeta-direction flux differences
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               flux(1,k) = rsd(4,i,j,k)
+               u41 = rsd(4,i,j,k) / rsd(1,i,j,k)
+               q = 0.50d+00 * (  rsd(2,i,j,k) * rsd(2,i,j,k)  &
+     &                         + rsd(3,i,j,k) * rsd(3,i,j,k)  &
+     &                         + rsd(4,i,j,k) * rsd(4,i,j,k) )  &
+     &                      / rsd(1,i,j,k)
+               flux(2,k) = rsd(2,i,j,k) * u41 
+               flux(3,k) = rsd(3,i,j,k) * u41 
+               flux(4,k) = rsd(4,i,j,k) * u41 + c2 *  &
+     &                         ( rsd(5,i,j,k) - q )
+               flux(5,k) = ( c1 * rsd(5,i,j,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  frct(m,i,j,k) =  frct(m,i,j,k)  &
+     &                  - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = 1.0d+00 / rsd(1,i,j,k)
+
+               u21k = tmp * rsd(2,i,j,k)
+               u31k = tmp * rsd(3,i,j,k)
+               u41k = tmp * rsd(4,i,j,k)
+               u51k = tmp * rsd(5,i,j,k)
+
+               tmp = 1.0d+00 / rsd(1,i,j,k-1)
+
+               u21km1 = tmp * rsd(2,i,j,k-1)
+               u31km1 = tmp * rsd(3,i,j,k-1)
+               u41km1 = tmp * rsd(4,i,j,k-1)
+               u51km1 = tmp * rsd(5,i,j,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * ( u41k  &
+     &                       - u41km1 )
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )  &
+     &                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tz3 * ( u41k**2 - u41km1**2 )  &
+     &              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               frct(1,i,j,k) = frct(1,i,j,k)  &
+     &              + dz1 * tz1 * (            rsd(1,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(1,i,j,k)  &
+     &                             +           rsd(1,i,j,k-1) )
+               frct(2,i,j,k) = frct(2,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )  &
+     &              + dz2 * tz1 * (            rsd(2,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(2,i,j,k)  &
+     &                             +           rsd(2,i,j,k-1) )
+               frct(3,i,j,k) = frct(3,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )  &
+     &              + dz3 * tz1 * (            rsd(3,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(3,i,j,k)  &
+     &                             +           rsd(3,i,j,k-1) )
+               frct(4,i,j,k) = frct(4,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )  &
+     &              + dz4 * tz1 * (            rsd(4,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(4,i,j,k)  &
+     &                             +           rsd(4,i,j,k-1) )
+               frct(5,i,j,k) = frct(5,i,j,k)  &
+     &          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )  &
+     &              + dz5 * tz1 * (            rsd(5,i,j,k+1)  &
+     &                             - 2.0d+00 * rsd(5,i,j,k)  &
+     &                             +           rsd(5,i,j,k-1) )
+            end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+            do m = 1, 5
+               frct(m,i,j,2) = frct(m,i,j,2)  &
+     &           - dssp * ( + 5.0d+00 * rsd(m,i,j,2)  &
+     &                       - 4.0d+00 * rsd(m,i,j,3)  &
+     &                       +           rsd(m,i,j,4) )
+               frct(m,i,j,3) = frct(m,i,j,3)  &
+     &           - dssp * (- 4.0d+00 * rsd(m,i,j,2)  &
+     &                      + 6.0d+00 * rsd(m,i,j,3)  &
+     &                      - 4.0d+00 * rsd(m,i,j,4)  &
+     &                      +           rsd(m,i,j,5) )
+            end do
+
+            do k = 4, nz - 3
+               do m = 1, 5
+                  frct(m,i,j,k) = frct(m,i,j,k)  &
+     &              - dssp * (           rsd(m,i,j,k-2)  &
+     &                        - 4.0d+00 * rsd(m,i,j,k-1)  &
+     &                        + 6.0d+00 * rsd(m,i,j,k)  &
+     &                        - 4.0d+00 * rsd(m,i,j,k+1)  &
+     &                        +           rsd(m,i,j,k+2) )
+               end do
+            end do
+
+            do m = 1, 5
+               frct(m,i,j,nz-2) = frct(m,i,j,nz-2)  &
+     &           - dssp * (            rsd(m,i,j,nz-4)  &
+     &                      - 4.0d+00 * rsd(m,i,j,nz-3)  &
+     &                      + 6.0d+00 * rsd(m,i,j,nz-2)  &
+     &                      - 4.0d+00 * rsd(m,i,j,nz-1)  )
+               frct(m,i,j,nz-1) = frct(m,i,j,nz-1)  &
+     &           - dssp * (             rsd(m,i,j,nz-3)  &
+     &                       - 4.0d+00 * rsd(m,i,j,nz-2)  &
+     &                       + 5.0d+00 * rsd(m,i,j,nz-1)  )
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/error.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/error.f90
new file mode 100644
index 000000000..134a2700c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/error.f90
@@ -0,0 +1,63 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine error
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the solution error
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  tmp
+      double precision  u000ijk(5)
+
+
+      do m = 1, 5
+         errnm(m) = 0.0d+00
+      end do
+
+!$omp parallel do schedule(static) collapse(2) default(shared)  &
+!$omp&  private(i,j,k,m,tmp,u000ijk) reduction(+: errnm)
+      do k = 2, nz-1
+         do j = jst, jend
+            do i = ist, iend
+               call exact( i, j, k, u000ijk )
+               do m = 1, 5
+                  tmp = ( u000ijk(m) - u(m,i,j,k) )
+                  errnm(m) = errnm(m) + tmp * tmp
+               end do
+            end do
+         end do
+      end do
+!$omp end parallel do
+
+      do m = 1, 5
+         errnm(m) = sqrt ( errnm(m) / ( dble(nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+!        write (*,1002) ( errnm(m), m = 1, 5 )
+
+ 1002 format (1x/1x,'RMS-norm of error in soln. to ',  &
+     & 'first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of error in soln. to ',  &
+     & 'fifth pde  = ',1pe12.5)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/exact.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/exact.f90
new file mode 100644
index 000000000..55ef573e6
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/exact.f90
@@ -0,0 +1,52 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exact( i, j, k, u000ijk )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   compute the exact solution at (i,j,k)
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer i, j, k
+      double precision u000ijk(*)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer m
+      double precision xi, eta, zeta
+
+      xi  = ( dble ( i - 1 ) ) / ( nx0 - 1 )
+      eta  = ( dble ( j - 1 ) ) / ( ny0 - 1 )
+      zeta = ( dble ( k - 1 ) ) / ( nz - 1 )
+
+
+      do m = 1, 5
+         u000ijk(m) =  ce(m,1)  &
+     &        + (ce(m,2)  &
+     &        + (ce(m,5)  &
+     &        + (ce(m,8)  &
+     &        +  ce(m,11) * xi) * xi) * xi) * xi  &
+     &        + (ce(m,3)  &
+     &        + (ce(m,6)  &
+     &        + (ce(m,9)  &
+     &        +  ce(m,12) * eta) * eta) * eta) * eta  &
+     &        + (ce(m,4)  &
+     &        + (ce(m,7)  &
+     &        + (ce(m,10)  &
+     &        +  ce(m,13) * zeta) * zeta) * zeta) * zeta
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/inputlu.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/inputlu.data.sample
new file mode 100644
index 000000000..9ef5a7be0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/inputlu.data.sample
@@ -0,0 +1,24 @@
+c
+c***controls printing of the progress of iterations: ipr    inorm
+                                                      1      250
+c
+c***the maximum no. of pseudo-time steps to be performed: nitmax
+                                                             250
+c
+c***magnitude of the time step: dt 
+                               2.0e+00
+c
+c***relaxation factor for SSOR iterations: omega
+                                            1.2
+c
+c***tolerance levels for steady-state residuals: tolnwt(m),m=1,5
+                             1.0e-08   1.0e-08   1.0e-08  1.0e-08  1.0e-08 
+c
+c***number of grid points in xi and eta and zeta directions: nx   ny   nz
+                                                            64  64  64
+c
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/jacld.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/jacld.f90
new file mode 100644
index 000000000..5abc800a3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/jacld.f90
@@ -0,0 +1,354 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine jacld(j, k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!   compute the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer j, k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+            do i = ist, iend
+
+!---------------------------------------------------------------------
+!   form the block daigonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i) =  1.0d+00  &
+     &                       + dt * 2.0d+00 * (   tx1 * dx1  &
+     &                                          + ty1 * dy1  &
+     &                                          + tz1 * dz1 )
+               d(1,2,i) =  0.0d+00
+               d(1,3,i) =  0.0d+00
+               d(1,4,i) =  0.0d+00
+               d(1,5,i) =  0.0d+00
+
+               d(2,1,i) = -dt * 2.0d+00  &
+     &          * (  tx1 * r43 + ty1 + tz1  )  &
+     &          * c34 * tmp2 * u(2,i,j,k)
+               d(2,2,i) =  1.0d+00  &
+     &          + dt * 2.0d+00 * c34 * tmp1  &
+     &          * (  tx1 * r43 + ty1 + tz1 )  &
+     &          + dt * 2.0d+00 * (   tx1 * dx2  &
+     &                             + ty1 * dy2  &
+     &                             + tz1 * dz2  )
+               d(2,3,i) = 0.0d+00
+               d(2,4,i) = 0.0d+00
+               d(2,5,i) = 0.0d+00
+
+               d(3,1,i) = -dt * 2.0d+00  &
+     &           * (  tx1 + ty1 * r43 + tz1  )  &
+     &           * c34 * tmp2 * u(3,i,j,k)
+               d(3,2,i) = 0.0d+00
+               d(3,3,i) = 1.0d+00  &
+     &         + dt * 2.0d+00 * c34 * tmp1  &
+     &              * (  tx1 + ty1 * r43 + tz1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx3  &
+     &                           + ty1 * dy3  &
+     &                           + tz1 * dz3 )
+               d(3,4,i) = 0.0d+00
+               d(3,5,i) = 0.0d+00
+
+               d(4,1,i) = -dt * 2.0d+00  &
+     &           * (  tx1 + ty1 + tz1 * r43  )  &
+     &           * c34 * tmp2 * u(4,i,j,k)
+               d(4,2,i) = 0.0d+00
+               d(4,3,i) = 0.0d+00
+               d(4,4,i) = 1.0d+00  &
+     &         + dt * 2.0d+00 * c34 * tmp1  &
+     &              * (  tx1 + ty1 + tz1 * r43 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx4  &
+     &                           + ty1 * dy4  &
+     &                           + tz1 * dz4 )
+               d(4,5,i) = 0.0d+00
+
+               d(5,1,i) = -dt * 2.0d+00  &
+     &  * ( ( ( tx1 * ( r43*c34 - c1345 )  &
+     &     + ty1 * ( c34 - c1345 )  &
+     &     + tz1 * ( c34 - c1345 ) ) * ( u(2,i,j,k) ** 2 )  &
+     &   + ( tx1 * ( c34 - c1345 )  &
+     &     + ty1 * ( r43*c34 - c1345 )  &
+     &     + tz1 * ( c34 - c1345 ) ) * ( u(3,i,j,k) ** 2 )  &
+     &   + ( tx1 * ( c34 - c1345 )  &
+     &     + ty1 * ( c34 - c1345 )  &
+     &     + tz1 * ( r43*c34 - c1345 ) ) * ( u(4,i,j,k) ** 2 )  &
+     &      ) * tmp3  &
+     &   + ( tx1 + ty1 + tz1 ) * c1345 * tmp2 * u(5,i,j,k) )
+
+               d(5,2,i) = dt * 2.0d+00 * tmp2 * u(2,i,j,k)  &
+     & * ( tx1 * ( r43*c34 - c1345 )  &
+     &   + ty1 * (     c34 - c1345 )  &
+     &   + tz1 * (     c34 - c1345 ) )
+               d(5,3,i) = dt * 2.0d+00 * tmp2 * u(3,i,j,k)  &
+     & * ( tx1 * ( c34 - c1345 )  &
+     &   + ty1 * ( r43*c34 -c1345 )  &
+     &   + tz1 * ( c34 - c1345 ) )
+               d(5,4,i) = dt * 2.0d+00 * tmp2 * u(4,i,j,k)  &
+     & * ( tx1 * ( c34 - c1345 )  &
+     &   + ty1 * ( c34 - c1345 )  &
+     &   + tz1 * ( r43*c34 - c1345 ) )
+               d(5,5,i) = 1.0d+00  &
+     &   + dt * 2.0d+00 * ( tx1  + ty1 + tz1 ) * c1345 * tmp1  &
+     &   + dt * 2.0d+00 * (  tx1 * dx5  &
+     &                    +  ty1 * dy5  &
+     &                    +  tz1 * dz5 )
+
+!---------------------------------------------------------------------
+!   form the first block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k-1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i) = - dt * tz1 * dz1
+               a(1,2,i) =   0.0d+00
+               a(1,3,i) =   0.0d+00
+               a(1,4,i) = - dt * tz2
+               a(1,5,i) =   0.0d+00
+
+               a(2,1,i) = - dt * tz2  &
+     &           * ( - ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k-1) )
+               a(2,2,i) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )  &
+     &           - dt * tz1 * c34 * tmp1  &
+     &           - dt * tz1 * dz2 
+               a(2,3,i) = 0.0d+00
+               a(2,4,i) = - dt * tz2 * ( u(2,i,j,k-1) * tmp1 )
+               a(2,5,i) = 0.0d+00
+
+               a(3,1,i) = - dt * tz2  &
+     &           * ( - ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k-1) )
+               a(3,2,i) = 0.0d+00
+               a(3,3,i) = - dt * tz2 * ( u(4,i,j,k-1) * tmp1 )  &
+     &           - dt * tz1 * ( c34 * tmp1 )  &
+     &           - dt * tz1 * dz3
+               a(3,4,i) = - dt * tz2 * ( u(3,i,j,k-1) * tmp1 )
+               a(3,5,i) = 0.0d+00
+
+               a(4,1,i) = - dt * tz2  &
+     &        * ( - ( u(4,i,j,k-1) * tmp1 ) ** 2  &
+     &            + c2 * qs(i,j,k-1) * tmp1 )  &
+     &        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k-1) )
+               a(4,2,i) = - dt * tz2  &
+     &             * ( - c2 * ( u(2,i,j,k-1) * tmp1 ) )
+               a(4,3,i) = - dt * tz2  &
+     &             * ( - c2 * ( u(3,i,j,k-1) * tmp1 ) )
+               a(4,4,i) = - dt * tz2 * ( 2.0d+00 - c2 )  &
+     &             * ( u(4,i,j,k-1) * tmp1 )  &
+     &             - dt * tz1 * ( r43 * c34 * tmp1 )  &
+     &             - dt * tz1 * dz4
+               a(4,5,i) = - dt * tz2 * c2
+
+               a(5,1,i) = - dt * tz2  &
+     &       * ( ( c2 * 2.0d0 * qs(i,j,k-1)  &
+     &       - c1 * u(5,i,j,k-1) )  &
+     &            * u(4,i,j,k-1) * tmp2 )  &
+     &       - dt * tz1  &
+     &       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k-1)**2)  &
+     &           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k-1)**2)  &
+     &           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k-1)**2)  &
+     &          - c1345 * tmp2 * u(5,i,j,k-1) )
+               a(5,2,i) = - dt * tz2  &
+     &       * ( - c2 * ( u(2,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k-1)
+               a(5,3,i) = - dt * tz2  &
+     &       * ( - c2 * ( u(3,i,j,k-1)*u(4,i,j,k-1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k-1)
+               a(5,4,i) = - dt * tz2  &
+     &       * ( c1 * ( u(5,i,j,k-1) * tmp1 )  &
+     &       - c2  &
+     &       * ( qs(i,j,k-1) * tmp1  &
+     &            + u(4,i,j,k-1)*u(4,i,j,k-1) * tmp2 ) )  &
+     &       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k-1)
+               a(5,5,i) = - dt * tz2  &
+     &       * ( c1 * ( u(4,i,j,k-1) * tmp1 ) )  &
+     &       - dt * tz1 * c1345 * tmp1  &
+     &       - dt * tz1 * dz5
+
+!---------------------------------------------------------------------
+!   form the second block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i,j-1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i) = - dt * ty1 * dy1
+               b(1,2,i) =   0.0d+00
+               b(1,3,i) = - dt * ty2
+               b(1,4,i) =   0.0d+00
+               b(1,5,i) =   0.0d+00
+
+               b(2,1,i) = - dt * ty2  &
+     &           * ( - ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )  &
+     &           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j-1,k) )
+               b(2,2,i) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )  &
+     &          - dt * ty1 * ( c34 * tmp1 )  &
+     &          - dt * ty1 * dy2
+               b(2,3,i) = - dt * ty2 * ( u(2,i,j-1,k) * tmp1 )
+               b(2,4,i) = 0.0d+00
+               b(2,5,i) = 0.0d+00
+
+               b(3,1,i) = - dt * ty2  &
+     &           * ( - ( u(3,i,j-1,k) * tmp1 ) ** 2  &
+     &       + c2 * ( qs(i,j-1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j-1,k) )
+               b(3,2,i) = - dt * ty2  &
+     &                   * ( - c2 * ( u(2,i,j-1,k) * tmp1 ) )
+               b(3,3,i) = - dt * ty2 * ( ( 2.0d+00 - c2 )  &
+     &                   * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( r43 * c34 * tmp1 )  &
+     &       - dt * ty1 * dy3
+               b(3,4,i) = - dt * ty2  &
+     &                   * ( - c2 * ( u(4,i,j-1,k) * tmp1 ) )
+               b(3,5,i) = - dt * ty2 * c2
+
+               b(4,1,i) = - dt * ty2  &
+     &              * ( - ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )  &
+     &       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j-1,k) )
+               b(4,2,i) = 0.0d+00
+               b(4,3,i) = - dt * ty2 * ( u(4,i,j-1,k) * tmp1 )
+               b(4,4,i) = - dt * ty2 * ( u(3,i,j-1,k) * tmp1 )  &
+     &                        - dt * ty1 * ( c34 * tmp1 )  &
+     &                        - dt * ty1 * dy4
+               b(4,5,i) = 0.0d+00
+
+               b(5,1,i) = - dt * ty2  &
+     &          * ( ( c2 * 2.0d0 * qs(i,j-1,k)  &
+     &               - c1 * u(5,i,j-1,k) )  &
+     &          * ( u(3,i,j-1,k) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j-1,k)**2)  &
+     &              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j-1,k)**2)  &
+     &              - (     c34 - c1345 )*tmp3*(u(4,i,j-1,k)**2)  &
+     &              - c1345*tmp2*u(5,i,j-1,k) )
+               b(5,2,i) = - dt * ty2  &
+     &          * ( - c2 * ( u(2,i,j-1,k)*u(3,i,j-1,k) ) * tmp2 )  &
+     &          - dt * ty1  &
+     &          * ( c34 - c1345 ) * tmp2 * u(2,i,j-1,k)
+               b(5,3,i) = - dt * ty2  &
+     &          * ( c1 * ( u(5,i,j-1,k) * tmp1 )  &
+     &          - c2  &
+     &          * ( qs(i,j-1,k) * tmp1  &
+     &               + u(3,i,j-1,k)*u(3,i,j-1,k) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j-1,k)
+               b(5,4,i) = - dt * ty2  &
+     &          * ( - c2 * ( u(3,i,j-1,k)*u(4,i,j-1,k) ) * tmp2 )  &
+     &          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j-1,k)
+               b(5,5,i) = - dt * ty2  &
+     &          * ( c1 * ( u(3,i,j-1,k) * tmp1 ) )  &
+     &          - dt * ty1 * c1345 * tmp1  &
+     &          - dt * ty1 * dy5
+
+!---------------------------------------------------------------------
+!   form the third block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i-1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i) = - dt * tx1 * dx1
+               c(1,2,i) = - dt * tx2
+               c(1,3,i) =   0.0d+00
+               c(1,4,i) =   0.0d+00
+               c(1,5,i) =   0.0d+00
+
+               c(2,1,i) = - dt * tx2  &
+     &          * ( - ( u(2,i-1,j,k) * tmp1 ) ** 2  &
+     &       + c2 * qs(i-1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i-1,j,k) )
+               c(2,2,i) = - dt * tx2  &
+     &          * ( ( 2.0d+00 - c2 ) * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &          - dt * tx1 * ( r43 * c34 * tmp1 )  &
+     &          - dt * tx1 * dx2
+               c(2,3,i) = - dt * tx2  &
+     &              * ( - c2 * ( u(3,i-1,j,k) * tmp1 ) )
+               c(2,4,i) = - dt * tx2  &
+     &              * ( - c2 * ( u(4,i-1,j,k) * tmp1 ) )
+               c(2,5,i) = - dt * tx2 * c2 
+
+               c(3,1,i) = - dt * tx2  &
+     &              * ( - ( u(2,i-1,j,k) * u(3,i-1,j,k) ) * tmp2 )  &
+     &         - dt * tx1 * ( - c34 * tmp2 * u(3,i-1,j,k) )
+               c(3,2,i) = - dt * tx2 * ( u(3,i-1,j,k) * tmp1 )
+               c(3,3,i) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx3
+               c(3,4,i) = 0.0d+00
+               c(3,5,i) = 0.0d+00
+
+               c(4,1,i) = - dt * tx2  &
+     &          * ( - ( u(2,i-1,j,k)*u(4,i-1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - c34 * tmp2 * u(4,i-1,j,k) )
+               c(4,2,i) = - dt * tx2 * ( u(4,i-1,j,k) * tmp1 )
+               c(4,3,i) = 0.0d+00
+               c(4,4,i) = - dt * tx2 * ( u(2,i-1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx4
+               c(4,5,i) = 0.0d+00
+
+               c(5,1,i) = - dt * tx2  &
+     &          * ( ( c2 * 2.0d0 * qs(i-1,j,k)  &
+     &              - c1 * u(5,i-1,j,k) )  &
+     &          * u(2,i-1,j,k) * tmp2 )  &
+     &          - dt * tx1  &
+     &          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i-1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(3,i-1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(4,i-1,j,k)**2 )  &
+     &              - c1345 * tmp2 * u(5,i-1,j,k) )
+               c(5,2,i) = - dt * tx2  &
+     &          * ( c1 * ( u(5,i-1,j,k) * tmp1 )  &
+     &             - c2  &
+     &             * ( u(2,i-1,j,k)*u(2,i-1,j,k) * tmp2  &
+     &                  + qs(i-1,j,k) * tmp1 ) )  &
+     &           - dt * tx1  &
+     &           * ( r43*c34 - c1345 ) * tmp2 * u(2,i-1,j,k)
+               c(5,3,i) = - dt * tx2  &
+     &           * ( - c2 * ( u(3,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(3,i-1,j,k)
+               c(5,4,i) = - dt * tx2  &
+     &           * ( - c2 * ( u(4,i-1,j,k)*u(2,i-1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(4,i-1,j,k)
+               c(5,5,i) = - dt * tx2  &
+     &           * ( c1 * ( u(2,i-1,j,k) * tmp1 ) )  &
+     &           - dt * tx1 * c1345 * tmp1  &
+     &           - dt * tx1 * dx5
+
+            end do
+
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/jacu.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/jacu.f90
new file mode 100644
index 000000000..89f681afd
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/jacu.f90
@@ -0,0 +1,353 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine jacu(j, k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer j, k
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i
+      double precision  r43
+      double precision  c1345
+      double precision  c34
+      double precision  tmp1, tmp2, tmp3
+
+
+
+      r43 = ( 4.0d+00 / 3.0d+00 )
+      c1345 = c1 * c3 * c4 * c5
+      c34 = c3 * c4
+
+            do i = iend, ist, -1
+
+!---------------------------------------------------------------------
+!   form the block daigonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               d(1,1,i) =  1.0d+00  &
+     &                       + dt * 2.0d+00 * (   tx1 * dx1  &
+     &                                          + ty1 * dy1  &
+     &                                          + tz1 * dz1 )
+               d(1,2,i) =  0.0d+00
+               d(1,3,i) =  0.0d+00
+               d(1,4,i) =  0.0d+00
+               d(1,5,i) =  0.0d+00
+
+               d(2,1,i) =  dt * 2.0d+00  &
+     &           * ( - tx1 * r43 - ty1 - tz1 )  &
+     &           * ( c34 * tmp2 * u(2,i,j,k) )
+               d(2,2,i) =  1.0d+00  &
+     &          + dt * 2.0d+00 * c34 * tmp1  &
+     &          * (  tx1 * r43 + ty1 + tz1 )  &
+     &          + dt * 2.0d+00 * (   tx1 * dx2  &
+     &                             + ty1 * dy2  &
+     &                             + tz1 * dz2  )
+               d(2,3,i) = 0.0d+00
+               d(2,4,i) = 0.0d+00
+               d(2,5,i) = 0.0d+00
+
+               d(3,1,i) = dt * 2.0d+00  &
+     &           * ( - tx1 - ty1 * r43 - tz1 )  &
+     &           * ( c34 * tmp2 * u(3,i,j,k) )
+               d(3,2,i) = 0.0d+00
+               d(3,3,i) = 1.0d+00  &
+     &         + dt * 2.0d+00 * c34 * tmp1  &
+     &              * (  tx1 + ty1 * r43 + tz1 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx3  &
+     &                           + ty1 * dy3  &
+     &                           + tz1 * dz3 )
+               d(3,4,i) = 0.0d+00
+               d(3,5,i) = 0.0d+00
+
+               d(4,1,i) = dt * 2.0d+00  &
+     &           * ( - tx1 - ty1 - tz1 * r43 )  &
+     &           * ( c34 * tmp2 * u(4,i,j,k) )
+               d(4,2,i) = 0.0d+00
+               d(4,3,i) = 0.0d+00
+               d(4,4,i) = 1.0d+00  &
+     &         + dt * 2.0d+00 * c34 * tmp1  &
+     &              * (  tx1 + ty1 + tz1 * r43 )  &
+     &         + dt * 2.0d+00 * (  tx1 * dx4  &
+     &                           + ty1 * dy4  &
+     &                           + tz1 * dz4 )
+               d(4,5,i) = 0.0d+00
+
+               d(5,1,i) = -dt * 2.0d+00  &
+     &  * ( ( ( tx1 * ( r43*c34 - c1345 )  &
+     &     + ty1 * ( c34 - c1345 )  &
+     &     + tz1 * ( c34 - c1345 ) ) * ( u(2,i,j,k) ** 2 )  &
+     &   + ( tx1 * ( c34 - c1345 )  &
+     &     + ty1 * ( r43*c34 - c1345 )  &
+     &     + tz1 * ( c34 - c1345 ) ) * ( u(3,i,j,k) ** 2 )  &
+     &   + ( tx1 * ( c34 - c1345 )  &
+     &     + ty1 * ( c34 - c1345 )  &
+     &     + tz1 * ( r43*c34 - c1345 ) ) * ( u(4,i,j,k) ** 2 )  &
+     &      ) * tmp3  &
+     &   + ( tx1 + ty1 + tz1 ) * c1345 * tmp2 * u(5,i,j,k) )
+
+               d(5,2,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( r43*c34 - c1345 )  &
+     &   + ty1 * (     c34 - c1345 )  &
+     &   + tz1 * (     c34 - c1345 ) ) * tmp2 * u(2,i,j,k)
+               d(5,3,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 )  &
+     &   + ty1 * ( r43*c34 -c1345 )  &
+     &   + tz1 * ( c34 - c1345 ) ) * tmp2 * u(3,i,j,k)
+               d(5,4,i) = dt * 2.0d+00  &
+     & * ( tx1 * ( c34 - c1345 )  &
+     &   + ty1 * ( c34 - c1345 )  &
+     &   + tz1 * ( r43*c34 - c1345 ) ) * tmp2 * u(4,i,j,k)
+               d(5,5,i) = 1.0d+00  &
+     &   + dt * 2.0d+00 * ( tx1 + ty1 + tz1 ) * c1345 * tmp1  &
+     &   + dt * 2.0d+00 * (  tx1 * dx5  &
+     &                    +  ty1 * dy5  &
+     &                    +  tz1 * dz5 )
+
+!---------------------------------------------------------------------
+!   form the first block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i+1,j,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               a(1,1,i) = - dt * tx1 * dx1
+               a(1,2,i) =   dt * tx2
+               a(1,3,i) =   0.0d+00
+               a(1,4,i) =   0.0d+00
+               a(1,5,i) =   0.0d+00
+
+               a(2,1,i) =  dt * tx2  &
+     &          * ( - ( u(2,i+1,j,k) * tmp1 ) ** 2  &
+     &     + c2 * qs(i+1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( - r43 * c34 * tmp2 * u(2,i+1,j,k) )
+               a(2,2,i) =  dt * tx2  &
+     &          * ( ( 2.0d+00 - c2 ) * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &          - dt * tx1 * ( r43 * c34 * tmp1 )  &
+     &          - dt * tx1 * dx2
+               a(2,3,i) =  dt * tx2  &
+     &              * ( - c2 * ( u(3,i+1,j,k) * tmp1 ) )
+               a(2,4,i) =  dt * tx2  &
+     &              * ( - c2 * ( u(4,i+1,j,k) * tmp1 ) )
+               a(2,5,i) =  dt * tx2 * c2 
+
+               a(3,1,i) =  dt * tx2  &
+     &              * ( - ( u(2,i+1,j,k) * u(3,i+1,j,k) ) * tmp2 )  &
+     &         - dt * tx1 * ( - c34 * tmp2 * u(3,i+1,j,k) )
+               a(3,2,i) =  dt * tx2 * ( u(3,i+1,j,k) * tmp1 )
+               a(3,3,i) =  dt * tx2 * ( u(2,i+1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx3
+               a(3,4,i) = 0.0d+00
+               a(3,5,i) = 0.0d+00
+
+               a(4,1,i) = dt * tx2  &
+     &          * ( - ( u(2,i+1,j,k)*u(4,i+1,j,k) ) * tmp2 )  &
+     &          - dt * tx1 * ( - c34 * tmp2 * u(4,i+1,j,k) )
+               a(4,2,i) = dt * tx2 * ( u(4,i+1,j,k) * tmp1 )
+               a(4,3,i) = 0.0d+00
+               a(4,4,i) = dt * tx2 * ( u(2,i+1,j,k) * tmp1 )  &
+     &          - dt * tx1 * ( c34 * tmp1 )  &
+     &          - dt * tx1 * dx4
+               a(4,5,i) = 0.0d+00
+
+               a(5,1,i) = dt * tx2  &
+     &          * ( ( c2 * 2.0d0 * qs(i+1,j,k)  &
+     &              - c1 * u(5,i+1,j,k) )  &
+     &          * ( u(2,i+1,j,k) * tmp2 ) )  &
+     &          - dt * tx1  &
+     &          * ( - ( r43*c34 - c1345 ) * tmp3 * ( u(2,i+1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(3,i+1,j,k)**2 )  &
+     &              - (     c34 - c1345 ) * tmp3 * ( u(4,i+1,j,k)**2 )  &
+     &              - c1345 * tmp2 * u(5,i+1,j,k) )
+               a(5,2,i) = dt * tx2  &
+     &          * ( c1 * ( u(5,i+1,j,k) * tmp1 )  &
+     &             - c2  &
+     &             * (  u(2,i+1,j,k)*u(2,i+1,j,k) * tmp2  &
+     &                  + qs(i+1,j,k) * tmp1 ) )  &
+     &           - dt * tx1  &
+     &           * ( r43*c34 - c1345 ) * tmp2 * u(2,i+1,j,k)
+               a(5,3,i) = dt * tx2  &
+     &           * ( - c2 * ( u(3,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(3,i+1,j,k)
+               a(5,4,i) = dt * tx2  &
+     &           * ( - c2 * ( u(4,i+1,j,k)*u(2,i+1,j,k) ) * tmp2 )  &
+     &           - dt * tx1  &
+     &           * (  c34 - c1345 ) * tmp2 * u(4,i+1,j,k)
+               a(5,5,i) = dt * tx2  &
+     &           * ( c1 * ( u(2,i+1,j,k) * tmp1 ) )  &
+     &           - dt * tx1 * c1345 * tmp1  &
+     &           - dt * tx1 * dx5
+
+!---------------------------------------------------------------------
+!   form the second block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i,j+1,k)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               b(1,1,i) = - dt * ty1 * dy1
+               b(1,2,i) =   0.0d+00
+               b(1,3,i) =  dt * ty2
+               b(1,4,i) =   0.0d+00
+               b(1,5,i) =   0.0d+00
+
+               b(2,1,i) =  dt * ty2  &
+     &           * ( - ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )  &
+     &           - dt * ty1 * ( - c34 * tmp2 * u(2,i,j+1,k) )
+               b(2,2,i) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )  &
+     &          - dt * ty1 * ( c34 * tmp1 )  &
+     &          - dt * ty1 * dy2
+               b(2,3,i) =  dt * ty2 * ( u(2,i,j+1,k) * tmp1 )
+               b(2,4,i) = 0.0d+00
+               b(2,5,i) = 0.0d+00
+
+               b(3,1,i) =  dt * ty2  &
+     &           * ( - ( u(3,i,j+1,k) * tmp1 ) ** 2  &
+     &      + c2 * ( qs(i,j+1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( - r43 * c34 * tmp2 * u(3,i,j+1,k) )
+               b(3,2,i) =  dt * ty2  &
+     &                   * ( - c2 * ( u(2,i,j+1,k) * tmp1 ) )
+               b(3,3,i) =  dt * ty2 * ( ( 2.0d+00 - c2 )  &
+     &                   * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &       - dt * ty1 * ( r43 * c34 * tmp1 )  &
+     &       - dt * ty1 * dy3
+               b(3,4,i) =  dt * ty2  &
+     &                   * ( - c2 * ( u(4,i,j+1,k) * tmp1 ) )
+               b(3,5,i) =  dt * ty2 * c2
+
+               b(4,1,i) =  dt * ty2  &
+     &              * ( - ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )  &
+     &       - dt * ty1 * ( - c34 * tmp2 * u(4,i,j+1,k) )
+               b(4,2,i) = 0.0d+00
+               b(4,3,i) =  dt * ty2 * ( u(4,i,j+1,k) * tmp1 )
+               b(4,4,i) =  dt * ty2 * ( u(3,i,j+1,k) * tmp1 )  &
+     &                        - dt * ty1 * ( c34 * tmp1 )  &
+     &                        - dt * ty1 * dy4
+               b(4,5,i) = 0.0d+00
+
+               b(5,1,i) =  dt * ty2  &
+     &          * ( ( c2 * 2.0d0 * qs(i,j+1,k)  &
+     &               - c1 * u(5,i,j+1,k) )  &
+     &          * ( u(3,i,j+1,k) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( - (     c34 - c1345 )*tmp3*(u(2,i,j+1,k)**2)  &
+     &              - ( r43*c34 - c1345 )*tmp3*(u(3,i,j+1,k)**2)  &
+     &              - (     c34 - c1345 )*tmp3*(u(4,i,j+1,k)**2)  &
+     &              - c1345*tmp2*u(5,i,j+1,k) )
+               b(5,2,i) =  dt * ty2  &
+     &          * ( - c2 * ( u(2,i,j+1,k)*u(3,i,j+1,k) ) * tmp2 )  &
+     &          - dt * ty1  &
+     &          * ( c34 - c1345 ) * tmp2 * u(2,i,j+1,k)
+               b(5,3,i) =  dt * ty2  &
+     &          * ( c1 * ( u(5,i,j+1,k) * tmp1 )  &
+     &          - c2  &
+     &          * ( qs(i,j+1,k) * tmp1  &
+     &               + u(3,i,j+1,k)*u(3,i,j+1,k) * tmp2 ) )  &
+     &          - dt * ty1  &
+     &          * ( r43*c34 - c1345 ) * tmp2 * u(3,i,j+1,k)
+               b(5,4,i) =  dt * ty2  &
+     &          * ( - c2 * ( u(3,i,j+1,k)*u(4,i,j+1,k) ) * tmp2 )  &
+     &          - dt * ty1 * ( c34 - c1345 ) * tmp2 * u(4,i,j+1,k)
+               b(5,5,i) =  dt * ty2  &
+     &          * ( c1 * ( u(3,i,j+1,k) * tmp1 ) )  &
+     &          - dt * ty1 * c1345 * tmp1  &
+     &          - dt * ty1 * dy5
+
+!---------------------------------------------------------------------
+!   form the third block sub-diagonal
+!---------------------------------------------------------------------
+               tmp1 = rho_i(i,j,k+1)
+               tmp2 = tmp1 * tmp1
+               tmp3 = tmp1 * tmp2
+
+               c(1,1,i) = - dt * tz1 * dz1
+               c(1,2,i) =   0.0d+00
+               c(1,3,i) =   0.0d+00
+               c(1,4,i) = dt * tz2
+               c(1,5,i) =   0.0d+00
+
+               c(2,1,i) = dt * tz2  &
+     &           * ( - ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(2,i,j,k+1) )
+               c(2,2,i) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )  &
+     &           - dt * tz1 * c34 * tmp1  &
+     &           - dt * tz1 * dz2 
+               c(2,3,i) = 0.0d+00
+               c(2,4,i) = dt * tz2 * ( u(2,i,j,k+1) * tmp1 )
+               c(2,5,i) = 0.0d+00
+
+               c(3,1,i) = dt * tz2  &
+     &           * ( - ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &           - dt * tz1 * ( - c34 * tmp2 * u(3,i,j,k+1) )
+               c(3,2,i) = 0.0d+00
+               c(3,3,i) = dt * tz2 * ( u(4,i,j,k+1) * tmp1 )  &
+     &           - dt * tz1 * ( c34 * tmp1 )  &
+     &           - dt * tz1 * dz3
+               c(3,4,i) = dt * tz2 * ( u(3,i,j,k+1) * tmp1 )
+               c(3,5,i) = 0.0d+00
+
+               c(4,1,i) = dt * tz2  &
+     &        * ( - ( u(4,i,j,k+1) * tmp1 ) ** 2  &
+     &            + c2 * ( qs(i,j,k+1) * tmp1 ) )  &
+     &        - dt * tz1 * ( - r43 * c34 * tmp2 * u(4,i,j,k+1) )
+               c(4,2,i) = dt * tz2  &
+     &             * ( - c2 * ( u(2,i,j,k+1) * tmp1 ) )
+               c(4,3,i) = dt * tz2  &
+     &             * ( - c2 * ( u(3,i,j,k+1) * tmp1 ) )
+               c(4,4,i) = dt * tz2 * ( 2.0d+00 - c2 )  &
+     &             * ( u(4,i,j,k+1) * tmp1 )  &
+     &             - dt * tz1 * ( r43 * c34 * tmp1 )  &
+     &             - dt * tz1 * dz4
+               c(4,5,i) = dt * tz2 * c2
+
+               c(5,1,i) = dt * tz2  &
+     &     * ( ( c2 * 2.0d0 * qs(i,j,k+1)  &
+     &       - c1 * u(5,i,j,k+1) )  &
+     &            * ( u(4,i,j,k+1) * tmp2 ) )  &
+     &       - dt * tz1  &
+     &       * ( - ( c34 - c1345 ) * tmp3 * (u(2,i,j,k+1)**2)  &
+     &           - ( c34 - c1345 ) * tmp3 * (u(3,i,j,k+1)**2)  &
+     &           - ( r43*c34 - c1345 )* tmp3 * (u(4,i,j,k+1)**2)  &
+     &          - c1345 * tmp2 * u(5,i,j,k+1) )
+               c(5,2,i) = dt * tz2  &
+     &       * ( - c2 * ( u(2,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(2,i,j,k+1)
+               c(5,3,i) = dt * tz2  &
+     &       * ( - c2 * ( u(3,i,j,k+1)*u(4,i,j,k+1) ) * tmp2 )  &
+     &       - dt * tz1 * ( c34 - c1345 ) * tmp2 * u(3,i,j,k+1)
+               c(5,4,i) = dt * tz2  &
+     &       * ( c1 * ( u(5,i,j,k+1) * tmp1 )  &
+     &       - c2  &
+     &       * ( qs(i,j,k+1) * tmp1  &
+     &            + u(4,i,j,k+1)*u(4,i,j,k+1) * tmp2 ) )  &
+     &       - dt * tz1 * ( r43*c34 - c1345 ) * tmp2 * u(4,i,j,k+1)
+               c(5,5,i) = dt * tz2  &
+     &       * ( c1 * ( u(4,i,j,k+1) * tmp1 ) )  &
+     &       - dt * tz1 * c1345 * tmp1  &
+     &       - dt * tz1 * dz5
+
+            end do
+
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/l2norm.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/l2norm.f90
new file mode 100644
index 000000000..df6fb2eb3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/l2norm.f90
@@ -0,0 +1,59 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine l2norm ( ldx, ldy, ldz,  &
+     &                    nx0, ny0, nz0,  &
+     &                    ist, iend,  &
+     &                    jst, jend,  &
+     &                    v, sum )
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   to compute the l2-norm of vector v.
+!---------------------------------------------------------------------
+
+      implicit none
+
+!---------------------------------------------------------------------
+!  input parameters
+!---------------------------------------------------------------------
+      integer ldx, ldy, ldz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+!---------------------------------------------------------------------
+!   To improve cache performance, second two dimensions padded by 1 
+!   for even number sizes only.  Only needed in v.
+!---------------------------------------------------------------------
+      double precision  v(5,ldx/2*2+1,ldy/2*2+1,*), sum(5)
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+
+
+      do m = 1, 5
+         sum(m) = 0.0d+00
+      end do
+
+!$omp parallel do schedule(static) collapse(2) default(shared)  &
+!$omp&  private(i,j,k,m) reduction(+: sum)
+      do k = 2, nz0-1
+         do j = jst, jend
+            do i = ist, iend
+               do m = 1, 5
+                  sum(m) = sum(m) + v(m,i,j,k)*v(m,i,j,k)
+               end do
+            end do
+         end do
+      end do
+!$omp end parallel do
+
+      do m = 1, 5
+         sum(m) = sqrt ( sum(m) / ( dble(nx0-2)*(ny0-2)*(nz0-2) ) )
+      end do
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/lu.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/lu.f90
new file mode 100644
index 000000000..a3a08bf83
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/lu.f90
@@ -0,0 +1,195 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   L U                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB LU code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!---------------------------------------------------------------------
+!
+! Authors: S. Weeratunga
+!          V. Venkatakrishnan
+!          E. Barszcz
+!          M. Yarrow
+!          H. Jin
+!
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+      program applu
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   driver for the performance evaluation of the solver for
+!   five coupled parabolic/elliptic partial differential equations.
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+      character class
+      logical verified
+      double precision mflops
+
+      double precision t, tmax, timer_read, trecs(t_last)
+      external timer_read
+      integer i
+      character t_names(t_last)*8
+
+!---------------------------------------------------------------------
+!     Setup info for timers
+!---------------------------------------------------------------------
+
+      call check_timer_flag( timeron )
+      if (timeron) then
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_jacld) = 'jacld'
+         t_names(t_blts) = 'blts'
+         t_names(t_jacu) = 'jacu'
+         t_names(t_buts) = 'buts'
+         t_names(t_add) = 'add'
+         t_names(t_l2norm) = 'l2norm'
+      endif
+
+!---------------------------------------------------------------------
+!   read input data
+!---------------------------------------------------------------------
+      call read_input()
+
+
+!---------------------------------------------------------------------
+!   set up domain sizes
+!---------------------------------------------------------------------
+      call domain()
+
+      call alloc_space
+
+!---------------------------------------------------------------------
+!   set up coefficients
+!---------------------------------------------------------------------
+      call setcoeff()
+
+!---------------------------------------------------------------------
+!   set the boundary values for dependent variables
+!---------------------------------------------------------------------
+      call setbv()
+
+!---------------------------------------------------------------------
+!   set the initial values for dependent variables
+!---------------------------------------------------------------------
+      call setiv()
+
+!---------------------------------------------------------------------
+!   compute the forcing term based on prescribed exact solution
+!---------------------------------------------------------------------
+      call erhs()
+
+!---------------------------------------------------------------------
+!   perform one SSOR iteration to touch all data pages
+!---------------------------------------------------------------------
+      call ssor(1)
+
+!---------------------------------------------------------------------
+!   reset the boundary and initial values
+!---------------------------------------------------------------------
+      call setbv()
+      call setiv()
+
+!---------------------------------------------------------------------
+!   perform the SSOR iterations
+!---------------------------------------------------------------------
+      call ssor(itmax)
+
+!---------------------------------------------------------------------
+!   compute the solution error
+!---------------------------------------------------------------------
+      call error()
+
+!---------------------------------------------------------------------
+!   compute the surface integral
+!---------------------------------------------------------------------
+      call pintgr()
+
+!---------------------------------------------------------------------
+!   verification test
+!---------------------------------------------------------------------
+      call verify ( rsdnm, errnm, frc, class, verified )
+      mflops = 1.0d-6*dble(itmax)*(1984.77*dble( nx0 )  &
+     &     *dble( ny0 )  &
+     &     *dble( nz0 )  &
+     &     -10923.3*(dble( nx0+ny0+nz0 )/3.)**2  &
+     &     +27770.9* dble( nx0+ny0+nz0 )/3.  &
+     &     -144010.)  &
+     &     / maxtime
+
+      call print_results('LU', class, nx0,  &
+     &  ny0, nz0, itmax,  &
+     &  maxtime, mflops, '          floating point', verified,  &
+     &  npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6,  &
+     &  '(none)')
+
+!---------------------------------------------------------------------
+!      More timers
+!---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      do i=1, t_last
+         trecs(i) = timer_read(i)
+      end do
+      tmax = maxtime
+      if ( tmax .eq. 0. ) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION     Time (secs)')
+      do i=1, t_last
+         if (i.ne.t_jacld .and. i.ne.t_jacu) then
+            write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+         endif
+         if (i.eq.t_rhs) then
+            t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+            write(*,820) 'sub-rhs', t, t*100./tmax
+            t = trecs(i) - t
+            write(*,820) 'rest-rhs', t, t*100./tmax
+         endif
+ 810     format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820     format(5x,'--> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/lu_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/lu_data.f90
new file mode 100644
index 000000000..744943eee
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/lu_data.f90
@@ -0,0 +1,169 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!---  lu_data module
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module lu_data
+
+!---------------------------------------------------------------------
+!   npbparams.h defines parameters that depend on the class and 
+!   number of nodes
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+!---------------------------------------------------------------------
+!   parameters which can be overridden in runtime config file
+!   isiz1,isiz2,isiz3 give the maximum size
+!   ipr = 1 to print out verbose information
+!   omega = 2.0 is correct for all classes
+!   tolrsd is tolerance levels for steady state residuals
+!---------------------------------------------------------------------
+      integer ipr_default
+      parameter (ipr_default = 1)
+      double precision omega_default
+      parameter (omega_default = 1.2d0)
+      double precision tolrsd1_def, tolrsd2_def, tolrsd3_def,  &
+     &                 tolrsd4_def, tolrsd5_def
+      parameter (tolrsd1_def=1.0e-08,  &
+     &          tolrsd2_def=1.0e-08, tolrsd3_def=1.0e-08,  &
+     &          tolrsd4_def=1.0e-08, tolrsd5_def=1.0e-08)
+
+      double precision c1, c2, c3, c4, c5
+      parameter( c1 = 1.40d+00, c2 = 0.40d+00,  &
+     &           c3 = 1.00d-01, c4 = 1.00d+00,  &
+     &           c5 = 1.40d+00 )
+
+!---------------------------------------------------------------------
+!   grid
+!---------------------------------------------------------------------
+      integer nx, ny, nz
+      integer nx0, ny0, nz0
+      integer ist, iend
+      integer jst, jend
+      integer ii1, ii2
+      integer ji1, ji2
+      integer ki1, ki2
+      double precision  dxi, deta, dzeta
+      double precision  tx1, tx2, tx3
+      double precision  ty1, ty2, ty3
+      double precision  tz1, tz2, tz3
+
+!---------------------------------------------------------------------
+!   dissipation
+!---------------------------------------------------------------------
+      double precision dx1, dx2, dx3, dx4, dx5
+      double precision dy1, dy2, dy3, dy4, dy5
+      double precision dz1, dz2, dz3, dz4, dz5
+      double precision dssp
+
+!---------------------------------------------------------------------
+!   field variables and residuals
+!---------------------------------------------------------------------
+      double precision, allocatable ::  &
+     &                 u   (:,:,:,:),  &
+     &                 rsd (:,:,:,:),  &
+     &                 frct(:,:,:,:),  &
+     &                 qs    (:,:,:),  &
+     &                 rho_i (:,:,:)
+
+      double precision flux(5,isiz1)
+!$omp threadprivate( flux )
+
+
+!---------------------------------------------------------------------
+!   output control parameters
+!---------------------------------------------------------------------
+      integer ipr, inorm
+
+!---------------------------------------------------------------------
+!   newton-raphson iteration control parameters
+!---------------------------------------------------------------------
+      integer itmax, invert
+      double precision  dt, omega, tolrsd(5),  &
+     &        rsdnm(5), errnm(5), frc, ttotal
+
+      double precision a(5,5,isiz1),  &
+     &                 b(5,5,isiz1),  &
+     &                 c(5,5,isiz1),  &
+     &                 d(5,5,isiz1)
+!$omp threadprivate( a, b, c, d )
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution
+!---------------------------------------------------------------------
+      double precision ce(5,13)
+
+!---------------------------------------------------------------------
+!   working arrays for surface integral
+!---------------------------------------------------------------------
+      double precision, allocatable ::  &
+     &                 phi1(:,:),  &
+     &                 phi2(:,:)
+
+!---------------------------------------------------------------------
+!   timers
+!---------------------------------------------------------------------
+      integer t_rhsx,t_rhsy,t_rhsz,t_rhs,t_jacld,t_blts,  &
+     &        t_jacu,t_buts,t_add,t_l2norm,t_last,t_total
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_jacld = 6)
+      parameter (t_blts = 7)
+      parameter (t_jacu = 8)
+      parameter (t_buts = 9)
+      parameter (t_add = 10)
+      parameter (t_l2norm = 11)
+      parameter (t_last = 11)
+
+      logical timeron
+      double precision maxtime
+
+      end module lu_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+      integer ios
+
+!---------------------------------------------------------------------
+!   to improve cache performance, second two dimensions padded by 1 
+!   for even number sizes only.
+!   Note: corresponding array (called "v") in routines blts, buts, 
+!   and l2norm are similarly padded
+!---------------------------------------------------------------------
+
+      allocate (  &
+     &          u   (5,isiz1/2*2+1,isiz2/2*2+1,isiz3),  &
+     &          rsd (5,isiz1/2*2+1,isiz2/2*2+1,isiz3),  &
+     &          frct(5,isiz1/2*2+1,isiz2/2*2+1,isiz3),  &
+     &          qs    (isiz1/2*2+1,isiz2/2*2+1,isiz3),  &
+     &          rho_i (isiz1/2*2+1,isiz2/2*2+1,isiz3),  &
+     &          phi1  (0:isiz2+1,0:isiz3+1),  &
+     &          phi2  (0:isiz2+1,0:isiz3+1),  &
+     &          stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         stop
+      endif
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/pintgr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/pintgr.f90
new file mode 100644
index 000000000..6ebcee240
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/pintgr.f90
@@ -0,0 +1,192 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine pintgr
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k
+      integer ibeg, ifin, ifin1
+      integer jbeg, jfin, jfin1
+      double precision frc1, frc2, frc3
+
+
+
+!---------------------------------------------------------------------
+!   set up the sub-domains for integeration in each processor
+!---------------------------------------------------------------------
+      ibeg = ii1
+      ifin = ii2
+      jbeg = ji1
+      jfin = ji2
+      ifin1 = ifin - 1
+      jfin1 = jfin - 1
+
+!$omp parallel default(shared) private(i,j,k)  &
+!$omp&  shared(ki1,ki2,ifin,ibeg,jfin,jbeg,ifin1,jfin1)
+
+!$omp do schedule(static) collapse(2)
+      do j = jbeg,jfin
+         do i = ibeg,ifin
+
+            k = ki1
+
+            phi1(i,j) = c2*(  u(5,i,j,k)  &
+     &           - 0.50d+00 * (  u(2,i,j,k) ** 2  &
+     &                         + u(3,i,j,k) ** 2  &
+     &                         + u(4,i,j,k) ** 2 )  &
+     &                        / u(1,i,j,k) )
+
+            k = ki2
+
+            phi2(i,j) = c2*(  u(5,i,j,k)  &
+     &           - 0.50d+00 * (  u(2,i,j,k) ** 2  &
+     &                         + u(3,i,j,k) ** 2  &
+     &                         + u(4,i,j,k) ** 2 )  &
+     &                        / u(1,i,j,k) )
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp single
+      frc1 = 0.0d+00
+!$omp end single
+
+!$omp do schedule(static) reduction(+:frc1) collapse(2)
+      do j = jbeg,jfin1
+         do i = ibeg, ifin1
+            frc1 = frc1 + (  phi1(i,j)  &
+     &                     + phi1(i+1,j)  &
+     &                     + phi1(i,j+1)  &
+     &                     + phi1(i+1,j+1)  &
+     &                     + phi2(i,j)  &
+     &                     + phi2(i+1,j)  &
+     &                     + phi2(i,j+1)  &
+     &                     + phi2(i+1,j+1) )
+         end do
+      end do
+!$omp end do
+
+
+!$omp master
+      frc1 = dxi * deta * frc1
+!$omp end master
+
+
+!$omp do schedule(static) collapse(2)
+      do k = ki1, ki2
+         do i = ibeg, ifin
+            phi1(i,k) = c2*(  u(5,i,jbeg,k)  &
+     &           - 0.50d+00 * (  u(2,i,jbeg,k) ** 2  &
+     &                         + u(3,i,jbeg,k) ** 2  &
+     &                         + u(4,i,jbeg,k) ** 2 )  &
+     &                        / u(1,i,jbeg,k) )
+         end do
+      end do
+!$omp end do nowait
+
+!$omp do schedule(static) collapse(2)
+      do k = ki1, ki2
+         do i = ibeg, ifin
+            phi2(i,k) = c2*(  u(5,i,jfin,k)  &
+     &           - 0.50d+00 * (  u(2,i,jfin,k) ** 2  &
+     &                         + u(3,i,jfin,k) ** 2  &
+     &                         + u(4,i,jfin,k) ** 2 )  &
+     &                        / u(1,i,jfin,k) )
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp single
+      frc2 = 0.0d+00
+!$omp end single
+
+!$omp do schedule(static) reduction(+:frc2) collapse(2)
+      do k = ki1, ki2-1
+         do i = ibeg, ifin1
+            frc2 = frc2 + (  phi1(i,k)  &
+     &                     + phi1(i+1,k)  &
+     &                     + phi1(i,k+1)  &
+     &                     + phi1(i+1,k+1)  &
+     &                     + phi2(i,k)  &
+     &                     + phi2(i+1,k)  &
+     &                     + phi2(i,k+1)  &
+     &                     + phi2(i+1,k+1) )
+         end do
+      end do
+!$omp end do
+
+
+!$omp master
+      frc2 = dxi * dzeta * frc2
+!$omp end master
+
+
+!$omp do schedule(static) collapse(2)
+      do k = ki1, ki2
+         do j = jbeg, jfin
+            phi1(j,k) = c2*(  u(5,ibeg,j,k)  &
+     &           - 0.50d+00 * (  u(2,ibeg,j,k) ** 2  &
+     &                         + u(3,ibeg,j,k) ** 2  &
+     &                         + u(4,ibeg,j,k) ** 2 )  &
+     &                        / u(1,ibeg,j,k) )
+         end do
+      end do
+!$omp end do nowait
+
+!$omp do schedule(static) collapse(2)
+      do k = ki1, ki2
+         do j = jbeg, jfin
+            phi2(j,k) = c2*(  u(5,ifin,j,k)  &
+     &           - 0.50d+00 * (  u(2,ifin,j,k) ** 2  &
+     &                         + u(3,ifin,j,k) ** 2  &
+     &                         + u(4,ifin,j,k) ** 2 )  &
+     &                        / u(1,ifin,j,k) )
+         end do
+      end do
+!$omp end do nowait
+
+
+!$omp single
+      frc3 = 0.0d+00
+!$omp end single
+
+!$omp do schedule(static) reduction(+:frc3) collapse(2)
+      do k = ki1, ki2-1
+         do j = jbeg, jfin1
+            frc3 = frc3 + (  phi1(j,k)  &
+     &                     + phi1(j+1,k)  &
+     &                     + phi1(j,k+1)  &
+     &                     + phi1(j+1,k+1)  &
+     &                     + phi2(j,k)  &
+     &                     + phi2(j+1,k)  &
+     &                     + phi2(j,k+1)  &
+     &                     + phi2(j+1,k+1) )
+         end do
+      end do
+!$omp end do
+
+
+!$omp master
+      frc3 = deta * dzeta * frc3
+!$omp end master
+!$omp end parallel
+
+      frc = 0.25d+00 * ( frc1 + frc2 + frc3 )
+!      write (*,1001) frc
+
+      return
+
+! 1001 format (//5x,'surface integral = ',1pe12.5//)
+
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/read_input.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/read_input.f90
new file mode 100644
index 000000000..b9f5f326b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/read_input.f90
@@ -0,0 +1,117 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine read_input
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+      integer  fstatus
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+
+
+!---------------------------------------------------------------------
+!    if input file does not exist, it uses defaults
+!       ipr = 1 for detailed progress output
+!       inorm = how often the norm is printed (once every inorm iterations)
+!       itmax = number of pseudo time steps
+!       dt = time step
+!       omega 1 over-relaxation factor for SSOR
+!       tolrsd = steady state residual tolerance levels
+!       nx, ny, nz = number of grid points in x, y, z directions
+!---------------------------------------------------------------------
+
+         write(*, 1000)
+
+         open (unit=3,file='inputlu.data',status='old',  &
+     &         access='sequential',form='formatted', iostat=fstatus)
+         if (fstatus .eq. 0) then
+
+            write(*, *) 'Reading from input file inputlu.data'
+
+            read (3,*)
+            read (3,*)
+            read (3,*) ipr, inorm
+            read (3,*)
+            read (3,*)
+            read (3,*) itmax
+            read (3,*)
+            read (3,*)
+            read (3,*) dt
+            read (3,*)
+            read (3,*)
+            read (3,*) omega
+            read (3,*)
+            read (3,*)
+            read (3,*) tolrsd(1),tolrsd(2),tolrsd(3),tolrsd(4),tolrsd(5)
+            read (3,*)
+            read (3,*)
+            read (3,*) nx0, ny0, nz0
+            close(3)
+         else
+            ipr = ipr_default
+            inorm = inorm_default
+            itmax = itmax_default
+            dt = dt_default
+            omega = omega_default
+            tolrsd(1) = tolrsd1_def
+            tolrsd(2) = tolrsd2_def
+            tolrsd(3) = tolrsd3_def
+            tolrsd(4) = tolrsd4_def
+            tolrsd(5) = tolrsd5_def
+            nx0 = isiz1
+            ny0 = isiz2
+            nz0 = isiz3
+         endif
+
+!---------------------------------------------------------------------
+!   check problem size
+!---------------------------------------------------------------------
+
+         if ( ( nx0 .lt. 4 ) .or.  &
+     &        ( ny0 .lt. 4 ) .or.  &
+     &        ( nz0 .lt. 4 ) ) then
+
+            write (*,2001)
+ 2001       format (5x,'PROBLEM SIZE IS TOO SMALL - ',  &
+     &           /5x,'SET EACH OF NX, NY AND NZ AT LEAST EQUAL TO 5')
+            stop
+
+         end if
+
+         if ( ( nx0 .gt. isiz1 ) .or.  &
+     &        ( ny0 .gt. isiz2 ) .or.  &
+     &        ( nz0 .gt. isiz3 ) ) then
+
+            write (*,2002)
+ 2002       format (5x,'PROBLEM SIZE IS TOO LARGE - ',  &
+     &           /5x,'NX, NY AND NZ SHOULD BE EQUAL TO ',  &
+     &           /5x,'ISIZ1, ISIZ2 AND ISIZ3 RESPECTIVELY')
+            stop
+
+         end if
+
+
+         write(*, 1001) nx0, ny0, nz0
+         write(*, 1002) itmax
+!$       write(*, 1003) omp_get_max_threads()
+         write(*, *)
+
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.4-OMP)',  &
+     &          ' - LU Benchmark', /)
+ 1001    format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002    format(' Iterations:                  ', i5)
+ 1003    format(' Number of available threads: ', i5)
+         
+
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/rhs.f90
new file mode 100644
index 000000000..5fc6af432
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/rhs.f90
@@ -0,0 +1,476 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   compute the right hand sides
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  q
+      double precision  tmp, utmp(6,isiz3), rtmp(5,isiz3)
+      double precision  u21, u31, u41
+      double precision  u21i, u31i, u41i, u51i
+      double precision  u21j, u31j, u41j, u51j
+      double precision  u21k, u31k, u41k, u51k
+      double precision  u21im1, u31im1, u41im1, u51im1
+      double precision  u21jm1, u31jm1, u41jm1, u51jm1
+      double precision  u21km1, u31km1, u41km1, u51km1
+
+
+      if (timeron) call timer_start(t_rhs)
+
+!$omp parallel default(shared) private(i,j,k,m,q,tmp,utmp,rtmp,  &
+!$omp& u51im1,u41im1,u31im1,u21im1,u51i,u41i,u31i,u21i,u21,  &
+!$omp& u51jm1,u41jm1,u31jm1,u21jm1,u51j,u41j,u31j,u21j,u31,  &
+!$omp& u51km1,u41km1,u31km1,u21km1,u51k,u41k,u31k,u21k,u41)
+!$omp do schedule(static) collapse(2)
+      do k = 1, nz
+         do j = 1, ny
+            do i = 1, nx
+               do m = 1, 5
+                  rsd(m,i,j,k) = - frct(m,i,j,k)
+               end do
+               tmp = 1.0d+00 / u(1,i,j,k)
+               rho_i(i,j,k) = tmp
+               qs(i,j,k) = 0.50d+00 * (  u(2,i,j,k) * u(2,i,j,k)  &
+     &                         + u(3,i,j,k) * u(3,i,j,k)  &
+     &                         + u(4,i,j,k) * u(4,i,j,k) )  &
+     &                      * tmp
+            end do
+         end do
+      end do
+!$omp end do
+
+!$omp master
+      if (timeron) call timer_start(t_rhsx)
+!$omp end master
+!---------------------------------------------------------------------
+!   xi-direction flux differences
+!---------------------------------------------------------------------
+
+!$omp do schedule(static) collapse(2)
+      do k = 2, nz - 1
+         do j = jst, jend
+            do i = 1, nx
+               flux(1,i) = u(2,i,j,k)
+               u21 = u(2,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,i) = u(2,i,j,k) * u21 + c2 *  &
+     &                        ( u(5,i,j,k) - q )
+               flux(3,i) = u(3,i,j,k) * u21
+               flux(4,i) = u(4,i,j,k) * u21
+               flux(5,i) = ( c1 * u(5,i,j,k) - c2 * q ) * u21
+            end do
+
+            do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)  &
+     &                 - tx2 * ( flux(m,i+1) - flux(m,i-1) )
+               end do
+            end do
+
+            do i = ist, nx
+               tmp = rho_i(i,j,k)
+
+               u21i = tmp * u(2,i,j,k)
+               u31i = tmp * u(3,i,j,k)
+               u41i = tmp * u(4,i,j,k)
+               u51i = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i-1,j,k)
+
+               u21im1 = tmp * u(2,i-1,j,k)
+               u31im1 = tmp * u(3,i-1,j,k)
+               u41im1 = tmp * u(4,i-1,j,k)
+               u51im1 = tmp * u(5,i-1,j,k)
+
+               flux(2,i) = (4.0d+00/3.0d+00) * tx3 * (u21i-u21im1)
+               flux(3,i) = tx3 * ( u31i - u31im1 )
+               flux(4,i) = tx3 * ( u41i - u41im1 )
+               flux(5,i) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tx3 * ( ( u21i  **2 + u31i  **2 + u41i  **2 )  &
+     &                      - ( u21im1**2 + u31im1**2 + u41im1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tx3 * ( u21i**2 - u21im1**2 )  &
+     &              + c1 * c5 * tx3 * ( u51i - u51im1 )
+            end do
+
+            do i = ist, iend
+               rsd(1,i,j,k) = rsd(1,i,j,k)  &
+     &              + dx1 * tx1 * (            u(1,i-1,j,k)  &
+     &                             - 2.0d+00 * u(1,i,j,k)  &
+     &                             +           u(1,i+1,j,k) )
+               rsd(2,i,j,k) = rsd(2,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(2,i+1) - flux(2,i) )  &
+     &              + dx2 * tx1 * (            u(2,i-1,j,k)  &
+     &                             - 2.0d+00 * u(2,i,j,k)  &
+     &                             +           u(2,i+1,j,k) )
+               rsd(3,i,j,k) = rsd(3,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(3,i+1) - flux(3,i) )  &
+     &              + dx3 * tx1 * (            u(3,i-1,j,k)  &
+     &                             - 2.0d+00 * u(3,i,j,k)  &
+     &                             +           u(3,i+1,j,k) )
+               rsd(4,i,j,k) = rsd(4,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(4,i+1) - flux(4,i) )  &
+     &              + dx4 * tx1 * (            u(4,i-1,j,k)  &
+     &                             - 2.0d+00 * u(4,i,j,k)  &
+     &                             +           u(4,i+1,j,k) )
+               rsd(5,i,j,k) = rsd(5,i,j,k)  &
+     &          + tx3 * c3 * c4 * ( flux(5,i+1) - flux(5,i) )  &
+     &              + dx5 * tx1 * (            u(5,i-1,j,k)  &
+     &                             - 2.0d+00 * u(5,i,j,k)  &
+     &                             +           u(5,i+1,j,k) )
+            end do
+
+!---------------------------------------------------------------------
+!   Fourth-order dissipation
+!---------------------------------------------------------------------
+            do m = 1, 5
+               rsd(m,2,j,k) = rsd(m,2,j,k)  &
+     &           - dssp * ( + 5.0d+00 * u(m,2,j,k)  &
+     &                      - 4.0d+00 * u(m,3,j,k)  &
+     &                      +           u(m,4,j,k) )
+               rsd(m,3,j,k) = rsd(m,3,j,k)  &
+     &           - dssp * ( - 4.0d+00 * u(m,2,j,k)  &
+     &                      + 6.0d+00 * u(m,3,j,k)  &
+     &                      - 4.0d+00 * u(m,4,j,k)  &
+     &                      +           u(m,5,j,k) )
+            end do
+
+            do i = 4, nx - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)  &
+     &              - dssp * (            u(m,i-2,j,k)  &
+     &                        - 4.0d+00 * u(m,i-1,j,k)  &
+     &                        + 6.0d+00 * u(m,i,j,k)  &
+     &                        - 4.0d+00 * u(m,i+1,j,k)  &
+     &                        +           u(m,i+2,j,k) )
+               end do
+            end do
+
+
+            do m = 1, 5
+               rsd(m,nx-2,j,k) = rsd(m,nx-2,j,k)  &
+     &           - dssp * (             u(m,nx-4,j,k)  &
+     &                      - 4.0d+00 * u(m,nx-3,j,k)  &
+     &                      + 6.0d+00 * u(m,nx-2,j,k)  &
+     &                      - 4.0d+00 * u(m,nx-1,j,k)  )
+               rsd(m,nx-1,j,k) = rsd(m,nx-1,j,k)  &
+     &           - dssp * (             u(m,nx-3,j,k)  &
+     &                      - 4.0d+00 * u(m,nx-2,j,k)  &
+     &                      + 5.0d+00 * u(m,nx-1,j,k) )
+            end do
+
+         end do
+      end do
+!$omp end do
+!$omp master
+      if (timeron) call timer_stop(t_rhsx)
+
+      if (timeron) call timer_start(t_rhsy)
+!$omp end master
+!---------------------------------------------------------------------
+!   eta-direction flux differences
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 2, nz - 1
+         do i = ist, iend
+            do j = 1, ny
+               flux(1,j) = u(3,i,j,k)
+               u31 = u(3,i,j,k) * rho_i(i,j,k)
+
+               q = qs(i,j,k)
+
+               flux(2,j) = u(2,i,j,k) * u31 
+               flux(3,j) = u(3,i,j,k) * u31 + c2 * (u(5,i,j,k)-q)
+               flux(4,j) = u(4,i,j,k) * u31
+               flux(5,j) = ( c1 * u(5,i,j,k) - c2 * q ) * u31
+            end do
+
+            do j = jst, jend
+               do m = 1, 5
+                  rsd(m,i,j,k) =  rsd(m,i,j,k)  &
+     &                   - ty2 * ( flux(m,j+1) - flux(m,j-1) )
+               end do
+            end do
+
+            do j = jst, ny
+               tmp = rho_i(i,j,k)
+
+               u21j = tmp * u(2,i,j,k)
+               u31j = tmp * u(3,i,j,k)
+               u41j = tmp * u(4,i,j,k)
+               u51j = tmp * u(5,i,j,k)
+
+               tmp = rho_i(i,j-1,k)
+               u21jm1 = tmp * u(2,i,j-1,k)
+               u31jm1 = tmp * u(3,i,j-1,k)
+               u41jm1 = tmp * u(4,i,j-1,k)
+               u51jm1 = tmp * u(5,i,j-1,k)
+
+               flux(2,j) = ty3 * ( u21j - u21jm1 )
+               flux(3,j) = (4.0d+00/3.0d+00) * ty3 * (u31j-u31jm1)
+               flux(4,j) = ty3 * ( u41j - u41jm1 )
+               flux(5,j) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * ty3 * ( ( u21j  **2 + u31j  **2 + u41j  **2 )  &
+     &                      - ( u21jm1**2 + u31jm1**2 + u41jm1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * ty3 * ( u31j**2 - u31jm1**2 )  &
+     &              + c1 * c5 * ty3 * ( u51j - u51jm1 )
+            end do
+
+            do j = jst, jend
+
+               rsd(1,i,j,k) = rsd(1,i,j,k)  &
+     &              + dy1 * ty1 * (            u(1,i,j-1,k)  &
+     &                             - 2.0d+00 * u(1,i,j,k)  &
+     &                             +           u(1,i,j+1,k) )
+
+               rsd(2,i,j,k) = rsd(2,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(2,j+1) - flux(2,j) )  &
+     &              + dy2 * ty1 * (            u(2,i,j-1,k)  &
+     &                             - 2.0d+00 * u(2,i,j,k)  &
+     &                             +           u(2,i,j+1,k) )
+
+               rsd(3,i,j,k) = rsd(3,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(3,j+1) - flux(3,j) )  &
+     &              + dy3 * ty1 * (            u(3,i,j-1,k)  &
+     &                             - 2.0d+00 * u(3,i,j,k)  &
+     &                             +           u(3,i,j+1,k) )
+
+               rsd(4,i,j,k) = rsd(4,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(4,j+1) - flux(4,j) )  &
+     &              + dy4 * ty1 * (            u(4,i,j-1,k)  &
+     &                             - 2.0d+00 * u(4,i,j,k)  &
+     &                             +           u(4,i,j+1,k) )
+
+               rsd(5,i,j,k) = rsd(5,i,j,k)  &
+     &          + ty3 * c3 * c4 * ( flux(5,j+1) - flux(5,j) )  &
+     &              + dy5 * ty1 * (            u(5,i,j-1,k)  &
+     &                             - 2.0d+00 * u(5,i,j,k)  &
+     &                             +           u(5,i,j+1,k) )
+
+            end do
+
+         end do
+      end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 2, nz - 1
+         do j = jst, jend
+            if (j .eq. 2) then
+               do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,2,k) = rsd(m,i,2,k)  &
+     &              - dssp * ( + 5.0d+00 * u(m,i,2,k)  &
+     &                      - 4.0d+00 * u(m,i,3,k)  &
+     &                      +           u(m,i,4,k) )
+               end do
+               end do
+
+            else if (j .eq. 3) then
+               do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,3,k) = rsd(m,i,3,k)  &
+     &              - dssp * ( - 4.0d+00 * u(m,i,2,k)  &
+     &                      + 6.0d+00 * u(m,i,3,k)  &
+     &                      - 4.0d+00 * u(m,i,4,k)  &
+     &                      +           u(m,i,5,k) )
+               end do
+               end do
+
+            else if (j .eq. ny-2) then
+               do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,ny-2,k) = rsd(m,i,ny-2,k)  &
+     &              - dssp * (          u(m,i,ny-4,k)  &
+     &                      - 4.0d+00 * u(m,i,ny-3,k)  &
+     &                      + 6.0d+00 * u(m,i,ny-2,k)  &
+     &                      - 4.0d+00 * u(m,i,ny-1,k)  )
+               end do
+               end do
+
+            else if (j .eq. ny-1) then
+               do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,ny-1,k) = rsd(m,i,ny-1,k)  &
+     &              - dssp * (          u(m,i,ny-3,k)  &
+     &                      - 4.0d+00 * u(m,i,ny-2,k)  &
+     &                      + 5.0d+00 * u(m,i,ny-1,k) )
+               end do
+               end do
+
+            else
+               do i = ist, iend
+               do m = 1, 5
+                  rsd(m,i,j,k) = rsd(m,i,j,k)  &
+     &              - dssp * (            u(m,i,j-2,k)  &
+     &                        - 4.0d+00 * u(m,i,j-1,k)  &
+     &                        + 6.0d+00 * u(m,i,j,k)  &
+     &                        - 4.0d+00 * u(m,i,j+1,k)  &
+     &                        +           u(m,i,j+2,k) )
+               end do
+               end do
+            endif
+
+         end do
+      end do
+!$omp end do
+!$omp master
+      if (timeron) call timer_stop(t_rhsy)
+
+      if (timeron) call timer_start(t_rhsz)
+!$omp end master
+!---------------------------------------------------------------------
+!   zeta-direction flux differences
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do j = jst, jend
+         do i = ist, iend
+            do k = 1, nz
+               utmp(1,k) = u(1,i,j,k)
+               utmp(2,k) = u(2,i,j,k)
+               utmp(3,k) = u(3,i,j,k)
+               utmp(4,k) = u(4,i,j,k)
+               utmp(5,k) = u(5,i,j,k)
+               utmp(6,k) = rho_i(i,j,k)
+            end do
+            do k = 1, nz
+               flux(1,k) = utmp(4,k)
+               u41 = utmp(4,k) * utmp(6,k)
+
+               q = qs(i,j,k)
+
+               flux(2,k) = utmp(2,k) * u41 
+               flux(3,k) = utmp(3,k) * u41 
+               flux(4,k) = utmp(4,k) * u41 + c2 * (utmp(5,k)-q)
+               flux(5,k) = ( c1 * utmp(5,k) - c2 * q ) * u41
+            end do
+
+            do k = 2, nz - 1
+               do m = 1, 5
+                  rtmp(m,k) =  rsd(m,i,j,k)  &
+     &                - tz2 * ( flux(m,k+1) - flux(m,k-1) )
+               end do
+            end do
+
+            do k = 2, nz
+               tmp = utmp(6,k)
+
+               u21k = tmp * utmp(2,k)
+               u31k = tmp * utmp(3,k)
+               u41k = tmp * utmp(4,k)
+               u51k = tmp * utmp(5,k)
+
+               tmp = utmp(6,k-1)
+
+               u21km1 = tmp * utmp(2,k-1)
+               u31km1 = tmp * utmp(3,k-1)
+               u41km1 = tmp * utmp(4,k-1)
+               u51km1 = tmp * utmp(5,k-1)
+
+               flux(2,k) = tz3 * ( u21k - u21km1 )
+               flux(3,k) = tz3 * ( u31k - u31km1 )
+               flux(4,k) = (4.0d+00/3.0d+00) * tz3 * (u41k-u41km1)
+               flux(5,k) = 0.50d+00 * ( 1.0d+00 - c1*c5 )  &
+     &              * tz3 * ( ( u21k  **2 + u31k  **2 + u41k  **2 )  &
+     &                      - ( u21km1**2 + u31km1**2 + u41km1**2 ) )  &
+     &              + (1.0d+00/6.0d+00)  &
+     &              * tz3 * ( u41k**2 - u41km1**2 )  &
+     &              + c1 * c5 * tz3 * ( u51k - u51km1 )
+            end do
+
+            do k = 2, nz - 1
+               rtmp(1,k) = rtmp(1,k)  &
+     &              + dz1 * tz1 * (            utmp(1,k-1)  &
+     &                             - 2.0d+00 * utmp(1,k)  &
+     &                             +           utmp(1,k+1) )
+               rtmp(2,k) = rtmp(2,k)  &
+     &          + tz3 * c3 * c4 * ( flux(2,k+1) - flux(2,k) )  &
+     &              + dz2 * tz1 * (            utmp(2,k-1)  &
+     &                             - 2.0d+00 * utmp(2,k)  &
+     &                             +           utmp(2,k+1) )
+               rtmp(3,k) = rtmp(3,k)  &
+     &          + tz3 * c3 * c4 * ( flux(3,k+1) - flux(3,k) )  &
+     &              + dz3 * tz1 * (            utmp(3,k-1)  &
+     &                             - 2.0d+00 * utmp(3,k)  &
+     &                             +           utmp(3,k+1) )
+               rtmp(4,k) = rtmp(4,k)  &
+     &          + tz3 * c3 * c4 * ( flux(4,k+1) - flux(4,k) )  &
+     &              + dz4 * tz1 * (            utmp(4,k-1)  &
+     &                             - 2.0d+00 * utmp(4,k)  &
+     &                             +           utmp(4,k+1) )
+               rtmp(5,k) = rtmp(5,k)  &
+     &          + tz3 * c3 * c4 * ( flux(5,k+1) - flux(5,k) )  &
+     &              + dz5 * tz1 * (            utmp(5,k-1)  &
+     &                             - 2.0d+00 * utmp(5,k)  &
+     &                             +           utmp(5,k+1) )
+            end do
+
+!---------------------------------------------------------------------
+!   fourth-order dissipation
+!---------------------------------------------------------------------
+            do m = 1, 5
+               rsd(m,i,j,2) = rtmp(m,2)  &
+     &           - dssp * ( + 5.0d+00 * utmp(m,2)  &
+     &                      - 4.0d+00 * utmp(m,3)  &
+     &                      +           utmp(m,4) )
+               rsd(m,i,j,3) = rtmp(m,3)  &
+     &           - dssp * ( - 4.0d+00 * utmp(m,2)  &
+     &                      + 6.0d+00 * utmp(m,3)  &
+     &                      - 4.0d+00 * utmp(m,4)  &
+     &                      +           utmp(m,5) )
+            end do
+
+            do k = 4, nz - 3
+               do m = 1, 5
+                  rsd(m,i,j,k) = rtmp(m,k)  &
+     &              - dssp * (            utmp(m,k-2)  &
+     &                        - 4.0d+00 * utmp(m,k-1)  &
+     &                        + 6.0d+00 * utmp(m,k)  &
+     &                        - 4.0d+00 * utmp(m,k+1)  &
+     &                        +           utmp(m,k+2) )
+               end do
+            end do
+
+            do m = 1, 5
+               rsd(m,i,j,nz-2) = rtmp(m,nz-2)  &
+     &           - dssp * (             utmp(m,nz-4)  &
+     &                      - 4.0d+00 * utmp(m,nz-3)  &
+     &                      + 6.0d+00 * utmp(m,nz-2)  &
+     &                      - 4.0d+00 * utmp(m,nz-1)  )
+               rsd(m,i,j,nz-1) = rtmp(m,nz-1)  &
+     &           - dssp * (             utmp(m,nz-3)  &
+     &                      - 4.0d+00 * utmp(m,nz-2)  &
+     &                      + 5.0d+00 * utmp(m,nz-1) )
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp master
+      if (timeron) call timer_stop(t_rhsz)
+!$omp end master
+!$omp end parallel
+
+      if (timeron) call timer_stop(t_rhs)
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setbv.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setbv.f90
new file mode 100644
index 000000000..6ca968252
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setbv.f90
@@ -0,0 +1,75 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setbv
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   set the boundary values of dependent variables
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!   local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision temp1(5), temp2(5)
+
+!---------------------------------------------------------------------
+!   set the dependent variable values along the top and bottom faces
+!---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,m,temp1,temp2)  &
+!$omp& shared(nx,ny,nz)
+!$omp do schedule(static) collapse(2)
+      do j = 1, ny
+         do i = 1, nx
+            call exact( i, j, 1, temp1 )
+            call exact( i, j, nz, temp2 )
+            do m = 1, 5
+               u( m, i, j, 1 ) = temp1(m)
+               u( m, i, j, nz ) = temp2(m)
+            end do
+         end do
+      end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!   set the dependent variable values along north and south faces
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, nz
+         do i = 1, nx
+            call exact( i, 1, k, temp1 )
+            call exact( i, ny, k, temp2 )
+            do m = 1, 5
+               u( m, i, 1, k ) = temp1(m)
+               u( m, i, ny, k ) = temp2(m)
+            end do
+         end do
+      end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!   set the dependent variable values along east and west faces
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 1, nz
+         do j = 1, ny
+            call exact( 1, j, k, temp1 )
+            call exact( nx, j, k, temp2 )
+            do m = 1, 5
+               u( m, 1, j, k ) = temp1(m)
+               u( m, nx, j, k ) = temp2(m)
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setcoeff.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setcoeff.f90
new file mode 100644
index 000000000..2175a4275
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setcoeff.f90
@@ -0,0 +1,151 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setcoeff
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+!   set up coefficients
+!---------------------------------------------------------------------
+      dxi = 1.0d+00 / ( nx0 - 1 )
+      deta = 1.0d+00 / ( ny0 - 1 )
+      dzeta = 1.0d+00 / ( nz0 - 1 )
+
+      tx1 = 1.0d+00 / ( dxi * dxi )
+      tx2 = 1.0d+00 / ( 2.0d+00 * dxi )
+      tx3 = 1.0d+00 / dxi
+
+      ty1 = 1.0d+00 / ( deta * deta )
+      ty2 = 1.0d+00 / ( 2.0d+00 * deta )
+      ty3 = 1.0d+00 / deta
+
+      tz1 = 1.0d+00 / ( dzeta * dzeta )
+      tz2 = 1.0d+00 / ( 2.0d+00 * dzeta )
+      tz3 = 1.0d+00 / dzeta
+
+!---------------------------------------------------------------------
+!   diffusion coefficients
+!---------------------------------------------------------------------
+      dx1 = 0.75d+00
+      dx2 = dx1
+      dx3 = dx1
+      dx4 = dx1
+      dx5 = dx1
+
+      dy1 = 0.75d+00
+      dy2 = dy1
+      dy3 = dy1
+      dy4 = dy1
+      dy5 = dy1
+
+      dz1 = 1.00d+00
+      dz2 = dz1
+      dz3 = dz1
+      dz4 = dz1
+      dz5 = dz1
+
+!---------------------------------------------------------------------
+!   fourth difference dissipation
+!---------------------------------------------------------------------
+      dssp = ( max (dx1, dy1, dz1 ) ) / 4.0d+00
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the first pde
+!---------------------------------------------------------------------
+      ce(1,1) = 2.0d+00
+      ce(1,2) = 0.0d+00
+      ce(1,3) = 0.0d+00
+      ce(1,4) = 4.0d+00
+      ce(1,5) = 5.0d+00
+      ce(1,6) = 3.0d+00
+      ce(1,7) = 5.0d-01
+      ce(1,8) = 2.0d-02
+      ce(1,9) = 1.0d-02
+      ce(1,10) = 3.0d-02
+      ce(1,11) = 5.0d-01
+      ce(1,12) = 4.0d-01
+      ce(1,13) = 3.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the second pde
+!---------------------------------------------------------------------
+      ce(2,1) = 1.0d+00
+      ce(2,2) = 0.0d+00
+      ce(2,3) = 0.0d+00
+      ce(2,4) = 0.0d+00
+      ce(2,5) = 1.0d+00
+      ce(2,6) = 2.0d+00
+      ce(2,7) = 3.0d+00
+      ce(2,8) = 1.0d-02
+      ce(2,9) = 3.0d-02
+      ce(2,10) = 2.0d-02
+      ce(2,11) = 4.0d-01
+      ce(2,12) = 3.0d-01
+      ce(2,13) = 5.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the third pde
+!---------------------------------------------------------------------
+      ce(3,1) = 2.0d+00
+      ce(3,2) = 2.0d+00
+      ce(3,3) = 0.0d+00
+      ce(3,4) = 0.0d+00
+      ce(3,5) = 0.0d+00
+      ce(3,6) = 2.0d+00
+      ce(3,7) = 3.0d+00
+      ce(3,8) = 4.0d-02
+      ce(3,9) = 3.0d-02
+      ce(3,10) = 5.0d-02
+      ce(3,11) = 3.0d-01
+      ce(3,12) = 5.0d-01
+      ce(3,13) = 4.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the fourth pde
+!---------------------------------------------------------------------
+      ce(4,1) = 2.0d+00
+      ce(4,2) = 2.0d+00
+      ce(4,3) = 0.0d+00
+      ce(4,4) = 0.0d+00
+      ce(4,5) = 0.0d+00
+      ce(4,6) = 2.0d+00
+      ce(4,7) = 3.0d+00
+      ce(4,8) = 3.0d-02
+      ce(4,9) = 5.0d-02
+      ce(4,10) = 4.0d-02
+      ce(4,11) = 2.0d-01
+      ce(4,12) = 1.0d-01
+      ce(4,13) = 3.0d-01
+
+!---------------------------------------------------------------------
+!   coefficients of the exact solution to the fifth pde
+!---------------------------------------------------------------------
+      ce(5,1) = 5.0d+00
+      ce(5,2) = 4.0d+00
+      ce(5,3) = 3.0d+00
+      ce(5,4) = 2.0d+00
+      ce(5,5) = 1.0d-01
+      ce(5,6) = 4.0d-01
+      ce(5,7) = 3.0d-01
+      ce(5,8) = 5.0d-02
+      ce(5,9) = 4.0d-02
+      ce(5,10) = 3.0d-02
+      ce(5,11) = 1.0d-01
+      ce(5,12) = 3.0d-01
+      ce(5,13) = 2.0d-01
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setiv.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setiv.f90
new file mode 100644
index 000000000..0a3117e6b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/setiv.f90
@@ -0,0 +1,65 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      subroutine setiv
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   set the initial values of independent variables based on tri-linear
+!   interpolation of boundary values in the computational space.
+!
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m
+      double precision  xi, eta, zeta
+      double precision  pxi, peta, pzeta
+      double precision  ue_1jk(5),ue_nx0jk(5),ue_i1k(5),  &
+     &        ue_iny0k(5),ue_ij1(5),ue_ijnz(5)
+
+
+!$omp parallel default(shared) private(i,j,k,m,pxi,peta,pzeta,  &
+!$omp& xi,eta,zeta,ue_ijnz,ue_ij1,ue_iny0k,ue_i1k,ue_nx0jk,ue_1jk)  &
+!$omp& shared(nx0,ny0,nz)
+!$omp do schedule(static) collapse(2)
+      do k = 2, nz - 1
+         do j = 2, ny - 1
+            zeta = ( dble (k-1) ) / (nz-1)
+            eta = ( dble (j-1) ) / (ny0-1)
+            do i = 2, nx - 1
+               xi = ( dble (i-1) ) / (nx0-1)
+               call exact (1,j,k,ue_1jk)
+               call exact (nx0,j,k,ue_nx0jk)
+               call exact (i,1,k,ue_i1k)
+               call exact (i,ny0,k,ue_iny0k)
+               call exact (i,j,1,ue_ij1)
+               call exact (i,j,nz,ue_ijnz)
+               do m = 1, 5
+                  pxi =   ( 1.0d+00 - xi ) * ue_1jk(m)  &
+     &                              + xi   * ue_nx0jk(m)
+                  peta =  ( 1.0d+00 - eta ) * ue_i1k(m)  &
+     &                              + eta   * ue_iny0k(m)
+                  pzeta = ( 1.0d+00 - zeta ) * ue_ij1(m)  &
+     &                              + zeta   * ue_ijnz(m)
+
+                  u( m, i, j, k ) = pxi + peta + pzeta  &
+     &                 - pxi * peta - peta * pzeta - pzeta * pxi  &
+     &                 + pxi * peta * pzeta
+
+               end do
+            end do
+         end do
+      end do
+!$omp end do nowait
+!$omp end parallel
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor.f90
new file mode 100644
index 000000000..92cadb6a0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor.f90
@@ -0,0 +1,285 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   to perform pseudo-time stepping SSOR iterations
+!   for five nonlinear pde's.
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+      integer niter
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m, n
+      integer istep
+      double precision  tmp, tmp2
+      double precision  delunm(5)
+
+      external timer_read
+      double precision timer_read
+
+ 
+!---------------------------------------------------------------------
+!   begin pseudo-time stepping iterations
+!---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+      call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the L2 norms of newton iteration residuals
+!---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &             ist, iend, jst, jend,  &
+     &             rsd, rsdnm )
+
+ 
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+if (niter > 1) then
+#ifdef M5_ANNOTATION
+      call m5_work_begin_interface
+#endif
+endif
+
+      call timer_start(1)
+ 
+!---------------------------------------------------------------------
+!   the timestep loop
+!---------------------------------------------------------------------
+      do istep = 1, niter
+
+         if (mod ( istep, 20) .eq. 0 .or.  &
+     &         istep .eq. itmax .or.  &
+     &         istep .eq. 1) then
+            if (niter .gt. 1) write( *, 200) istep
+ 200        format(' Time step ', i4)
+         endif
+ 
+!---------------------------------------------------------------------
+!   perform SSOR iteration
+!---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,m,tmp2)  &
+!$omp&  shared(ist,iend,jst,jend,nx,ny,nz,nx0,ny0,omega)
+
+!$omp master
+         if (timeron) call timer_start(t_rhs)
+!$omp end master
+         tmp2 = dt
+!$omp do schedule(static) collapse(2)
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = tmp2 * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp master
+         if (timeron) call timer_stop(t_rhs)
+
+         if (timeron) call timer_start(t_blts)
+!$omp end master
+
+         call sync_init( jend-jst )
+!$omp barrier
+
+         do k = 2, nz -1 
+
+            call sync_left( isiz1, isiz2, isiz3, rsd )
+!$omp do schedule(static)
+            do j = jst, jend
+
+!---------------------------------------------------------------------
+!   form the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacld(j, k)
+ 
+!---------------------------------------------------------------------
+!   perform the lower triangular solution
+!---------------------------------------------------------------------
+               call blts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz,  &
+     &                    omega,  &
+     &                    rsd,  &
+     &                    a, b, c, d,  &
+     &                    ist, iend, j, k )
+
+            end do
+!$omp end do nowait
+            call sync_right( isiz1, isiz2, isiz3, rsd )
+
+         end do
+!$omp barrier
+!$omp master
+         if (timeron) call timer_stop(t_blts)
+
+         if (timeron) call timer_start(t_buts)
+!$omp end master
+         do k = nz - 1, 2, -1
+
+            call sync_left( isiz1, isiz2, isiz3, rsd )
+!$omp do schedule(static)
+            do j = jend, jst, -1
+
+!---------------------------------------------------------------------
+!   form the strictly upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacu(j, k)
+
+!---------------------------------------------------------------------
+!   perform the upper triangular solution
+!---------------------------------------------------------------------
+               call buts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz,  &
+     &                    omega,  &
+     &                    rsd,  &
+     &                    d, a, b, c,  &
+     &                    ist, iend, j, k )
+
+            end do
+!$omp end do nowait
+            call sync_right( isiz1, isiz2, isiz3, rsd )
+
+         end do
+!$omp barrier
+!$omp master
+         if (timeron) call timer_stop(t_buts)
+!$omp end master
+
+!---------------------------------------------------------------------
+!   update the variables
+!---------------------------------------------------------------------
+
+!$omp master
+         if (timeron) call timer_start(t_add)
+!$omp end master
+         tmp2 = tmp
+!$omp do schedule(static) collapse(2)
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )  &
+     &                    + tmp2 * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp master
+         if (timeron) call timer_stop(t_add)
+!$omp end master
+!$omp end parallel
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration corrections
+!---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, delunm )
+            if (timeron) call timer_stop(t_l2norm)
+!            if ( ipr .eq. 1 ) then
+!                write (*,1006) ( delunm(m), m = 1, 5 )
+!            else if ( ipr .eq. 2 ) then
+!                write (*,'(i5,f15.6)') istep,delunm(5)
+!            end if
+         end if
+ 
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+         call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration residuals
+!---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.  &
+     &        ( istep .eq. itmax ) ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, rsdnm )
+            if (timeron) call timer_stop(t_l2norm)
+!            if ( ipr .eq. 1 ) then
+!                write (*,1007) ( rsdnm(m), m = 1, 5 )
+!            end if
+         end if
+
+!---------------------------------------------------------------------
+!   check the newton-iteration residuals against the tolerance levels
+!---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.  &
+     &        ( rsdnm(2) .lt. tolrsd(2) ) .and.  &
+     &        ( rsdnm(3) .lt. tolrsd(3) ) .and.  &
+     &        ( rsdnm(4) .lt. tolrsd(4) ) .and.  &
+     &        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+!            if (ipr .eq. 1 ) then
+               write (*,1004) istep
+!            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+
+if (niter > 1) then
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+endif
+      maxtime= timer_read(1)
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,  &
+     &   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',  &
+     & 'first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor_doac.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor_doac.f90
new file mode 100644
index 000000000..a5924cb25
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor_doac.f90
@@ -0,0 +1,271 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   to perform pseudo-time stepping SSOR iterations
+!   for five nonlinear pde's.
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+      integer niter
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m, n
+      integer istep
+      double precision  tmp, tmp2
+      double precision  delunm(5)
+
+      external timer_read
+      double precision timer_read
+
+ 
+!---------------------------------------------------------------------
+!   begin pseudo-time stepping iterations
+!---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+      call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the L2 norms of newton iteration residuals
+!---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &             ist, iend, jst, jend,  &
+     &             rsd, rsdnm )
+
+ 
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+#ifdef M5_ANNOTATION
+      call m5_work_begin_interface
+#endif
+
+      call timer_start(1)
+ 
+!---------------------------------------------------------------------
+!   the timestep loop
+!---------------------------------------------------------------------
+      do istep = 1, niter
+
+         if (mod ( istep, 20) .eq. 0 .or.  &
+     &         istep .eq. itmax .or.  &
+     &         istep .eq. 1) then
+            if (niter .gt. 1) write( *, 200) istep
+ 200        format(' Time step ', i4)
+         endif
+ 
+!---------------------------------------------------------------------
+!   perform SSOR iteration
+!---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,m,tmp2)  &
+!$omp&  shared(ist,iend,jst,jend,nx,ny,nz,nx0,ny0,omega)
+!$omp master
+         if (timeron) call timer_start(t_rhs)
+!$omp end master
+         tmp2 = dt
+!$omp do schedule(static) collapse(2)
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = tmp2 * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+!$omp end do
+!$omp master
+         if (timeron) call timer_stop(t_rhs)
+
+         if (timeron) call timer_start(t_blts)
+!$omp end master
+!$omp do schedule(static,1) ordered(2)
+         do k = 2, nz -1 
+            do j = jst, jend
+
+!$omp ordered depend(sink: k-1,j) depend(sink: k,j-1)
+!---------------------------------------------------------------------
+!   form the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacld(j, k)
+ 
+!---------------------------------------------------------------------
+!   perform the lower triangular solution
+!---------------------------------------------------------------------
+               call blts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz,  &
+     &                    omega,  &
+     &                    rsd,  &
+     &                    a, b, c, d,  &
+     &                    ist, iend, j, k )
+!$omp ordered depend(source)
+
+            end do
+         end do
+!$omp end do
+!$omp master
+         if (timeron) call timer_stop(t_blts)
+
+         if (timeron) call timer_start(t_buts)
+!$omp end master
+!$omp do schedule(static,1) ordered(2)
+         do k = nz - 1, 2, -1
+            do j = jend, jst, -1
+
+!$omp ordered depend(sink: k+1,j) depend(sink: k,j+1)
+!---------------------------------------------------------------------
+!   form the strictly upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacu(j, k)
+
+!---------------------------------------------------------------------
+!   perform the upper triangular solution
+!---------------------------------------------------------------------
+               call buts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz,  &
+     &                    omega,  &
+     &                    rsd,  &
+     &                    d, a, b, c,  &
+     &                    ist, iend, j, k )
+!$omp ordered depend(source)
+
+            end do
+         end do
+!$omp end do
+!$omp master
+         if (timeron) call timer_stop(t_buts)
+!$omp end master
+
+!---------------------------------------------------------------------
+!   update the variables
+!---------------------------------------------------------------------
+
+!$omp master
+         if (timeron) call timer_start(t_add)
+!$omp end master
+         tmp2 = tmp
+!$omp do schedule(static) collapse(2)
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )  &
+     &                    + tmp2 * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp master
+         if (timeron) call timer_stop(t_add)
+!$omp end master
+!$omp end parallel
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration corrections
+!---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, delunm )
+            if (timeron) call timer_stop(t_l2norm)
+!            if ( ipr .eq. 1 ) then
+!                write (*,1006) ( delunm(m), m = 1, 5 )
+!            else if ( ipr .eq. 2 ) then
+!                write (*,'(i5,f15.6)') istep,delunm(5)
+!            end if
+         end if
+ 
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+         call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration residuals
+!---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.  &
+     &        ( istep .eq. itmax ) ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, rsdnm )
+            if (timeron) call timer_stop(t_l2norm)
+!            if ( ipr .eq. 1 ) then
+!                write (*,1007) ( rsdnm(m), m = 1, 5 )
+!            end if
+         end if
+
+!---------------------------------------------------------------------
+!   check the newton-iteration residuals against the tolerance levels
+!---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.  &
+     &        ( rsdnm(2) .lt. tolrsd(2) ) .and.  &
+     &        ( rsdnm(3) .lt. tolrsd(3) ) .and.  &
+     &        ( rsdnm(4) .lt. tolrsd(4) ) .and.  &
+     &        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+!            if (ipr .eq. 1 ) then
+               write (*,1004) istep
+!            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+
+      maxtime= timer_read(1)
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,  &
+     &   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',  &
+     & 'first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor_hp.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor_hp.f90
new file mode 100644
index 000000000..77f71ffd8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/ssor_hp.f90
@@ -0,0 +1,271 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine ssor(niter)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   to perform pseudo-time stepping SSOR iterations
+!   for five nonlinear pde's.
+!---------------------------------------------------------------------
+
+      use lu_data
+      implicit none
+
+      integer niter
+
+!---------------------------------------------------------------------
+!  local variables
+!---------------------------------------------------------------------
+      integer i, j, k, m, n, l
+      integer istep, lst, lend
+      double precision  tmp, tmp2
+      double precision  delunm(5)
+
+      external timer_read
+      double precision timer_read
+
+ 
+!---------------------------------------------------------------------
+!   begin pseudo-time stepping iterations
+!---------------------------------------------------------------------
+      tmp = 1.0d+00 / ( omega * ( 2.0d+00 - omega ) ) 
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+      call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the L2 norms of newton iteration residuals
+!---------------------------------------------------------------------
+      call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &             ist, iend, jst, jend,  &
+     &             rsd, rsdnm )
+
+ 
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+
+#ifdef M5_ANNOTATION
+      call m5_work_begin_interface
+#endif
+
+      call timer_start(1)
+ 
+!---------------------------------------------------------------------
+!   the timestep loop
+!---------------------------------------------------------------------
+      lst = 2 + jst
+      lend = nz - 1 + jend
+      do istep = 1, niter
+
+         if (mod ( istep, 20) .eq. 0 .or.  &
+     &         istep .eq. itmax .or.  &
+     &         istep .eq. 1) then
+            if (niter .gt. 1) write( *, 200) istep
+ 200        format(' Time step ', i4)
+         endif
+ 
+!---------------------------------------------------------------------
+!   perform SSOR iteration
+!---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,j,k,m,l,tmp2)  &
+!$omp&  shared(ist,iend,jst,jend,nx,ny,nz,nx0,ny0,omega,lst,lend)
+!$omp master
+         if (timeron) call timer_start(t_rhs)
+!$omp end master
+         tmp2 = dt
+!$omp do schedule(static) collapse(2)
+         do k = 2, nz - 1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     rsd(m,i,j,k) = tmp2 * rsd(m,i,j,k)
+                  end do
+               end do
+            end do
+         end do
+!$omp end do
+!$omp master
+         if (timeron) call timer_stop(t_rhs)
+
+         if (timeron) call timer_start(t_blts)
+!$omp end master
+         do l = lst, lend
+!$omp do schedule(static)
+            do j = max(l-jend,jst), min(l-2,jend)
+               k = l - j
+
+!---------------------------------------------------------------------
+!   form the lower triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacld(j, k)
+ 
+!---------------------------------------------------------------------
+!   perform the lower triangular solution
+!---------------------------------------------------------------------
+               call blts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz,  &
+     &                    omega,  &
+     &                    rsd,  &
+     &                    a, b, c, d,  &
+     &                    ist, iend, j, k )
+
+            end do
+!$omp end do
+         end do
+!$omp master
+         if (timeron) call timer_stop(t_blts)
+
+         if (timeron) call timer_start(t_buts)
+!$omp end master
+         do l = lend, lst, -1
+!$omp do schedule(static)
+            do j = min(l-2,jend), max(l-jend,jst), -1
+               k = l - j
+
+!---------------------------------------------------------------------
+!   form the strictly upper triangular part of the jacobian matrix
+!---------------------------------------------------------------------
+               call jacu(j, k)
+
+!---------------------------------------------------------------------
+!   perform the upper triangular solution
+!---------------------------------------------------------------------
+               call buts( isiz1, isiz2, isiz3,  &
+     &                    nx, ny, nz,  &
+     &                    omega,  &
+     &                    rsd,  &
+     &                    d, a, b, c,  &
+     &                    ist, iend, j, k )
+
+            end do
+!$omp end do
+         end do
+!$omp master
+         if (timeron) call timer_stop(t_buts)
+!$omp end master
+
+!---------------------------------------------------------------------
+!   update the variables
+!---------------------------------------------------------------------
+
+!$omp master
+         if (timeron) call timer_start(t_add)
+!$omp end master
+         tmp2 = tmp
+!$omp do schedule(static) collapse(2)
+         do k = 2, nz-1
+            do j = jst, jend
+               do i = ist, iend
+                  do m = 1, 5
+                     u( m, i, j, k ) = u( m, i, j, k )  &
+     &                    + tmp2 * rsd( m, i, j, k )
+                  end do
+               end do
+            end do
+         end do
+!$omp end do nowait
+!$omp master
+         if (timeron) call timer_stop(t_add)
+!$omp end master
+!$omp end parallel
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration corrections
+!---------------------------------------------------------------------
+         if ( mod ( istep, inorm ) .eq. 0 ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, delunm )
+            if (timeron) call timer_stop(t_l2norm)
+!            if ( ipr .eq. 1 ) then
+!                write (*,1006) ( delunm(m), m = 1, 5 )
+!            else if ( ipr .eq. 2 ) then
+!                write (*,'(i5,f15.6)') istep,delunm(5)
+!            end if
+         end if
+ 
+!---------------------------------------------------------------------
+!   compute the steady-state residuals
+!---------------------------------------------------------------------
+         call rhs
+ 
+!---------------------------------------------------------------------
+!   compute the max-norms of newton iteration residuals
+!---------------------------------------------------------------------
+         if ( ( mod ( istep, inorm ) .eq. 0 ) .or.  &
+     &        ( istep .eq. itmax ) ) then
+            if (timeron) call timer_start(t_l2norm)
+            call l2norm( isiz1, isiz2, isiz3, nx0, ny0, nz0,  &
+     &                   ist, iend, jst, jend,  &
+     &                   rsd, rsdnm )
+            if (timeron) call timer_stop(t_l2norm)
+!            if ( ipr .eq. 1 ) then
+!                write (*,1007) ( rsdnm(m), m = 1, 5 )
+!            end if
+         end if
+
+!---------------------------------------------------------------------
+!   check the newton-iteration residuals against the tolerance levels
+!---------------------------------------------------------------------
+         if ( ( rsdnm(1) .lt. tolrsd(1) ) .and.  &
+     &        ( rsdnm(2) .lt. tolrsd(2) ) .and.  &
+     &        ( rsdnm(3) .lt. tolrsd(3) ) .and.  &
+     &        ( rsdnm(4) .lt. tolrsd(4) ) .and.  &
+     &        ( rsdnm(5) .lt. tolrsd(5) ) ) then
+!            if (ipr .eq. 1 ) then
+               write (*,1004) istep
+!            end if
+            go to 900
+         end if
+ 
+      end do
+  900 continue
+ 
+      call timer_stop(1)
+
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+
+      maxtime= timer_read(1)
+ 
+
+
+      return
+      
+ 1001 format (1x/5x,'pseudo-time SSOR iteration no.=',i4/)
+ 1004 format (1x/1x,'convergence was achieved after ',i4,  &
+     &   ' pseudo-time steps' )
+ 1006 format (1x/1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of SSOR-iteration correction ',  &
+     & 'for fifth pde  = ',1pe12.5)
+ 1007 format (1x/1x,'RMS-norm of steady-state residual for ',  &
+     & 'first pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'second pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'third pde  = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fourth pde = ',1pe12.5/,  &
+     & 1x,'RMS-norm of steady-state residual for ',  &
+     & 'fifth pde  = ',1pe12.5)
+ 
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/syncs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/syncs.f90
new file mode 100644
index 000000000..ef74c21b8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/syncs.f90
@@ -0,0 +1,127 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  syncs module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module syncs
+
+      use lu_data, only : isiz2
+
+!---------------------------------------------------------------------
+!  Flags used for thread synchronization for pipeline operation
+!---------------------------------------------------------------------
+
+      integer padim
+      parameter (padim=16)
+      integer isync(padim,0:isiz2), mthreadnum, iam
+!$omp threadprivate( mthreadnum, iam )
+
+      end module syncs
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine sync_init( jdim )
+
+!---------------------------------------------------------------------
+!   Initialize sync-related variables
+!---------------------------------------------------------------------
+
+      use syncs
+      implicit none
+
+      integer jdim
+
+!$    integer, external :: omp_get_num_threads, omp_get_thread_num
+
+      mthreadnum = 0
+!$    mthreadnum = omp_get_num_threads() - 1
+      if (mthreadnum .gt. jdim) mthreadnum = jdim
+      iam = 0
+!$    iam = omp_get_thread_num()
+      if (iam .le. mthreadnum) isync(1,iam) = 0
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!   Thread synchronization for pipeline operation
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine sync_left( ldmx, ldmy, ldmz, v )
+
+!---------------------------------------------------------------------
+!   Thread synchronization for pipeline operation
+!---------------------------------------------------------------------
+
+      use syncs
+      implicit none
+
+      integer ldmx, ldmy, ldmz
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, ldmz)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      integer neigh, iv
+
+
+      if (iam .gt. 0 .and. iam .le. mthreadnum) then
+         neigh = iam - 1
+!$omp atomic read
+         iv = isync(1,neigh)
+         do while (iv .eq. 0)
+!$omp atomic read
+            iv = isync(1,neigh)
+         end do
+!$omp atomic write
+         isync(1,neigh) = 0
+      endif
+!$omp flush(isync,v)
+
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine sync_right( ldmx, ldmy, ldmz, v )
+
+!---------------------------------------------------------------------
+!   Thread synchronization for pipeline operation
+!---------------------------------------------------------------------
+
+      use syncs
+      implicit none
+
+      integer ldmx, ldmy, ldmz
+      double precision  v( 5, ldmx/2*2+1, ldmy/2*2+1, ldmz)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      integer iv
+
+
+!$omp flush(isync,v)
+      if (iam .lt. mthreadnum) then
+!$omp atomic read
+         iv = isync(1,iam)
+         do while (iv .eq. 1)
+!$omp atomic read
+            iv = isync(1,iam)
+         end do
+!$omp atomic write
+         isync(1,iam) = 1
+      endif
+
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/verify.f90
new file mode 100644
index 000000000..f2d976b07
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/LU/verify.f90
@@ -0,0 +1,448 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine verify(xcr, xce, xci, class, verified)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  verification routine                         
+!---------------------------------------------------------------------
+
+        use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+        use lu_data
+
+        implicit none
+
+        double precision xcr(5), xce(5), xci
+        double precision xcrref(5),xceref(5),xciref,  &
+     &                   xcrdif(5),xcedif(5),xcidif,  &
+     &                   epsilon, dtref
+        integer m
+        character class
+        logical verified
+
+!---------------------------------------------------------------------
+!   tolerance level
+!---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+        xciref = 1.0
+
+        if ( (nx0  .eq. 12     ) .and.  &
+     &       (ny0  .eq. 12     ) .and.  &
+     &       (nz0  .eq. 12     ) .and.  &
+     &       (itmax   .eq. 50    ))  then
+
+           class = 'S'
+           dtref = 5.0d-1
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (12X12X12) grid,
+!   after 50 time steps, with  DT = 5.0d-01
+!---------------------------------------------------------------------
+         xcrref(1) = 1.6196343210976702d-02
+         xcrref(2) = 2.1976745164821318d-03
+         xcrref(3) = 1.5179927653399185d-03
+         xcrref(4) = 1.5029584435994323d-03
+         xcrref(5) = 3.4264073155896461d-02
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (12X12X12) grid,
+!   after 50 time steps, with  DT = 5.0d-01
+!---------------------------------------------------------------------
+         xceref(1) = 6.4223319957960924d-04
+         xceref(2) = 8.4144342047347926d-05
+         xceref(3) = 5.8588269616485186d-05
+         xceref(4) = 5.8474222595157350d-05
+         xceref(5) = 1.3103347914111294d-03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (12X12X12) grid,
+!   after 50 time steps, with DT = 5.0d-01
+!---------------------------------------------------------------------
+         xciref = 7.8418928865937083d+00
+
+
+        elseif ( (nx0 .eq. 33) .and.  &
+     &           (ny0 .eq. 33) .and.  &
+     &           (nz0 .eq. 33) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'W'   !SPEC95fp size
+           dtref = 1.5d-3
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (33x33x33) grid,
+!   after 300 time steps, with  DT = 1.5d-3
+!---------------------------------------------------------------------
+           xcrref(1) =   0.1236511638192d+02
+           xcrref(2) =   0.1317228477799d+01
+           xcrref(3) =   0.2550120713095d+01
+           xcrref(4) =   0.2326187750252d+01
+           xcrref(5) =   0.2826799444189d+02
+
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (33X33X33) grid,
+!---------------------------------------------------------------------
+           xceref(1) =   0.4867877144216d+00
+           xceref(2) =   0.5064652880982d-01
+           xceref(3) =   0.9281818101960d-01
+           xceref(4) =   0.8570126542733d-01
+           xceref(5) =   0.1084277417792d+01
+
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (33X33X33) grid,
+!   after 300 time steps, with  DT = 1.5d-3
+!---------------------------------------------------------------------
+           xciref    =   0.1161399311023d+02
+
+        elseif ( (nx0 .eq. 64) .and.  &
+     &           (ny0 .eq. 64) .and.  &
+     &           (nz0 .eq. 64) .and.  &
+     &           (itmax  .eq. 250) ) then
+
+           class = 'A'
+           dtref = 2.0d+0
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (64X64X64) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 7.7902107606689367d+02
+         xcrref(2) = 6.3402765259692870d+01
+         xcrref(3) = 1.9499249727292479d+02
+         xcrref(4) = 1.7845301160418537d+02
+         xcrref(5) = 1.8384760349464247d+03
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (64X64X64) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 2.9964085685471943d+01
+         xceref(2) = 2.8194576365003349d+00
+         xceref(3) = 7.3473412698774742d+00
+         xceref(4) = 6.7139225687777051d+00
+         xceref(5) = 7.0715315688392578d+01
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (64X64X64) grid,
+!   after 250 time steps, with DT = 2.0d+00
+!---------------------------------------------------------------------
+         xciref = 2.6030925604886277d+01
+
+
+        elseif ( (nx0 .eq. 102) .and.  &
+     &           (ny0 .eq. 102) .and.  &
+     &           (nz0 .eq. 102) .and.  &
+     &           (itmax  .eq. 250) ) then
+
+           class = 'B'
+           dtref = 2.0d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (102X102X102) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 3.5532672969982736d+03
+         xcrref(2) = 2.6214750795310692d+02
+         xcrref(3) = 8.8333721850952190d+02
+         xcrref(4) = 7.7812774739425265d+02
+         xcrref(5) = 7.3087969592545314d+03
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (102X102X102) 
+!   grid, after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 1.1401176380212709d+02
+         xceref(2) = 8.1098963655421574d+00
+         xceref(3) = 2.8480597317698308d+01
+         xceref(4) = 2.5905394567832939d+01
+         xceref(5) = 2.6054907504857413d+02
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (102X102X102) grid,
+!   after 250 time steps, with DT = 2.0d+00
+!---------------------------------------------------------------------
+         xciref = 4.7887162703308227d+01
+
+        elseif ( (nx0 .eq. 162) .and.  &
+     &           (ny0 .eq. 162) .and.  &
+     &           (nz0 .eq. 162) .and.  &
+     &           (itmax  .eq. 250) ) then
+
+           class = 'C'
+           dtref = 2.0d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (162X162X162) grid,
+!   after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 1.03766980323537846d+04
+         xcrref(2) = 8.92212458801008552d+02
+         xcrref(3) = 2.56238814582660871d+03
+         xcrref(4) = 2.19194343857831427d+03
+         xcrref(5) = 1.78078057261061185d+04
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (162X162X162) 
+!   grid, after 250 time steps, with  DT = 2.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 2.15986399716949279d+02
+         xceref(2) = 1.55789559239863600d+01
+         xceref(3) = 5.41318863077207766d+01
+         xceref(4) = 4.82262643154045421d+01
+         xceref(5) = 4.55902910043250358d+02
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (162X162X162) grid,
+!   after 250 time steps, with DT = 2.0d+00
+!---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (162X162X162) grid,
+!   after 250 time steps, with DT = 2.0d+00
+!---------------------------------------------------------------------
+         xciref = 6.66404553572181300d+01
+
+        elseif ( (nx0 .eq. 408) .and.  &
+     &           (ny0 .eq. 408) .and.  &
+     &           (nz0 .eq. 408) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'D'
+           dtref = 1.0d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (408X408X408) grid,
+!   after 300 time steps, with  DT = 1.0d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 0.4868417937025d+05
+         xcrref(2) = 0.4696371050071d+04
+         xcrref(3) = 0.1218114549776d+05 
+         xcrref(4) = 0.1033801493461d+05
+         xcrref(5) = 0.7142398413817d+05
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (408X408X408) 
+!   grid, after 300 time steps, with  DT = 1.0d+00
+!---------------------------------------------------------------------
+         xceref(1) = 0.3752393004482d+03
+         xceref(2) = 0.3084128893659d+02
+         xceref(3) = 0.9434276905469d+02
+         xceref(4) = 0.8230686681928d+02
+         xceref(5) = 0.7002620636210d+03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (408X408X408) grid,
+!   after 300 time steps, with DT = 1.0d+00
+!---------------------------------------------------------------------
+         xciref =    0.8334101392503d+02
+
+        elseif ( (nx0 .eq. 1020) .and.  &
+     &           (ny0 .eq. 1020) .and.  &
+     &           (nz0 .eq. 1020) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'E'
+           dtref = 0.5d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (1020X1020X1020) grid,
+!   after 300 time steps, with  DT = 0.5d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 0.2099641687874d+06
+         xcrref(2) = 0.2130403143165d+05
+         xcrref(3) = 0.5319228789371d+05 
+         xcrref(4) = 0.4509761639833d+05
+         xcrref(5) = 0.2932360006590d+06
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (1020X1020X1020) 
+!   grid, after 300 time steps, with  DT = 0.5d+00
+!---------------------------------------------------------------------
+         xceref(1) = 0.4800572578333d+03
+         xceref(2) = 0.4221993400184d+02
+         xceref(3) = 0.1210851906824d+03
+         xceref(4) = 0.1047888986770d+03
+         xceref(5) = 0.8363028257389d+03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (1020X1020X1020) grid,
+!   after 300 time steps, with DT = 0.5d+00
+!---------------------------------------------------------------------
+         xciref =    0.9512163272273d+02
+
+        elseif ( (nx0 .eq. 2560) .and.  &
+     &           (ny0 .eq. 2560) .and.  &
+     &           (nz0 .eq. 2560) .and.  &
+     &           (itmax  .eq. 300) ) then
+
+           class = 'F'
+           dtref = 0.2d+0
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of residual, for the (2560X2560X2560) grid,
+!   after 300 time steps, with  DT = 0.2d+00
+!---------------------------------------------------------------------
+         xcrref(1) = 0.8505125358152d+06
+         xcrref(2) = 0.8774655318044d+05
+         xcrref(3) = 0.2167258198851d+06
+         xcrref(4) = 0.1838245257371d+06
+         xcrref(5) = 0.1175556512415d+07
+
+!---------------------------------------------------------------------
+!   Reference values of RMS-norms of solution error, for the (2560X2560X2560)
+!   grid, after 300 time steps, with  DT = 0.2d+00
+!---------------------------------------------------------------------
+         xceref(1) = 0.5293914132486d+03
+         xceref(2) = 0.4784861621068d+02
+         xceref(3) = 0.1337701281659d+03
+         xceref(4) = 0.1154215049655d+03
+         xceref(5) = 0.8956266851467d+03
+
+!---------------------------------------------------------------------
+!   Reference value of surface integral, for the (2560X2560X2560) grid,
+!   after 300 time steps, with DT = 0.2d+00
+!---------------------------------------------------------------------
+         xciref =    0.1002509436546d+03
+
+        else
+           verified = .FALSE.
+        endif
+
+!---------------------------------------------------------------------
+!    verification test for residuals if gridsize is one of 
+!    the defined grid sizes above (class .ne. 'U')
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!    Compute the difference of solution values and the known reference values.
+!---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+        xcidif = dabs((xci - xciref)/xciref)
+
+
+!---------------------------------------------------------------------
+!    Output the comparison of computed results to known cases.
+!---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(/, ' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' Accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ',  &
+     &                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if ((.not.ieee_is_nan(xcrdif(m))) .and.  &
+     &              xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if ((.not.ieee_is_nan(xcedif(m))) .and.  &
+     &              xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, 2x, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, 2x, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, 2x, E20.13)
+        
+        if (class .ne. 'U') then
+           write (*,2025)
+        else
+           write (*,2026)
+        endif
+ 2025   format(' Comparison of surface integral')
+ 2026   format(' Surface integral')
+
+
+        if (class .eq. 'U') then
+           write(*, 2030) xci
+        else if ((.not.ieee_is_nan(xcidif)) .and.  &
+     &           xcidif .le. epsilon) then
+           write(*, 2032) xci, xciref, xcidif
+        else
+           verified = .false.
+           write(*, 2031) xci, xciref, xcidif
+        endif
+
+ 2030   format('          ', 4x, E20.13)
+ 2031   format(' FAILURE: ', 4x, E20.13, E20.13, E20.13)
+ 2032   format('          ', 4x, E20.13, E20.13, E20.13)
+
+
+
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/Makefile
new file mode 100644
index 000000000..f1738e85d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/Makefile
@@ -0,0 +1,28 @@
+SHELL=/bin/sh
+BENCHMARK=mg
+BENCHMARKU=MG
+
+include ../config/make.def
+
+OBJS = mg.o mg_data.o ${COMMON}/print_results.o  \
+       ${COMMON}/${RAND}.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+include ../sys/make.common
+
+${PROGRAM}: config ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+mg.o:		mg.f90 mg_data.o
+mg_data.o:      mg_data.f90 npbparams.h
+
+clean:
+	- rm -f *.o *~ *.mod
+	- rm -f npbparams.h core
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/README
new file mode 100644
index 000000000..566d71d49
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/README
@@ -0,0 +1,141 @@
+Some info about the MG benchmark
+(Note: this info applies to the parallel version and mostly concerns
+the processor decomposition.  Info not concerning the decomposition
+still applies to the serial version.)
+================================
+    
+'mg_demo' demonstrates the capabilities of a very simple multigrid
+solver in computing a three dimensional potential field.  This is
+a simplified multigrid solver in two important respects:
+
+  (1) it solves only a constant coefficient equation,
+  and that only on a uniform cubical grid,
+    
+  (2) it solves only a single equation, representing
+  a scalar field rather than a vector field.
+
+We chose it for its portability and simplicity, and expect that a
+supercomputer which can run it effectively will also be able to
+run more complex multigrid programs at least as well.
+     
+     Eric Barszcz                         Paul Frederickson
+     RIACS
+     NASA Ames Research Center            NASA Ames Research Center
+
+========================================================================
+Running the program:  (Note: also see parameter lm information in the
+                       two sections immediately below this section)
+
+The program may be run with or without an input deck (called "mg.input"). 
+The following describes a few things about the input deck if you want to 
+use one. 
+
+The four lines below are the "mg.input" file required to run a
+problem of total size 256x256x256, for 4 iterations (Class "A"),
+and presumes the use of 8 processors:
+
+   8 = top level
+   256 256 256 = nx ny nz
+   4 = nit
+   0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+8 processors are solving this problem (recall that the number of 
+processors is specified to MPI as a run parameter, and MPI subsequently
+determines this for the code via an MPI subroutine call), a 2x2x2 
+processor grid is  formed, and thus each partition on a processor is 
+of size 128x128x128.  Therefore, a maximum of 8 multi-grid levels may 
+be used.  These are of size 128,64,32,16,8,4,2,1, with the coarsest 
+level being a single point on a given processor.
+
+
+Next, consider the same size problem but running on 1 processor.  The
+following "mg.input" file is appropriate:
+
+    9 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+Since this processor must solve the full 256x256x256 problem, this
+permits 9 multi-grid levels (256,128,64,32,16,8,4,2,1), resulting in 
+a coarsest multi-grid level of a single point on the processor
+
+
+Next, consider the same size problem but running on 2 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The algorithm for partitioning the full grid onto some power of 2 number 
+of processors is to start by splitting the last dimension of the grid
+(z dimension) in 2: the problem is now partitioned onto 2 processors.
+Next the middle dimension (y dimension) is split in 2: the problem is now
+partitioned onto 4 processors.  Next, first dimension (x dimension) is
+split in 2: the problem is now partitioned onto 8 processors.  Next, the
+last dimension (z dimension) is split again in 2: the problem is now
+partitioned onto 16 processors.  This partitioning is repeated until all 
+of the power of 2 processors have been allocated.
+
+Thus to run the above problem on 2 processors, the grid partitioning 
+algorithm will allocate the two processors across the last dimension, 
+creating two partitions each of size 256x256x128. The coarsest level of 
+multi-grid must be a single point surrounded by a cubic number of grid 
+points.  Therefore, each of the two processor partitions will contain 4 
+coarsest multi-grid level points, each surrounded by a cube of grid points 
+of size 128x128x128, indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 4 processors.  The
+following "mg.input" file is required:
+
+    8 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The partitioning algorithm will create 4 partitions, each of size
+256x128x128.  Each partition will contain 2 coarsest multi-grid level
+points each surrounded by a cube of grid points of size 128x128x128, 
+indicated by a top level of 8.
+
+
+Next, consider the same size problem but running on 16 processors.  The
+following "mg.input" file is required:
+
+    7 = top level
+    256 256 256 = nx ny nz
+    4 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+On each node a partition of size 128x128x64 will be created.  A maximum
+of 7 multi-grid levels (64,32,16,8,4,2,1) may be used, resulting in each 
+partions containing 4 coarsest multi-grid level points, each surrounded 
+by a cube of grid points of size 64x64x64, indicated by a top level of 7.
+
+
+
+
+Note that non-cubic problem sizes may also be considered:
+
+The four lines below are the "mg.input" file appropriate for running a
+problem of total size 256x512x512, for 20 iterations and presumes the 
+use of 32 processors (note: this is NOT a class C problem):
+
+    8 = top level
+    256 512 512 = nx ny nz
+    20 = nit
+    0 0 0 0 0 0 0 0 = debug_vec
+
+The first line of input indicates how many levels of multi-grid
+cycle will be applied to a particular subpartition.  Presuming that
+32 processors are solving this problem, a 2x4x4 processor grid is
+formed, and thus each partition on a processor is of size 128x128x128.
+Therefore, a maximum of 8 multi-grid levels may be used.  These are of
+size 128,64,32,16,8,4,2,1, with the coarsest level being a single 
+point on a given processor.
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg.f90
new file mode 100644
index 000000000..1c2bb0cd7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg.f90
@@ -0,0 +1,1444 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   M G                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB MG code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+
+!---------------------------------------------------------------------
+!
+! Authors: E. Barszcz
+!          P. Frederickson
+!          A. Woo
+!          M. Yarrow
+!          H. Jin
+!
+!---------------------------------------------------------------------
+
+
+!---------------------------------------------------------------------
+      program mg
+!---------------------------------------------------------------------
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      use mg_data
+      use mg_fields
+
+      implicit none
+
+!---------------------------------------------------------------------------c
+! k is the current level. It is passed down through subroutine args
+! and is NOT global. it is the current iteration
+!---------------------------------------------------------------------------c
+
+      integer k, it
+
+      external timer_read
+      double precision t, tinit, mflops, timer_read
+
+      double precision rnm2, rnmu, old2, oldu, epsilon
+      integer n1, n2, n3, nit
+      double precision nn, verify_value, err
+      logical verified
+
+      integer i, fstatus
+      character t_names(t_last)*8
+      double precision tmax
+!$    integer  omp_get_max_threads
+!$    external omp_get_max_threads
+
+
+      do i = T_init, T_last
+         call timer_clear(i)
+      end do
+
+      call timer_start(T_init)
+
+!---------------------------------------------------------------------
+! Read in input data
+!---------------------------------------------------------------------
+
+      call check_timer_flag( timeron )
+      if (timeron) then
+         t_names(t_init) = 'init'
+         t_names(t_bench) = 'benchmk'
+         t_names(t_mg3P) = 'mg3P'
+         t_names(t_psinv) = 'psinv'
+         t_names(t_resid) = 'resid'
+         t_names(t_rprj3) = 'rprj3'
+         t_names(t_interp) = 'interp'
+         t_names(t_norm2) = 'norm2'
+         t_names(t_comm3) = 'comm3'
+      endif
+
+      write (*, 1000)
+
+      open(unit=7,file="mg.input", status="old", iostat=fstatus)
+      if (fstatus .eq. 0) then
+         write(*,50)
+ 50      format(' Reading from input file mg.input')
+         read(7,*) lt
+         read(7,*) nx(lt), ny(lt), nz(lt)
+         read(7,*) nit
+         read(7,*) (debug_vec(i),i=0,7)
+      else
+         write(*,51)
+ 51      format(' No input file. Using compiled defaults ')
+         lt = lt_default
+         nit = nit_default
+         nx(lt) = nx_default
+         ny(lt) = ny_default
+         nz(lt) = nz_default
+         do i = 0,7
+            debug_vec(i) = debug_default
+         end do
+      endif
+
+
+      if ( (nx(lt) .ne. ny(lt)) .or. (nx(lt) .ne. nz(lt)) ) then
+         Class = 'U'
+      else if( nx(lt) .eq. 32 .and. nit .eq. 4 ) then
+         Class = 'S'
+      else if( nx(lt) .eq. 128 .and. nit .eq. 4 ) then
+         Class = 'W'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 4 ) then
+         Class = 'A'
+      else if( nx(lt) .eq. 256 .and. nit .eq. 20 ) then
+         Class = 'B'
+      else if( nx(lt) .eq. 512 .and. nit .eq. 20 ) then
+         Class = 'C'
+      else if( nx(lt) .eq. 1024 .and. nit .eq. 50 ) then
+         Class = 'D'
+      else if( nx(lt) .eq. 2048 .and. nit .eq. 50 ) then
+         Class = 'E'
+      else if( nx(lt) .eq. 4096 .and. nit .eq. 50 ) then
+         Class = 'F'
+      else
+         Class = 'U'
+      endif
+
+!---------------------------------------------------------------------
+!  Use these for debug info:
+!---------------------------------------------------------------------
+!     debug_vec(0) = 1 !=> report all norms
+!     debug_vec(1) = 1 !=> some setup information
+!     debug_vec(1) = 2 !=> more setup information
+!     debug_vec(2) = k => at level k or below, show result of resid
+!     debug_vec(3) = k => at level k or below, show result of psinv
+!     debug_vec(4) = k => at level k or below, show result of rprj
+!     debug_vec(5) = k => at level k or below, show result of interp
+!     debug_vec(6) = 1 => (unused)
+!     debug_vec(7) = 1 => (unused)
+!---------------------------------------------------------------------
+      a(0) = -8.0D0/3.0D0
+      a(1) =  0.0D0
+      a(2) =  1.0D0/6.0D0
+      a(3) =  1.0D0/12.0D0
+
+      if(Class .eq. 'A' .or. Class .eq. 'S'.or. Class .eq.'W') then
+!---------------------------------------------------------------------
+!     Coefficients for the S(a) smoother
+!---------------------------------------------------------------------
+         c(0) =  -3.0D0/8.0D0
+         c(1) =  +1.0D0/32.0D0
+         c(2) =  -1.0D0/64.0D0
+         c(3) =   0.0D0
+      else
+!---------------------------------------------------------------------
+!     Coefficients for the S(b) smoother
+!---------------------------------------------------------------------
+         c(0) =  -3.0D0/17.0D0
+         c(1) =  +1.0D0/33.0D0
+         c(2) =  -1.0D0/61.0D0
+         c(3) =   0.0D0
+      endif
+      lb = 1
+      k  = lt
+
+      call alloc_space
+
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call norm2u3(v,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+!     write(*,*)
+!     write(*,*)' norms of random v are'
+!     write(*,600) 0, rnm2, rnmu
+!     write(*,*)' about to evaluate resid, k=',k
+
+      write (*, 1001) nx(lt),ny(lt),nz(lt), Class
+      write (*, 1002) nit
+!$    write (*, 1003) omp_get_max_threads()
+      write (*, *)
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.4-OMP)',  &
+     &          ' - MG Benchmark', /)
+ 1001 format(' Size: ', i4, 'x', i4, 'x', i4, '  (class ', A, ')' )
+ 1002 format(' Iterations:                  ', i5)
+ 1003 format(' Number of available threads: ', i5)
+
+
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+!---------------------------------------------------------------------
+!     One iteration for startup
+!---------------------------------------------------------------------
+      call mg3P(u,v,r,a,c,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call setup(n1,n2,n3,k)
+      call zero3(u,n1,n2,n3)
+      call zran3(v,n1,n2,n3,nx(lt),ny(lt),k)
+
+      call timer_stop(T_init)
+      tinit = timer_read(T_init)
+
+      write( *,'(A,F15.3,A/)' )  &
+     &     ' Initialization time: ',tinit, ' seconds'
+
+      do i = T_bench, T_last
+         call timer_clear(i)
+      end do
+
+#ifdef M5_ANNOTATION
+         call m5_work_begin_interface
+#endif
+
+      call timer_start(T_bench)
+
+      if (timeron) call timer_start(T_resid2)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      if (timeron) call timer_stop(T_resid2)
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+      old2 = rnm2
+      oldu = rnmu
+
+      do  it=1,nit
+         if (it.eq.1 .or. it.eq.nit .or. mod(it,5).eq.0) then
+            write(*,80) it
+   80       format('  iter ',i3)
+         endif
+         if (timeron) call timer_start(T_mg3P)
+         call mg3P(u,v,r,a,c,n1,n2,n3,k)
+         if (timeron) call timer_stop(T_mg3P)
+         if (timeron) call timer_start(T_resid2)
+         call resid(u,v,r,n1,n2,n3,a,k)
+         if (timeron) call timer_stop(T_resid2)
+      enddo
+
+
+      call norm2u3(r,n1,n2,n3,rnm2,rnmu,nx(lt),ny(lt),nz(lt))
+
+      call timer_stop(T_bench)
+
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+
+      t = timer_read(T_bench)
+
+      verified = .FALSE.
+      verify_value = 0.0
+
+      write(*,100)
+ 100  format(/' Benchmark completed ')
+
+      epsilon = 1.d-8
+      if (Class .ne. 'U') then
+         if(Class.eq.'S') then
+            verify_value = 0.5307707005734d-04
+         elseif(Class.eq.'W') then
+            verify_value = 0.6467329375339d-05
+         elseif(Class.eq.'A') then
+            verify_value = 0.2433365309069d-05
+         elseif(Class.eq.'B') then
+            verify_value = 0.1800564401355d-05
+         elseif(Class.eq.'C') then
+            verify_value = 0.5706732285740d-06
+         elseif(Class.eq.'D') then
+            verify_value = 0.1583275060440d-09
+         elseif(Class.eq.'E') then
+            verify_value = 0.5630442584711d-10
+         elseif(Class.eq.'F') then
+            verify_value = 0.1889225697989d-10
+         endif
+
+         err = abs( rnm2 - verify_value ) / verify_value
+         if( (.not.ieee_is_nan(err)) .and. (err .le. epsilon) ) then
+            verified = .TRUE.
+            write(*, 200)
+            write(*, 201) rnm2
+            write(*, 202) err
+ 200        format(' VERIFICATION SUCCESSFUL ')
+ 201        format(' L2 Norm is ', E20.13)
+ 202        format(' Error is   ', E20.13)
+         else
+            verified = .FALSE.
+            write(*, 300)
+            write(*, 301) rnm2
+            write(*, 302) verify_value
+ 300        format(' VERIFICATION FAILED')
+ 301        format(' L2 Norm is             ', E20.13)
+ 302        format(' The correct L2 Norm is ', E20.13)
+         endif
+      else
+         verified = .FALSE.
+         write (*, 400)
+         write (*, 401)
+         write (*, 201) rnm2
+ 400     format(' Problem size unknown')
+ 401     format(' NO VERIFICATION PERFORMED')
+      endif
+
+      nn = 1.0d0*nx(lt)*ny(lt)*nz(lt)
+
+      if( t .ne. 0. ) then
+         mflops = 58.*nit*nn*1.0D-6 /t
+      else
+         mflops = 0.0
+      endif
+
+      call print_results('MG', class, nx(lt), ny(lt), nz(lt),  &
+     &                   nit, t,  &
+     &                   mflops, '          floating point',  &
+     &                   verified, npbversion, compiletime,  &
+     &                   cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+
+
+ 600  format( i4, 2e19.12)
+
+!---------------------------------------------------------------------
+!      More timers
+!---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      tmax = timer_read(t_bench)
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION   Time (secs)')
+      do i=t_bench, t_last
+         t = timer_read(i)
+         if (i.eq.t_resid2) then
+            t = timer_read(T_resid) - t
+            write(*,820) 'mg-resid', t, t*100./tmax
+         else
+            write(*,810) t_names(i), t, t*100./tmax
+         endif
+ 810     format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820     format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine setup(n1,n2,n3,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1,n2,n3,k
+      integer j
+
+      integer ax, mi(3,maxlevel)
+      integer ng(3,maxlevel)
+
+
+      ng(1,lt) = nx(lt)
+      ng(2,lt) = ny(lt)
+      ng(3,lt) = nz(lt)
+      do  ax=1,3
+         do  k=lt-1,1,-1
+            ng(ax,k) = ng(ax,k+1)/2
+         enddo
+      enddo
+ 61   format(10i4)
+      do  k=lt,1,-1
+         nx(k) = ng(1,k)
+         ny(k) = ng(2,k)
+         nz(k) = ng(3,k)
+      enddo
+
+      do  k = lt,1,-1
+         do  ax = 1,3
+            mi(ax,k) = 2 + ng(ax,k)
+         enddo
+
+         m1(k) = mi(1,k)
+         m2(k) = mi(2,k)
+         m3(k) = mi(3,k)
+
+      enddo
+
+      k = lt
+      is1 = 2 + ng(1,k) - ng(1,lt)
+      ie1 = 1 + ng(1,k)
+      n1 = 3 + ie1 - is1
+      is2 = 2 + ng(2,k) - ng(2,lt)
+      ie2 = 1 + ng(2,k)
+      n2 = 3 + ie2 - is2
+      is3 = 2 + ng(3,k) - ng(3,lt)
+      ie3 = 1 + ng(3,k)
+      n3 = 3 + ie3 - is3
+
+
+      ir(lt)=1
+      do  j = lt-1, 1, -1
+         ir(j)=ir(j+1)+one*m1(j+1)*m2(j+1)*m3(j+1)
+      enddo
+
+
+      if( debug_vec(1) .ge. 1 )then
+         write(*,*)' in setup, '
+         write(*,*)' k  lt  nx  ny  nz ',  &
+     &        ' n1  n2  n3 is1 is2 is3 ie1 ie2 ie3'
+         write(*,9) k,lt,ng(1,k),ng(2,k),ng(3,k),  &
+     &              n1,n2,n3,is1,is2,is3,ie1,ie2,ie3
+ 9       format(15i4)
+      endif
+
+      k = lt
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine mg3P(u,v,r,a,c,n1,n2,n3,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     multigrid V-cycle routine
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1, n2, n3, k
+      double precision u(nr),v(nv),r(nr)
+      double precision a(0:3),c(0:3)
+
+      integer j
+
+!---------------------------------------------------------------------
+!     down cycle.
+!     restrict the residual from the find grid to the coarse
+!---------------------------------------------------------------------
+
+      do  k= lt, lb+1 , -1
+         j = k-1
+         call rprj3(r(ir(k)),m1(k),m2(k),m3(k),  &
+     &        r(ir(j)),m1(j),m2(j),m3(j),k)
+      enddo
+
+      k = lb
+!---------------------------------------------------------------------
+!     compute an approximate solution on the coarsest grid
+!---------------------------------------------------------------------
+      call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+      call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+
+      do  k = lb+1, lt-1
+          j = k-1
+!---------------------------------------------------------------------
+!        prolongate from level k-1  to k
+!---------------------------------------------------------------------
+         call zero3(u(ir(k)),m1(k),m2(k),m3(k))
+         call interp(u(ir(j)),m1(j),m2(j),m3(j),  &
+     &               u(ir(k)),m1(k),m2(k),m3(k),k)
+!---------------------------------------------------------------------
+!        compute residual for level k
+!---------------------------------------------------------------------
+         call resid(u(ir(k)),r(ir(k)),r(ir(k)),m1(k),m2(k),m3(k),a,k)
+!---------------------------------------------------------------------
+!        apply smoother
+!---------------------------------------------------------------------
+         call psinv(r(ir(k)),u(ir(k)),m1(k),m2(k),m3(k),c,k)
+      enddo
+ 200  continue
+      j = lt - 1
+      k = lt
+      call interp(u(ir(j)),m1(j),m2(j),m3(j),u,n1,n2,n3,k)
+      call resid(u,v,r,n1,n2,n3,a,k)
+      call psinv(r,u,n1,n2,n3,c,k)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine psinv( r,u,n1,n2,n3,c,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     psinv applies an approximate inverse as smoother:  u = u + Cr
+!
+!     This  implementation costs  15A + 4M per result, where
+!     A and M denote the costs of Addition and Multiplication.
+!     Presuming coefficient c(3) is zero (the NPB assumes this,
+!     but it is thus not a general case), 2A + 1M may be eliminated,
+!     resulting in 13A + 3M.
+!     Note that this vectorizes, and is also fine for cache
+!     based machines.
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),r(n1,n2,n3),c(0:3)
+      integer i3, i2, i1
+
+      double precision r1(m), r2(m)
+
+      if (timeron) call timer_start(T_psinv)
+!$omp parallel do default(shared) private(i1,i2,i3,r1,r2) collapse(2)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               r1(i1) = r(i1,i2-1,i3) + r(i1,i2+1,i3)  &
+     &                + r(i1,i2,i3-1) + r(i1,i2,i3+1)
+               r2(i1) = r(i1,i2-1,i3-1) + r(i1,i2+1,i3-1)  &
+     &                + r(i1,i2-1,i3+1) + r(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               u(i1,i2,i3) = u(i1,i2,i3)  &
+     &                     + c(0) * r(i1,i2,i3)  &
+     &                     + c(1) * ( r(i1-1,i2,i3) + r(i1+1,i2,i3)  &
+     &                              + r1(i1) )  &
+     &                     + c(2) * ( r2(i1) + r1(i1-1) + r1(i1+1) )
+!---------------------------------------------------------------------
+!  Assume c(3) = 0    (Enable line below if c(3) not= 0)
+!---------------------------------------------------------------------
+!    >                     + c(3) * ( r2(i1-1) + r2(i1+1) )
+!---------------------------------------------------------------------
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_psinv)
+
+!---------------------------------------------------------------------
+!     exchange boundary points
+!---------------------------------------------------------------------
+      call comm3(u,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(u,n1,n2,n3,'   psinv',k)
+      endif
+
+      if( debug_vec(3) .ge. k )then
+         call showall(u,n1,n2,n3)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine resid( u,v,r,n1,n2,n3,a,k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     resid computes the residual:  r = v - Au
+!
+!     This  implementation costs  15A + 4M per result, where
+!     A and M denote the costs of Addition (or Subtraction) and
+!     Multiplication, respectively.
+!     Presuming coefficient a(1) is zero (the NPB assumes this,
+!     but it is thus not a general case), 3A + 1M may be eliminated,
+!     resulting in 12A + 3M.
+!     Note that this vectorizes, and is also fine for cache
+!     based machines.
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1,n2,n3,k
+      double precision u(n1,n2,n3),v(n1,n2,n3),r(n1,n2,n3),a(0:3)
+      integer i3, i2, i1
+      double precision u1(m), u2(m)
+
+      if (timeron) call timer_start(T_resid)
+!$omp parallel do default(shared) private(i1,i2,i3,u1,u2) collapse(2)
+      do i3=2,n3-1
+         do i2=2,n2-1
+            do i1=1,n1
+               u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)  &
+     &                + u(i1,i2,i3-1) + u(i1,i2,i3+1)
+               u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)  &
+     &                + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
+            enddo
+            do i1=2,n1-1
+               r(i1,i2,i3) = v(i1,i2,i3)  &
+     &                     - a(0) * u(i1,i2,i3)  &
+!---------------------------------------------------------------------
+!  Assume a(1) = 0      (Enable 2 lines below if a(1) not= 0)
+!---------------------------------------------------------------------
+!    >                     - a(1) * ( u(i1-1,i2,i3) + u(i1+1,i2,i3)
+!    >                              + u1(i1) )
+!---------------------------------------------------------------------
+     &                     - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )  &
+     &                     - a(3) * ( u2(i1-1) + u2(i1+1) )
+            enddo
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_resid)
+
+!---------------------------------------------------------------------
+!     exchange boundary data
+!---------------------------------------------------------------------
+      call comm3(r,n1,n2,n3,k)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(r,n1,n2,n3,'   resid',k)
+      endif
+
+      if( debug_vec(2) .ge. k )then
+         call showall(r,n1,n2,n3)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rprj3( r,m1k,m2k,m3k,s,m1j,m2j,m3j,k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     rprj3 projects onto the next coarser grid,
+!     using a trilinear Finite Element projection:  s = r' = P r
+!
+!     This  implementation costs  20A + 4M per result, where
+!     A and M denote the costs of Addition and Multiplication.
+!     Note that this vectorizes, and is also fine for cache
+!     based machines.
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer m1k, m2k, m3k, m1j, m2j, m3j,k
+      double precision r(m1k,m2k,m3k), s(m1j,m2j,m3j)
+      integer j3, j2, j1, i3, i2, i1, d1, d2, d3, j
+
+      double precision x1(m), y1(m), x2,y2
+
+      if (timeron) call timer_start(T_rprj3)
+      if(m1k.eq.3)then
+        d1 = 2
+      else
+        d1 = 1
+      endif
+
+      if(m2k.eq.3)then
+        d2 = 2
+      else
+        d2 = 1
+      endif
+
+      if(m3k.eq.3)then
+        d3 = 2
+      else
+        d3 = 1
+      endif
+
+!$omp parallel do default(shared) collapse(2)  &
+!$omp& private(j1,j2,j3,i1,i2,i3,x1,y1,x2,y2)
+      do  j3=2,m3j-1
+         do  j2=2,m2j-1
+            i3 = 2*j3-d3
+            i2 = 2*j2-d2
+
+            do j1=2,m1j
+              i1 = 2*j1-d1
+              x1(i1-1) = r(i1-1,i2-1,i3  ) + r(i1-1,i2+1,i3  )  &
+     &                 + r(i1-1,i2,  i3-1) + r(i1-1,i2,  i3+1)
+              y1(i1-1) = r(i1-1,i2-1,i3-1) + r(i1-1,i2-1,i3+1)  &
+     &                 + r(i1-1,i2+1,i3-1) + r(i1-1,i2+1,i3+1)
+            enddo
+
+            do  j1=2,m1j-1
+              i1 = 2*j1-d1
+              y2 = r(i1,  i2-1,i3-1) + r(i1,  i2-1,i3+1)  &
+     &           + r(i1,  i2+1,i3-1) + r(i1,  i2+1,i3+1)
+              x2 = r(i1,  i2-1,i3  ) + r(i1,  i2+1,i3  )  &
+     &           + r(i1,  i2,  i3-1) + r(i1,  i2,  i3+1)
+              s(j1,j2,j3) =  &
+     &               0.5D0 * r(i1,i2,i3)  &
+     &             + 0.25D0 * ( r(i1-1,i2,i3) + r(i1+1,i2,i3) + x2)  &
+     &             + 0.125D0 * ( x1(i1-1) + x1(i1+1) + y2)  &
+     &             + 0.0625D0 * ( y1(i1-1) + y1(i1+1) )
+            enddo
+
+         enddo
+      enddo
+      if (timeron) call timer_stop(T_rprj3)
+
+
+      j = k-1
+      call comm3(s,m1j,m2j,m3j,j)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(s,m1j,m2j,m3j,'   rprj3',k-1)
+      endif
+
+      if( debug_vec(4) .ge. k )then
+         call showall(s,m1j,m2j,m3j)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine interp( z,mm1,mm2,mm3,u,n1,n2,n3,k )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     interp adds the trilinear interpolation of the correction
+!     from the coarser grid to the current approximation:  u = u + Qu'
+!
+!     Observe that this  implementation costs  16A + 4M, where
+!     A and M denote the costs of Addition and Multiplication.
+!     Note that this vectorizes, and is also fine for cache
+!     based machines.  Vector machines may get slightly better
+!     performance however, with 8 separate "do i1" loops, rather than 4.
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer mm1, mm2, mm3, n1, n2, n3,k
+      double precision z(mm1,mm2,mm3),u(n1,n2,n3)
+      integer i3, i2, i1, d1, d2, d3, t1, t2, t3
+
+! note that m = 1037 in globals.h but for this only need to be
+! 535 to handle up to 1024^3
+!      integer m
+!      parameter( m=535 )
+      double precision z1(m),z2(m),z3(m)
+
+      if (timeron) call timer_start(T_interp)
+      if( n1 .ne. 3 .and. n2 .ne. 3 .and. n3 .ne. 3 ) then
+
+!$omp parallel do default(shared) private(i1,i2,i3,z1,z2,z3) collapse(2)
+         do  i3=1,mm3-1
+            do  i2=1,mm2-1
+
+               do i1=1,mm1
+                  z1(i1) = z(i1,i2+1,i3) + z(i1,i2,i3)
+                  z2(i1) = z(i1,i2,i3+1) + z(i1,i2,i3)
+                  z3(i1) = z(i1,i2+1,i3+1) + z(i1,i2,i3+1) + z1(i1)
+               enddo
+
+               do  i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3-1)=u(2*i1-1,2*i2-1,2*i3-1)  &
+     &                 +z(i1,i2,i3)
+                  u(2*i1,2*i2-1,2*i3-1)=u(2*i1,2*i2-1,2*i3-1)  &
+     &                 +0.5d0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3-1)=u(2*i1-1,2*i2,2*i3-1)  &
+     &                 +0.5d0 * z1(i1)
+                  u(2*i1,2*i2,2*i3-1)=u(2*i1,2*i2,2*i3-1)  &
+     &                 +0.25d0*( z1(i1) + z1(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2-1,2*i3)=u(2*i1-1,2*i2-1,2*i3)  &
+     &                 +0.5d0 * z2(i1)
+                  u(2*i1,2*i2-1,2*i3)=u(2*i1,2*i2-1,2*i3)  &
+     &                 +0.25d0*( z2(i1) + z2(i1+1) )
+               enddo
+               do i1=1,mm1-1
+                  u(2*i1-1,2*i2,2*i3)=u(2*i1-1,2*i2,2*i3)  &
+     &                 +0.25d0* z3(i1)
+                  u(2*i1,2*i2,2*i3)=u(2*i1,2*i2,2*i3)  &
+     &                 +0.125d0*( z3(i1) + z3(i1+1) )
+               enddo
+            enddo
+         enddo
+
+      else
+
+         if(n1.eq.3)then
+            d1 = 2
+            t1 = 1
+         else
+            d1 = 1
+            t1 = 0
+         endif
+
+         if(n2.eq.3)then
+            d2 = 2
+            t2 = 1
+         else
+            d2 = 1
+            t2 = 0
+         endif
+
+         if(n3.eq.3)then
+            d3 = 2
+            t3 = 1
+         else
+            d3 = 1
+            t3 = 0
+         endif
+
+!$omp parallel default(shared) private(i1,i2,i3)
+!$omp do collapse(2)
+         do  i3=d3,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-d3)=u(2*i1-d1,2*i2-d2,2*i3-d3)  &
+     &                 +z(i1,i2,i3)
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-d3)=u(2*i1-t1,2*i2-d2,2*i3-d3)  &
+     &                 +0.5D0*(z(i1+1,i2,i3)+z(i1,i2,i3))
+               enddo
+            enddo
+         enddo
+!$omp do collapse(2)
+         do  i3=d3,mm3-1
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-d3)=u(2*i1-d1,2*i2-t2,2*i3-d3)  &
+     &                 +0.5D0*(z(i1,i2+1,i3)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-d3)=u(2*i1-t1,2*i2-t2,2*i3-d3)  &
+     &                 +0.25D0*(z(i1+1,i2+1,i3)+z(i1+1,i2,i3)  &
+     &                 +z(i1,  i2+1,i3)+z(i1,  i2,i3))
+               enddo
+            enddo
+         enddo
+
+!$omp do collapse(2)
+         do  i3=1,mm3-1
+            do  i2=d2,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-d2,2*i3-t3)=u(2*i1-d1,2*i2-d2,2*i3-t3)  &
+     &                 +0.5D0*(z(i1,i2,i3+1)+z(i1,i2,i3))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-d2,2*i3-t3)=u(2*i1-t1,2*i2-d2,2*i3-t3)  &
+     &                 +0.25D0*(z(i1+1,i2,i3+1)+z(i1,i2,i3+1)  &
+     &                 +z(i1+1,i2,i3  )+z(i1,i2,i3  ))
+               enddo
+            enddo
+         enddo
+!$omp do collapse(2)
+         do  i3=1,mm3-1
+            do  i2=1,mm2-1
+               do  i1=d1,mm1-1
+                  u(2*i1-d1,2*i2-t2,2*i3-t3)=u(2*i1-d1,2*i2-t2,2*i3-t3)  &
+     &                 +0.25D0*(z(i1,i2+1,i3+1)+z(i1,i2,i3+1)  &
+     &                 +z(i1,i2+1,i3  )+z(i1,i2,i3  ))
+               enddo
+               do  i1=1,mm1-1
+                  u(2*i1-t1,2*i2-t2,2*i3-t3)=u(2*i1-t1,2*i2-t2,2*i3-t3)  &
+     &                 +0.125D0*(z(i1+1,i2+1,i3+1)+z(i1+1,i2,i3+1)  &
+     &                 +z(i1  ,i2+1,i3+1)+z(i1  ,i2,i3+1)  &
+     &                 +z(i1+1,i2+1,i3  )+z(i1+1,i2,i3  )  &
+     &                 +z(i1  ,i2+1,i3  )+z(i1  ,i2,i3  ))
+               enddo
+            enddo
+         enddo
+!$omp end do nowait
+!$omp end parallel
+
+      endif
+      if (timeron) call timer_stop(T_interp)
+
+      if( debug_vec(0) .ge. 1 )then
+         call rep_nrm(z,mm1,mm2,mm3,'z: inter',k-1)
+         call rep_nrm(u,n1,n2,n3,'u: inter',k)
+      endif
+
+      if( debug_vec(5) .ge. k )then
+         call showall(z,mm1,mm2,mm3)
+         call showall(u,n1,n2,n3)
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine norm2u3(r,n1,n2,n3,rnm2,rnmu,nx,ny,nz)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     norm2u3 evaluates approximations to the L2 norm and the
+!     uniform (or L-infinity or Chebyshev) norm, under the
+!     assumption that the boundaries are periodic or zero.  Add the
+!     boundaries in with half weight (quarter weight on the edges
+!     and eighth weight at the corners) for inhomogeneous boundaries.
+!---------------------------------------------------------------------
+      use mg_data, only : timeron
+
+      implicit none
+
+      integer n1, n2, n3, nx, ny, nz
+      double precision rnm2, rnmu, r(n1,n2,n3)
+      double precision s, a
+      integer i3, i2, i1
+
+      double precision dn
+
+      integer T_norm2
+      parameter (T_norm2=9)
+
+      if (timeron) call timer_start(T_norm2)
+      dn = 1.0d0*nx*ny*nz
+
+      s=0.0D0
+      rnmu = 0.0D0
+!$omp parallel do default(shared) private(i1,i2,i3,a) collapse(2)  &
+!$omp& reduction(+:s) reduction(max:rnmu)
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               s=s+r(i1,i2,i3)**2
+               a=abs(r(i1,i2,i3))
+               rnmu=dmax1(rnmu,a)
+            enddo
+         enddo
+      enddo
+
+      rnm2=sqrt( s / dn )
+      if (timeron) call timer_stop(T_norm2)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine rep_nrm(u,n1,n2,n3,title,kk)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     report on norm
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      character*8 title
+
+      double precision rnm2, rnmu
+
+
+      call norm2u3(u,n1,n2,n3,rnm2,rnmu,nx(kk),ny(kk),nz(kk))
+      write(*,7)kk,title,rnm2,rnmu
+ 7    format(' Level',i2,' in ',a8,': norms =',D21.14,D21.14)
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine comm3(u,n1,n2,n3,kk)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     comm3 organizes the communication on all borders
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1, n2, n3, kk
+      double precision u(n1,n2,n3)
+      integer i1, i2, i3
+
+      if (timeron) call timer_start(T_comm3)
+!$omp parallel default(shared) private(i1,i2,i3)
+!$omp do
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            u( 1,i2,i3) = u(n1-1,i2,i3)
+            u(n1,i2,i3) = u(   2,i2,i3)
+         enddo
+!      enddo
+
+!      do  i3=2,n3-1
+         do  i1=1,n1
+            u(i1, 1,i3) = u(i1,n2-1,i3)
+            u(i1,n2,i3) = u(i1,   2,i3)
+         enddo
+      enddo
+
+!$omp do
+      do  i2=1,n2
+         do  i1=1,n1
+            u(i1,i2, 1) = u(i1,i2,n3-1)
+            u(i1,i2,n3) = u(i1,i2,   2)
+         enddo
+      enddo
+!$omp end do nowait
+!$omp end parallel
+      if (timeron) call timer_stop(T_comm3)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine zran3(z,n1,n2,n3,nx1,ny1,k)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     zran3  loads +1 at ten randomly chosen points,
+!     loads -1 at a different ten random points,
+!     and zero elsewhere.
+!---------------------------------------------------------------------
+
+      use mg_data
+      implicit none
+
+      integer n1, n2, n3, k, nx1, ny1, i0, mm0, mm1
+      double precision z(n1,n2,n3)
+
+      integer mm, i1, i2, i3, d1, e1, e2, e3
+      double precision x, a
+      double precision xx, x0, x1, a1, a2, ai, power
+      parameter( mm = 10,  a = 5.D0 ** 13, x = 314159265.D0)
+      double precision ten( mm, 0:1 ), best0, best1
+      integer i, j1( mm, 0:1 ), j2( mm, 0:1 ), j3( mm, 0:1 )
+      integer jg( 0:3, mm, 0:1 )
+
+      external randlc
+      double precision randlc, rdummy
+!$    integer  omp_get_thread_num, omp_get_num_threads
+!$    external omp_get_thread_num, omp_get_num_threads
+      integer myid, num_threads
+
+      a1 = power( a, nx1 )
+      a2 = power( a, nx1*ny1 )
+
+      call zero3(z,n1,n2,n3)
+
+      i = is1-2+nx1*(is2-2+ny1*(is3-2))
+
+      ai = power( a, i )
+      d1 = ie1 - is1 + 1
+      e1 = ie1 - is1 + 2
+      e2 = ie2 - is2 + 2
+      e3 = ie3 - is3 + 2
+      x0 = x
+      rdummy = randlc( x0, ai )
+
+!---------------------------------------------------------------------
+!     save the starting seeds for the following loop
+!---------------------------------------------------------------------
+      do  i3 = 2, e3
+         starts(i3) = x0
+         rdummy = randlc( x0, a2 )
+      end do
+
+!---------------------------------------------------------------------
+!     fill array
+!---------------------------------------------------------------------
+!$omp parallel do default(shared) private(i2,i3,x1,xx,rdummy)  &
+!$omp&  shared(e2,e3,d1,a1)
+      do  i3 = 2, e3
+         x1 = starts(i3)
+         do  i2 = 2, e2
+            xx = x1
+            call vranlc( d1, xx, a, z( 2, i2, i3 ))
+            rdummy = randlc( x1, a1 )
+         enddo
+      enddo
+!$omp end parallel do
+
+!---------------------------------------------------------------------
+!       call comm3(z,n1,n2,n3)
+!       call showall(z,n1,n2,n3)
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     each thread looks for twenty candidates
+!---------------------------------------------------------------------
+!$omp parallel default(shared) private(i,i0,i1,i2,i3,j1,j2,j3,ten,  &
+!$omp&  myid,num_threads) shared(best0,best1,n1,n2,n3)
+      do  i=1,mm
+         ten( i, 1 ) = 0.0D0
+         j1( i, 1 ) = 0
+         j2( i, 1 ) = 0
+         j3( i, 1 ) = 0
+         ten( i, 0 ) = 1.0D0
+         j1( i, 0 ) = 0
+         j2( i, 0 ) = 0
+         j3( i, 0 ) = 0
+      enddo
+
+!$omp do collapse(2)
+      do  i3=2,n3-1
+         do  i2=2,n2-1
+            do  i1=2,n1-1
+               if( z(i1,i2,i3) .gt. ten( 1, 1 ) )then
+                  ten(1,1) = z(i1,i2,i3)
+                  j1(1,1) = i1
+                  j2(1,1) = i2
+                  j3(1,1) = i3
+                  call bubble( ten, j1, j2, j3, mm, 1 )
+               endif
+               if( z(i1,i2,i3) .lt. ten( 1, 0 ) )then
+                  ten(1,0) = z(i1,i2,i3)
+                  j1(1,0) = i1
+                  j2(1,0) = i2
+                  j3(1,0) = i3
+                  call bubble( ten, j1, j2, j3, mm, 0 )
+               endif
+            enddo
+         enddo
+      enddo
+!$omp end do
+
+
+!---------------------------------------------------------------------
+!     Now which of these are globally best?
+!---------------------------------------------------------------------
+      i1 = mm
+      i0 = mm
+      myid = 0
+!$    myid = omp_get_thread_num()
+!$    num_threads = omp_get_num_threads()
+      do  i=mm,1,-1
+
+! ... ORDERED access is required here for sequential consistency
+! ... in case that two values are identical.
+! ... Since an "ORDERED" section is only defined in OpenMP 2,
+! ... we use a dummy loop to emulate ordered access in OpenMP 1.x.
+!$omp master
+         best1 = 0.0D0
+         best0 = 1.0D0
+!$omp end master
+
+!$omp do ordered schedule(static)
+!$       do i2=1,num_threads
+!$omp ordered
+         if (ten(i1,1) .gt. best1) then
+            best1 = ten(i1,1)
+            jg( 0, i, 1 ) = myid
+         endif
+         if (ten(i0,0) .lt. best0) then
+            best0 = ten(i0,0)
+            jg( 0, i, 0 ) = myid
+         endif
+!$omp end ordered
+!$       end do
+
+         if (myid .eq. jg( 0, i, 1 )) then
+            jg( 1, i, 1 ) = j1( i1, 1 )
+            jg( 2, i, 1 ) = j2( i1, 1 )
+            jg( 3, i, 1 ) = j3( i1, 1 )
+            i1 = i1-1
+         endif
+
+         if (myid .eq. jg( 0, i, 0 )) then
+            jg( 1, i, 0 ) = j1( i0, 0 )
+            jg( 2, i, 0 ) = j2( i0, 0 )
+            jg( 3, i, 0 ) = j3( i0, 0 )
+            i0 = i0-1
+         endif
+
+      enddo
+!$omp end parallel
+
+!      mm1 = i1+1
+!      mm0 = i0+1
+      mm1 = 1
+      mm0 = 1
+
+!     write(*,*)' '
+!     write(*,*)' negative charges at'
+!     write(*,9)(jg(1,i,0),jg(2,i,0),jg(3,i,0),i=1,mm)
+!     write(*,*)' positive charges at'
+!     write(*,9)(jg(1,i,1),jg(2,i,1),jg(3,i,1),i=1,mm)
+!     write(*,*)' small random numbers were'
+!     write(*,8)(ten( i,0),i=mm,1,-1)
+!     write(*,*)' and they were found on processor number'
+!     write(*,7)(jg(0,i,0),i=mm,1,-1)
+!     write(*,*)' large random numbers were'
+!     write(*,8)(ten( i,1),i=mm,1,-1)
+!     write(*,*)' and they were found on processor number'
+!     write(*,7)(jg(0,i,1),i=mm,1,-1)
+! 9    format(5(' (',i3,2(',',i3),')'))
+! 8    format(5D15.8)
+! 7    format(10i4)
+
+!$omp parallel do default(shared) private(i1,i2,i3) collapse(2)
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3) = 0.0D0
+            enddo
+         enddo
+      enddo
+!$omp end parallel do
+
+      do  i=mm,mm0,-1
+         z( jg(1,i,0), jg(2,i,0), jg(3,i,0) ) = -1.0D0
+      enddo
+      do  i=mm,mm1,-1
+         z( jg(1,i,1), jg(2,i,1), jg(3,i,1) ) = +1.0D0
+      enddo
+
+      call comm3(z,n1,n2,n3,k)
+
+!---------------------------------------------------------------------
+!          call showall(z,n1,n2,n3)
+!---------------------------------------------------------------------
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine showall(z,n1,n2,n3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+
+
+      integer n1,n2,n3,i1,i2,i3
+      double precision z(n1,n2,n3)
+      integer m1, m2, m3
+
+      m1 = min(n1,18)
+      m2 = min(n2,14)
+      m3 = min(n3,18)
+
+      write(*,*)'  '
+      do  i3=1,m3
+         do  i1=1,m1
+            write(*,6)(z(i1,i2,i3),i2=1,m2)
+         enddo
+         write(*,*)' - - - - - - - '
+      enddo
+      write(*,*)'  '
+ 6    format(15f6.3)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision function power( a, n )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     power  raises an integer, disguised as a double
+!     precision real, to an integer power
+!---------------------------------------------------------------------
+      implicit none
+
+      double precision a, aj
+      integer n, nj
+      external randlc
+      double precision randlc, rdummy
+
+      power = 1.0D0
+      nj = n
+      aj = a
+ 100  continue
+
+      if( nj .eq. 0 ) goto 200
+      if( mod(nj,2) .eq. 1 ) rdummy =  randlc( power, aj )
+      rdummy = randlc( aj, aj )
+      nj = nj/2
+      go to 100
+
+ 200  continue
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine bubble( ten, j1, j2, j3, m, ind )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!     bubble        does a bubble sort in direction dir
+!---------------------------------------------------------------------
+      implicit none
+
+
+      integer m, ind, j1( m, 0:1 ), j2( m, 0:1 ), j3( m, 0:1 )
+      double precision ten( m, 0:1 )
+      double precision temp
+      integer i, j_temp
+
+      if( ind .eq. 1 )then
+
+         do  i=1,m-1
+            if( ten(i,ind) .gt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else
+               return
+            endif
+         enddo
+
+      else
+
+         do  i=1,m-1
+            if( ten(i,ind) .lt. ten(i+1,ind) )then
+
+               temp = ten( i+1, ind )
+               ten( i+1, ind ) = ten( i, ind )
+               ten( i, ind ) = temp
+
+               j_temp           = j1( i+1, ind )
+               j1( i+1, ind ) = j1( i,   ind )
+               j1( i,   ind ) = j_temp
+
+               j_temp           = j2( i+1, ind )
+               j2( i+1, ind ) = j2( i,   ind )
+               j2( i,   ind ) = j_temp
+
+               j_temp           = j3( i+1, ind )
+               j3( i+1, ind ) = j3( i,   ind )
+               j3( i,   ind ) = j_temp
+
+            else
+               return
+            endif
+         enddo
+
+      endif
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine zero3(z,n1,n2,n3)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+
+
+      integer n1, n2, n3
+      double precision z(n1,n2,n3)
+      integer i1, i2, i3
+
+!$omp parallel do default(shared) private(i1,i2,i3) collapse(2)
+      do  i3=1,n3
+         do  i2=1,n2
+            do  i1=1,n1
+               z(i1,i2,i3)=0.0D0
+            enddo
+         enddo
+      enddo
+
+      return
+      end
+
+
+!----- end of program ------------------------------------------------
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg.input.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg.input.sample
new file mode 100644
index 000000000..a4dcf8127
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg.input.sample
@@ -0,0 +1,4 @@
+ 8 = top level
+ 256 256 256 = nx ny nz
+ 20 = nit
+ 0 0 0 0 0 0 0 0 = debug_vec
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg_data.f90
new file mode 100644
index 000000000..40dfc2fc7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/MG/mg_data.f90
@@ -0,0 +1,122 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mg_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mg_data
+
+!---------------------------------------------------------------------
+!  Parameter lm (declared and set in "npbparams.h") is the log-base2 of 
+!  the edge size max for the partition on a given node, so must be changed 
+!  either to save space (if running a small case) or made bigger for larger 
+!  cases, for example, 512^3. Thus lm=7 means that the largest dimension 
+!  of a partition that can be solved on a node is 2^7 = 128. lm is set 
+!  automatically in npbparams.h
+!  Parameters ndim1, ndim2, ndim3 are the local problem dimensions. 
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer nm  &   ! actual dimension including ghost cells for communications
+     &      , maxlevel! maximum number of levels
+! ... kind2 is defined in npbparams.h
+      integer(kind2) one  &
+     &      , nv  &   ! size of rhs array
+     &      , nr      ! size of residual array
+
+      parameter( one=1 )
+      parameter( nm=2+2**lm, maxlevel=(lt_default+1) )
+      parameter( nv=one*(2+2**ndim1)*(2+2**ndim2)*(2+2**ndim3) )
+      parameter( nr = ((nv+nm**2+5*nm+7*lm+6)/7)*8 )
+
+!---------------------------------------------------------------------
+      integer  nx(maxlevel),ny(maxlevel),nz(maxlevel)
+
+      character class
+
+      integer debug_vec(0:7)
+
+      integer m1(maxlevel), m2(maxlevel), m3(maxlevel)
+      integer lt, lb
+      integer(kind2) ir(maxlevel)
+
+!---------------------------------------------------------------------
+!  Grid starts and ends
+!---------------------------------------------------------------------
+      integer  is1, is2, is3, ie1, ie2, ie3
+
+! ... rans_save
+      double precision starts(nm)
+
+!---------------------------------------------------------------------
+!  Set at m=1024, can handle cases up to 1024^3 case
+!---------------------------------------------------------------------
+      integer m
+!      parameter( m=1037 )
+      parameter( m=nm+1 )
+
+      logical timeron
+      integer T_init, T_bench, T_psinv, T_resid, T_rprj3, T_interp,  &
+     &        T_norm2, T_mg3P, T_resid2, T_comm3, T_last
+      parameter (T_init=1, T_bench=2, T_mg3P=3,  &
+     &        T_psinv=4, T_resid=5, T_resid2=6, T_rprj3=7,  &
+     &        T_interp=8, T_norm2=9, T_comm3=10, T_last=10)
+
+
+      end module mg_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  mg_fields module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module mg_fields
+
+!---------------------------------------------------------------------------c
+! These arrays are in module because they are quite large
+! and probably shouldn't be allocated on the stack.
+!---------------------------------------------------------------------------c
+
+      double precision, allocatable :: u(:),v(:),r(:)
+
+      double precision a(0:3),c(0:3)
+
+      end module mg_fields
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for field arrays
+!---------------------------------------------------------------------
+
+      use mg_data, only : nr, nv
+      use mg_fields
+
+      implicit none
+
+      integer ios
+
+      allocate( u(nr), v(nv), r(nr),  &
+     &          stat = ios )
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         stop
+      endif
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/Makefile
new file mode 100644
index 000000000..6e2e0ab42
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/Makefile
@@ -0,0 +1,72 @@
+SHELL=/bin/sh
+CLASS=W
+VERSION=
+SFILE=config/suite.def
+
+default: header
+	@ sys/print_instructions
+
+BT: bt
+bt: header
+	cd BT; $(MAKE) CLASS=$(CLASS) VERSION=$(VERSION)
+
+SP: sp		       
+sp: header	       
+	cd SP; $(MAKE) CLASS=$(CLASS)
+
+LU: lu		       
+lu: header	       
+	cd LU; $(MAKE) CLASS=$(CLASS) VERSION=$(VERSION)
+
+MG: mg		       
+mg: header	       
+	cd MG; $(MAKE) CLASS=$(CLASS)
+
+FT: ft		       
+ft: header	       
+	cd FT; $(MAKE) CLASS=$(CLASS)
+
+IS: is		       
+is: header	       
+	cd IS; $(MAKE) CLASS=$(CLASS)
+
+CG: cg		       
+cg: header	       
+	cd CG; $(MAKE) CLASS=$(CLASS)
+
+EP: ep		       
+ep: header	       
+	cd EP; $(MAKE) CLASS=$(CLASS)
+
+UA: ua
+ua: header	       
+	cd UA; $(MAKE) CLASS=$(CLASS)
+
+DC: dc
+dc: header	       
+	cd DC; $(MAKE) CLASS=$(CLASS)
+
+# Awk script courtesy cmg@cray.com, modified by Haoqiang Jin
+suite:
+	@ awk -f sys/suite.awk SMAKE=$(MAKE) $(SFILE) | $(SHELL)
+
+
+# It would be nice to make clean in each subdirectory (the targets
+# are defined) but on a really clean system this will won't work
+# because those makefiles need config/make.def
+clean:
+	- rm -f core *~ */core */*~
+	- rm -f */*.o */*.obj */*.exe */*.mod */npbparams.h */blk_par.h
+	- rm -f sys/setparams sys/makesuite sys/setparams.h
+	- rm -rf */rii_files
+
+veryclean: clean
+	- rm -f config/make.def config/suite.def 
+	- rm -f bin/sp.* bin/lu.* bin/mg.* bin/ft.* bin/bt.* bin/is.*
+	- rm -f bin/ep.* bin/cg.* bin/ua.* bin/dc.* bin/ADC.*
+
+header:
+	@ sys/print_header
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/README
new file mode 100644
index 000000000..3025b3b04
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/README
@@ -0,0 +1,87 @@
+The OpenMP implementation of NPB 3.4.2 (NPB3.4-OMP)
+----------------------------------------------------
+
+For problem reports and suggestions on the implementation, 
+please contact:
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+
+This directory contains the OpenMP implementation of the NAS
+Parallel Benchmarks, Version 3.4.2 (NPB3.4-OMP).  A brief
+summary of the new features introduced in this version is
+given below.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+For explanation of compilation and running of the benchmarks,
+please refer to README.install.
+
+
+New features in NPB3.4-OMP of NPB 3.4.2:
+  * New verification scheme for EP
+
+
+New features in NPB3.4-OMP of NPB 3.4.1:
+  * Changed Fortran sources from fixed form to free form
+
+  * The blocking factor for FT can now be set via make option
+    "VERSION=blk<n>"
+
+  * Minor bug fix in reporting Fortran compiler used (F77->FC)
+
+  * Changed the reference of "INTEGER*8" to "INTEGER(8)" in randi8.f
+
+  * The proper data type for CG is set via setparams based problem size
+    so that the [-i8] flag for building Class E or F is no longer required.
+
+
+New features in NPB3.4-OMP:
+  * Added the class E problem size for IS, and the class F problem 
+    size for BT, LU, SP, CG, EP, FT, and MG.
+
+  * Improves loop-level parallelism with the use of the OpenMP
+    COLLAPSE clause available since OpenMP 3.0.  This version 
+    requires an OpenMP compiler that supports this feature.
+
+  * Re-introduced the hyperplane implementation of LU in the 
+    distribution, which is accessible via the VERSION=HP make
+    option during compilation. Included a third version of LU
+    that uses the DOACROSS feature in OpenMP 4.0.  This version
+    requires an OpenMP compiler that supports this feature.
+
+  * Included versions of BT and SP with blocking factor to improve 
+    cache performance, and selectable via the VERSION=BLK make
+    option during compilation. These versions supersede the "vector"
+    version introduced in version 3.3.
+
+  * Included a version of UA that uses array reduction for atomic
+    updates.  This version is selectable via the VERSION=rd make
+    option during compilation.
+
+  * The version uses Fortran modules and allocatable arrays to define 
+    and manage global data (to replace common blocks) and Fortran 2003 
+    IEEE arithmetic function to catch the NaN condition during verification.
+
+    The version requires a compiler that supports features available
+    in Fortran 90 and 2003. Because of these changes, the F77 flag 
+    in make.def is renamed to FC.
+
+  * The environment variable NPB_TIMER_FLAG is now used to enable 
+    additional timers.
+
+
+New features in NPB3.3-OMP:
+   * NPB3.3-OMP introduces a new problem size (class E) to seven of 
+     the benchmarks (BT, SP, LU, CG, MG, FT, and EP). The version 
+     also includes a new problem size (class D) for the IS benchmark, 
+     which was not present in the previous releases.
+
+   * The release is merged with the vector codes for the BT and LU 
+     benchmarks, which can be selected with the VERSION=VEC option 
+     during compilation.  However, successful vectorization highly 
+     depends on the compiler used.  Some changes to compiler directives 
+     for vectorization in the current codes (see *_vec.f files)
+     may be required.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/README.install b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/README.install
new file mode 100644
index 000000000..745ff810b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/README.install
@@ -0,0 +1,190 @@
+Some explanations on the OpenMP implementation of NPB 3.4.2 (NPB3.4-OMP)
+------------------------------------------------------------------------
+
+NPB-OMP is a sample OpenMP implementation based on NPB3-SER,
+the sequential implementation of the NAS Parallel Benchmarks.
+This implementation (NPB3.4-OMP) contains all ten benchmarks:
+eight in Fortran: BT, SP, LU, FT, CG, MG, EP, and UA; 
+two in C: IS and DC.
+
+For changes from different versions, see the Changes.log file
+included in the upper directory of this distribution.
+
+This version has been tested, among others, on SGI Origin3000 and
+SGI Altix.  For problem reports and suggestions on the implementation, 
+please contact
+
+   NAS Parallel Benchmark Team
+   npb@nas.nasa.gov
+
+
+1. Compilation
+
+   NPB3.x-OMP uses the same directory tree as NPB3.x-SER (and NPB2.x) does.
+   Before compilation, one needs to check the configuration file
+   'make.def' in the config directory and modify the file if necessary. 
+   If it does not yet exist, copy 'make.def.template' or one of the
+   sample files in the NAS.samples subdirectory to 'make.def' and
+   edit the content for site- and machine-specific data.  Some of the
+   flags to be specified in make.def are:
+
+      FC     - Fortran compiler
+      FFLAGS - Fortran compilation flags
+      FLINK  - Fortran linker, usually the same as FC
+      CC     - C compiler
+      CFLAGS - C compilation flags
+      CLINK  - C linker, usually the same as CC
+
+   Then
+
+      make <benchmark> CLASS=<class> [VERSION=<opt>]
+
+   <benchmark> is one of (BT, SP, LU, FT, CG, MG, EP, UA, IS, DC) 
+   and <class> is one of (S, W, A, B, C, D, E, F), except for the following:
+      class F not defined for IS,
+      classes E and F not defined for UA,
+      classes C, D, E, and F not defined for DC.
+
+   The "VERSION=blk" option is used to set the blocking factor for
+   the blocking version of BT, SP and FT.  Without this option
+   the non-blocking version of BT and SP is selected.  The default
+   blocking factor is 8 for BT and SP, and 32 for FT.  Use option 
+   "VERSION=blk<n>" to select a different blocking factor <n>.
+
+   The "VERSION=hp" or "VERSION=doac" option is for selecting alternative
+   versions of LU. See LU/README for details of different options.
+
+   The "VERSION=au" or "VERSION=rd" option is for selecting different
+   methods of performing atomic updates in UA.  See UA/README for more
+   details.
+
+   Class D or E for IS (Integer Sort) requires a compiler/system that 
+   supports the "long" type in C to be 64-bit.  As examples, the SGI 
+   MIPS compiler for the SGI Origin using the "-64" compilation flag and
+   the Intel compiler for IA64 are known to work.
+
+   To build a suite of benchmarks, one can create the file 
+   "config/suite.def", which contains a list of executables to build.
+   Each line in the file contains the name of a benchmark and the class,
+   separated by spaces or tabs (see suite.def.template for an example).
+   Then
+
+      make suite
+
+
+   ================================
+   
+   The "RAND" variable in make.def
+   --------------------------------
+   
+   Most of the NPBs use a random number generator. In two of the NPBs (FT
+   and EP) the computation of random numbers is included in the timed
+   part of the calculation, and it is important that the random number
+   generator be efficient.  The default random number generator package
+   provided is called "randi8" and should be used where possible. It has 
+   the following requirements:
+   
+   randi8:
+     1. Uses integer(8) arithmetic. Compiler must support integer(8)
+     2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
+     3. Assumes overflow bits are discarded by the hardware. In particular, 
+        that the lowest 46 bits of a*b are always correct, even if the 
+        result a*b is larger than 2^64. 
+   
+   Since randi8 may not work on all machines, we supply the following
+   alternatives:
+   
+   randi8_safe
+     1. Uses integer(8) arithmetic
+     2. Uses the Fortran 90 IBITS intrinsic. 
+     3. Does not make any assumptions about overflow. Should always
+        work correctly if compiler supports integer(8) and IBITS. 
+   
+   randdp
+     1. Uses double precision arithmetic (to simulate integer(8) operations). 
+        Should work with any system with support for 64-bit floating
+        point arithmetic.      
+   
+   randdpvec
+     1. Similar to randdp but written to be easier to vectorize. 
+   
+
+2. Execution
+
+   The executable is named <benchmark-name>.<class>.x and is placed
+   in the bin subdirectory (or in the directory BINDIR specified in
+   make.def, if you've defined it).  Folllowing is an example of running 
+   a benchmark in csh:
+
+      setenv OMP_NUM_THREADS 4
+      bin/bt.A.x > BT.A_out.4
+
+   It runs BT Class A problem on 4 threads and the output is stored
+   in BT.A_out.4.
+
+   Each benchmark includes a set of additional timers for profiling purpose
+   (reporting timing for selected code blocks).  By default, these timers
+   are disabled.  To enable the timers, set the environment variable
+   NPB_TIMER_FLAG to one of:
+
+      1, on, yes, true
+
+   before running a benchmark.  The previous method of creating a dummy 
+   file 'timer.flag' in the current working directory is still supported,
+   but not recommended.
+
+   The printed number of threads is the activated threads during the run,
+   which may not be the same as what is requested.
+
+3. Known issues
+
+   Many of the 3.4 versions of the benchmarks use the OpenMP "COLLAPSE" 
+   clause for better parallelism.  This feature is available since
+   OpenMP 3.0.  Thus, the 3.4 version requires a compiler that supports
+   OpenMP 3.0.
+
+   NPB-OMP assumes 'deterministic' static scheduling at run-time to 
+   ensure the correctness of the results.  Verification in some
+   benchmarks might fail if this condition is not met. 
+
+   For larger problem sizes, the default stack size for slave threads
+   may need to be increased on certain platforms.  For OpenMP 3.0-compliant
+   compilers, use the runtime environment variable:
+      setenv OMP_STACKSIZE 50m  (for 50MB)
+
+   In order to build the class E version of CG, the integer type
+   needs to be promoted to 64-bit, which is usually done through 
+   compilation flag (such as "-i8" for FFLAGS in config/make.def).
+
+4. Notes on the implementation
+
+   - Based on NPB3.0-SER, except that FT was kept closer to
+     the original version in NPB2.3-serial.
+
+   - OpenMP directives were added to the outer-most parallel loops. 
+     No nested parallelism was considered.  The 3.4 version includes
+     the use of COLLAPSE clause for two loop nests.
+
+   - Extra loops were added in the beginning of most of the benchmarks
+     to touch data pages.  This is to set up a data layout based on the
+     'first touch' policy.
+
+   - Since there is no standard way of performing vectorization, the 
+     mileage you get from the vector version depends very much on 
+     the compiler used.  Often additional compiler directives (or flags) 
+     may be necessary for optimal results.  The "blocking" versions 
+     of BT and SP have shown better performance on some systems.
+     The proper blocking factor is system-dependent.
+
+   - For LU, the pipelined algorithm outperforms the hyperplane algorithm
+     consistently on many cache-based platforms.  So, the pipelined
+     implementation is compiled as the default.  See LU/README for
+     additional information.
+
+   - The IS OpenMP benchmark enables bucket sort by default.  To disable
+     bucket sort, comment out the line in IS/is.c:
+     #define USE_BUCKETS
+     See IS/README.carefully for additional information.
+
+   - For Unstructured Adaptive (UA) and DC benchmarks, please see 
+     UA/README or DC/README for additional instruction.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/Makefile
new file mode 100644
index 000000000..a786edfc7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/Makefile
@@ -0,0 +1,67 @@
+SHELL=/bin/sh
+BENCHMARK=sp
+BENCHMARKU=SP
+BLK=
+BLKFAC=0
+
+include ../config/make.def
+
+
+OBJS = sp.o sp_data.o initialize.o exact_solution.o exact_rhs.o \
+       set_constants.o adi.o rhs.o work_lhs$(BLK).o     \
+       x_solve$(BLK).o ninvr.o y_solve$(BLK).o pinvr.o    \
+       z_solve$(BLK).o tzetar.o add.o txinvr.o error.o verify.o  \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+include ../sys/make.common
+
+# npbparams.h is included by sp_data module (via sp_data.o)
+
+${PROGRAM}: config
+	@ver=$(VERSION); bfac=`echo $$ver|sed -e 's/^blk//' -e 's/^BLK//'`; \
+	if [ x$$ver != x$$bfac ] ; then		\
+		${MAKE} BLK=_blk BLKFAC=$${bfac:-8} exec;		\
+	else					\
+		${MAKE} exec;			\
+	fi
+
+exec: $(OBJS)
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+blk_par.h: FORCE
+	sed -e 's/= 0/= $(BLKFAC)/' blk_par0.h > blk_par.h_wk
+	@ if ! `diff blk_par.h_wk blk_par.h > /dev/null 2>&1`; then \
+	mv -f blk_par.h_wk blk_par.h; else rm -f blk_par.h_wk; fi
+FORCE:
+
+sp.o:             sp.f90 sp_data.o blk_par.h
+initialize.o:     initialize.f90 sp_data.o
+exact_solution.o: exact_solution.f90 sp_data.o
+exact_rhs.o:      exact_rhs.f90 sp_data.o
+set_constants.o:  set_constants.f90 sp_data.o
+adi.o:            adi.f90 sp_data.o
+rhs.o:            rhs.f90 sp_data.o
+x_solve$(BLK).o:  x_solve$(BLK).f90 sp_data.o work_lhs$(BLK).o
+ninvr.o:          ninvr.f90 sp_data.o
+y_solve$(BLK).o:  y_solve$(BLK).f90 sp_data.o work_lhs$(BLK).o
+pinvr.o:          pinvr.f90 sp_data.o
+z_solve$(BLK).o:  z_solve$(BLK).f90 sp_data.o work_lhs$(BLK).o
+tzetar.o:         tzetar.f90 sp_data.o
+add.o:            add.f90 sp_data.o
+txinvr.o:         txinvr.f90 sp_data.o
+error.o:          error.f90 sp_data.o
+verify.o:         verify.f90 sp_data.o
+work_lhs$(BLK).o: work_lhs$(BLK).f90 sp_data.o blk_par.h
+sp_data.o:        sp_data.f90 npbparams.h
+
+clean:
+	- rm -f *.o *~ *.mod mputil*
+	- rm -f npbparams.h core blk_par.h
+	- if [ -d rii_files ]; then rm -r rii_files; fi
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/README
new file mode 100644
index 000000000..afa11c5ef
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/README
@@ -0,0 +1,13 @@
+This directory contains two versions of the SP implementation:
+
+- the standard version that has better cache utilization
+- the "blocking" version that contains codes for better vectorization
+
+For most platforms, the standard version gives reasonable performance. 
+To access the blocking version, use the VERSION=BLK make flag, such as,
+   make CLASS=A VERSION=BLK
+
+Since there is no standard way of performing vectorization, the mileage
+you get from the vector version depends very much on compilers.  Often
+additional compiler directives (or flags) may be necessary for optimal
+results.  The current version is intended to only serve as a baseline.
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/add.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/add.f90
new file mode 100644
index 000000000..1e7ec092b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/add.f90
@@ -0,0 +1,34 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  add
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! addition of update to the vector u
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i,j,k,m
+
+       if (timeron) call timer_start(t_add)
+!$omp parallel do default(shared) private(i,j,k,m) collapse(2)
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+                do m = 1, 5
+                   u(m,i,j,k) = u(m,i,j,k) + rhs(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_add)
+
+       return
+       end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/adi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/adi.f90
new file mode 100644
index 000000000..6db71a526
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/adi.f90
@@ -0,0 +1,24 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  adi
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       call compute_rhs
+
+       call txinvr
+
+       call x_solve
+
+       call y_solve
+
+       call z_solve
+
+       call add
+
+       return
+       end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/blk_par0.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/blk_par0.h
new file mode 100644
index 000000000..eec3a0783
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/blk_par0.h
@@ -0,0 +1,10 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  blocking factor
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      integer bsize, blkdim
+      parameter (bsize = 0, blkdim = bsize)
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/error.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/error.f90
new file mode 100644
index 000000000..389c31fed
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/error.f90
@@ -0,0 +1,95 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine error_norm(rms)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function computes the norm of the difference between the
+! computed solution and the exact solution
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i, j, k, m, d
+       double precision xi, eta, zeta, u_exact(5), rms(5), add
+
+       do m = 1, 5
+          rms(m) = 0.0d0
+       enddo
+
+!$omp parallel do default(shared)  &
+!$omp&   private(i,j,k,m,zeta,eta,xi,add,u_exact)  &
+!$omp&   reduction(+:rms)  &
+!$omp&   schedule(static) collapse(2)
+       do   k = 0, grid_points(3)-1
+          do   j = 0, grid_points(2)-1
+             zeta = dble(k) * dnzm1
+             eta = dble(j) * dnym1
+             do   i = 0, grid_points(1)-1
+                xi = dble(i) * dnxm1
+                call exact_solution(xi, eta, zeta, u_exact)
+
+                do   m = 1, 5
+                   add = u(m,i,j,k)-u_exact(m)
+                   rms(m) = rms(m) + add*add
+                end do
+             end do
+          end do
+       end do
+!$omp end parallel do
+
+       do    m = 1, 5
+          do    d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
+
+       subroutine rhs_norm(rms)
+
+       use sp_data
+       implicit none
+
+       integer i, j, k, d, m
+       double precision rms(5), add
+
+       do m = 1, 5
+          rms(m) = 0.0d0
+       enddo
+
+!$omp parallel do default(shared) private(i,j,k,m,add)  &
+!$omp&   reduction(+:rms)  &
+!$omp&   schedule(static) collapse(2)
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+                do m = 1, 5
+                   add = rhs(m,i,j,k)
+                   rms(m) = rms(m) + add*add
+                end do 
+             end do 
+          end do 
+       end do 
+!$omp end parallel do
+
+       do   m = 1, 5
+          do   d = 1, 3
+             rms(m) = rms(m) / dble(grid_points(d)-2)
+          end do
+          rms(m) = dsqrt(rms(m))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/exact_rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/exact_rhs.f90
new file mode 100644
index 000000000..d714eb603
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/exact_rhs.f90
@@ -0,0 +1,356 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine exact_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! compute the right hand side based on exact solution
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       double precision dtemp(5), xi, eta, zeta, dtpp
+       integer          m, i, j, k, ip1, im1, jp1,  &
+     &                  jm1, km1, kp1
+
+!$omp parallel default(shared)  &
+!$omp& private(i,j,k,m,zeta,eta,xi,dtpp,im1,ip1,  &
+!$omp&         jm1,jp1,km1,kp1,dtemp)
+!---------------------------------------------------------------------
+!      initialize                                  
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+       do   k= 0, grid_points(3)-1
+          do   j = 0, grid_points(2)-1
+             do   i = 0, grid_points(1)-1
+                do   m = 1, 5
+                   forcing(m,i,j,k) = 0.0d0
+                end do
+             end do
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!      xi-direction flux differences                      
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+       do   k = 1, grid_points(3)-2
+          do   j = 1, grid_points(2)-2
+          zeta = dble(k) * dnzm1
+             eta = dble(j) * dnym1
+
+             do  i=0, grid_points(1)-1
+                xi = dble(i) * dnxm1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do  m = 1, 5
+                   ue(i,m) = dtemp(m)
+                end do
+
+                dtpp = 1.0d0 / dtemp(1)
+
+                do  m = 2, 5
+                   buf(i,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(i)   = buf(i,2) * buf(i,2)
+                buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) +  &
+     &                     buf(i,4) * buf(i,4) 
+                q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) +  &
+     &                        buf(i,4)*ue(i,4))
+
+             end do
+ 
+             do  i = 1, grid_points(1)-2
+                im1 = i-1
+                ip1 = i+1
+
+                forcing(1,i,j,k) = forcing(1,i,j,k) -  &
+     &                 tx2*( ue(ip1,2)-ue(im1,2) )+  &
+     &                 dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - tx2 * (  &
+     &                (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))-  &
+     &                (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+  &
+     &                 xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+  &
+     &                 dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - tx2 * (  &
+     &                 ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+  &
+     &                 dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3))
+                  
+                forcing(4,i,j,k) = forcing(4,i,j,k) - tx2*(  &
+     &                 ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+  &
+     &                 xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+  &
+     &                 dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - tx2*(  &
+     &                 buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))-  &
+     &                 buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+  &
+     &                 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+  &
+     &                               buf(im1,1))+  &
+     &                 xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+  &
+     &                 xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+  &
+     &                 dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5))
+             end do
+
+!---------------------------------------------------------------------
+!            Fourth-order dissipation                         
+!---------------------------------------------------------------------
+             do   m = 1, 5
+                i = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m))
+                i = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) -  &
+     &                     4.0d0*ue(i+1,m) +       ue(i+2,m))
+             end do
+
+             do   m = 1, 5
+                do  i = 3, grid_points(1)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*  &
+     &                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m))
+                end do
+             end do
+
+             do   m = 1, 5
+                i = grid_points(1)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (ue(i-2,m) - 4.0d0*ue(i-1,m) +  &
+     &                    6.0d0*ue(i,m) - 4.0d0*ue(i+1,m))
+                i = grid_points(1)-2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m))
+             end do
+
+          end do
+       end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!  eta-direction flux differences             
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+       do   k = 1, grid_points(3)-2          
+          do   i=1, grid_points(1)-2
+          zeta = dble(k) * dnzm1
+             xi = dble(i) * dnxm1
+
+             do  j=0, grid_points(2)-1
+                eta = dble(j) * dnym1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do   m = 1, 5 
+                   ue(j,m) = dtemp(m)
+                end do
+                dtpp = 1.0d0/dtemp(1)
+
+                do  m = 2, 5
+                   buf(j,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(j)   = buf(j,3) * buf(j,3)
+                buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) +  &
+     &                     buf(j,4) * buf(j,4)
+                q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) +  &
+     &                        buf(j,4)*ue(j,4))
+             end do
+
+             do  j = 1, grid_points(2)-2
+                jm1 = j-1
+                jp1 = j+1
+                  
+                forcing(1,i,j,k) = forcing(1,i,j,k) -  &
+     &                ty2*( ue(jp1,3)-ue(jm1,3) )+  &
+     &                dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - ty2*(  &
+     &                ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+  &
+     &                yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+  &
+     &                dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - ty2*(  &
+     &                (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))-  &
+     &                (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+  &
+     &                yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+  &
+     &                dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3))
+
+                forcing(4,i,j,k) = forcing(4,i,j,k) - ty2*(  &
+     &                ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+  &
+     &                yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+  &
+     &                dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - ty2*(  &
+     &                buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))-  &
+     &                buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+  &
+     &                0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+  &
+     &                              buf(jm1,1))+  &
+     &                yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+  &
+     &                yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+  &
+     &                dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5))
+             end do
+
+!---------------------------------------------------------------------
+!            Fourth-order dissipation                      
+!---------------------------------------------------------------------
+             do   m = 1, 5
+                j = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m))
+                j = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) -  &
+     &                     4.0d0*ue(j+1,m) +       ue(j+2,m))
+             end do
+
+             do   m = 1, 5
+                do  j = 3, grid_points(2)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*  &
+     &                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m))
+                end do
+             end do
+
+             do   m = 1, 5
+                j = grid_points(2)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (ue(j-2,m) - 4.0d0*ue(j-1,m) +  &
+     &                    6.0d0*ue(j,m) - 4.0d0*ue(j+1,m))
+                j = grid_points(2)-2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m))
+
+             end do
+
+          end do
+       end do
+
+!---------------------------------------------------------------------
+!      zeta-direction flux differences                      
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+       do  j=1, grid_points(2)-2
+          do   i = 1, grid_points(1)-2
+          eta = dble(j) * dnym1
+             xi = dble(i) * dnxm1
+
+             do k=0, grid_points(3)-1
+                zeta = dble(k) * dnzm1
+
+                call exact_solution(xi, eta, zeta, dtemp)
+                do   m = 1, 5
+                   ue(k,m) = dtemp(m)
+                end do
+
+                dtpp = 1.0d0/dtemp(1)
+
+                do   m = 2, 5
+                   buf(k,m) = dtpp * dtemp(m)
+                end do
+
+                cuf(k)   = buf(k,4) * buf(k,4)
+                buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) +  &
+     &                     buf(k,3) * buf(k,3)
+                q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) +  &
+     &                        buf(k,4)*ue(k,4))
+             end do
+
+             do    k=1, grid_points(3)-2
+                km1 = k-1
+                kp1 = k+1
+
+                forcing(1,i,j,k) = forcing(1,i,j,k) -  &
+     &                 tz2*( ue(kp1,4)-ue(km1,4) )+  &
+     &                 dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1))
+
+                forcing(2,i,j,k) = forcing(2,i,j,k) - tz2 * (  &
+     &                 ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+  &
+     &                 dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2))
+
+                forcing(3,i,j,k) = forcing(3,i,j,k) - tz2 * (  &
+     &                 ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+  &
+     &                 zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+  &
+     &                 dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3))
+
+                forcing(4,i,j,k) = forcing(4,i,j,k) - tz2 * (  &
+     &                (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))-  &
+     &                (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+  &
+     &                zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+  &
+     &                dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4))
+
+                forcing(5,i,j,k) = forcing(5,i,j,k) - tz2 * (  &
+     &                 buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))-  &
+     &                 buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+  &
+     &                 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1)  &
+     &                              +buf(km1,1))+  &
+     &                 zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+  &
+     &                 zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+  &
+     &                 dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5))
+             end do
+
+!---------------------------------------------------------------------
+!            Fourth-order dissipation
+!---------------------------------------------------------------------
+             do   m = 1, 5
+                k = 1
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                    (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m))
+                k = 2
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) -  &
+     &                     4.0d0*ue(k+1,m) +       ue(k+2,m))
+             end do
+
+             do   m = 1, 5
+                do  k = 3, grid_points(3)-4
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp*  &
+     &                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m))
+                end do
+             end do
+
+             do    m = 1, 5
+                k = grid_points(3)-3
+                forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (ue(k-2,m) - 4.0d0*ue(k-1,m) +  &
+     &                    6.0d0*ue(k,m) - 4.0d0*ue(k+1,m))
+                   k = grid_points(3)-2
+                   forcing(m,i,j,k) = forcing(m,i,j,k) - dssp *  &
+     &                   (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m))
+                end do
+
+          end do
+       end do
+
+!---------------------------------------------------------------------
+! now change the sign of the forcing function, 
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+       do   k = 1, grid_points(3)-2
+          do   j = 1, grid_points(2)-2
+             do   i = 1, grid_points(1)-2
+                do   m = 1, 5
+                   forcing(m,i,j,k) = -1.d0 * forcing(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+
+       return
+       end
+
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/exact_solution.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/exact_solution.f90
new file mode 100644
index 000000000..a9669ebe2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/exact_solution.f90
@@ -0,0 +1,32 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine exact_solution(xi,eta,zeta,dtemp)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function returns the exact solution at point xi, eta, zeta  
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       double precision  xi, eta, zeta
+       double precision  dtemp(5)
+       integer m
+
+       do  m = 1, 5
+          dtemp(m) =  ce(m,1) +  &
+     &    xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) +  &
+     &    eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+  &
+     &    zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) +  &
+     &    zeta*ce(m,13))))
+       end do
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/initialize.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/initialize.f90
new file mode 100644
index 000000000..2808d39ae
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/initialize.f90
@@ -0,0 +1,215 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  initialize
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! This subroutine initializes the field variable u using 
+! tri-linear transfinite interpolation of the boundary values     
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+  
+       integer i, j, k, m, ix, iy, iz
+       double precision  xi, eta, zeta, Pface(5,3,2), Pxi, Peta,  &
+     &                   Pzeta, temp(5)
+    
+
+!$omp parallel default(shared)  &
+!$omp& private(i,j,k,m,zeta,eta,xi,ix,iy,iz,Pxi,Peta,Pzeta,Pface,temp)
+!---------------------------------------------------------------------
+!  Later (in compute_rhs) we compute 1/u for every element. A few of 
+!  the corner elements are not used, but it convenient (and faster) 
+!  to compute the whole thing with a simple loop. Make sure those 
+!  values are nonzero by initializing the whole thing here. 
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+      do k = 0, grid_points(3)-1
+         do j = 0, grid_points(2)-1
+            do i = 0, grid_points(1)-1
+               u(1,i,j,k) = 1.0
+               u(2,i,j,k) = 0.0
+               u(3,i,j,k) = 0.0
+               u(4,i,j,k) = 0.0
+               u(5,i,j,k) = 1.0
+            end do
+         end do
+      end do
+!$omp end do
+
+!---------------------------------------------------------------------
+! first store the "interpolated" values everywhere on the grid    
+!---------------------------------------------------------------------
+!$omp do schedule(static) collapse(2)
+          do  k = 0, grid_points(3)-1
+             do  j = 0, grid_points(2)-1
+             zeta = dble(k) * dnzm1
+                eta = dble(j) * dnym1
+                do   i = 0, grid_points(1)-1
+                   xi = dble(i) * dnxm1
+                  
+                   do ix = 1, 2
+                      Pxi = dble(ix-1)
+                      call exact_solution(Pxi, eta, zeta,  &
+     &                                    Pface(1,1,ix))
+                   end do
+
+                   do    iy = 1, 2
+                      Peta = dble(iy-1)
+                      call exact_solution(xi, Peta, zeta,  &
+     &                                    Pface(1,2,iy))
+                   end do
+
+                   do    iz = 1, 2
+                      Pzeta = dble(iz-1)
+                      call exact_solution(xi, eta, Pzeta,   &
+     &                                    Pface(1,3,iz))
+                   end do
+
+                   do   m = 1, 5
+                      Pxi   = xi   * Pface(m,1,2) +  &
+     &                        (1.0d0-xi)   * Pface(m,1,1)
+                      Peta  = eta  * Pface(m,2,2) +  &
+     &                        (1.0d0-eta)  * Pface(m,2,1)
+                      Pzeta = zeta * Pface(m,3,2) +  &
+     &                        (1.0d0-zeta) * Pface(m,3,1)
+ 
+                      u(m,i,j,k) = Pxi + Peta + Pzeta -  &
+     &                          Pxi*Peta - Pxi*Pzeta - Peta*Pzeta +  &
+     &                          Pxi*Peta*Pzeta
+
+                   end do
+                end do
+             end do
+          end do
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+! now store the exact values on the boundaries        
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! west face                                                  
+!---------------------------------------------------------------------
+
+       xi = 0.0d0
+       i  = 0
+!$omp do schedule(static) collapse(2)
+       do  k = 0, grid_points(3)-1
+          do   j = 0, grid_points(2)-1
+          zeta = dble(k) * dnzm1
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+! east face                                                      
+!---------------------------------------------------------------------
+
+       xi = 1.0d0
+       i  = grid_points(1)-1
+!$omp do schedule(static) collapse(2)
+       do   k = 0, grid_points(3)-1
+          do   j = 0, grid_points(2)-1
+          zeta = dble(k) * dnzm1
+             eta = dble(j) * dnym1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do
+
+!---------------------------------------------------------------------
+! south face                                                 
+!---------------------------------------------------------------------
+
+       eta = 0.0d0
+       j   = 0
+!$omp do schedule(static) collapse(2)
+       do  k = 0, grid_points(3)-1
+          do   i = 0, grid_points(1)-1
+          zeta = dble(k) * dnzm1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+
+!---------------------------------------------------------------------
+! north face                                    
+!---------------------------------------------------------------------
+
+       eta = 1.0d0
+       j   = grid_points(2)-1
+!$omp do schedule(static) collapse(2)
+       do   k = 0, grid_points(3)-1
+          do   i = 0, grid_points(1)-1
+          zeta = dble(k) * dnzm1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do
+
+!---------------------------------------------------------------------
+! bottom face                                       
+!---------------------------------------------------------------------
+
+       zeta = 0.0d0
+       k    = 0
+!$omp do schedule(static) collapse(2)
+       do   j = 0, grid_points(2)-1
+          do   i =0, grid_points(1)-1
+          eta = dble(j) * dnym1
+             xi = dble(i) *dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+! top face     
+!---------------------------------------------------------------------
+
+       zeta = 1.0d0
+       k    = grid_points(3)-1
+!$omp do schedule(static) collapse(2)
+       do   j = 0, grid_points(2)-1
+          do   i =0, grid_points(1)-1
+          eta = dble(j) * dnym1
+             xi = dble(i) * dnxm1
+             call exact_solution(xi, eta, zeta, temp)
+             do   m = 1, 5
+                u(m,i,j,k) = temp(m)
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/inputsp.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/inputsp.data.sample
new file mode 100644
index 000000000..ae3801fdb
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/inputsp.data.sample
@@ -0,0 +1,3 @@
+400       number of time steps
+0.0015d0  dt for class A = 0.0015d0. class B = 0.001d0  class C = 0.00067d0
+64 64 64
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/ninvr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/ninvr.f90
new file mode 100644
index 000000000..706434b05
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/ninvr.f90
@@ -0,0 +1,47 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  ninvr
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   block-diagonal matrix-vector multiplication              
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer  i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+       if (timeron) call timer_start(t_ninvr)
+!$omp parallel do default(shared) private(i,j,k,r1,r2,r3,r4,r5,t1,t2)  &
+!$omp&  collapse(2)
+       do k = 1, nz2
+          do j = 1, ny2
+             do i = 1, nx2
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+               
+                t1 = bt * r3
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(1,i,j,k) = -r2
+                rhs(2,i,j,k) =  r1
+                rhs(3,i,j,k) = bt * ( r4 - r5 )
+                rhs(4,i,j,k) = -t1 + t2
+                rhs(5,i,j,k) =  t1 + t2
+             enddo    
+          enddo
+       enddo
+       if (timeron) call timer_stop(t_ninvr)
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/pinvr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/pinvr.f90
new file mode 100644
index 000000000..254411767
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/pinvr.f90
@@ -0,0 +1,50 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine pinvr
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   block-diagonal matrix-vector multiplication                       
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i, j, k
+       double precision r1, r2, r3, r4, r5, t1, t2
+
+       if (timeron) call timer_start(t_pinvr)
+!$omp parallel do default(shared) private(i,j,k,r1,r2,r3,r4,r5,t1,t2)  &
+!$omp&  collapse(2)
+       do   k = 1, nz2
+          do   j = 1, ny2
+             do   i = 1, nx2
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+
+                t1 = bt * r1
+                t2 = 0.5d0 * ( r4 + r5 )
+
+                rhs(1,i,j,k) =  bt * ( r4 - r5 )
+                rhs(2,i,j,k) = -r3
+                rhs(3,i,j,k) =  r2
+                rhs(4,i,j,k) = -t1 + t2
+                rhs(5,i,j,k) =  t1 + t2
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_pinvr)
+
+       return
+       end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/rhs.f90
new file mode 100644
index 000000000..4ba519c66
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/rhs.f90
@@ -0,0 +1,417 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine compute_rhs
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i, j, k, m
+       double precision aux, rho_inv, uijk, up1, um1, vijk, vp1, vm1,  &
+     &                  wijk, wp1, wm1
+
+
+       if (timeron) call timer_start(t_rhs)
+!$omp parallel default(shared) private(i,j,k,m,rho_inv,aux,uijk,up1,um1,  &
+!$omp&   vijk,vp1,vm1,wijk,wp1,wm1)
+!---------------------------------------------------------------------
+!      compute the reciprocal of density, and the kinetic energy, 
+!      and the speed of sound. 
+!---------------------------------------------------------------------
+
+!$omp do schedule(static) collapse(2)
+       do    k = 0, grid_points(3)-1
+          do    j = 0, grid_points(2)-1
+             do    i = 0, grid_points(1)-1
+                rho_inv = 1.0d0/u(1,i,j,k)
+                rho_i(i,j,k) = rho_inv
+                us(i,j,k) = u(2,i,j,k) * rho_inv
+                vs(i,j,k) = u(3,i,j,k) * rho_inv
+                ws(i,j,k) = u(4,i,j,k) * rho_inv
+                square(i,j,k)     = 0.5d0* (  &
+     &                        u(2,i,j,k)*u(2,i,j,k) +  &
+     &                        u(3,i,j,k)*u(3,i,j,k) +  &
+     &                        u(4,i,j,k)*u(4,i,j,k) ) * rho_inv
+                qs(i,j,k) = square(i,j,k) * rho_inv
+!---------------------------------------------------------------------
+!               (don't need speed and ainx until the lhs computation)
+!---------------------------------------------------------------------
+                aux = c1c2*rho_inv* (u(5,i,j,k) - square(i,j,k))
+                speed(i,j,k) = dsqrt(aux)
+             end do
+          end do
+       end do
+!$omp end do nowait
+
+!---------------------------------------------------------------------
+! copy the exact forcing term to the right hand side;  because 
+! this forcing term is known, we can store it on the whole grid
+! including the boundary                   
+!---------------------------------------------------------------------
+
+!$omp do schedule(static) collapse(2)
+       do    k = 0, nz2+1
+          do    j = 0, ny2+1
+             do    i = 0, nx2+1
+                do    m = 1, 5
+                   rhs(m,i,j,k) = forcing(m,i,j,k)
+                end do
+             end do
+          end do
+       end do
+!$omp end do
+
+!---------------------------------------------------------------------
+!      compute xi-direction fluxes 
+!---------------------------------------------------------------------
+!$omp master
+       if (timeron) call timer_start(t_rhsx)
+!$omp end master
+!$omp do schedule(static) collapse(2)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+                uijk = us(i,j,k)
+                up1  = us(i+1,j,k)
+                um1  = us(i-1,j,k)
+
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dx1tx1 *  &
+     &                    (u(1,i+1,j,k) - 2.0d0*u(1,i,j,k) +  &
+     &                     u(1,i-1,j,k)) -  &
+     &                    tx2 * (u(2,i+1,j,k) - u(2,i-1,j,k))
+
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dx2tx1 *  &
+     &                    (u(2,i+1,j,k) - 2.0d0*u(2,i,j,k) +  &
+     &                     u(2,i-1,j,k)) +  &
+     &                    xxcon2*con43 * (up1 - 2.0d0*uijk + um1) -  &
+     &                    tx2 * (u(2,i+1,j,k)*up1 -  &
+     &                           u(2,i-1,j,k)*um1 +  &
+     &                           (u(5,i+1,j,k)- square(i+1,j,k)-  &
+     &                            u(5,i-1,j,k)+ square(i-1,j,k))*  &
+     &                            c2)
+
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dx3tx1 *  &
+     &                    (u(3,i+1,j,k) - 2.0d0*u(3,i,j,k) +  &
+     &                     u(3,i-1,j,k)) +  &
+     &                    xxcon2 * (vs(i+1,j,k) - 2.0d0*vs(i,j,k) +  &
+     &                              vs(i-1,j,k)) -  &
+     &                    tx2 * (u(3,i+1,j,k)*up1 -  &
+     &                           u(3,i-1,j,k)*um1)
+
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dx4tx1 *  &
+     &                    (u(4,i+1,j,k) - 2.0d0*u(4,i,j,k) +  &
+     &                     u(4,i-1,j,k)) +  &
+     &                    xxcon2 * (ws(i+1,j,k) - 2.0d0*ws(i,j,k) +  &
+     &                              ws(i-1,j,k)) -  &
+     &                    tx2 * (u(4,i+1,j,k)*up1 -  &
+     &                           u(4,i-1,j,k)*um1)
+
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dx5tx1 *  &
+     &                    (u(5,i+1,j,k) - 2.0d0*u(5,i,j,k) +  &
+     &                     u(5,i-1,j,k)) +  &
+     &                    xxcon3 * (qs(i+1,j,k) - 2.0d0*qs(i,j,k) +  &
+     &                              qs(i-1,j,k)) +  &
+     &                    xxcon4 * (up1*up1 -       2.0d0*uijk*uijk +  &
+     &                              um1*um1) +  &
+     &                    xxcon5 * (u(5,i+1,j,k)*rho_i(i+1,j,k) -  &
+     &                              2.0d0*u(5,i,j,k)*rho_i(i,j,k) +  &
+     &                              u(5,i-1,j,k)*rho_i(i-1,j,k)) -  &
+     &                    tx2 * ( (c1*u(5,i+1,j,k) -  &
+     &                             c2*square(i+1,j,k))*up1 -  &
+     &                            (c1*u(5,i-1,j,k) -  &
+     &                             c2*square(i-1,j,k))*um1 )
+             end do
+
+!---------------------------------------------------------------------
+!      add fourth order xi-direction dissipation               
+!---------------------------------------------------------------------
+             i = 1
+             do    m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp *  &
+     &                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +  &
+     &                            u(m,i+2,j,k))
+             end do
+
+             i = 2
+             do    m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    (-4.0d0*u(m,i-1,j,k) + 6.0d0*u(m,i,j,k) -  &
+     &                      4.0d0*u(m,i+1,j,k) + u(m,i+2,j,k))
+             end do
+
+             do  i = 3, nx2-2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    (  u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) +  &
+     &                     6.0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) +  &
+     &                         u(m,i+2,j,k) )
+                end do
+             end do
+
+             i = nx2-1
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i-2,j,k) - 4.0d0*u(m,i-1,j,k) +  &
+     &                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i+1,j,k) )
+             end do
+
+             i = nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i-2,j,k) - 4.d0*u(m,i-1,j,k) +  &
+     &                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp master
+       if (timeron) call timer_stop(t_rhsx)
+
+!---------------------------------------------------------------------
+!      compute eta-direction fluxes 
+!---------------------------------------------------------------------
+       if (timeron) call timer_start(t_rhsy)
+!$omp end master
+!$omp do schedule(static) collapse(2)
+       do     k = 1, nz2
+          do     j = 1, ny2
+             do     i = 1, nx2
+                vijk = vs(i,j,k)
+                vp1  = vs(i,j+1,k)
+                vm1  = vs(i,j-1,k)
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dy1ty1 *  &
+     &                   (u(1,i,j+1,k) - 2.0d0*u(1,i,j,k) +  &
+     &                    u(1,i,j-1,k)) -  &
+     &                   ty2 * (u(3,i,j+1,k) - u(3,i,j-1,k))
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dy2ty1 *  &
+     &                   (u(2,i,j+1,k) - 2.0d0*u(2,i,j,k) +  &
+     &                    u(2,i,j-1,k)) +  &
+     &                   yycon2 * (us(i,j+1,k) - 2.0d0*us(i,j,k) +  &
+     &                             us(i,j-1,k)) -  &
+     &                   ty2 * (u(2,i,j+1,k)*vp1 -  &
+     &                          u(2,i,j-1,k)*vm1)
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dy3ty1 *  &
+     &                   (u(3,i,j+1,k) - 2.0d0*u(3,i,j,k) +  &
+     &                    u(3,i,j-1,k)) +  &
+     &                   yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) -  &
+     &                   ty2 * (u(3,i,j+1,k)*vp1 -  &
+     &                          u(3,i,j-1,k)*vm1 +  &
+     &                          (u(5,i,j+1,k) - square(i,j+1,k) -  &
+     &                           u(5,i,j-1,k) + square(i,j-1,k))  &
+     &                          *c2)
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dy4ty1 *  &
+     &                   (u(4,i,j+1,k) - 2.0d0*u(4,i,j,k) +  &
+     &                    u(4,i,j-1,k)) +  &
+     &                   yycon2 * (ws(i,j+1,k) - 2.0d0*ws(i,j,k) +  &
+     &                             ws(i,j-1,k)) -  &
+     &                   ty2 * (u(4,i,j+1,k)*vp1 -  &
+     &                          u(4,i,j-1,k)*vm1)
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dy5ty1 *  &
+     &                   (u(5,i,j+1,k) - 2.0d0*u(5,i,j,k) +  &
+     &                    u(5,i,j-1,k)) +  &
+     &                   yycon3 * (qs(i,j+1,k) - 2.0d0*qs(i,j,k) +  &
+     &                             qs(i,j-1,k)) +  &
+     &                   yycon4 * (vp1*vp1       - 2.0d0*vijk*vijk +  &
+     &                             vm1*vm1) +  &
+     &                   yycon5 * (u(5,i,j+1,k)*rho_i(i,j+1,k) -  &
+     &                             2.0d0*u(5,i,j,k)*rho_i(i,j,k) +  &
+     &                             u(5,i,j-1,k)*rho_i(i,j-1,k)) -  &
+     &                   ty2 * ((c1*u(5,i,j+1,k) -  &
+     &                           c2*square(i,j+1,k)) * vp1 -  &
+     &                          (c1*u(5,i,j-1,k) -  &
+     &                           c2*square(i,j-1,k)) * vm1)
+             end do
+
+
+!---------------------------------------------------------------------
+!      add fourth order eta-direction dissipation         
+!---------------------------------------------------------------------
+
+          if (j .eq. 1) then
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp *  &
+     &                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +  &
+     &                            u(m,i,j+2,k))
+             end do
+          end do
+
+          else if (j .eq. 2) then
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    (-4.0d0*u(m,i,j-1,k) + 6.0d0*u(m,i,j,k) -  &
+     &                      4.0d0*u(m,i,j+1,k) + u(m,i,j+2,k))
+             end do
+          end do
+ 
+          else if (j .eq. ny2-1) then
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) +  &
+     &                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) )
+             end do
+          end do
+
+          else if (j .eq. ny2) then
+          do     i = 1, nx2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j-2,k) - 4.d0*u(m,i,j-1,k) +  &
+     &                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+
+          else  !do    j = 3, ny2-2
+             do  i = 1,nx2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    (  u(m,i,j-2,k) - 4.0d0*u(m,i,j-1,k) +  &
+     &                     6.0*u(m,i,j,k) - 4.0d0*u(m,i,j+1,k) +  &
+     &                         u(m,i,j+2,k) )
+                end do
+             end do
+          endif
+          end do
+       end do
+!$omp end do nowait
+!$omp master
+       if (timeron) call timer_stop(t_rhsy)
+
+!---------------------------------------------------------------------
+!      compute zeta-direction fluxes 
+!---------------------------------------------------------------------
+       if (timeron) call timer_start(t_rhsz)
+!$omp end master
+!$omp do schedule(static) collapse(2)
+       do    k = 1, grid_points(3)-2
+          do     j = 1, grid_points(2)-2
+             do     i = 1, grid_points(1)-2
+                wijk = ws(i,j,k)
+                wp1  = ws(i,j,k+1)
+                wm1  = ws(i,j,k-1)
+
+                rhs(1,i,j,k) = rhs(1,i,j,k) + dz1tz1 *  &
+     &                   (u(1,i,j,k+1) - 2.0d0*u(1,i,j,k) +  &
+     &                    u(1,i,j,k-1)) -  &
+     &                   tz2 * (u(4,i,j,k+1) - u(4,i,j,k-1))
+                rhs(2,i,j,k) = rhs(2,i,j,k) + dz2tz1 *  &
+     &                   (u(2,i,j,k+1) - 2.0d0*u(2,i,j,k) +  &
+     &                    u(2,i,j,k-1)) +  &
+     &                   zzcon2 * (us(i,j,k+1) - 2.0d0*us(i,j,k) +  &
+     &                             us(i,j,k-1)) -  &
+     &                   tz2 * (u(2,i,j,k+1)*wp1 -  &
+     &                          u(2,i,j,k-1)*wm1)
+                rhs(3,i,j,k) = rhs(3,i,j,k) + dz3tz1 *  &
+     &                   (u(3,i,j,k+1) - 2.0d0*u(3,i,j,k) +  &
+     &                    u(3,i,j,k-1)) +  &
+     &                   zzcon2 * (vs(i,j,k+1) - 2.0d0*vs(i,j,k) +  &
+     &                             vs(i,j,k-1)) -  &
+     &                   tz2 * (u(3,i,j,k+1)*wp1 -  &
+     &                          u(3,i,j,k-1)*wm1)
+                rhs(4,i,j,k) = rhs(4,i,j,k) + dz4tz1 *  &
+     &                   (u(4,i,j,k+1) - 2.0d0*u(4,i,j,k) +  &
+     &                    u(4,i,j,k-1)) +  &
+     &                   zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) -  &
+     &                   tz2 * (u(4,i,j,k+1)*wp1 -  &
+     &                          u(4,i,j,k-1)*wm1 +  &
+     &                          (u(5,i,j,k+1) - square(i,j,k+1) -  &
+     &                           u(5,i,j,k-1) + square(i,j,k-1))  &
+     &                          *c2)
+                rhs(5,i,j,k) = rhs(5,i,j,k) + dz5tz1 *  &
+     &                   (u(5,i,j,k+1) - 2.0d0*u(5,i,j,k) +  &
+     &                    u(5,i,j,k-1)) +  &
+     &                   zzcon3 * (qs(i,j,k+1) - 2.0d0*qs(i,j,k) +  &
+     &                             qs(i,j,k-1)) +  &
+     &                   zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk +  &
+     &                             wm1*wm1) +  &
+     &                   zzcon5 * (u(5,i,j,k+1)*rho_i(i,j,k+1) -  &
+     &                             2.0d0*u(5,i,j,k)*rho_i(i,j,k) +  &
+     &                             u(5,i,j,k-1)*rho_i(i,j,k-1)) -  &
+     &                   tz2 * ( (c1*u(5,i,j,k+1) -  &
+     &                            c2*square(i,j,k+1))*wp1 -  &
+     &                           (c1*u(5,i,j,k-1) -  &
+     &                            c2*square(i,j,k-1))*wm1)
+             end do
+
+!---------------------------------------------------------------------
+!      add fourth order zeta-direction dissipation                
+!---------------------------------------------------------------------
+
+          if (k .eq. 1) then
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k)- dssp *  &
+     &                    ( 5.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +  &
+     &                            u(m,i,j,k+2))
+             end do
+          end do
+
+          else if (k .eq. 2) then
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    (-4.0d0*u(m,i,j,k-1) + 6.0d0*u(m,i,j,k) -  &
+     &                      4.0d0*u(m,i,j,k+1) + u(m,i,j,k+2))
+             end do
+          end do
+ 
+          else if (k .eq. grid_points(3)-3) then
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) +  &
+     &                      6.0d0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) )
+             end do
+          end do
+
+          else if (k .eq. grid_points(3)-2) then
+          do     i = 1, grid_points(1)-2
+             do     m = 1, 5
+                rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    ( u(m,i,j,k-2) - 4.d0*u(m,i,j,k-1) +  &
+     &                      5.d0*u(m,i,j,k) )
+             end do
+          end do
+
+          else !do     k = 3, grid_points(3)-4
+             do     i = 1,grid_points(1)-2
+                do     m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) - dssp *  &
+     &                    (  u(m,i,j,k-2) - 4.0d0*u(m,i,j,k-1) +  &
+     &                     6.0*u(m,i,j,k) - 4.0d0*u(m,i,j,k+1) +  &
+     &                         u(m,i,j,k+2) )
+                end do
+             end do
+          endif
+          end do
+       end do
+!$omp end do nowait
+!$omp master
+       if (timeron) call timer_stop(t_rhsz)
+!$omp end master
+
+!$omp do schedule(static) collapse(2)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+                do    m = 1, 5
+                   rhs(m,i,j,k) = rhs(m,i,j,k) * dt
+                end do
+             end do
+          end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+        if (timeron) call timer_stop(t_rhs)
+   
+       return
+       end
+
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/set_constants.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/set_constants.f90
new file mode 100644
index 000000000..81820d475
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/set_constants.f90
@@ -0,0 +1,204 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  set_constants
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+  
+       ce(1,1)  = 2.0d0
+       ce(1,2)  = 0.0d0
+       ce(1,3)  = 0.0d0
+       ce(1,4)  = 4.0d0
+       ce(1,5)  = 5.0d0
+       ce(1,6)  = 3.0d0
+       ce(1,7)  = 0.5d0
+       ce(1,8)  = 0.02d0
+       ce(1,9)  = 0.01d0
+       ce(1,10) = 0.03d0
+       ce(1,11) = 0.5d0
+       ce(1,12) = 0.4d0
+       ce(1,13) = 0.3d0
+ 
+       ce(2,1)  = 1.0d0
+       ce(2,2)  = 0.0d0
+       ce(2,3)  = 0.0d0
+       ce(2,4)  = 0.0d0
+       ce(2,5)  = 1.0d0
+       ce(2,6)  = 2.0d0
+       ce(2,7)  = 3.0d0
+       ce(2,8)  = 0.01d0
+       ce(2,9)  = 0.03d0
+       ce(2,10) = 0.02d0
+       ce(2,11) = 0.4d0
+       ce(2,12) = 0.3d0
+       ce(2,13) = 0.5d0
+
+       ce(3,1)  = 2.0d0
+       ce(3,2)  = 2.0d0
+       ce(3,3)  = 0.0d0
+       ce(3,4)  = 0.0d0
+       ce(3,5)  = 0.0d0
+       ce(3,6)  = 2.0d0
+       ce(3,7)  = 3.0d0
+       ce(3,8)  = 0.04d0
+       ce(3,9)  = 0.03d0
+       ce(3,10) = 0.05d0
+       ce(3,11) = 0.3d0
+       ce(3,12) = 0.5d0
+       ce(3,13) = 0.4d0
+
+       ce(4,1)  = 2.0d0
+       ce(4,2)  = 2.0d0
+       ce(4,3)  = 0.0d0
+       ce(4,4)  = 0.0d0
+       ce(4,5)  = 0.0d0
+       ce(4,6)  = 2.0d0
+       ce(4,7)  = 3.0d0
+       ce(4,8)  = 0.03d0
+       ce(4,9)  = 0.05d0
+       ce(4,10) = 0.04d0
+       ce(4,11) = 0.2d0
+       ce(4,12) = 0.1d0
+       ce(4,13) = 0.3d0
+
+       ce(5,1)  = 5.0d0
+       ce(5,2)  = 4.0d0
+       ce(5,3)  = 3.0d0
+       ce(5,4)  = 2.0d0
+       ce(5,5)  = 0.1d0
+       ce(5,6)  = 0.4d0
+       ce(5,7)  = 0.3d0
+       ce(5,8)  = 0.05d0
+       ce(5,9)  = 0.04d0
+       ce(5,10) = 0.03d0
+       ce(5,11) = 0.1d0
+       ce(5,12) = 0.3d0
+       ce(5,13) = 0.2d0
+
+       c1 = 1.4d0
+       c2 = 0.4d0
+       c3 = 0.1d0
+       c4 = 1.0d0
+       c5 = 1.4d0
+
+       bt = dsqrt(0.5d0)
+
+       dnxm1 = 1.0d0 / dble(grid_points(1)-1)
+       dnym1 = 1.0d0 / dble(grid_points(2)-1)
+       dnzm1 = 1.0d0 / dble(grid_points(3)-1)
+
+       c1c2 = c1 * c2
+       c1c5 = c1 * c5
+       c3c4 = c3 * c4
+       c1345 = c1c5 * c3c4
+
+       conz1 = (1.0d0-c1c5)
+
+       tx1 = 1.0d0 / (dnxm1 * dnxm1)
+       tx2 = 1.0d0 / (2.0d0 * dnxm1)
+       tx3 = 1.0d0 / dnxm1
+
+       ty1 = 1.0d0 / (dnym1 * dnym1)
+       ty2 = 1.0d0 / (2.0d0 * dnym1)
+       ty3 = 1.0d0 / dnym1
+ 
+       tz1 = 1.0d0 / (dnzm1 * dnzm1)
+       tz2 = 1.0d0 / (2.0d0 * dnzm1)
+       tz3 = 1.0d0 / dnzm1
+
+       dx1 = 0.75d0
+       dx2 = 0.75d0
+       dx3 = 0.75d0
+       dx4 = 0.75d0
+       dx5 = 0.75d0
+
+       dy1 = 0.75d0
+       dy2 = 0.75d0
+       dy3 = 0.75d0
+       dy4 = 0.75d0
+       dy5 = 0.75d0
+
+       dz1 = 1.0d0
+       dz2 = 1.0d0
+       dz3 = 1.0d0
+       dz4 = 1.0d0
+       dz5 = 1.0d0
+
+       dxmax = dmax1(dx3, dx4)
+       dymax = dmax1(dy2, dy4)
+       dzmax = dmax1(dz2, dz3)
+
+       dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) )
+
+       c4dssp = 4.0d0 * dssp
+       c5dssp = 5.0d0 * dssp
+
+       dttx1 = dt*tx1
+       dttx2 = dt*tx2
+       dtty1 = dt*ty1
+       dtty2 = dt*ty2
+       dttz1 = dt*tz1
+       dttz2 = dt*tz2
+
+       c2dttx1 = 2.0d0*dttx1
+       c2dtty1 = 2.0d0*dtty1
+       c2dttz1 = 2.0d0*dttz1
+
+       dtdssp = dt*dssp
+
+       comz1  = dtdssp
+       comz4  = 4.0d0*dtdssp
+       comz5  = 5.0d0*dtdssp
+       comz6  = 6.0d0*dtdssp
+
+       c3c4tx3 = c3c4*tx3
+       c3c4ty3 = c3c4*ty3
+       c3c4tz3 = c3c4*tz3
+
+       dx1tx1 = dx1*tx1
+       dx2tx1 = dx2*tx1
+       dx3tx1 = dx3*tx1
+       dx4tx1 = dx4*tx1
+       dx5tx1 = dx5*tx1
+        
+       dy1ty1 = dy1*ty1
+       dy2ty1 = dy2*ty1
+       dy3ty1 = dy3*ty1
+       dy4ty1 = dy4*ty1
+       dy5ty1 = dy5*ty1
+        
+       dz1tz1 = dz1*tz1
+       dz2tz1 = dz2*tz1
+       dz3tz1 = dz3*tz1
+       dz4tz1 = dz4*tz1
+       dz5tz1 = dz5*tz1
+
+       c2iv  = 2.5d0
+       con43 = 4.0d0/3.0d0
+       con16 = 1.0d0/6.0d0
+        
+       xxcon1 = c3c4tx3*con43*tx3
+       xxcon2 = c3c4tx3*tx3
+       xxcon3 = c3c4tx3*conz1*tx3
+       xxcon4 = c3c4tx3*con16*tx3
+       xxcon5 = c3c4tx3*c1c5*tx3
+
+       yycon1 = c3c4ty3*con43*ty3
+       yycon2 = c3c4ty3*ty3
+       yycon3 = c3c4ty3*conz1*ty3
+       yycon4 = c3c4ty3*con16*ty3
+       yycon5 = c3c4ty3*c1c5*ty3
+
+       zzcon1 = c3c4tz3*con43*tz3
+       zzcon2 = c3c4tz3*tz3
+       zzcon3 = c3c4tz3*conz1*tz3
+       zzcon4 = c3c4tz3*con16*tz3
+       zzcon5 = c3c4tz3*c1c5*tz3
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/sp.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/sp.f90
new file mode 100644
index 000000000..faeb74de1
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/sp.f90
@@ -0,0 +1,229 @@
+!-------------------------------------------------------------------------!
+!                                                                         !
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         !
+!                                                                         !
+!                       O p e n M P     V E R S I O N                     !
+!                                                                         !
+!                                   S P                                   !
+!                                                                         !
+!-------------------------------------------------------------------------!
+!                                                                         !
+!    This benchmark is an OpenMP version of the NPB SP code.              !
+!    It is described in NAS Technical Report 99-011.                      !
+!                                                                         !
+!    Permission to use, copy, distribute and modify this software         !
+!    for any purpose with or without fee is hereby granted.  We           !
+!    request, however, that all derived work reference the NAS            !
+!    Parallel Benchmarks 3.4. This software is provided "as is"           !
+!    without express or implied warranty.                                 !
+!                                                                         !
+!    Information on NPB 3.4, including the technical report, the          !
+!    original specifications, source code, results and information        !
+!    on how to submit new results, is available at:                       !
+!                                                                         !
+!           http://www.nas.nasa.gov/Software/NPB/                         !
+!                                                                         !
+!    Send comments or suggestions to  npb@nas.nasa.gov                    !
+!                                                                         !
+!          NAS Parallel Benchmarks Group                                  !
+!          NASA Ames Research Center                                      !
+!          Mail Stop: T27A-1                                              !
+!          Moffett Field, CA   94035-1000                                 !
+!                                                                         !
+!          E-mail:  npb@nas.nasa.gov                                      !
+!          Fax:     (650) 604-3957                                        !
+!                                                                         !
+!-------------------------------------------------------------------------!
+
+!---------------------------------------------------------------------
+!
+! Authors: R. Van der Wijngaart
+!          W. Saphir
+!          H. Jin
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+       program SP
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       include 'blk_par.h'
+
+       integer          i, niter, step, fstatus
+       external         timer_read
+       double precision mflops, n3, t, tmax, timer_read, trecs(t_last)
+       logical          verified
+       character        class
+       character        t_names(t_last)*8
+!$     integer  omp_get_max_threads
+!$     external omp_get_max_threads
+
+!---------------------------------------------------------------------
+!      Read input file (if it exists), else take
+!      defaults from parameters
+!---------------------------------------------------------------------
+
+       call check_timer_flag( timeron )
+       if (timeron) then
+         t_names(t_total) = 'total'
+         t_names(t_rhsx) = 'rhsx'
+         t_names(t_rhsy) = 'rhsy'
+         t_names(t_rhsz) = 'rhsz'
+         t_names(t_rhs) = 'rhs'
+         t_names(t_xsolve) = 'xsolve'
+         t_names(t_ysolve) = 'ysolve'
+         t_names(t_zsolve) = 'zsolve'
+         t_names(t_rdis1) = 'redist1'
+         t_names(t_rdis2) = 'redist2'
+         t_names(t_tzetar) = 'tzetar'
+         t_names(t_ninvr) = 'ninvr'
+         t_names(t_pinvr) = 'pinvr'
+         t_names(t_txinvr) = 'txinvr'
+         t_names(t_add) = 'add'
+       endif
+
+       write(*, 1000)
+       open (unit=2,file='inputsp.data',status='old', iostat=fstatus)
+
+       if (fstatus .eq. 0) then
+         write(*,233)
+ 233     format(' Reading from input file inputsp.data')
+         read (2,*) niter
+         read (2,*) dt
+         read (2,*) grid_points(1), grid_points(2), grid_points(3)
+         close(2)
+       else
+         write(*,234)
+         niter = niter_default
+         dt    = dt_default
+         grid_points(1) = problem_size
+         grid_points(2) = problem_size
+         grid_points(3) = problem_size
+       endif
+ 234   format(' No input file inputsp.data. Using compiled defaults')
+
+       write(*, 1001) grid_points(1), grid_points(2), grid_points(3)
+       write(*, 1002) niter, dt
+       if (blkdim .gt. 0) write(*, 1004) blkdim
+!$     write(*, 1003) omp_get_max_threads()
+       write(*, *)
+
+ 1000  format(//, ' NAS Parallel Benchmarks (NPB3.4-OMP)',  &
+     &            ' - SP Benchmark', /)
+ 1001  format(' Size: ', i4, 'x', i4, 'x', i4)
+ 1002  format(' Iterations: ', i4, '    dt:  ', f11.7)
+ 1003  format(' Number of available threads: ', i5)
+ 1004  format(' Dimension blocking size: ', i5)
+
+       if ( (grid_points(1) .gt. IMAX) .or.  &
+     &      (grid_points(2) .gt. JMAX) .or.  &
+     &      (grid_points(3) .gt. KMAX) ) then
+             print *, (grid_points(i),i=1,3)
+             print *,' Problem size too big for compiled array sizes'
+             goto 999
+       endif
+       nx2 = grid_points(1) - 2
+       ny2 = grid_points(2) - 2
+       nz2 = grid_points(3) - 2
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+       call alloc_space
+
+       call set_constants
+
+       call exact_rhs
+
+       call initialize
+
+!---------------------------------------------------------------------
+!      do one time step to touch all code, and reinitialize
+!---------------------------------------------------------------------
+       call adi
+       call initialize
+
+       do i = 1, t_last
+          call timer_clear(i)
+       end do
+
+#ifdef M5_ANNOTATION
+       call m5_work_begin_interface
+#endif
+
+       call timer_start(1)
+
+       do  step = 1, niter
+
+          if (mod(step, 20) .eq. 0 .or. step .eq. 1) then
+             write(*, 200) step
+ 200         format(' Time step ', i4)
+          endif
+
+          call adi
+
+       end do
+
+       call timer_stop(1)
+
+#ifdef M5_ANNOTATION
+       call m5_work_end_interface
+#endif
+
+       tmax = timer_read(1)
+       call verify(niter, class, verified)
+
+       if( tmax .ne. 0. ) then
+          n3 = dble(grid_points(1))*grid_points(2)*grid_points(3)
+          t = (grid_points(1)+grid_points(2)+grid_points(3))/3.d0
+          mflops = 1.0d-6*dble( niter )*(881.174 * n3  &
+     &             -4683.91 * t**2  &
+     &             +11484.5 * t  &
+     &             -19272.4) / tmax
+       else
+          mflops = 0.d0
+       endif
+
+      call print_results('SP', class, grid_points(1),  &
+     &     grid_points(2), grid_points(3), niter,  &
+     &     tmax, mflops, '          floating point',  &
+     &     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5,  &
+     &     cs6, '(none)')
+
+!---------------------------------------------------------------------
+!      More timers
+!---------------------------------------------------------------------
+       if (.not.timeron) goto 999
+
+       do i=1, t_last
+          trecs(i) = timer_read(i)
+       end do
+       if (tmax .eq. 0.0) tmax = 1.0
+
+       write(*,800)
+ 800   format('  SECTION   Time (secs)')
+
+       do i=1, t_last
+          write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+          if (i.eq.t_rhs) then
+             t = trecs(t_rhsx) + trecs(t_rhsy) + trecs(t_rhsz)
+             write(*,820) 'sub-rhs', t, t*100./tmax
+             t = trecs(t_rhs) - t
+             write(*,820) 'rest-rhs', t, t*100./tmax
+          elseif (i.eq.t_zsolve) then
+             t = trecs(t_zsolve) - trecs(t_rdis1) - trecs(t_rdis2)
+             write(*,820) 'sub-zsol', t, t*100./tmax
+          elseif (i.eq.t_rdis2) then
+             t = trecs(t_rdis1) + trecs(t_rdis2)
+             write(*,820) 'redist', t, t*100./tmax
+          endif
+ 810      format(2x,a8,':',f9.3,'  (',f6.2,'%)')
+ 820      format('    --> ',a8,':',f9.3,'  (',f6.2,'%)')
+       end do
+
+ 999   continue
+
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/sp_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/sp_data.f90
new file mode 100644
index 000000000..8edcf2cec
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/sp_data.f90
@@ -0,0 +1,134 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  sp_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module sp_data
+
+!---------------------------------------------------------------------
+! The following include file is generated automatically by the
+! "setparams" utility. It defines 
+!      problem_size:  12, 64, 102, 162 (for class T, A, B, C)
+!      dt_default:    default time step for this problem size if no
+!                     config file
+!      niter_default: default number of iterations for this problem size
+!---------------------------------------------------------------------
+
+      include 'npbparams.h'
+
+      integer           grid_points(3), nx2, ny2, nz2
+
+      double precision  tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3,  &
+     &                  dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4,  &
+     &                  dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt,  &
+     &                  ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2,  &
+     &                  xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1,  &
+     &                  dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4,  &
+     &                  yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1,  &
+     &                  zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1,  &
+     &                  dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1,  &
+     &                  dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2,  &
+     &                  c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt,  &
+     &                  dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1,  &
+     &                  c2dtty1, c2dttz1, comz1, comz4, comz5, comz6,  &
+     &                  c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16
+
+      integer IMAX, JMAX, KMAX
+
+      parameter (IMAX=problem_size,JMAX=problem_size,KMAX=problem_size)
+
+!---------------------------------------------------------------------
+!   Field arrays
+!---------------------------------------------------------------------
+      double precision, allocatable ::  &
+     &   u       (:, :, :, :),  &
+     &   us      (   :, :, :),  &
+     &   vs      (   :, :, :),  &
+     &   ws      (   :, :, :),  &
+     &   qs      (   :, :, :),  &
+     &   rho_i   (   :, :, :),  &
+     &   speed   (   :, :, :),  &
+     &   square  (   :, :, :),  &
+     &   rhs     (:, :, :, :),  &
+     &   forcing (:, :, :, :)
+
+      double precision cuf(0:problem_size-1),  q(0:problem_size-1),  &
+     &                 ue(0:problem_size-1,5), buf(0:problem_size-1,5)
+!$omp threadprivate(cuf, q, ue, buf)
+
+!-----------------------------------------------------------------------
+!   Timer constants
+!-----------------------------------------------------------------------
+      integer t_rhsx, t_rhsy, t_rhsz, t_xsolve, t_ysolve, t_zsolve,  &
+     &        t_rdis1, t_rdis2, t_tzetar, t_ninvr, t_pinvr, t_add,  &
+     &        t_rhs, t_txinvr, t_last, t_total
+      parameter (t_total = 1)
+      parameter (t_rhsx = 2)
+      parameter (t_rhsy = 3)
+      parameter (t_rhsz = 4)
+      parameter (t_rhs = 5)
+      parameter (t_xsolve = 6)
+      parameter (t_ysolve = 7)
+      parameter (t_zsolve = 8)
+      parameter (t_rdis1 = 9)
+      parameter (t_rdis2 = 10)
+      parameter (t_txinvr = 11)
+      parameter (t_pinvr = 12)
+      parameter (t_ninvr = 13)
+      parameter (t_tzetar = 14)
+      parameter (t_add = 15)
+      parameter (t_last = 15)
+
+      logical timeron
+
+      end module sp_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use sp_data
+      implicit none
+
+      integer ios
+
+      integer IMAXP, JMAXP
+      parameter (IMAXP=IMAX/2*2,JMAXP=JMAX/2*2)
+
+!
+!   To improve cache performance, first two dimensions padded by 1 
+!   for even number sizes only
+!
+      allocate (  &
+     &   u       (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   us      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   vs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   ws      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   qs      (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   rho_i   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   speed   (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   square  (   0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   rhs     (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &   forcing (5, 0:IMAXP, 0:JMAXP, 0:KMAX-1),  &
+     &         stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         stop
+      endif
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/txinvr.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/txinvr.f90
new file mode 100644
index 000000000..5ef8a8268
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/txinvr.f90
@@ -0,0 +1,62 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  txinvr
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! block-diagonal matrix-vector multiplication                  
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i, j, k
+       double precision t1, t2, t3, ac, ru1, uu, vv, ww, r1, r2, r3,  &
+     &                  r4, r5, ac2inv
+
+
+       if (timeron) call timer_start(t_txinvr)
+!$omp parallel do default(shared)  &
+!$omp& private(i,j,k,t1,t2,t3,ac,ru1,uu,vv,ww,r1,r2,r3,r4,r5,ac2inv)  &
+!$omp&  collapse(2)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+
+                ru1 = rho_i(i,j,k)
+                uu = us(i,j,k)
+                vv = vs(i,j,k)
+                ww = ws(i,j,k)
+                ac = speed(i,j,k)
+                ac2inv = ac*ac
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)
+
+                t1 = c2 / ac2inv * ( qs(i,j,k)*r1 - uu*r2  -  &
+     &                  vv*r3 - ww*r4 + r5 )
+                t2 = bt * ru1 * ( uu * r1 - r2 )
+                t3 = ( bt * ru1 * ac ) * t1
+
+                rhs(1,i,j,k) = r1 - t1
+                rhs(2,i,j,k) = - ru1 * ( ww*r1 - r4 )
+                rhs(3,i,j,k) =   ru1 * ( vv*r1 - r3 )
+                rhs(4,i,j,k) = - t2 + t3
+                rhs(5,i,j,k) =   t2 + t3
+
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_txinvr)
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/tzetar.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/tzetar.f90
new file mode 100644
index 000000000..012ede77f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/tzetar.f90
@@ -0,0 +1,64 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine  tzetar
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   block-diagonal matrix-vector multiplication                       
+!---------------------------------------------------------------------
+
+       use sp_data
+       implicit none
+
+       integer i, j, k
+       double precision  t1, t2, t3, ac, xvel, yvel, zvel, r1, r2, r3,  &
+     &                   r4, r5, btuz, ac2u, uzik1
+
+
+       if (timeron) call timer_start(t_tzetar)
+!$omp parallel do default(shared)  &
+!$omp& private(i,j,k,t1,t2,t3,ac,xvel,yvel,zvel,r1,r2,r3,  &
+!$omp&              r4,r5,btuz,ac2u,uzik1)  &
+!$omp&  collapse(2)
+       do    k = 1, nz2
+          do    j = 1, ny2
+             do    i = 1, nx2
+
+                xvel = us(i,j,k)
+                yvel = vs(i,j,k)
+                zvel = ws(i,j,k)
+                ac   = speed(i,j,k)
+
+                ac2u = ac*ac
+
+                r1 = rhs(1,i,j,k)
+                r2 = rhs(2,i,j,k)
+                r3 = rhs(3,i,j,k)
+                r4 = rhs(4,i,j,k)
+                r5 = rhs(5,i,j,k)      
+
+                uzik1 = u(1,i,j,k)
+                btuz  = bt * uzik1
+
+                t1 = btuz/ac * (r4 + r5)
+                t2 = r3 + t1
+                t3 = btuz * (r4 - r5)
+
+                rhs(1,i,j,k) = t2
+                rhs(2,i,j,k) = -uzik1*r2 + xvel*t2
+                rhs(3,i,j,k) =  uzik1*r1 + yvel*t2
+                rhs(4,i,j,k) =  zvel*t2  + t3
+                rhs(5,i,j,k) =  uzik1*(-xvel*r2 + yvel*r1) +  &
+     &                    qs(i,j,k)*t2 + c2iv*ac2u*t1 + zvel*t3
+
+             end do
+          end do
+       end do
+       if (timeron) call timer_stop(t_tzetar)
+
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/verify.f90
new file mode 100644
index 000000000..1cc491b97
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/verify.f90
@@ -0,0 +1,392 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+        subroutine verify(no_time_steps, class, verified)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!  verification routine                         
+!---------------------------------------------------------------------
+
+        use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+        use sp_data
+
+        implicit none
+
+        double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5),  &
+     &                   epsilon, xce(5), xcr(5), dtref
+        integer m, no_time_steps
+        character class
+        logical verified
+
+!---------------------------------------------------------------------
+!   tolerance level
+!---------------------------------------------------------------------
+        epsilon = 1.0d-08
+
+
+!---------------------------------------------------------------------
+!   compute the error norm and the residual norm, and exit if not printing
+!---------------------------------------------------------------------
+        call error_norm(xce)
+        call compute_rhs
+
+        call rhs_norm(xcr)
+
+        do m = 1, 5
+           xcr(m) = xcr(m) / dt
+        enddo
+
+        class = 'U'
+        verified = .true.
+
+        do m = 1,5
+           xcrref(m) = 1.0
+           xceref(m) = 1.0
+        end do
+
+!---------------------------------------------------------------------
+!    reference data for 12X12X12 grids after 100 time steps, with DT = 1.50d-02
+!---------------------------------------------------------------------
+        if ( (grid_points(1)  .eq. 12     ) .and.  &
+     &       (grid_points(2)  .eq. 12     ) .and.  &
+     &       (grid_points(3)  .eq. 12     ) .and.  &
+     &       (no_time_steps   .eq. 100    ))  then
+
+           class = 'S'
+           dtref = 1.5d-2
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 2.7470315451339479d-02
+           xcrref(2) = 1.0360746705285417d-02
+           xcrref(3) = 1.6235745065095532d-02
+           xcrref(4) = 1.5840557224455615d-02
+           xcrref(5) = 3.4849040609362460d-02
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 2.7289258557377227d-05
+           xceref(2) = 1.0364446640837285d-05
+           xceref(3) = 1.6154798287166471d-05
+           xceref(4) = 1.5750704994480102d-05
+           xceref(5) = 3.4177666183390531d-05
+
+
+!---------------------------------------------------------------------
+!    reference data for 36X36X36 grids after 400 time steps, with DT = 1.5d-03
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 36) .and.  &
+     &           (grid_points(2) .eq. 36) .and.  &
+     &           (grid_points(3) .eq. 36) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'W'
+           dtref = 1.5d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.1893253733584d-02
+           xcrref(2) = 0.1717075447775d-03
+           xcrref(3) = 0.2778153350936d-03
+           xcrref(4) = 0.2887475409984d-03
+           xcrref(5) = 0.3143611161242d-02
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.7542088599534d-04
+           xceref(2) = 0.6512852253086d-05
+           xceref(3) = 0.1049092285688d-04
+           xceref(4) = 0.1128838671535d-04
+           xceref(5) = 0.1212845639773d-03
+
+!---------------------------------------------------------------------
+!    reference data for 64X64X64 grids after 400 time steps, with DT = 1.5d-03
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 64) .and.  &
+     &           (grid_points(2) .eq. 64) .and.  &
+     &           (grid_points(3) .eq. 64) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'A'
+           dtref = 1.5d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 2.4799822399300195d0
+           xcrref(2) = 1.1276337964368832d0
+           xcrref(3) = 1.5028977888770491d0
+           xcrref(4) = 1.4217816211695179d0
+           xcrref(5) = 2.1292113035138280d0
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 1.0900140297820550d-04
+           xceref(2) = 3.7343951769282091d-05
+           xceref(3) = 5.0092785406541633d-05
+           xceref(4) = 4.7671093939528255d-05
+           xceref(5) = 1.3621613399213001d-04
+
+!---------------------------------------------------------------------
+!    reference data for 102X102X102 grids after 400 time steps,
+!    with DT = 1.0d-03
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 102) .and.  &
+     &           (grid_points(2) .eq. 102) .and.  &
+     &           (grid_points(3) .eq. 102) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'B'
+           dtref = 1.0d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.6903293579998d+02
+           xcrref(2) = 0.3095134488084d+02
+           xcrref(3) = 0.4103336647017d+02
+           xcrref(4) = 0.3864769009604d+02
+           xcrref(5) = 0.5643482272596d+02
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.9810006190188d-02
+           xceref(2) = 0.1022827905670d-02
+           xceref(3) = 0.1720597911692d-02
+           xceref(4) = 0.1694479428231d-02
+           xceref(5) = 0.1847456263981d-01
+
+!---------------------------------------------------------------------
+!    reference data for 162X162X162 grids after 400 time steps,
+!    with DT = 0.67d-03
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 162) .and.  &
+     &           (grid_points(2) .eq. 162) .and.  &
+     &           (grid_points(3) .eq. 162) .and.  &
+     &           (no_time_steps  .eq. 400) ) then
+
+           class = 'C'
+           dtref = 0.67d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.5881691581829d+03
+           xcrref(2) = 0.2454417603569d+03
+           xcrref(3) = 0.3293829191851d+03
+           xcrref(4) = 0.3081924971891d+03
+           xcrref(5) = 0.4597223799176d+03
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.2598120500183d+00
+           xceref(2) = 0.2590888922315d-01
+           xceref(3) = 0.5132886416320d-01
+           xceref(4) = 0.4806073419454d-01
+           xceref(5) = 0.5483377491301d+00
+
+!---------------------------------------------------------------------
+!    reference data for 408X408X408 grids after 500 time steps,
+!    with DT = 0.3d-03
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 408) .and.  &
+     &           (grid_points(2) .eq. 408) .and.  &
+     &           (grid_points(3) .eq. 408) .and.  &
+     &           (no_time_steps  .eq. 500) ) then
+
+           class = 'D'
+           dtref = 0.30d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.1044696216887d+05
+           xcrref(2) = 0.3204427762578d+04
+           xcrref(3) = 0.4648680733032d+04
+           xcrref(4) = 0.4238923283697d+04
+           xcrref(5) = 0.7588412036136d+04
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.5089471423669d+01
+           xceref(2) = 0.5323514855894d+00
+           xceref(3) = 0.1187051008971d+01
+           xceref(4) = 0.1083734951938d+01
+           xceref(5) = 0.1164108338568d+02
+
+!---------------------------------------------------------------------
+!    reference data for 1020X1020X1020 grids after 500 time steps,
+!    with DT = 0.1d-03
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 1020) .and.  &
+     &           (grid_points(2) .eq. 1020) .and.  &
+     &           (grid_points(3) .eq. 1020) .and.  &
+     &           (no_time_steps  .eq. 500) ) then
+
+           class = 'E'
+           dtref = 0.10d-3
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.6255387422609d+05
+           xcrref(2) = 0.1495317020012d+05
+           xcrref(3) = 0.2347595750586d+05
+           xcrref(4) = 0.2091099783534d+05
+           xcrref(5) = 0.4770412841218d+05
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.6742735164909d+02
+           xceref(2) = 0.5390656036938d+01
+           xceref(3) = 0.1680647196477d+02
+           xceref(4) = 0.1536963126457d+02
+           xceref(5) = 0.1575330146156d+03
+
+!---------------------------------------------------------------------
+!    reference data for 2560X2560X2560 grids after 500 time steps,
+!    with DT = 0.15d-04
+!---------------------------------------------------------------------
+        elseif ( (grid_points(1) .eq. 2560) .and.  &
+     &           (grid_points(2) .eq. 2560) .and.  &
+     &           (grid_points(3) .eq. 2560) .and.  &
+     &           (no_time_steps  .eq. 500) ) then
+
+           class = 'F'
+           dtref = 0.15d-4
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of residual.
+!---------------------------------------------------------------------
+           xcrref(1) = 0.9281628449462d+05
+           xcrref(2) = 0.2230152287675d+05
+           xcrref(3) = 0.3493102358632d+05
+           xcrref(4) = 0.3114096186689d+05
+           xcrref(5) = 0.7424426448298d+05
+
+!---------------------------------------------------------------------
+!    Reference values of RMS-norms of solution error.
+!---------------------------------------------------------------------
+           xceref(1) = 0.2683717702444d+03
+           xceref(2) = 0.2030647554028d+02
+           xceref(3) = 0.6734864248234d+02
+           xceref(4) = 0.5947451301640d+02
+           xceref(5) = 0.5417636652565d+03
+
+
+        else
+           verified = .false.
+        endif
+
+!---------------------------------------------------------------------
+!    verification test for residuals if gridsize is one of 
+!    the defined grid sizes above (class .ne. 'U')
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!    Compute the difference of solution values and the known reference values.
+!---------------------------------------------------------------------
+        do m = 1, 5
+           
+           xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) 
+           xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m))
+           
+        enddo
+
+!---------------------------------------------------------------------
+!    Output the comparison of computed results to known cases.
+!---------------------------------------------------------------------
+
+        if (class .ne. 'U') then
+           write(*, 1990) class
+ 1990      format(' Verification being performed for class ', a)
+           write (*,2000) epsilon
+ 2000      format(' accuracy setting for epsilon = ', E20.13)
+           verified = (dabs(dt-dtref) .le. epsilon)
+           if (.not.verified) then  
+              class = 'U'
+              write (*,1000) dtref
+ 1000         format(' DT does not match the reference value of ',  &
+     &                 E15.8)
+           endif
+        else 
+           write(*, 1995)
+ 1995      format(' Unknown class')
+        endif
+
+
+        if (class .ne. 'U') then
+           write (*, 2001) 
+        else
+           write (*, 2005)
+        endif
+
+ 2001   format(' Comparison of RMS-norms of residual')
+ 2005   format(' RMS-norms of residual')
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xcr(m)
+           else if ((.not.ieee_is_nan(xcrdif(m))) .and.  &
+     &              xcrdif(m) .le. epsilon) then
+              write (*,2011) m,xcr(m),xcrref(m),xcrdif(m)
+           else 
+              verified = .false.
+              write (*,2010) m,xcr(m),xcrref(m),xcrdif(m)
+           endif
+        enddo
+
+        if (class .ne. 'U') then
+           write (*,2002)
+        else
+           write (*,2006)
+        endif
+ 2002   format(' Comparison of RMS-norms of solution error')
+ 2006   format(' RMS-norms of solution error')
+        
+        do m = 1, 5
+           if (class .eq. 'U') then
+              write(*, 2015) m, xce(m)
+           else if ((.not.ieee_is_nan(xcedif(m))) .and.  &
+     &              xcedif(m) .le. epsilon) then
+              write (*,2011) m,xce(m),xceref(m),xcedif(m)
+           else
+              verified = .false.
+              write (*,2010) m,xce(m),xceref(m),xcedif(m)
+           endif
+        enddo
+        
+ 2010   format(' FAILURE: ', i2, E20.13, E20.13, E20.13)
+ 2011   format('          ', i2, E20.13, E20.13, E20.13)
+ 2015   format('          ', i2, E20.13)
+        
+        if (class .eq. 'U') then
+           write(*, 2022)
+           write(*, 2023)
+ 2022      format(' No reference values provided')
+ 2023      format(' No verification performed')
+        else if (verified) then
+           write(*, 2020)
+ 2020      format(' Verification Successful')
+        else
+           write(*, 2021)
+ 2021      format(' Verification failed')
+        endif
+
+        return
+
+
+        end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/work_lhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/work_lhs.f90
new file mode 100644
index 000000000..ea4214b40
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/work_lhs.f90
@@ -0,0 +1,61 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  work_lhs module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module work_lhs
+
+      use sp_data, only : problem_size
+
+!-----------------------------------------------------------------------
+!   Working array for LHS
+!-----------------------------------------------------------------------
+
+      integer, parameter :: IMAXP=problem_size/2*2
+      double precision  &
+     &      lhs (5,0:IMAXP),  &
+     &      lhsp(5,0:IMAXP),  &
+     &      lhsm(5,0:IMAXP),  &
+     &      cv  (0:problem_size-1),  &
+     &      rhov(0:problem_size-1)
+!$omp threadprivate(lhs, lhsp, lhsm, cv, rhov)
+
+      end module work_lhs
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine lhsinit(ni, lhs, lhsp, lhsm)
+
+       implicit none
+
+       integer ni
+       double precision lhs(5,0:*), lhsp(5,0:*), lhsm(5,0:*)
+
+       integer m
+
+!---------------------------------------------------------------------
+!     zap the whole left hand side for starters
+!     set all diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+       do   m = 1, 5
+          lhs (m,0) = 0.0d0
+          lhsp(m,0) = 0.0d0
+          lhsm(m,0) = 0.0d0
+          lhs (m,ni) = 0.0d0
+          lhsp(m,ni) = 0.0d0
+          lhsm(m,ni) = 0.0d0
+       end do
+       lhs (3,0) = 1.0d0
+       lhsp(3,0) = 1.0d0
+       lhsm(3,0) = 1.0d0
+       lhs (3,ni) = 1.0d0
+       lhsp(3,ni) = 1.0d0
+       lhsm(3,ni) = 1.0d0
+ 
+       return
+       end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/work_lhs_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/work_lhs_blk.f90
new file mode 100644
index 000000000..b991d4e66
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/work_lhs_blk.f90
@@ -0,0 +1,70 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  work_lhs module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module work_lhs
+
+      use sp_data, only : problem_size
+
+!-----------------------------------------------------------------------
+!   Working array for LHS
+!-----------------------------------------------------------------------
+
+      include 'blk_par.h'
+
+      integer, parameter :: IMAXP=problem_size/2*2
+      double precision  &
+     &      lhs (blkdim,5,0:IMAXP),  &
+     &      lhsp(blkdim,5,0:IMAXP),  &
+     &      lhsm(blkdim,5,0:IMAXP),  &
+     &      rhsx(blkdim,5,0:IMAXP),  &
+     &      cv  (blkdim,0:problem_size-1),  &
+     &      rhov(blkdim,0:problem_size-1)
+!$omp threadprivate(lhs, lhsp, lhsm, rhsx, cv, rhov)
+
+      end module work_lhs
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine lhsinit(ni)
+
+       use work_lhs
+       implicit none
+
+       integer ni
+
+       integer i, j
+
+!---------------------------------------------------------------------
+!     zap the whole left hand side for starters
+!     set all diagonal values to 1. This is overkill, but convenient
+!---------------------------------------------------------------------
+       do i = 0, ni, ni
+          do j = 1, bsize
+             lhs (j,1,i) = 0.0d0
+             lhs (j,2,i) = 0.0d0
+             lhs (j,3,i) = 1.0d0
+             lhs (j,4,i) = 0.0d0
+             lhs (j,5,i) = 0.0d0
+             lhsp(j,1,i) = 0.0d0
+             lhsp(j,2,i) = 0.0d0
+             lhsp(j,3,i) = 1.0d0
+             lhsp(j,4,i) = 0.0d0
+             lhsp(j,5,i) = 0.0d0
+             lhsm(j,1,i) = 0.0d0
+             lhsm(j,2,i) = 0.0d0
+             lhsm(j,3,i) = 1.0d0
+             lhsm(j,4,i) = 0.0d0
+             lhsm(j,5,i) = 0.0d0
+          end do
+       end do
+ 
+       return
+       end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/x_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/x_solve.f90
new file mode 100644
index 000000000..a5791accf
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/x_solve.f90
@@ -0,0 +1,308 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine x_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the x-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the x-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use work_lhs
+
+       implicit none
+
+       integer i, j, k, i1, i2, m
+       double precision  ru1, fac1, fac2
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_xsolve)
+!$omp parallel do default(shared) private(i,j,k,i1,i2,m,  &
+!$omp&    ru1,fac1,fac2) collapse(2)
+       do  k = 1, nz2
+          do  j = 1, ny2
+
+            call lhsinit(nx2+1, lhs, lhsp, lhsm)
+
+!---------------------------------------------------------------------
+! Computes the left hand side for the three x-factors  
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      first fill the lhs for the u-eigenvalue                   
+!---------------------------------------------------------------------
+             do  i = 0, grid_points(1)-1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(i) = us(i,j,k)
+                rhov(i) = dmax1(dx2+con43*ru1,  &
+     &                          dx5+c1c5*ru1,  &
+     &                          dxmax+ru1,  &
+     &                          dx1)
+             end do
+
+             do  i = 1, nx2
+                lhs(1,i) =  0.0d0
+                lhs(2,i) = -dttx2 * cv(i-1) - dttx1 * rhov(i-1)
+                lhs(3,i) =  1.0d0 + c2dttx1 * rhov(i)
+                lhs(4,i) =  dttx2 * cv(i+1) - dttx1 * rhov(i+1)
+                lhs(5,i) =  0.0d0
+             end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                             
+!---------------------------------------------------------------------
+
+             i = 1
+             lhs(3,i) = lhs(3,i) + comz5
+             lhs(4,i) = lhs(4,i) - comz4
+             lhs(5,i) = lhs(5,i) + comz1
+  
+             lhs(2,i+1) = lhs(2,i+1) - comz4
+             lhs(3,i+1) = lhs(3,i+1) + comz6
+             lhs(4,i+1) = lhs(4,i+1) - comz4
+             lhs(5,i+1) = lhs(5,i+1) + comz1
+
+             do   i=3, grid_points(1)-4
+                lhs(1,i) = lhs(1,i) + comz1
+                lhs(2,i) = lhs(2,i) - comz4
+                lhs(3,i) = lhs(3,i) + comz6
+                lhs(4,i) = lhs(4,i) - comz4
+                lhs(5,i) = lhs(5,i) + comz1
+             end do
+
+             i = grid_points(1)-3
+             lhs(1,i) = lhs(1,i) + comz1
+             lhs(2,i) = lhs(2,i) - comz4
+             lhs(3,i) = lhs(3,i) + comz6
+             lhs(4,i) = lhs(4,i) - comz4
+
+             lhs(1,i+1) = lhs(1,i+1) + comz1
+             lhs(2,i+1) = lhs(2,i+1) - comz4
+             lhs(3,i+1) = lhs(3,i+1) + comz5
+
+!---------------------------------------------------------------------
+!      subsequently, fill the other factors (u+c), (u-c) by adding to 
+!      the first  
+!---------------------------------------------------------------------
+             do   i = 1, nx2
+                lhsp(1,i) = lhs(1,i)
+                lhsp(2,i) = lhs(2,i) -  &
+     &                            dttx2 * speed(i-1,j,k)
+                lhsp(3,i) = lhs(3,i)
+                lhsp(4,i) = lhs(4,i) +  &
+     &                            dttx2 * speed(i+1,j,k)
+                lhsp(5,i) = lhs(5,i)
+                lhsm(1,i) = lhs(1,i)
+                lhsm(2,i) = lhs(2,i) +  &
+     &                            dttx2 * speed(i-1,j,k)
+                lhsm(3,i) = lhs(3,i)
+                lhsm(4,i) = lhs(4,i) -  &
+     &                            dttx2 * speed(i+1,j,k)
+                lhsm(5,i) = lhs(5,i)
+             end do
+
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      perform the Thomas algorithm; first, FORWARD ELIMINATION     
+!---------------------------------------------------------------------
+
+             do    i = 0, grid_points(1)-3
+                i1 = i  + 1
+                i2 = i  + 2
+                fac1      = 1.d0/lhs(3,i)
+                lhs(4,i)  = fac1*lhs(4,i)
+                lhs(5,i)  = fac1*lhs(5,i)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,i1) = lhs(3,i1) -  &
+     &                         lhs(2,i1)*lhs(4,i)
+                lhs(4,i1) = lhs(4,i1) -  &
+     &                         lhs(2,i1)*lhs(5,i)
+                do    m = 1, 3
+                   rhs(m,i1,j,k) = rhs(m,i1,j,k) -  &
+     &                         lhs(2,i1)*rhs(m,i,j,k)
+                end do
+                lhs(2,i2) = lhs(2,i2) -  &
+     &                         lhs(1,i2)*lhs(4,i)
+                lhs(3,i2) = lhs(3,i2) -  &
+     &                         lhs(1,i2)*lhs(5,i)
+                do    m = 1, 3
+                   rhs(m,i2,j,k) = rhs(m,i2,j,k) -  &
+     &                         lhs(1,i2)*rhs(m,i,j,k)
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!      The last two rows in this grid block are a bit different, 
+!      since they do not have two more rows available for the
+!      elimination of off-diagonal entries
+!---------------------------------------------------------------------
+
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             fac1      = 1.d0/lhs(3,i)
+             lhs(4,i)  = fac1*lhs(4,i)
+             lhs(5,i)  = fac1*lhs(5,i)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,i1) = lhs(3,i1) -  &
+     &                      lhs(2,i1)*lhs(4,i)
+             lhs(4,i1) = lhs(4,i1) -  &
+     &                      lhs(2,i1)*lhs(5,i)
+             do    m = 1, 3
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -  &
+     &                      lhs(2,i1)*rhs(m,i,j,k)
+             end do
+!---------------------------------------------------------------------
+!            scale the last row immediately 
+!---------------------------------------------------------------------
+             fac2             = 1.d0/lhs(3,i1)
+             do    m = 1, 3
+                rhs(m,i1,j,k) = fac2*rhs(m,i1,j,k)
+             end do
+
+!---------------------------------------------------------------------
+!      do the u+c and the u-c factors                 
+!---------------------------------------------------------------------
+
+             do    i = 0, grid_points(1)-3
+                i1 = i  + 1
+                i2 = i  + 2
+                m = 4
+                fac1       = 1.d0/lhsp(3,i)
+                lhsp(4,i)  = fac1*lhsp(4,i)
+                lhsp(5,i)  = fac1*lhsp(5,i)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsp(3,i1) = lhsp(3,i1) -  &
+     &                        lhsp(2,i1)*lhsp(4,i)
+                lhsp(4,i1) = lhsp(4,i1) -  &
+     &                        lhsp(2,i1)*lhsp(5,i)
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -  &
+     &                        lhsp(2,i1)*rhs(m,i,j,k)
+                lhsp(2,i2) = lhsp(2,i2) -  &
+     &                        lhsp(1,i2)*lhsp(4,i)
+                lhsp(3,i2) = lhsp(3,i2) -  &
+     &                        lhsp(1,i2)*lhsp(5,i)
+                rhs(m,i2,j,k) = rhs(m,i2,j,k) -  &
+     &                        lhsp(1,i2)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,i)
+                lhsm(4,i)  = fac1*lhsm(4,i)
+                lhsm(5,i)  = fac1*lhsm(5,i)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsm(3,i1) = lhsm(3,i1) -  &
+     &                        lhsm(2,i1)*lhsm(4,i)
+                lhsm(4,i1) = lhsm(4,i1) -  &
+     &                        lhsm(2,i1)*lhsm(5,i)
+                rhs(m,i1,j,k) = rhs(m,i1,j,k) -  &
+     &                        lhsm(2,i1)*rhs(m,i,j,k)
+                lhsm(2,i2) = lhsm(2,i2) -  &
+     &                        lhsm(1,i2)*lhsm(4,i)
+                lhsm(3,i2) = lhsm(3,i2) -  &
+     &                        lhsm(1,i2)*lhsm(5,i)
+                rhs(m,i2,j,k) = rhs(m,i2,j,k) -  &
+     &                        lhsm(1,i2)*rhs(m,i,j,k)
+             end do
+
+!---------------------------------------------------------------------
+!         And again the last two rows separately
+!---------------------------------------------------------------------
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             m = 4
+             fac1       = 1.d0/lhsp(3,i)
+             lhsp(4,i)  = fac1*lhsp(4,i)
+             lhsp(5,i)  = fac1*lhsp(5,i)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsp(3,i1) = lhsp(3,i1) -  &
+     &                      lhsp(2,i1)*lhsp(4,i)
+             lhsp(4,i1) = lhsp(4,i1) -  &
+     &                      lhsp(2,i1)*lhsp(5,i)
+             rhs(m,i1,j,k) = rhs(m,i1,j,k) -  &
+     &                      lhsp(2,i1)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,i)
+             lhsm(4,i)  = fac1*lhsm(4,i)
+             lhsm(5,i)  = fac1*lhsm(5,i)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsm(3,i1) = lhsm(3,i1) -  &
+     &                      lhsm(2,i1)*lhsm(4,i)
+             lhsm(4,i1) = lhsm(4,i1) -  &
+     &                      lhsm(2,i1)*lhsm(5,i)
+             rhs(m,i1,j,k) = rhs(m,i1,j,k) -  &
+     &                      lhsm(2,i1)*rhs(m,i,j,k)
+!---------------------------------------------------------------------
+!               Scale the last row immediately
+!---------------------------------------------------------------------
+             rhs(4,i1,j,k) = rhs(4,i1,j,k)/lhsp(3,i1)
+             rhs(5,i1,j,k) = rhs(5,i1,j,k)/lhsm(3,i1)
+
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+
+
+             i  = grid_points(1)-2
+             i1 = grid_points(1)-1
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -  &
+     &                             lhs(4,i)*rhs(m,i1,j,k)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -  &
+     &                          lhsp(4,i)*rhs(4,i1,j,k)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -  &
+     &                          lhsm(4,i)*rhs(5,i1,j,k)
+
+!---------------------------------------------------------------------
+!      The first three factors
+!---------------------------------------------------------------------
+             do    i = grid_points(1)-3, 0, -1
+                i1 = i  + 1
+                i2 = i  + 2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) -  &
+     &                          lhs(4,i)*rhs(m,i1,j,k) -  &
+     &                          lhs(5,i)*rhs(m,i2,j,k)
+                end do
+
+!---------------------------------------------------------------------
+!      And the remaining two
+!---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) -  &
+     &                          lhsp(4,i)*rhs(4,i1,j,k) -  &
+     &                          lhsp(5,i)*rhs(4,i2,j,k)
+                rhs(5,i,j,k) = rhs(5,i,j,k) -  &
+     &                          lhsm(4,i)*rhs(5,i1,j,k) -  &
+     &                          lhsm(5,i)*rhs(5,i2,j,k)
+             end do
+          end do
+
+       end do
+       if (timeron) call timer_stop(t_xsolve)
+
+!---------------------------------------------------------------------
+!      Do the block-diagonal inversion          
+!---------------------------------------------------------------------
+       call ninvr
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/x_solve_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/x_solve_blk.f90
new file mode 100644
index 000000000..52a6dc23f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/x_solve_blk.f90
@@ -0,0 +1,374 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine x_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the x-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the x-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use work_lhs
+
+       implicit none
+
+       integer i, j, k, i1, i2, jj, jb, jm
+       double precision  ru1, fac1, fac2
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_xsolve)
+!$omp parallel default(shared) private(i,j,k,i1,i2,jj,jb,jm,  &
+!$omp&    ru1,fac1,fac2)
+
+       call lhsinit(nx2+1)
+
+!$omp do collapse(2)
+       do  k = 1, nz2
+       do  jj = 1, ny2, bsize
+          jm = min(bsize, ny2 - jj + 1)
+
+!---------------------------------------------------------------------
+! To improve cache utilization, copy a slab of rhs to temp array  
+!---------------------------------------------------------------------
+          do  i = 0, grid_points(1)-1
+             do  jb = 1, bsize
+                j = min(jb,jm) + jj - 1
+                rhsx(jb,1,i) = rhs(1,i,j,k)
+                rhsx(jb,2,i) = rhs(2,i,j,k)
+                rhsx(jb,3,i) = rhs(3,i,j,k)
+                rhsx(jb,4,i) = rhs(4,i,j,k)
+                rhsx(jb,5,i) = rhs(5,i,j,k)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+! Computes the left hand side for the three x-factors  
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      first fill the lhs for the u-eigenvalue                   
+!---------------------------------------------------------------------
+          do  i = 0, grid_points(1)-1
+             do  jb = 1, bsize
+                j = min(jb,jm) + jj - 1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(jb,i) = us(i,j,k)
+                rhov(jb,i) = dmax1(dx2+con43*ru1,  &
+     &                          dx5+c1c5*ru1,  &
+     &                          dxmax+ru1,  &
+     &                          dx1)
+             end do
+          end do
+
+          do  i = 1, nx2
+             do  jb = 1, bsize
+                lhs(jb,1,i) =  0.0d0
+                lhs(jb,2,i) = -dttx2 * cv(jb,i-1) - dttx1 * rhov(jb,i-1)
+                lhs(jb,3,i) =  1.0d0 + c2dttx1 * rhov(jb,i)
+                lhs(jb,4,i) =  dttx2 * cv(jb,i+1) - dttx1 * rhov(jb,i+1)
+                lhs(jb,5,i) =  0.0d0
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                             
+!---------------------------------------------------------------------
+
+          do  jb = 1, bsize
+             i = 1
+             lhs(jb,3,i) = lhs(jb,3,i) + comz5
+             lhs(jb,4,i) = lhs(jb,4,i) - comz4
+             lhs(jb,5,i) = lhs(jb,5,i) + comz1
+  
+             i = 2
+             lhs(jb,2,i) = lhs(jb,2,i) - comz4
+             lhs(jb,3,i) = lhs(jb,3,i) + comz6
+             lhs(jb,4,i) = lhs(jb,4,i) - comz4
+             lhs(jb,5,i) = lhs(jb,5,i) + comz1
+          end do
+
+          do   i=3, grid_points(1)-4
+             do  jb = 1, bsize
+                lhs(jb,1,i) = lhs(jb,1,i) + comz1
+                lhs(jb,2,i) = lhs(jb,2,i) - comz4
+                lhs(jb,3,i) = lhs(jb,3,i) + comz6
+                lhs(jb,4,i) = lhs(jb,4,i) - comz4
+                lhs(jb,5,i) = lhs(jb,5,i) + comz1
+             end do
+          end do
+
+          do  jb = 1, bsize
+             i = grid_points(1)-3
+             lhs(jb,1,i) = lhs(jb,1,i) + comz1
+             lhs(jb,2,i) = lhs(jb,2,i) - comz4
+             lhs(jb,3,i) = lhs(jb,3,i) + comz6
+             lhs(jb,4,i) = lhs(jb,4,i) - comz4
+
+             i = grid_points(1)-2
+             lhs(jb,1,i) = lhs(jb,1,i) + comz1
+             lhs(jb,2,i) = lhs(jb,2,i) - comz4
+             lhs(jb,3,i) = lhs(jb,3,i) + comz5
+          end do
+
+!---------------------------------------------------------------------
+!      subsequently, fill the other factors (u+c), (u-c) by adding to 
+!      the first  
+!---------------------------------------------------------------------
+          do   i = 1, nx2
+             do  jb = 1, bsize
+                j = min(jb,jm) + jj - 1
+                lhsp(jb,1,i) = lhs(jb,1,i)
+                lhsp(jb,2,i) = lhs(jb,2,i) -  &
+     &                            dttx2 * speed(i-1,j,k)
+                lhsp(jb,3,i) = lhs(jb,3,i)
+                lhsp(jb,4,i) = lhs(jb,4,i) +  &
+     &                            dttx2 * speed(i+1,j,k)
+                lhsp(jb,5,i) = lhs(jb,5,i)
+                lhsm(jb,1,i) = lhs(jb,1,i)
+                lhsm(jb,2,i) = lhs(jb,2,i) +  &
+     &                            dttx2 * speed(i-1,j,k)
+                lhsm(jb,3,i) = lhs(jb,3,i)
+                lhsm(jb,4,i) = lhs(jb,4,i) -  &
+     &                            dttx2 * speed(i+1,j,k)
+                lhsm(jb,5,i) = lhs(jb,5,i)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      perform the Thomas algorithm; first, FORWARD ELIMINATION     
+!---------------------------------------------------------------------
+
+          do    i = 0, grid_points(1)-3
+             i1 = i  + 1
+             i2 = i  + 2
+             do  jb = 1, bsize
+                fac1      = 1.d0/lhs(jb,3,i)
+                lhs(jb,4,i)  = fac1*lhs(jb,4,i)
+                lhs(jb,5,i)  = fac1*lhs(jb,5,i)
+                rhsx(jb,1,i) = fac1*rhsx(jb,1,i)
+                rhsx(jb,2,i) = fac1*rhsx(jb,2,i)
+                rhsx(jb,3,i) = fac1*rhsx(jb,3,i)
+                lhs(jb,3,i1) = lhs(jb,3,i1) -  &
+     &                         lhs(jb,2,i1)*lhs(jb,4,i)
+                lhs(jb,4,i1) = lhs(jb,4,i1) -  &
+     &                         lhs(jb,2,i1)*lhs(jb,5,i)
+                rhsx(jb,1,i1) = rhsx(jb,1,i1) -  &
+     &                         lhs(jb,2,i1)*rhsx(jb,1,i)
+                rhsx(jb,2,i1) = rhsx(jb,2,i1) -  &
+     &                         lhs(jb,2,i1)*rhsx(jb,2,i)
+                rhsx(jb,3,i1) = rhsx(jb,3,i1) -  &
+     &                         lhs(jb,2,i1)*rhsx(jb,3,i)
+                lhs(jb,2,i2) = lhs(jb,2,i2) -  &
+     &                         lhs(jb,1,i2)*lhs(jb,4,i)
+                lhs(jb,3,i2) = lhs(jb,3,i2) -  &
+     &                         lhs(jb,1,i2)*lhs(jb,5,i)
+                rhsx(jb,1,i2) = rhsx(jb,1,i2) -  &
+     &                         lhs(jb,1,i2)*rhsx(jb,1,i)
+                rhsx(jb,2,i2) = rhsx(jb,2,i2) -  &
+     &                         lhs(jb,1,i2)*rhsx(jb,2,i)
+                rhsx(jb,3,i2) = rhsx(jb,3,i2) -  &
+     &                         lhs(jb,1,i2)*rhsx(jb,3,i)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!      The last two rows in this grid block are a bit different, 
+!      since they do not have two more rows available for the
+!      elimination of off-diagonal entries
+!---------------------------------------------------------------------
+
+          i  = grid_points(1)-2
+          i1 = grid_points(1)-1
+          do  jb = 1, bsize
+             fac1      = 1.d0/lhs(jb,3,i)
+             lhs(jb,4,i)  = fac1*lhs(jb,4,i)
+             lhs(jb,5,i)  = fac1*lhs(jb,5,i)
+             rhsx(jb,1,i) = fac1*rhsx(jb,1,i)
+             rhsx(jb,2,i) = fac1*rhsx(jb,2,i)
+             rhsx(jb,3,i) = fac1*rhsx(jb,3,i)
+             lhs(jb,3,i1) = lhs(jb,3,i1) -  &
+     &                      lhs(jb,2,i1)*lhs(jb,4,i)
+             lhs(jb,4,i1) = lhs(jb,4,i1) -  &
+     &                      lhs(jb,2,i1)*lhs(jb,5,i)
+             rhsx(jb,1,i1) = rhsx(jb,1,i1) -  &
+     &                      lhs(jb,2,i1)*rhsx(jb,1,i)
+             rhsx(jb,2,i1) = rhsx(jb,2,i1) -  &
+     &                      lhs(jb,2,i1)*rhsx(jb,2,i)
+             rhsx(jb,3,i1) = rhsx(jb,3,i1) -  &
+     &                      lhs(jb,2,i1)*rhsx(jb,3,i)
+!---------------------------------------------------------------------
+!            scale the last row immediately 
+!---------------------------------------------------------------------
+             fac2             = 1.d0/lhs(jb,3,i1)
+             rhsx(jb,1,i1) = fac2*rhsx(jb,1,i1)
+             rhsx(jb,2,i1) = fac2*rhsx(jb,2,i1)
+             rhsx(jb,3,i1) = fac2*rhsx(jb,3,i1)
+          end do
+
+!---------------------------------------------------------------------
+!      do the u+c and the u-c factors                 
+!---------------------------------------------------------------------
+
+          do    i = 0, grid_points(1)-3
+             i1 = i  + 1
+             i2 = i  + 2
+             do  jb = 1, bsize
+                fac1       = 1.d0/lhsp(jb,3,i)
+                lhsp(jb,4,i)  = fac1*lhsp(jb,4,i)
+                lhsp(jb,5,i)  = fac1*lhsp(jb,5,i)
+                rhsx(jb,4,i)  = fac1*rhsx(jb,4,i)
+                lhsp(jb,3,i1) = lhsp(jb,3,i1) -  &
+     &                        lhsp(jb,2,i1)*lhsp(jb,4,i)
+                lhsp(jb,4,i1) = lhsp(jb,4,i1) -  &
+     &                        lhsp(jb,2,i1)*lhsp(jb,5,i)
+                rhsx(jb,4,i1) = rhsx(jb,4,i1) -  &
+     &                        lhsp(jb,2,i1)*rhsx(jb,4,i)
+                lhsp(jb,2,i2) = lhsp(jb,2,i2) -  &
+     &                        lhsp(jb,1,i2)*lhsp(jb,4,i)
+                lhsp(jb,3,i2) = lhsp(jb,3,i2) -  &
+     &                        lhsp(jb,1,i2)*lhsp(jb,5,i)
+                rhsx(jb,4,i2) = rhsx(jb,4,i2) -  &
+     &                        lhsp(jb,1,i2)*rhsx(jb,4,i)
+                fac1       = 1.d0/lhsm(jb,3,i)
+                lhsm(jb,4,i)  = fac1*lhsm(jb,4,i)
+                lhsm(jb,5,i)  = fac1*lhsm(jb,5,i)
+                rhsx(jb,5,i)  = fac1*rhsx(jb,5,i)
+                lhsm(jb,3,i1) = lhsm(jb,3,i1) -  &
+     &                        lhsm(jb,2,i1)*lhsm(jb,4,i)
+                lhsm(jb,4,i1) = lhsm(jb,4,i1) -  &
+     &                        lhsm(jb,2,i1)*lhsm(jb,5,i)
+                rhsx(jb,5,i1) = rhsx(jb,5,i1) -  &
+     &                        lhsm(jb,2,i1)*rhsx(jb,5,i)
+                lhsm(jb,2,i2) = lhsm(jb,2,i2) -  &
+     &                        lhsm(jb,1,i2)*lhsm(jb,4,i)
+                lhsm(jb,3,i2) = lhsm(jb,3,i2) -  &
+     &                        lhsm(jb,1,i2)*lhsm(jb,5,i)
+                rhsx(jb,5,i2) = rhsx(jb,5,i2) -  &
+     &                        lhsm(jb,1,i2)*rhsx(jb,5,i)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         And again the last two rows separately
+!---------------------------------------------------------------------
+          i  = grid_points(1)-2
+          i1 = grid_points(1)-1
+          do  jb = 1, bsize
+             fac1       = 1.d0/lhsp(jb,3,i)
+             lhsp(jb,4,i)  = fac1*lhsp(jb,4,i)
+             lhsp(jb,5,i)  = fac1*lhsp(jb,5,i)
+             rhsx(jb,4,i)  = fac1*rhsx(jb,4,i)
+             lhsp(jb,3,i1) = lhsp(jb,3,i1) -  &
+     &                      lhsp(jb,2,i1)*lhsp(jb,4,i)
+             lhsp(jb,4,i1) = lhsp(jb,4,i1) -  &
+     &                      lhsp(jb,2,i1)*lhsp(jb,5,i)
+             rhsx(jb,4,i1) = rhsx(jb,4,i1) -  &
+     &                      lhsp(jb,2,i1)*rhsx(jb,4,i)
+             fac1       = 1.d0/lhsm(jb,3,i)
+             lhsm(jb,4,i)  = fac1*lhsm(jb,4,i)
+             lhsm(jb,5,i)  = fac1*lhsm(jb,5,i)
+             rhsx(jb,5,i)  = fac1*rhsx(jb,5,i)
+             lhsm(jb,3,i1) = lhsm(jb,3,i1) -  &
+     &                      lhsm(jb,2,i1)*lhsm(jb,4,i)
+             lhsm(jb,4,i1) = lhsm(jb,4,i1) -  &
+     &                      lhsm(jb,2,i1)*lhsm(jb,5,i)
+             rhsx(jb,5,i1) = rhsx(jb,5,i1) -  &
+     &                      lhsm(jb,2,i1)*rhsx(jb,5,i)
+!---------------------------------------------------------------------
+!               Scale the last row immediately
+!---------------------------------------------------------------------
+             rhsx(jb,4,i1) = rhsx(jb,4,i1)/lhsp(jb,3,i1)
+             rhsx(jb,5,i1) = rhsx(jb,5,i1)/lhsm(jb,3,i1)
+          end do
+
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+
+
+          i  = grid_points(1)-2
+          i1 = grid_points(1)-1
+          do  jb = 1, bsize
+             rhsx(jb,1,i) = rhsx(jb,1,i) -  &
+     &                             lhs(jb,4,i)*rhsx(jb,1,i1)
+             rhsx(jb,2,i) = rhsx(jb,2,i) -  &
+     &                             lhs(jb,4,i)*rhsx(jb,2,i1)
+             rhsx(jb,3,i) = rhsx(jb,3,i) -  &
+     &                             lhs(jb,4,i)*rhsx(jb,3,i1)
+
+             rhsx(jb,4,i) = rhsx(jb,4,i) -  &
+     &                          lhsp(jb,4,i)*rhsx(jb,4,i1)
+             rhsx(jb,5,i) = rhsx(jb,5,i) -  &
+     &                          lhsm(jb,4,i)*rhsx(jb,5,i1)
+          end do
+
+!---------------------------------------------------------------------
+!      The first three factors
+!---------------------------------------------------------------------
+          do    i = grid_points(1)-3, 0, -1
+             i1 = i  + 1
+             i2 = i  + 2
+             do  jb = 1, bsize
+                rhsx(jb,1,i) = rhsx(jb,1,i) -  &
+     &                          lhs(jb,4,i)*rhsx(jb,1,i1) -  &
+     &                          lhs(jb,5,i)*rhsx(jb,1,i2)
+                rhsx(jb,2,i) = rhsx(jb,2,i) -  &
+     &                          lhs(jb,4,i)*rhsx(jb,2,i1) -  &
+     &                          lhs(jb,5,i)*rhsx(jb,2,i2)
+                rhsx(jb,3,i) = rhsx(jb,3,i) -  &
+     &                          lhs(jb,4,i)*rhsx(jb,3,i1) -  &
+     &                          lhs(jb,5,i)*rhsx(jb,3,i2)
+
+!---------------------------------------------------------------------
+!      And the remaining two
+!---------------------------------------------------------------------
+                rhsx(jb,4,i) = rhsx(jb,4,i) -  &
+     &                          lhsp(jb,4,i)*rhsx(jb,4,i1) -  &
+     &                          lhsp(jb,5,i)*rhsx(jb,4,i2)
+                rhsx(jb,5,i) = rhsx(jb,5,i) -  &
+     &                          lhsm(jb,4,i)*rhsx(jb,5,i1) -  &
+     &                          lhsm(jb,5,i)*rhsx(jb,5,i2)
+             end do
+          end do
+
+          do  jb = 1, jm
+             j = jb + jj - 1
+             do  i = 0, grid_points(1)-1
+                rhs(1,i,j,k) = rhsx(jb,1,i)
+                rhs(2,i,j,k) = rhsx(jb,2,i)
+                rhs(3,i,j,k) = rhsx(jb,3,i)
+                rhs(4,i,j,k) = rhsx(jb,4,i)
+                rhs(5,i,j,k) = rhsx(jb,5,i)
+             end do
+          end do
+
+       end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+       if (timeron) call timer_stop(t_xsolve)
+
+!---------------------------------------------------------------------
+!      Do the block-diagonal inversion          
+!---------------------------------------------------------------------
+       call ninvr
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/y_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/y_solve.f90
new file mode 100644
index 000000000..acc7274a0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/y_solve.f90
@@ -0,0 +1,301 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine y_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the y-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the y-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use work_lhs
+
+       implicit none
+
+       integer i, j, k, j1, j2, m
+       double precision ru1, fac1, fac2
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_ysolve)
+!$omp parallel do default(shared) private(i,j,k,j1,j2,m,  &
+!$omp&    ru1,fac1,fac2) collapse(2)
+       do  k = 1, nz2
+          do  i = 1, grid_points(1)-2
+
+            call lhsinit(ny2+1, lhs, lhsp, lhsm)
+
+!---------------------------------------------------------------------
+! Computes the left hand side for the three y-factors   
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      first fill the lhs for the u-eigenvalue         
+!---------------------------------------------------------------------
+
+             do  j = 0, grid_points(2)-1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(j) = vs(i,j,k)
+                rhov(j) = dmax1( dy3 + con43 * ru1,  &
+     &                           dy5 + c1c5*ru1,  &
+     &                           dymax + ru1,  &
+     &                           dy1)
+             end do
+            
+             do  j = 1, grid_points(2)-2
+                lhs(1,j) =  0.0d0
+                lhs(2,j) = -dtty2 * cv(j-1) - dtty1 * rhov(j-1)
+                lhs(3,j) =  1.0 + c2dtty1 * rhov(j)
+                lhs(4,j) =  dtty2 * cv(j+1) - dtty1 * rhov(j+1)
+                lhs(5,j) =  0.0d0
+             end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                             
+!---------------------------------------------------------------------
+
+             j = 1
+             lhs(3,j) = lhs(3,j) + comz5
+             lhs(4,j) = lhs(4,j) - comz4
+             lhs(5,j) = lhs(5,j) + comz1
+       
+             lhs(2,j+1) = lhs(2,j+1) - comz4
+             lhs(3,j+1) = lhs(3,j+1) + comz6
+             lhs(4,j+1) = lhs(4,j+1) - comz4
+             lhs(5,j+1) = lhs(5,j+1) + comz1
+
+             do   j=3, grid_points(2)-4
+                lhs(1,j) = lhs(1,j) + comz1
+                lhs(2,j) = lhs(2,j) - comz4
+                lhs(3,j) = lhs(3,j) + comz6
+                lhs(4,j) = lhs(4,j) - comz4
+                lhs(5,j) = lhs(5,j) + comz1
+             end do
+
+             j = grid_points(2)-3
+             lhs(1,j) = lhs(1,j) + comz1
+             lhs(2,j) = lhs(2,j) - comz4
+             lhs(3,j) = lhs(3,j) + comz6
+             lhs(4,j) = lhs(4,j) - comz4
+
+             lhs(1,j+1) = lhs(1,j+1) + comz1
+             lhs(2,j+1) = lhs(2,j+1) - comz4
+             lhs(3,j+1) = lhs(3,j+1) + comz5
+
+!---------------------------------------------------------------------
+!      subsequently, do the other two factors                    
+!---------------------------------------------------------------------
+             do    j = 1, grid_points(2)-2
+                lhsp(1,j) = lhs(1,j)
+                lhsp(2,j) = lhs(2,j) -  &
+     &                            dtty2 * speed(i,j-1,k)
+                lhsp(3,j) = lhs(3,j)
+                lhsp(4,j) = lhs(4,j) +  &
+     &                            dtty2 * speed(i,j+1,k)
+                lhsp(5,j) = lhs(5,j)
+                lhsm(1,j) = lhs(1,j)
+                lhsm(2,j) = lhs(2,j) +  &
+     &                            dtty2 * speed(i,j-1,k)
+                lhsm(3,j) = lhs(3,j)
+                lhsm(4,j) = lhs(4,j) -  &
+     &                            dtty2 * speed(i,j+1,k)
+                lhsm(5,j) = lhs(5,j)
+             end do
+
+
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+
+             do    j = 0, grid_points(2)-3
+                j1 = j  + 1
+                j2 = j  + 2
+                fac1      = 1.d0/lhs(3,j)
+                lhs(4,j)  = fac1*lhs(4,j)
+                lhs(5,j)  = fac1*lhs(5,j)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,j1) = lhs(3,j1) -  &
+     &                         lhs(2,j1)*lhs(4,j)
+                lhs(4,j1) = lhs(4,j1) -  &
+     &                         lhs(2,j1)*lhs(5,j)
+                do    m = 1, 3
+                   rhs(m,i,j1,k) = rhs(m,i,j1,k) -  &
+     &                         lhs(2,j1)*rhs(m,i,j,k)
+                end do
+                lhs(2,j2) = lhs(2,j2) -  &
+     &                         lhs(1,j2)*lhs(4,j)
+                lhs(3,j2) = lhs(3,j2) -  &
+     &                         lhs(1,j2)*lhs(5,j)
+                do    m = 1, 3
+                   rhs(m,i,j2,k) = rhs(m,i,j2,k) -  &
+     &                         lhs(1,j2)*rhs(m,i,j,k)
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!      The last two rows in this grid block are a bit different, 
+!      since they do not have two more rows available for the
+!      elimination of off-diagonal entries
+!---------------------------------------------------------------------
+
+             j  = grid_points(2)-2
+             j1 = grid_points(2)-1
+             fac1      = 1.d0/lhs(3,j)
+             lhs(4,j)  = fac1*lhs(4,j)
+             lhs(5,j)  = fac1*lhs(5,j)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,j1) = lhs(3,j1) -  &
+     &                      lhs(2,j1)*lhs(4,j)
+             lhs(4,j1) = lhs(4,j1) -  &
+     &                      lhs(2,j1)*lhs(5,j)
+             do    m = 1, 3
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -  &
+     &                      lhs(2,j1)*rhs(m,i,j,k)
+             end do
+!---------------------------------------------------------------------
+!            scale the last row immediately 
+!---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(3,j1)
+             do    m = 1, 3
+                rhs(m,i,j1,k) = fac2*rhs(m,i,j1,k)
+             end do
+
+!---------------------------------------------------------------------
+!      do the u+c and the u-c factors                 
+!---------------------------------------------------------------------
+             do    j = 0, grid_points(2)-3
+                j1 = j  + 1
+                j2 = j  + 2
+                m = 4
+                fac1       = 1.d0/lhsp(3,j)
+                lhsp(4,j)  = fac1*lhsp(4,j)
+                lhsp(5,j)  = fac1*lhsp(5,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsp(3,j1) = lhsp(3,j1) -  &
+     &                       lhsp(2,j1)*lhsp(4,j)
+                lhsp(4,j1) = lhsp(4,j1) -  &
+     &                       lhsp(2,j1)*lhsp(5,j)
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -  &
+     &                       lhsp(2,j1)*rhs(m,i,j,k)
+                lhsp(2,j2) = lhsp(2,j2) -  &
+     &                       lhsp(1,j2)*lhsp(4,j)
+                lhsp(3,j2) = lhsp(3,j2) -  &
+     &                       lhsp(1,j2)*lhsp(5,j)
+                rhs(m,i,j2,k) = rhs(m,i,j2,k) -  &
+     &                       lhsp(1,j2)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,j)
+                lhsm(4,j)  = fac1*lhsm(4,j)
+                lhsm(5,j)  = fac1*lhsm(5,j)
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                lhsm(3,j1) = lhsm(3,j1) -  &
+     &                       lhsm(2,j1)*lhsm(4,j)
+                lhsm(4,j1) = lhsm(4,j1) -  &
+     &                       lhsm(2,j1)*lhsm(5,j)
+                rhs(m,i,j1,k) = rhs(m,i,j1,k) -  &
+     &                       lhsm(2,j1)*rhs(m,i,j,k)
+                lhsm(2,j2) = lhsm(2,j2) -  &
+     &                       lhsm(1,j2)*lhsm(4,j)
+                lhsm(3,j2) = lhsm(3,j2) -  &
+     &                       lhsm(1,j2)*lhsm(5,j)
+                rhs(m,i,j2,k) = rhs(m,i,j2,k) -  &
+     &                       lhsm(1,j2)*rhs(m,i,j,k)
+             end do
+
+!---------------------------------------------------------------------
+!         And again the last two rows separately
+!---------------------------------------------------------------------
+             j  = grid_points(2)-2
+             j1 = grid_points(2)-1
+             m = 4
+             fac1       = 1.d0/lhsp(3,j)
+             lhsp(4,j)  = fac1*lhsp(4,j)
+             lhsp(5,j)  = fac1*lhsp(5,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsp(3,j1) = lhsp(3,j1) -  &
+     &                    lhsp(2,j1)*lhsp(4,j)
+             lhsp(4,j1) = lhsp(4,j1) -  &
+     &                    lhsp(2,j1)*lhsp(5,j)
+             rhs(m,i,j1,k)   = rhs(m,i,j1,k) -  &
+     &                    lhsp(2,j1)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,j)
+             lhsm(4,j)  = fac1*lhsm(4,j)
+             lhsm(5,j)  = fac1*lhsm(5,j)
+             rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             lhsm(3,j1) = lhsm(3,j1) -  &
+     &                    lhsm(2,j1)*lhsm(4,j)
+             lhsm(4,j1) = lhsm(4,j1) -  &
+     &                    lhsm(2,j1)*lhsm(5,j)
+             rhs(m,i,j1,k)   = rhs(m,i,j1,k) -  &
+     &                    lhsm(2,j1)*rhs(m,i,j,k)
+!---------------------------------------------------------------------
+!               Scale the last row immediately 
+!---------------------------------------------------------------------
+             rhs(4,i,j1,k)   = rhs(4,i,j1,k)/lhsp(3,j1)
+             rhs(5,i,j1,k)   = rhs(5,i,j1,k)/lhsm(3,j1)
+
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+
+             j  = grid_points(2)-2
+             j1 = grid_points(2)-1
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -  &
+     &                           lhs(4,j)*rhs(m,i,j1,k)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -  &
+     &                           lhsp(4,j)*rhs(4,i,j1,k)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -  &
+     &                           lhsm(4,j)*rhs(5,i,j1,k)
+
+!---------------------------------------------------------------------
+!      The first three factors
+!---------------------------------------------------------------------
+             do   j = grid_points(2)-3, 0, -1
+                j1 = j  + 1
+                j2 = j  + 2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) -  &
+     &                          lhs(4,j)*rhs(m,i,j1,k) -  &
+     &                          lhs(5,j)*rhs(m,i,j2,k)
+                end do
+
+!---------------------------------------------------------------------
+!      And the remaining two
+!---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) -  &
+     &                          lhsp(4,j)*rhs(4,i,j1,k) -  &
+     &                          lhsp(5,j)*rhs(4,i,j2,k)
+                rhs(5,i,j,k) = rhs(5,i,j,k) -  &
+     &                          lhsm(4,j)*rhs(5,i,j1,k) -  &
+     &                          lhsm(5,j)*rhs(5,i,j2,k)
+             end do
+
+          end do
+       end do
+       if (timeron) call timer_stop(t_ysolve)
+
+
+       call pinvr
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/y_solve_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/y_solve_blk.f90
new file mode 100644
index 000000000..71659d119
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/y_solve_blk.f90
@@ -0,0 +1,364 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine y_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the y-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the y-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use work_lhs
+
+       implicit none
+
+       integer i, j, k, j1, j2, ii, ib, im
+       double precision ru1, fac1, fac2
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_ysolve)
+!$omp parallel default(shared) private(i,j,k,j1,j2,ii,ib,im,  &
+!$omp&    ru1,fac1,fac2)
+
+       call lhsinit(ny2+1)
+
+!$omp do collapse(2)
+       do  k = 1, nz2
+       do  ii = 1, nx2, bsize
+          im = min(bsize, nx2 - ii + 1)
+
+          do  j = 0, grid_points(2)-1
+             do  ib = 1, bsize
+                i = min(ib,im) + ii - 1
+                rhsx(ib,1,j) = rhs(1,i,j,k)
+                rhsx(ib,2,j) = rhs(2,i,j,k)
+                rhsx(ib,3,j) = rhs(3,i,j,k)
+                rhsx(ib,4,j) = rhs(4,i,j,k)
+                rhsx(ib,5,j) = rhs(5,i,j,k)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+! Computes the left hand side for the three y-factors   
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      first fill the lhs for the u-eigenvalue         
+!---------------------------------------------------------------------
+
+          do  j = 0, grid_points(2)-1
+             do  ib = 1, bsize
+                i = min(ib,im) + ii - 1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(ib,j) = vs(i,j,k)
+                rhov(ib,j) = dmax1( dy3 + con43 * ru1,  &
+     &                           dy5 + c1c5*ru1,  &
+     &                           dymax + ru1,  &
+     &                           dy1)
+             end do
+          end do
+ 
+          do  j = 1, grid_points(2)-2
+             do  ib = 1, bsize
+                lhs(ib,1,j) =  0.0d0
+                lhs(ib,2,j) = -dtty2 * cv(ib,j-1) - dtty1 * rhov(ib,j-1)
+                lhs(ib,3,j) =  1.0 + c2dtty1 * rhov(ib,j)
+                lhs(ib,4,j) =  dtty2 * cv(ib,j+1) - dtty1 * rhov(ib,j+1)
+                lhs(ib,5,j) =  0.0d0
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                             
+!---------------------------------------------------------------------
+
+          do  ib = 1, bsize
+             j = 1
+             lhs(ib,3,j) = lhs(ib,3,j) + comz5
+             lhs(ib,4,j) = lhs(ib,4,j) - comz4
+             lhs(ib,5,j) = lhs(ib,5,j) + comz1
+       
+             j = 2
+             lhs(ib,2,j) = lhs(ib,2,j) - comz4
+             lhs(ib,3,j) = lhs(ib,3,j) + comz6
+             lhs(ib,4,j) = lhs(ib,4,j) - comz4
+             lhs(ib,5,j) = lhs(ib,5,j) + comz1
+          end do
+
+          do   j=3, grid_points(2)-4
+             do  ib = 1, bsize
+                lhs(ib,1,j) = lhs(ib,1,j) + comz1
+                lhs(ib,2,j) = lhs(ib,2,j) - comz4
+                lhs(ib,3,j) = lhs(ib,3,j) + comz6
+                lhs(ib,4,j) = lhs(ib,4,j) - comz4
+                lhs(ib,5,j) = lhs(ib,5,j) + comz1
+             end do
+          end do
+
+          do  ib = 1, bsize
+             j = grid_points(2)-3
+             lhs(ib,1,j) = lhs(ib,1,j) + comz1
+             lhs(ib,2,j) = lhs(ib,2,j) - comz4
+             lhs(ib,3,j) = lhs(ib,3,j) + comz6
+             lhs(ib,4,j) = lhs(ib,4,j) - comz4
+
+             j = grid_points(2)-2
+             lhs(ib,1,j) = lhs(ib,1,j) + comz1
+             lhs(ib,2,j) = lhs(ib,2,j) - comz4
+             lhs(ib,3,j) = lhs(ib,3,j) + comz5
+          end do
+
+!---------------------------------------------------------------------
+!      subsequently, do the other two factors                    
+!---------------------------------------------------------------------
+          do    j = 1, grid_points(2)-2
+             do  ib = 1, bsize
+                i = min(ib,im) + ii - 1
+                lhsp(ib,1,j) = lhs(ib,1,j)
+                lhsp(ib,2,j) = lhs(ib,2,j) -  &
+     &                            dtty2 * speed(i,j-1,k)
+                lhsp(ib,3,j) = lhs(ib,3,j)
+                lhsp(ib,4,j) = lhs(ib,4,j) +  &
+     &                            dtty2 * speed(i,j+1,k)
+                lhsp(ib,5,j) = lhs(ib,5,j)
+                lhsm(ib,1,j) = lhs(ib,1,j)
+                lhsm(ib,2,j) = lhs(ib,2,j) +  &
+     &                            dtty2 * speed(i,j-1,k)
+                lhsm(ib,3,j) = lhs(ib,3,j)
+                lhsm(ib,4,j) = lhs(ib,4,j) -  &
+     &                            dtty2 * speed(i,j+1,k)
+                lhsm(ib,5,j) = lhs(ib,5,j)
+             end do
+          end do
+
+
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+
+          do    j = 0, grid_points(2)-3
+             j1 = j  + 1
+             j2 = j  + 2
+             do  ib = 1, bsize
+                fac1      = 1.d0/lhs(ib,3,j)
+                lhs(ib,4,j)  = fac1*lhs(ib,4,j)
+                lhs(ib,5,j)  = fac1*lhs(ib,5,j)
+                rhsx(ib,1,j) = fac1*rhsx(ib,1,j)
+                rhsx(ib,2,j) = fac1*rhsx(ib,2,j)
+                rhsx(ib,3,j) = fac1*rhsx(ib,3,j)
+                lhs(ib,3,j1) = lhs(ib,3,j1) -  &
+     &                         lhs(ib,2,j1)*lhs(ib,4,j)
+                lhs(ib,4,j1) = lhs(ib,4,j1) -  &
+     &                         lhs(ib,2,j1)*lhs(ib,5,j)
+                rhsx(ib,1,j1) = rhsx(ib,1,j1) -  &
+     &                         lhs(ib,2,j1)*rhsx(ib,1,j)
+                rhsx(ib,2,j1) = rhsx(ib,2,j1) -  &
+     &                         lhs(ib,2,j1)*rhsx(ib,2,j)
+                rhsx(ib,3,j1) = rhsx(ib,3,j1) -  &
+     &                         lhs(ib,2,j1)*rhsx(ib,3,j)
+                lhs(ib,2,j2) = lhs(ib,2,j2) -  &
+     &                         lhs(ib,1,j2)*lhs(ib,4,j)
+                lhs(ib,3,j2) = lhs(ib,3,j2) -  &
+     &                         lhs(ib,1,j2)*lhs(ib,5,j)
+                rhsx(ib,1,j2) = rhsx(ib,1,j2) -  &
+     &                         lhs(ib,1,j2)*rhsx(ib,1,j)
+                rhsx(ib,2,j2) = rhsx(ib,2,j2) -  &
+     &                         lhs(ib,1,j2)*rhsx(ib,2,j)
+                rhsx(ib,3,j2) = rhsx(ib,3,j2) -  &
+     &                         lhs(ib,1,j2)*rhsx(ib,3,j)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!      The last two rows in this grid block are a bit different, 
+!      since they do not have two more rows available for the
+!      elimination of off-diagonal entries
+!---------------------------------------------------------------------
+
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  ib = 1, bsize
+             fac1      = 1.d0/lhs(ib,3,j)
+             lhs(ib,4,j)  = fac1*lhs(ib,4,j)
+             lhs(ib,5,j)  = fac1*lhs(ib,5,j)
+             rhsx(ib,1,j) = fac1*rhsx(ib,1,j)
+             rhsx(ib,2,j) = fac1*rhsx(ib,2,j)
+             rhsx(ib,3,j) = fac1*rhsx(ib,3,j)
+             lhs(ib,3,j1) = lhs(ib,3,j1) -  &
+     &                      lhs(ib,2,j1)*lhs(ib,4,j)
+             lhs(ib,4,j1) = lhs(ib,4,j1) -  &
+     &                      lhs(ib,2,j1)*lhs(ib,5,j)
+             rhsx(ib,1,j1) = rhsx(ib,1,j1) -  &
+     &                      lhs(ib,2,j1)*rhsx(ib,1,j)
+             rhsx(ib,2,j1) = rhsx(ib,2,j1) -  &
+     &                      lhs(ib,2,j1)*rhsx(ib,2,j)
+             rhsx(ib,3,j1) = rhsx(ib,3,j1) -  &
+     &                      lhs(ib,2,j1)*rhsx(ib,3,j)
+!---------------------------------------------------------------------
+!            scale the last row immediately 
+!---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(ib,3,j1)
+             rhsx(ib,1,j1) = fac2*rhsx(ib,1,j1)
+             rhsx(ib,2,j1) = fac2*rhsx(ib,2,j1)
+             rhsx(ib,3,j1) = fac2*rhsx(ib,3,j1)
+          end do
+
+!---------------------------------------------------------------------
+!      do the u+c and the u-c factors                 
+!---------------------------------------------------------------------
+          do    j = 0, grid_points(2)-3
+             j1 = j  + 1
+             j2 = j  + 2
+             do  ib = 1, bsize
+                fac1       = 1.d0/lhsp(ib,3,j)
+                lhsp(ib,4,j)  = fac1*lhsp(ib,4,j)
+                lhsp(ib,5,j)  = fac1*lhsp(ib,5,j)
+                rhsx(ib,4,j)  = fac1*rhsx(ib,4,j)
+                lhsp(ib,3,j1) = lhsp(ib,3,j1) -  &
+     &                       lhsp(ib,2,j1)*lhsp(ib,4,j)
+                lhsp(ib,4,j1) = lhsp(ib,4,j1) -  &
+     &                       lhsp(ib,2,j1)*lhsp(ib,5,j)
+                rhsx(ib,4,j1) = rhsx(ib,4,j1) -  &
+     &                       lhsp(ib,2,j1)*rhsx(ib,4,j)
+                lhsp(ib,2,j2) = lhsp(ib,2,j2) -  &
+     &                       lhsp(ib,1,j2)*lhsp(ib,4,j)
+                lhsp(ib,3,j2) = lhsp(ib,3,j2) -  &
+     &                       lhsp(ib,1,j2)*lhsp(ib,5,j)
+                rhsx(ib,4,j2) = rhsx(ib,4,j2) -  &
+     &                       lhsp(ib,1,j2)*rhsx(ib,4,j)
+                fac1       = 1.d0/lhsm(ib,3,j)
+                lhsm(ib,4,j)  = fac1*lhsm(ib,4,j)
+                lhsm(ib,5,j)  = fac1*lhsm(ib,5,j)
+                rhsx(ib,5,j)  = fac1*rhsx(ib,5,j)
+                lhsm(ib,3,j1) = lhsm(ib,3,j1) -  &
+     &                       lhsm(ib,2,j1)*lhsm(ib,4,j)
+                lhsm(ib,4,j1) = lhsm(ib,4,j1) -  &
+     &                       lhsm(ib,2,j1)*lhsm(ib,5,j)
+                rhsx(ib,5,j1) = rhsx(ib,5,j1) -  &
+     &                       lhsm(ib,2,j1)*rhsx(ib,5,j)
+                lhsm(ib,2,j2) = lhsm(ib,2,j2) -  &
+     &                       lhsm(ib,1,j2)*lhsm(ib,4,j)
+                lhsm(ib,3,j2) = lhsm(ib,3,j2) -  &
+     &                       lhsm(ib,1,j2)*lhsm(ib,5,j)
+                rhsx(ib,5,j2) = rhsx(ib,5,j2) -  &
+     &                       lhsm(ib,1,j2)*rhsx(ib,5,j)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         And again the last two rows separately
+!---------------------------------------------------------------------
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  ib = 1, bsize
+             fac1       = 1.d0/lhsp(ib,3,j)
+             lhsp(ib,4,j)  = fac1*lhsp(ib,4,j)
+             lhsp(ib,5,j)  = fac1*lhsp(ib,5,j)
+             rhsx(ib,4,j)  = fac1*rhsx(ib,4,j)
+             lhsp(ib,3,j1) = lhsp(ib,3,j1) -  &
+     &                    lhsp(ib,2,j1)*lhsp(ib,4,j)
+             lhsp(ib,4,j1) = lhsp(ib,4,j1) -  &
+     &                    lhsp(ib,2,j1)*lhsp(ib,5,j)
+             rhsx(ib,4,j1)   = rhsx(ib,4,j1) -  &
+     &                    lhsp(ib,2,j1)*rhsx(ib,4,j)
+             fac1       = 1.d0/lhsm(ib,3,j)
+             lhsm(ib,4,j)  = fac1*lhsm(ib,4,j)
+             lhsm(ib,5,j)  = fac1*lhsm(ib,5,j)
+             rhsx(ib,5,j)  = fac1*rhsx(ib,5,j)
+             lhsm(ib,3,j1) = lhsm(ib,3,j1) -  &
+     &                    lhsm(ib,2,j1)*lhsm(ib,4,j)
+             lhsm(ib,4,j1) = lhsm(ib,4,j1) -  &
+     &                    lhsm(ib,2,j1)*lhsm(ib,5,j)
+             rhsx(ib,5,j1)   = rhsx(ib,5,j1) -  &
+     &                    lhsm(ib,2,j1)*rhsx(ib,5,j)
+!---------------------------------------------------------------------
+!               Scale the last row immediately 
+!---------------------------------------------------------------------
+             rhsx(ib,4,j1)   = rhsx(ib,4,j1)/lhsp(ib,3,j1)
+             rhsx(ib,5,j1)   = rhsx(ib,5,j1)/lhsm(ib,3,j1)
+          end do
+
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+
+          j  = grid_points(2)-2
+          j1 = grid_points(2)-1
+          do  ib = 1, bsize
+             rhsx(ib,1,j) = rhsx(ib,1,j) -  &
+     &                           lhs(ib,4,j)*rhsx(ib,1,j1)
+             rhsx(ib,2,j) = rhsx(ib,2,j) -  &
+     &                           lhs(ib,4,j)*rhsx(ib,2,j1)
+             rhsx(ib,3,j) = rhsx(ib,3,j) -  &
+     &                           lhs(ib,4,j)*rhsx(ib,3,j1)
+
+             rhsx(ib,4,j) = rhsx(ib,4,j) -  &
+     &                           lhsp(ib,4,j)*rhsx(ib,4,j1)
+             rhsx(ib,5,j) = rhsx(ib,5,j) -  &
+     &                           lhsm(ib,4,j)*rhsx(ib,5,j1)
+          end do
+
+!---------------------------------------------------------------------
+!      The first three factors
+!---------------------------------------------------------------------
+          do   j = grid_points(2)-3, 0, -1
+             j1 = j  + 1
+             j2 = j  + 2
+             do  ib = 1, bsize
+                rhsx(ib,1,j) = rhsx(ib,1,j) -  &
+     &                          lhs(ib,4,j)*rhsx(ib,1,j1) -  &
+     &                          lhs(ib,5,j)*rhsx(ib,1,j2)
+                rhsx(ib,2,j) = rhsx(ib,2,j) -  &
+     &                          lhs(ib,4,j)*rhsx(ib,2,j1) -  &
+     &                          lhs(ib,5,j)*rhsx(ib,2,j2)
+                rhsx(ib,3,j) = rhsx(ib,3,j) -  &
+     &                          lhs(ib,4,j)*rhsx(ib,3,j1) -  &
+     &                          lhs(ib,5,j)*rhsx(ib,3,j2)
+
+!---------------------------------------------------------------------
+!      And the remaining two
+!---------------------------------------------------------------------
+                rhsx(ib,4,j) = rhsx(ib,4,j) -  &
+     &                          lhsp(ib,4,j)*rhsx(ib,4,j1) -  &
+     &                          lhsp(ib,5,j)*rhsx(ib,4,j2)
+                rhsx(ib,5,j) = rhsx(ib,5,j) -  &
+     &                          lhsm(ib,4,j)*rhsx(ib,5,j1) -  &
+     &                          lhsm(ib,5,j)*rhsx(ib,5,j2)
+             end do
+          end do
+
+          do  j = 0, grid_points(2)-1
+             do  ib = 1, im
+                i = ib + ii - 1
+                rhs(1,i,j,k) = rhsx(ib,1,j)
+                rhs(2,i,j,k) = rhsx(ib,2,j)
+                rhs(3,i,j,k) = rhsx(ib,3,j)
+                rhs(4,i,j,k) = rhsx(ib,4,j)
+                rhs(5,i,j,k) = rhsx(ib,5,j)
+             end do
+          end do
+
+       end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+       if (timeron) call timer_stop(t_ysolve)
+
+
+       call pinvr
+
+       return
+       end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/z_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/z_solve.f90
new file mode 100644
index 000000000..d74985483
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/z_solve.f90
@@ -0,0 +1,313 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine z_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the z-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the z-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use work_lhs
+
+       implicit none
+
+       integer i, j, k, k1, k2, m
+       double precision ru1, fac1, fac2
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! Prepare for z-solve, array redistribution   
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_zsolve)
+!$omp parallel do default(shared) private(i,j,k,k1,k2,m,  &
+!$omp&    ru1,fac1,fac2) collapse(2)
+       do   j = 1, ny2
+          do   i = 1, nx2
+
+            call lhsinit(nz2+1, lhs, lhsp, lhsm)
+
+!---------------------------------------------------------------------
+! Computes the left hand side for the three z-factors   
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! first fill the lhs for the u-eigenvalue                          
+!---------------------------------------------------------------------
+
+             do   k = 0, nz2 + 1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(k) = ws(i,j,k)
+                rhov(k) = dmax1(dz4 + con43 * ru1,  &
+     &                          dz5 + c1c5 * ru1,  &
+     &                          dzmax + ru1,  &
+     &                          dz1)
+             end do
+
+             do   k =  1, nz2
+                lhs(1,k) =  0.0d0
+                lhs(2,k) = -dttz2 * cv(k-1) - dttz1 * rhov(k-1)
+                lhs(3,k) =  1.0 + c2dttz1 * rhov(k)
+                lhs(4,k) =  dttz2 * cv(k+1) - dttz1 * rhov(k+1)
+                lhs(5,k) =  0.0d0
+             end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                                  
+!---------------------------------------------------------------------
+
+             k = 1
+             lhs(3,k) = lhs(3,k) + comz5
+             lhs(4,k) = lhs(4,k) - comz4
+             lhs(5,k) = lhs(5,k) + comz1
+
+             k = 2
+             lhs(2,k) = lhs(2,k) - comz4
+             lhs(3,k) = lhs(3,k) + comz6
+             lhs(4,k) = lhs(4,k) - comz4
+             lhs(5,k) = lhs(5,k) + comz1
+
+             do    k = 3, nz2-2
+                lhs(1,k) = lhs(1,k) + comz1
+                lhs(2,k) = lhs(2,k) - comz4
+                lhs(3,k) = lhs(3,k) + comz6
+                lhs(4,k) = lhs(4,k) - comz4
+                lhs(5,k) = lhs(5,k) + comz1
+             end do
+
+             k = nz2-1
+             lhs(1,k) = lhs(1,k) + comz1
+             lhs(2,k) = lhs(2,k) - comz4
+             lhs(3,k) = lhs(3,k) + comz6
+             lhs(4,k) = lhs(4,k) - comz4
+
+             k = nz2
+             lhs(1,k) = lhs(1,k) + comz1
+             lhs(2,k) = lhs(2,k) - comz4
+             lhs(3,k) = lhs(3,k) + comz5
+
+
+!---------------------------------------------------------------------
+!      subsequently, fill the other factors (u+c), (u-c) 
+!---------------------------------------------------------------------
+             do    k = 1, nz2
+                lhsp(1,k) = lhs(1,k)
+                lhsp(2,k) = lhs(2,k) -  &
+     &                            dttz2 * speed(i,j,k-1)
+                lhsp(3,k) = lhs(3,k)
+                lhsp(4,k) = lhs(4,k) +  &
+     &                            dttz2 * speed(i,j,k+1)
+                lhsp(5,k) = lhs(5,k)
+                lhsm(1,k) = lhs(1,k)
+                lhsm(2,k) = lhs(2,k) +  &
+     &                            dttz2 * speed(i,j,k-1)
+                lhsm(3,k) = lhs(3,k)
+                lhsm(4,k) = lhs(4,k) -  &
+     &                            dttz2 * speed(i,j,k+1)
+                lhsm(5,k) = lhs(5,k)
+             end do
+
+
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+
+             do    k = 0, grid_points(3)-3
+                k1 = k  + 1
+                k2 = k  + 2
+                fac1      = 1.d0/lhs(3,k)
+                lhs(4,k)  = fac1*lhs(4,k)
+                lhs(5,k)  = fac1*lhs(5,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+                end do
+                lhs(3,k1) = lhs(3,k1) -  &
+     &                         lhs(2,k1)*lhs(4,k)
+                lhs(4,k1) = lhs(4,k1) -  &
+     &                         lhs(2,k1)*lhs(5,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k1) = rhs(m,i,j,k1) -  &
+     &                         lhs(2,k1)*rhs(m,i,j,k)
+                end do
+                lhs(2,k2) = lhs(2,k2) -  &
+     &                         lhs(1,k2)*lhs(4,k)
+                lhs(3,k2) = lhs(3,k2) -  &
+     &                         lhs(1,k2)*lhs(5,k)
+                do    m = 1, 3
+                   rhs(m,i,j,k2) = rhs(m,i,j,k2) -  &
+     &                         lhs(1,k2)*rhs(m,i,j,k)
+                end do
+             end do
+
+!---------------------------------------------------------------------
+!      The last two rows in this grid block are a bit different, 
+!      since they do not have two more rows available for the
+!      elimination of off-diagonal entries
+!---------------------------------------------------------------------
+             k  = grid_points(3)-2
+             k1 = grid_points(3)-1
+             fac1      = 1.d0/lhs(3,k)
+             lhs(4,k)  = fac1*lhs(4,k)
+             lhs(5,k)  = fac1*lhs(5,k)
+             do    m = 1, 3
+                rhs(m,i,j,k) = fac1*rhs(m,i,j,k)
+             end do
+             lhs(3,k1) = lhs(3,k1) -  &
+     &                      lhs(2,k1)*lhs(4,k)
+             lhs(4,k1) = lhs(4,k1) -  &
+     &                      lhs(2,k1)*lhs(5,k)
+             do    m = 1, 3
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -  &
+     &                      lhs(2,k1)*rhs(m,i,j,k)
+             end do
+!---------------------------------------------------------------------
+!               scale the last row immediately
+!---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(3,k1)
+             do    m = 1, 3
+                rhs(m,i,j,k1) = fac2*rhs(m,i,j,k1)
+             end do
+
+!---------------------------------------------------------------------
+!      do the u+c and the u-c factors               
+!---------------------------------------------------------------------
+             do    k = 0, grid_points(3)-3
+                k1 = k  + 1
+                k2 = k  + 2
+                m = 4
+                fac1       = 1.d0/lhsp(3,k)
+                lhsp(4,k)  = fac1*lhsp(4,k)
+                lhsp(5,k)  = fac1*lhsp(5,k)
+                rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+                lhsp(3,k1) = lhsp(3,k1) -  &
+     &                       lhsp(2,k1)*lhsp(4,k)
+                lhsp(4,k1) = lhsp(4,k1) -  &
+     &                       lhsp(2,k1)*lhsp(5,k)
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -  &
+     &                       lhsp(2,k1)*rhs(m,i,j,k)
+                lhsp(2,k2) = lhsp(2,k2) -  &
+     &                       lhsp(1,k2)*lhsp(4,k)
+                lhsp(3,k2) = lhsp(3,k2) -  &
+     &                       lhsp(1,k2)*lhsp(5,k)
+                rhs(m,i,j,k2) = rhs(m,i,j,k2) -  &
+     &                       lhsp(1,k2)*rhs(m,i,j,k)
+                m = 5
+                fac1       = 1.d0/lhsm(3,k)
+                lhsm(4,k)  = fac1*lhsm(4,k)
+                lhsm(5,k)  = fac1*lhsm(5,k)
+                rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+                lhsm(3,k1) = lhsm(3,k1) -  &
+     &                       lhsm(2,k1)*lhsm(4,k)
+                lhsm(4,k1) = lhsm(4,k1) -  &
+     &                       lhsm(2,k1)*lhsm(5,k)
+                rhs(m,i,j,k1) = rhs(m,i,j,k1) -  &
+     &                       lhsm(2,k1)*rhs(m,i,j,k)
+                lhsm(2,k2) = lhsm(2,k2) -  &
+     &                       lhsm(1,k2)*lhsm(4,k)
+                lhsm(3,k2) = lhsm(3,k2) -  &
+     &                       lhsm(1,k2)*lhsm(5,k)
+                rhs(m,i,j,k2) = rhs(m,i,j,k2) -  &
+     &                       lhsm(1,k2)*rhs(m,i,j,k)
+             end do
+
+!---------------------------------------------------------------------
+!         And again the last two rows separately
+!---------------------------------------------------------------------
+             k  = grid_points(3)-2
+             k1 = grid_points(3)-1
+             m = 4
+             fac1       = 1.d0/lhsp(3,k)
+             lhsp(4,k)  = fac1*lhsp(4,k)
+             lhsp(5,k)  = fac1*lhsp(5,k)
+             rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+             lhsp(3,k1) = lhsp(3,k1) -  &
+     &                    lhsp(2,k1)*lhsp(4,k)
+             lhsp(4,k1) = lhsp(4,k1) -  &
+     &                    lhsp(2,k1)*lhsp(5,k)
+             rhs(m,i,j,k1) = rhs(m,i,j,k1) -  &
+     &                    lhsp(2,k1)*rhs(m,i,j,k)
+             m = 5
+             fac1       = 1.d0/lhsm(3,k)
+             lhsm(4,k)  = fac1*lhsm(4,k)
+             lhsm(5,k)  = fac1*lhsm(5,k)
+             rhs(m,i,j,k)  = fac1*rhs(m,i,j,k)
+             lhsm(3,k1) = lhsm(3,k1) -  &
+     &                    lhsm(2,k1)*lhsm(4,k)
+             lhsm(4,k1) = lhsm(4,k1) -  &
+     &                    lhsm(2,k1)*lhsm(5,k)
+             rhs(m,i,j,k1) = rhs(m,i,j,k1) -  &
+     &                    lhsm(2,k1)*rhs(m,i,j,k)
+!---------------------------------------------------------------------
+!               Scale the last row immediately (some of this is overkill
+!               if this is the last cell)
+!---------------------------------------------------------------------
+             rhs(4,i,j,k1) = rhs(4,i,j,k1)/lhsp(3,k1)
+             rhs(5,i,j,k1) = rhs(5,i,j,k1)/lhsm(3,k1)
+
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+
+             k  = grid_points(3)-2
+             k1 = grid_points(3)-1
+             do   m = 1, 3
+                rhs(m,i,j,k) = rhs(m,i,j,k) -  &
+     &                             lhs(4,k)*rhs(m,i,j,k1)
+             end do
+
+             rhs(4,i,j,k) = rhs(4,i,j,k) -  &
+     &                             lhsp(4,k)*rhs(4,i,j,k1)
+             rhs(5,i,j,k) = rhs(5,i,j,k) -  &
+     &                             lhsm(4,k)*rhs(5,i,j,k1)
+
+!---------------------------------------------------------------------
+!      Whether or not this is the last processor, we always have
+!      to complete the back-substitution 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      The first three factors
+!---------------------------------------------------------------------
+             do   k = grid_points(3)-3, 0, -1
+                k1 = k  + 1
+                k2 = k  + 2
+                do   m = 1, 3
+                   rhs(m,i,j,k) = rhs(m,i,j,k) -  &
+     &                          lhs(4,k)*rhs(m,i,j,k1) -  &
+     &                          lhs(5,k)*rhs(m,i,j,k2)
+                end do
+
+!---------------------------------------------------------------------
+!      And the remaining two
+!---------------------------------------------------------------------
+                rhs(4,i,j,k) = rhs(4,i,j,k) -  &
+     &                          lhsp(4,k)*rhs(4,i,j,k1) -  &
+     &                          lhsp(5,k)*rhs(4,i,j,k2)
+                rhs(5,i,j,k) = rhs(5,i,j,k) -  &
+     &                          lhsm(4,k)*rhs(5,i,j,k1) -  &
+     &                          lhsm(5,k)*rhs(5,i,j,k2)
+             end do
+
+          end do
+       end do
+       if (timeron) call timer_stop(t_zsolve)
+
+       call tzetar
+
+       return
+       end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/z_solve_blk.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/z_solve_blk.f90
new file mode 100644
index 000000000..e787a3992
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/SP/z_solve_blk.f90
@@ -0,0 +1,374 @@
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+       subroutine z_solve
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! this function performs the solution of the approximate factorization
+! step in the z-direction for all five matrix components
+! simultaneously. The Thomas algorithm is employed to solve the
+! systems for the z-lines. Boundary conditions are non-periodic
+!---------------------------------------------------------------------
+
+       use sp_data
+       use work_lhs
+
+       implicit none
+
+       integer i, j, k, k1, k2, ii, ib, im
+       double precision ru1, fac1, fac2
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! Prepare for z-solve, array redistribution   
+!---------------------------------------------------------------------
+
+       if (timeron) call timer_start(t_zsolve)
+!$omp parallel default(shared) private(i,j,k,k1,k2,ii,ib,im,  &
+!$omp&    ru1,fac1,fac2)
+
+       call lhsinit(nz2+1)
+
+!$omp do collapse(2)
+       do   j = 1, ny2
+       do  ii = 1, nx2, bsize
+          im = min(bsize, nx2 - ii + 1)
+
+          do  k = 0, grid_points(3)-1
+             do  ib = 1, bsize
+                i = min(ib,im) + ii - 1
+                rhsx(ib,1,k) = rhs(1,i,j,k)
+                rhsx(ib,2,k) = rhs(2,i,j,k)
+                rhsx(ib,3,k) = rhs(3,i,j,k)
+                rhsx(ib,4,k) = rhs(4,i,j,k)
+                rhsx(ib,5,k) = rhs(5,i,j,k)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+! Computes the left hand side for the three z-factors   
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! first fill the lhs for the u-eigenvalue                          
+!---------------------------------------------------------------------
+
+          do   k = 0, nz2 + 1
+             do  ib = 1, bsize
+                i = min(ib,im) + ii - 1
+                ru1 = c3c4*rho_i(i,j,k)
+                cv(ib,k) = ws(i,j,k)
+                rhov(ib,k) = dmax1(dz4 + con43 * ru1,  &
+     &                          dz5 + c1c5 * ru1,  &
+     &                          dzmax + ru1,  &
+     &                          dz1)
+             end do
+          end do
+
+          do   k =  1, nz2
+             do  ib = 1, bsize
+                lhs(ib,1,k) =  0.0d0
+                lhs(ib,2,k) = -dttz2 * cv(ib,k-1) - dttz1 * rhov(ib,k-1)
+                lhs(ib,3,k) =  1.0 + c2dttz1 * rhov(ib,k)
+                lhs(ib,4,k) =  dttz2 * cv(ib,k+1) - dttz1 * rhov(ib,k+1)
+                lhs(ib,5,k) =  0.0d0
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!      add fourth order dissipation                                  
+!---------------------------------------------------------------------
+
+          do  ib = 1, bsize
+             k = 1
+             lhs(ib,3,k) = lhs(ib,3,k) + comz5
+             lhs(ib,4,k) = lhs(ib,4,k) - comz4
+             lhs(ib,5,k) = lhs(ib,5,k) + comz1
+
+             k = 2
+             lhs(ib,2,k) = lhs(ib,2,k) - comz4
+             lhs(ib,3,k) = lhs(ib,3,k) + comz6
+             lhs(ib,4,k) = lhs(ib,4,k) - comz4
+             lhs(ib,5,k) = lhs(ib,5,k) + comz1
+          end do
+
+          do    k = 3, nz2-2
+             do  ib = 1, bsize
+                lhs(ib,1,k) = lhs(ib,1,k) + comz1
+                lhs(ib,2,k) = lhs(ib,2,k) - comz4
+                lhs(ib,3,k) = lhs(ib,3,k) + comz6
+                lhs(ib,4,k) = lhs(ib,4,k) - comz4
+                lhs(ib,5,k) = lhs(ib,5,k) + comz1
+             end do
+          end do
+
+          do  ib = 1, bsize
+             k = nz2-1
+             lhs(ib,1,k) = lhs(ib,1,k) + comz1
+             lhs(ib,2,k) = lhs(ib,2,k) - comz4
+             lhs(ib,3,k) = lhs(ib,3,k) + comz6
+             lhs(ib,4,k) = lhs(ib,4,k) - comz4
+
+             k = nz2
+             lhs(ib,1,k) = lhs(ib,1,k) + comz1
+             lhs(ib,2,k) = lhs(ib,2,k) - comz4
+             lhs(ib,3,k) = lhs(ib,3,k) + comz5
+          end do
+
+
+!---------------------------------------------------------------------
+!      subsequently, fill the other factors (u+c), (u-c) 
+!---------------------------------------------------------------------
+          do    k = 1, nz2
+             do  ib = 1, bsize
+                i = min(ib,im) + ii - 1
+                lhsp(ib,1,k) = lhs(ib,1,k)
+                lhsp(ib,2,k) = lhs(ib,2,k) -  &
+     &                            dttz2 * speed(i,j,k-1)
+                lhsp(ib,3,k) = lhs(ib,3,k)
+                lhsp(ib,4,k) = lhs(ib,4,k) +  &
+     &                            dttz2 * speed(i,j,k+1)
+                lhsp(ib,5,k) = lhs(ib,5,k)
+                lhsm(ib,1,k) = lhs(ib,1,k)
+                lhsm(ib,2,k) = lhs(ib,2,k) +  &
+     &                            dttz2 * speed(i,j,k-1)
+                lhsm(ib,3,k) = lhs(ib,3,k)
+                lhsm(ib,4,k) = lhs(ib,4,k) -  &
+     &                            dttz2 * speed(i,j,k+1)
+                lhsm(ib,5,k) = lhs(ib,5,k)
+             end do
+          end do
+
+
+!---------------------------------------------------------------------
+!                          FORWARD ELIMINATION  
+!---------------------------------------------------------------------
+
+          do    k = 0, grid_points(3)-3
+             k1 = k  + 1
+             k2 = k  + 2
+             do  ib = 1, bsize
+                fac1      = 1.d0/lhs(ib,3,k)
+                lhs(ib,4,k)  = fac1*lhs(ib,4,k)
+                lhs(ib,5,k)  = fac1*lhs(ib,5,k)
+                rhsx(ib,1,k) = fac1*rhsx(ib,1,k)
+                rhsx(ib,2,k) = fac1*rhsx(ib,2,k)
+                rhsx(ib,3,k) = fac1*rhsx(ib,3,k)
+                lhs(ib,3,k1) = lhs(ib,3,k1) -  &
+     &                         lhs(ib,2,k1)*lhs(ib,4,k)
+                lhs(ib,4,k1) = lhs(ib,4,k1) -  &
+     &                         lhs(ib,2,k1)*lhs(ib,5,k)
+                rhsx(ib,1,k1) = rhsx(ib,1,k1) -  &
+     &                         lhs(ib,2,k1)*rhsx(ib,1,k)
+                rhsx(ib,2,k1) = rhsx(ib,2,k1) -  &
+     &                         lhs(ib,2,k1)*rhsx(ib,2,k)
+                rhsx(ib,3,k1) = rhsx(ib,3,k1) -  &
+     &                         lhs(ib,2,k1)*rhsx(ib,3,k)
+                lhs(ib,2,k2) = lhs(ib,2,k2) -  &
+     &                         lhs(ib,1,k2)*lhs(ib,4,k)
+                lhs(ib,3,k2) = lhs(ib,3,k2) -  &
+     &                         lhs(ib,1,k2)*lhs(ib,5,k)
+                rhsx(ib,1,k2) = rhsx(ib,1,k2) -  &
+     &                         lhs(ib,1,k2)*rhsx(ib,1,k)
+                rhsx(ib,2,k2) = rhsx(ib,2,k2) -  &
+     &                         lhs(ib,1,k2)*rhsx(ib,2,k)
+                rhsx(ib,3,k2) = rhsx(ib,3,k2) -  &
+     &                         lhs(ib,1,k2)*rhsx(ib,3,k)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!      The last two rows in this grid block are a bit different, 
+!      since they do not have two more rows available for the
+!      elimination of off-diagonal entries
+!---------------------------------------------------------------------
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do  ib = 1, bsize
+             fac1      = 1.d0/lhs(ib,3,k)
+             lhs(ib,4,k)  = fac1*lhs(ib,4,k)
+             lhs(ib,5,k)  = fac1*lhs(ib,5,k)
+             rhsx(ib,1,k) = fac1*rhsx(ib,1,k)
+             rhsx(ib,2,k) = fac1*rhsx(ib,2,k)
+             rhsx(ib,3,k) = fac1*rhsx(ib,3,k)
+             lhs(ib,3,k1) = lhs(ib,3,k1) -  &
+     &                      lhs(ib,2,k1)*lhs(ib,4,k)
+             lhs(ib,4,k1) = lhs(ib,4,k1) -  &
+     &                      lhs(ib,2,k1)*lhs(ib,5,k)
+             rhsx(ib,1,k1) = rhsx(ib,1,k1) -  &
+     &                      lhs(ib,2,k1)*rhsx(ib,1,k)
+             rhsx(ib,2,k1) = rhsx(ib,2,k1) -  &
+     &                      lhs(ib,2,k1)*rhsx(ib,2,k)
+             rhsx(ib,3,k1) = rhsx(ib,3,k1) -  &
+     &                      lhs(ib,2,k1)*rhsx(ib,3,k)
+!---------------------------------------------------------------------
+!               scale the last row immediately
+!---------------------------------------------------------------------
+             fac2      = 1.d0/lhs(ib,3,k1)
+             rhsx(ib,1,k1) = fac2*rhsx(ib,1,k1)
+             rhsx(ib,2,k1) = fac2*rhsx(ib,2,k1)
+             rhsx(ib,3,k1) = fac2*rhsx(ib,3,k1)
+          end do
+
+!---------------------------------------------------------------------
+!      do the u+c and the u-c factors               
+!---------------------------------------------------------------------
+          do    k = 0, grid_points(3)-3
+             k1 = k  + 1
+             k2 = k  + 2
+             do  ib = 1, bsize
+                fac1       = 1.d0/lhsp(ib,3,k)
+                lhsp(ib,4,k)  = fac1*lhsp(ib,4,k)
+                lhsp(ib,5,k)  = fac1*lhsp(ib,5,k)
+                rhsx(ib,4,k)  = fac1*rhsx(ib,4,k)
+                lhsp(ib,3,k1) = lhsp(ib,3,k1) -  &
+     &                       lhsp(ib,2,k1)*lhsp(ib,4,k)
+                lhsp(ib,4,k1) = lhsp(ib,4,k1) -  &
+     &                       lhsp(ib,2,k1)*lhsp(ib,5,k)
+                rhsx(ib,4,k1) = rhsx(ib,4,k1) -  &
+     &                       lhsp(ib,2,k1)*rhsx(ib,4,k)
+                lhsp(ib,2,k2) = lhsp(ib,2,k2) -  &
+     &                       lhsp(ib,1,k2)*lhsp(ib,4,k)
+                lhsp(ib,3,k2) = lhsp(ib,3,k2) -  &
+     &                       lhsp(ib,1,k2)*lhsp(ib,5,k)
+                rhsx(ib,4,k2) = rhsx(ib,4,k2) -  &
+     &                       lhsp(ib,1,k2)*rhsx(ib,4,k)
+                fac1       = 1.d0/lhsm(ib,3,k)
+                lhsm(ib,4,k)  = fac1*lhsm(ib,4,k)
+                lhsm(ib,5,k)  = fac1*lhsm(ib,5,k)
+                rhsx(ib,5,k)  = fac1*rhsx(ib,5,k)
+                lhsm(ib,3,k1) = lhsm(ib,3,k1) -  &
+     &                       lhsm(ib,2,k1)*lhsm(ib,4,k)
+                lhsm(ib,4,k1) = lhsm(ib,4,k1) -  &
+     &                       lhsm(ib,2,k1)*lhsm(ib,5,k)
+                rhsx(ib,5,k1) = rhsx(ib,5,k1) -  &
+     &                       lhsm(ib,2,k1)*rhsx(ib,5,k)
+                lhsm(ib,2,k2) = lhsm(ib,2,k2) -  &
+     &                       lhsm(ib,1,k2)*lhsm(ib,4,k)
+                lhsm(ib,3,k2) = lhsm(ib,3,k2) -  &
+     &                       lhsm(ib,1,k2)*lhsm(ib,5,k)
+                rhsx(ib,5,k2) = rhsx(ib,5,k2) -  &
+     &                       lhsm(ib,1,k2)*rhsx(ib,5,k)
+             end do
+          end do
+
+!---------------------------------------------------------------------
+!         And again the last two rows separately
+!---------------------------------------------------------------------
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do  ib = 1, bsize
+             fac1       = 1.d0/lhsp(ib,3,k)
+             lhsp(ib,4,k)  = fac1*lhsp(ib,4,k)
+             lhsp(ib,5,k)  = fac1*lhsp(ib,5,k)
+             rhsx(ib,4,k)  = fac1*rhsx(ib,4,k)
+             lhsp(ib,3,k1) = lhsp(ib,3,k1) -  &
+     &                    lhsp(ib,2,k1)*lhsp(ib,4,k)
+             lhsp(ib,4,k1) = lhsp(ib,4,k1) -  &
+     &                    lhsp(ib,2,k1)*lhsp(ib,5,k)
+             rhsx(ib,4,k1) = rhsx(ib,4,k1) -  &
+     &                    lhsp(ib,2,k1)*rhsx(ib,4,k)
+             fac1       = 1.d0/lhsm(ib,3,k)
+             lhsm(ib,4,k)  = fac1*lhsm(ib,4,k)
+             lhsm(ib,5,k)  = fac1*lhsm(ib,5,k)
+             rhsx(ib,5,k)  = fac1*rhsx(ib,5,k)
+             lhsm(ib,3,k1) = lhsm(ib,3,k1) -  &
+     &                    lhsm(ib,2,k1)*lhsm(ib,4,k)
+             lhsm(ib,4,k1) = lhsm(ib,4,k1) -  &
+     &                    lhsm(ib,2,k1)*lhsm(ib,5,k)
+             rhsx(ib,5,k1) = rhsx(ib,5,k1) -  &
+     &                    lhsm(ib,2,k1)*rhsx(ib,5,k)
+!---------------------------------------------------------------------
+!               Scale the last row immediately (some of this is overkill
+!               if this is the last cell)
+!---------------------------------------------------------------------
+             rhsx(ib,4,k1) = rhsx(ib,4,k1)/lhsp(ib,3,k1)
+             rhsx(ib,5,k1) = rhsx(ib,5,k1)/lhsm(ib,3,k1)
+          end do
+
+
+!---------------------------------------------------------------------
+!                         BACKSUBSTITUTION 
+!---------------------------------------------------------------------
+
+          k  = grid_points(3)-2
+          k1 = grid_points(3)-1
+          do  ib = 1, bsize
+             rhsx(ib,1,k) = rhsx(ib,1,k) -  &
+     &                             lhs(ib,4,k)*rhsx(ib,1,k1)
+             rhsx(ib,2,k) = rhsx(ib,2,k) -  &
+     &                             lhs(ib,4,k)*rhsx(ib,2,k1)
+             rhsx(ib,3,k) = rhsx(ib,3,k) -  &
+     &                             lhs(ib,4,k)*rhsx(ib,3,k1)
+
+             rhsx(ib,4,k) = rhsx(ib,4,k) -  &
+     &                             lhsp(ib,4,k)*rhsx(ib,4,k1)
+             rhsx(ib,5,k) = rhsx(ib,5,k) -  &
+     &                             lhsm(ib,4,k)*rhsx(ib,5,k1)
+          end do
+
+!---------------------------------------------------------------------
+!      Whether or not this is the last processor, we always have
+!      to complete the back-substitution 
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!      The first three factors
+!---------------------------------------------------------------------
+          do   k = grid_points(3)-3, 0, -1
+             k1 = k  + 1
+             k2 = k  + 2
+             do  ib = 1, bsize
+                rhsx(ib,1,k) = rhsx(ib,1,k) -  &
+     &                          lhs(ib,4,k)*rhsx(ib,1,k1) -  &
+     &                          lhs(ib,5,k)*rhsx(ib,1,k2)
+                rhsx(ib,2,k) = rhsx(ib,2,k) -  &
+     &                          lhs(ib,4,k)*rhsx(ib,2,k1) -  &
+     &                          lhs(ib,5,k)*rhsx(ib,2,k2)
+                rhsx(ib,3,k) = rhsx(ib,3,k) -  &
+     &                          lhs(ib,4,k)*rhsx(ib,3,k1) -  &
+     &                          lhs(ib,5,k)*rhsx(ib,3,k2)
+
+!---------------------------------------------------------------------
+!      And the remaining two
+!---------------------------------------------------------------------
+                rhsx(ib,4,k) = rhsx(ib,4,k) -  &
+     &                          lhsp(ib,4,k)*rhsx(ib,4,k1) -  &
+     &                          lhsp(ib,5,k)*rhsx(ib,4,k2)
+                rhsx(ib,5,k) = rhsx(ib,5,k) -  &
+     &                          lhsm(ib,4,k)*rhsx(ib,5,k1) -  &
+     &                          lhsm(ib,5,k)*rhsx(ib,5,k2)
+             end do
+          end do
+
+          do  k = 0, grid_points(3)-1
+             do  ib = 1, im
+                i = ib + ii - 1
+                rhs(1,i,j,k) = rhsx(ib,1,k)
+                rhs(2,i,j,k) = rhsx(ib,2,k)
+                rhs(3,i,j,k) = rhsx(ib,3,k)
+                rhs(4,i,j,k) = rhsx(ib,4,k)
+                rhs(5,i,j,k) = rhsx(ib,5,k)
+             end do
+          end do
+
+       end do
+       end do
+!$omp end do nowait
+!$omp end parallel
+       if (timeron) call timer_stop(t_zsolve)
+
+       call tzetar
+
+       return
+       end
+
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/Makefile
new file mode 100644
index 000000000..d77721cc9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/Makefile
@@ -0,0 +1,58 @@
+SHELL=/bin/sh
+BENCHMARK=ua
+BENCHMARKU=UA
+UPDATE=$(VERSION)
+UXT=
+
+include ../config/make.def
+
+
+OBJS = ua.o ua_data.o convect.o diffuse.o adapt.o move.o mason.o \
+       precond.o utils.o verify.o setup.o transfer$(UXT).o \
+       ${COMMON}/print_results.o ${COMMON}/timers.o ${COMMON}/wtime.o
+
+ifeq (${M5_ANNOTATION}, 1)
+	OBJS += ${COMMON}/hooks.o
+endif
+
+include ../sys/make.common
+
+# npbparams.h is included by ua_data module (via ua_data.o)
+
+${PROGRAM}: config ${OBJS}
+	@if [ x$(UPDATE) = xau -o x$(UPDATE) = xAU ] ; then	\
+		${MAKE} UXT=_au ua-def;	\
+	elif [ x$(UPDATE) = xrd -o x$(UPDATE) = xRD ] ; then	\
+		${MAKE} UXT=_rd ua-rd;	\
+	else				\
+		${MAKE} ua-def;		\
+	fi
+
+ua-def: ${OBJS}
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${F_LIB}
+
+ua-rd: ${OBJS} tmorwork.o
+	${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} tmorwork.o ${F_LIB}
+
+.f90.o:
+	${FCOMPILE} $<
+
+ua.o:        ua.f90       ua_data.o
+setup.o:     setup.f90    ua_data.o
+convect.o:   convect.f90  ua_data.o
+adapt.o:     adapt.f90    ua_data.o
+move.o:      move.f90     ua_data.o
+diffuse.o:   diffuse.f90  ua_data.o
+mason.o:     mason.f90    ua_data.o
+precond.o:   precond.f90  ua_data.o
+transfer.o:  transfer.f90 ua_data.o
+transfer_au.o:  transfer_au.f90 ua_data.o
+transfer_rd.o:  transfer_rd.f90 ua_data.o tmorwork.o
+utils.o:     utils.f90    ua_data.o
+verify.o:    verify.f90   ua_data.o
+ua_data.o:   ua_data.f90  npbparams.h
+tmorwork.o:  tmorwork.f90
+
+clean:
+	- rm -f *.o *~ *.mod mputil*
+	- rm -f npbparams.h core
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/README
new file mode 100644
index 000000000..77b3d16e8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/README
@@ -0,0 +1,33 @@
+Note on the parallelization of transfer.f
+-----------------------------------------
+
+The file contains three major loops that update sparsely overlapped
+mortar points.  Parallelization of these loops requires atomic update
+of memory references by mortar points.  The first implementation
+uses the ATOMIC directive to perform the job.  However, in some systems
+where atomic update of memory references is not available, the ATOMIC
+directive could be implemented as a critical section, which would be 
+very expensive.
+
+The second approach is to use thread-local arrays to perform local updates,
+followed by a global reduction. This implementation uses array reduction 
+to achieve the goal. The disadvantage of this version is the increased use 
+of memory, which could have a negative impact on performance.
+
+The third approach is to use locks to guard atomic updates.  This 
+implementation scales reasonably well.  However, the overhead associated 
+with calling lock routines deep inside loop nests could be large.
+
+Three implementations:
+   - transfer_au.f: use ATOMIC directive for atomic updates
+   - transfer_rd.f: use array reduction for atomic updates
+   - transfer.f: use locks for atomic updates, as the default
+
+To use different approaches, one can use the suboption "VERSION"
+for make:
+
+   % make CLASS=<class>                # default version
+   % make CLASS=<class> VERSION=au     # ATOMIC directive
+   % make CLASS=<class> VERSION=rd     # array reduction
+
+where <class> is one of [S,W,A,B,C,D].
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/adapt.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/adapt.f90
new file mode 100644
index 000000000..47473acbe
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/adapt.f90
@@ -0,0 +1,1229 @@
+!-----------------------------------------------------------
+      subroutine adaptation (ifmortar,step)
+!-----------------------------------------------------------
+!     For 3-D mesh adaptation (refinement+ coarsening)
+!-----------------------------------------------------------
+
+      use ua_data
+      implicit none
+      
+      logical if_coarsen,if_refine,ifmortar,ifrepeat
+      integer iel,miel,irefine,icoarsen,neltold,step
+
+      if (timeron) call timer_start(t_adaptation)
+      ifmortar=.false.
+!.....compute heat source center(x0,y0,z0)
+      x0=x00+velx*time
+      y0=y00+vely*time
+      z0=z00+velz*time
+
+!.....Search elements to be refined. Check with restrictions. Perform
+!     refinement repeatedly until all desired refinements are done.
+
+!.....ich(iel)=0 no grid change on element iel
+!.....ich(iel)=2 iel is marked to be coarsened
+!.....ich(iel)=4 iel is marked to be refined
+
+!.....irefine records how many elements got refined
+      irefine=0
+
+!.....check whether elements need to be refined because they have overlap
+!     with the  heat source
+4     call find_refine(if_refine)
+
+      if(if_refine) then
+        ifrepeat=.true.
+2       if(ifrepeat) then
+!.........Check with restriction, unmark elements that cannot be refined.
+!         Elements preventing desired refinement will be marked to be refined.
+          call check_refine(ifrepeat) 
+          go to 2
+        end if
+!.......perform refinement
+        call do_refine(ifmortar,irefine)
+        goto 4
+      endif
+
+!.....Search for elements to be coarsened. Check with restrictions,
+!     Perform coarsening repeatedly until all possible coarsening
+!     is done.
+
+!.....icoarsen records how many elements got coarsened 
+      icoarsen=0
+
+!.....skip(iel)=.true. indicates an element no longer exists (because it
+!     got merged)
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)
+       do iel=1,nelt
+        skip(iel)=.false.
+      end do
+!$OMP END PARALLEL DO
+
+      neltold=nelt
+
+!.....Check whether elements need to be coarsened because they don't have
+!     overlap with the heat source. Only elements that don't have a larger 
+!     size neighbor can be marked to be coarsened
+
+5     call find_coarsen(if_coarsen,neltold)
+
+      if(if_coarsen) then
+!.......Perform coarsening, however subject to restriction. Only possible 
+!       coarsening will be performed. if_coarsen=.true. indicates that
+!       actual coarsening happened
+        call do_coarsen(if_coarsen,icoarsen,neltold)
+        if(if_coarsen) then
+!.........ifmortar=.true. indicates the grid changed, i.e. the mortar points 
+!         indices need to be regenerated on the new grid.
+          ifmortar=.true.
+          go to 5
+        end if 
+      end if
+
+      write(*,1000) step, irefine, icoarsen, nelt
+ 1000 format('Step ',i4, ': elements refined, merged, total: ',  &
+     &       i7, 1X , i7, 1X, i7)
+
+!.....mt_to_id(miel) takes as argument the morton index  and returns the actual 
+!                    element index
+!.....id_to_mt(iel)  takes as argument the actual element index and returns the 
+!                    morton index
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel)
+      do miel=1,nelt
+        iel=mt_to_id(miel)
+        id_to_mt(iel)=miel
+      end do 
+!$OMP END PARALLEL DO
+
+!.....Reorder the elements in the order of the morton curve. After the move 
+!     subroutine the element indices are  the same as the morton indices
+      call move
+
+!.....if the grid changed, regenerate mortar indices and update variables
+!     associated to grid.
+      if (ifmortar) then
+        call mortar
+        call prepwork
+      endif
+      if (timeron) call timer_stop(t_adaptation)
+
+      return
+      end 
+
+
+!-----------------------------------------------------------
+      subroutine do_coarsen(if_coarsen,icoarsen,neltold)
+!---------------------------------------------------------------
+!     Coarsening procedure: 
+!     1) check with restrictions
+!     2) perform coarsening
+!---------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      logical if_coarsen, icheck,test,test1,test2,test3
+      integer iel, ntp(8), ntempmin, ic, parent, mielnew, miel,  &
+     &        icoarsen, i, index, num_coarsen, ntemp, ii, ntemp1,  &
+     &        neltold
+      
+      if_coarsen=.false.
+
+!.....If an element has been merged, it will be skipped afterwards
+!     skip(iel)=.true. for elements that will be skipped.
+!     ifcoa_id(iel)=.true. indicates that element iel will be coarsened
+!     ifcoa(miel)=.true. refers to element miel(mortar index) will be
+!                        coarsened
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(iel)
+!$OMP DO 
+      do iel=1,nelt
+        mt_to_id_old(iel)=mt_to_id(iel)
+        mt_to_id(iel)=0
+      end do
+!$OMP END DO nowait
+!$OMP DO 
+      do iel=1,neltold 
+        ifcoa_id(iel)=.false.
+      end do
+!$OMP END DO nowait
+!$OMP END PARALLEL
+
+!.....Check whether the potential coarsening will make neighbor, 
+!     and neighbor's neighbor....break grid restriction
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel,ic,  &
+!$OMP& ntp,parent,test,test1,i,test2,test3)  &
+!$OMP& SHARED(if_coarsen)
+      do miel=1,nelt
+        ifcoa(miel)=.false.
+        front(miel)=0
+        iel=mt_to_id_old(miel)
+!.......if an element is marked to be coarsened
+        if(ich(iel).eq.2) then
+
+!.........If the current  element is the "first" child (front-left-
+!         bottom) of its parent (tree(iel) mod 8 equals 0), then 
+!         find all its neighbors. Check whether they are from the same 
+!         parent.
+
+          ic=tree(iel)
+          if(.not.btest(ic,0).and..not.btest(ic,1).and.  &
+     &       .not.btest(ic,2)) then
+            ntp(1)=iel
+            ntp(2)=sje(1,1,1,iel)
+            ntp(3)=sje(1,1,3,iel)
+            ntp(4)=sje(1,1,1,ntp(3))
+            ntp(5)=sje(1,1,5,iel)
+            ntp(6)=sje(1,1,1,ntp(5))
+            ntp(7)=sje(1,1,3,ntp(5))
+            ntp(8)=sje(1,1,1,ntp(7))
+ 
+            parent=ishft(tree(iel),-3)
+            test=.false.
+
+            test1=.true.
+            do i=1,8
+              if(ishft(tree(ntp(i)),-3).ne.parent)test1=.false.
+            end do
+
+!...........check whether all child elements are marked to be coarsened
+            if(test1)then
+              test2=.true.
+              do i=1,8
+                if(ich(ntp(i)).ne.2)test2=.false.
+              end do
+
+!.............check whether all child elements can be coarsened or not.
+              if(test2)then
+                test3=.true.
+                do i=1,8
+                  if(.not.icheck(ntp(i),i))test3=.false.
+                end do
+                if(test3)test=.true.
+              end if
+            end if
+!...........if the eight child elements are eligible to be coarsened
+!           mark the first children ifcoa(miel)=.true.
+!           mark them all ifcoa_id()=.true.
+!           front(miel) will be used to calculate (potentially in parallel) 
+!                       how many elements with seuqnece numbers less than
+!                       miel will be coarsened.
+!           skip()      marks that an element will no longer exist after merge.
+
+            if(test)then
+
+              ifcoa(miel)=.true.
+              do i=1,8
+                ifcoa_id(ntp(i))=.true.
+              end do
+              front(miel)=1
+              do i=1,7
+                 skip(ntp(i+1))=.true.
+              end do
+              if(.not.if_coarsen) if_coarsen=.true.
+            end if
+          end if 
+        end if 
+      end do 
+!$OMP END PARALLEL DO
+
+!.....compute front(iel), how many elements will be coarsened before iel
+!     (including iel)
+      call parallel_add(front)
+
+!.....num_coarsen is the total number of elements that will be coarsened
+      num_coarsen=front(nelt)
+
+!.....action(i) records the morton index of the i'th element (if it is an
+!     element's front-left-bottom-child) to be coarsened.
+
+!.....create array mt_to_id to convert actual element index to morton index
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel,mielnew)
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(.not.skip(iel))then
+          if(ifcoa(miel))then
+            action(front(miel))=miel
+            mielnew=miel-(front(miel)-1)*7
+          else 
+            mielnew=miel-front(miel)*7
+          end if
+          mt_to_id(mielnew)=iel
+        end if
+      end do
+!$OMP END PARALLEL DO
+
+!.....perform the coarsening procedure (potentially in parallel)
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(index,miel,iel,ntp)
+      do index=1,num_coarsen
+        miel=action(index)
+        iel=mt_to_id_old(miel)
+!.......find eight child elements to be coarsened
+        ntp(1)=iel
+        ntp(2)=sje(1,1,1,iel)
+        ntp(3)=sje(1,1,3,iel)
+        ntp(4)=sje(1,1,1,ntp(3))
+        ntp(5)=sje(1,1,5,iel)
+        ntp(6)=sje(1,1,1,ntp(5))
+        ntp(7)=sje(1,1,3,ntp(5))
+        ntp(8)=sje(1,1,1,ntp(7))
+!.......merge them to be the parent
+        call merging(ntp)
+      end do
+!$OMP END PARALLEL DO
+      nelt=nelt-num_coarsen*7
+      icoarsen=icoarsen+num_coarsen*8
+
+      return
+      end
+
+!-------------------------------------------------------
+      subroutine do_refine(ifmortar,irefine)
+!-------------------------------------------------------
+!     Refinement procedure
+!--------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      logical ifmortar
+      double precision xctemp(8), yctemp(8), zctemp(8), xleft, xright,  &
+     &       yleft, yright, zleft, zright, ta1temp(lx1,lx1,lx1),  &
+     &       xhalf, yhalf, zhalf
+      integer iel, i, ii, jj, j, jface,  &
+     &        ntemp, ndir, facedir, k, le(4), ne(4), mielnew,  &
+     &        miel, irefine,ntemp1, num_refine, index, treetemp,  &
+     &        sjetemp(2,2,6), n1, n2, nelttemp,  &
+     &        cb, cbctemp(6)
+
+!.....initialize
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel)
+      do miel=1,nelt
+        mt_to_id_old(miel)=mt_to_id(miel)
+        mt_to_id(miel)=0
+        action(miel)=0
+        if(ich(mt_to_id_old(miel)).ne.4)then
+          front(miel)=0
+        else
+          front(miel)=1
+        end if
+      end do
+!$OMP END PARALLEL DO
+
+!.....front(iel) records how many elements with sequence numbers less than
+!     or equal to iel will be refined
+      call parallel_add(front)
+
+!.....num_refine is the total number of elements that will be refined
+      num_refine=front(nelt)
+
+!.....action(i) records the morton index of the  i'th element to be refined
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel)
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(ich(iel).eq.4)then
+          action(front(miel))=miel
+        end if
+      end do
+!$OMP END PARALLEL DO
+
+!.....Compute array mt_to_id to convert the element index to morton index.
+!     ref_front_id(iel) records how many elements with index less than
+!     iel (actual element index, not morton index), will be refined.
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(miel,iel,ntemp,mielnew)
+      do miel=1,nelt
+        iel=mt_to_id_old(miel)
+        if(ich(iel).eq.4)then
+          ntemp=(front(miel)-1)*7
+          mielnew=miel+ntemp
+        else
+          ntemp=front(miel)*7
+          mielnew=miel+ntemp
+        end if
+
+        mt_to_id(mielnew)=iel
+        ref_front_id(iel)=nelt+ntemp
+      end do
+!$OMP END PARALLEL DO
+
+
+!.....Perform refinement (potentially in parallel): 
+!       - Cut an element into eight children.
+!       - Assign them element index  as iel, nelt+1,...., nelt+7.
+!       - Update neighboring information.
+
+      nelttemp=nelt
+
+      if (num_refine .gt. 0) then
+        ifmortar=.true.
+      endif
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(index,miel,mielnew,iel,nelt,  &
+!$OMP& treetemp,xctemp,yctemp,zctemp,cbctemp,sjetemp,ta1temp,  &
+!$OMP& ii,jj,ntemp,xleft,xright,xhalf,yleft,yright,yhalf,zleft,zright,  &
+!$OMP& zhalf,ndir,facedir,jface,cb,le,ne,n1,n2,i,j,k)
+      do index=1, num_refine  
+!.......miel is old morton index and mielnew is new morton index after refinement.
+        miel=action(index)
+        mielnew=miel+(front(miel)-1)*7
+        iel=mt_to_id_old(miel) 
+        nelt=nelttemp+(front(miel)-1)*7 
+!.......save iel's information in a temporary array
+        treetemp=tree(iel)
+        do i=1,8
+          xctemp(i)=xc(i,iel)
+          yctemp(i)=yc(i,iel)
+          zctemp(i)=zc(i,iel)
+        end do
+        do i=1,6
+          cbctemp(i)=cbc(i,iel)
+          do jj=1,2
+            do ii=1,2
+              sjetemp(ii,jj,i)=sje(ii,jj,i,iel)
+            end do
+          end do
+        end do
+        call copy(ta1temp,ta1(1,1,1,iel),nxyz)
+
+
+!.......zero out iel here
+        tree(iel)=0
+        call nr_init(cbc(1,iel),6,0)
+        call nr_init(sje(1,1,1,iel),24,0)
+        call nr_init(ijel(1,1,iel),12,0)
+        call r_init(ta1(1,1,1,iel),nxyz,0.d0)
+
+
+!.......initialize new child elements:iel and nelt+1~nelt+7
+        do j=1,7 
+          mt_to_id(mielnew+j)=nelt+j
+          tree(nelt+j)=0
+          call nr_init(cbc(1,nelt+j),6,0)
+          call nr_init(sje(1,1,1,nelt+j),24,0)
+          call nr_init(ijel(1,1,nelt+j),12,0)
+          call r_init(ta1(1,1,1,nelt+j),nxyz,0.d0)
+        end do
+          
+!.......update the tree()
+        ntemp=ishft(treetemp,3)
+        tree(iel)=ntemp
+        do i=1,7
+          tree(nelt+i)=ntemp+mod(i,8)
+        end do   
+!.......update the children's vertices' coordinates
+        xhalf=xctemp(1)+(xctemp(2)-xctemp(1))/2.d0
+        xleft=xctemp(1)
+        xright=xctemp(2)
+        yhalf=yctemp(1)+(yctemp(3)-yctemp(1))/2.d0
+        yleft=yctemp(1)
+        yright=yctemp(3)
+        zhalf=zctemp(1)+(zctemp(5)-zctemp(1))/2.d0
+        zleft=zctemp(1)
+        zright=zctemp(5)
+       
+        do j=1,7,2
+          do i=1,7,2
+            xc(i,nelt+j)     = xhalf
+            xc(i+1,nelt+j)   = xright 
+          end do
+        end do
+
+        do j=2,6,2
+          do i=1,7,2
+            xc(i,nelt+j)   = xleft
+            xc(i+1,nelt+j) = xhalf
+          end do
+        end do
+         
+        do i=1,7,2
+          xc(i,iel)=xleft
+          xc(i+1,iel)=xhalf
+        end do
+
+        do i=1,2
+          yc(i,nelt+1)=yleft
+          yc(i,nelt+4)=yleft
+          yc(i,nelt+5)=yleft
+          yc(i+4,nelt+1)=yleft
+          yc(i+4,nelt+4)=yleft
+          yc(i+4,nelt+5)=yleft
+        enddo
+        do i=3,4
+          yc(i,nelt+1)=yhalf
+          yc(i,nelt+4)=yhalf
+          yc(i,nelt+5)=yhalf
+          yc(i+4,nelt+1)=yhalf
+          yc(i+4,nelt+4)=yhalf
+          yc(i+4,nelt+5)=yhalf
+        end do
+        do j=2,3
+          do i=1,2
+            yc(i,nelt+j)=yhalf
+            yc(i,nelt+j+4)=yhalf
+            yc(i+4,nelt+j)=yhalf
+            yc(i+4,nelt+j+4)=yhalf
+          end do
+          do i=3,4
+            yc(i,nelt+j)=yright
+            yc(i,nelt+j+4)=yright
+            yc(i+4,nelt+j)=yright
+            yc(i+4,nelt+j+4)=yright
+          end do
+        end do
+          
+        do i=1,2
+          yc(i,iel)=yleft
+          yc(i+4,iel)=yleft
+        end do
+        do i=3,4
+          yc(i,iel)=yhalf
+          yc(i+4,iel)=yhalf
+        end do
+
+        do j=1,3
+          do i=1,4
+            zc(i,nelt+j)=zleft
+            zc(i+4,nelt+j)=zhalf
+          end do
+        end do
+        do j=4,7
+          do i=1,4
+            zc(i,nelt+j)=zhalf
+            zc(i+4,nelt+j)=zright
+          end do
+        end do
+        do i=1,4
+          zc(i,iel)=zleft
+          zc(i+4,iel)=zhalf
+        end do
+
+!.......update the children's neighbor information
+
+!.......ndir refers to the x,y,z directions, respectively.
+!       facedir refers to the orientation of the face in each direction, 
+!       e.g. ndir=1, facedir=0 refers to face 1,
+!       and ndir =1, facedir=1 refers to face 2.
+
+        do ndir = 1, 3
+          do facedir = 0, 1
+            i=2*ndir-1+facedir
+            jface=jjface(i)
+            cb=cbctemp(i)
+
+!...........find the new element indices of the four children on each
+!           face of the parent element
+            do k = 1, 4
+              le(k) = le_arr(k,facedir,ndir)+nelt
+              ne(k) = le_arr(k,1-facedir,ndir)+nelt
+            end do
+            if(facedir.eq.0)then
+              le(1)=iel
+            else
+              ne(1)=iel
+            end if
+!...........update neighbor information of the four child elements on each 
+!           face of the parent element
+            do k=1,4
+              cbc(i,le(k))=2
+              sje(1,1,i,le(k))=ne(k)
+              ijel(1,i,le(k))=1
+              ijel(2,i,le(k))=1
+            end do
+
+!...........if the face type of the parent element is type 2
+            if(cb.eq.2) then
+              ntemp=sjetemp(1,1,i)
+
+!.............if the neighbor ntemp is not marked to be refined
+              if(ich(ntemp).ne.4)then
+                cbc(jface,ntemp)=3
+                ijel(1,jface,ntemp)=1
+                ijel(2,jface,ntemp)=1
+  
+                do k=1,4
+                  cbc(i,ne(k))=1
+                  sje(1,1,i,ne(k))=ntemp
+                  if(k.eq.1) then
+                    ijel(1,i,ne(k))=1
+                    ijel(2,i,ne(k))=1
+                    sje(1,1,jface,ntemp)=ne(k)
+                  elseif(k.eq.2) then
+                    ijel(1,i,ne(k))=1
+                    ijel(2,i,ne(k))=2
+                    sje(1,2,jface,ntemp)=ne(k)
+                  elseif(k.eq.3) then
+                    ijel(1,i,ne(k))=2
+                    ijel(2,i,ne(k))=1
+                    sje(2,1,jface,ntemp)=ne(k)
+                  elseif(k.eq.4) then
+                    ijel(1,i,ne(k))=2
+                    ijel(2,i,ne(k))=2
+                    sje(2,2,jface,ntemp)=ne(k)
+                  end if
+                end do
+
+!.............if the neighbor ntemp is also marked to be refined
+              else
+                n1=ref_front_id(ntemp)
+                 
+                do k=1,4
+                  cbc(i,ne(k))=2
+                  n2=n1+le_arr(k,facedir,ndir)
+                  if(n2.eq.n1+8)n2=ntemp
+                  sje(1,1,i,ne(k))=n2
+                  ijel(1,i,ne(k))=1
+                end do
+
+              endif
+!...........if the face type of the parent element is type 3
+            elseif(cb.eq.3) then
+              do k=1,4
+                cbc(i,ne(k))=2
+                if(k.eq.1) then
+                  ntemp=sjetemp(1,1,i)
+                elseif (k.eq.2) then
+                  ntemp=sjetemp(1,2,i)
+                elseif(k.eq.3) then
+                  ntemp=sjetemp(2,1,i)
+                elseif(k.eq.4) then
+                  ntemp=sjetemp(2,2,i)
+                end if
+                ijel(1,i,ne(k))=1
+                ijel(2,i,ne(k))=1
+                sje(1,1,i,ne(k))=ntemp
+                cbc(jface,ntemp)=2
+                sje(1,1,jface,ntemp)=ne(k)
+                ijel(1,jface,ntemp)=1
+                ijel(2,jface,ntemp)=1
+              end do
+
+!...........if the face type of the parent element is type 0
+            elseif(cb.eq.0) then
+              do k=1,4
+                cbc(i,ne(k))=cb
+              end do
+            end if
+
+          end do 
+        end do 
+
+!.......map solution from parent element to children
+        call remap(ta1(1,1,1,iel),ta1(1,1,1,ref_front_id(iel)+1),  &
+     &             ta1temp(1,1,1))
+      end do
+!$OMP ENDPARALLEL DO
+
+      nelt=nelttemp+num_refine*7
+      irefine=irefine+num_refine
+      ntot=nelt*lx1*lx1*lx1
+      return
+      end
+
+!-----------------------------------------------------------
+       logical function ifcor(n1,n2,i,iface)
+!-----------------------------------------------------------
+!      returns whether element n1's face i and element n2's 
+!      jjface(iface) have intersections, i.e. whether n1 and 
+!      n2 are neighbored by an edge.
+!-----------------------------------------------------------
+
+       use ua_data
+       implicit none
+
+       integer n1,n2,i,iface
+       logical ifsame
+
+       ifcor=.false.
+
+       if(ifsame(n1,e1v1(iface,i),n2,e2v1(iface,i)).or.  &
+     &    ifsame(n1,e1v2(iface,i),n2,e2v2(iface,i))) then
+          ifcor=.true.
+       end if
+
+       return
+       end
+
+!-----------------------------------------------------------
+      logical function icheck(ie,n)
+!-----------------------------------------------------------
+!     Check whether element ie's three faces (sharing vertex n)
+!     are nonconforming. This will prevent it from being coarsened.
+!     Also check ie's neighbors on those three faces, whether ie's
+!     neighbors by only an edge have a size smaller than ie's,
+!     which also prevents ie from being coarsened.
+!-----------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer ie, n, iside, ntemp1, ntemp2, ntemp3, n1, n2, n3,  &
+     &cb2_1,cb3_1,cb1_2,cb3_2,cb1_3,cb2_3
+
+      icheck=.true.
+      cb2_1=0
+      cb3_1=0
+      cb1_2=0
+      cb3_2=0
+      cb1_3=0
+      cb2_3=0
+
+      n1=f_c(1,n)
+      n2=f_c(2,n)
+      n3=f_c(3,n)
+      if((cbc(n1,ie).eq.3) .or. (cbc(n2,ie).eq.3) .or.  &
+     &   (cbc(n3,ie).eq.3)) then
+         icheck=.false.
+      else
+        ntemp1=sje(1,1,n1,ie)
+        ntemp2=sje(1,1,n2,ie)
+        ntemp3=sje(1,1,n3,ie)
+        if(ntemp1.ne.0)then
+           cb2_1=cbc(n2,ntemp1)
+           cb3_1=cbc(n3,ntemp1)
+        end if
+        if(ntemp2.ne.0)then
+           cb3_2=cbc(n3,ntemp2)
+           cb1_2=cbc(n1,ntemp2)
+        end if
+        if(ntemp3.ne.0)then
+           cb1_3=cbc(n1,ntemp3)
+           cb2_3=cbc(n2,ntemp3)
+        end if
+        if((cbc(n1,ie).eq.2.and.(cb2_1.eq.3.or.  &
+     &                               cb3_1.eq.3)).or.  &
+     &     (cbc(n2,ie).eq.2.and.(cb3_2.eq.3.or.  &
+     &                               cb1_2.eq.3)).or.  &
+     &     (cbc(n3,ie).eq.2.and.(cb1_3.eq.3.or.  &
+     &                              cb2_3.eq.3)))then
+          icheck=.false.
+        end if
+      end if
+
+      return
+      end 
+
+!-----------------------------------------------------------
+      subroutine find_coarsen(if_coarsen,neltold)
+!-----------------------------------------------------------
+!     Search elements to be coarsened. Check with restrictions.
+!     This subroutine only checks the element itself, not its
+!     neighbors.
+!-----------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      logical if_coarsen, iftemp, iftouch
+      integer iel,i,neltold
+
+      if_coarsen=.false.
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,i,iftemp)  &
+!$OMP& SHARED(if_coarsen)
+      do iel=1,neltold
+        if(.not.skip(iel))then
+          ich(iel)=0
+          if(.not.iftouch(iel)) then
+            iftemp=.false.
+            do i=1,nsides
+!.............if iel has a larger size than its face neighbors, it
+!             can not be coarsened
+              if(cbc(i,iel).eq.3) then
+                iftemp=.true.
+              endif
+            enddo
+            if(.not.iftemp) then
+              if(.not.if_coarsen) if_coarsen=.true.
+              ich(iel)=2
+            end if
+          end if
+        endif
+      enddo
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------
+      subroutine find_refine(if_refine)
+!-----------------------------------------------------------
+!     search elements to be refined based on whether they
+!     have overlap with the heat source
+!-----------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      logical if_refine, iftouch
+      integer iel
+
+      if_refine=.false.
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)  &
+!$OMP& SHARED(if_refine)
+      do iel=1,nelt
+        ich(iel)=0
+        if(iftouch(iel)) then
+          if((xc(2,iel)-xc(1,iel)).gt.dlmin) then
+            if(.not.if_refine) if_refine=.true.
+            ich(iel)=4
+          end if
+        end if
+      enddo
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine check_refine(ifrepeat)
+!-----------------------------------------------------------------
+!     Check whether the potential refinement will violate the
+!     restriction. If so, mark the neighbor and unmark the
+!     original element, and set ifrepeat true. i.e. this procedure
+!     needs to be repeated until no further check is needed
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+ 
+      logical ifrepeat,ifcor
+      integer iel,iface,ntemp,nntemp,i,jface
+
+      ifrepeat=.false.
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,i,jface,ntemp,  &
+!$OMP& iface,nntemp) SHARED(ifrepeat)
+      do iel=1,nelt
+!.......if iel is marked to be refined
+        if(ich(iel).eq.4) then
+!.........check its six faces
+          do i=1,nsides
+            jface=jjface(i)
+            ntemp=sje(1,1,i,iel)
+!...........if one face neighbor is larger in size than iel
+            if(cbc(i,iel).eq.1) then
+!.............unmark iel
+              ich(iel)=0
+!.............the large size neighbor ntemp is marked to be refined
+              if(ich(ntemp).ne.4) then
+                if(.not.ifrepeat) ifrepeat=.true.
+                ich(ntemp)=4
+              end if
+!.............check iel's neighbor, neighbored by an edge on face i, which
+!             must be a face neighbor of ntemp
+              do iface=1,nsides
+                if(iface.ne.i.and.iface.ne.jface) then
+!................if edge neighbors are larger than iel, mark them to be refined
+                  if(cbc(iface,ntemp).eq.2) then
+                    nntemp=sje(1,1,iface,ntemp)
+!..................ifcor is to make sure the edge neighbor exist
+                    if(ich(nntemp).ne.4.and.  &
+     &                 ifcor(iel,nntemp,i,iface))then
+                      ich(nntemp)=4
+                    end if
+                  end if
+                end if
+              end do
+!...........if face neighbor are of the same size of iel, check edge neighbors
+            elseif(cbc(i,iel).eq.2)then
+              do iface=1,nsides
+                if(iface.ne.i.and.iface.ne.jface) then
+                  if(cbc(iface,ntemp).eq.1)then
+                    nntemp=sje(1,1,iface,ntemp)
+                    ich(nntemp)=4
+                    ich(iel)=0
+                    if(.not.ifrepeat) ifrepeat=.true.
+                  end if
+                end if
+              end do
+            end if
+          enddo
+        end if
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      logical function iftouch(iel)
+!-----------------------------------------------------------------
+!     check whether element iel has overlap with the heat source
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision dis, dis1, dis2, dis3, alpha2
+      integer iel
+
+      alpha2 = alpha*alpha
+
+      if     (x0 .lt. xc(1,iel)) then
+        dis1 = xc(1,iel) - x0
+      elseif (x0 .gt. xc(2,iel)) then
+        dis1 = x0 - xc(2,iel)
+      else
+        dis1 = 0.d0
+      endif
+
+      if     (y0 .lt. yc(1,iel)) then
+        dis2 = yc(1,iel) - y0
+      elseif (y0 .gt. yc(3,iel)) then
+        dis2 = y0 - yc(3,iel)
+      else
+        dis2 = 0.d0
+      endif
+
+      if     (z0 .lt. zc(1,iel)) then
+        dis3 = zc(1,iel) - z0
+      elseif (z0 .gt. zc(5,iel)) then
+        dis3 = z0 - zc(5,iel)
+      else
+       dis3 = 0.d0
+      endif
+
+      dis = dis1**2+dis2**2+dis3**2
+
+      if (dis .lt. alpha2) then
+       iftouch=.true.
+      else
+       iftouch=.false.
+      end if
+
+      return
+      end
+
+
+!-----------------------------------------------------------------
+      subroutine remap (y,y1,x) 
+!-----------------------------------------------------------------
+!     After a refinement, map the solution  from the parent (x) to
+!     the eight children. y is the solution on the first child
+!     (front-bottom-left) and y1 is the solution on the next 7 
+!     children.
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision x(lx1,lx1,lx1),y(lx1,lx1,lx1),y1(lx1,lx1,lx1,7),  &
+     &       yone(lx1,lx1,lx1,2), ytwo(lx1,lx1,lx1,4)
+      integer i, iz, ii, jj, kk
+
+      call r_init(y,lx1*lx1*lx1,0.d0)
+      call r_init(y1,lx1*lx1*lx1*7,0.d0)
+      call r_init(yone,lx1*lx1*lx1*2,0.d0)
+      call r_init(ytwo,lx1*lx1*lx1*4,0.d0)
+
+      do  i=1,lx1
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              yone(ii,jj,i,1) = yone(ii,jj,i,1) +ixmc1(ii,kk)*x(kk,jj,i)
+              yone(ii,jj,i,2) = yone(ii,jj,i,2) +ixmc2(ii,kk)*x(kk,jj,i)
+            end do
+          end do
+        end do
+
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              ytwo(ii,i,jj,1) = ytwo(ii,i,jj,1) +  &
+     &                          yone(ii,kk,i,1)*ixtmc1(kk,jj)
+              ytwo(ii,i,jj,2) = ytwo(ii,i,jj,2) +  &
+     &                          yone(ii,kk,i,1)*ixtmc2(kk,jj)
+              ytwo(ii,i,jj,3) = ytwo(ii,i,jj,3) +  &
+     &                          yone(ii,kk,i,2)*ixtmc1(kk,jj)
+              ytwo(ii,i,jj,4) = ytwo(ii,i,jj,4) +  &
+     &                          yone(ii,kk,i,2)*ixtmc2(kk,jj)
+            end do
+          end do
+        end do
+      end do
+
+      do  iz=1,lx1
+        do kk = 1, lx1
+          do jj = 1, lx1
+            do ii = 1, lx1
+              y(ii,iz,jj) = y(ii,iz,jj) +  &
+     &                        ytwo(ii,kk,iz,1)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,1) = y1(ii,iz,jj,1) +  &
+     &                        ytwo(ii,kk,iz,3)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,2) = y1(ii,iz,jj,2) +  &
+     &                        ytwo(ii,kk,iz,2)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,3) = y1(ii,iz,jj,3) +  &
+     &                        ytwo(ii,kk,iz,4)*ixtmc1(kk,jj)
+              y1(ii,iz,jj,4) = y1(ii,iz,jj,4) +  &
+     &                        ytwo(ii,kk,iz,1)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,5) = y1(ii,iz,jj,5) +  &
+     &                        ytwo(ii,kk,iz,3)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,6) = y1(ii,iz,jj,6) +  &
+     &                        ytwo(ii,kk,iz,2)*ixtmc2(kk,jj)
+              y1(ii,iz,jj,7) = y1(ii,iz,jj,7) +  &
+     &                        ytwo(ii,kk,iz,4)*ixtmc2(kk,jj)            
+            end do
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+!=======================================================================
+      subroutine merging(iela)
+!-----------------------------------------------------------------------
+!     This subroutine is to merge the eight child elements and map 
+!     the solution from eight children to the  merged element. 
+!     iela array records the eight elements to be merged.
+!-----------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision x1,x2,y1,y2,z1,z2
+      integer ielnew,i,ntemp,jface,ii,cb,ntempa(4),iela(8),ielold,  &
+     &        ntema(4)
+
+      ielnew=iela(1)
+
+      tree(ielnew)=ishft(tree(ielnew),-3)   
+
+!.....element vertices 
+      x1=xc(1,iela(1))
+      x2=xc(2,iela(2))
+      y1=yc(1,iela(1))
+      y2=yc(3,iela(3))
+      z1=zc(1,iela(1))
+      z2=zc(5,iela(5))
+
+      do i=1,7,2
+        xc(i,ielnew)=x1
+      end do
+      do i=2,8,2
+        xc(i,ielnew)=x2
+      end do
+      do i=1,2
+        yc(i,ielnew)=y1
+        yc(i+4,ielnew)=y1
+      end do
+      do i=3,4
+        yc(i,ielnew)=y2
+        yc(i+4,ielnew)=y2
+      end do
+      do i=1,4
+        zc(i,ielnew)=z1
+      end do
+      do i=5,8
+        zc(i,ielnew)=z2
+      end do
+
+!.....update neighboring information
+      do i=1,nsides
+        jface=jjface(i)
+        ielold=iela(children(1,i))
+        do ii=1,4
+          ntempa(ii)=iela(children(ii,i))
+        end do
+
+        cb=cbc(i,ielold)
+       
+        if (cb.eq.2) then
+!.........if the neighbor elements also will be coarsened
+          if(ifcoa_id(sje(1,1,i,ielold)))then
+            if (i.eq.2 .or. i.eq. 4 .or. i.eq.6) then
+              ntemp=sje(1,1,i,sje(1,1,i,ntempa(1)))
+            else
+              ntemp=sje(1,1,i,ntempa(1))
+            end if 
+            sje(1,1,i,ielnew)=ntemp
+            ijel(1,i,ielnew)=1
+            ijel(2,i,ielnew)=1
+            cbc(i,ielnew)=2
+
+!.........if the neighbor elements will not be coarsened
+          else
+            do ii=1,4
+              ntema(ii)=sje(1,1,i,ntempa(ii)) 
+              cbc(jface,ntema(ii))=1
+              sje(1,1,jface,ntema(ii))=ielnew
+              ijel(1,jface,ntema(ii))=iijj(1,ii)
+              ijel(2,jface,ntema(ii))=iijj(2,ii)
+              sje(iijj(1,ii),iijj(2,ii),i,ielnew)=ntema(ii)
+              ijel(1,i,ielnew)=1
+              ijel(2,i,ielnew)=1
+            end do
+            cbc(i,ielnew)=3
+          end if       
+
+        else if(cb.eq.1)then
+
+          ntemp=sje(1,1,i,ielold)
+          cbc(jface,ntemp)=2
+          ijel(1,jface,ntemp)=1
+          ijel(2,jface,ntemp)=1
+          sje(1,1,jface,ntemp)=ielnew
+          sje(1,2,jface,ntemp)=0
+          sje(2,1,jface,ntemp)=0
+          sje(2,2,jface,ntemp)=0
+           
+          cbc(i,ielnew)=2
+          ijel(1,i,ielnew)=1
+          ijel(2,i,ielnew)=1
+          sje(1,1,i,ielnew)=ntemp
+         
+        else if(cb.eq.0)then
+          cbc(i,ielnew)=0
+          sje(1,1,i,ielnew)=0
+          sje(1,2,i,ielnew)=0
+          sje(2,1,i,ielnew)=0
+          sje(2,2,i,ielnew)=0
+        endif
+
+      end do
+
+!.....map solution from children to the merged element
+      call remap2(iela, ielnew)
+      
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine remap2(iela, ielnew)
+!-----------------------------------------------------------------
+!     Map the solution from the children to the parent.
+!     iela array records the eight elements to be merged.
+!     ielnew is the element index of the merged element.
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer iela(8), ielnew
+
+      double precision temp1(lx1,lx1,lx1),  &
+     &       temp2(lx1,lx1,lx1),temp3(lx1,lx1,lx1),temp4(lx1,lx1,lx1),  &
+     &       temp5(lx1,lx1,lx1),temp6(lx1,lx1,lx1)
+
+      call remapx(ta1(1,1,1,iela(1)),ta1(1,1,1,iela(2)),temp1)
+      call remapx(ta1(1,1,1,iela(3)),ta1(1,1,1,iela(4)),temp2)
+      call remapx(ta1(1,1,1,iela(5)),ta1(1,1,1,iela(6)),temp3)
+      call remapx(ta1(1,1,1,iela(7)),ta1(1,1,1,iela(8)),temp4)
+      call remapy(temp1,temp2,temp5)
+      call remapy(temp3,temp4,temp6)
+      call remapz(temp5,temp6,ta1(1,1,1,ielnew))
+
+      return
+      end       
+
+!-----------------------------------------------------------------
+      subroutine remapz(x1,x2,y)
+!-----------------------------------------------------------------
+!     z direction mapping after the merge.
+!     Map solution from x1 & x2 to y.
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer ix, iy, ip
+
+      do iy=1,lx1
+        do ix=1,lx1
+          y(ix,iy,1)=x1(ix,iy,1)
+
+          y(ix,iy,2)=0.d0
+          do ip=1,lx1
+            y(ix,iy,2)=y(ix,iy,2)+map2(ip)*x1(ix,iy,ip)
+          end do
+
+          y(ix,iy,3)=x1(ix,iy,lx1)
+
+          y(ix,iy,4)=0.d0
+          do ip=1,lx1
+            y(ix,iy,4)=y(ix,iy,4)+map4(ip)*x2(ix,iy,ip)
+          end do
+
+          y(ix,iy,lx1)=x2(ix,iy,lx1)
+        end do
+      end do
+
+      return
+      end      
+
+!-----------------------------------------------------------------
+      subroutine remapy(x1,x2,y)
+!-----------------------------------------------------------------
+!     y direction mapping after the merge.
+!     Map solution from x1 & x2 to y.
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer ix, iz, ip
+
+      do iz=1,lx1
+        do ix=1,lx1
+          y(ix,1,iz)=x1(ix,1,iz)
+
+          y(ix,2,iz)=0.d0
+          do ip=1,lx1
+            y(ix,2,iz)=y(ix,2,iz)+map2(ip)*x1(ix,ip,iz)
+          end do
+
+          y(ix,3,iz)=x1(ix,lx1,iz)
+
+          y(ix,4,iz)=0.d0
+          do ip=1,lx1
+            y(ix,4,iz)=y(ix,4,iz)+map4(ip)*x2(ix,ip,iz)
+          end do
+
+          y(ix,lx1,iz)=x2(ix,lx1,iz)
+        end do
+      end do
+
+      return
+      end      
+
+!-----------------------------------------------------------------
+      subroutine remapx(x1,x2,y)
+!-----------------------------------------------------------------
+!     x direction mapping after the merge.
+!     Map solution from x1 & x2 to y.
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision x1(lx1,lx1,lx1),x2(lx1,lx1,lx1),y(lx1,lx1,lx1)
+      integer iy, iz, ip
+
+      do iz=1,lx1
+        do iy=1,lx1
+          y(1,iy,iz)=x1(1,iy,iz)
+
+          y(2,iy,iz)=0.d0
+          do ip=1,lx1
+            y(2,iy,iz)=y(2,iy,iz)+map2(ip)*x1(ip,iy,iz)
+          end do
+
+          y(3,iy,iz)=x1(lx1,iy,iz)
+
+          y(4,iy,iz)=0.d0
+          do ip=1,lx1
+            y(4,iy,iz)=y(4,iy,iz)+map4(ip)*x2(ip,iy,iz)
+          end do
+
+          y(lx1,iy,iz)=x2(lx1,iy,iz)
+        end do
+      end do
+
+      return
+      end      
+       
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/convect.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/convect.f90
new file mode 100644
index 000000000..1bd498c52
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/convect.f90
@@ -0,0 +1,227 @@
+!---------------------------------------------------------
+      subroutine convect(ifmortar)  
+!---------------------------------------------------------
+!     Advance the convection term using 4th order RK
+!     1.ta1 is solution from last time step 
+!     2.the heat source is considered part of d/dx
+!     3.trhs is right hand side for the diffusion equation
+!     4.tmor is solution on mortar points, which will be used
+!       as the initial guess when advancing the diffusion term 
+!---------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision alpha2, tempa(lx1,lx1,lx1),  &
+     &       rdtime, pidivalpha, sixth,  &
+     &       dtx1, dtx2, dtx3, src, rk1(lx1,lx1,lx1), rk2(lx1,lx1,lx1),  &
+     &       rk3(lx1,lx1,lx1), rk4(lx1,lx1,lx1), temp(lx1,lx1,lx1),  &
+     &       subtime(3), xx0(3), yy0(3), zz0(3), dtime2, r2, sum,  &
+     &       xloc(lx1), yloc(lx1), zloc(lx1)
+      integer k,iel,i,j,iside,isize, substep, ip
+      logical ifmortar
+      parameter (sixth=1.d0/6.d0)
+
+      if (timeron) call timer_start(t_convect)
+      pidivalpha = dacos(-1.d0)/alpha
+      alpha2     = alpha*alpha
+      dtime2     = dtime/2.d0 
+      rdtime     = 1.d0/dtime
+      subtime(1) = time
+      subtime(2) = time+dtime2
+      subtime(3) = time+dtime
+      do substep = 1, 3
+        xx0(substep) = x00+velx*subtime(substep)
+        yy0(substep) = y00+vely*subtime(substep)
+        zz0(substep) = z00+velz*subtime(substep)
+      end do
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(rk4,rk3,rk2,temp,rk1,dtx3,  &
+!$OMP& dtx2,dtx1,iside,ip,sum,src,r2,i,j,k,isize,iel,tempa,  &
+!$OMP& xloc,yloc,zloc)
+
+      do iel = 1, nelt
+        isize=size_e(iel)
+!.......xloc(i) is the location of i'th collocation in x direction in an element.
+!       yloc(i) is the location of j'th collocation in y direction in an element.
+!       zloc(i) is the location of k'th collocation in z direction in an element.
+        do i = 1, lx1
+          xloc(i) = xfrac(i)*(xc(2,iel)-xc(1,iel))+xc(1,iel)
+        end do
+        do j = 1, lx1
+          yloc(j) = xfrac(j)*(yc(4,iel)-yc(1,iel))+yc(1,iel)
+        end do
+        do k = 1, lx1
+          zloc(k) = xfrac(k)*(zc(5,iel)-zc(1,iel))+zc(1,iel)
+        end do
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(1))**2+(yloc(j)-yy0(1))**2+  &
+     &             (zloc(k)-zz0(1))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * ta1(ip,j,k,iel)
+              end do
+              dtx1 = -velx*sum*xrm1_s(i,j,k,isize)
+              sum  = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * ta1(i,ip,k,iel)
+              end do
+              dtx2=-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * ta1(i,j,ip,iel)
+              end do
+              dtx3=-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk1(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              temp(i,j,k)=ta1(i,j,k,iel)+dtime2*rk1(i,j,k)
+
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(2))**2 + (yloc(j)-yy0(2))**2 +  &
+     &             (zloc(k)-zz0(2))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * temp(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * temp(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * temp(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk2(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              tempa(i,j,k)=ta1(i,j,k,iel)+dtime2*rk2(i,j,k)
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(2))**2 + (yloc(j)-yy0(2))**2 +  &
+     &             (zloc(k)-zz0(2))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * tempa(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * tempa(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * tempa(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk3(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              temp(i,j,k)=ta1(i,j,k,iel)+dtime*rk3(i,j,k)
+            end do
+          end do
+        end do        
+
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              r2 = (xloc(i)-xx0(3))**2 + (yloc(j)-yy0(3))**2 +  &
+     &             (zloc(k)-zz0(3))**2
+              if (r2.le.alpha2) then
+                src = dcos(dsqrt(r2)*pidivalpha)+1.d0
+              else
+                src = 0.d0
+              endif
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(i,ip) * temp(ip,j,k)
+              end do
+              dtx1 =-velx*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(j,ip) * temp(i,ip,k)
+              end do
+              dtx2 =-vely*sum*xrm1_s(i,j,k,isize)
+              sum = 0.d0
+              do ip = 1, lx1
+                sum = sum + dxm1(k,ip) * temp(i,j,ip)
+              end do
+              dtx3 =-velz*sum*xrm1_s(i,j,k,isize)
+
+              rk4(i,j,k)= dtx1 + dtx2 + dtx3 + src
+              tempa(i,j,k)=sixth*(rk1(i,j,k)+2.d0*  &
+     &                   rk2(i,j,k)+2.d0*rk3(i,j,k)+rk4(i,j,k))
+            end do
+          end do
+        end do        
+
+!.......apply boundary condition
+        do iside=1,nsides
+          if(cbc(iside,iel).eq.0)then
+            call facev(tempa,iside,0.0d0)
+          end if
+        end do
+          
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              trhs(i,j,k,iel)=bm1_s(i,j,k,isize)*(ta1(i,j,k,iel)*rdtime+  &
+     &                        tempa(i,j,k))
+              ta1(i,j,k,iel)=ta1(i,j,k,iel)+tempa(i,j,k)*dtime
+            end do
+          end do
+        end do
+
+      end do 
+!$OMP END PARALLEL DO
+
+!.....get mortar for intial guess for CG
+
+      if (timeron) call timer_start(t_transfb_c)
+      if(ifmortar)then
+        call transfb_c_2(ta1)
+      else
+        call transfb_c(ta1)
+      end if
+      if (timeron) call timer_stop(t_transfb_c)
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,nmor
+       tmort(i)=tmort(i)/mormult(i)
+      end do
+!$OMP END PARALLEL DO
+      if (timeron) call timer_stop(t_convect)
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/diffuse.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/diffuse.f90
new file mode 100644
index 000000000..5beda2829
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/diffuse.f90
@@ -0,0 +1,246 @@
+!---------------------------------------------------------------------
+      subroutine diffusion(ifmortar)      
+!---------------------------------------------------------------------
+!     advance the diffusion term using CG iterations
+!---------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision  rho_aux, rho1, rho2, beta, cona
+      logical ifmortar
+      integer iter,ie, im,iside,i,j,k
+
+      if (timeron) call timer_start(t_diffusion)
+!.....set up diagonal preconditioner
+      if (ifmortar) then
+        call setuppc
+        call setpcmo
+      end if
+
+!.....arrays t and umor are accumlators of (am pm) in the CG algorithm
+!     (see the specification)
+
+      call r_init_omp(t,ntot,0.d0)
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,nmor
+        umor(i)=0.d0
+      end do
+!$OMP END PARALLEL DO
+
+!.....calculate initial am (see specification) in CG algorithm
+
+!.....trhs and rmor are combined to generate r0 in CG algorithm.
+!     pdiff and pmorx are combined to generate q0 in the CG algorithm.
+!     rho1 is  (qm,rm) in the CG algorithm.
+
+      rho1 = 0.d0
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(im,ie,i,j,k) REDUCTION(+:rho1)
+!$OMP DO
+       do ie=1,nelt
+         do k=1,lx1
+           do j=1,lx1
+             do i=1,lx1
+               pdiff(i,j,k,ie) = dpcelm(i,j,k,ie)*trhs(i,j,k,ie)
+               rho1            = rho1 + trhs(i,j,k,ie)*pdiff(i,j,k,ie)*  &
+     &                                          tmult(i,j,k,ie)
+             end do
+           end do
+         end do
+       end do
+!$OMP END DO nowait
+
+!$OMP DO
+      do im = 1, nmor
+        pmorx(im) = dpcmor(im)*rmor(im)
+        rho1      = rho1 + rmor(im)*pmorx(im)
+      end do
+!$OMP END DO nowait
+!$OMP END PARALLEL
+
+!.................................................................
+!     commence conjugate gradient iteration
+!.................................................................
+
+      do iter=1, nmxh
+        if(iter.gt.1) then 
+          rho_aux = 0.d0
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(im,ie,i,j,k) REDUCTION(+:rho_aux)
+!$OMP DO
+!.........pdiffp and ppmor are combined to generate q_m+1 in the specification
+!         rho_aux is (q_m+1,r_m+1)
+          do ie = 1, nelt
+            do k=1,lx1
+              do j=1,lx1
+                do i=1,lx1
+                  pdiffp(i,j,k,ie) = dpcelm(i,j,k,ie)*trhs(i,j,k,ie)
+                  rho_aux =rho_aux+trhs(i,j,k,ie)*pdiffp(i,j,k,ie)*  &
+     &                                            tmult(i,j,k,ie)
+                end do
+              end do
+            end do
+          end do
+!$OMP END DO nowait
+!$OMP DO
+          do im = 1, nmor
+            ppmor(im) = dpcmor(im)*rmor(im)
+            rho_aux = rho_aux + rmor(im)*ppmor(im)
+          end do
+!$OMP END DO nowait
+!$OMP END PARALLEL
+
+!.........compute bm (beta) in the specification
+          rho2 = rho1
+          rho1 = rho_aux
+          beta = rho1/rho2
+!.........update p_m+1 in the specification
+          call adds1m1(pdiff, pdiffp, beta,ntot)
+          call adds1m1(pmorx, ppmor,  beta, nmor)  
+        end if
+ 
+!.......compute matrix vector product: (theta pm) in the specification
+
+        if (timeron) call timer_start(t_transf)
+        call transf(pmorx,pdiff) 
+        if (timeron) call timer_stop(t_transf)
+
+!.......compute pdiffp which is (A theta pm) in the specification
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie) 
+        do ie=1, nelt
+          call laplacian(pdiffp(1,1,1,ie),pdiff(1,1,1,ie),size_e(ie))
+        end do
+!$OMP END PARALLEL DO
+
+!.......compute ppmor which will be used to compute (thetaT A theta pm) 
+!       in the specification
+        if (timeron) call timer_start(t_transfb)
+        call transfb(ppmor,pdiffp) 
+        if (timeron) call timer_stop(t_transfb)
+ 
+!.......apply boundary condition
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie,iside)
+        do ie=1,nelt
+          do iside=1,nsides
+            if(cbc(iside,ie).eq.0)then
+              call facev(pdiffp(1,1,1,ie),iside,0.d0)
+            end if
+          end do
+        end do
+!$OMP END PARALLEL DO
+
+!.......compute cona which is (pm,theta T A theta pm)
+        cona = 0.d0
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(im,ie,i,j,k) REDUCTION(+:cona)
+!$OMP DO
+        do ie = 1, nelt
+          do k=1,lx1
+            do j=1,lx1
+              do i=1,lx1
+                cona = cona +  &
+     &          pdiff(i,j,k,ie)*pdiffp(i,j,k,ie)*tmult(i,j,k,ie)
+              end do 
+             end do 
+          end do 
+        end do 
+!$OMP END DO nowait
+!$OMP DO
+        do im = 1, nmor
+          ppmor(im) = ppmor(im)*tmmor(im)
+          cona = cona + pmorx(im)*ppmor(im)
+        end do
+!$OMP END DO nowait
+!$OMP END PARALLEL
+
+!.......compute am
+        cona = rho1/cona
+!.......compute (am pm)
+        call adds2m1(t,    pdiff,   cona, ntot)
+        call adds2m1(umor, pmorx,   cona, nmor) 
+!.......compute r_m+1
+        call adds2m1(trhs, pdiffp, -cona, ntot)
+        call adds2m1(rmor, ppmor,  -cona, nmor) 
+ 
+      end do
+
+      if (timeron) call timer_start(t_transf)
+      call transf(umor,t)  
+      if (timeron) call timer_stop(t_transf)
+      if (timeron) call timer_stop(t_diffusion)
+
+      return
+      end
+
+
+!------------------------------------------------------------------
+      subroutine laplacian(r,u,sizei)
+!------------------------------------------------------------------
+!     compute  r = visc*[A]x +[B]x on a given element.
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision r(lx1,lx1,lx1), u(lx1,lx1,lx1), rdtime
+      integer i,j,k, ix,iz, sizei
+
+      double precision tm1(lx1,lx1,lx1),tm2(lx1,lx1,lx1)                     
+
+      rdtime = 1.d0/dtime
+
+      call r_init(tm1,nxyz,0.d0)
+      do iz=1,lx1                     
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              tm1(i,j,iz) = tm1(i,j,iz)+wdtdr(i,k)*u(k,j,iz)
+            end do
+          end do
+        end do                           
+      end do
+              
+      call r_init(tm2,nxyz,0.d0)                                                   
+      do iz=1,lx1                                            
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              tm2(i,j,iz) = tm2(i,j,iz)+u(i,k,iz)*wdtdr(k,j)
+            end do
+          end do
+        end do
+      end do
+                                                            
+      call r_init(r,nxyz,0.d0)   
+      do k = 1, lx1
+        do iz=1, lx1    
+          do j = 1, lx1
+            do i = 1, lx1
+              r(i,j,iz) = r(i,j,iz)+u(i,j,k)*wdtdr(k,iz)
+            end do
+          end do
+        end do
+      end do
+
+!.....collocate with remaining weights and sum to complete factorization.                   
+                                                      
+!      do ix=1,nxyz                                            
+!         r(ix,1,1)=visc*(tm1(ix,1,1)*g4m1_s(ix,1,1,sizei)+
+!     &                   tm2(ix,1,1)*g5m1_s(ix,1,1,sizei)+
+!     &                     r(ix,1,1)*g6m1_s(ix,1,1,sizei))+
+!     &               bm1_s(ix,1,1,sizei)*rdtime*u(ix,1,1)             
+!      end do
+      do k=1,lx1
+        do j=1,lx1
+          do i=1,lx1
+            r(i,j,k)=visc*(tm1(i,j,k)*g4m1_s(i,j,k,sizei)+  &
+     &                   tm2(i,j,k)*g5m1_s(i,j,k,sizei)+  &
+     &                    r(i,j,k)*g6m1_s(i,j,k,sizei))+  &
+     &               bm1_s(i,j,k,sizei)*rdtime*u(i,j,k)             
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                    
+
+
+ 
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/mason.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/mason.f90
new file mode 100644
index 000000000..d03bc47d1
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/mason.f90
@@ -0,0 +1,2281 @@
+!-----------------------------------------------------------------
+      subroutine mortar
+!-----------------------------------------------------------------
+!     generate mortar point index number 
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer count, iel, jface, ntemp, i, ii, jj, ntemp1,  &
+     &        iii, jjj, face2, ne, ie, edge_g, ie2,  &
+     &        mor_v(3), cb, cb1, cb2, cb3, cb4, cb5, cb6,  &
+     &        space, sumcb, ij1, ij2, n1, n2, n3, n4, n5
+
+      n1=lx1*lx1*6*4*nelt
+      n2=8*nelt
+      n3=2*64*nelt
+      n4=12*nelt
+      n5=2*12*nelt
+
+      call nr_init_omp(idmo,n1,0)
+      call nr_init_omp(nemo,n2,0)
+      call nr_init_omp(vassign,n2,0)
+      call nr_init_omp(emo,n3,0)
+      call  l_init_omp(if_1_edge,n4,.false.)
+      call nr_init_omp(diagn,n5,0)
+!.....Mortar points indices are generated in two steps: first generate 
+!     them for all element vertices (corner points), then for conforming 
+!     edge and conforming face interiors. Each time a new mortar index 
+!     is generated for a mortar point, it is broadcast to all elements 
+!     sharing this mortar point. 
+
+!.....VERTICES
+      count=0
+
+!.....assign mortar point indices to element vertices
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,sumcb,ij1,ij2,  &
+!$OMP& cb,cb1,cb2,ntemp,ntemp1)
+
+      do iel=1,nelt
+
+!.......first calculate how many new mortar indices will be generated for 
+!       each element.
+
+!.......For each element, at least one vertex (vertex 8) will be new mortar
+!       point. All possible new mortar points will be on face 2,4 or 6. By
+!       checking the type of these three faces, we are able to tell
+!       how many new mortar vertex points will be generated in each element.
+
+        cb=cbc(6,iel)
+        cb1=cbc(4,iel)
+        cb2=cbc(2,iel)
+
+!.......For different combinations of the type of these three faces,
+!       we group them into 27 configurations.
+!       For different face types we assign the following integers:
+!              1 for type 2 or 3
+!              2 for type 0
+!              5 for type 1
+!       By summing these integers for faces 2,4 and 6, sumcb will have 
+!       10 different numbers indicating 10 different combinations. 
+
+        sumcb=0
+        if(cb.eq.2.or.cb.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb.eq.1)then
+          sumcb=sumcb+5
+        end if
+        if(cb1.eq.2.or.cb1.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb1.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb1.eq.1)then
+          sumcb=sumcb+5
+        end if
+        if(cb2.eq.2.or.cb2.eq.3)then
+          sumcb=sumcb+1
+        elseif(cb2.eq.0)then
+          sumcb=sumcb+2
+        elseif(cb2.eq.1)then
+          sumcb=sumcb+5
+        end if
+
+!.......compute newc(iel)
+!       newc(iel) records how many new mortar indices will be generated
+!                 for element iel
+!       vassign(i,iel) records the element vertex of the i'th new mortar 
+!                 vertex point for element iel. e.g. vassign(2,iel)=8 means
+!                 the 2nd new mortar vertex point generated on element
+!                 iel is iel's 8th vertex.
+ 
+        if(sumcb.eq.3)then
+!.......the three face types for face 2,4, and 6 are 2 2 2
+          newc(iel)=1
+          vassign(1,iel)=8
+          
+        elseif(sumcb.eq.4)then
+!.......the three face types for face 2,4 and 6 are 2 2 0 (not 
+!       necessarily in this order)
+          newc(iel)=2
+          if(cb.eq.0)then
+            vassign(1,iel)=4
+          elseif(cb1.eq.0)then
+            vassign(1,iel)=6
+          elseif(cb2.eq.0)then
+            vassign(1,iel)=7
+          end if
+          vassign(2,iel)=8
+
+        elseif(sumcb.eq.7)then
+!.......the three face types for face 2,4 and 6 are 2 2 1 (not 
+!       necessarily in this order)
+          if(cb.eq.1)then
+            ij1=ijel(1,6,iel)
+            ij2=ijel(2,6,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=4
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,6,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=4
+                vassign(2,iel)=8
+              end if
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,6,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=4
+                vassign(2,iel)=8
+              endif
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+          elseif(cb1.eq.1)then
+            ij1=ijel(1,4,iel)
+            ij2=ijel(2,4,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=6
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=6
+                vassign(2,iel)=8
+              endif
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=6
+                vassign(2,iel)=8
+              endif
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+
+          elseif(cb2.eq.1)then
+            ij1=ijel(1,2,iel)
+            ij2=ijel(2,2,iel)
+            if(ij1.eq.1.and.ij2.eq.1)then
+              newc(iel)=2
+              vassign(1,iel)=7
+              vassign(2,iel)=8
+            elseif(ij1.eq.1.and.ij2.eq.2)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=7
+                vassign(2,iel)=8
+              end if
+
+            elseif(ij1.eq.2.and.ij2.eq.1)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                newc(iel)=1
+                vassign(1,iel)=8
+              else
+                newc(iel)=2
+                vassign(1,iel)=7
+                vassign(2,iel)=8
+              end if
+            else
+              newc(iel)=1
+              vassign(1,iel)=8
+            end if
+          end if
+
+        elseif(sumcb.eq.5)then
+!.......the three face types for face 2,4 and 6 are 2/3 0 0 (not 
+!       necessarily in this order)
+          newc(iel)=4
+          if(cb.eq.2.or.cb.eq.3)then
+            vassign(1,iel)=5
+            vassign(2,iel)=6
+            vassign(3,iel)=7
+            vassign(4,iel)=8
+          elseif(cb1.eq.2.or.cb1.eq.3)then
+            vassign(1,iel)=3
+            vassign(2,iel)=4
+            vassign(3,iel)=7
+            vassign(4,iel)=8
+          elseif(cb2.eq.2.or.cb2.eq.3)then
+            vassign(1,iel)=2
+            vassign(2,iel)=4
+            vassign(3,iel)=6
+            vassign(4,iel)=8
+          end if
+
+        elseif(sumcb.eq.8)then
+!.......the three face types for face 2,4 and 6 are 2 0 1 (not 
+!       necessarily in this order)
+
+!.........if face 2 of type 1
+          if(cb.eq.1)then
+            if(cb1.eq.2.or.cb1.eq.3)then
+              ij1=ijel(1,6,iel)
+              if(ij1.eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else 
+                ntemp=sje(1,1,6,iel)
+                if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+
+            elseif(cb2.eq.2.or.cb2.eq.3)then
+              if(ijel(2,6,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,6,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+
+!.........if face 4 of type 1
+          elseif(cb1.eq.1)then
+            if(cb.eq.2.or.cb.eq.3)then
+              ij1=ijel(1,4,iel)
+              ij2=ijel(2,4,iel)
+
+              if(ij1.eq.1.and.ij2.eq.1)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                else
+                  newc(iel)=4
+                  vassign(1,iel)=5
+                  vassign(2,iel)=6
+                  vassign(3,iel)=7
+                  vassign(4,iel)=8
+                end if
+              elseif(ij1.eq.1.and.ij2.eq.2)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=3
+                  vassign(1,iel)=5
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                else
+                  newc(iel)=4
+                  vassign(1,iel)=5
+                  vassign(2,iel)=6
+                  vassign(3,iel)=7
+                  vassign(4,iel)=8
+                end if
+              elseif(ij1.eq.2.and.ij2.eq.1)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              elseif(ij1.eq.2.and.ij2.eq.2)then
+                ntemp=sje(1,1,4,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=5
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            else 
+              if(ijel(2,4,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,4,iel)
+                if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            endif
+!.........if face 6 of type 1
+          elseif(cb2.eq.1)then
+            if(cb.eq.2.or.cb.eq.3)then
+              if(ijel(1,2,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=5
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,2,iel)
+                if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            else 
+              if(ijel(2,2,iel).eq.1)then
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                ntemp=sje(1,1,2,iel)
+                if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          end if
+
+        elseif(sumcb.eq.11)then
+!.......the three face type for face 2,4 and 6 are 2 1 1(not 
+!       necessarily in this order)
+          if(cb.eq.2.or.cb.eq.3)then
+            if(ijel(1,4,iel).eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=6
+                vassign(2,iel)=7
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=5
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              end if
+
+!...........if ijel(1,4,iel)=2
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).eq.3.and.sje(1,1,5,ntemp).lt.iel)then
+                ntemp1=sje(1,1,4,iel)
+                if(cbc(5,ntemp1).eq.3.and.  &
+     &             sje(1,1,5,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,4,iel)
+                if(cbc(5,ntemp1).eq.3.and.  &
+     &             sje(1,1,5,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=6
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          elseif(cb1.eq.2.or.cb1.eq.3)then
+            if(ijel(2,2,iel).eq.1)then
+              ntemp=sje(1,1,2,iel)
+              if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=4
+                vassign(2,iel)=7
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              end if
+!...........if ijel(2,2,iel)=2
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).eq.3.and.sje(1,1,3,ntemp).lt.iel)then
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(3,ntemp1).eq.3.and.  &
+     &            sje(1,1,3,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(3,ntemp1).eq.3.and.  &
+     &            sje(1,1,3,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=7
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=7
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+          elseif(cb2.eq.2.or.cb2.eq.3)then
+            if(ijel(2,6,iel).eq.1)then
+              ntemp=sje(1,1,4,iel)
+              if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+                newc(iel)=3
+                vassign(1,iel)=4
+                vassign(2,iel)=6
+                vassign(3,iel)=8
+              else
+                newc(iel)=4
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=8
+              end if
+!...........if ijel(2,6,iel)=2
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).eq.3.and.sje(1,1,1,ntemp).lt.iel)then
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(1,ntemp1).eq.3.and.  &
+     &            sje(1,1,1,ntemp1).lt.iel)then
+                  newc(iel)=1
+                  vassign(1,iel)=8
+                else
+                  newc(iel)=2
+                  vassign(1,iel)=4
+                  vassign(2,iel)=8
+                end if
+              else
+                ntemp1=sje(1,1,6,iel)
+                if(cbc(1,ntemp1).eq.3.and.  &
+     &              sje(1,1,1,ntemp1).lt.iel)then
+                  newc(iel)=2
+                  vassign(1,iel)=6
+                  vassign(2,iel)=8
+                else
+                  newc(iel)=3
+                  vassign(1,iel)=4
+                  vassign(2,iel)=6
+                  vassign(3,iel)=8
+                end if
+              end if
+            end if
+
+          end if
+          
+        elseif(sumcb.eq.6)then
+!.......the three face type for face 2,4 and 6 are 0 0 0(not 
+!       necessarily in this order)
+          newc(iel)=8
+          vassign(1,iel)=1
+          vassign(2,iel)=2
+          vassign(3,iel)=3
+          vassign(4,iel)=4
+          vassign(5,iel)=5
+          vassign(6,iel)=6
+          vassign(7,iel)=7
+          vassign(8,iel)=8
+
+        elseif(sumcb.eq.9)then
+!.......the three face type for face 2,4 and 6 are 0 0 1(not 
+!       necessarily in this order)
+          newc(iel)=7
+          vassign(1,iel)=2
+          vassign(2,iel)=3
+          vassign(3,iel)=4
+          vassign(4,iel)=5
+          vassign(5,iel)=6
+          vassign(6,iel)=7
+          vassign(7,iel)=8
+
+        elseif(sumcb.eq.12)then
+!.......the three face type for face 2,4 and 6 are 0 1 1(not 
+!       necessarily in this order)
+          if(cb.eq.0)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(4,ntemp).eq.3.and.sje(1,1,4,ntemp).lt.iel)then
+              newc(iel)=6
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=6
+              vassign(5,iel)=7
+              vassign(6,iel)=8
+            else
+              newc(iel)=7
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=5
+              vassign(5,iel)=6
+              vassign(6,iel)=7
+              vassign(7,iel)=8
+            end if
+          elseif(cb1.eq.0)then
+            newc(iel)=7
+            vassign(1,iel)=2
+            vassign(2,iel)=3
+            vassign(3,iel)=4
+            vassign(4,iel)=5
+            vassign(5,iel)=6
+            vassign(6,iel)=7
+            vassign(7,iel)=8
+          elseif(cb2.eq.0)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+              newc(iel)=6
+              vassign(1,iel)=3
+              vassign(2,iel)=4
+              vassign(3,iel)=5
+              vassign(4,iel)=6
+              vassign(5,iel)=7
+              vassign(6,iel)=8
+            else
+              newc(iel)=7
+              vassign(1,iel)=2
+              vassign(2,iel)=3
+              vassign(3,iel)=4
+              vassign(4,iel)=5
+              vassign(5,iel)=6
+              vassign(6,iel)=7
+              vassign(7,iel)=8
+            end if
+          end if
+        
+        elseif(sumcb.eq.15)then
+!.......the three face type for face 2,4 and 6 are 1 1 1(not 
+!       necessarily in this order)
+          ntemp=sje(1,1,4,iel)
+          ntemp1=sje(1,1,2,iel)
+          if(cbc(6,ntemp).eq.3.and.sje(1,1,6,ntemp).lt.iel)then
+            if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=4
+                vassign(1,iel)=4
+                vassign(2,iel)=6
+                vassign(3,iel)=7
+                vassign(4,iel)=8
+              else
+                newc(iel)=5
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              end if
+            else
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=5
+                vassign(1,iel)=4
+                vassign(2,iel)=5
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              else
+                newc(iel)=6
+                vassign(1,iel)=3
+                vassign(2,iel)=4
+                vassign(3,iel)=5
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+              end if
+            end if
+          else
+            if(cbc(2,ntemp).eq.3.and.sje(1,1,2,ntemp).lt.iel)then
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=5
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=6
+                vassign(4,iel)=7
+                vassign(5,iel)=8
+              else
+                newc(iel)=6
+                vassign(1,iel)=2
+                vassign(2,iel)=3
+                vassign(3,iel)=4
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+              end if
+            else
+              if(cbc(6,ntemp1).eq.3.and.sje(1,1,6,ntemp1).lt.iel)then
+                newc(iel)=6
+                vassign(1,iel)=2
+                vassign(2,iel)=4
+                vassign(3,iel)=5
+                vassign(4,iel)=6
+                vassign(5,iel)=7
+                vassign(6,iel)=8
+
+              else
+                newc(iel)=7
+                vassign(1,iel)=2 
+                vassign(2,iel)=3 
+                vassign(3,iel)=4 
+                vassign(4,iel)=5
+                vassign(5,iel)=6
+                vassign(6,iel)=7
+                vassign(7,iel)=8
+              end if
+            end if
+          end if
+        end if
+      end do
+!$OMP END PARALLEL DO
+!.....end computing how many new mortar vertex points will be generated
+!     on each element.
+
+!.....Compute (potentially in parallel) front(iel), which records how many 
+!     new mortar point indices are to be generated from element 1 to iel.
+!     front(iel)=newc(1)+newc(2)+...+newc(iel)
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)
+      do iel=1,nelt
+        front(iel)=newc(iel)
+      end do
+!$OMP END PARALLEL DO
+
+      call parallel_add(front)
+
+!.....On each element, generate new mortar point indices and assign them
+!     to all elements sharing this mortar point. Note, if a mortar point 
+!     is shared by several elements, the mortar point index of it will only
+!     be generated on the element with the lowest element index. 
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,i,count)
+      do iel=1,nelt
+
+!.......compute the starting vertex mortar point index in element iel
+        front(iel)=front(iel)-newc(iel)
+
+        do i=1,newc(iel)
+!.........count is the new mortar index number, which will be assigned
+!         to a vertex of iel and broadcast to all other elements sharing
+!         this vertex point.
+          count=front(iel)+i
+          call mortar_vertex(vassign(i,iel),iel,count) 
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+!.....nvertex records how many mortar indices are for element vertices.
+!     It is used in the computation of the preconditioner.
+      count=front(nelt)+newc(nelt)
+      nvertex=count
+
+!.....CONFORMING EDGE AND FACE INTERIOR
+
+!.....find out how many new mortar point indices will be assigned to all
+!.....conforming edges and all conforming face interiors on each element
+
+      n1=12*nelt
+      n2=6*nelt
+
+!.....eassign(i,iel)=.true.   indicates that the i'th edge on iel will 
+!                             generate new mortar points. 
+!     ncon_edge(i,iel)=.true. indicates that the i'th edge on iel is 
+!                             nonconforming
+      call l_init_omp(ncon_edge,n1,.false.)
+      call l_init_omp(eassign,n1,.false.)
+!.....fassign(i,iel)=.true. indicates that the i'th face of iel will 
+!                           generate new mortar points
+      call l_init_omp(fassign,n2,.false.)
+
+!.....newe records how many new edges are to be assigned
+!     diagn(1,n,iel) records the element index of neighbor element of iel,
+!                    that shares edge n of iel
+!     diagn(2,n,iel) records the neighbor element diagn(1,n,iel) shares which
+!                    part of edge n of iel. diagn(2,n,iel)=1 refers to left
+!                    or bottom half of the edge n, diagn(2,n,iel)=2 refers
+!                    to the right or top part of edge n.
+!     if_1_edge(n,iel)=.true. indicates that the size of iel is smaller than 
+!                    that of its neighbor connected, neighbored by edge n only
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,cb1,cb2,cb3,cb4,cb5  &
+!$OMP& ,cb6,ntemp)
+
+      do iel=1,nelt
+        newc(iel)=0
+        newe(iel)=0
+        newi(iel)=0
+        cb1=cbc(1,iel)
+        cb2=cbc(2,iel)
+        cb3=cbc(3,iel)
+        cb4=cbc(4,iel)
+        cb5=cbc(5,iel)
+        cb6=cbc(6,iel)
+
+!.......on face 6
+
+        if(cb6.eq.0)then
+          if(cb4.eq.0.or.cb4.eq.1)then
+!...........if face 6 is of type 0 and face 4 is of type 0 or type 1, the edge
+!           shared by face 4 and 6 (edge 11) will generate new mortar point
+!           indices.
+            newe(iel)=newe(iel)+1
+            eassign(11,iel)=.true.
+          end if
+          if(cb1.ne.3)then
+!...........if face 1 is of type 3, the edge shared by face 6 and 1 (edge 1)
+!           will generate new mortar points indices.
+            newe(iel)=newe(iel)+1
+            eassign(1,iel)=.true.
+          end if
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(9,iel)=.true.
+          end if
+          if(cb2.eq.0.or.cb2.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(5,iel)=.true.
+          end if
+        elseif(cb6.eq.1)then
+          if(cb4.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(11,iel)=.true.
+          elseif(cb4.eq.1)then
+
+!...........If face 6 and face 4 both are of type 1, ntemp is the neighbor
+!           element on face 4.
+            ntemp=sje(1,1,4,iel)
+
+!...........if ntemp's face 6 is not noncoforming or the neighbor element
+!           of ntemp on face 6 has an element index larger than iel, the 
+!           edge shared by face 6 and 4 (edge 11) will generate new mortar
+!           point indices.
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+
+              newe(iel)=newe(iel)+1
+              eassign(11,iel)=.true.
+!.............if the face 6 of ntemp is of type 2
+              if(cbc(6,ntemp).eq.2)then
+!...............The neighbor element of iel, neighbored by edge 11, is 
+!               sje(1,1,6,ntemp) (the neighbor element of ntemp on ntemp's
+!               face 6).
+                diagn(1,11,iel)=sje(1,1,6,ntemp)
+!...............The neighbor element of iel, neighbored by edge 11 shares
+!               the ijel(2,6,iel) part of edge 11 of iel
+                diagn(2,11,iel)=ijel(2,6,iel)
+!...............edge 10 of element sje(1,1,6,ntemp) (the neighbor element of 
+!               ntemp on ntemp's face 6) is a nonconforming edge
+                ncon_edge(10,sje(1,1,6,ntemp))=.true.
+!...............if_1_edge(n,iel)=.true. indicates that iel is of a smaller
+!               size than its neighbor element, neighbored by edge n of iel only.
+                if_1_edge(11,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.  &
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,11,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            end if
+          endif
+
+          if(cb1.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(1,iel)=.true.
+          elseif(cb1.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(1,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,1,iel)=sje(1,1,6,ntemp)
+                diagn(2,1,iel)=ijel(1,6,iel)
+                ncon_edge(7,sje(1,1,6,ntemp))=.true.
+                if_1_edge(1,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.  &
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,1,iel)=sje(ijel(1,6,iel),1,6,ntemp)
+              endif
+            end if
+          elseif(cb1.eq.2)then
+            if(ijel(2,6,iel).eq.2)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(6,ntemp).eq.1)then
+                newe(iel)=newe(iel)+1
+                eassign(1,iel)=.true.
+!.............if cbc(6,ntemp)=2
+              else
+                if(sje(1,1,6,ntemp).gt.iel)then
+                  newe(iel)=newe(iel)+1
+                  eassign(1,iel)=.true.
+                  diagn(1,1,iel)=sje(1,1,6,ntemp)
+                end if
+              end if
+            else
+              newe(iel)=newe(iel)+1
+              eassign(1,iel)=.true.
+            end if
+          end if
+
+          if(cb3.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(9,iel)=.true.
+          elseif(cb3.eq.1)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(9,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,9,iel)=sje(1,1,6,ntemp)
+                diagn(2,9,iel)=ijel(2,6,iel)
+                ncon_edge(12,sje(1,1,6,ntemp))=.true.
+                if_1_edge(9,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.  &
+     &           sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,9,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            end if
+          elseif(cb3.eq.2)then
+            if(ijel(1,6,iel).eq.2)then
+              ntemp=sje(1,1,3,iel)
+              if(cbc(6,ntemp).eq.1)then
+                newe(iel)=newe(iel)+1
+                eassign(9,iel)=.true.
+!.............if cbc(6,ntemp)=2
+              else
+                if(sje(1,1,6,ntemp).gt.iel)then
+                  newe(iel)=newe(iel)+1
+                  eassign(9,iel)=.true.
+                  diagn(1,9,iel)=sje(1,1,6,ntemp)
+                end if
+              end if
+            else
+              newe(iel)=newe(iel)+1
+              eassign(9,iel)=.true.
+            end if
+          end if
+
+          if(cb2.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(5,iel)=.true.
+          elseif(cb2.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(6,ntemp).ne.3.or.sje(1,1,6,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(5,iel)=.true.
+              if(cbc(6,ntemp).eq.2)then
+                diagn(1,5,iel)=sje(1,1,6,ntemp)
+                diagn(2,5,iel)=ijel(1,6,iel)
+                ncon_edge(3,sje(1,1,6,ntemp))=.true.
+                if_1_edge(5,iel)=.true.
+              endif
+              if(cbc(6,ntemp).eq.3.and.  &
+     &          sje(1,1,6,ntemp).gt.iel)then
+                diagn(1,9,iel)=sje(2,ijel(2,6,iel),6,ntemp)
+              endif
+            endif
+          end if
+        end if
+
+!.......one face 4
+        if(cb4.eq.0)then
+          if(cb1.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(4,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(12,iel)=.true.
+          endif
+          if(cb2.eq.0.or.cb2.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(8,iel)=.true.
+          end if 
+           
+        elseif(cb4.eq.1)then
+          if(cb1.eq.2)then
+            if(ijel(2,4,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(4,iel)=.true.
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(1,ntemp).ne.3.or.sje(1,1,1,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(4,iel)=.true.
+                if(cbc(1,ntemp).eq.3.and.  &
+     &            sje(1,1,1,ntemp).gt.iel)then
+                  diagn(1,4,iel)=sje(ijel(1,4,iel),2,1,ntemp) 
+                endif
+              endif
+            end if
+          elseif(cb1.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(4,iel)=.true.
+          elseif(cb1.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(1,ntemp).ne.3.or.sje(1,1,1,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(4,iel)=.true.
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,4,iel)=sje(1,1,1,ntemp)
+                diagn(2,4,iel)=ijel(1,4,iel)
+                ncon_edge(6,sje(1,1,1,ntemp))=.true.
+                if_1_edge(4,iel)=.true.
+              endif
+              if(cbc(1,ntemp).eq.3.and.  &
+     &          sje(1,1,1,ntemp).gt.iel)then
+                diagn(1,4,iel)=sje(ijel(1,4,iel),2,1,ntemp)
+              endif
+            end if
+          end if
+          if(cb5.eq.2)then
+            if(ijel(1,4,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(12,iel)=.true.
+            else
+              ntemp=sje(1,1,4,iel)
+              if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(12,iel)=.true.
+                if(cbc(5,ntemp).eq.3.and.  &
+     &            sje(1,1,5,ntemp).gt.iel)then
+                  diagn(1,12,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+                endif
+              endif
+            end if
+          elseif(cb5.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(12,iel)=.true.
+          elseif(cb5.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(12,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,12,iel)=sje(1,1,5,ntemp)
+                diagn(2,12,iel)=ijel(2,4,iel)
+                ncon_edge(9,sje(1,1,5,ntemp))=.true.
+                if_1_edge(12,iel)=.true.
+              endif
+              if(cbc(5,ntemp).eq.3.and.  &
+     &          sje(1,1,5,ntemp).gt.iel)then
+                diagn(1,12,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+              endif
+            end if
+          end if
+          if(cb2.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(8,iel)=.true.
+          elseif(cb2.eq.1)then
+            ntemp=sje(1,1,4,iel)
+            if(cbc(2,ntemp).ne.3.or.sje(1,1,2,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(8,iel)=.true.
+              if(cbc(2,ntemp).eq.2)then
+                diagn(1,8,iel)=sje(1,1,2,ntemp)
+                diagn(2,8,iel)=ijel(1,4,iel)
+                ncon_edge(2,sje(1,1,2,ntemp))=.true.
+                if_1_edge(8,iel)=.true.
+              endif
+              if(cbc(2,ntemp).eq.3.and.  &
+     &          sje(1,1,2,ntemp).gt.iel)then
+                diagn(1,8,iel)=sje(ijel(1,4,iel),2,3,ntemp)
+              endif
+            endif
+          end if
+        end if
+
+!.......on face 2
+        if(cb2.eq.0)then
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(6,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(7,iel)=.true.
+          endif
+        elseif(cb2.eq.1)then
+          if(cb3.eq.2)then
+            if(ijel(2,2,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(6,iel)=.true.
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(3,ntemp).ne.3.or.  &
+     &          sje(1,1,3,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(6,iel)=.true.
+                if(cbc(3,ntemp).eq.3.and.  &
+     &            sje(1,1,3,ntemp).gt.iel)then
+                  diagn(1,6,iel)=sje(ijel(1,2,iel),2,3,ntemp)
+                endif
+              endif
+            endif
+          elseif(cb3.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(6,iel)=.true.
+          elseif(cb3.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(3,ntemp).ne.3.or.sje(1,1,3,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(6,iel)=.true.
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,6,iel)=sje(1,1,3,ntemp)
+                diagn(2,6,iel)=ijel(1,2,iel)
+                ncon_edge(4,sje(1,1,3,ntemp))=.true.
+                if_1_edge(6,iel)=.true.
+              endif
+              if(cbc(3,ntemp).eq.3.and.  &
+     &          sje(1,1,3,ntemp).gt.iel)then
+                diagn(1,6,iel)=sje(ijel(1,4,iel),2,3,ntemp)
+              endif
+            endif
+          endif
+          if(cb5.eq.2)then
+            if(ijel(1,2,iel).eq.1)then
+              newe(iel)=newe(iel)+1
+              eassign(7,iel)=.true.
+            else
+              ntemp=sje(1,1,2,iel)
+              if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+                newe(iel)=newe(iel)+1
+                eassign(7,iel)=.true.
+                if(cbc(5,ntemp).eq.3.and.  &
+     &            sje(1,1,5,ntemp).gt.iel)then
+                  diagn(1,7,iel)=sje(ijel(2,2,iel),2,5,ntemp)
+                endif
+              endif
+            endif
+          elseif(cb5.eq.0)then
+            newe(iel)=newe(iel)+1
+            eassign(7,iel)=.true.
+          elseif(cb5.eq.1)then
+            ntemp=sje(1,1,2,iel)
+            if(cbc(5,ntemp).ne.3.or.sje(1,1,5,ntemp).gt.iel)then
+              newe(iel)=newe(iel)+1
+              eassign(7,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,7,iel)=sje(1,1,5,ntemp)
+                diagn(2,7,iel)=ijel(2,2,iel)
+                ncon_edge(1,sje(1,1,5,ntemp))=.true.
+                if_1_edge(7,iel)=.true.
+              endif
+              if(cbc(5,ntemp).eq.3.and.  &
+     &          sje(1,1,5,ntemp).gt.iel)then
+                diagn(1,7,iel)=sje(2,ijel(2,4,iel),5,ntemp)
+              endif
+            endif
+          endif
+        end if
+
+!.......on face 1
+        if(cb1.eq.1)then
+          newe(iel)=newe(iel)+2
+          eassign(2,iel)=.true.
+          if(cb3.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(3,ntemp).eq.2)then
+              diagn(1,2,iel)=sje(1,1,3,ntemp)
+              diagn(2,2,iel)=ijel(1,1,iel)
+              ncon_edge(8,sje(1,1,3,ntemp))=.true.
+              if_1_edge(2,iel)=.true.
+            elseif(cbc(3,ntemp).eq.3)then
+              diagn(1,2,iel)=sje(ijel(1,1,iel),1,3,ntemp)
+            endif
+          elseif(cb3.eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(ijel(2,1,iel).eq.2)then
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,1,ntemp)
+              end if
+            endif
+          end if
+
+          eassign(3,iel)=.true.
+          if(cb5.eq.1)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(5,ntemp).eq.2)then
+              diagn(1,3,iel)=sje(1,1,5,ntemp)
+              diagn(2,3,iel)=ijel(2,1,iel)
+              ncon_edge(5,sje(1,1,5,ntemp))=.true.
+              if_1_edge(3,iel)=.true.
+            elseif(cbc(5,ntemp).eq.3)then
+              diagn(1,3,iel)=sje(ijel(2,1,iel),1,5,ntemp)
+            endif
+          elseif(cb5.eq.2)then
+            ntemp=sje(1,1,5,iel)
+            if(ijel(1,1,iel).eq.2)then
+              if(cbc(1,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,1,ntemp)
+              end if
+            endif
+            
+          end if
+        elseif(cb1.eq.2)then
+          if(cb3.eq.2)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(3,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(2,iel)=.true.
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,3,ntemp)
+              endif 
+            endif
+          elseif(cb3.eq.0.or.cb3.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(2,iel)=.true.
+            if(cb3.eq.1)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(3,ntemp).eq.2)then
+                diagn(1,2,iel)=sje(1,1,3,ntemp)
+              endif
+            endif
+          end if
+          if(cb5.eq.2)then
+            ntemp=sje(1,1,1,iel)
+            if(cbc(5,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(3,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          elseif(cb5.eq.0.or.cb5.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(3,iel)=.true.
+            if(cb5.eq.1)then
+              ntemp=sje(1,1,1,iel)
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,3,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          end if
+        elseif(cb1.eq.0)then
+          if(cb3.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(2,iel)=.true.
+          endif
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(3,iel)=.true.
+          endif
+        endif
+
+!.......on face 3
+        if(cb3.eq.1)then
+          newe(iel)=newe(iel)+1
+          eassign(10,iel)=.true.
+          if(cb5.eq.1)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).eq.2)then
+              diagn(1,10,iel)=sje(1,1,5,ntemp)
+              diagn(2,10,iel)=ijel(2,3,iel)
+              ncon_edge(11,sje(1,1,5,ntemp))=.true.
+              if_1_edge(10,iel)=.true.
+            endif
+          endif
+          if(ijel(1,3,iel).eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).eq.3)then
+              diagn(1,10,iel)=sje(1,ijel(2,3,iel),5,ntemp)
+            endif
+          endif
+        elseif(cb3.eq.2)then
+          if(cb5.eq.2)then
+            ntemp=sje(1,1,3,iel)
+            if(cbc(5,ntemp).ne.3)then
+              newe(iel)=newe(iel)+1
+              eassign(10,iel)=.true.
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,10,iel)=sje(1,1,5,ntemp)
+              endif
+            endif
+          elseif(cb5.eq.0.or.cb5.eq.1)then
+            newe(iel)=newe(iel)+1
+            eassign(10,iel)=.true.
+            if(cb5.eq.1)then
+              ntemp=sje(1,1,3,iel)
+              if(cbc(5,ntemp).eq.2)then
+                diagn(1,10,iel)=sje(1,1,5,ntemp)
+              endif 
+            endif
+          end if
+        elseif(cb3.eq.0)then
+          if(cb5.ne.3)then
+            newe(iel)=newe(iel)+1
+            eassign(10,iel)=.true.
+          endif
+        endif
+
+!       CONFORMING FACE INTERIOR
+
+!.......find how many new mortar point indices will be assigned
+!       to face interiors on all faces on each element
+
+!.......newi record how many new face interior points will be assigned
+
+!.......on face 6
+        if(cb6.eq.1.or.cb6.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(6,iel)=.true.
+        end if
+!.......on face 4
+        if(cb4.eq.1.or.cb4.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(4,iel)=.true.
+        end if
+!.......on face 2
+        if(cb2.eq.1.or.cb2.eq.0)then
+          newi(iel)=newi(iel)+9
+          fassign(2,iel)=.true.
+        end if
+!.......on face 1
+        if(cb1.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(1,iel)=.true.
+        end if
+!.......on face 3
+        if(cb3.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(3,iel)=.true.
+        endif
+!.......on face 5
+        if(cb5.ne.3)then
+          newi(iel)=newi(iel)+9
+          fassign(5,iel)=.true.
+        endif
+
+!.......newc is the total number of new mortar point indices
+!       to be assigned to each element.
+        newc(iel)=newe(iel)*3+newi(iel)
+      end do
+!$OMP END PARALLEL DO
+
+!.....Compute (potentially in parallel) front(iel), which records how 
+!     many new mortar point indices are to be assigned (to conforming 
+!     edges and conforming face interiors) from element 1 to iel.
+!     front(iel)=newc(1)+newc(2)+...+newc(iel)
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel)
+      do iel=1,nelt
+        front(iel)=newc(iel)
+      end do
+!$OMP END PARALLEL DO
+
+      call parallel_add(front)
+
+!.....nmor is the total number or mortar points
+      nmor=nvertex+front(nelt)
+
+!.....Generate (potentially in parallel) new mortar point indices on 
+!     each conforming element face. On each face, first visit all 
+!     conforming edges, and then the face interior.
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(iel,count,i,cb1,ne,  &
+!$OMP& space,ie,edge_g,face2,ie2,ntemp,ii,jj,jface,cb,mor_v)
+      do iel=1,nelt
+        front(iel)=front(iel)-newc(iel)
+        count=nvertex+front(iel)
+        do i=1,6
+          cb1=cbc(i,iel)
+          if (i.le.2) then
+            ne=4
+            space=1
+          elseif (i.le.4)then
+            ne=3
+            space=2
+
+!.........i loops over faces. Only 4 faces need to be examed for edge visit.
+!         On face 1, edge 1,2,3 and 4 will be visited. On face 2, edge 5,6,7
+!         and 8 will be visited. On face 3, edge 9 and 10 will be visited and
+!         on face 4, edge 11 and 12 will be visited. The 12 edges can be 
+!         covered by four faces, there is no need to visit edges on face
+!         5 and 6.  So ne is set to be 0. 
+!         However, i still needs to loop over 5 and 6, since the interiors
+!         of face 5 and 6 still need to be visited.
+
+          else
+            ne=0
+            space=1
+          end if
+
+          do ie=1,ne,space
+            edge_g=edgenumber(ie,i)
+            if(eassign(edge_g,iel))then
+!.............generate the new mortar points index, mor_v
+              call mor_assign(mor_v,count)
+!.............assign mor_v to local edge ie of face i on element iel
+              call mor_edge(ie,i,iel,mor_v)
+
+!.............Since this edge is shared by another face of element 
+!             iel, assign mor_v to the corresponding edge on the other 
+!             face also.
+
+!.............find the other face
+              face2=f_e_ef(ie,i)
+!.............find the local edge index of this edge on the other face
+              ie2=localedgenumber(face2,edge_g)
+!.............asssign mor_v  to local edge ie2 of face face2 on element iel
+              call mor_edge(ie2,face2,iel,mor_v)
+
+!.............There are some neighbor elements also sharing this edge. Assign
+!             mor_v to neighbor element, neighbored by face i.
+              if (cbc(i,iel).eq.2)then
+                ntemp=sje(1,1,i,iel)
+                call mor_edge(ie,jjface(i),ntemp,mor_v)
+                call mor_edge(op(ie2),face2,ntemp,mor_v)
+              end if
+
+!.............assign mor_v  to neighbor element neighbored by face face2
+              if (cbc(face2,iel).eq.2)then
+                ntemp=sje(1,1,face2,iel)
+                call mor_edge(ie2,jjface(face2),ntemp,mor_v)
+                call mor_edge(op(ie),i,ntemp,mor_v)
+              end if
+
+!.............assign mor_v to neighbor element sharing this edge
+
+!.............if the neighbor is of the same size of iel
+              if(.not.if_1_edge(edgenumber(ie,i),iel))then
+                if(diagn(1,edgenumber(ie,i),iel).ne.0)then
+                  ntemp=diagn(1,edgenumber(ie,i),iel)
+                  call mor_edge(op(ie2),jjface(face2),ntemp,mor_v)
+                  call mor_edge(op(ie),jjface(i),ntemp,mor_v)
+                endif
+
+!.............if the neighbor has a size larger than iel's
+              else
+                if(diagn(1,edgenumber(ie,i),iel).ne.0)then
+                  ntemp=diagn(1,edgenumber(ie,i),iel)
+                  call mor_ne(mor_v,diagn(2,edgenumber(ie,i),iel),  &
+     &            ie,i,ie2,face2,iel,ntemp)
+                end if
+              endif
+ 
+            endif
+          end do 
+
+          if(fassign(i,iel))then
+!...........generate new mortar points index in face interior. 
+!           if face i is of type 2 or iel doesn't have a neighbor element,
+!           assign new mortar point indices to interior mortar points
+!           of face i of iel.
+            cb=cbc(i,iel)
+            if (cb.eq.1.or.cb.eq.0) then
+              do jj =2,lx1-1
+                do ii=2,lx1-1
+                  count=count+1
+                  idmo(ii,jj,1,1,i,iel)=count
+                end do
+              end do
+
+!...........if face i is of type 2, assign new mortar point indices
+!           to iel as well as to the neighboring element on face i
+            elseif (cb.eq.2) then
+              if (idmo(2,2,1,1,i,iel).eq.0) then
+                ntemp=sje(1,1,i,iel)
+                jface = jjface(i)
+                do jj =2,lx1-1
+                  do ii=2,lx1-1
+                    count=count+1
+                    idmo(ii,jj,1,1,i,iel)=count
+                    idmo(ii,jj,1,1,jface,ntemp)=count
+                  end do
+                end do
+              end if 
+            end if
+          end if
+        end do
+      end do 
+!$OMP END  PARALLEL DO
+
+ 
+!.....for edges on nonconforming faces, copy the mortar points indices
+!     from neighbors.
+!$OMP PARALLEL DO DEFAULT(SHARED)  &
+!$OMP& PRIVATE(iel,i,cb,jface,iii,jjj,ntemp,ii,jj)
+      do iel=1,nelt
+        do i=1,6
+          cb=cbc(i,iel)
+          if (cb.eq.3) then
+!...........edges 
+            call edgecopy_s(i,iel)
+          end if 
+
+!.........face interior 
+
+          jface = jjface(i)
+          if (cb.eq.3) then
+            do iii=1,2
+              do jjj=1,2
+                ntemp=sje(iii,jjj,i,iel) 
+                do jj =1,lx1
+                  do ii=1,lx1
+                    idmo(ii,jj,iii,jjj,i,iel)=  &
+     &                         idmo(ii,jj,1,1,jface,ntemp)
+                  end do
+                end do
+                idmo(1,1,iii,jjj,i,iel)=idmo(1,1,1,1,jface,ntemp)
+                idmo(lx1,1,iii,jjj,i,iel)=idmo(lx1,1,1,2,jface,ntemp)
+                idmo(1,lx1,iii,jjj,i,iel)=idmo(1,lx1,2,1,jface,ntemp)
+                idmo(lx1,lx1,iii,jjj,i,iel)=  &
+     &                         idmo(lx1,lx1,2,2,jface,ntemp)
+              end do
+            end do
+          end if
+        end do
+      end do
+!$OMP END PARALLEL DO
+      return
+      end
+       
+!-----------------------------------------------------------------
+       subroutine get_emo(ie,n,ng)
+!-----------------------------------------------------------------
+!      This subroutine fills array emo.
+!      emo  records all elements sharing the same mortar point 
+!                 (only applies to element vertices) .
+!      emo(1,i,n) gives the element ID of the i'th element sharing
+!                 mortar point n. (emo(1,i,n)=ie), ie is element
+!                 index.
+!      emo(2,i,n) gives the vertex index of mortar point n on this
+!                 element (emo(2,i,n)=ng), ng is the vertex index.
+!      nemo(n) records the total number of elements sharing mortar 
+!                 point n.
+!-----------------------------------------------------------------
+ 
+       use ua_data
+       implicit none
+
+       integer ie, n, ntemp, i,ng
+
+       do i=1,nemo(n)
+         if (emo(1,i,n).eq.ie) return
+       end do
+
+!$     call omp_set_lock(tlock(n))
+       ntemp=nemo(n)+1
+       nemo(n)=ntemp
+       emo(1,ntemp,n)=ie
+       emo(2,ntemp,n)=ng
+!$     call omp_unset_lock(tlock(n))
+
+       return
+       end 
+
+!-----------------------------------------------------------------
+      logical function ifsame(ntemp,j,iel,i)
+!-----------------------------------------------------------------
+!     Check whether the i's vertex of element iel is at the same
+!     location as j's vertex of element ntemp.
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer iel, i, ntemp, j
+
+      ifsame=.false.
+      if (ntemp.eq.0 .or. iel.eq.0) return
+      if (xc(i,iel).eq.xc(j,ntemp).and.  &
+     &    yc(i,iel).eq.yc(j,ntemp).and.  &
+     &    zc(i,iel).eq.zc(j,ntemp)) then
+        ifsame=.true.
+      end if
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine mor_assign(mor_v,count)
+!-----------------------------------------------------------------
+!     Assign three consecutive numbers for mor_v, which will
+!     be assigned to the three interior points of an edge as the 
+!     mortar point indices.
+!-----------------------------------------------------------------
+      
+      implicit none
+      integer mor_v(3),count,i
+   
+      do i=1,3 
+        count=count+1
+        mor_v(i)=count
+      end do
+
+      return
+      end  
+     
+!-----------------------------------------------------------------
+      subroutine mor_edge(ie,face,iel,mor_v)
+!-----------------------------------------------------------------
+!     Copy the mortar points index from mor_v to local 
+!     edge ie of the face'th face on element iel.
+!     The edge is conforming.
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer ie,i,iel,mor_v(3),j,nn,face
+
+      if (ie.eq.1) then
+        j=1
+        do nn=2,lx1-1
+          idmo(nn,j,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.2) then 
+        i=lx1
+        do nn=2,lx1-1
+          idmo(i,nn,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.3) then 
+        j=lx1
+        do nn=2,lx1-1
+          idmo(nn,j,1,1,face,iel)=mor_v(nn-1)
+        end do
+      elseif (ie.eq.4) then 
+        i=1
+        do nn=2,lx1-1
+          idmo(i,nn,1,1,face,iel)=mor_v(nn-1)
+        end do
+      end if
+
+      return
+      end 
+
+!------------------------------------------------------------
+      subroutine edgecopy_s(face,iel)
+!------------------------------------------------------------
+!     Copy mortar points index on edges from neighbor elements 
+!     to an element face of the 3rd type.
+!------------------------------------------------------------
+
+       use ua_data
+       implicit none
+
+       integer face, iel, ntemp1, ntemp2, ntemp3, ntemp4,  &
+     &         edge_g, edge_l, face2, mor_s_v(4,2), i
+
+!......find four neighbors on this face (3rd type)
+       ntemp1=sje(1,1,face,iel)
+       ntemp2=sje(1,2,face,iel)
+       ntemp3=sje(2,1,face,iel)
+       ntemp4=sje(2,2,face,iel)
+
+!......local edge 1
+
+!......mor_s_v is the array of mortar indices to  be copied.
+       call nrzero(mor_s_v,4*2)
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(i,1,1,1,jjface(face),ntemp1)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,1,1,2,jjface(face),ntemp1)
+       do i=1,lx1-1
+          mor_s_v(i,2)=idmo(i,1,1,1,jjface(face),ntemp2)
+       end do
+
+!......copy mor_s_v to local edge 1 on this face
+       call mor_s_e(1,face,iel,mor_s_v)
+
+!......copy mor_s_v to the corresponding edge on the other face sharing
+!      local edge 1
+       face2=f_e_ef(1,face)
+       edge_g=edgenumber(1,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+!......local edge 2
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(lx1,i,1,1,jjface(face),ntemp2)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,lx1,2,2,jjface(face),ntemp2)
+
+       mor_s_v(1,2)=idmo(lx1,1,1,2,jjface(face),ntemp4)
+       do i=2,lx1-1
+          mor_s_v(i,2)=idmo(lx1,i,1,1,jjface(face),ntemp4)
+       end do
+
+       call mor_s_e(2,face,iel,mor_s_v)
+       face2=f_e_ef(2,face)
+       edge_g=edgenumber(2,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+!......local edge 3
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(i,lx1,1,1,jjface(face),ntemp3)
+       end do
+       mor_s_v(lx1-1,1)=idmo(lx1,lx1,2,2,jjface(face),ntemp3)
+
+       mor_s_v(1,2)=idmo(1,lx1,2,1,jjface(face),ntemp4)
+       do i=2,lx1-1
+          mor_s_v(i,2)=idmo(i,lx1,1,1,jjface(face),ntemp4)
+       end do
+
+       call mor_s_e(3,face,iel,mor_s_v)
+       face2=f_e_ef(3,face)
+       edge_g=edgenumber(3,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+!......local edge 4
+       do i=2,lx1-1
+          mor_s_v(i-1,1)=idmo(1,i,1,1,jjface(face),ntemp1)
+       end do
+       mor_s_v(lx1-1,1)=idmo(1,lx1,2,1,jjface(face),ntemp1)
+
+       do i=1,lx1-1
+          mor_s_v(i,2)=idmo(1,i,1,1,jjface(face),ntemp3)
+       end do
+
+       call mor_s_e(4,face,iel,mor_s_v)
+       face2=f_e_ef(4,face)
+       edge_g=edgenumber(4,face)
+       edge_l=localedgenumber(face2,edge_g)
+       call mor_s_e(edge_l,face2,iel,mor_s_v)
+
+       return
+       end
+
+!------------------------------------------------------------
+       subroutine mor_s_e(n,face,iel,mor_s_v)
+!------------------------------------------------------------
+!      Copy mortar points index from mor_s_v to local edge n
+!      on face "face" of element iel. The edge is nonconforming. 
+!------------------------------------------------------------
+
+       use ua_data
+       implicit none
+
+       integer n,face,iel,mor_s_v(4,2), i
+
+       if (n.eq.1) then
+         do i=2,lx1
+           idmo(i,1,1,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(i,1,1,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.2) then
+         do i=2,lx1
+          idmo(lx1,i,1,2,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+          idmo(lx1,i,2,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.3) then
+         do i=2,lx1
+           idmo(i,lx1,2,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(i,lx1,2,2,face,iel)=mor_s_v(i,2)
+         end do
+       else if (n.eq.4) then
+         do i=2,lx1
+           idmo(1,i,1,1,face,iel)=mor_s_v(i-1,1)
+         end do
+         do i=1,lx1-1
+           idmo(1,i,2,1,face,iel)=mor_s_v(i,2)
+         end do
+       end if
+       return
+       end
+
+!------------------------------------------------------------
+       subroutine mor_s_e_nn(n,face,iel,mor_s_v,nn)
+!------------------------------------------------------------
+!      Copy mortar point indices from mor_s_v to local edge n
+!      on face "face" of element iel. nn is the edge mortar index,
+!      which indicates that mor_s_v  corresponds to left/bottom or 
+!      right/top part of the edge.
+!------------------------------------------------------------
+
+       use ua_data
+       implicit none
+
+       integer n,face,iel,mor_s_v(4), i,nn
+
+       if (n.eq.1) then
+         if(nn.eq.1)then
+            do i=2,lx1
+              idmo(i,1,1,1,face,iel)=mor_s_v(i-1)
+            end do
+         else
+           do i=1,lx1-1
+             idmo(i,1,1,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.2) then
+         if(nn.eq.1)then
+           do i=2,lx1
+            idmo(lx1,i,1,2,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(lx1,i,2,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.3) then
+         if(nn.eq.1)then
+           do i=2,lx1
+             idmo(i,lx1,2,1,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(i,lx1,2,2,face,iel)=mor_s_v(i)
+           end do
+         endif
+       else if (n.eq.4) then
+         if(nn.eq.1)then
+           do i=2,lx1
+            idmo(1,i,1,1,face,iel)=mor_s_v(i-1)
+           end do
+         else
+           do i=1,lx1-1
+            idmo(1,i,2,1,face,iel)=mor_s_v(i)
+           end do
+         endif
+       end if
+       return
+       end
+
+
+!---------------------------------------------------------------
+      subroutine mortar_vertex(i,iel,count)
+!---------------------------------------------------------------
+!     Assign mortar point index "count" to iel's i'th vertex
+!     and also to all elements sharing this vertex.
+!---------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer i,iel,count,ntempx(8),ifntempx(8),lc_a(3),nnb(3),  &
+     &        face_a(3),itemp,ntemp,ii, jj,j(3),  &
+     &        iintempx(3),l,nbe, lc, temp
+      logical ifsame,if_temp
+
+      do l= 1,8
+        ntempx(l)=0
+        ifntempx(l)=0
+      end do
+
+!.....face_a records the three faces sharing this vertex on iel.
+!     lc_a gives the local corner number of this vertex on each 
+!     face in face_a.
+
+      do l=1,3
+        face_a(l)=f_c(l,i)
+        lc_a(l)=local_corner(i,face_a(l))
+      end do
+
+!.....each vertex is shared by at most 8 elements. 
+!     ntempx(j) gives the element index of a POSSIBLE element with its 
+!               j'th  vertex is iel's i'th vertex
+!     ifntempx(i)=ntempx(i) means  ntempx(i) exists 
+!     ifntempx(i)=0 means ntempx(i) does not exist.
+
+      ntempx(9-i)=iel
+      ifntempx(9-i)=iel
+
+!.....first find all elements sharing this vertex, ifntempx
+
+!.....find the three possible neighbors of iel, neighbored by faces 
+!     listed in array face_a
+
+      do itemp= 1, 3
+
+!.......j(itemp) is the local corner number of this vertex on the 
+!       neighbor element on the corresponding face.
+        j(itemp)=c_f(lc_a(itemp),jjface(face_a(itemp)))
+
+!.......iitempx(itemp) records the vertex index of i on the
+!       neighbor element, neighborned by face_a(itemp)
+        iintempx(itemp)=cal_intempx(lc_a(itemp),face_a(itemp))
+
+!.......ntemp refers the neighbor element 
+        ntemp=0
+
+!.......if the face is nonconforming, find out in which piece of the 
+!       mortar the vertex is located
+        ii=cal_iijj(1,lc_a(itemp))
+        jj=cal_iijj(2,lc_a(itemp))
+        ntemp=sje(ii,jj,face_a(itemp),iel)
+
+!.......if the face is conforming
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(itemp),iel)
+!.........find the possible neighbor        
+          ntempx(iintempx(itemp))=ntemp
+!.........check whether this possible neighbor is a real neighbor or not
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,j(itemp),iel,i))then
+              ifntempx(iintempx(itemp))=ntemp
+            end if
+          end if
+
+!.......if the face is nonconforming
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,j(itemp),iel,i))then
+              ifntempx(iintempx(itemp))=ntemp
+              ntempx(iintempx(itemp))=ntemp
+            end if
+          end if
+        end if 
+      end do 
+
+!.....find the possible three neighbors, neighbored by an edge only
+      do l=1,3
+
+!.....find first existing neighbor of any of the faces in array face_a
+        if_temp=.false.
+        if(l.eq.1)then
+          if_temp=.true.
+        elseif(l.eq.2)then
+          if(ifntempx(iintempx(l-1)).eq.0)then
+            if_temp=.true.
+          end if
+        elseif(l.eq.3)then
+          if(ifntempx(iintempx(l-1)).eq.0  &
+     &       .and.ifntempx(iintempx(l-2)).eq.0) then
+            if_temp=.true.
+          end if
+        end if
+           
+        if(if_temp)then
+          if (ifntempx(iintempx(l)).ne.0) then
+            nbe=ifntempx(iintempx(l))
+!...........if 1st neighor exists, check the neighbor's two neighbors in
+!           the other two directions. 
+!           e.g. if l=1, check directions 2 and 3,i.e. itemp=2,3,1
+!           if l=2, itemp=3,1,-2
+!           if l=3, itemp=1,2,1
+!
+            do itemp=face_l1(l),face_l2(l),face_ld(l)
+!.............lc is the local corner number of this vertex on face face_a(itemp)
+!             on the neighbor element of iel, neighbored by a face face_a(l)
+              lc=local_corner(j(l),face_a(itemp))
+!.............temp is the vertex index of this vertex on the neighbor element
+!             neighbored by an edge
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+
+!.............if the face face_a(itemp) is conforming
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),  &
+     &            nbe,j(l)))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+!...................nnb(itemp) records the neighbor element neighbored by an
+!                   edge only
+                    nnb(itemp)=ntemp
+                  end if
+                end if
+
+!.............if the face face_a(itemp) is nonconforming
+              else
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),  &
+     &              nbe,j(l)))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(itemp)=ntemp
+                  end if
+                end if
+              end if
+            end do
+
+!...........check the last neighbor element, neighbored by an edge
+
+!...........ifntempx(iintempx(l)) has been visited in the above, now 
+!           check another neighbor element(nbe) neighbored by a face 
+
+!...........if the neighbor element is neighbored by face 
+!           face_a(face_l1(l)) exists
+            if(ifntempx(iintempx(face_l1(l))).ne.0)then
+              nbe=ifntempx(iintempx(face_l1(l)))
+!.............itemp is the last direction other than l and face_l1(l)
+              itemp=face_l2(l)
+              lc=local_corner(j(face_l1(l)),face_a(itemp))
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+
+!.............ntemp records the last neighbor element neighbored by an edge
+!             with element iel
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+!.............if conforming
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),  &
+     &              nbe,j(face_l1(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+!.............if nonconforming
+              else
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),  &
+     &            nbe,j(face_l1(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              end if
+
+!...........if the neighbor element neighbored by face face_a(face_l2(l)) 
+!           does not exist
+            elseif(ifntempx(iintempx(face_l2(l))).ne.0)then
+              nbe=ifntempx(iintempx(face_l2(l)))
+              itemp=face_l1(l)
+              lc=local_corner(j(face_l2(l)),face_a(itemp))
+              temp=cal_intempx(lc,face_a(itemp))
+              ii=cal_iijj(1,lc)
+              jj=cal_iijj(2,lc)
+              ntemp=sje(ii,jj,face_a(itemp),nbe)
+              if(ntemp.eq.0)then
+                ntemp=sje(1,1,face_a(itemp),nbe)
+                if(ntemp.ne.0)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),  &
+     &            nbe,j(face_l2(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              else
+                if(ntemp.ne.0.)then
+                  if(ifsame(ntemp,c_f(lc,jjface(face_a(itemp))),  &
+     &            nbe,j(face_l2(l))))then
+                    ntempx(temp)=ntemp
+                    ifntempx(temp)=ntemp
+                    nnb(l)=ntemp
+                  end if
+                end if
+              end if
+            endif
+          endif
+        end if
+      end do
+
+!.....check the neighbor element, neighbored by a vertex only
+
+!.....nnb are the three possible neighbor elements neighbored by an edge
+
+      nnb(1)=ifntempx(cal_nnb(1,i))
+      nnb(2)=ifntempx(cal_nnb(2,i))
+      nnb(3)=ifntempx(cal_nnb(3,i))
+      ntemp=0
+
+!.....the neighbor element neighbored by a vertex must be a neighbor of
+!     a valid(nonzero) nnb(i), neighbored by a face 
+
+      if(nnb(1).ne.0)then
+        lc=oplc(local_corner(i,face_a(3)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+!.......ntemp records the neighbor of iel, neighbored by vertex i 
+        ntemp=sje(ii,jj,face_a(3),nnb(1))
+!.......temp is the vertex index of i on ntemp
+        temp=cal_intempx(lc,face_a(3))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(3),nnb(1))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,c_f(lc,jjface(face_a(3))),  &
+     &         iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,c_f(lc,jjface(face_a(3))),  &
+     &         iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      elseif(nnb(2).ne.0)then
+        lc=oplc(local_corner(i,face_a(1)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+        ntemp=sje(ii,jj,face_a(1),nnb(2))
+        temp=cal_intempx(lc,face_a(1))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(1),nnb(2))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,  &
+     &        c_f(lc,jjface(face_a(1))),iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,  &
+     &      c_f(lc,jjface(face_a(1))),iel,i))then
+              ntempx(temp)=ntemp
+              ifntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      elseif(nnb(3).ne.0)then
+        lc=oplc(local_corner(i,face_a(2)))
+        ii=cal_iijj(1,lc)
+        jj=cal_iijj(2,lc)
+        ntemp=sje(ii,jj,face_a(2),nnb(3))
+        temp=cal_intempx(lc, face_a(2))
+        if(ntemp.eq.0)then
+          ntemp=sje(1,1,face_a(2),nnb(3))
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,  &
+     &         c_f(lc,jjface(face_a(2))),iel,i))then
+              ifntempx(temp)=ntemp
+              ntempx(temp)=ntemp
+            end if
+          end if
+        else
+          if(ntemp.ne.0)then
+            if(ifsame(ntemp,  &
+     &        c_f(lc,jjface(face_a(2))),iel,i))then
+              ifntempx(temp)=ntemp
+              ntempx(temp)=ntemp
+            end if
+          end if
+        end if
+      end if
+
+!.....ifntempx records all elements sharing this vertex, assign count
+!     to all these elements.
+
+      if (ifntempx(1).ne.0) then
+        idmo(lx1,lx1,2,2,1,ntempx(1))=count
+        idmo(lx1,lx1,2,2,3,ntempx(1))=count
+        idmo(lx1,lx1,2,2,5,ntempx(1))=count
+        call get_emo(ntempx(1),count,8)
+      end if
+
+      if (ifntempx(2).ne.0) then
+        idmo(lx1,lx1,2,2,2,ntempx(2))=count
+        idmo(1,lx1,2,1,3,ntempx(2))=count
+        idmo(1,lx1,2,1,5,ntempx(2))=count
+        call get_emo(ntempx(2),count,7)
+      end if
+
+      if (ifntempx(3).ne.0) then
+        idmo(1,lx1,2,1,1,ntempx(3))=count
+        idmo(lx1,lx1,2,2,4,ntempx(3))=count
+        idmo(lx1,1,1,2,5,ntempx(3))=count
+        call get_emo(ntempx(3),count,6)
+      end if
+      if (ifntempx(4).ne.0) then
+        idmo(1,lx1,2,1,2,ntempx(4))=count
+        idmo(1,lx1,2,1,4,ntempx(4))=count
+        idmo(1,1,1,1,5,ntempx(4))=count
+        call get_emo(ntempx(4),count,5)
+      end if
+
+      if (ifntempx(5).ne.0) then
+        idmo(lx1,1,1,2,1,ntempx(5))=count
+        idmo(lx1,1,1,2,3,ntempx(5))=count
+        idmo(lx1,lx1,2,2,6,ntempx(5))=count
+        call get_emo(ntempx(5),count,4)
+      end if
+
+
+      if (ifntempx(6).ne.0) then
+        idmo(lx1,1,1,2,2,ntempx(6))=count
+        idmo(1,1,1,1,3,ntempx(6))=count
+        idmo(1,lx1,2,1,6,ntempx(6))=count
+        call get_emo(ntempx(6),count,3)
+      end if
+
+      if (ifntempx(7).ne.0) then
+        idmo(1,1,1,1,1,ntempx(7))=count
+        idmo(lx1,1,1,2,4,ntempx(7))=count
+        idmo(lx1,1,1,2,6,ntempx(7))=count
+        call get_emo(ntempx(7),count,2)
+      end if
+
+      if (ifntempx(8).ne.0) then
+        idmo(1,1,1,1,2,ntempx(8))=count
+        idmo(1,1,1,1,4,ntempx(8))=count
+        idmo(1,1,1,1,6,ntempx(8))=count
+        call get_emo(ntempx(8),count,1)
+      end if
+
+      return
+      end
+
+     
+!---------------------------------------------------------------
+      subroutine mor_ne(mor_v,nn,edge,face,edge2,face2,ntemp,iel)
+!---------------------------------------------------------------
+!     Copy the mortar points index  (mor_v + vertex mortar point) from
+!     edge'th local edge on face'th face on element ntemp to iel.
+!     ntemp is iel's neighbor, neighbored by this edge only. 
+!     This subroutine is for the situation that iel is of larger
+!     size than ntemp.  
+!     face, face2 are face indices
+!     edge and edge2 are local edge numbers of this edge on face and face2
+!     nn is edge motar index, which indicate whether this edge
+!     corresponds to the left/bottom or right/top part of the edge
+!     on iel.
+!---------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer mor_v(3),nn,edge,face,edge2,face2,ntemp,iel, i,  &
+     &mor_s_v(4)
+
+!.....get mor_s_v which is the mor_v + vertex mortar
+      if (edge.eq.3) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,lx1,2,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,lx1,2,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+      
+      elseif (edge.eq.4) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(1,lx1,2,1,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,1,1,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+
+      elseif (edge.eq.1) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+            mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,1,1,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(1,1,1,1,face,ntemp)
+          do i=2,lx1-1
+            mor_s_v(i)=mor_v(i-1)
+          end do
+         endif
+
+      else if (edge.eq.2) then
+        if(nn.eq.1)then
+          do i=2,lx1-1
+             mor_s_v(i-1)=mor_v(i-1)
+          end do
+          mor_s_v(4)=idmo(lx1,lx1,2,2,face,ntemp)
+        else
+          mor_s_v(1)=idmo(lx1,1,1,2,face,ntemp)
+          do i=2,lx1-1
+             mor_s_v(i)=mor_v(i-1)
+          end do
+        endif
+      end if
+
+!.....copy mor_s_v to iel's local edge(op(edge)), on face jjface(face)
+      call mor_s_e_nn(op(edge),jjface(face),iel,mor_s_v,nn)
+!.....copy mor_s_v to iel's local edge(op(edge2)), on face jjface(face2)
+!     since this edge is shared by two faces on iel
+      call mor_s_e_nn(op(edge2),jjface(face2),iel,mor_s_v,nn)
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/move.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/move.f90
new file mode 100644
index 000000000..3edd2dec2
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/move.f90
@@ -0,0 +1,89 @@
+!---------------------------------------------------------------
+      subroutine move
+!---------------------------------------------------------------
+!     move element to proper location in morton space filling curve
+!---------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer i,iside,jface,iel,ntemp,ii1,ii2,n1,n2,cb
+
+      n2=2*6*nelt
+      n1=n2*2
+
+      call nr_init_omp(sje_new,n1,0)
+      call nr_init_omp(ijel_new,n2,0)
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(iel,i,iside,jface,cb,ntemp,  &
+!$OMP& ii1,ii2) 
+!$OMP DO
+      do iel=1,nelt
+        i=mt_to_id(iel)
+        treenew(iel)=tree(i)
+        call copy(xc_new(1,iel),xc(1,i),8)
+        call copy(yc_new(1,iel),yc(1,i),8)
+        call copy(zc_new(1,iel),zc(1,i),8)
+
+        do iside=1,nsides
+          jface = jjface(iside)
+          cb=cbc(iside,i)
+          xc_new(iside,iel)=xc(iside,i)
+          yc_new(iside,iel)=yc(iside,i)
+          zc_new(iside,iel)=zc(iside,i)
+          cbc_new(iside,iel)=cb
+
+          if(cb.eq.2)then
+            ntemp=sje(1,1,iside,i)
+            ijel_new(1,iside,iel)=1
+            ijel_new(2,iside,iel)=1
+            sje_new(1,1,iside,iel)=id_to_mt(ntemp)
+
+          else if(cb.eq.1) then
+            ntemp=sje(1,1,iside,i)
+            ijel_new(1,iside,iel)=ijel(1,iside,i)
+            ijel_new(2,iside,iel)=ijel(2,iside,i)
+            sje_new(1,1,iside,iel)=id_to_mt(ntemp)
+         
+          else if(cb.eq.3) then
+            do ii2=1,2
+              do ii1=1,2
+                ntemp=sje(ii1,ii2,iside,i)
+                ijel_new(1,iside,iel)=1
+                ijel_new(2,iside,iel)=1
+                sje_new(ii1,ii2,iside,iel)=id_to_mt(ntemp)
+              end do
+            end do
+
+          else if(cb.eq.0)then
+            sje_new(1,1,iside,iel)=0
+            sje_new(1,2,iside,iel)=0
+            sje_new(2,1,iside,iel)=0
+            sje_new(2,2,iside,iel)=0       
+          end if 
+
+        end do
+
+        call copy(ta2(1,1,1,iel),ta1(1,1,1,i),nxyz)
+
+      end do
+!$OMP ENDDO
+
+!$OMP DO
+      do iel=1,nelt
+        call copy(xc(1,iel),xc_new(1,iel),8)
+        call copy(yc(1,iel),yc_new(1,iel),8)
+        call copy(zc(1,iel),zc_new(1,iel),8)
+        call copy(ta1(1,1,1,iel),ta2(1,1,1,iel),nxyz)
+        call ncopy(sje(1,1,1,iel),sje_new(1,1,1,iel),4*6)
+        call ncopy(ijel(1,1,iel),ijel_new(1,1,iel),2*6)
+        call ncopy(cbc(1,iel),cbc_new(1,iel),6)
+        mt_to_id(iel)=iel
+        id_to_mt(iel)=iel
+        tree(iel)=treenew(iel)
+      end do
+!$OMP ENDDO 
+!$OMP END PARALLEL
+
+      return
+      end 
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/precond.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/precond.f90
new file mode 100644
index 000000000..40b67ab77
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/precond.f90
@@ -0,0 +1,797 @@
+!------------------------------------------------------------------
+      subroutine setuppc
+!------------------------------------------------------------------
+!     Generate diagonal preconditioner for CG.
+!     Preconditioner computed in this subroutine is correct only
+!     for collocation point in element interior, on conforming face
+!     interior and conforming edge.
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision dxtm1_2(lx1,lx1), rdtime
+      integer ie,k,i,j,q,isize
+
+      do j=1,lx1
+        do i=1,lx1
+          dxtm1_2(i,j)=dxtm1(i,j)**2
+        end do
+      end do
+
+      rdtime=1.d0/dtime
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie,isize,i,j,k,q) 
+      do ie = 1, nelt
+        call r_init(dpcelm(1,1,1,ie),nxyz,0.d0)
+        isize=size_e(ie)
+        do k = 1, lx1
+          do j = 1, lx1
+            do i = 1, lx1
+              do q = 1, lx1
+                dpcelm(i,j,k,ie) = dpcelm(i,j,k,ie) +  &
+     &                        g1m1_s(q,j,k,isize) * dxtm1_2(i,q) +  &
+     &                        g1m1_s(i,q,k,isize) * dxtm1_2(j,q) +  &
+     &                        g1m1_s(i,j,q,isize) * dxtm1_2(k,q)
+              end do
+              dpcelm(i,j,k,ie)=visc*dpcelm(i,j,k,ie)+  &
+     &                      rdtime*bm1_s(i,j,k,isize)
+            end do
+          end do
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+!.....do the stiffness summation
+      call dssum
+
+!.....take inverse.
+
+      call reciprocal(dpcelm,ntot)
+
+!.....compute preconditioner on mortar points. NOTE:  dpcmor for 
+!     nonconforming cases will be corrected in subroutine setpcmo 
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i=1,nmor
+        dpcmor(i)=1.d0/dpcmor(i)
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+
+!--------------------------------------------------------------
+      subroutine setpcmo_pre
+!--------------------------------------------------------------
+!     pre-compute elemental contribution to preconditioner  
+!     for all situations
+!--------------------------------------------------------------
+      
+      use ua_data
+      implicit none
+
+      integer element_size, i, j, ii, jj, col
+      double precision  &
+     &       p(lx1,lx1,lx1), p0(lx1,lx1,lx1), mtemp(lx1,lx1),  &
+     &       temp(lx1,lx1,lx1), temp1(lx1,lx1), tmp(lx1,lx1),tig(lx1)
+
+!.....corners on face of type 3 
+
+      call r_init(tcpre,lx1*lx1,0.d0)
+      call r_init(tmp,lx1*lx1,0.d0)
+      call r_init(tig,5,0.d0)
+      tig(1)   =1.d0
+      tmp(1,1) =1.d0 
+
+!.....tcpre results from mapping a unit spike field (unity at 
+!     collocation point (1,1), zero elsewhere) on an entire element
+!     face to the (1,1) segment of a nonconforming face
+      do i=2,lx1-1
+        do j=1,lx1
+          tmp(i,1) = tmp(i,1)+ qbnew(i-1,j,1)*tig(j)
+        end do
+      end do
+ 
+      do col=1,lx1
+        tcpre(col,1)=tmp(col,1)
+
+        do j=2,lx1-1
+          do i=1,lx1
+            tcpre(col,j) = tcpre(col,j) + qbnew(j-1,i,1)*  &
+     &                                     tmp(col,i)
+          end do
+        end do
+      end do
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(element_size,i,j,p,temp,  &
+!$OMP& mtemp,temp1,p0,ii,jj)
+      do element_size=1,refine_max
+
+!.......for conforming cases
+
+!.......pcmor_c (i,j,element_size) records the intermediate value 
+!       (preconditioner=1/pcmor_c) of the preconditor on collocation 
+!       point (i,j) on a conforming face of an element of size 
+!       element_size.
+
+        do j=1,lx1/2+1
+          do i=j,lx1/2+1
+            call r_init(p,nxyz,0.d0)
+            p(i,j,1)=1.d0
+            call laplacian(temp,p,element_size)
+            pcmor_c(i,j,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-i,j,element_size)=temp(i,j,1)
+            pcmor_c(j,i,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-j,i,element_size)=temp(i,j,1)
+            pcmor_c(j,lx1+1-i,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-j,lx1+1-i,element_size)=temp(i,j,1)
+            pcmor_c(i,lx1+1-j,element_size)=temp(i,j,1)
+            pcmor_c(lx1+1-i,lx1+1-j,element_size)=temp(i,j,1)
+          end do
+        end do
+
+!.......for nonconforming cases 
+
+!.......nonconforming face interior
+
+!.......pcmor_nc1(i,j,ii,jj,element_size) records the intermediate 
+!       preconditioner value on collocation point (i,j) on mortar 
+!       (ii,jj)  on a nonconforming face of an element of size element_
+!       size
+        do j=2,lx1
+          do i=j,lx1
+            call r_init(mtemp,lx1*lx1,0.d0)
+            call r_init(p,nxyz,0.d0)
+            mtemp(i,j)=1.d0
+!...........when i, j=lx1, mortar points are duplicated, so mtemp needs
+!           to be doubled.
+            if(i.eq.lx1)mtemp(i,j)=mtemp(i,j)*2.d0
+            if(j.eq.lx1)mtemp(i,j)=mtemp(i,j)*2.d0
+            call transf_nc(mtemp,p)
+            call laplacian(temp,p,element_size)
+            call transfb_nc1(temp1,temp)
+
+!...........values at points (i,j) and (j,i) are the same
+            pcmor_nc1(i,j,1,1,element_size)=temp1(i,j)
+            pcmor_nc1(j,i,1,1,element_size)=temp1(i,j)
+          end do
+
+!.........when i, j=lx1, mortar points are duplicated. so pcmor_nc1 needs
+!         to be doubled on those points
+          pcmor_nc1(lx1,j,1,1,element_size)=  &
+     &          pcmor_nc1(lx1,j,1,1,element_size)*2.d0
+          pcmor_nc1(j,lx1,1,1,element_size)=  &
+     &          pcmor_nc1(lx1,j,1,1,element_size)
+
+        end do
+        pcmor_nc1(lx1,lx1,1,1,element_size)=  &
+     &      pcmor_nc1(lx1,lx1,1,1,element_size)*2.d0
+
+!.......nonconforming edges
+        j=1
+        do i=2,lx1
+          call r_init(mtemp,lx1*lx1,0.d0)
+          call r_init(p,nxyz,0.d0)
+          call r_init(p0,nxyz,0.d0)
+          mtemp(i,j)=1.d0
+          if(i.eq.lx1)mtemp(i,j)=2.d0
+          call transf_nc(mtemp,p)
+          call laplacian(temp,p,element_size)                          
+          call transfb_nc1(temp1,temp)                   
+          pcmor_nc1(i,j,1,1,element_size)=temp1(i,j)      
+          pcmor_nc1(j,i,1,1,element_size)=temp1(i,j)                              
+          do ii=1,lx1
+!...........p0 is for the case that a nonconforming edge is shared by
+!           two conforming faces
+            p0(ii,1,1)=p(ii,1,1)
+            do jj=1,lx1 
+!.............now p is for the case that a nonconforming edge is shared
+!             by nonconforming faces
+              p(ii,1,jj)=p(ii,jj,1)
+            end do
+          end do
+
+          call laplacian(temp,p,element_size)
+          call transfb_nc2(temp1,temp)                
+
+!.........pcmor_nc2(i,j,ii,jj,element_size) gives the intermediate
+!         preconditioner value on collocation point (i,j) on a 
+!         nonconforming face of an element with size size_element
+
+          pcmor_nc2(i,j,1,1,element_size)=temp1(i,j)*2.d0 
+          pcmor_nc2(j,i,1,1,element_size)=  &
+     &          pcmor_nc2(i,j,1,1,element_size)
+
+          call laplacian(temp,p0,element_size) 
+          call transfb_nc0(temp1,temp)                  
+
+!.........pcmor_nc0(i,j,ii,jj,element_size) gives the intermediate
+!         preconditioner value on collocation point (i,j) on a 
+!         conforming face of an element, which shares a nonconforming 
+!         edge with another conforming face
+          pcmor_nc0(i,j,1,1,element_size)=temp1(i,j)
+          pcmor_nc0(j,i,1,1,element_size)=temp1(i,j)
+        end do
+        pcmor_nc1(lx1,j,1,1,element_size)=  &
+     &        pcmor_nc1(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc1(j,lx1,1,1,element_size)=  &
+     &        pcmor_nc1(lx1,j,1,1,element_size)
+        pcmor_nc2(lx1,j,1,1,element_size)=  &
+     &        pcmor_nc2(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc2(j,lx1,1,1,element_size)=  &
+     &        pcmor_nc2(lx1,j,1,1,element_size)
+        pcmor_nc0(lx1,j,1,1,element_size)=  &
+     &        pcmor_nc0(lx1,j,1,1,element_size)*2.d0
+        pcmor_nc0(j,lx1,1,1,element_size)=  &
+     &        pcmor_nc0(lx1,j,1,1,element_size)
+
+!.......symmetrical copy
+        do i=1,lx1-1
+          pcmor_nc1(i,j,1,2,element_size)=  &
+     &          pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          pcmor_nc0(i,j,1,2,element_size)=                                           &
+     &          pcmor_nc0(lx1+1-i,j,1,1,element_size)                                      
+          pcmor_nc2(i,j,1,2,element_size)=                                           &
+     &          pcmor_nc2(lx1+1-i,j,1,1,element_size)                                      
+        end do
+
+        do j=2,lx1                                            
+          do i=1,lx1-1
+            pcmor_nc1(i,j,1,2,element_size)=  &
+     &            pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          end do
+          i=lx1
+          pcmor_nc1(i,j,1,2,element_size)=  &
+     &          pcmor_nc1(lx1+1-i,j,1,1,element_size)
+          pcmor_nc0(i,j,1,2,element_size)=                                           &
+     &          pcmor_nc0(lx1+1-i,j,1,1,element_size)                                      
+          pcmor_nc2(i,j,1,2,element_size)=                                           &
+     &          pcmor_nc2(lx1+1-i,j,1,1,element_size)                                      
+        end do                                                
+
+        j=1
+        i=1
+        pcmor_nc1(i,j,2,1,element_size)=  &
+     &        pcmor_nc1(i,lx1+1-j,1,1,element_size)
+        pcmor_nc0(i,j,2,1,element_size)=  &
+     &        pcmor_nc0(i,lx1+1-j,1,1,element_size)
+        pcmor_nc2(i,j,2,1,element_size)=  &
+     &        pcmor_nc2(i,lx1+1-j,1,1,element_size)
+        do j=2,lx1-1
+          i=1
+          pcmor_nc1(i,j,2,1,element_size)=  &
+     &          pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          pcmor_nc0(i,j,2,1,element_size)=  &
+     &          pcmor_nc0(i,lx1+1-j,1,1,element_size)
+          pcmor_nc2(i,j,2,1,element_size)=  &
+     &          pcmor_nc2(i,lx1+1-j,1,1,element_size)
+          do i=2,lx1
+            pcmor_nc1(i,j,2,1,element_size)=  &
+     &            pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          end do
+        end do
+
+        j=lx1
+        do i=2,lx1
+          pcmor_nc1(i,j,2,1,element_size)=  &
+     &          pcmor_nc1(i,lx1+1-j,1,1,element_size)
+          pcmor_nc0(i,j,2,1,element_size)=  &
+     &          pcmor_nc0(i,lx1+1-j,1,1,element_size)
+          pcmor_nc2(i,j,2,1,element_size)=  &
+     &          pcmor_nc2(i,lx1+1-j,1,1,element_size)
+        end do
+
+        j=1
+        i=lx1
+        pcmor_nc1(i,j,2,2,element_size)=  &
+     &        pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)
+        pcmor_nc0(i,j,2,2,element_size)=  &
+     &        pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)
+        pcmor_nc2(i,j,2,2,element_size)=  &
+     &        pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)
+          
+        do j=2,lx1-1                                            
+          do i=2,lx1-1
+            pcmor_nc1(i,j,2,2,element_size)=  &
+     &            pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)
+          end do
+          i=lx1
+          pcmor_nc1(i,j,2,2,element_size)=                                       &
+     &          pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)                               
+          pcmor_nc0(i,j,2,2,element_size)=                                       &
+     &          pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)   
+          pcmor_nc2(i,j,2,2,element_size)=                                       &
+     &          pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)                     
+        end do                                                
+        j=lx1
+        do i=2,lx1-1
+          pcmor_nc1(i,j,2,2,element_size)=                                       &
+     &          pcmor_nc1(lx1+1-i,lx1+1-j,1,1,element_size)          
+          pcmor_nc0(i,j,2,2,element_size)=  &
+     &          pcmor_nc0(lx1+1-i,lx1+1-j,1,1,element_size)          
+          pcmor_nc2(i,j,2,2,element_size)=                                       &
+     &          pcmor_nc2(lx1+1-i,lx1+1-j,1,1,element_size)    
+        end do
+
+
+!.......vertices shared by at least one nonconforming face or edge
+
+!.......Among three edges and three faces sharing a vertex on an element
+!       situation 1: only one edge is nonconforming
+!       situation 2: two edges are nonconforming
+!       situation 3: three edges are nonconforming
+!       situation 4: one face is nonconforming 
+!       situation 5: one face and one edge are nonconforming 
+!       situation 6: two faces are nonconforming
+!       situation 7: three faces are nonconforming
+
+        call r_init(p0,nxyz,0.d0)
+        p0(1,1,1)=1.d0
+        call laplacian(temp,p0,element_size)
+        pcmor_cor(8,element_size)=temp(1,1,1)
+
+!.......situation 1
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size) 
+        call transfb_cor_e(1,pcmor_cor(1,element_size),temp)                  
+
+!.......situation 2
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+           p0(1,i,1)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_e(2,pcmor_cor(2,element_size),temp)                  
+
+!.......situation 3
+        call r_init(p0,nxyz,0.d0)
+        do i=1,lx1
+           p0(i,1,1)=tcpre(i,1)
+           p0(1,i,1)=tcpre(i,1)
+           p0(1,1,i)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_e(3,pcmor_cor(3,element_size),temp)                  
+
+!.......situation 4
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(4,pcmor_cor(4,element_size),temp)
+
+!.......situation 5
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+          end do
+        end do
+        do i=1,lx1
+           p0(1,1,i)=tcpre(i,1)
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(5,pcmor_cor(5,element_size),temp)
+ 
+!.......situation 6
+        call r_init(p0,nxyz,0.d0)
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+            p0(i,1,j)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(6,pcmor_cor(6,element_size),temp)
+
+!.......situation 7
+        do j=1,lx1
+          do i=1,lx1
+            p0(i,j,1)=tcpre(i,j)
+            p0(i,1,j)=tcpre(i,j)
+            p0(1,i,j)=tcpre(i,j)
+          end do
+        end do
+        call laplacian(temp,p0,element_size)
+        call transfb_cor_f(7,pcmor_cor(7,element_size),temp)
+
+      end do    
+!$OMP END PARALLEL DO     
+      return
+      end 
+
+
+!------------------------------------------------------------------------
+      subroutine setpcmo
+!------------------------------------------------------------------------
+!     compute the preconditioner by identifying its geometry configuration
+!     and sum the values from the precomputed elemental contributions
+!------------------------------------------------------------------------
+      
+      use ua_data
+      implicit none
+
+      integer face2, nb1, nb2, sizei, imor, enum, i,j,  &
+     &        iel, iside, nn1, nn2
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IMOR,IEL,ISIDE,I) 
+!$OMP DO
+      do imor=1,nvertex
+       ifpcmor(imor)=.false.
+      end do
+!$OMP END DO nowait
+   
+!$OMP DO 
+      do iel=1,nelt
+        do iside=1,nsides
+          do i=1,4
+            edgevis(i,iside,iel)=.false.
+          end do 
+        end do 
+      end do 
+!$OMP END DO 
+!$OMP END PARALLEL
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(IEL,iside,sizei,  &
+!$OMP& imor,enum,face2,nb1,nb2,i,j,nn1,nn2) 
+
+      do iel=1,nelt
+        do iside=1,nsides
+!.........for nonconforming faces
+          if(cbc(iside,iel).eq.3)then
+            sizei=size_e(iel)
+
+!...........vertices
+
+!...........ifpcmor(imor)=.true. indicates that mortar point imor has 
+!           been visited
+            imor=idmo(1,1,1,1,iside,iel)
+            if(.not.ifpcmor(imor))then
+!.............compute the preconditioner on mortar point imor
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(lx1,1,1,2,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(1,lx1,2,1,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+            imor=idmo(lx1,lx1,2,2,iside,iel)
+            if(.not.ifpcmor(imor))then
+              call pc_corner(imor)
+              ifpcmor(imor)=.true.
+            end if
+
+!...........edges on nonconforming faces, enum is local edge number
+            do enum=1,4
+
+!.............edgevis(enum,iside,iel)=.true. indicates that local edge 
+!             enum of face iside of iel has been visited
+              if(.not.edgevis(enum,iside,iel))then
+                edgevis(enum,iside,iel)=.true.
+
+!...............Examing neighbor element information,
+!               calculateing the preconditioner value.
+                face2= f_e_ef(enum,iside)
+                if(cbc(face2,iel).eq.2)then
+                  nb1=sje(1,1,face2,iel)
+                  if(cbc(iside,nb1).eq.2)then
+
+!...................Compute the preconditioner on local edge enum on face
+!                   iside of element iel, 1 is neighborhood information got
+!                   by examing neighbors(nb1). For detailed meaning of 1, 
+!                   see subroutine com_dpc.
+
+                    call com_dpc(iside,iel,enum,1,sizei)
+                    nb2=sje(1,1,iside,nb1)
+                    edgevis(op(e_face2(enum,iside)),  &
+     &                      jjface(face2),nb2)=.true.
+
+                  elseif(cbc(iside,nb1).eq.3)then
+                    call com_dpc(iside,iel,enum,2,sizei)
+                    edgevis(op(enum),iside,nb1)=.true.
+                  end if
+
+                elseif(cbc(face2,iel).eq.3)then
+                  edgevis(e_face2(enum,iside),face2,iel)=.true.
+                  nb1=sje(1,2,face2,iel)
+                  if(cbc(iside,nb1).eq.1)then
+                    call com_dpc(iside,iel,enum,3,sizei)
+                    nb2=sje(1,1,iside,nb1)
+                    edgevis(op(enum),jjface(iside),nb2)=.true.
+                    edgevis(op(e_face2(enum,iside)),  &
+     &                      jjface(face2),nb2)=.true.
+                  elseif(cbc(iside,nb1).eq.2)then
+                    call com_dpc(iside,iel,enum,4,sizei)
+                  end if
+                else if (cbc(face2,iel).eq.0)then
+                  call com_dpc(iside,iel,enum,0,sizei)
+                end if
+              end if
+            end do
+
+!...........mortar element interior (not edge of mortar) 
+
+            do nn1=1,2
+              do nn2=1,2
+                do j=2,lx1-1
+                  do i=2,lx1-1
+                    imor=idmo(i,j,nn1,nn2,iside,iel) 
+                    dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,nn1,nn2,sizei)+  &
+     &                                pcmor_c(i,j,sizei+1))
+                  end do
+                end do
+              end do
+            end do
+
+!...........for i,j=lx1 there are duplicated mortar points, so 
+!           pcmor_c needs to be doubled or quadrupled
+            i=lx1
+            do j=2,lx1-1
+              imor=idmo(i,j,1,1,iside,iel)            
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+  &
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+              imor=idmo(i,j,2,1,iside,iel)                
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,2,1,sizei)+  &
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+            end do      
+
+            j=lx1
+            imor=idmo(i,j,1,1,iside,iel)                                         
+            dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+  &
+     &                        pcmor_c(i,j,sizei+1)*4.d0)
+            do i=2,lx1-1
+              imor=idmo(i,j,1,1,iside,iel)  
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,1,sizei)+  &
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+              imor=idmo(i,j,1,2,iside,iel) 
+              dpcmor(imor) = 1.d0/(pcmor_nc1(i,j,1,2,sizei)+  &
+     &                          pcmor_c(i,j,sizei+1)*2.d0)
+            end do
+
+          end if 
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!--------------------------------------------------------------------------
+      subroutine pc_corner(imor)
+!------------------------------------------------------------------------
+!     calculate preconditioner value for vertex with mortar index imor
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmortemp
+      integer imor, inemo,ie, sizei,cornernumber,  &
+     &        sface,sedge,iiface,iface,iiedge,iedge,n
+
+      tmortemp=0.d0
+!.....loop over all elements sharing this vertex
+      do inemo=1,nemo(imor)
+        ie=emo(1,inemo,imor)
+        sizei=size_e(ie)
+        cornernumber=emo(2,inemo,imor)
+        sface=0
+        sedge=0
+        do iiface=1,3
+          iface=f_c(iiface,cornernumber)
+!.........sface sums the number of nonconforming faces sharing this vertex on
+!         one element
+          if(cbc(iface,ie).eq.3)then
+            sface=sface+1
+          end if
+        end do
+!.......sedge sums the number of nonconforming edges sharing this vertex on
+!       one element
+        do iiedge=1,3
+          iedge=e_c(iiedge,cornernumber)
+          if(ncon_edge(iedge,ie))sedge=sedge+1
+        end do
+
+!.......each n indicates how many nonconforming faces and nonconforming
+!       edges share this vertex on an element, 
+
+        if(sface.eq.0)then
+          if(sedge.eq.0)then
+             n=8
+          elseif(sedge.eq.1)then
+             n=1
+          elseif(sedge.eq.2)then
+             n=2
+          elseif(sedge.eq.3)then
+             n=3
+          end if 
+        elseif (sface.eq.1)then
+          if (sedge.eq.1)then
+           n=5
+          else
+           n=4
+          end if
+        else if (sface.eq.2)then
+           n=6
+        else if(sface.eq.3)then
+           n=7
+        end if
+          
+!.......sum the intermediate pre-computed preconditioner values for 
+!       all elements
+        tmortemp=tmortemp+pcmor_cor(n,sizei)
+
+      end do
+
+!.....dpcmor(imor) is the value of the preconditioner on mortar point imor
+      dpcmor(imor)=1.d0/tmortemp
+
+      return
+      end 
+
+!------------------------------------------------------------------------
+      subroutine com_dpc(iside,iel,enumber,n,isize)
+!------------------------------------------------------------------------
+!     Compute preconditioner for local edge enumber of face iside 
+!     on element iel.
+!     isize is element size,
+!     n is one of five different configurations
+!     anc1, ac, anc2, anc0 are coefficients for different edges. 
+!     nc0 refers to nonconforming edge shared by two conforming faces
+!     nc1 refers to nonconforming edge shared by one nonconforming face
+!     nc2 refers to nonconforming edges shared by two nonconforming faces
+!     c refers to conforming edge
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer n, isize,iside,iel, enumber, nn1start, nn1end, nn2start,  &
+     &        nn2end, jstart, jend, istart, iend, i, j, nn1, nn2, imor
+      double precision anc1,ac,anc2,anc0,temp
+
+!.....different local edges have different loop ranges 
+      if(enumber.eq.1)then
+        nn1start=1
+        nn1end=1
+        nn2start=1
+        nn2end=2
+        jstart=1
+        jend=1
+        istart=2
+        iend=lx1-1
+      elseif (enumber.eq.2) then
+        nn1start=1
+        nn1end=2
+        nn2start=2
+        nn2end=2
+        jstart=2
+        jend=lx1-1
+        istart=lx1
+        iend=lx1
+      elseif (enumber.eq.3) then
+        nn1start=2
+        nn1end=2
+        nn2start=1
+        nn2end=2
+        jstart=lx1
+        jend=lx1
+        istart=2
+        iend=lx1-1
+      elseif (enumber.eq.4) then
+        nn1start=1
+        nn1end=2
+        nn2start=1
+        nn2end=1
+        jstart=2
+        jend=lx1-1
+        istart=1
+        iend=1
+      end if
+
+!.....among the four elements sharing this edge
+
+!.....one has a smaller size
+      if(n.eq.1)then
+        anc1=2.d0
+        ac=1.d0
+        anc0=1.d0
+        anc2=0.d0
+
+!.....two (neighbored by a face) are of  smaller size
+      else if (n.eq.2)then
+        anc1=2.d0
+        ac=2.d0
+        anc0=0.d0
+        anc2=0.d0
+
+!.....two (neighbored by an edge) are of smaller size
+      else if (n.eq.3)then
+        anc2=2.d0
+        ac=2.d0
+        anc1=0.d0
+        anc0=0.d0
+
+!.....three are of smaller size
+      else if (n.eq.4)then
+        anc1=0.d0
+        ac=3.d0
+        anc2=1.d0
+        anc0=0.d0
+
+!.....on the boundary
+      else if (n.eq.0)then
+        anc1=1.d0
+        ac=1.d0
+        anc2=0.d0
+        anc0=0.d0
+      end if
+
+!.....edge interior
+      do nn2=nn2start,nn2end
+        do nn1=nn1start,nn1end
+          do j=jstart,jend
+            do i=istart,iend
+              imor=idmo(i,j,nn1,nn2,iside,iel)
+              temp=anc1* pcmor_nc1(i,j,nn1,nn2,isize) +  &
+     &             ac*  pcmor_c(i,j,isize+1)+  &
+     &             anc0*  pcmor_nc0(i,j,nn1,nn2,isize)+  &
+     &             anc2*pcmor_nc2(i,j,nn1,nn2,isize)
+                dpcmor(imor)=1.d0/temp
+              end do
+            end do
+          end do
+        end do
+
+!.......local edge 1
+        if (enumber.eq.1) then
+          imor=idmo(lx1,1,1,1,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,1,1,1,isize) +  &
+     &         ac*  pcmor_c(lx1,1,isize+1)*2.d0+  &
+     &         anc0*  pcmor_nc0(lx1,1,1,1,isize)+  &
+     &         anc2*pcmor_nc2(lx1,1,1,1,isize)
+!.......local edge 2
+        elseif (enumber.eq.2) then
+          imor=idmo(lx1,lx1,1,2,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,lx1,1,2,isize) +  &
+     &         ac*  pcmor_c(lx1,lx1,isize+1)*2.d0+  &
+     &         anc0*  pcmor_nc0(lx1,lx1,1,2,isize)+  &
+     &         anc2*pcmor_nc2(lx1,lx1,1,2,isize)
+!.......local edge 3
+        elseif (enumber.eq.3) then
+          imor=idmo(lx1,lx1,2,1,iside,iel)
+          temp=anc1* pcmor_nc1(lx1,lx1,2,1,isize) +  &
+     &         ac*  pcmor_c(lx1,lx1,isize+1)*2.d0+  &
+     &         anc0*  pcmor_nc0(lx1,lx1,2,1,isize)+  &
+     &         anc2*pcmor_nc2(lx1,lx1,2,1,isize)
+!.......local edge 4
+        elseif (enumber.eq.4) then
+          imor=idmo(1,lx1,1,1,iside,iel)
+          temp=anc1* pcmor_nc1(1,lx1,1,1,isize) +  &
+     &         ac*  pcmor_c(1,lx1,isize+1)*2.d0+  &
+     &         anc0*  pcmor_nc0(1,lx1,1,1,isize)+  &
+     &         anc2*pcmor_nc2(1,lx1,1,1,isize)
+        end if
+
+        dpcmor(imor)=1.d0/temp
+
+      return
+      end 
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/setup.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/setup.f90
new file mode 100644
index 000000000..9b5f25923
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/setup.f90
@@ -0,0 +1,413 @@
+!-----------------------------------------------------------------
+      subroutine create_initial_grid        
+!------------------------------------------------------------------
+    
+      use ua_data
+      implicit none
+
+      integer i
+
+      nelt=1
+      ntot=nelt*lx1*lx1*lx1 
+      tree(1)=1
+      mt_to_id(1)=1
+      do i=1,7,2
+        xc(i,1)=0.d0
+        xc(i+1,1)=1.d0
+      end do
+
+      do i=1,2
+        yc(i,1)=0.d0
+        yc(2+i,1)=1.d0
+        yc(4+i,1)=0.d0
+        yc(6+i,1)=1.d0
+      end do
+     
+      do i=1,4
+        zc(i,1)=0.d0
+        zc(4+i,1)=1.d0
+      end do
+  
+      do i=1,6
+        cbc(i,1)=0
+      end do
+
+      return
+
+      end
+
+!-----------------------------------------------------------------
+      subroutine coef
+!-----------------------------------------------------------------
+!
+!     generate 
+!
+!            - collocation points
+!            - weights
+!            - derivative matrices 
+!            - projection matrices
+!            - interpolation matrices 
+!
+!     associated with the 
+!
+!            - gauss-legendre lobatto mesh (suffix m1)
+!
+!----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer i,j,k
+
+!.....for gauss-legendre lobatto mesh (suffix m1)
+!.....generate collocation points and weights 
+
+      zgm1(1)=-1.d0
+      zgm1(2)=-0.6546536707079771d0
+      zgm1(3)=0.d0
+      zgm1(4)= 0.6546536707079771d0
+      zgm1(5)=1.d0
+      wxm1(1)=0.1d0
+      wxm1(2)=49.d0/90.d0
+      wxm1(3)=32.d0/45.d0
+      wxm1(4)=wxm1(2)
+      wxm1(5)=0.1d0 
+
+      do k=1,lx1
+        do j=1,lx1
+          do i=1,lx1
+            w3m1(i,j,k)=wxm1(i)*wxm1(j)*wxm1(k)
+          end do
+        end do
+      end do
+
+!.....generate derivative matrices
+
+      dxm1(1,1)=-5.0d0
+      dxm1(2,1)=-1.240990253030982d0
+      dxm1(3,1)= 0.375d0
+      dxm1(4,1)=-0.2590097469690172d0
+      dxm1(5,1)= 0.5d0
+      dxm1(1,2)= 6.756502488724238d0
+      dxm1(2,2)= 0.d0
+      dxm1(3,2)=-1.336584577695453d0
+      dxm1(4,2)= 0.7637626158259734d0
+      dxm1(5,2)=-1.410164177942427d0
+      dxm1(1,3)=-2.666666666666667d0
+      dxm1(2,3)= 1.745743121887939d0
+      dxm1(3,3)= 0.d0
+      dxm1(4,3)=-dxm1(2,3)
+      dxm1(5,3)=-dxm1(1,3)
+      do j=4,lx1
+        do i=1,lx1
+          dxm1(i,j)=-dxm1(lx1+1-i,lx1+1-j)
+        end do
+      end do
+      do j=1,lx1
+        do i=1,lx1
+          dxtm1(i,j)=dxm1(j,i)
+        end do
+      end do
+
+!.....generate projection (mapping) matrices
+
+      qbnew(1,1,1)=-0.1772843218615690d0
+      qbnew(2,1,1)=9.375d-02
+      qbnew(3,1,1)=-3.700139242414530d-02
+      qbnew(1,2,1)= 0.7152146412463197d0
+      qbnew(2,2,1)=-0.2285757930375471d0
+      qbnew(3,2,1)= 8.333333333333333d-02
+      qbnew(1,3,1)= 0.4398680650316104d0
+      qbnew(2,3,1)= 0.2083333333333333d0
+      qbnew(3,3,1)=-5.891568407922938d-02
+      qbnew(1,4,1)= 8.333333333333333d-02
+      qbnew(2,4,1)= 0.3561799597042137d0
+      qbnew(3,4,1)=-4.854797457965334d-02
+      qbnew(1,5,1)= 0.d0
+      qbnew(2,5,1)=7.03125d-02
+      qbnew(3,5,1)=0.d0
+      
+      do j=1,lx1
+        do i=1,3
+          qbnew(i,j,2)=qbnew(4-i,lx1+1-j,1)
+        end do
+      end do 
+
+!.....generate interpolation matrices for mesh refinement
+
+      ixtmc1(1,1)=1.d0
+      ixtmc1(2,1)=0.d0
+      ixtmc1(3,1)=0.d0
+      ixtmc1(4,1)=0.d0
+      ixtmc1(5,1)=0.d0 
+      ixtmc1(1,2)= 0.3385078435248143d0
+      ixtmc1(2,2)= 0.7898516348912331d0
+      ixtmc1(3,2)=-0.1884018684471238d0
+      ixtmc1(4,2)= 9.202967302175333d-02
+      ixtmc1(5,2)=-3.198728299067715d-02
+      ixtmc1(1,3)=-0.1171875d0
+      ixtmc1(2,3)= 0.8840317166357952d0
+      ixtmc1(3,3)= 0.3125d0    
+      ixtmc1(4,3)=-0.118406716635795d0 
+      ixtmc1(5,3)= 0.0390625d0   
+      ixtmc1(1,4)=-7.065070066767144d-02
+      ixtmc1(2,4)= 0.2829703269782467d0 
+      ixtmc1(3,4)= 0.902687582732838d0
+      ixtmc1(4,4)=-0.1648516348912333d0 
+      ixtmc1(5,4)= 4.984442584781999d-02
+      ixtmc1(1,5)=0.d0
+      ixtmc1(2,5)=0.d0
+      ixtmc1(3,5)=1.d0 
+      ixtmc1(4,5)=0.d0
+      ixtmc1(5,5)=0.d0  
+      do j=1,lx1
+        do i=1,lx1
+          ixmc1(i,j)=ixtmc1(j,i)
+        end do
+      end do
+
+      do j=1,lx1
+        do i=1,lx1
+          ixtmc2(i,j)=ixtmc1(lx1+1-i,lx1+1-j)
+        end do
+      end do
+
+      do j=1,lx1
+        do i=1,lx1
+          ixmc2(i,j)=ixtmc2(j,i)
+        end do
+      end do
+
+!.....solution interpolation matrix for mesh coarsening
+
+      map2(1)=-0.1179652785083428d0
+      map2(2)= 0.5505046330389332d0
+      map2(3)= 0.7024534364259963d0
+      map2(4)=-0.1972224518285866d0
+      map2(5)= 6.222966087199998d-02
+
+      do i=1,lx1
+        map4(i)=map2(lx1+1-i)
+      end do
+
+      return
+      end
+
+!-------------------------------------------------------------------
+      subroutine geom1
+!-------------------------------------------------------------------
+!
+!     routine to generate elemental geometry information on mesh m1,
+!     (gauss-legendre lobatto mesh).
+!
+!         xrm1_s   -   dx/dr, dy/dr, dz/dr
+!         rxm1_s   -   dr/dx, dr/dy, dr/dz
+!         g1m1_s  geometric factors used in preconditioner computation
+!         g4m1_s  g5m1_s  g6m1_s :
+!         geometric factors used in lapacian opertor
+!         jacm1    -   jacobian
+!         bm1      -   mass matrix
+!         xfrac    -   will be used in prepwork for calculating collocation
+!                      coordinates
+!         idel     -   collocation points index on element boundaries 
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision temp,temp1,temp2,dtemp
+      integer isize,i,j,k,ntemp,iel
+ 
+      do i=1,lx1
+        xfrac(i)=zgm1(i)*0.5d0 + 0.5d0
+      end do
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ISIZE,TEMP,TEMP1,TEMP2,  &
+!$OMP&  K,J,I,dtemp)
+      do isize=1,refine_max
+        temp=2.d0**(-isize-1)
+        dtemp=1.d0/temp
+        temp1=temp**3
+        temp2=temp**2
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              xrm1_s(i,j,k,isize)=dtemp
+              jacm1_s(i,j,k,isize)=temp1
+              rxm1_s(i,j,k,isize)=temp2
+              g1m1_s(i,j,k,isize)=w3m1(i,j,k)*temp
+              bm1_s(i,j,k,isize)=w3m1(i,j,k)*temp1
+              g4m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(i)
+              g5m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(j)
+              g6m1_s(i,j,k,isize)=g1m1_s(i,j,k,isize)/wxm1(k)
+            end do
+          end do
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ntemp,i,j,iel)
+      do iel = 1, lelt
+        ntemp=lx1*lx1*lx1*(iel-1)
+        do j = 1, lx1
+          do i = 1, lx1
+            idel(i,j,1,iel)=ntemp+(i-1)*lx1 + (j-1)*lx1*lx1+lx1
+            idel(i,j,2,iel)=ntemp+(i-1)*lx1 + (j-1)*lx1*lx1+1
+            idel(i,j,3,iel)=ntemp+(i-1)*1 + (j-1)*lx1*lx1+lx1*(lx1-1)+1
+            idel(i,j,4,iel)=ntemp+(i-1)*1 + (j-1)*lx1*lx1+1
+            idel(i,j,5,iel)=ntemp+(i-1)*1 + (j-1)*lx1+lx1*lx1*(lx1-1)+1
+            idel(i,j,6,iel)=ntemp+(i-1)*1 + (j-1)*lx1+1
+          end do
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!------------------------------------------------------------------
+      subroutine setdef
+!------------------------------------------------------------------
+!     compute the discrete laplacian operators
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer i,j,ip
+ 
+      call r_init(wdtdr(1,1),lx1*lx1,0.d0)
+
+      do i=1,lx1
+        do j=1,lx1
+          do ip=1,lx1
+            wdtdr(i,j) = wdtdr(i,j) + wxm1(ip)*dxm1(ip,i)*dxm1(ip,j)
+          end do
+        end do
+      end do
+
+      return 
+      end
+
+
+!------------------------------------------------------------------
+      subroutine prepwork
+!------------------------------------------------------------------
+!     mesh information preparations: calculate refinement levels of
+!     each element, mask matrix for domain boundary and element 
+!     boundaries
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer i, j, iel, iface, cb
+      double precision rdlog2
+
+      ntot = nelt*nxyz
+      rdlog2 = 1.d0/dlog(2.d0)
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(I,J,IEL,IFACE,CB)
+
+!.....calculate the refinement levels of each element
+
+!$OMP DO 
+      do iel = 1, nelt
+        size_e(iel)=-dlog(xc(2,iel)-xc(1,iel))*rdlog2+1.d-8
+      end do
+!$OMP END DO nowait
+
+!.....mask matrix for element boundary
+
+!$OMP DO
+      do iel = 1, nelt
+        call r_init(tmult(1,1,1,iel),nxyz,1.d0)   
+        do iface=1,nsides
+          call facev(tmult(1,1,1,iel),iface,0.0d0)
+        end do
+      end do
+!$OMP END DO nowait
+
+!.....masks for domain boundary at mortar 
+
+!$OMP DO
+      do iel=1,nmor
+        tmmor(iel)=1.d0
+      end do
+!$OMP END DO
+
+!$OMP DO
+      do iel = 1, nelt
+        do iface = 1,nsides
+          cb=cbc(iface,iel)
+          if(cb.eq.0) then
+            do j=2,lx1-1
+              do i=2,lx1-1
+               tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+            end do
+
+            j=1
+            do i = 1, lx1-1
+               tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+            end do
+
+            if(idmo(lx1,1,1,1,iface,iel).eq.0)then
+              tmmor(idmo(lx1,1,1,2,iface,iel))=0.d0
+            else
+              tmmor(idmo(lx1,1,1,1,iface,iel))=0.d0
+              do i=1,lx1
+                tmmor(idmo(i,j,1,2,iface,iel))=0.d0
+              end do
+            end if
+
+            i=lx1
+            if(idmo(lx1,2,1,2,iface,iel).eq.0)then
+              do j=2,lx1-1
+                tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+              tmmor(idmo(lx1,lx1,2,2,iface,iel))=0.d0
+            else
+              do j=2,lx1
+                tmmor(idmo(i,j,1,2,iface,iel))=0.d0
+              end do
+              do j=1,lx1
+                tmmor(idmo(i,j,2,2,iface,iel))=0.d0
+              end do
+            end if
+            
+            j=lx1
+            tmmor(idmo(1,lx1,2,1,iface,iel))=0.d0
+            if(idmo(2,lx1,2,1,iface,iel).eq.0)then
+              do i=2,lx1-1
+                tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+              end do
+            else
+              do i=2,lx1
+                tmmor(idmo(i,j,2,1,iface,iel))=0.d0
+              end do
+              do i=1,lx1-1
+                tmmor(idmo(i,j,2,2,iface,iel))=0.d0
+              end do
+            end if
+
+            i=1
+            do j=2,lx1-1
+             tmmor(idmo(i,j,1,1,iface,iel))=0.d0
+            end do
+            if(idmo(1,lx1,1,1,iface,iel).ne.0)then
+              tmmor(idmo(i,lx1,1,1,iface,iel))=0.d0
+              do j=1,lx1-1
+               tmmor(idmo(i,j,2,1,iface,iel))=0.d0
+              end do
+            end if
+
+          endif
+        end do
+       end do
+!$OMP END DO nowait
+            
+!$OMP END PARALLEL
+      return
+      end 
+    
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/tmorwork.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/tmorwork.f90
new file mode 100644
index 000000000..34f42d339
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/tmorwork.f90
@@ -0,0 +1,17 @@
+!------------------------------------------------------------------
+!------------------------------------------------------------------
+!     module for thread-local working arrays
+!------------------------------------------------------------------
+!------------------------------------------------------------------
+      module tmorwork
+
+      double precision, pointer ::  &
+     &                   tmorwk(:,:), mormulwk(:,:)
+
+      double precision, pointer ::  &
+     &                   tmorl(:), mormull(:)
+      integer :: myid, nwthreads
+!$omp threadprivate( tmorl, mormull, myid, nwthreads )
+
+      end module tmorwork
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer.f90
new file mode 100644
index 000000000..97a779a7c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer.f90
@@ -0,0 +1,1114 @@
+!------------------------------------------------------------------
+      subroutine init_locks
+!------------------------------------------------------------------
+!     Initialize locks to be used for atomic updates
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer i
+
+!.....initialize locks in parallel
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+!$    do i=1,lmor
+!$      call omp_init_lock(tlock(i))
+!$    end do
+
+      return
+      end
+
+
+!------------------------------------------------------------------
+      subroutine transf(tmor,tx)
+!------------------------------------------------------------------
+!     Map values from mortar(tmor) to element(tx)
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(*),tx(*), tmp(lx1,lx1,2)
+      integer ig1,ig2,ig3,ig4,ie,iface,il1,il2,il3,il4,  &
+     &        nnje,ije1,ije2,col,i,j,ig,il
+
+
+!.....zero out tx on element boundaries
+      call col2(tx,tmult,ntot)     
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,  &
+!$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,tmp)
+      do ie=1,nelt
+        do iface=1,nsides
+
+!.........get the collocation point index of the four local corners on the
+!         face iface of element ie
+          il1=idel(1,1,iface,ie)
+          il2=idel(lx1,1,iface,ie)
+          il3=idel(1,lx1,iface,ie)
+          il4=idel(lx1,lx1,iface,ie)
+
+!.........get the mortar indices of the four local corners
+          ig1= idmo(1,  1  ,1,1,iface,ie)
+          ig2= idmo(lx1,1  ,1,2,iface,ie)
+          ig3= idmo(1,  lx1,2,1,iface,ie)
+          ig4= idmo(lx1,lx1,2,2,iface,ie)
+  
+!.........copy the value from tmor to tx for these four local corners
+          tx(il1) = tmor(ig1)
+          tx(il2) = tmor(ig2)
+          tx(il3) = tmor(ig3)
+          tx(il4) = tmor(ig4)
+ 
+!.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+!.........for nonconforming faces
+          if(nnje.eq.2) then
+
+!...........nonconforming faces have four pieces of mortar, first map them to
+!           two intermediate mortars, stored in tmp
+            call r_init(tmp,lx1*lx1*2,0.d0)
+   
+            do ije1=1,nnje
+              do ije2=1,nnje
+                do col=1,lx1
+
+!.................in each row col, when coloumn i=1 or lx1, the value
+!                 in tmor is copied to tmp
+                  i = v_end(ije2)
+                  ig=idmo(i,col,ije1,ije2,iface,ie)
+                  tmp(i,col,ije1)=tmor(ig)
+
+!.................in each row col, value in the interior three collocation
+!                 points is computed by apply mapping matrix qbnew to tmor
+                  do i=2,lx1-1
+                    il= idel(i,col,iface,ie)
+                    do j=1,lx1
+                      ig=idmo(j,col,ije1,ije2,iface,ie)
+                      tmp(i,col,ije1) = tmp(i,col,ije1) +  &
+     &                qbnew(i-1,j,ije2)*tmor(ig)
+                    end do
+                  end do
+
+                end do
+              end do
+            end do
+      
+!...........mapping from two pieces of intermediate mortar tmp to element 
+!           face tx
+
+            do ije1=1, nnje
+
+!.............the first column, col=1, is an edge of face iface.
+!             the value on the three interior collocation points, tx, is 
+!             computed by applying mapping matrices qbnew to tmp.
+!             the mapping result is divided by 2, because there will be 
+!             duplicated contribution from another face sharing this edge.
+              col=1
+              do i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*  &
+     &                       tmp(col,j,ije1)*0.5d0
+                end do 
+              end do 
+
+!.............for column 2 ~ lx-1 
+              do col=2,lx1-1
+
+!...............when i=1 or lx1, the collocation points are also on an edge of
+!               the face, so the mapping result also needs to be divided by 2
+                i = v_end(ije1)
+                il= idel(col,i,iface,ie)
+                tx(il)=tx(il)+tmp(col,i,ije1)*0.5d0
+
+!...............compute the value at interior collocation points in 
+!               columns 2 ~ lx1
+                do i=2,lx1-1
+                  il= idel(col,i,iface,ie)
+                  do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)* tmp(col,j,ije1)
+                  end do 
+                end do
+              end do
+
+!.............same as col=1
+              col=lx1
+              do  i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                  tx(il) = tx(il) + qbnew(i-1,j,ije1)*  &
+     &                     tmp(col,j,ije1)*0.5d0
+                end do 
+              end do
+            end do
+
+!.........for conforming faces
+          else
+
+!.........face interior
+            do col=2,lx1-1
+              do i=2,lx1-1  
+                il= idel(i,col,iface,ie)
+                ig= idmo(i,col,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end do
+
+        
+!...........edges of conforming faces
+
+!...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(i,1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,1,1,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 1 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,1,iface,ie)
+                ig= idmo(i,1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(lx1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(lx1,j,ije1,2,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 2 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(lx1,i,iface,ie)
+                ig= idmo(lx1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do  i=2,lx1-1               
+                il= idel(i,lx1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,lx1,2,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 3 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,lx1,iface,ie)
+                ig= idmo(i,lx1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(1,j,ije1,1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+!...........if local edge 4 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(1,i,iface,ie)
+                ig= idmo(1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+          end if
+          
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+
+!------------------------------------------------------------------
+      subroutine transfb(tmor,tx)
+!------------------------------------------------------------------
+!     Map from element(tx) to mortar(tmor).
+!     tmor sums contributions from all elements.
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision third
+      parameter (third=1.d0/3.d0)
+      integer shift
+
+      double precision tmp,tmp1,tx(*),tmor(*),temp(lx1,lx1,2),  &
+     &                 top(lx1,2)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,nnje,  &
+     &        ije1,ije2,col,i,j,ije,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,  &
+!$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,ije,  &
+!$OMP& tmp,shift,temp,top,tmp1)
+
+!$OMP DO
+      do ie=1,nmor
+        tmor(ie)=0.d0
+      end do
+!$OMP END DO
+
+!$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+!.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+!.........get collocation point index of four local corners on the face
+          il1 = idel(1,  1,  iface,ie)
+          il2 = idel(lx1,1,  iface,ie)
+          il3 = idel(1,  lx1,iface,ie)
+          il4 = idel(lx1,lx1,iface,ie)
+
+!.........get the mortar indices of the four local corners
+          ig1 = idmo(1,  1,  1,1,iface,ie)
+          ig2 = idmo(lx1,1,  1,2,iface,ie)
+          ig3 = idmo(1,  lx1,2,1,iface,ie )
+          ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+!.........sum the values from tx to tmor for these four local corners
+!         only 1/3 of the value is summed, since there will be two duplicated
+!         contributions from the other two faces sharing this vertex 
+!
+!$        call omp_set_lock(tlock(ig1))
+          tmor(ig1) = tmor(ig1)+tx(il1)*third
+!$        call omp_unset_lock(tlock(ig1))
+!
+!$        call omp_set_lock(tlock(ig2))
+          tmor(ig2) = tmor(ig2)+tx(il2)*third
+!$        call omp_unset_lock(tlock(ig2))
+!
+!$        call omp_set_lock(tlock(ig3))
+          tmor(ig3) = tmor(ig3)+tx(il3)*third
+!$        call omp_unset_lock(tlock(ig3))
+!
+!$        call omp_set_lock(tlock(ig4))
+          tmor(ig4) = tmor(ig4)+tx(il4)*third
+!$        call omp_unset_lock(tlock(ig4))
+
+!.........for nonconforming faces
+          if(nnje.eq.2) then       
+            call r_init(temp,lx1*lx1*2,0.d0)
+
+!...........nonconforming faces have four pieces of mortar, first map tx to
+!           two intermediate mortars stored in temp
+
+            do ije2 = 1, nnje
+              shift = ije2-1
+              do col=1,lx1
+!...............For mortar points on face edge (top and bottom), copy the 
+!               value from tx to temp
+                il=idel(col,v_end(ije2),iface,ie)
+                temp(col,v_end(ije2),ije2)=tx(il)
+
+!...............For mortar points on face edge (top and bottom), calculate 
+!               the interior points' contribution to them, i.e. top()
+                j = v_end(ije2)
+                tmp=0.d0
+                do i=2,lx1-1 
+                  il=idel(col,i,iface,ie)
+                  tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                end do
+
+                top(col,ije2)=tmp
+
+!...............Use mapping matrices qbnew to map the value from tx to temp 
+!               for mortar points not on the top bottom face edge.
+                do j=2-shift,lx1-shift
+                  tmp=0.d0
+                  do i=2,lx1-1 
+                    il=idel(col,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                  end do
+                  temp(col,j,ije2) = tmp + temp(col,j,ije2)
+                end do
+              end do
+            end do
+
+!...........mapping from temp to tmor
+
+            do ije1=1, nnje
+              shift = ije1-1
+              do ije2=1,nnje
+
+!...............for each column of collocation points on a piece of mortar
+                do col=2-shift,lx1-shift
+
+!.................For the end point, which is on an edge (local edge 2,4), 
+!                 the contribution is halved since there will be duplicated 
+!                 contribution from another face sharing this edge.
+
+                  ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+!
+!$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+temp(v_end(ije2),col,ije1)*0.5d0
+!$                call omp_unset_lock(tlock(ig))
+
+!.................In each row of collocation points on a piece of mortar, 
+!                 sum the contributions from interior collocation points 
+!                 (i=2,lx1-1)
+
+                  do  j=1,lx1
+                    tmp=0.d0
+                    do i=2,lx1-1
+                      tmp = tmp + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    end do
+                    ig=idmo(j,col,ije1,ije2,iface,ie)
+!
+!$                  call omp_set_lock(tlock(ig))
+                    tmor(ig)=tmor(ig)+tmp
+!$                  call omp_unset_lock(tlock(ig))
+                  end do
+                end do
+
+!...............For tmor on local edge 1 and 3, tmp is the contribution from
+!               an edge, so it is halved because of duplicated contribution
+!               from another face sharing this edge. tmp1 is contribution 
+!               from face interior. 
+
+                col = v_end(ije1)
+                ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+top(v_end(ije2),ije1)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+                do  j=1,lx1
+                  tmp=0.d0
+                  tmp1=0.d0
+                  do i=2,lx1-1
+                    tmp  = tmp  + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    tmp1 = tmp1 + qbnew(i-1,j,ije2) * top(i,ije1)
+                  end do
+                  ig=idmo(j,col,ije1,ije2,iface,ie)
+!
+!$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0+tmp1 
+!$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+            end do
+
+!.........for conforming faces
+          else
+
+!.........face interior
+            do col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end do
+
+!...........edges of conforming faces
+
+!...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,1,iface,ie)
+                    tmp= tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,1,1,ije,iface,ie)
+!
+!$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+!$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+!...........if local edge 1 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+
+!...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(lx1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(lx1,j,ije,2,iface,ie)
+!
+!$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+!$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+!...........if local edge 2 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+
+!...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,lx1,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,lx1,2,ije,iface,ie)
+!
+!$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+!$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+!...........if local edge 3 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+
+!...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(1,j,ije,1,iface,ie)
+!
+!$                call omp_set_lock(tlock(ig))
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+!$                call omp_unset_lock(tlock(ig))
+                end do
+              end do
+
+!...........if local edge 4 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if 
+          end if
+        end do
+      end do
+!$OMP END DO NOWAIT
+!$OMP END PARALLEL 
+
+      return
+      end
+
+
+!--------------------------------------------------------------
+      subroutine transfb_cor_e(n,tmor,tx)
+!--------------------------------------------------------------
+!     This subroutine performs the edge to mortar mapping and
+!     calculates the mapping result on the mortar point at a vertex
+!     under situation 1,2, or 3.
+!     n refers to the configuration of three edges sharing a vertex, 
+!     n = 1: only one edge is nonconforming
+!     n = 2: two edges are nonconforming 
+!     n = 3: three edges are nonconforming 
+!-------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor,tx(lx1,lx1,lx1),tmp
+      integer i,n
+
+      tmor=tx(1,1,1)
+
+      do i=2,lx1-1
+        tmor= tmor + qbnew(i-1,1,1)*tx(i,1,1)
+      end do
+
+      if(n.gt.1)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,i,1)
+        end do
+      end if
+
+      if(n.eq.3)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,1,i)
+        end do
+      end if
+
+      return
+      end
+
+!--------------------------------------------------------------
+      subroutine transfb_cor_f(n,tmor,tx)
+!--------------------------------------------------------------
+!     This subroutine performs the mapping from face to mortar.
+!     Output tmor is the mapping result on a mortar vertex
+!     of situations of three edges and three faces sharing a vertex:
+!     n=4: only one face is nonconforming 
+!     n=5: one face and one edge are nonconforming
+!     n=6: two faces are nonconforming 
+!     n=7: three faces are nonconforming 
+!--------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1,lx1),tmor,temp(lx1)
+      integer col,i,n
+
+      call r_init(temp,lx1,0.d0)
+
+      do col=1,lx1
+        temp(col)=tx(col,1,1)
+        do i=2,lx1-1
+          temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,i,1)
+        end do
+      end do
+      tmor=temp(1)
+
+      do i=2,lx1-1
+        tmor = tmor + qbnew(i-1,1,1) *temp(i)
+      end do
+
+      if(n.eq.5)then
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *tx(1,1,i)
+        end do
+      end if
+ 
+      if(n.ge.6)then
+        call r_init(temp,lx1,0.d0)
+        do col=1,lx1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,1,i)
+          end do
+        end do
+        tmor=tmor+temp(1)
+        do i=2,lx1-1
+          tmor = tmor +qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+        
+      if(n.eq.7)then
+        call r_init(temp,lx1,0.d0)
+        do col=2,lx1-1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(1,col,i)
+          end do
+        end do
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+
+      return
+      end
+
+
+!-------------------------------------------------------------------------
+      subroutine transf_nc(tmor,tx)
+!------------------------------------------------------------------------
+!     Perform mortar to element mapping on a nonconforming face. 
+!     This subroutin is used when all entries in tmor are zero except
+!     one tmor(i,j)=1. So this routine is simplified. Only one piece of 
+!     mortar  (tmor only has two indices) and one piece of intermediate 
+!     mortar (tmp) are involved.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(lx1,lx1), tx(lx1,lx1), tmp(lx1,lx1)
+      integer col,i,j
+
+      call r_init(tmp,lx1*lx1,0.d0)
+      do col=1,lx1
+        i = 1
+        tmp(i,col)=tmor(i,col)                           
+        do i=2,lx1-1
+          do j=1,lx1
+            tmp(i,col) = tmp(i,col) + qbnew(i-1,j,1)*tmor(j,col)
+          end do
+        end do
+      end do
+
+      do col=1,lx1
+        i = 1
+        tx(col,i)   = tx(col,i)   + tmp(col,i)
+        do i=2,lx1-1
+          do j=1,lx1
+            tx(col,i) = tx(col,i) + qbnew(i-1,j,1)*tmp(col,j)
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                     
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc0(tmor,tx)
+!------------------------------------------------------------------------
+!     Performs mapping from element to mortar when the nonconforming 
+!     edges are shared by two conforming faces of an element.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(lx1,lx1),tx(lx1,lx1,lx1)
+      integer i,j
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,1)= tmor(j,1) + qbnew(i-1,j  ,1)*tx(i,1,1)
+        end do
+      end do
+
+      return
+      end 
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc2(tmor,tx)
+!------------------------------------------------------------------------
+!     Maps values from element to mortar when the nonconforming edges are
+!     shared by two nonconforming faces of an element.
+!     Although each face shall have four pieces of mortar, only value in
+!     one piece (location (1,1)) is used in the calling routine so only
+!     the value in the first mortar is calculated in this subroutine.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),  &
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+      tmor(1,1)=tx(1,1)
+
+!.....mapping from tx to intermediate mortar temp + bottom
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j=1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col) = bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+        end do
+      end do
+
+!.....from intermediate mortar to mortar
+
+!.....On the nonconforming edge, temp is divided by 2 as there will be
+!     a duplicate contribution from another face sharing this edge
+      col=1
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,col)=tmor(j,col)+ qbnew(i-1,j,1) * bottom(i) +  &
+     &                             qbnew(i-1,j,1) * temp(i,col) * 0.5d0 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end 
+
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc1(tmor,tx)
+!------------------------------------------------------------------------
+!     Maps values from element to mortar when the nonconforming edges are
+!     shared by a nonconforming face and a conforming face of an element
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),  &
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+
+      tmor(1,1)=tx(1,1)
+!.....Contribution from the nonconforming faces
+!     Since the calling subroutine is only interested in the value on the
+!     mortar (location (1,1)), only this piece of mortar is calculated.
+
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j = 1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col)=bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+
+        end do
+      end do
+
+      col=1
+      tmor(1,col)=tmor(1,col)+bottom(1)
+      do j=1,lx1
+        do i=2,lx1-1
+
+!.........temp is not divided by 2 here. It includes the contribution
+!         from the other conforming face.
+
+          tmor(j,col)=tmor(j,col) + qbnew(i-1,j,1) *bottom(i) +  &
+     &                              qbnew(i-1,j,1) *temp(i,col) 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+!-------------------------------------------------------------------
+      subroutine transfb_c(tx)
+!-------------------------------------------------------------------
+!     Prepare initial guess for cg. All values from conforming 
+!     boundary are copied and summed on tmor.
+!-------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,  &
+!$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL) 
+
+!$OMP DO
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+!$OMP END DO
+
+!$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,1,iface,ie)
+            il2 = idel(lx1,1,iface,ie)
+            il3 = idel(1,lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+!
+!$          call omp_set_lock(tlock(ig1))
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+!$          call omp_unset_lock(tlock(ig1))
+!
+!$          call omp_set_lock(tlock(ig2))
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+!$          call omp_unset_lock(tlock(ig2))
+!
+!$          call omp_set_lock(tlock(ig3))
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+!$          call omp_unset_lock(tlock(ig3))
+!
+!$          call omp_set_lock(tlock(ig4))
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+!$          call omp_unset_lock(tlock(ig4))
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+          end if!
+        end do
+      end do
+!$OMP END DO NOWAIT
+!$OMP END PARALLEL
+      return
+      end
+
+!-------------------------------------------------------------------
+      subroutine transfb_c_2(tx)
+!-------------------------------------------------------------------
+!     Prepare initial guess for CG. All values from conforming 
+!     boundary are copied and summed in tmort. 
+!     mormult is multiplicity, which is used to average tmort.
+!-------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,  &
+!$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL)
+
+!$OMP DO     
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+!$OMP END DO nowait
+!$OMP DO
+      do j=1,nmor
+        mormult(j)=0.d0
+      end do
+!$OMP END DO
+
+!$OMP DO 
+      do ie=1,nelt
+        do iface=1,nsides
+          
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,  1,  iface,ie)
+            il2 = idel(lx1,1,  iface,ie)
+            il3 = idel(1,  lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+!
+!$          call omp_set_lock(tlock(ig1))
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+            mormult(ig1) = mormult(ig1)+third
+!$          call omp_unset_lock(tlock(ig1))
+!
+!$          call omp_set_lock(tlock(ig2))
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+            mormult(ig2) = mormult(ig2)+third
+!$          call omp_unset_lock(tlock(ig2))
+!
+!$          call omp_set_lock(tlock(ig3))
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+            mormult(ig3) = mormult(ig3)+third
+!$          call omp_unset_lock(tlock(ig3))
+!
+!$          call omp_set_lock(tlock(ig4))
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+            mormult(ig4) = mormult(ig4)+third
+!$          call omp_unset_lock(tlock(ig4))
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)
+                mormult(ig)=mormult(ig)+1.d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+!$              call omp_unset_lock(tlock(ig))
+               end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!
+!$              call omp_set_lock(tlock(ig))
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+                mormult(ig)=mormult(ig)+0.5d0
+!$              call omp_unset_lock(tlock(ig))
+              end do
+            end if
+          end if!nnje=1
+        end do
+      end do
+!$OMP END DO NOWAIT
+!$OMP END PARALLEL
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer_au.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer_au.f90
new file mode 100644
index 000000000..a012da15a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer_au.f90
@@ -0,0 +1,1056 @@
+!------------------------------------------------------------------
+      subroutine init_locks
+!------------------------------------------------------------------
+!     This version uses ATOMIC for atomic updates, 
+!     but locks are still used in get_emo (mason.f).
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer i
+
+!.....initialize locks in parallel
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+!$    do i=1,8*lelt
+!$      call omp_init_lock(tlock(i))
+!$    end do
+
+      return
+      end
+
+
+!------------------------------------------------------------------
+      subroutine transf(tmor,tx)
+!------------------------------------------------------------------
+!     Map values from mortar(tmor) to element(tx)
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(*),tx(*), tmp(lx1,lx1,2)
+      integer ig1,ig2,ig3,ig4,ie,iface,il1,il2,il3,il4,  &
+     &        nnje,ije1,ije2,col,i,j,ig,il
+
+
+!.....zero out tx on element boundaries
+      call col2(tx,tmult,ntot)     
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,  &
+!$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,tmp)
+      do ie=1,nelt
+        do iface=1,nsides
+
+!.........get the collocation point index of the four local corners on the
+!         face iface of element ie
+          il1=idel(1,1,iface,ie)
+          il2=idel(lx1,1,iface,ie)
+          il3=idel(1,lx1,iface,ie)
+          il4=idel(lx1,lx1,iface,ie)
+
+!.........get the mortar indices of the four local corners
+          ig1= idmo(1,  1  ,1,1,iface,ie)
+          ig2= idmo(lx1,1  ,1,2,iface,ie)
+          ig3= idmo(1,  lx1,2,1,iface,ie)
+          ig4= idmo(lx1,lx1,2,2,iface,ie)
+  
+!.........copy the value from tmor to tx for these four local corners
+          tx(il1) = tmor(ig1)
+          tx(il2) = tmor(ig2)
+          tx(il3) = tmor(ig3)
+          tx(il4) = tmor(ig4)
+ 
+!.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+!.........for nonconforming faces
+          if(nnje.eq.2) then
+
+!...........nonconforming faces have four pieces of mortar, first map them to
+!           two intermediate mortars, stored in tmp
+            call r_init(tmp,lx1*lx1*2,0.d0)
+   
+            do ije1=1,nnje
+              do ije2=1,nnje
+                do col=1,lx1
+
+!.................in each row col, when coloumn i=1 or lx1, the value
+!                 in tmor is copied to tmp
+                  i = v_end(ije2)
+                  ig=idmo(i,col,ije1,ije2,iface,ie)
+                  tmp(i,col,ije1)=tmor(ig)
+
+!.................in each row col, value in the interior three collocation
+!                 points is computed by apply mapping matrix qbnew to tmor
+                  do i=2,lx1-1
+                    il= idel(i,col,iface,ie)
+                    do j=1,lx1
+                      ig=idmo(j,col,ije1,ije2,iface,ie)
+                      tmp(i,col,ije1) = tmp(i,col,ije1) +  &
+     &                qbnew(i-1,j,ije2)*tmor(ig)
+                    end do
+                  end do
+
+                end do
+              end do
+            end do
+      
+!...........mapping from two pieces of intermediate mortar tmp to element 
+!           face tx
+
+            do ije1=1, nnje
+
+!.............the first column, col=1, is an edge of face iface.
+!             the value on the three interior collocation points, tx, is 
+!             computed by applying mapping matrices qbnew to tmp.
+!             the mapping result is divided by 2, because there will be 
+!             duplicated contribution from another face sharing this edge.
+              col=1
+              do i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*  &
+     &                       tmp(col,j,ije1)*0.5d0
+                end do 
+              end do 
+
+!.............for column 2 ~ lx-1 
+              do col=2,lx1-1
+
+!...............when i=1 or lx1, the collocation points are also on an edge of
+!               the face, so the mapping result also needs to be divided by 2
+                i = v_end(ije1)
+                il= idel(col,i,iface,ie)
+                tx(il)=tx(il)+tmp(col,i,ije1)*0.5d0
+
+!...............compute the value at interior collocation points in 
+!               columns 2 ~ lx1
+                do i=2,lx1-1
+                  il= idel(col,i,iface,ie)
+                  do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)* tmp(col,j,ije1)
+                  end do 
+                end do
+              end do
+
+!.............same as col=1
+              col=lx1
+              do  i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                  tx(il) = tx(il) + qbnew(i-1,j,ije1)*  &
+     &                     tmp(col,j,ije1)*0.5d0
+                end do 
+              end do
+            end do
+
+!.........for conforming faces
+          else
+
+!.........face interior
+            do col=2,lx1-1
+              do i=2,lx1-1  
+                il= idel(i,col,iface,ie)
+                ig= idmo(i,col,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end do
+
+        
+!...........edges of conforming faces
+
+!...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(i,1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,1,1,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 1 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,1,iface,ie)
+                ig= idmo(i,1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(lx1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(lx1,j,ije1,2,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 2 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(lx1,i,iface,ie)
+                ig= idmo(lx1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do  i=2,lx1-1               
+                il= idel(i,lx1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,lx1,2,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 3 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,lx1,iface,ie)
+                ig= idmo(i,lx1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(1,j,ije1,1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+!...........if local edge 4 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(1,i,iface,ie)
+                ig= idmo(1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+          end if
+          
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+
+!------------------------------------------------------------------
+      subroutine transfb(tmor,tx)
+!------------------------------------------------------------------
+!     Map from element(tx) to mortar(tmor).
+!     tmor sums contributions from all elements.
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision third
+      parameter (third=1.d0/3.d0)
+      integer shift
+
+      double precision tmp,tmp1,tx(*),tmor(*),temp(lx1,lx1,2),  &
+     &                 top(lx1,2)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,nnje,  &
+     &        ije1,ije2,col,i,j,ije,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,  &
+!$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,ije,  &
+!$OMP& tmp,shift,temp,top,tmp1)
+
+!$OMP DO
+      do ie=1,nmor
+        tmor(ie)=0.d0
+      end do
+!$OMP END DO
+
+!$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+!.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+!.........get collocation point index of four local corners on the face
+          il1 = idel(1,  1,  iface,ie)
+          il2 = idel(lx1,1,  iface,ie)
+          il3 = idel(1,  lx1,iface,ie)
+          il4 = idel(lx1,lx1,iface,ie)
+
+!.........get the mortar indices of the four local corners
+          ig1 = idmo(1,  1,  1,1,iface,ie)
+          ig2 = idmo(lx1,1,  1,2,iface,ie)
+          ig3 = idmo(1,  lx1,2,1,iface,ie )
+          ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+!.........sum the values from tx to tmor for these four local corners
+!         only 1/3 of the value is summed, since there will be two duplicated
+!         contributions from the other two faces sharing this vertex 
+!$OMP ATOMIC
+          tmor(ig1) = tmor(ig1)+tx(il1)*third
+!$OMP ATOMIC
+          tmor(ig2) = tmor(ig2)+tx(il2)*third
+!$OMP ATOMIC
+          tmor(ig3) = tmor(ig3)+tx(il3)*third
+!$OMP ATOMIC
+          tmor(ig4) = tmor(ig4)+tx(il4)*third
+
+!.........for nonconforming faces
+          if(nnje.eq.2) then       
+            call r_init(temp,lx1*lx1*2,0.d0)
+
+!...........nonconforming faces have four pieces of mortar, first map tx to
+!           two intermediate mortars stored in temp
+
+            do ije2 = 1, nnje
+              shift = ije2-1
+              do col=1,lx1
+!...............For mortar points on face edge (top and bottom), copy the 
+!               value from tx to temp
+                il=idel(col,v_end(ije2),iface,ie)
+                temp(col,v_end(ije2),ije2)=tx(il)
+
+!...............For mortar points on face edge (top and bottom), calculate 
+!               the interior points' contribution to them, i.e. top()
+                j = v_end(ije2)
+                tmp=0.d0
+                do i=2,lx1-1 
+                  il=idel(col,i,iface,ie)
+                  tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                end do
+
+                top(col,ije2)=tmp
+
+!...............Use mapping matrices qbnew to map the value from tx to temp 
+!               for mortar points not on the top bottom face edge.
+                do j=2-shift,lx1-shift
+                  tmp=0.d0
+                  do i=2,lx1-1 
+                    il=idel(col,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                  end do
+                  temp(col,j,ije2) = tmp + temp(col,j,ije2)
+                end do
+              end do
+            end do
+
+!...........mapping from temp to tmor
+
+            do ije1=1, nnje
+              shift = ije1-1
+              do ije2=1,nnje
+
+!...............for each column of collocation points on a piece of mortar
+                do col=2-shift,lx1-shift
+
+!.................For the end point, which is on an edge (local edge 2,4), 
+!                 the contribution is halved since there will be duplicated 
+!                 contribution from another face sharing this edge.
+
+                  ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+!$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+temp(v_end(ije2),col,ije1)*0.5d0
+
+!.................In each row of collocation points on a piece of mortar, 
+!                 sum the contributions from interior collocation points 
+!                 (i=2,lx1-1)
+
+                  do  j=1,lx1
+                    tmp=0.d0
+                    do i=2,lx1-1
+                      tmp = tmp + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    end do
+                    ig=idmo(j,col,ije1,ije2,iface,ie)
+!$OMP ATOMIC
+                    tmor(ig)=tmor(ig)+tmp
+                  end do
+                end do
+
+!...............For tmor on local edge 1 and 3, tmp is the contribution from
+!               an edge, so it is halved because of duplicated contribution
+!               from another face sharing this edge. tmp1 is contribution 
+!               from face interior. 
+
+                col = v_end(ije1)
+                ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+!$OMP ATOMIC
+                tmor(ig)=tmor(ig)+top(v_end(ije2),ije1)*0.5d0
+                do  j=1,lx1
+                  tmp=0.d0
+                  tmp1=0.d0
+                  do i=2,lx1-1
+                    tmp  = tmp  + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    tmp1 = tmp1 + qbnew(i-1,j,ije2) * top(i,ije1)
+                  end do
+                  ig=idmo(j,col,ije1,ije2,iface,ie)
+!$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0+tmp1 
+                end do
+              end do
+            end do
+
+!.........for conforming faces
+          else
+
+!.........face interior
+            do col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)
+              end do
+            end do
+
+!...........edges of conforming faces
+
+!...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,1,iface,ie)
+                    tmp= tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,1,1,ije,iface,ie)
+!$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 1 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+!...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(lx1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(lx1,j,ije,2,iface,ie)
+!$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 2 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+!...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,lx1,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,lx1,2,ije,iface,ie)
+!$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 3 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+!...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(1,j,ije,1,iface,ie)
+!$OMP ATOMIC
+                  tmor(ig)=tmor(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 4 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!$OMP ATOMIC
+                tmor(ig)=tmor(ig)+tx(il)*0.5d0
+              end do
+            end if 
+          end if
+        end do
+      end do
+!$OMP END DO NOWAIT
+!$OMP END PARALLEL 
+
+      return
+      end
+
+
+!--------------------------------------------------------------
+      subroutine transfb_cor_e(n,tmor,tx)
+!--------------------------------------------------------------
+!     This subroutine performs the edge to mortar mapping and
+!     calculates the mapping result on the mortar point at a vertex
+!     under situation 1,2, or 3.
+!     n refers to the configuration of three edges sharing a vertex, 
+!     n = 1: only one edge is nonconforming
+!     n = 2: two edges are nonconforming 
+!     n = 3: three edges are nonconforming 
+!-------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor,tx(lx1,lx1,lx1),tmp
+      integer i,n
+
+      tmor=tx(1,1,1)
+
+      do i=2,lx1-1
+        tmor= tmor + qbnew(i-1,1,1)*tx(i,1,1)
+      end do
+
+      if(n.gt.1)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,i,1)
+        end do
+      end if
+
+      if(n.eq.3)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,1,i)
+        end do
+      end if
+
+      return
+      end
+
+!--------------------------------------------------------------
+      subroutine transfb_cor_f(n,tmor,tx)
+!--------------------------------------------------------------
+!     This subroutine performs the mapping from face to mortar.
+!     Output tmor is the mapping result on a mortar vertex
+!     of situations of three edges and three faces sharing a vertex:
+!     n=4: only one face is nonconforming 
+!     n=5: one face and one edge are nonconforming
+!     n=6: two faces are nonconforming 
+!     n=7: three faces are nonconforming 
+!--------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1,lx1),tmor,temp(lx1)
+      integer col,i,n
+
+      call r_init(temp,lx1,0.d0)
+
+      do col=1,lx1
+        temp(col)=tx(col,1,1)
+        do i=2,lx1-1
+          temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,i,1)
+        end do
+      end do
+      tmor=temp(1)
+
+      do i=2,lx1-1
+        tmor = tmor + qbnew(i-1,1,1) *temp(i)
+      end do
+
+      if(n.eq.5)then
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *tx(1,1,i)
+        end do
+      end if
+ 
+      if(n.ge.6)then
+        call r_init(temp,lx1,0.d0)
+        do col=1,lx1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,1,i)
+          end do
+        end do
+        tmor=tmor+temp(1)
+        do i=2,lx1-1
+          tmor = tmor +qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+        
+      if(n.eq.7)then
+        call r_init(temp,lx1,0.d0)
+        do col=2,lx1-1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(1,col,i)
+          end do
+        end do
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+
+      return
+      end
+
+
+!-------------------------------------------------------------------------
+      subroutine transf_nc(tmor,tx)
+!------------------------------------------------------------------------
+!     Perform mortar to element mapping on a nonconforming face. 
+!     This subroutin is used when all entries in tmor are zero except
+!     one tmor(i,j)=1. So this routine is simplified. Only one piece of 
+!     mortar  (tmor only has two indices) and one piece of intermediate 
+!     mortar (tmp) are involved.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(lx1,lx1), tx(lx1,lx1), tmp(lx1,lx1)
+      integer col,i,j
+
+      call r_init(tmp,lx1*lx1,0.d0)
+      do col=1,lx1
+        i = 1
+        tmp(i,col)=tmor(i,col)                           
+        do i=2,lx1-1
+          do j=1,lx1
+            tmp(i,col) = tmp(i,col) + qbnew(i-1,j,1)*tmor(j,col)
+          end do
+        end do
+      end do
+
+      do col=1,lx1
+        i = 1
+        tx(col,i)   = tx(col,i)   + tmp(col,i)
+        do i=2,lx1-1
+          do j=1,lx1
+            tx(col,i) = tx(col,i) + qbnew(i-1,j,1)*tmp(col,j)
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                     
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc0(tmor,tx)
+!------------------------------------------------------------------------
+!     Performs mapping from element to mortar when the nonconforming 
+!     edges are shared by two conforming faces of an element.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(lx1,lx1),tx(lx1,lx1,lx1)
+      integer i,j
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,1)= tmor(j,1) + qbnew(i-1,j  ,1)*tx(i,1,1)
+        end do
+      end do
+
+      return
+      end 
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc2(tmor,tx)
+!------------------------------------------------------------------------
+!     Maps values from element to mortar when the nonconforming edges are
+!     shared by two nonconforming faces of an element.
+!     Although each face shall have four pieces of mortar, only value in
+!     one piece (location (1,1)) is used in the calling routine so only
+!     the value in the first mortar is calculated in this subroutine.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),  &
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+      tmor(1,1)=tx(1,1)
+
+!.....mapping from tx to intermediate mortar temp + bottom
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j=1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col) = bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+        end do
+      end do
+
+!.....from intermediate mortar to mortar
+
+!.....On the nonconforming edge, temp is divided by 2 as there will be
+!     a duplicate contribution from another face sharing this edge
+      col=1
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,col)=tmor(j,col)+ qbnew(i-1,j,1) * bottom(i) +  &
+     &                             qbnew(i-1,j,1) * temp(i,col) * 0.5d0 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end 
+
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc1(tmor,tx)
+!------------------------------------------------------------------------
+!     Maps values from element to mortar when the nonconforming edges are
+!     shared by a nonconforming face and a conforming face of an element
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),  &
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+
+      tmor(1,1)=tx(1,1)
+!.....Contribution from the nonconforming faces
+!     Since the calling subroutine is only interested in the value on the
+!     mortar (location (1,1)), only this piece of mortar is calculated.
+
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j = 1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col)=bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+
+        end do
+      end do
+
+      col=1
+      tmor(1,col)=tmor(1,col)+bottom(1)
+      do j=1,lx1
+        do i=2,lx1-1
+
+!.........temp is not divided by 2 here. It includes the contribution
+!         from the other conforming face.
+
+          tmor(j,col)=tmor(j,col) + qbnew(i-1,j,1) *bottom(i) +  &
+     &                              qbnew(i-1,j,1) *temp(i,col) 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+!-------------------------------------------------------------------
+      subroutine transfb_c(tx)
+!-------------------------------------------------------------------
+!     Prepare initial guess for cg. All values from conforming 
+!     boundary are copied and summed on tmor.
+!-------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,  &
+!$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL) 
+
+!$OMP DO
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+!$OMP END DO
+
+!$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,1,iface,ie)
+            il2 = idel(lx1,1,iface,ie)
+            il3 = idel(1,lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+!$OMP ATOMIC
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+!$OMP ATOMIC
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+!$OMP ATOMIC
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+!$OMP ATOMIC
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+              end do
+            end if
+          end if!
+        end do
+      end do
+!$OMP END DO NOWAIT
+!$OMP END PARALLEL
+      return
+      end
+
+!-------------------------------------------------------------------
+      subroutine transfb_c_2(tx)
+!-------------------------------------------------------------------
+!     Prepare initial guess for CG. All values from conforming 
+!     boundary are copied and summed in tmort. 
+!     mormult is multiplicity, which is used to average tmort.
+!-------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,  &
+!$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL)
+
+!$OMP DO     
+      do j=1,nmor
+        tmort(j)=0.d0
+      end do
+!$OMP END DO nowait
+!$OMP DO
+      do j=1,nmor
+        mormult(j)=0.d0
+      end do
+!$OMP END DO
+
+!$OMP DO 
+      do ie=1,nelt
+        do iface=1,nsides
+          
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,  1,  iface,ie)
+            il2 = idel(lx1,1,  iface,ie)
+            il3 = idel(1,  lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+!$OMP ATOMIC
+            tmort(ig1) = tmort(ig1)+tx(il1)*third
+!$OMP ATOMIC
+            tmort(ig2) = tmort(ig2)+tx(il2)*third
+!$OMP ATOMIC
+            tmort(ig3) = tmort(ig3)+tx(il3)*third
+!$OMP ATOMIC
+            tmort(ig4) = tmort(ig4)+tx(il4)*third
+!$OMP ATOMIC
+            mormult(ig1) = mormult(ig1)+third
+!$OMP ATOMIC
+            mormult(ig2) = mormult(ig2)+third
+!$OMP ATOMIC
+            mormult(ig3) = mormult(ig3)+third
+!$OMP ATOMIC
+            mormult(ig4) = mormult(ig4)+third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)
+!$OMP ATOMIC
+                mormult(ig)=mormult(ig)+1.d0
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+               end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!$OMP ATOMIC
+                tmort(ig)=tmort(ig)+tx(il)*0.5d0
+!$OMP ATOMIC
+                mormult(ig)=mormult(ig)+0.5d0
+              end do
+            end if
+          end if!nnje=1
+        end do
+      end do
+!$OMP END DO NOWAIT
+!$OMP END PARALLEL
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer_rd.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer_rd.f90
new file mode 100644
index 000000000..02967fd91
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/transfer_rd.f90
@@ -0,0 +1,1122 @@
+!------------------------------------------------------------------
+      subroutine init_locks
+!------------------------------------------------------------------
+!     This version uses array reduction for atomic updates, 
+!     but locks are still used in get_emo (mason.f).
+!------------------------------------------------------------------
+
+      use ua_data
+      use tmorwork
+
+      implicit none
+
+      integer i
+!$    integer, external :: omp_get_thread_num, omp_get_num_threads
+
+!.....initialize locks in parallel
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i)
+!$OMP DO
+!$    do i=1,8*lelt
+!$      call omp_init_lock(tlock(i))
+!$    end do
+
+      myid = 0
+      nwthreads = 0
+!$    myid = omp_get_thread_num()
+!$    nwthreads = omp_get_num_threads() - 1
+!$OMP END PARALLEL
+
+!.....allocate space for array-reduction work arrays
+      if (nwthreads .gt. 0) then
+         allocate(tmorwk(lmor,nwthreads), mormulwk(lmor,nwthreads),  &
+     &            stat = i)
+         if (i .ne. 0) then
+            write(*,*) 'error in allocating space'
+            stop
+         endif
+      endif
+
+      return
+      end
+
+
+!------------------------------------------------------------------
+      subroutine transf(tmor,tx)
+!------------------------------------------------------------------
+!     Map values from mortar(tmor) to element(tx)
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(*),tx(*), tmp(lx1,lx1,2)
+      integer ig1,ig2,ig3,ig4,ie,iface,il1,il2,il3,il4,  &
+     &        nnje,ije1,ije2,col,i,j,ig,il
+
+
+!.....zero out tx on element boundaries
+      call col2(tx,tmult,ntot)     
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,  &
+!$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,tmp)
+      do ie=1,nelt
+        do iface=1,nsides
+
+!.........get the collocation point index of the four local corners on the
+!         face iface of element ie
+          il1=idel(1,1,iface,ie)
+          il2=idel(lx1,1,iface,ie)
+          il3=idel(1,lx1,iface,ie)
+          il4=idel(lx1,lx1,iface,ie)
+
+!.........get the mortar indices of the four local corners
+          ig1= idmo(1,  1  ,1,1,iface,ie)
+          ig2= idmo(lx1,1  ,1,2,iface,ie)
+          ig3= idmo(1,  lx1,2,1,iface,ie)
+          ig4= idmo(lx1,lx1,2,2,iface,ie)
+  
+!.........copy the value from tmor to tx for these four local corners
+          tx(il1) = tmor(ig1)
+          tx(il2) = tmor(ig2)
+          tx(il3) = tmor(ig3)
+          tx(il4) = tmor(ig4)
+ 
+!.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+!.........for nonconforming faces
+          if(nnje.eq.2) then
+
+!...........nonconforming faces have four pieces of mortar, first map them to
+!           two intermediate mortars, stored in tmp
+            call r_init(tmp,lx1*lx1*2,0.d0)
+   
+            do ije1=1,nnje
+              do ije2=1,nnje
+                do col=1,lx1
+
+!.................in each row col, when coloumn i=1 or lx1, the value
+!                 in tmor is copied to tmp
+                  i = v_end(ije2)
+                  ig=idmo(i,col,ije1,ije2,iface,ie)
+                  tmp(i,col,ije1)=tmor(ig)
+
+!.................in each row col, value in the interior three collocation
+!                 points is computed by apply mapping matrix qbnew to tmor
+                  do i=2,lx1-1
+                    il= idel(i,col,iface,ie)
+                    do j=1,lx1
+                      ig=idmo(j,col,ije1,ije2,iface,ie)
+                      tmp(i,col,ije1) = tmp(i,col,ije1) +  &
+     &                qbnew(i-1,j,ije2)*tmor(ig)
+                    end do
+                  end do
+
+                end do
+              end do
+            end do
+      
+!...........mapping from two pieces of intermediate mortar tmp to element 
+!           face tx
+
+            do ije1=1, nnje
+
+!.............the first column, col=1, is an edge of face iface.
+!             the value on the three interior collocation points, tx, is 
+!             computed by applying mapping matrices qbnew to tmp.
+!             the mapping result is divided by 2, because there will be 
+!             duplicated contribution from another face sharing this edge.
+              col=1
+              do i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*  &
+     &                       tmp(col,j,ije1)*0.5d0
+                end do 
+              end do 
+
+!.............for column 2 ~ lx-1 
+              do col=2,lx1-1
+
+!...............when i=1 or lx1, the collocation points are also on an edge of
+!               the face, so the mapping result also needs to be divided by 2
+                i = v_end(ije1)
+                il= idel(col,i,iface,ie)
+                tx(il)=tx(il)+tmp(col,i,ije1)*0.5d0
+
+!...............compute the value at interior collocation points in 
+!               columns 2 ~ lx1
+                do i=2,lx1-1
+                  il= idel(col,i,iface,ie)
+                  do j=1,lx1
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)* tmp(col,j,ije1)
+                  end do 
+                end do
+              end do
+
+!.............same as col=1
+              col=lx1
+              do  i=2,lx1-1
+                il= idel(col,i,iface,ie)
+                do j=1,lx1
+                  tx(il) = tx(il) + qbnew(i-1,j,ije1)*  &
+     &                     tmp(col,j,ije1)*0.5d0
+                end do 
+              end do
+            end do
+
+!.........for conforming faces
+          else
+
+!.........face interior
+            do col=2,lx1-1
+              do i=2,lx1-1  
+                il= idel(i,col,iface,ie)
+                ig= idmo(i,col,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end do
+
+        
+!...........edges of conforming faces
+
+!...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(i,1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,1,1,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 1 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,1,iface,ie)
+                ig= idmo(i,1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(lx1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(lx1,j,ije1,2,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 2 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(lx1,i,iface,ie)
+                ig= idmo(lx1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do  i=2,lx1-1               
+                il= idel(i,lx1,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(j,lx1,2,ije1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+
+!...........if local edge 3 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(i,lx1,iface,ie)
+                ig= idmo(i,lx1,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+
+!...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do i=2,lx1-1               
+                il= idel(1,i,iface,ie)
+                do ije1=1,2
+                  do j=1,lx1
+                    ig=idmo(1,j,ije1,1,iface,ie)
+                    tx(il) = tx(il) + qbnew(i-1,j,ije1)*tmor(ig)*0.5d0
+                  end do
+                end do
+              end do
+!...........if local edge 4 is a conforming edge
+            else
+              do i=2,lx1-1
+                il= idel(1,i,iface,ie)
+                ig= idmo(1,i,1,1,iface,ie)
+                tx(il)=tmor(ig)
+              end do
+            end if 
+          end if
+          
+        end do
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+
+!------------------------------------------------------------------
+      subroutine transfb(tmor,tx)
+!------------------------------------------------------------------
+!     Map from element(tx) to mortar(tmor).
+!     tmor sums contributions from all elements.
+!------------------------------------------------------------------
+
+      use ua_data
+      use tmorwork
+
+      implicit none
+
+      double precision :: tx(*)
+      double precision, target :: tmor(*)
+
+      double precision third
+      parameter (third=1.d0/3.d0)
+      integer shift
+
+      double precision tmp,tmp1,temp(lx1,lx1,2),  &
+     &                 top(lx1,2)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,nnje,  &
+     &        ije1,ije2,col,i,j,ije,ig,il
+
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(il,j,ig,i,col,ije2,ije1,ig4,  &
+!$OMP& ig3,ig2,ig1,nnje,il4,il3,il2,il1,iface,ie,ije,  &
+!$OMP& tmp,shift,temp,top,tmp1)
+
+      if (myid .eq. 0) then
+         tmorl => tmor(1:nmor)
+      else
+         tmorl => tmorwk(:,myid)
+      endif
+
+      do ie=1,nmor
+        tmorl(ie)=0.d0
+      end do
+
+!$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+!.........nnje=1 for conforming faces, nnje=2 for nonconforming faces
+          if(cbc(iface,ie).eq.3) then
+            nnje=2
+          else
+            nnje=1 
+          end if
+
+!.........get collocation point index of four local corners on the face
+          il1 = idel(1,  1,  iface,ie)
+          il2 = idel(lx1,1,  iface,ie)
+          il3 = idel(1,  lx1,iface,ie)
+          il4 = idel(lx1,lx1,iface,ie)
+
+!.........get the mortar indices of the four local corners
+          ig1 = idmo(1,  1,  1,1,iface,ie)
+          ig2 = idmo(lx1,1,  1,2,iface,ie)
+          ig3 = idmo(1,  lx1,2,1,iface,ie )
+          ig4 = idmo(lx1,lx1,2,2,iface,ie)
+
+!.........sum the values from tx to tmor for these four local corners
+!         only 1/3 of the value is summed, since there will be two duplicated
+!         contributions from the other two faces sharing this vertex 
+!
+          tmorl(ig1) = tmorl(ig1)+tx(il1)*third
+!
+          tmorl(ig2) = tmorl(ig2)+tx(il2)*third
+!
+          tmorl(ig3) = tmorl(ig3)+tx(il3)*third
+!
+          tmorl(ig4) = tmorl(ig4)+tx(il4)*third
+
+!.........for nonconforming faces
+          if(nnje.eq.2) then       
+            call r_init(temp,lx1*lx1*2,0.d0)
+
+!...........nonconforming faces have four pieces of mortar, first map tx to
+!           two intermediate mortars stored in temp
+
+            do ije2 = 1, nnje
+              shift = ije2-1
+              do col=1,lx1
+!...............For mortar points on face edge (top and bottom), copy the 
+!               value from tx to temp
+                il=idel(col,v_end(ije2),iface,ie)
+                temp(col,v_end(ije2),ije2)=tx(il)
+
+!...............For mortar points on face edge (top and bottom), calculate 
+!               the interior points' contribution to them, i.e. top()
+                j = v_end(ije2)
+                tmp=0.d0
+                do i=2,lx1-1 
+                  il=idel(col,i,iface,ie)
+                  tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                end do
+
+                top(col,ije2)=tmp
+
+!...............Use mapping matrices qbnew to map the value from tx to temp 
+!               for mortar points not on the top bottom face edge.
+                do j=2-shift,lx1-shift
+                  tmp=0.d0
+                  do i=2,lx1-1 
+                    il=idel(col,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije2)*tx(il)
+                  end do
+                  temp(col,j,ije2) = tmp + temp(col,j,ije2)
+                end do
+              end do
+            end do
+
+!...........mapping from temp to tmor
+
+            do ije1=1, nnje
+              shift = ije1-1
+              do ije2=1,nnje
+
+!...............for each column of collocation points on a piece of mortar
+                do col=2-shift,lx1-shift
+
+!.................For the end point, which is on an edge (local edge 2,4), 
+!                 the contribution is halved since there will be duplicated 
+!                 contribution from another face sharing this edge.
+
+                  ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+!
+                  tmorl(ig)=tmorl(ig)+temp(v_end(ije2),col,ije1)*0.5d0
+
+!.................In each row of collocation points on a piece of mortar, 
+!                 sum the contributions from interior collocation points 
+!                 (i=2,lx1-1)
+
+                  do  j=1,lx1
+                    tmp=0.d0
+                    do i=2,lx1-1
+                      tmp = tmp + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    end do
+                    ig=idmo(j,col,ije1,ije2,iface,ie)
+!
+                    tmorl(ig)=tmorl(ig)+tmp
+                  end do
+                end do
+
+!...............For tmor on local edge 1 and 3, tmp is the contribution from
+!               an edge, so it is halved because of duplicated contribution
+!               from another face sharing this edge. tmp1 is contribution 
+!               from face interior. 
+
+                col = v_end(ije1)
+                ig=idmo(v_end(ije2),col,ije1,ije2,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+top(v_end(ije2),ije1)*0.5d0
+                do  j=1,lx1
+                  tmp=0.d0
+                  tmp1=0.d0
+                  do i=2,lx1-1
+                    tmp  = tmp  + qbnew(i-1,j,ije2) * temp(i,col,ije1)
+                    tmp1 = tmp1 + qbnew(i-1,j,ije2) * top(i,ije1)
+                  end do
+                  ig=idmo(j,col,ije1,ije2,iface,ie)
+!
+                  tmorl(ig)=tmorl(ig)+tmp*0.5d0+tmp1 
+                end do
+              end do
+            end do
+
+!.........for conforming faces
+          else
+
+!.........face interior
+            do col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)
+              end do
+            end do
+
+!...........edges of conforming faces
+
+!...........if local edge 1 is a nonconforming edge
+            if(idmo(lx1,1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,1,iface,ie)
+                    tmp= tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,1,1,ije,iface,ie)
+!
+                  tmorl(ig)=tmorl(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 1 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+!...........if local edge 2 is a nonconforming edge
+            if(idmo(lx1,2,1,2,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(lx1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(lx1,j,ije,2,iface,ie)
+!
+                  tmorl(ig)=tmorl(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 2 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+!...........if local edge 3 is a nonconforming edge
+            if(idmo(2,lx1,2,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(i,lx1,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(j,lx1,2,ije,iface,ie)
+!
+                  tmorl(ig)=tmorl(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 3 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if 
+
+!...........if local edge 4 is a nonconforming edge
+            if(idmo(1,lx1,1,1,iface,ie).ne.0)then
+              do ije=1,2
+                do j=1,lx1
+                  tmp=0.d0
+                  do i=2,lx1-1
+                    il=idel(1,i,iface,ie)
+                    tmp = tmp + qbnew(i-1,j,ije)*tx(il)
+                  end do
+                  ig=idmo(1,j,ije,1,iface,ie)
+!
+                  tmorl(ig)=tmorl(ig)+tmp*0.5d0
+                end do
+              end do
+
+!...........if local edge 4 is a conforming edge
+            else
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if 
+          end if
+        end do
+      end do
+!$OMP END DO
+
+      call update_tmor(tmor,tmorwk,nmor,lmor)
+!$OMP END PARALLEL 
+
+      return
+      end
+
+!--------------------------------------------------------------
+      subroutine transfb_cor_e(n,tmor,tx)
+!--------------------------------------------------------------
+!     This subroutine performs the edge to mortar mapping and
+!     calculates the mapping result on the mortar point at a vertex
+!     under situation 1,2, or 3.
+!     n refers to the configuration of three edges sharing a vertex, 
+!     n = 1: only one edge is nonconforming
+!     n = 2: two edges are nonconforming 
+!     n = 3: three edges are nonconforming 
+!-------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor,tx(lx1,lx1,lx1),tmp
+      integer i,n
+
+      tmor=tx(1,1,1)
+
+      do i=2,lx1-1
+        tmor= tmor + qbnew(i-1,1,1)*tx(i,1,1)
+      end do
+
+      if(n.gt.1)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,i,1)
+        end do
+      end if
+
+      if(n.eq.3)then
+        do i=2,lx1-1
+          tmor= tmor + qbnew(i-1,1,1)*tx(1,1,i)
+        end do
+      end if
+
+      return
+      end
+
+!--------------------------------------------------------------
+      subroutine transfb_cor_f(n,tmor,tx)
+!--------------------------------------------------------------
+!     This subroutine performs the mapping from face to mortar.
+!     Output tmor is the mapping result on a mortar vertex
+!     of situations of three edges and three faces sharing a vertex:
+!     n=4: only one face is nonconforming 
+!     n=5: one face and one edge are nonconforming
+!     n=6: two faces are nonconforming 
+!     n=7: three faces are nonconforming 
+!--------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1,lx1),tmor,temp(lx1)
+      integer col,i,n
+
+      call r_init(temp,lx1,0.d0)
+
+      do col=1,lx1
+        temp(col)=tx(col,1,1)
+        do i=2,lx1-1
+          temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,i,1)
+        end do
+      end do
+      tmor=temp(1)
+
+      do i=2,lx1-1
+        tmor = tmor + qbnew(i-1,1,1) *temp(i)
+      end do
+
+      if(n.eq.5)then
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *tx(1,1,i)
+        end do
+      end if
+ 
+      if(n.ge.6)then
+        call r_init(temp,lx1,0.d0)
+        do col=1,lx1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(col,1,i)
+          end do
+        end do
+        tmor=tmor+temp(1)
+        do i=2,lx1-1
+          tmor = tmor +qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+        
+      if(n.eq.7)then
+        call r_init(temp,lx1,0.d0)
+        do col=2,lx1-1
+          do i=2,lx1-1
+            temp(col) = temp(col) + qbnew(i-1,1,1)*tx(1,col,i)
+          end do
+        end do
+        do i=2,lx1-1
+          tmor = tmor + qbnew(i-1,1,1) *temp(i)
+        end do
+      end if
+
+      return
+      end
+
+
+!-------------------------------------------------------------------------
+      subroutine transf_nc(tmor,tx)
+!------------------------------------------------------------------------
+!     Perform mortar to element mapping on a nonconforming face. 
+!     This subroutin is used when all entries in tmor are zero except
+!     one tmor(i,j)=1. So this routine is simplified. Only one piece of 
+!     mortar  (tmor only has two indices) and one piece of intermediate 
+!     mortar (tmp) are involved.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(lx1,lx1), tx(lx1,lx1), tmp(lx1,lx1)
+      integer col,i,j
+
+      call r_init(tmp,lx1*lx1,0.d0)
+      do col=1,lx1
+        i = 1
+        tmp(i,col)=tmor(i,col)                           
+        do i=2,lx1-1
+          do j=1,lx1
+            tmp(i,col) = tmp(i,col) + qbnew(i-1,j,1)*tmor(j,col)
+          end do
+        end do
+      end do
+
+      do col=1,lx1
+        i = 1
+        tx(col,i)   = tx(col,i)   + tmp(col,i)
+        do i=2,lx1-1
+          do j=1,lx1
+            tx(col,i) = tx(col,i) + qbnew(i-1,j,1)*tmp(col,j)
+          end do
+        end do
+      end do
+
+      return                                                  
+      end                                                     
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc0(tmor,tx)
+!------------------------------------------------------------------------
+!     Performs mapping from element to mortar when the nonconforming 
+!     edges are shared by two conforming faces of an element.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tmor(lx1,lx1),tx(lx1,lx1,lx1)
+      integer i,j
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,1)= tmor(j,1) + qbnew(i-1,j  ,1)*tx(i,1,1)
+        end do
+      end do
+
+      return
+      end 
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc2(tmor,tx)
+!------------------------------------------------------------------------
+!     Maps values from element to mortar when the nonconforming edges are
+!     shared by two nonconforming faces of an element.
+!     Although each face shall have four pieces of mortar, only value in
+!     one piece (location (1,1)) is used in the calling routine so only
+!     the value in the first mortar is calculated in this subroutine.
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),  &
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+      tmor(1,1)=tx(1,1)
+
+!.....mapping from tx to intermediate mortar temp + bottom
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j=1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col) = bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+        end do
+      end do
+
+!.....from intermediate mortar to mortar
+
+!.....On the nonconforming edge, temp is divided by 2 as there will be
+!     a duplicate contribution from another face sharing this edge
+      col=1
+      do j=1,lx1
+        do i=2,lx1-1
+          tmor(j,col)=tmor(j,col)+ qbnew(i-1,j,1) * bottom(i) +  &
+     &                             qbnew(i-1,j,1) * temp(i,col) * 0.5d0 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end 
+
+
+!------------------------------------------------------------------------
+      subroutine transfb_nc1(tmor,tx)
+!------------------------------------------------------------------------
+!     Maps values from element to mortar when the nonconforming edges are
+!     shared by a nonconforming face and a conforming face of an element
+!------------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision tx(lx1,lx1),tmor(lx1,lx1),bottom(lx1),  &
+     &                 temp(lx1,lx1)
+      integer col,j,i
+
+      call r_init(tmor,lx1*lx1,0.d0)
+      call r_init(temp,lx1*lx1,0.d0)
+
+      tmor(1,1)=tx(1,1)
+!.....Contribution from the nonconforming faces
+!     Since the calling subroutine is only interested in the value on the
+!     mortar (location (1,1)), only this piece of mortar is calculated.
+
+      do col=1,lx1
+        temp(col,1)=tx(col,1)
+        j = 1
+        bottom(col)= 0.d0
+        do i=2,lx1-1 
+          bottom(col)=bottom(col) + qbnew(i-1,j,1)*tx(col,i)
+        end do
+
+        do j=2,lx1
+          do i=2,lx1-1 
+            temp(col,j) = temp(col,j) + qbnew(i-1,j,1)*tx(col,i)
+          end do
+
+        end do
+      end do
+
+      col=1
+      tmor(1,col)=tmor(1,col)+bottom(1)
+      do j=1,lx1
+        do i=2,lx1-1
+
+!.........temp is not divided by 2 here. It includes the contribution
+!         from the other conforming face.
+
+          tmor(j,col)=tmor(j,col) + qbnew(i-1,j,1) *bottom(i) +  &
+     &                              qbnew(i-1,j,1) *temp(i,col) 
+        end do
+      end do
+
+      do col=2,lx1
+        tmor(1,col)=tmor(1,col)+temp(1,col)
+        do j=1,lx1
+          do i=2,lx1-1
+            tmor(j,col) = tmor(j,col) + qbnew(i-1,j,1) *temp(i,col)
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
+!-------------------------------------------------------------------
+      subroutine transfb_c(tx)
+!-------------------------------------------------------------------
+!     Prepare initial guess for cg. All values from conforming 
+!     boundary are copied and summed on tmor.
+!-------------------------------------------------------------------
+
+      use ua_data
+      use tmorwork
+
+      implicit none
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,  &
+!$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL) 
+
+      if (myid .eq. 0) then
+         tmorl => tmort(:)
+      else
+         tmorl => tmorwk(:,myid)
+      endif
+
+      do j=1,nmor
+        tmorl(j)=0.d0
+      end do
+
+!$OMP DO
+      do ie=1,nelt
+        do iface=1,nsides
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,1,iface,ie)
+            il2 = idel(lx1,1,iface,ie)
+            il3 = idel(1,lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+!
+            tmorl(ig1) = tmorl(ig1)+tx(il1)*third
+!
+            tmorl(ig2) = tmorl(ig2)+tx(il2)*third
+!
+            tmorl(ig3) = tmorl(ig3)+tx(il3)*third
+!
+            tmorl(ig4) = tmorl(ig4)+tx(il4)*third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+              end do
+            end if
+          end if!
+        end do
+      end do
+!$OMP END DO
+
+      call update_tmor(tmort,tmorwk,nmor,lmor)
+!$OMP END PARALLEL
+      return
+      end
+
+!-------------------------------------------------------------------
+      subroutine transfb_c_2(tx)
+!-------------------------------------------------------------------
+!     Prepare initial guess for CG. All values from conforming 
+!     boundary are copied and summed in tmort. 
+!     mormult is multiplicity, which is used to average tmort.
+!-------------------------------------------------------------------
+
+      use ua_data
+      use tmorwork
+
+      implicit none
+
+      double precision third
+      parameter (third = 1.d0/3.d0)
+      double precision tx(*)
+      integer il1,il2,il3,il4,ig1,ig2,ig3,ig4,ie,iface,col,j,ig,il
+
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(IE,IFACE,IL1,IL2,  &
+!$OMP& IL3,IL4,IG1,IG2,IG3,IG4,COL,J,IG,IL)
+
+      if (myid .eq. 0) then
+         tmorl => tmort(:)
+         mormull => mormult(:)
+      else
+         tmorl => tmorwk(:,myid)
+         mormull => mormulwk(:,myid)
+      endif
+
+      do j=1,nmor
+        tmorl(j)=0.d0
+      end do
+      do j=1,nmor
+        mormull(j)=0.d0
+      end do
+
+!$OMP DO 
+      do ie=1,nelt
+        do iface=1,nsides
+          
+          if(cbc(iface,ie).ne.3)then
+            il1 = idel(1,  1,  iface,ie)
+            il2 = idel(lx1,1,  iface,ie)
+            il3 = idel(1,  lx1,iface,ie)
+            il4 = idel(lx1,lx1,iface,ie)
+            ig1 = idmo(1,  1,  1,1,iface,ie)
+            ig2 = idmo(lx1,1,  1,2,iface,ie)
+            ig3 = idmo(1,  lx1,2,1,iface,ie)
+            ig4 = idmo(lx1,lx1,2,2,iface,ie)
+!
+            tmorl(ig1) = tmorl(ig1)+tx(il1)*third
+            mormull(ig1) = mormull(ig1)+third
+!
+            tmorl(ig2) = tmorl(ig2)+tx(il2)*third
+            mormull(ig2) = mormull(ig2)+third
+!
+            tmorl(ig3) = tmorl(ig3)+tx(il3)*third
+            mormull(ig3) = mormull(ig3)+third
+!
+            tmorl(ig4) = tmorl(ig4)+tx(il4)*third
+            mormull(ig4) = mormull(ig4)+third
+
+            do  col=2,lx1-1
+              do j=2,lx1-1
+                il=idel(j,col,iface,ie)
+                ig=idmo(j,col,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)
+                mormull(ig)=mormull(ig)+1.d0
+              end do
+            end do
+
+            if(idmo(lx1,1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,1,iface,ie)
+                ig=idmo(j,1,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+                mormull(ig)=mormull(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(lx1,2,1,2,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(lx1,j,iface,ie)
+                ig=idmo(lx1,j,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+                mormull(ig)=mormull(ig)+0.5d0
+              end do
+            end if
+
+            if(idmo(2,lx1,2,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(j,lx1,iface,ie)
+                ig=idmo(j,lx1,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+                mormull(ig)=mormull(ig)+0.5d0
+               end do
+            end if
+
+            if(idmo(1,lx1,1,1,iface,ie).eq.0)then
+              do j=2,lx1-1
+                il=idel(1,j,iface,ie)
+                ig=idmo(1,j,1,1,iface,ie)
+!
+                tmorl(ig)=tmorl(ig)+tx(il)*0.5d0
+                mormull(ig)=mormull(ig)+0.5d0
+              end do
+            end if
+          end if!nnje=1
+        end do
+      end do
+!$OMP END DO
+
+      call update_tmor(tmort,tmorwk,nmor,lmor)
+      call update_tmor(mormult,mormulwk,nmor,lmor)
+!$OMP END PARALLEL
+
+      return
+      end
+
+!--------------------------------------------------------------
+      subroutine update_tmor(tmor,tmorg,nmor,lmor)
+!--------------------------------------------------------------
+!--------------------------------------------------------------
+
+      use tmorwork
+      implicit none
+
+      integer nmor,lmor
+      double precision tmor(*), tmorg(lmor,*)
+!
+      integer i, ii, iim, n
+
+      if (nwthreads .lt. 1) return
+
+!$omp do
+      do i = 1, nmor, 16
+         iim = i + min(15,nmor-i)
+         do n = 1, nwthreads
+            do ii = i, iim
+               tmor(ii) = tmor(ii) + tmorg(ii,n)
+            end do
+         end do
+      end do
+!$omp end do nowait
+!
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/ua.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/ua.f90
new file mode 100644
index 000000000..3e4a41590
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/ua.f90
@@ -0,0 +1,289 @@
+!-------------------------------------------------------------------------c
+!                                                                         c
+!        N  A  S     P A R A L L E L     B E N C H M A R K S  3.4         c
+!                                                                         c
+!                      O p e n M P     V E R S I O N                      c
+!                                                                         c
+!                                   U A                                   c
+!                                                                         c
+!-------------------------------------------------------------------------c
+!                                                                         c
+!    This benchmark is the OpenMP version of the NPB UA code.             c
+!    Refer to NAS Technical Report NAS--04-006 for details                c
+!                                                                         c
+!    Permission to use, copy, distribute and modify this software         c
+!    for any purpose with or without fee is hereby granted.  We           c
+!    request, however, that all derived work reference the NAS            c
+!    Parallel Benchmarks 3.4. This software is provided "as is"           c
+!    without express or implied warranty.                                 c
+!                                                                         c
+!    Information on NPB 3.4, including the technical report, the          c
+!    original specifications, source code, results and information        c
+!    on how to submit new results, is available at:                       c
+!                                                                         c
+!           http://www.nas.nasa.gov/Software/NPB/                         c
+!                                                                         c
+!    Send comments or suggestions to  npb@nas.nasa.gov                    c
+!                                                                         c
+!          NAS Parallel Benchmarks Group                                  c
+!          NASA Ames Research Center                                      c
+!          Mail Stop: T27A-1                                              c
+!          Moffett Field, CA   94035-1000                                 c
+!                                                                         c
+!          E-mail:  npb@nas.nasa.gov                                      c
+!          Fax:     (650) 604-3957                                        c
+!                                                                         c
+!-------------------------------------------------------------------------c
+
+!---------------------------------------------------------------------
+!
+! Author: H. Feng
+!         R. Van der Wijngaart
+!---------------------------------------------------------------------
+
+      program ua
+
+      use ua_data
+      implicit none
+
+      integer          step, ie,iside,i,j, fstatus,k
+      external         timer_read
+      double precision timer_read, mflops, tmax, nelt_tot
+      character        class
+      logical          ifmortar, verified
+!$    integer          omp_get_max_threads
+!$    external         omp_get_max_threads
+
+      double precision t2, trecs(t_last)
+      character t_names(t_last)*10
+
+!---------------------------------------------------------------------
+!     Read input file (if it exists), else take
+!     defaults from parameters
+!---------------------------------------------------------------------
+
+      call check_timer_flag( timeron )
+      if (timeron) then
+         t_names(t_total) = 'total'
+         t_names(t_init) = 'init'
+         t_names(t_convect) = 'convect'
+         t_names(t_transfb_c) = 'transfb_c'
+         t_names(t_diffusion) = 'diffusion'
+         t_names(t_transf) = 'transf'
+         t_names(t_transfb) = 'transfb'
+         t_names(t_adaptation) = 'adaptation'
+         t_names(t_transf2) = 'transf+b'
+         t_names(t_add2) = 'add2'
+      endif
+
+      write (*,1000)
+      open (unit=2,file='inputua.data',status='old', iostat=fstatus)
+
+      if (fstatus .eq. 0) then
+        write(*,233)
+ 233    format(' Reading from input file inputua.data')
+        read (2,*) fre
+        read (2,*) niter
+        read (2,*) nmxh
+        read (2,*) alpha
+        class = 'U'
+        close(2)
+      else
+        write(*,234)
+        fre        = fre_default
+        niter      = niter_default
+        nmxh       = nmxh_default
+        alpha      = alpha_default
+        class      = class_default
+      endif
+ 234  format(' No input file inputua.data. Using compiled defaults')
+
+      dlmin = 0.5d0**refine_max
+      dtime = 0.04d0*dlmin
+
+      write (*,1001) refine_max
+      write (*,1002) fre
+      write (*,1003) niter, dtime
+      write (*,1004) nmxh
+      write (*,1005) alpha
+!$    write (*,1006) omp_get_max_threads()
+      write (*,*)
+
+
+ 1000 format(//,' NAS Parallel Benchmarks (NPB3.4-OMP)',  &
+     &          ' - UA Benchmark', /)
+ 1001 format(' Levels of refinement:        ', i8)
+ 1002 format(' Adaptation frequency:        ', i8)
+ 1003 format(' Time steps:                  ', i8, '    dt: ', g15.6)
+ 1004 format(' CG iterations:               ', i8)
+ 1005 format(' Heat source radius:          ', f8.4)
+ 1006 format(' Number of available threads: ', i8)
+
+      do i = 1, t_last
+         call timer_clear(i)
+      end do
+      if (timeron) call timer_start(t_init)
+
+      call alloc_space
+
+!.....set up initial mesh (single element) and solution (all zero)
+      call create_initial_grid
+
+      call r_init_omp(ta1,ntot,0.d0)
+      call nr_init_omp(sje,4*6*nelt,0)
+
+      call init_locks
+
+!.....compute tables of coefficients and weights
+      call coef
+      call geom1
+
+!.....compute the discrete laplacian operators
+      call setdef
+
+!.....prepare for the preconditioner
+      call setpcmo_pre
+
+!.....refine initial mesh and do some preliminary work
+      time = 0.d0
+      call mortar
+      call prepwork
+      call adaptation(ifmortar,0)
+      if (timeron) call timer_stop(t_init)
+
+      call timer_clear(1)
+
+      time = 0.d0
+      do step= 0, niter
+
+        if (step .eq. 1) then
+!.........reset the solution and start the timer, keep track of total no elms
+
+          call r_init(ta1,ntot,0.d0)
+
+          time = 0.d0
+          nelt_tot = 0.d0
+          do i = 1, t_last
+             if (i.ne.t_init) call timer_clear(i)
+          end do
+#ifdef M5_ANNOTATION
+          call m5_work_begin_interface
+#endif
+          call timer_start(1)
+        endif
+
+!.......advance the convection step
+        call convect(ifmortar)
+
+        if (timeron) call timer_start(t_transf2)
+!.......prepare the intital guess for cg
+        call transf(tmort,ta1)
+
+!.......compute residual for diffusion term based on intital guess
+
+!.......compute the left hand side of equation, lapacian t
+!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(ie,k,j,i)
+!$OMP DO
+        do ie = 1,nelt
+          call laplacian(ta2(1,1,1,ie),ta1(1,1,1,ie),size_e(ie))
+        end do
+!$OMP END DO
+!.......compute the residual
+!$OMP DO
+        do ie = 1, nelt
+          do k=1,lx1
+            do j=1,lx1
+              do i=1,lx1
+                trhs(i,j,k,ie) = trhs(i,j,k,ie) - ta2(i,j,k,ie)
+              end do
+            end do
+          end do
+        end do
+!$OMP END DO
+!$OMP END PARALLEL
+!.......get the residual on mortar
+        call transfb(rmor,trhs)
+        if (timeron) call timer_stop(t_transf2)
+
+!.......apply boundary condition: zero out the residual on domain boundaries
+
+!.......apply boundary conidtion to trhs
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ie,iside)
+        do ie=1,nelt
+          do iside=1,nsides
+            if (cbc(iside,ie).eq.0) then
+              call facev(trhs(1,1,1,ie),iside,0.d0)
+            end if
+          end do
+        end do
+!$OMP END PARALLELDO
+!.......apply boundary condition to rmor
+        call col2(rmor,tmmor,nmor)
+
+!.......call the conjugate gradient iterative solver
+        call diffusion(ifmortar)
+
+!.......add convection and diffusion
+        if (timeron) call timer_start(t_add2)
+        call add2(ta1,t,ntot)
+        if (timeron) call timer_stop(t_add2)
+
+
+!.......perform mesh adaptation
+        time=time+dtime
+        if ((step.ne.0).and.(step/fre*fre .eq. step)) then
+           if (step .ne. niter) then
+             call adaptation(ifmortar,step)
+           end if
+        else
+          ifmortar = .false.
+        end if
+        nelt_tot = nelt_tot + dble(nelt)
+      end do
+
+      call timer_stop(1)
+
+#ifdef M5_ANNOTATION
+      call m5_work_end_interface
+#endif
+
+      tmax = timer_read(1)
+      call verify(class, verified)
+
+!.....compute millions of collocation points advanced per second.
+!.....diffusion: nmxh advancements, convection: 1 advancement
+      mflops = nelt_tot*dble(lx1*lx1*lx1*(nmxh+1))/(tmax*1.d6)
+
+      call print_results('UA', class, refine_max, 0, 0, niter,  &
+     &     tmax, mflops, '    coll. point advanced',  &
+     &     verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5,  &
+     &     cs6, '(none)')
+
+!---------------------------------------------------------------------
+!      More timers
+!---------------------------------------------------------------------
+      if (.not.timeron) goto 999
+
+      do i=1, t_last
+         trecs(i) = timer_read(i)
+      end do
+      if (tmax .eq. 0.0) tmax = 1.0
+
+      write(*,800)
+ 800  format('  SECTION     Time (secs)')
+      do i=1, t_last
+         write(*,810) t_names(i), trecs(i), trecs(i)*100./tmax
+         if (i.eq.t_transfb_c) then
+            t2 = trecs(t_convect) - trecs(t_transfb_c)
+            write(*,820) 'sub-convect', t2, t2*100./tmax
+         else if (i.eq.t_transfb) then
+            t2 = trecs(t_diffusion) - trecs(t_transf) - trecs(t_transfb)
+            write(*,820) 'sub-diffuse', t2, t2*100./tmax
+         endif
+ 810     format(2x,a10,':',f9.3,'  (',f6.2,'%)')
+ 820     format('    --> ',a11,':',f9.3,'  (',f6.2,'%)')
+      end do
+
+ 999  continue
+
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/ua_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/ua_data.f90
new file mode 100644
index 000000000..58d5d7b49
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/ua_data.f90
@@ -0,0 +1,367 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+!
+!  ua_data module
+!
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      module ua_data
+
+      include 'npbparams.h'
+
+!.....Array dimensions     
+      integer lx1, lnje, nsides, nxyz
+      parameter(lx1=5, lnje=2, nsides=6,  nxyz=lx1*lx1*lx1)
+
+      integer fre, niter, nmxh
+      double precision alpha, dlmin, dtime
+
+      integer nelt, ntot, nmor, nvertex
+
+      double precision x0, y0, z0, time
+
+      double precision velx, vely, velz, visc, x00, y00, z00
+      parameter(velx=3.d0, vely=3.d0, velz=3.d0)
+      parameter(visc=0.005d0)
+      parameter(x00=3.d0/7.d0, y00=2.d0/7.d0, z00=2.d0/7.d0)
+
+!.....double precision arrays associated with collocation points
+      double precision, allocatable ::  &
+     &       ta1  (:,:,:,:), ta2   (:,:,:,:),  &
+     &       trhs (:,:,:,:), t     (:,:,:,:),  &
+     &       tmult(:,:,:,:), dpcelm(:,:,:,:),  &
+     &       pdiff(:,:,:,:), pdiffp(:,:,:,:)
+
+!.....double precision arays associated with mortar points
+      double precision, allocatable ::  &
+     &       umor(:), tmmor (:),  &
+     &       rmor(:), dpcmor(:), pmorx(:), ppmor(:) 
+      double precision, allocatable, target ::  &
+     &       mormult(:), tmort(:)
+
+!.... integer arrays associated with element faces
+      integer, allocatable ::  &
+     &        idmo    (:,:,:,:,:,:),  &
+     &        idel    (:,:,    :,:),  &
+     &        sje     (:,:,    :,:),  &
+     &        sje_new (:,:,    :,:),  &
+     &        ijel    (:,      :,:),  &
+     &        ijel_new(:,      :,:),  &
+     &        cbc     (        :,:),  &
+     &        cbc_new (        :,:) 
+
+!.....integer array associated with vertices
+      integer, allocatable :: vassign (:,:), emo(:,:,:),   &
+     &        nemo (:)
+
+!.....integer array associated with element edges
+      integer, allocatable :: diagn  (:,:,:) 
+
+!.... integer arrays associated with elements
+      integer, allocatable ::  &
+     &        tree (:), mt_to_id    (:),                   &
+     &        newc (:), mt_to_id_old(:),  &
+     &        newi (:), id_to_mt    (:),  &
+     &        newe (:), ref_front_id(:),  &
+     &        front(:), action      (:),  &
+     &        ich  (:), size_e      (:),  &
+     &        treenew(:)
+
+!.....logical arrays associated with vertices
+      logical, allocatable :: ifpcmor (:)
+
+!.....logical arrays associated with edge
+      logical, allocatable ::  &
+     &        eassign  (:,:), if_1_edge(:,:),  &
+     &        ncon_edge(:,:)
+
+!.....logical arrays associated with elements
+      logical, allocatable :: skip(:), ifcoa(:), ifcoa_id(:)
+
+!.....logical arrays associated with element faces
+      logical, allocatable :: fassign(:,:), edgevis(:,:,:)      
+
+!.....small arrays
+      double precision qbnew(lx1-2,lx1,2), bqnew(lx1-2,lx1-2,2)
+
+      double precision  &
+     &       pcmor_nc1(lx1,lx1,2,2,refine_max),  &
+     &       pcmor_nc2(lx1,lx1,2,2,refine_max),  &
+     &       pcmor_nc0(lx1,lx1,2,2,refine_max),  &
+     &       pcmor_c(lx1,lx1,refine_max), tcpre(lx1,lx1),  &
+     &       pcmor_cor(8,refine_max)
+
+!.....gauss-labotto and gauss points
+      double precision zgm1(lx1)
+
+!.....weights
+      double precision wxm1(lx1),w3m1(lx1,lx1,lx1)
+
+!.....coordinate of element vertices
+      double precision, allocatable ::  &
+     &       xc(:,:),    yc(:,:),    zc(:,:),  &
+     &       xc_new(:,:),yc_new(:,:),zc_new(:,:)
+
+!.....dr/dx, dx/dr  and Jacobian
+      double precision jacm1_s(lx1,lx1,lx1,refine_max),  &
+     &       rxm1_s(lx1,lx1,lx1,refine_max),  &
+     &       xrm1_s(lx1,lx1,lx1,refine_max)
+
+!.....mass matrices (diagonal)
+      double precision bm1_s(lx1,lx1,lx1,refine_max)
+
+!.....dertivative matrices d/dr
+      double precision dxm1(lx1,lx1), dxtm1(lx1,lx1), wdtdr(lx1,lx1)
+
+!.....interpolation operators
+      double precision  &
+     &       ixm31(lx1,lx1*2-1), ixtm31(lx1*2-1,lx1), ixmc1(lx1,lx1),  &
+     &       ixtmc1(lx1,lx1), ixmc2(lx1,lx1),  ixtmc2(lx1,lx1),  &
+     &       map2(lx1),map4(lx1)
+
+!.....collocation location within an element
+      double precision xfrac(lx1)
+
+!.....used in laplacian operator
+      double precision g1m1_s(lx1,lx1,lx1,refine_max),  &
+     &       g4m1_s(lx1,lx1,lx1,refine_max),  &
+     &       g5m1_s(lx1,lx1,lx1,refine_max),  &
+     &       g6m1_s(lx1,lx1,lx1,refine_max)
+      
+!.....We store some tables of useful topological constants
+!     These constants are intialized as a block data below
+      integer f_e_ef(4,6)
+      integer e_c(3,8)
+      integer local_corner(8,6)
+      integer cal_nnb(3,8)
+      integer oplc(4)
+      integer cal_iijj(2,4)
+      integer cal_intempx(4,6)
+      integer c_f(4,6)
+      integer le_arr(4,0:1,3)
+      integer jjface(6)
+      integer e_face2(4,6)
+      integer op(4)
+      integer localedgenumber(6,12)
+      integer edgenumber(4,6)
+      integer f_c(3,8)
+      integer e1v1(6,6),e2v1(6,6),e1v2(6,6),e2v2(6,6)
+      integer children(4,6)
+      integer iijj(2,4)
+      integer v_end(2)
+      integer face_l1(3),face_l2(3),face_ld(3)
+
+! ... Timer parameters
+      integer t_total,t_init,t_convect,t_transfb_c,  &
+     &        t_diffusion,t_transf,t_transfb,t_adaptation,  &
+     &        t_transf2,t_add2,t_last
+      parameter (t_total=1,t_init=2,t_convect=3,t_transfb_c=4,  &
+     &        t_diffusion=5,t_transf=6,t_transfb=7,t_adaptation=8,  &
+     &        t_transf2=9,t_add2=10,t_last=10)
+      logical timeron
+
+!.....Locks used for atomic updates
+!c    integer (kind=omp_lock_kind) tlock(lmor)
+!$    integer(8) tlock(lmor)
+
+
+!------------------------------------------------------------------
+!.....We store some tables of useful topological constants
+!------------------------------------------------------------------
+
+!     f_e_ef(e,f) returns the other face sharing the e'th local edge of face f.
+      data f_e_ef/6,3,5,4, 6,3,5,4, 6,1,5,2, 6,1,5,2, 4,1,3,2, 4,1,3,2/
+
+!.....e_c(n,j) returns n'th edge sharing the vertex j of an element
+      data e_c /5,8,11, 1,4,11,  5,6,9, 1,2,9,  &
+     &          7,8,12, 3,4,12, 6,7,10, 2,3,10/
+
+!.....local_corner(n,i) returns the local corner index of vertex n on face i
+      data local_corner /0,1,0,2,0,3,0,4, 1,0,2,0,3,0,4,0,  &
+     &                   0,0,1,2,0,0,3,4, 1,2,0,0,3,4,0,0,  &
+     &                   0,0,0,0,1,2,3,4, 1,2,3,4,0,0,0,0/
+
+!.....cal_nnb(n,i) returns the neighbor elements neighbored by n'th edge
+!     among the three edges sharing vertex i
+!     the elements are the eight children elements ordered as 1 to 8.
+      data cal_nnb/5,2,3, 6,1,4, 7,4,1, 8,3,2,  &
+     &             1,6,7, 2,5,8, 3,8,5, 4,7,6/
+
+!.....returns the opposite local corner index: 1-4,2-3
+      data oplc /4,3,2,1/
+
+!.....cal_iijj(i,n) returns the location of local corner number n on a face 
+!     i =1  to get ii, i=2 to get jj
+!     (ii,jj) is defined the same as in mortar location (ii,jj)
+      data cal_iijj /1,1, 1,2, 2,1, 2,2/
+
+!.....returns the adjacent(neighbored by a face) element's children,
+!     assumming a vertex is shared by eight child elements 1-8. 
+!     index n is local corner number on the face which is being 
+!     assigned the mortar index number
+      data cal_intempx /8,6,4,2, 7,5,3,1, 8,7,4,3,  &
+     &                  6,5,2,1, 8,7,6,5, 4,3,2,1/
+
+!.....c_f(i,f) returns the vertex number of i'th local corner on face f
+      data c_f /2,4,6,8, 1,3,5,7, 3,4,7,8, 1,2,5,6, 5,6,7,8, 1,2,3,4/
+
+!.....on each face of the parent element, there are four children element.
+!     le_arr(i,j,n) returns the i'th elements among the four children elements 
+!     n refers to the direction: 1 for x, 2 for y and 3 for z direction. 
+!     j refers to positive(0) or negative(1) direction on x, y or z direction.
+!     n=1,j=0 refers to face 1 and n=1, j=1 refers to face 2, n=2,j=0 refers to
+!     face 3.... 
+!     The current eight children are ordered as 8,1,2,3,4,5,6,7 
+      data    le_arr/8,2,4,6, 1,3,5,7,  &
+     &               8,1,4,5, 2,3,6,7,  &
+     &               8,1,2,3, 4,5,6,7/
+
+!.....jjface(n) returns the face opposite to face n
+      data jjface /2,1,4,3,6,5/
+
+!c.....edgeface(n,f) returns OTHER face which shares local edge n on face f
+!      integer edgeface(4,6)
+!      data edgeface /6,3,5,4, 6,3,5,4, 6,1,5,2, 
+!     $               6,1,5,2, 4,1,3,2, 4,1,3,2/
+
+!.....e_face2(n,f) returns the local edge number of edge n on the
+!     other face sharing local edge n on face f
+      data e_face2 /2,2,2,2, 4,4,4,4, 3,2,3,2,  &
+     &              1,4,1,4, 3,3,3,3, 1,1,1,1/
+
+!.....op(n) returns the local edge number of the edge which 
+!     is opposite to local edge n on the same face
+      data op /3,4,1,2/
+
+!.....localedgenumber(f,e) returns the local edge number for edge e
+!     on face f. A zero result value signifies illegal input
+      data localedgenumber /1,0,0,0,0,2, 2,0,2,0,0,0, 3,0,0,0,2,0,  &
+     &                      4,0,0,2,0,0, 0,1,0,0,0,4, 0,2,4,0,0,0,  &
+     &                      0,3,0,0,4,0, 0,4,0,4,0,0, 0,0,1,0,0,3,  &
+     &                      0,0,3,0,3,0, 0,0,0,1,0,1, 0,0,0,3,1,0/
+
+!.....edgenumber(e,f) returns the edge index of local edge e on face f
+      data edgenumber / 1,2, 3,4,  5,6, 7,8,  9,2,10,6,  &
+     &                 11,4,12,8, 12,3,10,7, 11,1, 9,5/
+
+!.....f_c(c,n) returns the face index of i'th face sharing vertex n 
+      data f_c /2,4,6, 1,4,6, 2,3,6, 1,3,6,  &
+     &          2,4,5, 1,4,5, 2,3,5, 1,3,5/
+
+!.....if two elements are neighbor by one edge, 
+!     e1v1(f1,f2) returns the smaller index of the two vertices on this 
+!     edge on one element
+!     e1v2 returns the larger index of the two vertices of this edge on 
+!     on element. exfor a vertex on element 
+!     e2v1 returns the smaller index of the two vertices on this edge on 
+!     another element
+!     e2v2 returns the larger index of the two vertiex on this edge on
+!     another element
+      data e1v1/0,0,4,2,6,2, 0,0,3,1,5,1, 4,3,0,0,7,3,  &
+     &          2,1,0,0,5,1, 6,5,7,5,0,0, 2,1,3,1,0,0/
+      data e2v1/0,0,1,3,1,5, 0,0,2,4,2,6, 1,2,0,0,1,5,  &
+     &          3,4,0,0,3,7, 1,2,1,3,0,0, 5,6,5,7,0,0/
+      data e1v2/0,0,8,6,8,4, 0,0,7,5,7,3, 8,7,0,0,8,4,  &
+     &          6,5,0,0,6,2, 8,7,8,6,0,0, 4,3,4,2,0,0/
+      data e2v2/0,0,5,7,3,7, 0,0,6,8,4,8, 5,6,0,0,2,6,  &
+     &          7,8,0,0,4,8, 3,4,2,4,0,0, 7,8,6,8,0,0/
+
+!.....children(n1,n)returns the four elements among the eight children 
+!     elements to be merged on face n of the parent element
+!     the IDs for the eight children are 1,2,3,4,5,6,7,8
+      data children/2,4,6,8, 1,3,5,7, 3,4,7,8,  &
+     &              1,2,5,6, 5,6,7,8, 1,2,3,4/
+
+!.....iijj(n1,n) returns the location of n's mortar on an element face
+!     n1=1 refers to x direction location and n1=2 refers to y direction
+      data iijj/1,1,1,2,2,1,2,2/
+
+!.....v_end(n) returns the index of collocation points at two ends of each
+!     direction
+      data v_end /1,lx1/
+
+!.....face_l1,face_l2,face_ld return for start,end,stride for a loop over faces 
+!     used on subroutine  mortar_vertex
+      data face_l1 /2,3,1/, face_l2 /3,1,2/, face_ld /1,-2,1/
+
+
+      end module ua_data
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine alloc_space
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+! allocate space dynamically for data arrays
+!---------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer ios
+
+
+      allocate (  &
+     &        ta1  (lx1,lx1,lx1,lelt), ta2   (lx1,lx1,lx1,lelt),  &
+     &        trhs (lx1,lx1,lx1,lelt), t     (lx1,lx1,lx1,lelt),  &
+     &        tmult(lx1,lx1,lx1,lelt), dpcelm(lx1,lx1,lx1,lelt),  &
+     &        pdiff(lx1,lx1,lx1,lelt), pdiffp(lx1,lx1,lx1,lelt),  &
+     &        stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &        umor(lmor), tmmor(lmor),  &
+     &        rmor(lmor), dpcmor (lmor), pmorx(lmor), ppmor(lmor),  &
+     &        mormult(lmor), tmort(lmor),  &
+     &        stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &        idmo    (lx1,lx1,lnje,lnje,nsides,lelt),  &
+     &        idel    (lx1,lx1,          nsides,lelt),  &
+     &        sje     (2,2,              nsides,lelt),  &
+     &        sje_new (2,2,              nsides,lelt),  &
+     &        ijel    (2,                nsides,lelt),  &
+     &        ijel_new(2,                nsides,lelt),  &
+     &        cbc     (                  nsides,lelt),  &
+     &        cbc_new (                  nsides,lelt),  &
+     &        vassign (8,lelt),       emo(2,8,8*lelt),   &
+     &        nemo    (8*lelt),  &
+     &        diagn   (2,12,lelt),  &
+     &        stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &        tree   (lelt), mt_to_id    (lelt),                   &
+     &        newc   (lelt), mt_to_id_old(lelt),  &
+     &        newi   (lelt), id_to_mt    (lelt),  &
+     &        newe   (lelt), ref_front_id(lelt),  &
+     &        front  (lelt), action      (lelt),  &
+     &        ich    (lelt), size_e      (lelt),  &
+     &        treenew(lelt),  &
+     &        stat = ios)
+
+      if (ios .eq. 0) allocate (  &
+     &        ifpcmor  (8* lelt),  &
+     &        eassign  (12,lelt),  if_1_edge(12,lelt),  &
+     &        ncon_edge(12,lelt),  &
+     &        skip (lelt), ifcoa (lelt), ifcoa_id(lelt),  &
+     &        fassign(nsides,lelt), edgevis(4,nsides,lelt),    &
+     &        stat = ios)
+
+!.....coordinate of element vertices
+      if (ios .eq. 0) allocate (  &
+     &        xc    (8,lelt),yc    (8,lelt),zc    (8,lelt),  &
+     &        xc_new(8,lelt),yc_new(8,lelt),zc_new(8,lelt),  &
+     &        stat = ios)
+
+      if (ios .ne. 0) then
+         write(*,*) 'Error encountered in allocating space'
+         stop
+      endif
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/utils.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/utils.f90
new file mode 100644
index 000000000..4dd274d55
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/utils.f90
@@ -0,0 +1,373 @@
+!------------------------------------------------------------------
+      subroutine reciprocal (a, n)
+!------------------------------------------------------------------
+!     initialize double precision array a with length of n
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n)
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = 1.d0/a(i)
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+!------------------------------------------------------------------
+      subroutine r_init_omp (a, n, const)
+!------------------------------------------------------------------
+!     initialize double precision array a with length of n
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n), const
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = const
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+!------------------------------------------------------------------
+      subroutine r_init (a, n, const)
+!------------------------------------------------------------------
+!     initialize double precision array a with length of n
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i
+      double precision a(n), const
+
+      do i = 1, n
+        a(i) = const
+      end do
+
+      return
+      end
+!------------------------------------------------------------------
+      subroutine nr_init_omp (a, n, const)
+!------------------------------------------------------------------
+!     initialize integer array a with length of n
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i, a(n), const
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = const
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!------------------------------------------------------------------
+      subroutine nr_init (a, n, const)
+!------------------------------------------------------------------
+!     initialize integer array a with length of n
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n, i, a(n), const
+
+      do i = 1, n
+        a(i) = const
+      end do
+
+      return
+      end
+!------------------------------------------------------------------
+      subroutine l_init_omp (a, n, const)
+!------------------------------------------------------------------
+!     initialize integer array a with length of n
+!------------------------------------------------------------------
+
+      implicit none
+      integer n, i
+      logical a(n), const
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(I)
+      do i = 1, n
+        a(i) = const
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine ncopy (a,b,n)
+!------------------------------------------------------------------
+!     copy array of integers b to a, the length of array is n
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      integer a(n), b(n)
+
+      do i = 1, n
+        a(i) = b(i)
+      end do
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine copy (a,b,n)
+!------------------------------------------------------------------
+!     copy double precision array b to a, the length of array is n
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n), b(n)
+
+      do i = 1, n
+         a(i) = b(i)
+      end do
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine adds2m1(a,b,c1,n)
+!-----------------------------------------------------------------
+!     a=b*c1
+!-----------------------------------------------------------------
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n),c1
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=a(i)+c1*b(i)
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine adds1m1(a,b,c1,n )
+!-----------------------------------------------------------------
+!     a=c1*a+b
+!-----------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n),c1
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=c1*a(i)+b(i)
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine col2(a,b,n)
+!------------------------------------------------------------------
+!     a=a*b
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision a(n),b(n)
+
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=a(i)*b(i)
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine nrzero (na,n)
+!------------------------------------------------------------------
+!     zero out array of integers 
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i,na(n)
+
+      do i = 1, n
+        na(i ) = 0
+      end do
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      subroutine add2(a,b,n)
+!------------------------------------------------------------------
+!     a=a+b
+!------------------------------------------------------------------
+
+      implicit none
+
+      integer n,i
+      double precision  a(n),b(n)
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
+      do i=1,n
+        a(i)=a(i)+b(i)
+      end do
+!$OMP END PARALLEL DO
+
+      return
+      end
+
+!-----------------------------------------------------------------
+      double precision function calc_norm()
+!------------------------------------------------------------------
+!     calculate the integral of ta1 over the whole domain
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision total,ieltotal
+      integer iel,k,j,i,isize
+
+      total=0.d0
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i,j,k,isize,ieltotal,iel)  &
+!$OMP& REDUCTION(+:total)
+
+      do iel=1,nelt
+        ieltotal=0.d0
+        isize=size_e(iel)
+        do k=1,lx1
+          do j=1,lx1
+            do i=1,lx1
+              ieltotal=ieltotal+ta1(i,j,k,iel)*w3m1(i,j,k)  &
+     &                               *jacm1_s(i,j,k,isize)
+            end do
+          end do
+        end do
+      total=total+ieltotal
+      end do
+!$OMP END PARALLEL DO
+
+      calc_norm = total
+
+      return
+      end
+!-----------------------------------------------------------------
+      subroutine parallel_add(frontier)
+!-----------------------------------------------------------------
+!     input array frontier, perform (potentially) parallel add so that
+!     the output frontier(i) has sum of frontier(1)+frontier(2)+...+frontier(i)
+!-----------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      integer nellog,i,ahead,ii,ntemp,n1,ntemp1,frontier(lelt),iel
+
+      nellog=0
+      iel=1
+   10 iel=iel*2
+      nellog=nellog+1
+      if (iel.lt.nelt) goto 10
+
+      ntemp=1
+      do i=1,nellog
+        n1=ntemp*2
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ahead,ii,iel)
+        do iel=n1, nelt,n1
+          ahead=frontier(iel-ntemp)
+          do ii=ntemp-1,0,-1
+            frontier(iel-ii)=frontier(iel-ii)+ahead
+          end do
+        end do
+!$OMP END PARALLEL DO
+
+        iel=(nelt/n1+1)*n1
+        ntemp1=iel-nelt
+        if(ntemp1.lt.ntemp)then
+          ahead=frontier(iel-ntemp)
+!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(ii)
+          do ii=ntemp-1,ntemp1,-1
+            frontier(iel-ii)=frontier(iel-ii)+ahead
+          end do
+!$OMP END PARALLEL DO
+        end if
+
+        ntemp=n1
+      end do
+
+      return
+      end 
+
+!------------------------------------------------------------------
+      subroutine dssum
+
+!------------------------------------------------------------------
+!     Perform stiffness summation: element-mortar-element mapping
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      call transfb(dpcmor,dpcelm)
+      call transf (dpcmor,dpcelm)
+
+      return
+      end
+
+!------------------------------------------------------------------
+      subroutine facev(a,iface,val)
+!------------------------------------------------------------------
+!     assign the value val to face(iface,iel) of array a.
+!------------------------------------------------------------------
+
+      use ua_data
+      implicit none
+
+      double precision a(lx1,lx1,lx1), val
+      integer iface, kx1, kx2, ky1, ky2, kz1, kz2, ix, iy, iz
+
+      kx1=1
+      ky1=1
+      kz1=1
+      kx2=lx1
+      ky2=lx1
+      kz2=lx1
+      if (iface.eq.1) kx1=lx1
+      if (iface.eq.2) kx2=1
+      if (iface.eq.3) ky1=lx1
+      if (iface.eq.4) ky2=1
+      if (iface.eq.5) kz1=lx1
+      if (iface.eq.6) kz2=1
+
+      do ix = kx1, kx2
+        do iy = ky1, ky2
+          do iz = kz1, kz2
+            a(ix,iy,iz)=val
+          end do
+        end do
+      end do
+
+      return
+      end
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/verify.f90
new file mode 100644
index 000000000..dc28c4e2e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/UA/verify.f90
@@ -0,0 +1,93 @@
+      subroutine verify(class, verified)
+
+      use, intrinsic :: ieee_arithmetic, only : ieee_is_nan
+
+      use ua_data
+
+      implicit none
+
+      double precision norm, calc_norm, epsilon, norm_dif, norm_ref
+      external         calc_norm
+      character        class
+      logical          verified
+       
+!.....tolerance level
+      epsilon = 1.0d-08
+
+!.....compute the temperature integral over the whole domain
+      norm = calc_norm()
+
+      verified = .true.
+      if     ( class .eq. 'S' ) then
+        norm_ref = 0.1890013110962D-02
+      elseif ( class .eq. 'W' ) then
+        norm_ref = 0.2569794837076D-04
+      elseif ( class .eq. 'A' ) then
+        norm_ref = 0.8939996281443D-04
+      elseif ( class .eq. 'B' ) then
+        norm_ref = 0.4507561922901D-04
+      elseif ( class .eq. 'C' ) then
+        norm_ref = 0.1544736587100D-04
+      elseif ( class .eq. 'D' ) then
+        norm_ref = 0.1577586272355D-05
+      else
+        class = 'U'
+        norm_ref = 1.d0
+        verified = .false.
+      endif         
+
+      norm_dif = dabs((norm - norm_ref)/norm_ref)
+
+!---------------------------------------------------------------------
+!    Output the comparison of computed results to known cases.
+!---------------------------------------------------------------------
+
+      print *
+
+      if (class .ne. 'U') then
+         write(*, 1990) class
+ 1990    format(' Verification being performed for class ', a)
+         write (*,2000) epsilon
+ 2000    format(' accuracy setting for epsilon = ', E20.13)
+      else 
+         write(*, 1995)
+ 1995    format(' Unknown class')
+      endif
+
+      if (class .ne. 'U') then
+         write (*,2001) 
+      else
+         write (*, 2005)
+      endif
+
+ 2001 format(' Comparison of temperature integrals')
+ 2005 format(' Temperature integral')
+      if (class .eq. 'U') then
+         write(*, 2015) norm
+      else if ((.not.ieee_is_nan(norm_dif)) .and.  &
+     &         norm_dif .le. epsilon) then
+         write (*,2011) norm, norm_ref, norm_dif
+      else 
+         verified = .false.
+         write (*,2010) norm, norm_ref, norm_dif
+      endif
+
+ 2010 format(' FAILURE: ', E20.13, E20.13, E20.13)
+ 2011 format('          ', E20.13, E20.13, E20.13)
+ 2015 format('          ', E20.13)
+        
+      if (class .eq. 'U') then
+        write(*, 2022)
+        write(*, 2023)
+ 2022   format(' No reference values provided')
+ 2023   format(' No verification performed')
+      else if (verified) then
+        write(*, 2020)
+ 2020   format(' Verification Successful')
+      else
+        write(*, 2021)
+ 2021   format(' Verification failed')
+      endif
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_print_results.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_print_results.c
new file mode 100644
index 000000000..dbd07c372
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_print_results.c
@@ -0,0 +1,117 @@
+/*****************************************************************/
+/******     C  _  P  R  I  N  T  _  R  E  S  U  L  T  S     ******/
+/*****************************************************************/
+#include <stdlib.h>
+#include <stdio.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+void c_print_results( char   *name,
+                      char   class,
+                      int    n1, 
+                      int    n2,
+                      int    n3,
+                      int    niter,
+                      double t,
+                      double mops,
+		      char   *optype,
+                      int    passed_verification,
+                      char   *npbversion,
+                      char   *compiletime,
+                      char   *cc,
+                      char   *clink,
+                      char   *c_lib,
+                      char   *c_inc,
+                      char   *cflags,
+                      char   *clinkflags )
+{
+    int num_threads, max_threads;
+
+
+    max_threads = 0;
+    num_threads = 0;
+
+/*   figure out number of threads used */
+#ifdef _OPENMP
+    max_threads = omp_get_max_threads();
+#pragma omp parallel shared(num_threads)
+{
+    #pragma omp master
+    num_threads = omp_get_num_threads();
+}
+#endif
+
+
+    printf( "\n\n %s Benchmark Completed\n", name ); 
+
+    printf( " Class           =                        %c\n", class );
+
+    if( n3 == 0 ) {
+        long nn = n1;
+        if ( n2 != 0 ) nn *= n2;
+        printf( " Size            =             %12ld\n", nn );   /* as in IS */
+    }
+    else
+        printf( " Size            =             %4dx%4dx%4d\n", n1,n2,n3 );
+
+    printf( " Iterations      =             %12d\n", niter );
+ 
+    printf( " Time in seconds =             %12.2f\n", t );
+
+    if (num_threads > 0)
+        printf( " Total threads   =             %12d\n", num_threads);
+
+    if (max_threads > 0)
+        printf( " Avail threads   =             %12d\n", max_threads);
+
+    if (num_threads != max_threads) 
+        printf( " Warning: Threads used differ from threads available\n");
+
+    printf( " Mop/s total     =             %12.2f\n", mops );
+
+    if (num_threads > 0)
+        printf( " Mop/s/thread    =             %12.2f\n",
+               mops/(double)num_threads );
+
+    printf( " Operation type  = %24s\n", optype);
+
+    if( passed_verification < 0 )
+        printf( " Verification    =            NOT PERFORMED\n" );
+    else if( passed_verification )
+        printf( " Verification    =               SUCCESSFUL\n" );
+    else
+        printf( " Verification    =             UNSUCCESSFUL\n" );
+
+    printf( " Version         =             %12s\n", npbversion );
+
+    printf( " Compile date    =             %12s\n", compiletime );
+
+    printf( "\n Compile options:\n" );
+
+    printf( "    CC           = %s\n", cc );
+
+    printf( "    CLINK        = %s\n", clink );
+
+    printf( "    C_LIB        = %s\n", c_lib );
+
+    printf( "    C_INC        = %s\n", c_inc );
+
+    printf( "    CFLAGS       = %s\n", cflags );
+
+    printf( "    CLINKFLAGS   = %s\n", clinkflags );
+
+    printf( "\n\n" );
+    printf( " Please send all errors/feedbacks to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " npb@nas.nasa.gov\n\n\n" );
+/*    printf( " Please send the results of this run to:\n\n" );
+    printf( " NPB Development Team\n" );
+    printf( " Internet: npb@nas.nasa.gov\n \n" );
+    printf( " If email is not available, send this to:\n\n" );
+    printf( " MS T27A-1\n" );
+    printf( " NASA Ames Research Center\n" );
+    printf( " Moffett Field, CA  94035-1000\n\n" );
+    printf( " Fax: 650-604-3957\n\n" ); */
+}
+ 
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_timers.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_timers.c
new file mode 100644
index 000000000..c6a2aafab
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_timers.c
@@ -0,0 +1,104 @@
+#include "wtime.h"
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
+
+/*  Prototype  */
+void wtime( double * );
+
+
+/*****************************************************************/
+/******         E  L  A  P  S  E  D  _  T  I  M  E          ******/
+/*****************************************************************/
+double elapsed_time( void )
+{
+    double t;
+
+#if defined(_OPENMP) && (_OPENMP > 200010)
+/*  Use the OpenMP timer if we can */
+    t = omp_get_wtime();
+#else
+    wtime( &t );
+#endif
+    return( t );
+}
+
+
+static double start[64], elapsed[64];
+#ifdef _OPENMP
+#pragma omp threadprivate(start, elapsed)
+#endif
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  C  L  E  A  R          ******/
+/*****************************************************************/
+void timer_clear( int n )
+{
+    elapsed[n] = 0.0;
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  A  R  T          ******/
+/*****************************************************************/
+void timer_start( int n )
+{
+    start[n] = elapsed_time();
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  S  T  O  P             ******/
+/*****************************************************************/
+void timer_stop( int n )
+{
+    double t, now;
+
+    now = elapsed_time();
+    t = now - start[n];
+    elapsed[n] += t;
+
+}
+
+
+/*****************************************************************/
+/******            T  I  M  E  R  _  R  E  A  D             ******/
+/*****************************************************************/
+double timer_read( int n )
+{
+    return( elapsed[n] );
+}
+
+
+/*****************************************************************/
+/******            C H E C K _ T I M E R _ F L A G          ******/
+/*****************************************************************/
+int check_timer_flag( void )
+{
+    int timer_on = 0;
+    char *ev = getenv("NPB_TIMER_FLAG");
+
+    if (ev) {
+        if (*ev == '\0')
+            timer_on = 1;
+        else if (*ev >= '1' && *ev <= '9')
+            timer_on = 1;
+        else if (strcmp(ev, "on") == 0 || strcmp(ev, "ON") == 0 ||
+                 strcmp(ev, "yes") == 0 || strcmp(ev, "YES") == 0 ||
+                 strcmp(ev, "true") == 0 || strcmp(ev, "TRUE") == 0)
+            timer_on = 1;
+    }
+    else {
+        FILE *fp = fopen("timer.flag", "r");
+        if (fp != NULL) {
+            fclose(fp);
+            timer_on = 1;
+        }
+    }
+
+    return timer_on;
+}
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_timers.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_timers.h
new file mode 100644
index 000000000..ea3a2ceb0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/c_timers.h
@@ -0,0 +1,11 @@
+#ifndef __C_TIMERS_H
+#define __C_TIMERS_H
+
+extern void   timer_clear( int n );
+extern void   timer_start( int n );
+extern void   timer_stop( int n );
+extern double timer_read( int n );
+extern int    check_timer_flag( void );
+
+#endif
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/hooks.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/hooks.c
new file mode 100644
index 000000000..1322e12e9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/hooks.c
@@ -0,0 +1,61 @@
+/*
+Copyright (c) 2024 The Regents of the University of California
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met: redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer;
+redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution;
+neither the name of the copyright holders nor the names of its
+contributors may be used to endorse or promote products derived from
+this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+#include <stdio.h>
+#include <gem5/m5ops.h>
+
+
+void init_() __attribute__((constructor));
+void map_m5_mem();
+
+void init_() {
+
+	//__attribute__ makes this function get called before main()
+	// need to mmap /dev/mem
+    printf(" --------------------- M5 INIT --------------------- \n");
+    map_m5_mem();
+}
+
+void m5_exit_interface_()
+{
+    printf(" --------------------- M5 EXIT --------------------- \n");
+    // this function calls m5_exit
+    m5_exit_addr(0);
+}
+
+void m5_work_begin_interface_()
+{
+
+    printf(" -------------------- ROI BEGIN -------------------- \n");
+    m5_work_begin_addr(0,0);
+}
+
+void m5_work_end_interface_()
+{
+    m5_work_end_addr(0,0);
+    printf(" -------------------- ROI END -------------------- \n");
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/print_results.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/print_results.f90
new file mode 100644
index 000000000..f6be545ca
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/print_results.f90
@@ -0,0 +1,136 @@
+
+      subroutine print_results(name, class, n1, n2, n3, niter,  &
+     &               t, mops, optype, verified, npbversion,  &
+     &               compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7)
+      
+      implicit none
+      character(len=*) name
+      character class
+      integer   n1, n2, n3, niter, j
+      double precision t, mops
+      character optype*24, size*15
+      logical   verified
+      character(len=*) npbversion, compiletime,  &
+     &              cs1, cs2, cs3, cs4, cs5, cs6, cs7
+      integer   num_threads, max_threads
+!$    integer omp_get_num_threads, omp_get_max_threads
+!$    external omp_get_num_threads, omp_get_max_threads
+
+
+      max_threads = 0
+!$    max_threads = omp_get_max_threads()
+
+!     figure out number of threads used
+      num_threads = 0
+!$omp parallel shared(num_threads)
+!$omp master
+!$    num_threads = omp_get_num_threads()
+!$omp end master
+!$omp end parallel
+
+
+         write (*, 2) name
+ 2       format(//, ' ', A, ' Benchmark Completed.')
+
+         write (*, 3) Class
+ 3       format(' Class           = ', 12x, a12)
+
+!   If this is not a grid-based problem (EP, FT, CG), then
+!   we only print n1, which contains some measure of the
+!   problem size. In that case, n2 and n3 are both zero.
+!   Otherwise, we print the grid size n1xn2xn3
+
+         if ((n2 .eq. 0) .and. (n3 .eq. 0)) then
+            if (name(1:2) .eq. 'EP') then
+               write(size, '(f15.0)' ) 2.d0**n1
+               j = 15
+               if (size(j:j) .eq. '.') j = j - 1
+               write (*,42) size(1:j)
+ 42            format(' Size            = ',9x, a15)
+            else
+               write (*,44) n1
+ 44            format(' Size            = ',12x, i12)
+            endif
+         else
+            write (*, 4) n1,n2,n3
+ 4          format(' Size            =  ',9x, i4,'x',i4,'x',i4)
+         endif
+
+         write (*, 5) niter
+ 5       format(' Iterations      = ', 12x, i12)
+         
+         write (*, 6) t
+ 6       format(' Time in seconds = ',12x, f12.2)
+
+         if (num_threads .gt. 0) write (*,7) num_threads
+ 7       format(' Total threads   = ', 12x, i12)
+         
+         if (max_threads .gt. 0) write (*,8) max_threads
+ 8       format(' Avail threads   = ', 12x, i12)
+
+         if (num_threads .ne. max_threads) write (*,88) 
+ 88      format(' Warning: Threads used differ from threads available')
+
+         write (*,9) mops
+ 9       format(' Mop/s total     = ',12x, f12.2)
+
+         if (num_threads .gt. 0) write (*,10) mops/float( num_threads )
+ 10      format(' Mop/s/thread    = ', 12x, f12.2)        
+
+         write(*, 11) optype
+ 11      format(' Operation type  = ', a24)
+
+         if (verified) then 
+            write(*,12) '  SUCCESSFUL'
+         else
+            write(*,12) 'UNSUCCESSFUL'
+         endif
+ 12      format(' Verification    = ', 12x, a)
+
+         write(*,13) npbversion
+ 13      format(' Version         = ', 12x, a12)
+
+         write(*,14) compiletime
+ 14      format(' Compile date    = ', 12x, a12)
+
+
+         write (*,121) cs1
+ 121     format(/, ' Compile options:', /,  &
+     &          '    FC           = ', A)
+
+         write (*,122) cs2
+ 122     format('    FLINK        = ', A)
+
+         write (*,123) cs3
+ 123     format('    F_LIB        = ', A)
+
+         write (*,124) cs4
+ 124     format('    F_INC        = ', A)
+
+         write (*,125) cs5
+ 125     format('    FFLAGS       = ', A)
+
+         write (*,126) cs6
+ 126     format('    FLINKFLAGS   = ', A)
+
+         write(*, 127) cs7
+ 127     format('    RAND         = ', A)
+        
+         write (*,130)
+ 130     format(//' Please send all errors/feedbacks to:'//  &
+     &            ' NPB Development Team'/  &
+     &            ' npb@nas.nasa.gov'//)
+! 130     format(//' Please send the results of this run to:'//
+!     >            ' NPB Development Team '/
+!     >            ' Internet: npb@nas.nasa.gov'/
+!     >            ' '/
+!     >            ' If email is not available, send this to:'//
+!     >            ' MS T27A-1'/
+!     >            ' NASA Ames Research Center'/
+!     >            ' Moffett Field, CA  94035-1000'//
+!     >            ' Fax: 650-604-3957'//)
+
+
+         return
+         end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randdp.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randdp.f90
new file mode 100644
index 000000000..27fdf95e0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randdp.f90
@@ -0,0 +1,137 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision function randlc (x, a)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+!
+!   This routine should produce the same results on any computer with at least
+!   48 mantissa bits in double precision floating point data.  On 64 bit
+!   systems, double precision should be disabled.
+!
+!   David H. Bailey     October 26, 1990
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,  &
+     &  t46 = t23 ** 2)
+
+!---------------------------------------------------------------------
+!   Break A into two parts such that A = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+!---------------------------------------------------------------------
+!   Break X into two parts such that X = 2^23 * X1 + X2, compute
+!   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+!   X = 2^23 * Z + A2 * X2  (mod 2^46).
+!---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+
+      return
+      end
+
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   This routine generates N uniform pseudorandom double precision numbers in
+!   the range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The N results are placed in Y and are normalized
+!   to be between 0 and 1.  X is updated to contain the new seed, so that
+!   subsequent calls to VRANLC using the same arguments will generate a
+!   continuous sequence.  If N is zero, only initialization is performed, and
+!   the variables X, A and Y are ignored.
+!
+!   This routine is the standard version designed for scalar or RISC systems.
+!   However, it should produce the same results on any single processor
+!   computer with at least 48 mantissa bits in double precision floating point
+!   data.  On 64 bit systems, double precision should be disabled.
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+      integer i,n
+      double precision y,r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      dimension y(*)
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,  &
+     &  t46 = t23 ** 2)
+
+
+!---------------------------------------------------------------------
+!   Break A into two parts such that A = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+!---------------------------------------------------------------------
+!   Generate N results.   This loop is not vectorizable.
+!---------------------------------------------------------------------
+      do i = 1, n
+
+!---------------------------------------------------------------------
+!   Break X into two parts such that X = 2^23 * X1 + X2, compute
+!   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+!   X = 2^23 * Z + A2 * X2  (mod 2^46).
+!---------------------------------------------------------------------
+        t1 = r23 * x
+        x1 = int (t1)
+        x2 = x - t23 * x1
+        t1 = a1 * x2 + a2 * x1
+        t2 = int (r23 * t1)
+        z = t1 - t23 * t2
+        t3 = t23 * z + a2 * x2
+        t4 = int (r46 * t3)
+        x = t3 - t46 * t4
+        y(i) = r46 * x
+      enddo
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randdpvec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randdpvec.f90
new file mode 100644
index 000000000..069e8cbe8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randdpvec.f90
@@ -0,0 +1,186 @@
+!---------------------------------------------------------------------
+      double precision function randlc (x, a)
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+!
+!   This routine should produce the same results on any computer with at least
+!   48 mantissa bits in double precision floating point data.  On 64 bit
+!   systems, double precision should be disabled.
+!
+!   David H. Bailey     October 26, 1990
+!
+!---------------------------------------------------------------------
+
+      implicit none
+
+      double precision r23,r46,t23,t46,a,x,t1,t2,t3,t4,a1,a2,x1,x2,z
+      parameter (r23 = 0.5d0 ** 23, r46 = r23 ** 2, t23 = 2.d0 ** 23,  &
+     &  t46 = t23 ** 2)
+
+!---------------------------------------------------------------------
+!   Break A into two parts such that A = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+      t1 = r23 * a
+      a1 = int (t1)
+      a2 = a - t23 * a1
+
+!---------------------------------------------------------------------
+!   Break X into two parts such that X = 2^23 * X1 + X2, compute
+!   Z = A1 * X2 + A2 * X1  (mod 2^23), and then
+!   X = 2^23 * Z + A2 * X2  (mod 2^46).
+!---------------------------------------------------------------------
+      t1 = r23 * x
+      x1 = int (t1)
+      x2 = x - t23 * x1
+
+
+      t1 = a1 * x2 + a2 * x1
+      t2 = int (r23 * t1)
+      z = t1 - t23 * t2
+      t3 = t23 * z + a2 * x2
+      t4 = int (r46 * t3)
+      x = t3 - t46 * t4
+      randlc = r46 * x
+      return
+      end
+
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine vranlc (n, x, a, y)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+!---------------------------------------------------------------------
+!   This routine generates N uniform pseudorandom double precision numbers in
+!   the range (0, 1) by using the linear congruential generator
+!   
+!   x_{k+1} = a x_k  (mod 2^46)
+!   
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The N results are placed in Y and are normalized
+!   to be between 0 and 1.  X is updated to contain the new seed, so that
+!   subsequent calls to RANDLC using the same arguments will generate a
+!   continuous sequence.
+!   
+!   This routine generates the output sequence in batches of length NV, for
+!   convenience on vector computers.  This routine should produce the same
+!   results on any computer with at least 48 mantissa bits in double precision
+!   floating point data.  On Cray systems, double precision should be disabled.
+!   
+!   David H. Bailey    August 30, 1990
+!---------------------------------------------------------------------
+
+      integer n
+      double precision x, a, y(*)
+      
+      double precision r23, r46, t23, t46
+      integer nv
+      parameter (r23 = 2.d0 ** (-23), r46 = r23 * r23, t23 = 2.d0 ** 23,  &
+     &     t46 = t23 * t23, nv = 64)
+      double precision  xv(nv), t1, t2, t3, t4, an, a1, a2, x1, x2, yy
+      integer n1, i, j
+      external randlc
+      double precision randlc
+
+!---------------------------------------------------------------------
+!     Compute the first NV elements of the sequence using RANDLC.
+!---------------------------------------------------------------------
+      t1 = x
+      n1 = min (n, nv)
+
+      do  i = 1, n1
+         xv(i) = t46 * randlc (t1, a)
+      enddo
+
+!---------------------------------------------------------------------
+!     It is not necessary to compute AN, A1 or A2 unless N is greater than NV.
+!---------------------------------------------------------------------
+      if (n .gt. nv) then
+
+!---------------------------------------------------------------------
+!     Compute AN = AA ^ NV (mod 2^46) using successive calls to RANDLC.
+!---------------------------------------------------------------------
+         t1 = a
+         t2 = r46 * a
+
+         do  i = 1, nv - 1
+            t2 = randlc (t1, a)
+         enddo
+
+         an = t46 * t2
+
+!---------------------------------------------------------------------
+!     Break AN into two parts such that AN = 2^23 * A1 + A2.
+!---------------------------------------------------------------------
+         t1 = r23 * an
+         a1 = aint (t1)
+         a2 = an - t23 * a1
+      endif
+
+!---------------------------------------------------------------------
+!     Compute N pseudorandom results in batches of size NV.
+!---------------------------------------------------------------------
+      do  j = 0, n - 1, nv
+         n1 = min (nv, n - j)
+
+!---------------------------------------------------------------------
+!     Compute up to NV results based on the current seed vector XV.
+!---------------------------------------------------------------------
+         do  i = 1, n1
+            y(i+j) = r46 * xv(i)
+         enddo
+
+!---------------------------------------------------------------------
+!     If this is the last pass through the 140 loop, it is not necessary to
+!     update the XV vector.
+!---------------------------------------------------------------------
+         if (j + n1 .eq. n) goto 150
+
+!---------------------------------------------------------------------
+!     Update the XV vector by multiplying each element by AN (mod 2^46).
+!---------------------------------------------------------------------
+         do  i = 1, nv
+            t1 = r23 * xv(i)
+            x1 = aint (t1)
+            x2 = xv(i) - t23 * x1
+            t1 = a1 * x2 + a2 * x1
+            t2 = aint (r23 * t1)
+            yy = t1 - t23 * t2
+            t3 = t23 * yy + a2 * x2
+            t4 = aint (r46 * t3)
+            xv(i) = t3 - t46 * t4
+         enddo
+
+      enddo
+
+!---------------------------------------------------------------------
+!     Save the last seed in X so that subsequent calls to VRANLC will generate
+!     a continuous sequence.
+!---------------------------------------------------------------------
+ 150  x = xv(n1)
+
+      return
+      end
+
+!----- end of program ------------------------------------------------
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randi8.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randi8.f90
new file mode 100644
index 000000000..f8932edaf
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randi8.f90
@@ -0,0 +1,67 @@
+      double precision function randlc(x, a)
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer(kind=8) i246m1, Lx, La
+      double precision d2m46
+
+      parameter(d2m46=0.5d0**46)
+
+      parameter(i246m1=INT(Z'00003FFFFFFFFFFF',8))
+
+      Lx = X
+      La = A
+
+      Lx   = iand(Lx*La,i246m1)
+      randlc = d2m46*dble(Lx)
+      x    = dble(Lx)
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer(kind=8) i246m1, Lx, La
+      double precision d2m46
+
+! This doesn't work, because the compiler does the calculation in 32
+! bits and overflows. No standard way (without f90 stuff) to specify
+! that the rhs should be done in 64 bit arithmetic. 
+!      parameter(i246m1=2**46-1)
+
+      parameter(d2m46=0.5d0**46)
+
+      parameter(i246m1=INT(Z'00003FFFFFFFFFFF',8))
+
+      Lx = X
+      La = A
+      do i = 1, N
+         Lx   = iand(Lx*La,i246m1)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x    = dble(Lx)
+
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randi8_safe.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randi8_safe.f90
new file mode 100644
index 000000000..ac63a1884
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/randi8_safe.f90
@@ -0,0 +1,64 @@
+      double precision function randlc(x, a)
+
+!---------------------------------------------------------------------
+!
+!   This routine returns a uniform pseudorandom double precision number in the
+!   range (0, 1) by using the linear congruential generator
+!
+!   x_{k+1} = a x_k  (mod 2^46)
+!
+!   where 0 < x_k < 2^46 and 0 < a < 2^46.  This scheme generates 2^44 numbers
+!   before repeating.  The argument A is the same as 'a' in the above formula,
+!   and X is the same as x_0.  A and X must be odd double precision integers
+!   in the range (1, 2^46).  The returned value RANDLC is normalized to be
+!   between 0 and 1, i.e. RANDLC = 2^(-46) * x_1.  X is updated to contain
+!   the new seed x_1, so that subsequent calls to RANDLC using the same
+!   arguments will generate a continuous sequence.
+
+      implicit none
+      double precision x, a
+      integer(kind=8) Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = x
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      x1 = ibits(Lx, 23, 23)
+      x2 = ibits(Lx, 0, 23)
+      xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+      Lx   = ibits(xa,0, 46)
+      x    = dble(Lx)
+      randlc = d2m46*x
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+
+      SUBROUTINE VRANLC (N, X, A, Y)
+      implicit none
+      integer n, i
+      double precision x, a, y(*)
+      integer(kind=8) Lx, La, a1, a2, x1, x2, xa
+      double precision d2m46
+      parameter(d2m46=0.5d0**46)
+
+      Lx = X
+      La = A
+      a1 = ibits(La, 23, 23)
+      a2 = ibits(La, 0, 23)
+      do i = 1, N
+         x1 = ibits(Lx, 23, 23)
+         x2 = ibits(Lx, 0, 23)
+         xa = ishft(ibits(a1*x2+a2*x1, 0, 23), 23) + a2*x2
+         Lx   = ibits(xa,0, 46)
+         y(i) = d2m46*dble(Lx)
+      end do
+      x = dble(Lx)
+      return
+      end
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/timers.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/timers.f90
new file mode 100644
index 000000000..3a50de942
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/timers.f90
@@ -0,0 +1,171 @@
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      module timers
+
+      double precision start(64), elapsed(64)
+!$omp threadprivate(start, elapsed)
+
+      double precision, external :: elapsed_time
+
+      end module timers
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+      
+      subroutine timer_clear(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+
+      elapsed(n) = 0.0
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine timer_start(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+
+      start(n) = elapsed_time()
+
+      return
+      end
+      
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine timer_stop(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+
+      double precision t, now
+
+      now = elapsed_time()
+      t = now - start(n)
+      elapsed(n) = elapsed(n) + t
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision function timer_read(n)
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      use timers
+      implicit none
+
+      integer n
+      
+      timer_read = elapsed(n)
+
+      return
+      end
+
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      double precision function elapsed_time()
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+!$    external         omp_get_wtime
+!$    double precision omp_get_wtime
+
+      double precision t
+      logical          mp
+
+! ... Use the OpenMP timer if we can (via C$ conditional compilation)
+      mp = .false.
+!$    mp = .true.
+!$    t = omp_get_wtime()
+
+      if (.not.mp) then
+! This function must measure wall clock time, not CPU time. 
+! Since there is no portable timer in Fortran (77)
+! we call a routine compiled in C (though the C source may have
+! to be tweaked). 
+         call wtime(t)
+! The following is not ok for "official" results because it reports
+! CPU time not wall clock time. It may be useful for developing/testing
+! on timeshared Crays, though. 
+!        call second(t)
+      endif
+
+      elapsed_time = t
+
+      return
+      end
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      subroutine check_timer_flag( timeron )
+
+!---------------------------------------------------------------------
+!---------------------------------------------------------------------
+
+      implicit none
+      logical timeron
+
+      integer nc, ios
+      character(len=20) val
+
+      timeron = .false.
+
+! ... Check environment variable "NPB_TIMER_FLAG"
+      call get_environment_variable('NPB_TIMER_FLAG', val, nc, ios)
+      if (ios .eq. 0) then
+         if (nc .le. 0) then
+            timeron = .true.
+         else if (val(1:1) .ge. '1' .and. val(1:1) .le. '9') then
+            timeron = .true.
+         else if (val .eq. 'on' .or. val .eq. 'ON' .or.  &
+     &            val .eq. 'yes' .or. val .eq. 'YES' .or.  &
+     &            val .eq. 'true' .or. val .eq. 'TRUE') then
+            timeron = .true.
+         endif
+
+      else
+
+! ... Check if the "timer.flag" file exists
+         open (unit=2, file='timer.flag', status='old', iostat=ios)
+         if (ios .eq. 0) then
+            close(2)
+            timeron = .true.
+         endif
+
+      endif
+
+      return
+      end
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime.c
new file mode 100644
index 000000000..b5dcdaad5
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime.c
@@ -0,0 +1,16 @@
+#include "wtime.h"
+#include <time.h>
+#ifndef DOS
+#include <sys/time.h>
+#endif
+
+void wtime(double *t)
+{
+   /* a generic timer */
+   static int sec = -1;
+   struct timeval tv;
+   gettimeofday(&tv, (void *)0);
+   if (sec < 0) sec = tv.tv_sec;
+   *t = (tv.tv_sec - sec) + 1.0e-6*tv.tv_usec;
+}
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime.h
new file mode 100644
index 000000000..12eb0cb0e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime.h
@@ -0,0 +1,12 @@
+/* C/Fortran interface is different on different machines. 
+ * You may need to tweak this.
+ */
+
+
+#if defined(IBM)
+#define wtime wtime
+#elif defined(CRAY)
+#define wtime WTIME
+#else
+#define wtime wtime_
+#endif
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime_sgi64.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime_sgi64.c
new file mode 100644
index 000000000..d08d50cd3
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/common/wtime_sgi64.c
@@ -0,0 +1,74 @@
+#include <sys/types.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/syssgi.h>
+#include <sys/immu.h>
+#include <errno.h>
+#include <stdio.h>
+
+/* The following works on SGI Power Challenge systems */
+
+typedef unsigned long iotimer_t;
+
+unsigned int cycleval;
+volatile iotimer_t *iotimer_addr, base_counter;
+double resolution;
+
+/* address_t is an integer type big enough to hold an address */
+typedef unsigned long address_t;
+
+
+
+void timer_init() 
+{
+  
+  int fd;
+  char *virt_addr;
+  address_t phys_addr, page_offset, pagemask, pagebase_addr;
+  
+  pagemask = getpagesize() - 1;
+  errno = 0;
+  phys_addr = syssgi(SGI_QUERY_CYCLECNTR, &cycleval);
+  if (errno != 0) {
+    perror("SGI_QUERY_CYCLECNTR");
+    exit(1);
+  }
+  /* rel_addr = page offset of physical address */
+  page_offset = phys_addr & pagemask;
+  pagebase_addr = phys_addr - page_offset;
+  fd = open("/dev/mmem", O_RDONLY);
+
+  virt_addr = mmap(0, pagemask, PROT_READ, MAP_PRIVATE, fd, pagebase_addr);
+  virt_addr = virt_addr + page_offset;
+  iotimer_addr = (iotimer_t *)virt_addr;
+  /* cycleval in picoseconds to this gives resolution in seconds */
+  resolution = 1.0e-12*cycleval; 
+  base_counter = *iotimer_addr;
+}
+
+void wtime_(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
+void wtime(double *time) 
+{
+  static int initialized = 0;
+  volatile iotimer_t counter_value;
+  if (!initialized) { 
+    timer_init();
+    initialized = 1;
+  }
+  counter_value = *iotimer_addr - base_counter;
+  *time = (double)counter_value * resolution;
+}
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/README
new file mode 100644
index 000000000..ae535e95c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/README
@@ -0,0 +1,7 @@
+This directory contains examples of make.def files that were used 
+by the NPB team in testing the benchmarks on different platforms. 
+They can be used as starting points for make.def files for your 
+own platform, but you may need to taylor them for best performance 
+on your installation. A clean template can be found in directory 
+`config'.
+Some examples of suite.def files are also provided.
\ No newline at end of file
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_gcc b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_gcc
new file mode 100644
index 000000000..c75becfa0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_gcc
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = gfortran
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -fopenmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = gcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -fopenmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_gcc_m b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_gcc_m
new file mode 100644
index 000000000..de81c2eb9
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_gcc_m
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = gfortran
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -fopenmp -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = gcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -fopenmp -mcmodel=medium
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_itc b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_itc
new file mode 100644
index 000000000..70b78072b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_itc
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = ifort
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -qopenmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = icc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -qopenmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_itc_p b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_itc_p
new file mode 100644
index 000000000..e86b2ba6d
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_itc_p
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = ifort
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -openmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = icc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -openmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_pgi b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_pgi
new file mode 100644
index 000000000..0408fed7a
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_pgi
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = pgf90
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fastsse -mp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = pgcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -fastsse -mp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_sun b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_sun
new file mode 100644
index 000000000..76a8af0e7
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/make.def_sun
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = f90 -xarch=sse4_2 -m64
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -fast -xopenmp 
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = cc -xarch=sse4_2 -m64
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -fast -xopenmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# NPB2.x/common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.bt b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.bt
new file mode 100644
index 000000000..66d59b01b
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.bt
@@ -0,0 +1,6 @@
+bt	S
+bt	W
+bt	A
+bt	B
+bt	C
+bt	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.cg b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.cg
new file mode 100644
index 000000000..c96081769
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.cg
@@ -0,0 +1,6 @@
+cg	S
+cg	W
+cg	A
+cg	B
+cg	C
+cg	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.ep b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.ep
new file mode 100644
index 000000000..a0491d38c
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.ep
@@ -0,0 +1,6 @@
+ep	S
+ep	W
+ep	A
+ep	B
+ep	C
+ep	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.ft b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.ft
new file mode 100644
index 000000000..100ae4f9f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.ft
@@ -0,0 +1,6 @@
+ft	S
+ft	W
+ft	A
+ft	B
+ft	C
+ft	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.is b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.is
new file mode 100644
index 000000000..3a0b05d9e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.is
@@ -0,0 +1,5 @@
+is	S
+is	W
+is	A
+is	B
+is	C
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.lu b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.lu
new file mode 100644
index 000000000..583de7ee0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.lu
@@ -0,0 +1,6 @@
+lu	S
+lu	W
+lu	A
+lu	B
+lu	C
+lu	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.mg b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.mg
new file mode 100644
index 000000000..1df86a902
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.mg
@@ -0,0 +1,6 @@
+mg	S
+mg	W
+mg	A
+mg	B
+mg	C
+mg	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.sp b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.sp
new file mode 100644
index 000000000..8b5a9ba66
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/NAS.samples/suite.def.sp
@@ -0,0 +1,6 @@
+sp	S
+sp	W
+sp	A
+sp	B
+sp	C
+sp	D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/make.def b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/make.def
new file mode 100644
index 000000000..57cfc7e50
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/make.def
@@ -0,0 +1,168 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS.
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+#                This config is specific for gem5.
+#---------------------------------------------------------------------------
+
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran
+#
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = gfortran
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker
+#---------------------------------------------------------------------------
+F_LIB  =  -lm5 -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -fopenmp -cpp -no-pie -DM5_ANNOTATION
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable
+# size usually go here.
+#---------------------------------------------------------------------------
+# Using no-pie here as m5 is compiled with the no-pie flash and cant no be build
+# as a position independent executable.
+FLINKFLAGS = -O3 -fopenmp -no-pie
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = gcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker
+#---------------------------------------------------------------------------
+C_LIB  = -lm5 -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -fopenmp -no-pie -DM5_ANNOTATION
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable
+# size usually go here.
+#---------------------------------------------------------------------------
+# Using no-pie here as m5 is compiled with the no-pie flash and cant no be build
+# as a position independent executable.
+CLINKFLAGS = -O3 -fopenmp -no-pie
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by
+# this compiler go here also; typically there are few flags required; hence
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. .
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator
+# is used. It is described in detail in README.install.
+# Use "randi8" unless there is a reason to use another one.
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM:
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/make.def.template b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/make.def.template
new file mode 100644
index 000000000..c75becfa0
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/make.def.template
@@ -0,0 +1,161 @@
+#---------------------------------------------------------------------------
+#
+#                SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS. 
+#
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Items in this file will need to be changed for each platform.
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# Parallel Fortran:
+#
+# For CG, EP, FT, MG, LU, SP, BT and UA, which are in Fortran, the following 
+# must be defined:
+#
+# FC         - Fortran compiler
+# FFLAGS     - Fortran compilation arguments
+# F_INC      - any -I arguments required for compiling Fortran 
+# FLINK      - Fortran linker
+# FLINKFLAGS - Fortran linker arguments
+# F_LIB      - any -L and -l arguments required for linking Fortran 
+# 
+# compilations are done with $(FC) $(F_INC) $(FFLAGS) or
+#                            $(FC) $(FFLAGS)
+# linking is done with       $(FLINK) $(F_LIB) $(FLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the fortran compiler used for Fortran programs
+#---------------------------------------------------------------------------
+FC = gfortran
+# This links fortran programs; usually the same as ${FC}
+FLINK	= $(FC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+F_LIB  =
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+F_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for Fortran programs
+#---------------------------------------------------------------------------
+FFLAGS	= -O3 -fopenmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+FLINKFLAGS = $(FFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Parallel C:
+#
+# For IS and DC, which are in C, the following must be defined:
+#
+# CC         - C compiler 
+# CFLAGS     - C compilation arguments
+# C_INC      - any -I arguments required for compiling C 
+# CLINK      - C linker
+# CLINKFLAGS - C linker flags
+# C_LIB      - any -L and -l arguments required for linking C 
+#
+# compilations are done with $(CC) $(C_INC) $(CFLAGS) or
+#                            $(CC) $(CFLAGS)
+# linking is done with       $(CLINK) $(C_LIB) $(CLINKFLAGS)
+#---------------------------------------------------------------------------
+
+#---------------------------------------------------------------------------
+# This is the C compiler used for C programs
+#---------------------------------------------------------------------------
+CC = gcc
+# This links C programs; usually the same as ${CC}
+CLINK	= $(CC)
+
+#---------------------------------------------------------------------------
+# These macros are passed to the linker 
+#---------------------------------------------------------------------------
+C_LIB  = -lm
+
+#---------------------------------------------------------------------------
+# These macros are passed to the compiler 
+#---------------------------------------------------------------------------
+C_INC =
+
+#---------------------------------------------------------------------------
+# Global *compile time* flags for C programs
+# DC inspects the following flags (preceded by "-D"):
+#
+# IN_CORE - computes all views and checksums in main memory (if there is 
+# enough memory)
+#
+# VIEW_FILE_OUTPUT - forces DC to write the generated views to disk
+#
+# OPTIMIZATION - turns on some nonstandard DC optimizations
+#
+# _FILE_OFFSET_BITS=64 
+# _LARGEFILE64_SOURCE - are standard compiler flags which allow to work with 
+# files larger than 2GB.
+#---------------------------------------------------------------------------
+CFLAGS	= -O3 -fopenmp
+
+#---------------------------------------------------------------------------
+# Global *link time* flags. Flags for increasing maximum executable 
+# size usually go here. 
+#---------------------------------------------------------------------------
+CLINKFLAGS = $(CFLAGS)
+
+
+#---------------------------------------------------------------------------
+# Utilities C:
+#
+# This is the C compiler used to compile C utilities.  Flags required by 
+# this compiler go here also; typically there are few flags required; hence 
+# there are no separate macros provided for such flags.
+#---------------------------------------------------------------------------
+UCC	= gcc
+
+
+#---------------------------------------------------------------------------
+# Destination of executables, relative to subdirs of the main directory. . 
+#---------------------------------------------------------------------------
+BINDIR	= ../bin
+
+
+#---------------------------------------------------------------------------
+# The variable RAND controls which random number generator 
+# is used. It is described in detail in README.install. 
+# Use "randi8" unless there is a reason to use another one. 
+# Other allowed values are "randi8_safe", "randdp" and "randdpvec"
+#---------------------------------------------------------------------------
+RAND   = randi8
+# The following is highly reliable but may be slow:
+# RAND   = randdp
+
+
+#---------------------------------------------------------------------------
+# The variable WTIME is the name of the wtime source code module in the
+# common directory.  
+# For most machines,       use wtime.c
+# For SGI power challenge: use wtime_sgi64.c
+#---------------------------------------------------------------------------
+WTIME  = wtime.c
+
+
+#---------------------------------------------------------------------------
+# Enable if either Cray (not Cray-X1) or IBM: 
+# (no such flag for most machines: see common/wtime.h)
+# This is used by the C compiler to pass the machine name to common/wtime.h,
+# where the C/Fortran binding interface format is determined
+#---------------------------------------------------------------------------
+# MACHINE	=	-DCRAY
+# MACHINE	=	-DIBM
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/suite.def b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/suite.def
new file mode 100644
index 000000000..c4cfebe80
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/suite.def
@@ -0,0 +1,60 @@
+# config/suite.def
+# This file is used to build several benchmarks with a single command.
+# Typing "make suite" in the main directory will build all the benchmarks
+# specified in this file.
+# Each line of this file contains a benchmark name and the class.
+# The name is one of "cg", "is", "ep", mg", "ft", "sp",
+#  "bt", "lu", and "ua".
+# The class is one of "S", "W", "A" through "E"
+# (except that no class E for IS and UA).
+# No blank lines.
+# The following example builds sample sizes of all benchmarks.
+ft	S
+mg	S
+sp	S
+lu	S
+bt	S
+is	S
+ep	S
+cg	S
+ua	S
+
+ft  A
+mg  A
+sp  A
+lu  A
+bt  A
+is  A
+ep  A
+cg  A
+ua  A
+
+ft  B
+mg  B
+sp  B
+lu  B
+bt  B
+is  B
+ep  B
+cg  B
+ua  B
+
+ft  C
+mg  C
+sp  C
+lu  C
+bt  C
+is  C
+ep  C
+cg  C
+ua  C
+
+ft  D
+mg  D
+sp  D
+lu  D
+bt  D
+is  D
+ep  D
+cg  D
+ua  D
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/suite.def.template b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/suite.def.template
new file mode 100644
index 000000000..2037b89be
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/config/suite.def.template
@@ -0,0 +1,20 @@
+# config/suite.def
+# This file is used to build several benchmarks with a single command. 
+# Typing "make suite" in the main directory will build all the benchmarks
+# specified in this file. 
+# Each line of this file contains a benchmark name and the class.
+# The name is one of "cg", "is", "ep", mg", "ft", "sp",
+#  "bt", "lu", and "ua". 
+# The class is one of "S", "W", "A" through "E" 
+# (except that no class E for IS and UA).
+# No blank lines. 
+# The following example builds sample sizes of all benchmarks. 
+ft	S
+mg	S
+sp	S
+lu	S
+bt	S
+is	S
+ep	S
+cg	S
+ua	S
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/Makefile
new file mode 100644
index 000000000..cf0f508ab
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/Makefile
@@ -0,0 +1,22 @@
+UCC = cc
+include ../config/make.def
+
+# Note that COMPILE is also defined in make.common and should
+# be the same. We can't include make.common because it has a lot
+# of other garbage. 
+FCOMPILE = $(FC) -c $(F_INC) $(FFLAGS)
+
+all: setparams 
+
+# setparams creates an npbparam.h file for each benchmark 
+# configuration. npbparams.h also contains info about how a benchmark
+# was compiled and linked
+
+setparams: setparams.c ../config/make.def
+	$(UCC) ${CONVERTFLAG} -o setparams setparams.c
+
+
+clean: 
+	-rm -f setparams setparams.h npbparams.h
+	-rm -f *~ *.o
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/README
new file mode 100644
index 000000000..ede69b579
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/README
@@ -0,0 +1,41 @@
+This directory contains utilities and files used by the 
+build process. You should not need to change anything
+in this directory. 
+
+Original Files
+--------------
+setparams.c:
+        Source for the setparams program. This program is used internally
+        in the build process to create the file "npbparams.h" for each 
+        benchmark. npbparams.h contains Fortran or C parameters to build a 
+        benchmark for a specific class. The setparams program is never run 
+        directly by a user. Its invocation syntax is 
+
+            "setparams benchmark-name class". 
+
+        It examines the file "npbparams.h" in the current directory. If 
+        the specified parameters are the same as those in the npbparams.h 
+        file, nothing it changed. If the file does not exist or corresponds 
+        to a different class/number of nodes, it is (re)built. 
+	One of the more complicated things in npbparams.h is that it 
+        contains, in a Fortran string, the compiler flags used to build a 
+        benchmark, so that a benchmark can print out how it was compiled. 
+
+make.common
+        A makefile segment that is included in each individual benchmark
+        program makefile. It sets up some standard macros (COMPILE, etc) 
+        and makes sure everything is configured correctly (npbparams.h)
+
+Makefile
+        Builds  setparams
+
+README
+        This file. 
+
+
+Created files
+-------------
+
+setparams
+	See descriptions above
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/make.common b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/make.common
new file mode 100644
index 000000000..36590b398
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/make.common
@@ -0,0 +1,66 @@
+PROGRAM  = $(BINDIR)/$(BENCHMARK)$(VEXT).$(CLASS).x
+FCOMPILE = $(FC) -c $(F_INC) $(FFLAGS)
+CCOMPILE = $(CC) -c $(C_INC) $(CFLAGS)
+
+# Class "U" is used internally by the setparams program to mean
+# "unknown". This means that if you don't specify CLASS=
+# on the command line, you'll get an error. It would be nice
+# to be able to avoid this, but we'd have to get information
+# from the setparams back to the make program, which isn't easy. 
+CLASS=U
+
+default:: ${PROGRAM}
+
+# This makes sure the configuration utility setparams 
+# is up to date. 
+# Note that this must be run every time, which is why the
+# target does not exist and is not created. 
+# If you create a file called "config" you will break things. 
+config:
+	@cd ../sys; ${MAKE} all
+	../sys/setparams ${BENCHMARK} ${CLASS}
+
+COMMON=../common
+${COMMON}/${RAND}.o: ${COMMON}/${RAND}.f90 ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} ${RAND}.f90
+
+${COMMON}/print_results.o: ${COMMON}/print_results.f90 ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} print_results.f90
+
+${COMMON}/c_print_results.o: ${COMMON}/c_print_results.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} c_print_results.c
+
+${COMMON}/timers.o: ${COMMON}/timers.f90 ../config/make.def
+	cd ${COMMON}; ${FCOMPILE} timers.f90
+
+${COMMON}/c_timers.o: ${COMMON}/c_timers.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} c_timers.c
+
+${COMMON}/wtime.o: ${COMMON}/${WTIME} ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} ${MACHINE} -o wtime.o ${COMMON}/${WTIME}
+
+${COMMON}/hooks.o: ${COMMON}/hooks.c ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} hooks.c
+# For most machines or CRAY or IBM
+#	cd ${COMMON}; ${CCOMPILE} ${MACHINE} ${COMMON}/wtime.c
+# For a precise timer on an SGI Power Challenge, try:
+#	cd ${COMMON}; ${CCOMPILE} -o wtime.o ${COMMON}/wtime_sgi64.c
+
+${COMMON}/c_wtime.o: ${COMMON}/${WTIME} ../config/make.def
+	cd ${COMMON}; ${CCOMPILE} -o c_wtime.o ${COMMON}/${WTIME}
+
+
+# Normally setparams updates npbparams.h only if the settings (CLASS)
+# have changed. However, we also want to update if the compile options
+# may have changed (set in ../config/make.def). 
+npbparams.h: ../config/make.def
+	@ echo make.def modified. Rebuilding npbparams.h just in case
+	rm -f npbparams.h
+	../sys/setparams ${BENCHMARK} ${CLASS}
+
+# So that "make benchmark-name" works
+${BENCHMARK}:  default
+${BENCHMARKU}: default
+
+.SUFFIXES:
+.SUFFIXES: .c .h .f90 .f .o
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/print_header b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/print_header
new file mode 100755
index 000000000..f16383a41
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/print_header
@@ -0,0 +1,6 @@
+echo '   ============================================'
+echo '   =      NAS PARALLEL BENCHMARKS 3.4         ='
+echo '   =      OpenMP Versions                     ='
+echo '   =      Fortran/C                           ='
+echo '   ============================================'
+echo ''
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/print_instructions b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/print_instructions
new file mode 100755
index 000000000..89f591623
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/print_instructions
@@ -0,0 +1,19 @@
+echo ''
+echo '   To make a NAS benchmark type '
+echo ''
+echo '         make <benchmark-name> CLASS=<class>'
+echo ''
+echo '   where <benchmark-name> is "bt", "cg", "ep", "ft", "is", "lu",'
+echo '                             "mg", "sp", "ua", or "dc"'
+echo '         <class>          is "S", "W", "A", "B", "C", "D", "E", or "F"'
+echo ''
+echo '   To make a set of benchmarks, create the file config/suite.def'
+echo '   according to the instructions in config/suite.def.template and type'
+echo ''
+echo '         make suite'
+echo ''
+echo ' ***************************************************************'
+echo ' * Remember to edit the file config/make.def for site specific *'
+echo ' * information as described in the README file                 *'
+echo ' ***************************************************************'
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/setparams.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/setparams.c
new file mode 100644
index 000000000..564076a4e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/setparams.c
@@ -0,0 +1,1058 @@
+/* 
+ * This utility configures a NPB to be built for a specific class. 
+ * It creates a file "npbparams.h" 
+ * in the source directory. This file keeps state information about 
+ * which size of benchmark is currently being built (so that nothing
+ * if unnecessarily rebuilt) and defines (through PARAMETER statements)
+ * the number of nodes and class for which a benchmark is being built. 
+
+ * The utility takes 3 arguments: 
+ *       setparams benchmark-name class
+ *    benchmark-name is "sp", "bt", etc
+ *    class is the size of the benchmark
+ * These parameters are checked for the current benchmark. If they
+ * are invalid, this program prints a message and aborts. 
+ * If the parameters are ok, the current npbsize.h (actually just
+ * the first line) is read in. If the new parameters are the same as 
+ * the old, nothing is done, but an exit code is returned to force the
+ * user to specify (otherwise the make procedure succeeds but builds a
+ * binary of the wrong name).  Otherwise the file is rewritten. 
+ * Errors write a message (to stdout) and abort. 
+ * 
+ * This program makes use of two extra benchmark "classes"
+ * class "X" means an invalid specification. It is returned if
+ * there is an error parsing the config file. 
+ * class "U" is an external specification meaning "unknown class"
+ * 
+ * Unfortunately everything has to be case sensitive. This is
+ * because we can always convert lower to upper or v.v. but
+ * can't feed this information back to the makefile, so typing
+ * make CLASS=a and make CLASS=A will produce different binaries.
+ *
+ * 
+ */
+
+#include <sys/types.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <time.h>
+
+/*
+ * This is the master version number for this set of 
+ * NPB benchmarks. It is in an obscure place so people
+ * won't accidentally change it. 
+ */
+
+#define VERSION "3.4.2"
+
+/* controls verbose output from setparams */
+/* #define VERBOSE */
+
+#define FILENAME "npbparams.h"
+#define DESC_LINE "! CLASS = %c\n"
+#define DEF_CLASS_LINE     "#define CLASS '%c'\n"
+#define FINDENT  "        "
+#define CONTINUE "     & "
+
+void get_info(char *argv[], int *typep, char *classp);
+void check_info(int type, char class);
+void read_info(int type, char *classp);
+void write_info(int type, char class);
+void write_sp_info(FILE *fp, char class);
+void write_bt_info(FILE *fp, char class);
+void write_dc_info(FILE *fp, char class);
+void write_lu_info(FILE *fp, char class);
+void write_mg_info(FILE *fp, char class);
+void write_cg_info(FILE *fp, char class);
+void write_ft_info(FILE *fp, char class);
+void write_ep_info(FILE *fp, char class);
+void write_is_info(FILE *fp, char class);
+void write_ua_info(FILE *fp, char class);
+void write_compiler_info(int type, FILE *fp);
+void write_convertdouble_info(int type, FILE *fp);
+void check_line(char *line, char *label, char *val);
+int  check_include_line(char *line, char *filename);
+void put_string(FILE *fp, char *name, char *val);
+void put_def_string(FILE *fp, char *name, char *val);
+void put_def_variable(FILE *fp, char *name, char *val);
+int ilog2(int i);
+double power(double base, int i);
+
+enum benchmark_types {SP, BT, LU, MG, FT, IS, EP, CG, UA, DC};
+
+int main(int argc, char *argv[])
+{
+  int type;
+  char class, class_old;
+  
+  if (argc != 3) {
+    printf("Usage: %s benchmark-name class\n", argv[0]);
+    exit(1);
+  }
+
+  /* Get command line arguments. Make sure they're ok. */
+  get_info(argv, &type, &class);
+  if (class != 'U') {
+#ifdef VERBOSE
+    printf("setparams: For benchmark %s: class = %c\n", 
+	   argv[1], class); 
+#endif
+    check_info(type, class);
+  }
+
+  /* Get old information. */
+  read_info(type, &class_old);
+  if (class != 'U') {
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams:     old settings: class = %c\n", 
+	     class_old); 
+#endif
+    }
+  } else {
+    printf("setparams:\n\
+  *********************************************************************\n\
+  * You must specify CLASS to build this benchmark                    *\n\
+  * For example, to build a class A benchmark, type                   *\n\
+  *       make {benchmark-name} CLASS=A                               *\n\
+  *********************************************************************\n\n"); 
+
+    if (class_old != 'X') {
+#ifdef VERBOSE
+      printf("setparams: Previous settings were CLASS=%c \n", class_old); 
+#endif
+    }
+    exit(1); /* exit on class==U */
+  }
+
+  /* Write out new information if it's different. */
+  if (class != class_old) {
+#ifdef VERBOSE
+    printf("setparams: Writing %s\n", FILENAME); 
+#endif
+    write_info(type, class);
+  } else {
+#ifdef VERBOSE
+    printf("setparams: Settings unchanged. %s unmodified\n", FILENAME); 
+#endif
+  }
+
+  return 0;
+}
+
+
+/*
+ *  get_info(): Get parameters from command line 
+ */
+
+void get_info(char *argv[], int *typep, char *classp) 
+{
+
+  *classp = *argv[2];
+
+  if      (!strcmp(argv[1], "sp") || !strcmp(argv[1], "SP")) *typep = SP;
+  else if (!strcmp(argv[1], "bt") || !strcmp(argv[1], "BT")) *typep = BT;
+  else if (!strcmp(argv[1], "ft") || !strcmp(argv[1], "FT")) *typep = FT;
+  else if (!strcmp(argv[1], "lu") || !strcmp(argv[1], "LU")) *typep = LU;
+  else if (!strcmp(argv[1], "mg") || !strcmp(argv[1], "MG")) *typep = MG;
+  else if (!strcmp(argv[1], "is") || !strcmp(argv[1], "IS")) *typep = IS;
+  else if (!strcmp(argv[1], "ep") || !strcmp(argv[1], "EP")) *typep = EP;
+  else if (!strcmp(argv[1], "cg") || !strcmp(argv[1], "CG")) *typep = CG;
+  else if (!strcmp(argv[1], "ua") || !strcmp(argv[1], "UA")) *typep = UA;
+  else if (!strcmp(argv[1], "dc") || !strcmp(argv[1], "DC")) *typep = DC;
+  else {
+    printf("setparams: Error: unknown benchmark type %s\n", argv[1]);
+    exit(1);
+  }
+}
+
+/*
+ *  check_info(): Make sure command line data is ok for this benchmark 
+ */
+
+void check_info(int type, char class) 
+{
+
+  /* check class */
+  if (class != 'S' && 
+      class != 'W' && 
+      class != 'A' && 
+      class != 'B' && 
+      class != 'C' && 
+      class != 'D' && 
+      class != 'E' && 
+      class != 'F') {
+    printf("setparams: Unknown benchmark class %c\n", class); 
+    printf("setparams: Allowed classes are \"S\", \"W\", and \"A\" through \"F\"\n");
+    exit(1);
+  }
+
+  if ((class == 'E' && (type == UA || type == DC)) ||
+      (class == 'F' && (type == IS || type == UA || type == DC)) ||
+      ((class == 'C' || class == 'D') && type == DC)) {
+    printf("setparams: Benchmark class %c not defined for %s\n",
+           class, (type == IS)? "IS" : (type == UA)? "UA" : "DC");
+    exit(1);
+  }
+}
+
+
+/* 
+ * read_info(): Read previous information from file. 
+ *              Not an error if file doesn't exist, because this
+ *              may be the first time we're running. 
+ *              Assumes the first line of the file is in a special
+ *              format that we understand (since we wrote it). 
+ */
+
+void read_info(int type, char *classp)
+{
+  int nread;
+  FILE *fp;
+  fp = fopen(FILENAME, "r");
+  if (fp == NULL) {
+#ifdef VERBOSE
+    printf("setparams: INFO: configuration file %s does not exist (yet)\n", FILENAME); 
+#endif
+    goto abort;
+  }
+  
+  /* first line of file contains info (fortran), first two lines (C) */
+
+  switch(type) {
+      case SP:
+      case BT:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          nread = fscanf(fp, DESC_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      case IS:
+      case DC:
+          nread = fscanf(fp, DEF_CLASS_LINE, classp);
+          if (nread != 1) {
+            printf("setparams: Error parsing config file %s. Ignoring previous settings\n", FILENAME);
+            goto abort;
+          }
+          break;
+      default:
+        /* never should have gotten this far with a bad name */
+        printf("setparams: (Internal Error) Benchmark type %d unknown to this program\n", type); 
+        exit(1);
+  }
+
+  fclose(fp);
+
+
+  return;
+
+ abort:
+  *classp = 'X';
+  return;
+}
+
+
+/* 
+ * write_info(): Write new information to config file. 
+ *               First line is in a special format so we can read
+ *               it in again. Then comes a warning. The rest is all
+ *               specific to a particular benchmark. 
+ */
+
+void write_info(int type, char class) 
+{
+  FILE *fp;
+  fp = fopen(FILENAME, "w");
+  if (fp == NULL) {
+    printf("setparams: Can't open file %s for writing\n", FILENAME);
+    exit(1);
+  }
+
+  switch(type) {
+      case SP:
+      case BT:
+      case FT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          /* Write out the header */
+          fprintf(fp, DESC_LINE, class);
+          /* Print out a warning so bozos don't mess with the file */
+          fprintf(fp, "\
+!  \n\
+!  \n\
+!  This file is generated automatically by the setparams utility.\n\
+!  It sets the number of processors and the class of the NPB\n\
+!  in this directory. Do not modify it by hand.\n\
+!  \n");
+
+          break;
+      case IS:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.   */\n\
+   \n");
+          break;
+      case DC:
+          fprintf(fp, DEF_CLASS_LINE, class);
+          fprintf(fp, "\
+/*\n\
+   This file is generated automatically by the setparams utility.\n\
+   It sets the number of processors and the class of the NPB\n\
+   in this directory. Do not modify it by hand.\n\
+   This file provided for backward compatibility.\n\
+   It is not used in DC benchmark.   */\n\
+   \n");
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+  /* Now do benchmark-specific stuff */
+  switch(type) {
+  case SP:
+    write_sp_info(fp, class);
+    break;	      
+  case BT:	      
+    write_bt_info(fp, class);
+    break;
+ case DC:
+    write_dc_info(fp, class);
+    break;	      
+  case LU:	      
+    write_lu_info(fp, class);
+    break;	      
+  case MG:	      
+    write_mg_info(fp, class);
+    break;	      
+  case IS:	      
+    write_is_info(fp, class);  
+    break;	      
+  case FT:	      
+    write_ft_info(fp, class);
+    break;	      
+  case EP:	      
+    write_ep_info(fp, class);
+    break;	      
+  case CG:	      
+    write_cg_info(fp, class);
+    break;
+  case UA:	      
+    write_ua_info(fp, class);
+    break;
+  default:
+    printf("setparams: (Internal error): Unknown benchmark type %d\n", type);
+    exit(1);
+  }
+  write_convertdouble_info(type, fp);
+  write_compiler_info(type, fp);
+  fclose(fp);
+  return;
+}
+
+
+/* 
+ * write_sp_info(): Write SP specific info to config file
+ */
+
+void write_sp_info(FILE *fp, char class) 
+{
+  int problem_size, niter;
+  char *dt;
+  if      (class == 'S') { problem_size = 12;  dt = "0.015d0";   niter = 100; }
+  else if (class == 'W') { problem_size = 36;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0015d0";  niter = 400; }
+  else if (class == 'B') { problem_size = 102; dt = "0.001d0";   niter = 400; }
+  else if (class == 'C') { problem_size = 162; dt = "0.00067d0"; niter = 400; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00030d0"; niter = 500; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.0001d0"; niter = 500; }
+  else if (class == 'F') { problem_size = 2560; dt = "0.15d-4";  niter = 500; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_bt_info(): Write BT specific info to config file
+ */
+
+void write_bt_info(FILE *fp, char class) 
+{
+  int problem_size, niter;
+  char *dt;
+  if      (class == 'S') { problem_size = 12;  dt = "0.010d0";   niter = 60; }
+  else if (class == 'W') { problem_size = 24;  dt = "0.0008d0";  niter = 200; }
+  else if (class == 'A') { problem_size = 64;  dt = "0.0008d0";  niter = 200; }
+  else if (class == 'B') { problem_size = 102; dt = "0.0003d0";  niter = 200; }
+  else if (class == 'C') { problem_size = 162; dt = "0.0001d0";  niter = 200; }
+  else if (class == 'D') { problem_size = 408; dt = "0.00002d0";  niter = 250; }
+  else if (class == 'E') { problem_size = 1020; dt = "0.4d-5";    niter = 250; }
+  else if (class == 'F') { problem_size = 2560; dt = "0.6d-6";    niter = 250; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "%sinteger problem_size, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (problem_size=%d, niter_default=%d)\n", 
+	       FINDENT, problem_size, niter);
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt);
+}
+  
+/* 
+ * write_dc_info(): Write DC specific info to config file
+ */
+
+
+void write_dc_info(FILE *fp, char class)
+{
+  long int input_tuples, attrnum;
+  if      (class == 'S') { input_tuples = 1000;     attrnum = 5; }
+  else if (class == 'W') { input_tuples = 100000;   attrnum = 10; }
+  else if (class == 'A') { input_tuples = 1000000;  attrnum = 15; }
+  else if (class == 'B') { input_tuples = 10000000; attrnum = 20; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  fprintf(fp, "long long int input_tuples=%ld, attrnum=%ld;\n",
+              input_tuples, attrnum);
+}
+
+
+/* 
+ * write_lu_info(): Write LU specific info to config file
+ */
+
+void write_lu_info(FILE *fp, char class) 
+{
+  int isiz1, isiz2, itmax, inorm, problem_size;
+  char *dt_default;
+
+  if      (class == 'S') { problem_size = 12;  dt_default = "0.5d0"; itmax = 50; }
+  else if (class == 'W') { problem_size = 33;  dt_default = "1.5d-3"; itmax = 300; }
+  else if (class == 'A') { problem_size = 64;  dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'B') { problem_size = 102; dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'C') { problem_size = 162; dt_default = "2.0d0"; itmax = 250; }
+  else if (class == 'D') { problem_size = 408; dt_default = "1.0d0"; itmax = 300; }
+  else if (class == 'E') { problem_size = 1020; dt_default = "0.5d0"; itmax = 300; }
+  else if (class == 'F') { problem_size = 2560; dt_default = "0.2d0"; itmax = 300; }
+  else {
+    printf("setparams: Internal error: invalid class %c\n", class);
+    exit(1);
+  }
+  inorm = itmax;
+  isiz1 = problem_size;
+  isiz2 = problem_size;
+  
+
+  fprintf(fp, "\n! full problem size\n");
+  fprintf(fp, "%sinteger isiz1, isiz2, isiz3\n", FINDENT);
+  fprintf(fp, "%sparameter (isiz1=%d, isiz2=%d, isiz3=%d)\n", 
+	       FINDENT, isiz1, isiz2, problem_size );
+
+  fprintf(fp, "\n! number of iterations and how often to print the norm\n");
+  fprintf(fp, "%sinteger itmax_default, inorm_default\n", FINDENT);
+  fprintf(fp, "%sparameter (itmax_default=%d, inorm_default=%d)\n", 
+	  FINDENT, itmax, inorm);
+
+  fprintf(fp, "%sdouble precision dt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (dt_default = %s)\n", FINDENT, dt_default);
+  
+}
+
+/* 
+ * write_mg_info(): Write MG specific info to config file
+ */
+
+void write_mg_info(FILE *fp, char class) 
+{
+  int problem_size, nit, log2_size, lt_default, lm;
+  int ndim1, ndim2, ndim3;
+  if      (class == 'S') { problem_size = 32; nit = 4; }
+/*  else if (class == 'W') { problem_size = 64; nit = 40; }*/
+  else if (class == 'W') { problem_size = 128; nit = 4; }
+  else if (class == 'A') { problem_size = 256; nit = 4; }
+  else if (class == 'B') { problem_size = 256; nit = 20; }
+  else if (class == 'C') { problem_size = 512; nit = 20; }
+  else if (class == 'D') { problem_size = 1024; nit = 50; }
+  else if (class == 'E') { problem_size = 2048; nit = 50; }
+  else if (class == 'F') { problem_size = 4096; nit = 50; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  log2_size = ilog2(problem_size);
+  /* lt is log of largest total dimension */
+  lt_default = log2_size;
+  /* log of log of maximum dimension on a node */
+  lm = log2_size;
+  ndim1 = lm;
+  ndim3 = log2_size;
+  ndim2 = log2_size;
+
+  fprintf(fp, "%sinteger nx_default, ny_default, nz_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nx_default=%d, ny_default=%d, nz_default=%d)\n", 
+	  FINDENT, problem_size, problem_size, problem_size);
+  fprintf(fp, "%sinteger nit_default, lm, lt_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nit_default=%d, lm = %d, lt_default=%d)\n", 
+	  FINDENT, nit, lm, lt_default);
+  fprintf(fp, "%sinteger debug_default\n", FINDENT);
+  fprintf(fp, "%sparameter (debug_default=%d)\n", FINDENT, 0);
+  fprintf(fp, "%sinteger ndim1, ndim2, ndim3\n", FINDENT);
+  fprintf(fp, "%sparameter (ndim1 = %d, ndim2 = %d, ndim3 = %d)\n", 
+	  FINDENT, ndim1, ndim2, ndim3);
+  fprintf(fp, "%sinteger kind2\n", FINDENT);
+  fprintf(fp, "%sparameter (kind2=%s)\n",
+          FINDENT, (problem_size > 1024)? "8" : "4");
+}
+
+
+/* 
+ * write_is_info(): Write IS specific info to config file
+ */
+
+void write_is_info(FILE *fp, char class) 
+{
+  if( class != 'S' &&
+      class != 'W' &&
+      class != 'A' &&
+      class != 'B' &&
+      class != 'C' &&
+      class != 'D' &&
+      class != 'E')
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+}
+
+
+/* 
+ * write_cg_info(): Write CG specific info to config file
+ */
+
+void write_cg_info(FILE *fp, char class) 
+{
+  int na,nonzer,niter,kz;
+  char *shift,*rcond="1.0d-1";
+
+  if( class == 'S' )
+  { na=1400; nonzer=7; niter=15; shift="10."; }
+  else if( class == 'W' )
+  { na=7000; nonzer=8; niter=15; shift="12."; }
+  else if( class == 'A' )
+  { na=14000; nonzer=11; niter=15; shift="20."; }
+  else if( class == 'B' )
+  { na=75000; nonzer=13; niter=75; shift="60."; }
+  else if( class == 'C' )
+  { na=150000; nonzer=15; niter=75; shift="110."; }
+  else if( class == 'D' )
+  { na=1500000; nonzer=21; niter=100; shift="500."; }
+  else if( class == 'E' )
+  { na=9000000; nonzer=26; niter=100; shift="1.5d3"; }
+  else if( class == 'F' )
+  { na=54000000; nonzer=31; niter=100; shift="5.0d3"; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  kz = (na >= 9000000)? 8 : 4;
+  fprintf( fp, "%sinteger            na, nonzer, niter\n", FINDENT );
+  fprintf( fp, "%sdouble precision   shift, rcond\n", FINDENT );
+  fprintf( fp, "%sparameter(  na=%d, &\n", FINDENT, na );
+  fprintf( fp, "%s             nonzer=%d, &\n", CONTINUE, nonzer );
+  fprintf( fp, "%s             niter=%d, &\n", CONTINUE, niter );
+  fprintf( fp, "%s             shift=%s, &\n", CONTINUE, shift );
+  fprintf( fp, "%s             rcond=%s )\n", CONTINUE, rcond );
+  fprintf( fp, "%sinteger, parameter :: kz=%d\n", FINDENT, kz );
+  
+}
+
+
+
+/* 
+ * write_ua_info(): Write UA specific info to config file
+ */
+
+void write_ua_info(FILE *fp, char class) 
+{
+  int lelt, lmor,refine_max, niter, nmxh, fre;
+  char *alpha;
+
+  fre = 5;
+  if( class == 'S' )
+  { lelt=250;lmor=11600;       refine_max=4;  niter=50;  nmxh=10; alpha="0.040d0"; }
+  else if( class == 'W' )
+  { lelt=700;lmor=26700;       refine_max=5;  niter=100; nmxh=10; alpha="0.060d0"; }
+  else if( class == 'A' )
+  { lelt=2400;lmor=92700;      refine_max=6;  niter=200; nmxh=10; alpha="0.076d0"; }
+  else if( class == 'B' )
+  { lelt=8800;  lmor=334600;   refine_max=7;  niter=200; nmxh=10; alpha="0.076d0"; }
+  else if( class == 'C' )
+  { lelt=33500; lmor=1262100;  refine_max=8;  niter=200; nmxh=10; alpha="0.067d0"; }
+  else if( class == 'D' )
+  { lelt=514400;lmor=19134400; refine_max=10; niter=250; nmxh=10; alpha="0.046d0"; }
+  else if( class == 'E' )
+  { lelt=7844800;lmor=291302900; refine_max=12; niter=250; nmxh=10; alpha="0.0294d0"; }
+  else
+  {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  
+  fprintf( fp, "%sinteger          lelt, lmor, refine_max, fre_default\n", FINDENT );
+  fprintf( fp, "%sinteger          niter_default, nmxh_default\n", FINDENT );
+  fprintf( fp, "%scharacter        class_default\n", FINDENT );
+  fprintf( fp, "%sdouble precision alpha_default\n", FINDENT );
+  fprintf( fp, "%sparameter(  lelt=%d, &\n", FINDENT, lelt );
+  fprintf( fp, "%s            lmor=%d, &\n", CONTINUE, lmor );
+  fprintf( fp, "%s             refine_max=%d, &\n", CONTINUE, refine_max );
+  fprintf( fp, "%s             fre_default=%d, &\n", CONTINUE, fre );
+  fprintf( fp, "%s             niter_default=%d, &\n", CONTINUE, niter );
+  fprintf( fp, "%s             nmxh_default=%d, &\n", CONTINUE, nmxh );
+  fprintf( fp, "%s             class_default=\"%c\", &\n", CONTINUE, class );
+  fprintf( fp, "%s             alpha_default=%s )\n", CONTINUE, alpha );
+  
+}
+
+
+/* 
+ * write_ft_info(): Write FT specific info to config file
+ */
+
+void write_ft_info(FILE *fp, char class) 
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int nx, ny, nz, maxdim, niter;
+  if      (class == 'S') { nx = 64; ny = 64; nz = 64; niter = 6;}
+  else if (class == 'W') { nx = 128; ny = 128; nz = 32; niter = 6;}
+  else if (class == 'A') { nx = 256; ny = 256; nz = 128; niter = 6;}
+  else if (class == 'B') { nx = 512; ny = 256; nz = 256; niter =20;}
+  else if (class == 'C') { nx = 512; ny = 512; nz = 512; niter =20;}
+  else if (class == 'D') { nx = 2048; ny = 1024; nz = 1024; niter =25;}
+  else if (class == 'E') { nx = 4096; ny = 2048; nz = 2048; niter =25;}
+  else if (class == 'F') { nx = 8192; ny = 4096; nz = 4096; niter =25;}
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+  maxdim = nx;
+  if (ny > maxdim) maxdim = ny;
+  if (nz > maxdim) maxdim = nz;
+  fprintf(fp, "%sinteger nx, ny, nz, maxdim, niter_default\n", FINDENT);
+  fprintf(fp, "%sparameter (nx=%d, ny=%d, nz=%d, maxdim=%d)\n", 
+          FINDENT, nx, ny, nz, maxdim);
+  fprintf(fp, "%sparameter (niter_default=%d)\n", FINDENT, niter);
+  fprintf(fp, "%sinteger kind2\n", FINDENT);
+  fprintf(fp, "%sparameter (kind2=%s)\n", 
+          FINDENT, (maxdim > 1024)? "8" : "4");
+
+}
+
+/*
+ * write_ep_info(): Write EP specific info to config file
+ */
+
+void write_ep_info(FILE *fp, char class)
+{
+  /* easiest way (given the way the benchmark is written)
+   * is to specify log of number of grid points in each
+   * direction m1, m2, m3. nt is the number of iterations
+   */
+  int m;
+  if      (class == 'S') { m = 24; }
+  else if (class == 'W') { m = 25; }
+  else if (class == 'A') { m = 28; }
+  else if (class == 'B') { m = 30; }
+  else if (class == 'C') { m = 32; }
+  else if (class == 'D') { m = 36; }
+  else if (class == 'E') { m = 40; }
+  else if (class == 'F') { m = 44; }
+  else {
+    printf("setparams: Internal error: invalid class type %c\n", class);
+    exit(1);
+  }
+
+  fprintf(fp, "%scharacter class\n",FINDENT);
+  fprintf(fp, "%sparameter (class =\'%c\')\n",
+                  FINDENT, class);
+  fprintf(fp, "%sinteger m\n", FINDENT);
+  fprintf(fp, "%sparameter (m=%d)\n", FINDENT, m);
+}
+
+
+/* 
+ * This is a gross hack to allow the benchmarks to 
+ * print out how they were compiled. Various other ways
+ * of doing this have been tried and they all fail on
+ * some machine - due to a broken "make" program, or
+ * FC limitations, of whatever. Hopefully this will
+ * always work because it uses very portable C. Unfortunately
+ * it relies on parsing the make.def file - YUK. 
+ * If your machine doesn't have <string.h> or <ctype.h>, happy hacking!
+ * 
+ */
+
+#define VERBOSE
+#define LL 400
+#define DEFFILE "../config/make.def"
+#define DEFAULT_MESSAGE "(none)"
+FILE *deffile;
+void write_compiler_info(int type, FILE *fp)
+{
+  char line[LL];
+  char fc[LL], flink[LL], f_lib[LL], f_inc[LL], fflags[LL], flinkflags[LL];
+  char compiletime[LL], randfile[LL];
+  char cc[LL], cflags[LL], clink[LL], clinkflags[LL],
+       c_lib[LL], c_inc[LL];
+  struct tm *tmp;
+  time_t t;
+  deffile = fopen(DEFFILE, "r");
+  if (deffile == NULL) {
+    printf("\n\
+setparams: File %s doesn't exist. To build the NAS benchmarks\n\
+           you need to create is according to the instructions\n\
+           in the README in the main directory and comments in \n\
+           the file config/make.def.template\n", DEFFILE);
+    exit(1);
+  }
+  strcpy(fc, DEFAULT_MESSAGE);
+  strcpy(flink, DEFAULT_MESSAGE);
+  strcpy(f_lib, DEFAULT_MESSAGE);
+  strcpy(f_inc, DEFAULT_MESSAGE);
+  strcpy(fflags, DEFAULT_MESSAGE);
+  strcpy(flinkflags, DEFAULT_MESSAGE);
+  strcpy(randfile, DEFAULT_MESSAGE);
+  strcpy(cc, DEFAULT_MESSAGE);
+  strcpy(cflags, DEFAULT_MESSAGE);
+  strcpy(clink, DEFAULT_MESSAGE);
+  strcpy(clinkflags, DEFAULT_MESSAGE);
+  strcpy(c_lib, DEFAULT_MESSAGE);
+  strcpy(c_inc, DEFAULT_MESSAGE);
+
+  while (fgets(line, LL, deffile) != NULL) {
+    if (*line == '#') continue;
+    /* yes, this is inefficient. but it's simple! */
+    check_line(line, "FC", fc);
+    check_line(line, "FLINK", flink);
+    check_line(line, "F_LIB", f_lib);
+    check_line(line, "F_INC", f_inc);
+    check_line(line, "FFLAGS", fflags);
+    check_line(line, "FLINKFLAGS", flinkflags);
+    check_line(line, "RAND", randfile);
+    check_line(line, "CC", cc);
+    check_line(line, "CFLAGS", cflags);
+    check_line(line, "CLINK", clink);
+    check_line(line, "CLINKFLAGS", clinkflags);
+    check_line(line, "C_LIB", c_lib);
+    check_line(line, "C_INC", c_inc);
+  }
+
+  
+  (void) time(&t);
+  tmp = localtime(&t);
+  (void) strftime(compiletime, (size_t)LL, "%d %b %Y", tmp);
+
+
+  switch(type) {
+      case FT:
+      case SP:
+      case BT:
+      case MG:
+      case LU:
+      case EP:
+      case CG:
+      case UA:
+          put_string(fp, "compiletime", compiletime);
+          put_string(fp, "npbversion", VERSION);
+          put_string(fp, "cs1", fc);
+          put_string(fp, "cs2", flink);
+          put_string(fp, "cs3", f_lib);
+          put_string(fp, "cs4", f_inc);
+          put_string(fp, "cs5", fflags);
+          put_string(fp, "cs6", flinkflags);
+	  put_string(fp, "cs7", randfile);
+          break;
+      case IS:
+      case DC:
+          put_def_string(fp, "COMPILETIME", compiletime);
+          put_def_string(fp, "NPBVERSION", VERSION);
+          put_def_string(fp, "CC", cc);
+          put_def_string(fp, "CFLAGS", cflags);
+          put_def_string(fp, "CLINK", clink);
+          put_def_string(fp, "CLINKFLAGS", clinkflags);
+          put_def_string(fp, "C_LIB", c_lib);
+          put_def_string(fp, "C_INC", c_inc);
+          break;
+      default:
+          printf("setparams: (Internal error): Unknown benchmark type %d\n", 
+                                                                         type);
+          exit(1);
+  }
+
+}
+
+void check_line(char *line, char *label, char *val)
+{
+  char *original_line;
+  int n;
+  original_line = line;
+  /* compare beginning of line and label */
+  while (*label != '\0' && *line == *label) {
+    line++; label++; 
+  }
+  /* if *label is not EOS, we must have had a mismatch */
+  if (*label != '\0') return;
+  /* if *line is not a space, actual label is longer than test label */
+  if (!isspace(*line) && *line != '=') return ; 
+  /* skip over white space */
+  while (isspace(*line)) line++;
+  /* next char should be '=' */
+  if (*line != '=') return;
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return;
+  /* finally we've come to the value */
+  strcpy(val, line);
+  /* chop off the newline at the end */
+  n = strlen(val)-1;
+  if (n >= 0 && val[n] == '\n')
+    val[n--] = '\0';
+  if (n >= 0 && val[n] == '\r')
+    val[n--] = '\0';
+  /* treat continuation */
+  while (val[n] == '\\' && fgets(original_line, LL, deffile)) {
+     line = original_line;
+     while (isspace(*line)) line++;
+     if (isspace(*original_line)) val[n++] = ' ';
+     while (*line && *line != '\n' && *line != '\r' && n < LL-1)
+       val[n++] = *line++;
+     val[n] = '\0';
+     n--;
+  }
+/*  if (val[n] == '\\') {
+    printf("\n\
+setparams: Error in file make.def. Because of the way in which\n\
+           command line arguments are incorporated into the\n\
+           executable benchmark, you can't have any continued\n\
+           lines in the file make.def, that is, lines ending\n\
+           with the character \"\\\". Although it may be ugly, \n\
+           you should be able to reformat without continuation\n\
+           lines. The offending line is\n\
+  %s\n", original_line);
+    exit(1);
+  } */
+}
+
+int check_include_line(char *line, char *filename)
+{
+  char *include_string = "include";
+  /* compare beginning of line and "include" */
+  while (*include_string != '\0' && *line == *include_string) {
+    line++; include_string++; 
+  }
+  /* if *include_string is not EOS, we must have had a mismatch */
+  if (*include_string != '\0') return(0);
+  /* if *line is not a space, first word is not "include" */
+  if (!isspace(*line)) return(0); 
+  /* skip over white space */
+  while (isspace(*++line));
+  /* if EOS, nothing was specified */
+  if (*line == '\0') return(0);
+  /* next keyword should be name of include file in *filename */
+  while (*filename != '\0' && *line == *filename) {
+    line++; filename++; 
+  }  
+  if (*filename != '\0' || 
+      (*line != ' ' && *line != '\0' && *line !='\n')) return(0);
+  else return(1);
+}
+
+
+#define MAXL 46
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "%scharacter %s*%d\n", FINDENT, name, len);
+  fprintf(fp, "%sparameter (%s=\'%s\')\n", FINDENT, name, val);
+}
+
+/* need to escape quote (") in val */
+int fix_string_quote(char *val, char *newval, int maxl)
+{
+  int len;
+  int i, j;
+  len = strlen(val);
+  i = j = 0;
+  while (i < len && j < maxl) {
+    if (val[i] == '"')
+      newval[j++] = '\\';
+    if (j < maxl)
+      newval[j++] = val[i++];
+  }
+  newval[j] = '\0';
+  return j;
+}
+
+/* NOTE: is the ... stuff necessary in C? */
+void put_def_string(FILE *fp, char *name, char *val0)
+{
+  int len;
+  char val[MAXL+3];
+  len = fix_string_quote(val0, val, MAXL+2);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s \"%s\"\n", name, val);
+}
+
+void put_def_variable(FILE *fp, char *name, char *val)
+{
+  int len;
+  len = strlen(val);
+  if (len > MAXL) {
+    val[MAXL] = '\0';
+    val[MAXL-1] = '.';
+    val[MAXL-2] = '.';
+    val[MAXL-3] = '.';
+    len = MAXL;
+  }
+  fprintf(fp, "#define %s %s\n", name, val);
+}
+
+
+
+#if 0
+
+/* this version allows arbitrarily long lines but 
+ * some compilers don't like that and they're rarely
+ * useful 
+ */
+
+#define LINELEN 65
+void put_string(FILE *fp, char *name, char *val)
+{
+  int len, nlines, pos, i;
+  char line[100];
+  len = strlen(val);
+  nlines = len/LINELEN;
+  if (nlines*LINELEN < len) nlines++;
+  fprintf(fp, "%scharacter*%d %s\n", FINDENT, nlines*LINELEN, name);
+  fprintf(fp, "%sparameter (%s = &\n", FINDENT, name);
+  for (i = 0; i < nlines; i++) {
+    pos = i*LINELEN;
+    if (i == 0) fprintf(fp, "%s\'", CONTINUE);
+    else        fprintf(fp, "%s", CONTINUE);
+    /* number should be same as LINELEN */
+    fprintf(fp, "%.65s", val+pos);
+    if (i == nlines-1) fprintf(fp, "\')\n");
+    else             fprintf(fp, " &\n");
+  }
+}
+
+#endif
+
+
+/* integer log base two. Return error is argument isn't
+ * a power of two or is less than or equal to zero 
+ */
+
+int ilog2(int i)
+{
+  int log2;
+  int exp2 = 1;
+  if (i <= 0) return(-1);
+
+  for (log2 = 0; log2 < 30; log2++) {
+    if (exp2 == i) return(log2);
+    if (exp2 > i) break;
+    exp2 *= 2;
+  }
+  return(-1);
+}
+
+
+/* Power function. We could use pow from the math library, but then
+ * we would have to insist on always linking with the math library, just
+ * for this function. Since we only need pow with integer exponents,
+ * we'll code it ourselves here.
+ */
+
+double power(double base, int i)
+{
+  double x;
+
+  if (i==0) return (1.0);
+  else if (i<0) {
+    base = 1.0/base;
+    i = -i;
+  }
+  x = 1.0;
+  while (i>0) {
+    x *=base;
+    i--;
+  }
+  return (x);
+}
+    
+
+void write_convertdouble_info(int type, FILE *fp)
+{
+  switch(type) {
+  case SP:
+  case BT:
+  case LU:
+  case FT:
+  case MG:
+  case EP:
+  case CG:
+  case UA:
+    fprintf(fp, "%slogical  convertdouble\n", FINDENT);
+#ifdef CONVERTDOUBLE
+    fprintf(fp, "%sparameter (convertdouble = .true.)\n", FINDENT);
+#else
+    fprintf(fp, "%sparameter (convertdouble = .false.)\n", FINDENT);
+#endif
+    break;
+  }
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/suite.awk b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/suite.awk
new file mode 100644
index 000000000..461adab1f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/sys/suite.awk
@@ -0,0 +1,10 @@
+BEGIN { SMAKE = "make" } {
+  if ($1 !~ /^#/ &&  NF > 1) {
+    printf "cd `echo %s|tr '[a-z]' '[A-Z]'`; %s clean;", $1, SMAKE;
+    printf "%s CLASS=%s", SMAKE, $2;
+    if (NF > 2) {
+      printf " VERSION=%s", $3;
+    }
+    printf "; cd ..\n";
+  }
+}
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/comp b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/comp
new file mode 100755
index 000000000..db9c7dec8
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/comp
@@ -0,0 +1,71 @@
+#!/bin/csh
+
+module purge
+module load comp/intel-12.0.4
+#module load comp/gcc-5.3
+module load comp/gcc-8.2
+
+set logfile=npb-make.log
+touch $logfile
+set outf=npb-make.out
+touch $outf
+
+echo "Date: `date`" >> $logfile
+echo "Host: `hostname`" >> $logfile
+module list >>& $logfile
+echo "" >> $logfile
+
+set cnt=0
+set cntf=0
+
+set aps=(bt sp lu lu ua ua)
+set spv=(blk blk hp doac au rd)
+set c="A"
+
+foreach cf (gcc itc_p pgi)
+
+set bindir=bin/bin_$cf
+if ( ! -d $bindir) mkdir -p $bindir
+\cp -f config/NAS.samples/make.def_$cf config/make.def
+make clean >>& $outf
+
+foreach ap (bt cg ep ft is lu mg sp ua)
+   make $ap CLASS=$c >>& $outf
+   set pgm=${ap}.${c}.x
+   set pgmx=bin/$pgm
+   @ cnt++
+   if ( -e $pgmx ) then
+      \mv $pgmx $bindir
+      echo ">>> make $cf/$pgm - successful" | tee -a $logfile
+   else
+      echo "*** make $cf/$pgm - FAILED" | tee -a $logfile
+      @ cntf++
+   endif
+end
+
+set n=1
+while ( $n <= $#aps )
+   set ap=$aps[$n]
+   set ver=$spv[$n]
+   make $ap CLASS=$c VERSION=$ver VEXT=-$ver >>& $outf
+   set pgm=${ap}-${ver}.${c}.x
+   set pgmx=bin/$pgm
+   @ cnt++
+   if ( -e $pgmx ) then
+      \mv $pgmx $bindir
+      echo ">>> make $cf/$pgm - successful" | tee -a $logfile
+   else
+      echo "*** make $cf/$pgm - FAILED" | tee -a $logfile
+      @ cntf++
+   endif
+   @ n++
+end
+
+end
+
+echo "" >> $logfile
+echo "Date: `date`" >> $logfile
+echo "Total number of cases: $cnt" | tee -a $logfile
+echo "Total number of FAILED cases: $cntf" | tee -a $logfile
+echo "" >> $logfile
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/run_test b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/run_test
new file mode 100755
index 000000000..ddb410b2e
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/run_test
@@ -0,0 +1,13 @@
+#!/bin/csh
+
+set sdir=$0:h
+set wdir=$sdir/..
+
+cd $wdir
+echo "Testing ... $sdir/comp"
+$sdir/comp
+
+cd bin
+echo "Testing ... $sdir/runit"
+../$sdir/runit
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/runit b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/runit
new file mode 100755
index 000000000..e960fdf8f
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-OMP/test_scripts/runit
@@ -0,0 +1,96 @@
+#!/bin/csh
+
+module purge
+module load local
+module load comp/intel-12.0.4
+#module load comp/gcc-5.3
+module load comp/gcc-8.2
+
+set logfile=npb-run.log
+touch $logfile
+set tmpf=npb.tmp.$$
+
+echo "Date: `date`" >> $logfile
+echo "Host: `hostname`" >> $logfile
+module list >>& $logfile
+echo "" >> $logfile
+
+set cnt=0
+set cntf=0
+set cntp=0
+
+set aps=(bt sp lu lu ua ua)
+set spv=(blk blk hp doac au rd)
+set c="A"
+setenv NPB_TIMER_FLAG 1
+
+foreach nt (4)
+foreach cf (gcc itc_p pgi)
+
+set bindir=bin_$cf
+set outdir=out_$cf
+if ( ! -d $outdir) mkdir -p $outdir
+
+foreach ap (bt cg ep ft is lu mg sp ua)
+   set pgm=${ap}.${c}.x
+   set pgmx=$bindir/$pgm
+   set case="run $cf/$pgm nt=$nt"
+   @ cnt++
+   if ( -e $pgmx ) then
+      set outf=$outdir/${ap}.${c}.out.$nt
+      touch $outf
+      mbind.x -t$nt -cs-1 $pgmx >&! $tmpf
+      grep -i ' successful' $tmpf >& /dev/null
+      if ( $status == 0 ) then
+         echo ">>> $case - successful" | tee -a $logfile
+      else
+         echo "*** $case - FAILED" | tee -a $logfile
+         @ cntf++
+      endif
+      cat $tmpf >> $outf
+      \rm $tmpf
+   else
+      echo "... $case - not present" | tee -a $logfile
+      @ cntp++
+   endif
+end
+
+set n=1
+while ( $n <= $#aps )
+   set ap=$aps[$n]
+   set ver=$spv[$n]
+   set pgm=${ap}-${ver}.${c}.x
+   set pgmx=$bindir/$pgm
+   set case="run $cf/$pgm nt=$nt"
+   @ cnt++
+   if ( -e $pgmx ) then
+      set outf=$outdir/${ap}-${ver}.${c}.out.$nt
+      touch $outf
+      mbind.x -t$nt -cs-1 $pgmx >&! $tmpf
+      grep -i ' successful' $tmpf >& /dev/null
+      if ( $status == 0 ) then
+         echo ">>> $case - successful" | tee -a $logfile
+      else
+         echo "*** $case - FAILED" | tee -a $logfile
+         @ cntf++
+      endif
+      cat $tmpf >> $outf
+      \rm $tmpf
+   else
+      echo "... $case - not present" | tee -a $logfile
+      @ cntp++
+   endif
+   @ n++
+end
+
+end
+end
+
+echo "" >> $logfile
+echo "Date: `date`" >> $logfile
+echo "Total number of cases: $cnt" | tee -a $logfile
+echo "Total number of FAILED cases: $cntf" | tee -a $logfile
+echo "Total number of not present cases: $cntp" | tee -a $logfile
+echo "" >> $logfile
+
+
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-SER.README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-SER.README
new file mode 100644
index 000000000..23f06a1af
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-SER.README
@@ -0,0 +1,5 @@
+The SER version of NPB is not included in this distribution.
+Please use the OMP version instead or download a previous version
+from NPB3.3.1.
+
+http://www.nas.nasa.gov/Software/NPB
diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/README b/src/npb-24.04-imgs/npb-with-roi/NPB/README
new file mode 100644
index 000000000..55dfcae78
--- /dev/null
+++ b/src/npb-24.04-imgs/npb-with-roi/NPB/README
@@ -0,0 +1,173 @@
+NAS Parallel Benchmarks Version 3.4.2 (NPB3.4.2)
+--------------------------------------------------
+
+  NAS Parallel Benchmarks Team
+  NASA Ames Research Center
+  Moffett Field, CA   94035-1000
+
+  E-mail:  npb@nas.nasa.gov                                      
+  Fax:     (650) 604-3957                                        
+  http://www.nas.nasa.gov/Software/NPB/
+
+
+================================================
+INSTALLATION
+
+  For documentation on installing and running the NAS Parallel
+  Benchmarks, refer to subdirectory README files.
+
+
+================================================
+BACKGROUND
+
+  Information on NPB 3.4.2, including the technical reports, the          
+  original specifications, source code, results and information        
+  on how to submit new results, is available at:                       
+
+     http://www.nas.nasa.gov/Software/NPB/                              
+
+
+================================================
+Summary of New Features and Improvements
+ (Details are given in Changes.log.)
+
+
+ in NPB3.4.2 from NPB3.4.1:
+
+  - new verification scheme for EP
+
+  - MPI version
+    * add back the VEC versions of BT and LU
+
+    * fixed a bug in the BT-IO benchmark that can cause integer overflow
+      in CLASS=D or larger problems.  Setting FORTRAN_REC_SIZE in make.def
+      is no longer required.
+
+
+ in NPB3.4.1 from NPB3.4:
+
+  - changed Fortran sources from fixed form to free form
+
+  - MPI version
+    * fixed an inconsistency in enforcing process count requirement
+
+  - OMP version
+    * fixed the report of Fortran compiler flag
+
+    * The blocking factor for FT can now be set via make option
+
+
+ in NPB3.4 from NPB3.3.1:
+
+  - General changes applied to both MPI and OMP versions
+      * Added the class E problem size for IS, and the class F problem 
+        size for BT, LU, SP, CG, EP, FT, and MG.
+
+      * Use Fortran modules and allocatable arrays to define and
+        manage global data (to replace common blocks) and Fortran 2003 
+        IEEE arithmetic function to catch the NaN condition during 
+        verification.
+
+      * The environment variable NPB_TIMER_FLAG is now used to enable 
+        additional timers.
+
+      * Make flag change: from MPIF77 or F77 to MPIFC or FC.
+
+  - MPI version improvement
+      * MPI codes use Fortran 90 dynamic memory allocation for space 
+        allocation to simplify compilation process.  The number of 
+        processes is solely determined and checked at runtime.
+
+      * Performance improvement of the LU benchmark.
+
+  - OMP version improvement
+      * Improved loop-level parallelism with the use of the COLLAPSE
+        clause
+
+      * Included the "blocking" version for the BT and SP benchmarks
+
+      * Included the "doacross" version for the LU benchmark
+
+  - Removed the serial version - use the OpenMP version instead
+
+
+ in NPB3.3.1 from NPB3.3:
+
+  - Bug fixes for:
+      MPI/FT - non-portable way of broadcasting input parameters
+      {OMP,SER}/DC - access to out-of-bound array elements
+      {OMP,SER}/UA - use of uninitialized array
+
+  - Code clean up in MPI/LU: avoid using MPI_ANY_SOURCE and delete
+      unused codes
+
+  - Additional timers are included in the MPI version
+
+  - Executables produced for OMP and SER now use ".x" as an extension
+
+
+ in NPB3.3 from NPB3.2.1:
+
+  - Introduction of the Class E problem in seven of the benchmarks
+    (BT, SP, LU, CG, MG, FT, and EP) to stress larger size parallel 
+    computers.
+
+  - Class D added to the IS benchmark in all three implementations.
+
+  - Enable the Bucket sort option for OMP/IS.
+
+  - Introduction of the "twiddle" array in the OpenMP FT benchmark
+    to improve performance
+
+  - Array padding in MPI/SP was adjusted to improve performance
+
+  - Merge the vector codes for the BT and LU benchmarks into this
+    release.
+
+  - The hyperplane version of LU (LU-HP) is no longer included 
+    in the distribution.  Download NPB3.2.1 if needed.
+
+
+ in NPB3.2.1 from NPB3.2:
+
+  - A number of bug fixes for the MPI versions of {FT, LU, MG, BT} and 
+    the OpenMP version of LU
+
+  - Improvements on the OpenMP versions of {EP, IS, UA}
+    (see *OMP/UA/README for a special note on UA)
+
+
+ in NPB3.2 from NPB3.1:
+
+  - Serial DC was converted to C from C++ (only classes S, W, A and B
+    are available)
+
+  - OpenMP version of DC was added (only classes S, W, A and B
+    are available)
+
+  - Inclusion of the new DT benchmark (MPI)
+
+
+ in NPB3.1 from NPB3.0 & NPB2.4:
+
+  - MPI, OpenMP, and Serial versions are now merged into one package
+
+  - Inclusion of the Class D problem in both serial and OpenMP versions
+
+  - Inclusion of the new UA benchmark (Serial & OpenMP)
+
+  - Inclusion of "LU-HP" in the OpenMP version
+
+  - Inclusion of the new DC benchmark (Serial)
+
+  - Use of relative errors for verification in both CG and MG
+
+  - Change in problem parameters for MG Class W
+
+
+The NPB IO benchmark is part of NPB3.3-MPI.  Check the README file
+in that subdirectory for additional information.
+
+The Java and HPF implementations are not included in this distribution.
+Please use the NPB3.0 distribution.
+
diff --git a/src/npb-24.04-imgs/scripts/post-installation.sh b/src/npb-24.04-imgs/scripts/post-installation.sh
new file mode 100755
index 000000000..b6518f16d
--- /dev/null
+++ b/src/npb-24.04-imgs/scripts/post-installation.sh
@@ -0,0 +1,24 @@
+#!/bin/sh
+
+# Copyright (c) 2020 The Regents of the University of California.
+# SPDX-License-Identifier: BSD 3-Clause
+
+# install build-essential (gcc and g++ included) and gfortran
+
+#Compile NPB
+
+apt-get install -y gfortran
+
+cd /home/gem5/NPB3.4-OMP/
+
+mkdir bin
+make clean
+make suite M5_ANNOTATION=1
+echo "Disabling network by default"
+echo "See README.md for instructions on how to enable network"
+mv /etc/netplan/50-cloud-init.yaml /etc/netplan/50-cloud-init.yaml.bak
+# Disable systemd service that waits for network to be online
+systemctl disable systemd-networkd-wait-online.service
+systemctl mask systemd-networkd-wait-online.service
+
+netplan apply
\ No newline at end of file
diff --git a/src/npb-24.04-imgs/x86-npb.pkr.hcl b/src/npb-24.04-imgs/x86-npb.pkr.hcl
new file mode 100644
index 000000000..c14e3404d
--- /dev/null
+++ b/src/npb-24.04-imgs/x86-npb.pkr.hcl
@@ -0,0 +1,65 @@
+packer {
+  required_plugins {
+    qemu = {
+      source  = "github.com/hashicorp/qemu"
+      version = "~> 1"
+    }
+  }
+}
+
+variable "image_name" {
+  type    = string
+  default = "x86-ubuntu-npb"
+}
+
+variable "ssh_password" {
+  type    = string
+  default = "12345"
+}
+
+variable "ssh_username" {
+  type    = string
+  default = "gem5"
+}
+
+source "qemu" "initialize" {
+  accelerator      = "kvm"
+  boot_command     = ["<wait120>",
+                      "gem5<enter><wait>",
+                      "12345<enter><wait>",
+                      "sudo mv /etc/netplan/50-cloud-init.yaml.bak /etc/netplan/50-cloud-init.yaml<enter><wait>",
+                      "12345<enter><wait>",
+                      "sudo netplan apply<enter><wait>",
+                      "<wait>"]
+  cpus             = "4"
+  disk_size        = "5000"
+  format           = "raw"
+  headless         = "true"
+  disk_image       = "true"
+  iso_checksum     = "sha256:6cedf26ebf281b823b24722341d3a2ab1f1ba26b10b536916d3f23cf92a8f4b5"
+  iso_urls         = ["./x86-ubuntu-24-04-v2"]
+  memory           = "8192"
+  output_directory = "disk-image-x86-npb"
+  qemu_binary      = "/usr/bin/qemu-system-x86_64"
+  qemuargs         = [["-cpu", "host"], ["-display", "none"]]
+  shutdown_command = "echo '${var.ssh_password}'|sudo -S shutdown -P now"
+  ssh_password     = "${var.ssh_password}"
+  ssh_username     = "${var.ssh_username}"
+  ssh_wait_timeout = "60m"
+  vm_name          = "${var.image_name}"
+  ssh_handshake_attempts = "1000"
+}
+
+build {
+  sources = ["source.qemu.initialize"]
+
+  provisioner "file" {
+    source      = "npb-with-roi/NPB/NPB3.4-OMP"
+    destination = "/home/gem5/"
+  }
+  provisioner "shell" {
+    execute_command = "echo '${var.ssh_password}' | {{ .Vars }} sudo -E -S bash '{{ .Path }}'"
+    scripts         = ["scripts/post-installation.sh"]
+  }
+
+}
diff --git a/src/npb/.gitignore b/src/npb/.gitignore
deleted file mode 100644
index 5856da1c0..000000000
--- a/src/npb/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-disk-image/npb/npb-image/npb
diff --git a/src/npb/README.md b/src/npb/README.md
deleted file mode 100644
index f485c94c7..000000000
--- a/src/npb/README.md
+++ /dev/null
@@ -1,119 +0,0 @@
----
-title: NAS Parallel Benchmarks (NPB) Tests
-tags:
-    - x86
-    - fullsystem
-permalink: resources/npb
-shortdoc: >
-    Disk image and a gem5 configuration script to run the [NAS parallel benchmarks](https://www.nas.nasa.gov/).
-author: ["Ayaz Akram"]
-license: BSD-3-Clause
----
-
-This document provides instructions to create a disk image needed to run the NPB tests with gem5 and points to an example gem5 configuration script needed to run these tests. The example script uses a pre-built disk-image.
-
-The NAS parallel benchmarks ([NPB](https://www.nas.nasa.gov/)) are high performance computing (HPC) workloads consisting of different kernels and pseudo applications:
-
-Kernels:
-- **IS:** Integer Sort, random memory access
-- **EP:** Embarrassingly Parallel
-- **CG:** Conjugate Gradient, irregular memory access and communication
-- **MG:** Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive
-- **FT:** discrete 3D fast Fourier Transform, all-to-all communication
-
-Pseudo Applications:
-- **BT:** Block Tri-diagonal solver
-- **SP:** Scalar Penta-diagonal solver
-- **LU:** Lower-Upper Gauss-Seidel solver
-
-There are different classes (A,B,C,D,E and F) of each workload based on the input data size. Detailed discussion of the data sizes is available [here](https://www.nas.nasa.gov/publications/npb_problem_sizes.html).
-
-We make use of a modified source of the NPB suite for these tests, which can be found in `disk-images/npb/npb-hooks`.
-We have added ROI (region of interest) annotations for each benchmark which is used by gem5 to separate simulation statistics between different regions of each benchmark. gem5 magic instructions are used before and after each ROI to exit the guest and transfer control to gem5 the gem5 configuration script. This can then dump and reset stats, or switch to cpus of interest.
-
-We assume the following directory structure while following the instructions in this README file:
-
-```
-npb/
-  |___ gem5/                               # gem5 source code
-  |
-  |___ disk-image/
-  |      |___ build.sh                     # The script downloading packer binary and building the disk image
-  |      |___ shared/                      # Auxiliary files needed for disk creation
-  |      |___ npb/
-  |            |___ npb-image/             # Will be created once the disk is generated
-  |            |      |___ npb             # The generated disk image
-  |            |___ npb.json               # The Packer script to build the disk image
-  |            |___ runscript.sh           # Executes a user provided script in simulated guest
-  |            |___ post-installation.sh   # Moves runscript.sh to guest's .bashrc
-  |            |___ npb-install.sh         # Compiles NPB inside the generated disk image
-  |            |___ npb-hooks              # The NPB source (modified to function better with gem5).
-  |
-  |___ linux                               # Linux source and binary will live here
-  |
-  |___ README.md                           # This README file
-```
-
-## Disk Image
-
-Assuming that you are in the `src/npb/` directory (the directory containing this README), first build `m5` (which is needed to create the disk image):
-
-```sh
-git clone https://gem5.googlesource.com/public/gem5
-cd gem5/util/m5
-scons build/x86/out/m5
-```
-
-Next,
-
-```sh
-cd disk-image
-./build.sh          # the script downloading packer binary and building the disk image
-```
-
-Once this process succeeds, the created disk image can be found on `npb/npb-image/npb`.
-A disk image already created following the above instructions can be found, gzipped, [here](http://dist.gem5.org/dist/v22-1/images/x86/ubuntu-18-04/npb.img.gz).
-
-## Simulating NPB using an example script
-
-An example script with a pre-configured system is available in the following directory within the gem5 repository:
-
-```
-gem5/configs/example/gem5_library/x86-npb-benchmarks.py
-```
-
-The example script specifies a system with the following parameters:
-
-* A `SimpleSwitchableProcessor` (`KVM` for startup and `TIMING` for ROI execution). There are 2 CPU cores, each clocked at 3 GHz.
-* 2 Level `MESI_Two_Level` cache with 32 kB L1I and L1D size, and, 256 kB L2 size. The L1 cache(s) has associativity of 8, and, the L2 cache has associativity 16. There are 2 L2 cache banks.
-* The system has 3 GB `SingleChannelDDR4_2400` memory.
-* The script uses `x86-linux-kernel-4.19.83` and `x86-npb`, the disk image created from following the instructions in this `README.md`.
-
-The example script must be run with the `X86_MESI_Two_Level` binary. To build:
-
-```sh
-git clone https://gem5.googlesource.com/public/gem5
-cd gem5
-scons build/X86/gem5.opt -j<proc>
-```
-Once compiled, you may use the example config file to run the NPB benchmark programs. You would need to specify the benchmark program (`bt`, `cg`, `ep`, `ft`, `is`, `lu`, `mg`, `sp`) and the class (`A`, `B`, `C`) separately, using the following command:
-
-```sh
-# In the gem5 directory
-build/X86/gem5.opt \
-configs/example/gem5_library/x86-npb-benchmarks.py \
---benchmark <benchmark_program> \
---size <class_of_the_benchmark>
-```
-
-Description of the two arguments, provided in the above command are:
-* **--benchmark**, which refers to one of 8 benchmark programs, provided in the NAS parallel benchmark suite. These include `bt`, `cg`, `ep`, `ft`, `is`, `lu`, `mg` and `sp`. For more information on the workloads can be found at <https://www.nas.nasa.gov/>.
-* **--size**, which refers to the workload class to simulate. The classes present in the pre-built disk-image are `A`, `B`, `C` and `D`. More information regarding these classes are written in the following paragraphs.
-
-A few important notes to keep in mind while simulating NPB using the disk-image from gem5 resources:
-
-* The pre-built disk image has NPB executables for classes `A`, `B`, `C` and `D`.
-* Classes `D` and `F` requires main memory sizes of more than 3 GB. Therefore, most of the benchmark programs for class `D` will fail to be executed properly, as our system only has 3 GB of main memory. The `X86Board` from `gem5 stdlib` is currently limited to 3 GB of memory.
-* Only benchmark `ep` with class `D` works in the aforemented configuration.
-* The configuration file `x86-npb-benchmarks.py` takes class input of `A`, `B` or `C`.
-* More information on memory footprint for NPB is available in the paper by [Akram et al.](https://arxiv.org/abs/2010.13216)
diff --git a/src/npb/disk-image/build.sh b/src/npb/disk-image/build.sh
deleted file mode 100755
index 4ba51d434..000000000
--- a/src/npb/disk-image/build.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-PACKER_VERSION="1.7.8"
-
-if [ ! -f ./packer ]; then
-    wget https://releases.hashicorp.com/packer/${PACKER_VERSION}/packer_${PACKER_VERSION}_linux_amd64.zip;
-    unzip packer_${PACKER_VERSION}_linux_amd64.zip;
-    rm packer_${PACKER_VERSION}_linux_amd64.zip;
-fi
-
-./packer validate npb/npb.json
-./packer build npb/npb.json
diff --git a/src/npb/disk-image/npb/npb-hooks/README.md b/src/npb/disk-image/npb/npb-hooks/README.md
deleted file mode 100644
index bc81f9b32..000000000
--- a/src/npb/disk-image/npb/npb-hooks/README.md
+++ /dev/null
@@ -1,26 +0,0 @@
-# npb-hooks
-Annotating the region of interest for npb.
-
-IMPORTANT NOTE:  This repo is not supposed to be the canonical source for the benchmarks and serves only as an example for annotating the ROI. The source code can be obtained from [NAS Parallel Benchmarks](https://www.nas.nasa.gov/publications/npb.html)
-
-This repo adds ROI hooks for NAS Parallel Benchmark (OMP version for now). In this particular implementation, the hooks are coupled with gem5 specific instructions (m5_dumpreststats) to collect the stats for the ROI. But the hooks can be used for any other tool with minimal effort.
-
-To enable hooks, make with HOOKS=1
-
-## Summary of the steps taken:
-
-### For the suite:
-hooks.c defines the functions called by each benchmark, and the actions to be taken at the start/end of the ROI.
-
-Adding gem5 instructions to the hooks:
-In make.common we should add proper compilation options to create object files.
-
-In make.def we should define the path to gem5 directory. Also, -cpp should be added to the fortran compiler (FF) options to enable support for C pre-processors.
-
-### For each benchmark in the suite:
-The source file (i.e. BENCH.f or BENCH.c) should be modified to call roi_begin and roi_end functions. In here, we follow a the methodology used by the developers and the function calls are place right before and after the timing procedures.
-We use pre-processor for conditional compilation of added function calls (HOOKS).
-
-The make files should be modified to add the object files created (hooks.o and any other possible dependencies - in our case m5op_x86.o).
-Also, if hooks are enabled, proper flag should be set (-DHOOKS) in the final step of the compilation process (creating the executable).
-These are both done "conditionally" under HOOKS flag (ifeq ($HOOKS, 1)).
diff --git a/src/npb/disk-image/npb/npb-install.sh b/src/npb/disk-image/npb/npb-install.sh
deleted file mode 100755
index 3a885068a..000000000
--- a/src/npb/disk-image/npb/npb-install.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/bin/sh
-
-# Copyright (c) 2020 The Regents of the University of California.
-# SPDX-License-Identifier: BSD 3-Clause
-
-# install build-essential (gcc and g++ included) and gfortran
-
-#Compile NPB
-
-echo "12345" | sudo apt-get install build-essential gfortran
-
-cd /home/gem5/NPB3.3-OMP/
-
-mkdir bin
-
-make suite HOOKS=1
diff --git a/src/npb/disk-image/npb/npb.json b/src/npb/disk-image/npb/npb.json
deleted file mode 100755
index 73e62d6e4..000000000
--- a/src/npb/disk-image/npb/npb.json
+++ /dev/null
@@ -1,105 +0,0 @@
-{
-    "_author": "Hoa Nguyen <hoanguyen@ucdavis.edu>, Ayaz Akram <yazakram@ucdavis.edu>",
-    "_license": "Copyright (c) 2020 The Regents of the University of California. SPDX-License-Identifier: BSD 3-Clause",
-    "builders":
-    [
-        {
-            "type": "qemu",
-            "format": "raw",
-            "accelerator": "kvm",
-            "boot_command":
-            [
-                "{{ user `boot_command_prefix` }}",
-                "debian-installer={{ user `locale` }} auto locale={{ user `locale` }} kbd-chooser/method=us ",
-                "file=/floppy/{{ user `preseed` }} ",
-                "fb=false debconf/frontend=noninteractive ",
-                "hostname={{ user `hostname` }} ",
-                "/install/vmlinuz noapic ",
-                "initrd=/install/initrd.gz ",
-                "keyboard-configuration/modelcode=SKIP keyboard-configuration/layout=USA ",
-                "keyboard-configuration/variant=USA console-setup/ask_detect=false ",
-                "passwd/user-fullname={{ user `ssh_fullname` }} ",
-                "passwd/user-password={{ user `ssh_password` }} ",
-                "passwd/user-password-again={{ user `ssh_password` }} ",
-                "passwd/username={{ user `ssh_username` }} ",
-                "-- <enter>"
-            ],
-            "cpus": "{{ user `vm_cpus`}}",
-            "disk_size": "{{ user `image_size` }}",
-            "floppy_files":
-            [
-                "shared/{{ user `preseed` }}"
-            ],
-            "headless": "{{ user `headless` }}",
-            "http_directory": "shared/",
-            "iso_checksum": "{{ user `iso_checksum_type` }}:{{ user `iso_checksum` }}",
-            "iso_urls": [ "{{ user `iso_url` }}" ],
-            "memory": "{{ user `vm_memory`}}",
-            "output_directory": "npb/{{ user `image_name` }}-image",
-            "qemuargs":
-            [
-                [ "-cpu", "host" ],
-                [ "-display", "none" ]
-            ],
-            "qemu_binary":"/usr/bin/qemu-system-x86_64",
-            "shutdown_command": "echo '{{ user `ssh_password` }}'|sudo -S shutdown -P now",
-            "ssh_password": "{{ user `ssh_password` }}",
-            "ssh_username": "{{ user `ssh_username` }}",
-            "ssh_wait_timeout": "60m",
-            "vm_name": "{{ user `image_name` }}"
-        }
-    ],
-    "provisioners":
-    [
-        {
-            "type": "file",
-            "source": "../gem5/util/m5/build/x86/out/m5",
-            "destination": "/home/gem5/"
-        },
-        {
-            "type": "file",
-            "source": "shared/serial-getty@.service",
-            "destination": "/home/gem5/"
-        },
-        {
-            "type": "file",
-            "source": "npb/runscript.sh",
-            "destination": "/home/gem5/"
-        },
-        {
-            "type": "file",
-            "source": "npb/npb-hooks/NPB3.3.1/NPB3.3-OMP",
-            "destination": "/home/gem5/"
-        },
-        {
-            "type": "shell",
-            "execute_command": "echo '{{ user `ssh_password` }}' | {{.Vars}} sudo -E -S bash '{{.Path}}'",
-            "scripts":
-            [
-                "npb/post-installation.sh",
-                "npb/npb-install.sh"
-            ]
-        }
-    ],
-    "variables":
-    {
-        "boot_command_prefix": "<enter><wait><f6><esc><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs><bs>",
-        "desktop": "false",
-        "image_size": "12000",
-        "headless": "true",
-        "iso_checksum": "34416ff83179728d54583bf3f18d42d2",
-        "iso_checksum_type": "md5",
-        "iso_name": "ubuntu-18.04.2-server-amd64.iso",
-        "iso_url": "http://old-releases.ubuntu.com/releases/18.04.2/ubuntu-18.04.2-server-amd64.iso",
-        "locale": "en_US",
-        "preseed" : "preseed.cfg",
-        "hostname": "gem5",
-        "ssh_fullname": "gem5",
-        "ssh_password": "12345",
-        "ssh_username": "gem5",
-        "vm_cpus": "4",
-        "vm_memory": "8192",
-        "image_name": "npb"
-  }
-
-}
diff --git a/src/npb/disk-image/npb/post-installation.sh b/src/npb/disk-image/npb/post-installation.sh
deleted file mode 100755
index 0ecb8068b..000000000
--- a/src/npb/disk-image/npb/post-installation.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/bin/bash
-
-# Copyright (c) 2020 The Regents of the University of California.
-# SPDX-License-Identifier: BSD 3-Clause
-
-echo 'Post Installation Started'
-
-mv /home/gem5/serial-getty@.service /lib/systemd/system/
-
-mv /home/gem5/m5 /sbin
-ln -s /sbin/m5 /sbin/gem5
-
-# copy and run outside (host) script after booting
-cat /home/gem5/runscript.sh >> /root/.bashrc
-
-echo 'Post Installation Done'
diff --git a/src/npb/disk-image/npb/runscript.sh b/src/npb/disk-image/npb/runscript.sh
deleted file mode 100755
index 15e4377f1..000000000
--- a/src/npb/disk-image/npb/runscript.sh
+++ /dev/null
@@ -1,13 +0,0 @@
-#!/bin/sh
-
-# Copyright (c) 2020 The Regents of the University of California.
-# SPDX-License-Identifier: BSD 3-Clause
-
-m5 readfile > script.sh
-if [ -s script.sh ]; then
-    # if the file is not empty, execute it
-    chmod +x script.sh
-    ./script.sh
-    m5 exit
-fi
-# otherwise, drop to the terminal
diff --git a/src/npb/disk-image/shared/preseed.cfg b/src/npb/disk-image/shared/preseed.cfg
deleted file mode 100755
index 1fa22859b..000000000
--- a/src/npb/disk-image/shared/preseed.cfg
+++ /dev/null
@@ -1,106 +0,0 @@
-# Copyright (c) 2020 The Regents of the University of California.
-# SPDX-License-Identifier: BSD 3-Clause
-
-# Choosing keyboard layout
-d-i debian-installer/locale string en_US
-d-i console-setup/ask_detect boolean false
-d-i keyboard-configuration/xkb-keymap select us
-
-# Choosing network interface
-d-i netcfg/choose_interface select auto
-
-# Assigning hostname and domain
-d-i netcfg/get_hostname string gem5-host
-d-i netcfg/get_domain string gem5-domain
-
-d-i netcfg/wireless_wep string
-
-# https://unix.stackexchange.com/q/216348
-# The above link says there's no way to not to set a mirror
-# Should choose a local minor
-d-i mirror/country string manual
-d-i mirror/http/hostname string archive.ubuntu.com
-d-i mirror/http/directory string /ubuntu
-d-i mirror/http/proxy string
-
-# Setting up `root` password
-d-i passwd/root-login boolean false
-
-# Creating a normal user account. This account has sudo permission.
-d-i passwd/user-fullname string gem5
-d-i passwd/username string gem5
-d-i passwd/user-password password 12345
-d-i passwd/user-password-again password 12345
-d-i user-setup/allow-password-weak boolean true
-
-# No home folder encryption
-d-i user-setup/encrypt-home boolean false
-
-# Choosing the clock timezone
-d-i clock-setup/utc boolean true
-d-i time/zone string US/Eastern
-d-i clock-setup/ntp boolean true
-
-# Choosing partition scheme
-# This setting should result in MBR
-# gem5 doesn't work with logical volumes
-d-i partman-auto/disk string /dev/vda
-d-i partman-auto/method string regular
-d-i partman-lvm/device_remove_lvm boolean true
-d-i partman-md/device_remove_md boolean true
-d-i partman-lvm/confirm boolean true
-d-i partman-lvm/confirm_nooverwrite boolean true
-
-# Ignoring an option to set the home folder in another partition
-#d-i partman-auto/choose_recipe select atomic
-
-d-i partman-auto/expert_recipe string                         \
-      bootable-root ::                                        \
-              500 10000 1000000000 ext4                       \
-                      method{ format }                        \
-                      format{ }                               \
-                      use_filesystem{ } filesystem{ ext4 }    \
-                      mountpoint{ / }                         \
-              .
-
-
-d-i partman-auto/choose_recipe select bootable-root
-
-# Finishing disk partition settings
-d-i partman-md/confirm boolean true
-d-i partman-partitioning/confirm_write_new_label boolean true
-d-i partman/choose_partition select finish
-d-i partman/confirm boolean true
-d-i partman/confirm_nooverwrite boolean true
-
-# Installing standard packages and ubuntu-server packages
-# More details about ubuntu standard packages:
-# https://packages.ubuntu.com/bionic/ubuntu-standard
-# More details about ubuntu-server packages:
-# https://packages.ubuntu.com/bionic/ubuntu-server
-tasksel tasksel/first multiselect standard, ubuntu-server
-
-# openssh-server is required for communicating with Packer
-# build-essential has standard compiling tools, could be removed
-d-i pkgsel/include string openssh-server build-essential
-# No package upgrade
-d-i pkgsel/upgrade select none
-
-# Updating packages automatically is unnecessary
-d-i pkgsel/update-policy select none
-
-# Choosing not to report installed software to some servers
-popularity-contest popularity-contest/participate boolean false
-
-# Installing grub
-d-i grub-installer/only_debian boolean true
-
-# Install to the above partition
-d-i grub-installer/bootdev  string default
-
-# Answering the prompt saying the installation is finished
-d-i finish-install/reboot_in_progress note
-
-# Answering the prompt saying no bootloader is installed
-# This will appear if grub is not installed
-nobootloader nobootloader/confirmation_common note
diff --git a/src/npb/disk-image/shared/serial-getty@.service b/src/npb/disk-image/shared/serial-getty@.service
deleted file mode 100644
index b0424f0e6..000000000
--- a/src/npb/disk-image/shared/serial-getty@.service
+++ /dev/null
@@ -1,46 +0,0 @@
-#  SPDX-License-Identifier: LGPL-2.1+
-#
-#  This file is part of systemd.
-#
-#  systemd is free software; you can redistribute it and/or modify it
-#  under the terms of the GNU Lesser General Public License as published by
-#  the Free Software Foundation; either version 2.1 of the License, or
-#  (at your option) any later version.
-
-[Unit]
-Description=Serial Getty on %I
-Documentation=man:agetty(8) man:systemd-getty-generator(8)
-Documentation=http://0pointer.de/blog/projects/serial-console.html
-BindsTo=dev-%i.device
-After=dev-%i.device systemd-user-sessions.service plymouth-quit-wait.service getty-pre.target
-After=rc-local.service
-
-# If additional gettys are spawned during boot then we should make
-# sure that this is synchronized before getty.target, even though
-# getty.target didn't actually pull it in.
-Before=getty.target
-IgnoreOnIsolate=yes
-
-# IgnoreOnIsolate causes issues with sulogin, if someone isolates
-# rescue.target or starts rescue.service from multi-user.target or
-# graphical.target.
-Conflicts=rescue.service
-Before=rescue.service
-
-[Service]
-# The '-o' option value tells agetty to replace 'login' arguments with an
-# option to preserve environment (-p), followed by '--' for safety, and then
-# the entered username.
-ExecStart=-/sbin/agetty --autologin root --keep-baud 115200,38400,9600 %I $TERM
-Type=idle
-Restart=always
-UtmpIdentifier=%I
-TTYPath=/dev/%I
-TTYReset=yes
-TTYVHangup=yes
-KillMode=process
-IgnoreSIGPIPE=no
-SendSIGHUP=yes
-
-[Install]
-WantedBy=getty.target