diff --git a/.gitignore b/.gitignore index 435cadd12..ec8514e6e 100644 --- a/.gitignore +++ b/.gitignore @@ -2,7 +2,6 @@ __pycache__/ *.pyc m5out/ packer - # For jekyll _site .jekyll-cache diff --git a/src/npb-24.04-imgs/.gitignore b/src/npb-24.04-imgs/.gitignore new file mode 100644 index 000000000..4a57e240e --- /dev/null +++ b/src/npb-24.04-imgs/.gitignore @@ -0,0 +1,4 @@ +disk-image*/ +arm-ubuntu-* +disk-image-arm-npb +disk-image-x86-npb \ No newline at end of file diff --git a/src/npb-24.04-imgs/README.md b/src/npb-24.04-imgs/README.md new file mode 100644 index 000000000..21579a78b --- /dev/null +++ b/src/npb-24.04-imgs/README.md @@ -0,0 +1,186 @@ +--- +title: NPB ubuntu 24.04 disk images +tags: + - x86 + - arm + - fullsystem +permalink: resources/npb-24.04-imgs +shortdoc: > + This resource implementes the NPB benchmark . +author: ["Harshil Patel"] +license: BSD-3-Clause +--- + +This document provides instructions to create a NPB ubuntu 24.04 disk image, which, along with an example script, may be used to run NPB within gem5 simulations. The example script uses a pre-built disk-image. + +A pre-built disk image, for X86, can be found, gzipped, here: [x86-ubuntu-24.04-npb-img](https://resources.gem5.org/resources/x86-ubuntu-24.04-npb-img?version=2.0.0) + +A pre-built disk image, for arm, can be found, gzipped, here: +[arm-ubuntu-24.04-npb-img](https://resources.gem5.org/resources/arm-ubuntu-24.04-npb-img?version=2.0.0) + +## What's on the disk? + +- username: gem5 +- password: 12345 + +- The `gem5-bridge`(m5) utility is installed in `/usr/local/bin/gem5-bridge`. +- `libm5` is installed in `/usr/local/lib/`. +- The headers for `libm5` are installed in `/usr/local/include/gem5-bridge`. +- `npb` benchmark sutie with ROI annotations + +Thus, you should be able to build packages on the disk and easily link to the gem5-bridge library. + +The disk has network disabled by default to improve boot time in gem5. + +If you want to enable networking, you need to modify the disk image and move the file `/etc/netplan/50-cloud-init.yaml.bak` to `/etc/netplan/50-cloud-init.yaml`. + +## Building the Disk Image + +### Arm specific file requirement + +To get the `flash0.img` run the following commands in the `files` directory. + +```bash +dd if=/dev/zero of=flash0.img bs=1M count=64 +dd if=/usr/share/qemu-efi-aarch64/QEMU_EFI.fd of=flash0.img conv=notrunc +``` + +**Note**: The `build-arm.sh` will make this file for you. + +Assuming that you are in the `src/npb-24.04-imgs/` directory, run + +```sh +./build-x86.sh # the script downloading packer binary and building +``` + +to build the x86 disk image or + +```sh +./build-arm.sh +``` + +to run the arm disk image. +After this process succeeds, the disk image can be found on the `npb-24.04-imgs/disk-image-x86-npb/disk-image-x86-npb` or `npb-24.04-imgs/disk-image-arm-npb/disk-image-arm-npb` repectively. + +This npb image uses the prebuilt ubuntu 24.04 image as a base image. The npb image also throws the same exit events as the base image. + +Each benchmark also has its regions of intrests annotated and they throw a `gem5-bridge workbegin` and `gem5-bridge workend` exit event. + +## Init Process and Exit Events + +This section outlines the disk image's boot process variations and the impact of specific boot parameters on its behavior. +By default, the disk image boots with systemd in a non-interactive mode. +Users can adjust this behavior through kernel arguments at boot time, influencing the init system and session interactivity. + +### Boot Parameters + +The disk image supports two main kernel arguments to adjust the boot process: + +- `no_systemd=true`: Disables systemd as the init system, allowing the system to boot without systemd's management. +- `interactive=true`: Enables interactive mode, presenting a shell prompt to the user for interactive session management. + +Combining these parameters yields four possible boot configurations: + +1. **Default (Systemd, Non-Interactive)**: The system uses systemd for initialization and runs non-interactively. +2. **Systemd and Interactive**: Systemd initializes the system, and the boot process enters an interactive mode, providing a user shell. +3. **Without Systemd and Non-Interactive**: The system boots without systemd and proceeds non-interactively, executing predefined scripts. +4. **Without Systemd and Interactive**: Boots without systemd and provides a shell for interactive use. + +### Note on Print Statements and Exit Events + +- The bold points in the sequence descriptions are `printf` statements in the code, indicating key moments in the boot process. +- The `**` symbols mark gem5 exit events, essential for simulation purposes, dictating system shutdown or reboot actions based on the configured scenario. + +### Boot Sequences + +#### Default Boot Sequence (Systemd, Non-Interactive) + +- Kernel output +- **Kernel Booted print message** ** +- Running systemd print message +- Systemd output +- autologin +- **Running after_boot script** ** +- Print indicating **non-interactive** mode +- **Reading run script file** +- Script output +- Exit ** + +#### With Systemd and Interactive + +- Kernel output +- **Kernel Booted print message** ** +- Running systemd print message +- Systemd output +- autologin +- **Running after_boot script** ** +- Shell + +#### Without Systemd and Non-Interactive + +- Kernel output +- **Kernel Booted print message** ** +- autologin +- **Running after_boot script** ** +- Print indicating **non-interactive** mode +- **Reading run script file** +- Script output +- Exit ** + +#### Without Systemd and Interactive + +- Kernel output +- **Kernel Booted print message** ** +- autologin +- **Running after_boot script** ** +- Shell + +This detailed overview provides a foundational understanding of how different boot configurations affect the system's initialization and mode of operation. +By selecting the appropriate parameters, users can customize the boot process for diverse environments, ranging from automated setups to hands-on interactive sessions. + +## Handling Exit Events in gem5 + +The disk image triggers five exit events in total: + +- 3 `gem5-bridge exit` events +- 1 `gem5-bridge workbegin` event +- 1 `gem5-bridge workend` event + +To manage these events in gem5, you need to create three exit event handlers. Below is a code snippet showing how these handlers could be implemented and added to the `simulator` object in gem5: + +```python +def handle_workbegin(): + print("Done booting Linux") + print("Resetting stats at the start of ROI!") + m5.stats.reset() + processor.switch() + yield False + +# We expect that the ROI ends with `workend` or `simulate() limit reached`. +def handle_workend(): + print("Dumping stats at the end of the ROI!") + m5.stats.dump() + yield True + +def exit_event_handler(): + print("First exit: Kernel booted") + yield False # gem5 is now executing systemd startup + print("Second exit: Started `after_boot.sh` script") + # The after_boot.sh script is executed after the kernel and systemd have booted. + yield False # gem5 is now executing the `after_boot.sh` script + print("Third exit: Finished `after_boot.sh` script") + # The after_boot.sh script will run a script if passed via m5 readfile. + # This is the last exit event before the simulation exits. + yield True + +simulator = Simulator( + board=board, + on_exit_event={ + ExitEvent.WORKBEGIN: handle_workbegin(), + ExitEvent.WORKEND: handle_workend(), + ExitEvent.EXIT: exit_event_handler(), + }, +) +``` + +This script defines three handlers for different exit events (`WORKBEGIN`, `WORKEND`, and `EXIT`). diff --git a/src/npb-24.04-imgs/arm-npb.pkr.hcl b/src/npb-24.04-imgs/arm-npb.pkr.hcl new file mode 100644 index 000000000..4c3463633 --- /dev/null +++ b/src/npb-24.04-imgs/arm-npb.pkr.hcl @@ -0,0 +1,75 @@ +packer { + required_plugins { + qemu = { + source = "github.com/hashicorp/qemu" + version = "~> 1" + } + } +} + +variable "image_name" { + type = string + default = "arm-ubuntu" +} + +variable "ssh_password" { + type = string + default = "12345" +} + +variable "ssh_username" { + type = string + default = "gem5" +} + +source "qemu" "initialize" { + boot_command = ["", + "gem5", + "12345", + "sudo mv /etc/netplan/50-cloud-init.yaml.bak /etc/netplan/50-cloud-init.yaml", + "12345", + "sudo netplan apply", + ""] + cpus = "4" + disk_size = "4600" + format = "raw" + headless = "true" + disk_image = "true" + iso_checksum = "sha256:eb94422a3908c6c5183c03666b278b6e8bcfbde04da3d7c3bb5374bc82e0ef48" + iso_urls = ["./arm-ubuntu-24.04-20240823"] + memory = "8192" + output_directory = "disk-image-arm-npb" + qemu_binary = "/usr/bin/qemu-system-aarch64" + qemuargs = [ ["-boot", "order=dc"], + ["-bios", "./files/flash0.img"], + ["-cpu", "host"], + ["-enable-kvm"], + ["-machine", "virt"], + ["-machine", "gic-version=3"], + ["-device","virtio-gpu-pci"], + ["-device", "qemu-xhci"], + ["-device","usb-kbd"], + + ] + shutdown_command = "echo '${var.ssh_password}'|sudo -S shutdown -P now" + ssh_password = "${var.ssh_password}" + ssh_username = "${var.ssh_username}" + ssh_wait_timeout = "60m" + vm_name = "${var.image_name}" + ssh_handshake_attempts = "1000" +} + +build { + sources = ["source.qemu.initialize"] + + provisioner "file" { + source = "npb-with-roi/NPB/NPB3.4-OMP" + destination = "/home/gem5/" + } + + provisioner "shell" { + execute_command = "echo '${var.ssh_password}' | {{ .Vars }} sudo -E -S bash '{{ .Path }}'" + scripts = ["scripts/post-installation.sh"] + } + +} diff --git a/src/npb-24.04-imgs/build-arm.sh b/src/npb-24.04-imgs/build-arm.sh new file mode 100755 index 000000000..33f627e42 --- /dev/null +++ b/src/npb-24.04-imgs/build-arm.sh @@ -0,0 +1,24 @@ +PACKER_VERSION="1.10.0" + +if [ ! -f ./packer ]; then + wget https://releases.hashicorp.com/packer/${PACKER_VERSION}/packer_${PACKER_VERSION}_linux_arm64.zip; + unzip packer_${PACKER_VERSION}_linux_arm64.zip; + rm packer_${PACKER_VERSION}_linux_arm64.zip; +fi + +# make the flash0.sh file +mkdir files +cd ./files +dd if=/dev/zero of=flash0.img bs=1M count=64 +dd if=/usr/share/qemu-efi-aarch64/QEMU_EFI.fd of=flash0.img conv=notrunc +cd .. + +# get the base image from gem5 resoruces +wget https://storage.googleapis.com/dist.gem5.org/dist/develop/images/arm/ubuntu-24-04/arm-ubuntu-24.04-20240823.gz +gunzip arm-ubuntu-24.04-20240823.gz + +# Install the needed plugins +./packer init arm-npb.pkr.hcl + +# Build the image +./packer build arm-npb.pkr.hcl diff --git a/src/npb-24.04-imgs/build-x86.sh b/src/npb-24.04-imgs/build-x86.sh new file mode 100755 index 000000000..50ec0e8ac --- /dev/null +++ b/src/npb-24.04-imgs/build-x86.sh @@ -0,0 +1,21 @@ +#!/bin/bash + +# Copyright (c) 2024 The Regents of the University of California. +# SPDX-License-Identifier: BSD 3-Clause + +PACKER_VERSION="1.10.0" + +if [ ! -f ./packer ]; then + wget https://releases.hashicorp.com/packer/${PACKER_VERSION}/packer_${PACKER_VERSION}_linux_amd64.zip; + unzip packer_${PACKER_VERSION}_linux_amd64.zip; + rm packer_${PACKER_VERSION}_linux_amd64.zip; +fi + +wget https://storage.googleapis.com/dist.gem5.org/dist/develop/images/x86/ubuntu-24-04/x86-ubuntu-24-04-v2.gz +gunzip x86-ubuntu-24-04-v2.gz + +# Install the needed plugins +./packer init x86-npb.pkr.hcl + +# Build the image +./packer build x86-npb.pkr.hcl \ No newline at end of file diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/Changes.log b/src/npb-24.04-imgs/npb-with-roi/NPB/Changes.log new file mode 100644 index 000000000..c91cf488c --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/Changes.log @@ -0,0 +1,564 @@ +########################################### +# Modification History of NPB3.x # +# ------------------------------ # +# NPB development team # +# NASA Ames Research Center # +# npb@nas.nasa.gov # +# http://www.nas.nasa.gov/Software/NPB/ # +########################################### + + +------------------------------------------------------ +Changes in NPB3.4.2 + ( NPB3.4-MPI, NPB3.4-OMP ) +------------------------------------------------------ +[20-Jul-20] + +This is a bug-fix release with following changes. + + o Verification change for the EP benchmark (NPB-MPI and NPB-OMP) + - No. of Gaussian pairs is now part of the verification to broaden + the coverage + + - Due to numeric sensitivity in subtracting two close numbers, + the verification of (SX,SY) could fail for the Class F problem + for the given threthold (1.d-8). A new scheme has been implemented + to use the absolute values of (X,Y) in calculating (SX,SY) to mitigate + the sensitivity. There is no change in the number of operations, but + the verification values have been regenerated. + + o Change in NPB-MPI + - add back the VEC versions of BT and LU and make them available via + make option "VERSION=vec" + + - minor format fix in common/print_results.f90 + + - fixed a bug in the BT-IO benchmark that can cause integer overflow + in CLASS=D or larger problems. Setting FORTRAN_REC_SIZE in make.def + is no longer required. + + +------------------------------------------------------ +Changes in NPB3.4.1 + ( NPB3.4-MPI, NPB3.4-OMP ) +------------------------------------------------------ +[15-Feb-20] + +This is a minor release with following changes. + + o Changed Fortran sources from fixed form to free form + + o Change in the NPB-MPI version + - Fixed an inconsistency in enforcing process count requirement in + different benchmarks. The environment variable NPB_NPROCS_STRICT + can be used to turn off the enforcement. + + o Change in the NPB-OMP version + - Fixed the report of Fortran compiler flag (F77 -> FC). + + - The blocking factor for FT can now be set via make option + "VERSION=blk" + + +------------------------------------------------------ +Changes in NPB3.4 + ( NPB3.4-MPI, NPB3.4-OMP ) +------------------------------------------------------ +[13-May-18] + +1. General + + - The serial version of NPBs (NPB-SER) is no longer included in + the distribution. The same functionality can be achieved by + the OpenMP version compiled with OpenMP disabled. + + - Version 3.4 uses Fortran modules and allocatable arrays to define + and manage global data (to replace common blocks), and Fortran 2003 + IEEE arithmetic function to catch the NaN condition during verification. . + + So, the version requires a compiler that supports these features. + Examples of a few compilers that are known to work: + Intel compiler v12+, GCC v5+, PGI v10+. + + - The environment variable NPB_TIMER_FLAG is now used to enable + additional timers. This method supersedes the use of the file + "timer.flag" in the working directory. + + - The MPIF77 or F77 flag in make.def is renamed to MPIFC or FC to match + with the fact that a Fortran 90 or newer compiler is required. + +2. MPI version + + - NPB3.4-MPI added the class E problem size for IS, and the class F + problem size for BT, LU, SP, CG, EP, FT, and MG. + + - Version 3.4 uses the dynamic memory allocation feature + in Fortran 90 so that separate compilations for different + process counts are no longer necessary. The number of processes + is solely determined and checked at runtime. + + - The LU benchmark improvement: + * Reduced memory usage for working arrays (a,b,c,d) in the solver. + This could improve performance in some cases. + + * Relaxed the number of processes allowed. For example, the square + number of processes (3x3=9) is now allowed. + + - The vector codes for the BT and LU benchmarks have been removed + due to the fact that these implementations were not portable and + successful vectorization highly depends on the compiler used. + +3. OMP version + + - Added the class E problem size for IS, and the class F problem + size for BT, LU, SP, CG, EP, FT, and MG. + + - Improved loop-level parallelism with the use of the OpenMP + COLLAPSE clause available since OpenMP 3.0. This version + requires an OpenMP compiler that supports this feature. + + - Changes specific to LU: + * The thread synchronization in the pipelined version of LU was + changed to use ATOMIC read/write available from OpenMP 3.0. + + * Re-introduced the hyperplane implementation of LU in the + distribution, which is accessible via the VERSION=HP make + option during compilation. + + * Included a third version of LU that uses the DOACROSS feature + of OpenMP 4.0. This version requires an OpenMP compiler that + supports this feature. + + - Changes specific to BT and SP: + * Data access in RHS has been improved for better performance. + + * Included a version with blocking factor in the solver to + improve cache performance. This version can be selected via + the VERSION=BLK make option during compilation and supersedes + the "vector" version that was introduced in version 3.3. + + - Changes specific to UA: + * Included a version that uses array reduction for atomic updates. + This version is selectable via the VERSION=rd make option + during compilation. + + +------------------------------------------------------ +Changes in NPB3.3.1 + (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI ) +------------------------------------------------------ +[17-Feb-09] + +This is a bug fixing release of NPB3.3. + +1. All versions + + - sys/setparams.c: fixed a problem in dealing with quoted (") flags + from make.def when producing npbparams.h for C. + + - CG: ensure 'implicit none' used in all subroutines. + +2. MPI version + + - Additional timers can be used for profiling purpose, similar + to those already included in the OMP and SER versions. + + - LU: + * code clean up (suggested by Rob Van der Wijngaart) + > avoid using MPI_ANY_SOURCE in exchange_*.f, which might + alter performance in some cases. + > delete references to sethyper and 'icomm*', which are + no longer used since NPB2.2. + * change the low-bound limit on the sub-domain size in subdomain.f + from 4 to 3 in order to increase allowable process counts. + * allow number of processes other than power of two. + + - FT: fix a non-portable way of broadcasting input parameters + (pointed out by Art Lazanoff) + + - BT: include 'btio_cleanup' as part of the I/O timing + +3. OMP and SER versions + + - DC: fix access to out-of-bound array elements in adc.c + Reported by Per Larsen of Denmark + + - UA: fix the use of uninitialized array 'sje' in mortar_vertex() by + adding "call nr_init[_omp](sje,4*6*nelt,0)" in the main program. + + - MG, UA: include additional timers for profiling purpose. + + - Executables now use ".x" as a name extension + + +------------------------------------------------------ +Changes in NPB3.3 + (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI ) +------------------------------------------------------ +[02-Aug-07] + +1. New and improvements + + - The Class E problem has been introduced in seven of the benchmarks + (BT, SP, LU, CG, MG, FT, and EP) in all three implementations. + + - The Class D problem has been added to the IS benchmark in all + three implementations. It requires the compiler support of + 64-bit "long" type in C. The MPI version of IS now allows runs + up to 1024 processes. + + - The Bucket Sort option (USE_BUCKETS) has been added to + the OpenMP version of IS and made as the default. + + - Introduced the "twiddle" array in the OpenMP FT benchmark, + which has been used in the MPI and SER versions and seems + to improve performance for larger problem sizes. + + - Merged vector codes for the BT and LU benchmarks into + the release. + + - Updates to BTIO (MPI/BT with IO subtypes): + * added I/O stats (I/O timing, data size written, I/O data rate) + * added an option for interleaving reads between writes through + the inputbt.data file. Although the data file size would be + smaller as a result, the total amount of data written is still + the same. + + - Made documents more consistent throughout different versions + (README and README.install). + +2. Bug fixes + + - MPI/FT: fixed a verification failure for cases where NX/=NY + and the 2D decomposition are used. The bug occurred at least + for (Class D, NPROCS=2048) and (Class B, NPROCS=512). + + fixed an output printing format problem occurred when + the number of processes >= 1000. + + - MPI/SP: fixed a performance regression due to improper + padding of array dimensions. + + - MPI/IS: minor fix to support large processor counts (>=512). + + - OMP/UA: fixed a race condition in mason.f, avoided the use + of the LASTPRIVATE directive. + + - OMP/LU: minor fix in data flushing for pipelining. + + - DC: There are a number of fixes - + * fixed segmentation fault in both OMP and SER versions + caused by accessing zero-length array elements. + Reported by Jeff Odom . + + * fixed a race in reporting benchmark timing in the OMP version + + * fixed the use of timer in the OMP version, which limited + the number of threads to 64. The number of threads is now + lifted to a maximum of MAX_NUMBER_OF_TASKS (=256). + + * made the benchmark output consistent with other NPBs. + + - fixed a use of uninitialized variable in MPI/sys/setparams.c. + setparams in all three versions was updated to deal with + make.def that contains carriage-return character ('\r'). + + - SER/FT: added 'implicit none' to all missing places. + + - SER/IS: fixed missing variable declarations for the Bucket + Sort option (when USE_BUCKETS is defined). + +3. Others + + - The default value for collbuf_nodes in the BT I/O benchmark + is now set to 0, indicating no file hints will be used. + The setting can be changed by using the "inputbt.data" file. + + - The hyperplane version of LU (LU-HP) is no longer included + in the distribution. + + +------------------------------------------------------ +Changes in NPB3.2.1 + (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI ) +------------------------------------------------------ +[27-Jul-05] + +This is a bug fixing release of NPB3.2. + +1. MPI version + - sys/setparams.c: removed a duplicated statement for writing + FT parameters and made invalid SUBTYPE as an error condition. + The 'duplicated statement' problem was fixed in NPB3.2 (See + the note below). However, during the final updating process, + the fix was left out, even though the log file was updated. + + - BT: included SUBTYPE=EPIO in the I/O verification. + + - LU: bcast_inputs.f: fixed wrong data type (dp_type) used for + communicating integers (nx0,ny0,nz0) with the correct type + MPI_INTEGER. + + - MG: fixed a mis-calculation of parameter "nr" in globals.h + that caused run-time failure for NPROCS >= 512 + (reported by Donald Ferry of Cray). Expanded to limit to + 131072 processes and added an error checking code. + + The use of MPI_ANY_SOURCE for MPI_Irecv inside subroutine + ready() could cause MPI_Wait return a message meant for + the wrong k. The problem is fixed with nbr(axis,-dir,k) + in place of MPI_ANY_SOURCE in the call to MPI_Irecv + (reported and suggested by Hideo Saito). + +2. OpenMP version + - EP: use THREADPRIVATE for working array storage. It should not + change performance but made some compiler happier. + + - LU: add variable "v" to FLUSH to ensure solution data properly + flushed for pipeline. This change is needed according to + the OpenMP 2.5 standard. + + - IS: reorganized working buffers so that the count for key + population could be more naturally performed. This version + uses much less stack space. + + - UA: implemented atomic updates with locks in order to achieve + better scaling on those systems that have an inefficient + (or even buggy) ATOMIC implementation. + + +------------------------------------------------------ +Changes in NPB3.2 + (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI ) +------------------------------------------------------ +[07-Jan-05] + +1. DC version in NPB3.2-SER was converted to C from C++ + (CLASSES S, W, A, B). + sys/setparams.c file was changed appropriately. + +2. OpenMP version of DC was added to NPB3.2-OMP. + +3. Data Traffic benchmark DT was added to NPB3.2-MPI. + +[24-May-04] + +All versions: + - use assumed shape "(*)" declaration in CG + - fixed the use of an uninitialized variable in EP + - avoid using integer array for assumed shape dimensions in FT + - fix in UA: + * fix the reference to file "inputua.data" + * avoid overindexing + * avoid reference to out-of-bound array elements + * change declaration "real*8" to "double precision" + +OMP version: + - explicitly added "SCHEDULE(STATIC)" to the OMP version + - use the "omp_get_wtime()" function for timer if available + - removed the call to "getenv" for portability + - change in UA: + * implemented an alternative approach for atomic update + +MPI version: + - removed a duplicated declaration in FT (from setparams.c) + - removed a duplicated declaration in BT/full_mpiio.f + - fixed a missing "NPROCS=" in sys/suite.awk + + +------------------------------------------------------ +Changes in NPB3.1 + (NPB3.1-MPI, NPB3.1-SER, NPB3.1-OMP) +------------------------------------------------------ +[22-Apr-04] NPB3.1-MPI + +Merged the NPB2.4-MPI branch into NPB3.1 with the following changes. + + - Optimized the BT memory usage. The new version is about 1/3 of + the memory used in NPB2.x. + - Fixed a bug in CG for running on a large number of processes + - Redefined the Class W size in MG so that the verification value + will not be too small. (see below for SER & OMP versions) + - Use the relative errors for verification in both CG and MG + - Fixed a race in 'make suite' + +[08-Apr-04] NPB3.1-SER and NPB3.1-OMP + +The following changes are made in both NPB3.1-SER and NPB3.1-OMP. + +1. Added the Class D problem + - verification values taken from NPB2.4-MPI + - modified variables to fit in large problem + +2. Improvements for LU and LU-HP: + - reduced the memory usage for the 'tv' variable in LU and LU-HP + - a more efficient memory access for variables "a,b,c,d" in LU-HP + - a dummy iteration added before the time step loop for consistency + with other benchmarks + +3. Improvement and fix in MG: + - verification in MG now uses the relative error + (instead of the absolute error). This will avoid incorrect + verification for small reference values. + - redefined the class size for Class W so that the verification + value will not be too small. + In version 3.0 and earlier: 64x64x64, 40 iters + New size in version 3.1 : 128x128x128, 4 iters + - fixed incorrect verification values for Classes A and C. + +4. CG: + - use relative error for verification + - clean up codes for matrix initialization (makea). + The new code uses about 1/2 memory of the previous version. + +5. Fixed makefile related issues + - fixed dependence on make.def for files in common. + - fixed a race in 'make suite' + - added 'LU-HP' as a valid benchmark option in makefiles + +The following changes are made in NPB3.1-OMP. + +1. Included a hyper-plane version of the LU benchmark: LU-HP + - based on the serial version + +2. The dummy 'omp_lib_dum' library is no longer used for compilation + without an OpenMP compiler. Conditional compilation is now used. + +3. Parallelization of the initialization part of MG. + It improves the turn-around time quite a bit for the larger + classes, such as class D. + +4. Parallelize codes for matrix initialization (makea) in CG. + The new code uses about 2/3 memory of the version in NPB3.0-OMP. + +5. Code clean up in SP so that the structure is more consistent + with the serial version. + + + +------------------------------------------------------ +Changes in NPB2.x MPI version +------------------------------------------------------ + +Changes in 2.4.1 +- fixed error in BT/Makefile (replaced "==" with "=") +- added stub function accumulate_norms in BT/btio.f +- changed type of Class B verification constants in BT/verify.f from + single to double precision + +Changes in 2.4 +- Added I/O benchmark (subtype of BT). +- Added Class D for all benchmarks except IS. +- Reduced size of tabulated exponentials in FT. +- Made minor changes to FT to prevent integer overflow for class D on + systems with 32-bit integers. FT class D will not run on small + numbers of processors anymore. + + +------------------------------------------------------ +Changes in non-MPI versions of NPB (previously PBN3.0) + (NPB3.0-SER, NPB3.0-HPF, NPB3.0-OMP, NPB3.0-JAV) +------------------------------------------------------ + +[01-Mar-99] Initial Beta Release. + +[06-Apr-99] Based on report from Charles Grassl and Ramesh Menon (SGI). + + 1. NPB-SER, FT: file auxfnct.f - + lines 74 and 75 were interchanged: + + double complex u0(d1+1,d2,d3), tmp(maxdim) + integer d1,d2,d3 + + 2. NPB-OMP: The OpenMP standards requires reduction variable be scalars, + thus, changes made to remove the use of array variable for reduction. + Relevant modifications in EP, CG, LU, SP, and BT + + 3. NPB-OMP: Remove compiler warnings of "Referenced scalar variables + use defaults" by declaring explicitly as shared. + Relevant modifications in FT, LU, and BT + + 4. NPB-OMP, README.openmp: Explicitly spell out the requirement of + the static scheduling (setenv OMP_SCHEDULE "static"). + + +[05-Oct-99] NPB3.0-non-MPI Beta Release (02) + +General change to all (NPB-SER, NPB-HPF, NPB-OMP) - + 1. Update header information for all benchmarks. + + 2. Allow continuation lines in 'make.def' (modification done + in sys/setparams.c). + +Change made in NPB-OMP - + 1. 'print_results' now prints Number-Of-Threads and Mflops/s/thread. + The printed number is the activated threads during the run, which + may not be the same as what's requested. + + 2. A initial data touch loop for array A is added in CG. + + 3. 'CRITICAL' section is used for reduction with array. + Relevant changes in EP, CG, LU, SP, and BT. + + 4. Reconfigure 'make.def' such that 'omp_lib_dum' can be activated + from the file for no directive compilation. + + 5. The "!$OMP END DO" seems needed before "!$OMP MASTER" in rhs.f + for both BT and SP for some f90 compilers. + + 6. "SCHEDULE(STATIC)" are used for the pipeline in LU to ensure + compliance with the OMP standard. + +Change made in NPB-HPF - + 1. 'print_results' now prints Number-Of-Processes and Mflops/s/process. + + 2. Use more consistent output format (via print_results). + + 3. More consistent makefiles (via config/make.def). + + +[04-Apr-00] NPB3.0-non-MPI Beta Release (03) + +Change made in NPB-OMP - + 1. The OpenMP-C version of IS has been added, including more timers. + + 2. 'cprint_results' includes Number-Of-Threads and Mflops/s/thread. + +Change made in NPB-SER - + 1. More timers included in IS. + +NPB-JAV has been included in NPB3.0-non-MPI. + + +[31-May-01] NPB3.0-non-MPI Beta Release (04) + +Change made in NPB-OMP - + 1. NPB-OMP/LU: Failure in verification for number of threads greater + than the problem size is now fixed. + + 2. If OMP_NUM_THREADS is unset, the printout will report as "unset" + instead of "1" + + 3. NPB-OMP/IS: Allocating work_buff on the stack seems to cause problem + for large problem size (CLASS C). "work_buff" is now allocated + by "malloc" on the heap for CLASS C. + + 4. NPB-OMP/IS: Reported by - potential + synchronization problem could arise due to the use of "static" + variables inside "randlc()". Declaration of these static variables + are moved out of randlc() and put in the threadprivate directive. + +General change to all (NPB-SER, NPB-HPF, NPB-OMP) - + 1. Cleanup in makefiles + + +[28-Aug-02] The Official NPB3.0 Release + +Change made in all - + 1. Fixed a bogus verification for "NaN". + + 2. Name change from "PBN3.0" to "NPB3.0". Updated all the banners. + + 3. NPB-SER/FT: use a derived version from NPB2.3-serial. + + 4. NPB-HPF/FT: use a consistent printing format. diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-HPF.README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-HPF.README new file mode 100644 index 000000000..ff1e508d2 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-HPF.README @@ -0,0 +1,4 @@ +The HPF version of NPB is not included in this distribution. +Please download it from NPB3.0 instead. + +http://www.nas.nasa.gov/Software/NPB diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-JAV.README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-JAV.README new file mode 100644 index 000000000..b36e68676 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-JAV.README @@ -0,0 +1,4 @@ +The Java version of NPB is not included in this distribution. +Please download it from NPB3.0 instead. + +http://www.nas.nasa.gov/Software/NPB diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/Makefile new file mode 100644 index 000000000..e8439220a --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/Makefile @@ -0,0 +1,94 @@ +SHELL=/bin/sh +BENCHMARK=bt +BENCHMARKU=BT +VEC= + +include ../config/make.def + + +OBJS = bt.o bt_data.o make_set.o initialize.o exact_solution.o \ + exact_rhs.o set_constants.o adi.o define.o copy_faces.o \ + rhs.o solve_subs.o x_solve$(VEC).o y_solve$(VEC).o z_solve$(VEC).o \ + add.o error.o verify.o setup_mpi.o mpinpb.o \ + ${COMMON}/get_active_nprocs.o \ + ${COMMON}/print_results.o ${COMMON}/timers.o + +include ../sys/make.common + +# npbparams.h is included by bt_data module (via bt_data.o) + +${PROGRAM}: config + @if [ x$(VERSION) = xvec ] ; then \ + ${MAKE} VEC=_vec exec; \ + elif [ x$(VERSION) = xVEC ] ; then \ + ${MAKE} VEC=_vec exec; \ + else \ + ${MAKE} exec; \ + fi + +exec: $(OBJS) + @if [ x$(SUBTYPE) = xfull -o x$(SUBTYPE) = xFULL ] ; then \ + ${MAKE} bt-full; \ + elif [ x$(SUBTYPE) = xsimple -o x$(SUBTYPE) = xSIMPLE ] ; then \ + ${MAKE} bt-simple; \ + elif [ x$(SUBTYPE) = xfortran -o x$(SUBTYPE) = xFORTRAN ] ; then \ + ${MAKE} bt-fortran; \ + elif [ x$(SUBTYPE) = xepio -o x$(SUBTYPE) = xEPIO ] ; then \ + ${MAKE} bt-epio; \ + else \ + ${MAKE} bt-bt; \ + fi + +bt-bt: ${OBJS} btio.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} btio.o ${FMPI_LIB} + +bt-full: ${OBJS} full_mpiio.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_full ${OBJS} btio_common.o full_mpiio.o ${FMPI_LIB} + +bt-simple: ${OBJS} simple_mpiio.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_simple ${OBJS} btio_common.o simple_mpiio.o ${FMPI_LIB} + +bt-fortran: ${OBJS} fortran_io.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.fortran_io ${OBJS} btio_common.o fortran_io.o ${FMPI_LIB} + +bt-epio: ${OBJS} epio.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.ep_io ${OBJS} btio_common.o epio.o ${FMPI_LIB} + +.f90.o: + ${FCOMPILE} $< + +.c.o: + ${CCOMPILE} $< + + +bt.o: bt.f90 bt_data.o mpinpb.o +make_set.o: make_set.f90 bt_data.o mpinpb.o +initialize.o: initialize.f90 bt_data.o +exact_solution.o: exact_solution.f90 bt_data.o +exact_rhs.o: exact_rhs.f90 bt_data.o +set_constants.o: set_constants.f90 bt_data.o +adi.o: adi.f90 bt_data.o +define.o: define.f90 bt_data.o +copy_faces.o: copy_faces.f90 bt_data.o mpinpb.o +rhs.o: rhs.f90 bt_data.o +x_solve$(VEC).o: x_solve$(VEC).f90 bt_data.o mpinpb.o +y_solve$(VEC).o: y_solve$(VEC).f90 bt_data.o mpinpb.o +z_solve$(VEC).o: z_solve$(VEC).f90 bt_data.o mpinpb.o +solve_subs.o: solve_subs.f90 +add.o: add.f90 bt_data.o +error.o: error.f90 bt_data.o mpinpb.o +verify.o: verify.f90 bt_data.o mpinpb.o +setup_mpi.o: setup_mpi.f90 bt_data.o mpinpb.o +btio.o: btio.f90 bt_data.o +btio_common.o: btio_common.f90 bt_data.o mpinpb.o +fortran_io.o: fortran_io.f90 bt_data.o mpinpb.o +simple_mpiio.o: simple_mpiio.f90 bt_data.o mpinpb.o +full_mpiio.o: full_mpiio.f90 bt_data.o mpinpb.o +epio.o: epio.f90 bt_data.o mpinpb.o +bt_data.o: bt_data$(VEC).f90 mpinpb.o npbparams.h + ${FCOMPILE} -o $@ bt_data$(VEC).f90 +mpinpb.o: mpinpb.f90 + +clean: + - rm -f *.o *.mod *~ mputil* + - rm -f npbparams.h core diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/add.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/add.f90 new file mode 100644 index 000000000..f8dd37913 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/add.f90 @@ -0,0 +1,31 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine add + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! addition of update to the vector u +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer c, i, j, k, m + + do c = 1, ncells + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = u(m,i,j,k,c) + rhs(m,i,j,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/adi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/adi.f90 new file mode 100644 index 000000000..78025ad8e --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/adi.f90 @@ -0,0 +1,21 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine adi + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + call copy_faces + + call x_solve + + call y_solve + + call z_solve + + call add + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt.f90 new file mode 100644 index 000000000..f2f1ea9f9 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt.f90 @@ -0,0 +1,349 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.4 ! +! ! +! B T ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.4 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007. ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.4. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.4, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + +!--------------------------------------------------------------------- +! +! Authors: R. F. Van der Wijngaart +! T. Harris +! M. Yarrow +! +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- + program MPBT +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i, niter, step, c, error, fstatus + double precision navg, mflops, mbytes, n3 + + external timer_read + double precision t, tmax, iorate(2), tpc, timer_read + logical verified + character class, cbuff*40 + double precision t1(t_last), tsum(t_last), & + & tming(t_last), tmaxg(t_last) + character t_recs(t_last)*8 + + integer wr_interval + + data t_recs/'total', 'i/o', 'rhs', 'xsolve', 'ysolve', 'zsolve', & + & 'bpack', 'exch', 'xcomm', 'ycomm', 'zcomm', & + & ' totcomp', ' totcomm'/ + + call setup_mpi + if (.not. active) goto 999 + +!--------------------------------------------------------------------- +! Root node reads input file (if it exists) else takes +! defaults from parameters +!--------------------------------------------------------------------- + if (node .eq. root) then + + write(*, 1000) + + call check_timer_flag( timeron ) + + open (unit=2,file='inputbt.data',status='old', iostat=fstatus) +! + rd_interval = 0 + if (fstatus .eq. 0) then + write(*,233) + 233 format(' Reading from input file inputbt.data') + read (2,*) niter + read (2,*) dt + read (2,*) grid_points(1), grid_points(2), grid_points(3) + if (iotype .ne. 0) then + read (2,'(A)') cbuff + read (cbuff,*,iostat=i) wr_interval, rd_interval + if (i .ne. 0) rd_interval = 0 + if (wr_interval .le. 0) wr_interval = wr_default + endif + if (iotype .eq. 1) then + read (2,*) collbuf_nodes, collbuf_size + write(*,*) 'collbuf_nodes ', collbuf_nodes + write(*,*) 'collbuf_size ', collbuf_size + endif + close(2) + else + write(*,234) + niter = niter_default + dt = dt_default + grid_points(1) = problem_size + grid_points(2) = problem_size + grid_points(3) = problem_size + wr_interval = wr_default + if (iotype .eq. 1) then +! set number of nodes involved in collective buffering to 4, +! unless total number of nodes is smaller than that. +! set buffer size for collective buffering to 1MB per node +! collbuf_nodes = min(4,no_nodes) +! set default to No-File-Hints with a value of 0 + collbuf_nodes = 0 + collbuf_size = 1000000 + endif + endif + 234 format(' No input file inputbt.data. Using compiled defaults') + + call set_class(niter, class) + + write(*, 1001) grid_points(1), grid_points(2), grid_points(3), & + & class + write(*, 1002) niter, dt + write(*, 1003) total_nodes + if (no_nodes .ne. total_nodes) write(*, 1004) no_nodes + write(*, *) + + if (iotype .eq. 1) write(*, 1006) 'FULL MPI-IO', wr_interval + if (iotype .eq. 2) write(*, 1006) 'SIMPLE MPI-IO', wr_interval + if (iotype .eq. 3) write(*, 1006) 'EPIO', wr_interval + if (iotype .eq. 4) write(*, 1006) 'FORTRAN IO', wr_interval + + 1000 format(//, ' NAS Parallel Benchmarks 3.4 -- BT Benchmark',/) + 1001 format(' Size: ', i4, 'x', i4, 'x', i4, ' (class ', a, ')' ) + 1002 format(' Iterations: ', i4, ' dt: ', F11.7) + 1003 format(' Total number of processes: ', i6) + 1004 format(' WARNING: Number of processes is not a square number', & + & ' (', i0, ' active)') + 1006 format(' BTIO -- ', A, ' write interval: ', i3 /) + + endif + + call mpi_bcast(niter, 1, MPI_INTEGER, & + & root, comm_setup, error) + + call mpi_bcast(dt, 1, dp_type, & + & root, comm_setup, error) + + call mpi_bcast(grid_points(1), 3, MPI_INTEGER, & + & root, comm_setup, error) + + call mpi_bcast(wr_interval, 1, MPI_INTEGER, & + & root, comm_setup, error) + + call mpi_bcast(rd_interval, 1, MPI_INTEGER, & + & root, comm_setup, error) + + call mpi_bcast(timeron, 1, MPI_LOGICAL, & + & root, comm_setup, error) + + call alloc_space + + call make_set + + do c = 1, maxcells + if ( (cell_size(1,c) .gt. IMAX) .or. & + & (cell_size(2,c) .gt. JMAX) .or. & + & (cell_size(3,c) .gt. KMAX) ) then + print *,node, c, (cell_size(i,c),i=1,3) + print *,' Problem size too big for compiled array sizes' + goto 999 + endif + end do + + do i = 1, t_last + call timer_clear(i) + end do + + call set_constants + + call initialize + + call setup_btio + idump = 0 + + call lhsinit + + call exact_rhs + + call compute_buffer_size(5) + +!--------------------------------------------------------------------- +! do one time step to touch all code, and reinitialize +!--------------------------------------------------------------------- + call adi + call initialize + +!--------------------------------------------------------------------- +! Synchronize before placing time stamp +!--------------------------------------------------------------------- + do i = 1, t_last + call timer_clear(i) + end do + call mpi_barrier(comm_setup, error) + + call timer_start(1) + + do step = 1, niter + + if (node .eq. root) then + if (mod(step, 20) .eq. 0 .or. step .eq. niter .or. & + & step .eq. 1) then + write(*, 200) step + 200 format(' Time step ', i4) + endif + endif + + call adi + + if (iotype .ne. 0) then + if (mod(step, wr_interval).eq.0 .or. step .eq. niter) then + if (node .eq. root) then + print *, 'Writing data set, time step', step + endif + if (step .eq. niter .and. rd_interval .gt. 1) then + rd_interval = 1 + endif + call timer_start(2) + call output_timestep + call timer_stop(2) + idump = idump + 1 + endif + endif + end do + + call timer_start(2) + call btio_cleanup + call timer_stop(2) + + call timer_stop(1) + t = timer_read(1) + t1(1) = timer_read(t_enorm) + + call timer_clear(t_enorm) + call verify(class, verified) + + call mpi_reduce(t, tmax, 1, & + & dp_type, MPI_MAX, & + & root, comm_setup, error) + + if (iotype .ne. 0) then + n3 = 0.d0 + do c = 1,ncells + n3 = n3 + dble(cell_size(1,c)) * cell_size(2,c) * cell_size(3,c) + end do + mbytes = n3 * 40.0 * idump * 1.0d-6 + t1(2) = timer_read(t_enorm) + do i = 1, 2 + if (i .eq. 1) then + t = timer_read(t_io) + else + t = timer_read(t_iov) + endif + t = t - t1(i) ! remove enorm time + if (t .ne. 0.d0) t = mbytes / t ! rate MB/s + t1(i) = t + end do + if (rd_interval .gt. 0) t1(1) = t1(1) * 2 + call mpi_reduce(t1, iorate, 2, & + & dp_type, MPI_SUM, & + & root, comm_setup, error) + endif + + if( node .eq. root ) then + n3 = dble(grid_points(1))*grid_points(2)*grid_points(3) + navg = (grid_points(1)+grid_points(2)+grid_points(3))/3.d0 + if( tmax .ne. 0. ) then + mflops = 1.0d-6*dble(niter)* & + & (3478.8*n3-17655.7*navg**2+28023.7*navg) & + & / tmax + else + mflops = 0.d0 + endif + + if (iotype .ne. 0) then + mbytes = n3 * 40.0 * idump * 1.0d-6 + do i = 1, 2 + t1(i) = 0.0 + if (iorate(i) .ne. 0.d0) t1(i) = mbytes / iorate(i) + end do + if (rd_interval .gt. 0) t1(1) = t1(1) * 2 + tpc = 0.0 + if (tmax .ne. 0.) tpc = t1(1) * 100.0 / tmax + write(*,1100) t1(1), tpc, t1(2), mbytes, iorate(1) + 1100 format(/' BTIO -- statistics:'/ & + & ' I/O timing in seconds : ', f14.2/ & + & ' I/O timing percentage : ', f14.2/ & + & ' I/O timing in verify : ', f14.2/ & + & ' Total data written (MB) : ', f14.2/ & + & ' I/O data rate (MB/sec) : ', f14.2) + endif + + call print_results('BT', class, grid_points(1), & + & grid_points(2), grid_points(3), niter, no_nodes, & + & total_nodes, tmax, mflops, ' floating point', & + & verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, & + & cs6, '(none)') + endif + + if (.not.timeron) goto 999 + + do i = 1, t_zcomm + t1(i) = timer_read(i) + end do + t1(t_xsolve) = t1(t_xsolve) - t1(t_xcomm) + t1(t_ysolve) = t1(t_ysolve) - t1(t_ycomm) + t1(t_zsolve) = t1(t_zsolve) - t1(t_zcomm) + t1(t_comm) = t1(t_xcomm)+t1(t_ycomm)+t1(t_zcomm)+t1(t_exch) + t1(t_comp) = t1(t_total) - t1(t_comm) + + call MPI_Reduce(t1, tsum, t_last, dp_type, MPI_SUM, & + & 0, comm_setup, error) + call MPI_Reduce(t1, tming, t_last, dp_type, MPI_MIN, & + & 0, comm_setup, error) + call MPI_Reduce(t1, tmaxg, t_last, dp_type, MPI_MAX, & + & 0, comm_setup, error) + + if (node .eq. 0) then + write(*, 800) no_nodes + do i = 1, t_last + tsum(i) = tsum(i) / no_nodes + write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i) + end do + endif + 800 format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', & + & 5x, 'average') + 810 format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4)) + + 999 continue + call mpi_barrier(MPI_COMM_WORLD, error) + call mpi_finalize(error) + + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data.f90 new file mode 100644 index 000000000..750ed8fb7 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data.f90 @@ -0,0 +1,193 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! bt_data module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module bt_data + +!--------------------------------------------------------------------- +! The following include file is generated automatically by the +! "setparams" utility. It defines +! maxcells: the square root of the maximum number of processors +! problem_size: 12, 64, 102, 162 (for class S, A, B, C) +! dt_default: default time step for this problem size if no +! config file +! niter_default: default number of iterations for this problem size +!--------------------------------------------------------------------- + + include 'npbparams.h' + + integer aa, bb, cc, BLOCK_SIZE + parameter (aa=1, bb=2, cc=3, BLOCK_SIZE=5) + + integer ncells, grid_points(3) + double precision elapsed_time + + double precision tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, & + & dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, & + & dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, & + & ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, & + & xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1, & + & dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4, & + & yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1, & + & zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, & + & dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, & + & dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, & + & c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt, & + & dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, & + & c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, & + & c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16 + + integer EAST, WEST, NORTH, SOUTH, & + & BOTTOM, TOP + + parameter (EAST=2000, WEST=3000, NORTH=4000, SOUTH=5000, & + & BOTTOM=6000, TOP=7000) + + integer maxcells, IMAX, JMAX, KMAX, MAX_CELL_DIM, BUF_SIZE + + integer predecessor(3), successor(3), grid_size(3) + integer, allocatable :: & + & cell_coord (:,:), cell_low (:,:), & + & cell_high (:,:), cell_size(:,:), & + & start (:,:), end (:,:), & + & slice (:,:) + + double precision, allocatable :: & + & us ( :,:,:,:), & + & vs ( :,:,:,:), & + & ws ( :,:,:,:), & + & qs ( :,:,:,:), & + & rho_i ( :,:,:,:), & + & square ( :,:,:,:), & + & forcing ( :,:,:,:,:), & + & u ( :,:,:,:,:), & + & rhs ( :,:,:,:,:), & + & lhsc (:,:,:,:,:,:), & + & backsub_info(:,:,:,:), & + & in_buffer(:), out_buffer(:) + + double precision, allocatable :: & + & cv (:), rhon(:), & + & rhos(:), rhoq(:), & + & cuf (:), q (:), & + & ue(:,:), buf (:,:) + + double precision, allocatable :: & + & fjac(:, :, :), & + & njac(:, :, :), & + & lhsa(:, :, :), & + & lhsb(:, :, :) + + integer west_size, east_size, bottom_size, top_size, & + & north_size, south_size, start_send_west, & + & start_send_east, start_send_south, start_send_north, & + & start_send_bottom, start_send_top, start_recv_west, & + & start_recv_east, start_recv_south, start_recv_north, & + & start_recv_bottom, start_recv_top + + double precision tmp_block(5,5), b_inverse(5,5), tmp_vec(5) + +!--------------------------------------------------------------------- +! These are used by btio +!--------------------------------------------------------------------- + integer collbuf_nodes, collbuf_size, iosize, eltext, & + & combined_btype, fp, idump, record_length, element, & + & combined_ftype, idump_sub, rd_interval + double precision sum(niter_default), xce_sub(5) + integer(kind=8) :: iseek + + +!--------------------------------------------------------------------- +! Timer constants +!--------------------------------------------------------------------- + integer t_total, t_io, t_rhs, t_xsolve, t_ysolve, t_zsolve, & + & t_bpack, t_exch, t_xcomm, t_ycomm, t_zcomm, & + & t_comp, t_comm, t_enorm, t_iov, t_last + parameter (t_total=1, t_io=2, t_rhs=3, t_xsolve=4, t_ysolve=5, & + & t_zsolve=6, t_bpack=7, t_exch=8, t_xcomm=9, & + & t_ycomm=10, t_zcomm=11, t_comp=12, t_comm=13, & + & t_enorm=12, t_iov=13, t_last=13) + logical timeron + + end module bt_data + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine alloc_space + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! allocate space dynamically for data arrays +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ios, ierr + + MAX_CELL_DIM = (problem_size/maxcells)+1 + + IMAX = MAX_CELL_DIM + JMAX = MAX_CELL_DIM + KMAX = MAX_CELL_DIM + + BUF_SIZE = MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60+1 + + allocate ( & + & cell_coord (3,maxcells), cell_low (3,maxcells), & + & cell_high (3,maxcells), cell_size(3,maxcells), & + & start (3,maxcells), end (3,maxcells), & + & slice (3,maxcells), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & forcing (5, 0:IMAX-1, 0:JMAX-1, 0:KMAX-1, maxcells), & + & u (5, -2:IMAX+1,-2:JMAX+1,-2:KMAX+1, maxcells), & + & rhs (5, -1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells), & + & lhsc (5,5,-1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells), & + & backsub_info (5, 0:MAX_CELL_DIM, 0:MAX_CELL_DIM, maxcells), & + & in_buffer(BUF_SIZE), out_buffer(BUF_SIZE), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & cv (-2:MAX_CELL_DIM+1), rhon(-2:MAX_CELL_DIM+1), & + & rhos(-2:MAX_CELL_DIM+1), rhoq(-2:MAX_CELL_DIM+1), & + & cuf (-2:MAX_CELL_DIM+1), q(-2:MAX_CELL_DIM+1), & + & ue (-2:MAX_CELL_DIM+1,5), buf(-2:MAX_CELL_DIM+1,5), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & fjac(5, 5, -2:MAX_CELL_DIM+1), & + & njac(5, 5, -2:MAX_CELL_DIM+1), & + & lhsa(5, 5, -1:MAX_CELL_DIM), & + & lhsb(5, 5, -1:MAX_CELL_DIM), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & us (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & vs (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & ws (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & qs (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & rho_i (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & square(-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & stat = ios) + + if (ios .ne. 0) then + write(*,*) 'Error encountered in allocating space' + call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr) + stop + endif + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data_vec.f90 new file mode 100644 index 000000000..6f248dd04 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/bt_data_vec.f90 @@ -0,0 +1,193 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! bt_data module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module bt_data + +!--------------------------------------------------------------------- +! The following include file is generated automatically by the +! "setparams" utility. It defines +! maxcells: the square root of the maximum number of processors +! problem_size: 12, 64, 102, 162 (for class S, A, B, C) +! dt_default: default time step for this problem size if no +! config file +! niter_default: default number of iterations for this problem size +!--------------------------------------------------------------------- + + include 'npbparams.h' + + integer aa, bb, cc, BLOCK_SIZE + parameter (aa=1, bb=2, cc=3, BLOCK_SIZE=5) + + integer ncells, grid_points(3) + double precision elapsed_time + + double precision tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, & + & dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, & + & dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, & + & ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, & + & xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1, & + & dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4, & + & yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1, & + & zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, & + & dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, & + & dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, & + & c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt, & + & dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, & + & c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, & + & c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16 + + integer EAST, WEST, NORTH, SOUTH, & + & BOTTOM, TOP + + parameter (EAST=2000, WEST=3000, NORTH=4000, SOUTH=5000, & + & BOTTOM=6000, TOP=7000) + + integer maxcells, IMAX, JMAX, KMAX, MAX_CELL_DIM, BUF_SIZE + + integer predecessor(3), successor(3), grid_size(3) + integer, allocatable :: & + & cell_coord (:,:), cell_low (:,:), & + & cell_high (:,:), cell_size(:,:), & + & start (:,:), end (:,:), & + & slice (:,:) + + double precision, allocatable :: & + & us ( :,:,:,:), & + & vs ( :,:,:,:), & + & ws ( :,:,:,:), & + & qs ( :,:,:,:), & + & rho_i ( :,:,:,:), & + & square ( :,:,:,:), & + & forcing ( :,:,:,:,:), & + & u ( :,:,:,:,:), & + & rhs ( :,:,:,:,:), & + & lhsc (:,:,:,:,:,:), & + & backsub_info(:,:,:,:), & + & in_buffer(:), out_buffer(:) + + double precision, allocatable :: & + & cv (:), rhon(:), & + & rhos(:), rhoq(:), & + & cuf (:), q (:), & + & ue(:,:), buf (:,:) + + double precision, allocatable :: & + & fjac(:, :, :, :), & + & njac(:, :, :, :), & + & lhsa(:, :, :, :), & + & lhsb(:, :, :, :) + + integer west_size, east_size, bottom_size, top_size, & + & north_size, south_size, start_send_west, & + & start_send_east, start_send_south, start_send_north, & + & start_send_bottom, start_send_top, start_recv_west, & + & start_recv_east, start_recv_south, start_recv_north, & + & start_recv_bottom, start_recv_top + + double precision tmp_block(5,5), b_inverse(5,5), tmp_vec(5) + +!--------------------------------------------------------------------- +! These are used by btio +!--------------------------------------------------------------------- + integer collbuf_nodes, collbuf_size, iosize, eltext, & + & combined_btype, fp, idump, record_length, element, & + & combined_ftype, idump_sub, rd_interval + double precision sum(niter_default), xce_sub(5) + integer(kind=8) :: iseek + + +!--------------------------------------------------------------------- +! Timer constants +!--------------------------------------------------------------------- + integer t_total, t_io, t_rhs, t_xsolve, t_ysolve, t_zsolve, & + & t_bpack, t_exch, t_xcomm, t_ycomm, t_zcomm, & + & t_comp, t_comm, t_enorm, t_iov, t_last + parameter (t_total=1, t_io=2, t_rhs=3, t_xsolve=4, t_ysolve=5, & + & t_zsolve=6, t_bpack=7, t_exch=8, t_xcomm=9, & + & t_ycomm=10, t_zcomm=11, t_comp=12, t_comm=13, & + & t_enorm=12, t_iov=13, t_last=13) + logical timeron + + end module bt_data + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine alloc_space + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! allocate space dynamically for data arrays +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ios, ierr + + MAX_CELL_DIM = (problem_size/maxcells)+1 + + IMAX = MAX_CELL_DIM + JMAX = MAX_CELL_DIM + KMAX = MAX_CELL_DIM + + BUF_SIZE = MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60+1 + + allocate ( & + & cell_coord (3,maxcells), cell_low (3,maxcells), & + & cell_high (3,maxcells), cell_size(3,maxcells), & + & start (3,maxcells), end (3,maxcells), & + & slice (3,maxcells), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & forcing (5, 0:IMAX-1, 0:JMAX-1, 0:KMAX-1, maxcells), & + & u (5, -2:IMAX+1,-2:JMAX+1,-2:KMAX+1, maxcells), & + & rhs (5, -1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells), & + & lhsc (5,5,-1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells), & + & backsub_info (5, 0:MAX_CELL_DIM, 0:MAX_CELL_DIM, maxcells), & + & in_buffer(BUF_SIZE), out_buffer(BUF_SIZE), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & cv (-2:MAX_CELL_DIM+1), rhon(-2:MAX_CELL_DIM+1), & + & rhos(-2:MAX_CELL_DIM+1), rhoq(-2:MAX_CELL_DIM+1), & + & cuf (-2:MAX_CELL_DIM+1), q(-2:MAX_CELL_DIM+1), & + & ue (-2:MAX_CELL_DIM+1,5), buf(-2:MAX_CELL_DIM+1,5), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & fjac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1), & + & njac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1), & + & lhsa(5, 5, -1:MAX_CELL_DIM, -1:MAX_CELL_DIM), & + & lhsb(5, 5, -1:MAX_CELL_DIM, -1:MAX_CELL_DIM), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & us (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & vs (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & ws (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & qs (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & rho_i (-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & square(-1:IMAX, -1:JMAX, -1:KMAX, maxcells), & + & stat = ios) + + if (ios .ne. 0) then + write(*,*) 'Error encountered in allocating space' + call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr) + stop + endif + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio.f90 new file mode 100644 index 000000000..3b36cca87 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio.f90 @@ -0,0 +1,72 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine setup_btio + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine output_timestep + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine btio_cleanup + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine btio_verify(verified) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + logical verified + + verified = .true. + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + double precision xce_acc(5) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine checksum_timestep + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + return + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio_common.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio_common.f90 new file mode 100644 index 000000000..f06bc1cd3 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/btio_common.f90 @@ -0,0 +1,30 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine clear_timestep + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer cio, kio, jio, ix + + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + do ix=0,cell_size(1,cio)-1 + u(1,ix, jio,kio,cio) = 0 + u(2,ix, jio,kio,cio) = 0 + u(3,ix, jio,kio,cio) = 0 + u(4,ix, jio,kio,cio) = 0 + u(5,ix, jio,kio,cio) = 0 + enddo + enddo + enddo + enddo + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/copy_faces.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/copy_faces.f90 new file mode 100644 index 000000000..ff9ac3a35 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/copy_faces.f90 @@ -0,0 +1,324 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine copy_faces + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! +! This function copies the face values of a variable defined on a set +! of cells to the overlap locations of the adjacent sets of cells. +! Because a set of cells interfaces in each direction with exactly one +! other set, we only need to fill six different buffers. We could try to +! overlap communication with computation, by computing +! some internal values while communicating boundary values, but this +! adds so much overhead that it's not clearly useful. +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i, j, k, c, m, requests(0:11), p0, p1, & + & p2, p3, p4, p5, b_size(0:5), ss(0:5), & + & sr(0:5), error, statuses(MPI_STATUS_SIZE, 0:11) + +!--------------------------------------------------------------------- +! exit immediately if there are no faces to be copied +!--------------------------------------------------------------------- + if (no_nodes .eq. 1) then + call compute_rhs + return + endif + + ss(0) = start_send_east + ss(1) = start_send_west + ss(2) = start_send_north + ss(3) = start_send_south + ss(4) = start_send_top + ss(5) = start_send_bottom + + sr(0) = start_recv_east + sr(1) = start_recv_west + sr(2) = start_recv_north + sr(3) = start_recv_south + sr(4) = start_recv_top + sr(5) = start_recv_bottom + + b_size(0) = east_size + b_size(1) = west_size + b_size(2) = north_size + b_size(3) = south_size + b_size(4) = top_size + b_size(5) = bottom_size + +!--------------------------------------------------------------------- +! because the difference stencil for the diagonalized scheme is +! orthogonal, we do not have to perform the staged copying of faces, +! but can send all face information simultaneously to the neighboring +! cells in all directions +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_bpack) + p0 = 0 + p1 = 0 + p2 = 0 + p3 = 0 + p4 = 0 + p5 = 0 + + do c = 1, ncells + +!--------------------------------------------------------------------- +! fill the buffer to be sent to eastern neighbors (i-dir) +!--------------------------------------------------------------------- + if (cell_coord(1,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = cell_size(1,c)-2, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(0)+p0) = u(m,i,j,k,c) + p0 = p0 + 1 + end do + end do + end do + end do + endif + +!--------------------------------------------------------------------- +! fill the buffer to be sent to western neighbors +!--------------------------------------------------------------------- + if (cell_coord(1,c) .ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, 1 + do m = 1, 5 + out_buffer(ss(1)+p1) = u(m,i,j,k,c) + p1 = p1 + 1 + end do + end do + end do + end do + + endif + +!--------------------------------------------------------------------- +! fill the buffer to be sent to northern neighbors (j_dir) +!--------------------------------------------------------------------- + if (cell_coord(2,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = cell_size(2,c)-2, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(2)+p2) = u(m,i,j,k,c) + p2 = p2 + 1 + end do + end do + end do + end do + endif + +!--------------------------------------------------------------------- +! fill the buffer to be sent to southern neighbors +!--------------------------------------------------------------------- + if (cell_coord(2,c).ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = 0, 1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(3)+p3) = u(m,i,j,k,c) + p3 = p3 + 1 + end do + end do + end do + end do + endif + +!--------------------------------------------------------------------- +! fill the buffer to be sent to top neighbors (k-dir) +!--------------------------------------------------------------------- + if (cell_coord(3,c) .ne. ncells) then + do k = cell_size(3,c)-2, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(4)+p4) = u(m,i,j,k,c) + p4 = p4 + 1 + end do + end do + end do + end do + endif + +!--------------------------------------------------------------------- +! fill the buffer to be sent to bottom neighbors +!--------------------------------------------------------------------- + if (cell_coord(3,c).ne. 1) then + do k=0, 1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(5)+p5) = u(m,i,j,k,c) + p5 = p5 + 1 + end do + end do + end do + end do + endif + +!--------------------------------------------------------------------- +! cell loop +!--------------------------------------------------------------------- + end do + if (timeron) call timer_stop(t_bpack) + + if (timeron) call timer_start(t_exch) + call mpi_irecv(in_buffer(sr(0)), b_size(0), & + & dp_type, successor(1), WEST, & + & comm_rhs, requests(0), error) + call mpi_irecv(in_buffer(sr(1)), b_size(1), & + & dp_type, predecessor(1), EAST, & + & comm_rhs, requests(1), error) + call mpi_irecv(in_buffer(sr(2)), b_size(2), & + & dp_type, successor(2), SOUTH, & + & comm_rhs, requests(2), error) + call mpi_irecv(in_buffer(sr(3)), b_size(3), & + & dp_type, predecessor(2), NORTH, & + & comm_rhs, requests(3), error) + call mpi_irecv(in_buffer(sr(4)), b_size(4), & + & dp_type, successor(3), BOTTOM, & + & comm_rhs, requests(4), error) + call mpi_irecv(in_buffer(sr(5)), b_size(5), & + & dp_type, predecessor(3), TOP, & + & comm_rhs, requests(5), error) + + call mpi_isend(out_buffer(ss(0)), b_size(0), & + & dp_type, successor(1), EAST, & + & comm_rhs, requests(6), error) + call mpi_isend(out_buffer(ss(1)), b_size(1), & + & dp_type, predecessor(1), WEST, & + & comm_rhs, requests(7), error) + call mpi_isend(out_buffer(ss(2)), b_size(2), & + & dp_type,successor(2), NORTH, & + & comm_rhs, requests(8), error) + call mpi_isend(out_buffer(ss(3)), b_size(3), & + & dp_type,predecessor(2), SOUTH, & + & comm_rhs, requests(9), error) + call mpi_isend(out_buffer(ss(4)), b_size(4), & + & dp_type,successor(3), TOP, & + & comm_rhs, requests(10), error) + call mpi_isend(out_buffer(ss(5)), b_size(5), & + & dp_type,predecessor(3), BOTTOM, & + & comm_rhs,requests(11), error) + + + call mpi_waitall(12, requests, statuses, error) + if (timeron) call timer_stop(t_exch) + +!--------------------------------------------------------------------- +! unpack the data that has just been received; +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_bpack) + p0 = 0 + p1 = 0 + p2 = 0 + p3 = 0 + p4 = 0 + p5 = 0 + + do c = 1, ncells + + if (cell_coord(1,c) .ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = -2, -1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(1)+p0) + p0 = p0 + 1 + end do + end do + end do + end do + endif + + if (cell_coord(1,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = cell_size(1,c), cell_size(1,c)+1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(0)+p1) + p1 = p1 + 1 + end do + end do + end do + end do + end if + + if (cell_coord(2,c) .ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = -2, -1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(3)+p2) + p2 = p2 + 1 + end do + end do + end do + end do + + endif + + if (cell_coord(2,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = cell_size(2,c), cell_size(2,c)+1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(2)+p3) + p3 = p3 + 1 + end do + end do + end do + end do + endif + + if (cell_coord(3,c) .ne. 1) then + do k = -2, -1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(5)+p4) + p4 = p4 + 1 + end do + end do + end do + end do + endif + + if (cell_coord(3,c) .ne. ncells) then + do k = cell_size(3,c), cell_size(3,c)+1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(4)+p5) + p5 = p5 + 1 + end do + end do + end do + end do + endif + +!--------------------------------------------------------------------- +! cells loop +!--------------------------------------------------------------------- + end do + if (timeron) call timer_stop(t_bpack) + +!--------------------------------------------------------------------- +! do the rest of the rhs that uses the copied face values +!--------------------------------------------------------------------- + call compute_rhs + + return + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/define.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/define.f90 new file mode 100644 index 000000000..f42fcc4e0 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/define.f90 @@ -0,0 +1,65 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine compute_buffer_size(dim) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer c, dim, face_size + + if (ncells .eq. 1) return + +!--------------------------------------------------------------------- +! compute the actual sizes of the buffers; note that there is +! always one cell face that doesn't need buffer space, because it +! is at the boundary of the grid +!--------------------------------------------------------------------- + west_size = 0 + east_size = 0 + + do c = 1, ncells + face_size = cell_size(2,c) * cell_size(3,c) * dim * 2 + if (cell_coord(1,c).ne.1) west_size = west_size + face_size + if (cell_coord(1,c).ne.ncells) east_size = east_size + & + & face_size + end do + + north_size = 0 + south_size = 0 + do c = 1, ncells + face_size = cell_size(1,c)*cell_size(3,c) * dim * 2 + if (cell_coord(2,c).ne.1) south_size = south_size + face_size + if (cell_coord(2,c).ne.ncells) north_size = north_size + & + & face_size + end do + + top_size = 0 + bottom_size = 0 + do c = 1, ncells + face_size = cell_size(1,c) * cell_size(2,c) * dim * 2 + if (cell_coord(3,c).ne.1) bottom_size = bottom_size + & + & face_size + if (cell_coord(3,c).ne.ncells) top_size = top_size + & + & face_size + end do + + start_send_west = 1 + start_send_east = start_send_west + west_size + start_send_south = start_send_east + east_size + start_send_north = start_send_south + south_size + start_send_bottom = start_send_north + north_size + start_send_top = start_send_bottom + bottom_size + start_recv_west = 1 + start_recv_east = start_recv_west + west_size + start_recv_south = start_recv_east + east_size + start_recv_north = start_recv_south + south_size + start_recv_bottom = start_recv_north + north_size + start_recv_top = start_recv_bottom + bottom_size + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/epio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/epio.f90 new file mode 100644 index 000000000..7c209656a --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/epio.f90 @@ -0,0 +1,174 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine setup_btio + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + character(128) newfilenm + integer m + + if (node .lt. 10000) then + write (newfilenm, 996) filenm,node + else + print *, 'error generating file names (> 10000 nodes)' + stop + endif + +996 format (a,'.',i4.4) + + open (unit=99, file=newfilenm, form='unformatted', & + & status='unknown') + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine output_timestep + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer ix, iio, jio, kio, cio, aio + + do cio=1,ncells + write(99) & + & ((((u(aio,ix, jio,kio,cio),aio=1,5), & + & ix=0, cell_size(1,cio)-1), & + & jio=0, cell_size(2,cio)-1), & + & kio=0, cell_size(3,cio)-1) + enddo + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + rewind(99) + call acc_sub_norms(idump+1) + + rewind(99) + idump_sub = 0 + endif + endif + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + use bt_data + use mpinpb + + implicit none + + integer idump_cur + + integer ix, jio, kio, cio, ii, m, ichunk + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + do cio=1,ncells + read(99) & + & ((((u(m,ix, jio,kio,cio),m=1,5), & + & ix=0, cell_size(1,cio)-1), & + & jio=0, cell_size(2,cio)-1), & + & kio=0, cell_size(3,cio)-1) + enddo + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine btio_cleanup + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + implicit none + + close(unit=99) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + double precision xce_acc(5) + + character(128) newfilenm + integer m + + if (rd_interval .gt. 0) goto 20 + + if (node .lt. 10000) then + write (newfilenm, 996) filenm,node + else + print *, 'error generating file names (> 10000 nodes)' + stop + endif + +996 format (a,'.',i4.4) + + open (unit=99, file=newfilenm, & + & form='unformatted', action='read') + +! clear the last time step + + call clear_timestep + +! read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + close(unit=99) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/error.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/error.f90 new file mode 100644 index 000000000..b09998125 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/error.f90 @@ -0,0 +1,114 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine error_norm(rms) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! this function computes the norm of the difference between the +! computed solution and the exact solution +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer c, i, j, k, m, ii, jj, kk, d, error + double precision xi, eta, zeta, u_exact(5), rms(5), rms_work(5), & + & add + + call timer_start(t_enorm) + + do m = 1, 5 + rms_work(m) = 0.0d0 + enddo + + do c = 1, ncells + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, u_exact) + + do m = 1, 5 + add = u(m,ii,jj,kk,c)-u_exact(m) + rms_work(m) = rms_work(m) + add*add + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + kk = kk + 1 + enddo + enddo + + call mpi_allreduce(rms_work, rms, 5, dp_type, & + & MPI_SUM, comm_setup, error) + + do m = 1, 5 + do d = 1, 3 + rms(m) = rms(m) / dble(grid_points(d)-2) + enddo + rms(m) = dsqrt(rms(m)) + enddo + + call timer_stop(t_enorm) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine rhs_norm(rms) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer c, i, j, k, d, m, error + double precision rms(5), rms_work(5), add + + do m = 1, 5 + rms_work(m) = 0.0d0 + enddo + + do c = 1, ncells + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + add = rhs(m,i,j,k,c) + rms_work(m) = rms_work(m) + add*add + enddo + enddo + enddo + enddo + enddo + + call mpi_allreduce(rms_work, rms, 5, dp_type, & + & MPI_SUM, comm_setup, error) + + do m = 1, 5 + do d = 1, 3 + rms(m) = rms(m) / dble(grid_points(d)-2) + enddo + rms(m) = dsqrt(rms(m)) + enddo + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_rhs.f90 new file mode 100644 index 000000000..ed27eeaf9 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_rhs.f90 @@ -0,0 +1,361 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine exact_rhs + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! compute the right hand side based on exact solution +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision dtemp(5), xi, eta, zeta, dtpp + integer c, m, i, j, k, ip1, im1, jp1, & + & jm1, km1, kp1 + + +!--------------------------------------------------------------------- +! loop over all cells owned by this node +!--------------------------------------------------------------------- + do c = 1, ncells + +!--------------------------------------------------------------------- +! initialize +!--------------------------------------------------------------------- + do k= 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = 0.0d0 + enddo + enddo + enddo + enddo + +!--------------------------------------------------------------------- +! xi-direction flux differences +!--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + zeta = dble(k+cell_low(3,c)) * dnzm1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + eta = dble(j+cell_low(2,c)) * dnym1 + + do i=-2*(1-start(1,c)), cell_size(1,c)+1-2*end(1,c) + xi = dble(i+cell_low(1,c)) * dnxm1 + + call exact_solution(xi, eta, zeta, dtemp) + do m = 1, 5 + ue(i,m) = dtemp(m) + enddo + + dtpp = 1.0d0 / dtemp(1) + + do m = 2, 5 + buf(i,m) = dtpp * dtemp(m) + enddo + + cuf(i) = buf(i,2) * buf(i,2) + buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + & + & buf(i,4) * buf(i,4) + q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) + & + & buf(i,4)*ue(i,4)) + + enddo + + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + im1 = i-1 + ip1 = i+1 + + forcing(1,i,j,k,c) = forcing(1,i,j,k,c) - & + & tx2*( ue(ip1,2)-ue(im1,2) )+ & + & dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1)) + + forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tx2 * ( & + & (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))- & + & (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+ & + & xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+ & + & dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2)) + + forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tx2 * ( & + & ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+ & + & xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+ & + & dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3)) + + forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tx2*( & + & ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+ & + & xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+ & + & dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4)) + + forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tx2*( & + & buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))- & + & buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+ & + & 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+ & + & buf(im1,1))+ & + & xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+ & + & xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+ & + & dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5)) + enddo + +!--------------------------------------------------------------------- +! Fourth-order dissipation +!--------------------------------------------------------------------- + if (start(1,c) .gt. 0) then + do m = 1, 5 + i = 1 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m)) + i = 2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) - & + & 4.0d0*ue(i+1,m) + ue(i+2,m)) + enddo + endif + + do i = start(1,c)*3, cell_size(1,c)-3*end(1,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp* & + & (ue(i-2,m) - 4.0d0*ue(i-1,m) + & + & 6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m)) + enddo + enddo + + if (end(1,c) .gt. 0) then + do m = 1, 5 + i = cell_size(1,c)-3 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (ue(i-2,m) - 4.0d0*ue(i-1,m) + & + & 6.0d0*ue(i,m) - 4.0d0*ue(i+1,m)) + i = cell_size(1,c)-2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m)) + enddo + endif + + enddo + enddo + +!--------------------------------------------------------------------- +! eta-direction flux differences +!--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + zeta = dble(k+cell_low(3,c)) * dnzm1 + do i=start(1,c), cell_size(1,c)-end(1,c)-1 + xi = dble(i+cell_low(1,c)) * dnxm1 + + do j=-2*(1-start(2,c)), cell_size(2,c)+1-2*end(2,c) + eta = dble(j+cell_low(2,c)) * dnym1 + + call exact_solution(xi, eta, zeta, dtemp) + do m = 1, 5 + ue(j,m) = dtemp(m) + enddo + + dtpp = 1.0d0/dtemp(1) + + do m = 2, 5 + buf(j,m) = dtpp * dtemp(m) + enddo + + cuf(j) = buf(j,3) * buf(j,3) + buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + & + & buf(j,4) * buf(j,4) + q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) + & + & buf(j,4)*ue(j,4)) + enddo + + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + jm1 = j-1 + jp1 = j+1 + + forcing(1,i,j,k,c) = forcing(1,i,j,k,c) - & + & ty2*( ue(jp1,3)-ue(jm1,3) )+ & + & dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1)) + + forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - ty2*( & + & ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+ & + & yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+ & + & dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2)) + + forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - ty2*( & + & (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))- & + & (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+ & + & yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+ & + & dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3)) + + forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - ty2*( & + & ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+ & + & yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+ & + & dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4)) + + forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - ty2*( & + & buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))- & + & buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+ & + & 0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+ & + & buf(jm1,1))+ & + & yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+ & + & yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+ & + & dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5)) + enddo + +!--------------------------------------------------------------------- +! Fourth-order dissipation +!--------------------------------------------------------------------- + if (start(2,c) .gt. 0) then + do m = 1, 5 + j = 1 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m)) + j = 2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) - & + & 4.0d0*ue(j+1,m) + ue(j+2,m)) + enddo + endif + + do j = start(2,c)*3, cell_size(2,c)-3*end(2,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp* & + & (ue(j-2,m) - 4.0d0*ue(j-1,m) + & + & 6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m)) + enddo + enddo + + if (end(2,c) .gt. 0) then + do m = 1, 5 + j = cell_size(2,c)-3 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (ue(j-2,m) - 4.0d0*ue(j-1,m) + & + & 6.0d0*ue(j,m) - 4.0d0*ue(j+1,m)) + j = cell_size(2,c)-2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m)) + + enddo + endif + + enddo + enddo + +!--------------------------------------------------------------------- +! zeta-direction flux differences +!--------------------------------------------------------------------- + do j=start(2,c), cell_size(2,c)-end(2,c)-1 + eta = dble(j+cell_low(2,c)) * dnym1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + xi = dble(i+cell_low(1,c)) * dnxm1 + + do k=-2*(1-start(3,c)), cell_size(3,c)+1-2*end(3,c) + zeta = dble(k+cell_low(3,c)) * dnzm1 + + call exact_solution(xi, eta, zeta, dtemp) + do m = 1, 5 + ue(k,m) = dtemp(m) + enddo + + dtpp = 1.0d0/dtemp(1) + + do m = 2, 5 + buf(k,m) = dtpp * dtemp(m) + enddo + + cuf(k) = buf(k,4) * buf(k,4) + buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) + & + & buf(k,3) * buf(k,3) + q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) + & + & buf(k,4)*ue(k,4)) + enddo + + do k=start(3,c), cell_size(3,c)-end(3,c)-1 + km1 = k-1 + kp1 = k+1 + + forcing(1,i,j,k,c) = forcing(1,i,j,k,c) - & + & tz2*( ue(kp1,4)-ue(km1,4) )+ & + & dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1)) + + forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tz2 * ( & + & ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+ & + & zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+ & + & dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2)) + + forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tz2 * ( & + & ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+ & + & zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+ & + & dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3)) + + forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tz2 * ( & + & (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))- & + & (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+ & + & zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+ & + & dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4)) + + forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tz2 * ( & + & buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))- & + & buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+ & + & 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1) & + & +buf(km1,1))+ & + & zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+ & + & zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+ & + & dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5)) + enddo + +!--------------------------------------------------------------------- +! Fourth-order dissipation +!--------------------------------------------------------------------- + if (start(3,c) .gt. 0) then + do m = 1, 5 + k = 1 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m)) + k = 2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) - & + & 4.0d0*ue(k+1,m) + ue(k+2,m)) + enddo + endif + + do k = start(3,c)*3, cell_size(3,c)-3*end(3,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp* & + & (ue(k-2,m) - 4.0d0*ue(k-1,m) + & + & 6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m)) + enddo + enddo + + if (end(3,c) .gt. 0) then + do m = 1, 5 + k = cell_size(3,c)-3 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (ue(k-2,m) - 4.0d0*ue(k-1,m) + & + & 6.0d0*ue(k,m) - 4.0d0*ue(k+1,m)) + k = cell_size(3,c)-2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * & + & (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m)) + enddo + endif + + enddo + enddo + +!--------------------------------------------------------------------- +! now change the sign of the forcing function, +!--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = -1.d0 * forcing(m,i,j,k,c) + enddo + enddo + enddo + enddo + + enddo + + return + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_solution.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_solution.f90 new file mode 100644 index 000000000..2ada9387a --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/exact_solution.f90 @@ -0,0 +1,30 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine exact_solution(xi,eta,zeta,dtemp) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! this function returns the exact solution at point xi, eta, zeta +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision xi, eta, zeta, dtemp(5) + integer m + + do m = 1, 5 + dtemp(m) = ce(m,1) + & + & xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) + & + & eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+ & + & zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + & + & zeta*ce(m,13)))) + enddo + + return + end + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/fortran_io.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/fortran_io.f90 new file mode 100644 index 000000000..d35781fbf --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/fortran_io.f90 @@ -0,0 +1,198 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine setup_btio + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + double precision d5(5) + integer frec_sz, m, ierr + +! determine a proper record_length to use + if (node.eq.root) then + frec_sz = fortran_rec_sz + if (frec_sz > 0) then + ! use the compiled value + record_length = 40/frec_sz + else + ! query directly + inquire(iolength=record_length) d5 + endif + if (record_length < 1) record_length = 40 + endif + + call mpi_bcast(record_length, 1, MPI_INTEGER, & + & root, comm_setup, ierr) + + open (unit=99, file=filenm, & + & form='unformatted', access='direct', & + & recl=record_length) + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine output_timestep + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer ix, jio, kio, cio + + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=(cell_low(3,cio)+kio) + & + & PROBLEM_SIZE*idump_sub + iseek=(cell_low(2,cio)+jio) + & + & PROBLEM_SIZE*iseek + iseek=(cell_low(1,cio)) + & + & PROBLEM_SIZE*iseek + + do ix=0,cell_size(1,cio)-1 + write(99, rec=iseek+ix+1) & + & u(1,ix, jio,kio,cio), & + & u(2,ix, jio,kio,cio), & + & u(3,ix, jio,kio,cio), & + & u(4,ix, jio,kio,cio), & + & u(5,ix, jio,kio,cio) + enddo + enddo + enddo + enddo + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + call acc_sub_norms(idump+1) + + idump_sub = 0 + endif + endif + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + use bt_data + use mpinpb + + implicit none + + integer idump_cur + + integer ix, jio, kio, cio, ii, m, ichunk + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=(cell_low(3,cio)+kio) + & + & PROBLEM_SIZE*ii + iseek=(cell_low(2,cio)+jio) + & + & PROBLEM_SIZE*iseek + iseek=(cell_low(1,cio)) + & + & PROBLEM_SIZE*iseek + + + do ix=0,cell_size(1,cio)-1 + read(99, rec=iseek+ix+1) & + & u(1,ix, jio,kio,cio), & + & u(2,ix, jio,kio,cio), & + & u(3,ix, jio,kio,cio), & + & u(4,ix, jio,kio,cio), & + & u(5,ix, jio,kio,cio) + enddo + enddo + enddo + enddo + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine btio_cleanup + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + implicit none + + close(unit=99) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision xce_acc(5) + integer m + + if (rd_interval .gt. 0) goto 20 + + open (unit=99, file=filenm, & + & form='unformatted', access='direct', & + & recl=record_length, action='read') + +! clear the last time step + + call clear_timestep + +! read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + close(unit=99) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/full_mpiio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/full_mpiio.f90 new file mode 100644 index 000000000..b14acd832 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/full_mpiio.f90 @@ -0,0 +1,319 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine setup_btio + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ierr + integer mstatus(MPI_STATUS_SIZE) + integer sizes(4), starts(4), subsizes(4) + integer cell_btype(maxcells), cell_ftype(maxcells) + integer cell_blength(maxcells) + integer info + character*20 cb_nodes, cb_size + integer c, m + integer cell_disp(maxcells) + + call mpi_bcast(collbuf_nodes, 1, MPI_INTEGER, & + & root, comm_setup, ierr) + + call mpi_bcast(collbuf_size, 1, MPI_INTEGER, & + & root, comm_setup, ierr) + + if (collbuf_nodes .eq. 0) then + info = MPI_INFO_NULL + else + write (cb_nodes,*) collbuf_nodes + write (cb_size,*) collbuf_size + call MPI_Info_create(info, ierr) + call MPI_Info_set(info, 'cb_nodes', cb_nodes, ierr) + call MPI_Info_set(info, 'cb_buffer_size', cb_size, ierr) + call MPI_Info_set(info, 'collective_buffering', 'true', ierr) + endif + + call MPI_Type_contiguous(5, MPI_DOUBLE_PRECISION, & + & element, ierr) + call MPI_Type_commit(element, ierr) + call MPI_Type_extent(element, eltext, ierr) + + do c = 1, ncells +! +! Outer array dimensions ar same for every cell +! + sizes(1) = IMAX+4 + sizes(2) = JMAX+4 + sizes(3) = KMAX+4 +! +! 4th dimension is cell number, total of maxcells cells +! + sizes(4) = maxcells +! +! Internal dimensions of cells can differ slightly between cells +! + subsizes(1) = cell_size(1, c) + subsizes(2) = cell_size(2, c) + subsizes(3) = cell_size(3, c) +! +! Cell is 4th dimension, 1 cell per cell type to handle varying +! cell sub-array sizes +! + subsizes(4) = 1 + +! +! type constructors use 0-based start addresses +! + starts(1) = 2 + starts(2) = 2 + starts(3) = 2 + starts(4) = c-1 + +! +! Create buftype for a cell +! + call MPI_Type_create_subarray(4, sizes, subsizes, & + & starts, MPI_ORDER_FORTRAN, element, & + & cell_btype(c), ierr) +! +! block length and displacement for joining cells - +! 1 cell buftype per block, cell buftypes have own displacment +! generated from cell number (4th array dimension) +! + cell_blength(c) = 1 + cell_disp(c) = 0 + + enddo +! +! Create combined buftype for all cells +! + call MPI_Type_struct(ncells, cell_blength, cell_disp, & + & cell_btype, combined_btype, ierr) + call MPI_Type_commit(combined_btype, ierr) + + do c = 1, ncells +! +! Entire array size +! + sizes(1) = PROBLEM_SIZE + sizes(2) = PROBLEM_SIZE + sizes(3) = PROBLEM_SIZE + +! +! Size of c'th cell +! + subsizes(1) = cell_size(1, c) + subsizes(2) = cell_size(2, c) + subsizes(3) = cell_size(3, c) + +! +! Starting point in full array of c'th cell +! + starts(1) = cell_low(1,c) + starts(2) = cell_low(2,c) + starts(3) = cell_low(3,c) + + call MPI_Type_create_subarray(3, sizes, subsizes, & + & starts, MPI_ORDER_FORTRAN, & + & element, cell_ftype(c), ierr) + cell_blength(c) = 1 + cell_disp(c) = 0 + enddo + + call MPI_Type_struct(ncells, cell_blength, cell_disp, & + & cell_ftype, combined_ftype, ierr) + call MPI_Type_commit(combined_ftype, ierr) + + iseek=0 + if (node .eq. root) then + call MPI_File_delete(filenm, MPI_INFO_NULL, ierr) + endif + + + call MPI_Barrier(comm_solve, ierr) + + call MPI_File_open(comm_solve, & + & filenm, & + & MPI_MODE_RDWR+MPI_MODE_CREATE, & + & MPI_INFO_NULL, fp, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error opening file' + stop + endif + + call MPI_File_set_view(fp, iseek, element, & + & combined_ftype, 'native', info, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error setting file view' + stop + endif + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine output_timestep + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer mstatus(MPI_STATUS_SIZE) + integer ierr + + call MPI_File_write_at_all(fp, iseek, u, & + & 1, combined_btype, mstatus, ierr) + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error writing to file' + stop + endif + + call MPI_Type_size(combined_btype, iosize, ierr) + iseek = iseek + iosize/eltext + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + iseek = 0 + call acc_sub_norms(idump+1) + + iseek = 0 + idump_sub = 0 + endif + endif + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + use bt_data + use mpinpb + + implicit none + + integer idump_cur + + integer ii, m, ichunk + integer ierr + integer mstatus(MPI_STATUS_SIZE) + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + + call MPI_File_read_at_all(fp, iseek, u, & + & 1, combined_btype, mstatus, ierr) + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error reading back file' + call MPI_File_close(fp, ierr) + stop + endif + + call MPI_Type_size(combined_btype, iosize, ierr) + iseek = iseek + iosize/eltext + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine btio_cleanup + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ierr + + call MPI_File_close(fp, ierr) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + + subroutine accumulate_norms(xce_acc) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + double precision xce_acc(5) + integer m, ierr + + if (rd_interval .gt. 0) goto 20 + + call MPI_File_open(comm_solve, & + & filenm, & + & MPI_MODE_RDONLY, & + & MPI_INFO_NULL, & + & fp, & + & ierr) + + iseek = 0 + call MPI_File_set_view(fp, iseek, element, combined_ftype, & + & 'native', MPI_INFO_NULL, ierr) + +! clear the last time step + + call clear_timestep + +! read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + call MPI_File_close(fp, ierr) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/initialize.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/initialize.f90 new file mode 100644 index 000000000..aeca10abc --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/initialize.f90 @@ -0,0 +1,310 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine initialize + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! This subroutine initializes the field variable u using +! tri-linear transfinite interpolation of the boundary values +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer c, i, j, k, m, ii, jj, kk, ix, iy, iz + double precision xi, eta, zeta, Pface(5,3,2), Pxi, Peta, & + & Pzeta, temp(5) + +!--------------------------------------------------------------------- +! Later (in compute_rhs) we compute 1/u for every element. A few of +! the corner elements are not used, but it convenient (and faster) +! to compute the whole thing with a simple loop. Make sure those +! values are nonzero by initializing the whole thing here. +!--------------------------------------------------------------------- + do c = 1, ncells + do kk = -1, KMAX + do jj = -1, JMAX + do ii = -1, IMAX + do m = 1, 5 + u(m, ii, jj, kk, c) = 1.0 + end do + end do + end do + end do + end do +!--------------------------------------------------------------------- + + + +!--------------------------------------------------------------------- +! first store the "interpolated" values everywhere on the grid +!--------------------------------------------------------------------- + do c=1, ncells + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + + do ix = 1, 2 + call exact_solution(dble(ix-1), eta, zeta, & + & Pface(1,1,ix)) + enddo + + do iy = 1, 2 + call exact_solution(xi, dble(iy-1) , zeta, & + & Pface(1,2,iy)) + enddo + + do iz = 1, 2 + call exact_solution(xi, eta, dble(iz-1), & + & Pface(1,3,iz)) + enddo + + do m = 1, 5 + Pxi = xi * Pface(m,1,2) + & + & (1.0d0-xi) * Pface(m,1,1) + Peta = eta * Pface(m,2,2) + & + & (1.0d0-eta) * Pface(m,2,1) + Pzeta = zeta * Pface(m,3,2) + & + & (1.0d0-zeta) * Pface(m,3,1) + + u(m,ii,jj,kk,c) = Pxi + Peta + Pzeta - & + & Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + & + & Pxi*Peta*Pzeta + + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + kk = kk+1 + enddo + enddo + +!--------------------------------------------------------------------- +! now store the exact values on the boundaries +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! west face +!--------------------------------------------------------------------- + c = slice(1,1) + ii = 0 + xi = 0.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + jj = jj + 1 + enddo + kk = kk + 1 + enddo + +!--------------------------------------------------------------------- +! east face +!--------------------------------------------------------------------- + c = slice(1,ncells) + ii = cell_size(1,c)-1 + xi = 1.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + jj = jj + 1 + enddo + kk = kk + 1 + enddo + +!--------------------------------------------------------------------- +! south face +!--------------------------------------------------------------------- + c = slice(2,1) + jj = 0 + eta = 0.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + kk = kk + 1 + enddo + + +!--------------------------------------------------------------------- +! north face +!--------------------------------------------------------------------- + c = slice(2,ncells) + jj = cell_size(2,c)-1 + eta = 1.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + kk = kk + 1 + enddo + +!--------------------------------------------------------------------- +! bottom face +!--------------------------------------------------------------------- + c = slice(3,1) + kk = 0 + zeta = 0.0d0 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i =cell_low(1,c), cell_high(1,c) + xi = dble(i) *dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + +!--------------------------------------------------------------------- +! top face +!--------------------------------------------------------------------- + c = slice(3,ncells) + kk = cell_size(3,c)-1 + zeta = 1.0d0 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i =cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine lhsinit + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i, j, k, d, c, m, n + +!--------------------------------------------------------------------- +! loop over all cells +!--------------------------------------------------------------------- + do c = 1, ncells + +!--------------------------------------------------------------------- +! first, initialize the start and end arrays +!--------------------------------------------------------------------- + do d = 1, 3 + if (cell_coord(d,c) .eq. 1) then + start(d,c) = 1 + else + start(d,c) = 0 + endif + if (cell_coord(d,c) .eq. ncells) then + end(d,c) = 1 + else + end(d,c) = 0 + endif + enddo + +!--------------------------------------------------------------------- +! zero the whole left hand side for starters +!--------------------------------------------------------------------- + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1,5 + do n = 1, 5 + lhsc(m,n,i,j,k,c) = 0.0d0 + enddo + enddo + enddo + enddo + enddo + + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine lhsabinit(lhsa, lhsb, size) + implicit none + + integer size + double precision lhsa(5, 5, -1:size), lhsb(5, 5, -1:size) + + integer i, m, n + +!--------------------------------------------------------------------- +! next, set all diagonal values to 1. This is overkill, but convenient +!--------------------------------------------------------------------- + do i = 0, size + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,i) = 0.0d0 + lhsb(m,n,i) = 0.0d0 + enddo + lhsb(m,m,i) = 1.0d0 + enddo + enddo + + return + end + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/inputbt.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/inputbt.data.sample new file mode 100644 index 000000000..776654e8d --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/inputbt.data.sample @@ -0,0 +1,5 @@ +200 number of time steps +0.0008d0 dt for class A = 0.0008d0. class B = 0.0003d0 class C = 0.0001d0 +64 64 64 +5 0 write interval (optional read interval) for BTIO +0 1000000 number of nodes in collective buffering and buffer size for BTIO diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/make_set.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/make_set.f90 new file mode 100644 index 000000000..b24575109 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/make_set.f90 @@ -0,0 +1,126 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine make_set + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! This function allocates space for a set of cells and fills the set +! such that communication between cells on different nodes is only +! nearest neighbor +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer p, i, j, c, dir, size, excess, ierr,ierrcode + +!--------------------------------------------------------------------- +! compute square root; add small number to allow for roundoff +! (note: this is computed in setup_mpi.f also, but prefer to do +! it twice because of some include file problems). +!--------------------------------------------------------------------- + ncells = dint(dsqrt(dble(no_nodes) + 0.00001d0)) + +!--------------------------------------------------------------------- +! this makes coding easier +!--------------------------------------------------------------------- + p = ncells + +!--------------------------------------------------------------------- +! determine the location of the cell at the bottom of the 3D +! array of cells +!--------------------------------------------------------------------- + cell_coord(1,1) = mod(node,p) + cell_coord(2,1) = node/p + cell_coord(3,1) = 0 + +!--------------------------------------------------------------------- +! set the cell_coords for cells in the rest of the z-layers; +! this comes down to a simple linear numbering in the z-direct- +! ion, and to the doubly-cyclic numbering in the other dirs +!--------------------------------------------------------------------- + do c=2, p + cell_coord(1,c) = mod(cell_coord(1,c-1)+1,p) + cell_coord(2,c) = mod(cell_coord(2,c-1)-1+p,p) + cell_coord(3,c) = c-1 + end do + +!--------------------------------------------------------------------- +! offset all the coordinates by 1 to adjust for Fortran arrays +!--------------------------------------------------------------------- + do dir = 1, 3 + do c = 1, p + cell_coord(dir,c) = cell_coord(dir,c) + 1 + end do + end do + +!--------------------------------------------------------------------- +! slice(dir,n) contains the sequence number of the cell that is in +! coordinate plane n in the dir direction +!--------------------------------------------------------------------- + do dir = 1, 3 + do c = 1, p + slice(dir,cell_coord(dir,c)) = c + end do + end do + + +!--------------------------------------------------------------------- +! fill the predecessor and successor entries, using the indices +! of the bottom cells (they are the same at each level of k +! anyway) acting as if full periodicity pertains; note that p is +! added to those arguments to the mod functions that might +! otherwise return wrong values when using the modulo function +!--------------------------------------------------------------------- + i = cell_coord(1,1)-1 + j = cell_coord(2,1)-1 + + predecessor(1) = mod(i-1+p,p) + p*j + predecessor(2) = i + p*mod(j-1+p,p) + predecessor(3) = mod(i+1,p) + p*mod(j-1+p,p) + successor(1) = mod(i+1,p) + p*j + successor(2) = i + p*mod(j+1,p) + successor(3) = mod(i-1+p,p) + p*mod(j+1,p) + +!--------------------------------------------------------------------- +! now compute the sizes of the cells +!--------------------------------------------------------------------- + do dir= 1, 3 +!--------------------------------------------------------------------- +! set cell_coord range for each direction +!--------------------------------------------------------------------- + size = grid_points(dir)/p + excess = mod(grid_points(dir),p) + do c=1, ncells + if (cell_coord(dir,c) .le. excess) then + cell_size(dir,c) = size+1 + cell_low(dir,c) = (cell_coord(dir,c)-1)*(size+1) + cell_high(dir,c) = cell_low(dir,c)+size + else + cell_size(dir,c) = size + cell_low(dir,c) = excess*(size+1)+ & + & (cell_coord(dir,c)-excess-1)*size + cell_high(dir,c) = cell_low(dir,c)+size-1 + endif + if (cell_size(dir, c) .le. 2) then + write(*,50) + 50 format(' Error: Cell size too small. Min size is 3') + ierrcode = 1 + call MPI_Abort(mpi_comm_world,ierrcode,ierr) + stop + endif + end do + end do + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/mpinpb.f90 new file mode 100644 index 000000000..6fd83ac63 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/mpinpb.f90 @@ -0,0 +1,18 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! mpinpb module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module mpinpb + + include 'mpif.h' + + integer node, no_nodes, total_nodes, root, comm_setup, & + & comm_solve, comm_rhs, dp_type + logical active + + end module mpinpb + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/rhs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/rhs.f90 new file mode 100644 index 000000000..b47cdf6c1 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/rhs.f90 @@ -0,0 +1,429 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine compute_rhs + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer c, i, j, k, m + double precision rho_inv, uijk, up1, um1, vijk, vp1, vm1, & + & wijk, wp1, wm1 + + + if (timeron) call timer_start(t_rhs) +!--------------------------------------------------------------------- +! loop over all cells owned by this node +!--------------------------------------------------------------------- + do c = 1, ncells + +!--------------------------------------------------------------------- +! compute the reciprocal of density, and the kinetic energy, +! and the speed of sound. +!--------------------------------------------------------------------- + do k = -1, cell_size(3,c) + do j = -1, cell_size(2,c) + do i = -1, cell_size(1,c) + rho_inv = 1.0d0/u(1,i,j,k,c) + rho_i(i,j,k,c) = rho_inv + us(i,j,k,c) = u(2,i,j,k,c) * rho_inv + vs(i,j,k,c) = u(3,i,j,k,c) * rho_inv + ws(i,j,k,c) = u(4,i,j,k,c) * rho_inv + square(i,j,k,c) = 0.5d0* ( & + & u(2,i,j,k,c)*u(2,i,j,k,c) + & + & u(3,i,j,k,c)*u(3,i,j,k,c) + & + & u(4,i,j,k,c)*u(4,i,j,k,c) ) * rho_inv + qs(i,j,k,c) = square(i,j,k,c) * rho_inv + enddo + enddo + enddo + +!--------------------------------------------------------------------- +! copy the exact forcing term to the right hand side; because +! this forcing term is known, we can store it on the whole of every +! cell, including the boundary +!--------------------------------------------------------------------- + + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = forcing(m,i,j,k,c) + enddo + enddo + enddo + enddo + + +!--------------------------------------------------------------------- +! compute xi-direction fluxes +!--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + uijk = us(i,j,k,c) + up1 = us(i+1,j,k,c) + um1 = us(i-1,j,k,c) + + rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dx1tx1 * & + & (u(1,i+1,j,k,c) - 2.0d0*u(1,i,j,k,c) + & + & u(1,i-1,j,k,c)) - & + & tx2 * (u(2,i+1,j,k,c) - u(2,i-1,j,k,c)) + + rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dx2tx1 * & + & (u(2,i+1,j,k,c) - 2.0d0*u(2,i,j,k,c) + & + & u(2,i-1,j,k,c)) + & + & xxcon2*con43 * (up1 - 2.0d0*uijk + um1) - & + & tx2 * (u(2,i+1,j,k,c)*up1 - & + & u(2,i-1,j,k,c)*um1 + & + & (u(5,i+1,j,k,c)- square(i+1,j,k,c)- & + & u(5,i-1,j,k,c)+ square(i-1,j,k,c))* & + & c2) + + rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dx3tx1 * & + & (u(3,i+1,j,k,c) - 2.0d0*u(3,i,j,k,c) + & + & u(3,i-1,j,k,c)) + & + & xxcon2 * (vs(i+1,j,k,c) - 2.0d0*vs(i,j,k,c) + & + & vs(i-1,j,k,c)) - & + & tx2 * (u(3,i+1,j,k,c)*up1 - & + & u(3,i-1,j,k,c)*um1) + + rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dx4tx1 * & + & (u(4,i+1,j,k,c) - 2.0d0*u(4,i,j,k,c) + & + & u(4,i-1,j,k,c)) + & + & xxcon2 * (ws(i+1,j,k,c) - 2.0d0*ws(i,j,k,c) + & + & ws(i-1,j,k,c)) - & + & tx2 * (u(4,i+1,j,k,c)*up1 - & + & u(4,i-1,j,k,c)*um1) + + rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dx5tx1 * & + & (u(5,i+1,j,k,c) - 2.0d0*u(5,i,j,k,c) + & + & u(5,i-1,j,k,c)) + & + & xxcon3 * (qs(i+1,j,k,c) - 2.0d0*qs(i,j,k,c) + & + & qs(i-1,j,k,c)) + & + & xxcon4 * (up1*up1 - 2.0d0*uijk*uijk + & + & um1*um1) + & + & xxcon5 * (u(5,i+1,j,k,c)*rho_i(i+1,j,k,c) - & + & 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) + & + & u(5,i-1,j,k,c)*rho_i(i-1,j,k,c)) - & + & tx2 * ( (c1*u(5,i+1,j,k,c) - & + & c2*square(i+1,j,k,c))*up1 - & + & (c1*u(5,i-1,j,k,c) - & + & c2*square(i-1,j,k,c))*um1 ) + enddo + enddo + enddo + +!--------------------------------------------------------------------- +! add fourth order xi-direction dissipation +!--------------------------------------------------------------------- + if (start(1,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + i = 1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * & + & ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) + & + & u(m,i+2,j,k,c)) + enddo + + i = 2 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & (-4.0d0*u(m,i-1,j,k,c) + 6.0d0*u(m,i,j,k,c) - & + & 4.0d0*u(m,i+1,j,k,c) + u(m,i+2,j,k,c)) + enddo + enddo + enddo + endif + + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = 3*start(1,c),cell_size(1,c)-3*end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) + & + & 6.0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) + & + & u(m,i+2,j,k,c) ) + enddo + enddo + enddo + enddo + + + if (end(1,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + i = cell_size(1,c)-3 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) + & + & 6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) ) + enddo + + i = cell_size(1,c)-2 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i-2,j,k,c) - 4.d0*u(m,i-1,j,k,c) + & + & 5.d0*u(m,i,j,k,c) ) + enddo + enddo + enddo + endif + +!--------------------------------------------------------------------- +! compute eta-direction fluxes +!--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + vijk = vs(i,j,k,c) + vp1 = vs(i,j+1,k,c) + vm1 = vs(i,j-1,k,c) + rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dy1ty1 * & + & (u(1,i,j+1,k,c) - 2.0d0*u(1,i,j,k,c) + & + & u(1,i,j-1,k,c)) - & + & ty2 * (u(3,i,j+1,k,c) - u(3,i,j-1,k,c)) + rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dy2ty1 * & + & (u(2,i,j+1,k,c) - 2.0d0*u(2,i,j,k,c) + & + & u(2,i,j-1,k,c)) + & + & yycon2 * (us(i,j+1,k,c) - 2.0d0*us(i,j,k,c) + & + & us(i,j-1,k,c)) - & + & ty2 * (u(2,i,j+1,k,c)*vp1 - & + & u(2,i,j-1,k,c)*vm1) + rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dy3ty1 * & + & (u(3,i,j+1,k,c) - 2.0d0*u(3,i,j,k,c) + & + & u(3,i,j-1,k,c)) + & + & yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) - & + & ty2 * (u(3,i,j+1,k,c)*vp1 - & + & u(3,i,j-1,k,c)*vm1 + & + & (u(5,i,j+1,k,c) - square(i,j+1,k,c) - & + & u(5,i,j-1,k,c) + square(i,j-1,k,c)) & + & *c2) + rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dy4ty1 * & + & (u(4,i,j+1,k,c) - 2.0d0*u(4,i,j,k,c) + & + & u(4,i,j-1,k,c)) + & + & yycon2 * (ws(i,j+1,k,c) - 2.0d0*ws(i,j,k,c) + & + & ws(i,j-1,k,c)) - & + & ty2 * (u(4,i,j+1,k,c)*vp1 - & + & u(4,i,j-1,k,c)*vm1) + rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dy5ty1 * & + & (u(5,i,j+1,k,c) - 2.0d0*u(5,i,j,k,c) + & + & u(5,i,j-1,k,c)) + & + & yycon3 * (qs(i,j+1,k,c) - 2.0d0*qs(i,j,k,c) + & + & qs(i,j-1,k,c)) + & + & yycon4 * (vp1*vp1 - 2.0d0*vijk*vijk + & + & vm1*vm1) + & + & yycon5 * (u(5,i,j+1,k,c)*rho_i(i,j+1,k,c) - & + & 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) + & + & u(5,i,j-1,k,c)*rho_i(i,j-1,k,c)) - & + & ty2 * ((c1*u(5,i,j+1,k,c) - & + & c2*square(i,j+1,k,c)) * vp1 - & + & (c1*u(5,i,j-1,k,c) - & + & c2*square(i,j-1,k,c)) * vm1) + enddo + enddo + enddo + +!--------------------------------------------------------------------- +! add fourth order eta-direction dissipation +!--------------------------------------------------------------------- + if (start(2,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + j = 1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * & + & ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) + & + & u(m,i,j+2,k,c)) + enddo + enddo + + j = 2 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & (-4.0d0*u(m,i,j-1,k,c) + 6.0d0*u(m,i,j,k,c) - & + & 4.0d0*u(m,i,j+1,k,c) + u(m,i,j+2,k,c)) + enddo + enddo + enddo + endif + + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = 3*start(2,c), cell_size(2,c)-3*end(2,c)-1 + do i = start(1,c),cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) + & + & 6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) + & + & u(m,i,j+2,k,c) ) + enddo + enddo + enddo + enddo + + if (end(2,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + j = cell_size(2,c)-3 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) + & + & 6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) ) + enddo + enddo + + j = cell_size(2,c)-2 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i,j-2,k,c) - 4.d0*u(m,i,j-1,k,c) + & + & 5.d0*u(m,i,j,k,c) ) + enddo + enddo + enddo + endif + +!--------------------------------------------------------------------- +! compute zeta-direction fluxes +!--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + wijk = ws(i,j,k,c) + wp1 = ws(i,j,k+1,c) + wm1 = ws(i,j,k-1,c) + + rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dz1tz1 * & + & (u(1,i,j,k+1,c) - 2.0d0*u(1,i,j,k,c) + & + & u(1,i,j,k-1,c)) - & + & tz2 * (u(4,i,j,k+1,c) - u(4,i,j,k-1,c)) + rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dz2tz1 * & + & (u(2,i,j,k+1,c) - 2.0d0*u(2,i,j,k,c) + & + & u(2,i,j,k-1,c)) + & + & zzcon2 * (us(i,j,k+1,c) - 2.0d0*us(i,j,k,c) + & + & us(i,j,k-1,c)) - & + & tz2 * (u(2,i,j,k+1,c)*wp1 - & + & u(2,i,j,k-1,c)*wm1) + rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dz3tz1 * & + & (u(3,i,j,k+1,c) - 2.0d0*u(3,i,j,k,c) + & + & u(3,i,j,k-1,c)) + & + & zzcon2 * (vs(i,j,k+1,c) - 2.0d0*vs(i,j,k,c) + & + & vs(i,j,k-1,c)) - & + & tz2 * (u(3,i,j,k+1,c)*wp1 - & + & u(3,i,j,k-1,c)*wm1) + rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dz4tz1 * & + & (u(4,i,j,k+1,c) - 2.0d0*u(4,i,j,k,c) + & + & u(4,i,j,k-1,c)) + & + & zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) - & + & tz2 * (u(4,i,j,k+1,c)*wp1 - & + & u(4,i,j,k-1,c)*wm1 + & + & (u(5,i,j,k+1,c) - square(i,j,k+1,c) - & + & u(5,i,j,k-1,c) + square(i,j,k-1,c)) & + & *c2) + rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dz5tz1 * & + & (u(5,i,j,k+1,c) - 2.0d0*u(5,i,j,k,c) + & + & u(5,i,j,k-1,c)) + & + & zzcon3 * (qs(i,j,k+1,c) - 2.0d0*qs(i,j,k,c) + & + & qs(i,j,k-1,c)) + & + & zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + & + & wm1*wm1) + & + & zzcon5 * (u(5,i,j,k+1,c)*rho_i(i,j,k+1,c) - & + & 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) + & + & u(5,i,j,k-1,c)*rho_i(i,j,k-1,c)) - & + & tz2 * ( (c1*u(5,i,j,k+1,c) - & + & c2*square(i,j,k+1,c))*wp1 - & + & (c1*u(5,i,j,k-1,c) - & + & c2*square(i,j,k-1,c))*wm1) + enddo + enddo + enddo + +!--------------------------------------------------------------------- +! add fourth order zeta-direction dissipation +!--------------------------------------------------------------------- + if (start(3,c) .gt. 0) then + k = 1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * & + & ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) + & + & u(m,i,j,k+2,c)) + enddo + enddo + enddo + + k = 2 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & (-4.0d0*u(m,i,j,k-1,c) + 6.0d0*u(m,i,j,k,c) - & + & 4.0d0*u(m,i,j,k+1,c) + u(m,i,j,k+2,c)) + enddo + enddo + enddo + endif + + do k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c),cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) + & + & 6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) + & + & u(m,i,j,k+2,c) ) + enddo + enddo + enddo + enddo + + if (end(3,c) .gt. 0) then + k = cell_size(3,c)-3 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) + & + & 6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) ) + enddo + enddo + enddo + + k = cell_size(3,c)-2 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * & + & ( u(m,i,j,k-2,c) - 4.d0*u(m,i,j,k-1,c) + & + & 5.d0*u(m,i,j,k,c) ) + enddo + enddo + enddo + endif + + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) * dt + enddo + enddo + enddo + enddo + + enddo + + if (timeron) call timer_stop(t_rhs) + + return + end + + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/set_constants.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/set_constants.f90 new file mode 100644 index 000000000..1519e2cb7 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/set_constants.f90 @@ -0,0 +1,203 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine set_constants + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + implicit none + + ce(1,1) = 2.0d0 + ce(1,2) = 0.0d0 + ce(1,3) = 0.0d0 + ce(1,4) = 4.0d0 + ce(1,5) = 5.0d0 + ce(1,6) = 3.0d0 + ce(1,7) = 0.5d0 + ce(1,8) = 0.02d0 + ce(1,9) = 0.01d0 + ce(1,10) = 0.03d0 + ce(1,11) = 0.5d0 + ce(1,12) = 0.4d0 + ce(1,13) = 0.3d0 + + ce(2,1) = 1.0d0 + ce(2,2) = 0.0d0 + ce(2,3) = 0.0d0 + ce(2,4) = 0.0d0 + ce(2,5) = 1.0d0 + ce(2,6) = 2.0d0 + ce(2,7) = 3.0d0 + ce(2,8) = 0.01d0 + ce(2,9) = 0.03d0 + ce(2,10) = 0.02d0 + ce(2,11) = 0.4d0 + ce(2,12) = 0.3d0 + ce(2,13) = 0.5d0 + + ce(3,1) = 2.0d0 + ce(3,2) = 2.0d0 + ce(3,3) = 0.0d0 + ce(3,4) = 0.0d0 + ce(3,5) = 0.0d0 + ce(3,6) = 2.0d0 + ce(3,7) = 3.0d0 + ce(3,8) = 0.04d0 + ce(3,9) = 0.03d0 + ce(3,10) = 0.05d0 + ce(3,11) = 0.3d0 + ce(3,12) = 0.5d0 + ce(3,13) = 0.4d0 + + ce(4,1) = 2.0d0 + ce(4,2) = 2.0d0 + ce(4,3) = 0.0d0 + ce(4,4) = 0.0d0 + ce(4,5) = 0.0d0 + ce(4,6) = 2.0d0 + ce(4,7) = 3.0d0 + ce(4,8) = 0.03d0 + ce(4,9) = 0.05d0 + ce(4,10) = 0.04d0 + ce(4,11) = 0.2d0 + ce(4,12) = 0.1d0 + ce(4,13) = 0.3d0 + + ce(5,1) = 5.0d0 + ce(5,2) = 4.0d0 + ce(5,3) = 3.0d0 + ce(5,4) = 2.0d0 + ce(5,5) = 0.1d0 + ce(5,6) = 0.4d0 + ce(5,7) = 0.3d0 + ce(5,8) = 0.05d0 + ce(5,9) = 0.04d0 + ce(5,10) = 0.03d0 + ce(5,11) = 0.1d0 + ce(5,12) = 0.3d0 + ce(5,13) = 0.2d0 + + c1 = 1.4d0 + c2 = 0.4d0 + c3 = 0.1d0 + c4 = 1.0d0 + c5 = 1.4d0 + + bt = dsqrt(0.5d0) + + dnxm1 = 1.0d0 / dble(grid_points(1)-1) + dnym1 = 1.0d0 / dble(grid_points(2)-1) + dnzm1 = 1.0d0 / dble(grid_points(3)-1) + + c1c2 = c1 * c2 + c1c5 = c1 * c5 + c3c4 = c3 * c4 + c1345 = c1c5 * c3c4 + + conz1 = (1.0d0-c1c5) + + tx1 = 1.0d0 / (dnxm1 * dnxm1) + tx2 = 1.0d0 / (2.0d0 * dnxm1) + tx3 = 1.0d0 / dnxm1 + + ty1 = 1.0d0 / (dnym1 * dnym1) + ty2 = 1.0d0 / (2.0d0 * dnym1) + ty3 = 1.0d0 / dnym1 + + tz1 = 1.0d0 / (dnzm1 * dnzm1) + tz2 = 1.0d0 / (2.0d0 * dnzm1) + tz3 = 1.0d0 / dnzm1 + + dx1 = 0.75d0 + dx2 = 0.75d0 + dx3 = 0.75d0 + dx4 = 0.75d0 + dx5 = 0.75d0 + + dy1 = 0.75d0 + dy2 = 0.75d0 + dy3 = 0.75d0 + dy4 = 0.75d0 + dy5 = 0.75d0 + + dz1 = 1.0d0 + dz2 = 1.0d0 + dz3 = 1.0d0 + dz4 = 1.0d0 + dz5 = 1.0d0 + + dxmax = dmax1(dx3, dx4) + dymax = dmax1(dy2, dy4) + dzmax = dmax1(dz2, dz3) + + dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) ) + + c4dssp = 4.0d0 * dssp + c5dssp = 5.0d0 * dssp + + dttx1 = dt*tx1 + dttx2 = dt*tx2 + dtty1 = dt*ty1 + dtty2 = dt*ty2 + dttz1 = dt*tz1 + dttz2 = dt*tz2 + + c2dttx1 = 2.0d0*dttx1 + c2dtty1 = 2.0d0*dtty1 + c2dttz1 = 2.0d0*dttz1 + + dtdssp = dt*dssp + + comz1 = dtdssp + comz4 = 4.0d0*dtdssp + comz5 = 5.0d0*dtdssp + comz6 = 6.0d0*dtdssp + + c3c4tx3 = c3c4*tx3 + c3c4ty3 = c3c4*ty3 + c3c4tz3 = c3c4*tz3 + + dx1tx1 = dx1*tx1 + dx2tx1 = dx2*tx1 + dx3tx1 = dx3*tx1 + dx4tx1 = dx4*tx1 + dx5tx1 = dx5*tx1 + + dy1ty1 = dy1*ty1 + dy2ty1 = dy2*ty1 + dy3ty1 = dy3*ty1 + dy4ty1 = dy4*ty1 + dy5ty1 = dy5*ty1 + + dz1tz1 = dz1*tz1 + dz2tz1 = dz2*tz1 + dz3tz1 = dz3*tz1 + dz4tz1 = dz4*tz1 + dz5tz1 = dz5*tz1 + + c2iv = 2.5d0 + con43 = 4.0d0/3.0d0 + con16 = 1.0d0/6.0d0 + + xxcon1 = c3c4tx3*con43*tx3 + xxcon2 = c3c4tx3*tx3 + xxcon3 = c3c4tx3*conz1*tx3 + xxcon4 = c3c4tx3*con16*tx3 + xxcon5 = c3c4tx3*c1c5*tx3 + + yycon1 = c3c4ty3*con43*ty3 + yycon2 = c3c4ty3*ty3 + yycon3 = c3c4ty3*conz1*ty3 + yycon4 = c3c4ty3*con16*ty3 + yycon5 = c3c4ty3*c1c5*ty3 + + zzcon1 = c3c4tz3*con43*tz3 + zzcon2 = c3c4tz3*tz3 + zzcon3 = c3c4tz3*conz1*tz3 + zzcon4 = c3c4tz3*con16*tz3 + zzcon5 = c3c4tz3*c1c5*tz3 + + return + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/setup_mpi.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/setup_mpi.f90 new file mode 100644 index 000000000..9d7939b7c --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/setup_mpi.f90 @@ -0,0 +1,48 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine setup_mpi + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! set up MPI stuff +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer error, nc + + call mpi_init(error) + + if (.not. convertdouble) then + dp_type = MPI_DOUBLE_PRECISION + else + dp_type = MPI_REAL + endif + +!--------------------------------------------------------------------- +! get a process grid that requires a square number of procs. +! excess ranks are marked as inactive. +!--------------------------------------------------------------------- + call get_active_nprocs(1, nc, maxcells, no_nodes, & + & total_nodes, node, comm_setup, active) + + if (.not. active) return + + call mpi_comm_dup(comm_setup, comm_solve, error) + call mpi_comm_dup(comm_setup, comm_rhs, error) + +!--------------------------------------------------------------------- +! let node 0 be the root for the group (there is only one) +!--------------------------------------------------------------------- + root = 0 + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/simple_mpiio.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/simple_mpiio.f90 new file mode 100644 index 000000000..e47da2943 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/simple_mpiio.f90 @@ -0,0 +1,228 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine setup_btio + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer m, ierr + + iseek=0 + + if (node .eq. root) then + call MPI_File_delete(filenm, MPI_INFO_NULL, ierr) + endif + + call MPI_Barrier(comm_solve, ierr) + + call MPI_File_open(comm_solve, & + & filenm, & + & MPI_MODE_RDWR + MPI_MODE_CREATE, & + & MPI_INFO_NULL, & + & fp, & + & ierr) + + call MPI_File_set_view(fp, & + & iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION, & + & 'native', MPI_INFO_NULL, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error opening file' + stop + endif + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine output_timestep + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer count, jio, kio, cio, aio + integer ierr + integer mstatus(MPI_STATUS_SIZE) + + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=(cell_low(3,cio)+kio) + & + & PROBLEM_SIZE*idump_sub + iseek=(cell_low(2,cio)+jio) + & + & PROBLEM_SIZE*iseek + iseek=5*(cell_low(1,cio) + & + & PROBLEM_SIZE*iseek) + + count=5*cell_size(1,cio) + + call MPI_File_write_at(fp, iseek, & + & u(1,0,jio,kio,cio), & + & count, MPI_DOUBLE_PRECISION, & + & mstatus, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error writing to file' + stop + endif + enddo + enddo + enddo + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + call acc_sub_norms(idump+1) + + idump_sub = 0 + endif + endif + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + use bt_data + use mpinpb + + implicit none + + integer idump_cur + + integer count, jio, kio, cio, ii, m, ichunk + integer ierr + integer mstatus(MPI_STATUS_SIZE) + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=(cell_low(3,cio)+kio) + & + & PROBLEM_SIZE*ii + iseek=(cell_low(2,cio)+jio) + & + & PROBLEM_SIZE*iseek + iseek=5*(cell_low(1,cio) + & + & PROBLEM_SIZE*iseek) + + count=5*cell_size(1,cio) + + call MPI_File_read_at(fp, iseek, & + & u(1,0,jio,kio,cio), & + & count, MPI_DOUBLE_PRECISION, & + & mstatus, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error reading back file' + call MPI_File_close(fp, ierr) + stop + endif + enddo + enddo + enddo + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine btio_cleanup + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ierr + + call MPI_File_close(fp, ierr) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + double precision xce_acc(5) + integer m, ierr + + if (rd_interval .gt. 0) goto 20 + + call MPI_File_open(comm_solve, & + & filenm, & + & MPI_MODE_RDONLY, & + & MPI_INFO_NULL, & + & fp, & + & ierr) + + iseek = 0 + call MPI_File_set_view(fp, & + & iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION, & + & 'native', MPI_INFO_NULL, ierr) + +! clear the last time step + + call clear_timestep + +! read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + call MPI_File_close(fp, ierr) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/solve_subs.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/solve_subs.f90 new file mode 100644 index 000000000..913bd2778 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/solve_subs.f90 @@ -0,0 +1,642 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine matvec_sub(ablock,avec,bvec) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! subtracts bvec=bvec - ablock*avec +!--------------------------------------------------------------------- + + implicit none + + double precision ablock,avec,bvec + dimension ablock(5,5),avec(5),bvec(5) + +!--------------------------------------------------------------------- +! rhs(i,ic,jc,kc,ccell) = rhs(i,ic,jc,kc,ccell) +! $ - lhs(i,1,ablock,ia,ja,ka,acell)* +!--------------------------------------------------------------------- + bvec(1) = bvec(1) - ablock(1,1)*avec(1) & + & - ablock(1,2)*avec(2) & + & - ablock(1,3)*avec(3) & + & - ablock(1,4)*avec(4) & + & - ablock(1,5)*avec(5) + bvec(2) = bvec(2) - ablock(2,1)*avec(1) & + & - ablock(2,2)*avec(2) & + & - ablock(2,3)*avec(3) & + & - ablock(2,4)*avec(4) & + & - ablock(2,5)*avec(5) + bvec(3) = bvec(3) - ablock(3,1)*avec(1) & + & - ablock(3,2)*avec(2) & + & - ablock(3,3)*avec(3) & + & - ablock(3,4)*avec(4) & + & - ablock(3,5)*avec(5) + bvec(4) = bvec(4) - ablock(4,1)*avec(1) & + & - ablock(4,2)*avec(2) & + & - ablock(4,3)*avec(3) & + & - ablock(4,4)*avec(4) & + & - ablock(4,5)*avec(5) + bvec(5) = bvec(5) - ablock(5,1)*avec(1) & + & - ablock(5,2)*avec(2) & + & - ablock(5,3)*avec(3) & + & - ablock(5,4)*avec(4) & + & - ablock(5,5)*avec(5) + + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine matmul_sub(ablock, bblock, cblock) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! subtracts a(i,j,k) X b(i,j,k) from c(i,j,k) +!--------------------------------------------------------------------- + + implicit none + + double precision ablock, bblock, cblock + dimension ablock(5,5), bblock(5,5), cblock(5,5) + + + cblock(1,1) = cblock(1,1) - ablock(1,1)*bblock(1,1) & + & - ablock(1,2)*bblock(2,1) & + & - ablock(1,3)*bblock(3,1) & + & - ablock(1,4)*bblock(4,1) & + & - ablock(1,5)*bblock(5,1) + cblock(2,1) = cblock(2,1) - ablock(2,1)*bblock(1,1) & + & - ablock(2,2)*bblock(2,1) & + & - ablock(2,3)*bblock(3,1) & + & - ablock(2,4)*bblock(4,1) & + & - ablock(2,5)*bblock(5,1) + cblock(3,1) = cblock(3,1) - ablock(3,1)*bblock(1,1) & + & - ablock(3,2)*bblock(2,1) & + & - ablock(3,3)*bblock(3,1) & + & - ablock(3,4)*bblock(4,1) & + & - ablock(3,5)*bblock(5,1) + cblock(4,1) = cblock(4,1) - ablock(4,1)*bblock(1,1) & + & - ablock(4,2)*bblock(2,1) & + & - ablock(4,3)*bblock(3,1) & + & - ablock(4,4)*bblock(4,1) & + & - ablock(4,5)*bblock(5,1) + cblock(5,1) = cblock(5,1) - ablock(5,1)*bblock(1,1) & + & - ablock(5,2)*bblock(2,1) & + & - ablock(5,3)*bblock(3,1) & + & - ablock(5,4)*bblock(4,1) & + & - ablock(5,5)*bblock(5,1) + cblock(1,2) = cblock(1,2) - ablock(1,1)*bblock(1,2) & + & - ablock(1,2)*bblock(2,2) & + & - ablock(1,3)*bblock(3,2) & + & - ablock(1,4)*bblock(4,2) & + & - ablock(1,5)*bblock(5,2) + cblock(2,2) = cblock(2,2) - ablock(2,1)*bblock(1,2) & + & - ablock(2,2)*bblock(2,2) & + & - ablock(2,3)*bblock(3,2) & + & - ablock(2,4)*bblock(4,2) & + & - ablock(2,5)*bblock(5,2) + cblock(3,2) = cblock(3,2) - ablock(3,1)*bblock(1,2) & + & - ablock(3,2)*bblock(2,2) & + & - ablock(3,3)*bblock(3,2) & + & - ablock(3,4)*bblock(4,2) & + & - ablock(3,5)*bblock(5,2) + cblock(4,2) = cblock(4,2) - ablock(4,1)*bblock(1,2) & + & - ablock(4,2)*bblock(2,2) & + & - ablock(4,3)*bblock(3,2) & + & - ablock(4,4)*bblock(4,2) & + & - ablock(4,5)*bblock(5,2) + cblock(5,2) = cblock(5,2) - ablock(5,1)*bblock(1,2) & + & - ablock(5,2)*bblock(2,2) & + & - ablock(5,3)*bblock(3,2) & + & - ablock(5,4)*bblock(4,2) & + & - ablock(5,5)*bblock(5,2) + cblock(1,3) = cblock(1,3) - ablock(1,1)*bblock(1,3) & + & - ablock(1,2)*bblock(2,3) & + & - ablock(1,3)*bblock(3,3) & + & - ablock(1,4)*bblock(4,3) & + & - ablock(1,5)*bblock(5,3) + cblock(2,3) = cblock(2,3) - ablock(2,1)*bblock(1,3) & + & - ablock(2,2)*bblock(2,3) & + & - ablock(2,3)*bblock(3,3) & + & - ablock(2,4)*bblock(4,3) & + & - ablock(2,5)*bblock(5,3) + cblock(3,3) = cblock(3,3) - ablock(3,1)*bblock(1,3) & + & - ablock(3,2)*bblock(2,3) & + & - ablock(3,3)*bblock(3,3) & + & - ablock(3,4)*bblock(4,3) & + & - ablock(3,5)*bblock(5,3) + cblock(4,3) = cblock(4,3) - ablock(4,1)*bblock(1,3) & + & - ablock(4,2)*bblock(2,3) & + & - ablock(4,3)*bblock(3,3) & + & - ablock(4,4)*bblock(4,3) & + & - ablock(4,5)*bblock(5,3) + cblock(5,3) = cblock(5,3) - ablock(5,1)*bblock(1,3) & + & - ablock(5,2)*bblock(2,3) & + & - ablock(5,3)*bblock(3,3) & + & - ablock(5,4)*bblock(4,3) & + & - ablock(5,5)*bblock(5,3) + cblock(1,4) = cblock(1,4) - ablock(1,1)*bblock(1,4) & + & - ablock(1,2)*bblock(2,4) & + & - ablock(1,3)*bblock(3,4) & + & - ablock(1,4)*bblock(4,4) & + & - ablock(1,5)*bblock(5,4) + cblock(2,4) = cblock(2,4) - ablock(2,1)*bblock(1,4) & + & - ablock(2,2)*bblock(2,4) & + & - ablock(2,3)*bblock(3,4) & + & - ablock(2,4)*bblock(4,4) & + & - ablock(2,5)*bblock(5,4) + cblock(3,4) = cblock(3,4) - ablock(3,1)*bblock(1,4) & + & - ablock(3,2)*bblock(2,4) & + & - ablock(3,3)*bblock(3,4) & + & - ablock(3,4)*bblock(4,4) & + & - ablock(3,5)*bblock(5,4) + cblock(4,4) = cblock(4,4) - ablock(4,1)*bblock(1,4) & + & - ablock(4,2)*bblock(2,4) & + & - ablock(4,3)*bblock(3,4) & + & - ablock(4,4)*bblock(4,4) & + & - ablock(4,5)*bblock(5,4) + cblock(5,4) = cblock(5,4) - ablock(5,1)*bblock(1,4) & + & - ablock(5,2)*bblock(2,4) & + & - ablock(5,3)*bblock(3,4) & + & - ablock(5,4)*bblock(4,4) & + & - ablock(5,5)*bblock(5,4) + cblock(1,5) = cblock(1,5) - ablock(1,1)*bblock(1,5) & + & - ablock(1,2)*bblock(2,5) & + & - ablock(1,3)*bblock(3,5) & + & - ablock(1,4)*bblock(4,5) & + & - ablock(1,5)*bblock(5,5) + cblock(2,5) = cblock(2,5) - ablock(2,1)*bblock(1,5) & + & - ablock(2,2)*bblock(2,5) & + & - ablock(2,3)*bblock(3,5) & + & - ablock(2,4)*bblock(4,5) & + & - ablock(2,5)*bblock(5,5) + cblock(3,5) = cblock(3,5) - ablock(3,1)*bblock(1,5) & + & - ablock(3,2)*bblock(2,5) & + & - ablock(3,3)*bblock(3,5) & + & - ablock(3,4)*bblock(4,5) & + & - ablock(3,5)*bblock(5,5) + cblock(4,5) = cblock(4,5) - ablock(4,1)*bblock(1,5) & + & - ablock(4,2)*bblock(2,5) & + & - ablock(4,3)*bblock(3,5) & + & - ablock(4,4)*bblock(4,5) & + & - ablock(4,5)*bblock(5,5) + cblock(5,5) = cblock(5,5) - ablock(5,1)*bblock(1,5) & + & - ablock(5,2)*bblock(2,5) & + & - ablock(5,3)*bblock(3,5) & + & - ablock(5,4)*bblock(4,5) & + & - ablock(5,5)*bblock(5,5) + + + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine binvcrhs( lhs,c,r ) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! +!--------------------------------------------------------------------- + + implicit none + + double precision pivot, coeff, lhs + dimension lhs(5,5) + double precision c(5,5), r(5) + +!--------------------------------------------------------------------- +! +!--------------------------------------------------------------------- + + pivot = 1.00d0/lhs(1,1) + lhs(1,2) = lhs(1,2)*pivot + lhs(1,3) = lhs(1,3)*pivot + lhs(1,4) = lhs(1,4)*pivot + lhs(1,5) = lhs(1,5)*pivot + c(1,1) = c(1,1)*pivot + c(1,2) = c(1,2)*pivot + c(1,3) = c(1,3)*pivot + c(1,4) = c(1,4)*pivot + c(1,5) = c(1,5)*pivot + r(1) = r(1) *pivot + + coeff = lhs(2,1) + lhs(2,2)= lhs(2,2) - coeff*lhs(1,2) + lhs(2,3)= lhs(2,3) - coeff*lhs(1,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(1,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(1,5) + c(2,1) = c(2,1) - coeff*c(1,1) + c(2,2) = c(2,2) - coeff*c(1,2) + c(2,3) = c(2,3) - coeff*c(1,3) + c(2,4) = c(2,4) - coeff*c(1,4) + c(2,5) = c(2,5) - coeff*c(1,5) + r(2) = r(2) - coeff*r(1) + + coeff = lhs(3,1) + lhs(3,2)= lhs(3,2) - coeff*lhs(1,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(1,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(1,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(1,5) + c(3,1) = c(3,1) - coeff*c(1,1) + c(3,2) = c(3,2) - coeff*c(1,2) + c(3,3) = c(3,3) - coeff*c(1,3) + c(3,4) = c(3,4) - coeff*c(1,4) + c(3,5) = c(3,5) - coeff*c(1,5) + r(3) = r(3) - coeff*r(1) + + coeff = lhs(4,1) + lhs(4,2)= lhs(4,2) - coeff*lhs(1,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(1,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(1,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(1,5) + c(4,1) = c(4,1) - coeff*c(1,1) + c(4,2) = c(4,2) - coeff*c(1,2) + c(4,3) = c(4,3) - coeff*c(1,3) + c(4,4) = c(4,4) - coeff*c(1,4) + c(4,5) = c(4,5) - coeff*c(1,5) + r(4) = r(4) - coeff*r(1) + + coeff = lhs(5,1) + lhs(5,2)= lhs(5,2) - coeff*lhs(1,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(1,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(1,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(1,5) + c(5,1) = c(5,1) - coeff*c(1,1) + c(5,2) = c(5,2) - coeff*c(1,2) + c(5,3) = c(5,3) - coeff*c(1,3) + c(5,4) = c(5,4) - coeff*c(1,4) + c(5,5) = c(5,5) - coeff*c(1,5) + r(5) = r(5) - coeff*r(1) + + + pivot = 1.00d0/lhs(2,2) + lhs(2,3) = lhs(2,3)*pivot + lhs(2,4) = lhs(2,4)*pivot + lhs(2,5) = lhs(2,5)*pivot + c(2,1) = c(2,1)*pivot + c(2,2) = c(2,2)*pivot + c(2,3) = c(2,3)*pivot + c(2,4) = c(2,4)*pivot + c(2,5) = c(2,5)*pivot + r(2) = r(2) *pivot + + coeff = lhs(1,2) + lhs(1,3)= lhs(1,3) - coeff*lhs(2,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(2,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(2,5) + c(1,1) = c(1,1) - coeff*c(2,1) + c(1,2) = c(1,2) - coeff*c(2,2) + c(1,3) = c(1,3) - coeff*c(2,3) + c(1,4) = c(1,4) - coeff*c(2,4) + c(1,5) = c(1,5) - coeff*c(2,5) + r(1) = r(1) - coeff*r(2) + + coeff = lhs(3,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(2,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(2,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(2,5) + c(3,1) = c(3,1) - coeff*c(2,1) + c(3,2) = c(3,2) - coeff*c(2,2) + c(3,3) = c(3,3) - coeff*c(2,3) + c(3,4) = c(3,4) - coeff*c(2,4) + c(3,5) = c(3,5) - coeff*c(2,5) + r(3) = r(3) - coeff*r(2) + + coeff = lhs(4,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(2,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(2,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(2,5) + c(4,1) = c(4,1) - coeff*c(2,1) + c(4,2) = c(4,2) - coeff*c(2,2) + c(4,3) = c(4,3) - coeff*c(2,3) + c(4,4) = c(4,4) - coeff*c(2,4) + c(4,5) = c(4,5) - coeff*c(2,5) + r(4) = r(4) - coeff*r(2) + + coeff = lhs(5,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(2,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(2,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(2,5) + c(5,1) = c(5,1) - coeff*c(2,1) + c(5,2) = c(5,2) - coeff*c(2,2) + c(5,3) = c(5,3) - coeff*c(2,3) + c(5,4) = c(5,4) - coeff*c(2,4) + c(5,5) = c(5,5) - coeff*c(2,5) + r(5) = r(5) - coeff*r(2) + + + pivot = 1.00d0/lhs(3,3) + lhs(3,4) = lhs(3,4)*pivot + lhs(3,5) = lhs(3,5)*pivot + c(3,1) = c(3,1)*pivot + c(3,2) = c(3,2)*pivot + c(3,3) = c(3,3)*pivot + c(3,4) = c(3,4)*pivot + c(3,5) = c(3,5)*pivot + r(3) = r(3) *pivot + + coeff = lhs(1,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(3,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(3,5) + c(1,1) = c(1,1) - coeff*c(3,1) + c(1,2) = c(1,2) - coeff*c(3,2) + c(1,3) = c(1,3) - coeff*c(3,3) + c(1,4) = c(1,4) - coeff*c(3,4) + c(1,5) = c(1,5) - coeff*c(3,5) + r(1) = r(1) - coeff*r(3) + + coeff = lhs(2,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(3,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(3,5) + c(2,1) = c(2,1) - coeff*c(3,1) + c(2,2) = c(2,2) - coeff*c(3,2) + c(2,3) = c(2,3) - coeff*c(3,3) + c(2,4) = c(2,4) - coeff*c(3,4) + c(2,5) = c(2,5) - coeff*c(3,5) + r(2) = r(2) - coeff*r(3) + + coeff = lhs(4,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(3,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(3,5) + c(4,1) = c(4,1) - coeff*c(3,1) + c(4,2) = c(4,2) - coeff*c(3,2) + c(4,3) = c(4,3) - coeff*c(3,3) + c(4,4) = c(4,4) - coeff*c(3,4) + c(4,5) = c(4,5) - coeff*c(3,5) + r(4) = r(4) - coeff*r(3) + + coeff = lhs(5,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(3,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(3,5) + c(5,1) = c(5,1) - coeff*c(3,1) + c(5,2) = c(5,2) - coeff*c(3,2) + c(5,3) = c(5,3) - coeff*c(3,3) + c(5,4) = c(5,4) - coeff*c(3,4) + c(5,5) = c(5,5) - coeff*c(3,5) + r(5) = r(5) - coeff*r(3) + + + pivot = 1.00d0/lhs(4,4) + lhs(4,5) = lhs(4,5)*pivot + c(4,1) = c(4,1)*pivot + c(4,2) = c(4,2)*pivot + c(4,3) = c(4,3)*pivot + c(4,4) = c(4,4)*pivot + c(4,5) = c(4,5)*pivot + r(4) = r(4) *pivot + + coeff = lhs(1,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(4,5) + c(1,1) = c(1,1) - coeff*c(4,1) + c(1,2) = c(1,2) - coeff*c(4,2) + c(1,3) = c(1,3) - coeff*c(4,3) + c(1,4) = c(1,4) - coeff*c(4,4) + c(1,5) = c(1,5) - coeff*c(4,5) + r(1) = r(1) - coeff*r(4) + + coeff = lhs(2,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(4,5) + c(2,1) = c(2,1) - coeff*c(4,1) + c(2,2) = c(2,2) - coeff*c(4,2) + c(2,3) = c(2,3) - coeff*c(4,3) + c(2,4) = c(2,4) - coeff*c(4,4) + c(2,5) = c(2,5) - coeff*c(4,5) + r(2) = r(2) - coeff*r(4) + + coeff = lhs(3,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(4,5) + c(3,1) = c(3,1) - coeff*c(4,1) + c(3,2) = c(3,2) - coeff*c(4,2) + c(3,3) = c(3,3) - coeff*c(4,3) + c(3,4) = c(3,4) - coeff*c(4,4) + c(3,5) = c(3,5) - coeff*c(4,5) + r(3) = r(3) - coeff*r(4) + + coeff = lhs(5,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(4,5) + c(5,1) = c(5,1) - coeff*c(4,1) + c(5,2) = c(5,2) - coeff*c(4,2) + c(5,3) = c(5,3) - coeff*c(4,3) + c(5,4) = c(5,4) - coeff*c(4,4) + c(5,5) = c(5,5) - coeff*c(4,5) + r(5) = r(5) - coeff*r(4) + + + pivot = 1.00d0/lhs(5,5) + c(5,1) = c(5,1)*pivot + c(5,2) = c(5,2)*pivot + c(5,3) = c(5,3)*pivot + c(5,4) = c(5,4)*pivot + c(5,5) = c(5,5)*pivot + r(5) = r(5) *pivot + + coeff = lhs(1,5) + c(1,1) = c(1,1) - coeff*c(5,1) + c(1,2) = c(1,2) - coeff*c(5,2) + c(1,3) = c(1,3) - coeff*c(5,3) + c(1,4) = c(1,4) - coeff*c(5,4) + c(1,5) = c(1,5) - coeff*c(5,5) + r(1) = r(1) - coeff*r(5) + + coeff = lhs(2,5) + c(2,1) = c(2,1) - coeff*c(5,1) + c(2,2) = c(2,2) - coeff*c(5,2) + c(2,3) = c(2,3) - coeff*c(5,3) + c(2,4) = c(2,4) - coeff*c(5,4) + c(2,5) = c(2,5) - coeff*c(5,5) + r(2) = r(2) - coeff*r(5) + + coeff = lhs(3,5) + c(3,1) = c(3,1) - coeff*c(5,1) + c(3,2) = c(3,2) - coeff*c(5,2) + c(3,3) = c(3,3) - coeff*c(5,3) + c(3,4) = c(3,4) - coeff*c(5,4) + c(3,5) = c(3,5) - coeff*c(5,5) + r(3) = r(3) - coeff*r(5) + + coeff = lhs(4,5) + c(4,1) = c(4,1) - coeff*c(5,1) + c(4,2) = c(4,2) - coeff*c(5,2) + c(4,3) = c(4,3) - coeff*c(5,3) + c(4,4) = c(4,4) - coeff*c(5,4) + c(4,5) = c(4,5) - coeff*c(5,5) + r(4) = r(4) - coeff*r(5) + + + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine binvrhs( lhs,r ) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! +!--------------------------------------------------------------------- + + implicit none + + double precision pivot, coeff, lhs + dimension lhs(5,5) + double precision r(5) + +!--------------------------------------------------------------------- +! +!--------------------------------------------------------------------- + + + pivot = 1.00d0/lhs(1,1) + lhs(1,2) = lhs(1,2)*pivot + lhs(1,3) = lhs(1,3)*pivot + lhs(1,4) = lhs(1,4)*pivot + lhs(1,5) = lhs(1,5)*pivot + r(1) = r(1) *pivot + + coeff = lhs(2,1) + lhs(2,2)= lhs(2,2) - coeff*lhs(1,2) + lhs(2,3)= lhs(2,3) - coeff*lhs(1,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(1,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(1,5) + r(2) = r(2) - coeff*r(1) + + coeff = lhs(3,1) + lhs(3,2)= lhs(3,2) - coeff*lhs(1,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(1,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(1,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(1,5) + r(3) = r(3) - coeff*r(1) + + coeff = lhs(4,1) + lhs(4,2)= lhs(4,2) - coeff*lhs(1,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(1,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(1,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(1,5) + r(4) = r(4) - coeff*r(1) + + coeff = lhs(5,1) + lhs(5,2)= lhs(5,2) - coeff*lhs(1,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(1,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(1,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(1,5) + r(5) = r(5) - coeff*r(1) + + + pivot = 1.00d0/lhs(2,2) + lhs(2,3) = lhs(2,3)*pivot + lhs(2,4) = lhs(2,4)*pivot + lhs(2,5) = lhs(2,5)*pivot + r(2) = r(2) *pivot + + coeff = lhs(1,2) + lhs(1,3)= lhs(1,3) - coeff*lhs(2,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(2,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(2,5) + r(1) = r(1) - coeff*r(2) + + coeff = lhs(3,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(2,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(2,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(2,5) + r(3) = r(3) - coeff*r(2) + + coeff = lhs(4,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(2,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(2,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(2,5) + r(4) = r(4) - coeff*r(2) + + coeff = lhs(5,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(2,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(2,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(2,5) + r(5) = r(5) - coeff*r(2) + + + pivot = 1.00d0/lhs(3,3) + lhs(3,4) = lhs(3,4)*pivot + lhs(3,5) = lhs(3,5)*pivot + r(3) = r(3) *pivot + + coeff = lhs(1,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(3,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(3,5) + r(1) = r(1) - coeff*r(3) + + coeff = lhs(2,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(3,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(3,5) + r(2) = r(2) - coeff*r(3) + + coeff = lhs(4,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(3,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(3,5) + r(4) = r(4) - coeff*r(3) + + coeff = lhs(5,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(3,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(3,5) + r(5) = r(5) - coeff*r(3) + + + pivot = 1.00d0/lhs(4,4) + lhs(4,5) = lhs(4,5)*pivot + r(4) = r(4) *pivot + + coeff = lhs(1,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(4,5) + r(1) = r(1) - coeff*r(4) + + coeff = lhs(2,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(4,5) + r(2) = r(2) - coeff*r(4) + + coeff = lhs(3,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(4,5) + r(3) = r(3) - coeff*r(4) + + coeff = lhs(5,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(4,5) + r(5) = r(5) - coeff*r(4) + + + pivot = 1.00d0/lhs(5,5) + r(5) = r(5) *pivot + + coeff = lhs(1,5) + r(1) = r(1) - coeff*r(5) + + coeff = lhs(2,5) + r(2) = r(2) - coeff*r(5) + + coeff = lhs(3,5) + r(3) = r(3) - coeff*r(5) + + coeff = lhs(4,5) + r(4) = r(4) - coeff*r(5) + + + return + end + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/verify.f90 new file mode 100644 index 000000000..977b95ec9 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/verify.f90 @@ -0,0 +1,529 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine set_class(no_time_steps, class) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! set problem class based on problem size +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer no_time_steps + character class + + + if ( (grid_points(1) .eq. 12 ) .and. & + & (grid_points(2) .eq. 12 ) .and. & + & (grid_points(3) .eq. 12 ) .and. & + & (no_time_steps .eq. 60 )) then + + class = 'S' + + elseif ( (grid_points(1) .eq. 24) .and. & + & (grid_points(2) .eq. 24) .and. & + & (grid_points(3) .eq. 24) .and. & + & (no_time_steps .eq. 200) ) then + + class = 'W' + + elseif ( (grid_points(1) .eq. 64) .and. & + & (grid_points(2) .eq. 64) .and. & + & (grid_points(3) .eq. 64) .and. & + & (no_time_steps .eq. 200) ) then + + class = 'A' + + elseif ( (grid_points(1) .eq. 102) .and. & + & (grid_points(2) .eq. 102) .and. & + & (grid_points(3) .eq. 102) .and. & + & (no_time_steps .eq. 200) ) then + + class = 'B' + + elseif ( (grid_points(1) .eq. 162) .and. & + & (grid_points(2) .eq. 162) .and. & + & (grid_points(3) .eq. 162) .and. & + & (no_time_steps .eq. 200) ) then + + class = 'C' + + elseif ( (grid_points(1) .eq. 408) .and. & + & (grid_points(2) .eq. 408) .and. & + & (grid_points(3) .eq. 408) .and. & + & (no_time_steps .eq. 250) ) then + + class = 'D' + + elseif ( (grid_points(1) .eq. 1020) .and. & + & (grid_points(2) .eq. 1020) .and. & + & (grid_points(3) .eq. 1020) .and. & + & (no_time_steps .eq. 250) ) then + + class = 'E' + + elseif ( (grid_points(1) .eq. 2560) .and. & + & (grid_points(2) .eq. 2560) .and. & + & (grid_points(3) .eq. 2560) .and. & + & (no_time_steps .eq. 250) ) then + + class = 'F' + + else + + class = 'U' + + endif + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine verify(class, verified) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! verification routine +!--------------------------------------------------------------------- + + use, intrinsic :: ieee_arithmetic, only : ieee_is_nan + + use bt_data + use mpinpb + + implicit none + + double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), & + & epsilon, xce(5), xcr(5), dtref + integer m + character class + logical verified + +!--------------------------------------------------------------------- +! tolerance level +!--------------------------------------------------------------------- + epsilon = 1.0d-08 + verified = .true. + +!--------------------------------------------------------------------- +! compute the error norm and the residual norm, and exit if not printing +!--------------------------------------------------------------------- + + if (iotype .ne. 0) then + call timer_start(t_iov) + call accumulate_norms(xce) + call timer_stop(t_iov) + else + call error_norm(xce) + endif + + call copy_faces + + call rhs_norm(xcr) + + do m = 1, 5 + xcr(m) = xcr(m) / dt + enddo + + if (node .ne. 0) return + + do m = 1,5 + xcrref(m) = 1.0 + xceref(m) = 1.0 + end do + +!--------------------------------------------------------------------- +! reference data for 12X12X12 grids after 60 time steps, with DT = 1.0d-02 +!--------------------------------------------------------------------- + if ( class .eq. 'S' ) then + + dtref = 1.0d-2 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 1.7034283709541311d-01 + xcrref(2) = 1.2975252070034097d-02 + xcrref(3) = 3.2527926989486055d-02 + xcrref(4) = 2.6436421275166801d-02 + xcrref(5) = 1.9211784131744430d-01 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 4.9976913345811579d-04 + xceref(2) = 4.5195666782961927d-05 + xceref(3) = 7.3973765172921357d-05 + xceref(4) = 7.3821238632439731d-05 + xceref(5) = 8.9269630987491446d-04 + else + xceref(1) = 0.1149036328945d+02 + xceref(2) = 0.9156788904727d+00 + xceref(3) = 0.2857899428614d+01 + xceref(4) = 0.2598273346734d+01 + xceref(5) = 0.2652795397547d+02 + endif + +!--------------------------------------------------------------------- +! reference data for 24X24X24 grids after 200 time steps, with DT = 0.8d-3 +!--------------------------------------------------------------------- + elseif ( class .eq. 'W' ) then + + dtref = 0.8d-3 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 0.1125590409344d+03 + xcrref(2) = 0.1180007595731d+02 + xcrref(3) = 0.2710329767846d+02 + xcrref(4) = 0.2469174937669d+02 + xcrref(5) = 0.2638427874317d+03 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.4419655736008d+01 + xceref(2) = 0.4638531260002d+00 + xceref(3) = 0.1011551749967d+01 + xceref(4) = 0.9235878729944d+00 + xceref(5) = 0.1018045837718d+02 + else + xceref(1) = 0.6729594398612d+02 + xceref(2) = 0.5264523081690d+01 + xceref(3) = 0.1677107142637d+02 + xceref(4) = 0.1508721463436d+02 + xceref(5) = 0.1477018363393d+03 + endif + + +!--------------------------------------------------------------------- +! reference data for 64X64X64 grids after 200 time steps, with DT = 0.8d-3 +!--------------------------------------------------------------------- + elseif ( class .eq. 'A' ) then + + dtref = 0.8d-3 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 1.0806346714637264d+02 + xcrref(2) = 1.1319730901220813d+01 + xcrref(3) = 2.5974354511582465d+01 + xcrref(4) = 2.3665622544678910d+01 + xcrref(5) = 2.5278963211748344d+02 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 4.2348416040525025d+00 + xceref(2) = 4.4390282496995698d-01 + xceref(3) = 9.6692480136345650d-01 + xceref(4) = 8.8302063039765474d-01 + xceref(5) = 9.7379901770829278d+00 + else + xceref(1) = 0.6482218724961d+02 + xceref(2) = 0.5066461714527d+01 + xceref(3) = 0.1613931961359d+02 + xceref(4) = 0.1452010201481d+02 + xceref(5) = 0.1420099377681d+03 + endif + +!--------------------------------------------------------------------- +! reference data for 102X102X102 grids after 200 time steps, +! with DT = 3.0d-04 +!--------------------------------------------------------------------- + elseif ( class .eq. 'B' ) then + + dtref = 3.0d-4 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 1.4233597229287254d+03 + xcrref(2) = 9.9330522590150238d+01 + xcrref(3) = 3.5646025644535285d+02 + xcrref(4) = 3.2485447959084092d+02 + xcrref(5) = 3.2707541254659363d+03 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 5.2969847140936856d+01 + xceref(2) = 4.4632896115670668d+00 + xceref(3) = 1.3122573342210174d+01 + xceref(4) = 1.2006925323559144d+01 + xceref(5) = 1.2459576151035986d+02 + else + xceref(1) = 0.1477545106464d+03 + xceref(2) = 0.1108895555053d+02 + xceref(3) = 0.3698065590331d+02 + xceref(4) = 0.3310505581440d+02 + xceref(5) = 0.3157928282563d+03 + endif + +!--------------------------------------------------------------------- +! reference data for 162X162X162 grids after 200 time steps, +! with DT = 1.0d-04 +!--------------------------------------------------------------------- + elseif ( class .eq. 'C' ) then + + dtref = 1.0d-4 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 0.62398116551764615d+04 + xcrref(2) = 0.50793239190423964d+03 + xcrref(3) = 0.15423530093013596d+04 + xcrref(4) = 0.13302387929291190d+04 + xcrref(5) = 0.11604087428436455d+05 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.16462008369091265d+03 + xceref(2) = 0.11497107903824313d+02 + xceref(3) = 0.41207446207461508d+02 + xceref(4) = 0.37087651059694167d+02 + xceref(5) = 0.36211053051841265d+03 + else + xceref(1) = 0.2597156483475d+03 + xceref(2) = 0.1985384289495d+02 + xceref(3) = 0.6517950485788d+02 + xceref(4) = 0.5757235541520d+02 + xceref(5) = 0.5215668188726d+03 + endif + + +!--------------------------------------------------------------------- +! reference data for 408x408x408 grids after 250 time steps, +! with DT = 0.2d-04 +!--------------------------------------------------------------------- + elseif ( class .eq. 'D' ) then + + dtref = 0.2d-4 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 0.2533188551738d+05 + xcrref(2) = 0.2346393716980d+04 + xcrref(3) = 0.6294554366904d+04 + xcrref(4) = 0.5352565376030d+04 + xcrref(5) = 0.3905864038618d+05 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.3100009377557d+03 + xceref(2) = 0.2424086324913d+02 + xceref(3) = 0.7782212022645d+02 + xceref(4) = 0.6835623860116d+02 + xceref(5) = 0.6065737200368d+03 + else + xceref(1) = 0.3813781566713d+03 + xceref(2) = 0.3160872966198d+02 + xceref(3) = 0.9593576357290d+02 + xceref(4) = 0.8363391989815d+02 + xceref(5) = 0.7063466087423d+03 + endif + + +!--------------------------------------------------------------------- +! reference data for 1020x1020x1020 grids after 250 time steps, +! with DT = 0.4d-05 +!--------------------------------------------------------------------- + elseif ( class .eq. 'E' ) then + + dtref = 0.4d-5 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 0.9795372484517d+05 + xcrref(2) = 0.9739814511521d+04 + xcrref(3) = 0.2467606342965d+05 + xcrref(4) = 0.2092419572860d+05 + xcrref(5) = 0.1392138856939d+06 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.4327562208414d+03 + xceref(2) = 0.3699051964887d+02 + xceref(3) = 0.1089845040954d+03 + xceref(4) = 0.9462517622043d+02 + xceref(5) = 0.7765512765309d+03 + else +! wr_interval = 5 + xceref(1) = 0.4729898413058d+03 + xceref(2) = 0.4145899331704d+02 + xceref(3) = 0.1192850917138d+03 + xceref(4) = 0.1032746026932d+03 + xceref(5) = 0.8270322177634d+03 +! wr_interval = 10 +! xceref(1) = 0.4718135916251d+03 +! xceref(2) = 0.4132620259096d+02 +! xceref(3) = 0.1189831133503d+03 +! xceref(4) = 0.1030212798803d+03 +! xceref(5) = 0.8255924078458d+03 + endif + +!--------------------------------------------------------------------- +! reference data for 2560x2560x2560 grids after 250 time steps, +! with DT = 0.6d-06 +!--------------------------------------------------------------------- + elseif ( class .eq. 'F' ) then + + dtref = 0.6d-6 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of residual. +!--------------------------------------------------------------------- + xcrref(1) = 0.4240735175585d+06 + xcrref(2) = 0.4348701133212d+05 + xcrref(3) = 0.1078114688845d+06 + xcrref(4) = 0.9142160938556d+05 + xcrref(5) = 0.5879842143431d+06 + +!--------------------------------------------------------------------- +! Reference values of RMS-norms of solution error. +!--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.5095577042351d+03 + xceref(2) = 0.4557065541652d+02 + xceref(3) = 0.1286632140581d+03 + xceref(4) = 0.1111419378722d+03 + xceref(5) = 0.8720011709356d+03 + endif + + else + + verified = .false. + + endif + +!--------------------------------------------------------------------- +! verification test for residuals if gridsize is one of +! the defined grid sizes above (class .ne. 'U') +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Compute the difference of solution values and the known reference +! values. +!--------------------------------------------------------------------- + do m = 1, 5 + + xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) + xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m)) + + enddo + +!--------------------------------------------------------------------- +! Output the comparison of computed results to known cases. +!--------------------------------------------------------------------- + + if (class .ne. 'U') then + write(*, 1990) class + 1990 format(' Verification being performed for class ', a) + write (*,2000) epsilon + 2000 format(' accuracy setting for epsilon = ', E20.13) + verified = (dabs(dt-dtref) .le. epsilon) + if (.not.verified) then + class = 'U' + write (*,1000) dtref + 1000 format(' DT does not match the reference value of ', & + & E15.8) + endif + else + write(*, 1995) + 1995 format(' Unknown class') + endif + + + if (class .ne. 'U') then + write (*,2001) + else + write (*, 2005) + endif + + 2001 format(' Comparison of RMS-norms of residual') + 2005 format(' RMS-norms of residual') + do m = 1, 5 + if (class .eq. 'U') then + write(*, 2015) m, xcr(m) + else if ((.not.ieee_is_nan(xcrdif(m))) .and. & + & xcrdif(m) .le. epsilon) then + write (*,2011) m,xcr(m),xcrref(m),xcrdif(m) + else + verified = .false. + write (*,2010) m,xcr(m),xcrref(m),xcrdif(m) + endif + enddo + + if (class .ne. 'U') then + write (*,2002) + else + write (*,2006) + endif + 2002 format(' Comparison of RMS-norms of solution error') + 2006 format(' RMS-norms of solution error') + + do m = 1, 5 + if (class .eq. 'U') then + write(*, 2015) m, xce(m) + else if ((.not.ieee_is_nan(xcedif(m))) .and. & + & xcedif(m) .le. epsilon) then + write (*,2011) m,xce(m),xceref(m),xcedif(m) + else + verified = .false. + write (*,2010) m,xce(m),xceref(m),xcedif(m) + endif + enddo + + 2010 format(' FAILURE: ', i2, E20.13, E20.13, E20.13) + 2011 format(' ', i2, E20.13, E20.13, E20.13) + 2015 format(' ', i2, E20.13) + + if (class .eq. 'U') then + write(*, 2022) + write(*, 2023) + 2022 format(' No reference values provided') + 2023 format(' No verification performed') + else if (verified) then + write(*, 2020) + 2020 format(' Verification Successful') + else + write(*, 2021) + 2021 format(' Verification failed') + endif + + return + + + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve.f90 new file mode 100644 index 000000000..125a46894 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve.f90 @@ -0,0 +1,790 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_solve + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! +! Performs line solves in X direction by first factoring +! the block-tridiagonal matrix into an upper triangular matrix, +! and then performing back substitution to solve for the unknow +! vectors of each line. +! +! Make sure we treat elements zero to cell_size in the direction +! of the sweep. +! +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer c, istart, stage, & + & first, last, recv_id, error, r_status(MPI_STATUS_SIZE), & + & isize,jsize,ksize,send_id + + istart = 0 + + if (timeron) call timer_start(t_xsolve) +!--------------------------------------------------------------------- +! in our terminology stage is the number of the cell in the x-direction +! i.e. stage = 1 means the start of the line stage=ncells means end +!--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(1,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +!--------------------------------------------------------------------- +! set last-cell flag +!--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +!--------------------------------------------------------------------- +! This is the first cell, so solve without receiving data +!--------------------------------------------------------------------- + first = 1 +! call lhsx(c) + call x_solve_cell(first,last,c) + else +!--------------------------------------------------------------------- +! Not the first cell of this line, so receive info from +! processor working on preceeding cell +!--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_xcomm) + call x_receive_solve_info(recv_id,c) +!--------------------------------------------------------------------- +! overlap computations and communications +!--------------------------------------------------------------------- +! call lhsx(c) +!--------------------------------------------------------------------- +! wait for completion +!--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) +!--------------------------------------------------------------------- +! install C'(istart) and rhs'(istart) to be used in this cell +!--------------------------------------------------------------------- + call x_unpack_solve_info(c) + call x_solve_cell(first,last,c) + endif + + if (last .eq. 0) call x_send_solve_info(send_id,c) + enddo + +!--------------------------------------------------------------------- +! now perform backsubstitution in reverse direction +!--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(1,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +!--------------------------------------------------------------------- +! last cell, so perform back substitute without waiting +!--------------------------------------------------------------------- + call x_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_xcomm) + call x_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) + call x_unpack_backsub_info(c) + call x_backsubstitute(first,last,c) + endif + if (first .eq. 0) call x_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_xsolve) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_unpack_solve_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack C'(-1) and rhs'(-1) for +! all j and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer j,k,m,n,ptr,c,istart + + istart = 0 + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_send_solve_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send C'(iend) and rhs'(iend) for +! all j and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer j,k,m,n,isize,ptr,c,jp,kp + integer error,send_id,buffer_size + + isize = cell_size(1,c)-1 + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +!--------------------------------------------------------------------- +! pack up buffer +!--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +!--------------------------------------------------------------------- +! send buffer +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, successor(1), & + & WEST+jp+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_send_backsub_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send U(istart) for all j and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer j,k,n,ptr,c,istart,jp,kp + integer error,send_id,buffer_size + +!--------------------------------------------------------------------- +! Send element 0 to previous processor +!--------------------------------------------------------------------- + istart = 0 + jp = cell_coord(2,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,istart,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, predecessor(1), & + & EAST+jp+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_unpack_backsub_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack U(isize) for all j and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer j,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_receive_backsub_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer error,recv_id,jp,kp,c,buffer_size + + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, successor(1), & + & EAST+jp+kp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_receive_solve_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer jp,kp,recv_id,error,c,buffer_size + + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, predecessor(1), & + & WEST+jp+kp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_backsubstitute(first, last, c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! back solve: if last cell, then generate U(isize)=rhs(isize) +! else assume U(isize) is loaded in un pack backsub_info +! so just use it +! after call u(istart) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer first, last, c, i, j, k + integer m,n,isize,jsize,ksize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do j=start(2,c),jsize +!--------------------------------------------------------------------- +! U(isize) uses info from previous cell if not last cell +!--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) & + & - lhsc(m,n,isize,j,k,c)* & + & backsub_info(n,j,k,c) +!--------------------------------------------------------------------- +! rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) +! $ - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c) +!--------------------------------------------------------------------- + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=start(2,c),jsize + do i=isize-1,istart,-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) & + & - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_solve_cell(first,last,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! performs guaussian elimination on this cell. +! +! assumes that unpacking routines for non-first cells +! preload C' and rhs' from previous cell. +! +! assumed send happens outside this routine, but that +! c'(IMAX) and rhs'(IMAX) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision tmp1, tmp2, tmp3 + integer first,last,c + integer i,j,k,isize,ksize,jsize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + + call lhsabinit(lhsa, lhsb, isize) + + do k=start(3,c),ksize + do j=start(2,c),jsize + +!--------------------------------------------------------------------- +! This function computes the left hand side in the xi-direction +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! determine a (labeled f) and n jacobians for cell c +!--------------------------------------------------------------------- + do i = start(1,c)-1, cell_size(1,c) - end(1,c) + + tmp1 = rho_i(i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 +!--------------------------------------------------------------------- +! +!--------------------------------------------------------------------- + fjac(1,1,i) = 0.0d+00 + fjac(1,2,i) = 1.0d+00 + fjac(1,3,i) = 0.0d+00 + fjac(1,4,i) = 0.0d+00 + fjac(1,5,i) = 0.0d+00 + + fjac(2,1,i) = -(u(2,i,j,k,c) * tmp2 * & + & u(2,i,j,k,c)) & + & + c2 * qs(i,j,k,c) + fjac(2,2,i) = ( 2.0d+00 - c2 ) & + & * ( u(2,i,j,k,c) * tmp1 ) + fjac(2,3,i) = - c2 * ( u(3,i,j,k,c) * tmp1 ) + fjac(2,4,i) = - c2 * ( u(4,i,j,k,c) * tmp1 ) + fjac(2,5,i) = c2 + + fjac(3,1,i) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2 + fjac(3,2,i) = u(3,i,j,k,c) * tmp1 + fjac(3,3,i) = u(2,i,j,k,c) * tmp1 + fjac(3,4,i) = 0.0d+00 + fjac(3,5,i) = 0.0d+00 + + fjac(4,1,i) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2 + fjac(4,2,i) = u(4,i,j,k,c) * tmp1 + fjac(4,3,i) = 0.0d+00 + fjac(4,4,i) = u(2,i,j,k,c) * tmp1 + fjac(4,5,i) = 0.0d+00 + + fjac(5,1,i) = ( c2 * 2.0d0 * qs(i,j,k,c) & + & - c1 * ( u(5,i,j,k,c) * tmp1 ) ) & + & * ( u(2,i,j,k,c) * tmp1 ) + fjac(5,2,i) = c1 * u(5,i,j,k,c) * tmp1 & + & - c2 & + & * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2 & + & + qs(i,j,k,c) ) + fjac(5,3,i) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) ) & + & * tmp2 + fjac(5,4,i) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) ) & + & * tmp2 + fjac(5,5,i) = c1 * ( u(2,i,j,k,c) * tmp1 ) + + njac(1,1,i) = 0.0d+00 + njac(1,2,i) = 0.0d+00 + njac(1,3,i) = 0.0d+00 + njac(1,4,i) = 0.0d+00 + njac(1,5,i) = 0.0d+00 + + njac(2,1,i) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i) = con43 * c3c4 * tmp1 + njac(2,3,i) = 0.0d+00 + njac(2,4,i) = 0.0d+00 + njac(2,5,i) = 0.0d+00 + + njac(3,1,i) = - c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i) = 0.0d+00 + njac(3,3,i) = c3c4 * tmp1 + njac(3,4,i) = 0.0d+00 + njac(3,5,i) = 0.0d+00 + + njac(4,1,i) = - c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i) = 0.0d+00 + njac(4,3,i) = 0.0d+00 + njac(4,4,i) = c3c4 * tmp1 + njac(4,5,i) = 0.0d+00 + + njac(5,1,i) = - ( con43 * c3c4 & + & - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) & + & - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i) = ( con43 * c3c4 & + & - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i) = ( c1345 ) * tmp1 + + enddo +!--------------------------------------------------------------------- +! now jacobians set, so form left hand side in x direction +!--------------------------------------------------------------------- + do i = start(1,c), isize - end(1,c) + + tmp1 = dt * tx1 + tmp2 = dt * tx2 + + lhsa(1,1,i) = - tmp2 * fjac(1,1,i-1) & + & - tmp1 * njac(1,1,i-1) & + & - tmp1 * dx1 + lhsa(1,2,i) = - tmp2 * fjac(1,2,i-1) & + & - tmp1 * njac(1,2,i-1) + lhsa(1,3,i) = - tmp2 * fjac(1,3,i-1) & + & - tmp1 * njac(1,3,i-1) + lhsa(1,4,i) = - tmp2 * fjac(1,4,i-1) & + & - tmp1 * njac(1,4,i-1) + lhsa(1,5,i) = - tmp2 * fjac(1,5,i-1) & + & - tmp1 * njac(1,5,i-1) + + lhsa(2,1,i) = - tmp2 * fjac(2,1,i-1) & + & - tmp1 * njac(2,1,i-1) + lhsa(2,2,i) = - tmp2 * fjac(2,2,i-1) & + & - tmp1 * njac(2,2,i-1) & + & - tmp1 * dx2 + lhsa(2,3,i) = - tmp2 * fjac(2,3,i-1) & + & - tmp1 * njac(2,3,i-1) + lhsa(2,4,i) = - tmp2 * fjac(2,4,i-1) & + & - tmp1 * njac(2,4,i-1) + lhsa(2,5,i) = - tmp2 * fjac(2,5,i-1) & + & - tmp1 * njac(2,5,i-1) + + lhsa(3,1,i) = - tmp2 * fjac(3,1,i-1) & + & - tmp1 * njac(3,1,i-1) + lhsa(3,2,i) = - tmp2 * fjac(3,2,i-1) & + & - tmp1 * njac(3,2,i-1) + lhsa(3,3,i) = - tmp2 * fjac(3,3,i-1) & + & - tmp1 * njac(3,3,i-1) & + & - tmp1 * dx3 + lhsa(3,4,i) = - tmp2 * fjac(3,4,i-1) & + & - tmp1 * njac(3,4,i-1) + lhsa(3,5,i) = - tmp2 * fjac(3,5,i-1) & + & - tmp1 * njac(3,5,i-1) + + lhsa(4,1,i) = - tmp2 * fjac(4,1,i-1) & + & - tmp1 * njac(4,1,i-1) + lhsa(4,2,i) = - tmp2 * fjac(4,2,i-1) & + & - tmp1 * njac(4,2,i-1) + lhsa(4,3,i) = - tmp2 * fjac(4,3,i-1) & + & - tmp1 * njac(4,3,i-1) + lhsa(4,4,i) = - tmp2 * fjac(4,4,i-1) & + & - tmp1 * njac(4,4,i-1) & + & - tmp1 * dx4 + lhsa(4,5,i) = - tmp2 * fjac(4,5,i-1) & + & - tmp1 * njac(4,5,i-1) + + lhsa(5,1,i) = - tmp2 * fjac(5,1,i-1) & + & - tmp1 * njac(5,1,i-1) + lhsa(5,2,i) = - tmp2 * fjac(5,2,i-1) & + & - tmp1 * njac(5,2,i-1) + lhsa(5,3,i) = - tmp2 * fjac(5,3,i-1) & + & - tmp1 * njac(5,3,i-1) + lhsa(5,4,i) = - tmp2 * fjac(5,4,i-1) & + & - tmp1 * njac(5,4,i-1) + lhsa(5,5,i) = - tmp2 * fjac(5,5,i-1) & + & - tmp1 * njac(5,5,i-1) & + & - tmp1 * dx5 + + lhsb(1,1,i) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(1,1,i) & + & + tmp1 * 2.0d+00 * dx1 + lhsb(1,2,i) = tmp1 * 2.0d+00 * njac(1,2,i) + lhsb(1,3,i) = tmp1 * 2.0d+00 * njac(1,3,i) + lhsb(1,4,i) = tmp1 * 2.0d+00 * njac(1,4,i) + lhsb(1,5,i) = tmp1 * 2.0d+00 * njac(1,5,i) + + lhsb(2,1,i) = tmp1 * 2.0d+00 * njac(2,1,i) + lhsb(2,2,i) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(2,2,i) & + & + tmp1 * 2.0d+00 * dx2 + lhsb(2,3,i) = tmp1 * 2.0d+00 * njac(2,3,i) + lhsb(2,4,i) = tmp1 * 2.0d+00 * njac(2,4,i) + lhsb(2,5,i) = tmp1 * 2.0d+00 * njac(2,5,i) + + lhsb(3,1,i) = tmp1 * 2.0d+00 * njac(3,1,i) + lhsb(3,2,i) = tmp1 * 2.0d+00 * njac(3,2,i) + lhsb(3,3,i) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(3,3,i) & + & + tmp1 * 2.0d+00 * dx3 + lhsb(3,4,i) = tmp1 * 2.0d+00 * njac(3,4,i) + lhsb(3,5,i) = tmp1 * 2.0d+00 * njac(3,5,i) + + lhsb(4,1,i) = tmp1 * 2.0d+00 * njac(4,1,i) + lhsb(4,2,i) = tmp1 * 2.0d+00 * njac(4,2,i) + lhsb(4,3,i) = tmp1 * 2.0d+00 * njac(4,3,i) + lhsb(4,4,i) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(4,4,i) & + & + tmp1 * 2.0d+00 * dx4 + lhsb(4,5,i) = tmp1 * 2.0d+00 * njac(4,5,i) + + lhsb(5,1,i) = tmp1 * 2.0d+00 * njac(5,1,i) + lhsb(5,2,i) = tmp1 * 2.0d+00 * njac(5,2,i) + lhsb(5,3,i) = tmp1 * 2.0d+00 * njac(5,3,i) + lhsb(5,4,i) = tmp1 * 2.0d+00 * njac(5,4,i) + lhsb(5,5,i) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(5,5,i) & + & + tmp1 * 2.0d+00 * dx5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i+1) & + & - tmp1 * njac(1,1,i+1) & + & - tmp1 * dx1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i+1) & + & - tmp1 * njac(1,2,i+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i+1) & + & - tmp1 * njac(1,3,i+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i+1) & + & - tmp1 * njac(1,4,i+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i+1) & + & - tmp1 * njac(1,5,i+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i+1) & + & - tmp1 * njac(2,1,i+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i+1) & + & - tmp1 * njac(2,2,i+1) & + & - tmp1 * dx2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i+1) & + & - tmp1 * njac(2,3,i+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i+1) & + & - tmp1 * njac(2,4,i+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i+1) & + & - tmp1 * njac(2,5,i+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i+1) & + & - tmp1 * njac(3,1,i+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i+1) & + & - tmp1 * njac(3,2,i+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i+1) & + & - tmp1 * njac(3,3,i+1) & + & - tmp1 * dx3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i+1) & + & - tmp1 * njac(3,4,i+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i+1) & + & - tmp1 * njac(3,5,i+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i+1) & + & - tmp1 * njac(4,1,i+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i+1) & + & - tmp1 * njac(4,2,i+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i+1) & + & - tmp1 * njac(4,3,i+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i+1) & + & - tmp1 * njac(4,4,i+1) & + & - tmp1 * dx4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i+1) & + & - tmp1 * njac(4,5,i+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i+1) & + & - tmp1 * njac(5,1,i+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i+1) & + & - tmp1 * njac(5,2,i+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i+1) & + & - tmp1 * njac(5,3,i+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i+1) & + & - tmp1 * njac(5,4,i+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i+1) & + & - tmp1 * njac(5,5,i+1) & + & - tmp1 * dx5 + + enddo + + +!--------------------------------------------------------------------- +! outer most do loops - sweeping in i direction +!--------------------------------------------------------------------- + if (first .eq. 1) then + +!--------------------------------------------------------------------- +! multiply c(istart,j,k) by b_inverse and copy back to c +! multiply rhs(istart) by b_inverse(istart) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,istart), & + & lhsc(1,1,istart,j,k,c), & + & rhs(1,istart,j,k,c) ) + + endif + +!--------------------------------------------------------------------- +! begin inner most do loop +! do all the elements of the cell unless last +!--------------------------------------------------------------------- + do i=istart+first,isize-last + +!--------------------------------------------------------------------- +! rhs(i) = rhs(i) - A*rhs(i-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i), & + & rhs(1,i-1,j,k,c),rhs(1,i,j,k,c)) + +!--------------------------------------------------------------------- +! B(i) = B(i) - C(i-1)*A(i) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i), & + & lhsc(1,1,i-1,j,k,c), & + & lhsb(1,1,i)) + + +!--------------------------------------------------------------------- +! multiply c(i,j,k) by b_inverse and copy back to c +! multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i), & + & lhsc(1,1,i,j,k,c), & + & rhs(1,i,j,k,c) ) + + enddo + +!--------------------------------------------------------------------- +! Now finish up special cases for last cell +!--------------------------------------------------------------------- + if (last .eq. 1) then + +!--------------------------------------------------------------------- +! rhs(isize) = rhs(isize) - A*rhs(isize-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,isize), & + & rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c)) + +!--------------------------------------------------------------------- +! B(isize) = B(isize) - C(isize-1)*A(isize) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,isize), & + & lhsc(1,1,isize-1,j,k,c), & + & lhsb(1,1,isize)) + +!--------------------------------------------------------------------- +! multiply rhs() by b_inverse() and copy to rhs +!--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,isize), & + & rhs(1,isize,j,k,c) ) + + endif + enddo + enddo + + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve_vec.f90 new file mode 100644 index 000000000..593f6a3a6 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/x_solve_vec.f90 @@ -0,0 +1,813 @@ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_solve + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! +! Performs line solves in X direction by first factoring +! the block-tridiagonal matrix into an upper triangular matrix, +! and then performing back substitution to solve for the unknow +! vectors of each line. +! +! Make sure we treat elements zero to cell_size in the direction +! of the sweep. +! +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer c, istart, stage, & + & first, last, recv_id, error, r_status(MPI_STATUS_SIZE), & + & isize,jsize,ksize,send_id + + istart = 0 + + if (timeron) call timer_start(t_xsolve) +!--------------------------------------------------------------------- +! in our terminology stage is the number of the cell in the x-direct +! i.e. stage = 1 means the start of the line stage=ncells means end +!--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(1,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +!--------------------------------------------------------------------- +! set last-cell flag +!--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +!--------------------------------------------------------------------- +! This is the first cell, so solve without receiving data +!--------------------------------------------------------------------- + first = 1 +! call lhsx(c) + call x_solve_cell(first,last,c) + else +!--------------------------------------------------------------------- +! Not the first cell of this line, so receive info from +! processor working on preceeding cell +!--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_xcomm) + call x_receive_solve_info(recv_id,c) +!--------------------------------------------------------------------- +! overlap computations and communications +!--------------------------------------------------------------------- +! call lhsx(c) +!--------------------------------------------------------------------- +! wait for completion +!--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) +!--------------------------------------------------------------------- +! install C'(istart) and rhs'(istart) to be used in this cell +!--------------------------------------------------------------------- + call x_unpack_solve_info(c) + call x_solve_cell(first,last,c) + endif + + if (last .eq. 0) call x_send_solve_info(send_id,c) + enddo + +!--------------------------------------------------------------------- +! now perform backsubstitution in reverse direction +!--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(1,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +!--------------------------------------------------------------------- +! last cell, so perform back substitute without waiting +!--------------------------------------------------------------------- + call x_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_xcomm) + call x_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) + call x_unpack_backsub_info(c) + call x_backsubstitute(first,last,c) + endif + if (first .eq. 0) call x_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_xsolve) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_unpack_solve_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack C'(-1) and rhs'(-1) for +! all j and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + integer j,k,m,n,ptr,c,istart + + istart = 0 + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_send_solve_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send C'(iend) and rhs'(iend) for +! all j and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer j,k,m,n,isize,ptr,c,jp,kp + integer error,send_id,buffer_size + + isize = cell_size(1,c)-1 + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +!--------------------------------------------------------------------- +! pack up buffer +!--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +!--------------------------------------------------------------------- +! send buffer +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, successor(1), & + & WEST+jp+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_send_backsub_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send U(istart) for all j and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer j,k,n,ptr,c,istart,jp,kp + integer error,send_id,buffer_size + +!--------------------------------------------------------------------- +! Send element 0 to previous processor +!--------------------------------------------------------------------- + istart = 0 + jp = cell_coord(2,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,istart,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, predecessor(1), & + & EAST+jp+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_unpack_backsub_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack U(isize) for all j and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + integer j,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_receive_backsub_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer error,recv_id,jp,kp,c,buffer_size + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, successor(1), & + & EAST+jp+kp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_receive_solve_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer jp,kp,recv_id,error,c,buffer_size + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, predecessor(1), & + & WEST+jp+kp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_backsubstitute(first, last, c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! back solve: if last cell, then generate U(isize)=rhs(isize) +! else assume U(isize) is loaded in un pack backsub_info +! so just use it +! after call u(istart) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer first, last, c, i, j, k + integer m,n,isize,jsize,ksize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do j=start(2,c),jsize +!--------------------------------------------------------------------- +! U(isize) uses info from previous cell if not last cell +!--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) & + & - lhsc(m,n,isize,j,k,c)* & + & backsub_info(n,j,k,c) +!--------------------------------------------------------------------- +! rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) +! $ - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c) +!--------------------------------------------------------------------- + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=start(2,c),jsize + do i=isize-1,istart,-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) & + & - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine x_solve_cell(first,last,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! performs guaussian elimination on this cell. +! +! assumes that unpacking routines for non-first cells +! preload C' and rhs' from previous cell. +! +! assumed send happens outside this routine, but that +! c'(IMAX) and rhs'(IMAX) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision tmp1, tmp2, tmp3 + integer first,last,c + integer i,j,k,m,n,isize,ksize,jsize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + +!--------------------------------------------------------------------- +! zero the left hand side for starters +! set diagonal values to 1. This is overkill, but convenient +!--------------------------------------------------------------------- + do j = 0, jsize + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,0,j) = 0.0d0 + lhsb(m,n,0,j) = 0.0d0 + lhsa(m,n,isize,j) = 0.0d0 + lhsb(m,n,isize,j) = 0.0d0 + enddo + lhsb(m,m,0,j) = 1.0d0 + lhsb(m,m,isize,j) = 1.0d0 + enddo + enddo + + do k=start(3,c),ksize + +!--------------------------------------------------------------------- +! This function computes the left hand side in the xi-direction +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! determine a (labeled f) and n jacobians for cell ! +!--------------------------------------------------------------------- + do j=start(2,c),jsize + do i = start(1,c)-1, cell_size(1,c) - end(1,c) + + tmp1 = rho_i(i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 +!--------------------------------------------------------------------- +! +!--------------------------------------------------------------------- + fjac(1,1,i,j) = 0.0d+00 + fjac(1,2,i,j) = 1.0d+00 + fjac(1,3,i,j) = 0.0d+00 + fjac(1,4,i,j) = 0.0d+00 + fjac(1,5,i,j) = 0.0d+00 + + fjac(2,1,i,j) = -(u(2,i,j,k,c) * tmp2 * & + & u(2,i,j,k,c)) & + & + c2 * qs(i,j,k,c) + fjac(2,2,i,j) = ( 2.0d+00 - c2 ) & + & * ( u(2,i,j,k,c) * tmp1 ) + fjac(2,3,i,j) = - c2 * ( u(3,i,j,k,c) * tmp1 ) + fjac(2,4,i,j) = - c2 * ( u(4,i,j,k,c) * tmp1 ) + fjac(2,5,i,j) = c2 + + fjac(3,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2 + fjac(3,2,i,j) = u(3,i,j,k,c) * tmp1 + fjac(3,3,i,j) = u(2,i,j,k,c) * tmp1 + fjac(3,4,i,j) = 0.0d+00 + fjac(3,5,i,j) = 0.0d+00 + + fjac(4,1,i,j) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2 + fjac(4,2,i,j) = u(4,i,j,k,c) * tmp1 + fjac(4,3,i,j) = 0.0d+00 + fjac(4,4,i,j) = u(2,i,j,k,c) * tmp1 + fjac(4,5,i,j) = 0.0d+00 + + fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c) & + & - c1 * ( u(5,i,j,k,c) * tmp1 ) ) & + & * ( u(2,i,j,k,c) * tmp1 ) + fjac(5,2,i,j) = c1 * u(5,i,j,k,c) * tmp1 & + & - c2 & + & * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2 & + & + qs(i,j,k,c) ) + fjac(5,3,i,j) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) ) & + & * tmp2 + fjac(5,4,i,j) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) ) & + & * tmp2 + fjac(5,5,i,j) = c1 * ( u(2,i,j,k,c) * tmp1 ) + + njac(1,1,i,j) = 0.0d+00 + njac(1,2,i,j) = 0.0d+00 + njac(1,3,i,j) = 0.0d+00 + njac(1,4,i,j) = 0.0d+00 + njac(1,5,i,j) = 0.0d+00 + + njac(2,1,i,j) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i,j) = con43 * c3c4 * tmp1 + njac(2,3,i,j) = 0.0d+00 + njac(2,4,i,j) = 0.0d+00 + njac(2,5,i,j) = 0.0d+00 + + njac(3,1,i,j) = - c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i,j) = 0.0d+00 + njac(3,3,i,j) = c3c4 * tmp1 + njac(3,4,i,j) = 0.0d+00 + njac(3,5,i,j) = 0.0d+00 + + njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i,j) = 0.0d+00 + njac(4,3,i,j) = 0.0d+00 + njac(4,4,i,j) = c3c4 * tmp1 + njac(4,5,i,j) = 0.0d+00 + + njac(5,1,i,j) = - ( con43 * c3c4 & + & - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) & + & - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i,j) = ( con43 * c3c4 & + & - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i,j) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i,j) = ( c1345 ) * tmp1 + + enddo + enddo + +!--------------------------------------------------------------------- +! now jacobians set, so form left hand side in x direction +!--------------------------------------------------------------------- + do j=start(2,c),jsize + do i = start(1,c), isize - end(1,c) + + tmp1 = dt * tx1 + tmp2 = dt * tx2 + + lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i-1,j) & + & - tmp1 * njac(1,1,i-1,j) & + & - tmp1 * dx1 + lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i-1,j) & + & - tmp1 * njac(1,2,i-1,j) + lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i-1,j) & + & - tmp1 * njac(1,3,i-1,j) + lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i-1,j) & + & - tmp1 * njac(1,4,i-1,j) + lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i-1,j) & + & - tmp1 * njac(1,5,i-1,j) + + lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i-1,j) & + & - tmp1 * njac(2,1,i-1,j) + lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i-1,j) & + & - tmp1 * njac(2,2,i-1,j) & + & - tmp1 * dx2 + lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i-1,j) & + & - tmp1 * njac(2,3,i-1,j) + lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i-1,j) & + & - tmp1 * njac(2,4,i-1,j) + lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i-1,j) & + & - tmp1 * njac(2,5,i-1,j) + + lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i-1,j) & + & - tmp1 * njac(3,1,i-1,j) + lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i-1,j) & + & - tmp1 * njac(3,2,i-1,j) + lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i-1,j) & + & - tmp1 * njac(3,3,i-1,j) & + & - tmp1 * dx3 + lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i-1,j) & + & - tmp1 * njac(3,4,i-1,j) + lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i-1,j) & + & - tmp1 * njac(3,5,i-1,j) + + lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i-1,j) & + & - tmp1 * njac(4,1,i-1,j) + lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i-1,j) & + & - tmp1 * njac(4,2,i-1,j) + lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i-1,j) & + & - tmp1 * njac(4,3,i-1,j) + lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i-1,j) & + & - tmp1 * njac(4,4,i-1,j) & + & - tmp1 * dx4 + lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i-1,j) & + & - tmp1 * njac(4,5,i-1,j) + + lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i-1,j) & + & - tmp1 * njac(5,1,i-1,j) + lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i-1,j) & + & - tmp1 * njac(5,2,i-1,j) + lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i-1,j) & + & - tmp1 * njac(5,3,i-1,j) + lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i-1,j) & + & - tmp1 * njac(5,4,i-1,j) + lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i-1,j) & + & - tmp1 * njac(5,5,i-1,j) & + & - tmp1 * dx5 + + lhsb(1,1,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(1,1,i,j) & + & + tmp1 * 2.0d+00 * dx1 + lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j) + lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j) + lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j) + lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j) + + lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j) + lhsb(2,2,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(2,2,i,j) & + & + tmp1 * 2.0d+00 * dx2 + lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j) + lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j) + lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j) + + lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j) + lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j) + lhsb(3,3,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(3,3,i,j) & + & + tmp1 * 2.0d+00 * dx3 + lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j) + lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j) + + lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j) + lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j) + lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j) + lhsb(4,4,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(4,4,i,j) & + & + tmp1 * 2.0d+00 * dx4 + lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j) + + lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j) + lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j) + lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j) + lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j) + lhsb(5,5,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(5,5,i,j) & + & + tmp1 * 2.0d+00 * dx5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i+1,j) & + & - tmp1 * njac(1,1,i+1,j) & + & - tmp1 * dx1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i+1,j) & + & - tmp1 * njac(1,2,i+1,j) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i+1,j) & + & - tmp1 * njac(1,3,i+1,j) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i+1,j) & + & - tmp1 * njac(1,4,i+1,j) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i+1,j) & + & - tmp1 * njac(1,5,i+1,j) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i+1,j) & + & - tmp1 * njac(2,1,i+1,j) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i+1,j) & + & - tmp1 * njac(2,2,i+1,j) & + & - tmp1 * dx2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i+1,j) & + & - tmp1 * njac(2,3,i+1,j) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i+1,j) & + & - tmp1 * njac(2,4,i+1,j) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i+1,j) & + & - tmp1 * njac(2,5,i+1,j) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i+1,j) & + & - tmp1 * njac(3,1,i+1,j) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i+1,j) & + & - tmp1 * njac(3,2,i+1,j) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i+1,j) & + & - tmp1 * njac(3,3,i+1,j) & + & - tmp1 * dx3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i+1,j) & + & - tmp1 * njac(3,4,i+1,j) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i+1,j) & + & - tmp1 * njac(3,5,i+1,j) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i+1,j) & + & - tmp1 * njac(4,1,i+1,j) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i+1,j) & + & - tmp1 * njac(4,2,i+1,j) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i+1,j) & + & - tmp1 * njac(4,3,i+1,j) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i+1,j) & + & - tmp1 * njac(4,4,i+1,j) & + & - tmp1 * dx4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i+1,j) & + & - tmp1 * njac(4,5,i+1,j) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i+1,j) & + & - tmp1 * njac(5,1,i+1,j) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i+1,j) & + & - tmp1 * njac(5,2,i+1,j) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i+1,j) & + & - tmp1 * njac(5,3,i+1,j) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i+1,j) & + & - tmp1 * njac(5,4,i+1,j) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i+1,j) & + & - tmp1 * njac(5,5,i+1,j) & + & - tmp1 * dx5 + + enddo + enddo + + +!--------------------------------------------------------------------- +! outer most do loops - sweeping in i direction +!--------------------------------------------------------------------- + if (first .eq. 1) then + +!--------------------------------------------------------------------- +! multiply c(istart,j,k) by b_inverse and copy back to ! +! multiply rhs(istart) by b_inverse(istart) and copy to rhs +!--------------------------------------------------------------------- +!dir$ ivdep + do j=start(2,c),jsize + call binvcrhs( lhsb(1,1,istart,j), & + & lhsc(1,1,istart,j,k,c), & + & rhs(1,istart,j,k,c) ) + enddo + + endif + +!--------------------------------------------------------------------- +! begin inner most do loop +! do all the elements of the cell unless last +!--------------------------------------------------------------------- + do i=istart+first,isize-last +!dir$ ivdep + do j=start(2,c),jsize + +!--------------------------------------------------------------------- +! rhs(i) = rhs(i) - A*rhs(i-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,j), & + & rhs(1,i-1,j,k,c),rhs(1,i,j,k,c)) + +!--------------------------------------------------------------------- +! B(i) = B(i) - C(i-1)*A(i) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,j), & + & lhsc(1,1,i-1,j,k,c), & + & lhsb(1,1,i,j)) + + +!--------------------------------------------------------------------- +! multiply c(i,j,k) by b_inverse and copy back to ! +! multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i,j), & + & lhsc(1,1,i,j,k,c), & + & rhs(1,i,j,k,c) ) + + enddo + enddo + +!--------------------------------------------------------------------- +! Now finish up special cases for last cell +!--------------------------------------------------------------------- + if (last .eq. 1) then + +!dir$ ivdep + do j=start(2,c),jsize +!--------------------------------------------------------------------- +! rhs(isize) = rhs(isize) - A*rhs(isize-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,isize,j), & + & rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c)) + +!--------------------------------------------------------------------- +! B(isize) = B(isize) - C(isize-1)*A(isize) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,isize,j), & + & lhsc(1,1,isize-1,j,k,c), & + & lhsb(1,1,isize,j)) + +!--------------------------------------------------------------------- +! multiply rhs() by b_inverse() and copy to rhs +!--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,isize,j), & + & rhs(1,isize,j,k,c) ) + enddo + + endif + enddo + + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve.f90 new file mode 100644 index 000000000..1398094a8 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve.f90 @@ -0,0 +1,797 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_solve + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Performs line solves in Y direction by first factoring +! the block-tridiagonal matrix into an upper triangular matrix, +! and then performing back substitution to solve for the unknow +! vectors of each line. +! +! Make sure we treat elements zero to cell_size in the direction +! of the sweep. +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer & + & c, jstart, stage, & + & first, last, recv_id, error, r_status(MPI_STATUS_SIZE), & + & isize,jsize,ksize,send_id + + jstart = 0 + + if (timeron) call timer_start(t_ysolve) +!--------------------------------------------------------------------- +! in our terminology stage is the number of the cell in the y-direction +! i.e. stage = 1 means the start of the line stage=ncells means end +!--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(2,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +!--------------------------------------------------------------------- +! set last-cell flag +!--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +!--------------------------------------------------------------------- +! This is the first cell, so solve without receiving data +!--------------------------------------------------------------------- + first = 1 +! call lhsy(c) + call y_solve_cell(first,last,c) + else +!--------------------------------------------------------------------- +! Not the first cell of this line, so receive info from +! processor working on preceeding cell +!--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_ycomm) + call y_receive_solve_info(recv_id,c) +!--------------------------------------------------------------------- +! overlap computations and communications +!--------------------------------------------------------------------- +! call lhsy(c) +!--------------------------------------------------------------------- +! wait for completion +!--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) +!--------------------------------------------------------------------- +! install C'(jstart+1) and rhs'(jstart+1) to be used in this cell +!--------------------------------------------------------------------- + call y_unpack_solve_info(c) + call y_solve_cell(first,last,c) + endif + + if (last .eq. 0) call y_send_solve_info(send_id,c) + enddo + +!--------------------------------------------------------------------- +! now perform backsubstitution in reverse direction +!--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(2,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +!--------------------------------------------------------------------- +! last cell, so perform back substitute without waiting +!--------------------------------------------------------------------- + call y_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_ycomm) + call y_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) + call y_unpack_backsub_info(c) + call y_backsubstitute(first,last,c) + endif + if (first .eq. 0) call y_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_ysolve) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_unpack_solve_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack C'(-1) and rhs'(-1) for +! all i and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,k,m,n,ptr,c,jstart + + jstart = 0 + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_send_solve_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send C'(jend) and rhs'(jend) for +! all i and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,k,m,n,jsize,ptr,c,ip,kp + integer error,send_id,buffer_size + + jsize = cell_size(2,c)-1 + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +!--------------------------------------------------------------------- +! pack up buffer +!--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +!--------------------------------------------------------------------- +! send buffer +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, successor(2), & + & SOUTH+ip+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_send_backsub_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send U(jstart) for all i and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,k,n,ptr,c,jstart,ip,kp + integer error,send_id,buffer_size + +!--------------------------------------------------------------------- +! Send element 0 to previous processor +!--------------------------------------------------------------------- + jstart = 0 + ip = cell_coord(1,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jstart,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, predecessor(2), & + & NORTH+ip+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_unpack_backsub_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack U(jsize) for all i and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_receive_backsub_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer error,recv_id,ip,kp,c,buffer_size + + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, successor(2), & + & NORTH+ip+kp*NCELLS, comm_solve, & + & recv_id, error) + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_receive_solve_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ip,kp,recv_id,error,c,buffer_size + + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, predecessor(2), & + & SOUTH+ip+kp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_backsubstitute(first, last, c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! back solve: if last cell, then generate U(jsize)=rhs(jsize) +! else assume U(jsize) is loaded in un pack backsub_info +! so just use it +! after call u(jstart) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,jstart + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do i=start(1,c),isize +!--------------------------------------------------------------------- +! U(jsize) uses info from previous cell if not last cell +!--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c) & + & - lhsc(m,n,i,jsize,k,c)* & + & backsub_info(n,i,k,c) + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=jsize-1,jstart,-1 + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) & + & - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_solve_cell(first,last,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! performs guaussian elimination on this cell. +! +! assumes that unpacking routines for non-first cells +! preload C' and rhs' from previous cell. +! +! assumed send happens outside this routine, but that +! c'(JMAX) and rhs'(JMAX) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision tmp1, tmp2, tmp3 + integer first,last,c + integer i,j,k,isize,ksize,jsize,jstart + double precision utmp(6,-2:JMAX+1) + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + + call lhsabinit(lhsa, lhsb, jsize) + + do k=start(3,c),ksize + do i=start(1,c),isize + +!--------------------------------------------------------------------- +! This function computes the left hand side for the three y-factors +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Compute the indices for storing the tri-diagonal matrix; +! determine a (labeled f) and n jacobians for cell c +!--------------------------------------------------------------------- + do j = start(2,c)-1, cell_size(2,c)-end(2,c) + utmp(1,j) = 1.0d0 / u(1,i,j,k,c) + utmp(2,j) = u(2,i,j,k,c) + utmp(3,j) = u(3,i,j,k,c) + utmp(4,j) = u(4,i,j,k,c) + utmp(5,j) = u(5,i,j,k,c) + utmp(6,j) = qs(i,j,k,c) + end do + + do j = start(2,c)-1, cell_size(2,c)-end(2,c) + + tmp1 = utmp(1,j) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,j) = 0.0d+00 + fjac(1,2,j) = 0.0d+00 + fjac(1,3,j) = 1.0d+00 + fjac(1,4,j) = 0.0d+00 + fjac(1,5,j) = 0.0d+00 + + fjac(2,1,j) = - ( utmp(2,j)*utmp(3,j) ) & + & * tmp2 + fjac(2,2,j) = utmp(3,j) * tmp1 + fjac(2,3,j) = utmp(2,j) * tmp1 + fjac(2,4,j) = 0.0d+00 + fjac(2,5,j) = 0.0d+00 + + fjac(3,1,j) = - ( utmp(3,j)*utmp(3,j)*tmp2) & + & + c2 * utmp(6,j) + fjac(3,2,j) = - c2 * utmp(2,j) * tmp1 + fjac(3,3,j) = ( 2.0d+00 - c2 ) & + & * utmp(3,j) * tmp1 + fjac(3,4,j) = - c2 * utmp(4,j) * tmp1 + fjac(3,5,j) = c2 + + fjac(4,1,j) = - ( utmp(3,j)*utmp(4,j) ) & + & * tmp2 + fjac(4,2,j) = 0.0d+00 + fjac(4,3,j) = utmp(4,j) * tmp1 + fjac(4,4,j) = utmp(3,j) * tmp1 + fjac(4,5,j) = 0.0d+00 + + fjac(5,1,j) = ( c2 * 2.0d0 * utmp(6,j) & + & - c1 * utmp(5,j) * tmp1 ) & + & * utmp(3,j) * tmp1 + fjac(5,2,j) = - c2 * utmp(2,j)*utmp(3,j) & + & * tmp2 + fjac(5,3,j) = c1 * utmp(5,j) * tmp1 & + & - c2 * ( utmp(6,j) & + & + utmp(3,j)*utmp(3,j) * tmp2 ) + fjac(5,4,j) = - c2 * ( utmp(3,j)*utmp(4,j) ) & + & * tmp2 + fjac(5,5,j) = c1 * utmp(3,j) * tmp1 + + njac(1,1,j) = 0.0d+00 + njac(1,2,j) = 0.0d+00 + njac(1,3,j) = 0.0d+00 + njac(1,4,j) = 0.0d+00 + njac(1,5,j) = 0.0d+00 + + njac(2,1,j) = - c3c4 * tmp2 * utmp(2,j) + njac(2,2,j) = c3c4 * tmp1 + njac(2,3,j) = 0.0d+00 + njac(2,4,j) = 0.0d+00 + njac(2,5,j) = 0.0d+00 + + njac(3,1,j) = - con43 * c3c4 * tmp2 * utmp(3,j) + njac(3,2,j) = 0.0d+00 + njac(3,3,j) = con43 * c3c4 * tmp1 + njac(3,4,j) = 0.0d+00 + njac(3,5,j) = 0.0d+00 + + njac(4,1,j) = - c3c4 * tmp2 * utmp(4,j) + njac(4,2,j) = 0.0d+00 + njac(4,3,j) = 0.0d+00 + njac(4,4,j) = c3c4 * tmp1 + njac(4,5,j) = 0.0d+00 + + njac(5,1,j) = - ( c3c4 & + & - c1345 ) * tmp3 * (utmp(2,j)**2) & + & - ( con43 * c3c4 & + & - c1345 ) * tmp3 * (utmp(3,j)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (utmp(4,j)**2) & + & - c1345 * tmp2 * utmp(5,j) + + njac(5,2,j) = ( c3c4 - c1345 ) * tmp2 * utmp(2,j) + njac(5,3,j) = ( con43 * c3c4 & + & - c1345 ) * tmp2 * utmp(3,j) + njac(5,4,j) = ( c3c4 - c1345 ) * tmp2 * utmp(4,j) + njac(5,5,j) = ( c1345 ) * tmp1 + + enddo + +!--------------------------------------------------------------------- +! now joacobians set, so form left hand side in y direction +!--------------------------------------------------------------------- + do j = start(2,c), jsize-end(2,c) + + tmp1 = dt * ty1 + tmp2 = dt * ty2 + + lhsa(1,1,j) = - tmp2 * fjac(1,1,j-1) & + & - tmp1 * njac(1,1,j-1) & + & - tmp1 * dy1 + lhsa(1,2,j) = - tmp2 * fjac(1,2,j-1) & + & - tmp1 * njac(1,2,j-1) + lhsa(1,3,j) = - tmp2 * fjac(1,3,j-1) & + & - tmp1 * njac(1,3,j-1) + lhsa(1,4,j) = - tmp2 * fjac(1,4,j-1) & + & - tmp1 * njac(1,4,j-1) + lhsa(1,5,j) = - tmp2 * fjac(1,5,j-1) & + & - tmp1 * njac(1,5,j-1) + + lhsa(2,1,j) = - tmp2 * fjac(2,1,j-1) & + & - tmp1 * njac(2,1,j-1) + lhsa(2,2,j) = - tmp2 * fjac(2,2,j-1) & + & - tmp1 * njac(2,2,j-1) & + & - tmp1 * dy2 + lhsa(2,3,j) = - tmp2 * fjac(2,3,j-1) & + & - tmp1 * njac(2,3,j-1) + lhsa(2,4,j) = - tmp2 * fjac(2,4,j-1) & + & - tmp1 * njac(2,4,j-1) + lhsa(2,5,j) = - tmp2 * fjac(2,5,j-1) & + & - tmp1 * njac(2,5,j-1) + + lhsa(3,1,j) = - tmp2 * fjac(3,1,j-1) & + & - tmp1 * njac(3,1,j-1) + lhsa(3,2,j) = - tmp2 * fjac(3,2,j-1) & + & - tmp1 * njac(3,2,j-1) + lhsa(3,3,j) = - tmp2 * fjac(3,3,j-1) & + & - tmp1 * njac(3,3,j-1) & + & - tmp1 * dy3 + lhsa(3,4,j) = - tmp2 * fjac(3,4,j-1) & + & - tmp1 * njac(3,4,j-1) + lhsa(3,5,j) = - tmp2 * fjac(3,5,j-1) & + & - tmp1 * njac(3,5,j-1) + + lhsa(4,1,j) = - tmp2 * fjac(4,1,j-1) & + & - tmp1 * njac(4,1,j-1) + lhsa(4,2,j) = - tmp2 * fjac(4,2,j-1) & + & - tmp1 * njac(4,2,j-1) + lhsa(4,3,j) = - tmp2 * fjac(4,3,j-1) & + & - tmp1 * njac(4,3,j-1) + lhsa(4,4,j) = - tmp2 * fjac(4,4,j-1) & + & - tmp1 * njac(4,4,j-1) & + & - tmp1 * dy4 + lhsa(4,5,j) = - tmp2 * fjac(4,5,j-1) & + & - tmp1 * njac(4,5,j-1) + + lhsa(5,1,j) = - tmp2 * fjac(5,1,j-1) & + & - tmp1 * njac(5,1,j-1) + lhsa(5,2,j) = - tmp2 * fjac(5,2,j-1) & + & - tmp1 * njac(5,2,j-1) + lhsa(5,3,j) = - tmp2 * fjac(5,3,j-1) & + & - tmp1 * njac(5,3,j-1) + lhsa(5,4,j) = - tmp2 * fjac(5,4,j-1) & + & - tmp1 * njac(5,4,j-1) + lhsa(5,5,j) = - tmp2 * fjac(5,5,j-1) & + & - tmp1 * njac(5,5,j-1) & + & - tmp1 * dy5 + + lhsb(1,1,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(1,1,j) & + & + tmp1 * 2.0d+00 * dy1 + lhsb(1,2,j) = tmp1 * 2.0d+00 * njac(1,2,j) + lhsb(1,3,j) = tmp1 * 2.0d+00 * njac(1,3,j) + lhsb(1,4,j) = tmp1 * 2.0d+00 * njac(1,4,j) + lhsb(1,5,j) = tmp1 * 2.0d+00 * njac(1,5,j) + + lhsb(2,1,j) = tmp1 * 2.0d+00 * njac(2,1,j) + lhsb(2,2,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(2,2,j) & + & + tmp1 * 2.0d+00 * dy2 + lhsb(2,3,j) = tmp1 * 2.0d+00 * njac(2,3,j) + lhsb(2,4,j) = tmp1 * 2.0d+00 * njac(2,4,j) + lhsb(2,5,j) = tmp1 * 2.0d+00 * njac(2,5,j) + + lhsb(3,1,j) = tmp1 * 2.0d+00 * njac(3,1,j) + lhsb(3,2,j) = tmp1 * 2.0d+00 * njac(3,2,j) + lhsb(3,3,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(3,3,j) & + & + tmp1 * 2.0d+00 * dy3 + lhsb(3,4,j) = tmp1 * 2.0d+00 * njac(3,4,j) + lhsb(3,5,j) = tmp1 * 2.0d+00 * njac(3,5,j) + + lhsb(4,1,j) = tmp1 * 2.0d+00 * njac(4,1,j) + lhsb(4,2,j) = tmp1 * 2.0d+00 * njac(4,2,j) + lhsb(4,3,j) = tmp1 * 2.0d+00 * njac(4,3,j) + lhsb(4,4,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(4,4,j) & + & + tmp1 * 2.0d+00 * dy4 + lhsb(4,5,j) = tmp1 * 2.0d+00 * njac(4,5,j) + + lhsb(5,1,j) = tmp1 * 2.0d+00 * njac(5,1,j) + lhsb(5,2,j) = tmp1 * 2.0d+00 * njac(5,2,j) + lhsb(5,3,j) = tmp1 * 2.0d+00 * njac(5,3,j) + lhsb(5,4,j) = tmp1 * 2.0d+00 * njac(5,4,j) + lhsb(5,5,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(5,5,j) & + & + tmp1 * 2.0d+00 * dy5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,j+1) & + & - tmp1 * njac(1,1,j+1) & + & - tmp1 * dy1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,j+1) & + & - tmp1 * njac(1,2,j+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,j+1) & + & - tmp1 * njac(1,3,j+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,j+1) & + & - tmp1 * njac(1,4,j+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,j+1) & + & - tmp1 * njac(1,5,j+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,j+1) & + & - tmp1 * njac(2,1,j+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,j+1) & + & - tmp1 * njac(2,2,j+1) & + & - tmp1 * dy2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,j+1) & + & - tmp1 * njac(2,3,j+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,j+1) & + & - tmp1 * njac(2,4,j+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,j+1) & + & - tmp1 * njac(2,5,j+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,j+1) & + & - tmp1 * njac(3,1,j+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,j+1) & + & - tmp1 * njac(3,2,j+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,j+1) & + & - tmp1 * njac(3,3,j+1) & + & - tmp1 * dy3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,j+1) & + & - tmp1 * njac(3,4,j+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,j+1) & + & - tmp1 * njac(3,5,j+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,j+1) & + & - tmp1 * njac(4,1,j+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,j+1) & + & - tmp1 * njac(4,2,j+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,j+1) & + & - tmp1 * njac(4,3,j+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,j+1) & + & - tmp1 * njac(4,4,j+1) & + & - tmp1 * dy4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,j+1) & + & - tmp1 * njac(4,5,j+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,j+1) & + & - tmp1 * njac(5,1,j+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,j+1) & + & - tmp1 * njac(5,2,j+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,j+1) & + & - tmp1 * njac(5,3,j+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,j+1) & + & - tmp1 * njac(5,4,j+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,j+1) & + & - tmp1 * njac(5,5,j+1) & + & - tmp1 * dy5 + + enddo + + +!--------------------------------------------------------------------- +! outer most do loops - sweeping in i direction +!--------------------------------------------------------------------- + if (first .eq. 1) then + +!--------------------------------------------------------------------- +! multiply c(i,jstart,k) by b_inverse and copy back to c +! multiply rhs(jstart) by b_inverse(jstart) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,jstart), & + & lhsc(1,1,i,jstart,k,c), & + & rhs(1,i,jstart,k,c) ) + + endif + +!--------------------------------------------------------------------- +! begin inner most do loop +! do all the elements of the cell unless last +!--------------------------------------------------------------------- + do j=jstart+first,jsize-last + +!--------------------------------------------------------------------- +! subtract A*lhs_vector(j-1) from lhs_vector(j) +! +! rhs(j) = rhs(j) - A*rhs(j-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,j), & + & rhs(1,i,j-1,k,c),rhs(1,i,j,k,c)) + +!--------------------------------------------------------------------- +! B(j) = B(j) - C(j-1)*A(j) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,j), & + & lhsc(1,1,i,j-1,k,c), & + & lhsb(1,1,j)) + +!--------------------------------------------------------------------- +! multiply c(i,j,k) by b_inverse and copy back to c +! multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,j), & + & lhsc(1,1,i,j,k,c), & + & rhs(1,i,j,k,c) ) + + enddo + +!--------------------------------------------------------------------- +! Now finish up special cases for last cell +!--------------------------------------------------------------------- + if (last .eq. 1) then + +!--------------------------------------------------------------------- +! rhs(jsize) = rhs(jsize) - A*rhs(jsize-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,jsize), & + & rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c)) + +!--------------------------------------------------------------------- +! B(jsize) = B(jsize) - C(jsize-1)*A(jsize) +! call matmul_sub(aa,i,jsize,k,c, +! $ cc,i,jsize-1,k,c,bb,i,jsize,k,c) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,jsize), & + & lhsc(1,1,i,jsize-1,k,c), & + & lhsb(1,1,jsize)) + +!--------------------------------------------------------------------- +! multiply rhs(jsize) by b_inverse(jsize) and copy to rhs +!--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,jsize), & + & rhs(1,i,jsize,k,c) ) + + endif + enddo + enddo + + + return + end + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve_vec.f90 new file mode 100644 index 000000000..201064511 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/y_solve_vec.f90 @@ -0,0 +1,812 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_solve + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Performs line solves in Y direction by first factoring +! the block-tridiagonal matrix into an upper triangular matrix, +! and then performing back substitution to solve for the unknow +! vectors of each line. +! +! Make sure we treat elements zero to cell_size in the direction +! of the sweep. +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer & + & c, jstart, stage, & + & first, last, recv_id, error, r_status(MPI_STATUS_SIZE), & + & isize,jsize,ksize,send_id + + jstart = 0 + + if (timeron) call timer_start(t_ysolve) +!--------------------------------------------------------------------- +! in our terminology stage is the number of the cell in the y-direct +! i.e. stage = 1 means the start of the line stage=ncells means end +!--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(2,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +!--------------------------------------------------------------------- +! set last-cell flag +!--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +!--------------------------------------------------------------------- +! This is the first cell, so solve without receiving data +!--------------------------------------------------------------------- + first = 1 +! call lhsy(c) + call y_solve_cell(first,last,c) + else +!--------------------------------------------------------------------- +! Not the first cell of this line, so receive info from +! processor working on preceeding cell +!--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_ycomm) + call y_receive_solve_info(recv_id,c) +!--------------------------------------------------------------------- +! overlap computations and communications +!--------------------------------------------------------------------- +! call lhsy(c) +!--------------------------------------------------------------------- +! wait for completion +!--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) +!--------------------------------------------------------------------- +! install C'(jstart+1) and rhs'(jstart+1) to be used in this cell +!--------------------------------------------------------------------- + call y_unpack_solve_info(c) + call y_solve_cell(first,last,c) + endif + + if (last .eq. 0) call y_send_solve_info(send_id,c) + enddo + +!--------------------------------------------------------------------- +! now perform backsubstitution in reverse direction +!--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(2,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +!--------------------------------------------------------------------- +! last cell, so perform back substitute without waiting +!--------------------------------------------------------------------- + call y_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_ycomm) + call y_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) + call y_unpack_backsub_info(c) + call y_backsubstitute(first,last,c) + endif + if (first .eq. 0) call y_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_ysolve) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_unpack_solve_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack C'(-1) and rhs'(-1) for +! all i and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,k,m,n,ptr,c,jstart + + jstart = 0 + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_send_solve_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send C'(jend) and rhs'(jend) for +! all i and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,k,m,n,jsize,ptr,c,ip,kp + integer error,send_id,buffer_size + + jsize = cell_size(2,c)-1 + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +!--------------------------------------------------------------------- +! pack up buffer +!--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +!--------------------------------------------------------------------- +! send buffer +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, successor(2), & + & SOUTH+ip+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_send_backsub_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send U(jstart) for all i and k +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,k,n,ptr,c,jstart,ip,kp + integer error,send_id,buffer_size + +!--------------------------------------------------------------------- +! Send element 0 to previous processor +!--------------------------------------------------------------------- + jstart = 0 + ip = cell_coord(1,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jstart,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, predecessor(2), & + & NORTH+ip+kp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_unpack_backsub_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack U(jsize) for all i and k +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_receive_backsub_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer error,recv_id,ip,kp,c,buffer_size + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, successor(2), & + & NORTH+ip+kp*NCELLS, comm_solve, & + & recv_id, error) + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_receive_solve_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ip,kp,recv_id,error,c,buffer_size + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, predecessor(2), & + & SOUTH+ip+kp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_backsubstitute(first, last, c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! back solve: if last cell, then generate U(jsize)=rhs(jsize) +! else assume U(jsize) is loaded in un pack backsub_info +! so just use it +! after call u(jstart) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,jstart + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do i=start(1,c),isize +!--------------------------------------------------------------------- +! U(jsize) uses info from previous cell if not last cell +!--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c) & + & - lhsc(m,n,i,jsize,k,c)* & + & backsub_info(n,i,k,c) + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=jsize-1,jstart,-1 + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) & + & - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine y_solve_cell(first,last,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! performs guaussian elimination on this cell. +! +! assumes that unpacking routines for non-first cells +! preload C' and rhs' from previous cell. +! +! assumed send happens outside this routine, but that +! c'(JMAX) and rhs'(JMAX) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision tmp1, tmp2, tmp3 + integer first,last,c + integer i,j,k,m,n,isize,ksize,jsize,jstart + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + +!--------------------------------------------------------------------- +! zero the left hand side for starters +! set diagonal values to 1. This is overkill, but convenient +!--------------------------------------------------------------------- + do i = 0, isize + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,i,0) = 0.0d0 + lhsb(m,n,i,0) = 0.0d0 + lhsa(m,n,i,jsize) = 0.0d0 + lhsb(m,n,i,jsize) = 0.0d0 + enddo + lhsb(m,m,i,0) = 1.0d0 + lhsb(m,m,i,jsize) = 1.0d0 + enddo + enddo + + do k=start(3,c),ksize + +!--------------------------------------------------------------------- +! This function computes the left hand side for the three y-factors +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Compute the indices for storing the tri-diagonal matrix; +! determine a (labeled f) and n jacobians for cell ! +!--------------------------------------------------------------------- + + do j = start(2,c)-1, cell_size(2,c)-end(2,c) + do i=start(1,c),isize + + tmp1 = 1.0d0 / u(1,i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,i,j) = 0.0d+00 + fjac(1,2,i,j) = 0.0d+00 + fjac(1,3,i,j) = 1.0d+00 + fjac(1,4,i,j) = 0.0d+00 + fjac(1,5,i,j) = 0.0d+00 + + fjac(2,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) & + & * tmp2 + fjac(2,2,i,j) = u(3,i,j,k,c) * tmp1 + fjac(2,3,i,j) = u(2,i,j,k,c) * tmp1 + fjac(2,4,i,j) = 0.0d+00 + fjac(2,5,i,j) = 0.0d+00 + + fjac(3,1,i,j) = - ( u(3,i,j,k,c)*u(3,i,j,k,c)*tmp2) & + & + c2 * qs(i,j,k,c) + fjac(3,2,i,j) = - c2 * u(2,i,j,k,c) * tmp1 + fjac(3,3,i,j) = ( 2.0d+00 - c2 ) & + & * u(3,i,j,k,c) * tmp1 + fjac(3,4,i,j) = - c2 * u(4,i,j,k,c) * tmp1 + fjac(3,5,i,j) = c2 + + fjac(4,1,i,j) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) ) & + & * tmp2 + fjac(4,2,i,j) = 0.0d+00 + fjac(4,3,i,j) = u(4,i,j,k,c) * tmp1 + fjac(4,4,i,j) = u(3,i,j,k,c) * tmp1 + fjac(4,5,i,j) = 0.0d+00 + + fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c) & + & - c1 * u(5,i,j,k,c) * tmp1 ) & + & * u(3,i,j,k,c) * tmp1 + fjac(5,2,i,j) = - c2 * u(2,i,j,k,c)*u(3,i,j,k,c) & + & * tmp2 + fjac(5,3,i,j) = c1 * u(5,i,j,k,c) * tmp1 & + & - c2 * ( qs(i,j,k,c) & + & + u(3,i,j,k,c)*u(3,i,j,k,c) * tmp2 ) + fjac(5,4,i,j) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) ) & + & * tmp2 + fjac(5,5,i,j) = c1 * u(3,i,j,k,c) * tmp1 + + njac(1,1,i,j) = 0.0d+00 + njac(1,2,i,j) = 0.0d+00 + njac(1,3,i,j) = 0.0d+00 + njac(1,4,i,j) = 0.0d+00 + njac(1,5,i,j) = 0.0d+00 + + njac(2,1,i,j) = - c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i,j) = c3c4 * tmp1 + njac(2,3,i,j) = 0.0d+00 + njac(2,4,i,j) = 0.0d+00 + njac(2,5,i,j) = 0.0d+00 + + njac(3,1,i,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i,j) = 0.0d+00 + njac(3,3,i,j) = con43 * c3c4 * tmp1 + njac(3,4,i,j) = 0.0d+00 + njac(3,5,i,j) = 0.0d+00 + + njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i,j) = 0.0d+00 + njac(4,3,i,j) = 0.0d+00 + njac(4,4,i,j) = c3c4 * tmp1 + njac(4,5,i,j) = 0.0d+00 + + njac(5,1,i,j) = - ( c3c4 & + & - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) & + & - ( con43 * c3c4 & + & - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) & + & - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i,j) = ( c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i,j) = ( con43 * c3c4 & + & - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i,j) = ( c1345 ) * tmp1 + + enddo + enddo + +!--------------------------------------------------------------------- +! now joacobians set, so form left hand side in y direction +!--------------------------------------------------------------------- + do j = start(2,c), jsize-end(2,c) + do i=start(1,c),isize + + tmp1 = dt * ty1 + tmp2 = dt * ty2 + + lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i,j-1) & + & - tmp1 * njac(1,1,i,j-1) & + & - tmp1 * dy1 + lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i,j-1) & + & - tmp1 * njac(1,2,i,j-1) + lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i,j-1) & + & - tmp1 * njac(1,3,i,j-1) + lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i,j-1) & + & - tmp1 * njac(1,4,i,j-1) + lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i,j-1) & + & - tmp1 * njac(1,5,i,j-1) + + lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i,j-1) & + & - tmp1 * njac(2,1,i,j-1) + lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i,j-1) & + & - tmp1 * njac(2,2,i,j-1) & + & - tmp1 * dy2 + lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i,j-1) & + & - tmp1 * njac(2,3,i,j-1) + lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i,j-1) & + & - tmp1 * njac(2,4,i,j-1) + lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i,j-1) & + & - tmp1 * njac(2,5,i,j-1) + + lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i,j-1) & + & - tmp1 * njac(3,1,i,j-1) + lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i,j-1) & + & - tmp1 * njac(3,2,i,j-1) + lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i,j-1) & + & - tmp1 * njac(3,3,i,j-1) & + & - tmp1 * dy3 + lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i,j-1) & + & - tmp1 * njac(3,4,i,j-1) + lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i,j-1) & + & - tmp1 * njac(3,5,i,j-1) + + lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i,j-1) & + & - tmp1 * njac(4,1,i,j-1) + lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i,j-1) & + & - tmp1 * njac(4,2,i,j-1) + lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i,j-1) & + & - tmp1 * njac(4,3,i,j-1) + lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i,j-1) & + & - tmp1 * njac(4,4,i,j-1) & + & - tmp1 * dy4 + lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i,j-1) & + & - tmp1 * njac(4,5,i,j-1) + + lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i,j-1) & + & - tmp1 * njac(5,1,i,j-1) + lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i,j-1) & + & - tmp1 * njac(5,2,i,j-1) + lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i,j-1) & + & - tmp1 * njac(5,3,i,j-1) + lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i,j-1) & + & - tmp1 * njac(5,4,i,j-1) + lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i,j-1) & + & - tmp1 * njac(5,5,i,j-1) & + & - tmp1 * dy5 + + lhsb(1,1,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(1,1,i,j) & + & + tmp1 * 2.0d+00 * dy1 + lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j) + lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j) + lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j) + lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j) + + lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j) + lhsb(2,2,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(2,2,i,j) & + & + tmp1 * 2.0d+00 * dy2 + lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j) + lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j) + lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j) + + lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j) + lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j) + lhsb(3,3,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(3,3,i,j) & + & + tmp1 * 2.0d+00 * dy3 + lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j) + lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j) + + lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j) + lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j) + lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j) + lhsb(4,4,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(4,4,i,j) & + & + tmp1 * 2.0d+00 * dy4 + lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j) + + lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j) + lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j) + lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j) + lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j) + lhsb(5,5,i,j) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(5,5,i,j) & + & + tmp1 * 2.0d+00 * dy5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i,j+1) & + & - tmp1 * njac(1,1,i,j+1) & + & - tmp1 * dy1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i,j+1) & + & - tmp1 * njac(1,2,i,j+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i,j+1) & + & - tmp1 * njac(1,3,i,j+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i,j+1) & + & - tmp1 * njac(1,4,i,j+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i,j+1) & + & - tmp1 * njac(1,5,i,j+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i,j+1) & + & - tmp1 * njac(2,1,i,j+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i,j+1) & + & - tmp1 * njac(2,2,i,j+1) & + & - tmp1 * dy2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i,j+1) & + & - tmp1 * njac(2,3,i,j+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i,j+1) & + & - tmp1 * njac(2,4,i,j+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i,j+1) & + & - tmp1 * njac(2,5,i,j+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i,j+1) & + & - tmp1 * njac(3,1,i,j+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i,j+1) & + & - tmp1 * njac(3,2,i,j+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i,j+1) & + & - tmp1 * njac(3,3,i,j+1) & + & - tmp1 * dy3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i,j+1) & + & - tmp1 * njac(3,4,i,j+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i,j+1) & + & - tmp1 * njac(3,5,i,j+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i,j+1) & + & - tmp1 * njac(4,1,i,j+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i,j+1) & + & - tmp1 * njac(4,2,i,j+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i,j+1) & + & - tmp1 * njac(4,3,i,j+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i,j+1) & + & - tmp1 * njac(4,4,i,j+1) & + & - tmp1 * dy4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i,j+1) & + & - tmp1 * njac(4,5,i,j+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i,j+1) & + & - tmp1 * njac(5,1,i,j+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i,j+1) & + & - tmp1 * njac(5,2,i,j+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i,j+1) & + & - tmp1 * njac(5,3,i,j+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i,j+1) & + & - tmp1 * njac(5,4,i,j+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i,j+1) & + & - tmp1 * njac(5,5,i,j+1) & + & - tmp1 * dy5 + + enddo + enddo + + +!--------------------------------------------------------------------- +! outer most do loops - sweeping in i direction +!--------------------------------------------------------------------- + if (first .eq. 1) then + +!--------------------------------------------------------------------- +! multiply c(i,jstart,k) by b_inverse and copy back to ! +! multiply rhs(jstart) by b_inverse(jstart) and copy to rhs +!--------------------------------------------------------------------- +!dir$ ivdep + do i=start(1,c),isize + call binvcrhs( lhsb(1,1,i,jstart), & + & lhsc(1,1,i,jstart,k,c), & + & rhs(1,i,jstart,k,c) ) + enddo + + endif + +!--------------------------------------------------------------------- +! begin inner most do loop +! do all the elements of the cell unless last +!--------------------------------------------------------------------- + do j=jstart+first,jsize-last +!dir$ ivdep + do i=start(1,c),isize + +!--------------------------------------------------------------------- +! subtract A*lhs_vector(j-1) from lhs_vector(j) +! +! rhs(j) = rhs(j) - A*rhs(j-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,j), & + & rhs(1,i,j-1,k,c),rhs(1,i,j,k,c)) + +!--------------------------------------------------------------------- +! B(j) = B(j) - C(j-1)*A(j) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,j), & + & lhsc(1,1,i,j-1,k,c), & + & lhsb(1,1,i,j)) + +!--------------------------------------------------------------------- +! multiply c(i,j,k) by b_inverse and copy back to ! +! multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i,j), & + & lhsc(1,1,i,j,k,c), & + & rhs(1,i,j,k,c) ) + + enddo + enddo + +!--------------------------------------------------------------------- +! Now finish up special cases for last cell +!--------------------------------------------------------------------- + if (last .eq. 1) then + +!dir$ ivdep + do i=start(1,c),isize +!--------------------------------------------------------------------- +! rhs(jsize) = rhs(jsize) - A*rhs(jsize-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,jsize), & + & rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c)) + +!--------------------------------------------------------------------- +! B(jsize) = B(jsize) - C(jsize-1)*A(jsize) +! call matmul_sub(aa,i,jsize,k,c, +! $ cc,i,jsize-1,k,c,bb,i,jsize,k,c) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,jsize), & + & lhsc(1,1,i,jsize-1,k,c), & + & lhsb(1,1,i,jsize)) + +!--------------------------------------------------------------------- +! multiply rhs(jsize) by b_inverse(jsize) and copy to rhs +!--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,i,jsize), & + & rhs(1,i,jsize,k,c) ) + enddo + + endif + enddo + + + return + end + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve.f90 new file mode 100644 index 000000000..ccbba0147 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve.f90 @@ -0,0 +1,802 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_solve + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Performs line solves in Z direction by first factoring +! the block-tridiagonal matrix into an upper triangular matrix, +! and then performing back substitution to solve for the unknow +! vectors of each line. +! +! Make sure we treat elements zero to cell_size in the direction +! of the sweep. +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer c, kstart, stage, & + & first, last, recv_id, error, r_status(MPI_STATUS_SIZE), & + & isize,jsize,ksize,send_id + + kstart = 0 + + if (timeron) call timer_start(t_zsolve) +!--------------------------------------------------------------------- +! in our terminology stage is the number of the cell in the y-direction +! i.e. stage = 1 means the start of the line stage=ncells means end +!--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(3,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 +!--------------------------------------------------------------------- +! set last-cell flag +!--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +!--------------------------------------------------------------------- +! This is the first cell, so solve without receiving data +!--------------------------------------------------------------------- + first = 1 +! call lhsz(c) + call z_solve_cell(first,last,c) + else +!--------------------------------------------------------------------- +! Not the first cell of this line, so receive info from +! processor working on preceeding cell +!--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_zcomm) + call z_receive_solve_info(recv_id,c) +!--------------------------------------------------------------------- +! overlap computations and communications +!--------------------------------------------------------------------- +! call lhsz(c) +!--------------------------------------------------------------------- +! wait for completion +!--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) +!--------------------------------------------------------------------- +! install C'(kstart+1) and rhs'(kstart+1) to be used in this cell +!--------------------------------------------------------------------- + call z_unpack_solve_info(c) + call z_solve_cell(first,last,c) + endif + + if (last .eq. 0) call z_send_solve_info(send_id,c) + enddo + +!--------------------------------------------------------------------- +! now perform backsubstitution in reverse direction +!--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(3,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +!--------------------------------------------------------------------- +! last cell, so perform back substitute without waiting +!--------------------------------------------------------------------- + call z_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_zcomm) + call z_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) + call z_unpack_backsub_info(c) + call z_backsubstitute(first,last,c) + endif + if (first .eq. 0) call z_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_zsolve) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_unpack_solve_info(c) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack C'(-1) and rhs'(-1) for +! all i and j +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,j,m,n,ptr,c,kstart + + kstart = 0 + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_send_solve_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send C'(kend) and rhs'(kend) for +! all i and j +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,j,m,n,ksize,ptr,c,ip,jp + integer error,send_id,buffer_size + + ksize = cell_size(3,c)-1 + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +!--------------------------------------------------------------------- +! pack up buffer +!--------------------------------------------------------------------- + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +!--------------------------------------------------------------------- +! send buffer +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, successor(3), & + & BOTTOM+ip+jp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_send_backsub_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send U(jstart) for all i and j +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,j,n,ptr,c,kstart,ip,jp + integer error,send_id,buffer_size + +!--------------------------------------------------------------------- +! Send element 0 to previous processor +!--------------------------------------------------------------------- + kstart = 0 + ip = cell_coord(1,c)-1 + jp = cell_coord(2,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,kstart,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, predecessor(3), & + & TOP+ip+jp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_unpack_backsub_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack U(ksize) for all i and j +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,j,n,ptr,c + + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,j,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_receive_backsub_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer error,recv_id,ip,jp,c,buffer_size + + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, successor(3), & + & TOP+ip+jp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_receive_solve_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ip,jp,recv_id,error,c,buffer_size + + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, predecessor(3), & + & BOTTOM+ip+jp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_backsubstitute(first, last, c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! back solve: if last cell, then generate U(ksize)=rhs(ksize) +! else assume U(ksize) is loaded in un pack backsub_info +! so just use it +! after call u(kstart) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,kstart + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + if (last .eq. 0) then + do j=start(2,c),jsize + do i=start(1,c),isize +!--------------------------------------------------------------------- +! U(jsize) uses info from previous cell if not last cell +!--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c) & + & - lhsc(m,n,i,j,ksize,c)* & + & backsub_info(n,i,j,c) + enddo + enddo + enddo + enddo + endif + do k=ksize-1,kstart,-1 + do j=start(2,c),jsize + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) & + & - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_solve_cell(first,last,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! performs guaussian elimination on this cell. +! +! assumes that unpacking routines for non-first cells +! preload C' and rhs' from previous cell. +! +! assumed send happens outside this routine, but that +! c'(KMAX) and rhs'(KMAX) will be sent to next cell. +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision tmp1, tmp2, tmp3 + integer first,last,c + integer i,j,k,isize,ksize,jsize,kstart + double precision utmp(6,-2:KMAX+1) + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + + call lhsabinit(lhsa, lhsb, ksize) + + do j=start(2,c),jsize + do i=start(1,c),isize + +!--------------------------------------------------------------------- +! This function computes the left hand side for the three z-factors +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Compute the indices for storing the block-diagonal matrix; +! determine c (labeled f) and s jacobians for cell c +!--------------------------------------------------------------------- + do k = start(3,c)-1, cell_size(3,c)-end(3,c) + utmp(1,k) = 1.0d0 / u(1,i,j,k,c) + utmp(2,k) = u(2,i,j,k,c) + utmp(3,k) = u(3,i,j,k,c) + utmp(4,k) = u(4,i,j,k,c) + utmp(5,k) = u(5,i,j,k,c) + utmp(6,k) = qs(i,j,k,c) + end do + + do k = start(3,c)-1, cell_size(3,c)-end(3,c) + + tmp1 = utmp(1,k) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,k) = 0.0d+00 + fjac(1,2,k) = 0.0d+00 + fjac(1,3,k) = 0.0d+00 + fjac(1,4,k) = 1.0d+00 + fjac(1,5,k) = 0.0d+00 + + fjac(2,1,k) = - ( utmp(2,k)*utmp(4,k) ) & + & * tmp2 + fjac(2,2,k) = utmp(4,k) * tmp1 + fjac(2,3,k) = 0.0d+00 + fjac(2,4,k) = utmp(2,k) * tmp1 + fjac(2,5,k) = 0.0d+00 + + fjac(3,1,k) = - ( utmp(3,k)*utmp(4,k) ) & + & * tmp2 + fjac(3,2,k) = 0.0d+00 + fjac(3,3,k) = utmp(4,k) * tmp1 + fjac(3,4,k) = utmp(3,k) * tmp1 + fjac(3,5,k) = 0.0d+00 + + fjac(4,1,k) = - (utmp(4,k)*utmp(4,k) * tmp2 ) & + & + c2 * utmp(6,k) + fjac(4,2,k) = - c2 * utmp(2,k) * tmp1 + fjac(4,3,k) = - c2 * utmp(3,k) * tmp1 + fjac(4,4,k) = ( 2.0d+00 - c2 ) & + & * utmp(4,k) * tmp1 + fjac(4,5,k) = c2 + + fjac(5,1,k) = ( c2 * 2.0d0 * utmp(6,k) & + & - c1 * ( utmp(5,k) * tmp1 ) ) & + & * ( utmp(4,k) * tmp1 ) + fjac(5,2,k) = - c2 * ( utmp(2,k)*utmp(4,k) ) & + & * tmp2 + fjac(5,3,k) = - c2 * ( utmp(3,k)*utmp(4,k) ) & + & * tmp2 + fjac(5,4,k) = c1 * ( utmp(5,k) * tmp1 ) & + & - c2 * ( utmp(6,k) & + & + utmp(4,k)*utmp(4,k) * tmp2 ) + fjac(5,5,k) = c1 * utmp(4,k) * tmp1 + + njac(1,1,k) = 0.0d+00 + njac(1,2,k) = 0.0d+00 + njac(1,3,k) = 0.0d+00 + njac(1,4,k) = 0.0d+00 + njac(1,5,k) = 0.0d+00 + + njac(2,1,k) = - c3c4 * tmp2 * utmp(2,k) + njac(2,2,k) = c3c4 * tmp1 + njac(2,3,k) = 0.0d+00 + njac(2,4,k) = 0.0d+00 + njac(2,5,k) = 0.0d+00 + + njac(3,1,k) = - c3c4 * tmp2 * utmp(3,k) + njac(3,2,k) = 0.0d+00 + njac(3,3,k) = c3c4 * tmp1 + njac(3,4,k) = 0.0d+00 + njac(3,5,k) = 0.0d+00 + + njac(4,1,k) = - con43 * c3c4 * tmp2 * utmp(4,k) + njac(4,2,k) = 0.0d+00 + njac(4,3,k) = 0.0d+00 + njac(4,4,k) = con43 * c3 * c4 * tmp1 + njac(4,5,k) = 0.0d+00 + + njac(5,1,k) = - ( c3c4 & + & - c1345 ) * tmp3 * (utmp(2,k)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (utmp(3,k)**2) & + & - ( con43 * c3c4 & + & - c1345 ) * tmp3 * (utmp(4,k)**2) & + & - c1345 * tmp2 * utmp(5,k) + + njac(5,2,k) = ( c3c4 - c1345 ) * tmp2 * utmp(2,k) + njac(5,3,k) = ( c3c4 - c1345 ) * tmp2 * utmp(3,k) + njac(5,4,k) = ( con43 * c3c4 & + & - c1345 ) * tmp2 * utmp(4,k) + njac(5,5,k) = ( c1345 )* tmp1 + + + enddo + +!--------------------------------------------------------------------- +! now joacobians set, so form left hand side in z direction +!--------------------------------------------------------------------- + do k = start(3,c), ksize-end(3,c) + + tmp1 = dt * tz1 + tmp2 = dt * tz2 + + lhsa(1,1,k) = - tmp2 * fjac(1,1,k-1) & + & - tmp1 * njac(1,1,k-1) & + & - tmp1 * dz1 + lhsa(1,2,k) = - tmp2 * fjac(1,2,k-1) & + & - tmp1 * njac(1,2,k-1) + lhsa(1,3,k) = - tmp2 * fjac(1,3,k-1) & + & - tmp1 * njac(1,3,k-1) + lhsa(1,4,k) = - tmp2 * fjac(1,4,k-1) & + & - tmp1 * njac(1,4,k-1) + lhsa(1,5,k) = - tmp2 * fjac(1,5,k-1) & + & - tmp1 * njac(1,5,k-1) + + lhsa(2,1,k) = - tmp2 * fjac(2,1,k-1) & + & - tmp1 * njac(2,1,k-1) + lhsa(2,2,k) = - tmp2 * fjac(2,2,k-1) & + & - tmp1 * njac(2,2,k-1) & + & - tmp1 * dz2 + lhsa(2,3,k) = - tmp2 * fjac(2,3,k-1) & + & - tmp1 * njac(2,3,k-1) + lhsa(2,4,k) = - tmp2 * fjac(2,4,k-1) & + & - tmp1 * njac(2,4,k-1) + lhsa(2,5,k) = - tmp2 * fjac(2,5,k-1) & + & - tmp1 * njac(2,5,k-1) + + lhsa(3,1,k) = - tmp2 * fjac(3,1,k-1) & + & - tmp1 * njac(3,1,k-1) + lhsa(3,2,k) = - tmp2 * fjac(3,2,k-1) & + & - tmp1 * njac(3,2,k-1) + lhsa(3,3,k) = - tmp2 * fjac(3,3,k-1) & + & - tmp1 * njac(3,3,k-1) & + & - tmp1 * dz3 + lhsa(3,4,k) = - tmp2 * fjac(3,4,k-1) & + & - tmp1 * njac(3,4,k-1) + lhsa(3,5,k) = - tmp2 * fjac(3,5,k-1) & + & - tmp1 * njac(3,5,k-1) + + lhsa(4,1,k) = - tmp2 * fjac(4,1,k-1) & + & - tmp1 * njac(4,1,k-1) + lhsa(4,2,k) = - tmp2 * fjac(4,2,k-1) & + & - tmp1 * njac(4,2,k-1) + lhsa(4,3,k) = - tmp2 * fjac(4,3,k-1) & + & - tmp1 * njac(4,3,k-1) + lhsa(4,4,k) = - tmp2 * fjac(4,4,k-1) & + & - tmp1 * njac(4,4,k-1) & + & - tmp1 * dz4 + lhsa(4,5,k) = - tmp2 * fjac(4,5,k-1) & + & - tmp1 * njac(4,5,k-1) + + lhsa(5,1,k) = - tmp2 * fjac(5,1,k-1) & + & - tmp1 * njac(5,1,k-1) + lhsa(5,2,k) = - tmp2 * fjac(5,2,k-1) & + & - tmp1 * njac(5,2,k-1) + lhsa(5,3,k) = - tmp2 * fjac(5,3,k-1) & + & - tmp1 * njac(5,3,k-1) + lhsa(5,4,k) = - tmp2 * fjac(5,4,k-1) & + & - tmp1 * njac(5,4,k-1) + lhsa(5,5,k) = - tmp2 * fjac(5,5,k-1) & + & - tmp1 * njac(5,5,k-1) & + & - tmp1 * dz5 + + lhsb(1,1,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(1,1,k) & + & + tmp1 * 2.0d+00 * dz1 + lhsb(1,2,k) = tmp1 * 2.0d+00 * njac(1,2,k) + lhsb(1,3,k) = tmp1 * 2.0d+00 * njac(1,3,k) + lhsb(1,4,k) = tmp1 * 2.0d+00 * njac(1,4,k) + lhsb(1,5,k) = tmp1 * 2.0d+00 * njac(1,5,k) + + lhsb(2,1,k) = tmp1 * 2.0d+00 * njac(2,1,k) + lhsb(2,2,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(2,2,k) & + & + tmp1 * 2.0d+00 * dz2 + lhsb(2,3,k) = tmp1 * 2.0d+00 * njac(2,3,k) + lhsb(2,4,k) = tmp1 * 2.0d+00 * njac(2,4,k) + lhsb(2,5,k) = tmp1 * 2.0d+00 * njac(2,5,k) + + lhsb(3,1,k) = tmp1 * 2.0d+00 * njac(3,1,k) + lhsb(3,2,k) = tmp1 * 2.0d+00 * njac(3,2,k) + lhsb(3,3,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(3,3,k) & + & + tmp1 * 2.0d+00 * dz3 + lhsb(3,4,k) = tmp1 * 2.0d+00 * njac(3,4,k) + lhsb(3,5,k) = tmp1 * 2.0d+00 * njac(3,5,k) + + lhsb(4,1,k) = tmp1 * 2.0d+00 * njac(4,1,k) + lhsb(4,2,k) = tmp1 * 2.0d+00 * njac(4,2,k) + lhsb(4,3,k) = tmp1 * 2.0d+00 * njac(4,3,k) + lhsb(4,4,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(4,4,k) & + & + tmp1 * 2.0d+00 * dz4 + lhsb(4,5,k) = tmp1 * 2.0d+00 * njac(4,5,k) + + lhsb(5,1,k) = tmp1 * 2.0d+00 * njac(5,1,k) + lhsb(5,2,k) = tmp1 * 2.0d+00 * njac(5,2,k) + lhsb(5,3,k) = tmp1 * 2.0d+00 * njac(5,3,k) + lhsb(5,4,k) = tmp1 * 2.0d+00 * njac(5,4,k) + lhsb(5,5,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(5,5,k) & + & + tmp1 * 2.0d+00 * dz5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,k+1) & + & - tmp1 * njac(1,1,k+1) & + & - tmp1 * dz1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,k+1) & + & - tmp1 * njac(1,2,k+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,k+1) & + & - tmp1 * njac(1,3,k+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,k+1) & + & - tmp1 * njac(1,4,k+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,k+1) & + & - tmp1 * njac(1,5,k+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,k+1) & + & - tmp1 * njac(2,1,k+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,k+1) & + & - tmp1 * njac(2,2,k+1) & + & - tmp1 * dz2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,k+1) & + & - tmp1 * njac(2,3,k+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,k+1) & + & - tmp1 * njac(2,4,k+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,k+1) & + & - tmp1 * njac(2,5,k+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,k+1) & + & - tmp1 * njac(3,1,k+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,k+1) & + & - tmp1 * njac(3,2,k+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,k+1) & + & - tmp1 * njac(3,3,k+1) & + & - tmp1 * dz3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,k+1) & + & - tmp1 * njac(3,4,k+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,k+1) & + & - tmp1 * njac(3,5,k+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,k+1) & + & - tmp1 * njac(4,1,k+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,k+1) & + & - tmp1 * njac(4,2,k+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,k+1) & + & - tmp1 * njac(4,3,k+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,k+1) & + & - tmp1 * njac(4,4,k+1) & + & - tmp1 * dz4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,k+1) & + & - tmp1 * njac(4,5,k+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,k+1) & + & - tmp1 * njac(5,1,k+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,k+1) & + & - tmp1 * njac(5,2,k+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,k+1) & + & - tmp1 * njac(5,3,k+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,k+1) & + & - tmp1 * njac(5,4,k+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,k+1) & + & - tmp1 * njac(5,5,k+1) & + & - tmp1 * dz5 + + enddo + + +!--------------------------------------------------------------------- +! outer most do loops - sweeping in i direction +!--------------------------------------------------------------------- + if (first .eq. 1) then + +!--------------------------------------------------------------------- +! multiply c(i,j,kstart) by b_inverse and copy back to c +! multiply rhs(kstart) by b_inverse(kstart) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,kstart), & + & lhsc(1,1,i,j,kstart,c), & + & rhs(1,i,j,kstart,c) ) + + endif + +!--------------------------------------------------------------------- +! begin inner most do loop +! do all the elements of the cell unless last +!--------------------------------------------------------------------- + do k=kstart+first,ksize-last + +!--------------------------------------------------------------------- +! subtract A*lhs_vector(k-1) from lhs_vector(k) +! +! rhs(k) = rhs(k) - A*rhs(k-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,k), & + & rhs(1,i,j,k-1,c),rhs(1,i,j,k,c)) + +!--------------------------------------------------------------------- +! B(k) = B(k) - C(k-1)*A(k) +! call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,k), & + & lhsc(1,1,i,j,k-1,c), & + & lhsb(1,1,k)) + +!--------------------------------------------------------------------- +! multiply c(i,j,k) by b_inverse and copy back to c +! multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,k), & + & lhsc(1,1,i,j,k,c), & + & rhs(1,i,j,k,c) ) + + enddo + +!--------------------------------------------------------------------- +! Now finish up special cases for last cell +!--------------------------------------------------------------------- + if (last .eq. 1) then + +!--------------------------------------------------------------------- +! rhs(ksize) = rhs(ksize) - A*rhs(ksize-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,ksize), & + & rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c)) + +!--------------------------------------------------------------------- +! B(ksize) = B(ksize) - C(ksize-1)*A(ksize) +! call matmul_sub(aa,i,j,ksize,c, +! $ cc,i,j,ksize-1,c,bb,i,j,ksize,c) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,ksize), & + & lhsc(1,1,i,j,ksize-1,c), & + & lhsb(1,1,ksize)) + +!--------------------------------------------------------------------- +! multiply rhs(ksize) by b_inverse(ksize) and copy to rhs +!--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,ksize), & + & rhs(1,i,j,ksize,c) ) + + endif + enddo + enddo + + + return + end + + + + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve_vec.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve_vec.f90 new file mode 100644 index 000000000..2491969a5 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/BT/z_solve_vec.f90 @@ -0,0 +1,817 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_solve + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Performs line solves in Z direction by first factoring +! the block-tridiagonal matrix into an upper triangular matrix, +! and then performing back substitution to solve for the unknow +! vectors of each line. +! +! Make sure we treat elements zero to cell_size in the direction +! of the sweep. +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer c, kstart, stage, & + & first, last, recv_id, error, r_status(MPI_STATUS_SIZE), & + & isize,jsize,ksize,send_id + + kstart = 0 + + if (timeron) call timer_start(t_zsolve) +!--------------------------------------------------------------------- +! in our terminology stage is the number of the cell in the y-direct +! i.e. stage = 1 means the start of the line stage=ncells means end +!--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(3,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 +!--------------------------------------------------------------------- +! set last-cell flag +!--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +!--------------------------------------------------------------------- +! This is the first cell, so solve without receiving data +!--------------------------------------------------------------------- + first = 1 +! call lhsz(c) + call z_solve_cell(first,last,c) + else +!--------------------------------------------------------------------- +! Not the first cell of this line, so receive info from +! processor working on preceeding cell +!--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_zcomm) + call z_receive_solve_info(recv_id,c) +!--------------------------------------------------------------------- +! overlap computations and communications +!--------------------------------------------------------------------- +! call lhsz(c) +!--------------------------------------------------------------------- +! wait for completion +!--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) +!--------------------------------------------------------------------- +! install C'(kstart+1) and rhs'(kstart+1) to be used in this cell +!--------------------------------------------------------------------- + call z_unpack_solve_info(c) + call z_solve_cell(first,last,c) + endif + + if (last .eq. 0) call z_send_solve_info(send_id,c) + enddo + +!--------------------------------------------------------------------- +! now perform backsubstitution in reverse direction +!--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(3,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +!--------------------------------------------------------------------- +! last cell, so perform back substitute without waiting +!--------------------------------------------------------------------- + call z_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_zcomm) + call z_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) + call z_unpack_backsub_info(c) + call z_backsubstitute(first,last,c) + endif + if (first .eq. 0) call z_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_zsolve) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_unpack_solve_info(c) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack C'(-1) and rhs'(-1) for +! all i and j +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,j,m,n,ptr,c,kstart + + kstart = 0 + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_send_solve_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send C'(kend) and rhs'(kend) for +! all i and j +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,j,m,n,ksize,ptr,c,ip,jp + integer error,send_id,buffer_size + + ksize = cell_size(3,c)-1 + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +!--------------------------------------------------------------------- +! pack up buffer +!--------------------------------------------------------------------- + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +!--------------------------------------------------------------------- +! send buffer +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, successor(3), & + & BOTTOM+ip+jp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_send_backsub_info(send_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! pack up and send U(jstart) for all i and j +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer i,j,n,ptr,c,kstart,ip,jp + integer error,send_id,buffer_size + +!--------------------------------------------------------------------- +! Send element 0 to previous processor +!--------------------------------------------------------------------- + kstart = 0 + ip = cell_coord(1,c)-1 + jp = cell_coord(2,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,kstart,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, & + & dp_type, predecessor(3), & + & TOP+ip+jp*NCELLS, comm_solve, & + & send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_unpack_backsub_info(c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! unpack U(ksize) for all i and j +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer i,j,n,ptr,c + + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,j,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_receive_backsub_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer error,recv_id,ip,jp,c,buffer_size + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, successor(3), & + & TOP+ip+jp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_receive_solve_info(recv_id,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! post mpi receives +!--------------------------------------------------------------------- + + use bt_data + use mpinpb + + implicit none + + integer ip,jp,recv_id,error,c,buffer_size + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* & + & (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, & + & dp_type, predecessor(3), & + & BOTTOM+ip+jp*NCELLS, comm_solve, & + & recv_id, error) + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_backsubstitute(first, last, c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! back solve: if last cell, then generate U(ksize)=rhs(ksize) +! else assume U(ksize) is loaded in un pack backsub_info +! so just use it +! after call u(kstart) will be sent to next cell +!--------------------------------------------------------------------- + + use bt_data + implicit none + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,kstart + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + if (last .eq. 0) then + do j=start(2,c),jsize + do i=start(1,c),isize +!--------------------------------------------------------------------- +! U(jsize) uses info from previous cell if not last cell +!--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c) & + & - lhsc(m,n,i,j,ksize,c)* & + & backsub_info(n,i,j,c) + enddo + enddo + enddo + enddo + endif + do k=ksize-1,kstart,-1 + do j=start(2,c),jsize + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) & + & - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine z_solve_cell(first,last,c) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! performs guaussian elimination on this cell. +! +! assumes that unpacking routines for non-first cells +! preload C' and rhs' from previous cell. +! +! assumed send happens outside this routine, but that +! c'(KMAX) and rhs'(KMAX) will be sent to next cell. +!--------------------------------------------------------------------- + + use bt_data + implicit none + + double precision tmp1, tmp2, tmp3 + integer first,last,c + integer i,j,k,m,n,isize,ksize,jsize,kstart + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + +!--------------------------------------------------------------------- +! zero the left hand side for starters +! set diagonal values to 1. This is overkill, but convenient +!--------------------------------------------------------------------- + do i = 0, isize + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,i,0) = 0.0d0 + lhsb(m,n,i,0) = 0.0d0 + lhsa(m,n,i,ksize) = 0.0d0 + lhsb(m,n,i,ksize) = 0.0d0 + enddo + lhsb(m,m,i,0) = 1.0d0 + lhsb(m,m,i,ksize) = 1.0d0 + enddo + enddo + + do j=start(2,c),jsize + +!--------------------------------------------------------------------- +! This function computes the left hand side for the three z-factors +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Compute the indices for storing the block-diagonal matrix; +! determine c (labeled f) and s jacobians for cell ! +!--------------------------------------------------------------------- + + do k = start(3,c)-1, cell_size(3,c)-end(3,c) + do i=start(1,c),isize + + tmp1 = 1.0d0 / u(1,i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,i,k) = 0.0d+00 + fjac(1,2,i,k) = 0.0d+00 + fjac(1,3,i,k) = 0.0d+00 + fjac(1,4,i,k) = 1.0d+00 + fjac(1,5,i,k) = 0.0d+00 + + fjac(2,1,i,k) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) & + & * tmp2 + fjac(2,2,i,k) = u(4,i,j,k,c) * tmp1 + fjac(2,3,i,k) = 0.0d+00 + fjac(2,4,i,k) = u(2,i,j,k,c) * tmp1 + fjac(2,5,i,k) = 0.0d+00 + + fjac(3,1,i,k) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) ) & + & * tmp2 + fjac(3,2,i,k) = 0.0d+00 + fjac(3,3,i,k) = u(4,i,j,k,c) * tmp1 + fjac(3,4,i,k) = u(3,i,j,k,c) * tmp1 + fjac(3,5,i,k) = 0.0d+00 + + fjac(4,1,i,k) = - (u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 ) & + & + c2 * qs(i,j,k,c) + fjac(4,2,i,k) = - c2 * u(2,i,j,k,c) * tmp1 + fjac(4,3,i,k) = - c2 * u(3,i,j,k,c) * tmp1 + fjac(4,4,i,k) = ( 2.0d+00 - c2 ) & + & * u(4,i,j,k,c) * tmp1 + fjac(4,5,i,k) = c2 + + fjac(5,1,i,k) = ( c2 * 2.0d0 * qs(i,j,k,c) & + & - c1 * ( u(5,i,j,k,c) * tmp1 ) ) & + & * ( u(4,i,j,k,c) * tmp1 ) + fjac(5,2,i,k) = - c2 * ( u(2,i,j,k,c)*u(4,i,j,k,c) ) & + & * tmp2 + fjac(5,3,i,k) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) ) & + & * tmp2 + fjac(5,4,i,k) = c1 * ( u(5,i,j,k,c) * tmp1 ) & + & - c2 * ( qs(i,j,k,c) & + & + u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 ) + fjac(5,5,i,k) = c1 * u(4,i,j,k,c) * tmp1 + + njac(1,1,i,k) = 0.0d+00 + njac(1,2,i,k) = 0.0d+00 + njac(1,3,i,k) = 0.0d+00 + njac(1,4,i,k) = 0.0d+00 + njac(1,5,i,k) = 0.0d+00 + + njac(2,1,i,k) = - c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i,k) = c3c4 * tmp1 + njac(2,3,i,k) = 0.0d+00 + njac(2,4,i,k) = 0.0d+00 + njac(2,5,i,k) = 0.0d+00 + + njac(3,1,i,k) = - c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i,k) = 0.0d+00 + njac(3,3,i,k) = c3c4 * tmp1 + njac(3,4,i,k) = 0.0d+00 + njac(3,5,i,k) = 0.0d+00 + + njac(4,1,i,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i,k) = 0.0d+00 + njac(4,3,i,k) = 0.0d+00 + njac(4,4,i,k) = con43 * c3 * c4 * tmp1 + njac(4,5,i,k) = 0.0d+00 + + njac(5,1,i,k) = - ( c3c4 & + & - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) & + & - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) & + & - ( con43 * c3c4 & + & - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) & + & - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i,k) = ( c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i,k) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i,k) = ( con43 * c3c4 & + & - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i,k) = ( c1345 )* tmp1 + + + enddo + enddo + +!--------------------------------------------------------------------- +! now joacobians set, so form left hand side in z direction +!--------------------------------------------------------------------- + do k = start(3,c), ksize-end(3,c) + do i=start(1,c),isize + + tmp1 = dt * tz1 + tmp2 = dt * tz2 + + lhsa(1,1,i,k) = - tmp2 * fjac(1,1,i,k-1) & + & - tmp1 * njac(1,1,i,k-1) & + & - tmp1 * dz1 + lhsa(1,2,i,k) = - tmp2 * fjac(1,2,i,k-1) & + & - tmp1 * njac(1,2,i,k-1) + lhsa(1,3,i,k) = - tmp2 * fjac(1,3,i,k-1) & + & - tmp1 * njac(1,3,i,k-1) + lhsa(1,4,i,k) = - tmp2 * fjac(1,4,i,k-1) & + & - tmp1 * njac(1,4,i,k-1) + lhsa(1,5,i,k) = - tmp2 * fjac(1,5,i,k-1) & + & - tmp1 * njac(1,5,i,k-1) + + lhsa(2,1,i,k) = - tmp2 * fjac(2,1,i,k-1) & + & - tmp1 * njac(2,1,i,k-1) + lhsa(2,2,i,k) = - tmp2 * fjac(2,2,i,k-1) & + & - tmp1 * njac(2,2,i,k-1) & + & - tmp1 * dz2 + lhsa(2,3,i,k) = - tmp2 * fjac(2,3,i,k-1) & + & - tmp1 * njac(2,3,i,k-1) + lhsa(2,4,i,k) = - tmp2 * fjac(2,4,i,k-1) & + & - tmp1 * njac(2,4,i,k-1) + lhsa(2,5,i,k) = - tmp2 * fjac(2,5,i,k-1) & + & - tmp1 * njac(2,5,i,k-1) + + lhsa(3,1,i,k) = - tmp2 * fjac(3,1,i,k-1) & + & - tmp1 * njac(3,1,i,k-1) + lhsa(3,2,i,k) = - tmp2 * fjac(3,2,i,k-1) & + & - tmp1 * njac(3,2,i,k-1) + lhsa(3,3,i,k) = - tmp2 * fjac(3,3,i,k-1) & + & - tmp1 * njac(3,3,i,k-1) & + & - tmp1 * dz3 + lhsa(3,4,i,k) = - tmp2 * fjac(3,4,i,k-1) & + & - tmp1 * njac(3,4,i,k-1) + lhsa(3,5,i,k) = - tmp2 * fjac(3,5,i,k-1) & + & - tmp1 * njac(3,5,i,k-1) + + lhsa(4,1,i,k) = - tmp2 * fjac(4,1,i,k-1) & + & - tmp1 * njac(4,1,i,k-1) + lhsa(4,2,i,k) = - tmp2 * fjac(4,2,i,k-1) & + & - tmp1 * njac(4,2,i,k-1) + lhsa(4,3,i,k) = - tmp2 * fjac(4,3,i,k-1) & + & - tmp1 * njac(4,3,i,k-1) + lhsa(4,4,i,k) = - tmp2 * fjac(4,4,i,k-1) & + & - tmp1 * njac(4,4,i,k-1) & + & - tmp1 * dz4 + lhsa(4,5,i,k) = - tmp2 * fjac(4,5,i,k-1) & + & - tmp1 * njac(4,5,i,k-1) + + lhsa(5,1,i,k) = - tmp2 * fjac(5,1,i,k-1) & + & - tmp1 * njac(5,1,i,k-1) + lhsa(5,2,i,k) = - tmp2 * fjac(5,2,i,k-1) & + & - tmp1 * njac(5,2,i,k-1) + lhsa(5,3,i,k) = - tmp2 * fjac(5,3,i,k-1) & + & - tmp1 * njac(5,3,i,k-1) + lhsa(5,4,i,k) = - tmp2 * fjac(5,4,i,k-1) & + & - tmp1 * njac(5,4,i,k-1) + lhsa(5,5,i,k) = - tmp2 * fjac(5,5,i,k-1) & + & - tmp1 * njac(5,5,i,k-1) & + & - tmp1 * dz5 + + lhsb(1,1,i,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(1,1,i,k) & + & + tmp1 * 2.0d+00 * dz1 + lhsb(1,2,i,k) = tmp1 * 2.0d+00 * njac(1,2,i,k) + lhsb(1,3,i,k) = tmp1 * 2.0d+00 * njac(1,3,i,k) + lhsb(1,4,i,k) = tmp1 * 2.0d+00 * njac(1,4,i,k) + lhsb(1,5,i,k) = tmp1 * 2.0d+00 * njac(1,5,i,k) + + lhsb(2,1,i,k) = tmp1 * 2.0d+00 * njac(2,1,i,k) + lhsb(2,2,i,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(2,2,i,k) & + & + tmp1 * 2.0d+00 * dz2 + lhsb(2,3,i,k) = tmp1 * 2.0d+00 * njac(2,3,i,k) + lhsb(2,4,i,k) = tmp1 * 2.0d+00 * njac(2,4,i,k) + lhsb(2,5,i,k) = tmp1 * 2.0d+00 * njac(2,5,i,k) + + lhsb(3,1,i,k) = tmp1 * 2.0d+00 * njac(3,1,i,k) + lhsb(3,2,i,k) = tmp1 * 2.0d+00 * njac(3,2,i,k) + lhsb(3,3,i,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(3,3,i,k) & + & + tmp1 * 2.0d+00 * dz3 + lhsb(3,4,i,k) = tmp1 * 2.0d+00 * njac(3,4,i,k) + lhsb(3,5,i,k) = tmp1 * 2.0d+00 * njac(3,5,i,k) + + lhsb(4,1,i,k) = tmp1 * 2.0d+00 * njac(4,1,i,k) + lhsb(4,2,i,k) = tmp1 * 2.0d+00 * njac(4,2,i,k) + lhsb(4,3,i,k) = tmp1 * 2.0d+00 * njac(4,3,i,k) + lhsb(4,4,i,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(4,4,i,k) & + & + tmp1 * 2.0d+00 * dz4 + lhsb(4,5,i,k) = tmp1 * 2.0d+00 * njac(4,5,i,k) + + lhsb(5,1,i,k) = tmp1 * 2.0d+00 * njac(5,1,i,k) + lhsb(5,2,i,k) = tmp1 * 2.0d+00 * njac(5,2,i,k) + lhsb(5,3,i,k) = tmp1 * 2.0d+00 * njac(5,3,i,k) + lhsb(5,4,i,k) = tmp1 * 2.0d+00 * njac(5,4,i,k) + lhsb(5,5,i,k) = 1.0d+00 & + & + tmp1 * 2.0d+00 * njac(5,5,i,k) & + & + tmp1 * 2.0d+00 * dz5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i,k+1) & + & - tmp1 * njac(1,1,i,k+1) & + & - tmp1 * dz1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i,k+1) & + & - tmp1 * njac(1,2,i,k+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i,k+1) & + & - tmp1 * njac(1,3,i,k+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i,k+1) & + & - tmp1 * njac(1,4,i,k+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i,k+1) & + & - tmp1 * njac(1,5,i,k+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i,k+1) & + & - tmp1 * njac(2,1,i,k+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i,k+1) & + & - tmp1 * njac(2,2,i,k+1) & + & - tmp1 * dz2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i,k+1) & + & - tmp1 * njac(2,3,i,k+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i,k+1) & + & - tmp1 * njac(2,4,i,k+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i,k+1) & + & - tmp1 * njac(2,5,i,k+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i,k+1) & + & - tmp1 * njac(3,1,i,k+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i,k+1) & + & - tmp1 * njac(3,2,i,k+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i,k+1) & + & - tmp1 * njac(3,3,i,k+1) & + & - tmp1 * dz3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i,k+1) & + & - tmp1 * njac(3,4,i,k+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i,k+1) & + & - tmp1 * njac(3,5,i,k+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i,k+1) & + & - tmp1 * njac(4,1,i,k+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i,k+1) & + & - tmp1 * njac(4,2,i,k+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i,k+1) & + & - tmp1 * njac(4,3,i,k+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i,k+1) & + & - tmp1 * njac(4,4,i,k+1) & + & - tmp1 * dz4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i,k+1) & + & - tmp1 * njac(4,5,i,k+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i,k+1) & + & - tmp1 * njac(5,1,i,k+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i,k+1) & + & - tmp1 * njac(5,2,i,k+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i,k+1) & + & - tmp1 * njac(5,3,i,k+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i,k+1) & + & - tmp1 * njac(5,4,i,k+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i,k+1) & + & - tmp1 * njac(5,5,i,k+1) & + & - tmp1 * dz5 + + enddo + enddo + + +!--------------------------------------------------------------------- +! outer most do loops - sweeping in i direction +!--------------------------------------------------------------------- + if (first .eq. 1) then + +!--------------------------------------------------------------------- +! multiply c(i,j,kstart) by b_inverse and copy back to ! +! multiply rhs(kstart) by b_inverse(kstart) and copy to rhs +!--------------------------------------------------------------------- +!dir$ ivdep + do i=start(1,c),isize + call binvcrhs( lhsb(1,1,i,kstart), & + & lhsc(1,1,i,j,kstart,c), & + & rhs(1,i,j,kstart,c) ) + enddo + + endif + +!--------------------------------------------------------------------- +! begin inner most do loop +! do all the elements of the cell unless last +!--------------------------------------------------------------------- + do k=kstart+first,ksize-last +!dir$ ivdep + do i=start(1,c),isize + +!--------------------------------------------------------------------- +! subtract A*lhs_vector(k-1) from lhs_vector(k) +! +! rhs(k) = rhs(k) - A*rhs(k-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,k), & + & rhs(1,i,j,k-1,c),rhs(1,i,j,k,c)) + +!--------------------------------------------------------------------- +! B(k) = B(k) - C(k-1)*A(k) +! call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,k), & + & lhsc(1,1,i,j,k-1,c), & + & lhsb(1,1,i,k)) + +!--------------------------------------------------------------------- +! multiply c(i,j,k) by b_inverse and copy back to ! +! multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs +!--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i,k), & + & lhsc(1,1,i,j,k,c), & + & rhs(1,i,j,k,c) ) + + enddo + enddo + +!--------------------------------------------------------------------- +! Now finish up special cases for last cell +!--------------------------------------------------------------------- + if (last .eq. 1) then + +!dir$ ivdep + do i=start(1,c),isize +!--------------------------------------------------------------------- +! rhs(ksize) = rhs(ksize) - A*rhs(ksize-1) +!--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,ksize), & + & rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c)) + +!--------------------------------------------------------------------- +! B(ksize) = B(ksize) - C(ksize-1)*A(ksize) +! call matmul_sub(aa,i,j,ksize,c, +! $ cc,i,j,ksize-1,c,bb,i,j,ksize,c) +!--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,ksize), & + & lhsc(1,1,i,j,ksize-1,c), & + & lhsb(1,1,i,ksize)) + +!--------------------------------------------------------------------- +! multiply rhs(ksize) by b_inverse(ksize) and copy to rhs +!--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,i,ksize), & + & rhs(1,i,j,ksize,c) ) + enddo + + endif + enddo + + + return + end + + + + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/Makefile new file mode 100644 index 000000000..78daa13ee --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/Makefile @@ -0,0 +1,28 @@ +SHELL=/bin/sh +BENCHMARK=cg +BENCHMARKU=CG + +include ../config/make.def + +OBJS = cg.o cg_data.o mpinpb.o ${COMMON}/print_results.o \ + ${COMMON}/get_active_nprocs.o \ + ${COMMON}/${RAND}.o ${COMMON}/timers.o + +include ../sys/make.common + +${PROGRAM}: config ${OBJS} + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB} + +.f90.o: + ${FCOMPILE} $< + +cg.o: cg.f90 cg_data.o mpinpb.o +cg_data.o: cg_data.f90 mpinpb.o npbparams.h +mpinpb.o: mpinpb.f90 + +clean: + - rm -f *.o *.mod *~ + - rm -f npbparams.h core + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg.f90 new file mode 100644 index 000000000..ff94f5f80 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg.f90 @@ -0,0 +1,1541 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.4 ! +! ! +! C G ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.4 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007 ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.4. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.4, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + + +!--------------------------------------------------------------------- +! +! Authors: M. Yarrow +! C. Kuszmaul +! R. F. Van der Wijngaart +! H. Jin +! +!--------------------------------------------------------------------- + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + program cg +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use, intrinsic :: ieee_arithmetic, only : ieee_is_nan + + use cg_data + use mpinpb + use timing + + implicit none + + integer status(MPI_STATUS_SIZE), request, ierr + + integer i, j, k, it + + double precision zeta, randlc + external randlc + double precision rnorm + double precision norm_temp1(2), norm_temp2(2) + + double precision t, tmax, mflops + external timer_read + double precision timer_read + character class + logical verified + double precision zeta_verify_value, epsilon, err + + double precision tsum(t_last+2), t1(t_last+2), & + & tming(t_last+2), tmaxg(t_last+2) + character t_recs(t_last+2)*8 + + data t_recs/'total', 'conjg', 'rcomm', 'ncomm', & + & ' totcomp', ' totcomm'/ + + +!--------------------------------------------------------------------- +! Set up mpi initialization and number of proc testing +!--------------------------------------------------------------------- + call initialize_mpi + if (.not. active) goto 999 + +!--------------------------------------------------------------------- +! Set up processor info, such as whether sq num of procs, etc +!--------------------------------------------------------------------- + call setup_proc_info( ) + +!--------------------------------------------------------------------- +! Allocate space for work arrays +!--------------------------------------------------------------------- + call alloc_space( ) + + + if( na .eq. 1400 .and. & + & nonzer .eq. 7 .and. & + & niter .eq. 15 .and. & + & shift .eq. 10.d0 ) then + class = 'S' + zeta_verify_value = 8.5971775078648d0 + else if( na .eq. 7000 .and. & + & nonzer .eq. 8 .and. & + & niter .eq. 15 .and. & + & shift .eq. 12.d0 ) then + class = 'W' + zeta_verify_value = 10.362595087124d0 + else if( na .eq. 14000 .and. & + & nonzer .eq. 11 .and. & + & niter .eq. 15 .and. & + & shift .eq. 20.d0 ) then + class = 'A' + zeta_verify_value = 17.130235054029d0 + else if( na .eq. 75000 .and. & + & nonzer .eq. 13 .and. & + & niter .eq. 75 .and. & + & shift .eq. 60.d0 ) then + class = 'B' + zeta_verify_value = 22.712745482631d0 + else if( na .eq. 150000 .and. & + & nonzer .eq. 15 .and. & + & niter .eq. 75 .and. & + & shift .eq. 110.d0 ) then + class = 'C' + zeta_verify_value = 28.973605592845d0 + else if( na .eq. 1500000 .and. & + & nonzer .eq. 21 .and. & + & niter .eq. 100 .and. & + & shift .eq. 500.d0 ) then + class = 'D' + zeta_verify_value = 52.514532105794d0 + else if( na .eq. 9000000 .and. & + & nonzer .eq. 26 .and. & + & niter .eq. 100 .and. & + & shift .eq. 1.5d3 ) then + class = 'E' + zeta_verify_value = 77.522164599383d0 + else if( na .eq. 54000000 .and. & + & nonzer .eq. 31 .and. & + & niter .eq. 100 .and. & + & shift .eq. 5.0d3 ) then + class = 'F' + zeta_verify_value = 107.3070826433d0 + else + class = 'U' + endif + + if( me .eq. root )then + write( *,1000 ) + write( *,1001 ) na, class + write( *,1002 ) niter + write( *,1003 ) nonzer + write( *,1004 ) shift + write( *,1005 ) total_nodes + if (total_nodes .ne. nprocs) write (*, 1006) nprocs + + 1000 format(//,' NAS Parallel Benchmarks 3.4 -- CG Benchmark', /) + 1001 format(' Size: ', i10, ' (class ', a, ')' ) + 1002 format(' Iterations: ', i5 ) + 1003 format(' Number of nonzeroes per row: ', i8) + 1004 format(' Eigenvalue shift: ', f9.3) + 1005 format(' Total number of processes: ', i6) + 1006 format(' WARNING: Number of processes is not power of two (', & + & i0, ' active)') + endif + + +!--------------------------------------------------------------------- +! Set up partition's submatrix info: firstcol, lastcol, firstrow, lastrow +!--------------------------------------------------------------------- + call setup_submatrix_info( ) + + + do i = 1, t_last + call timer_clear(i) + end do + +!--------------------------------------------------------------------- +! Inialize random number generator +!--------------------------------------------------------------------- + tran = 314159265.0D0 + amult = 1220703125.0D0 + zeta = randlc( tran, amult ) + +!--------------------------------------------------------------------- +! Set up partition's sparse random matrix for given class size +!--------------------------------------------------------------------- + call makea(na, nz, a, colidx, rowstr, nonzer, & + & firstrow, lastrow, firstcol, lastcol, & + & rcond, arow, acol, aelt, v, iv, shift) + + + +!--------------------------------------------------------------------- +! Note: as a result of the above call to makea: +! values of j used in indexing rowstr go from 1 --> lastrow-firstrow+1 +! values of colidx which are col indexes go from firstcol --> lastcol +! So: +! Shift the col index vals from actual (firstcol --> lastcol ) +! to local, i.e., (1 --> lastcol-firstcol+1) +!--------------------------------------------------------------------- + do j=1,lastrow-firstrow+1 + do k=rowstr(j),rowstr(j+1)-1 + colidx(k) = colidx(k) - firstcol + 1 + enddo + enddo + +!--------------------------------------------------------------------- +! set starting vector to (1, 1, .... 1) +!--------------------------------------------------------------------- + do i = 1, naa+1 + x(i) = 1.0D0 + enddo + + zeta = 0.0d0 + +!--------------------------------------------------------------------- +!----> +! Do one iteration untimed to init all code and data page tables +!----> (then reinit, start timing, to niter its) +!--------------------------------------------------------------------- + do it = 1, 1 + +!--------------------------------------------------------------------- +! The call to the conjugate gradient routine: +!--------------------------------------------------------------------- + call conj_grad ( rnorm ) + +!--------------------------------------------------------------------- +! zeta = shift + 1/(x.z) +! So, first: (x.z) +! Also, find norm of z +! So, first: (z.z) +!--------------------------------------------------------------------- + norm_temp1(1) = 0.0d0 + norm_temp1(2) = 0.0d0 + do j=1, lastcol-firstcol+1 + norm_temp1(1) = norm_temp1(1) + x(j)*z(j) + norm_temp1(2) = norm_temp1(2) + z(j)*z(j) + enddo + + if (timeron) call timer_start(t_ncomm) + do i = 1, l2npcols + call mpi_irecv( norm_temp2, & + & 2, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( norm_temp1, & + & 2, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + + norm_temp1(1) = norm_temp1(1) + norm_temp2(1) + norm_temp1(2) = norm_temp1(2) + norm_temp2(2) + enddo + if (timeron) call timer_stop(t_ncomm) + + norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) ) + + +!--------------------------------------------------------------------- +! Normalize z to obtain x +!--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + x(j) = norm_temp1(2)*z(j) + enddo + + + enddo ! end of do one iteration untimed + + +!--------------------------------------------------------------------- +! set starting vector to (1, 1, .... 1) +!--------------------------------------------------------------------- +! +! NOTE: a questionable limit on size: should this be na/num_proc_cols+1 ? +! + do i = 1, naa+1 + x(i) = 1.0d0 + enddo + + zeta = 0.0d0 + +!--------------------------------------------------------------------- +! Synchronize and start timing +!--------------------------------------------------------------------- + do i = 1, t_last + call timer_clear(i) + end do + call mpi_barrier( comm_solve, ierr ) + + call timer_clear( 1 ) + call timer_start( 1 ) + +!--------------------------------------------------------------------- +!----> +! Main Iteration for inverse power method +!----> +!--------------------------------------------------------------------- + do it = 1, niter + +!--------------------------------------------------------------------- +! The call to the conjugate gradient routine: +!--------------------------------------------------------------------- + call conj_grad ( rnorm ) + + +!--------------------------------------------------------------------- +! zeta = shift + 1/(x.z) +! So, first: (x.z) +! Also, find norm of z +! So, first: (z.z) +!--------------------------------------------------------------------- + norm_temp1(1) = 0.0d0 + norm_temp1(2) = 0.0d0 + do j=1, lastcol-firstcol+1 + norm_temp1(1) = norm_temp1(1) + x(j)*z(j) + norm_temp1(2) = norm_temp1(2) + z(j)*z(j) + enddo + + if (timeron) call timer_start(t_ncomm) + do i = 1, l2npcols + call mpi_irecv( norm_temp2, & + & 2, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( norm_temp1, & + & 2, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + + norm_temp1(1) = norm_temp1(1) + norm_temp2(1) + norm_temp1(2) = norm_temp1(2) + norm_temp2(2) + enddo + if (timeron) call timer_stop(t_ncomm) + + norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) ) + + + if( me .eq. root )then + zeta = shift + 1.0d0 / norm_temp1(1) + if( it .eq. 1 ) write( *,9000 ) + write( *,9001 ) it, rnorm, zeta + endif + 9000 format( /,' iteration ||r|| zeta' ) + 9001 format( 4x, i5, 6x, e21.14, f20.13 ) + +!--------------------------------------------------------------------- +! Normalize z to obtain x +!--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + x(j) = norm_temp1(2)*z(j) + enddo + + + enddo ! end of main iter inv pow meth + + call timer_stop( 1 ) + +!--------------------------------------------------------------------- +! End of timed section +!--------------------------------------------------------------------- + + t = timer_read( 1 ) + + call mpi_reduce( t, & + & tmax, & + & 1, & + & dp_type, & + & MPI_MAX, & + & root, & + & comm_solve, & + & ierr ) + + if( me .eq. root )then + write(*,100) + 100 format(' Benchmark completed ') + + epsilon = 1.d-10 + if (class .ne. 'U') then + + err = abs( zeta - zeta_verify_value )/zeta_verify_value + if( (.not.ieee_is_nan(err)) .and. (err .le. epsilon) ) then + verified = .TRUE. + write(*, 200) + write(*, 201) zeta + write(*, 202) err + 200 format(' VERIFICATION SUCCESSFUL ') + 201 format(' Zeta is ', E20.13) + 202 format(' Error is ', E20.13) + else + verified = .FALSE. + write(*, 300) + write(*, 301) zeta + write(*, 302) zeta_verify_value + 300 format(' VERIFICATION FAILED') + 301 format(' Zeta ', E20.13) + 302 format(' The correct zeta is ', E20.13) + endif + else + verified = .FALSE. + write (*, 400) + write (*, 401) + write (*, 201) zeta + 400 format(' Problem size unknown') + 401 format(' NO VERIFICATION PERFORMED') + endif + + + if( tmax .ne. 0. ) then + mflops = 1.0d-6 * 2*niter*dble( na ) & + & * ( 3.+nonzer*dble(nonzer+1) & + & + 25.*(5.+nonzer*dble(nonzer+1)) & + & + 3. ) / tmax + else + mflops = 0.d0 + endif + + call print_results('CG', class, na, 0, 0, & + & niter, nprocs, total_nodes, tmax, & + & mflops, ' floating point', & + & verified, npbversion, compiletime, & + & cs1, cs2, cs3, cs4, cs5, cs6, cs7) + + + endif + + + if (.not.timeron) goto 999 + + do i = 1, t_last + t1(i) = timer_read(i) + end do + t1(t_conjg) = t1(t_conjg) - t1(t_rcomm) + t1(t_last+2) = t1(t_rcomm) + t1(t_ncomm) + t1(t_last+1) = t1(t_total) - t1(t_last+2) + + call MPI_Reduce(t1, tsum, t_last+2, dp_type, MPI_SUM, & + & 0, comm_solve, ierr) + call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, & + & 0, comm_solve, ierr) + call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, & + & 0, comm_solve, ierr) + + if (me .eq. 0) then + write(*, 800) nprocs + do i = 1, t_last+2 + tsum(i) = tsum(i) / nprocs + write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i) + end do + endif + 800 format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', & + & 5x, 'average') + 810 format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4)) + + 999 continue + call mpi_finalize(ierr) + + + + end ! end main + + + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine initialize_mpi +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use cg_data + use mpinpb + use timing + + implicit none + + integer ierr + + + call mpi_init( ierr ) + +!--------------------------------------------------------------------- +! get a process grid that requires a pwr-2 number of procs. +! excess ranks are marked as inactive. +!--------------------------------------------------------------------- + call get_active_nprocs(3, num_proc_cols, num_proc_rows, nprocs, & + & total_nodes, me, comm_solve, active) + + if (.not. active) return + + if (.not. convertdouble) then + dp_type = MPI_DOUBLE_PRECISION + else + dp_type = MPI_REAL + endif + + root = 0 + + if (me .eq. root) then + call check_timer_flag( timeron ) + endif + + call mpi_bcast(timeron, 1, MPI_LOGICAL, 0, comm_solve, ierr) + + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine setup_proc_info( ) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use cg_data + use mpinpb + + implicit none + + integer i, ierr + + +!--------------------------------------------------------------------- +! set up dimension parameters after partition +! num_proc_rows & num_proc_cols are set by get_active_nprocs +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows. +! num_proc_cols and num_proc_cols are to be found in npbparams.h. +! When num_procs is not square, then num_proc_cols must be = 2*num_proc_rows. +!--------------------------------------------------------------------- + num_procs = num_proc_cols * num_proc_rows + +!--------------------------------------------------------------------- +! num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows +! When num_procs is not square, then num_proc_cols = 2*num_proc_rows +!--------------------------------------------------------------------- +! First, number of procs must be power of two. +!--------------------------------------------------------------------- + if( nprocs .ne. num_procs )then + if( me .eq. root ) write( *,9000 ) nprocs, num_procs + 9000 format( /,'ERROR: Number of processes (', & + & i0, ') is not a power of two (', i0, '?)'/ ) + call mpi_abort(mpi_comm_world, mpi_err_other, ierr) + stop + endif + + + npcols = num_proc_cols + nprows = num_proc_rows + + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine setup_submatrix_info( ) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use cg_data + use mpinpb + + implicit none + + integer col_size, row_size + integer i, j + integer div_factor + + + proc_row = me / npcols + proc_col = me - proc_row*npcols + + +!--------------------------------------------------------------------- +! If na evenly divisible by npcols, then it is evenly divisible +! by nprows +!--------------------------------------------------------------------- + + if( na/npcols*npcols .eq. na )then + col_size = na/npcols + firstcol = proc_col*col_size + 1 + lastcol = firstcol - 1 + col_size + row_size = na/nprows + firstrow = proc_row*row_size + 1 + lastrow = firstrow - 1 + row_size +!--------------------------------------------------------------------- +! If na not evenly divisible by npcols, then first subdivide for nprows +! and then, if npcols not equal to nprows (i.e., not a sq number of procs), +! get col subdivisions by dividing by 2 each row subdivision. +!--------------------------------------------------------------------- + else + if( proc_row .lt. na - na/nprows*nprows)then + row_size = na/nprows+ 1 + firstrow = proc_row*row_size + 1 + lastrow = firstrow - 1 + row_size + else + row_size = na/nprows + firstrow = (na - na/nprows*nprows)*(row_size+1) & + & + (proc_row-(na-na/nprows*nprows)) & + & *row_size + 1 + lastrow = firstrow - 1 + row_size + endif + if( npcols .eq. nprows )then + if( proc_col .lt. na - na/npcols*npcols )then + col_size = na/npcols+ 1 + firstcol = proc_col*col_size + 1 + lastcol = firstcol - 1 + col_size + else + col_size = na/npcols + firstcol = (na - na/npcols*npcols)*(col_size+1) & + & + (proc_col-(na-na/npcols*npcols)) & + & *col_size + 1 + lastcol = firstcol - 1 + col_size + endif + else + if( (proc_col/2) .lt. & + & na - na/(npcols/2)*(npcols/2) )then + col_size = na/(npcols/2) + 1 + firstcol = (proc_col/2)*col_size + 1 + lastcol = firstcol - 1 + col_size + else + col_size = na/(npcols/2) + firstcol = (na - na/(npcols/2)*(npcols/2)) & + & *(col_size+1) & + & + ((proc_col/2)-(na-na/(npcols/2)*(npcols/2))) & + & *col_size + 1 + lastcol = firstcol - 1 + col_size + endif +!C write( *,* ) col_size,firstcol,lastcol + if( mod( me,2 ) .eq. 0 )then + lastcol = firstcol - 1 + (col_size-1)/2 + 1 + else + firstcol = firstcol + (col_size-1)/2 + 1 + lastcol = firstcol - 1 + col_size/2 +!C write( *,* ) firstcol,lastcol + endif + endif + endif + + + + if( npcols .eq. nprows )then + send_start = 1 + send_len = lastrow - firstrow + 1 + else + if( mod( me,2 ) .eq. 0 )then + send_start = 1 + send_len = (1 + lastrow-firstrow+1)/2 + else + send_start = (1 + lastrow-firstrow+1)/2 + 1 + send_len = (lastrow-firstrow+1)/2 + endif + endif + + + + +!--------------------------------------------------------------------- +! Transpose exchange processor +!--------------------------------------------------------------------- + + if( npcols .eq. nprows )then + exch_proc = mod( me,nprows )*nprows + me/nprows + else + exch_proc = 2*(mod( me/2,nprows )*nprows + me/2/nprows) & + & + mod( me,2 ) + endif + + + + i = npcols / 2 + l2npcols = 0 + do while( i .gt. 0 ) + l2npcols = l2npcols + 1 + i = i / 2 + enddo + + +!--------------------------------------------------------------------- +! Set up the reduce phase schedules... +!--------------------------------------------------------------------- + + div_factor = npcols + do i = 1, l2npcols + + j = mod( proc_col+div_factor/2, div_factor ) & + & + proc_col / div_factor * div_factor + reduce_exch_proc(i) = proc_row*npcols + j + + div_factor = div_factor / 2 + + enddo + + + do i = l2npcols, 1, -1 + + if( nprows .eq. npcols )then + reduce_send_starts(i) = send_start + reduce_send_lengths(i) = send_len + reduce_recv_lengths(i) = lastrow - firstrow + 1 + else + reduce_recv_lengths(i) = send_len + if( i .eq. l2npcols )then + reduce_send_lengths(i) = lastrow-firstrow+1 - send_len + if( me/2*2 .eq. me )then + reduce_send_starts(i) = send_start + send_len + else + reduce_send_starts(i) = 1 + endif + else + reduce_send_lengths(i) = send_len + reduce_send_starts(i) = send_start + endif + endif + reduce_recv_starts(i) = send_start + + enddo + + + exch_recv_length = lastcol - firstcol + 1 + + + return + end + + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine conj_grad ( rnorm ) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Floaging point arrays here are named as in NPB1 spec discussion of +! CG algorithm +!--------------------------------------------------------------------- + + use cg_data + use mpinpb + use timing + + implicit none + + double precision rnorm + + integer status(MPI_STATUS_SIZE ), request + + integer i, j, k, ierr + integer cgit, cgitmax + + double precision d, sum, rho, rho0, alpha, beta + + external timer_read + double precision timer_read + + data cgitmax / 25 / + + + if (timeron) call timer_start(t_conjg) +!--------------------------------------------------------------------- +! Initialize the CG algorithm: +!--------------------------------------------------------------------- + do j=1,naa+1 + q(j) = 0.0d0 + z(j) = 0.0d0 + r(j) = x(j) + p(j) = r(j) + w(j) = 0.0d0 + enddo + + +!--------------------------------------------------------------------- +! rho = r.r +! Now, obtain the norm of r: First, sum squares of r elements locally... +!--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + sum = sum + r(j)*r(j) + enddo + +!--------------------------------------------------------------------- +! Exchange and sum with procs identified in reduce_exch_proc +! (This is equivalent to mpi_allreduce.) +! Sum the partial sums of rho, leaving rho on all processors +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_rcomm) + do i = 1, l2npcols + call mpi_irecv( rho, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( sum, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + + sum = sum + rho + enddo + if (timeron) call timer_stop(t_rcomm) + rho = sum + + + +!--------------------------------------------------------------------- +!----> +! The conj grad iteration loop +!----> +!--------------------------------------------------------------------- + do cgit = 1, cgitmax + + +!--------------------------------------------------------------------- +! q = A.p +! The partition submatrix-vector multiply: use workspace w +!--------------------------------------------------------------------- + do j=1,lastrow-firstrow+1 + sum = 0.d0 + do k=rowstr(j),rowstr(j+1)-1 + sum = sum + a(k)*p(colidx(k)) + enddo + w(j) = sum + enddo + +!--------------------------------------------------------------------- +! Sum the partition submatrix-vec A.p's across rows +! Exchange and sum piece of w with procs identified in reduce_exch_proc +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_rcomm) + do i = l2npcols, 1, -1 + call mpi_irecv( q(reduce_recv_starts(i)), & + & reduce_recv_lengths(i), & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( w(reduce_send_starts(i)), & + & reduce_send_lengths(i), & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + do j=send_start,send_start + reduce_recv_lengths(i) - 1 + w(j) = w(j) + q(j) + enddo + enddo + if (timeron) call timer_stop(t_rcomm) + + +!--------------------------------------------------------------------- +! Exchange piece of q with transpose processor: +!--------------------------------------------------------------------- + if( l2npcols .ne. 0 )then + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( q, & + & exch_recv_length, & + & dp_type, & + & exch_proc, & + & 1, & + & comm_solve, & + & request, & + & ierr ) + + call mpi_send( w(send_start), & + & send_len, & + & dp_type, & + & exch_proc, & + & 1, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + else + do j=1,exch_recv_length + q(j) = w(j) + enddo + endif + + +!--------------------------------------------------------------------- +! Clear w for reuse... +!--------------------------------------------------------------------- + do j=1, max( lastrow-firstrow+1, lastcol-firstcol+1 ) + w(j) = 0.0d0 + enddo + + +!--------------------------------------------------------------------- +! Obtain p.q +!--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + sum = sum + p(j)*q(j) + enddo + +!--------------------------------------------------------------------- +! Obtain d with a sum-reduce +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_rcomm) + do i = 1, l2npcols + call mpi_irecv( d, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( sum, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + + call mpi_wait( request, status, ierr ) + + sum = sum + d + enddo + if (timeron) call timer_stop(t_rcomm) + d = sum + + +!--------------------------------------------------------------------- +! Obtain alpha = rho / (p.q) +!--------------------------------------------------------------------- + alpha = rho / d + +!--------------------------------------------------------------------- +! Save a temporary of rho +!--------------------------------------------------------------------- + rho0 = rho + +!--------------------------------------------------------------------- +! Obtain z = z + alpha*p +! and r = r - alpha*q +!--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + z(j) = z(j) + alpha*p(j) + r(j) = r(j) - alpha*q(j) + enddo + +!--------------------------------------------------------------------- +! rho = r.r +! Now, obtain the norm of r: First, sum squares of r elements locally... +!--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + sum = sum + r(j)*r(j) + enddo + +!--------------------------------------------------------------------- +! Obtain rho with a sum-reduce +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_rcomm) + do i = 1, l2npcols + call mpi_irecv( rho, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( sum, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + + sum = sum + rho + enddo + if (timeron) call timer_stop(t_rcomm) + rho = sum + +!--------------------------------------------------------------------- +! Obtain beta: +!--------------------------------------------------------------------- + beta = rho / rho0 + +!--------------------------------------------------------------------- +! p = r + beta*p +!--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + p(j) = r(j) + beta*p(j) + enddo + + + + enddo ! end of do cgit=1,cgitmax + + + +!--------------------------------------------------------------------- +! Compute residual norm explicitly: ||r|| = ||x - A.z|| +! First, form A.z +! The partition submatrix-vector multiply +!--------------------------------------------------------------------- + do j=1,lastrow-firstrow+1 + sum = 0.d0 + do k=rowstr(j),rowstr(j+1)-1 + sum = sum + a(k)*z(colidx(k)) + enddo + w(j) = sum + enddo + + + +!--------------------------------------------------------------------- +! Sum the partition submatrix-vec A.z's across rows +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_rcomm) + do i = l2npcols, 1, -1 + call mpi_irecv( r(reduce_recv_starts(i)), & + & reduce_recv_lengths(i), & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( w(reduce_send_starts(i)), & + & reduce_send_lengths(i), & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + + do j=send_start,send_start + reduce_recv_lengths(i) - 1 + w(j) = w(j) + r(j) + enddo + enddo + if (timeron) call timer_stop(t_rcomm) + + +!--------------------------------------------------------------------- +! Exchange piece of q with transpose processor: +!--------------------------------------------------------------------- + if( l2npcols .ne. 0 )then + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( r, & + & exch_recv_length, & + & dp_type, & + & exch_proc, & + & 1, & + & comm_solve, & + & request, & + & ierr ) + + call mpi_send( w(send_start), & + & send_len, & + & dp_type, & + & exch_proc, & + & 1, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + else + do j=1,exch_recv_length + r(j) = w(j) + enddo + endif + + +!--------------------------------------------------------------------- +! At this point, r contains A.z +!--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + d = x(j) - r(j) + sum = sum + d*d + enddo + +!--------------------------------------------------------------------- +! Obtain d with a sum-reduce +!--------------------------------------------------------------------- + if (timeron) call timer_start(t_rcomm) + do i = 1, l2npcols + call mpi_irecv( d, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & request, & + & ierr ) + call mpi_send( sum, & + & 1, & + & dp_type, & + & reduce_exch_proc(i), & + & i, & + & comm_solve, & + & ierr ) + call mpi_wait( request, status, ierr ) + + sum = sum + d + enddo + if (timeron) call timer_stop(t_rcomm) + d = sum + + + if( me .eq. root ) rnorm = sqrt( d ) + + if (timeron) call timer_stop(t_conjg) + + + return + end ! end of routine conj_grad + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine makea( n, nz, a, colidx, rowstr, nonzer, & + & firstrow, lastrow, firstcol, lastcol, & + & rcond, arow, acol, aelt, v, iv, shift ) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + implicit none + integer n, nz, nonzer + integer firstrow, lastrow, firstcol, lastcol + integer colidx(nz), rowstr(n+1) + integer iv(2*n+1), arow(nz), acol(nz) + double precision v(n+1), aelt(nz) + double precision rcond, a(nz), shift + +!--------------------------------------------------------------------- +! generate the test problem for benchmark 6 +! makea generates a sparse matrix with a +! prescribed sparsity distribution +! +! parameter type usage +! +! input +! +! n i number of cols/rows of matrix +! nz i nonzeros as declared array size +! rcond r*8 condition number +! shift r*8 main diagonal shift +! +! output +! +! a r*8 array for nonzeros +! colidx i col indices +! rowstr i row pointers +! +! workspace +! +! iv, arow, acol i +! v, aelt r*8 +!--------------------------------------------------------------------- + + integer i, nnza, iouter, ivelt, ivelt1, irow, nzv, jcol + +!--------------------------------------------------------------------- +! nonzer is approximately (int(sqrt(nnza /n))); +!--------------------------------------------------------------------- + + double precision size, ratio, scale + external sparse, sprnvc, vecset + + size = 1.0D0 + ratio = rcond ** (1.0D0 / dfloat(n)) + nnza = 0 + +!--------------------------------------------------------------------- +! Initialize iv(n+1 .. 2n) to zero. +! Used by sprnvc to mark nonzero positions +!--------------------------------------------------------------------- + + do i = 1, n + iv(n+i) = 0 + enddo + do iouter = 1, n + nzv = nonzer + call sprnvc( n, nzv, v, colidx, iv(1), iv(n+1) ) + call vecset( n, v, colidx, nzv, iouter, .5D0 ) + do ivelt = 1, nzv + jcol = colidx(ivelt) + if (jcol.ge.firstcol .and. jcol.le.lastcol) then + scale = size * v(ivelt) + do ivelt1 = 1, nzv + irow = colidx(ivelt1) + if (irow.ge.firstrow .and. irow.le.lastrow) then + nnza = nnza + 1 + if (nnza .gt. nz) goto 9999 + acol(nnza) = jcol + arow(nnza) = irow + aelt(nnza) = v(ivelt1) * scale + endif + enddo + endif + enddo + size = size * ratio + enddo + + +!--------------------------------------------------------------------- +! ... add the identity * rcond to the generated matrix to bound +! the smallest eigenvalue from below by rcond +!--------------------------------------------------------------------- + do i = firstrow, lastrow + if (i.ge.firstcol .and. i.le.lastcol) then + iouter = n + i + nnza = nnza + 1 + if (nnza .gt. nz) goto 9999 + acol(nnza) = i + arow(nnza) = i + aelt(nnza) = rcond - shift + endif + enddo + + +!--------------------------------------------------------------------- +! ... make the sparse matrix from list of elements with duplicates +! (v and iv are used as workspace) +!--------------------------------------------------------------------- + call sparse( a, colidx, rowstr, n, arow, acol, aelt, & + & firstrow, lastrow, & + & v, iv(1), iv(n+1), nnza ) + return + + 9999 continue + write(*,*) 'Space for matrix elements exceeded in makea' + write(*,*) 'nnza, nzmax = ',nnza, nz + write(*,*) ' iouter = ',iouter + + stop + end +!-------end of makea------------------------------ + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine sparse( a, colidx, rowstr, n, arow, acol, aelt, & + & firstrow, lastrow, & + & x, mark, nzloc, nnza ) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + implicit none + integer colidx(*), rowstr(*) + integer firstrow, lastrow + integer n, arow(*), acol(*), nnza + double precision a(*), aelt(*) + +!--------------------------------------------------------------------- +! rows range from firstrow to lastrow +! the rowstr pointers are defined for nrows = lastrow-firstrow+1 values +!--------------------------------------------------------------------- + integer nzloc(n), nrows + double precision x(n) + integer mark(n) + +!--------------------------------------------------- +! generate a sparse matrix from a list of +! [col, row, element] tri +!--------------------------------------------------- + + integer i, j, jajp1, nza, k, nzrow + double precision xi + +!--------------------------------------------------------------------- +! how many rows of result +!--------------------------------------------------------------------- + nrows = lastrow - firstrow + 1 + +!--------------------------------------------------------------------- +! ...count the number of triples in each row +!--------------------------------------------------------------------- + do j = 1, n + rowstr(j) = 0 + mark(j) = 0 + enddo + rowstr(n+1) = 0 + + do nza = 1, nnza + j = (arow(nza) - firstrow + 1) + 1 + rowstr(j) = rowstr(j) + 1 + enddo + + rowstr(1) = 1 + do j = 2, nrows+1 + rowstr(j) = rowstr(j) + rowstr(j-1) + enddo + + +!--------------------------------------------------------------------- +! ... rowstr(j) now is the location of the first nonzero +! of row j of a +!--------------------------------------------------------------------- + + +!--------------------------------------------------------------------- +! ... do a bucket sort of the triples on the row index +!--------------------------------------------------------------------- + do nza = 1, nnza + j = arow(nza) - firstrow + 1 + k = rowstr(j) + a(k) = aelt(nza) + colidx(k) = acol(nza) + rowstr(j) = rowstr(j) + 1 + enddo + + +!--------------------------------------------------------------------- +! ... rowstr(j) now points to the first element of row j+1 +!--------------------------------------------------------------------- + do j = nrows, 1, -1 + rowstr(j+1) = rowstr(j) + enddo + rowstr(1) = 1 + + +!--------------------------------------------------------------------- +! ... generate the actual output rows by adding elements +!--------------------------------------------------------------------- + nza = 0 + do i = 1, n + x(i) = 0.0 + mark(i) = 0 + enddo + + jajp1 = rowstr(1) + do j = 1, nrows + nzrow = 0 + +!--------------------------------------------------------------------- +! ...loop over the jth row of a +!--------------------------------------------------------------------- + do k = jajp1 , rowstr(j+1)-1 + i = colidx(k) + x(i) = x(i) + a(k) + if ( (mark(i) .eq. 0) .and. (x(i) .ne. 0.D0)) then + mark(i) = 1 + nzrow = nzrow + 1 + nzloc(nzrow) = i + endif + enddo + +!--------------------------------------------------------------------- +! ... extract the nonzeros of this row +!--------------------------------------------------------------------- + do k = 1, nzrow + i = nzloc(k) + mark(i) = 0 + xi = x(i) + x(i) = 0.D0 + if (xi .ne. 0.D0) then + nza = nza + 1 + a(nza) = xi + colidx(nza) = i + endif + enddo + jajp1 = rowstr(j+1) + rowstr(j+1) = nza + rowstr(1) + enddo +!C write (*, 11000) nza + return +11000 format ( //,'final nonzero count in sparse ', & + & /,'number of nonzeros = ', i16 ) + end +!-------end of sparse----------------------------- + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine sprnvc( n, nz, v, iv, nzloc, mark ) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use cg_data, only : amult, tran + implicit none + + double precision v(*) + integer n, nz, iv(*), nzloc(n), nn1 + integer mark(n) + + +!--------------------------------------------------------------------- +! generate a sparse n-vector (v, iv) +! having nzv nonzeros +! +! mark(i) is set to 1 if position i is nonzero. +! mark is all zero on entry and is reset to all zero before exit +! this corrects a performance bug found by John G. Lewis, caused by +! reinitialization of mark on every one of the n calls to sprnvc +!--------------------------------------------------------------------- + + integer nzrow, nzv, ii, i, icnvrt + + external randlc, icnvrt + double precision randlc, vecelt, vecloc + + + nzv = 0 + nzrow = 0 + nn1 = 1 + 50 continue + nn1 = 2 * nn1 + if (nn1 .lt. n) goto 50 + +!--------------------------------------------------------------------- +! nn1 is the smallest power of two not less than n +!--------------------------------------------------------------------- + +100 continue + if (nzv .ge. nz) goto 110 + vecelt = randlc( tran, amult ) + +!--------------------------------------------------------------------- +! generate an integer between 1 and n in a portable manner +!--------------------------------------------------------------------- + vecloc = randlc(tran, amult) + i = icnvrt(vecloc, nn1) + 1 + if (i .gt. n) goto 100 + +!--------------------------------------------------------------------- +! was this integer generated already? +!--------------------------------------------------------------------- + if (mark(i) .eq. 0) then + mark(i) = 1 + nzrow = nzrow + 1 + nzloc(nzrow) = i + nzv = nzv + 1 + v(nzv) = vecelt + iv(nzv) = i + endif + goto 100 +110 continue + do ii = 1, nzrow + i = nzloc(ii) + mark(i) = 0 + enddo + return + end +!-------end of sprnvc----------------------------- + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + function icnvrt(x, ipwr2) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + implicit none + double precision x + integer ipwr2, icnvrt + +!--------------------------------------------------------------------- +! scale a double precision number x in (0,1) by a power of 2 and chop it +!--------------------------------------------------------------------- + icnvrt = int(ipwr2 * x) + + return + end +!-------end of icnvrt----------------------------- + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + subroutine vecset(n, v, iv, nzv, i, val) +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + implicit none + integer n, iv(*), nzv, i, k + double precision v(*), val + +!--------------------------------------------------------------------- +! set ith element of sparse vector (v, iv) with +! nzv nonzeros to val +!--------------------------------------------------------------------- + + logical set + + set = .false. + do k = 1, nzv + if (iv(k) .eq. i) then + v(k) = val + set = .true. + endif + enddo + if (.not. set) then + nzv = nzv + 1 + v(nzv) = val + iv(nzv) = i + endif + return + end +!-------end of vecset----------------------------- + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg_data.f90 new file mode 100644 index 000000000..5d193f547 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/cg_data.f90 @@ -0,0 +1,161 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! cg_data module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module cg_data + + +!--------------------------------------------------------------------- +! Class specific parameters are defined in the npbparams.h file, +! which is written by the sys/setparams.c program. +!--------------------------------------------------------------------- + + include 'npbparams.h' + + + ! main_int_mem + integer, allocatable :: colidx(:), rowstr(:), & + & iv(:), arow(:), acol(:) + + ! main_flt_mem + double precision, allocatable :: & + & v(:), aelt(:), a(:), & + & x(:), & + & z(:), & + & p(:), & + & q(:), & + & r(:), & + & w(:) + + ! urando + double precision amult, tran + + + ! process grid + integer num_procs, num_proc_rows, num_proc_cols + + ! number of nonzeros after partition + integer nz + + ! partit_size + integer naa, nzz, & + & npcols, nprows, & + & proc_col, proc_row, & + & firstrow, & + & lastrow, & + & firstcol, & + & lastcol, & + & exch_proc, & + & exch_recv_length, & + & send_start, & + & send_len + + ! work arrays for reduction + integer l2npcols + integer, allocatable :: & + & reduce_exch_proc(:), & + & reduce_send_starts(:), & + & reduce_send_lengths(:), & + & reduce_recv_starts(:), & + & reduce_recv_lengths(:) + + + end module cg_data + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! timing module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module timing + + integer t_total, t_conjg, t_rcomm, t_ncomm, t_last + parameter (t_total=1, t_conjg=2, t_rcomm=3, t_ncomm=4, t_last=4) + logical timeron + + end module timing + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine alloc_space + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! allocate space dynamically for data arrays +!--------------------------------------------------------------------- + + use cg_data + use mpinpb + + implicit none + + integer(8) naz + integer ios, ierr + + +!--------------------------------------------------------------------- +! set up dimension parameters after partition +!--------------------------------------------------------------------- + + naz = na ! to avoid integer overflow + naz = naz*(nonzer+1)/num_procs*(nonzer+1)+nonzer & + & + naz*(nonzer+2+num_procs/256)/num_proc_cols + nz = naz + if (nz .ne. naz) then + write(*,*) 'Error: integer overflow', nz, naz + call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr) + endif + + naa = na / num_proc_rows + nzz = nz + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + allocate ( & + & colidx(nz), & + & rowstr(na+1), & + & iv(2*na+1), & + & arow(nz), & + & acol(nz), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & v(na+1), aelt(nz), a(nz), & + & x(naa+2), & + & z(naa+2), & + & p(naa+2), & + & q(naa+2), & + & r(naa+2), & + & w(naa+2), & + & stat = ios) + + if (ios .eq. 0) allocate ( & + & reduce_exch_proc(num_proc_cols), & + & reduce_send_starts(num_proc_cols), & + & reduce_send_lengths(num_proc_cols), & + & reduce_recv_starts(num_proc_cols), & + & reduce_recv_lengths(num_proc_cols), & + & stat = ios) + + if (ios .ne. 0) then + write(*,*) 'Error encountered in allocating space' + call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr) + stop + endif + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/mpinpb.f90 new file mode 100644 index 000000000..1a34c2924 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/CG/mpinpb.f90 @@ -0,0 +1,17 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! mpinpb module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module mpinpb + + include 'mpif.h' + + integer me, nprocs, total_nodes, root, comm_solve, dp_type + logical active + + end module mpinpb + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.c new file mode 100644 index 000000000..3c7a57819 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.c @@ -0,0 +1,184 @@ +#include +#include +#include + +#include "DGraph.h" + +DGArc *newArc(DGNode *tl,DGNode *hd){ + DGArc *ar=(DGArc *)malloc(sizeof(DGArc)); + ar->tail=tl; + ar->head=hd; + return ar; +} +void arcShow(DGArc *ar){ + DGNode *tl=(DGNode *)ar->tail, + *hd=(DGNode *)ar->head; + fprintf(stderr,"%d. |%s ->%s\n",ar->id,tl->name,hd->name); +} + +DGNode *newNode(char *nm){ + DGNode *nd=(DGNode *)malloc(sizeof(DGNode)); + nd->attribute=0; + nd->color=0; + nd->inDegree=0; + nd->outDegree=0; + nd->maxInDegree=SMALL_BLOCK_SIZE; + nd->maxOutDegree=SMALL_BLOCK_SIZE; + nd->inArc=(DGArc **)malloc(nd->maxInDegree*sizeof(DGArc*)); + nd->outArc=(DGArc **)malloc(nd->maxOutDegree*sizeof(DGArc*)); + nd->name=strdup(nm); + nd->feat=NULL; + return nd; +} +void nodeShow(DGNode* nd){ + fprintf( stderr,"%3d.%s: (%d,%d)\n", + nd->id,nd->name,nd->inDegree,nd->outDegree); +/* + if(nd->verified==1) fprintf(stderr,"%ld.%s\t: usable.",nd->id,nd->name); + else if(nd->verified==0) fprintf(stderr,"%ld.%s\t: unusable.",nd->id,nd->name); + else fprintf(stderr,"%ld.%s\t: notverified.",nd->id,nd->name); +*/ +} + +DGraph* newDGraph(char* nm){ + DGraph *dg=(DGraph *)malloc(sizeof(DGraph)); + dg->numNodes=0; + dg->numArcs=0; + dg->maxNodes=BLOCK_SIZE; + dg->maxArcs=BLOCK_SIZE; + dg->node=(DGNode **)malloc(dg->maxNodes*sizeof(DGNode*)); + dg->arc=(DGArc **)malloc(dg->maxArcs*sizeof(DGArc*)); + dg->name=strdup(nm); + return dg; +} +int AttachNode(DGraph* dg, DGNode* nd) { + int i=0,j,len=0; + DGNode **nds =NULL, *tmpnd=NULL; + DGArc **ar=NULL; + + if (dg->numNodes == dg->maxNodes-1 ) { + dg->maxNodes += BLOCK_SIZE; + nds =(DGNode **) calloc(dg->maxNodes,sizeof(DGNode*)); + memcpy(nds,dg->node,(dg->maxNodes-BLOCK_SIZE)*sizeof(DGNode*)); + free(dg->node); + dg->node=nds; + } + + len = strlen( nd->name); + for (i = 0; i < dg->numNodes; i++) { + tmpnd =dg->node[ i]; + ar=NULL; + if ( strlen( tmpnd->name) != len ) continue; + if ( strncmp( nd->name, tmpnd->name, len) ) continue; + if ( nd->inDegree > 0 ) { + tmpnd->maxInDegree += nd->maxInDegree; + ar =(DGArc **) calloc(tmpnd->maxInDegree,sizeof(DGArc*)); + memcpy(ar,tmpnd->inArc,(tmpnd->inDegree)*sizeof(DGArc*)); + free(tmpnd->inArc); + tmpnd->inArc=ar; + for (j = 0; j < nd->inDegree; j++ ) { + nd->inArc[ j]->head = tmpnd; + } + memcpy( &(tmpnd->inArc[ tmpnd->inDegree]), nd->inArc, nd->inDegree*sizeof( DGArc *)); + tmpnd->inDegree += nd->inDegree; + } + if ( nd->outDegree > 0 ) { + tmpnd->maxOutDegree += nd->maxOutDegree; + ar =(DGArc **) calloc(tmpnd->maxOutDegree,sizeof(DGArc*)); + memcpy(ar,tmpnd->outArc,(tmpnd->outDegree)*sizeof(DGArc*)); + free(tmpnd->outArc); + tmpnd->outArc=ar; + for (j = 0; j < nd->outDegree; j++ ) { + nd->outArc[ j]->tail = tmpnd; + } + memcpy( &(tmpnd->outArc[tmpnd->outDegree]),nd->outArc,nd->outDegree*sizeof( DGArc *)); + tmpnd->outDegree += nd->outDegree; + } + free(nd); + return i; + } + nd->id = dg->numNodes; + dg->node[dg->numNodes] = nd; + dg->numNodes++; +return nd->id; +} +int AttachArc(DGraph *dg,DGArc* nar){ +int arcId = -1; +int i=0,newNumber=0; +DGNode *head = nar->head, + *tail = nar->tail; +DGArc **ars=NULL,*probe=NULL; +/*fprintf(stderr,"AttachArc %ld\n",dg->numArcs); */ + if ( !tail || !head ) return arcId; + if ( dg->numArcs == dg->maxArcs-1 ) { + dg->maxArcs += BLOCK_SIZE; + ars =(DGArc **) calloc(dg->maxArcs,sizeof(DGArc*)); + memcpy(ars,dg->arc,(dg->maxArcs-BLOCK_SIZE)*sizeof(DGArc*)); + free(dg->arc); + dg->arc=ars; + } + for(i = 0; i < tail->outDegree; i++ ) { /* parallel arc */ + probe = tail->outArc[ i]; + if(probe->head == head + && + probe->length == nar->length + ){ + free(nar); + return probe->id; + } + } + + nar->id = dg->numArcs; + arcId=dg->numArcs; + dg->arc[dg->numArcs] = nar; + dg->numArcs++; + + head->inArc[ head->inDegree] = nar; + head->inDegree++; + if ( head->inDegree >= head->maxInDegree ) { + newNumber = head->maxInDegree + SMALL_BLOCK_SIZE; + ars =(DGArc **) calloc(newNumber,sizeof(DGArc*)); + memcpy(ars,head->inArc,(head->inDegree)*sizeof(DGArc*)); + free(head->inArc); + head->inArc=ars; + head->maxInDegree = newNumber; + } + tail->outArc[ tail->outDegree] = nar; + tail->outDegree++; + if(tail->outDegree >= tail->maxOutDegree ) { + newNumber = tail->maxOutDegree + SMALL_BLOCK_SIZE; + ars =(DGArc **) calloc(newNumber,sizeof(DGArc*)); + memcpy(ars,tail->outArc,(tail->outDegree)*sizeof(DGArc*)); + free(tail->outArc); + tail->outArc=ars; + tail->maxOutDegree = newNumber; + } +/*fprintf(stderr,"AttachArc: head->in=%d tail->out=%ld\n",head->inDegree,tail->outDegree);*/ +return arcId; +} +void graphShow(DGraph *dg,int DetailsLevel){ + int i=0,j=0; + fprintf(stderr," %d.%s: (%d,%d)\n",dg->id,dg->name,dg->numNodes,dg->numArcs); + if ( DetailsLevel < 1) return; + for (i = 0; i < dg->numNodes; i++ ) { + DGNode *focusNode = dg->node[ i]; + if(DetailsLevel >= 2) { + for (j = 0; j < focusNode->inDegree; j++ ) { + fprintf(stderr,"\t "); + nodeShow(focusNode->inArc[ j]->tail); + } + } + nodeShow(focusNode); + if ( DetailsLevel < 2) continue; + for (j = 0; j < focusNode->outDegree; j++ ) { + fprintf(stderr, "\t "); + nodeShow(focusNode->outArc[ j]->head); + } + fprintf(stderr, "---\n"); + } + fprintf(stderr,"----------------------------------------\n"); + if ( DetailsLevel < 3) return; +} + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.h b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.h new file mode 100644 index 000000000..f38f898b2 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/DGraph.h @@ -0,0 +1,43 @@ +#ifndef _DGRAPH +#define _DGRAPH + +#define BLOCK_SIZE 128 +#define SMALL_BLOCK_SIZE 32 + +typedef struct{ + int id; + void *tail,*head; + int length,width,attribute,maxWidth; +}DGArc; + +typedef struct{ + int maxInDegree,maxOutDegree; + int inDegree,outDegree; + int id; + char *name; + DGArc **inArc,**outArc; + int depth,height,width; + int color,attribute,address,verified; + void *feat; +}DGNode; + +typedef struct{ + int maxNodes,maxArcs; + int id; + char *name; + int numNodes,numArcs; + DGNode **node; + DGArc **arc; +} DGraph; + +DGArc *newArc(DGNode *tl,DGNode *hd); +void arcShow(DGArc *ar); +DGNode *newNode(char *nm); +void nodeShow(DGNode* nd); + +DGraph* newDGraph(char *nm); +int AttachNode(DGraph *dg,DGNode *nd); +int AttachArc(DGraph *dg,DGArc* nar); +void graphShow(DGraph *dg,int DetailsLevel); + +#endif diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/Makefile new file mode 100644 index 000000000..687ac3324 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/Makefile @@ -0,0 +1,26 @@ +SHELL=/bin/sh +BENCHMARK=dt +BENCHMARKU=DT + +include ../config/make.def + +include ../sys/make.common +#Override PROGRAM +DTPROGRAM = $(BINDIR)/$(BENCHMARK).$(CLASS).x + +OBJS = dt.o DGraph.o \ + ${COMMON}/c_print_results.o ${COMMON}/c_timers.o ${COMMON}/c_randdp.o + + +${PROGRAM}: config ${OBJS} + ${CLINK} ${CLINKFLAGS} -o ${DTPROGRAM} ${OBJS} ${CMPI_LIB} + +.c.o: + ${CCOMPILE} $< + +dt.o: dt.c npbparams.h +DGraph.o: DGraph.c DGraph.h + +clean: + - rm -f *.o *~ mputil* + - rm -f dt npbparams.h core diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/README new file mode 100644 index 000000000..873e3ae6f --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/README @@ -0,0 +1,22 @@ +Data Traffic benchmark DT is new in the NPB suite +(released as part of NPB3.x-MPI package). +---------------------------------------------------- + +DT is written in C and same executable can run on any number of processors, +provided this number is not less than the number of nodes in the communication +graph. DT benchmark takes one argument: BH, WH, or SH. This argument +specifies the communication graph Black Hole, White Hole, or SHuffle +respectively. The current release contains verification numbers for +CLASSES S, W, A, and B only. Classes C and D are defined, but verification +numbers are not provided in this release. + +The following table summarizes the number of nodes in the communication +graph based on CLASS and graph TYPE. + +CLASS N_Source N_Nodes(BH,WH) N_Nodes(SH) + S 4 5 12 + W 8 11 32 + A 16 21 80 + B 32 43 192 + C 64 85 448 + D 128 171 1024 diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/dt.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/dt.c new file mode 100644 index 000000000..281979d31 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/DT/dt.c @@ -0,0 +1,755 @@ +/************************************************************************* + * * + * N A S P A R A L L E L B E N C H M A R K S 3.4 * + * * + * D T * + * * + ************************************************************************* + * * + * This benchmark is part of the NAS Parallel Benchmark 3.4 suite. * + * * + * Permission to use, copy, distribute and modify this software * + * for any purpose with or without fee is hereby granted. We * + * request, however, that all derived work reference the NAS * + * Parallel Benchmarks 3.4. This software is provided "as is" * + * without express or implied warranty. * + * * + * Information on NPB 3.4, including the technical report, the * + * original specifications, source code, results and information * + * on how to submit new results, is available at: * + * * + * http: www.nas.nasa.gov/Software/NPB * + * * + * Send comments or suggestions to npb@nas.nasa.gov * + * Send bug reports to npb-bugs@nas.nasa.gov * + * * + * NAS Parallel Benchmarks Group * + * NASA Ames Research Center * + * Mail Stop: T27A-1 * + * Moffett Field, CA 94035-1000 * + * * + * E-mail: npb@nas.nasa.gov * + * Fax: (650) 604-3957 * + * * + ************************************************************************* + * * + * Author: M. Frumkin * * + * * + *************************************************************************/ + +#include +#include +#include + +#include "mpi.h" +#include "npbparams.h" + +#ifndef CLASS +#define CLASS 'S' +#endif + +int passed_verification; +extern double randlc( double *X, double *A ); +extern +void c_print_results( char *name, + char class, + int n1, + int n2, + int n3, + int niter, + int nprocs_compiled, + int nprocs_total, + double t, + double mops, + char *optype, + int passed_verification, + char *npbversion, + char *compiletime, + char *mpicc, + char *clink, + char *cmpi_lib, + char *cmpi_inc, + char *cflags, + char *clinkflags ); + +#include "../common/c_timers.h" +int timer_on=0,timers_tot=64; + +int verify(char *bmname,double rnm2){ + double verify_value=0.0; + double epsilon=1.0E-8; + char cls=CLASS; + int verified=-1; + if (cls != 'U') { + if(cls=='S') { + if(strstr(bmname,"BH")){ + verify_value=30892725.0; + }else if(strstr(bmname,"WH")){ + verify_value=67349758.0; + }else if(strstr(bmname,"SH")){ + verify_value=58875767.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = 0; + }else if(cls=='W') { + if(strstr(bmname,"BH")){ + verify_value = 4102461.0; + }else if(strstr(bmname,"WH")){ + verify_value = 204280762.0; + }else if(strstr(bmname,"SH")){ + verify_value = 186944764.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = 0; + }else if(cls=='A') { + if(strstr(bmname,"BH")){ + verify_value = 17809491.0; + }else if(strstr(bmname,"WH")){ + verify_value = 1289925229.0; + }else if(strstr(bmname,"SH")){ + verify_value = 610856482.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = 0; + }else if(cls=='B') { + if(strstr(bmname,"BH")){ + verify_value = 4317114.0; + }else if(strstr(bmname,"WH")){ + verify_value = 7877279917.0; + }else if(strstr(bmname,"SH")){ + verify_value = 1836863082.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + verified = 0; + } + }else if(cls=='C') { + if(strstr(bmname,"BH")){ + verify_value = 0.0; + }else if(strstr(bmname,"WH")){ + verify_value = 0.0; + }else if(strstr(bmname,"SH")){ + verify_value = 0.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + verified = -1; + } + }else if(cls=='D') { + if(strstr(bmname,"BH")){ + verify_value = 0.0; + }else if(strstr(bmname,"WH")){ + verify_value = 0.0; + }else if(strstr(bmname,"SH")){ + verify_value = 0.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = -1; + }else{ + fprintf(stderr,"No such class as %c.\n",cls); + } + fprintf(stderr," %s L2 Norm = %f\n",bmname,rnm2); + if(verified==-1){ + fprintf(stderr," No verification was performed.\n"); + }else if( rnm2 - verify_value < epsilon && + rnm2 - verify_value > -epsilon) { /* abs here does not work on ALTIX */ + verified = 1; + fprintf(stderr," Deviation = %f\n",(rnm2 - verify_value)); + }else{ + verified = 0; + fprintf(stderr," The correct verification value = %f\n",verify_value); + fprintf(stderr," Got value = %f\n",rnm2); + } + }else{ + verified = -1; + } + return verified; + } + +int ipowMod(int a,long long int n,int md){ + int seed=1,q=a,r=1; + if(n<0){ + fprintf(stderr,"ipowMod: exponent must be nonnegative exp=%lld\n",n); + n=-n; /* temp fix */ +/* return 1; */ + } + if(md<=0){ + fprintf(stderr,"ipowMod: module must be positive mod=%d",md); + return 1; + } + if(n==0) return 1; + while(n>1){ + int n2 = n/2; + if (n2*2==n){ + seed = (q*q)%md; + q=seed; + n = n2; + }else{ + seed = (r*q)%md; + r=seed; + n = n-1; + } + } + seed = (r*q)%md; + return seed; +} + +#include "DGraph.h" +DGraph *buildSH(char cls){ +/* + Nodes of the graph must be topologically sorted + to avoid MPI deadlock. +*/ + DGraph *dg; + int numSources=NUM_SOURCES; /* must be power of 2 */ + int numOfLayers=0,tmpS=numSources>>1; + int firstLayerNode=0; + DGArc *ar=NULL; + DGNode *nd=NULL; + int mask=0x0,ndid=0,ndoff=0; + int i=0,j=0; + char nm[BLOCK_SIZE]; + + sprintf(nm,"DT_SH.%c",cls); + dg=newDGraph(nm); + + while(tmpS>1){ + numOfLayers++; + tmpS>>=1; + } + for(i=0;inode[ndid],nd); + AttachArc(dg,ar); + ndoff+=mask; + ndid=firstLayerNode+ndoff; + ar=newArc(dg->node[ndid],nd); + AttachArc(dg,ar); + } + firstLayerNode+=numSources; + } + mask=0x00000001<node[ndid],nd); + AttachArc(dg,ar); + ndoff+=mask; + ndid=firstLayerNode+ndoff; + ar=newArc(dg->node[ndid],nd); + AttachArc(dg,ar); + } +return dg; +} +DGraph *buildWH(char cls){ +/* + Nodes of the graph must be topologically sorted + to avoid MPI deadlock. +*/ + int i=0,j=0; + int numSources=NUM_SOURCES,maxInDeg=4; + int numLayerNodes=numSources,firstLayerNode=0; + int totComparators=0; + int numPrevLayerNodes=numLayerNodes; + int id=0,sid=0; + DGraph *dg; + DGNode *nd=NULL,*source=NULL,*tmp=NULL,*snd=NULL; + DGArc *ar=NULL; + char nm[BLOCK_SIZE]; + + sprintf(nm,"DT_WH.%c",cls); + dg=newDGraph(nm); + + for(i=0;imaxInDeg){ + numLayerNodes=numLayerNodes/maxInDeg; + if(numLayerNodes*maxInDeg=numPrevLayerNodes) break; + snd=dg->node[firstLayerNode+sid]; + ar=newArc(dg->node[id],snd); + AttachArc(dg,ar); + } + } + firstLayerNode+=numPrevLayerNodes; + numPrevLayerNodes=numLayerNodes; + } + source=newNode("Source"); + AttachNode(dg,source); + for(i=0;inode[firstLayerNode+i]; + ar=newArc(source,nd); + AttachArc(dg,ar); + } + + for(i=0;inumNodes/2;i++){ /* Topological sorting */ + tmp=dg->node[i]; + dg->node[i]=dg->node[dg->numNodes-1-i]; + dg->node[i]->id=i; + dg->node[dg->numNodes-1-i]=tmp; + dg->node[dg->numNodes-1-i]->id=dg->numNodes-1-i; + } +return dg; +} +DGraph *buildBH(char cls){ +/* + Nodes of the graph must be topologically sorted + to avoid MPI deadlock. +*/ + int i=0,j=0; + int numSources=NUM_SOURCES,maxInDeg=4; + int numLayerNodes=numSources,firstLayerNode=0; + DGraph *dg; + DGNode *nd=NULL, *snd=NULL, *sink=NULL; + DGArc *ar=NULL; + int totComparators=0; + int numPrevLayerNodes=numLayerNodes; + int id=0, sid=0; + char nm[BLOCK_SIZE]; + + sprintf(nm,"DT_BH.%c",cls); + dg=newDGraph(nm); + + for(i=0;imaxInDeg){ + numLayerNodes=numLayerNodes/maxInDeg; + if(numLayerNodes*maxInDeg=numPrevLayerNodes) break; + snd=dg->node[firstLayerNode+sid]; + ar=newArc(snd,dg->node[id]); + AttachArc(dg,ar); + } + } + firstLayerNode+=numPrevLayerNodes; + numPrevLayerNodes=numLayerNodes; + } + sink=newNode("Sink"); + AttachNode(dg,sink); + for(i=0;inode[firstLayerNode+i]; + ar=newArc(nd,sink); + AttachArc(dg,ar); + } +return dg; +} + +typedef struct{ + int len; + double* val; +} Arr; +Arr *newArr(int len){ + Arr *arr=(Arr *)malloc(sizeof(Arr)); + arr->len=len; + arr->val=(double *)malloc(len*sizeof(double)); + return arr; +} +void arrShow(Arr* a){ + if(!a) fprintf(stderr,"-- NULL array\n"); + else{ + fprintf(stderr,"-- length=%d\n",a->len); + } +} +double CheckVal(Arr *feat){ + double csum=0.0; + int i=0; + for(i=0;ilen;i++){ + csum+=feat->val[i]*feat->val[i]/feat->len; /* The truncation does not work since + result will be 0 for large len */ + } + return csum; +} +int GetFNumDPar(int* mean, int* stdev){ + *mean=NUM_SAMPLES; + *stdev=STD_DEVIATION; + return 0; +} +int GetFeatureNum(char *mbname,int id){ + double tran=314159265.0; + double A=2*id+1; + double denom=randlc(&tran,&A); + char cval='S'; + int mean=NUM_SAMPLES,stdev=128; + int rtfs=0,len=0; + GetFNumDPar(&mean,&stdev); + rtfs=ipowMod((int)(1/denom)*(int)cval,(long long int) (2*id+1),2*stdev); + if(rtfs<0) rtfs=-rtfs; + len=mean-stdev+rtfs; + return len; +} +Arr* RandomFeatures(char *bmname,int fdim,int id){ + int len=GetFeatureNum(bmname,id)*fdim; + Arr* feat=newArr(len); + int nxg=2,nyg=2,nzg=2,nfg=5; + int nx=421,ny=419,nz=1427,nf=3527; + long long int expon=(len*(id+1))%3141592; + int seedx=ipowMod(nxg,expon,nx), + seedy=ipowMod(nyg,expon,ny), + seedz=ipowMod(nzg,expon,nz), + seedf=ipowMod(nfg,expon,nf); + int i=0; + if(timer_on){ + timer_clear(id+1); + timer_start(id+1); + } + for(i=0;ival[i]=seedx; + feat->val[i+1]=seedy; + feat->val[i+2]=seedz; + feat->val[i+3]=seedf; + } + if(timer_on){ + timer_stop(id+1); + fprintf(stderr,"** RandomFeatures time in node %d = %f\n",id,timer_read(id+1)); + } + return feat; +} +void Resample(Arr *a,int blen){ + long long int i=0,j=0,jlo=0,jhi=0; + double avval=0.0; + double *nval=(double *)malloc(blen*sizeof(double)); + Arr *tmp=newArr(10); + for(i=0;ilen-1;i++){ + jlo=(int)(0.5*(2*i-1)*(blen/a->len)); + jhi=(int)(0.5*(2*i+1)*(blen/a->len)); + + avval=a->val[i]/(jhi-jlo+1); + for(j=jlo;j<=jhi;j++){ + nval[j]+=avval; + } + } + nval[0]=a->val[0]; + nval[blen-1]=a->val[a->len-1]; + free(a->val); + a->val=nval; + a->len=blen; +} +#define fielddim 4 +Arr* WindowFilter(Arr *a, Arr* b,int w){ + int i=0,j=0,k=0; + double rms0=0.0,rms1=0.0,rmsm1=0.0; + double weight=((double) (w+1))/(w+2); + + w+=1; + if(timer_on){ + timer_clear(w); + timer_start(w); + } + if(a->lenlen) Resample(a,b->len); + if(a->len>b->len) Resample(b,a->len); + for(i=fielddim;ilen-fielddim;i+=fielddim){ + rms0=(a->val[i]-b->val[i])*(a->val[i]-b->val[i]) + +(a->val[i+1]-b->val[i+1])*(a->val[i+1]-b->val[i+1]) + +(a->val[i+2]-b->val[i+2])*(a->val[i+2]-b->val[i+2]) + +(a->val[i+3]-b->val[i+3])*(a->val[i+3]-b->val[i+3]); + j=i+fielddim; + rms1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j]) + +(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1]) + +(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2]) + +(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]); + j=i-fielddim; + rmsm1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j]) + +(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1]) + +(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2]) + +(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]); + k=0; + if(rms1val[i]=weight*b->val[i]; + a->val[i+1]=weight*b->val[i+1]; + a->val[i+2]=weight*b->val[i+2]; + a->val[i+3]=weight*b->val[i+3]; + }else if(k==1){ + j=i+fielddim; + a->val[i]=weight*b->val[j]; + a->val[i+1]=weight*b->val[j+1]; + a->val[i+2]=weight*b->val[j+2]; + a->val[i+3]=weight*b->val[j+3]; + }else { /*if(k==-1)*/ + j=i-fielddim; + a->val[i]=weight*b->val[j]; + a->val[i+1]=weight*b->val[j+1]; + a->val[i+2]=weight*b->val[j+2]; + a->val[i+3]=weight*b->val[j+3]; + } + } + if(timer_on){ + timer_stop(w); + fprintf(stderr,"** WindowFilter time in node %d = %f\n",(w-1),timer_read(w)); + } + return a; +} + +int SendResults(DGraph *dg,DGNode *nd,Arr *feat){ + int i=0,tag=0; + DGArc *ar=NULL; + DGNode *head=NULL; + if(!feat) return 0; + for(i=0;ioutDegree;i++){ + ar=nd->outArc[i]; + if(ar->tail!=nd) continue; + head=ar->head; + tag=ar->id; + if(head->address!=nd->address){ + MPI_Send(&feat->len,1,MPI_INT,head->address,tag,MPI_COMM_WORLD); + MPI_Send(feat->val,feat->len,MPI_DOUBLE,head->address,tag,MPI_COMM_WORLD); + } + } + return 1; +} +Arr* CombineStreams(DGraph *dg,DGNode *nd){ + Arr *resfeat=newArr(NUM_SAMPLES*fielddim); + int i=0,len=0,tag=0; + DGArc *ar=NULL; + DGNode *tail=NULL; + MPI_Status status; + Arr *feat=NULL,*featp=NULL; + + if(nd->inDegree==0) return NULL; + for(i=0;iinDegree;i++){ + ar=nd->inArc[i]; + if(ar->head!=nd) continue; + tail=ar->tail; + if(tail->address!=nd->address){ + len=0; + tag=ar->id; + MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status); + feat=newArr(len); + MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status); + resfeat=WindowFilter(resfeat,feat,nd->id); + free(feat); + }else{ + featp=(Arr *)tail->feat; + feat=newArr(featp->len); + memcpy(feat->val,featp->val,featp->len*sizeof(double)); + resfeat=WindowFilter(resfeat,feat,nd->id); + free(feat); + } + } + for(i=0;ilen;i++) resfeat->val[i]=((int)resfeat->val[i])/nd->inDegree; + nd->feat=resfeat; + return nd->feat; +} +double Reduce(Arr *a,int w){ + double retv=0.0; + if(timer_on){ + timer_clear(w); + timer_start(w); + } + retv=(int)(w*CheckVal(a));/* The casting needed for node + and array dependent verifcation */ + if(timer_on){ + timer_stop(w); + fprintf(stderr,"** Reduce time in node %d = %f\n",(w-1),timer_read(w)); + } + return retv; +} + +double ReduceStreams(DGraph *dg,DGNode *nd){ + double csum=0.0; + int i=0,len=0,tag=0; + DGArc *ar=NULL; + DGNode *tail=NULL; + Arr *feat=NULL; + double retv=0.0; + + for(i=0;iinDegree;i++){ + ar=nd->inArc[i]; + if(ar->head!=nd) continue; + tail=ar->tail; + if(tail->address!=nd->address){ + MPI_Status status; + len=0; + tag=ar->id; + MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status); + feat=newArr(len); + MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status); + csum+=Reduce(feat,(nd->id+1)); + free(feat); + }else{ + csum+=Reduce(tail->feat,(nd->id+1)); + } + } + if(nd->inDegree>0)csum=(((long long int)csum)/nd->inDegree); + retv=(nd->id+1)*csum; + return retv; +} + +int ProcessNodes(DGraph *dg,int me){ + double chksum=0.0; + Arr *feat=NULL; + int i=0,verified=0,tag; + DGNode *nd=NULL; + double rchksum=0.0; + MPI_Status status; + + for(i=0;inumNodes;i++){ + nd=dg->node[i]; + if(nd->address!=me) continue; + if(strstr(nd->name,"Source")){ + nd->feat=RandomFeatures(dg->name,fielddim,nd->id); + SendResults(dg,nd,nd->feat); + }else if(strstr(nd->name,"Sink")){ + chksum=ReduceStreams(dg,nd); + tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */ + MPI_Send(&chksum,1,MPI_DOUBLE,0,tag,MPI_COMM_WORLD); + }else{ + feat=CombineStreams(dg,nd); + SendResults(dg,nd,feat); + } + } + if(me==0){ /* Report node */ + rchksum=0.0; + chksum=0.0; + for(i=0;inumNodes;i++){ + nd=dg->node[i]; + if(!strstr(nd->name,"Sink")) continue; + tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */ + MPI_Recv(&rchksum,1,MPI_DOUBLE,nd->address,tag,MPI_COMM_WORLD,&status); + chksum+=rchksum; + } + verified=verify(dg->name,chksum); + } +return verified; +} + +int main(int argc,char **argv ){ + int my_rank,comm_size; + int i; + DGraph *dg=NULL; + int verified=0, featnum=0; + double bytes_sent=2.0,tot_time=0.0; + + MPI_Init( &argc, &argv ); + MPI_Comm_rank( MPI_COMM_WORLD, &my_rank ); + MPI_Comm_size( MPI_COMM_WORLD, &comm_size ); + + if(argc!=2|| + ( strncmp(argv[1],"BH",2)!=0 + &&strncmp(argv[1],"WH",2)!=0 + &&strncmp(argv[1],"SH",2)!=0 + ) + ){ + if(my_rank==0){ + fprintf(stderr,"** Usage: mpirun -np N ../bin/dt.S GraphName\n"); + fprintf(stderr,"** Where \n - N is integer number of MPI processes\n"); + fprintf(stderr," - S is the class S, W, or A \n"); + fprintf(stderr," - GraphName is the communication graph name BH, WH, or SH.\n"); + fprintf(stderr," - the number of MPI processes N should not be be less than \n"); + fprintf(stderr," the number of nodes in the graph\n"); + } + MPI_Finalize(); + exit(1); + } + if(strncmp(argv[1],"BH",2)==0){ + dg=buildBH(CLASS); + }else if(strncmp(argv[1],"WH",2)==0){ + dg=buildWH(CLASS); + }else if(strncmp(argv[1],"SH",2)==0){ + dg=buildSH(CLASS); + } + + if(timer_on&&dg->numNodes+1>timers_tot){ + timer_on=0; + if(my_rank==0) + fprintf(stderr,"Not enough timers. Node timeing is off. \n"); + } + if(dg->numNodes>comm_size){ + if(my_rank==0){ + fprintf(stderr,"** The number of MPI processes should not be less than \n"); + fprintf(stderr,"** the number of nodes in the graph\n"); + fprintf(stderr,"** Number of MPI processes = %d\n",comm_size); + fprintf(stderr,"** Number nodes in the graph = %d\n",dg->numNodes); + } + MPI_Finalize(); + exit(1); + } + for(i=0;inumNodes;i++){ + dg->node[i]->address=i; + } + if( my_rank == 0 ){ + printf( "\n\n NAS Parallel Benchmarks 3.4 -- DT Benchmark\n\n" ); + graphShow(dg,0); + timer_clear(0); + timer_start(0); + } + verified=ProcessNodes(dg,my_rank); + + featnum=NUM_SAMPLES*fielddim; + bytes_sent=featnum*dg->numArcs; + bytes_sent/=1048576; + if(my_rank==0){ + timer_stop(0); + tot_time=timer_read(0); + c_print_results( dg->name, + CLASS, + featnum, + 0, + 0, + dg->numNodes, + 0, + comm_size, + tot_time, + bytes_sent/tot_time, + "bytes transmitted", + verified, + NPBVERSION, + COMPILETIME, + MPICC, + CLINK, + CMPI_LIB, + CMPI_INC, + CFLAGS, + CLINKFLAGS ); + } + MPI_Finalize(); + return 0; +} diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/Makefile new file mode 100644 index 000000000..b77a4e80d --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/Makefile @@ -0,0 +1,28 @@ +SHELL=/bin/sh +BENCHMARK=ep +BENCHMARKU=EP + +include ../config/make.def + +OBJS = ep.o ep_data.o verify.o mpinpb.o \ + ${COMMON}/print_results.o ${COMMON}/${RAND}.o ${COMMON}/timers.o + +include ../sys/make.common + +${PROGRAM}: config ${OBJS} + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB} + +.f90.o: + ${FCOMPILE} $< + +ep.o: ep.f90 ep_data.o mpinpb.o +ep_data.o: ep_data.f90 npbparams.h +verify.o: verify.f90 +mpinpb.o: mpinpb.f90 + +clean: + - rm -f *.o *~ *.mod + - rm -f npbparams.h core + + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/README new file mode 100644 index 000000000..6eb36571a --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/README @@ -0,0 +1,6 @@ +This code implements the random-number generator described in the +NAS Parallel Benchmark document RNR Technical Report RNR-94-007. +The code is "embarrassingly" parallel in that no communication is +required for the generation of the random numbers itself. There is +no special requirement on the number of processors used for running +the benchmark. diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep.f90 new file mode 100644 index 000000000..1bdaa9aac --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep.f90 @@ -0,0 +1,319 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.4 ! +! ! +! E P ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.4 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007 ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.4. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.4, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + + +!--------------------------------------------------------------------- +! +! Authors: P. O. Frederickson +! D. H. Bailey +! A. C. Woo +! R. F. Van der Wijngaart +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- + program EMBAR +!--------------------------------------------------------------------- + +! This is the MPI version of the APP Benchmark 1, +! the "embarassingly parallel" benchmark. + + use ep_data + use mpinpb + + implicit none + + double precision Mops, t1, t2, t3, t4, x1, & + & x2, sx, sy, tm, an, tt, gc, dum(3) + + integer i, ik, kk, l, k, nit, no_large_nodes, & + & np, np_add, k_offset, j + integer ierr, ierrcode + + logical verified, timers_enabled + + double precision randlc, timer_read + external randlc, timer_read + + character size*15, classv + + double precision epsilon + parameter (epsilon=1.d-8) + + double precision tsum(t_last+2), t1m(t_last+2), & + & tming(t_last+2), tmaxg(t_last+2) + character t_recs(t_last+2)*8 + + data dum /1.d0, 1.d0, 1.d0/ + + data t_recs/'total', 'gpairs', 'randn', 'rcomm', & + & ' totcomp', ' totcomm'/ + + + call mpi_init(ierr) + comm_solve = MPI_COMM_WORLD + call mpi_comm_rank(comm_solve,node,ierr) + call mpi_comm_size(comm_solve,no_nodes,ierr) + + root = 0 + + if (.not. convertdouble) then + dp_type = MPI_DOUBLE_PRECISION + else + dp_type = MPI_REAL + endif + + if (node.eq.root) then + +! Because the size of the problem is too large to store in a 32-bit +! integer for some classes, we put it into a string (for printing). +! Have to strip off the decimal point put in there by the floating +! point print statement (internal file) + + write(*, 1000) + write(size, '(f15.0)' ) 2.d0**(m+1) + j = 15 + if (size(j:j) .eq. '.') j = j - 1 + write (*,1001) size(1:j), class + write(*, 1003) no_nodes + + 1000 format(/,' NAS Parallel Benchmarks 3.4 -- EP Benchmark',/) + 1001 format(' Number of random numbers generated: ', a15, & + & ' (class ', a, ')' ) + 1003 format(' Total number of processes: ', 2x, i13, /) + + call check_timer_flag( timers_enabled ) + endif + + call mpi_bcast(timers_enabled, 1, MPI_LOGICAL, root, & + & comm_solve, ierr) + + verified = .false. + +! Compute the number of "batches" of random number pairs generated +! per processor. Adjust if the number of processors does not evenly +! divide the total number + + np = nn / no_nodes + no_large_nodes = mod(nn, no_nodes) + if (node .lt. no_large_nodes) then + np_add = 1 + else + np_add = 0 + endif + np = np + np_add + + if (np .eq. 0) then + write (6, 1) no_nodes, nn + 1 format ('Too many nodes:', i0, 1x, i0) + ierrcode = 1 + call mpi_abort(MPI_COMM_WORLD,ierrcode,ierr) + stop + endif + +! Call the random number generator functions and initialize +! the x-array to reduce the effects of paging on the timings. +! Also, call all mathematical functions that are used. Make +! sure these initializations cannot be eliminated as dead code. + + call vranlc(0, dum(1), dum(2), dum(3)) + dum(1) = randlc(dum(2), dum(3)) + do 5 i = 1, 2*nk + x(i) = -1.d99 + 5 continue + Mops = log(sqrt(abs(max(1.d0,1.d0)))) + +!--------------------------------------------------------------------- +! Synchronize before placing time stamp +!--------------------------------------------------------------------- + do i = 1, t_last + call timer_clear(i) + end do + call mpi_barrier(comm_solve, ierr) + call timer_start(1) + + t1 = a + call vranlc(0, t1, a, x) + +! Compute AN = A ^ (2 * NK) (mod 2^46). + + t1 = a + + do 100 i = 1, mk + 1 + t2 = randlc(t1, t1) + 100 continue + + an = t1 + tt = s + gc = 0.d0 + sx = 0.d0 + sy = 0.d0 + + do 110 i = 0, nq - 1 + q(i) = 0.d0 + 110 continue + +! Each instance of this loop may be performed independently. We compute +! the k offsets separately to take into account the fact that some nodes +! have more numbers to generate than others + + if (np_add .eq. 1) then + k_offset = node * np -1 + else + k_offset = no_large_nodes*(np+1) + (node-no_large_nodes)*np -1 + endif + + do 150 k = 1, np + kk = k_offset + k + t1 = s + t2 = an + +! Find starting seed t1 for this kk. + + do 120 i = 1, 100 + ik = kk / 2 + if (2 * ik .ne. kk) t3 = randlc(t1, t2) + if (ik .eq. 0) goto 130 + t3 = randlc(t2, t2) + kk = ik + 120 continue + +! Compute uniform pseudorandom numbers. + 130 continue + + if (timers_enabled) call timer_start(t_randn) + call vranlc(2 * nk, t1, a, x) + if (timers_enabled) call timer_stop(t_randn) + +! Compute Gaussian deviates by acceptance-rejection method and +! tally counts in concentric square annuli. This loop is not +! vectorizable. + + if (timers_enabled) call timer_start(t_gpairs) + + do 140 i = 1, nk + x1 = 2.d0 * x(2*i-1) - 1.d0 + x2 = 2.d0 * x(2*i) - 1.d0 + t1 = x1 ** 2 + x2 ** 2 + if (t1 .le. 1.d0) then + t2 = sqrt(-2.d0 * log(t1) / t1) + t3 = abs(x1 * t2) + t4 = abs(x2 * t2) + l = max(t3, t4) + q(l) = q(l) + 1.d0 + sx = sx + t3 + sy = sy + t4 + endif + 140 continue + + if (timers_enabled) call timer_stop(t_gpairs) + + 150 continue + + if (timers_enabled) call timer_start(t_rcomm) + call mpi_allreduce(sx, x, 1, dp_type, & + & MPI_SUM, comm_solve, ierr) + sx = x(1) + call mpi_allreduce(sy, x, 1, dp_type, & + & MPI_SUM, comm_solve, ierr) + sy = x(1) + call mpi_allreduce(q, x, nq, dp_type, & + & MPI_SUM, comm_solve, ierr) + if (timers_enabled) call timer_stop(t_rcomm) + + do i = 1, nq + q(i-1) = x(i) + enddo + + do 160 i = 0, nq - 1 + gc = gc + q(i) + 160 continue + + call timer_stop(1) + tm = timer_read(1) + + call mpi_allreduce(tm, x, 1, dp_type, & + & MPI_MAX, comm_solve, ierr) + tm = x(1) + + if (node.eq.root) then + call verify(m, sx, sy, gc, verified, classv) + + nit = 0 + Mops = 2.d0**(m+1)/tm/1000000.d0 + + write (6,11) tm, m, gc, sx, sy, (i, q(i), i = 0, nq - 1) + 11 format ('EP Benchmark Results:'//'CPU Time =',f10.3/'N = 2^', & + & i5/'No. Gaussian Pairs =',f15.0/'Sums = ',1p,2d25.15/ & + & 'Counts:'/(i3,0p,f15.0)) + + call print_results('EP', class, m+1, 0, 0, nit, no_nodes, & + & no_nodes, tm, Mops, & + & 'Random numbers generated', & + & verified, npbversion, compiletime, cs1, & + & cs2, cs3, cs4, cs5, cs6, cs7) + + endif + + + if (.not.timers_enabled) goto 999 + + do i = 1, t_last + t1m(i) = timer_read(i) + end do + t1m(t_last+2) = t1m(t_rcomm) + t1m(t_last+1) = t1m(t_total) - t1m(t_last+2) + + call MPI_Reduce(t1m, tsum, t_last+2, dp_type, MPI_SUM, & + & 0, comm_solve, ierr) + call MPI_Reduce(t1m, tming, t_last+2, dp_type, MPI_MIN, & + & 0, comm_solve, ierr) + call MPI_Reduce(t1m, tmaxg, t_last+2, dp_type, MPI_MAX, & + & 0, comm_solve, ierr) + + if (node .eq. 0) then + write(*, 800) no_nodes + do i = 1, t_last+2 + tsum(i) = tsum(i) / no_nodes + write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i) + end do + endif + 800 format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', & + & 5x, 'average') + 810 format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4)) + + 999 continue + call mpi_finalize(ierr) + + end diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep_data.f90 new file mode 100644 index 000000000..93ee961c7 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/ep_data.f90 @@ -0,0 +1,39 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! ep_data module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module ep_data + +!--------------------------------------------------------------------- +! The following include file is generated automatically by the +! "setparams" utility, which defines the problem size 'm' +!--------------------------------------------------------------------- + + include 'npbparams.h' + +!--------------------------------------------------------------------- +! M is the Log_2 of the number of complex pairs of uniform (0, 1) random +! numbers. MK is the Log_2 of the size of each batch of uniform random +! numbers. MK can be set for convenience on a given system, since it does +! not affect the results. +!--------------------------------------------------------------------- + integer mk, mm, nn, nk, nq + parameter (mk = 16, mm = m - mk, nn = 2 ** mm, & + & nk = 2 ** mk, nq = 10) + + double precision a, s + parameter (a = 1220703125.d0, s = 271828183.d0) + +! ... storage + double precision x(2*nk), q(0:nq-1), qq(10000) + +! ... timer constants + integer t_total, t_gpairs, t_randn, t_rcomm, t_last + parameter (t_total=1, t_gpairs=2, t_randn=3, t_rcomm=4, t_last=4) + + end module ep_data + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/mpinpb.f90 new file mode 100644 index 000000000..865402747 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/mpinpb.f90 @@ -0,0 +1,16 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! mpinpb module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module mpinpb + + include 'mpif.h' + + integer node, no_nodes, root, comm_solve, dp_type + + end module mpinpb + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/verify.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/verify.f90 new file mode 100644 index 000000000..65fee595c --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/EP/verify.f90 @@ -0,0 +1,82 @@ +!--------------------------------------------------------------------- + subroutine verify(m, sx, sy, gc, verified, class) +!--------------------------------------------------------------------- + + use, intrinsic :: ieee_arithmetic, only : ieee_is_nan + + implicit none + integer m + double precision sx, sy, gc + logical verified + character class + + double precision sx_verify_value, sy_verify_value + double precision gc_verify_value + double precision sx_err, sy_err, gc_err + + double precision, parameter :: epsilon = 1.d-8 + + verified = .true. + if (m.eq.24) then + class = 'S' + sx_verify_value = 1.051299420395306D+07 + sy_verify_value = 1.051517131857535D+07 + gc_verify_value = 13176389.D0 + elseif (m.eq.25) then + class = 'W' + sx_verify_value = 2.102505525182392D+07 + sy_verify_value = 2.103162209578822D+07 + gc_verify_value = 26354769.D0 + elseif (m.eq.28) then + class = 'A' + sx_verify_value = 1.682235632304711D+08 + sy_verify_value = 1.682195123368299D+08 + gc_verify_value = 210832767.D0 + elseif (m.eq.30) then + class = 'B' + sx_verify_value = 6.728927543423024D+08 + sy_verify_value = 6.728951822504275D+08 + gc_verify_value = 843345606.D0 + elseif (m.eq.32) then + class = 'C' + sx_verify_value = 2.691444083862931D+09 + sy_verify_value = 2.691519118724585D+09 + gc_verify_value = 3373275903.D0 + elseif (m.eq.36) then + class = 'D' + sx_verify_value = 4.306350280812112D+10 + sy_verify_value = 4.306347571859157D+10 + gc_verify_value = 53972171957.D0 + elseif (m.eq.40) then + class = 'E' + sx_verify_value = 6.890169663167274D+11 + sy_verify_value = 6.890164670688535D+11 + gc_verify_value = 863554308186.D0 + elseif (m.eq.44) then + class = 'F' + sx_verify_value = 1.102426773788175D+13 + sy_verify_value = 1.102426773787993D+13 + gc_verify_value = 13816870608324.D0 + else + class = 'U' + verified = .false. + endif + if (verified) then + sx_err = abs((sx - sx_verify_value)/sx_verify_value) + sy_err = abs((sy - sy_verify_value)/sy_verify_value) + if (ieee_is_nan(sx_err) .or. ieee_is_nan(sy_err)) then + verified = .false. + else + verified = ((sx_err.le.epsilon) .and. (sy_err.le.epsilon)) + endif + endif + if (verified) then + gc_err = abs((gc - gc_verify_value)/gc_verify_value) + if (ieee_is_nan(gc_err) .or. gc_err.gt.epsilon) then + verified = .false. + endif + endif + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/Makefile new file mode 100644 index 000000000..28e3e7df1 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/Makefile @@ -0,0 +1,25 @@ +SHELL=/bin/sh +BENCHMARK=ft +BENCHMARKU=FT + +include ../config/make.def + +include ../sys/make.common + +OBJS = ft.o ft_data.o mpinpb.o ${COMMON}/get_active_nprocs.o \ + ${COMMON}/${RAND}.o ${COMMON}/print_results.o ${COMMON}/timers.o + +${PROGRAM}: config ${OBJS} + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB} + + +.f90.o: + ${FCOMPILE} $< + +ft.o: ft.f90 ft_data.o mpinpb.o +ft_data.o: ft_data.f90 mpinpb.o npbparams.h +mpinpb.o: mpinpb.f90 + +clean: + - rm -f *.o *.mod *~ mputil* + - rm -f ft npbparams.h core diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/README b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/README new file mode 100644 index 000000000..ab08b363b --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/README @@ -0,0 +1,5 @@ +This code implements the time integration of a three-dimensional +partial differential equation using the Fast Fourier Transform. +Some of the dimension statements are not F77 conforming and will +not work using the g77 compiler. All dimension statements, +however, are legal F90. \ No newline at end of file diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft.f90 new file mode 100644 index 000000000..ac2df0a58 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft.f90 @@ -0,0 +1,2124 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.4 ! +! ! +! F T ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.4 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007 ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.4. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.4, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + +!TO REDUCE THE AMOUNT OF MEMORY REQUIRED BY THE BENCHMARK WE NO LONGER +!STORE THE ENTIRE TIME EVOLUTION ARRAY "EX" FOR ALL TIME STEPS, BUT +!JUST FOR THE FIRST. ALSO, IT IS STORED ONLY FOR THE PART OF THE GRID +!FOR WHICH THE CALLING PROCESSOR IS RESPONSIBLE, SO THAT THE MEMORY +!USAGE BECOMES SCALABLE. THIS NEW ARRAY IS CALLED "TWIDDLE" (SEE +!NPB3.0-SER) + +!TO AVOID PROBLEMS WITH VERY LARGE ARRAY SIZES THAT ARE COMPUTED BY +!MULTIPLYING GRID DIMENSIONS (CAUSING INTEGER OVERFLOW IN THE VARIABLE +!NTOTAL) AND SUBSEQUENTLY DIVIDING BY THE NUMBER OF PROCESSORS, WE +!COMPUTE THE SIZE OF ARRAY PARTITIONS MORE CONSERVATIVELY AS +!((NX*NY)/NP)*NZ, WHERE NX, NY, AND NZ ARE GRID DIMENSIONS AND NP IS +!THE NUMBER OF PROCESSORS, THE RESULT IS STORED IN "NTDIVNP". FOR THE +!PERFORMANCE CALCULATION WE STORE THE TOTAL NUMBER OF GRID POINTS IN A +!FLOATING POINT NUMBER "NTOTAL_F" INSTEAD OF AN INTEGER. +!THIS FIX WILL FAIL IF THE NUMBER OF PROCESSORS IS SMALL. + +!UGLY HACK OF SUBROUTINE IPOW46: FOR VERY LARGE GRIDS THE SINGLE EXPONENT +!FROM NPB2.3 MAY NOT FIT IN A 32-BIT INTEGER. HOWEVER, WE KNOW THAT THE +!"EXPONENT" ARGUMENT OF THIS ROUTINE CAN ALWAYS BE FACTORED INTO A TERM +!DIVISIBLE BY NX (EXP_1) AND ANOTHER TERM (EXP_2). NX IS USUALLY A POWER +!OF TWO, SO WE CAN KEEP HALVING IT UNTIL THE PRODUCT OF EXP_1 +!AND EXP_2 IS SMALL ENOUGH (NAMELY EXP_2 ITSELF). THIS UPDATED VERSION +!OF IPWO46, WHICH NOW TAKES THE TWO FACTORS OF "EXPONENT" AS SEPARATE +!ARGUMENTS, MAY BREAK DOWN IF EXP_1 DOES NOT CONTAIN A LARGE POWER OF TWO. + +!--------------------------------------------------------------------- +! +! Authors: D. Bailey +! W. Saphir +! R. F. Van der Wijngaart +! +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! FT benchmark +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + program ft + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use ft_fields + use mpinpb + + implicit none + + integer i, ierr + + integer iter + double precision total_time, mflops + logical verified + character class + + + call setup(class) + if (.not. active) goto 999 + +!--------------------------------------------------------------------- +! Run the entire problem once to make sure all data is touched. +! This reduces variable startup costs, which is important for such a +! short benchmark. The other NPB 2 implementations are similar. +!--------------------------------------------------------------------- + do i = 1, t_max + call timer_clear(i) + end do + + call timer_start(T_init) + call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3)) + call compute_initial_conditions(u1, dims(1,1), dims(2,1), & + & dims(3,1)) + call fft_init (dims(1,1)) + call fft(1, u1, u0) + call timer_stop(T_init) + if (me .eq. 0) then + write(*, 1000) timer_read(T_init) +1000 format(/' Initialization time =', f12.4/) + endif + +!--------------------------------------------------------------------- +! Start over from the beginning. Note that all operations must +! be timed, in contrast to other benchmarks. +!--------------------------------------------------------------------- + do i = 1, t_max + call timer_clear(i) + end do + call MPI_Barrier(comm_solve, ierr) + + call timer_start(T_total) + if (timers_enabled) call timer_start(T_setup) + + call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3)) + call compute_initial_conditions(u1, dims(1,1), dims(2,1), & + & dims(3,1)) + call fft_init (dims(1,1)) + +! if (timers_enabled) call synchup() + if (timers_enabled) call timer_stop(T_setup) + + if (timers_enabled) call timer_start(T_fft) + call fft(1, u1, u0) + if (timers_enabled) call timer_stop(T_fft) + + do iter = 1, niter + if (timers_enabled) call timer_start(T_evolve) + call evolve(u0, u1, twiddle, & + & dims(1,1), dims(2,1), dims(3,1)) + if (timers_enabled) call timer_stop(T_evolve) + if (timers_enabled) call timer_start(T_fft) + call fft(-1, u1, u2) + if (timers_enabled) call timer_stop(T_fft) +! if (timers_enabled) call synchup() + if (timers_enabled) call timer_start(T_checksum) + call checksum(iter, u2, dims(1,1), dims(2,1), dims(3,1)) + if (timers_enabled) call timer_stop(T_checksum) + end do + + call verify(niter, verified, class) + call timer_stop(t_total) +!! if (np .ne. np_min) verified = .false. + total_time = timer_read(t_total) + + if( total_time .ne. 0. ) then + mflops = 1.0d-6*ntotal_f * & + & (14.8157+7.19641*log(ntotal_f) & + & + (5.23518+7.21113*log(ntotal_f))*niter) & + & /total_time + else + mflops = 0.0 + endif + if (me .eq. 0) then + call print_results('FT', class, nx, ny, nz, niter, np_min, np, & + & total_time, mflops, ' floating point', verified, & + & npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7) + endif + if (timers_enabled) call print_timers() + + 999 continue + call MPI_Finalize(ierr) + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine evolve(u0, u1, twiddle, d1, d2, d3) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! evolve u0 -> u1 (t time steps) in fourier space +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer d1, d2, d3 + double precision exi + double complex u0(d1,d2,d3) + double complex u1(d1,d2,d3) + double precision twiddle(d1,d2,d3) + integer i, j, k + + do k = 1, d3 + do j = 1, d2 + do i = 1, d1 + u0(i,j,k) = u0(i,j,k)*(twiddle(i,j,k)) + u1(i,j,k) = u0(i,j,k) + end do + end do + end do + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine compute_initial_conditions(u0, d1, d2, d3) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Fill in array u0 with initial conditions from +! random number generator +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer d1, d2, d3 + double complex u0(d1, d2, d3) + integer k + double precision x0, start, an, dummy + +!--------------------------------------------------------------------- +! 0-D and 1-D layouts are easy because each processor gets a contiguous +! chunk of the array, in the Fortran ordering sense. +! For a 2-D layout, it's a bit more complicated. We always +! have entire x-lines (contiguous) in processor. +! We can do ny/np1 of them at a time since we have +! ny/np1 contiguous in y-direction. But then we jump +! by z-planes (nz/np2 of them, total). +! For the 0-D and 1-D layouts we could do larger chunks, but +! this turns out to have no measurable impact on performance. +!--------------------------------------------------------------------- + + + start = seed +!--------------------------------------------------------------------- +! Jump to the starting element for our first plane. +!--------------------------------------------------------------------- + call ipow46(a, 2*nx, (zstart(1)-1)*ny + (ystart(1)-1), an) + dummy = randlc(start, an) + call ipow46(a, 2*nx, ny, an) + +!--------------------------------------------------------------------- +! Go through by z planes filling in one square at a time. +!--------------------------------------------------------------------- + do k = 1, dims(3, 1) ! nz/np2 + x0 = start + call vranlc(2*nx*dims(2, 1), x0, a, u0(1, 1, k)) + if (k .ne. dims(3, 1)) dummy = randlc(start, an) + end do + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine ipow46(a, exp_1, exp_2, result) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! compute a^exponent mod 2^46 +!--------------------------------------------------------------------- + + implicit none + double precision a, result, dummy, q, r + integer exp_1, exp_2, n, n2, ierr + external randlc + double precision randlc + logical two_pow +!--------------------------------------------------------------------- +! Use +! a^n = a^(n/2)*a^(n/2) if n even else +! a^n = a*a^(n-1) if n odd +!--------------------------------------------------------------------- + result = 1 + if (exp_2 .eq. 0 .or. exp_1 .eq. 0) return + q = a + r = 1 + n = exp_1 + two_pow = .true. + + do while (two_pow) + n2 = n/2 + if (n2 * 2 .eq. n) then + dummy = randlc(q, q) + n = n2 + else + n = n * exp_2 + two_pow = .false. + endif + end do + + do while (n .gt. 1) + n2 = n/2 + if (n2 * 2 .eq. n) then + dummy = randlc(q, q) + n = n2 + else + dummy = randlc(r, q) + n = n-1 + endif + end do + dummy = randlc(r, q) + result = r + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine setup(class) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + character class + integer ierr, i, fstatus + debug = .FALSE. + + call MPI_Init(ierr) + +!--------------------------------------------------------------------- +! get a process grid that requires a pwr-2 number of procs. +! excess ranks are marked as inactive. +!--------------------------------------------------------------------- + call get_active_nprocs(3, np1, np2, np_min, & + & np, me, comm_solve, active) + + if (.not. active) return + + if (.not. convertdouble) then + dc_type = MPI_DOUBLE_COMPLEX + else + dc_type = MPI_COMPLEX + endif + + if (me .eq. 0) then + write(*, 1000) + + call check_timer_flag( timers_enabled ) + + open (unit=2,file='inputft.data',status='old', iostat=fstatus) + + if (fstatus .eq. 0) then + write(*,233) + 233 format(' Reading from input file inputft.data') + read (2,*) niter + read (2,*) layout_type + read (2,*) np1, np2 + close(2) + +!--------------------------------------------------------------------- +! check to make sure input data is consistent +!--------------------------------------------------------------------- + + +!--------------------------------------------------------------------- +! 1. product of processor grid dims must equal number of processors +!--------------------------------------------------------------------- + + if (np1 * np2 .ne. np_min) then + write(*, 238) + 238 format(' np1 and np2 given in input file are not valid.') + write(*, 239) np1*np2, np_min + 239 format(' Product is ', i5, ' and should be ', i5) + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif + +!--------------------------------------------------------------------- +! 2. layout type must be valid +!--------------------------------------------------------------------- + + if (layout_type .ne. layout_0D .and. & + & layout_type .ne. layout_1D .and. & + & layout_type .ne. layout_2D) then + write(*, 240) + 240 format(' Layout type specified in inputft.data is & + & invalid ') + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif + +!--------------------------------------------------------------------- +! 3. 0D layout must be 1x1 grid +!--------------------------------------------------------------------- + + if (layout_type .eq. layout_0D .and. & + & (np1 .ne.1 .or. np2 .ne. 1)) then + write(*, 241) + 241 format(' For 0D layout, both np1 and np2 must be 1 ') + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif +!--------------------------------------------------------------------- +! 4. 1D layout must be 1xN grid +!--------------------------------------------------------------------- + + if (layout_type .eq. layout_1D .and. np1 .ne. 1) then + write(*, 242) + 242 format(' For 1D layout, np1 must be 1 ') + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif + + else + write(*,234) + niter = niter_default + if (np_min .eq. 1) then + np1 = 1 + np2 = 1 + layout_type = layout_0D + else if (np_min .le. nz) then + np1 = 1 + np2 = np_min + layout_type = layout_1D + else + np1 = nz + np2 = np_min/nz + layout_type = layout_2D + endif + endif + + call set_class(nx, ny, nz, niter, class) + + 234 format(' No input file inputft.data. Using compiled defaults') + write(*, 1001) nx, ny, nz, class + write(*, 1002) niter + write(*, 1004) np + if (np .ne. np_min) write(*, 1006) np_min + write(*, 1005) np1, np2 + + if (layout_type .eq. layout_0D) then + write(*, 1010) '0D' + else if (layout_type .eq. layout_1D) then + write(*, 1010) '1D' + else + write(*, 1010) '2D' + endif + + 1000 format(//,' NAS Parallel Benchmarks 3.4 -- FT Benchmark',/) + 1001 format(' Size : ', i4, 'x', i4, 'x', i4, & + & ' (class ', a, ')') + 1002 format(' Iterations : ', 7x, i7) + 1004 format(' Number of processes : ', 7x, i7) + 1005 format(' Processor array : ', 5x, i4, 'x', i4) + 1006 format(' WARNING: Number of processes is not power of two (', & + & i0, ' active)') + 1010 format(' Layout type : ', 9x, A5) + endif + + +!--------------------------------------------------------------------- +! Broadcast parameters +!--------------------------------------------------------------------- + call MPI_BCAST(np1, 1, MPI_INTEGER, 0, comm_solve, ierr) + call MPI_BCAST(np2, 1, MPI_INTEGER, 0, comm_solve, ierr) + call MPI_BCAST(layout_type, 1, MPI_INTEGER, 0, comm_solve, & + & ierr) + call MPI_BCAST(niter, 1, MPI_INTEGER, 0, comm_solve, ierr) + call MPI_BCAST(timers_enabled, 1, MPI_LOGICAL, 0, comm_solve, & + & ierr) + + if (np1 .eq. 1 .and. np2 .eq. 1) then + layout_type = layout_0D + else if (np1 .eq. 1) then + layout_type = layout_1D + else + layout_type = layout_2D + endif + + if (layout_type .eq. layout_0D) then + do i = 1, 3 + dims(1, i) = nx + dims(2, i) = ny + dims(3, i) = nz + end do + else if (layout_type .eq. layout_1D) then + dims(1, 1) = nx + dims(2, 1) = ny + dims(3, 1) = nz + + dims(1, 2) = nx + dims(2, 2) = ny + dims(3, 2) = nz + + dims(1, 3) = nz + dims(2, 3) = nx + dims(3, 3) = ny + else if (layout_type .eq. layout_2D) then + dims(1, 1) = nx + dims(2, 1) = ny + dims(3, 1) = nz + + dims(1, 2) = ny + dims(2, 2) = nx + dims(3, 2) = nz + + dims(1, 3) = nz + dims(2, 3) = nx + dims(3, 3) = ny + + endif + do i = 1, 3 + dims(2, i) = dims(2, i) / np1 + dims(3, i) = dims(3, i) / np2 + end do + +!--------------------------------------------------------------------- +! Allocate space +!--------------------------------------------------------------------- + call alloc_space + +!--------------------------------------------------------------------- +! Determine processor coordinates of this processor +! Processor grid is np1xnp2. +! Arrays are always (n1, n2/np1, n3/np2) +! Processor coords are zero-based. +!--------------------------------------------------------------------- + me2 = mod(me, np2) ! goes from 0...np2-1 + me1 = me/np2 ! goes from 0...np1-1 +!--------------------------------------------------------------------- +! Communicators for rows/columns of processor grid. +! commslice1 is communicator of all procs with same me1, ranked as me2 +! commslice2 is communicator of all procs with same me2, ranked as me1 +! mpi_comm_split(comm, color, key, ...) +!--------------------------------------------------------------------- + call MPI_Comm_split(comm_solve, me1, me2, commslice1, ierr) + call MPI_Comm_split(comm_solve, me2, me1, commslice2, ierr) +! if (timers_enabled) call synchup() + + if (debug) print *, 'proc coords: ', me, me1, me2 + +!--------------------------------------------------------------------- +! Determine which section of the grid is owned by this +! processor. +!--------------------------------------------------------------------- + if (layout_type .eq. layout_0d) then + + do i = 1, 3 + xstart(i) = 1 + xend(i) = nx + ystart(i) = 1 + yend(i) = ny + zstart(i) = 1 + zend(i) = nz + end do + + else if (layout_type .eq. layout_1d) then + + xstart(1) = 1 + xend(1) = nx + ystart(1) = 1 + yend(1) = ny + zstart(1) = 1 + me2 * nz/np2 + zend(1) = (me2+1) * nz/np2 + + xstart(2) = 1 + xend(2) = nx + ystart(2) = 1 + yend(2) = ny + zstart(2) = 1 + me2 * nz/np2 + zend(2) = (me2+1) * nz/np2 + + xstart(3) = 1 + xend(3) = nx + ystart(3) = 1 + me2 * ny/np2 + yend(3) = (me2+1) * ny/np2 + zstart(3) = 1 + zend(3) = nz + + else if (layout_type .eq. layout_2d) then + + xstart(1) = 1 + xend(1) = nx + ystart(1) = 1 + me1 * ny/np1 + yend(1) = (me1+1) * ny/np1 + zstart(1) = 1 + me2 * nz/np2 + zend(1) = (me2+1) * nz/np2 + + xstart(2) = 1 + me1 * nx/np1 + xend(2) = (me1+1)*nx/np1 + ystart(2) = 1 + yend(2) = ny + zstart(2) = zstart(1) + zend(2) = zend(1) + + xstart(3) = xstart(2) + xend(3) = xend(2) + ystart(3) = 1 + me2 *ny/np2 + yend(3) = (me2+1)*ny/np2 + zstart(3) = 1 + zend(3) = nz + endif + +!--------------------------------------------------------------------- +! Set up info for blocking of ffts and transposes. This improves +! performance on cache-based systems. Blocking involves +! working on a chunk of the problem at a time, taking chunks +! along the first, second, or third dimension. +! +! - In cffts1 blocking is on 2nd dimension (with fft on 1st dim) +! - In cffts2/3 blocking is on 1st dimension (with fft on 2nd and 3rd dims) + +! Since 1st dim is always in processor, we'll assume it's long enough +! (default blocking factor is 16 so min size for 1st dim is 16) +! The only case we have to worry about is cffts1 in a 2d decomposition. +! so the blocking factor should not be larger than the 2nd dimension. +!--------------------------------------------------------------------- + + fftblock = fftblock_default + fftblockpad = fftblockpad_default + + if (layout_type .eq. layout_2d) then + if (dims(2, 1) .lt. fftblock) fftblock = dims(2, 1) + if (dims(2, 2) .lt. fftblock) fftblock = dims(2, 2) + if (dims(2, 3) .lt. fftblock) fftblock = dims(2, 3) + endif + + if (fftblock .ne. fftblock_default) fftblockpad = fftblock+3 + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine compute_indexmap(twiddle, d1, d2, d3) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! compute function from local (i,j,k) to ibar^2+jbar^2+kbar^2 +! for time evolution exponent. +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + integer d1, d2, d3 + integer i, j, k, ii, ii2, jj, ij2, kk + double precision ap, twiddle(d1, d2, d3) + +!--------------------------------------------------------------------- +! this function is very different depending on whether +! we are in the 0d, 1d or 2d layout. Compute separately. +! basically we want to convert the fortran indices +! 1 2 3 4 5 6 7 8 +! to +! 0 1 2 3 -4 -3 -2 -1 +! The following magic formula does the trick: +! mod(i-1+n/2, n) - n/2 +!--------------------------------------------------------------------- + + ap = - 4.d0 * alpha * pi *pi + + if (layout_type .eq. layout_0d) then ! xyz layout + do i = 1, dims(1,3) + ii = mod(i+xstart(3)-2+nx/2, nx) - nx/2 + ii2 = ii*ii + do j = 1, dims(2,3) + jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2 + ij2 = jj*jj+ii2 + do k = 1, dims(3,3) + kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2 + twiddle(i,j,k) = dexp(ap*dfloat(kk*kk+ij2)) + end do + end do + end do + else if (layout_type .eq. layout_1d) then ! zxy layout + do i = 1,dims(2,3) + ii = mod(i+xstart(3)-2+nx/2, nx) - nx/2 + ii2 = ii*ii + do j = 1,dims(3,3) + jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2 + ij2 = jj*jj+ii2 + do k = 1,dims(1,3) + kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2 + twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2)) + end do + end do + end do + else if (layout_type .eq. layout_2d) then ! zxy layout + do i = 1,dims(2,3) + ii = mod(i+xstart(3)-2+nx/2, nx) - nx/2 + ii2 = ii*ii + do j = 1, dims(3,3) + jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2 + ij2 = jj*jj+ii2 + do k =1,dims(1,3) + kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2 + twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2)) + end do + end do + end do + else + print *, ' Unknown layout type ', layout_type + stop + endif + + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine print_timers() + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + integer i, ierr + character*25 tstrings(T_max+2) + double precision t1(T_max+2), tsum(T_max+2), & + & tming(T_max+2), tmaxg(T_max+2) + data tstrings / ' total ', & + & ' setup ', & + & ' fft ', & + & ' evolve ', & + & ' checksum ', & + & ' fftlow ', & + & ' fftcopy ', & + & ' transpose ', & + & ' transpose1_loc ', & + & ' transpose1_glo ', & + & ' transpose1_fin ', & + & ' transpose2_loc ', & + & ' transpose2_glo ', & + & ' transpose2_fin ', & + & ' sync ', & + & ' init ', & + & ' totcomp ', & + & ' totcomm ' / + + do i = 1, t_max + t1(i) = timer_read(i) + end do + t1(t_max+2) = t1(t_transxzglo) + t1(t_transxyglo) + t1(t_synch) + t1(t_max+1) = t1(t_total) - t1(t_max+2) + + call MPI_Reduce(t1, tsum, t_max+2, MPI_DOUBLE_PRECISION, & + & MPI_SUM, 0, comm_solve, ierr) + call MPI_Reduce(t1, tming, t_max+2, MPI_DOUBLE_PRECISION, & + & MPI_MIN, 0, comm_solve, ierr) + call MPI_Reduce(t1, tmaxg, t_max+2, MPI_DOUBLE_PRECISION, & + & MPI_MAX, 0, comm_solve, ierr) + + if (me .ne. 0) return + write(*, 800) np_min + do i = 1, t_max+2 + if (tsum(i) .ne. 0.0d0) then + write(*, 810) i, tstrings(i), tming(i), tmaxg(i), & + & tsum(i)/np_min + endif + end do + 800 format(' nprocs =', i6, 19x, 'minimum', 5x, 'maximum', & + & 5x, 'average') + 810 format(' timer ', i2, '(', A16, ') :', 3(2X,F10.4)) + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine fft(dir, x1, x2) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer dir + double complex x1(ntdivnp), x2(ntdivnp) + + double complex scratch(fftblockpad_default*maxdim*2) + +!--------------------------------------------------------------------- +! note: args x1, x2 must be different arrays +! note: args for cfftsx are (direction, layout, xin, xout, scratch) +! xin/xout may be the same and it can be somewhat faster +! if they are +! note: args for transpose are (layout1, layout2, xin, xout) +! xin/xout must be different +!--------------------------------------------------------------------- + + if (dir .eq. 1) then + if (layout_type .eq. layout_0d) then + call cffts1(1, dims(1,1), dims(2,1), dims(3,1), & + & x1, x1, scratch) + call cffts2(1, dims(1,2), dims(2,2), dims(3,2), & + & x1, x1, scratch) + call cffts3(1, dims(1,3), dims(2,3), dims(3,3), & + & x1, x2, scratch) + else if (layout_type .eq. layout_1d) then + call cffts1(1, dims(1,1), dims(2,1), dims(3,1), & + & x1, x1, scratch) + call cffts2(1, dims(1,2), dims(2,2), dims(3,2), & + & x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_xy_z(2, 3, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(1, dims(1,3), dims(2,3), dims(3,3), & + & x2, x2, scratch) + else if (layout_type .eq. layout_2d) then + call cffts1(1, dims(1,1), dims(2,1), dims(3,1), & + & x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_y(1, 2, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(1, dims(1,2), dims(2,2), dims(3,2), & + & x2, x2, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_z(2, 3, x2, x1) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(1, dims(1,3), dims(2,3), dims(3,3), & + & x1, x2, scratch) + endif + else + if (layout_type .eq. layout_0d) then + call cffts3(-1, dims(1,3), dims(2,3), dims(3,3), & + & x1, x1, scratch) + call cffts2(-1, dims(1,2), dims(2,2), dims(3,2), & + & x1, x1, scratch) + call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), & + & x1, x2, scratch) + else if (layout_type .eq. layout_1d) then + call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), & + & x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_yz(3, 2, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts2(-1, dims(1,2), dims(2,2), dims(3,2), & + & x2, x2, scratch) + call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), & + & x2, x2, scratch) + else if (layout_type .eq. layout_2d) then + call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), & + & x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_z(3, 2, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(-1, dims(1,2), dims(2,2), dims(3,2), & + & x2, x2, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_y(2, 1, x2, x1) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), & + & x1, x2, scratch) + endif + endif + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine cffts1(is, d1, d2, d3, x, xout, y) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer is, d1, d2, d3, logd1 + double complex x(d1,d2,d3) + double complex xout(d1,d2,d3) + double complex y(fftblockpad, d1, 2) + integer i, j, k, jj + + logd1 = ilog2(d1) + + do k = 1, d3 + do jj = 0, d2 - fftblock, fftblock + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, fftblock + do i = 1, d1 + y(j,i,1) = x(i,j+jj,k) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + + if (timers_enabled) call timer_start(T_fftlow) + call cfftz (is, logd1, d1, y, y(1,1,2)) + if (timers_enabled) call timer_stop(T_fftlow) + + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, fftblock + do i = 1, d1 + xout(i,j+jj,k) = y(j,i,1) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + enddo + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine cffts2(is, d1, d2, d3, x, xout, y) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer is, d1, d2, d3, logd2 + double complex x(d1,d2,d3) + double complex xout(d1,d2,d3) + double complex y(fftblockpad, d2, 2) + integer i, j, k, ii + + logd2 = ilog2(d2) + + do k = 1, d3 + do ii = 0, d1 - fftblock, fftblock + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, d2 + do i = 1, fftblock + y(i,j,1) = x(i+ii,j,k) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + + if (timers_enabled) call timer_start(T_fftlow) + call cfftz (is, logd2, d2, y, y(1, 1, 2)) + if (timers_enabled) call timer_stop(T_fftlow) + + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, d2 + do i = 1, fftblock + xout(i+ii,j,k) = y(i,j,1) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + enddo + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine cffts3(is, d1, d2, d3, x, xout, y) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer is, d1, d2, d3, logd3 + double complex x(d1,d2,d3) + double complex xout(d1,d2,d3) + double complex y(fftblockpad, d3, 2) + integer i, j, k, ii + + logd3 = ilog2(d3) + + do j = 1, d2 + do ii = 0, d1 - fftblock, fftblock + if (timers_enabled) call timer_start(T_fftcopy) + do k = 1, d3 + do i = 1, fftblock + y(i,k,1) = x(i+ii,j,k) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + + if (timers_enabled) call timer_start(T_fftlow) + call cfftz (is, logd3, d3, y, y(1, 1, 2)) + if (timers_enabled) call timer_stop(T_fftlow) + + if (timers_enabled) call timer_start(T_fftcopy) + do k = 1, d3 + do i = 1, fftblock + xout(i+ii,j,k) = y(i,k,1) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + enddo + enddo + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine fft_init (n) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! compute the roots-of-unity array that will be used for subsequent FFTs. +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer m,n,nu,ku,i,j,ln + double precision t, ti + + +!--------------------------------------------------------------------- +! Initialize the U array with sines and cosines in a manner that permits +! stride one access at each FFT iteration. +!--------------------------------------------------------------------- + nu = n + m = ilog2(n) + u(1) = m + ku = 2 + ln = 1 + + do j = 1, m + t = pi / ln + + do i = 0, ln - 1 + ti = i * t + u(i+ku) = dcmplx (cos (ti), sin(ti)) + enddo + + ku = ku + ln + ln = 2 * ln + enddo + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine cfftz (is, m, n, x, y) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Computes NY N-point complex-to-complex FFTs of X using an algorithm due +! to Swarztrauber. X is both the input and the output array, while Y is a +! scratch array. It is assumed that N = 2^M. Before calling CFFTZ to +! perform FFTs, the array U must be initialized by calling CFFTZ with IS +! set to 0 and M set to MX, where MX is the maximum value of M for any +! subsequent call. +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer is,m,n,i,j,l,mx + double complex x, y + + dimension x(fftblockpad,n), y(fftblockpad,n) + +!--------------------------------------------------------------------- +! Check if input parameters are invalid. +!--------------------------------------------------------------------- + mx = u(1) + if ((is .ne. 1 .and. is .ne. -1) .or. m .lt. 1 .or. m .gt. mx) & + & then + write (*, 1) is, m, mx + 1 format ('CFFTZ: Either U has not been initialized, or else'/ & + & 'one of the input parameters is invalid', 3I5) + stop + endif + +!--------------------------------------------------------------------- +! Perform one variant of the Stockham FFT. +!--------------------------------------------------------------------- + do l = 1, m, 2 + call fftz2 (is, l, m, n, fftblock, fftblockpad, u, x, y) + if (l .eq. m) goto 160 + call fftz2 (is, l + 1, m, n, fftblock, fftblockpad, u, y, x) + enddo + + goto 180 + +!--------------------------------------------------------------------- +! Copy Y to X. +!--------------------------------------------------------------------- + 160 do j = 1, n + do i = 1, fftblock + x(i,j) = y(i,j) + enddo + enddo + + 180 continue + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine fftz2 (is, l, m, n, ny, ny1, u, x, y) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! Performs the L-th iteration of the second variant of the Stockham FFT. +!--------------------------------------------------------------------- + + implicit none + + integer is,k,l,m,n,ny,ny1,n1,li,lj,lk,ku,i,j,i11,i12,i21,i22 + double complex u,x,y,u1,x11,x21 + dimension u(n), x(ny1,n), y(ny1,n) + + +!--------------------------------------------------------------------- +! Set initial parameters. +!--------------------------------------------------------------------- + + n1 = n / 2 + lk = 2 ** (l - 1) + li = 2 ** (m - l) + lj = 2 * lk + ku = li + 1 + + do i = 0, li - 1 + i11 = i * lk + 1 + i12 = i11 + n1 + i21 = i * lj + 1 + i22 = i21 + lk + if (is .ge. 1) then + u1 = u(ku+i) + else + u1 = dconjg (u(ku+i)) + endif + +!--------------------------------------------------------------------- +! This loop is vectorizable. +!--------------------------------------------------------------------- + do k = 0, lk - 1 + do j = 1, ny + x11 = x(j,i11+k) + x21 = x(j,i12+k) + y(j,i21+k) = x11 + x21 + y(j,i22+k) = u1 * (x11 - x21) + enddo + enddo + enddo + + return + end + +!--------------------------------------------------------------------- + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + integer function ilog2(n) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + implicit none + integer n, nn, lg + if (n .eq. 1) then + ilog2=0 + return + endif + lg = 1 + nn = 2 + do while (nn .lt. n) + nn = nn*2 + lg = lg+1 + end do + ilog2 = lg + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_yz(l1, l2, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + + call transpose2_local(dims(1,l1),dims(2, l1)*dims(3, l1), & + & xin, xout) + + call transpose2_global(xout, xin) + + call transpose2_finish(dims(1,l1),dims(2, l1)*dims(3, l1), & + & xin, xout) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_xy_z(l1, l2, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + + call transpose2_local(dims(1,l1)*dims(2, l1),dims(3, l1), & + & xin, xout) + call transpose2_global(xout, xin) + call transpose2_finish(dims(1,l1)*dims(2, l1),dims(3, l1), & + & xin, xout) + + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose2_local(n1, n2, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + integer n1, n2 + double complex xin(n1, n2), xout(n2, n1) + + double complex z(transblockpad, transblock) + + integer i, j, ii, jj + + if (timers_enabled) call timer_start(T_transxzloc) + +!--------------------------------------------------------------------- +! If possible, block the transpose for cache memory systems. +! How much does this help? Example: R8000 Power Challenge (90 MHz) +! Blocked version decreases time spend in this routine +! from 14 seconds to 5.2 seconds on 8 nodes class A. +!--------------------------------------------------------------------- + + if (n1 .lt. transblock .or. n2 .lt. transblock) then + if (n1 .ge. n2) then + do j = 1, n2 + do i = 1, n1 + xout(j, i) = xin(i, j) + end do + end do + else + do i = 1, n1 + do j = 1, n2 + xout(j, i) = xin(i, j) + end do + end do + endif + else + do j = 0, n2-1, transblock + do i = 0, n1-1, transblock + +!--------------------------------------------------------------------- +! Note: compiler should be able to take j+jj out of inner loop +!--------------------------------------------------------------------- + do jj = 1, transblock + do ii = 1, transblock + z(jj,ii) = xin(i+ii, j+jj) + end do + end do + + do ii = 1, transblock + do jj = 1, transblock + xout(j+jj, i+ii) = z(jj,ii) + end do + end do + + end do + end do + endif + if (timers_enabled) call timer_stop(T_transxzloc) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose2_global(xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + double complex xin(ntdivnp) + double complex xout(ntdivnp) + integer ierr + +! if (timers_enabled) call synchup() + + if (timers_enabled) call timer_start(T_transxzglo) + call mpi_alltoall(xin, ntdivnp/np_min, dc_type, & + & xout, ntdivnp/np_min, dc_type, & + & commslice1, ierr) + if (timers_enabled) call timer_stop(T_transxzglo) + + return + end + + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose2_finish(n1, n2, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer n1, n2, ioff + double complex xin(n2, n1/np2, 0:np2-1), xout(n2*np2, n1/np2) + + integer i, j, p + + if (timers_enabled) call timer_start(T_transxzfin) + do p = 0, np2-1 + ioff = p*n2 + do j = 1, n1/np2 + do i = 1, n2 + xout(i+ioff, j) = xin(i, j, p) + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxzfin) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_z(l1, l2, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + + call transpose_x_z_local(dims(1,l1),dims(2,l1),dims(3,l1), & + & xin, xout) + call transpose_x_z_global(dims(1,l1),dims(2,l1),dims(3,l1), & + & xout, xin) + call transpose_x_z_finish(dims(1,l2),dims(2,l2),dims(3,l2), & + & xin, xout) + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_z_local(d1, d2, d3, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer d1, d2, d3 + double complex xin(d1,d2,d3) + double complex xout(d3,d2,d1) + integer block1, block3 + integer i, j, k, kk, ii, i1, k1 + + double complex buf(transblockpad, maxdim) + if (timers_enabled) call timer_start(T_transxzloc) + if (d1 .lt. 32) goto 100 + block3 = d3 + if (block3 .eq. 1) goto 100 + if (block3 .gt. transblock) block3 = transblock + block1 = d1 + if (block1*block3 .gt. transblock*transblock) & + & block1 = transblock*transblock/block3 +!--------------------------------------------------------------------- +! blocked transpose +!--------------------------------------------------------------------- + do j = 1, d2 + do kk = 0, d3-block3, block3 + do ii = 0, d1-block1, block1 + + do k = 1, block3 + k1 = k + kk + do i = 1, block1 + buf(k, i) = xin(i+ii, j, k1) + end do + end do + + do i = 1, block1 + i1 = i + ii + do k = 1, block3 + xout(k+kk, j, i1) = buf(k, i) + end do + end do + + end do + end do + end do + goto 200 + + +!--------------------------------------------------------------------- +! basic transpose +!--------------------------------------------------------------------- + 100 continue + + do j = 1, d2 + do k = 1, d3 + do i = 1, d1 + xout(k, j, i) = xin(i, j, k) + end do + end do + end do + +!--------------------------------------------------------------------- +! all done +!--------------------------------------------------------------------- + 200 continue + + if (timers_enabled) call timer_stop(T_transxzloc) + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_z_global(d1, d2, d3, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + integer d1, d2, d3 + double complex xin(d3,d2,d1) + double complex xout(d3,d2,d1) ! not real layout, but right size + integer ierr + +! if (timers_enabled) call synchup() + +!--------------------------------------------------------------------- +! do transpose among all processes with same 1-coord (me1) +!--------------------------------------------------------------------- + if (timers_enabled)call timer_start(T_transxzglo) + call mpi_alltoall(xin, d1*d2*d3/np2, dc_type, & + & xout, d1*d2*d3/np2, dc_type, & + & commslice1, ierr) + if (timers_enabled) call timer_stop(T_transxzglo) + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_z_finish(d1, d2, d3, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer d1, d2, d3 + double complex xin(d1/np2, d2, d3, 0:np2-1) + double complex xout(d1,d2,d3) + integer i, j, k, p, ioff + if (timers_enabled) call timer_start(T_transxzfin) +!--------------------------------------------------------------------- +! this is the most straightforward way of doing it. the +! calculation in the inner loop doesn't help. +! do i = 1, d1/np2 +! do j = 1, d2 +! do k = 1, d3 +! do p = 0, np2-1 +! ii = i + p*d1/np2 +! xout(ii, j, k) = xin(i, j, k, p) +! end do +! end do +! end do +! end do +!--------------------------------------------------------------------- + + do p = 0, np2-1 + ioff = p*d1/np2 + do k = 1, d3 + do j = 1, d2 + do i = 1, d1/np2 + xout(i+ioff, j, k) = xin(i, j, k, p) + end do + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxzfin) + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_y(l1, l2, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + +!--------------------------------------------------------------------- +! xy transpose is a little tricky, since we don't want +! to touch 3rd axis. But alltoall must involve 3rd axis (most +! slowly varying) to be efficient. So we do +! (nx, ny/np1, nz/np2) -> (ny/np1, nz/np2, nx) (local) +! (ny/np1, nz/np2, nx) -> ((ny/np1*nz/np2)*np1, nx/np1) (global) +! then local finish. +!--------------------------------------------------------------------- + + + call transpose_x_y_local(dims(1,l1),dims(2,l1),dims(3,l1), & + & xin, xout) + call transpose_x_y_global(dims(1,l1),dims(2,l1),dims(3,l1), & + & xout, xin) + call transpose_x_y_finish(dims(1,l2),dims(2,l2),dims(3,l2), & + & xin, xout) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_y_local(d1, d2, d3, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer d1, d2, d3 + double complex xin(d1, d2, d3) + double complex xout(d2, d3, d1) + integer i, j, k + if (timers_enabled) call timer_start(T_transxyloc) + + do k = 1, d3 + do i = 1, d1 + do j = 1, d2 + xout(j,k,i)=xin(i,j,k) + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxyloc) + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_y_global(d1, d2, d3, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + integer d1, d2, d3 +!--------------------------------------------------------------------- +! array is in form (ny/np1, nz/np2, nx) +!--------------------------------------------------------------------- + double complex xin(d2,d3,d1) + double complex xout(d2,d3,d1) ! not real layout but right size + integer ierr + +! if (timers_enabled) call synchup() + +!--------------------------------------------------------------------- +! do transpose among all processes with same 1-coord (me1) +!--------------------------------------------------------------------- + if (timers_enabled) call timer_start(T_transxyglo) + call mpi_alltoall(xin, d1*d2*d3/np1, dc_type, & + & xout, d1*d2*d3/np1, dc_type, & + & commslice2, ierr) + if (timers_enabled) call timer_stop(T_transxyglo) + + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine transpose_x_y_finish(d1, d2, d3, xin, xout) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + implicit none + + integer d1, d2, d3 + double complex xin(d1/np1, d3, d2, 0:np1-1) + double complex xout(d1,d2,d3) + integer i, j, k, p, ioff + if (timers_enabled) call timer_start(T_transxyfin) +!--------------------------------------------------------------------- +! this is the most straightforward way of doing it. the +! calculation in the inner loop doesn't help. +! do i = 1, d1/np1 +! do j = 1, d2 +! do k = 1, d3 +! do p = 0, np1-1 +! ii = i + p*d1/np1 +! note order is screwy bcz we have (ny/np1, nz/np2, nx) -> (ny, nx/np1, nz/np2) +! xout(ii, j, k) = xin(i, k, j, p) +! end do +! end do +! end do +! end do +!--------------------------------------------------------------------- + + do p = 0, np1-1 + ioff = p*d1/np1 + do k = 1, d3 + do j = 1, d2 + do i = 1, d1/np1 + xout(i+ioff, j, k) = xin(i, k, j, p) + end do + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxyfin) + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine checksum(i, u1, d1, d2, d3) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + integer i, d1, d2, d3 + double complex u1(d1, d2, d3) + integer j, q,r,s, ierr + double complex chk,allchk + chk = (0.0,0.0) + + do j=1,1024 + q = mod(j, nx)+1 + if (q .ge. xstart(1) .and. q .le. xend(1)) then + r = mod(3*j,ny)+1 + if (r .ge. ystart(1) .and. r .le. yend(1)) then + s = mod(5*j,nz)+1 + if (s .ge. zstart(1) .and. s .le. zend(1)) then + chk=chk+u1(q-xstart(1)+1,r-ystart(1)+1,s-zstart(1)+1) + end if + end if + end if + end do + chk = chk/ntotal_f + + if (timers_enabled) call timer_start(T_synch) + call MPI_Reduce(chk, allchk, 1, dc_type, MPI_SUM, & + & 0, comm_solve, ierr) + if (timers_enabled) call timer_stop(T_synch) + if (me .eq. 0) then + write (*, 30) i, allchk + 30 format (' T =',I5,5X,'Checksum =',1P2D22.12) + endif + +! sums(i) = allchk +! If we compute the checksum for diagnostic purposes, we let i be +! negative, so the result will not be stored in an array + if (i .gt. 0) sums(i) = allchk + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine synchup + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use ft_data + use mpinpb + + implicit none + + integer ierr + call timer_start(T_synch) + call mpi_barrier(comm_solve, ierr) + call timer_stop(T_synch) + return + end + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine set_class (d1, d2, d3, nt, class) + +!--------------------------------------------------------------------- +! set problem class based on problem size +!--------------------------------------------------------------------- + + implicit none + + integer d1, d2, d3, nt + character class + + + class = 'U' + + if (d1 .eq. 64 .and. & + & d2 .eq. 64 .and. & + & d3 .eq. 64 .and. & + & nt .eq. 6) then + class = 'S' + + else if (d1 .eq. 128 .and. & + & d2 .eq. 128 .and. & + & d3 .eq. 32 .and. & + & nt .eq. 6) then + class = 'W' + + else if (d1 .eq. 256 .and. & + & d2 .eq. 256 .and. & + & d3 .eq. 128 .and. & + & nt .eq. 6) then + class = 'A' + + else if (d1 .eq. 512 .and. & + & d2 .eq. 256 .and. & + & d3 .eq. 256 .and. & + & nt .eq. 20) then + class = 'B' + + else if (d1 .eq. 512 .and. & + & d2 .eq. 512 .and. & + & d3 .eq. 512 .and. & + & nt .eq. 20) then + class = 'C' + + else if (d1 .eq. 2048 .and. & + & d2 .eq. 1024 .and. & + & d3 .eq. 1024 .and. & + & nt .eq. 25) then + class = 'D' + + else if (d1 .eq. 4096 .and. & + & d2 .eq. 2048 .and. & + & d3 .eq. 2048 .and. & + & nt .eq. 25) then + class = 'E' + + else if (d1 .eq. 8192 .and. & + & d2 .eq. 4096 .and. & + & d3 .eq. 4096 .and. & + & nt .eq. 25) then + class = 'F' + + endif + + return + end + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine verify (nt, verified, class) + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + use, intrinsic :: ieee_arithmetic, only : ieee_is_nan + + use ft_data + use mpinpb + + implicit none + + integer nt + character class + logical verified + integer ierr, size, i + double precision err, epsilon + +!--------------------------------------------------------------------- +! Reference checksums +!--------------------------------------------------------------------- + double complex csum_ref(25) + + + if (me .ne. 0) return + + epsilon = 1.0d-12 + verified = .FALSE. + + if ( class .eq. 'S' ) then +!--------------------------------------------------------------------- +! Sample size reference checksums +!--------------------------------------------------------------------- + csum_ref(1) = dcmplx(5.546087004964D+02, 4.845363331978D+02) + csum_ref(2) = dcmplx(5.546385409189D+02, 4.865304269511D+02) + csum_ref(3) = dcmplx(5.546148406171D+02, 4.883910722336D+02) + csum_ref(4) = dcmplx(5.545423607415D+02, 4.901273169046D+02) + csum_ref(5) = dcmplx(5.544255039624D+02, 4.917475857993D+02) + csum_ref(6) = dcmplx(5.542683411902D+02, 4.932597244941D+02) + + else if ( class .eq. 'W' ) then +!--------------------------------------------------------------------- +! Class W size reference checksums +!--------------------------------------------------------------------- + csum_ref(1) = dcmplx(5.673612178944D+02, 5.293246849175D+02) + csum_ref(2) = dcmplx(5.631436885271D+02, 5.282149986629D+02) + csum_ref(3) = dcmplx(5.594024089970D+02, 5.270996558037D+02) + csum_ref(4) = dcmplx(5.560698047020D+02, 5.260027904925D+02) + csum_ref(5) = dcmplx(5.530898991250D+02, 5.249400845633D+02) + csum_ref(6) = dcmplx(5.504159734538D+02, 5.239212247086D+02) + + else if ( class .eq. 'A' ) then +!--------------------------------------------------------------------- +! Class A size reference checksums +!--------------------------------------------------------------------- + csum_ref(1) = dcmplx(5.046735008193D+02, 5.114047905510D+02) + csum_ref(2) = dcmplx(5.059412319734D+02, 5.098809666433D+02) + csum_ref(3) = dcmplx(5.069376896287D+02, 5.098144042213D+02) + csum_ref(4) = dcmplx(5.077892868474D+02, 5.101336130759D+02) + csum_ref(5) = dcmplx(5.085233095391D+02, 5.104914655194D+02) + csum_ref(6) = dcmplx(5.091487099959D+02, 5.107917842803D+02) + + else if ( class .eq. 'B' ) then +!--------------------------------------------------------------------- +! Class B size reference checksums +!--------------------------------------------------------------------- + csum_ref(1) = dcmplx(5.177643571579D+02, 5.077803458597D+02) + csum_ref(2) = dcmplx(5.154521291263D+02, 5.088249431599D+02) + csum_ref(3) = dcmplx(5.146409228649D+02, 5.096208912659D+02) + csum_ref(4) = dcmplx(5.142378756213D+02, 5.101023387619D+02) + csum_ref(5) = dcmplx(5.139626667737D+02, 5.103976610617D+02) + csum_ref(6) = dcmplx(5.137423460082D+02, 5.105948019802D+02) + csum_ref(7) = dcmplx(5.135547056878D+02, 5.107404165783D+02) + csum_ref(8) = dcmplx(5.133910925466D+02, 5.108576573661D+02) + csum_ref(9) = dcmplx(5.132470705390D+02, 5.109577278523D+02) + csum_ref(10) = dcmplx(5.131197729984D+02, 5.110460304483D+02) + csum_ref(11) = dcmplx(5.130070319283D+02, 5.111252433800D+02) + csum_ref(12) = dcmplx(5.129070537032D+02, 5.111968077718D+02) + csum_ref(13) = dcmplx(5.128182883502D+02, 5.112616233064D+02) + csum_ref(14) = dcmplx(5.127393733383D+02, 5.113203605551D+02) + csum_ref(15) = dcmplx(5.126691062020D+02, 5.113735928093D+02) + csum_ref(16) = dcmplx(5.126064276004D+02, 5.114218460548D+02) + csum_ref(17) = dcmplx(5.125504076570D+02, 5.114656139760D+02) + csum_ref(18) = dcmplx(5.125002331720D+02, 5.115053595966D+02) + csum_ref(19) = dcmplx(5.124551951846D+02, 5.115415130407D+02) + csum_ref(20) = dcmplx(5.124146770029D+02, 5.115744692211D+02) + + else if ( class .eq. 'C' ) then +!--------------------------------------------------------------------- +! Class C size reference checksums +!--------------------------------------------------------------------- + csum_ref(1) = dcmplx(5.195078707457D+02, 5.149019699238D+02) + csum_ref(2) = dcmplx(5.155422171134D+02, 5.127578201997D+02) + csum_ref(3) = dcmplx(5.144678022222D+02, 5.122251847514D+02) + csum_ref(4) = dcmplx(5.140150594328D+02, 5.121090289018D+02) + csum_ref(5) = dcmplx(5.137550426810D+02, 5.121143685824D+02) + csum_ref(6) = dcmplx(5.135811056728D+02, 5.121496764568D+02) + csum_ref(7) = dcmplx(5.134569343165D+02, 5.121870921893D+02) + csum_ref(8) = dcmplx(5.133651975661D+02, 5.122193250322D+02) + csum_ref(9) = dcmplx(5.132955192805D+02, 5.122454735794D+02) + csum_ref(10) = dcmplx(5.132410471738D+02, 5.122663649603D+02) + csum_ref(11) = dcmplx(5.131971141679D+02, 5.122830879827D+02) + csum_ref(12) = dcmplx(5.131605205716D+02, 5.122965869718D+02) + csum_ref(13) = dcmplx(5.131290734194D+02, 5.123075927445D+02) + csum_ref(14) = dcmplx(5.131012720314D+02, 5.123166486553D+02) + csum_ref(15) = dcmplx(5.130760908195D+02, 5.123241541685D+02) + csum_ref(16) = dcmplx(5.130528295923D+02, 5.123304037599D+02) + csum_ref(17) = dcmplx(5.130310107773D+02, 5.123356167976D+02) + csum_ref(18) = dcmplx(5.130103090133D+02, 5.123399592211D+02) + csum_ref(19) = dcmplx(5.129905029333D+02, 5.123435588985D+02) + csum_ref(20) = dcmplx(5.129714421109D+02, 5.123465164008D+02) + + else if ( class .eq. 'D' ) then +!--------------------------------------------------------------------- +! Class D size reference checksums +!--------------------------------------------------------------------- + csum_ref(1) = dcmplx(5.122230065252D+02, 5.118534037109D+02) + csum_ref(2) = dcmplx(5.120463975765D+02, 5.117061181082D+02) + csum_ref(3) = dcmplx(5.119865766760D+02, 5.117096364601D+02) + csum_ref(4) = dcmplx(5.119518799488D+02, 5.117373863950D+02) + csum_ref(5) = dcmplx(5.119269088223D+02, 5.117680347632D+02) + csum_ref(6) = dcmplx(5.119082416858D+02, 5.117967875532D+02) + csum_ref(7) = dcmplx(5.118943814638D+02, 5.118225281841D+02) + csum_ref(8) = dcmplx(5.118842385057D+02, 5.118451629348D+02) + csum_ref(9) = dcmplx(5.118769435632D+02, 5.118649119387D+02) + csum_ref(10) = dcmplx(5.118718203448D+02, 5.118820803844D+02) + csum_ref(11) = dcmplx(5.118683569061D+02, 5.118969781011D+02) + csum_ref(12) = dcmplx(5.118661708593D+02, 5.119098918835D+02) + csum_ref(13) = dcmplx(5.118649768950D+02, 5.119210777066D+02) + csum_ref(14) = dcmplx(5.118645605626D+02, 5.119307604484D+02) + csum_ref(15) = dcmplx(5.118647586618D+02, 5.119391362671D+02) + csum_ref(16) = dcmplx(5.118654451572D+02, 5.119463757241D+02) + csum_ref(17) = dcmplx(5.118665212451D+02, 5.119526269238D+02) + csum_ref(18) = dcmplx(5.118679083821D+02, 5.119580184108D+02) + csum_ref(19) = dcmplx(5.118695433664D+02, 5.119626617538D+02) + csum_ref(20) = dcmplx(5.118713748264D+02, 5.119666538138D+02) + csum_ref(21) = dcmplx(5.118733606701D+02, 5.119700787219D+02) + csum_ref(22) = dcmplx(5.118754661974D+02, 5.119730095953D+02) + csum_ref(23) = dcmplx(5.118776626738D+02, 5.119755100241D+02) + csum_ref(24) = dcmplx(5.118799262314D+02, 5.119776353561D+02) + csum_ref(25) = dcmplx(5.118822370068D+02, 5.119794338060D+02) + + else if ( class .eq. 'E' ) then +!--------------------------------------------------------------------- +! Class E size reference checksums +!--------------------------------------------------------------------- + csum_ref(1) = dcmplx(5.121601045346D+02, 5.117395998266D+02) + csum_ref(2) = dcmplx(5.120905403678D+02, 5.118614716182D+02) + csum_ref(3) = dcmplx(5.120623229306D+02, 5.119074203747D+02) + csum_ref(4) = dcmplx(5.120438418997D+02, 5.119345900733D+02) + csum_ref(5) = dcmplx(5.120311521872D+02, 5.119551325550D+02) + csum_ref(6) = dcmplx(5.120226088809D+02, 5.119720179919D+02) + csum_ref(7) = dcmplx(5.120169296534D+02, 5.119861371665D+02) + csum_ref(8) = dcmplx(5.120131225172D+02, 5.119979364402D+02) + csum_ref(9) = dcmplx(5.120104767108D+02, 5.120077674092D+02) + csum_ref(10) = dcmplx(5.120085127969D+02, 5.120159443121D+02) + csum_ref(11) = dcmplx(5.120069224127D+02, 5.120227453670D+02) + csum_ref(12) = dcmplx(5.120055158164D+02, 5.120284096041D+02) + csum_ref(13) = dcmplx(5.120041820159D+02, 5.120331373793D+02) + csum_ref(14) = dcmplx(5.120028605402D+02, 5.120370938679D+02) + csum_ref(15) = dcmplx(5.120015223011D+02, 5.120404138831D+02) + csum_ref(16) = dcmplx(5.120001570022D+02, 5.120432068837D+02) + csum_ref(17) = dcmplx(5.119987650555D+02, 5.120455615860D+02) + csum_ref(18) = dcmplx(5.119973525091D+02, 5.120475499442D+02) + csum_ref(19) = dcmplx(5.119959279472D+02, 5.120492304629D+02) + csum_ref(20) = dcmplx(5.119945006558D+02, 5.120506508902D+02) + csum_ref(21) = dcmplx(5.119930795911D+02, 5.120518503782D+02) + csum_ref(22) = dcmplx(5.119916728462D+02, 5.120528612016D+02) + csum_ref(23) = dcmplx(5.119902874185D+02, 5.120537101195D+02) + csum_ref(24) = dcmplx(5.119889291565D+02, 5.120544194514D+02) + csum_ref(25) = dcmplx(5.119876028049D+02, 5.120550079284D+02) + + else if ( class .eq. 'F' ) then +!--------------------------------------------------------------------- +! Class F size reference checksums +!--------------------------------------------------------------------- + csum_ref( 1) = dcmplx(5.119892866928D+02, 5.121457822747D+02) + csum_ref( 2) = dcmplx(5.119560157487D+02, 5.121009044434D+02) + csum_ref( 3) = dcmplx(5.119437960123D+02, 5.120761074285D+02) + csum_ref( 4) = dcmplx(5.119395628845D+02, 5.120614320496D+02) + csum_ref( 5) = dcmplx(5.119390371879D+02, 5.120514085624D+02) + csum_ref( 6) = dcmplx(5.119405091840D+02, 5.120438117102D+02) + csum_ref( 7) = dcmplx(5.119430444528D+02, 5.120376348915D+02) + csum_ref( 8) = dcmplx(5.119460702242D+02, 5.120323831062D+02) + csum_ref( 9) = dcmplx(5.119492377036D+02, 5.120277980818D+02) + csum_ref(10) = dcmplx(5.119523446268D+02, 5.120237368268D+02) + csum_ref(11) = dcmplx(5.119552825361D+02, 5.120201137845D+02) + csum_ref(12) = dcmplx(5.119580008777D+02, 5.120168723492D+02) + csum_ref(13) = dcmplx(5.119604834177D+02, 5.120139707209D+02) + csum_ref(14) = dcmplx(5.119627332821D+02, 5.120113749334D+02) + csum_ref(15) = dcmplx(5.119647637538D+02, 5.120090554887D+02) + csum_ref(16) = dcmplx(5.119665927740D+02, 5.120069857863D+02) + csum_ref(17) = dcmplx(5.119682397643D+02, 5.120051414260D+02) + csum_ref(18) = dcmplx(5.119697238718D+02, 5.120034999132D+02) + csum_ref(19) = dcmplx(5.119710630664D+02, 5.120020405355D+02) + csum_ref(20) = dcmplx(5.119722737384D+02, 5.120007442976D+02) + csum_ref(21) = dcmplx(5.119733705802D+02, 5.119995938652D+02) + csum_ref(22) = dcmplx(5.119743666226D+02, 5.119985735001D+02) + csum_ref(23) = dcmplx(5.119752733481D+02, 5.119976689792D+02) + csum_ref(24) = dcmplx(5.119761008382D+02, 5.119968675026D+02) + csum_ref(25) = dcmplx(5.119768579280D+02, 5.119961575929D+02) + + endif + + + if (class .ne. 'U') then + + do i = 1, nt + err = abs( (sums(i) - csum_ref(i)) / csum_ref(i) ) + if (ieee_is_nan(err) .or. (err .gt. epsilon)) goto 100 + end do + verified = .TRUE. + 100 continue + + endif + +! call MPI_COMM_SIZE(comm_solve, size, ierr) +! if (size .ne. np) then +! write(*, 4010) np +! write(*, 4011) +! write(*, 4012) +!--------------------------------------------------------------------- +! multiple statements because some Fortran compilers have +! problems with long strings. +!--------------------------------------------------------------------- +! 4010 format( ' Warning: benchmark was compiled for ', i5, +! > 'processors') +! 4011 format( ' Must be run on this many processors for official', +! > ' verification') +! 4012 format( ' so memory access is repeatable') +! verified = .false. +! endif + + if (class .ne. 'U') then + if (verified) then + write(*,2000) + 2000 format(' Result verification successful') + else + write(*,2001) + 2001 format(' Result verification failed') + endif + endif + + + return + end + + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft_data.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft_data.f90 new file mode 100644 index 000000000..62eadd6c2 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/ft_data.f90 @@ -0,0 +1,209 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! ft_data module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module ft_data + + include 'npbparams.h' + +! total number of grid points in floating point number + double precision ntotal_f + parameter (ntotal_f=dble(nx)*ny*nz) + +! total dimension scaled by the number of processes + integer ntdivnp + + + double precision seed, a, pi, alpha + parameter (seed = 314159265.d0, a = 1220703125.d0, & + & pi = 3.141592653589793238d0, alpha=1.0d-6) + +! roots of unity array +! relies on x being largest dimension? + double complex, allocatable :: u(:) + + +! for checksum data + double complex sums(0:niter_default) + +! number of iterations + integer niter + +! other stuff + logical debug, debugsynch + + +!-------------------------------------------------------------------- +! Cache blocking params. These values are good for most +! RISC processors. +! FFT parameters: +! fftblock controls how many ffts are done at a time. +! The default is appropriate for most cache-based machines +! On vector machines, the FFT can be vectorized with vector +! length equal to the block size, so the block size should +! be as large as possible. This is the size of the smallest +! dimension of the problem: 128 for class A, 256 for class B and +! 512 for class C. +! Transpose parameters: +! transblock is the blocking factor for the transposes when there +! is a 1-D layout. On vector machines it should probably be +! large (largest dimension of the problem). +!-------------------------------------------------------------------- + + integer fftblock_default, fftblockpad_default + parameter (fftblock_default=16, fftblockpad_default=18) + integer transblock, transblockpad + parameter(transblock=32, transblockpad=34) + + integer fftblock, fftblockpad + + +!-------------------------------------------------------------------- +! 2D processor array -> 2D grid decomposition (by pencils) +! If processor array is 1xN or -> 1D grid decomposition (by planes) +! If processor array is 1x1 -> 0D grid decomposition +! For simplicity, do not treat Nx1 (np2 = 1) specially +!-------------------------------------------------------------------- + integer np1, np2 + +! basic decomposition strategy + integer layout_type + integer layout_0D, layout_1D, layout_2D + parameter (layout_0D = 0, layout_1D = 1, layout_2D = 2) + +!-------------------------------------------------------------------- +! There are basically three stages +! 1: x-y-z layout +! 2: after x-transform (before y) +! 3: after y-transform (before z) +! The computation proceeds logically as + +! set up initial conditions +! fftx(1) +! transpose (1->2) +! ffty(2) +! transpose (2->3) +! fftz(3) +! time evolution +! fftz(3) +! transpose (3->2) +! ffty(2) +! transpose (2->1) +! fftx(1) +! compute residual(1) + +! for the 0D, 1D, 2D strategies, the layouts look like xxx +! +! 0D 1D 2D +! 1: xyz xyz xyz +! 2: xyz xyz yxz +! 3: xyz zyx zxy +!-------------------------------------------------------------------- + +! the array dimensions are stored in dims(coord, phase) + integer dims(3, 3) + integer xstart(3), ystart(3), zstart(3) + integer xend(3), yend(3), zend(3) + +!-------------------------------------------------------------------- +! Timing constants +!-------------------------------------------------------------------- + integer T_total, T_setup, T_fft, T_evolve, T_checksum, & + & T_fftlow, T_fftcopy, T_transpose, & + & T_transxzloc, T_transxzglo, T_transxzfin, & + & T_transxyloc, T_transxyglo, T_transxyfin, & + & T_synch, T_init, T_max + parameter (T_total = 1, T_setup = 2, T_fft = 3, & + & T_evolve = 4, T_checksum = 5, & + & T_fftlow = 6, T_fftcopy = 7, T_transpose = 8, & + & T_transxzloc = 9, T_transxzglo = 10, T_transxzfin = 11, & + & T_transxyloc = 12, T_transxyglo = 13, & + & T_transxyfin = 14, T_synch = 15, T_init = 16, & + & T_max = 16) + + logical timers_enabled + +!-------------------------------------------------------------------- +! external functions +!-------------------------------------------------------------------- + double precision, external :: randlc, timer_read + integer, external :: ilog2 + + end module ft_data + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! ft_fields module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module ft_fields + +!--------------------------------------------------------------------- +! u0, u1, u2 are the main arrays in the problem. +! Depending on the decomposition, these arrays will have different +! dimensions. To accomodate all possibilities, we allocate them as +! one-dimensional arrays and pass them to subroutines for different +! views +! - u0 contains the initial (transformed) initial condition +! - u1 and u2 are working arrays +!--------------------------------------------------------------------- + double complex, allocatable :: & + & u0(:), u1(:), u2(:) + double precision, allocatable :: & + & twiddle(:) + + end module ft_fields + + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + subroutine alloc_space + +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + +!--------------------------------------------------------------------- +! allocate space dynamically for data arrays +!--------------------------------------------------------------------- + + use ft_data + use ft_fields + use mpinpb + + implicit none + + integer ios, ierr + + + ntdivnp = ((nx*ny)/np_min)*nz + +!--------------------------------------------------------------------- +! Padding+3 is to avoid accidental cache problems, +! since all array sizes are powers of two. +!--------------------------------------------------------------------- + allocate ( & + & u0 (ntdivnp+3), & + & u1 (ntdivnp+3), & + & u2 (ntdivnp+3), & + & twiddle(ntdivnp), & + & u (maxdim), & + & stat = ios) + + if (ios .ne. 0) then + write(*,*) 'Error encountered in allocating space' + call MPI_Abort(MPI_COMM_WORLD, MPI_ERR_OTHER, ierr) + stop + endif + + return + end + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/inputft.data.sample b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/inputft.data.sample new file mode 100644 index 000000000..448ac42bc --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/inputft.data.sample @@ -0,0 +1,3 @@ +6 ! number of iterations +2 ! layout type. 0 = 0d, 1 = 1d, 2 = 2d +2 4 ! processor layout. 0d must be "1 1"; 1d must be "1 N" diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/mpinpb.f90 b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/mpinpb.f90 new file mode 100644 index 000000000..d125ea4ca --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/FT/mpinpb.f90 @@ -0,0 +1,31 @@ +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- +! +! mpinpb module +! +!--------------------------------------------------------------------- +!--------------------------------------------------------------------- + + module mpinpb + + include 'mpif.h' + +!-------------------------------------------------------------------- +! 'np' number of processors, 'np_min' min number of processors +!-------------------------------------------------------------------- + integer np_min, np + +! we need a bunch of logic to keep track of how +! arrays are laid out. +! coords of this processor + integer me, me1, me2 + +! need a communicator for row/col in processor grid + integer comm_solve, commslice1, commslice2 + logical active + +! mpi data types + integer dc_type + + end module mpinpb + diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/Makefile b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/Makefile new file mode 100644 index 000000000..0ac4ae959 --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/Makefile @@ -0,0 +1,23 @@ +SHELL=/bin/sh +BENCHMARK=is +BENCHMARKU=IS + +include ../config/make.def + +include ../sys/make.common + +OBJS = is.o ${COMMON}/c_print_results.o ${COMMON}/c_timers.o + + +${PROGRAM}: config ${OBJS} + ${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${CMPI_LIB} + +.c.o: + ${CCOMPILE} $< + +is.o: is.c npbparams.h + + +clean: + - rm -f *.o *~ mputil* + - rm -f is npbparams.h core diff --git a/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/is.c b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/is.c new file mode 100644 index 000000000..e7227ae6e --- /dev/null +++ b/src/npb-24.04-imgs/npb-with-roi/NPB/NPB3.4-MPI/IS/is.c @@ -0,0 +1,1219 @@ +/************************************************************************* + * * + * N A S P A R A L L E L B E N C H M A R K S 3.4 * + * * + * I S * + * * + ************************************************************************* + * * + * This benchmark is part of the NAS Parallel Benchmark 3.4 suite. * + * It is described in NAS Technical Report 95-020. * + * * + * Permission to use, copy, distribute and modify this software * + * for any purpose with or without fee is hereby granted. We * + * request, however, that all derived work reference the NAS * + * Parallel Benchmarks 3.4. This software is provided "as is" * + * without express or implied warranty. * + * * + * Information on NPB 3.4, including the technical report, the * + * original specifications, source code, results and information * + * on how to submit new results, is available at: * + * * + * http://www.nas.nasa.gov/Software/NPB * + * * + * Send comments or suggestions to npb@nas.nasa.gov * + * Send bug reports to npb-bugs@nas.nasa.gov * + * * + * NAS Parallel Benchmarks Group * + * NASA Ames Research Center * + * Mail Stop: T27A-1 * + * Moffett Field, CA 94035-1000 * + * * + * E-mail: npb@nas.nasa.gov * + * Fax: (650) 604-3957 * + * * + ************************************************************************* + * * + * Author: M. Yarrow * + * H. Jin * + * * + *************************************************************************/ + +#include "mpi.h" +#include "npbparams.h" +#include +#include +#include + +/******************/ +/* default values */ +/******************/ +#ifndef CLASS +#define CLASS 'S' +#define NUM_PROCS 1 +#endif +#define MIN_PROCS 1 +#define ONE 1 + + +/*************/ +/* CLASS S */ +/*************/ +#if CLASS == 'S' +#define TOTAL_KEYS_LOG_2 16 +#define MAX_KEY_LOG_2 11 +#define NUM_BUCKETS_LOG_2 9 +#endif + + +/*************/ +/* CLASS W */ +/*************/ +#if CLASS == 'W' +#define TOTAL_KEYS_LOG_2 20 +#define MAX_KEY_LOG_2 16 +#define NUM_BUCKETS_LOG_2 10 +#endif + +/*************/ +/* CLASS A */ +/*************/ +#if CLASS == 'A' +#define TOTAL_KEYS_LOG_2 23 +#define MAX_KEY_LOG_2 19 +#define NUM_BUCKETS_LOG_2 10 +#endif + + +/*************/ +/* CLASS B */ +/*************/ +#if CLASS == 'B' +#define TOTAL_KEYS_LOG_2 25 +#define MAX_KEY_LOG_2 21 +#define NUM_BUCKETS_LOG_2 10 +#endif + + +/*************/ +/* CLASS C */ +/*************/ +#if CLASS == 'C' +#define TOTAL_KEYS_LOG_2 27 +#define MAX_KEY_LOG_2 23 +#define NUM_BUCKETS_LOG_2 10 +#endif + + +/*************/ +/* CLASS D */ +/*************/ +#if CLASS == 'D' +#define TOTAL_KEYS_LOG_2 29 /* 2^31 */ +#define MAX_KEY_LOG_2 27 +#define NUM_BUCKETS_LOG_2 10 +#undef MIN_PROCS +#define MIN_PROCS 4 +#endif + + +/*************/ +/* CLASS E */ +/*************/ +#if CLASS == 'E' +#define TOTAL_KEYS_LOG_2 29 /* 2^35 */ +#define MAX_KEY_LOG_2 31 +#define NUM_BUCKETS_LOG_2 10 +#undef MIN_PROCS +#define MIN_PROCS 64 +#undef ONE +#define ONE 1L +#endif + + +/******************************************************************* + * Defining MIN_PROCS is to avoid integer overflow for large problem + * sizes without using a larger integer type, such as long int. + * The actual total keys = TOTAL_KEYS * MIN_PROCS + *******************************************************************/ +#define TOTAL_KEYS (1 << TOTAL_KEYS_LOG_2) + +#define MAX_KEY (ONE << MAX_KEY_LOG_2) +#define NUM_BUCKETS (1 << NUM_BUCKETS_LOG_2) + +/*****************************************************************/ +/* NOTE: THIS CODE CANNOT BE RUN ON ARBITRARILY LARGE NUMBERS OF */ +/* PROCESSORS. THE LARGEST VERIFIED NUMBER IS 1024. INCREASE */ +/* MAX_PROCS AT YOUR PERIL */ +/*****************************************************************/ +#if CLASS == 'S' +#define MAX_PROCS 128 +#else +#define MAX_PROCS 1024 +#endif + +#define MAX_ITERATIONS 10 +#define TEST_ARRAY_SIZE 5 + + +/* Number of keys assigned to each processor + * #define NUM_KEYS (TOTAL_KEYS/NUM_PROCS*MIN_PROCS) + */ +int num_keys; + +/*****************************************************************/ +/* On larger number of processors, since the keys are (roughly) */ +/* gaussian distributed, the first and last processor sort keys */ +/* in a large interval, requiring array sizes to be larger. Note */ +/* that for large NUM_PROCS, NUM_KEYS is, however, a small number*/ +/* The required array size also depends on the bucket size used. */ +/* The following values are validated for the 1024-bucket setup. */ +/*****************************************************************/ +/* + * #if NUM_PROCS < 256 + * #define SIZE_OF_BUFFERS 3*NUM_KEYS/2 + * #elif NUM_PROCS < 512 + * #define SIZE_OF_BUFFERS 5*NUM_KEYS/2 + * #elif NUM_PROCS < 1024 + * #define SIZE_OF_BUFFERS 4*NUM_KEYS + * #else + * #define SIZE_OF_BUFFERS 13*NUM_KEYS/2 + * #endif + */ +int size_of_buffers; + + +/***********************************/ +/* Enable separate communication, */ +/* computation timing and printout */ +/***********************************/ +#define TIMING_ENABLED +#ifdef NO_MTIMERS +#undef TIMINIG_ENABLED +#define TIMER_START( x ) +#define TIMER_STOP( x ) +#else +#define TIMER_START( x ) if (timeron) timer_start( x ) +#define TIMER_STOP( x ) if (timeron) timer_stop( x ) +#define T_TOTAL 0 +#define T_RANK 1 +#define T_RCOMM 2 +#define T_VERIFY 3 +#define T_LAST 3 +#endif +int timeron; + + +/*************************************/ +/* Typedef: if necessary, change the */ +/* size of int here by changing the */ +/* int type to, say, long */ +/*************************************/ +typedef int INT_TYPE; +#if CLASS == 'D' || CLASS == 'E' +typedef long KEY_TYPE; +#else +typedef int KEY_TYPE; +#endif +#define MP_KEY_TYPE MPI_INT + + + +/********************/ +/* MPI properties: */ +/********************/ +int my_rank, np_total, + comm_size; +MPI_Comm comm_work; + + +/********************/ +/* Some global info */ +/********************/ +INT_TYPE *key_buff_ptr_global, /* used by full_verify to get */ + total_local_keys, /* copies of rank info */ + total_lesser_keys; + + +int passed_verification; + + + +/************************************/ +/* These are the three main arrays. */ +/* See SIZE_OF_BUFFERS def above */ +/************************************/ +INT_TYPE *key_array, + *key_buff1, + *key_buff2, + bucket_size[NUM_BUCKETS+TEST_ARRAY_SIZE], /* Top 5 elements for */ + bucket_size_totals[NUM_BUCKETS+TEST_ARRAY_SIZE], /* part. ver. vals */ + bucket_ptrs[NUM_BUCKETS], + process_bucket_distrib_ptr1[NUM_BUCKETS+TEST_ARRAY_SIZE], + process_bucket_distrib_ptr2[NUM_BUCKETS+TEST_ARRAY_SIZE]; +int *send_count, *recv_count, + *send_displ, *recv_displ; + + +/**********************/ +/* Partial verif info */ +/**********************/ +KEY_TYPE test_index_array[TEST_ARRAY_SIZE], + test_rank_array[TEST_ARRAY_SIZE]; + +int S_test_index_array[TEST_ARRAY_SIZE] = + {48427,17148,23627,62548,4431}, + S_test_rank_array[TEST_ARRAY_SIZE] = + {0,18,346,64917,65463}, + + W_test_index_array[TEST_ARRAY_SIZE] = + {357773,934767,875723,898999,404505}, + W_test_rank_array[TEST_ARRAY_SIZE] = + {1249,11698,1039987,1043896,1048018}, + + A_test_index_array[TEST_ARRAY_SIZE] = + {2112377,662041,5336171,3642833,4250760}, + A_test_rank_array[TEST_ARRAY_SIZE] = + {104,17523,123928,8288932,8388264}, + + B_test_index_array[TEST_ARRAY_SIZE] = + {41869,812306,5102857,18232239,26860214}, + B_test_rank_array[TEST_ARRAY_SIZE] = + {33422937,10244,59149,33135281,99}, + + C_test_index_array[TEST_ARRAY_SIZE] = + {44172927,72999161,74326391,129606274,21736814}, + C_test_rank_array[TEST_ARRAY_SIZE] = + {61147,882988,266290,133997595,133525895}; + +long D_test_index_array[TEST_ARRAY_SIZE] = + {1317351170,995930646,1157283250,1503301535,1453734525}, + D_test_rank_array[TEST_ARRAY_SIZE] = + {1,36538729,1978098519,2145192618,2147425337}, + + E_test_index_array[TEST_ARRAY_SIZE] = + {21492309536L,24606226181L,12608530949L,4065943607L,3324513396L}, + E_test_rank_array[TEST_ARRAY_SIZE] = + {3L,27580354L,3248475153L,30048754302L,31485259697L}; + + + +/***********************/ +/* function prototypes */ +/***********************/ +double randlc( double *X, double *A ); + +void full_verify( void ); + +void c_print_results( char *name, + char class, + int n1, + int n2, + int n3, + int niter, + int nprocs_active, + int nprocs_total, + double t, + double mops, + char *optype, + int passed_verification, + char *npbversion, + char *compiletime, + char *mpicc, + char *clink, + char *cmpi_lib, + char *cmpi_inc, + char *cflags, + char *clinkflags ); + +#include "../common/c_timers.h" + + +/*****************************************************************/ +/* Dynamically allocate space for main arrays */ +/*****************************************************************/ +void alloc_space(void) +{ + /* problem size after partition */ + num_keys = (TOTAL_KEYS/comm_size) * MIN_PROCS; + + /* buffer size for communication */ + if ( comm_size < 256 ) + size_of_buffers = 3*num_keys/2; + else if ( comm_size < 512 ) + size_of_buffers = 5*num_keys/2; + else if ( comm_size < 1024 ) + size_of_buffers = 4*num_keys; + else + size_of_buffers = 13*num_keys/2; + + /* allocate space */ + key_array = (INT_TYPE *)malloc(sizeof(INT_TYPE)*size_of_buffers); + key_buff1 = (INT_TYPE *)malloc(sizeof(INT_TYPE)*size_of_buffers); + key_buff2 = (INT_TYPE *)malloc(sizeof(INT_TYPE)*size_of_buffers); + + send_count = (int *)malloc(sizeof(int)*comm_size); + recv_count = (int *)malloc(sizeof(int)*comm_size); + send_displ = (int *)malloc(sizeof(int)*comm_size); + recv_displ = (int *)malloc(sizeof(int)*comm_size); + + if (!key_array || !key_buff1 || !key_buff2 || + !send_count || !recv_count || !send_displ || !recv_displ) { + printf("ERROR: memoy allocation failed\n"); + MPI_Abort(MPI_COMM_WORLD, 1); + exit(1); + } +} + + +/* + * FUNCTION RANDLC (X, A) + * + * This routine returns a uniform pseudorandom double precision number in the + * range (0, 1) by using the linear congruential generator + * + * x_{k+1} = a x_k (mod 2^46) + * + * where 0 < x_k < 2^46 and 0 < a < 2^46. This scheme generates 2^44 numbers + * before repeating. The argument A is the same as 'a' in the above formula, + * and X is the same as x_0. A and X must be odd double precision integers + * in the range (1, 2^46). The returned value RANDLC is normalized to be + * between 0 and 1, i.e. RANDLC = 2^(-46) * x_1. X is updated to contain + * the new seed x_1, so that subsequent calls to RANDLC using the same + * arguments will generate a continuous sequence. + * + * This routine should produce the same results on any computer with at least + * 48 mantissa bits in double precision floating point data. On Cray systems, + * double precision should be disabled. + * + * David H. Bailey October 26, 1990 + * + * IMPLICIT DOUBLE PRECISION (A-H, O-Z) + * SAVE KS, R23, R46, T23, T46 + * DATA KS/0/ + * + * If this is the first call to RANDLC, compute R23 = 2 ^ -23, R46 = 2 ^ -46, + * T23 = 2 ^ 23, and T46 = 2 ^ 46. These are computed in loops, rather than + * by merely using the ** operator, in order to insure that the results are + * exact on all systems. This code assumes that 0.5D0 is represented exactly. + */ + + +/*****************************************************************/ +/************* R A N D L C ************/ +/************* ************/ +/************* portable random number generator ************/ +/*****************************************************************/ + +double randlc( double *X, double *A ) +{ + static int KS=0; + static double R23, R46, T23, T46; + double T1, T2, T3, T4; + double A1; + double A2; + double X1; + double X2; + double Z; + int i, j; + + if (KS == 0) + { + R23 = 1.0; + R46 = 1.0; + T23 = 1.0; + T46 = 1.0; + + for (i=1; i<=23; i++) + { + R23 = 0.50 * R23; + T23 = 2.0 * T23; + } + for (i=1; i<=46; i++) + { + R46 = 0.50 * R46; + T46 = 2.0 * T46; + } + KS = 1; + } + +/* Break A into two parts such that A = 2^23 * A1 + A2 and set X = N. */ + + T1 = R23 * *A; + j = T1; + A1 = j; + A2 = *A - T23 * A1; + +/* Break X into two parts such that X = 2^23 * X1 + X2, compute + Z = A1 * X2 + A2 * X1 (mod 2^23), and then + X = 2^23 * Z + A2 * X2 (mod 2^46). */ + + T1 = R23 * *X; + j = T1; + X1 = j; + X2 = *X - T23 * X1; + T1 = A1 * X2 + A2 * X1; + + j = R23 * T1; + T2 = j; + Z = T1 - T23 * T2; + T3 = T23 * Z + A2 * X2; + j = R46 * T3; + T4 = j; + *X = T3 - T46 * T4; + return(R46 * *X); +} + + + +/*****************************************************************/ +/************ F I N D _ M Y _ S E E D ************/ +/************ ************/ +/************ returns parallel random number seq seed ************/ +/*****************************************************************/ + +/* + * Create a random number sequence of total length nn residing + * on np number of processors. Each processor will therefore have a + * subsequence of length nn/np. This routine returns that random + * number which is the first random number for the subsequence belonging + * to processor rank kn, and which is used as seed for proc kn ran # gen. + */ + +double find_my_seed( int kn, /* my processor rank, 0<=kn<=num procs */ + int np, /* np = num procs */ + long nn, /* total num of ran numbers, all procs */ + double s, /* Ran num seed, for ex.: 314159265.00 */ + double a ) /* Ran num gen mult, try 1220703125.00 */ +{ + + long i; + + double t1,t2,t3,an; + long mq,nq,kk,ik; + + + + nq = nn / np; + + for( mq=0; nq>1; mq++,nq/=2 ) + ; + + t1 = a; + + for( i=1; i<=mq; i++ ) + t2 = randlc( &t1, &t1 ); + + an = t1; + + kk = kn; + t1 = s; + t2 = an; + + for( i=1; i<=100; i++ ) + { + ik = kk / 2; + if( 2 * ik != kk ) + t3 = randlc( &t1, &t2 ); + if( ik == 0 ) + break; + t3 = randlc( &t2, &t2 ); + kk = ik; + } + + return( t1 ); + +} + + + + +/*****************************************************************/ +/************* C R E A T E _ S E Q ************/ +/*****************************************************************/ + +void create_seq( double seed, double a ) +{ + double x; + int i, k; + + k = MAX_KEY/4; + + for (i=0; i 0 ) + MPI_Irecv( &k, + 1, + MP_KEY_TYPE, + my_rank-1, + 1000, + comm_work, + &request ); + if( my_rank < comm_size-1 ) + MPI_Send( &key_array[last_local_key], + 1, + MP_KEY_TYPE, + my_rank+1, + 1000, + comm_work ); + if( my_rank > 0 ) + MPI_Wait( &request, &status ); + +/* Confirm that neighbor's greatest key value + is not greater than my least key value */ + j = 0; + if( my_rank > 0 && total_local_keys > 0 ) + if( k > key_array[0] ) + j++; + + +/* Confirm keys correctly sorted: count incorrectly sorted keys, if any */ + for( i=1; i key_array[i] ) + j++; + + + if( j != 0 ) + { + printf( "Processor %d: Full_verify: number of keys out of sort: %d\n", + my_rank, j ); + } + else + passed_verification++; + + TIMER_STOP( T_VERIFY ); + +} + + + + +/*****************************************************************/ +/************* R A N K ****************/ +/*****************************************************************/ + + +void rank( int iteration ) +{ + + INT_TYPE i, k; + + INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; + INT_TYPE key; + KEY_TYPE bucket_sum_accumulator, j, m; + INT_TYPE local_bucket_sum_accumulator; + INT_TYPE min_key_val, max_key_val; + INT_TYPE *key_buff_ptr; + + + + TIMER_START( T_RANK ); + +/* Iteration alteration of keys */ + if(my_rank == 0 ) + { + key_array[iteration] = iteration; + key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; + } + + +/* Initialize */ + for( i=0; i> shift]++; + + +/* Accumulative bucket sizes are the bucket pointers */ + bucket_ptrs[0] = 0; + for( i=1; i< NUM_BUCKETS; i++ ) + bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; + + +/* Sort into appropriate bucket */ + for( i=0; i> shift]++] = key; + } + + TIMER_STOP( T_RANK ); + TIMER_START( T_RCOMM ); + +/* Get the bucket size totals for the entire problem. These + will be used to determine the redistribution of keys */ + MPI_Allreduce( bucket_size, + bucket_size_totals, + NUM_BUCKETS+TEST_ARRAY_SIZE, + MP_KEY_TYPE, + MPI_SUM, + comm_work ); + + TIMER_STOP( T_RCOMM ); + TIMER_START( T_RANK ); + +/* Determine Redistibution of keys: accumulate the bucket size totals + till this number surpasses NUM_KEYS (which the average number of keys + per processor). Then all keys in these buckets go to processor 0. + Continue accumulating again until supassing 2*NUM_KEYS. All keys + in these buckets go to processor 1, etc. This algorithm guarantees + that all processors have work ranking; no processors are left idle. + The optimum number of buckets, however, does not result in as high + a degree of load balancing (as even a distribution of keys as is + possible) as is obtained from increasing the number of buckets, but + more buckets results in more computation per processor so that the + optimum number of buckets turns out to be 1024 for machines tested. + Note that process_bucket_distrib_ptr1 and ..._ptr2 hold the bucket + number of first and last bucket which each processor will have after + the redistribution is done. */ + + bucket_sum_accumulator = 0; + local_bucket_sum_accumulator = 0; + send_displ[0] = 0; + process_bucket_distrib_ptr1[0] = 0; + for( i=0, j=0; i= (j+1)*num_keys ) + { + send_count[j] = local_bucket_sum_accumulator; + if( j != 0 ) + { + send_displ[j] = send_displ[j-1] + send_count[j-1]; + process_bucket_distrib_ptr1[j] = + process_bucket_distrib_ptr2[j-1]+1; + } + process_bucket_distrib_ptr2[j++] = i; + local_bucket_sum_accumulator = 0; + } + } + +/* When NUM_PROCS approaching NUM_BUCKETS, it is highly possible + that the last few processors don't get any buckets. So, we + need to set counts properly in this case to avoid any fallouts. */ + while( j < comm_size ) + { + send_count[j] = 0; + process_bucket_distrib_ptr1[j] = 1; + j++; + } + + TIMER_STOP( T_RANK ); + TIMER_START( T_RCOMM ); + +/* This is the redistribution section: first find out how many keys + each processor will send to every other processor: */ + MPI_Alltoall( send_count, + 1, + MPI_INT, + recv_count, + 1, + MPI_INT, + comm_work ); + +/* Determine the receive array displacements for the buckets */ + recv_displ[0] = 0; + for( i=1; i