For more complete information on NetPIPE, visit the webpage at:
https://www.ameslab.gov/scl/netpipe
NetPIPE was originally developed by Quinn Snell, Armin Mikler, John Gustafson, and Guy Helmer.
It is currently being developed and maintained by Troy Benjegerdes, with help from several graduate students (Veerendra Allada, Kyle Schochenmaier)
Release 3.7 adds support for OpenFabrics infiniband verbs module (NPibv), and should work with the OFED-1.1 or OFED-1.2 release. It has been tested on IBM eHCA hardware as well as Mellanox pci-express infiniband adapters. The openfabrics verbs support is currently lacking any support for the connection manager, and all connections are set up via TCP sockets. This means that RDMA ethernet devices will not work. Patches to [email protected] are encouraged.
Release 3.6.2 mainly fixes some bugs. A number of portability issues with 64-bit architectures were taken care of, especially in the Infiniband module. A small typecasting error was fixed that caused segmentation faults on Red Hat Enterprise and Fedora Core systems (and probably others...). The bi-directional mode was also tested with the Infiniband module, and a subset of NetPIPE options are now supported.
Release 3.6.1 adds a bi-directional (-2) mode to allow data to be sent in both directions simultaneously. This has been tested with the TCP, MPI, MPI-2, and GM modules. You can also now test synchronous MPI communications MPI_SSend/MPI_SRecv using (-S). A launch utility (nplaunch) allows you to launch NPtcp, NPgm, NPib, and NPpvm from one side using ssh to start the remote executible.
Version 3.6 adds the ability to test with and without cache effects, and the ability to offset both the source and destination buffers. A memcpy module has also been added.
Release 3.5 removes the CPU utilization measurements. Getrusage is probably not very accurate, so a dummy workload will eventually be used instead. The streaming mode has also been fixed. When run at Gigabit speeds, the TCP window size would collapse limit performance of subsequent data points. Now we reset the sockets between trials to prevent this. We have also added in a module to evaluate memory copy rates. -n now sets a constant number of repeats for each trial. -r resets the sockets between each trial (automatic for streaming).
Release 3.3 includes an Infiniband module for the Mellanox VAPI. It also has an integrity check (-i), which is still being developed.
Version 3.2 includes additional modules to test PVM, TCGMSG, SHMEM, and MPI-2, as well as the GM, GPSHMEM, ARMCI, and LAPI software layers they run upon.
If you have problems or comments, please email [email protected]
NetPIPE Network Protocol Independent Performance Evaluator, Release 2.3 Copyright 1997, 1998 Iowa State University Research Foundation, Inc.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
NetPIPE requires an ANSI C compiler. You are on your own for installing the various libraries that NetPIPE can be used to test.
Review the provided makefile and change any necessary settings, such as the CC compiler or CFLAGS flags, required extra libraries, and PVM library & include file pathnames if you have these communication libraries. Alternatively, you can specify these changes on the make command line. The line below would compile the NPtcp module using the icc compiler instead of the default cc compiler.
make CC=icc tcp
Compile NetPIPE with the desired communication interface by using:
make mpi (this will use the default MPI on the system) make pvm (you may need to set some paths in the makefile) make tcgmsg (you will need to set some paths in the makefile) make mpi2 (this will test 1-sided MPI_Put() functions) make shmem (1-sided library for Cray and SGI systems)
make tcp make tcp6 (for IPv6 enabled systems) make ipx (for IPX enabled systems) make sctp (for SCTP enabled systems) make sctp6 (for SCTP6 enabled systems) make gm (for Myrinet cards, you will need to set some paths) make shmem (1-sided library for Cray and SGI systems) make gpshmem (SHMEM interface for other machines) make armci (still under development) make lapi (for the IBM SP) make ib (for Mellanox Infiniband adapters, uses VAPI layer) make ibv (for OpenFabrics Infiniband devices) make memcpy (uses memcpy to copy data between buffers in 1 process) make MP_memcpy (uses an optimized copy in MP_memcpy.c to copy data between buffers. This requires icc or gcc 3.x.)
NetPIPE will dump its output to the screen by default and also to the np.out. The following parameters can be used to change how NetPIPE is run, and are in order of their general usefulness.
-b: specify send and receive TCP buffer sizes e.g. "-b 32768"
This can make a huge difference for Gigabit Ethernet cards.
You may need to tune the OS to set a larger maximum TCP
buffer size for optimal performance.
-O: specify send and optionally receive buffer offsets, e.g. "-O 1,3"
-l: lower bound (start value for block size) e.g. "-l 1"
-u: upper bound (stop value for block size) e.g. "-u 1048576"
-o: specify output filename e.g. "-o output.txt"
-z: for MPI, receive messages using ANYSOURCE
-g: MPI-2: use MPI_Get() instead of MPI_Put()
-f: MPI-2: do not use a fence call (may not work for all packages)
-I: Invalidate cache: Take measures to eliminate the effects cache
has on performance.
-a: asynchronous receive (a.k.a. pre-posted receive)
May not have any effect, depending on your implementation
-B: burst all preposts before measuring performance
Normally only one receive is preposted at a time with -a
-p: set perturbation offset of buffer size, e.g. "-p 3"
-i: Integrity check: Check the integrity of data transfer instead
of performance
-s: stream option (default mode is "ping pong")
If this option is used, it must be specified on both
the sending and receiving processes
-S: Use synchronous sends/receives for MPI.
-2: Bi-directional communications. Transmit in both directions
simultaneously.
-P: Set the port number used by TCP to something other than
default.
Compile NetPIPE using 'make tcp'
remote_host> NPtcp [options]
local_host> NPtcp -h remote_host [options]
OR
local_host> nplaunch NPtcp -h remote_host [options]
Compile NetPIPE using 'make tcp6'
remote_host> NPtcp6 [options]
local_host> NPtcp6 -h remote_host [options]
OR
local_host> nplaunch NPtcp6 -h remote_host [options]
Compile NetPIPE using 'make ipx
remote_host> NPipx [options]
local_host> NPipx -h remote_host [options]
OR
local_host> nplaunch NPipx -h remote_host [options]
Compile NetPIPE using 'make sctp
remote_host> NPsctp [options]
local_host> NPsctp -h remote_host [options]
OR
local_host> nplaunch NPsctp -h remote_host [options]
Compile NetPIPE using 'make sctp6
remote_host> NPsctp6 [options]
local_host> NPsctp6 -h remote_host [options]
OR
local_host> nplaunch NPsctp6 -h remote_host [options]
Install MPICH
Compile NetPIPE using 'make mpi'
use p4pg file or edit mpich/util/mach/mach.{ARCH} file
to specify the machines to run on
mpirun [-nolocal] -np 2 NPmpi [options]
'setenv P4_SOCKBUFSIZE 256000' can make a huge difference for
MPICH on Unix systems.
Install LAM
Compile NetPIPE using 'make mpi'
put the machine names into a lamhosts file
'lamboot -v -b lamhosts' to start the lamd daemons
mpirun -np 2 [-O] NPmpi [options]
The -O parameter avoids data translation for homogeneous systems.
Install MPI/Pro
Compile NetPIPE using 'make mpi'
put the machine names into /etc/machines or a local machine file
mpirun -np 2 NPmpi [options]
Install MP_Lite (http://www.scl.ameslab.gov/Projects/MP_Lite/)
Compile NetPIPE using 'make MP_Lite'
mprun -np 2 -h {host1} {host2} NPmplite [options]
Install PVM (comes on the RedHat distributions now)
Set the PVM paths in the makefile if necessary.
Compile NetPIPE using 'make pvm'
use the 'pvm' utility to start the pvmd daemons
type 'pvm' to start it (this will also start pvmd on the local_host)
pvm> help --> lists all commands
pvm> add remote_host --> will start a pvmd on a machine called 'host2'
pvm> quit --> when you have all the pvmd machines started
remote_host> NPpvm [options]
local_host> NPpvm -h remote_host [options]
OR
local_host> nplaunch NPpvm -h remote_host [options]
Changing PVMDATA in netpipe.h and PvmRouteDirect in pvm.c can
effect the performance greatly.
Install TCGMSG package
Set the TCGMSG paths in the makefile.
Compile NetPIPE using 'make tcgmsg'
create a NPtcgmsg.p file with hosts and paths (see hosts/NPtcgmsg.p)
parallel NPtcgmsg
(no options can be passed into this version)
Install the MPI package
Compile NetPIPE using 'make mpi2'
Follow the directions for running the MPI package from above
The MPI_Put() function will be tested with fence calls by default.
Use -g to test MPI_Get() instead, or -f to do MPI_Put() without
fence calls (will not work with LAM).
Must be run on a Cray or SGI system that supports SHMEM calls.
Compile NetPIPE using 'make shmem'
(Xuehua, fill out the rest)
Ask Ricky or Krzysztof for help :).
Install the GM package and configure the Myrinet cards
Compile NetPIPE using 'make gm'
remote_host> NPgm [options]
local_host> NPgm -h remote_host [options]
OR
local_host> nplaunch NPgm -h remote_host [options]
Log into IBM SP machine at NERSC
Compile NetPIPE using 'make lapi'
To run interactively at NERSC:
Set environment variable MP_MSG_API to lapi
e.g. 'setenv MP_MSG_API lapi', 'export MP_MSG_API=lapi'
Run NPlapi with '-procs 2' to tell the parallel environment you
want 2 nodes. Use any other options that are applicable to
NetPIPE.
To submit a batch job at NERSC:
Copy the file batchLapi from the 'hosts' directory to the directory
containing NPlapi.
Edit the copy of batchLapi:
job_name: Identifying name of job, can be anything
output: File to send stdout to
error: File to send stderr to (most of NetPIPE's output
will go here)
tasks_per_node: Number of tasks to be run on each node
node: Number of nodes to run on
(Use a combination of the above two options to determine
how NetPIPE runs. Use 1 task per node and 2 nodes to run
benchmark between nodes. Use 2 tasks per node and 1 node
to run benchmark on single node)
Use whatever command-line options are appropriate for NetPIPE
Submit the job with the command 'llsubmit batchLapi'
Check status of all your jobs with 'llqs -u <user>'
You should receive an email when the job finishes. The resulting output
files will then be available.
Install the ARMCI package
Compile NetPIPE using 'make armci'
Follow the directions for running the MPI package from above
If running on interfaces other than the default, create a file
called armci_hosts, containing two lines, one for each hostname,
then run package.
This test will only work on machines connected via TCP/IP as well
as Infiniband.
Install Infiniband adapters and mellanox vapi software
(OR OFED-1.1 or OFED-1.2)
Make sure the adapters are up and running (e.g. Check that the
Mellanox-supplied bandwidth/latency program, perf_main, works, if
you have it. For OFED, make sure ibv_rc_pingpong works)
Compile NetPIPE using 'make ib' for vapi and 'make ibv for OFED
(The environment variable MTHOME needs to be set to the directory
containing the include and lib directories for the Mellanox software).
OFED should build if the libraries are in the standard OFED install
locations, if not, edit the makefile.
remote_host> NPib [-options]
local_host> NPib -h remote_host [-options]
OR
local_host> nplaunch NPib -h remote_host [options]
(remote_host should be the ip address or hostname of the other host)
Other options: (this documentation is out of date for ibv)
Use -m to select mtu size for Infiniband adapter.
Valid values are 256, 512, 1024, 2048, 4096. Default is 1024.
Use -t to select the communications type.
Possible values are
send_recv: basic send and receive
send_recv_with_imm: send and receive with immediate data
rdma_write: one-sided remote dma write
rdma_write_with_imm: one-sided remote dma write with immediate data
Default is send_recv.
Use -c to select the message completion type.
Possible values are
local_poll: poll on last byte of receive buffer
vapi_poll: use VAPI polling mechanism
event: use VAPI event completion mechanism
Default is local_poll.
NetPIPE generates a np.out file by default, which can be renamed using the -o option. This file contains 3 columns: the number of bytes, the throughput in Mbps, and the round trip time divided by two. The first 2 columns can therefore be used to produce a throughput vs message size graph.
The screen output contains this same information, plus the test number and the number of ping-pong's involved in the test.
more np.out 1 0.136403 0.00005593 2 0.274586 0.00005557 3 0.402104 0.00005692 4 0.545668 0.00005593 6 0.805053 0.00005686 8 1.039586 0.00005871 12 1.598912 0.00005726 13 1.700719 0.00005832 16 2.098007 0.00005818 19 2.340364 0.00006194
The -I switch can be used to reduce the effects cache has on performance.
Without the switch, NetPIPE tests the performance of communicating
n-byte blocks by reading from an n-byte buffer on one node, sending data
over the communications link, and writing to an n-byte buffer on the other
node. For each block size, this trial will be repeated x times, where x
typically starts out very large for small block sizes, and decreases as the
block size grows. The same buffers on each node are used repeatedly, so
after the first transfer the entire buffer will be in cache on each node,
given that the block-size is less than the available cache. Thus each transfer
after the first will be read from cache on one end and written into cache on
the other. Depending on the cache architecture, a write to main memory may
not occur on the receiving end during the transfer loop.
While the performance measurements obtained from this method are certainly useful, it is also interesting to use the -I switch to measure performance when data is read from and written to main memory. In order to facilitate this, large pools of memory are allocated at startup, and each n-byte transfer comes from a region of the pool not in cache. Before each series of n-byte transfers, every byte of a large dummy buffer is individually accessed in order to flush the data for the transfer out of cache. After this step, the first n-byte transfer comes from the beginning of the large pool, the second comes from n-bytes after the beginning of the pool, and so on (note that stride between n-byte transfers will depend on the buffer alignment setting). In this way we make sure each read is coming from main memory.
On the receiving end data is written into a large pool in the same fashion that it was read on the transmitting end. Data will first be written into cache. What happens next depends on the cache architecture, but one case is that no transfer to main memory occurs, YET. For moderately large block sizes, however, a large number of transfer iterations will cause reuse of cache memory. As this occurs, data in the cache location to be replaced must be written back to main memory, so we incur a performance penalty while we wait for the write.
In summary, using the -I switch gives worst-case performance (i.e. all data transfers involve reading from or writing to memory not in cache) and not using the switch gives best-case performance (i.e. all data transfers involve only reading from or writing to memory in cache). Note that other combinations, such as reading from memory in cache and writing to memory not in cache, would give intermediary results. We chose to implement the methods that will measure the two extremes.
- we need to replace the getrusage stuff from version 2.4 with a dummy workload ... We have a working version using DGEMM internally, email [email protected] for more info.
- the ibv module needs to have better documentation, and support for the CM