Chapter 3. SHMEM

Table of Contents
3.1. Using SHMEM_GET64 and SHMEM_PUT64 for data transfer
3.2. Optimizing Existing MPI and PVM Programs by Using SHMEM
3.3. Passing 32-bit Data
3.4. Copying Strided Data
3.5. Gathering and Scattering Data
3.6. Broadcasting Data to Multiple PEs
3.7. Merging Arrays
3.8. Reading and Updating in One Operation
3.9. Using Reduction routines

You can either use shared memory (SHMEM) routines alone or mix them into a program that primarily uses PVM (glossary, ) or MPI (glossary, ), thereby offering opportunities for optimizations beyond what the message-passing protocols can provide. Be aware, however, that SHMEM is not a standard protocol and will not be available on machines developed by companies other than Silicon Graphics and Cray Research. SHMEM is supported on Cray PVP systems, Cray MPP systems, and on Silicon Graphics systems.

For background information on SHMEM, see Section 1.1.2. For an introduction to the SHMEM routines, see the shmem_intro(3) man page.

This chapter describes the following optimization techniques:

3.1. Using SHMEM_GET64 and SHMEM_PUT64 for data transfer

In general, avoiding communications between PEs (including data transfer) improves performance. The fewer the number of communications, the faster your program can execute. Data transfer is, however, often necessary. Finding the fastest method of passing data is an important optimization, and the SHMEM routines are usually the fastest method available.

The SHMEM_PUT64 and SHMEM_GET64 routines avoid the extra overhead (glossary, ) sometimes associated with message passing routines by moving data directly between the user-specified memory locations on local and remote PEs.

For both small and large transfers, the SHMEM_PUT64 routine, which moves data from the local PE to a remote PE, and the SHMEM_GET64 routine, which moves data from a remote PE to the local PE, are virtually the same in terms of performance. At times, SHMEM_PUT64 may be the better choice because it lets the calling PE perform other work while the data is in the network. Because SHMEM_PUT64 is asynchronous, it may allow statements that follow it to execute while the data is in the process of being copied to the memory of the receiving PE. SHMEM_GET64 forces the calling PE to wait until the data is in local memory (glossary, ), meaning that no early work can be done.

Passing data in large chunks is always faster than passing it in small chunks because it saves subroutine overhead. Whenever possible, put all of your data (such as an array) into a single SHMEM_PUT64 or SHMEM_GET64 call rather than calling the routine iteratively.

In the following example, eight 64-bit words are transferred from PE 1 to PE 0 by using SHMEM_PUT64. PE numbering always begins with 0.


Example 3-1. Example of a SHMEM_PUT64 transfer

 1.       INCLUDE "mpp/shmem.fh"
 2.       INTEGER SOURCE(8), DEST(8)
 3.       INTRINSIC MY_PE
 4.       SAVE DEST
 5. C On the sending PE
 6.       IF (MY_PE() .EQ. 1) THEN
 7.         DO I = 1,8
 8.           SOURCE(I) = I
 9.         ENDDO
10. C PE 1 sends the data to PE 0.
11.        CALL SHMEM_PUT64(DEST, SOURCE, 8, 0)
12.       ENDIF
13. 
14. C Make sure the transfer is complete.
15.       CALL SHMEM_BARRIER_ALL()
16. 
17. C On the receiving PE
18.       IF (MY_PE() .EQ. 0) THEN
19.         PRINT *, 'DEST ON PE 0: ', DEST
20.       ENDIF
21. 
22.       END


See the following figure for an illustration of the transfer.

Figure 3-1. SHMEM_PUT64 data transfer

The output from the example is as follows:

 DEST ON PE 0:  1,  2,  3,  4,  5,  6,  7,  8

Defining the number of PEs in a program and the number in an active set (glossary, ) as powers of 2 (that is, 2, 4, 8, 16, 32, and so on) helped performance on CRAY T3D systems. Also, declaring arrays as powers of 2 was necessary if you were using Cray Research Adaptive Fortran (CRAFT) on CRAY T3D systems. Both have changed as follows on CRAY T3E systems:

For information on optimizing existing PVM and MPI programs using SHMEM_GET64 and SHMEM_PUT64, see Section 3.2. For a complete description of the MPP-specific statements in the preceding example, continue on with this section.

In the SHMEM_PUT64 example (see Example 3-1), line 1 imports the SHMEM INCLUDE file, which defines parameters needed by many of the routines. The location of the file may be different on your system. Check with your system administrator if you do not know the correct path.

1.       INCLUDE "mpp/shmem.fh"

Line 3 declares the intrinsic function MY_PE, which returns the number of the PE on which it executes. Two versions of MY_PE function exist on the CRAY T3E system, one in the external library and one as an intrinsic. The intrinsic version is marginally faster than external library version, but the external library version is now, and will be in the future, on more Cray Research and Silicon Graphics supercomputer systems. Declaring MY_PE as an intrinsic is not necessary, but it will ensure you of getting the slightly faster version of the routine.

3.       INTRINSIC MY_PE 

The defined constant N$PES, which returns the number of PEs in a program, is also slightly faster than the more portable external library routine NUM_PES. Like MY_PE, both versions return the same information.

The intrinsic function MY_PE and the constant N$PES are also faster than using equivalent message-passing routines, such as SHMEM_MY_PE and SHMEM_N_PES. Both methods return the same information and are available on Cray PVP systems as well as Cray MPP systems.

Line 4 ensures that the remote array (DEST) is symmetric (glossary, ), which means that it has the same address on remote PEs as on the local PE.

4.       SAVE DEST

You can make sure DEST is symmetric in any of the following ways. (None of these methods is significantly faster than the others.)

In line 6, the MY_PE function is called. The function returns the number of the calling PE, meaning only PE 1 will execute the THEN clause. As a result, the array SOURCE is initialized only on PE 1.

5. C On the sending PE
6.       IF (MY_PE() .EQ. 1) THEN
7.         DO I = 1,8
8.           SOURCE(I) = I
9.         ENDDO

In line 11, PE 1 executes the SHMEM_PUT64 routine call that sends the data. SHMEM_PUT64 is the variant of SHMEM_PUT64 that transfers 64-bit (KIND=8) data. It sends eight array elements from its SOURCE array to the DEST array on PE 0.

11.        SHMEM_PUT64(DEST, SOURCE, 8, 0)
12.      ENDIF

Line 15 is a barrier (glossary, ), which provides a synchronization point (glossary, ). No PE proceeds beyond this point in the program until all PEs have arrived. The effect in this case is to wait until the transfer has finished. Without the barrier, PE 0 could print the DEST array before receiving the data. Calling SHMEM_BARRIER_ALL is as fast as calling the BARRIER routine directly.

15.      CALL SHMEM_BARRIER_ALL()

Line 18 selects PE 0, which is passively receiving the data. Because SHMEM_PUT64 places the data directly into PE 0's local memory, PE 0 is not involved in the transfer operation. After being released from the barrier, PE 0 prints DEST, and the program exits.

18.      IF (MY_PE() .EQ. 0) THEN
19.        PRINT *, 'DEST ON PE 0: ', DEST
20.      ENDIF