You can either use shared memory (SHMEM) routines alone or mix them into a program that primarily uses PVM (glossary, ) or MPI (glossary, ), thereby offering opportunities for optimizations beyond what the message-passing protocols can provide. Be aware, however, that SHMEM is not a standard protocol and will not be available on machines developed by companies other than Silicon Graphics and Cray Research. SHMEM is supported on Cray PVP systems, Cray MPP systems, and on Silicon Graphics systems.
For background information on SHMEM, see Section 1.1.2. For an introduction to the SHMEM routines, see the shmem_intro(3) man page.
This chapter describes the following optimization techniques:
Improving data transfer rates in any CRAY T3E program by using SHMEM get and put routines (see Section 3.1). This section provides an introduction to data transfer, which is the most important capability that SHMEM offers.
Improving the performance of a PVM or MPI program by adding SHMEM data manipulation routines (see Section 3.2).
Avoiding performance pitfalls when passing 32-bit data rather than 64-bit data (see Section 3.3).
Copying strided (glossary, ) data while maintaining maximum performance. The strided data routines enable you, for example, to divide the elements of an array among a set of processing elements (PEs) or pull elements from arrays on multiple PEs into a single array on one PE (see Section 3.4).
Gathering and scattering data and reordering it in the process (see Section 3.5).
Broadcasting (glossary, ) data from one PE to all PEs (see Section 3.6).
Merging arrays from each PE into a single array on all PEs (see Section 3.7).
Executing an atomic memory operation (glossary, ) to read and update a remote value in a single process (see Section 3.8).
Using reduction (glossary, ) routines to execute a variety of operations across multiple PEs (see Section 3.9).
In general, avoiding communications between PEs (including data transfer) improves performance. The fewer the number of communications, the faster your program can execute. Data transfer is, however, often necessary. Finding the fastest method of passing data is an important optimization, and the SHMEM routines are usually the fastest method available.
The SHMEM_PUT64 and SHMEM_GET64 routines avoid the extra overhead (glossary, ) sometimes associated with message passing routines by moving data directly between the user-specified memory locations on local and remote PEs.
For both small and large transfers, the SHMEM_PUT64 routine, which moves data from the local PE to a remote PE, and the SHMEM_GET64 routine, which moves data from a remote PE to the local PE, are virtually the same in terms of performance. At times, SHMEM_PUT64 may be the better choice because it lets the calling PE perform other work while the data is in the network. Because SHMEM_PUT64 is asynchronous, it may allow statements that follow it to execute while the data is in the process of being copied to the memory of the receiving PE. SHMEM_GET64 forces the calling PE to wait until the data is in local memory (glossary, ), meaning that no early work can be done.
Passing data in large chunks is always faster than passing it in small chunks because it saves subroutine overhead. Whenever possible, put all of your data (such as an array) into a single SHMEM_PUT64 or SHMEM_GET64 call rather than calling the routine iteratively.
In the following example, eight 64-bit words are transferred from PE 1 to PE 0 by using SHMEM_PUT64. PE numbering always begins with 0.
Example 3-1. Example of a SHMEM_PUT64 transfer
1. INCLUDE "mpp/shmem.fh" 2. INTEGER SOURCE(8), DEST(8) 3. INTRINSIC MY_PE 4. SAVE DEST 5. C On the sending PE 6. IF (MY_PE() .EQ. 1) THEN 7. DO I = 1,8 8. SOURCE(I) = I 9. ENDDO 10. C PE 1 sends the data to PE 0. 11. CALL SHMEM_PUT64(DEST, SOURCE, 8, 0) 12. ENDIF 13. 14. C Make sure the transfer is complete. 15. CALL SHMEM_BARRIER_ALL() 16. 17. C On the receiving PE 18. IF (MY_PE() .EQ. 0) THEN 19. PRINT *, 'DEST ON PE 0: ', DEST 20. ENDIF 21. 22. END
See the following figure for an illustration of the transfer.
The output from the example is as follows:
DEST ON PE 0: 1, 2, 3, 4, 5, 6, 7, 8
Defining the number of PEs in a program and the number in an active set (glossary, ) as powers of 2 (that is, 2, 4, 8, 16, 32, and so on) helped performance on CRAY T3D systems. Also, declaring arrays as powers of 2 was necessary if you were using Cray Research Adaptive Fortran (CRAFT) on CRAY T3D systems. Both have changed as follows on CRAY T3E systems:
Declaring arrays such as SOURCE and DEST as multiples of 8 helps SHMEM speed things up somewhat, since 8 is the vector length of a key component of the PE remote data transfer hardware. Declaring the number of elements as a power of 2 does not affect performance unless that number is also a multiple of 8.
Defining the number of PEs, whether you are referring to all PEs in a program or to the number involved in an active set, as a power of 2 does not usually enhance performance in a significant way on the CRAY T3E system. Some SHMEM routines, notably SHMEM_BROADCAST, still do benefit somewhat from having the number of PEs defined as a power of 2.
For information on optimizing existing PVM and MPI programs using SHMEM_GET64 and SHMEM_PUT64, see Section 3.2. For a complete description of the MPP-specific statements in the preceding example, continue on with this section.
In the SHMEM_PUT64 example (see Example 3-1), line 1 imports the SHMEM INCLUDE file, which defines parameters needed by many of the routines. The location of the file may be different on your system. Check with your system administrator if you do not know the correct path.
1. INCLUDE "mpp/shmem.fh"
Line 3 declares the intrinsic function MY_PE, which returns the number of the PE on which it executes. Two versions of MY_PE function exist on the CRAY T3E system, one in the external library and one as an intrinsic. The intrinsic version is marginally faster than external library version, but the external library version is now, and will be in the future, on more Cray Research and Silicon Graphics supercomputer systems. Declaring MY_PE as an intrinsic is not necessary, but it will ensure you of getting the slightly faster version of the routine.
3. INTRINSIC MY_PE
The defined constant N$PES, which returns the number of PEs in a program, is also slightly faster than the more portable external library routine NUM_PES. Like MY_PE, both versions return the same information.
The intrinsic function MY_PE and the constant N$PES are also faster than using equivalent message-passing routines, such as SHMEM_MY_PE and SHMEM_N_PES. Both methods return the same information and are available on Cray PVP systems as well as Cray MPP systems.
Line 4 ensures that the remote array (DEST) is symmetric (glossary, ), which means that it has the same address on remote PEs as on the local PE.
4. SAVE DEST
You can make sure DEST is symmetric in any of the following ways. (None of these methods is significantly faster than the others.)
Name it in a SAVE statement, as in the example.
Include it in a common block.
If it is an array, allocate it by using shpalloc(3).
If it is a stack variable, declare it by using the !CDIR$ SYMMETRIC directive.
In line 6, the MY_PE function is called. The function returns the number of the calling PE, meaning only PE 1 will execute the THEN clause. As a result, the array SOURCE is initialized only on PE 1.
5. C On the sending PE 6. IF (MY_PE() .EQ. 1) THEN 7. DO I = 1,8 8. SOURCE(I) = I 9. ENDDO
In line 11, PE 1 executes the SHMEM_PUT64 routine call that sends the data. SHMEM_PUT64 is the variant of SHMEM_PUT64 that transfers 64-bit (KIND=8) data. It sends eight array elements from its SOURCE array to the DEST array on PE 0.
11. SHMEM_PUT64(DEST, SOURCE, 8, 0) 12. ENDIF
Line 15 is a barrier (glossary, ), which provides a synchronization point (glossary, ). No PE proceeds beyond this point in the program until all PEs have arrived. The effect in this case is to wait until the transfer has finished. Without the barrier, PE 0 could print the DEST array before receiving the data. Calling SHMEM_BARRIER_ALL is as fast as calling the BARRIER routine directly.
15. CALL SHMEM_BARRIER_ALL()
Line 18 selects PE 0, which is passively receiving the data. Because SHMEM_PUT64 places the data directly into PE 0's local memory, PE 0 is not involved in the transfer operation. After being released from the barrier, PE 0 prints DEST, and the program exits.
18. IF (MY_PE() .EQ. 0) THEN 19. PRINT *, 'DEST ON PE 0: ', DEST 20. ENDIF