Cray T3ETM Fortran Optimization Guide

Table of Contents
Related Publications
Ordering Printed Publications
Reader Comments
1. Background Information
1.1. Message-passing Protocols
1.1.1. Parallel Virtual Machine (PVM)
1.1.2. SHMEM
1.2. Hardware
1.2.1. Memory
1.2.2. Processing Element
1.2.3. Network and Peripherals
1.2.4. Memory Performance Information
1.3. Measuring Performance
2. Parallel Virtual Machine (PVM)
2.1. Setting the Size of a Message
2.2. Allocating Send Buffers
2.3. the Advantage of 32-bit Data
2.4. Sending and Receiving Stride-1 Data
2.5. Mixing Send and Receive Routines
2.6. Initializing and Packing Data
2.7. Working While You Wait
2.8. Avoiding Barriers
2.9. Using Broadcast or Multicast
2.10. Minimizing Synchronization Time When Receiving Data
2.11. Using the Reduction Functions
2.12. Gathering and Scattering Data
3.1. Using SHMEM_GET64 and SHMEM_PUT64 for data transfer
3.2. Optimizing Existing MPI and PVM Programs by Using SHMEM
3.2.1. Optimizing by Using SHMEM_GET64
3.2.2. Optimizing by Using SHMEM_PUT64
3.3. Passing 32-bit Data
3.4. Copying Strided Data
3.5. Gathering and Scattering Data
3.6. Broadcasting Data to Multiple PEs
3.7. Merging Arrays
3.8. Reading and Updating in One Operation
3.9. Using Reduction routines
4. Single-PE Optimization
4.1. Unrolling Loops
4.2. Software Pipelining
4.2.1. Optimizing a Program with Software Pipelining
4.2.2. Identifying Loops for Pipelining
4.2.3. How Pipelining Works
4.3. Optimizing for Cache
4.3.1. Rearranging Array Dimensions for Cache Reuse
4.3.2. Padding Common Blocks and Arrays to Reduce Cache Conflict
4.3.3. Automatic Padding
4.4. Optimizing for Stream Buffers
4.4.1. Splitting Loops
4.4.2. Padding Common Blocks and Arrays for Loop Splitting
4.4.3. Changed Behavior from Loop Splitting
4.4.4. Maximizing Inner Loop Trip Count
4.4.5. Minimizing Stream Count
4.4.6. Grouping Statements That Use the Same Streams
4.4.7. Enabling and Disabling Stream Buffers
4.5. Optimizing Division Operations
4.6. Vectorization
4.6.1. Using the IVDEP Directive
4.7. Bypassing Cache
5. Input/Output
5.1. Strategies for I/O
5.1.1. Using a Single, Shared File
5.1.2. Using Multiple Files and Multiple PEs
5.1.3. Using a Single PE
5.2. Unformatted I/O
5.2.1. Sequential, Unformatted Requests
5.3. Formatted I/O
5.3.1. Reduce Formatted I/O
5.3.2. Make Large I/O Requests
5.3.3. Minimize Data Items
5.3.4. Use Longer Records
5.3.5. Format Manually
5.3.6. Change Edit Descriptors for Character Data
5.4. FFIO
5.4.1. Memory-resident Data Files
5.4.2. Distributed I/O
5.4.3. Using the Cache Layer
5.4.4. Using Library Buffers
5.5. Random Access
5.6. Striping
6. Glossary
List of Tables
1-1. Latencies and bandwidths for data cache access
1-2. Latencies and bandwidths for access that does not hit cache
4-1. Functional unit
List of Figures
1-1. Data transfer comparison
1-2. Position of E registers
1-3. Flow of data on a CRAY T3E node
1-4. Data flow on the EV5 microprocessor
1-5. First value reaches the microprocessor
1-6. Ninth value reaches the microprocessor
1-7. Output stream
1-8. An external GigaRing network
2-1. Fan-out method used by broadcasting routines
2-2. A PvmMax reduction
2-3. The gather/scatter process
3-1. SHMEM_PUT64 data transfer
3-2. Identification of neighbors in the ring program.
3-4. Reordering elements during a scatter operation
3-5. The broadcast operation
3-6. An example of SHMEM_FCOLLECT
3-7. The SHMEM_REAL8_MIN_TO_ALL example
4-1. Overlapped iterations
4-2. Pipelining a loop with multiplications
4-3. Before and after array A has been optimized
4-4. Arrays B and C in local memory
4-5. Cache conflict between arrays B and C
4-6. Arrays B and C in local memory after padding
4-7. Data cache after padding
5-1. Multiple PEs using a single file
5-2. Multiple PEs and multiple files
5-3. I/O to and from a single PE
5-4. Data paths between disk and an array
5-5. Data layout for distributed I/O
List of Examples
2-1. Transferring 32-bit data
2-5. PvmSum
2-6. Gather operation
2-7. Scatter operation
3-1. Example of a SHMEM_PUT64 transfer
3-2. PVM version of the ring program
3-3. SHMEM_GET64 version of the ring program
3-4. SHMEM_PUT64 version of the ring program
3-5. 32-bit version of ring program
3-6. Passing strided data using SHMEM_REAL_IGET
3-7. Passing strided data using SHMEM_REAL_IPUT
3-8. SHMEM_IXPUT version of a reordered scatter
3-9. One-to-all broadcasting
3-11. Remote fetch and increment
3-12. Minimum value reduction routine
3-13. Summation using a reduction routine
4-1. Unoptimized code
4-2. Automatic padding
4-3. Automatic padding for smaller arrays
4-4. Original loop
4-5. Splitting loops
4-6. Stripmining
4-7. Splitting loops across IF statements
4-8. Splitting individual statements
4-9. Rearranging array dimensions
4-10. Minimizing streams
4-11. Reduced streams version
4-12. Original code
4-13. Grouping statements within the loop
4-14. Loop that will be split into four
4-15. Loop that will be split into two
4-16. Original code
4-17. Modified code
4-18. Transforming a loop for vectorization
5-1. Distributed I/O
5-2. Disk striping from within a program