Optimizing Applications on the Cray X1TM System


Table of Contents
Preface
Accessing Cray Documentation
Error Message Explanations
Typographical Conventions
Ordering Documentation
Reader Comments
1. Overview
1.1. Related Publications
1.2. System Overview
1.2.1. Vector Processors
1.2.2. Cache
1.2.3. Distributed Shared Memory
1.2.4. I/O
1.3. Optimization Flowchart
2. Evaluating Code
2.1. Is the Program Memory Bound?
2.2. Is the Program I/O Bound?
2.3. Is the Program Processor Bound?
3. Optimizing Memory-bound Code
4. Optimizing I/O-bound Code
4.1. Identifying I/O Intensive Code
4.2. Does the Code Use Formatted I/O?
4.3. Does the Code Use Large, Sequential, Unformatted I/O Requests?
4.4. Does the Code Use Small, Sequential, Unformatted I/O Requests?
4.5. Does the Code Use Direct Access I/O?
4.6. Does the Code Use Asynchronous I/O Requests?
4.7. Using an Optimal Storage Device
4.7.1. Memory-Resident (MR) Files
4.8. Minimizing System Calls
5. Optimizing Processor-bound Code
5.1. Processor Optimization Principles
5.1.1. Vectorization
5.1.2. Multistreaming
5.2. Analyzing Your Code
5.3. Optimization Techniques
5.3.1. Using More Aggressive Compiler Options
5.3.2. Using Directives
5.3.3. Rewriting Your Source Code
5.3.4. Latency and Bandwidth Issues
6. Optimizing Parallel Code
6.1. Identifying Problem Areas
6.2. Optimization Techniques
6.3. Supported Programming Models
6.3.1. Message Passing Interface (MPI)
6.3.2. Shared Memory (SHMEM)
6.3.3. Co-array Fortran (CAF)
6.3.4. Unified Parallel C (UPC)
6.3.5. Pthreads
6.3.6. OpenMP
A. Performance Tools
A.1. Common Timing Tools
A.2. Loopmark Listings
A.3. Decompiled Listings
A.4. CrayPat
A.4.1. Loading the CrayPat Module
A.4.2. Online Documentation
A.4.3. Capturing Hardware Performance Counters
A.4.4. Sampling and Tracing Experiments
B. Hardware Performance Counters
B.1. E-chip (Cache)
B.2. M-chip (Memory)
B.3. P-chip (Processor)
C. Loopmark Examples
C.1. Listing Key
C.2. Examples
C.2.1. Vectorization
C.2.2. Multistreaming
C.2.3. Pattern-Matching
C.2.4. Loop Unrolling
C.2.5. Loop Interchange
C.2.6. Loop Collapse
C.2.7. Loop Fusion
C.2.8. Loop Blocking
D. Cache
D.1. Overview
D.1.1. D-cache and I-cache
D.1.2. E-cache
D.2. Cache Coherency and Consistency
D.3. Cache Localities and Strides
D.4. Cache Line Size
D.5. Cache Pollution Control
D.6. Bandwidth
Glossary
Index
List of Tables
B-1. E-chip Counters
B-2. M-chip Counters
B-3. P-chip Counters
List of Figures
1-1. Optimization Overview
2-1. Evaluating Code
3-1. Optimizing Memory-bound Code
4-1. I/O Optimization
4-2. I/O Optimization (continued)
5-1. Dividing Loop Iterations among SSPs
5-2. Optimizing Single-processor Performance
D-1. D-cache Structure
D-2. D-cache Address
D-3. E-cache Structure
D-4. E-cache Address
List of Examples
4-1. Fortran Direct Access
4-2. C++ Direct Access
List of Procedures
2.1. Determing Whether Code is Memory Bound
2.2. Determine Whether Code is I/O Bound
2.3. Determine Whether Code is Processor Bound
3.1. Optimizing Memory
4.1. Optimizing I/O Bound Code
4.2. Identifying I/O Intensive Code
4.3. Finding and Optimizing Formatted I/O
4.4. Finding and Optimizing Large, Sequential, Unformatted I/O Requests
4.7. Optimizing Small, Sequential, Unformatted I/O Requests
4.8. Finding and Optimizing Direct Access I/O
4.9. Optimizing Asynchronous I/O
5.1. Analyzing Single-processor Performance
6.1. Analyzing Parallel-processor Performance