| Optimizing Applications on the Cray X1TM System - S-2315-50 | ||
|---|---|---|
| Prev Section | Chapter 1. Overview | Next Section |
This section summarizes the Cray X1 hardware features that are relevant to optimization. A more thorough discussion can be found in the Cray X1 System Overview.
The Cray X1 system is a highly scalable, non-uniform memory access (NUMA) system consisting of up to 1024 node modules. Each node module consists of four multistreaming processors (MSPs), along with up to 64 GB of flat, shared memory; each MSP is in turn composed of four processing elements called single-streaming processors (SSPs) and 2 MB of shared cache memory. Each SSP contains both a superscalar (RISC-type) processing unit and a 2-pipe vector processing unit, and each MSP automatically distributes the multistreamable parts of a user program to its component SSPs, unless the program has been compiled to operate in single-streaming mode.
Note: While single-streaming mode is functionally available now, it will not be optimized until Cray Programming Environment 5.1 release. Therefore, documentation of single-streaming mode is deferred.
Note: For the purposes of this guide, the term processor is used to mean either an MSP or an SSP, depending on how the program is compiled.
Node modules function as either application nodes, support nodes, or operating systems (OS) nodes, as defined by software. There is no physical difference between the different types of node modules. The node flavor merely dictates the kinds of processes and threads that can use a node's resources.
This design means that the Cray X1 is a flexible system:
Applications that stay within a node module behave as if they are running on a 16-processor shared-memory PVP system, much like a Cray SV1 series system.
Applications that span node modules behave as if they are running on a distributed-memory MPP system, like the Cray T3E system.
In addition, parallel programming models can be combined to produce programs that run as shared-memory processes nested within distributed-memory applications, when advantageous.
The vector components of the SSPs implement Cray's new vector (NV-1) instruction set architecture. The NV-1 instruction set features:
Support for 32-bit and 64-bit twos-complement integers
Support for IEEE-754 floating-point format (both 32-bit and 64-bit)
Fixed 32-bit width instructions with regular encoding
Hardware features to support improved vectorization
Cache allocation control to support explicit communication and reduce cache pollution
Relaxed memory ordering rules with mechanisms for explicit synchronization
Support for decoupled scalar and vector execution for maximum perfomance
Cray X1 systems also feature large register sets to reduce the number of memory accesses, reduce register spills, eliminate write-after-read register dependencies, and hide memory latency. These include:
32 vector registers, each with 64 elements
8 vector mask registers
64 scalar registers
64 address registers
8 control registers
1 bit-matrix multiply register
1 vector carry register
Note: Because of differences in the system architectures, the B (buffer) and T (intermediate storage) registers used on Cray SV1 series systems and the E (global data transfer) registers used on Cray T3E systems are not needed and do not exist on Cray X1 systems.
There are three forms of cache in a Cray X1 system. Each SSP has a 16 KB scalar data cache and a 16 KB instruction cache, while each MSP has a 2 MB instruction and data cache that is shared by the four SSPs. Processors may cache data from their local node modules only; references to memory on other node modules are not cached locally.
In addition, the 32 KB of vector register space effectively functions as the lowest level of processor data cache, similar to the L1 cache found on microprocessor-based systems.
For more information about cache, see Appendix D.
Memory is shared and flat within each node module in a Cray X1 system.
Memory is shared and addressable across the entire Cray X1 system, meaning that any MSP or SSP on any node module can access any location in memory on another node module. However, because of differences in network channel speeds, shared memory or distributed memory operations that stay within a single node module are much faster than operations that span nodes.
The Cray X1 system uses dynamic memory allocation and virtual memory exclusively. Therefore, the hardware-based user area, stack, and heap allocation principles discussed in other Cray optimization guides are not relevant on the Cray X1.
In the Cray X1 system, all disk I/O is done using a directly attached fibre channel RAID with intelligent RAID controllers. Therefore, hardware-specific optimizations such as disk striping are not discussed in this guide. For information about tuning disk I/O, see UNICOS/mp Networking Facilities Administration.
Within a program, performance gains can be realized through careful attention to I/O calls and use of the assign environment and flexible file I/O (FFIO) system. For more information, see Chapter 4.
| Prev Section | Table of Contents | Title Page | Index | Next Section |
| Overview | Up one level | Optimization Flowchart |