|Cray SV1TM Application Optimization Guide - S-2312-36|
|Prev Section||Chapter 1. Introduction||Next Section|
There are three models of the Cray SV1 series of systems:
The Cray SV1 model, which is the system as originally released.
The Cray SV1e model, which adds a faster processor to the original system.
The Cray SV1ex model, which adds faster memory to the Cray SV1e model and supports the Solid State Storage Device (SSD-I). See Section 6.8 for more information regarding the SSD-I.
Because the Cray SV1 series system performs multistreaming, its processors are referred to as multistreaming processors (MSPs). Each MSP divides parallel work among four single-streaming processors (SSPs). See Chapter 3 for information on optimizing your program using multistreaming.
Each SSP has 2 add and 2 multiply functional units, allowing them to deliver 4 floating-point results per processor clock cycle.
The Cray SV1 series systems are significantly different from previous Cray vector machines in that they provide a cache for the data resulting from scalar, vector, and instruction buffer memory references. Like their predecessors, the Cray SV1 series systems achieve high bandwidth to memory for both unit and non-unit stride memory references.
The memory architecture is uniform access, shared central memory. Uniform memory access (UMA) means that the access time for any CPU memory reference to any location in memory is the same. Memory capacity for the system ranges from a minimum of 4 GB up to a maximum of 128 GB.
The following table describes differences among the models:
The Cray SV1 series system has two module types: processor and memory. The system must be configured with eight memory modules and one to eight processor modules. Each processor module has four CPUs.
Cray SV1 systems have Cray SV1 CPUs.
A Cray J90 processor module can be upgraded with a Cray SV1 processor module and the Cray J90 system can be configured with both processor module types, Cray J90 or Cray SV1. The original Cray SV1 model processors cannot be mixed with Cray SV1e or Cray SV1ex model processors.
Number of SSPs
Clock rate of an SSP in megahertz
Peak GFLOPs per SSP
Peak GFLOPs for the system
Memory size in GB
* Maximum value includes SSD-I.
The following figure shows the block diagram for a single processor.
The Cray SV1 system has Cray SV1 processors. Cray SV1ex systems have Cray SV1e processors. (Some Cray SV1ex systems also have extended memory, called ex memory.)
The Cray SV1 processor uses a custom CMOS chip. The processor is implemented using two chip types: processor and cache.
The processor chip contains the vector and scalar units. Scalar registers, scalar functional units, and the instruction buffers reside in the scalar unit, while the vector unit contains vector registers and the vector functional units. As in previous Cray vector systems, the processor contains eight vector (V) registers, eight scalar (S) registers backed by 64 T registers, and eight address (A) registers backed up by 64 B registers. A parallel job also has access to eight shared B and eight shared T registers, which are used for low overhead data passing and synchronization between processors.
A vector functional unit contains two pipes, each capable of producing a result every processor clock cycle. This results in a peak rate for a functional unit of two results per clock cycle. The maximum vector length, or VL, is 64. The combined floating-point functional units, add and multiply, deliver four results per processor clock cycle.
In addition to the add and multiply units, the other vector functional units are reciprocal, integer add, shift, pop/parity/leading zero, bit matrix multiply, and logical.
The vector units are capable of full chaining and tailgating. Chaining is reading from a V register that is still being written to, and tailgating is writing to a V register that is still being read from a prior vector instruction. Scalar floating-point operations are executed using the vector functional units. This is different from the Cray J90 system, which has separate floating-point functional units for scalar operations.
Two data paths or ports are provided to move data between processor registers and memory via cache. In any given clock cycle, two memory requests can be active and consist of two dual-port reads or one dual-port read and one dual-port write. If there are no read requests, there can be only one write request active. The processor can access all of memory, but an application is limited to a 16-GB address space.
Instructions are decoded by the scalar unit; when vector instructions are encountered, they are dispatched to the vector unit, which maintains an independent instruction queue. Barring instruction dependency, the two units can execute independently.
There are 32 performance counters in four groups of eight each. Only one group can be active at a time, with software providing user access to the data. The groups are labeled 0 through 3. The collection order, based on how useful the performance information is to the user, is 0, 3, 2 and 1. For an example of using the hardware performance monitor, see Section 2.1.1.
Note: In addition to the processor clock, there is a system clock that runs at the rate of 100 MHz. When using the cpu instruction to return the count of clock ticks, the tick count is generated by the system clock rate.
Cache lies between memory and a processor's registers. Its purpose is to speed up loads.
Data read from memory is accessed through a high-speed, 32-Kword cache. Each of the processors involved in multistreaming has its own cache.
The Cray SV1 system cache is a four-way set associative temporary storage area, meaning each line can be allocated into any of four places, or ways.
Moving data from cache to a register is up to ten times faster than moving data directly from memory to a register. Having the data items you are going to use available in cache can represent a significant optimization in itself.
On non-cached Cray vector systems, performance on vector constructs generally increases as a predictable function of vector length. This is not always the case on the Cray SV1 series of systems, since long vectors can lead to a reduction in data cache efficiency. In general, it is better to use blocked algorithms (similar to those commonly used on microprocessors), balancing the vector length against any potential data reuse that can be exploited via data cache.
For information on optimizing cache, see Chapter 3.
Disk drives, interfaces to other networks, and other peripherals are connected to the Cray SV1 series of systems using the high-speed GigaRing I/O system. The double-ring product is illustrated in Figure 1-3.
Each of the rings has a maximum transfer rate of 500 MBPS, which provides an effective total bandwidth of 800 MBPS. Since the two rings rotate in different directions, the shortest path to the target node is selected for each transfer.