| Cray SV1TM Application Optimization Guide - S-2312-35 | ||
|---|---|---|
| Prev Section | Chapter 1. Introduction | Next Section |
The Cray SV1 computer is significantly different from previous Cray vector machines in that it provides a cache for the data resulting from scalar, vector, and instruction buffer memory references. Like its predecessors, the Cray SV1 system achieves high bandwidth to memory for both unit and non-unit stride memory references.
The Cray SV1 system is configured with between 4 and 32 CPUs. Each CPU has 2 add and 2 multiply functional units, allowing them to deliver 4 floating point results per CPU clock cycle. With the 300 MHz CPU clock, the peak floating point rate per CPU is 1.2 Gflops/s and 38.4 Gflops/s for the system.
The memory architecture is uniform access, shared central memory. Uniform memory access (UMA) means that the access time for any CPU memory reference to any location in memory is the same. Memory capacity for the system ranges from a minimum of 4 GBytes up to a maximum of 128 GBytes.
The Cray SV1 system has two module types: processor and memory. The system must be configured with eight memory modules and one to eight processor modules. Each processor module has four CPUs. A Cray J90 processor module can be upgraded with a Cray SV1 processor module and the Cray J90 system can be configured with both processor module types, Cray J90 or Cray SV1.
The following figure shows the block diagram for a single processor:
The Cray SV1 processor uses a custom CMOS chip. The processor is implemented using two chip types: cpu and cache.
The CPU chip contains the vector and scalar units. Scalar registers, scalar functional units, and the instruction buffers reside in the scalar unit, while the vector unit contains vector registers and the vector functional units. As in previous Cray vector systems, the processor contains eight vector (V) registers, eight scalar (S) registers backed by 64 T registers, and eight address (A) registers backed up by 64 B registers. A parallel job also has access to eight shared B and eight shared T registers, which are used for low overhead data passing and synchronization between processors.
A vector functional unit contains two pipes, each capable of producing a result every CPU clock cycle. This results in a peak rate for a functional unit of two results per clock cycle. The maximum vector length, or VL, is 64. The combine floating point functional units, add and multiply, deliver 4 results per CPU clock cycle. With the CPU clock rate of between 300 and 500 MHz, a peak floating point rate of between 1.2 and 2.0 Gflops per processor can be achieved.
In addition to the add and multiply units, the other vector functional units are reciprocal, integer add, shift, pop/parity/leading zero, bit matrix multiply, and logical.
The vector units are capable of full chaining and tailgating. Chaining is reading from a V register that is still being written to, and tailgating is writing to a V register that is still being read from a prior vector instruction. Scalar floating point operations are executed using the vector functional units. This is different from the Cray J90 system, which has separate floating point functional units for scalar operations.
Two data paths or ports are provided to move data between CPU registers and memory via cache. In any given clock cycle, two memory requests can be active and consist of two dual-port reads or one dual-port read and one dual-port write. If there are no read requests, there can be only one write request active. The processor can access up to 32 Gbytes of memory, but an application is limited to a 16-Gbyte address space.
Instructions are decoded by the scalar unit; when vector instructions are encountered, they are dispatched to the vector unit, which maintains an independent instruction queue. Barring instruction dependency, the two units can execute independently.
There are 32 performance counters in four groups of eight each. Only one group can be active at a time, with software providing user access to the data. The groups are labeled 0 through 3. The collection order, based on how useful the performance information is to the user, is 0, 3, 2 and 1. For an example of using the hardware performance monitor, see Section 2.1.1.
Note: In addition to the CPU clock, there is a system clock that runs at the rate of 100 MHz. When using the CPU instruction to return the count of clock ticks, the tick count is generated by the system clock rate.
Cache lies between memory and a processor's registers (see Figure 1-3). Its purpose is to speed up loads from memory.
Sacrificing the performance improvements to be gained from cache can eat into the performance gains you may be realizing with other optimizations. To know how to get the best out of both, you must first know something about how cache works.
Data read from memory is accessed through a high-speed, 32-Kword cache. Each of the processors involved in multi-streaming has its own cache.
Cache is a four-way set associative temporary storage area, meaning each line in cache is divided into four places, or sets. See the representation of cache in Figure 1-3.
Moving data from cache to a register is 2-to-3 times faster than moving data directly from memory to a register (ideally, 32 clock periods (CPs) compared to about 102 CPs). Having the data items you are going to use available in cache can represent a significant optimization in itself.
The following example shows the advantage of cache, how it helps move data quickly between memory and processor registers. The data being read in the example is from an array named A that is accessed as follows in a Fortran program:
DO I = 1, N ... = A(I) ENDDO |
A single-streaming processor (SSP) in a multi-streaming program would be assigned its own part of the array. Each would use the following procedure to move data from memory to its registers:
A register requests the value of A(1) from cache.
Cache does not have A(1). It requests A(1) from memory.
Cache receives 8 64-bit words (A(1) through A(8)) from memory and stores them in one of four sets.
The register receives A(1) from cache. The state of the data at this point is as illustrated in the following figure:
When the register needs A(2) through A(8), it finds them in cache.
Meanwhile, cache is prefetching A(9) through A(16) from memory.
By the time the register needs A(9), it is again available in cache.
In this way, a constant flow of data is set up between memory and the processor registers. Whenever an array item is needed, it will be available in cache.
On non-cached Cray vector systems, performance on vector constructs generally increased as a predictable function of vector length. This is not always the case on the Cray SV1 system, since long vectors can lead to a reduction in data cache efficiency. In general, it is better to use blocked algorithms (similar to those commonly used on microprocessors), balancing the vector length against any potential data reuse that can be exploited via data cache.
Disk drives, interfaces to other networks, and other peripherals are connected to the Cray SV1 system using the high-speed GigaRing I/O system. The double-ring product is illustrated in the following illustration.
Each of the rings has a maximum transfer rate of 500 Mbytes/s, which provides an effective total bandwidth of 800 Mbytes/s. Since the two rings rotate in different directions, the shortest path to the target node is selected for each transfer.
| Prev Section | Table of Contents | Title Page | Next Section |
| Optimization Overview | Up one level | Evaluating Code |