Chapter 5. Optimizing Processor-bound Code

Table of Contents
5.1. Processor Optimization Principles
5.2. Analyzing Your Code
5.3. Optimization Techniques

After addressing memory and I/O issues, focus on improving the performance of code running on a single processor.

5.1. Processor Optimization Principles

On Cray X1 systems, processor optimization is based on two principles: improving the vectorization of your code and improving your use of multistreaming.

5.1.1. Vectorization

The SSPs on Cray X1 systems perform two basic types of processing: scalar and vector. A scalar is a single value that can be manipulated by the scalar hardware registers. scalar processing consists of performing logical, arithmetic, or memory operations on scalar registers one at a time.

A vector is a series of values on which instructions operate. A vector can be an array, an array of structures, or any subset of an array such as a row, column, or diagonal. When arithmetic, logical, or memory operations are applied to vectors, using the hardware vector registers—in effect performing parallel processing at the instruction level—this is referred to as vector processing.

On Cray X1 systems scalar registers are exclusive to scalar processes, while vector registers are accessible by either scalar or vector processes. Because the vector component of the SSP runs at twice the clock speed of the scalar component, vector processing is always faster than scalar processing by at least a factor of two. More typically vectorized constructs can perform operations up to 20 times faster than similar nonvectorized constructs, thanks in part to the extremely large vector registers on the Cray X1 system.

Therefore there are three principles which apply to vectorization:

  • always vectorize your code as fully as possible

  • always make your vectorized loops as "fat" as possible

  • even if a section of code cannot be vectorized, use the vector registers

While this last principle may seem counter-intuitive, because vector register access is faster than cache, even scalar operations will run faster if they use the vector registers.

5.1.2. Multistreaming

Multistreaming automatically partitions loop iterations among the four single-streaming processors (SSPs) that make up a multistreaming processor (MSP). You may get speedup factors of up to four on loops to which this technique can be applied.

Figure 5-1 illustrates the loop iterations that each SSP will operate on for the following loop:

DO I = 1,2000
    A(I) = A(I) * 3.14
ENDDO

Figure 5-1. Dividing Loop Iterations among SSPs

By default, multistreaming is always on. The Cray X1 system does provide compiler options to disable multistreaming, but this mode currently idles SSPs 1-3 of the selected MSP.

Multistreaming causes gang scheduling of all requested processors, meaning they are attached to the program whether they are actually executing code or not. Processor utilization efficiency is directly proportional to the extent that you are able to multistream the program.