5.3. Optimization Techniques

There are three basic approaches to optimizing code. In order of ascending difficulty and time consumed, they are:

  1. Use more aggressive compiler options to control the optimization of the entire program unit.

  2. Use directives to control the optimization of specific blocks within the program unit.

  3. Rewrite the source code to allow better optimization by the compiler.

When working through these techniques, remember that source files can be optimized separately and that there is no "one size fits all" solution. An aggressive compiler option that produces better results in one place may produce poorer results in another. The key is to test, modify, recompile, retest, and know when it's time to stop working on one module and shift your attention to another.

5.3.1. Using More Aggressive Compiler Options

The easiest way to improve optimization is by selecting different options when you compile your code. For example, in Fortran code, some loops will not vectorize with the default value of vector2. In these cases using the -O vector3 option on the ftn(1) compiler line may allow vectorization. Similarly, for Fortran loops that won't vectorize because they contain a very large (hundreds) number of lines, use of the -O aggress option on the ftn compiler line will allow for larger internal tables for compiler analysis and may permit vectorization.

The Cray Fortran Compiler options that affect optimization are discussed in the Cray Fortran Compiler Commands and Directives Reference Manual. The Cray C and C++ Compiler options that affect optimization are discussed in the Cray C and C++ Reference Manual.

5.3.2. Using Directives

Directives can be used within the source code to control the compiler optimizations performed on selected blocks of code. For example, the Fortran compiler analyzes loops for dependencies. If a forward dependency is found (data is read, then written), the loop can be vectorized. This is an example of a loop that can be fully vectorized:

do i = 1,n
   a(i) = a(i+1) * b(i)
end do

In comparison, if the loopmark listing indicates that a loop can only be partially vectorized—as in the following example, in which the compiler assumes that there will be collisions of indices and thus the loop is vectorized with considerable overhead—

 6.  Vp----<			DO i = 1,n
 7.  VP r-<>				e(ix1(i)) = e(ix1(i)) - a(i)
 8.  VP---->			END DO
 9.
10.						end

f90-6371 f90: VECTOR File = gs-2.f, Line = 6
  A vectorized loop contains potential conflicts due to indirect
  addressing at line 7, causing less efficient code to be generated.

f90-6204 f90: VECTOR File = gs-2.f, Line = 6
  A loop starting at line 6 was vectorized.

In a case like this, if you know that a loop is free of dependencies but the compiler cannot determine this, you can use the concurrent or ivdep directive to tell the compiler that the loop contains no vector dependencies. For example:

!dir$ concurrent
DO i = 1,n
   e(ix1(i)) = e(ix1(i)) - a(i)
END DO

—will vectorize and multistream cleanly and produce the following loopmark listing:

    6.       !dir$ concurrent
    7.  MV--<       DO i = 1, n
    8.  MV            e(ix1(i)) = e(ix1(i)) - a(i)
    9.  MV-->       END DO
   10.
   11.             end

f90-6203 f90: VECTOR File = gs-2.f, Line = 7
  A loop starting at line 7 was vectorized because an IVDEP
  or CONCURRENT compiler directive was specified.

f90-6203 f90: STREAM File = gs-2.f, Line = 7
  A loop starting at line 7 was streamed because an IVDEP
  or CONCURRENT compiler directive was specified.

In this case, the insertion of an concurrent or ivdep directive will yield a considerable performance improvement.

The Cray Fortran Compiler directives that affect optimization are discussed in the Cray Fortran Compiler Commands and Directives Reference Manual. The Cray C and C++ Compiler directives that affect optimization are discussed in the Cray C and C++ Reference Manual. In particular note the Cray Streaming Directives (CSDs) which can be used to control multistreaming of specific blocks of code.

5.3.3. Rewriting Your Source Code

A complete guide to rewriting Fortran, C, or C++ source code is beyond the scope of this document. However, when examining loops that fail to vectorize or multistream, keep the following factors in mind.

5.3.3.1. Factors that Inhibit Vectorization

Vectorization inhibitors within DO loops include:

  • CALL statements not inlined

  • I/O statements

  • Backward branches

  • Statement labels with references from outside the loop

  • References to character variables

  • External functions that do not vectorize

  • RETURN, STOP, or PAUSE statements

  • Dependencies (see Section 5.3.3.2)

You can avoid many of these inhibitors by slightly modifying the source code, or through use of compiler options.

5.3.3.2. Factors that Inhibit Multistreaming

The compiler uses certain criteria to judge whether or not a loop can be automatically multistreamed.

At the point the compiler evaluates a program, the code has already gone through an initial restructuring, as it would for vectorization. Optimizations such as loop interchange (switching an inner loop with an outer loop) and loop splitting (changing a single loop into two or more loops) may have been done.

If a loop passes the following tests, it is a candidate for multistreaming:

  • Its iterations can be divided among different processors without delivering incorrect results. The loop can contain private arrays if their values are not used outside the loop.

  • There are no function or subroutine calls within the loop, other than those specified by !dir$ ssp_private.

  • A scatter operation is unordered. That is, no indices can be repeated in the array to be scattered.

  • It is not prefaced by the nostream directive.

  • It has a trip count of at least 2.

When two loops that qualify to be multistreamed are nested, the following criteria are used in the following order to choose which loop to multistream:

  1. The outermost loop that is prefaced by the preferstream directive.

  2. The loop that the compiler estimates will have the greatest amount of work after the loop is interchanged to its outermost valid position.

  3. The outermost loop after initial restructuring.

Though the compiler favors outermost loops to multistream, any loop within a nest may be chosen to multistream or vectorize. It is no longer necessary for the vector loop to be the innermost loop and the streamed loop to be the outermost loop.

5.3.3.3. Factors that Enhance Vectorization and Multistreaming

One of the most powerful ways to control multistreaming is through insertion of Cray Streaming Directives (CSDs) in your source code. The CSDs are modeled after OpenMP directives, permit you to control multistreaming at a range of broad to very fine levels, and enable you to force multistreaming of code that the compiler would not ordinarily multistream. A full guide to CSD usage is beyond the scope of this manual; the CSDs are covered in detail in the Cray Fortran Compiler Commands and Directives Reference Manual and Cray C and C++ Reference Manual.

Once multistreaming divides the iterations of a loop among the SSPs working on your program, a number of single-processor optimizations created by the compiler continue to improve the performance of your program. The most important is vectorization.

Vectorization usually yields a greater speedup than any of the other single-processor optimizations. The more you can get out of vectorization, the closer you can come to realizing the full potential of the Cray X1 system.

Vectorization and multistreaming coexist automatically, as follows:

  • Vectorization and multistream can both work on the same loop, causing an MSP to behave like an 8-pipe vector processor.

  • Multistreaming can be done on a loop inside or outside of the vector loop, as appropriate.

The major strategies you can follow in order to enhance vectorization are:

  • Keep the stride through the array small; ideally, use a stride of 1.

  • Avoid less efficient vector code, such as loops that contain variant IFs and reductions.

5.3.4. Latency and Bandwidth Issues

Latency bound is a catchall category that includes:

All of these situations require modifying your source code, either by inserting CSDs to expand the streamed regions or by performing gross code restructuring to bring more parallelism into the computational routines.

Bandwidth bound issues become apparent when more time is spent loading operands from memory than is spent doing floating point calculations. Correcting this requires significant source code modification, including: