|Optimizing Applications on the Cray X1TM System - S-2315-50|
|Prev Section||Chapter 5. Optimizing Processor-bound Code||Next Section|
There are three basic approaches to optimizing code. In order of ascending difficulty and time consumed, they are:
Use more aggressive compiler options to control the optimization of the entire program unit.
Use directives to control the optimization of specific blocks within the program unit.
Rewrite the source code to allow better optimization by the compiler.
When working through these techniques, remember that source files can be optimized separately and that there is no "one size fits all" solution. An aggressive compiler option that produces better results in one place may produce poorer results in another. The key is to test, modify, recompile, retest, and know when it's time to stop working on one module and shift your attention to another.
The easiest way to improve optimization is by selecting different options when you compile your code. For example, in Fortran code, some loops will not vectorize with the default value of vector2. In these cases using the -O vector3 option on the ftn(1) compiler line may allow vectorization. Similarly, for Fortran loops that won't vectorize because they contain a very large (hundreds) number of lines, use of the -O aggress option on the ftn compiler line will allow for larger internal tables for compiler analysis and may permit vectorization.
The Cray Fortran Compiler options that affect optimization are discussed in the Cray Fortran Compiler Commands and Directives Reference Manual. The Cray C and C++ Compiler options that affect optimization are discussed in the Cray C and C++ Reference Manual.
Directives can be used within the source code to control the compiler optimizations performed on selected blocks of code. For example, the Fortran compiler analyzes loops for dependencies. If a forward dependency is found (data is read, then written), the loop can be vectorized. This is an example of a loop that can be fully vectorized:
do i = 1,n a(i) = a(i+1) * b(i) end do
In comparison, if the loopmark listing indicates that a loop can only be partially vectorized—as in the following example, in which the compiler assumes that there will be collisions of indices and thus the loop is vectorized with considerable overhead—
6. Vp----< DO i = 1,n 7. VP r-<> e(ix1(i)) = e(ix1(i)) - a(i) 8. VP----> END DO 9. 10. end f90-6371 f90: VECTOR File = gs-2.f, Line = 6 A vectorized loop contains potential conflicts due to indirect addressing at line 7, causing less efficient code to be generated. f90-6204 f90: VECTOR File = gs-2.f, Line = 6 A loop starting at line 6 was vectorized.
In a case like this, if you know that a loop is free of dependencies but the compiler cannot determine this, you can use the concurrent or ivdep directive to tell the compiler that the loop contains no vector dependencies. For example:
!dir$ concurrent DO i = 1,n e(ix1(i)) = e(ix1(i)) - a(i) END DO
—will vectorize and multistream cleanly and produce the following loopmark listing:
6. !dir$ concurrent 7. MV--< DO i = 1, n 8. MV e(ix1(i)) = e(ix1(i)) - a(i) 9. MV--> END DO 10. 11. end f90-6203 f90: VECTOR File = gs-2.f, Line = 7 A loop starting at line 7 was vectorized because an IVDEP or CONCURRENT compiler directive was specified. f90-6203 f90: STREAM File = gs-2.f, Line = 7 A loop starting at line 7 was streamed because an IVDEP or CONCURRENT compiler directive was specified.
In this case, the insertion of an concurrent or ivdep directive will yield a considerable performance improvement.
The Cray Fortran Compiler directives that affect optimization are discussed in the Cray Fortran Compiler Commands and Directives Reference Manual. The Cray C and C++ Compiler directives that affect optimization are discussed in the Cray C and C++ Reference Manual. In particular note the Cray Streaming Directives (CSDs) which can be used to control multistreaming of specific blocks of code.
A complete guide to rewriting Fortran, C, or C++ source code is beyond the scope of this document. However, when examining loops that fail to vectorize or multistream, keep the following factors in mind.
CALL statements not inlined
Statement labels with references from outside the loop
References to character variables
External functions that do not vectorize
RETURN, STOP, or PAUSE statements
Dependencies (see Section 18.104.22.168)
You can avoid many of these inhibitors by slightly modifying the source code, or through use of compiler options.
At the point the compiler evaluates a program, the code has already gone through an initial restructuring, as it would for vectorization. Optimizations such as loop interchange (switching an inner loop with an outer loop) and loop splitting (changing a single loop into two or more loops) may have been done.
If a loop passes the following tests, it is a candidate for multistreaming:
Its iterations can be divided among different processors without delivering incorrect results. The loop can contain private arrays if their values are not used outside the loop.
There are no function or subroutine calls within the loop, other than those specified by !dir$ ssp_private.
A scatter operation is unordered. That is, no indices can be repeated in the array to be scattered.
It is not prefaced by the nostream directive.
It has a trip count of at least 2.
The outermost loop that is prefaced by the preferstream directive.
The loop that the compiler estimates will have the greatest amount of work after the loop is interchanged to its outermost valid position.
The outermost loop after initial restructuring.
Though the compiler favors outermost loops to multistream, any loop within a nest may be chosen to multistream or vectorize. It is no longer necessary for the vector loop to be the innermost loop and the streamed loop to be the outermost loop.
One of the most powerful ways to control multistreaming is through insertion of Cray Streaming Directives (CSDs) in your source code. The CSDs are modeled after OpenMP directives, permit you to control multistreaming at a range of broad to very fine levels, and enable you to force multistreaming of code that the compiler would not ordinarily multistream. A full guide to CSD usage is beyond the scope of this manual; the CSDs are covered in detail in the Cray Fortran Compiler Commands and Directives Reference Manual and Cray C and C++ Reference Manual.
Once multistreaming divides the iterations of a loop among the SSPs working on your program, a number of single-processor optimizations created by the compiler continue to improve the performance of your program. The most important is vectorization.
Vectorization usually yields a greater speedup than any of the other single-processor optimizations. The more you can get out of vectorization, the closer you can come to realizing the full potential of the Cray X1 system.
Vectorization and multistreaming coexist automatically, as follows:
Vectorization and multistream can both work on the same loop, causing an MSP to behave like an 8-pipe vector processor.
Multistreaming can be done on a loop inside or outside of the vector loop, as appropriate.
The major strategies you can follow in order to enhance vectorization are:
Keep the stride through the array small; ideally, use a stride of 1.
Avoid less efficient vector code, such as loops that contain variant IFs and reductions.
short vector loops where efficiency suffers because there are not enough memory references to cover main memory latency
very fine granularity streamed regions where the multistreaming overhead and startup costs outweigh the advantages of multistreaming
frequently switching between small vector and small scalar loops, which causes excessive time to be consumed by repeatedly filling and draining the vector pipeline
All of these situations require modifying your source code, either by inserting CSDs to expand the streamed regions or by performing gross code restructuring to bring more parallelism into the computational routines.
Bandwidth bound issues become apparent when more time is spent loading operands from memory than is spent doing floating point calculations. Correcting this requires significant source code modification, including:
fusing loops to eliminate temporary vector carry-over arrays
working towards nesting loops so that the compiler can vectorize the outermost loops
stripmining outer loops so that the cache footprint for all arrays referenced is smaller than the cache size
using the no_cache_alloc directive on large arrays that have no temporal locality