| Optimizing Applications on the Cray X1TM System - S-2315-52 | ||
|---|---|---|
| Prev Section | Next Section | |
On Cray X1 systems, the two most important optimization techniques are vectorization and multistreaming. In addition, the compiler often makes code transformations such as interchanging or unrolling loops to enable vectorization and multistreaming and to perform other optimizations. This chapter describes the processes the Cray Fortran, C, and C++ compilers use to automatically vectorize and multistream loops and the tools and techniques you can use to assist the compilers.
The SSPs on Cray X1 systems perform two types of processing: vector and scalar.
A vector is a named series of values in memory. A vector can be an array, an array of structures, or any subset of an array such as a row, column, or diagonal. Vector processing is the application of arithmetic, logical, or memory operations to multiple operands in vector registers.
Fortran arrays are stored in memory in column-major order (that is, the leftmost dimension varies fastest and the rightmost varies slowest). By default, the first element in each dimension is indexed 1, as in these Fortran array examples:
a(1), a(2), a(3), ..., a(10000) x(1,1), x(2,1), x(3,1), x(1,2), x(2,2), x(3,2) |
C and C++ arrays are stored in memory in row-major order (that is, the leftmost dimension varies slowest and the rightmost varies fastest). By default, the first element in each dimension is indexed 0, as in these C or C++ array examples:
m[0], m[1], m[2], ..., m[500] n[1][1], n[1][2], n[2][1], n[2][2], n[3][1], n[3][2] |
A scalar is a single value, including a single element of an array, that can be manipulated by a superscalar processor. Scalar processing consists of logical, arithmetic, or memory operations performed on operands in scalar registers, one at a time.
In the following Fortran examples, all the variables are scalars:
sum = sum + const b(4,3) = 5. |
The following C and C++ variables are also scalars:
x[0][2][4] tot += i; |
Vector registers are exclusive to vector processors, while scalar registers are components of both scalar or vector processors.
When vectorized loops are processed, operations are decomposed into segments, with functional units of the vector processor performing specific tasks. There is a delay in the pipeline until the results of the first operation are computed, but from then on the processing unit typically produces one vector result per clock period.
For example, consider the following loop:
do i = 1, nx c(i) = a(i) * b(i) + d end do |
Processing the loop proceeds as follows. Figure 6-1 is a graphical representation of the process.
Load the scalar value at address d into the scalar register.
Load 64 elements of array a(i) from memory addresses a(1) through a(64) into vector register 0 (V0).
Load 64 elements of array b(i) from memory into vector register 1 (V1).
Using the vector processor multiply unit, calculate the product of the values in vector registers 0 and 1 and store the 64 results in vector register 2 (V2).
Using the vector processor add unit, calculate the sum of the scalar value and each value in vector register 2 and store the 64 results in addresses c(1) through c(64) in memory.
Repeat until all values of c(i) have been calculated. Later steps can begin before earlier steps have completed.
An explicit or implied loop must be present for vectorization to occur. In determining what loops to vectorize, the compiler considers a number of factors. There are no simple rules that will predict the compiler's behavior in all cases.
These general guidelines apply to vectorization:
Vectorize loops whenever possible, even if the application is not multistreamed. Vectorization is the optimization technique that usually provides the greatest performance increase. The vector processor gains its performance advantage over the superscalar processor in two ways. First, the vector processor runs at twice the clock speed of the superscalar processor. Second, the vector processor chains all like operations and performs them in order, then moves on to the next set of like operations, and so on until the entire task is completed. Vector processing is usually faster than scalar processing by 20 times or more. More typically, vectorized loops can run up to 40 times faster than their scalar counterparts.
Make vectorized loops as "fat" as possible; that is, put as much work as possible inside the loop. See Section 6.2.1 for a definition of work. The reason this is desirable is data reuse; that is, data can be used multiple times after it is loaded from memory. The greatest performance bottleneck once a loop is vectorized is low bandwidth because cache and memory are much slower than the processor.
Use stride-1 loops (loops that increment or decrement by 1) whenever possible. Stride-1 reads and writes provide maximum bandwidth because they access data most likely to be in cache.
For nested loops, vectorize the loop with the highest trip count if possible. This is not an absolute. Under certain conditions, the compiler may determine that vectorizing a loop with less than the maximum trip count will produce the greatest net benefit. There is setup time spent in vectorizing a loop, so the trip count should be 4 or higher.
Use dense calculation loops; that is, use loops with a high ratio of floating point operations to memory references. (This flops-to-load ratio is also referred to as floating-point intensity.) For example, the statement:
c(i) = a(i) * b(i) + d |
has a flops-to-load ratio of 1:1 (2 floating point operations to 2 vector loads).
In analyzing source code, if there are no compiler options or directives prohibiting vectorization, the compiler checks all explicit and implied loops to determine whether they can be vectorized.
A fully vectorized loop is indicated in loopmark listings with V and should not be altered unless the code is producing incorrect results or you would prefer to vectorize a different loop in the loop nest.
Example 6-1. Fully Vectorized Loop
Source code:
subroutine vectorize1(nx,a,b,c,d) real a(nx),b(nx),c(nx),d do i = 1, nx c(i) = a(i) * b(i) + d end do end subroutine |
Compiler command:
% ftn -c -rm vectorize1.ftn |
Loopmark listing:
1. subroutine vectorize1(nx,a,b,c,d)
2.
3. real a(nx),b(nx),c(nx),d
4.
5. MV--< do i = 1, nx
6. MV c(i) = a(i) * b(i) + d
7. MV--> end do
8.
9. end subroutine
ftn-6204 ftn: VECTOR File = vectorize1.ftn, Line = 5
A loop starting at line 5 was vectorized.
ftn-6601 ftn: STREAM File = vectorize1.ftn, Line = 5
A loop starting at line 5 was multi-streamed. |
When the compiler cannot generate a fully vectorized loop, it tries to:
Conditionally vectorize the loop, creating a scalar version and a vector version. A runtime conditional expression chooses the scalar loop when needed to avoid a recurrence; otherwise, the vector loop is chosen.
Partially vectorize the loop. A loop is partially vectorized if it is split into multiple loops, at least one of which is scalar. This generally indicates that the loop contains some cross-iteration dependency that prevents full vectorization.
Generate a vector loop with a computed-safe vector length (VL). The safe VL is generated at runtime. If the computed-safe VL creates a fully vectorized loop (that is, it is greater than or equal to the machine-maximum VL), it is still moderately slower than its fully vectorized counterpart because of the overhead involved in the safe VL computation.
Generate a vector update loop. A vector update loop performs arithmetic on existing elements of an array and stores the results back into the same array. This type of loop requires both a gather operation from and a scatter operation to the same memory location and is not as fast as a fully vectorized loop.
Less than fully vectorized loops are indicated by Vp in the loopmark listing. Vp loops may present opportunities for programmer-assisted vectorization, as described in Section 6.1.4.
If a Vp type loop is in a critical portion of the code, it may be worth recompiling with the decompile option (-rd) and then examining the resulting .opt file to determine what was vectorized, what was done as scalar code, and where the limiting factor or dependency resides.
Example 6-2. Conditionally Vectorized Loop
Source code:
subroutine conditional1(nx,a,s1)
real a(nx)
integer s1
do i = 2, nx
if (s1 .eq. 0) then
a(i) = 0.0
else
a(i) = a(i-1)
end if
end do
end subroutine |
Compiler command:
% ftn -c -rm conditional1.ftn |
Loopmark listing:
1. subroutine conditional1(nx,a,b,s1)
2.
3. real a(nx),b(nx)
4. integer s1
5.
6. MVpr--< do i = 2, nx
7. MVpr if (s1 .eq. 0) then
8. MVpr a(i) = 0.0
9. MVpr else
10. MVpr a(i) = a(i-1)+b(i)
11. MVpr end if
12. MVpr--> end do
13.
14. end subroutine
ftn-6005 ftn: SCALAR File = conditional1.ftn, Line = 6
A loop starting at line 6 was unrolled 8 times.
ftn-6213 ftn: VECTOR File = conditional1.ftn, Line = 6
A loop starting at line 6 was conditionally vectorized.
ftn-6601 ftn: STREAM File = conditional1.ftn, Line = 6
A loop starting at line 6 was multi-streamed. |
Decompilation listing:
1. subroutine conditional1( nx, a, b, s1 )
1. t$1 = NX
3. t$2 = _zla( 0, NX )
6. if ( -2 + NX >= 0 ) then
7. if ( S1 == 0 ) then
6. @Induc01_N2 = 0
6. !dir$ ivdep
6. !dir$ stream
6. do
8. A(2 + @Induc01_N2) = 0.0
12. @Induc01_N2 = 1 + @Induc01_N2
11. if ( @Induc01_N2 >= -1 + NX ) exit
11. enddo
11. else
6. @Induc01_N2 = 0
6. @A_WR0_R0 = A(1)
6. do
10. @A_WR0_R0 = @A_WR0_R0 + B(2 + @Induc01_N2)
10. A(2 + @Induc01_N2) = @A_WR0_R0
12. @Induc01_N2 = 1 + @Induc01_N2
11. if ( @Induc01_N2 >= -1 + NX ) exit
11. enddo
11. endif
11. endif
14. return
14. end |
The decompilation listing shows the code generated to test for the conditions -2 + NX >= 0 and S1 == 0.
The Cray Fortran, C, and C++ compilers automatically produce highly optimized code by default. However, there are a number of factors that can inhibit automatic vectorization. Loops with the following attributes cannot be vectorized without programmer assistance:
Loops with ambiguous data dependencies
Loops containing non-inlined calls to subroutines and functions
Loops containing I/O statements
Statement labels with references from outside the loop
References to Fortran character variables
Intrinsic functions that do not vectorize
Fortran RETURN, STOP, or PAUSE statements
Loops in which C++ exceptions are thrown or caught
C++ typeid and runtime dynamic_cast operators
You can avoid many of these inhibitors by using compiler options or directives or by modifying the source code. Before you use any of the techniques described in the following sections, use CrayPat profile reports to focus on the functions that consume the most time. See Section 2.2.5 for details.
It is important to note that the superscalar processor performs the steps of a loop in a different order than the vector processor. For example, consider the following loop:
for(i=0; i<=2; i++) {
s[i] = t[i] + r[i];
x[i] = y[i] + z[i];
} |
The superscalar processor performs operations in this order:
s[0] = t[0] + r[0] x[0] = y[0] + z[0] s[1] = t[1] + r[1] x[1] = y[1] + z[1] s[2] = t[2] + r[2] x[2] = y[2] + z[2] |
In contrast, the vector processor performs operations in this order:
s[0] = t[0] + r[0] s[1] = t[1] + r[1] s[2] = t[2] + r[2] x[0] = y[0] + z[0] x[1] = y[1] + z[1] x[2] = y[2] + z[2] |
When you vectorize a loop, be sure to compare the vector and scalar results. Run your program with the vectorized loop, check results, then insert the Fortran !dir$ nextscalar directive or C and C++ #pragma _CRI novector directive immediately before the loop, run the program again, and compare results. Apart from small numerical differences due to reductions, if the results differ, incorrect vectorization was performed.
Check loopmark listings (see Section 2.3.1) to identify loops that the compiler did not vectorize or were less than fully vectorized. Then, use one or more of the approaches listed below:
Use compiler options. Compiler options apply to all loops in the compilation unit.
Use directives. Directives apply to specific blocks within a program.
You can apply vectorization to all but selected loops by:
Using the Fortran !dir$ nextscalar directive in conjunction with the -O vector2 or -O vector3 option. The !dir$ nextscalar directive applies to only one loop, the first loop that appears after the directive within the same program unit.
Using the C and C++ #pragma _CRI novector directive in conjunction with the -h vector2 or -h vector3 option. The #pragma _CRI novector directive applies only to the code block immediately following it.
Rewrite critical sections of source code.
When you use these techniques, remember that source files can be optimized separately and that there is no one-size-fits-all solution. Check the loopmark listing again to see if your changes causes a loop to be vectorized. Then, compare the results to the scalar version to make sure that the loops yield the correct results.
The following subsections describe situations where the compiler does not vectorize a loop and the techniques you can use to enable vectorization.
The Cray Fortran Compiler options and directives that affect optimization are documented in the Cray Fortran Compiler Commands and Directives Reference Manual. The Cray C and C++ Compiler options and directives that affect optimization are documented in the Cray C and C++ Reference Manual.
Caution:![]() | Use care when experimenting with compiler vectorization options. Illegal vectorizations can introduce errors and/or degrade performance. |
When the compiler analyzes and optimizes a loop, it uses a moderate approach (the Fortran -O noaggress option and the -h noaggress C and C++ option). If the compiler has not vectorized a loop because it is very large (hundreds of lines) or very complex, you can use the Fortran -O aggress or C and C++ -h aggress option. The trade-off of using -O aggress or -h aggress is more compile resources, time, and space.
Because I/O always involves external libraries and system calls, I/O is considered just another call to a subprogram. The compiler cannot vectorize a loop containing an I/O call, as shown in the following example.
Example 6-3. Loop Containing I/O Call
Source code:
subroutine io1(nx,a,b,c)
real a(nx),b(nx),c(nx)
open(8,file='c_array',access='direct', &
form='formatted',status='replace')
do i = 1, nx
c(i) = a(i) * b(i)
write(8,'(1x,f12.4)',rec=i) c(i)
end do
end subroutine |
Compiler command:
% ftn -c -rm io1.ftn |
Loopmark listing:
1. subroutine io1(nx,a,b,c)
2. real a(nx),b(nx),c(nx)
3.
4. open(8,file='c_array',access='direct', &
5. form='formatted',status='replace')
6.
7. 1--< do i = 1, nx
8. 1 c(i) = a(i) * b(i)
9. 1 write(8,'(1x,f12.4)',rec=i) c(i)
10. 1--> end do
11.
12. end subroutine
ftn-3021 ftn: INLINE File = io1.ftn, Line = 4
Routine _OPEN was not inlined because the compiler was unable to locate the
routine to expand it inline.
ftn-6286 ftn: VECTOR File = io1.ftn, Line = 7
A loop starting at line 7 was not vectorized because it contains input/output
operations at line 9.
ftn-6709 ftn: STREAM File = io1.ftn, Line = 7
A loop starting at line 7 was not multi-streamed because it contains
input/output operations. |
Example 6-4. Segmenting an I/O Call
You can use segmenting to enable partial vectorization. In the following example, the statement containing the write operation is moved to a separate loop.
Source code:
subroutine io2(nx,a,b,c)
real a(nx),b(nx),c(nx)
open(8,file='c_array',access='direct', &
form='formatted',status='replace')
do i = 1, nx
c(i) = a(i) * b(i)
end do
write(8,'(1x,f12.4)',rec=i) (c(i),i=1,nx)
end subroutine |
Compiler command:
% ftn -c -rm io2.ftn |
Loopmark listing:
1. subroutine io2(nx,a,b,c)
2. real a(nx),b(nx),c(nx)
3.
4. open(8,file='c_array',access='direct', &
5. form='formatted',status='replace')
6.
7. MV--< do i = 1, nx
8. MV c(i) = a(i) * b(i)
9. MV--> end do
10.
11. write(8,'(1x,f12.4)',rec=i) (c(i),i=1,nx)
12.
13. end subroutine
ftn-3021 ftn: INLINE File = io2.ftn, Line = 4
Routine _OPEN was not inlined because the compiler was unable to locate the
routine to expand it inline.
ftn-6204 ftn: VECTOR File = io2.ftn, Line = 7
A loop starting at line 7 was vectorized.
ftn-6601 ftn: STREAM File = io2.ftn, Line = 7
A loop starting at line 7 was multi-streamed.
|
The compiler generated a fully vectorized loop and a scalar loop.
One way to eliminate a scalar recurrence in a loop is to store the computed scalar values in a temporary array and index them in a different order. This technique is based on restructuring nested loops. Because scalars do not have another dimension with which to work, the scalar must be transformed into a vector.
To use this technique, you might convert a simple variable into a one-dimensional array or a one-dimensional array into a two-dimensional array. The key to this conversion is to make the scalar into a vector array with respect to the original outer loop. Thus, when the loops are switched to index the arrays in a different order, the new array is a vector with respect to the new inner loop.
Because this conversion involves restructuring the loop nest, you should take the same care when applying it that you would with any other loop nest restructuring.
For example, Example 6-5 does not vectorize because of the recurrence on array x. Example 6-6 shows how to convert the code so it will vectorize.
Example 6-5. Scalar Recurrence
Source code:
subroutine scalarrec1(nx,ny,y,vexpr,xinit)
real y(ny),vexpr(nx,ny)
do j = 1, nx
x = xinit
do k = 1, ny
x = x + vexpr(j,k)
y(k) = y(k) + x
end do
end do
end subroutine |
Compiler command:
% ftn -c -rm scalarrec1.ftn |
Loopmark listing:
1. subroutine scalarrec1(nx,ny,y,vexpr,xinit)
2. real y(ny),vexpr(nx,ny)
3.
4. 1----< do j = 1, nx
5. 1 x = xinit
6. 1 r--< do k = 1, ny
7. 1 r x = x + vexpr(j,k)
8. 1 r y(k) = y(k) + x
9. 1 r--> end do
10. 1----> end do
11.
12. end subroutine
ftn-6254 ftn: VECTOR File = scalarrec1.ftn, Line = 4
A loop starting at line 4 was not vectorized because a recurrence was found
on "Y" at line 8.
ftn-6751 ftn: STREAM File = scalarrec1.ftn, Line = 4
A loop starting at line 4 was not multi-streamed because a recurrence was
found on "Y" at line 8.
ftn-6005 ftn: SCALAR File = scalarrec1.ftn, Line = 6
A loop starting at line 6 was unrolled 8 times.
ftn-6254 ftn: VECTOR File = scalarrec1.ftn, Line = 6
A loop starting at line 6 was not vectorized because a recurrence was found
on "X" at line 7.
ftn-6751 ftn: STREAM File = scalarrec1.ftn, Line = 6
A loop starting at line 6 was not multi-streamed because a recurrence was
found on "X" at line 7. |
Example 6-6. Converting a Scalar Recurrence
You can enable vectorization in situations like that shown in Example 6-5 by replacing references to scalar x in the k loop with references to temporary array x(j), interchanging the k and j loops, and splitting out the initialization of x into its own loop.
Source code:
subroutine scalarrec2(nx,ny,x,y,vexpr,xinit)
real x(nx),y(ny),vexpr(nx,ny)
do j = 1, nx
x(j) = xinit
end do
do k = 1, ny
do j = 1, nx
x(j) = x(j) + vexpr(j,k)
y(k) = y(k) + x(j)
end do
end do
end subroutine |
Compiler command:
% ftn -c -rm scalarrec2.ftn |
Loopmark listing:
1. subroutine scalarrec2(nx,ny,x,y,vexpr,xinit)
2. real x(nx),y(ny),vexpr(nx,ny)
3.
4. MV----< do j = 1, nx
5. MV x(j) = xinit
6. MV----> end do
7.
8. 1-----< do k = 1, ny
9. 1 MV--< do j = 1, nx
10. 1 MV x(j) = x(j) + vexpr(j,k)
11. 1 MV y(k) = y(k) + x(j)
12. 1 MV--> end do
13. 1-----> end do
14.
15. end subroutine
ftn-6204 ftn: VECTOR File = scalarrec2.ftn, Line = 4
A loop starting at line 4 was vectorized.
ftn-6601 ftn: STREAM File = scalarrec2.ftn, Line = 4
A loop starting at line 4 was multi-streamed.
ftn-6254 ftn: VECTOR File = scalarrec2.ftn, Line = 8
A loop starting at line 8 was not vectorized because a recurrence was found
on "X" at line 10.
ftn-6751 ftn: STREAM File = scalarrec2.ftn, Line = 8
A loop starting at line 8 was not multi-streamed because a recurrence was
found on "X" at line 10.
ftn-6204 ftn: VECTOR File = scalarrec2.ftn, Line = 9
A loop starting at line 9 was vectorized.
ftn-6601 ftn: STREAM File = scalarrec2.ftn, Line = 9
A loop starting at line 9 was multi-streamed.
|
Because scalarrec2.ftn replaces the references to scalar x with references to vector x(j), the compiler was able to vectorize the j loop.
| Prev Section | Table of Contents | Title Page | Index | Next Section |
| Minimizing System Calls | Up one level | Streaming Loops |