| Optimizing Applications on the Cray X1TM System - S-2315-50 | ||
|---|---|---|
| Prev Section | Appendix C. Loopmark Examples | Next Section |
The following example shows a loop nest that has been vectorized (V) and unrolled (r):
1. subroutine sub(nx,ny,nz,a,b,c) 2. real a(nx,ny,nz) ,b(nx,ny,nz) ,c(nx,ny,nz) 3. 4. 1------< do k = 2,nz 5. 1 r----< do j = 2,ny 6. 1 r V--< do i = 2,nx 7. 1 r V c(i,j,k) = 2.5*(a(i,j,k)-a(i,j-1,k))*(b(i,j,k)-b(i,j-1,k)) 8. 1 r V--< end do 9. 1 r----< end do 10. 1------< end do 11. 12. end f90-6005 f90: SCALAR File = t.f, Line = 5 A loop starting at line 5 was unrolled 2 times. f90-6204 f90: VECTOR File = t.f, Line = 6 A loop starting at line 6 was vectorized |
In this example the loopmark indicates that the DO loop starting on line 6 has been vectorized and the DO loop starting on line 5 has been unrolled. The messages following the subroutine provide details concerning these optimizations. For example, the first message indicates that the loop starting at line 5 was unrolled twice.
If the compiler found loops that could not be vectorized, additional messages are generated explaining why specific loops could not be vectorized. For example, the above code example would generate the following additional messages:
f90-6294 f90: VECTOR File = t.f, Line = 4 A loop starting at line 4 was not vectorized because a better candidate was found at line 6. f90-6294 f90: VECTOR File = t.f, Line = 5 A loop starting at line 5 was not vectorized because a better candidate was found at line 6. |
The following subsections discuss other information that may be included in loopmark listings.
The most important type of optimization performed by the compiler is vectorization. This subsection describes the variations on vectorization that you may encounter and the causes, advantages, and disadvantages of each type.
A fully vectorized loop is indicated in loopmark listings with V. This is generally the best-performing loop possible, and it should not be altered unless the code is producing incorrect results.
The compiler may generate partially vectorized loops, which are indicated with Vp in loopmark listings. A loop is partially vectorized if it is split into multiple loops, at least one of which is scalar. This optimization is most effective on loops that contain a higher percentage of vector work compared to scalar work.
The following list contains portions of compiler messages that indicate partial vectorization (to obtain more detail on each message produced by the compilers, use the explain(1) command):
Partially vectorized
Partially vectorized with a single vector iteration
Partially vectorized with a vector length of N
Partially vectorized with a computed safe vector length
Partially and conditionally vectorized
The compiler may generate a conditionally vectorized loop, which results in two loops: a scalar and a vector version. A run-time conditional expression chooses the scalar loop when needed to avoid a recurrence; otherwise, the vector loop is chosen. Conditionally vectorized loops are indicated with Vp in loopmark listings. Here are two examples of loops that would be conditionally vectorized:
Fortran example:
do i=1,n
if (s1.EQ.0) then
a(i) = 0.0
else
a(i) = a(i-1)
end if
end do |
C++ example:
for (i=0; i<n; i++
if (s[i]!=0) {
a[i]=0.0
}else{
a[i]=a[i-1]
} |
If you know that there will never be a recurrence in that loop, you can insert an ivdep or concurrent directive to avoid the extra logic and get a fully vectorized loop. Likewise, if you know that there will always be a recurrence, you can use the nextscalar directive to prevent the conditional expression to get a faster scalar loop.
The compiler may generate a particular form of vectorization called a reduction loop, which is indicated with V in loopmark listings. A reduction loop (also known as a summation loop or vector collapse) sums the elements of one array dimension. If the size of that dimension is considerably larger than the hardware vector register size, most of the summation can occur with partial sums in vector registers. Some scalar code is required to sum up the elements of the vector. The following examples illustrate loops that would be optimized as reduction loops:
Fortran example:
do i=1,n sum=sum+a(i) end do |
C++ example:
for (i=0; i<n; i++){
sum = sum+a[i];
} |
The compiler may generate a short loop, which is indicated with Vs in loopmark listings. A short loop is a loop that is vectorized but also determined by the compiler to have trips less than or equal to the maximum vector length. In this case, the compiler deletes the branch to the top of the loop. If the shortloop directive is used or the trip count is constant, the top test for number of trips is deleted. A short loop is more efficient than a conventional loop.
The compiler determines that a loop is a short loop when any one of the following conditions is met:
The shortloop or shortloop128 directive is used when compiling
The constant trip count is less than or equal to the maximum hardware vector length
Compiler range analysis proves that, for a legal program, the trip count is less than or equal to the maximum vector length
The compiler may generate vector update loops, which are indicated with Vp in loopmark listings. A vector update loop performs arithmetic on existing elements of an array and stores the results back into the same array. This type of loop requires both a gather operation from and a scatter operation to the same memory location and is not as fast as a fully vectorized loop. The following examples illustrate code that would be optimized as vector update loops:
Fortran example:
do i=1,n a(ib(i)) = a(ib(i))+b(i) end do |
C++ example:
for (i=0; i<n; i++) {
a[ib[i]]-a[ib[i]]+b[i];
} |
To improve the performance of a vector update loop, insert an ivdep or concurrent directive, but only if you know there are no subscript collisions.
The compiler might generate a vector loop with a computed-safe vector length (VL) to avoid a recurrence. The safe VL is generated at run-time. If the computed-safe VL creates a fully vectorized loop (that is, it is greater than or equal to the machine-maximum VL), it is still moderately slower than its unconditional fully vectorized counterpart because of the overhead involved in the safe VL computation.
Loops vectorized with computed-safe VL are indicated with Vp in loopmark listings.
The following examples illustrate code that defines computed-safe VL loops:
Fortran example:
do i=1,n a(i) = a(i-m) end do |
C++ example:
for (i=0; i<n; i++) {
a[i]=a[i-m];
} |
If you know that there will be no dependencies caused by array index overlap between loop iterations, you can avoid the overhead associated with computed-safe VL loops by using the ivdep or concurrent directive to force the loop into a fully vectorized loop instead.
Historically, Cray compilers have vectorized inner loops only. On Cray X1 systems, the compilers now have the ability to vectorize either inner or outer loops.
As an example, the outer loop below (do i) is vectorized. Note that array a(i) is invariant with respect to the inner loop (do j). The load of a(i) is done before the inner loop and the store is done following it. This reduces the memory traffic within the do j loop by two memory operations and increase performance significantly.
4. V----< do i = 1,m
5. V r--< do j = 1,n
6. V r a(i) = a(i) + b(i,j)*c(j)
7. V r--> end do
8. V----> end do
9.
10. end
ftn-6204 f90: VECTOR File = outer.f, Line = 4
A loop starting at line 4 was vectorized.
ftn-6005 f90: SCALAR File = outer.f, Line = 5
A loop starting at line 5 was unrolled 2 times. |
A loop that is multistreamed is indicated with M, unless it has not been partitioned, in which case it is indicated with m. Loops may be both vectorized and multistreamed, especially when dealing with nested loops.
The compiler uses certain criteria to judge whether or not a loop can be automatically multistreamed.
At the point the compiler evaluates a program, the code has already gone through an initial restructuring, as it would for vectorization. Optimizations such as loop interchange (switching an inner loop with an outer loop) and loop splitting (changing a single loop into two or more loops) may have been done.
If a loop passes the following tests, it is a candidate for multistreaming:
Its iterations can be divided among different processors without delivering incorrect results. The loop can contain private arrays if their values are not used outside the loop.
There are no function or subroutine calls within the loop, other than those specified by !dir$ ssp_private.
A scatter operation is unordered. That is, no indices can be repeated in the array to be scattered.
It is not prefaced by the nostream directive.
It has a trip count of at least 2.
When two loops that qualify to be multistreamed are nested, the following criteria are used in the following order to choose which loop to multistream:
The outermost loop that is prefaced by the preferstream directive.
The loop that the compiler estimates will have the greatest amount of work after the loop is interchanged to its outermost valid position.
The outermost loop after initial restructuring.
Though the compiler favors outermost loops to multistream, any loop within a nest may be chosen to multistream or vectorize. It is no longer necessary for the vector loop to be the innermost loop and the streamed loop to be the outermost loop.
In the loopmark listing, the loopmark character of M is used to indicate loops that are streamed and partitioned, as shown in this example:
68. M-----< do j = 1,m 69. M V---< do i = 1,n 70. M V c(i,j) = a(i,j) + d(i,j) 71. M V a(i,j) = b(i,j) * 10.0 72. M V---> end do 73. M-----> end do 74. ir2=irtc() 75. ir=ir2-ir1 76. end f90-6601 f90: STREAM File = 2-level.f, Line = 68 A loop starting at line 68 was multistreamed. f90-6204 f90: VECTOR File = 2-level.f, Line = 69 A loop starting at line 69 was vectorized. |
M indicates that the iterations of the loop have been divided equally among the SSPs and will be executed in parallel.
In certain cases a loop may be multistreamed without being partitioned. These cases always involve nested loops, where an inner loop can be divided equally among the SSPs, but the outer loop is executed redundantly by all SSPs. In the loopmark listing the character of m is used to indicate multistreaming without partitioning:
37. Vm-----< do i = 1,n 38. Vm M---< do j = 1,20 39. Vm M a(i,j) = b(i,j) + c(i,j) 40. Vm M---> end do 41. Vm-----> end do |
In the above example, the loops would be divided up among the SSPs as illustrated below:
SSP 0
do i = 1,n do j=16,20 |
SSP 1
do i = 1,n do j=11,15 |
SSP 2
do i = 1,n do j=6,10 |
SSP 3
do i = 1,n do j=1,5 |
Thus the inner loop can be fully multistreamed, but the outer loop is vectorized and multistreamed without partitioning.
A loop that has been pattern-matched and replaced with a call to a library routine is indicated with an A. The following example shows a matrix-matrix multiply that has been pattern matched. The three outer loops are marked with D and have been deleted, and the innermost loop is marked with an A to indicate that it has been pattern-matched:
23. D--------< do k = 1,n3 ! 24. D D------< do j = 1,n2 ! M 25. D D D----< do i = 1,n1 ! V 26. D D D A-<> x(i,j) = x(i,j) + y(i,k) * z(k,j) 27. D D D----> end do 28. D D------> end do 29. D--------> end do 30. 31. ir2=irtc() 32. ir=ir2-ir1 33. end f90-6002 f90: SCALAR File = dgemm-1.f, Line = 23 A loop starting at line 23 was eliminated by optimization. f90-6002 f90: SCALAR File = dgemm-1.f, Line = 24 A loop starting at line 24 was eliminated by optimization. f90-6002 f90: SCALAR File = dgemm-1.f, Line = 25 A loop starting at line 25 was eliminated by optimization. f90-6202 f90: VECTOR File = dgemm-1.f, Line = 26 A loop starting at line 26 was replaced by a library call. |
Pattern matching is not always worthwhile. When applied to a routine that is called frequently, the cumulative time saved by using the optimized library routine will probably outweigh the call overhead. When applied to a routine that is used rarely, the calling overhead may actually result in a net degradation of performance. When compiling using the default optimization settings, the compiler attempts to determine whether each given candidate for pattern matching will in fact yield improved performance.
Loop unrolling is done for the following reasons:
More efficient use of vector operations
Vector register reuse
Cache reuse if all reused data does not fit in vector registers
Reduced number of executed branch instructions and other loop overhead
A nonvector loop is chosen to be unrolled when a pattern of nearest neighbor (for example, a(i,j) = a(i,j+1)) is seen for that loop. Unrolling a nonvector loop is referred to as unroll-and-jam. The nearest neighbor pattern can be either write-read or read-read. Unroll-and-jam is done primarily to improve register and cache reuse. Reusing vector registers or cache reduces bandwidth usage, thereby improving performance.
For example, consider the following unroll-and-jam optimization:
5. r-----< do j = 1,n-1 6. r V---< do i = 1,n ! vectorize 7. r V a(i,j) = b(i,j) + b(i,j+1) 8. r V---> enddo 9. r-----> enddo f90-6005 f90: SCALAR File = nn-3.f, Line = 5 A loop starting at line 5 was unrolled 2 times. f90-6204 f90: VECTOR File = nn-3.f, Line = 6 A loop starting at line 6 was vectorized. |
The loopmark character r indicates unrolling. The loop is converted to:
do j = 1,n-1,2 ! unrolled for reuse of b(i,j+1)
do i = 1,n ! vectorized
a(i,j) = b(i,j) + b(i,j+1)
a(i,j+1) = b(i,j+1) + b(i,j+2)
end do
end do |
Data b(i,j+1) is loaded once but used twice in the unrolled version.
Loop unwinding is a special case of unrolling, where the loop is entirely unrolled. For example:
4. V----< do j = 1,n
5. V W--< do i = 1,3
6. V W a(i,j) = b(i,j) + c(i,j) * s1
7. V W--> end do
8. V----> end do
9.
10. end
f90-6204 f90: VECTOR File = unwind-1.f, Line = 4
A loop starting at line 4 was vectorized.
f90-6008 f90: SCALAR File = unwind-1.f, Line = 5
A loop starting at line 5 was unwound. |
The loopmark character W indicates unwinding.
Loop interchange is done to promote vector register reuse and quicker cache reuse. As a result of interchange, there is additional outer-loop vectorization, resulting in invariant vector registers and operations. These vectors can be hoisted and reused within the inner loop, which is very effective in reducing bandwidth usage.
The following example illustrates interchange in matrix-vector multiplication:
25. ir----< do j = 1,m
26. ir V--< do i = 1,n
27. ir V a(i) = a(i) + b(i,j) * c(j)
28. ir V--> end do
29. ir----> end do
f90-6007 f90: SCALAR File = dgemv-1.f, Line = 25
A loop starting at line 25 was interchanged with
the loop starting at line 26.
f90-6005 f90: SCALAR File = dgemv-1.f, Line = 25
A loop starting at line 25 was unrolled 2 times.
f90-6204 f90: VECTOR File = dgemv-1.f, Line = 26
A loop starting at line 26 was vectorized. |
The loopmark character i indicates that the loop was interchanged. The inner loop is converted to a dot product:
do i = i,n ! vectorized do j = i,n a(i) = a(i) + b(i,j) * c(j) end do end do |
The vector load of a(i) can be loaded ahead of the j-loop and the vector store of a(i) can be pushed below. Within the inner loop, instead of two vector-loads and one vector-store, there will be one vector-load and no vector-stores.
Loop collapse combines two or more loops into a single loop. This produces:
Longer vectors
Less loop overhead
Better vector and/or streaming load balance
For example, consider this loop:
1. subroutine sub(n)
2. common/blk/s1,a(100,100),b(100,100),c(100,100)
3.
4. C-----< do j = 1,100
5. C Vr--< do i = 1,100
6. C Vr a(i,j) = b(i,j) + c(i,j) * s1
7. C Vr--> end do
8. C-----> end do
9.
10. end
f90-6003 f90: SCALAR File = unwind-1.f, Line = 4
A loop starting at line 4 was collapsed into
the loop starting at line 5.
f90-8135 f90: SCALAR File = unwind-1.f, Line = 5
Loop starting at line 5 was unrolled 2 times.
f90-6204 f90: VECTOR File = unwind-1.f, Line = 5
A loop starting at line 5 was vectorized. |
The loopmark C indicates loop collapse. Essentially, the loop is converted to:
do i = 1,100 * 100 a(i,1) = b(i,1) + c(i,1) * s1 end do |
Loop fusion combines two or more loops into a single loop. The goals are reduced loop overhead and register reuse. Loop fusion is very useful for optimizing Fortran array syntax, where each statement is an implied loop:
1. subroutine sub
2. common/blk/a(100),b(100)
3.
4. V---<> a=0.
5. f---<> b=0.
6.
7. end
f90-6204 f90: VECTOR File = fusion.f, Line = 4
A loop starting at line 4 was vectorized.
f90-6004 f90: SCALAR File = fusion.f, Line = 5
A loop starting at line 5 was fused with the
loop starting at line 4. |
The loopmark character f indicates loop fusion.
Loop fusion is also useful as illustrated in the case below, where array temp can be optimized away by temporarily storing it in a register instead of in memory:
5. Vr--< do i = 1,n
6. Vr temp(i) = b(i) * c(i)
7. Vr--> end do
8.
9. f---< do i = 1,n
10. f a(i) = temp(i) + d(i)
11. f---> end do
f90-8135 f90: SCALAR File = fusion.f, Line = 5
Loop starting at line 5 was unrolled 2 times.
f90-6204 f90: VECTOR File = fusion.f, Line = 5
A loop starting at line 5 was vectorized.
f90-6004 f90: SCALAR File = fusion.f, Line = 9
A loop starting at line 9 was fused with the
loop starting at line 5. |
Cray compilers do not perform automatic loop blocking at present. However, the blockable directive is available to do manual blocking:
4. 1--------< do k = 1,n
5. 1 !dir$ blockable (j,i)
6. 1 !dir$ blocking size (20)
7. 1 b------< do j = 1,m
8. 1 b !dir$ blocking size (20)
9. 1 b Vb---< do i = 1,mm
10. 1 b Vb z(i,k) = z(i,k) + x(i,j) * y(j,k)
11. 1 b Vb---> end do
12. 1 b------> end do
13. 1--------> end do
14.
15. end
f90-6051 f90: SCALAR File = blockable.f, Line = 7
A loop starting at line 7 was blocked according
to user directive with block size 20.
f90-6051 f90: SCALAR File = blockable.f, Line = 9
A loop starting at line 9 was blocked according
to user directive with block size 20.
f90-6205 f90: VECTOR File = blockable.f, Line = 9
A loop starting at line 9 was vectorized with a
single vector iteration. |
The loopmark character b indicates loop blocking. See the Cray Fortran Compiler Commands and Directives manual for more information.
| Prev Section | Table of Contents | Title Page | Index | Next Section |
| Loopmark Examples | Up one level | Cache |