C.2. Examples

The following example shows a loop nest that has been vectorized (V) and unrolled (r):

 1.					subroutine sub(nx,ny,nz,a,b,c)
 2.					real a(nx,ny,nz) ,b(nx,ny,nz) ,c(nx,ny,nz)
 3.	
 4.	1------<	do k = 2,nz
 5.	1 r----<		do j = 2,ny
 6.	1 r V--<			do i = 2,nx
 7.	1 r V					c(i,j,k) = 2.5*(a(i,j,k)-a(i,j-1,k))*(b(i,j,k)-b(i,j-1,k))
 8.	1 r V--<			end do
 9.	1 r----<		end do
10.	1------<	end do
11.
12.					end

f90-6005 f90: SCALAR File = t.f, Line = 5
  A loop starting at line 5 was unrolled 2 times.

f90-6204 f90: VECTOR File = t.f, Line = 6
  A loop starting at line 6 was vectorized

In this example the loopmark indicates that the DO loop starting on line 6 has been vectorized and the DO loop starting on line 5 has been unrolled. The messages following the subroutine provide details concerning these optimizations. For example, the first message indicates that the loop starting at line 5 was unrolled twice.

If the compiler found loops that could not be vectorized, additional messages are generated explaining why specific loops could not be vectorized. For example, the above code example would generate the following additional messages:

f90-6294 f90: VECTOR File = t.f, Line = 4
  A loop starting at line 4 was not vectorized because 
  a better candidate was found at line 6.

f90-6294 f90: VECTOR File = t.f, Line = 5
  A loop starting at line 5 was not vectorized 
  because a better candidate was found at line 6.

The following subsections discuss other information that may be included in loopmark listings.

C.2.1. Vectorization

The most important type of optimization performed by the compiler is vectorization. This subsection describes the variations on vectorization that you may encounter and the causes, advantages, and disadvantages of each type.

C.2.1.1. Fully Vectorized Loops

A fully vectorized loop is indicated in loopmark listings with V. This is generally the best-performing loop possible, and it should not be altered unless the code is producing incorrect results.

C.2.1.2. Partially Vectorized Loops

The compiler may generate partially vectorized loops, which are indicated with Vp in loopmark listings. A loop is partially vectorized if it is split into multiple loops, at least one of which is scalar. This optimization is most effective on loops that contain a higher percentage of vector work compared to scalar work.

The following list contains portions of compiler messages that indicate partial vectorization (to obtain more detail on each message produced by the compilers, use the explain(1) command):

  • Partially vectorized

  • Partially vectorized with a single vector iteration

  • Partially vectorized with a vector length of N

  • Partially vectorized with a computed safe vector length

  • Partially and conditionally vectorized

C.2.1.3. Conditionally Vectorized Loops

The compiler may generate a conditionally vectorized loop, which results in two loops: a scalar and a vector version. A run-time conditional expression chooses the scalar loop when needed to avoid a recurrence; otherwise, the vector loop is chosen. Conditionally vectorized loops are indicated with Vp in loopmark listings. Here are two examples of loops that would be conditionally vectorized:

Fortran example:

do i=1,n
   if (s1.EQ.0) then
      a(i) = 0.0
   else
      a(i) = a(i-1)
   end if
end do

C++ example:

for (i=0; i<n; i++
   if (s[i]!=0) {
      a[i]=0.0
   }else{
      a[i]=a[i-1]
   }

If you know that there will never be a recurrence in that loop, you can insert an ivdep or concurrent directive to avoid the extra logic and get a fully vectorized loop. Likewise, if you know that there will always be a recurrence, you can use the nextscalar directive to prevent the conditional expression to get a faster scalar loop.

C.2.1.4. Reduction Loop

The compiler may generate a particular form of vectorization called a reduction loop, which is indicated with V in loopmark listings. A reduction loop (also known as a summation loop or vector collapse) sums the elements of one array dimension. If the size of that dimension is considerably larger than the hardware vector register size, most of the summation can occur with partial sums in vector registers. Some scalar code is required to sum up the elements of the vector. The following examples illustrate loops that would be optimized as reduction loops:

Fortran example:

do i=1,n
   sum=sum+a(i)
end do

C++ example:

for (i=0; i<n; i++){
   sum = sum+a[i];
   }

C.2.1.5. Short Loops

The compiler may generate a short loop, which is indicated with Vs in loopmark listings. A short loop is a loop that is vectorized but also determined by the compiler to have trips less than or equal to the maximum vector length. In this case, the compiler deletes the branch to the top of the loop. If the shortloop directive is used or the trip count is constant, the top test for number of trips is deleted. A short loop is more efficient than a conventional loop.

The compiler determines that a loop is a short loop when any one of the following conditions is met:

  • The shortloop or shortloop128 directive is used when compiling

  • The constant trip count is less than or equal to the maximum hardware vector length

  • Compiler range analysis proves that, for a legal program, the trip count is less than or equal to the maximum vector length

C.2.1.6. Vector Update Loops

The compiler may generate vector update loops, which are indicated with Vp in loopmark listings. A vector update loop performs arithmetic on existing elements of an array and stores the results back into the same array. This type of loop requires both a gather operation from and a scatter operation to the same memory location and is not as fast as a fully vectorized loop. The following examples illustrate code that would be optimized as vector update loops:

Fortran example:

do i=1,n
   a(ib(i)) = a(ib(i))+b(i)
end do

C++ example:

for (i=0; i<n; i++) {
   a[ib[i]]-a[ib[i]]+b[i];
   }

To improve the performance of a vector update loop, insert an ivdep or concurrent directive, but only if you know there are no subscript collisions.

C.2.1.7. Computed-safe Vector Length Loops

The compiler might generate a vector loop with a computed-safe vector length (VL) to avoid a recurrence. The safe VL is generated at run-time. If the computed-safe VL creates a fully vectorized loop (that is, it is greater than or equal to the machine-maximum VL), it is still moderately slower than its unconditional fully vectorized counterpart because of the overhead involved in the safe VL computation.

Loops vectorized with computed-safe VL are indicated with Vp in loopmark listings.

The following examples illustrate code that defines computed-safe VL loops:

Fortran example:

do i=1,n
   a(i) = a(i-m)
end do

C++ example:

for (i=0; i<n; i++) {
   a[i]=a[i-m];
   }

If you know that there will be no dependencies caused by array index overlap between loop iterations, you can avoid the overhead associated with computed-safe VL loops by using the ivdep or concurrent directive to force the loop into a fully vectorized loop instead.

C.2.1.8. Outer-loop Vectorization

Historically, Cray compilers have vectorized inner loops only. On Cray X1 systems, the compilers now have the ability to vectorize either inner or outer loops.

As an example, the outer loop below (do i) is vectorized. Note that array a(i) is invariant with respect to the inner loop (do j). The load of a(i) is done before the inner loop and the store is done following it. This reduces the memory traffic within the do j loop by two memory operations and increase performance significantly.

    4.  V----<      do i = 1,m
    5.  V r--<           do j = 1,n
    6.  V r                    a(i) = a(i) + b(i,j)*c(j)
    7.  V r-->           end do
    8.  V---->      end do
    9.
   10.               end

ftn-6204 f90: VECTOR File = outer.f, Line = 4
  A loop starting at line 4 was vectorized.

ftn-6005 f90: SCALAR File = outer.f, Line = 5
  A loop starting at line 5 was unrolled 2 times.

C.2.2. Multistreaming

A loop that is multistreamed is indicated with M, unless it has not been partitioned, in which case it is indicated with m. Loops may be both vectorized and multistreamed, especially when dealing with nested loops.

The compiler uses certain criteria to judge whether or not a loop can be automatically multistreamed.

At the point the compiler evaluates a program, the code has already gone through an initial restructuring, as it would for vectorization. Optimizations such as loop interchange (switching an inner loop with an outer loop) and loop splitting (changing a single loop into two or more loops) may have been done.

If a loop passes the following tests, it is a candidate for multistreaming:

When two loops that qualify to be multistreamed are nested, the following criteria are used in the following order to choose which loop to multistream:

  1. The outermost loop that is prefaced by the preferstream directive.

  2. The loop that the compiler estimates will have the greatest amount of work after the loop is interchanged to its outermost valid position.

  3. The outermost loop after initial restructuring.

Though the compiler favors outermost loops to multistream, any loop within a nest may be chosen to multistream or vectorize. It is no longer necessary for the vector loop to be the innermost loop and the streamed loop to be the outermost loop.

C.2.2.1. Streamed and Partitioned Loops

In the loopmark listing, the loopmark character of M is used to indicate loops that are streamed and partitioned, as shown in this example:

   68.  M-----<       do j = 1,m
   69.  M V---<         do i = 1,n
   70.  M V               c(i,j) = a(i,j) + d(i,j)
   71.  M V               a(i,j) = b(i,j) * 10.0
   72.  M V--->         end do
   73.  M----->       end do
   74.                ir2=irtc()
   75.                ir=ir2-ir1
   76.                end

f90-6601 f90: STREAM File = 2-level.f, Line = 68
  A loop starting at line 68 was multistreamed.

f90-6204 f90: VECTOR File = 2-level.f, Line = 69
  A loop starting at line 69 was vectorized.

M indicates that the iterations of the loop have been divided equally among the SSPs and will be executed in parallel.

C.2.2.2. Streaming without Partitioning

In certain cases a loop may be multistreamed without being partitioned. These cases always involve nested loops, where an inner loop can be divided equally among the SSPs, but the outer loop is executed redundantly by all SSPs. In the loopmark listing the character of m is used to indicate multistreaming without partitioning:

37.  Vm-----<     do i = 1,n
38.  Vm M---<       do j = 1,20
39.  Vm M             a(i,j) = b(i,j) + c(i,j)
40.  Vm M--->       end do
41.  Vm----->     end do

In the above example, the loops would be divided up among the SSPs as illustrated below:

SSP 0

do i = 1,n
   do j=16,20

SSP 1

do i = 1,n
   do j=11,15

SSP 2

do i = 1,n
   do j=6,10

SSP 3

do i = 1,n
   do j=1,5

Thus the inner loop can be fully multistreamed, but the outer loop is vectorized and multistreamed without partitioning.

C.2.3. Pattern-Matching

A loop that has been pattern-matched and replaced with a call to a library routine is indicated with an A. The following example shows a matrix-matrix multiply that has been pattern matched. The three outer loops are marked with D and have been deleted, and the innermost loop is marked with an A to indicate that it has been pattern-matched:

   23.  D--------<       do k = 1,n3       !
   24.  D D------<         do j = 1,n2     !       M
   25.  D D D----<           do i = 1,n1   !       V
   26.  D D D A-<>             x(i,j) = x(i,j) + y(i,k) * z(k,j)
   27.  D D D---->           end do
   28.  D D------>         end do
   29.  D-------->       end do
   30.
   31.                   ir2=irtc()
   32.                   ir=ir2-ir1
   33.                   end

f90-6002 f90: SCALAR File = dgemm-1.f, Line = 23
  A loop starting at line 23 was eliminated by optimization.

f90-6002 f90: SCALAR File = dgemm-1.f, Line = 24
  A loop starting at line 24 was eliminated by optimization.

f90-6002 f90: SCALAR File = dgemm-1.f, Line = 25
  A loop starting at line 25 was eliminated by optimization.

f90-6202 f90: VECTOR File = dgemm-1.f, Line = 26
  A loop starting at line 26 was replaced by a library call.

Pattern matching is not always worthwhile. When applied to a routine that is called frequently, the cumulative time saved by using the optimized library routine will probably outweigh the call overhead. When applied to a routine that is used rarely, the calling overhead may actually result in a net degradation of performance. When compiling using the default optimization settings, the compiler attempts to determine whether each given candidate for pattern matching will in fact yield improved performance.

C.2.4. Loop Unrolling

Loop unrolling is done for the following reasons:

A nonvector loop is chosen to be unrolled when a pattern of nearest neighbor (for example, a(i,j) = a(i,j+1)) is seen for that loop. Unrolling a nonvector loop is referred to as unroll-and-jam. The nearest neighbor pattern can be either write-read or read-read. Unroll-and-jam is done primarily to improve register and cache reuse. Reusing vector registers or cache reduces bandwidth usage, thereby improving performance.

For example, consider the following unroll-and-jam optimization:

 5.  r-----<       do j = 1,n-1    
 6.  r V---<         do i = 1,n      ! vectorize    
 7.  r V              a(i,j) = b(i,j) + b(i,j+1)    
 8.  r V--->         enddo    
 9.  r----->       enddo  

f90-6005 f90: SCALAR File = nn-3.f, Line = 5 
  A loop starting at line 5 was unrolled 2 times.  

f90-6204 f90: VECTOR File = nn-3.f, Line = 6 
  A loop starting at line 6 was vectorized.

The loopmark character r indicates unrolling. The loop is converted to:

do j = 1,n-1,2    ! unrolled for reuse of b(i,j+1)
  do i = 1,n      ! vectorized
    a(i,j)   = b(i,j)   + b(i,j+1)
    a(i,j+1) = b(i,j+1) + b(i,j+2)
  end do
end do  

Data b(i,j+1) is loaded once but used twice in the unrolled version.

Loop unwinding is a special case of unrolling, where the loop is entirely unrolled. For example:

     4.  V----<       do j = 1,n
     5.  V W--<         do i = 1,3
     6.  V W              a(i,j) = b(i,j) + c(i,j) * s1
     7.  V W-->         end do
     8.  V---->       end do
     9.
    10.               end

f90-6204 f90: VECTOR File = unwind-1.f, Line = 4
  A loop starting at line 4 was vectorized.

f90-6008 f90: SCALAR File = unwind-1.f, Line = 5
  A loop starting at line 5 was unwound.   

The loopmark character W indicates unwinding.

C.2.5. Loop Interchange

Loop interchange is done to promote vector register reuse and quicker cache reuse. As a result of interchange, there is additional outer-loop vectorization, resulting in invariant vector registers and operations. These vectors can be hoisted and reused within the inner loop, which is very effective in reducing bandwidth usage.

The following example illustrates interchange in matrix-vector multiplication:

    25.  ir----<       do j = 1,m
    26.  ir V--<         do i = 1,n
    27.  ir V              a(i) = a(i) + b(i,j) * c(j)
    28.  ir V-->         end do
    29.  ir---->       end do

f90-6007 f90: SCALAR File = dgemv-1.f, Line = 25
  A loop starting at line 25 was interchanged with 
  the loop starting at line 26.

f90-6005 f90: SCALAR File = dgemv-1.f, Line = 25 
  A loop starting at line 25 was unrolled 2 times.
  
f90-6204 f90: VECTOR File = dgemv-1.f, Line = 26 
  A loop starting at line 26 was vectorized.

The loopmark character i indicates that the loop was interchanged. The inner loop is converted to a dot product:

do i = i,n        ! vectorized
  do j = i,n           
	     a(i) = a(i) + b(i,j) * c(j)
  end do
 end do  

The vector load of a(i) can be loaded ahead of the j-loop and the vector store of a(i) can be pushed below. Within the inner loop, instead of two vector-loads and one vector-store, there will be one vector-load and no vector-stores.

C.2.6. Loop Collapse

Loop collapse combines two or more loops into a single loop. This produces:

For example, consider this loop:

    1.                subroutine sub(n)
    2.                common/blk/s1,a(100,100),b(100,100),c(100,100)
    3.
    4.  C-----<       do j = 1,100
    5.  C Vr--<         do i = 1,100
    6.  C Vr              a(i,j) = b(i,j) + c(i,j) * s1
    7.  C Vr-->         end do
    8.  C----->       end do
    9.
   10.                end

f90-6003 f90: SCALAR File = unwind-1.f, Line = 4
  A loop starting at line 4 was collapsed into 
  the loop starting at line 5.

f90-8135 f90: SCALAR File = unwind-1.f, Line = 5
  Loop starting at line 5 was unrolled 2 times.

f90-6204 f90: VECTOR File = unwind-1.f, Line = 5
  A loop starting at line 5 was vectorized.

The loopmark C indicates loop collapse. Essentially, the loop is converted to:

do i = 1,100 * 100
  a(i,1) = b(i,1) + c(i,1) * s1
end do

C.2.7. Loop Fusion

Loop fusion combines two or more loops into a single loop. The goals are reduced loop overhead and register reuse. Loop fusion is very useful for optimizing Fortran array syntax, where each statement is an implied loop:

    1.                subroutine sub
    2.                common/blk/a(100),b(100)
    3.
    4.  V---<>        a=0.
    5.  f---<>        b=0.
    6.
    7.                end

f90-6204 f90: VECTOR File = fusion.f, Line = 4
  A loop starting at line 4 was vectorized.

f90-6004 f90: SCALAR File = fusion.f, Line = 5
  A loop starting at line 5 was fused with the
  loop starting at line 4.

The loopmark character f indicates loop fusion.

Loop fusion is also useful as illustrated in the case below, where array temp can be optimized away by temporarily storing it in a register instead of in memory:

    5.  Vr--<        do i = 1,n
    6.  Vr             temp(i) = b(i) * c(i)
    7.  Vr-->        end do
    8.
    9.  f---<        do i = 1,n
   10.  f              a(i) = temp(i) + d(i)
   11.  f--->        end do

f90-8135 f90: SCALAR File = fusion.f, Line = 5
  Loop starting at line 5 was unrolled 2 times.

f90-6204 f90: VECTOR File = fusion.f, Line = 5
  A loop starting at line 5 was vectorized.

f90-6004 f90: SCALAR File = fusion.f, Line = 9
  A loop starting at line 9 was fused with the
  loop starting at line 5.

C.2.8. Loop Blocking

Cray compilers do not perform automatic loop blocking at present. However, the blockable directive is available to do manual blocking:

    4.  1--------<       do k = 1,n
    5.  1          !dir$   blockable (j,i)
    6.  1          !dir$   blocking size (20)
    7.  1 b------<         do j = 1,m
    8.  1 b        !dir$     blocking size (20)
    9.  1 b Vb---<           do i = 1,mm
   10.  1 b Vb                z(i,k) = z(i,k) + x(i,j) * y(j,k)
   11.  1 b Vb--->           end do
   12.  1 b------>         end do
   13.  1-------->       end do
   14.
   15.                   end

f90-6051 f90: SCALAR File = blockable.f, Line = 7
  A loop starting at line 7 was blocked according 
  to user directive with block size 20.

f90-6051 f90: SCALAR File = blockable.f, Line = 9
  A loop starting at line 9 was blocked according 
  to user directive with block size 20.

f90-6205 f90: VECTOR File = blockable.f, Line = 9
  A loop starting at line 9 was vectorized with a 
  single vector iteration.

The loopmark character b indicates loop blocking. See the Cray Fortran Compiler Commands and Directives manual for more information.