4.3. Optimizing for Cache

Like CRAY T3D systems, the CRAY T3E system has an 8-Kbyte primary data cache that is direct mapped (glossary, ). But it also has a 96-Kbyte secondary cache that is three-way set associative (glossary, ). Although this makes optimizing for cache use less critical on the CRAY T3E system, programming for cache is still an important source of potential performance improvement.

The following sections describe how to rearrange array dimensions (see the next section) and add pad arrays (see Section 4.3.2). For background information on how data cache and secondary cache work, see Procedure 1.1, Moving data from memory, and the associated figures.

4.3.1. Rearranging Array Dimensions for Cache Reuse

You can decrease the execution time for your code by increasing the number of times a piece of data is used while it is resident in cache. One way to increase cache reuse is by making reused array dimensions the fastest-moving, or leftmost, dimensions of an array.

The array dimensions in the following example can be rearranged to increase reuse.


Example 4-1. Unoptimized code

COMMON A(N,3,3), B(N,3,3), C(N,3,3)
    DO I=1,3
      DO K=1,3
        DO L=1,N
          C(L,I,K) = A(L,I,1) * B(L,1,K)
   &          +A(L,I,2) * B(L,2,K)
   &          +A(L,I,3) * B(L,3,K)
        ENDDO
      ENDDO
    ENDDO


One way to detect potential cache reuse in a loop nest is by looking for array references that do not contain all of the loop nest's loop control variables. Dimensions that have references without a loop control variable, or that have loop control variables that differ from reference to reference, are generally candidates for reuse.

In the preceding example, the second and third dimensions have reuse potential because they each have references in which the subscript is not a loop control variable. These dimensions should be made the fastest running, leftmost dimensions. The dimension declarations can be changed as follows:

COMMON A(3,3,N), B(3,3,N), C(3,3,N)

Rearranging a dimension should be accompanied by changing the order of the nested DO loops. This optimization maximizes the number of loop invariant (glossary, ) references in the inner loop. Maximizing the proportion of stride-1 reference streams is also important and often easier to do. In the loop from the preceding example, the number of invariant references is the same for both the K and I loops. The K and I loops have more invariant references than the L loop. The I loop gives more stride-1 references, thus the loop nest should be changed in the following way:

    DO L=1,N
      DO K=1,3
        DO I=1,3
          C(I,K,L) = A(I,1,L) * B(1,K,L)
   &          +A(I,2,L) * B(2,K,L)
   &          +A(I,3,L) * B(3,K,L)
        ENDDO
      ENDDO
    ENDDO

For an illustration of how the internal ordering of array A changes as a result of the optimization, see the following figure. The array is now processed in the same order it is stored. The illustration assumes that N has a value of 6.

Figure 4-3. Before and after array A has been optimized

Of course, loop interchange can only be done if it does not violate data dependencies. The compiler will usually perform interchange under default optimization if it detects that doing so is both profitable in terms of performance and does not cause incorrect behavior.

Cache reuse in loops such as the one in the preceding example can be further increased by tiling (glossary, ). Tiling involves further stripmining (glossary, ) the inner loops and interchanging loops. This is especially profitable when the fastest-moving dimensions are large and less likely to stay entirely in cache. Tiling is currently not performed by the compiler. You must do it manually. For an example of stripmining, see Example 4-6.

4.3.2. Padding Common Blocks and Arrays to Reduce Cache Conflict

Reducing cache conflict is another method of increasing cache reuse. Reducing data cache conflict between arrays accessed in the same loop can reduce the run time of loops by up to 66%. Reducing secondary cache conflict can reduce the run time of loops by up to 85%.

Due to the addition of a 96-Kbyte secondary cache on CRAY T3E systems, data cache misses are not as costly as they were for CRAY T3D systems. When a value needed by the microprocessor is not in data cache, it is often in secondary cache. Accessing either data cache or secondary cache is faster than accessing local memory. Maximizing the number of data references to cache can effectively decrease the execution time of your program, and adding pad arrays is one way to do that.

You can either pad arrays yourself, or you can specify the command-line option -a pad and have the compiler do the padding. Because the compiler does not do extensive analysis of the data or any analysis of data reference patterns in your code, you may be able to get better results adding your own pad arrays. This section concentrates on how to add your own padding arrays. Information on instructing the compiler to add them automatically is included in Section 4.3.3.

Cache conflict in common blocks is a function of the declaration of the common block, the code that references it, and the size of the cache. Such conflict often happens when a common block's arrays have sizes that are powers of two, as in the following code fragment:

      COMMON /AAA/ A(1024), B(1024), C(1024)

      DO I=1,1024
        A(I) = B(I) + C(I)
      ENDDO

Data cache is 8 Kbytes, or 1,024 words, in size. Each of the arrays in this example is the same size as data cache. Every word in the B array is mapped to a line in data cache, element B(1) to the first cache line and B(1024) to the last cache line. Then the next data item in memory, in this case C(1), is mapped to the first line of data cache. (See the following figure.)

Figure 4-4. Arrays B and C in local memory

Because of the array sizes, all references to B(I) and C(I) for the same value of I will map to the same line in data cache. Data cache is direct mapped, meaning it can contain only one of the two, either B(I) or C(I). Usually B(1-4) will be loaded and, after B(1) is moved to a register, overwritten by C(1-4). (See Figure 4-5.) When B(2) is needed, it will have to be loaded into data cache a second time. Subsequent references to B and C may mean reloading the same cache line in the same manner for each array element. Such data cache thrashing (glossary, ) can be avoided by adding pad arrays of at least 4 words between the arrays in conflict.

Figure 4-5. Cache conflict between arrays B and C

Adding a CACHE_ALIGN directive as well guarantees that the common block starts at the beginning of an 8-word line, implying that 4 words of padding are sufficient to avoid conflict in this case.

      COMMON /AAA/ A(1024), B(1024), P(4), C(1024)
!DIR$ CACHE_ALIGN /AAA/

Now the layout in memory looks different. The P array, placed as it is between the B and the C arrays, causes the mappings of the two arrays to change. (See Figure 4-6.)

Figure 4-6. Arrays B and C in local memory after padding

The first elements of C now map to the second line of data cache, and the conflicts have disappeared. C(1-4) will not be replaced by B(5-8) until they are no longer needed. (See the following figure.)

Figure 4-7. Data cache after padding

The preceding example contains no conflict in secondary cache either. Although secondary cache is larger than data cache, keep in mind the need to maximize its performance, especially when you are dealing with large arrays. The following example is a similar code fragment but with larger arrays. This time the size of the arrays match the size of secondary cache. Any multiple of that size causes conflicts.

      COMMON /AAA/ A(4096), B(4096), C(4096)
!DIR$ CACHE_ALIGN /AAA/

      DO I=1,4096
        A(I) = B(I) + C(I)
      ENDDO

Output values and input values are both written to secondary cache, meaning you must be concerned about conflict with A as well as B and C. (For an example, see Procedure 1.1, Moving data from memory.) The values A(I), B(I), and C(I) will all map to the same line in secondary cache for the same values of I.

Although secondary cache is three-way, set-associative (three 8-word values can be held in each line), its random replacement strategy creates the potential for cache thrashing. Because its lines are 8 words, a pad array of 8 words (as opposed to 4 words for data cache) is necessary to eliminate conflicts in secondary cache. The P1 and P2 arrays are pad arrays in the following example:

      COMMON /AAA/ A(4096), P1(8), B(4096), P2(8), C(4096)

Padding individual dimensions within the same array may also reduce cache conflict, as in the following code fragment:

      COMMON /AAA/ A(1024,64), B(1024,64)
!DIR$ CACHE_ALIGN /AAA/

      DO J=1,64
        DO I=1,1024
          A(I,J) = B(I,J) - B(I,J+1)
        ENDDO
      ENDDO

Not only do the A(I,J) and B(I,J) references conflict in both data cache and secondary cache, but the two B references conflict with each other in data cache. Extending the first dimensions of both A and B by 8 words avoids any conflict in either cache. You need not use the extra words defined by the pad array. The loop bounds will remain the same using the following arrays:

      COMMON /AAA/ A(1032,64), B(1032,64)

This technique will work for all arrays, not just for those in common blocks.

Note: Use caution when applying these padding techniques yourself. Because they change the sequence association and storage association of the data that is padded, they can change the behavior of a program.

4.3.3. Automatic Padding

The compiler adds padding automatically after every array in a common block and after local static data when you specify the -a pad option on the f90 command line. The compiler computes a padding value for each array. It pads static arrays and common blocks if the arrays do not appear in an EQUIVALENCE statement.

You can specify your own pad size by adding a value to the -a pad option. The compiler adds padding after every array if you execute one of the following command line options:

% f90 -a pad ...
% f90 -a pad 64 ...

Using the first option, the compiler adds padding that depends on the size of the array, as follows:

Automatic padding also adds words so that the next array starts on a secondary cache line (8-word) boundary. It assumes that the preceding array also starts on an 8-word boundary. You can ensure that it does by using either the CACHE_ALIGN directive or the following compiler/loader option:

-W1"-D allocate(alignsz)=64b"

For the pad 64 argument, the compiler adds a pad of 64 words after all arrays. Using a large pad value on small arrays is not recommended.

Note: Automatic padding by the compiler could potentially slow some programs down or cause results to change, since the option breaks standard Fortran sequence association and storage association specifications for padded data. For restrictions that apply to automatic padding, see the CF90 Commands and Directives Reference Manual, publication SR-3901.

In the application that defined the following common block, every array is aligned with two other arrays in cache:


Example 4-2. Automatic padding

      COMMON A(1024), B(1024), C(1024), D(1024), E(1024), F(1024),
      % G(1024), H(1024), R(1024), S(1024), T(1024), U(1024)


The compiler automatically pads the arrays as follows if the -a pad option is specified:

      COMMON A(1024), PAD1(72),  B(1024), PAD2(72),  C(1024), PAD3(72),
     &       D(1024), PAD4(72),  E(1024), PAD5(72),  F(1024), PAD6(72),
     &       G(1024), PAD7(72),  H(1024), PAD8(72),  R(1024), PAD9(72),
     &       S(1024), PAD10(72), T(1024), PAD11(72), U(1024), PAD12(72)

Now none of the arrays start at the same location in cache. The pad at the end of the common block is added to potentially improve effects across common blocks in the same way that it improves things within common blocks.

A common block of smaller arrays such as the following can be saved from data cache conflicts:


Example 4-3. Automatic padding for smaller arrays

      COMMON A(256), B(256), C(256), D(256), E(256), F(256),
     & G(256), H(256), R(256), S(256), T(256), U(256)


The compiler automatically pads these arrays as follows:

       COMMON A(256), PAD1(8),  B(256), PAD2(8),  C(256), PAD3(8),
      &       D(256), PAD4(8),  E(256), PAD5(8),  F(256), PAD6(8),
      &       G(256), PAD7(8),  H(256), PAD8(8),  R(256), PAD9(8),
      &       S(256), PAD10(8), T(256), PAD11(8), U(256), PAD12(8)