After you have compiled and debugged your code and analyzed its performance, you can use a number of techniques to optimize performance. For details about compiler optimization and optimization reporting options, see the following manuals:
Cray C and C++ Reference Manual, Cray Fortran Reference Manual
PGI User's Guide
Using the GNU Compiler Collection (GCC)
PathScale Compiler Suite User Guide
Intel C++ Compiler Professional Edition for Linux
Intel Fortran Compiler Professional Edition for Linux
Optimization produces code that is more efficient and runs significantly faster than unoptimized code. Optimization can be performed at the compilation unit level through compiler driver options or to selected portions of code through the use of directives or pragmas. Because optimization may increase compilation time and may make debugging difficult, it is best to use performance analysis data in advance to isolate the portions of code where optimization would provide the greatest benefits.
You also can use aprun affinity options to optimize applications.
In the following example, a Fortran matrix multiply subroutine is optimized. The compiler driver option generates an optimization report.
Source code of
subroutine mxm(x,y,z,m,n) real*8 x(m,n), y(m,n), z(n,n) do k = 1,n do j = 1,n do i = 1,m x(i,j) = x(i,j) + y(i,k)*z(k,j) enddo enddo enddo end
PGI Fortran compiler command:
ftn -c -fast -Minfo matrix_multiply.f90
mxm: 5, Interchange produces reordered loop nest: 7, 5, 9 9, Generated 3 alternate loops for the inner loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop
To generate an optimization report (loopmark listing) by using the Cray Fortran compiler, enter:
module swap PrgEnv-pgi PrgEnv-cray%
ftn -ra -c matrix_multiply.f90
Optimization report (file
%%% L o o p m a r k L e g e n d %%% Primary Loop Type Modifiers ------- ---- ---- --------- a - vector atomic memory operation A - Pattern matched b - blocked C - Collapsed f - fused D - Deleted i - interchanged E - Cloned m - streamed but not partitioned I - Inlined p - conditional, partial and/or computed M - Multithreaded r - unrolled P - Parallel/Tasked s - shortloop V - Vectorized t - array syntax temp used W - Unwound w - unwound 1. subroutine mxm(x,y,z,m,n) 2. real*8 x(m,n), y(m,n), z(n,n) 3. 4. D------< do k = 1,n 5. D 2----< do j = 1,n 6. D 2 A--< do i = 1,m 7. D 2 A x(i,j) = x(i,j) + y(i,k)*z(k,j) 8. D 2 A--> enddo 9. D 2----> enddo 10. D------> enddo 11. 12. end ftn-6002 ftn: SCALAR File = matrix_multiply.f90, Line = 4 A loop starting at line 4 was eliminated by optimization. ftn-6002 ftn: SCALAR File = matrix_multiply.f90, Line = 5 A loop starting at line 5 was eliminated by optimization. ftn-6202 ftn: VECTOR File = matrix_multiply.f90, Line = 6 A loop starting at line 6 was replaced by a library call.
Each Cray compute node has local-NUMA-node memory and remote-NUMA-node memory. Remote-NUMA-node memory references, such as a NUMA node 0 PE accessing NUMA node 1 memory, can adversely affect performance. Cray has added aprun memory affinity options to give you run time controls that may optimize memory references.
Applications can use one or all NUMA nodes of a Cray system compute node. If an application is placed using one NUMA node, other NUMA nodes are not used and the application processes are restricted to using local-NUMA-node memory. This memory usage policy is enforced by running the application processes within a cpuset. A
cpuset consists of cores and local memory on a compute node.
When an application is placed using all NUMA nodes, the cpuset includes all node memory and all CPUs. In this case, the application processes allocate local-NUMA-node memory first. If insufficient free local-NUMA-node memory is available, the allocation may be satisfied by using remote-NUMA-node memory. In other words, if there is not enough NUMA node n memory, the allocation may be satisfied by using NUMA node n+1 memory. An exception is the
-ss (strict memory containment) option. For this option, memory accesses are restricted to local-NUMA-node memory even if both NUMA nodes are available to the application.
The aprun memory affinity options are:
For details, see Using the aprun Command.
Use these aprun options for each element of an MPMD application and vary them with each MPMD element as required.
Compute nodes are considered for the application placement if any of the following conditions is true:
-sn value is
-sl list has more than one entry.
-sl list is the highest-ordered NUMA node.
-S value along with a
-N value requires two NUMA nodes (such as
-N 4 -S 2).
Use cnselect numcores.eq.number_of_cores to get a list the Cray system compute nodes.
You can use the
aprun -L or
qsub -lmppnodes options to specify those lists or a subset of those lists. For additional information, see the aprun(1), cnselect(1), and qsub(1) man pages.
CNL can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node. In some cases, moving processes from CPU to CPU increases cache misses and translation lookaside buffer (TLB) misses and therefore reduces performance. Also, there may be cases where an application runs faster by avoiding or targeting a particular CPU. The aprun CPU affinity options let you bind a process to a particular CPU or the CPUs on a NUMA node. These options apply to all Cray multicore compute nodes.
Applications are assigned to a
cpuset and can run only on the CPUs specified by the
cpuset. Also, applications can allocate memory only on memory defined by the
cpuset can be a compute node (default) or a NUMA node.
The CPU affinity options are:
-cc cpu-list | keyword
For details, see Using the aprun Command.
These aprun options can be used for each element of an MPMD application and can vary with each MPMD element.
-F affinity option for aprun provides a program with exclusive access to all the processing and memory resources on a node.
This option assigns all compute node cores and compute node memory to the application's
cpuset. Used with the
-cc option, it enables an application programmer to bind processes to those mentioned in the affinity string.
There are two modes:
share. The share mode restricts the application specific
cpuset contents to only the application reserved cores and memory on NUMA node boundaries. For example, if an application requests and is assigned cores and memory on NUMA node
0, then only NUMA node
0 cores and memory are contained within the application
cpuset. The application cannot access the cores and memory of the other NUMA nodes on that compute node.
Administrators can modify
/etc/alps.conf to set a policy for access modes. If
nodeShare is not specified in this file, the default mode remains
exclusive; setting to
share makes the default
share access mode. Users can override the system-wide policy by specifying aprun
-F exclusive at the command line or within their respective batch scripts. For additional information, see the aprun(1) man page.
Multicore systems can run more tasks simultaneously, which increases overall system performance. The trade-offs are that each core has less local memory (because it is shared by the cores) and less system interconnection bandwidth (which is also shared).
Processes are placed in packed rank-sequential order, starting with the first node. For a 100-core, 50-node job running on dual-core nodes, the layout of ranks on cores is:
MPI supports multiple interconnect device drivers for a single MPI job. This allows each process (rank) of an MPI job to create the most optimal messaging path to every other process in the job, based on the topology of the given ranks. The SMP device driver is based on shared memory and is used for communication between ranks that share a node. The GNI device driver is used for communication between ranks that span nodes.
To attain the fastest possible run time, try running your program on only one core of each node. (In this case, the other cores are allocated to your job, but are idle.) This allows each process to have full access to the system interconnection network.
For example, you could use the commands:
cnselect numcores.gt.120-175 %
aprun -n 64 -N 1 -L 20-175 ./prog1
prog1 on one core of each of 64 multicore nodes.