Running Applications [2]

The aprun utility launches applications on compute nodes. The utility submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution, forwards your login node environment to the assigned compute nodes, forwards signals, and manages the stdin, stdout, and stderr streams.

This chapter describes how to run applications interactively on compute nodes and get application status reports. For a description of batch job processing, see Chapter 4, Using Workload Management Systems.

2.1 Using the aprun Command

Use the aprun command to specify the resources your application requires, request application placement, and initiate application launch.

The format of the aprun command is:

aprun [-a arch ] [-b ] [-B] [-C ][-cc cpu_list | keyword ] [-cp cpu_placement_file_name ] [-d depth ][-D value ] [-L node_list ] [-m size[h|hs] ] [-j num_cpus] [-n pes ] [-N pes_per_node ][-F access mode ][-p protection domain identifier] [-q ] [-r cores][-R pe_dec][-S pes_per_numa_node ] [-sl list_of_numa_nodes ] [-sn numa_nodes_per_node ] [-ss ] [-T ] [-t sec ] executable [ arguments_for_executable ]

where:

-b  

Bypasses the transfer of the executable program to the compute nodes. By default, the executable is transferred to the compute nodes during the aprun process of launching an application. For an example, see Running Compute Node Commands.

-B  

Reuses the width, depth, nppn, and memory request options that are specified with the batch reservation. This option obviates the need to specify aprun options -n, -d, -N, and -m. aprun will exit with errors if these options are specified with the -B option.

-cc cpu_list|keyword  

Binds processing elements (PEs) to CPUs. CNL does not migrate processes that are bound to a CPU. This option applies to all multicore compute nodes. The cpu_list is not used for placement decisions, but is used only by CNL during application execution. For further information about binding (CPU affinity), see Using aprun CPU Affinity Options.

The cpu_list is a comma-separated or hyphen-separated list of logical CPU numbers and/or ranges. As PEs are created, they are bound to the CPU in cpu_list corresponding to the number of PEs that have been created at that point. For example, the first PE created is bound to the first CPU in cpu_list, the second PE created is bound to the second CPU in cpu_list, and so on. If more PEs are created than given in cpu_list, binding starts over at the beginning of cpu_list and starts again with the first CPU in cpu_list. The cpu_list can also contain an x, which indicates that the application-created process at that location in the fork sequence should not be bound to a CPU.

If multiple PEs are created on a compute node, the user may optionally specify a cpu_list for each PE. Multiple cpu_lists are separated by colons (:). This provides the user with the ability to control the placement for PEs that may conflict with other PEs that are simultaneously creating child processes and threads of their own.

% aprun -n 2 -d 3 -cc 0,1,2:4,5,6 ./a.out

The example above contains two cpu_lists. The first (0,1,2) is applied to the first PE created and any threads or child processes that result. The second (4,5,6) is applied to the second PE created and any threads or child processes that result.

Out-of-range cpu_list values are ignored unless all CPU values are out of range, in which case an error message is issued. For example, if you want to bind PEs starting with the highest CPU on a compute node and work down from there, you might use this -cc option:

% aprun -n 8 -cc 10-4 ./a.out

If the PEs were placed on Cray XE6 24-core compute nodes, the specified -cc range would be valid. However, if the PEs were placed on Cray XK6 eight-core compute nodes, CPUs 10-8 would be out of range and therefore not used.

The following keyword values can be used:

  • The cpu keyword (the default) binds each PE to a CPU within the assigned NUMA node. You do not have to indicate a specific CPU.

    If you specify a depth per PE (aprun -d depth), the PEs are constrained to CPUs with a distance of depth between them to each PE's threads to the CPUs closest to the PE's CPU.

    The -cc cpu option is the typical use case for an MPI application.

    Note: If you oversubscribe CPUs for an OpenMP application, Cray recommends that you not use the -cc cpu default. Test the -cc none and -cc numa_node options and compare results to determine which option produces the better performance.
  • The numa_node keyword constrains PEs to the CPUs within the assigned NUMA node. CNL can migrate a PE among the CPUs in the assigned NUMA node but not off the assigned NUMA node.

    If PEs create threads, the threads are constrained to the same NUMA-node CPUs as the PEs. There is one exception. If depth is greater than the number of CPUs per NUMA node, once the number of threads created by the PE has exceeded the number of CPUs per NUMA node, the remaining threads are constrained to CPUs within the next NUMA node on the compute node. For example, on 8-core nodes, if depth is 5, threads 0-3 are constrained to CPUs 0-3 and thread 4 is constrained to CPUs 4-7.

  • The none keyword allows PE migration within the assigned NUMA nodes.

-cp cpu_placement_file_name  

Provides the name of a CPU binding placement file. This option applies to all multicore compute nodes. This file must be located on a file system that is accessible to the compute nodes. The CPU placement file provides more extensive CPU binding instructions than the -cc options.

-p protection domain identifier  

Requests use of a protection domain using the user pre-allocated protection identifier. You cannot use this option with protection domains already allocated by system services. Any cooperating set of applications must specify this same aprun -p option to have access to the shared protection domain. aprun will return an error if either the protection domain identifier is not recognized or if the user is not the owner of the specified protection domain identifier.

-D value  

The -D option value is an integer bitmask setting that controls debug verbosity, where:

  • A value of 1 provides a small level of debug messages

  • A value of 2 provides a medium level of debug messages

  • A value of 4 provides a high level of debug messages

Because this option is a bitmask setting, value can be set to get any or all of the above levels of debug messages. Therefore, valid values are 0 through 7. For example, -D 3 provides all small and medium level debug messages.

-d depth  

Specifies the number of CPUs for each PE and its threads. ALPS allocates the number of CPUs equal to depth times pes. The -cc cpu_list option can restrict the placement of threads, resulting in more than one thread per CPU.

The default depth is 1.

For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and the aprun -d option to specify the number of CPUs hosting the threads. ALPS creates -n pes instances of the executable, and the executable spawns OMP_NUM_THREADS-1 additional threads per PE. For an OpenMP example, see Running an OpenMP Application.

Note: For a PathScale OpenMP program, set the PSC_OMP_AFFINITY environment variable to FALSE.

For Cray systems, compute nodes must have at least depth CPUs. For Cray XE5 systems, depth cannot exceed 12. For Cray XK6 compute blades, depth cannot exceed 16. For Cray XE6 systems, depth cannot exceed 32.

-j num_cpus  

Specifies how many CPUs to use per compute unit for an ALPS job. See for an explanation of compute unit affinity in ALPS. For more information on compute unit affinity, see Using Compute Unit Affinity on Cray Systems.

-L node_list  

Specifies the candidate nodes to constrain application placement. The syntax allows a comma-separated list of nodes (such as -L 32,33,40), a range of nodes (such as -L 41-87), or a combination of both formats. Node values can be expressed in decimal, octal (preceded by 0), or hexadecimal (preceded by 0x). The first number in a range must be less than the second number (8-6, for example, is invalid), but the nodes in a list can be in any order.

This option is used for applications launched interactively; use the qsub -lmppnodes=\"node_list\" option for batch and interactive batch jobs.

If the placement node list contains fewer nodes than the number required, a fatal error is produced. If resources are not currently available, aprun continues to retry.

A common source of node lists is the cnselect command. See the cnselect(1) man page for details.

-m size[h|hs]  

Specifies the per-PE required Resident Set Size (RSS) memory size in megabytes. K, M, and G suffixes (case insensitive) are supported (16M = 16m = 16 megabytes, for example). If you do not include the -m option, the default amount of memory available to each PE equals the minimum value of (compute node memory size) / (number of CPUs) calculated for each compute node.

If you want huge pages (2 MB) allocated for an application, use the h or hs suffix.

Cray XE and Cray XK: The default huge page size is 2 MB. Additional sizes are available: 128KB, 512KB, 8MB, 16MB, and 64MB.

The use of the -m option is not required on Cray systems because the kernel allows the dynamic creation of huge pages. However, it is advisable to specify this option and preallocate an appropriate number of huge pages, when memory requirements are known, to reduce operating system overhead. See the intro_hugepages(1) man page.

-m sizeh  

Requests memory to be allocated to each PE, where memory is preferentially allocated out of the huge page pool. All nodes use as much memory as they are able to allocate and 4 KB base pages thereafter.

-m sizehs  

Requests memory to be allocated to each PE, where memory is allocated out of the huge page pool. If the request cannot be satisfied, an error message is issued and the application launch is terminated.

Note: To use huge pages, you must first link the application with hugetlbfs:
% cc -c my_hugepages_app.c 
% cc -o my_hugepages_app my_hugepages_app.o -lhugetlbfs
Set the huge pages environment variable at run-time:
% setenv HUGETLB_MORECORE yes
Or
% export HUGETLB_MORECORE=yes
-n pes  

Specifies the number of processing elements (PEs) that your application requires. A PE is an instance of an ALPS-launched executable. You can express the number of PEs in decimal, octal, or hexadecimal form. If pes has a leading 0, it is interpreted as octal (-n 16 specifies 16 PEs, but -n 016 is interpreted as 14 PEs). If pes has a leading 0x, it is interpreted as hexadecimal (-n 16 specifies 16 PEs, but -n 0x16 is interpreted as 22 PEs). The default value is 1.

-N pes_per_node  

Specifies the number of PEs to place per node. For Cray systems, the default is the number of available NUMA nodes times the number of cores per NUMA node.

The maximum pes_per_node is 32 for systems with Cray XE6 compute blades.

-F exclusive|share  

exclusive mode provides a program with exclusive access to all the processing and memory resources on a node. Using this option with the cc option binds processes to those mentioned in the affinity string. share mode access restricts the application specific cpuset contents to only the application reserved cores and memory on NUMA node boundaries, meaning the application will not have access to cores and memory on other NUMA nodes on that compute node. The exclusive option does not need to be specified because exclusive access mode is enabled by default. However, if nodeShare is set to share in /etc/alps.conf then you must use the -F exclusive to override the policy set in this file. You can check the value of nodeShare by executing apstat -svv | grep access.

-q  

Specifies quiet mode and suppresses all aprun-generated non-fatal messages. Do not use this option with the -D (debug) option; aprun terminates the application if both options are specified. Even with the -q option, aprun writes its help message and any ALPS fatal messages when exiting. Normally, this option should not be used.

-r cores  

Enables core specialization on Cray compute nodes, where the number of cores specified is the number of system services cores per node for the application. If the r value is greater than one, the system services core will be assigned in a round-robin fashion to each NUMA node in descending order unless the -cc cpu_list affinity option is specified. In that case, specialized cores are assigned from the highest-order core sans those specified in cpu_list.

-S pes_per_numa_node  

Specifies the number of PEs to allocate per NUMA node. You can use this option to reduce the number of PEs per NUMA node, thereby making more resources available per PE.

For 8-core compute nodes, the default is 4. For 12-core compute nodes, the default is 6. For 16-core compute nodes, the default value is 4. For 24-core compute nodes, the default is 6. For 32-core compute nodes, the default is 8. A zero value is not allowed and causes a fatal error. For further information, see Using aprun Memory Affinity Options.

-sl list_of_numa_nodes  

Specifies the NUMA node or nodes (comma separated or hyphen separated) to use for application placement. A space is required between -sl and list_of_numa_nodes. The list_of_numa_nodes value can be -sl <0,1> on Cray XE5 and Cray XK6 compute nodes, -sl <0,1,2,3> on Cray XE6 compute nodes, or a range such as -sl 0-1 and -sl 0-3. The default is no placement constraints. You can use this option to determine whether restricting your PEs to one NUMA node per node affects performance.

List NUMA nodes in ascending order; -sl 1-0 and -sl 1,0 are invalid.

-sn numa_nodes_per_node  

Specifies the number of NUMA nodes per node to be allocated. Insert a space between -sn and numa_nodes_per_node. The numa_nodes_per_node value can be 1 or 2 on Cray XE5 and Cray XK6 compute nodes, or 1, 2, 3, 4 on Cray XE6 compute nodes. The default is no placement constraints. You can use this option to find out if restricting your PEs to one NUMA node per node affects performance.

A zero value is not allowed and is a fatal error.

-ss  

Specifies strict memory containment per NUMA node. When -ss is specified, a PE can allocate only the memory that is local to its assigned NUMA node.

The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. You can use this option to find out if restricting each PE's memory access to local-NUMA-node memory affects performance.

-T  

Synchronizes the application's stdout and stderr to prevent interleaving of its output.

-t sec  

Specifies the per-PE CPU time limit in seconds. The sec time limit is constrained by your CPU time limit on the login node. For example, if your time limit on the login node is 3600 seconds but you specify a -t value of 5000, your application is constrained to 3600 seconds per PE. If your time limit on the login node is unlimited, the sec value is used (or, if not specified, the time per-PE is unlimited). You can determine your CPU time limit by using the limit command (csh) or the ulimit -a command (bash).

Note: For OpenMP or multithreaded applications where processes may have child tasks, the time used in the child tasks accumulates against the parent process. Thus, it may be necessary to multiply the sec value by the depth value in order to get a real-time value approximately equivalent to the same value for the PE of a non-threaded application.
: (colon)  

Separates the names of executables and their associated options for Multiple Program, Multiple Data (MPMD) mode. A space is required before and after the colon.

2.1.1 ALPS Application Environment Variables

The following environment variables modify the behavior of aprun:

APRUN_DEFAULT_MEMORY  

Specifies default per PE memory size. An explicit aprun -m value overrides this setting.

APRUN_XFER_LIMITS  

Sets the rlimit() transfer limits for aprun. If this is set to a non-zero string, aprun will transfer the {get,set}rlimit() limits to apinit, which will use those limits on the compute nodes. If it is not set or set to 0, none of the limits will be transferred other than RLIMIT_CORE, RLIMIT_CPU, and possibly RLIMIT_RSS.

APRUN_SYNC_TTY  

Sets synchronous tty for stdout and stderr output. Any non-zero value enables synchronous tty output. An explicit aprun -T value overrides this value.

PGAS_ERROR_FILE  

Redirects error messages issued by the PGAS library (libpgas) to standard output stream when set to stdout. The default is stderr.

CRAY_CUDA_PROXY  

Overrides the site default for execution in simultaneous contexts on GPU-equipped nodes (Hyper Q). Setting CRAY_CUDA_PROXY to 1 or on will explicitly enable the CUDA proxy. To explicitly disable CUDA proxy, set to 0 or off. Debugging is only supported with the CUDA proxy disabled.

APRUN_PRINT_APID  

When this variable is set and output is not suppressed with the -q option, the APID will be displayed upon launch and/or relaunch.

ALPS will pass values to the following application environment variable:

ALPS_APP_DEPTH  

Reflects the aprun -d value as determined by apshepherd. The default is 1. The value can be different between compute nodes or sets of compute nodes when executing a MPMD job. In that case, an instance of apshepherd will determine the appropriate value locally for an executable.

2.2 Understanding Application Placement

The aprun placement options are -n, -N, -d, and -m. ALPS attempts to use the smallest number of nodes to fulfill the placement requirements specified by the -n, -N, -d, -S, -sl, -sn, and/or -m values. For example, the command:

% aprun -n 32 ./a.out

places 32 PEs on:

  • Cray XE5 dual-socket, quad-core processors on 4 nodes

  • Cray XE5 dual-socket, six-core processors on 3 nodes

  • Cray XE6 dual-socket, eight-core processors on 2 nodes

  • Cray XE6 dual-socket, 12-core processors on 2 nodes

  • Cray XE6 dual-socket, 16-core processors on 1 node

Note: Cray XK6 nodes are populated with single-socket host processors. There is still a one-to-one relationship between PEs and host processor cores.

The above aprun command would place 32 PEs on:

  • Cray XK6 single-socket, eight-core processors on 4 nodes.

  • Cray XK6 single-socket, 12-core processors on 3 nodes.

  • Cray XK6 single-socket, 16-core processors on 2 nodes.

The memory and CPU affinity options are optimization options, not placement options. You use memory affinity options if you think that remote-NUMA-node memory references are reducing performance. You use CPU affinity options if you think that process migration is reducing performance.

Note: For examples showing how to use memory affinity options, see Using aprun Memory Affinity Options. For examples showing how to use CPU affinity options, see Using aprun CPU Affinity Options.

2.2.1 System Interconnnect Features Impacting Application Placement

ALPS uses interconnect software to make reservations available to workload managers through the BASIL API. The following interconnect features are used through ALPS to allocate system resources and ensure application resiliency using protection and communication domains:

  • Node Translation Table (NTT) — assists in addressing remote nodes within the application and enables software to address other NICs within the resource space of the application. NTTs have a value assigned to them called the granularity value. There are 8192 entries per NTT, which represents a granularity value of 1. For applications that use more than 8192 compute nodes, the granularity value will be greater than 1.

  • Protection Tag (pTag) — an 8-bit identifier that provides for memory protection and validation of incoming remote memory references. ALPS assigns a pTag-cookie pair to an application. This prevents application interference when sharing NTT entries. This is the default behavior of a private protection domain model. A flexible protection domain model allows users to share memory resources amongst their applications. For more information, see Using the aprun Command.

  • Cookies — an application-specific identifier that helps sort network traffic meant for different layers in the software stack.

  • Programmable Network Performance Counters — memory mapped registers in the interconnect ASIC that ALPS manages for use with CrayPat (Cray performance analysis tool). Applications can share a one interconnect ASIC, but only one application can have reserved access to performance counters. Thus compute nodes are assigned in pairs to avoid any conflicts.

These parameters interact to schedule applications for placement.

2.2.2 Application Placement Algorithms on Cray Systems

In previous versions of the Cray Linux Environment, applications were placed within the requested compute node resources by numerical node ID (NID) in serial order, as shown in Figure 1. Each color represents a different application, red being the largest. The larger blue spheres indicate the direction of origin in these cabinet views or torus cross-sections. The serial sequence is not necessarily ideal for placement of large applications within the actual torus topology of the Cray system. Cabinets and chassis are usually physically interleaved to reduce the maximum cable lengths. NIDs are numbered in physical order tracking these cabinet placements. While this aids in locating the physical position of the NID in cabinet space, this does not provide for an easy way to track the nodes or their interconnections within two- or three-dimensional topology space. This will likely inhibit optimal performance for larger jobs.

Figure 1. Cabinet View Showing Three Applications in Original Serial Ordering

Cabinet View Showing Three Applications in Original Serial Ordering

Figure 2 shows the reordered application in the cabinets. The benefit of the new packing will become more obvious in Figure 4.

Figure 2. Cabinet View Showing Three Applications in New Ordering

Cabinet View Showing Three Applications in New Ordering

Figure 3. Topology View of Original Application Ordering

Topology View of Original Application Ordering

A different view: Figure 3 shows the original ordering for the three applications with respect to the topology in a "flattened" cross-section.

Figure 4. Topology View of New Application Ordering

Topology View of New Application Ordering

To reduce this type of performance hit for large node count jobs, ALPS introduced an "XYZ" placement method. This method reorders the sequence of the NID numbers used in assigning placements such that they are placed to conform to the mesh or torus topology. An example of this is shown in Figure 4. In an XYZ placement method, jobs are first packed from origin (0, 0, 0) across the x-dimension, then the y-dimension, and finally the z-dimensions assuming these are ascending in size—which may not always be the case. A modification to this is also known as max-major ordering: performance is improved for large applications, exploiting the torus bisection bandwidth by packing the minimum dimension first, the next-smallest dimension section second, and the largest dimension last. For example in a 10x4x8 topology, XYZ-ordered node coordinates look like the following: (0, 0, 0)...(0, 3, 0) ... (0, 3, 7), (1, 0, 0). The smallest dimension will vary most quickly.

For Cray systems, y-major ordering can be used to exploit the increase in bandwidth over SeaStar due to the doubling of channels in the x- and z-directions. This benefit results from the inclusion of two interconnect chips per package in the Gemini (Figure 5 and Figure 6). In this ordering, the y-dimension is varied last because it has the least bisectional bandwidth of the three axes in the torus.

Figure 5. SeaStar Interconnect Links

SeaStar Interconnect Links

Figure 6. Gemini Interconnect Links

Gemini Interconnect Links

For applications that are considered small node count jobs, the max-major and y-major placement methods may not be optimal. In fact for these types of jobs the original serial NID ordering has shown better bisectional bandwidth if the jobs are confined to a chassis within the Cray system. This effect is compounded by the fact that more applications can fit into the "small node count jobs" category as core density grows with successive processor generations. CLE introduced the hybridized xyz-by2 NID ordering method to leverage both the communications improvement found with XYZ placement methods and the benefit of the original simple NID ordering for small node count jobs. Cray recommends that sites use this NID ordering for best performance.

The following is the section in /etc/sysconfig/alps that describes the selections available to system administrators for NID ordering choice is used on the Cray system. You can also view this file on the login node to view what ordering the system is using:

<snip>
...
# The nid ordering option for apbridge can be defined.
# The choices are: (just leave unset) or
#   -On for numerical ordering, i.e. no special order
#   -Ox for max-major dimension ordering
#   -Oy for y-major dimension ordering (for gemini systems of 3+ cabinets)
#   -Or for reverse of max (i.e. min) ordering
#   -Of for field ordering (uses od_allocator_id column)
#	 -O2 for 2x2x2 ordering
ALPS_NIDORDER="-Ox"
<snip>

If ALPS_NIDORDER is not specified, On is the default.

  • -On is the old default option that uses serial ordering; based solely on ascending NID value.

  • -Ox is max-major NID ordering.

  • -Oy is y-major dimension ordering, which will order along the y-axis last to exploit the bandwidth in Gemini networks.

  • -Or is reverse max-major NID ordering. Cray provides this NID ordering for experimental purposes only, there is no evidence it provides a performance improvement, and Cray does not recommend this option for normal use.

  • -Of gives the system administrator the option to customize NID ordering based on site preferences.

  • -O2 Assigns order of nodes based on the xyz-by2 NID reordering method, which is a merger of the incidental small node packing of the simple NID number method and the inter-application interaction reduction of the "xyz" method.

    Note: Cray recommends this option for Cray XE and Cray XK systems. Use of this option results in better application performance for larger applications running on Cray XE and Cray XK systems.

2.3 Gathering Application Status and Information on the Cray System

Before running applications, you should check the status of the compute nodes.

There are two ways to do this: using the apstat and the xtnodestat commands.

The apstat command provides status information about reservations, compute resources, pending and placed applications, and cores. The format of the apstat command is:

apstat [-a ][-c][-A apid ... | -R resid ...][-f column list][-G][-n|-no|-ng][-P] [-p] [-r] [-s][-v] [-X] [-z]

You can use apstat to display the following types of status information:

  • all applications

  • placed applications

  • applications by application IDs (APIDs)

  • applications by reservation IDs (ResIDs)

  • protection domain information (e.g., pTags, cookies)

  • hardware information such as number of cores, accelerators, and memory

  • pending applications

  • confirmed and claimed reservations

For example:

% apstat -a
Total placed applications: 3
Placed    Apid ResID     User PEs Nodes  Age  State Command
         48062     6     bill   1     1 4h02m run   lsms
         48108  1588      jim   4     1 0h15m run   gtp
         48109  1589      sue   4     2 0h07m run   bench6

Adding the -v option adds the following output to the above:

% apstat -av
...snip...
Application detail
Ap[1]: apid 48062, pagg 0x5201, resId 6, user bill,
       gid 12790, account 0, time 0, normal
  Batch System ID = 171737
  Created at Tue Aug 23 08:17:07 2011
  Originator: aprun on NID 26, pid 21089
  Number of commands 1, control network fanout 32
  Network: pTag 154, cookie 0x878e0000, NTTgran/entries 1/1, hugePageSz 2M
  Cmd[0]: lsms -n 1, 1024MB, XT, nodes 1
  Placement list entries: 1

Most of these values were discussed in greater detail in System Interconnnect Features Impacting Application Placement but the following items are brief descriptions of the new apstat display values:

  • pTag — 8-bit protection tag identifier assigned to application

  • cookie — 32-bit identifier used to negotiate traffic between software application

  • NTTgran/entries — Network Translation Table (NTT) granularity value and number of NTT entries. The NTT contains NIC addresses of compute nodes accessible by this application; ALPS assigns a granularity value of either 1, 2, 4, 8, 16, or 32. The combination of a pTag and the NTT creates a unique application identifier and prevents interference between applications.

  • hugePageSz — Indicates hugepage size value for the application.

Adding the -p option when applications are pending changes the following options:

  • PerfCtrs — Indicates that a node considered for placement was not available because it shared a network chip with a node using network performance counters

  • pTags — Indicates the application was not able to allocate a free pTag

An APID is also displayed in the apstat display after aprun execution results. For example:

% aprun -n 2 -d 2 ./omp1
Hello from rank 0 (thread 0) on nid00540
Hello from rank 1 (thread 0) on nid00541
Hello from rank 0 (thread 1) on nid00540
Hello from rank 1 (thread 1) on nid00541
Application 48109 resources: utime ~0s, stime ~0s%

The apstat -n command displays the status of the nodes that are UP and core status. Nodes are listed in sequential order:

% apstat -n
   NID Arch State HW Rv Pl  PgSz     Avl    Conf  Placed  PEs Apids
    48   XT UP  I  4  1  1    4K 2048000  512000  512000    1 28489
    49   XT UP  I  4  1  1    4K 2048000  512000  512000    1 28490
    50   XT UP  I  4  -  -    4K 2048000       0       0    0
    51   XT UP  I  4  -  -    4K 2048000       0       0    0
    52   XT UP  I  4  1  1    4K 2048000  512000  512000    1 28489
    53   XT UP  I  4  -  -    4K 2048000       0       0    0
    54   XT UP  I  4  -  -    4K 2048000       0       0    0
    55   XT UP  I  4  -  -    4K 2048000       0       0    0
    56   XT UP  I  8  1  1    4K 4096000  512000  512000    1 28490
    58   XT UP  I  8  -  -    4K 4096000       0       0    0
    59   XT UP  I  8  -  -    4K 4096000       0       0    0
Compute node summary
    arch config     up    use   held  avail   down
      XT     20     11      4      0      7      9

The apstat -no command displays the same information as apstat -n, but the nodes are listed in the order that ALPS used to place an application. Site administrators can specify non-sequential node ordering to reduce system interconnect transfer times.

% apstat -no
 NID Arch State HW Rv Pl  PgSz     Avl    Conf  Placed  PEs Apids
   14   XT UP  B 24 24  -   4K  8192000 8189952       0    0
   15   XT UP  B 24  1  -   4K  8192000  341248       0    0
   16   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   17   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   18   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   19   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   20   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   21   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   32   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   33   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   34   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   35   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266
   36   XT UP  B 24 24 24   4K  8192000 8189952 8189952   24 290266

...snip...

Compute node summary
    arch config     up    use   held  avail   down
      XT   1124   1123    379    137     607      1

where HW is the number of cores in the node, Rv is the number of cores held in a reservation, and Pl is the number of cores being used by an application. If you want to display a 0 instead of a - in the Rv and Pl fields, add the -z option to the apstat command.

The following apstat -n command displays a job using core specialization, demarked by the + sign:

% apstat -n
NID Arch State HW Rv Pl  PgSz     Avl    Conf  Placed  PEs Apids
...
84   XT UP  B  8  8  7+   4K 4096000 4096000 4096000    8 1577851
85   XT UP  B  8  2  1+   4K 4096000 4096000 4096000    8 1577851
86   XT UP  B  8  8  8    4K 4096000 4096000 4096000    8 1577854

For apid 1577851, a total of 10 PEs are placed. On nid00084, eight cores are reserved but the 7+ indicates that seven PEs were placed and one core was used for system services. A similar situation appears on nid00085 three cores are reserved, two application PEs are placed on two cores, and one core is used for system services. For more information, see Core Specialization.

apstat -G will give general information about all nodes that have an accelerator:

% apstat -G
GPU Accelerators
 NID Module State Memory(MB)      Family ResId
   6      0    UP       6144 Tesla_X2090 928
   7      0    UP       6144 Tesla_X2090 928
  10      0    UP       6144 Tesla_X2090 928
  11      0    UP       6144 Tesla_X2090 928

Module is the accelerator module number on the node; 0 is the only valid value. Memory is the amount of accelerator memory on the node. Family is the name of the particular accelerator product line; in this case it is NVIDIA Tesla.

Using the new custom column output option (-f), you can specify which apstat display you want to see. For example, to see the NID, Placed, and APID columns one would put the format string in a quote-enclosed comma separated list:

apstat -no -f "NID,placed,apids"
NID  Placed   Apids
 28       0        
 29       0        
  2       0        
  3       0        
764 8388608 6817081
765 8388608 6817081
738 8388608 6817081
739 8388608 6817081
736 8388608 6817081
737 8388608 6817081
766 8388608 6817081
767 8388608 6817081
576 8388608 6817081
577 8388608 6817081
606 8388608 6817081
607 8388608 6817081
544 8388608 6817081

Here's an arbitrarily changed format string displaying compute units:

apstat -no -f "NID,apids,CU"
NID Apids CU
 28       24
 29       24
  2       24
  3       24
764       16
765       16
738       16

2.3.1 Using the xtnodestat Command

The xtnodestat command is another way to display the current job and node status. Each character in the display represents a single node. For systems running a large number of jobs, multiple characters may be used to designate a job.

% xtnodestat
Current Allocation Status at Tue Aug 23 13:30:16 2011

     C0-0     C1-0     C2-0     C3-0
  n3 -------- ------X- -------- ------A-
  n2 -------- -------- --a----- --------
  n1 -------- -------A -----X-- --------
c2n0 -------- -------- -------- --------
  n3 X------- -------- -------- --------
  n2 -------- -------- -------- --------
  n1 -------- -------- -------- --------
c1n0 -------- ----X--- -------- --------
  n3 S-S-S-S- -e------ --X----X bb-b----
  n2 S-S-S-S- cd------ -------- bb-b----
  n1 S-S-S-SX -g------ -------- bb------
c0n0 S-S-S-S- -f------ -------- bb------
    s01234567 01234567 01234567 01234567

Legend:
   nonexistent node                  S  service node
;  free interactive compute node     -  free batch compute node
A  allocated interactive or ccm node ?  suspect compute node
W  waiting or non-running job        X  down compute node
Y  down or admindown service node    Z  admindown compute node

Available compute nodes:          0 interactive,        343 batch


Job ID     User       Size   Age        State           command line
--- ------ --------   -----  ---------  --------  ----------------------------------
a   762544 user1      1      0h00m      run        test_zgetrf
b   760520 user2      10     1h28m      run        gs_count_gpu
c   761842 user3      1      0h40m      run        userTest
d   761792 user3      1      0h45m      run        userTest
e   761807 user3      1      0h43m      run        userTest
f   755149 user4      1      5h13m      run        lsms
g   761770 user3      1      0h47m      run        userTest

The xtnodestat command displays the allocation grid, a legend, and a job listing. The column and row headings of the grid show the physical location of jobs: C represents a cabinet, c represents a chassis, s represents a slot, and n represents a node.

Note: If xtnodestat indicates that no compute nodes have been allocated for interactive processing, you can still run your job interactively by using the qsub -I command. Then launch your application with the aprun command.

Use the xtprocadmin -A command to display node attributes that show both the logical node IDs (NID heading) and the physical node IDs (NODENAME heading):

% xtprocadmin -A
  NID    (HEX)   NODENAME     TYPE ARCH        OS    CPUS CU  AVAILMEM   PAGESZ CLOCKMHZ GPU SOCKETS DIES C/CU
  1      0x1     c0-0c0s0n1  service  xt (service)   12   6    32768     4096     2500   0      1    1    2
  2      0x2     c0-0c0s0n2  service  xt (service)   12   6    32768     4096     2500   0      1    1    2
  5      0x5     c0-0c0s1n1  service  xt (service)   16   8    32768     4096     2600   0      1    1    2
  6      0x6     c0-0c0s1n2  service  xt (service)   12   6    32768     4096     2500   0      1    1    2
  12     0xc     c0-0c0s3n0  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  13     0xd     c0-0c0s3n1  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  14     0xe     c0-0c0s3n2  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  15     0xf     c0-0c0s3n3  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  20     0x14    c0-0c0s5n0  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  21     0x15    c0-0c0s5n1  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  22     0x16    c0-0c0s5n2  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  23     0x17    c0-0c0s5n3  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  36     0x24    c0-0c0s9n0  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  37     0x25    c0-0c0s9n1  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  38     0x26    c0-0c0s9n2  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2
  39     0x26    c0-0c0s9n3  compute  xt       CNL   32  16    32768     4096     2700   0      2    2    2

For more information, see the xtnodestat(1) and xtprocadmin(8) man pages.

2.4 Using the cnselect Command

The aprun utility supports manual and automatic node selection. For manual node selection, first use the cnselect command to get a candidate list of compute nodes that meet the criteria you specify. Then, for interactive jobs use the aprun -L node_list option. For batch and interactive batch jobs, add -lmppnodes=\"node_list\" to the job script or the qsub command line.

The format of the cnselect command is:

[-l] [-L fieldname] [-U] [-D] [-c] [-V] [[-e] expression]

where:

  • -l lists the names of fields in the compute nodes attributes database.

    Note: The cnselect utility displays nodeids, sorted by ascending NID number or unsorted. For some sites, node IDs are presented to ALPS in non-sequential order for application placement. Site administrators can specify non-sequential node ordering to reduce system interconnect transfer times.
  • -L fieldname lists the current possible values for a given field.

  • -U Causes the user-supplied expression to not be enclosed in parentheses but combined with other built-in conditions. This option may be needed if you add other SQL qualifiers (such as ORDER BY) to the expression.

  • -V prints the version number and exits.

  • -c gives a count of the number of nodes rather than a list of the nodes themselves.

  • [-e] expression queries the compute node attributes database.

You can use cnselect to get a list of nodes selected by such characteristics as the number of cores per node (numcores), the amount of memory on the node (in megabytes), and the processor speed (in megahertz). For example, to run an application on Cray XK6 16-core nodes with 32 GB of memory or more, use:

% cnselect numcores.eq.16 .and. availmem.gt.32000
268-269,274-275,80-81,78-79
% aprun -n 32 -L 268-269 ./app1
Note: The cnselect utility returns -1 to stdout if the numcores criteria cannot be met; for example numcores.eq.16 on a system that has no 16-core compute nodes.

You can also use cnselect to get a list of nodes if a site-defined label exists. For example, to run an application on six-core nodes, you might use:

% cnselect -L label1
HEX-CORE
DODEC-CORE
16-Core
% cnselect -e "label1.eq.'HEX-CORE'"
60-63,76,82
% aprun -n 6 -L 60-63,76,82 ./app1

If you do not include the -L option on the aprun command or the -lmppnodes option on the qsub command, ALPS automatically places the application using available resources.

2.5 Understanding How Much Memory is Available to Applications

When running large applications, you should understand how much memory will be available per node. Cray Linux Environment (CLE) uses memory on each node for CNL and other functions such as I/O buffering, core specialization, and compute node resiliency. The remaining memory is available for user executables; user data arrays; stacks, libraries and buffers; and the SHMEM symmetric stack heap.

The amount of memory CNL uses depends on the number of cores, memory size, and whether optional software has been configured on the compute nodes. For a 24-core node with 32 GB of memory, roughly 28.8 to 30 GB of memory is available for applications.

The default stack size is 16 MB. You can determine the maximum stack size by using the limit command (csh) or the ulimit -a command (bash).

Note: The actual amount of memory CNL uses varies depending on the total amount of memory on the node and the OS services configured for the node.

You can use the aprun -m size option to specify the per-PE memory limit. For example, this command launches xthi on cores 0 and 1 of compute nodes 472 and 473. Each node has 8 GB of available memory, allowing 4 GB per PE.

% aprun -n 4 -N 2 -m4000 ./xthi | sort
Application 225108 resources: utime ~0s, stime ~0s
PE 0 nid00472 Core affinity = 0,1
PE 1 nid00472 Core affinity = 0,1
PE 2 nid00473 Core affinity = 0,1
PE 3 nid00473 Core affinity = 0,1 
% aprun -n 4 -N 2 -m4001 ./xthi | sort
Claim exceeds reservation's memory

You can change MPI buffer sizes and stack space from the defaults by setting certain environment variables. For more details, see the intro_mpi(3) man page.

2.6 Core Specialization

CLE offers a core-specialization functionality. Core specialization binds a set of Linux kernel-space processes and daemons to one or more cores within a Cray compute node to enable the software application to fully utilize the remaining cores within its cpuset. This restricts all possible overhead processing to the specialized cores within the reservation and may improve application performance. To help users calculate the new "scaled-up" width for a batch reservation that uses core specialization, use the apcount tool.

Note: apcount will work only if your system has uniform compute node types.

See the apcount(1) man page for further information.

2.7 Launching an MPMD Application

The aprun utility supports multiple-program, multiple-data (MPMD) launch mode. To run an application in MPMD mode under aprun, use the colon-separated -n pes executable1 : -n pes executable2 : ... format. In the first executable segment, you may use other aprun options such as -cc, -cp, -d, -L, -n, -N, -S, -sl, -sn, and -ss. If you specify the -m option it must be specified in the first executable segment and the value is used for all subsequent executables. If you specify -m more than once while launching multiple applications in MPMD mode, aprun will return an error. For MPI applications, all of the executables share the same MPI_COMM_WORLD process communicator. MPMD mode will not work for system commands and applications require at the least an enclosure within MPI_Init() and MPI_Finalize() environment management routines.

For example, this command launches 128 instances of program1 and 256 instances of program2:

aprun -n 128 ./program1 : -n 256 ./program2

A space is required before and after the colon.

Note: MPMD applications that use the SHMEM parallel programming model, either standalone or nested within an MPI program, are not supported on Gemini based systems.

2.8 Managing Compute Node Processors from an MPI Program

MPI programs should call the MPI_Finalize() routine at the conclusion of the program. This call waits for all processing elements to complete before exiting. If one of the programs fails to call MPI_Finalize(), the program never completes and aprun stops responding. There are two ways to prevent this behavior:

  • Use the PBS Professional elapsed (wall clock) time limit to terminate the job after a specified time limit (such as -l walltime=2:00:00).

  • Use the aprun -t sec option to terminate the program. This option specifies the per-PE CPU time limit in seconds. A process will terminate only if it reaches the specified amount of CPU time (not wallclock time).

    For example, if you use:

    % aprun -n 8 -t 120 ./myprog1

    and a PE uses more than two minutes of CPU time, the application terminates.

2.9 About aprun Input and Output Modes

The aprun utility handles standard input (stdin) on behalf of the user and handles standard output (stdout) and standard error messages (stderr) for user applications.

2.10 About aprun Resource Limits

aprun utility does not forward its user resource limits to each compute node (except for RLIMIT_CORE and RLIMIT_CPU, which are always forwarded).

You can set the APRUN_XFER_LIMITS environment variable to 1 (export APRUN_XFER_LIMITS=1 or setenv APRUN_XFER_LIMITS 1) to enable the forwarding of user resource limits. For more information, see the getrlimit(P) man page.

2.11 About aprun Signal Processing

The aprun utility forwards the following signals to an application:

  • SIGHUP

  • SIGINT

  • SIGQUIT

  • SIGTERM

  • SIGABRT

  • SIGUSR1

  • SIGUSR2

  • SIGURG

  • SIGWINCH

The aprun utility ignores SIGPIPE and SIGTTIN signals. All other signals remain at default and are not forwarded to an application. The default behaviors that terminate aprun also cause ALPS to terminate the application with a SIGKILL signal.

2.12 Reserved File Descriptors

The following file descriptors are used by ALPS and should not be closed by applications: 100, 102, 108, 110.