aprun utility launches applications on compute nodes. The utility submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution, forwards your login node environment to the assigned compute nodes, forwards signals, and manages the
This chapter describes how to run applications interactively on compute nodes and get application status reports. For a description of batch job processing, see Chapter 4, Using Workload Management Systems.
Use the aprun command to specify the resources your application requires, request application placement, and initiate application launch.
The format of the aprun command is:
-a arch] [
-cc cpu_list | keyword] [-cp cpu_placement_file_name ] [
-D value] [
-L node_list] [
-msize[h|hs] ] [
-n pes] [
-F access mode][
-pprotection domain identifier] [
-S pes_per_numa_node] [
-sl list_of_numa_nodes] [
-sn numa_nodes_per_node] [
-t sec] executable [ arguments_for_executable ]
Bypasses the transfer of the executable program to the compute nodes. By default, the executable is transferred to the compute nodes during the
Reuses the width, depth, nppn, and memory request options that are specified with the batch reservation. This option obviates the need to specify aprun options
Binds processing elements (PEs) to CPUs. CNL does not migrate processes that are bound to a CPU. This option applies to all multicore compute nodes. The cpu_list is not used for placement decisions, but is used only by CNL during application execution. For further information about binding (CPU affinity), see Using aprun CPU Affinity Options.
The cpu_list is a comma-separated or hyphen-separated list of logical CPU numbers and/or ranges. As PEs are created, they are bound to the CPU in cpu_list corresponding to the number of PEs that have been created at that point. For example, the first PE created is bound to the first CPU in cpu_list, the second PE created is bound to the second CPU in cpu_list, and so on. If more PEs are created than given in cpu_list, binding starts over at the beginning of cpu_list and starts again with the first CPU in cpu_list. The cpu_list can also contain an
If multiple PEs are created on a compute node, the user may optionally specify a cpu_list for each PE. Multiple cpu_lists are separated by colons (:). This provides the user with the ability to control the placement for PEs that may conflict with other PEs that are simultaneously creating child processes and threads of their own.
The example above contains two cpu_lists. The first (0,1,2) is applied to the first PE created and any threads or child processes that result. The second (4,5,6) is applied to the second PE created and any threads or child processes that result.
Out-of-range cpu_list values are ignored unless all CPU values are out of range, in which case an error message is issued. For example, if you want to bind PEs starting with the highest CPU on a compute node and work down from there, you might use this
If the PEs were placed on Cray XE6 24-core compute nodes, the specified
The following keyword values can be used:
Provides the name of a CPU binding placement file. This option applies to all multicore compute nodes. This file must be located on a file system that is accessible to the compute nodes. The CPU placement file provides more extensive CPU binding instructions than the
Requests use of a protection domain using the user pre-allocated protection identifier. You cannot use this option with protection domains already allocated by system services. Any cooperating set of applications must specify this same aprun
Because this option is a bitmask setting, value can be set to get any or all of the above levels of debug messages. Therefore, valid values are
Specifies the number of CPUs for each PE and its threads. ALPS allocates the number of CPUs equal to depth times pes. The
The default depth is
For OpenMP applications, use both the
Note: For a PathScale OpenMP program, set the
For Cray systems, compute nodes must have at least depth CPUs. For Cray XE5 systems, depth cannot exceed 12. For Cray XK6 compute blades, depth cannot exceed 16. For Cray XE6 systems, depth cannot exceed 32.
Specifies how many CPUs to use per compute unit for an ALPS job. See for an explanation of compute unit affinity in ALPS. For more information on compute unit affinity, see Using Compute Unit Affinity on Cray Systems.
Specifies the candidate nodes to constrain application placement. The syntax allows a comma-separated list of nodes (such as
This option is used for applications launched interactively; use the qsub -lmppnodes=\"node_list\" option for batch and interactive batch jobs.
If the placement node list contains fewer nodes than the number required, a fatal error is produced. If resources are not currently available,
A common source of node lists is the cnselect command. See the cnselect(1) man page for details.
Specifies the per-PE required Resident Set Size (RSS) memory size in megabytes. K, M, and G suffixes (case insensitive) are supported (16M = 16m = 16 megabytes, for example). If you do not include the
If you want huge pages (2 MB) allocated for an application, use the
Cray XE and Cray XK: The default huge page size is 2 MB. Additional sizes are available: 128KB, 512KB, 8MB, 16MB, and 64MB.
The use of the
Note: To use huge pages, you must first link the application with
Specifies the number of processing elements (PEs) that your application requires. A PE is an instance of an ALPS-launched executable. You can express the number of PEs in decimal, octal, or hexadecimal form. If pes has a leading
Specifies the number of PEs to place per node. For Cray systems, the default is the number of available NUMA nodes times the number of cores per NUMA node.
The maximum pes_per_node is
Specifies quiet mode and suppresses all aprun-generated non-fatal messages. Do not use this option with the
Enables core specialization on Cray compute nodes, where the number of cores specified is the number of system services cores per node for the application. If the
Specifies the number of PEs to allocate per NUMA node. You can use this option to reduce the number of PEs per NUMA node, thereby making more resources available per PE.
For 8-core compute nodes, the default is
Specifies the NUMA node or nodes (comma separated or hyphen separated) to use for application placement. A space is required between
List NUMA nodes in ascending order;
Specifies the number of NUMA nodes per node to be allocated. Insert a space between
A zero value is not allowed and is a fatal error.
Specifies strict memory containment per NUMA node. When
The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. You can use this option to find out if restricting each PE's memory access to local-NUMA-node memory affects performance.
Synchronizes the application's
Specifies the per-PE CPU time limit in seconds. The sec time limit is constrained by your CPU time limit on the login node. For example, if your time limit on the login node is 3600 seconds but you specify a
Note: For OpenMP or multithreaded applications where processes may have child tasks, the time used in the child tasks accumulates against the parent process. Thus, it may be necessary to multiply the sec value by the depth value in order to get a real-time value approximately equivalent to the same value for the PE of a non-threaded application.
Separates the names of executables and their associated options for Multiple Program, Multiple Data (MPMD) mode. A space is required before and after the colon.
The following environment variables modify the behavior of aprun:
Specifies default per PE memory size. An explicit aprun
Redirects error messages issued by the PGAS library (
Overrides the site default for execution in simultaneous contexts on GPU-equipped nodes (Hyper Q). Setting
When this variable is set and output is not suppressed with the
ALPS will pass values to the following application environment variable:
The aprun placement options are
-m. ALPS attempts to use the smallest number of nodes to fulfill the placement requirements specified by the
-m values. For example, the command:
aprun -n 32 ./a.out
places 32 PEs on:
Cray XE5 dual-socket, quad-core processors on 4 nodes
Cray XE5 dual-socket, six-core processors on 3 nodes
Cray XE6 dual-socket, eight-core processors on 2 nodes
Cray XE6 dual-socket, 12-core processors on 2 nodes
Cray XE6 dual-socket, 16-core processors on 1 node
Note: Cray XK6 nodes are populated with single-socket host processors. There is still a one-to-one relationship between PEs and host processor cores.
The above aprun command would place 32 PEs on:
Cray XK6 single-socket, eight-core processors on 4 nodes.
Cray XK6 single-socket, 12-core processors on 3 nodes.
Cray XK6 single-socket, 16-core processors on 2 nodes.
The memory and CPU affinity options are optimization options, not placement options. You use memory affinity options if you think that remote-NUMA-node memory references are reducing performance. You use CPU affinity options if you think that process migration is reducing performance.
Note: For examples showing how to use memory affinity options, see Using aprun Memory Affinity Options. For examples showing how to use CPU affinity options, see Using aprun CPU Affinity Options.
ALPS uses interconnect software to make reservations available to workload managers through the BASIL API. The following interconnect features are used through ALPS to allocate system resources and ensure application resiliency using protection and communication domains:
Node Translation Table (NTT) — assists in addressing remote nodes within the application and enables software to address other NICs within the resource space of the application. NTTs have a value assigned to them called the granularity value. There are 8192 entries per NTT, which represents a granularity value of 1. For applications that use more than 8192 compute nodes, the granularity value will be greater than 1.
Protection Tag (pTag) — an 8-bit identifier that provides for memory protection and validation of incoming remote memory references. ALPS assigns a pTag-cookie pair to an application. This prevents application interference when sharing NTT entries. This is the default behavior of a private protection domain model. A flexible protection domain model allows users to share memory resources amongst their applications. For more information, see Using the aprun Command.
Cookies — an application-specific identifier that helps sort network traffic meant for different layers in the software stack.
Programmable Network Performance Counters — memory mapped registers in the interconnect ASIC that ALPS manages for use with CrayPat (Cray performance analysis tool). Applications can share a one interconnect ASIC, but only one application can have reserved access to performance counters. Thus compute nodes are assigned in pairs to avoid any conflicts.
These parameters interact to schedule applications for placement.
In previous versions of the Cray Linux Environment, applications were placed within the requested compute node resources by numerical node ID (NID) in serial order, as shown in Figure 1. Each color represents a different application, red being the largest. The larger blue spheres indicate the direction of origin in these cabinet views or torus cross-sections. The serial sequence is not necessarily ideal for placement of large applications within the actual torus topology of the Cray system. Cabinets and chassis are usually physically interleaved to reduce the maximum cable lengths. NIDs are numbered in physical order tracking these cabinet placements. While this aids in locating the physical position of the NID in cabinet space, this does not provide for an easy way to track the nodes or their interconnections within two- or three-dimensional topology space. This will likely inhibit optimal performance for larger jobs.
A different view: Figure 3 shows the original ordering for the three applications with respect to the topology in a "flattened" cross-section.
To reduce this type of performance hit for large node count jobs, ALPS introduced an "XYZ" placement method. This method reorders the sequence of the NID numbers used in assigning placements such that they are placed to conform to the mesh or torus topology. An example of this is shown in Figure 4. In an XYZ placement method, jobs are first packed from origin (0, 0, 0) across the x-dimension, then the y-dimension, and finally the z-dimensions assuming these are ascending in size—which may not always be the case. A modification to this is also known as max-major ordering: performance is improved for large applications, exploiting the torus bisection bandwidth by packing the minimum dimension first, the next-smallest dimension section second, and the largest dimension last. For example in a 10x4x8 topology, XYZ-ordered node coordinates look like the following: (0, 0, 0)...(0, 3, 0) ... (0, 3, 7), (1, 0, 0). The smallest dimension will vary most quickly.
For Cray systems, y-major ordering can be used to exploit the increase in bandwidth over SeaStar due to the doubling of channels in the x- and z-directions. This benefit results from the inclusion of two interconnect chips per package in the Gemini (Figure 5 and Figure 6). In this ordering, the y-dimension is varied last because it has the least bisectional bandwidth of the three axes in the torus.
For applications that are considered small node count jobs, the max-major and y-major placement methods may not be optimal. In fact for these types of jobs the original serial NID ordering has shown better bisectional bandwidth if the jobs are confined to a chassis within the Cray system. This effect is compounded by the fact that more applications can fit into the "small node count jobs" category as core density grows with successive processor generations. CLE introduced the hybridized xyz-by2 NID ordering method to leverage both the communications improvement found with XYZ placement methods and the benefit of the original simple NID ordering for small node count jobs. Cray recommends that sites use this NID ordering for best performance.
The following is the section in
/etc/sysconfig/alps that describes the selections available to system administrators for NID ordering choice is used on the Cray system. You can also view this file on the login node to view what ordering the system is using:
<snip> ... # The nid ordering option for apbridge can be defined. # The choices are: (just leave unset) or # -On for numerical ordering, i.e. no special order # -Ox for max-major dimension ordering # -Oy for y-major dimension ordering (for gemini systems of 3+ cabinets) # -Or for reverse of max (i.e. min) ordering # -Of for field ordering (uses od_allocator_id column) # -O2 for 2x2x2 ordering ALPS_NIDORDER="-Ox" <snip>
ALPS_NIDORDER is not specified,
On is the default.
-On is the old default option that uses serial ordering; based solely on ascending NID value.
-Ox is max-major NID ordering.
-Oy is y-major dimension ordering, which will order along the y-axis last to exploit the bandwidth in Gemini networks.
-Or is reverse max-major NID ordering. Cray provides this NID ordering for experimental purposes only, there is no evidence it provides a performance improvement, and Cray does not recommend this option for normal use.
-Of gives the system administrator the option to customize NID ordering based on site preferences.
-O2 Assigns order of nodes based on the xyz-by2 NID reordering method, which is a merger of the incidental small node packing of the simple NID number method and the inter-application interaction reduction of the "xyz" method.
Note: Cray recommends this option for Cray XE and Cray XK systems. Use of this option results in better application performance for larger applications running on Cray XE and Cray XK systems.
Before running applications, you should check the status of the compute nodes.
There are two ways to do this: using the apstat and the xtnodestat commands.
The apstat command provides status information about reservations, compute resources, pending and placed applications, and cores. The format of the apstat command is:
apstat [-a ][-c][-A apid ... | -R resid ...][-f column list][-G][-n|-no|-ng][-P] [-p] [-r] [-s][-v] [-X] [-z]
You can use apstat to display the following types of status information:
applications by application IDs (APIDs)
applications by reservation IDs (ResIDs)
protection domain information (e.g., pTags, cookies)
hardware information such as number of cores, accelerators, and memory
confirmed and claimed reservations
apstat -aTotal placed applications: 3 Placed Apid ResID User PEs Nodes Age State Command 48062 6 bill 1 1 4h02m run lsms 48108 1588 jim 4 1 0h15m run gtp 48109 1589 sue 4 2 0h07m run bench6
-v option adds the following output to the above:
apstat -av...snip... Application detail Ap: apid 48062, pagg 0x5201, resId 6, user bill, gid 12790, account 0, time 0, normal Batch System ID = 171737 Created at Tue Aug 23 08:17:07 2011 Originator: aprun on NID 26, pid 21089 Number of commands 1, control network fanout 32 Network: pTag 154, cookie 0x878e0000, NTTgran/entries 1/1, hugePageSz 2M Cmd: lsms -n 1, 1024MB, XT, nodes 1 Placement list entries: 1
Most of these values were discussed in greater detail in System Interconnnect Features Impacting Application Placement but the following items are brief descriptions of the new apstat display values:
pTag — 8-bit protection tag identifier assigned to application
cookie — 32-bit identifier used to negotiate traffic between software application
NTTgran/entries — Network Translation Table (NTT) granularity value and number of NTT entries. The NTT contains NIC addresses of compute nodes accessible by this application; ALPS assigns a granularity value of either 1, 2, 4, 8, 16, or 32. The combination of a
pTag and the NTT creates a unique application identifier and prevents interference between applications.
hugePageSz — Indicates hugepage size value for the application.
-p option when applications are pending changes the following options:
PerfCtrs — Indicates that a node considered for placement was not available because it shared a network chip with a node using network performance counters
pTags — Indicates the application was not able to allocate a free pTag
An APID is also displayed in the apstat display after aprun execution results. For example:
aprun -n 2 -d 2 ./omp1Hello from rank 0 (thread 0) on nid00540 Hello from rank 1 (thread 0) on nid00541 Hello from rank 0 (thread 1) on nid00540 Hello from rank 1 (thread 1) on nid00541 Application 48109 resources: utime ~0s, stime ~0s%
The apstat -n command displays the status of the nodes that are
UP and core status. Nodes are listed in sequential order:
apstat -nNID Arch State HW Rv Pl PgSz Avl Conf Placed PEs Apids 48 XT UP I 4 1 1 4K 2048000 512000 512000 1 28489 49 XT UP I 4 1 1 4K 2048000 512000 512000 1 28490 50 XT UP I 4 - - 4K 2048000 0 0 0 51 XT UP I 4 - - 4K 2048000 0 0 0 52 XT UP I 4 1 1 4K 2048000 512000 512000 1 28489 53 XT UP I 4 - - 4K 2048000 0 0 0 54 XT UP I 4 - - 4K 2048000 0 0 0 55 XT UP I 4 - - 4K 2048000 0 0 0 56 XT UP I 8 1 1 4K 4096000 512000 512000 1 28490 58 XT UP I 8 - - 4K 4096000 0 0 0 59 XT UP I 8 - - 4K 4096000 0 0 0 Compute node summary arch config up use held avail down XT 20 11 4 0 7 9
The apstat -no command displays the same information as apstat -n, but the nodes are listed in the order that ALPS used to place an application. Site administrators can specify non-sequential node ordering to reduce system interconnect transfer times.
apstat -noNID Arch State HW Rv Pl PgSz Avl Conf Placed PEs Apids 14 XT UP B 24 24 - 4K 8192000 8189952 0 0 15 XT UP B 24 1 - 4K 8192000 341248 0 0 16 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 17 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 18 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 19 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 20 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 21 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 32 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 33 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 34 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 35 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 36 XT UP B 24 24 24 4K 8192000 8189952 8189952 24 290266 ...snip... Compute node summary arch config up use held avail down XT 1124 1123 379 137 607 1
HW is the number of cores in the node,
Rv is the number of cores held in a reservation, and
Pl is the number of cores being used by an application. If you want to display a
0 instead of a
- in the
Pl fields, add the
-z option to the apstat command.
The following apstat -n command displays a job using core specialization, demarked by the + sign:
apstat -nNID Arch State HW Rv Pl PgSz Avl Conf Placed PEs Apids ... 84 XT UP B 8 8 7+ 4K 4096000 4096000 4096000 8 1577851 85 XT UP B 8 2 1+ 4K 4096000 4096000 4096000 8 1577851 86 XT UP B 8 8 8 4K 4096000 4096000 4096000 8 1577854
apid 1577851, a total of 10 PEs are placed. On
nid00084, eight cores are reserved but the
7+ indicates that seven PEs were placed and one core was used for system services. A similar situation appears on
nid00085 three cores are reserved, two application PEs are placed on two cores, and one core is used for system services. For more information, see Core Specialization.
-G will give general information about all nodes that have an accelerator:
apstat -GGPU Accelerators NID Module State Memory(MB) Family ResId 6 0 UP 6144 Tesla_X2090 928 7 0 UP 6144 Tesla_X2090 928 10 0 UP 6144 Tesla_X2090 928 11 0 UP 6144 Tesla_X2090 928
Module is the accelerator module number on the node;
0 is the only valid value.
Memory is the amount of accelerator memory on the node.
Family is the name of the particular accelerator product line; in this case it is NVIDIA Tesla.
Using the new custom column output option (
-f), you can specify which apstat display you want to see. For example, to see the NID, Placed, and APID columns one would put the format string in a quote-enclosed comma separated list:
apstat -no -f "NID,placed,apids" NID Placed Apids 28 0 29 0 2 0 3 0 764 8388608 6817081 765 8388608 6817081 738 8388608 6817081 739 8388608 6817081 736 8388608 6817081 737 8388608 6817081 766 8388608 6817081 767 8388608 6817081 576 8388608 6817081 577 8388608 6817081 606 8388608 6817081 607 8388608 6817081 544 8388608 6817081
Here's an arbitrarily changed format string displaying compute units:
apstat -no -f "NID,apids,CU" NID Apids CU 28 24 29 24 2 24 3 24 764 16 765 16 738 16
The xtnodestat command is another way to display the current job and node status. Each character in the display represents a single node. For systems running a large number of jobs, multiple characters may be used to designate a job.
xtnodestatCurrent Allocation Status at Tue Aug 23 13:30:16 2011 C0-0 C1-0 C2-0 C3-0 n3 -------- ------X- -------- ------A- n2 -------- -------- --a----- -------- n1 -------- -------A -----X-- -------- c2n0 -------- -------- -------- -------- n3 X------- -------- -------- -------- n2 -------- -------- -------- -------- n1 -------- -------- -------- -------- c1n0 -------- ----X--- -------- -------- n3 S-S-S-S- -e------ --X----X bb-b---- n2 S-S-S-S- cd------ -------- bb-b---- n1 S-S-S-SX -g------ -------- bb------ c0n0 S-S-S-S- -f------ -------- bb------ s01234567 01234567 01234567 01234567 Legend: nonexistent node S service node ; free interactive compute node - free batch compute node A allocated interactive or ccm node ? suspect compute node W waiting or non-running job X down compute node Y down or admindown service node Z admindown compute node Available compute nodes: 0 interactive, 343 batch Job ID User Size Age State command line --- ------ -------- ----- --------- -------- ---------------------------------- a 762544 user1 1 0h00m run test_zgetrf b 760520 user2 10 1h28m run gs_count_gpu c 761842 user3 1 0h40m run userTest d 761792 user3 1 0h45m run userTest e 761807 user3 1 0h43m run userTest f 755149 user4 1 5h13m run lsms g 761770 user3 1 0h47m run userTest
The xtnodestat command displays the allocation grid, a legend, and a job listing. The column and row headings of the grid show the physical location of jobs:
C represents a cabinet,
c represents a chassis,
s represents a slot, and
n represents a node.
Note: If xtnodestat indicates that no compute nodes have been allocated for interactive processing, you can still run your job interactively by using the qsub -I command. Then launch your application with the aprun command.
Use the xtprocadmin
-A command to display node attributes that show both the logical node IDs (
NID heading) and the physical node IDs (
xtprocadmin -ANID (HEX) NODENAME TYPE ARCH OS CPUS CU AVAILMEM PAGESZ CLOCKMHZ GPU SOCKETS DIES C/CU 1 0x1 c0-0c0s0n1 service xt (service) 12 6 32768 4096 2500 0 1 1 2 2 0x2 c0-0c0s0n2 service xt (service) 12 6 32768 4096 2500 0 1 1 2 5 0x5 c0-0c0s1n1 service xt (service) 16 8 32768 4096 2600 0 1 1 2 6 0x6 c0-0c0s1n2 service xt (service) 12 6 32768 4096 2500 0 1 1 2 12 0xc c0-0c0s3n0 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 13 0xd c0-0c0s3n1 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 14 0xe c0-0c0s3n2 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 15 0xf c0-0c0s3n3 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 20 0x14 c0-0c0s5n0 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 21 0x15 c0-0c0s5n1 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 22 0x16 c0-0c0s5n2 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 23 0x17 c0-0c0s5n3 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 36 0x24 c0-0c0s9n0 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 37 0x25 c0-0c0s9n1 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 38 0x26 c0-0c0s9n2 compute xt CNL 32 16 32768 4096 2700 0 2 2 2 39 0x26 c0-0c0s9n3 compute xt CNL 32 16 32768 4096 2700 0 2 2 2
For more information, see the xtnodestat(1) and xtprocadmin(8) man pages.
The aprun utility supports manual and automatic node selection. For manual node selection, first use the cnselect command to get a candidate list of compute nodes that meet the criteria you specify. Then, for interactive jobs use the aprun
-L node_list option. For batch and interactive batch jobs, add
-lmppnodes=\"node_list\" to the job script or the qsub command line.
The format of the cnselect command is:
-l lists the names of fields in the compute nodes attributes database.
Note: The cnselect utility displays
nodeids, sorted by ascending NID number or unsorted. For some sites, node IDs are presented to ALPS in non-sequential order for application placement. Site administrators can specify non-sequential node ordering to reduce system interconnect transfer times.
-L fieldname lists the current possible values for a given field.
-U Causes the user-supplied expression to not be enclosed in parentheses but combined with other built-in conditions. This option may be needed if you add other SQL qualifiers (such as
ORDER BY) to the expression.
-V prints the version number and exits.
-c gives a count of the number of nodes rather than a list of the nodes themselves.
expression queries the compute node attributes database.
You can use cnselect to get a list of nodes selected by such characteristics as the number of cores per node (
numcores), the amount of memory on the node (in megabytes), and the processor speed (in megahertz). For example, to run an application on Cray XK6 16-core nodes with 32 GB of memory or more, use:
cnselect numcores.eq.16 .and. availmem.gt.32000268-269,274-275,80-81,78-79 %
aprun -n 32 -L 268-269 ./app1
Note: The cnselect utility returns
numcorescriteria cannot be met; for example
numcores.eq.16on a system that has no 16-core compute nodes.
You can also use cnselect to get a list of nodes if a site-defined label exists. For example, to run an application on six-core nodes, you might use:
cnselect -L label1HEX-CORE DODEC-CORE 16-Core %
cnselect -e "label1.eq.'HEX-CORE'"60-63,76,82 %
aprun -n 6 -L 60-63,76,82 ./app1
If you do not include the
-L option on the aprun command or the
-lmppnodes option on the qsub command, ALPS automatically places the application using available resources.
When running large applications, you should understand how much memory will be available per node. Cray Linux Environment (CLE) uses memory on each node for CNL and other functions such as I/O buffering, core specialization, and compute node resiliency. The remaining memory is available for user executables; user data arrays; stacks, libraries and buffers; and the SHMEM symmetric stack heap.
The amount of memory CNL uses depends on the number of cores, memory size, and whether optional software has been configured on the compute nodes. For a 24-core node with 32 GB of memory, roughly 28.8 to 30 GB of memory is available for applications.
The default stack size is 16 MB. You can determine the maximum stack size by using the limit command (csh) or the ulimit -a command (bash).
Note: The actual amount of memory CNL uses varies depending on the total amount of memory on the node and the OS services configured for the node.
You can use the aprun -m size option to specify the per-PE memory limit. For example, this command launches
xthi on cores 0 and 1 of compute nodes
473. Each node has 8 GB of available memory, allowing 4 GB per PE.
aprun -n 4 -N 2 -m4000 ./xthi | sortApplication 225108 resources: utime ~0s, stime ~0s PE 0 nid00472 Core affinity = 0,1 PE 1 nid00472 Core affinity = 0,1 PE 2 nid00473 Core affinity = 0,1 PE 3 nid00473 Core affinity = 0,1 %
aprun -n 4 -N 2 -m4001 ./xthi | sortClaim exceeds reservation's memory
You can change MPI buffer sizes and stack space from the defaults by setting certain environment variables. For more details, see the intro_mpi(3) man page.
CLE offers a core-specialization functionality. Core specialization binds a set of Linux kernel-space processes and daemons to one or more cores within a Cray compute node to enable the software application to fully utilize the remaining cores within its
cpuset. This restricts all possible overhead processing to the specialized cores within the reservation and may improve application performance. To help users calculate the new "scaled-up" width for a batch reservation that uses core specialization, use the apcount tool.
Note: apcount will work only if your system has uniform compute node types.
See the apcount(1) man page for further information.
The aprun utility supports multiple-program, multiple-data (MPMD) launch mode. To run an application in MPMD mode under aprun, use the colon-separated
-n pes executable1 :
-n pes executable2
: ... format. In the first executable segment, you may use other aprun options such as
-ss. If you specify the
-m option it must be specified in the first executable segment and the value is used for all subsequent executables. If you specify
-m more than once while launching multiple applications in MPMD mode, aprun will return an error. For MPI applications, all of the executables share the same
MPI_COMM_WORLD process communicator. MPMD mode will not work for system commands and applications require at the least an enclosure within
MPI_Finalize() environment management routines.
For example, this command launches 128 instances of
program1 and 256 instances of
aprun -n 128 ./program1 : -n 256 ./program2
A space is required before and after the colon.
Note: MPMD applications that use the SHMEM parallel programming model, either standalone or nested within an MPI program, are not supported on Gemini based systems.
MPI programs should call the
MPI_Finalize() routine at the conclusion of the program. This call waits for all processing elements to complete before exiting. If one of the programs fails to call
MPI_Finalize(), the program never completes and aprun stops responding. There are two ways to prevent this behavior:
Use the PBS Professional elapsed (wall clock) time limit to terminate the job after a specified time limit (such as
Use the aprun
-t sec option to terminate the program. This option specifies the per-PE CPU time limit in seconds. A process will terminate only if it reaches the specified amount of CPU time (not wallclock time).
For example, if you use:
aprun -n 8 -t 120 ./myprog1
and a PE uses more than two minutes of CPU time, the application terminates.
aprun utility handles standard input (
stdin) on behalf of the user and handles standard output (
stdout) and standard error messages (
stderr) for user applications.
aprun utility does not forward its user resource limits to each compute node (except for
RLIMIT_CPU, which are always forwarded).
You can set the
APRUN_XFER_LIMITS environment variable to
export APRUN_XFER_LIMITS=1 or
setenv APRUN_XFER_LIMITS 1) to enable the forwarding of user resource limits. For more information, see the getrlimit(P) man page.
aprun utility forwards the following signals to an application:
The aprun utility ignores
SIGTTIN signals. All other signals remain at default and are not forwarded to an application. The default behaviors that terminate aprun also cause ALPS to terminate the application with a