Using Cluster Compatibility Mode in CLE [6]

6.1 Cluster Compatibility Mode

A Cray system is not a cluster but a massively parallel processing (MPP) computer. An MPP is one computer with many networked processors used for distributed computation, and, in the case of Cray system architectures, a high-speed communications interface that facilitates optimal bandwidth and memory operations between those processors. When operating as an MPP machine, the Cray compute node kernel (Cray CNL) typically does not have a full set of the Linux services available that are used in cluster ISV applications.

Cluster Compatibility Mode (CCM) is a software solution that provides the services needed to run most cluster-based independent software vendor (ISV) applications out-of-the-box with some configuration adjustments. It is built on top of the compute node root runtime environment (CNRTE), the infrastructure that provides dynamic library support in Cray systems.

6.1.1 CCM Implementation

CCM is tightly coupled to the workload management system. It enables users to execute cluster applications together with workload-managed jobs that are running in a traditional MPP batch or interactive queue (see Figure 7). Essentially, CCM uses the batch system to logically designate part of the Cray system as an emulated cluster for the duration of the job.

Figure 7. Cray Job Distribution Cross Section

Cray Job Distribution Cross Section

Users provision the emulated cluster by launching a batch or interactive job in LSF, PBS or Moab and TORQUE using a CCM-specific queue. The user-specified nodes in the batch reservation are no longer available for MPP jobs for the duration of the CCM job. These nodes can be found in the user directory: $HOME/.crayccm/ccm_nodelist.$PBS_JOBID or $HOME/.crayccm/ccm_nodelist.$LSF_JOBID where the file name suffix is the unique job identification created by the batch reservation. The user launches the application using ccmrun. When the job terminates, the applications clean up and the nodes are then returned to the free pool of compute nodes (see Figure 8).

Figure 8. CCM Job Flow Diagram

CCM Job Flow Diagram

6.2 Installation and Configuration of Applications for CCM

Users are encouraged to install programs using their local scratch directory and set paths accordingly to use CCM. However, if an ISV application requires root privileges, the site administrator must install the application on the boot node's shared root in xtopview. Compute nodes will then be able to mount the shared root using the compute node root runtime environment and use services necessary for the ISV application.

6.3 Using CCM

6.3.1 CCM Commands

After loading the ccm module the user can issue the following two commands: ccmrun and ccmlogin.

6.3.1.1 ccmrun

ccmrun, as the name implies, starts the cluster application. The head node is the first node in the emulated cluster where ccmrun sets up the CCM infrastructure and propagates the rest of the application. The following shows the syntax for ccmrun:

ccmrun [--help][--ssh][--nossh][--rsh][--norsh][--nscd][--nonscd]
[--rsip][--norsip]executable [executable arg1] ... [executable argn]

where:

--ssh  

Launches a CCM job with the SSH daemon listening on port 203. This is the default in absence of custom configuration or environment.

--nosshd  

Launches a CCM job without an SSH daemon.

--rsh  

Launches a CCM job with portmap and xinetd daemons on all compute nodes within the CCM reservation. This is the default behavor.

--norsh  

Launches a CCM job without the portmap and xinetd daemons. It's possible this option may improve performance if rsh is not required by the current job.

--ncsd  

Launches a CCM job with the name service caching daemon on the CCM compute nodes. This is the default behavior.

--noncsd  

Launches a CCM job without the name service caching daemon.

--rsip  

Turns on CCM RSIP (Realm Specific Internet Protocol) Small Port allocation behavior. When you select this option, RSIP allocates bind (INADDR_ANY) requests from non-RSIP port ranges. This functionality serves to prevent a CCM application from consuming ports from in the limited RSIP pool. This is the default behavior.

--norsip  

Disables RSIP for the CCM application. When this option is specified, bind (INADDR_ANY) requests from non-RSIP port ranges. This pool is not generally recommended in most configurations. Since the number of RSIP ports per host is extremely limited, specifying this option could cause an application to run out of ports. However, this option may be helpful if an application fails in a default environment.

--help  

Displays ccmrun usage statement.

6.3.1.2 ccmlogin

ccmlogin supports a -n hostname option. If -n option is specified, services are not initiated at startup time, and the user does not need to be in a batch session. ccmlogin also supports the -V option, which propagates the environment to compute nodes in the same manner as ssh -V.

6.3.2 Starting a CCM Batch Job

You can use either PBS, Moab and TORQUE, or Platform LSF (Load Sharing Facility) to reserve the nodes for the cluster by using the qsub or bsub commands; then launch the application using ccmrun. All standard workload management reservation options are supported with ccmrun. An example using the application isv_app appears below:

Example 5. Launching a CCM application using PBS or Moab and TORQUE

% qsub -I -l mppwidth=32 -q ccm_queue
qsub: waiting for job 434781.sdb to start
qsub: job 434781.sdb ready
Initializing CCM Environment, please wait

After the user prompt re-appears, run the application using ccmrun:

% ccmrun isv_app job=e5 cpus=32

An equivalent batch script for this example would look like:

#mtscript
#PBS -l mppwidth=32
#PBS -q ccm_queue
#PBS -j oe
#PBS -S /bin/bash
cd $PBS_O_WORKDIR
export PATH=${PATH}:/mnt/lustre_server/ccmuser/isv_app/Commands
ln -s ../e5.inp e5.inp 
export TMPDIR=${PBS_O_WORKDIR}/temp 
mkdir $TMPDIR 
module load ccm
ccmrun isv_app job=e5 cpus=32 interactive

To submit the job enter the following at the command prompt:

% qsub mtscript

Example 6. Launching a CCM application using Platform LSF

For LSF, bsub requests use node counts rather than core counts while ccmrun still takes the number of cores as its argument:

% cnselect -L numcores
24
% module load xt-lsfhpc
% module load ccm
% bsub -n 2 -ext"CRAYXT[]" -q ccm_queue -o out_file ccmrun isv_app job=e5 cpus=32

The equivalent batch script for this example would look like:

#lsfscript
#!/bin/bash
#BSUB -n 2
#BSUB -q ccm_queue
#BSUB -o out_file
. /opt/modules/default/init/bash
cd $LS_SUBCWD
export PATH=${PATH}:/mnt/lustre_server/ccmuser/isv_app/Commands
ln -s ../e5.inp e5.inp 
export TMPDIR=${LS_SUBCWD}/temp
mkdir $TMPDIR
module load ccm
ccmrun isv_app job=e5 cpus=32 interactive

To submit a job to LSF, direct lsfscript to the bsub command:

% bsub < lsfscript

6.3.3 X11 Forwarding in CCM

Applications that require X11 forwarding (or tunneling) can use the qsub -V option to pass the DISPLAY variable to the emulated cluster. Users can then forward X traffic by using ccmlogin, as in the following example:

ssh -Y login
qsub -V -q=ccm_queue -lmppwidth=1
ccmlogin -V

6.3.4 ISV Application Acceleration (IAA)

IAA is a feature that potentially improves application performance by enabling the MPI implementation to directly use the high speed interconnect rather than requiring an additional TCP/IP layer. To MPI, the Aries or Gemini network looks as if it is an Infiniband network that supports the standard OFED (OpenFabrics Enterprise Distribution) API. By default, loading the ccm module automatically loads the cray-isvaccel module, which sets the general environment options for IAA. However, there are some settings that are specific to implementations of MPI. The method of passing these settings to CCM is highly application-specific. The following serves as a general guide to configuring your application's MPI and setting up the necessary CCM environment for application acceleration with Infiniband over the high speed network. Platform MPI and Open MPI are presently supported.

6.3.4.1 Configuring Platform MPI (HP-MPI) and Launching mpirun

Cray recommends you pass the -IBV option to mpirun to ensure that Platform MPI takes advantage of application acceleration. Without this option, any unexpected problem in application acceleration will cause Platform MPI to fall back to using TCP/IP, resulting in poor performance without explanation.

6.3.4.2 Caveats and Limitations for IAA

You may encounter the following known caveats or limitations when using IAA:

  • Only Platform MPI and Open MPI are presently supported.

  • IAA supports up to 2048 processing elements per application.

  • IAA does not yet support 32-bit applications.

  • IAA does not support application code that uses alternately named MPI entry points, such as PMPI_Init().

  • Use batch reservation resources efficiently as IAA allocates resources based on the reservation made for CCM. It is possible that an unnecessarily large job reservation will result in memory registration errors application failures.

6.3.4.3 Troubleshooting IAA

  • "Error detected by IBGNI. Subsequent operation may be unreliable."

    This message indicates that IAA has reported an error to the MPI implementation. Under most conditions, the MPI will properly handle the error and continue. If the job completes successfully, Cray recommends that you disregard the warning messages of this nature. However, if the job aborts, this message can provide important clues about what went wrong.

  • "libibgni: Could not open /tmp/ccm_alps_info (No such file or directory)."

    This means that CCM is improperly configured. Contact your system administrator if receiving the message.

  • "lsmod test for MPI_ICMOD_IBV__IBV_MAIN could not find module in list ib_core."

    This error indicates that Platform MPI is not correctly configured to use Infiniband.

  • "libibverbs: Fatal: Couldn't read uverbs ABI version."

    It is likely that the incorrect version of libibverbs is being linked with the application, which indicates a CLE installation issue. Contact your system administrator when you see this error.

  • "FLEXlm error: -15,570. System Error: 19 "Cannot assign requested address."

    This error can occur on systems that use Platform MPI and rely on RSIP for external connectivity. If MPI applications are run in quick succession, the number of ports available to RSIP become exhausted. The solution is to leave more time between MPI runs.

  • "libhugetlbfs [nid000545:5652]: WARNING: Layout problem with segments 0 and 1. Segments would overlap."

    This is a warning from the huge pages library and will not interrupt execution of the application.

  • "mpid: IBV requested on node localhost, but not available."

    This happens when running Platform MPI in close succession after a ccmlogin. The solution is to allow enough time between executions of mpirun and ccmlogin.

  • "Fatal error detected by IBGNI: Network error is unrecoverable: SOURCE_SSID_SRSP:MDD_INV"

    This is a secondary error caused by one or more PEs aborting with subsequent network messages arriving for them. Check earlier in your program output for the primary issue.

  • "mpid: IBV requested on node localhost, but not available."

    If you rerun Platform MPI jobs too close together, it will fail before the IBGNI packet wait timer completes:

    user@nid00002:~/osu_benchmarks_for_platform> mpirun -np 2 -IBV ./osu_bw 
    mpid: IBV requested on node localhost, but not available.
  • "PAM configuration can cause IAA to fail"

    The problem results in permission denied errors when IAA tries to access the HSN from compute nodes other than the CCM head node. That happens because the app process is running in a different job container than the one that has permission to the HSN.

    The second job container is created by PAM, specifically the following line in /etc/pam.d/common-session:

    session	optional	/opt/cray/job/default/lib64/security/pam_job.so
  • "bind: Invalid argument"

    Applications using older versions of MVAPICH may abort with this message due to a bug in the MPI implementation. This bug is present in, at least, MVAPICH version 1.2a1. It is fixed in MVAPICH2-1.8a2.

6.4 Individual Software Vendor (ISV) Example

Example 7. Launching the UMT/pyMPI benchmark using CCM

The UMT/pyMPI benchmark tests MPI and OpenMP parallel scaling efficiency, thread compiling, single CPU performance, and Python functionality.

The following example runs through the UMT/pyMPI benchmark; it can use CCM and presupposes that you have installed it in your user scratch directory. The runSuOlson.py Python script runs the benchmark. The -V option passes environment variables to the cluster job:

module load ccm
qsub -V -q ccm_queue -I -lmppwidth=2 -l mppnodes=471
cd top_of_directory_where_extrated
a=`pwd`
export LD_LIBRARY_PATH=${a}/Teton:${a}/cmg2Kull/sources:${a}/CMG_CLEAN/src:${LD_LIBRARY_PATH}
ccmrun -n2 ${a}/Install/pyMPI-2.4b4/pyMPI python/runSuOlson.py

The following runs the UMT test contained in the package:

module load ccm
qsub -V -q ccm_queue -I -lmppwidth=2 -l mppnodes=471
qsub: waiting for job 394846.sdb to start
qsub: job 394846.sdb ready

Initializing CCM environment, Please Wait
waiting for jid....
waiting for jid....
CCM Start success, 1 of 1 responses
machine=> cd UMT_TEST
machine=> a=`pwd`
machine=> ccmrun -n2 ${a}/Install/pyMPI-2.4b4/pyMPI python/runSuOlson.py
writing grid file:  grid_2_13x13x13.cmg
Constructing mesh.
Mesh construction complete, next building region, opacity, material, etc.
mesh and data setup complete, building Teton object.
Setup complete, beginning time steps.
CYCLE 1 timerad = 3e-06
TempIters = 3 FluxIters = 3 GTAIters = 0
TrMax =     0.0031622776601684 in Zone 47 on Node 1
TeMax =     0.0031622776601684 in Zone 1239 on Node 1
Recommended time step for next rad cycle = 6e-05
 
 
********** Run Time Statistics **********
                  Cycle Advance             Accumulated 
                     Time (sec)         Angle Loop Time (sec)
RADTR              = 47.432             39.991999864578
 
CYCLE 2 timerad = 6.3e-05

...

The benchmark continues for several iterations before completing.

6.5 Troubleshooting

6.5.1 CCM Initialization Fails

Immediately after the user enters the qsub command line, output appears as in the following example:

Initializing CCM environment, Please Wait
 Cluster Compatibility Mode Start failed, 1 of 4 responses

This error usually results when /etc files (e.g., nsswitch.conf, resolv.conf, passwd, shadow, etc.) are not specialized to the cnos class view. If you encounter this error, the system administrator must migrate these files from the login class view to the cnos class view. For more information, see Managing System Software for Cray Cascade Systems.

6.5.2 pam_job.so Is Incompatible with CCM

The pam_job.so module is incompatible with CCM. This can cause symptoms such as failed job cleanup and slow login. PAM jobs should be enabled only for login class views, not for the cnos class view.

Procedure 1. Disabling CSA Accounting for the cnos class view

  1. Enter xtopview in the default view and edit /etc/opt/cray/ccm/ccm_mounts.local in the following manner:

    boot:~ # xtopview 
    default/:/# vi /etc/opt/cray/ccm/ccm_mounts.local 
    /etc/pam.d/common-session-pc.ccm /etc/pam.d/common-session bind 0 
    default/:/# exit
  2. Enter xtopview in the cnos view:

     boot:~ # xtopview -c cnos -x /etc/opt/cray/sdb/node_classes
  3. Edit /etc/pam.d/common-auth-pc:

    class/cnos:/ # vi /etc/pam.d/common-auth-pc

    and remove or comment the following line:

     # session optional        /opt/cray/job/default/lib64/security/pam_job.so
  4. Edit /etc/pam.d/common-session to include:

    session optional pam_mkhomedir.so skel=/software/skel 
    session required pam_limits.so 
    session required pam_unix2.so 
    session optional pam_ldap.so 
    session optional pam_umask.s 
    session optional /opt/cray/job/default/lib64/security/pam_job.so 
  5. Edit /etc/pam.d/common-session-pc.ccm to remove or comment all of the following:

    session optional pam_mkhomedir.so skel=/software/skel 
    session required pam_limits.so 
    session required pam_unix2.so 
    session optional pam_ldap.so 

6.5.3 PMGR_COLLECTIVE ERROR

When you see the error, "PMGR_COLLECTIVE ERROR: uninitialized MPI task: Missing required environment variable: MPIRUN_RANK," you are likely trying to run an application compiled with a mismatched version of MPI.

6.5.4 Job Hangs When sa Parameter Is Passed to Platform MPI

The sa parameter is provided by Platform MPI to enable MPI messages to continue flowing even when an application is consuming CPU time for long periods. Platform MPI enables a timer that generates signals at regular intervals. The signals interrupt the application and allow Platform MPI to use some necessary CPU cycles.

MP-MPI and Platform MPI 7.x have a bug that may cause intermittent hangs when this option is enabled. This issue does not exist with Platform MPI 8.0.

6.5.5 "MPI_Init: dlopen" Error(s)

The error message, "MPI_Init: dlopen /opt/platform_mpi/lib/linux_amd64/plugins/default.so: undefined symbol," is likely caused by a library search path that includes an MPI implementation which is different from the implementation being used by the application.

6.5.6 Bus Errors In an Application, MPI, or libibgni

Sometimes bus errors are due to bugs in the application software. However, the Linux kernel will also generate a bus error if it encounters various errors while handling a page fault. The most likely of those errors is running out of RAM or being unable to allocate a huge page due to memory fragmentation.

6.5.7 glibc.so Errors at Start of Application Launch

This error may occur nearly immediately after submission. In certain applications, like FLUENT, glibc errors and a stack trace are appear in stderr. This problem typically involves the license server. Be sure to include a line return at the end of your ~/.flexlmrc file.

6.5.8 "orted: command not found"

This message can appear when using an Open MPI build that is not in the default PATH. To avoid the problem, using the --prefix command argument to mpirun to specify the location of Open MPI.

6.6 Caveats and Limitations for CCM

6.6.1 ALPS Does Not Accurately Reflect CCM Job Resources

Because CCM is transparent to the user application, ALPS utilities such as apstat do not accurately reflect resources used by a CCM job.

6.6.2 Open MPI and Moab and TORQUE Integration Not Supported

Open MPI provides native Moab and TORQUE integration. However, CCM does not support this mode or applications that use a shrink-wrapped MPI with this mode. Checking ompi_info will reveal if it was built with this integration. It will look like the following:

% ompi_info | grep tm             
MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
MCA ras: tm (MCA v2.0, API v2.0, Component v1.3.3)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.3.3) 

You can rebuild Open MPI to disable Moab and TORQUE integration using the following options to the configure script:

./configure --enable-mca-no-build=plm-tm,ras-tm --disable-mpi-f77 \
   --disable-mpi-f90 \
   --prefix=path_to_install

Which should result in no TM API being displayed by ompi_info:

% ompi_info | grep tm
MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)

6.6.3 Miscellaneous Limitations

The following limitations apply to supporting cluster queues with CLE 4.1 on Cray systems:

  • Applications must fit in the physical node memory because swap space is not supported in CCM.

  • Core specialization is not supported with CCM.

  • CCM does not support applications that are built in Cray Compiling Environment (CCE) with Fortran 2008 with coarrays or Unified Parallel C (UPC) compiling options, nor any Cray built libraries built with these implementations. Applications built using the Cray SHMEM or Cray MPI libraries are also not compatible with CCM.