Example Applications [9]

This chapter presents examples that show how to compile, link, and run applications.

Verify that your work area is in a Lustre-mounted directory. Then use the module list command to verify that the correct modules are loaded. Each following example lists the modules that have to be loaded.

9.1 Running a Basic Application

This example shows how to compile program simple.c and launch the executable.

One of the following modules is required:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Create a C program, simple.c:

#include "mpi.h"

int main(int argc, char *argv[])
{
  int rank;
  int numprocs;
  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
 
  printf("hello from pe %d of %d\n",rank,numprocs);
  MPI_Finalize();
}

Compile the program:

% cc -o simple simple.c

Run the program:

% aprun -n 6 ./simple
hello from pe 0 of 6
hello from pe 5 of 6
hello from pe 4 of 6
hello from pe 3 of 6
hello from pe 2 of 6
hello from pe 1 of 6
Application 135891 resources: utime ~0s, stime ~0s

9.2 Running an MPI Application

This example shows how to compile, link, and run an MPI program. The MPI program distributes the work represented in a reduction loop, prints the subtotal for each PE, combines the results from the PEs, and prints the total.

One of the following modules is required:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Create a Fortran program, mpi.f90:

program reduce
include "mpif.h"

integer n, nres, ierr

call MPI_INIT (ierr)
call MPI_COMM_RANK (MPI_COMM_WORLD,mype,ierr)
call MPI_COMM_SIZE (MPI_COMM_WORLD,npes,ierr)

nres = 0
n = 0

do i=mype,100,npes
  n = n + i
enddo

print *, 'My PE:', mype, ' My part:',n

call MPI_REDUCE (n,nres,1,MPI_INTEGER,MPI_SUM,0,MPI_COMM_WORLD,ierr)

if (mype == 0) print *,'   PE:',mype,'Total is:',nres

call MPI_FINALIZE (ierr)

end

Compile mpi.f90:

% ftn -o mpi mpi.f90

Run program mpi:

% aprun -n 6 ./mpi | sort
    PE:            0 Total is:         5050
 My PE:            0  My part:          816
 My PE:            1  My part:          833
 My PE:            2  My part:          850
 My PE:            3  My part:          867
 My PE:            4  My part:          884
 My PE:            5  My part:          800
Application 3016865 resources: utime ~0s, stime ~0s

If desired, you could use this C version of the program:

/* program reduce */

#include <stdio.h>
#include "mpi.h"

int main (int argc, char *argv[])
{
  int i, sum, mype, npes, nres, ret;
  ret = MPI_Init (&argc, &argv);
  ret = MPI_Comm_size (MPI_COMM_WORLD, &npes);
  ret = MPI_Comm_rank (MPI_COMM_WORLD, &mype);
  nres = 0;
  sum = 0;
  for (i = mype; i <=100; i += npes) {
    sum = sum + i;
  }

  (void) printf ("My PE:%d  My part:%d\n",mype, sum);
  ret = MPI_Reduce (&sum,&nres,1,MPI_INTEGER,MPI_SUM,0,MPI_COMM_WORLD);
  if (mype == 0)
  {
    (void) printf ("PE:%d  Total is:%d\n",mype, nres);
  }
  ret =  MPI_Finalize ();

}

9.3 Using the Cray shmem_put Function

This example shows how to use the shmem_put64() function to copy a contiguous data object from the local PE to a contiguous data object on a different PE.

One of the following modules is required:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Source code of C program (shmem_put.c):

/*
 *      simple put test
 */

#include <stdio.h>
#include <stdlib.h>
#include <mpp/shmem.h>

/* Dimension of source and target of put operations */
#define DIM     1000000

long target[DIM];
long local[DIM];
 
main(int argc,char **argv)
{
  register int i;
  int my_partner, my_pe;
 
  /* Prepare resources required for correct functionality 
     of SHMEM on XT. Alternatively, shmem_init() could 
     be called. */
  start_pes(0);
 
  for (i=0; i<DIM; i++) {
    target[i] = 0L;
    local[i] = shmem_my_pe() + (i * 10);
  }
 
  my_pe = shmem_my_pe();
 
  if(shmem_n_pes()%2) {
    if(my_pe == 0) printf("Test needs even number of processes\n");
    /* Clean up resources before exit. */
    shmem_finalize();
    exit(0);
  }

  shmem_barrier_all();
 
  /* Test has to be run on two procs. */
  my_partner = my_pe % 2 ? my_pe - 1 : my_pe + 1;
 
  shmem_put64(target,local,DIM,my_partner);
 
  /* Synchronize before verifying results. */
  shmem_barrier_all();
 
  /* Check results of put */
  for(i=0; i<DIM; i++) {
    if(target[i] != (my_partner + (i * 10))) {
      fprintf(stderr,"FAIL (1) on PE %d target[%d] = %d (%d)\n",
        shmem_my_pe(), i, target[i],my_partner+(i*10));
      shmem_finalize();
      exit(-1);
    }
  }
 
  printf(" PE %d: Test passed.\n",my_pe);

   /* Clean up resources. */
   shmem_finalize();
}

Compile shmem_put.c and create executable shmem_put:

% cc -o shmem_put shmem_put.c

Run shmem_put:

% aprun -n 12 -L 56 ./shmem_put
 PE 5: Test passed.
 PE 6: Test passed.
 PE 3: Test passed.
 PE 1: Test passed.
 PE 4: Test passed.
 PE 2: Test passed.
 PE 7: Test passed.
 PE 11: Test passed.
 PE 10: Test passed.
 PE 9: Test passed.
 PE 8: Test passed.
 PE 0: Test passed.

Application 57916 exit codes: 255
Application 57916 resources: utime ~1s, stime ~2s

9.4 Using the Cray shmem_get Function

This example shows how to use the shmem_get() function to copy a contiguous data object from a different PE to a contiguous data object on the local PE.

One of the following modules is required:

PrgEnv-pgi
PrgEnv-cray
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

The cray-shmem module is also required.

Note: The Fortran module for Cray SHMEM is not supported. Use the INCLUDE 'mpp/shmem.fh' statement instead.

Source code of Fortran program (shmem_get.f90):

program reduction
include 'mpp/shmem.fh'

real values, sum
common /c/ values
real work

call start_pes(0)
values=my_pe()
call shmem_barrier_all! Synchronize all PEs
sum = 0.0
do i = 0,num_pes()-1
  call shmem_get(work, values, 1, i)   ! Get next value
  sum = sum + work     ! Sum it
enddo

print*, 'PE',my_pe(),' computedsum=',sum

call shmem_barrier_all
call shmem_finalize

end

Compile shmem_get.f90 and create executable shmem_get:

% ftn -o shmem_get shmem_get.f90

Run shmem2:

% aprun -n 6 ./shmem_get
 PE            0  computedsum=    15.00000
 PE            5  computedsum=    15.00000
 PE            4  computedsum=    15.00000
 PE            3  computedsum=    15.00000
 PE            2  computedsum=    15.00000
 PE            1  computedsum=    15.00000
Application 137031 resources: utime ~0s, stime ~0s

9.5 Running Partitioned Global Address Space (PGAS) Applications

To run Unified Parallel C (UPC) or Fortran 2008 coarrays applications, use the Cray C compiler. These are not supported for PGI, GCC, PathScale, or Intel C compilers.

This example shows how to compile and run a Cray C program that includes Unified Parallel C (UPC) functions.

Modules required:

PrgEnv-cray

Check that these additional modules are loaded. These are part of the default modules on the login node loaded with the module Base-opts, but you will encounter an error with PGAS applications with these modules unloaded:

udreg
ugni
dmapp

9.5.1 Running a Unified Parallel C (UPC) Application

The following is the source code of program upc_cray.c:

#include <upc.h>
#include <stdio.h>
int main (int argc, char *argv[])
{
  int i;
  for (i = 0; i < THREADS; ++i)
    {
      upc_barrier;
      if (i == MYTHREAD)
        printf ("Hello world from thread: %d\n", MYTHREAD);
    }
  return 0;
}

Compile upc_cray.c and run executable cray_upc:

% cc -h upc -o upc_cray upc_cray.c
% aprun -n 2 ./upc_cray
Hello world from thread: 0
Hello world from thread: 1
Application 251523 resources: utime ~0s, stime ~0s
Note: You need to include the -h upc option on the cc command line.

9.5.2 Running a Fortran 2008 Application Using Coarrays

The following is the source code of program simple_caf.f90:

program simple_caf
implicit none

integer :: npes,mype,i
real    :: local_array(1000),total
real    :: coarray[*]

mype = this_image()
npes = num_images()

if (npes < 2) then
	print *, "Need at least 2 images to run"
 	stop
end if

do i=1,1000
   local_array(i) = sin(real(mype*i))
end do

coarray = sum(local_array)
sync all

if (mype == 1) then
	total = coarray + coarray[2]
   print *, "Total from images 1 and 2 is ",total
end if

end program simple_caf

Compile simple_caf.f90 and run the executable:

% ftn -hcaf -o simple_caf simple_caf.f90
/opt/cray/xt-asyncpe/3.9.39/bin/ftn: INFO: linux target is being used
% aprun -n2 simple_caf
  Total from images 1 and 2 is  1.71800661
  Application 39512 resources: utime ~0s, stime ~0s

9.6 Running an Acclerated Cray LibSci Routine

The following sample program displays usage of the libsci_acc accelerated libraries to perform LAPACK routines. The program solves a linear system of equations (AX = B) by computing the LU factorization of matrix A in DGETRF and completing the solution in DGETRS. For more information on auto-tuned LibSci GPU routines, see the intro_libsci_acc(3s) man page.

Modules required:

PrgEnv-cray
craype-accel-nvidia35
cray-libsci

Source of the program

#include <stdio.h>
#include <stdlib.h> 
#include <math.h> 
#include <libsci_acc.h> 

int main ( int argc, char **argv ) { 
		double *A, *B; int *ipiv; 
		int n, nrhs, lda, ldb, info; 
		int i, j; 
		
		n = lda = ldb = 5; 
		nrhs = 1; 
		ipiv = (int *)malloc(sizeof(int)*n); 
		B = (double *)malloc(sizeof(double)*n*nrhs);		

		libsci_acc_init(); 
		libsci_acc_HostAlloc( (void **)&A, sizeof(double)*n*n );  
		
		for ( i = 0; i < n; i++ ) { 
			for ( j = 0; j < n; j++ ) { 
				A[i*lda+j] = drand48(); 
			} 
		}  

		for ( i = 0; i < nrhs; i++ ) { 

			for ( j = 0; j < n; j++ ) { 
				B[i*ldb+j] = drand48(); 
			} 
		}  
	
		printf("\n\nMatrix A\n"); 
		for ( i = 0; i < n  ; i++ ) { 
			if (i > 0) 
				printf("\n"); 
			for ( j = 0; j < n; j++ ) { 
				printf("\t%f",A[i*lda+j]); 
			}
	 	}  

		printf("\n\nRHS/B\n"); 
		for (i=0; i < nrhs; i++) { 
			if (i > 0)
		  		printf("\n"); 
			for ( j = 0; j < n; j++ ) { 
				if (i==0) 
					printf("|  %f\n",B[i*ldb+j]); 
				else 
					printf("  %f\n",B[i*ldb+j]); 
			}
		 } 

	printf("\n\nSolution/X\n"); 
	dgetrf( n, n, A, lda, ipiv, &info); 
	dgetrs('N', n, nrhs, A, lda, ipiv, B, ldb, &info); 

	for ( i = 0; i < nrhs; i++ ) { 
		printf("\n"); 
		for ( j = 0; j < n; j++ ) { 
			printf("%f\n",B[i*ldb+j]);
	 	}
	} 
	printf("\n");  

	libsci_acc_FreeHost ( A ); 
	free(ipiv);  
	free(B); 
	libsci_acc_finalize(); 

}
% aprun -n1 ./a.out

Matrix A
	0.000000	0.000985	0.041631	0.176643	0.364602
	0.091331	0.092298	0.487217	0.526750	0.454433
	0.233178	0.831292	0.931731	0.568060	0.556094
	0.050832	0.767051	0.018915	0.252360	0.298197
	0.875981	0.531557	0.920261	0.515431	0.810429

RHS/B
|  0.188420
|  0.886314
|  0.570614
|  0.076775
|  0.815274


Solution/X

3.105866
-2.649034
1.836310
-0.543425
0.034012

9.7 Running a PETSc Application

This example (Copyright 1995-2004 University of Chicago) shows how to use PETSc functions to solve a linear system of partial differential equations.

Note: There are many ways to use the PETSc solvers. This example is intended to show the basics of compiling and running a PETSc program on a Cray system. It presents one simple approach and may not be the best template to use in writing user code. For issues that are not specific to Cray systems, you can get technical support through petsc-users@mcs.anl.gov.

The source code for this example includes a comment about the use of the mpiexec command to launch the executable. Use aprun instead.

Modules required:

petsc

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Source code of program ex2f.F:

!
!  Description: Solves a linear system in parallel with KSP (Fortran code).
!               Also shows how to set a user-defined monitoring routine.
!
!  Program usage: mpiexec -np  ex2f [-help] [all PETSc options]
!
!/*T
!  Concepts: KSP^basic parallel example
!  Concepts: KSP^setting a user-defined monitoring routine
!  Processors: n
!T*/
!
! -----------------------------------------------------------------------

      program main
      implicit none
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
!                    Include files
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
!
!  This program uses CPP for preprocessing, as indicated by the use of
!  PETSc include files in the directory petsc/include/finclude.  This
!  convention enables use of the CPP preprocessor, which allows the use
!  of the #include statements that define PETSc objects and variables.
!
!  Use of the conventional Fortran include statements is also supported
!  In this case, the PETsc include files are located in the directory
!  petsc/include/foldinclude.
!         
!  Since one must be very careful to include each file no more than once
!  in a Fortran routine, application programmers must exlicitly list
!  each file needed for the various PETSc components within their
!  program (unlike the C/C++ interface).
!
!  See the Fortran section of the PETSc users manual for details.
!
!  The following include statements are required for KSP Fortran programs:
!     petsc.h       - base PETSc routines
!     petscvec.h    - vectors
!     petscmat.h    - matrices
!     petscpc.h     - preconditioners
!     petscksp.h    - Krylov subspace methods
!  Include the following to use PETSc random numbers:
!     petscsys.h    - system routines
!  Additional include statements may be needed if using additional
!  PETSc routines in a Fortran program, e.g.,
!     petscviewer.h - viewers
!     petscis.h     - index sets
!
#include "finclude/petsc.h"
#include "finclude/petscvec.h"
#include "finclude/petscmat.h"
#include "finclude/petscpc.h"
#include "finclude/petscksp.h"
#include "finclude/petscsys.h"
!
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
!                   Variable declarations
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
!
!  Variables:
!     ksp     - linear solver context
!     ksp      - Krylov subspace method context
!     pc       - preconditioner context
!     x, b, u  - approx solution, right-hand-side, exact solution vectors
!     A        - matrix that defines linear system
!     its      - iterations for convergence
!     norm     - norm of error in solution
!     rctx     - random number generator context
!
!  Note that vectors are declared as PETSc "Vec" objects.  These vectors
!  are mathematical objects that contain more than just an array of
!  double precision numbers. I.e., vectors in PETSc are not just
!        double precision x(*).
!  However, local vector data can be easily accessed via VecGetArray().
!  See the Fortran section of the PETSc users manual for details.
!  
      double precision  norm
      PetscInt  i,j,II,JJ,m,n,its
      PetscInt  Istart,Iend,ione
      PetscErrorCode ierr
      PetscMPIInt     rank,size
      PetscTruth  flg
      PetscScalar v,one,neg_one
      Vec         x,b,u
      Mat         A 
      KSP         ksp
      PetscRandom rctx

!  These variables are not currently used.
!      PC          pc
!      PCType      ptype 
!      double precision tol


!  Note: Any user-defined Fortran routines (such as MyKSPMonitor)
!  MUST be declared as external.

      external MyKSPMonitor,MyKSPConverged

! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
!                 Beginning of program
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

      call PetscInitialize(PETSC_NULL_CHARACTER,ierr)
      m = 3
      n = 3
      one  = 1.0
      neg_one = -1.0
      ione    = 1
      call PetscOptionsGetInt(PETSC_NULL_CHARACTER,'-m',m,flg,ierr)
      call PetscOptionsGetInt(PETSC_NULL_CHARACTER,'-n',n,flg,ierr)
      call MPI_Comm_rank(PETSC_COMM_WORLD,rank,ierr)
      call MPI_Comm_size(PETSC_COMM_WORLD,size,ierr)

! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
!      Compute the matrix and right-hand-side vector that define
!      the linear system, Ax = b.
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

!  Create parallel matrix, specifying only its global dimensions.
!  When using MatCreate(), the matrix format can be specified at
!  runtime. Also, the parallel partitioning of the matrix is
!  determined by PETSc at runtime.

      call MatCreate(PETSC_COMM_WORLD,A,ierr)
      call MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n,ierr)
      call MatSetFromOptions(A,ierr)

!  Currently, all PETSc parallel matrix formats are partitioned by
!  contiguous chunks of rows across the processors.  Determine which
!  rows of the matrix are locally owned. 

      call MatGetOwnershipRange(A,Istart,Iend,ierr)

!  Set matrix elements for the 2-D, five-point stencil in parallel.
!   - Each processor needs to insert only elements that it owns
!     locally (but any non-local elements will be sent to the
!     appropriate processor during matrix assembly). 
!   - Always specify global row and columns of matrix entries.
!   - Note that MatSetValues() uses 0-based row and column numbers
!     in Fortran as well as in C.

!     Note: this uses the less common natural ordering that orders first
!     all the unknowns for x = h then for x = 2h etc; Hence you see JH = II +- n
!     instead of JJ = II +- m as you might expect. The more standard ordering
!     would first do all variables for y = h, then y = 2h etc.

      do 10, II=Istart,Iend-1
        v = -1.0
        i = II/n
        j = II - i*n  
        if (i.gt.0) then
          JJ = II - n
          call MatSetValues(A,ione,II,ione,JJ,v,INSERT_VALUES,ierr)
        endif
        if (i.lt.m-1) then
          JJ = II + n
          call MatSetValues(A,ione,II,ione,JJ,v,INSERT_VALUES,ierr)
        endif
        if (j.gt.0) then
          JJ = II - 1
          call MatSetValues(A,ione,II,ione,JJ,v,INSERT_VALUES,ierr)
        endif
        if (j.lt.n-1) then
          JJ = II + 1
          call MatSetValues(A,ione,II,ione,JJ,v,INSERT_VALUES,ierr)
        endif
        v = 4.0
        call  MatSetValues(A,ione,II,ione,II,v,INSERT_VALUES,ierr)
 10   continue

!  Assemble matrix, using the 2-step process:
!       MatAssemblyBegin(), MatAssemblyEnd()
!  Computations can be done while messages are in transition,
!  by placing code between these two statements.

      call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY,ierr)
      call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY,ierr)

!  Create parallel vectors.
!   - Here, the parallel partitioning of the vector is determined by
!     PETSc at runtime.  We could also specify the local dimensions
!     if desired -- or use the more general routine VecCreate().
!   - When solving a linear system, the vectors and matrices MUST
!     be partitioned accordingly.  PETSc automatically generates
!     appropriately partitioned matrices and vectors when MatCreate()
!     and VecCreate() are used with the same communicator. 
!   - Note: We form 1 vector from scratch and then duplicate as needed.

      call VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,m*n,u,ierr)
      call VecSetFromOptions(u,ierr)
      call VecDuplicate(u,b,ierr)
      call VecDuplicate(b,x,ierr)

!  Set exact solution; then compute right-hand-side vector.
!  By default we use an exact solution of a vector with all
!  elements of 1.0;  Alternatively, using the runtime option
!  -random_sol forms a solution vector with random components.

      call PetscOptionsHasName(PETSC_NULL_CHARACTER,                    &
     &             "-random_exact_sol",flg,ierr)
      if (flg .eq. 1) then
         call PetscRandomCreate(PETSC_COMM_WORLD,rctx,ierr)
         call PetscRandomSetFromOptions(rctx,ierr)
         call VecSetRandom(u,rctx,ierr)
         call PetscRandomDestroy(rctx,ierr)
      else
         call VecSet(u,one,ierr)
      endif
      call MatMult(A,u,b,ierr)

!  View the exact solution vector if desired

      call PetscOptionsHasName(PETSC_NULL_CHARACTER,                    &
     &             "-view_exact_sol",flg,ierr)
      if (flg .eq. 1) then
         call VecView(u,PETSC_VIEWER_STDOUT_WORLD,ierr)
      endif

! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
!         Create the linear solver and set various options
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

!  Create linear solver context

      call KSPCreate(PETSC_COMM_WORLD,ksp,ierr)

!  Set operators. Here the matrix that defines the linear system
!  also serves as the preconditioning matrix.

      call KSPSetOperators(ksp,A,A,DIFFERENT_NONZERO_PATTERN,ierr)

!  Set linear solver defaults for this problem (optional).
!   - By extracting the KSP and PC contexts from the KSP context,
!     we can then directly directly call any KSP and PC routines
!     to set various options.
!   - The following four statements are optional; all of these
!     parameters could alternatively be specified at runtime via
!     KSPSetFromOptions(). All of these defaults can be
!     overridden at runtime, as indicated below.

!     We comment out this section of code since the Jacobi
!     preconditioner is not a good general default.

!      call KSPGetPC(ksp,pc,ierr)
!      ptype = PCJACOBI
!      call PCSetType(pc,ptype,ierr)
!      tol = 1.e-7
!      call KSPSetTolerances(ksp,tol,PETSC_DEFAULT_DOUBLE_PRECISION,
!     &     PETSC_DEFAULT_DOUBLE_PRECISION,PETSC_DEFAULT_INTEGER,ierr)

!  Set user-defined monitoring routine if desired

      call PetscOptionsHasName(PETSC_NULL_CHARACTER,'-my_ksp_monitor',  &
     &                    flg,ierr)
      if (flg .eq. 1) then
        call KSPMonitorSet(ksp,MyKSPMonitor,PETSC_NULL_OBJECT,          &
     &                     PETSC_NULL_FUNCTION,ierr)
      endif

!  Set runtime options, e.g.,
!      -ksp_type <type> -pc_type <type> -ksp_monitor -ksp_rtol 
!  These options will override those specified above as long as
!  KSPSetFromOptions() is called _after_ any other customization
!  routines.

      call KSPSetFromOptions(ksp,ierr)

!  Set convergence test routine if desired

      call PetscOptionsHasName(PETSC_NULL_CHARACTER,                    &
     &     '-my_ksp_convergence',flg,ierr)
      if (flg .eq. 1) then
        call KSPSetConvergenceTest(ksp,MyKSPConverged,                  &
     &          PETSC_NULL_OBJECT,ierr)
      endif

!
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
!                      Solve the linear system
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

      call KSPSolve(ksp,b,x,ierr)

! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
!                     Check solution and clean up
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

!  Check the error

      call VecAXPY(x,neg_one,u,ierr)
      call VecNorm(x,NORM_2,norm,ierr)
      call KSPGetIterationNumber(ksp,its,ierr)
      if (rank .eq. 0) then
        if (norm .gt. 1.e-12) then
           write(6,100) norm,its
        else
           write(6,110) its
        endif
      endif
  100 format('Norm of error ',e10.4,' iterations ',i5)
  110 format('Norm of error < 1.e-12,iterations ',i5)

!  Free work space.  All PETSc objects should be destroyed when they
!  are no longer needed.

      call KSPDestroy(ksp,ierr)
      call VecDestroy(u,ierr)
      call VecDestroy(x,ierr)
      call VecDestroy(b,ierr)
      call MatDestroy(A,ierr)

!  Always call PetscFinalize() before exiting a program.  This routine
!    - finalizes the PETSc libraries as well as MPI
!    - provides summary and diagnostic information if certain runtime
!      options are chosen (e.g., -log_summary).  See PetscFinalize()
!      manpage for more information.

      call PetscFinalize(ierr)
      end

! --------------------------------------------------------------
!
!  MyKSPMonitor - This is a user-defined routine for monitoring
!  the KSP iterative solvers.
!
!  Input Parameters:
!    ksp   - iterative context
!    n     - iteration number
!    rnorm - 2-norm (preconditioned) residual value (may be estimated)
!    dummy - optional user-defined monitor context (unused here)
!
      subroutine MyKSPMonitor(ksp,n,rnorm,dummy,ierr)

      implicit none

#include "finclude/petsc.h"
#include "finclude/petscvec.h"
#include "finclude/petscksp.h"

      KSP              ksp
      Vec              x
      PetscErrorCode ierr
      PetscInt n,dummy
      PetscMPIInt rank
      double precision rnorm

!  Build the solution vector

      call KSPBuildSolution(ksp,PETSC_NULL_OBJECT,x,ierr)

!  Write the solution vector and residual norm to stdout
!   - Note that the parallel viewer PETSC_VIEWER_STDOUT_WORLD
!     handles data from multiple processors so that the
!     output is not jumbled.

      call MPI_Comm_rank(PETSC_COMM_WORLD,rank,ierr)
      if (rank .eq. 0) write(6,100) n
      call VecView(x,PETSC_VIEWER_STDOUT_WORLD,ierr)
      if (rank .eq. 0) write(6,200) n,rnorm

 100  format('iteration ',i5,' solution vector:')
 200  format('iteration ',i5,' residual norm ',e10.4)
      ierr = 0
      end

! --------------------------------------------------------------
!
!  MyKSPConverged - This is a user-defined routine for testing
!  convergence of the KSP iterative solvers.
!
!  Input Parameters:
!    ksp   - iterative context
!    n     - iteration number
!    rnorm - 2-norm (preconditioned) residual value (may be estimated)
!    dummy - optional user-defined monitor context (unused here)
!
      subroutine MyKSPConverged(ksp,n,rnorm,flag,dummy,ierr)

      implicit none

#include "finclude/petsc.h"
#include "finclude/petscvec.h"
#include "finclude/petscksp.h"

      KSP              ksp
      PetscErrorCode ierr
      PetscInt n,dummy
      KSPConvergedReason flag
      double precision rnorm

      if (rnorm .le. .05) then 
        flag = 1
      else
        flag = 0
      endif
      ierr = 0

      end

Use the following makefile.F:

.SUFFIXES:  .mod .o .F

### Compilers, linkers and flags.

FC                   =       ftn
LINKER          =       ftn
FCFLAGS       = 
LINKLAGS     =

### Fortran optimization options.

FOPTFLAGS      = -O3

.F.o: 
	$(FC) -c ${FOPTFLAGS} ${FCFLAGS} $*.F
            

all : ex2f
ex2f  :  ex2f.o 
	$(LINKER) -o $@ ex2f.o 

Create and run executable ex2f, including the PETSc run time option -mat_view to display the nonzero values of the 9x9 matrix A:

% make -f makefile.F
% aprun -n 2 ./ex2f -mat_view
row 0: (0, 4)  (1, -1)  (3, -1)
row 1: (0, -1)  (1, 4)  (2, -1)  (4, -1)
row 2: (1, -1)  (2, 4)  (5, -1)
row 3: (0, -1)  (3, 4)  (4, -1)  (6, -1)
row 4: (1, -1)  (3, -1)  (4, 4)  (5, -1)  (7, -1)
row 5: (2, -1)  (4, -1)  (5, 4)  (8, -1)
row 6: (3, -1)  (6, 4)  (7, -1)
row 7: (4, -1)  (6, -1)  (7, 4)  (8, -1)
row 8: (5, -1)  (7, -1)  (8, 4)
row 0: (0, 0.25)  (3, -1)
row 1: (1, 0.25)  (2, -1)
row 2: (1, -0.25)  (2, 0.266667)  (3, -1)
row 3: (0, -0.25)  (2, -0.266667)  (3, 0.287081)
row 0: (0, 0.25)  (1, -1)  (3, -1)
row 1: (0, -0.25)  (1, 0.266667)  (2, -1)  (4, -1)
row 2: (1, -0.266667)  (2, 0.267857)
row 3: (0, -0.25)  (3, 0.266667)  (4, -1)
row 4: (1, -0.266667)  (3, -0.266667)  (4, 0.288462)
Norm of error < 1.e-12,iterations     7
Application 155514 resources: utime 0, stime 12

9.8 Running an OpenMP Application

This example shows how to compile and run an OpenMP/MPI application.

One of the following modules is required:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel
Note: To compile an OpenMP program using a PGI or PathScale compiler, include -mp on the compiler driver command line. For a GCC compiler, include -fopenmp. For in Intel compiler, include -openmp. No option is required for the Cray compilers; -h omp is the default.

For a PathScale OpenMP program, set the PSC_OMP_AFFINITY environment variable to FALSE.

Source code of C program xthi.c:

#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sched.h>
#include <mpi.h>
#include <omp.h>

/* Borrowed from util-linux-2.13-pre7/schedutils/taskset.c */
static char *cpuset_to_cstr(cpu_set_t *mask, char *str)
{
  char *ptr = str;
  int i, j, entry_made = 0;
  for (i = 0; i < CPU_SETSIZE; i++) {
    if (CPU_ISSET(i, mask)) {
      int run = 0;
      entry_made = 1;
      for (j = i + 1; j < CPU_SETSIZE; j++) {
        if (CPU_ISSET(j, mask)) run++;
        else break;
      }
      if (!run)
        sprintf(ptr, "%d,", i);
      else if (run == 1) {
        sprintf(ptr, "%d,%d,", i, i + 1);
        i++;
      } else {
        sprintf(ptr, "%d-%d,", i, i + run);
        i += run;
      }
      while (*ptr != 0) ptr++;
    }
  }
  ptr -= entry_made;
  *ptr = 0;
  return(str);
}

int main(int argc, char *argv[])
{
  int rank, thread;
  cpu_set_t coremask;
  char clbuf[7 * CPU_SETSIZE], hnbuf[64];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  memset(clbuf, 0, sizeof(clbuf));
  memset(hnbuf, 0, sizeof(hnbuf));
  (void)gethostname(hnbuf, sizeof(hnbuf));
  #pragma omp parallel private(thread, coremask, clbuf)
  {
    thread = omp_get_thread_num();
    (void)sched_getaffinity(0, sizeof(coremask), &coremask);
    cpuset_to_cstr(&coremask, clbuf);
    #pragma omp barrier
    printf("Hello from rank %d, thread %d, on %s. (core affinity = %s)\n",
            rank, thread, hnbuf, clbuf);
  }
  MPI_Finalize();
  return(0);
}

Load the PrgEnv-pathscale module:

% module swap PrgEnv-pgi PrgEnv-pathscale

Set the PSC_OMP_AFFINITY environment variable to FALSE:

% setenv PSC_OMP_AFFINITY FALSE

Or

% export PSC_OMP_AFFINITY=FALSE

Compile and link xthi.c:

% cc -mp -o xthi xthi.c

Set the OpenMP environment variable equal to the number of threads in the team:

% setenv OMP_NUM_THREADS 2

Or

% export OMP_NUM_THREADS=2
Note: If you are running Intel-compiled code, you must use one of the alternate methods when setting OMP_NUM_THREADS:
  • Increase the aprun -d depth value by one.

  • Use the aprun -cc numa_node affinity option.

Run program xthi:

% export OMP_NUM_THREADS=24
% aprun -n 1 -d 24 -L 56 xthi | sort
Application 57937 resources: utime ~1s, stime ~0s
Hello from rank 0, thread 0, on nid00056. (core affinity = 0)
Hello from rank 0, thread 10, on nid00056. (core affinity = 10)
Hello from rank 0, thread 11, on nid00056. (core affinity = 11)
Hello from rank 0, thread 12, on nid00056. (core affinity = 12)
Hello from rank 0, thread 13, on nid00056. (core affinity = 13)
Hello from rank 0, thread 14, on nid00056. (core affinity = 14)
Hello from rank 0, thread 15, on nid00056. (core affinity = 15)
Hello from rank 0, thread 16, on nid00056. (core affinity = 16)
Hello from rank 0, thread 17, on nid00056. (core affinity = 17)
Hello from rank 0, thread 18, on nid00056. (core affinity = 18)
Hello from rank 0, thread 19, on nid00056. (core affinity = 19)
Hello from rank 0, thread 1, on nid00056. (core affinity = 1)
Hello from rank 0, thread 20, on nid00056. (core affinity = 20)
Hello from rank 0, thread 21, on nid00056. (core affinity = 21)
Hello from rank 0, thread 22, on nid00056. (core affinity = 22)
Hello from rank 0, thread 23, on nid00056. (core affinity = 23)
Hello from rank 0, thread 2, on nid00056. (core affinity = 2)
Hello from rank 0, thread 3, on nid00056. (core affinity = 3)
Hello from rank 0, thread 4, on nid00056. (core affinity = 4)
Hello from rank 0, thread 5, on nid00056. (core affinity = 5)
Hello from rank 0, thread 6, on nid00056. (core affinity = 6)
Hello from rank 0, thread 7, on nid00056. (core affinity = 7)
Hello from rank 0, thread 8, on nid00056. (core affinity = 8)
Hello from rank 0, thread 9, on nid00056. (core affinity = 9)

The aprun command created one instance of xthi, which spawned 23 additional threads running on separate cores.

Here is another run of xthi:

% export OMP_NUM_THREADS=6
% aprun -n 4 -d 6 -L 56 xthi | sort
Application 57948 resources: utime ~1s, stime ~1s
Hello from rank 0, thread 0, on nid00056. (core affinity = 0)
Hello from rank 0, thread 1, on nid00056. (core affinity = 1)
Hello from rank 0, thread 2, on nid00056. (core affinity = 2)
Hello from rank 0, thread 3, on nid00056. (core affinity = 3)
Hello from rank 0, thread 4, on nid00056. (core affinity = 4)
Hello from rank 0, thread 5, on nid00056. (core affinity = 5)
Hello from rank 1, thread 0, on nid00056. (core affinity = 6)
Hello from rank 1, thread 1, on nid00056. (core affinity = 7)
Hello from rank 1, thread 2, on nid00056. (core affinity = 8)
Hello from rank 1, thread 3, on nid00056. (core affinity = 9)
Hello from rank 1, thread 4, on nid00056. (core affinity = 10)
Hello from rank 1, thread 5, on nid00056. (core affinity = 11)
Hello from rank 2, thread 0, on nid00056. (core affinity = 12)
Hello from rank 2, thread 1, on nid00056. (core affinity = 13)
Hello from rank 2, thread 2, on nid00056. (core affinity = 14)
Hello from rank 2, thread 3, on nid00056. (core affinity = 15)
Hello from rank 2, thread 4, on nid00056. (core affinity = 16)
Hello from rank 2, thread 5, on nid00056. (core affinity = 17)
Hello from rank 3, thread 0, on nid00056. (core affinity = 18)
Hello from rank 3, thread 1, on nid00056. (core affinity = 19)
Hello from rank 3, thread 2, on nid00056. (core affinity = 20)
Hello from rank 3, thread 3, on nid00056. (core affinity = 21)
Hello from rank 3, thread 4, on nid00056. (core affinity = 22)
Hello from rank 3, thread 5, on nid00056. (core affinity = 23)

The aprun command created four instances of xthi which spawned five additional threads per instance. All PEs are running on separate cores and each instance is confined to NUMA node domains on one compute node.

9.9 Running an Interactive Batch Job

This example shows how to compile and run an OpenMP/MPI application (see Running an OpenMP Application) on 16-core Cray X6 compute nodes using an interactive batch job.

Modules required:

pbs or moab

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Use the cnselect command to get a list of eight-core, dual-socket compute nodes:

% cnselect coremask.eq.65535
14-17,128-223,256-351,384-479,512-607,640-715

Initiate an interactive batch session:

% qsub -I -l mppwidth=8 -l mppdepth=4 -l mppnodes=\"14-15\"

Set the OpenMP environment variable equal to the number of threads in the team:

% setenv OMP_NUM_THREADS 4

Or

% export OMP_NUM_THREADS=4

Run program omp:

% aprun -n 8 -d 4 -L14-15 ./xthi | sort
Application 57953 resources: utime ~2s, stime ~2s
Hello from rank 0, thread 0, on nid00014. (core affinity = 0)
Hello from rank 0, thread 1, on nid00014. (core affinity = 1)
Hello from rank 0, thread 2, on nid00014. (core affinity = 2)
Hello from rank 0, thread 3, on nid00014. (core affinity = 3)
Hello from rank 1, thread 0, on nid00014. (core affinity = 4)
Hello from rank 1, thread 1, on nid00014. (core affinity = 5)
Hello from rank 1, thread 2, on nid00014. (core affinity = 6)
Hello from rank 1, thread 3, on nid00014. (core affinity = 7)
Hello from rank 2, thread 0, on nid00014. (core affinity = 8)
Hello from rank 2, thread 1, on nid00014. (core affinity = 9)
Hello from rank 2, thread 2, on nid00014. (core affinity = 10)
Hello from rank 2, thread 3, on nid00014. (core affinity = 11)
Hello from rank 3, thread 0, on nid00014. (core affinity = 12)
Hello from rank 3, thread 1, on nid00014. (core affinity = 13)
Hello from rank 3, thread 2, on nid00014. (core affinity = 14)
Hello from rank 3, thread 3, on nid00014. (core affinity = 15)
Hello from rank 4, thread 0, on nid00015. (core affinity = 0)
Hello from rank 4, thread 1, on nid00015. (core affinity = 1)
Hello from rank 4, thread 2, on nid00015. (core affinity = 2)
Hello from rank 4, thread 3, on nid00015. (core affinity = 3)
Hello from rank 5, thread 0, on nid00015. (core affinity = 4)
Hello from rank 5, thread 1, on nid00015. (core affinity = 5)
Hello from rank 5, thread 2, on nid00015. (core affinity = 6)
Hello from rank 5, thread 3, on nid00015. (core affinity = 7)
Hello from rank 6, thread 0, on nid00015. (core affinity = 8)
Hello from rank 6, thread 1, on nid00015. (core affinity = 9)
Hello from rank 6, thread 2, on nid00015. (core affinity = 10)
Hello from rank 6, thread 3, on nid00015. (core affinity = 11)
Hello from rank 7, thread 0, on nid00015. (core affinity = 12)
Hello from rank 7, thread 1, on nid00015. (core affinity = 13)
Hello from rank 7, thread 2, on nid00015. (core affinity = 14)
Hello from rank 7, thread 3, on nid00015. (core affinity = 15)

9.10 Running a Batch Job Script

In this example, a batch job script requests six PEs to run program mpi.

Modules required:

pbs or moab

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Create script1:

#!/bin/bash
#
# Define the destination of this job
# as the queue named "workq":
#PBS -q workq
#PBS -l mppwidth=6
# Tell WMS to keep both standard output and
# standard error on the execution host:
#PBS -k eo
cd /lus/nid0008/user1
aprun -n 6 ./mpi
exit 0

Set permissions to executable:

% chmod +x script1

Submit the job:

% qsub script1

The qsub command produces a batch job log file with output from mpi (see Running an MPI Application). The job output is in a script1.onnnnn file.

% cat script1.o238830 | sort
Application 848571 resources: utime ~0s, stime ~0s
 My PE:            0  My part:          816
 My PE:            1  My part:          833
 My PE:            2  My part:          850
 My PE:            3  My part:          867
 My PE:            4  My part:          884
 My PE:            5  My part:          800
    PE:            0 Total is:         5050

9.11 Running Multiple Sequential Applications

To run multiple sequential applications, the number of processors you specify as an argument to qsub must be equal to or greater than the largest number of processors required by a single invocation of aprun in your script. For example, in job script mult_seq, the -l mppwidth value is 6 because the largest aprun n value is 6.

Modules required:

pbs or moab

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Create script mult_seq:

#!/bin/bash
#
# Define the destination of this job
# as the queue named "workq":
#PBS -q workq
#PBS -l mppwidth=6
# Tell WMS to keep both standard output and
# standard error on the execution host:
#PBS -k eo
cd /lus/nid000015/user1
aprun -n 2 ./simple
aprun -n 3 ./mpi
aprun -n 6 ./shmem_put
aprun -n 6 ./shmem_get
exit 0

The script launches applications simple (see Running a Basic Application), mpi (see Running an MPI Application), shmem_put (see Using the Cray shmem_put Function), and shmem_get (see Using the Cray shmem_get Function).

Set file permission to executable:

% chmod +x mult_seq

Run the script:

% qsub mult_seq

List the output:

% cat mult_seq.o465713
hello from pe 0 of 2
hello from pe 1 of 2
 My PE:            0  My part:         1683
 My PE:            1  My part:         1717
 My PE:            2  My part:         1650
  PE:            0 Total is:         5050
 PE 0: Test passed.
 PE 1: Test passed.
 PE 2: Test passed.
 PE 3: Test passed.
 PE 4: Test passed.
 PE 5: Test passed.
 PE            0  computedsum=    15.00000
 PE            1  computedsum=    15.00000
 PE            2  computedsum=    15.00000
 PE            3  computedsum=    15.00000
 PE            4  computedsum=    15.00000
 PE            5  computedsum=    15.00000

9.12 Running Multiple Parallel Applications

If you are running multiple parallel applications, the number of processors must be equal to or greater than the total number of processors specified by calls to aprun. For example, in job script mult_par, the -l mppwidth value is 11 because the total of the aprun n values is 11.

Modules required:

pbs or moab

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Create mult_par:

#!/bin/bash
#
# Define the destination of this job
# as the queue named "workq":
#PBS -q workq
#PBS -l mppwidth=11
# Tell WMS to keep both standard output and
# standard error on the execution host:
#PBS -k eo
cd /lus/nid00007/user1
aprun -n 2 ./simple &
aprun -n 3 ./mpi &
aprun -n 6 ./shmem_put &
aprun -n 6 ./shmem_get &
wait
exit 0

The script launches applications simple (see Running a Basic Application), mpi (see Running an MPI Application), shmem_put (see Using the Cray shmem_put Function), and shmem_get (see Using the Cray shmem_get Function).

Set file permission to executable:

% chmod +x mult_par

Run the script:

% qsub mult_par

List the output:

% cat mult_par.o7231
hello from pe 0 of 2
hello from pe 1 of 2
Application 520255 resources: utime ~0s, stime ~0s
 My PE:            0  My part:         1683
 My PE:            2  My part:         1650
 My PE:            1  My part:         1717
    PE:            0 Total is:         5050
Application 520256 resources: utime ~0s, stime ~0s
 PE 0: Test passed.
 PE 5: Test passed.
 PE 4: Test passed.
 PE 3: Test passed.
 PE 2: Test passed.
 PE 1: Test passed.
Application 520258 exit codes: 64
Application 520258 resources: utime ~0s, stime ~0s 
 PE            0  computedsum=    15.00000
 PE            5  computedsum=    15.00000
 PE            4  computedsum=    15.00000
 PE            3  computedsum=    15.00000
 PE            2  computedsum=    15.00000
 PE            1  computedsum=    15.00000
Application 520259 resources: utime ~0s, stime ~0s

9.13 Using aprun Memory Affinity Options

In some cases, remote-NUMA-node memory references can reduce the performance of applications. You can use the aprun memory affinity options to control remote-NUMA-node memory references. For the -S, -sl, and -sn options, memory allocation is satisfied using local-NUMA-node memory. If there is not enough NUMA node 0 memory, NUMA node 1 memory may be used. For the -ss, only local-NUMA-node memory can be allocated.

9.13.1 Using the aprun -S Option

This example runs each PE on a specific NUMA node 0 CPU:

% aprun -n 4 ./xthi | sort
Application 225110 resources: utime ~0s, stime ~0s
PE 0 nid00045 Core affinity = 0
PE 1 nid00045 Core affinity = 1
PE 2 nid00045 Core affinity = 2
PE 3 nid00045 Core affinity = 3

This example runs one PE on each NUMA node of nodes 45 and 70:

% aprun -n 4 -S 1 ./xthi | sort
Application 225111 resources: utime ~0s, stime ~0s
PE 0 nid00045 Core affinity = 0
PE 1 nid00045 Core affinity = 4
PE 2 nid00070 Core affinity = 0
PE 3 nid00070 Core affinity = 4

9.13.2 Using the aprun -sl Option

This example runs all PEs on NUMA node 1:

% aprun -n 4 -sl 1 ./xthi | sort
Application 57967 resources: utime ~1s, stime ~1s
Hello from rank 0, thread 0, on nid00014. (core affinity = 4)
Hello from rank 1, thread 0, on nid00014. (core affinity = 5)
Hello from rank 2, thread 0, on nid00014. (core affinity = 6)
Hello from rank 3, thread 0, on nid00014. (core affinity = 7)

This example runs all PEs on NUMA node 2:

% aprun -n 4 -sl 2 ./xthi | sort
Application 57968 resources: utime ~1s, stime ~1s
Hello from rank 0, thread 0, on nid00014. (core affinity = 8)
Hello from rank 1, thread 0, on nid00014. (core affinity = 9)
Hello from rank 2, thread 0, on nid00014. (core affinity = 10)
Hello from rank 3, thread 0, on nid00014. (core affinity = 11)

9.13.3 Using the aprun -sn Option

This example runs four PEs on NUMA node 0 of node 45 and four PEs on NUMA node 0 of node 70:

% aprun -n 8 -sn 1 ./xthi | sort
Application 2251114 resources: utime ~0s, stime ~0s
PE 0 nid00045 Core affinity = 0
PE 1 nid00045 Core affinity = 1
PE 2 nid00045 Core affinity = 2
PE 3 nid00045 Core affinity = 3
PE 4 nid00070 Core affinity = 0
PE 5 nid00070 Core affinity = 1
PE 6 nid00070 Core affinity = 2
PE 7 nid00070 Core affinity = 3

9.13.4 Using the aprun -ss Option

When -ss is specified, a PE can allocate only the memory that is local to its assigned NUMA node. The default is to allow remote-NUMA-node memory allocation. For example, by default any PE running on NUMA node 0 can allocate NUMA node 1 memory (if NUMA node 1 has been reserved for the application).

This example runs PEs 0-3 on NUMA node 0, PEs 4-7 on NUMA node 1, PEs 8-11 on NUMA node 2, and PEs 12-15 on NUMA node 3. PEs 0-3 cannot allocate NUMA node 1, 2, or 3 memories, PEs 4-7 cannot allocate NUMA node 0, 2, 3 memories, etc.

% aprun -n 16 -sl 0,1,2,3 -ss ./xthi | sort

Application 57970 resources: utime ~9s, stime ~2s
PE 0 nid00014. (core affinity = 0-3)
PE 10 nid00014. (core affinity = 8-11)
PE 11 nid00014. (core affinity = 8-11)
PE 12 nid00014. (core affinity = 12-15)
PE 13 nid00014. (core affinity = 12-15)
PE 14 nid00014. (core affinity = 12-15)
PE 15 nid00014. (core affinity = 12-15)
PE 1 nid00014. (core affinity = 0-3)
PE 2 nid00014. (core affinity = 0-3)
PE 3 nid00014. (core affinity = 0-3)
PE 4 nid00014. (core affinity = 4-7)
PE 5 nid00014. (core affinity = 4-7)
PE 6 nid00014. (core affinity = 4-7)
PE 7 nid00014. (core affinity = 4-7)
PE 8 nid00014. (core affinity = 8-11)
PE 9 nid00014. (core affinity = 8-11)

9.14 Using aprun CPU Affinity Options

The following examples show how you can use aprun CPU affinity options to bind a process to a particular CPU or the CPUs on a NUMA node.

9.14.1 Using the aprun -cc cpu_list Option

This example binds PEs to CPUs 0-4 and 7 on an 8-core node:

% aprun -n 6 -cc 0-4,7 ./xthi | sort
Application 225116 resources: utime ~0s, stime ~0s
PE 0 nid00045 Core affinity = 0
PE 1 nid00045 Core affinity = 1
PE 2 nid00045 Core affinity = 2
PE 3 nid00045 Core affinity = 3
PE 4 nid00045 Core affinity = 4
PE 5 nid00045 Core affinity = 7

9.14.2 Using the aprun -cc keyword Options

Processes can migrate from one CPU to another on a node. You can use the -cc option to bind PEs to CPUs. This example uses the -cc cpu (default) option to bind each PE to a CPU:

% aprun -n 8 -cc cpu ./xthi | sort
Application 225117 resources: utime ~0s, stime ~0s
PE 0 nid00045 Core affinity = 0
PE 1 nid00045 Core affinity = 1
PE 2 nid00045 Core affinity = 2
PE 3 nid00045 Core affinity = 3
PE 4 nid00045 Core affinity = 4
PE 5 nid00045 Core affinity = 5
PE 6 nid00045 Core affinity = 6
PE 7 nid00045 Core affinity = 7

This example uses the -cc numa_node option to bind each PE to the CPUs within a NUMA node:

% aprun -n 8 -cc numa_node ./xthi | sort
Application 225118 resources: utime ~0s, stime ~0s
PE 0 nid00045 Core affinity = 0-3
PE 1 nid00045 Core affinity = 0-3
PE 2 nid00045 Core affinity = 0-3
PE 3 nid00045 Core affinity = 0-3
PE 4 nid00045 Core affinity = 4-7
PE 5 nid00045 Core affinity = 4-7
PE 6 nid00045 Core affinity = 4-7
PE 7 nid00045 Core affinity = 4-7

9.15 Using Checkpoint/Restart Commands

To checkpoint and restart a job, first load these modules:

moab
blcr

This example shows the use of the qhold and qchkpt checkpoint commands and the qrls and qrerun restart commands.

Source code of cr.c:

#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include "mpi.h"
#include <signal.h>


static void sig_handler(int);

static unsigned int Cnt = 0;    /* Counter that is 
incremented each time app is checkpointed. */

static int me;

int
main (int argc, char *argv[])
{
  int all, ret;
  int sleep_time=100000;
  ret = MPI_Init(&argc, &argv);
  ret = MPI_Comm_rank (MPI_COMM_WORLD, &me);
  ret = MPI_Comm_size(MPI_COMM_WORLD, &all);

  if (me == 0) {

    if (signal(SIGCONT, sig_handler) == SIG_ERR) {
      printf("Can't catch SIGCONT\n");
      ret = MPI_Finalize();
      exit(3);
    }
    printf ("Partition size is = %d\n", all);
  }

  ret = 999;
  while (ret != 0) {

    Cnt += 1;
    ret = sleep(sleep_time);
    if (ret != 0 ) {

      printf("PE %d PID %d interrupted at cnt: %d\n", me, getpid(), Cnt);
      sleep_time = ret;
    }
  }

  printf ("Finished with count at: &d, exiting \n", Cnt);

  ret = MPI_Finalize();
}


static void
sig_handler(int signo)
{
  printf("\n");

}

Load the modules and compile cr.c:

% module load moab
% module load blcr
% cc -o cr cr.c

Create script cr_script:

#!/usr/bin/ksh
#PBS -l mppwidth=2
#PBS -l mppnppn=1
#PBS -j oe
#PBS -l walltime=6:00:00
#PBS -c enabled

# cd to directory where job was submitted from:
cd /lus/nid00015/user12/c

export MPICH_VERSION_DISPLAY=1

aprun -n 2 -N 1 ./cr

wait;

Launch the job:

% qsub cr_script
87151.nid00003

The WMS returns the job identifier 87151.nid00003. Use just the first part (sequence number 87151) in checkpoint/restart commands.

Check the job status:

% qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
87151.nid00003            cr_script        user12          00:00:00 R workq

The job is running (qstat state S is R).

Check the status of application cr:

% apstat
Compute node summary
   arch config     up    use   held  avail   down
     XT     72     72      2      0     70      0

No pending applications are present

Total placed applications: 1
Placed  Apid ResId     User   PEs Nodes    Age   State Command
      331897     6   user12     2     2   0h03m  run   cr

The application is running (State is run).

Checkpoint the job, place it in hold state, and recheck job and application status:

% qhold 87151
% qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
87151.nid00003            cr_script        user12          00:00:00 H workq
% apstat
Compute node summary
   arch config     up    use   held  avail   down
     XT     72     72      0      0     72      0

No pending applications are present

No placed applications are present

The job is checkpointed and its state changes from run to hold. Application cr is checkpointed (apstat State field is chkpt), then stops running.

Note: The qhold command checkpointed the job because it was submitted with the -c enabled option.

Release the job, get status to verify, then restart it:

% qrls 87151
% qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
87151.nid00003            cr_script        user12          00:00:00 R workq
% apstat
Compute node summary
   arch config     up    use   held  avail   down
     XT     72     72      2      0     70      0

No pending applications are present

Total placed applications: 1
Placed  Apid ResId     User   PEs Nodes    Age   State Command
      331899     7   user12     2     2   0h00m  run   cr

The job is running (qstat S field is R and application State is run).

Checkpoint the job but keep it running:

% qchkpt 87151
% qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
87151.nid00003            cr_script        user12          00:00:00 R workq
% apstat
Compute node summary
   arch config     up    use   held  avail   down
     XT     72     72      2      0     70      0

No pending applications are present

Total placed applications: 1
Placed  Apid ResId     User   PEs Nodes    Age   State Command
      331899     7   user12     2     2   0h02m  run   cr

The qstat S field changed to R, and the application state changed from chkpt to run.

Use qdel to stop the job:

% qdel 87151
% qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
87151.nid00003            cr_script        user12          00:00:00 C workq

Use the qrerun command to restart a completed job previously checkpointed:

% qrerun 87151
% qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
87151.nid00003            cr_script        user12          00:00:00 R workq
% apstat
Compute node summary
   arch config     up    use   held  avail   down
     XT     72     72      2      0     70      0

No pending applications are present

Total placed applications: 1
Placed  Apid ResId     User   PEs Nodes    Age   State Command
      331901     8   user12     2     2   0h00m  run   cr

You can use qrerun to restart a job if the job remains queued in the completed state.

At any step in the checkpoint/restart process, you can use the qstat -f option to displays details about the job and checkpoint files:

% qstat -f 87151
Job Id: 87151.nid00003
    Job_Name = cr_script
    Job_Owner = user12@nid00004
<snip>
    Checkpoint = enabled
<snip>
    comment = Job 87151.nid00003 was checkpointed and continued to /lus/scratc
        h/BLCR_checkpoint_dir/ckpt.87151.nid00003.1237761585 at Sun Mar 22 17:
        39:45 2009

<snip>
    checkpoint_dir = /lus/scratch/BLCR_checkpoint_dir
    checkpoint_name = ckpt.87151.nid00003.1237761585
    checkpoint_time = Sun Mar 22 17:39:45 2009
    checkpoint_restart_status = Successfully restarted job

You can get details about the checkpointed files in checkpoint_dir:

% cd /lus/scratch/BLCR_checkpoint_dir
% ls -al
<snip>
drwx------  3 user12 dev1   4096 2009-03-22 17:35 ckpt.87151.nid00003.1237761347
drwx------  3 user12 dev1   4096 2009-03-22 17:39 ckpt.87151.nid00003.123776158
% cd ckpt.87151.nid00003.123776158
% ls
331899  cpr.context  info.7828
% cd 331899
% ls
context.0  context.1

There is a context.n file for each width value (-l mppwidth=2).

9.16 Running Compute Node Commands

You can use the aprun -b option to run compute node BusyBox commands.

The following aprun command runs the compute node grep command to find references to MemTotal in compute node file /proc/meminfo:

% aprun -b grep MemTotal /proc/meminfo
MemTotal:      8124872 kB

9.17 Using the High-level PAPI Interface

PAPI provides simple high-level interfaces for instrumenting applications written in C or Fortran. This example shows the use of the PAPI_start_counters() and PAPI_stop_counters() functions.

Modules required:

xti-papi

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Source of papi_hl.c:

#include <papi.h>
void main()
{

  int retval, Events[2]= {PAPI_TOT_CYC, PAPI_TOT_INS};
  long_long values[2];

  if (PAPI_start_counters (Events, 2) != PAPI_OK) {
    printf("Error starting counters\n");
    exit(1);
  }

  /* Do some computation here... */

  if (PAPI_stop_counters (values, 2) != PAPI_OK) {
    printf("Error stopping counters\n");
    exit(1);
  }

  printf("PAPI_TOT_CYC = %lld\n", values[0]);
  printf("PAPI_TOT_INS = %lld\n", values[1]);
}

Compile papi_hl.c:

% cc -o papi_hl papi_hl.c

Run papi_hl:

% aprun ./papi_hl
PAPI_TOT_CYC = 4020
PAPI_TOT_INS = 201
Application 520262 exit codes: 19
Application 520262 resources: utime ~0s, stime ~0s

9.18 Using the Low-level PAPI Interface

PAPI provides an advanced low-level interface for instrumenting applications. Initialize the PAPI library before calling any of these functions by issuing either a high-level function call or a call to PAPI_library_init(). This example shows the use of the PAPI_create_eventset(), PAPI_add_event(), PAPI_start(), and PAPI_read() functions.

Modules required:

xti-papi

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Source of papi_ll.c:

#include <papi.h>
void main()
{
  int EventSet = PAPI_NULL;
  long_long values[1];

  /* Initialize PAPI library */
  if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT) {
    printf("Error initializing PAPI library\n");
    exit(1);
  }

  /* Create Event Set */
  if (PAPI_create_eventset(&EventSet) != PAPI_OK) {
    printf("Error creating eventset\n");
  exit(1);
  }

  /* Add Total Instructions Executed to eventset */
  if (PAPI_add_event (EventSet, PAPI_TOT_INS) != PAPI_OK) {
    printf("Error adding event\n");
    exit(1);
  }

  /* Start counting ... */
  if (PAPI_start (EventSet) != PAPI_OK) {
    printf("Error starting counts\n");
    exit(1);
  }

  /* Do some computation here...*/

  if (PAPI_read (EventSet, values) != PAPI_OK) {
    printf("Error stopping counts\n");
    exit(1);
  }
  printf("PAPI_TOT_INS = %lld\n", values[0]);
}

Compile papi_ll.c:

% cc -o papi_ll papi_ll.c

Run papi_ll:

% aprun ./papi_ll
PAPI_TOT_INS = 97
Application 520264 exit codes: 18
Application 520264 resources: utime ~0s, stime ~0s

9.19 Using CrayPat

This example shows how to instrument a program, run the instrumented program, and generate CrayPat reports.

Modules required:

perftools

and one of the following:

PrgEnv-cray
PrgEnv-pgi
PrgEnv-gnu
PrgEnv-pathscale
PrgEnv-intel

Source code of pa1.f90:

program main
include 'mpif.h'

  call MPI_Init(ierr)     ! Required
  call MPI_Comm_rank(MPI_COMM_WORLD,mype,ierr)
  call MPI_Comm_size(MPI_COMM_WORLD,npes,ierr)

  print *,'hello from pe',mype,' of',npes

  do i=1+mype,1000,npes   ! Distribute the work
    call work(i,mype)
  enddo

  call MPI_Finalize(ierr) ! Required
end

Source code of pa2.c:

void work_(int *N, int *MYPE)
{
  int n=*N, mype=*MYPE;

  if (n == 42) {
    printf("PE %d: sizeof(long) = %d\n",mype,sizeof(long));
    printf("PE %d: The answer is: %d\n",mype,n);
  }
}

Compile pa2.c and pa1.f90 and create executable perf:

% cc -c pa2.c
% ftn -o perf pa1.f90 pa2.o

Run pat_build to generate instrumented program perf+pat:

% pat_build -u -g mpi perf perf+pat
INFO: A trace intercept routine was created for the function 'MAIN_'.
INFO: A trace intercept routine was created for the function 'work_'.

The tracegroup (-g option) is mpi.

Run perf+pat:

% aprun -n 4 ./perf+pat | sort
CrayPat/X:  Version 5.0 Revision 2635  06/04/09 03:13:22
Experiment data file written:
/mnt/lustre_server/user12/perf+pat+1652-30tdt.xf
Application 582809 resources: utime ~0s, stime ~0s
 hello from pe            0  of            4
 hello from pe            1  of            4
 hello from pe            2  of            4
 hello from pe            3  of            4
PE 1: sizeof(long) = 8
PE 1: The answer is: 42
Note: When executed, the instrumented executable creates directory progname+pat+PIDkeyletters, where . PID is the process ID that was assigned to the instrumented program at run time.

Run pat_report to generate reports perf.rpt1 (using default pat_report options) and perf.rpt2 (using the -O calltree option).

% pat_report perf+pat+1652-30tdt.xf > perf.rpt1
pat_report:  Creating file:   perf+pat+1652-30tdt.ap2
Data file 1/1: [....................]
% pat_report -O calltree perf+pat+1652-30tdt.xf > perf.rpt2
pat_report:  Using existing file:   perf+pat+1652-30tdt.ap2
Data file 1/1: [....................]
% pat_report -O calltree -f ap2 perf+pat+1652-30tdt.xf
Output redirected to:  perf+pat+1652-30tdt.ap2
Note: The -f ap2 option is used to create a *.ap2 file for input to Cray Apprentice2 (see Using Cray Apprentice2).

List perf.rpt1:

CrayPat/X:  Version 5.0 Revision 2635 (xf 2571)  06/04/09 03:13:22

Number of PEs (MPI ranks):      4

Number of Threads per PE:       1

Number of Cores per Processor:  4

<snip>


Table 1:  Profile by Function Group and Function

 Time % |     Time |Imb. Time |   Imb. | Calls |Group
        |          |          | Time % |       | Function
        |          |          |        |       |  PE='HIDE'

 100.0% | 0.000151 |       -- |     -- | 257.0 |Total
|----------------------------------------------------------
|  98.9% | 0.000150 |       -- |     -- | 253.0 |USER
||---------------------------------------------------------
||  81.0% | 0.000122 | 0.000002 |   2.3% |   1.0 |MAIN_
||  14.5% | 0.000022 | 0.000001 |   4.8% |   1.0 |exit
||   2.1% | 0.000003 | 0.000001 |  20.1% |   1.0 |main
||   1.2% | 0.000002 | 0.000000 |  10.2% | 250.0 |work_
||=========================================================
|   1.1% | 0.000002 |       -- |     -- |   4.0 |MPI
|==========================================================

<snip>

Table 2:  Load Balance with MPI Message Stats

 Time % |     Time |Group
        |          | PE

 100.0% | 0.000189 |Total
|------------------------
|  98.6% | 0.000186 |USER
||-----------------------
||  25.5% | 0.000193 |pe.1
||  24.7% | 0.000187 |pe.0
||  24.3% | 0.000183 |pe.2
||  24.1% | 0.000182 |pe.3
||=======================
|   1.4% | 0.000003 |MPI
||-----------------------
||   0.4% | 0.000003 |pe.1
||   0.4% | 0.000003 |pe.2
||   0.3% | 0.000003 |pe.0
||   0.3% | 0.000003 |pe.3
|========================

<snip>

Table 5:  Program Wall Clock Time, Memory High Water Mark

  Process |  Process |PE
     Time |    HiMem |
          | (MBytes) |

 0.033981 |       20 |Total
|-----------------------
| 0.034040 |   19.742 |pe.2
| 0.034023 |   19.750 |pe.3
| 0.034010 |   19.754 |pe.0
| 0.033851 |   19.750 |pe.1
|=======================

=========  Additional details ============================

Experiment:  trace

<snip>

Estimated minimum overhead per call of a traced function,
  which was subtracted from the data shown in this report
  (for raw data, use the option:  -s overhead=include):
    Time    0.241  microseconds

Number of traced functions: 102
  (To see the list, specify:  -s traced_functions=show)

List perf.rpt2:

CrayPat/X:  Version 5.0 Revision 2635 (xf 2571)  06/04/09 03:13:22

Number of PEs (MPI ranks):      4

Number of Threads per PE:       1

Number of Cores per Processor:  4

<snip>

Table 1:  Function Calltree View

 Time % |     Time | Calls |Calltree
        |          |       | PE='HIDE'

 100.0% | 0.000181 | 657.0 |Total
|-------------------------------------
|  69.7% | 0.000126 | 255.0 |MAIN_
||------------------------------------
||  67.7% | 0.000122 |   1.0 |MAIN_(exclusive)
||   1.0% | 0.000002 | 250.0 |work_
||====================================
|  12.2% | 0.000022 |   1.0 |exit
|   1.8% | 0.000003 |   1.0 |main
|=====================================

=========  Additional details ============================

Experiment:  trace

<snip>

Estimated minimum overhead per call of a traced function,
  which was subtracted from the data shown in this report
  (for raw data, use the option:  -s overhead=include):
    Time    0.241  microseconds

Number of traced functions: 102
  (To see the list, specify:  -s traced_functions=show)

9.20 Using Cray Apprentice2

In the CrayPat example (Using CrayPat), we ran the instrumented program perf and generated file perf+pat+1652-30tdt.ap2.

To view this Cray Apprentice2 file, first load the perftools module.

% module load perftools

Then launch Cray Apprentice2:

% app2 perf+pat+1652-30tdt.ap2

Display the results in call-graph form:

Figure 9. Cray Apprentice2 Callgraph

Cray Apprentice2 Callgraph