Cray XT System Management

S-2393-30 - Mar 2010

This book helps administrators effectively manage and monitor Cray XT systems. Administrators should have a working knowledge of Linux system administration because Cray XT systems run a combination of software developed by Cray Inc, third-party vendors, and the open source community.

Links

Available formats

Adobe Acrobat Reader S-2393-30.pdf
Valid HTML 4.01 html-S-2393-30

Table of contents

Introduction
    1.1  Audience for This Guide
    1.2  Cray System Administration Publications
    1.3  Related Publications
Introducing Cray XT System Components
    2.1  System Management Workstation (SMW)
    2.2  CLE
    2.3  Boot Root File System
    2.4  Shared Root File System
    2.5  Service Partition
    2.5.1  Service Nodes
    2.5.1.1  Boot Node
    2.5.1.2  Service Database (SDB) Node
    2.5.1.3  Syslog Node
    2.5.1.4  Login Nodes
    2.5.1.5  Network Nodes
    2.5.1.6  I/O Nodes
    2.5.2  Services on the Service Partition
    2.5.2.1  Resiliency Communication Agent (RCA)
    2.5.2.2  Lustre File System
    2.5.2.3  Cray Data Virtualization Service (Cray DVS)
    2.5.2.4  Application Level Placement Scheduler (ALPS) for Compute Nodes
    2.5.2.5  IP Implementation
    2.6  Compute Partition
    2.6.1  Compute Nodes
    2.7  Job Launch Commands
    2.8  Node Health Checker (NHC)
    2.9  Comprehensive System Accounting (CSA)
    2.10  Checkpoint/Restart (CPR)
    2.11  Portals
    2.12  Optional Workload-management (Batch) System Software Products
    2.13  Hardware Supervisory System (HSS)
    2.13.1  HSS Network
    2.13.2  HSS Interface
    2.13.3  Blade Control Processor (L0 Controller) and Cabinet Control Processors (L1 Controller)
    2.13.4  NTP Server
    2.13.5  Event Router
    2.13.6  HSS Managers
    2.13.6.1  State Manager
    2.13.6.2  Boot Manager
    2.13.6.3  System Environmental Data Collections (SEDC) Manager
    2.13.6.4  Diagnostics Manager
    2.13.6.5  Power Manager
    2.13.6.6  Flash Manager
    2.13.6.7  Router Manager
    2.13.6.8  NID Manager
    2.13.7  xtdiscover Command
    2.13.8  Event Logs
    2.13.9  Boot Logs
    2.13.10  Dump Logs
    2.14  Cray Management Services (CMS)
    2.15  Storage
    2.16  Other Administrative Information
    2.16.1  Identifying Components
    2.16.1.1  Physical ID
    2.16.1.2  Node ID (NID)
    2.16.1.3  Class Name
    2.16.2  Topology Class
    2.16.3  Persistent /var Directory
    2.16.4  Default Network IP Addresses
    2.16.5  /etc/hosts Files
    2.16.6  Native IP (SSIP)
    2.16.7  Realm-Specific IP Addressing (RSIP) for CNL Compute Nodes
    2.16.8  Security Auditing
    2.16.9  Logging Failed Login Attempts
    2.16.10  Logical Machines
Managing the System
    3.1  Connecting the SMW to the Console of a Service Node
    3.2  Logging On to the Boot Node
    3.3  Preparing a Service Node and Compute Node Boot Image
    3.3.1  Using shell_bootimage_label.sh to Prepare Boot Images
    3.4  Changing Boot Parameters
    3.5  Booting Nodes
    3.5.1  Booting the System
    3.5.2  Using the xtcli boot Command to Boot a Node or Set of Nodes
    3.5.3  Rebooting a Single CNL Compute Node
    3.5.4  Rebooting Login or Network Nodes
    3.6  Requesting and Displaying System Routing
    3.7  Shutting Down Service Nodes Using the xtshutdown Command
    3.8  Shutting Down the System or Part of the System Using the xtcli shutdown Command
    3.9  Shutting Down the System Using the auto.xtshutdown File
    3.10  Stopping System Components
    3.10.1  Reserving a Component
    3.10.2  Powering Down a Node
    3.10.3  Powering Down a Component
    3.10.4  Powering Down a Single Blade
    3.10.5  Forcing Components to Power Down
    3.10.6  Halting Selected Nodes
    3.10.7  Powering Off L0 Controllers or Slots
    3.11  Restarting a System Component
    3.12  Aborting Active Sessions on HSS Managers
    3.13  Displaying and Changing Software System Status
    3.13.1  Displaying the Status of Nodes from the Operating System
    3.13.2  Viewing and Changing the Status of Nodes
    3.13.3  Marking a Compute Node as a Service Node
    3.13.4  Finding Node Information
    3.14  Displaying and Changing Hardware System Status
    3.14.1  Generating HSS Physical IDs
    3.14.2  Disabling Hardware Components
    3.14.3  Enabling Hardware Components
    3.14.4  Setting Components to Empty
    3.14.5  Locking Components
    3.14.6  Unlocking Components
    3.14.7  Determining How Service Nodes Are Configured by Looking at Hardware
    3.15  Performing Parallel Operations on Nodes
    3.16  Handling Component Failures
    3.17  Capturing and Analyzing System-level and Node-level Dumps
    3.17.1  Dumping Information Using the xtdumpsys Command
    3.17.2  ldump and lcrash Utilities for Node Memory Dump and Analysis
    3.18  Using xtnmi Command to Collect Debug Information from Hung Nodes
Monitoring System Activity
    4.1  Displaying Installed SMW Release Level
    4.2  Displaying Installed CLE Release Level
    4.3  Displaying Boot Configuration Information
    4.4  Monitoring Multiple Nodes
    4.5  Managing Log Files Using CLE and HSS Commands
    4.5.1  Filtering the Event Log
    4.5.2  Adding Entries to Log Files
    4.5.3  Examining Log Files
    4.5.4  Removing Old Log Files
    4.6  Managing Log Files Using the Cray Management Services (CMS) Log Manager
    4.7  Checking the Status of System Components
    4.8  Checking the Status of Compute Processors
    4.9  Checking CNL Compute Node Connection
    4.10  Checking Link Control Block and Router Errors
    4.11  Monitoring the Status of Jobs Started Under a Third-party Batch System
    4.12  Listing Running Jobs
    4.13  Using the cray_pam Module to Monitor Failed Login Attempts
    4.14  Monitoring DDN RAID
    4.15  Monitoring LSI Engenio RAID
    4.16  Monitoring HSS Managers
    4.16.1  Examining Activity on HSS Managers
    4.16.2  Checking the Health of HSS Managers
    4.17  Monitoring Events
    4.18  Monitoring Node Console Messages
    4.19  Running System Diagnostics
    4.20  Showing the Component Alert, Warning, and Location History
    4.21  Viewing System State
    4.21.1  Displaying Component Information
    4.21.2  Displaying Alerts and Warnings
    4.21.3  Clearing Flags
Managing User Access
    5.1  Load Balancing Across Login Nodes
    5.2  Passwords
    5.2.1  Changing Default SMW Passwords After Completing Installation
    5.2.2  Changing root and crayadm Passwords on Boot and Service Nodes
    5.2.3  Changing the root Password on CNL Compute Nodes
    5.2.4  Changing Default MySQL Passwords on the SDB
    5.2.5  Assigning and Changing User Passwords
    5.2.6  Logins That Do Not Require Passwords
    5.3  Administering Accounts
    5.3.1  Managing Boot Node Accounts
    5.3.2  Managing User Accounts on Service Nodes
    5.3.2.1  Adding a User or Group
    5.3.2.2  Removing a User or Group
    5.3.2.3  Changing User or Group Information
    5.3.2.4  Assigning Groups of Compute Nodes to a User Group
    5.3.3  Setting Disk Quotas for a User on the Cray Local, non-Lustre File System
    5.3.4  Associating Users with Projects
    5.4  System-wide Default Modulefiles
    5.5  User Access to a Compiler Environment Using Modulefiles
    5.6  Maintaining *rc.local Scripts
    5.7  Using the pam_listfile Module in the Shared Root Environment
    5.8  ulimit Stack Size Limit
    5.9  Stopping a User's Job
    5.9.1  Stopping a CNL Job Running in Interactive Mode
    5.9.2  Stopping a Job Running Under a Batch System
Modifying an Installed System
    6.1  PBS Professional Licensing Requirements for Cray Systems
    6.2  Disabling Secure Shell (SSH) on Compute Nodes
    6.3  Modifying SSH Keys for Compute Nodes
    6.4  Configuring the System Environmental Data Collector (SEDC)
    6.5  Configuring the Shared-root File System on Service Nodes
    6.5.1  Specialization
    6.5.2  Visible Shared-root File System Layout
    6.5.3  How Specialization Is Implemented
    6.5.4  Working with the Shared-root File System
    6.5.4.1  Managing System Configuration with the xtopview Tool
    6.5.4.2  Updating Specialized Files from within the xtopview Shell
    6.5.4.3  Specializing Files
    6.5.4.4  Determining which Files are Specialized
    6.5.4.5  Checking Shared-root Configuration
    6.5.4.6  Verifying the Coherency of /etc/init.d Files Across All Shared Root Views
    6.5.4.7  Cloning a Shared-root Hierarchy
    6.5.4.8  Changing the Class of a Node
    6.5.4.9  Removing Specialization
    6.5.4.10  Displaying RCS Log Information for Shared Root Files
    6.5.4.11  Checking Out an RCS Version of Shared Root Files
    6.5.4.12  Listing Shared Root File Specification and Version Information
    6.5.4.13  Performing Archive Operations on Shared Root Files
    6.5.5  Logging Shared-root Activity
    6.6  Configuring Optional RPMs in the CNL Boot Image
    6.7  Configuring Cray Enhanced Linux Security Features
    6.7.1  Security Auditing and Cray Audit Extensions
    6.7.1.1  Lustre File System Requirements for Cray Audit
    6.7.1.2  System Performance Considerations for Cray Audit
    6.7.2  Using the cray_pam PAM to Log Failed Login Attempts
    6.8  Configuring cron Services
    6.9  Configuring the Load Balancer
    6.10  Configuring Node Health Checker (NHC)
    6.10.1  /etc/opt/cray/nodehealth/nodehealth.conf Configuration File
    6.10.2  Configuring Node Health Checker Tests
    6.10.2.1  Global Configuration Variables That Affect All NHC Tests
    6.10.2.2  Standard Variables Used With Each NHC Test
    6.10.3  Suspect Mode
    6.10.4  NHC Messages
    6.10.5  Warm Booting a Compute Node in suspect State
    6.10.6  Node Remains in suspect State Across a Reboot or After Panics or Crashes
    6.10.7  What if a Login Node Crashes While xtcheckhealth Binaries are Monitoring Nodes?
    6.10.8  Disabling NHC
    6.10.9  Configuring the Node Health Checker to Use SSL
    6.11  Activating Process Accounting for Service Nodes
    6.12  Configuring Boot-node Failover
    6.13  Creating Logical Machines
    6.13.1  Creating Routable Logical Machines
    6.13.1.1  Topology Class 0
    6.13.1.2  Topology Class 1
    6.13.1.3  Topology Class 2
    6.13.1.4  Topology Class 3
    6.13.2  Configuring a Logical Machine
    6.13.3  Booting a Logical Machine
    6.14  Updating Boot Configuration
    6.15  Modifying Boot Automation Files
    6.16  Callout to rc.local During Boot
    6.17  Changing the System Software Version to Be Booted
    6.17.1  Minor Release Switching within a System Set
    6.17.2  Major Release Switching using Separate System Sets
    6.18  Changing the Service Database (SDB)
    6.18.1  Service Database Tables
    6.18.2  Database Security
    6.18.3  Updating Database Tables
    6.18.3.1  Changing Nodes and Classes
    6.18.3.2  Changing Services
    6.19  Viewing the Service Database Contents with MySQL Commands
    6.20  Configuring the Lustre File System
    6.21  Configuring Cray Data Virtualization Service (Cray DVS)
    6.22  Enabling File-locking for Lustre Clients
    6.23  Setting and Viewing Node Attributes
    6.23.1  Setting Node Attributes Using the /etc/opt/cray/sdb/attr.xthwinv and /etc/opt/cray/sdb/attr.defaults Files
    6.23.1.1  Enabling Node Attributes during Boot Process
    6.23.1.2  Generating the /etc/opt/cray/sdb/attributes File
    6.23.2  SDB attributes Table
    6.23.3  Setting Attributes Using the xtprocadmin Command
    6.23.4  Viewing Node Attributes
    6.24  Using the XTAdmin Database segment Table
    6.25  Configuring Networking Services
    6.25.1  Changing the High-speed Network (HSN)
    6.25.2  Network File System (NFS)
    6.25.3  Configuring Ethernet Link Aggregation (Bonding, Channel Bonding)
    6.25.4  Configuring the Virtual Channel (VC)
    6.25.5  Increasing Size of ARP Tables
    6.25.6  Configuring Native IP (SSIP)
    6.25.7  Configuring Realm-Specific IP Addressing (RSIP)
    6.25.7.1  Using the XTinstall Program to Install and Configure RSIP
    6.25.8  IP Routes for CNL Nodes in the /etc/routes File
    6.26  Updating the System Configuration After A Hardware Change
    6.27  Changing the Location to Log syslog-ng Information
Managing Services
    7.1  Configuring the SMW to Synchronize to a Site NTP Server
    7.2  Synchronizing Time of Day on Compute Node clocks with the Clock on the Boot Node
    7.3  Adding and Starting a Service Using Standard Linux Mechanisms
    7.4  Adding and Starting a Service Using RCA
    7.4.1  Adding a Service to List of Services Available under RCA
    7.4.2  Indicating Nodes on Which the Service Will Be Started
    7.5  Creating a Snapshot of /var
    7.6  Setting Soft and Hard Limits to Prevent Login Node Hangs
    7.7  Handling Bus Errors
    7.8  Creating a Cray System Management Workstation (SMW) Bootable Backup Drive
    7.9  Setting Up the Bootable Backup Drive as an Alternate Boot Device
    7.10  Backing Up the System Configuration to a DVD-R or CD-R Media
    7.11  Archiving the SDB
    7.12  Backing Up Limited Shared-root Configuration Data
    7.12.1  Using the xtoparchive Utility to Archive the Shared-root File System
    7.12.2  Using Linux Utilities to Save the Shared-root File System
    7.13  Backing Up Boot Root and Shared Root
    7.13.1  Using the xthotbackup Command to Back Up Boot Root and Shared Root
    7.13.2  Using dump and restore Commands to Back Up Boot Root and Shared Root
    7.14  Backing Up User Data
    7.15  Recovering the System Configuration
    7.16  Rebooting a Stopped SMW
    7.16.1  SMW Recovery
    7.17  Recovering from Service Database Failure
    7.17.1  Database Server Failover
    7.17.2  Rebuilding Corrupted SDB Tables
    7.18  Using Persistent SCSI Device Names
    7.18.1  Using cray-scscidev-emulation Device Naming
    7.19  Using a Linux iptables Firewall to Limit Services
    7.20  Handling Single-node Failures
    7.21  Increasing the Boot Manager Time-out Value
    7.22  RAID Failure
Using the Application Level Placement Scheduler (ALPS)
    8.1  ALPS Functionality
    8.2  ALPS Architecture
    8.2.1  ALPS Clients
    8.2.1.1  The aprun Client
    8.2.1.2  The apstat Client
    8.2.1.3  The apkill Client
    8.2.1.4  The apmgr Client
    8.2.1.5  The apbasil Client
    8.2.2  ALPS Daemons
    8.2.2.1  The apbridge Daemon
    8.2.2.2  The apsched Daemon
    8.2.2.3  The apsys Daemon
    8.2.2.4  The apwatch Daemon
    8.2.2.5  The apinit Daemon
    8.2.2.6  The apres Daemon
    8.2.2.7  ALPS Log Files
    8.2.2.8  Changing Debug Message Level of apsched and apsys Daemons
    8.3  Configuring ALPS
    8.3.1  /etc/sysconfig/alps Configuration File
    8.3.2  /etc/alps.conf Configuration File
    8.4  Resynchronizing ALPS and the SDB Command After Manually Changing the SDB
    8.5  Identifying Reserved Resources
    8.6  Terminating a Batch Job
    8.7  Setting a Compute Node to Batch or Interactive Mode
    8.8  Manually Starting and Stopping ALPS Daemons on Service Nodes
    8.9  Manually Cleaning ALPS and PBS After Downed Login Node
    8.10  Verifying that ALPS is Communicating with Cray XT Compute Nodes
    8.11  ALPS and Node Health Monitoring Interaction
    8.11.1  aprun Actions
    8.11.2  apinit Actions
    8.11.3  apsys Actions
    8.11.4  apmgrcleanup Actions
    8.11.5  Node Health Checker Actions
    8.11.6  Verifying Application Cleanup
Using Comprehensive System Accounting
    9.1  Interacting with Batch Entry Systems or the PAM job Module
    9.2  CSA Configuration File Values
    9.3  Configuring CSA
    9.3.1  Obtaining File System and Node Information
    9.3.2  Editing the csa.conf File
    9.3.3  Editing Other System Configuration Files
    9.3.4  Creating a CNL Image with CSA Enabled
    9.3.5  Setting Up Project Accounting
    9.3.5.1  Disabling Project Accounting
    9.3.6  Setting Up Job Accounting
    9.4  Creating Accounting cron Jobs
    9.4.1  csanodeacct() cron Job for Login Nodes
    9.4.2  csarun() cron Job
    9.4.3  csaperiod() cron Job
    9.5  Enabling CSA
    9.6  Using LDAP with CSA
10  Using Checkpoint/Restart on Cray Systems
    10.1  Requirements and/or Limitations for Checkpoint/Restart
    10.2  Installation and Configuration
    10.2.1  Cray XT Installation and Configuration Options
    10.2.2  Configuring TORQUE and Moab to Work with CPR
    10.2.3  Configuring PBS Professional to Work with CPR
    10.3  Using Checkpoint/Restart
    10.3.1  Compiling Applications
    10.3.2  Using Checkpoint/Restart with TORQUE and Moab
    10.3.2.1  Common Checkpoint/Restart Error Messages
    10.3.3  Using Checkpoint/Restart with PBS Professional
11  OpenFabrics Interconnect Drivers for Cray XT Systems
    11.1  OFED Overview
    11.2  Using InfiniBand
    11.2.1  Storage Area Networking
    11.2.2  Lustre Routing
    11.2.3  IP Connectivity
    11.3  Configuration
    11.4  InfiniBand Configuration
    11.5  Subnet Manager (OpenSM) Configuration
    11.5.1  Starting OpenSM at Boot Time
    11.6  Internet Protocol over InfiniBand (IPoIB) Configuration
    11.7  Configuring SCSI RDMA Protocol (SRP) on Cray XT Systems
    11.8  Lustre Networking (LNET) Router
    11.8.1  Configuring the LNET router
    11.8.2  Configuring the InfiniBand Lustre Server
    11.8.3  Configuring the Portals Lustre Clients
    11.9  Sample Lustre Router Control File
SMW and CLE System Administration Commands
System States
Error Codes
Remote Access to the SMW
Updating the Time Zone
Creating Modulefiles
PBS Professional Licensing for Cray Systems
Utilities for Cray Service Personnel Use

Software Releases this book supports

Product Version Sub Product Release Date
Cray Linux Environment (CLE) 3.0 Mar 2010

Other versions of this book

Publication Number Release Date Supported Software Releases
S-2393-52xx Mar 2014 Cray Linux Environment (CLE) 5.2.UP00
S-2393-52xc Mar 2014 Cray Linux Environment (CLE) 5.2.UP00
S-2393-5101 Dec 2013 Cray Linux Environment (CLE) 5.1.UP01
S-2393-4202 Oct 2013 Cray Linux Environment (CLE) 4.2.UP02
S-2393-51 Sep 2013 Cray Linux Environment (CLE) 5.1.UP00
S-2393-4201 Jul 2013 Cray Linux Environment (CLE) 4.2.UP01
S-2393-5003 Jun 2013 Cray Linux Environment (CLE) 5.0.UP03
S-2393-42 Apr 2013 Cray Linux Environment (CLE) 4.2
S-2393-5002 Mar 2013 Cray Linux Environment (CLE) 5.0.UP02
S-2393-4101 Dec 2012 Cray Linux Environment (CLE) 4.1.UP01
S-2393-4003 Mar 2012 Cray Linux Environment (CLE) 4.0.UP03
S-2393-4002 Dec 2011 Cray Linux Environment (CLE) 4.0.UP02
S-2393-4001 Sep 2011 Cray Linux Environment (CLE) 4.0.UP01
S-2393-3102 Jan 2011 Cray Linux Environment (CLE) 3.1.UP02
S-2393-31 Jun 2010 Cray Linux Environment (CLE) 3.1
S-2393-15 Nov 2006 System Management Workstation (SMW) 1.5, UNICOS/lc 1.5
S-2393-14 May 2006 System Management Workstation (SMW) 1.4, UNICOS/lc 1.4
S-2393-13 Nov 2005 System Management Workstation (SMW) 1.3, UNICOS/lc 1.3
S-2393-12 Aug 2005 Cray XT3 Programming Environment 1.2, System Management Workstation (SMW) 1.2, UNICOS/lc 1.2
S-2393-11 Jun 2005 System Management Workstation (SMW) 1.1, UNICOS/lc 1.1