APE logo

APE - The Array Processor Experiment

INFN logo

 

INFN apeNEXT Computing Center

In this page, we try to boot-strap users on the use of apeNEXT supercomputer in the Rome Computing Center:

  1. Installation structure
  2. Front-End workstation
  3. Compilation environment
  4. Execution environment
  5. Programs execeptions
  6. From exceptions to source code

4 apeNEXT racks

4 apeNEXT racks

4 apeNEXT racks

Staff people are reachable at this email alias.

Please, reads these presentations to get more details on the developer tools, optimization methods, etc.

Installation structure

In the end, the apeNEXT installation will host a small intranet of roughly 70 computers:

  • theboss: the user front-end workstation. It's a dual dual-core 1.8GHz Opteron 265 system with 2 GBytes of RAM, running Fedora Core 4, exporting a 1TByte RAID5 volume for user home directories.
  • rack1,...,rack13: the computers hosting one apeNEXT rack each.
  • blade1,..,blade4: the computers hosting one unit of each rack.
  • local storage servers: 13 iSCSI modules with 1TByte each to provide local-to-rack scratch storage. Available Soon
  • a storage server: a FibreChannel mass storage device with 25TB raw space. Available Soon
  • services: one or more computers for firewalling, web server, database, etc.

WARNING: The users home directories are placed in a 1TB RAID5 filesystem which is currently imposing a quota limit of 10GB of storage per user.

Front-End workstation

To log onto the front-end workstation, please use:

ssh -p 23 username@apegate.roma1.infn.it

Compilation environment

To properly define the apeNEXT developer environment, please use:

source /nroot/nlogin (tcsh)
. /nroot/nlogin.sh (sh/bash/zsh)

For ease of access, we link here a document briefly describing syntax changes in the latest version of rtc TAO compiler (ver. 0.1.43) for apeNEXT as opposed to APEmille/APE100 TAO compilers.

This is the example of a Makefile for apeNEXT programs:

CC=nlcc
CFLAGS=-O2
CPPFLAGS=-I.
COMPILE.c=$(CC) $(CFLAGS) $(CPPFLAGS)
EXPAPE=expape
MPP=mpp -os7
RTC=rtc -anext
SOFAN=sofan
SHAKER=npsk +a -k -regrename

.SUFFIXES: .c_ .sasm .masm .mem .no .c .smasm .i
.PRECIOUS: %.c_ %.sasm %.masm %.mem %.no %.smasm %.i

%.sasm: %.c
	$(COMPILE.c) -o $@ -S $?

%.sasm: %.zzt
	$(RTC) $?

%.masm: %.sasm
	$(MPP) -o $@ $?

%.mem:  %.smasm
	$(SOFAN) $? $@ && \
	$(SHAKER) $?

all: test1.mem test2.mem

Execution environment

As of today, our installation uses Torque, a derivative of Portable Batch System (PBS).

apeNEXT resources, i.e. racks, crates, units and boards, are modeled as queues. There is one Torque queue for each resource type. In the end, there will be 4 queues in total. As of today, there is only one queue: crate.

All of the standard, production queues are limited to 24 hours of run time. This seemingly strict requirement allows enough scheduling opportunities to fullfill load balance and resource assignment quotas.

As soon as possible, we will expose 1 or more test boards, useful to run short test jobs.

Job output is only available at the end of the batch life, as it is actually stored locally on the host machine during the running period. After job ended, the output file is delivered back to the front-end host, theboss.
To guarantee delivery of job output from front-ends to theboss, please modify your shell init script to avoid: printing any messages, manipulating terminal. For instance, adding source /nroot/nlogin is dangerous.

Typical batch job structure is an executable script file, let's call it run.sh, such as:

#!/bin/sh
cd $PBS_O_WORKDIR

LOGFILE=job.log
# redirect stdout to $LOGFILE
exec 1>$LOGFILE
# redirect stderr on stdout
exec 2>&1

echo "PBS_O_WORKDIR=$PBS_O_WORKDIR"
echo "PBS_QUEUE    =$PBS_QUEUE"
echo "PBS_JOBNAME  =$PBS_JOBNAME"
echo "PBS_JOBID    =$PBS_JOBID"
echo "HOSTNAME     =$HOSTNAME"

. /nroot/nlogin.sh

nrun -dnose -minit0
# other stuff
nrun -dnose -Mnet 0x200005553fff myprogram.mem
WARNING: note the use of -Mnet nrun switch to mask I/O exception during the transition period necessary to fix some nasty problem in the apeNEXT I/O link HW.

Job submission is done via nsub tool.

$ nsub -c crate run.sh
194.theboss
The string printed by nsub is your job id.

Job status is queried by qstat command.

$ qstat -a

theboss:
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
194.theboss          rossetti crate    RUN.sh      15158     1  --    --  24:00 R   --

Computing nodes status is obtained via pbsnodes -a. This way, job-to-host assignment can be deduced.

$ pbsnodes -a
rack1
     state = free
     np = 2
     ntype = cluster
     status = opsys=linux,uname=Linux rack1.ape 2.6.14-1.1656_FC4smp #1 SMP Thu Jan 5 22:24:06 EST 2006 i686,sessions=2437 14992 15013 15036 15057 3043,nsessions=6,nusers=3,idletime=13524,totmem=4106328kb,availmem=4024984kb,physmem=2074720kb,ncpus=4,loadave=0.00,netload=3801077742,state=free,jobs=? 15201,rectime=1143214899

rack3
     state = free
     np = 2
     ntype = cluster
     jobs = 0/194.theboss
     status = opsys=linux,uname=Linux rack3 2.6.15-1.1831_FC4smp #1 SMP Tue Feb 7 13:48:31 EST 2006 i686,sessions=2285 11099 11121 11137 11183 4104 15158,nsessions=7,nusers=3,idletime=7580,totmem=4100084kb,availmem=3978164kb,physmem=2068476kb,ncpus=4,loadave=0.32,netload=1511802068,state=free,jobs=194.theboss,rectime=1143214894

rack6
     state = free
     np = 2
     ntype = cluster
     status = opsys=linux,uname=Linux rack6 2.6.15-1.1831_FC4smp #1 SMP Tue Feb 7 13:48:31 EST 2006 i686,sessions=2446 25206 25225 25243 25268,nsessions=5,nusers=2,idletime=84650,totmem=4100084kb,availmem=4017812kb,physmem=2068476kb,ncpus=4,loadave=0.00,netload=2636850633,state=free,jobs=? 15201,rectime=1143214929

rack7
     state = free
     np = 2
     ntype = cluster
     status = opsys=linux,uname=Linux rack7 2.6.15-1.1831_FC4smp #1 SMP Tue Feb 7 13:48:31 EST 2006 i686,sessions=2518 25022 25058 25079 25102,nsessions=5,nusers=2,idletime=84371,totmem=4100084kb,availmem=4017776kb,physmem=2068476kb,ncpus=4,loadave=0.00,netload=3443563005,state=free,jobs=? 15201,rectime=1143214895

Programs execeptions

In case of exceptions, a register dump is produced both on stderr and on the file Exception.log. In case this file is already present, the old file is renamed to Exception.log-date-time and the new one takes its place. That file is precious to deduce the source of the exception so it should be stored and submitted to the apeNEXT staff. Use this HW register list, a stripped down version of the original one, to decode the exceptions to some extent.

Note that Exception.log has a bunch of lines of register dump for each node:

N[4,0,0] (C01,B00,N00)  BSREG: 0x00000000.00000000.00000000.FFFFFFFF

N[4,0,0] (C01,B00,N00)  SREG: 0x00000000.00000000.00000000.00001015
N[4,0,0] (C01,B00,N00)  GLOBAL KILL
N[4,0,0] (C01,B00,N00)  CR_EDACEXC (0x10): 0x00000000.00000000.00000000.00000000
N[4,0,0] (C01,B00,N00)  CR_MEMEXC  (0x20): 0x00000000.00000000.00000000.00004000
N[4,0,0] (C01,B00,N00)  CR_DMAMON  (0x22): 0x00000000.00000000.000C0000.001C761F
N[4,0,0] (C01,B00,N00)  CR_PC      (0x26): 0x00000000.00000000.0CFE69B2.0CFE6979
N[4,0,0] (C01,B00,N00)  CR_FILUEXC (0x30): 0x00000000.00000000.00000000.00000000
N[4,0,0] (C01,B00,N00)  CR_STKEXC  (0x40): 0x00000000.00000000.00000000.00000008
N[4,0,0] (C01,B00,N00)  CR_NETEXC  (0xD0): 0x00000000.00000000.00000000.00000000
...
N[6,7,4] (C01,B14,N10)  SREG: 0x00000000.00000000.00000000.00001085
N[6,7,4] (C01,B14,N10)  AGU/FILU EXCEPTION
N[6,7,4] (C01,B14,N10)  CR_EDACEXC (0x10): 0x00000000.00000000.00000000.00000000
N[6,7,4] (C01,B14,N10)  CR_MEMEXC  (0x20): 0x00000000.00000000.00000000.00004000
N[6,7,4] (C01,B14,N10)  CR_DMAMON  (0x22): 0x00000000.00000000.000C0000.001C0F47
N[6,7,4] (C01,B14,N10)  CR_PC      (0x26): 0x00000000.00000000.0CFE64A0.0CFE669D
N[6,7,4] (C01,B14,N10)  CR_FILUEXC (0x30): 0x00000000.00000000.00000000.0000000C
N[6,7,4] (C01,B14,N10)  CR_STKEXC  (0x40): 0x00000000.00000000.00000000.00000000
N[6,7,4] (C01,B14,N10)  CR_NETEXC  (0xD0): 0x00000000.00000000.00000000.00000000
...

Some nodes are real sources of exceptions, while others get killed as a consequence of the death of the former. Killed nodes are marked with GLOBAL KILL and are less interesting than the others.

In the example above, node [6,7,4] raised an exception of AGU/FILU type. To go in deeper details, we have to decode the 0x30 register, i.e. the CR_FILUEXC register. Looking it up in the HW register list:

Address 0x30:	Exceptions                                    (CrFiluExc)
==========================
[0]	00000000 00000001	RW	AluBadOpL
[1]	00000000 00000002	RW	AluBadOpH
[2]	00000000 00000004	RW	AluDenInL
[3]	00000000 00000008	RW	AluDenInH
[4]	00000000 00000010	RW	AluDenOutL+)
[5]	00000000 00000020	RW	AluDenOutH+)
[6]	00000000 00000040	RW	AluOvfL (incl. IDiv0)
[7]	00000000 00000080	RW	AluOvfH (incl. IDiv0)
We decode it as an AluDenInL + AluDenInH, that is both the high/complex and low/real part of a complex/vector calculation happened to contain some denormalized IEEE 754 numbers, i.e. a memory word whose bits does not follow the standard. For instance this may occur in case of bad array indexing.

From exceptions to source code

It is possible to get an idea of the zone in the original source code which raised the exception. The procedure is a bit elaborate, and it could well be wrapped in a script sometime in the future:

  • Produce the microcode dump:
    $ dispminit foo.mem >/tmp/foo.ncd
    
  • Lookup the PMA (Program Memory Address) of the exception at register CR_PC in Exception.log, in the example above:
    N[6,7,4] (C01,B14,N10)  CR_PC      (0x26): 0x00000000.00000000.0CFE64A0.0CFE669D
    
    the PMA is 0x0CFE669D=0x0C000000+0x00FE669D, where 0x0C000000 means that that code was executed in cache mode. So the real address to lookup is 0x00FE669D.
  • View /tmp/foo.ncd and scroll down to the first section above the faulty 0x00FE669D microcode line in foo.ncd:
    !!   LABEL: GL_0x1154            STARTPMA: 0x00FD5E6E  ENDPMA: 0x00FD5E93  LEN: 91
    00FE6445: 0cfe6c00   0    -     -    -    -     E     -   0  00  0   0 00  ISUB  0   0 00 00 00 11       0x00FD5E6E
    00FE6446: 00000000   0    -     -    -    -    -    LFPC  0  00  0   0 00    -   0   0 00 00 00 00       0x00FD5E6F
    ...
    00FE669A: 000001b0   0    -     -    -    -   RXE    LAL  0  2a  0   0 a3    -   3   0 a2 00 00 00  1    0x00FD5F5C
    00FE669B: 00000000   0    -     -    -    -    -      -   0  00  0   0 a2    -   3   0 a3 00 00 00       0x00FD5F5D
    00FE669C: 00000000   0    -     -    -    -    RA     -   0  00  0   0 a3    -   3   0 a2 00 00 00  1    0x00FD5F5D
    00FE669D: 00000000   0   M2Q    -    -    -    -      -   0  00  0   0 a2    -   3   0 a3 00 00 00       0x00FD5F5E
    ...
    
  • Annotate the section name, GL_0x1154, and look it up in the foo.sasm:
    !! --- 6120         do ix=0,(VOL3-1)
    IADD   25 0 0 U
    ATR    4099 0x1b0 U
    IADD   17 4099 0 U
    IADD   18 1 0 U
    LABEL GL_0x1154
    PRAGMA_MAXPHYS 45 146
    IPUSHLE.L 17 0
    JUMPIF GL_0x1155+0
    !! --- 6121 
    !! --- 6122            ixp1=iup[ix,0]
    ATR    4096 25#3 U
    LMTR    4097 :1  0x4cfd24.4096 U
    IADD   26 4097 0 U
    
  • Watch for TAO code embedded in sasm comments, which carries source code lines.
  • Go to the faulty TAO code zone, which in the case above is code beginning at line 6122 of foo.zzt.

 

CVS $Revision: 1.26 $
webmaster
Valid HTML 4.01! Valid CSS!