Project

General

Profile

NOvA Computing Do's and Don'ts

The computing infrastructure at Fermilab can be complicated at times. These pages are designed to address some of subtle aspects of the FNAL setup and how certain coding/scripting patterns can be use to make your analysis jobs and script run smoothly.

Offline Computing nodes (nova-offline.fnal.gov, novagpvmXX)

These machines are provided as work spaces for people to do development of analysis code and to do limited testing of their code. They can also be used for interactive analysis, making histograms and looking at events.

HOWEVER.... they are a shared resource and the things you do can affect your fellow NOvA collaborators (and sometimes things outside of NOvA!)

Here are some simple Do's and don't related to the Computing Nodes:

General

Don't...

  • Leave rogue programs running
  • Leave large temporary files on disk
  • Leave stealth login sessions running

Sometimes when you have problems with your code you will try close a window and inadvertently leave a program running. Some versions of ROOT are especially prone to this, and will not die properly leaving behind a "zombie" process. These programs continue to run and will eat up resources (especially if you had an infinite loop that caused ROOT to go nuts!) making it hard or impossible for other people to use the machine.

Similarly storing large (temporary) files on the /scratch disks eats up the space that is available. Your file by itself might not fill the disk, but if enough of them pile up there are problems.

Do...

Set limits for your own jobs. The Unix "ulimit" command allows you to set per process limits for many different resources your jobs might need. Setting a limit on the amount of CPU/Memory/Filesizes that you let your jobs take will ensure that when they do go crazy, they won't spin out of control (instead they hit their limit and die).

To setup limits first see what your limits are (ulimit -a shows this) (if you are using bash you will use "ulimit", see the man page for details):

[anorman@novagpvm06 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 102400
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 102400
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

By default most things are unlimited. To change this (here to set the max cpu time to 300 seconds (five minutes)):

ulimit -t 300

Now if you ran a job and it went crazy it would die after 5 minutes of flat CPU running (Note: this is not wall time but CPU time. Jobs like the Event display take very little actual CPU time compared to the time you spend looking at them).

And yes limit ROOT!

Clean up after yourself

Make sure that temporary files that you created are removed when your script/job cashes or when you are done with them. To do this in your scripts you can follow the standard pattern of a "cleanup" function that get runs on exit (no matter how the script exits!)

#!/bin/bash

TMP_FILES=

function cleanup {
  rm -f ${TMP_FILES}
}

trap cleanup EXIT

# Now make the files and play with them
MYFILE=$(mktemp)
TMP_FILES="${TMP_FILES} ${MYFILE}" 

do stuff.....

# Exit and when you do cleanup gets called
exit 


In practice this the right way to do this (and a very robust one) is to register a list of commands that are run.

#######################################
# Declare an array to hold commands to run on exit
declare -a on_exit_items
function on_exit() {
  echo "Performing cleanup on exit" 
  for i in "${on_exit_items[@]}" 
  do 
    echo $i
    eval $i
  done
}

#######################################
# Function to add a command to the exit list
function cleanup_on_exit(){
  local n=${#on_exit_items[*]}
  on_exit_items[$n]="$*" 
  if [[ $n -eq 0 ]]; then
    echo "Setting trap for cleanup on exit" 
    trap on_exit EXIT
  fi
}


The first function (on_exit) is what is run when the program is terminated in almost any fashion (only a kill -9 prevents this). The second function is a registration function which registers a command to be run. The first time it is called it also sets the trap for the EXIT condition (so you don't ever have to remember to do it).

The way you use this is that as soon as you make (or think you might make) something that will need to be cleaned up, you register its cleanup function. The actual cleanup is deferred to exit.

For example if you have a bunch of files in a temp directory you would do something like:

DIR=`mktemp -p ${TMPDIR} -d`
echo Using temporary directory $DIR

# These are the type of files we want to cleanup
# set variables for them so we can auto clean up
TEMP_FILES1="part*.fcl" 
TEMP_FILES2="stage*.root" 
TEMP_FILES3="merged.root" 
TEMP_FILES4="pid.root" 
TEMP_FILES5="lemid_hist.root" 

# Register each command we want run on exit
cleanup_on_exit rm -f ${DIR}/${TEMP_FILES1}
cleanup_on_exit rm -f ${DIR}/${TEMP_FILES2}
cleanup_on_exit rm -f ${DIR}/${TEMP_FILES3}
cleanup_on_exit rm -f ${DIR}/${TEMP_FILES4}
cleanup_on_exit rm -f ${DIR}/${TEMP_FILES5}
cleanup_on_exit rmdir ${DIR}


Where the "cleanup_on_exit" will make sure that we clean up everything at the very end.

Files

Don't...

  • Write files to /tmp

The /tmp area on machines is reserved for special system files and if it fills up you kill the entire machine. This area is designed to be small and holds things like your kerberos ticket and other small files. It can not hold large files (i.e. root files!)

Do...

  • Write files to the /scratch area. This is a disk that is pseduo-local to the machine you are running on and is designed to hold files that you only need for a short time. Make (or copy) your intermediate files to this location and clean them up when you are done.

You can still use the mktemp command to ensure that you get unique file names, but you will need to redirect the base path of where they are made. As an example:


# Setup some sensible defaults
USRTEMP=
DEFAULT_TEMP=/scratch/nova

# Now IF we set USRTEMP to something then we will make
# our files in that area, if we don't then it will default 
# to the $DEFAULT_TEMP area

# Setup the Temporary location where we put files
export TMPDIR=${USRTEMP:-${DEFAULT_TEMP}}

# Here $TMPDIR is a special environment variable that is checked
# by the mktemp program.  

# If the base directory does not exist then make a directory in this area
if [ ! -d $TMPDIR ] ; then
  # The directory does not exist so we make it
  `mkdir -p $TMPDIR`
  if [ $? -ne 0 ] ; then
    exit 1
  fi
  # Because we made a directory we will need to clean it up
  clean_base_dir=TRUE  
fi

DIR=`mktemp -p ${TMPDIR} -d`
echo Using temporary directory $DIR

if [ $clean_base_dir ] ; then
  cleanup_on_exit rmdir ${TMPDIR}
fi


Login Sessions

The NOvA computing environment is a shared one and the virtual machines that provide our interactive login sessions are designed to be shared by many people. It may be tempting to try to start up a personal Xwindows work space on these machines so that you can run certain types of programs or development environments. However if you leave a session like this active when you are not using it (i.e. you leave a persistent virtual desktop running) you will be eating into the available memory and other resources that are shared on the machine. In particular starting a VNC (virtual network computing) and leaving it running can consume a large fraction of a given VM's memory.

VNC sessions might be useful to overcome slow connection issues when using the event display in batch mode.
It is the recommendation of NOvA computing to all NOvA collaborators to limit VNC session usage to novagpvm01-04. Please do not leave VNC server processes running if they are not in use.

If there is specific application that you need to use that will only run under a VNC session (e.g. something that needs a specific frame buffer implementation that the remote Xwindows protocol doesn't support) then the your VNC session MUST MUST MUST be configured in the following manner:

  1. The VNC session must be configured to connect to the local host
  2. Authentication must be through a kerberized connection
  3. Remote access must be over a kerberized SSH tunnel to the target host

Starting a VNC session on port 5952 in local mode:

Xvnc :52 -localhost

This will bind the server to port 5952. A tunnel to the host can then be started using:

ssh -L 5952:localhost:5952 -N -f -l username novagpvm01.fnal.gov

Then you can connect to the VNC session by starting your client and connecting to "localhost:52"

When you are done working you should then shutdown your VNC session.

Other Topics

More to come...