Project

General

Profile

NuMI at Theta » History » Version 81

« Previous - Version 81/86 (diff) - Next » - Current version
Krishan Mistry, 12/10/2019 04:06 PM


NuMI at Theta

NuMI Documentation



This wiki page documents how to process NuMI data on Theta. This documentation is easily adaptable for any experiment code running larsoft.

General notes:

  • Folders which will contain data files that are large such as flux, swizzled or reco2 should have their stripe set to 56. This improves I/O performance. All subfolders of a folder that has its stipe setting modified will inherit the stripe setting.
  • When adding a path to point some piece of code to a directory where for example a fcl file is located then make sure the full path is given. A full path on theta will look something like /lus/theta-fs0/projects/uboone/... Doing a pwd omits the /lus/theta-fs0 part and so make sure this is included.
  • Try to make sure your folders are group writable if you think other people will need to edit your files: chmod -R g+w <foldername>

Pulling a specific version of uboonecode to Theta

This section will show you how to pull down an experiment specific code such as uboonecode for a specific version. You will see that once the version has been pulled down, all the ups products that this version depends on will be in this folder.

In /projects/uboone there is a script called 'pullProducts', by fermilab SCD: it will take a specification (product name, version, qualifiers) and download it + dependencies based on a 'manifest' file that lists all the required products.

On the page https://scisoft.fnal.gov you can find a variety of top level products that have manifest files supported.
Find the microboone and scroll down to the product you want.
Set up the code with (example with v08_00_00_26)

./pullProducts /lus/theta-fs0/projects/uboone/uboonecode/ slf7 uboone-08.00.00.26 s78-e17 prof

In case one wants to set up a new products area, remember to stripe it! This is because every job will read the libraries in to start, and there can be an io bottleneck if it's not striped. Therefore, the entire thing is striped before any software installation. New products inherit the striping, so this needs to be done only for new areas.


Using Singularity containers to run larsoft

Now we have pulled down the experiment code, we will want to be able to run this code on theta. We use a singularity container to do that. An example of a singularity container is shown below. This is essentially a bash script that setups up uboonecode and executes a set of larsoft commands.


Setting up the Balsam environment

Documentation of the Balsam database can be found here: https://balsam.readthedocs.io/en/latest/index.html

We will want to first create a balsam database. This database is what Balsam uses to schedule jobs and monitor their state. Make sure the person who is going to use this the most creates the database. Only the owner of the database can restart it if it goes down after maintenance for example.


# This is for a first time setup

# Load Balsam -- make sure you check that this is the right version of balsam you want to use, if you want to use the most up-to-date version then omit the version.
module load balsam/0.3.5.1

# Create a balsam databse called uboone_balsam
balsam init /lus/theta-fs0/projects/uboone/uboone_balsam

# Set some permissions for the database so all users can use it
find uboone_balsam/ -type d -exec chmod g+rwx {} \;
find uboone_balsam/ -type f -exec chmod  g+rw {}   \;
find uboone_balsam/ -executable -type f -exec chmod g+x {} \;
chmod 700 uboone_balsam/balsamdb/

# Actvate the database
. balsamactivate uboone_balsam

# Now we should add the additional users who are going to user the database
balsam server --add-user <username here> # current user names I know are andrzej, cadams, kmistry

# Set permissions so that other users can modify the folders you create from balsam 
umask g=rwx

# Set the stripe of the balsam data folder (this is where your output files will go)
lfs setstripe -c 56 uboone_balsam/data/

Now we have created the database, next time any user logs in, they should run these commands to setup balsam and connect to the database


# Make sure you check that this is the right version of balsam you want to use, if you want to use the most up-to-date version then omit the version.
module load balsam/0.3.5.1

# Activate the balsam database
source balsamactivate /projects/uboone/uboone_balsam

# Set permissions so that other users can modify the folders you create from balsam 
umask g=rwx


Creating a Balsam Workflow and populating the database

I would first recommend reading these excellent introductory slides written by Misha (one of the creators of Balsam): https://www.alcf.anl.gov/files/Salim_Balsam_SDL-10-2019%20%28MC4xOTAyMDQwMA%29.pdf

The first thing we will want to do is create a balsam application. Applications are what you will point a Balsam job to run on a Theta worker node. Typically, we will want to point Balsam towards a singularity container which will setup a version of uboonecode and run larsoft for us and so that's what this example will do.

To create the app, we can simply do:


balsam app --name <application name> --executable <path to executable>

# For example
balsam app --name uboonecode_v08_00_00_27 --executable /lus/theta-fs0/projects/uboone/container/container_uboonecode_v08_00_00_27.sh

# You can then see the created app registered to the balsam database, verbose is optional
balsam ls apps [--verbose]

# You might also want to make a pre/post processing script. These are generic python scripts that run before/after the executable is launched.
balsam app --name uboonecode_v08_00_00_27 --executable /lus/theta-fs0/projects/uboone/container/container_uboonecode_v08_00_00_27.sh --preprocess /lus/theta-fs0/projects/uboone/container/preprocess.py

Now we have the balsam apps we can go ahead and populate the balsam database. We can do this using a python script. I am going to show a comprehensive example of a workflow that does the following:
  • Loops over a set of files in a directory
  • Gets the number of events for the file (this is based on a file I made which has the file name and number of events in that file)
  • If the file has 50 events, then we crate 50 jobs each processing 1 event from the file
  • Adds a joining job to merge back the 50 files creates (each with 1 event) back into 1 file

We can then populate the jobs to the database by running the script with python

python add_workflow.py

To see the jobs in the database, you can do:

balsam ls

Now the jobs are populated in the database, most of the work is done now! If you want to test if the submission is working, you can use the debug queue's. There are two debug queue's available debug-cache-quad and debug-flat-quad. Each give you a maxinmum of 8 nodes for up to an hour runtime.

balsam submit-launch -n 2 -t 30 -A uboone -q debug-cache-quad --job-mode serial --wf-filter uboone_beamoff_run1_preprocess_join

n is the number of nodes to run on, t is the runtime in minutes, A is the project, q is the queue, leave the job mode as it is and wf-filter is the workflow.

If you want to submit using the actual allocation you should change the queue to default. Note that you have to submit a minimum of 128 nodes.

To view how much allocation you have just type sbank into the terminal.


Monitoring jobs on Theta

Firstly, you can visit the Theta activity page: https://status.alcf.anl.gov/theta/activity

This shows the current running and queued jobs in a visual way.

You can use qstat to see the current status of the job. Make sure to grep for your username otherwise your screen will show all the current running jobs and fill your terminal.

qstat | grep <username>

If you want to kill a job you can use qdel. Simply get the job id using qstat then:

qdel <jobid>

We can also monitor the jobs with balsam.

To see the current number of jobs running, or waiting for parents, we can do:

balsam ls --by-state

Omit the --by-state to have a more verbose view of the database.

Other options include (you can see all of them by adding --help):

balsam ls —state RUNNING
balsam ls —name reco1
balsam ls -wf uboone

You can also see the current log files and datafiles as they are being processed if you go to the balsam data folder. I like to find any job that is currently running and do a tail -f on the log file. This spits out a live version of the log file as it updates.


Useful Tutorials

Getting Job information

How do I get the information of each individual job for debugging such as what command was executed on the worker node?
We can do all this in a python command line. Something you might do might look like (after starting python and making sure you have loaded balsam):

>>> from balsam.launcher.dag import BalsamJob
>>> jobs = BalsamJob.objects.filter(workflow="<your workflow name>") # get jobs matching a specific workflow
>>> len(jobs) # should show how many jobs have been selected
>>> jobs # print all the information for a job

You can filter the jobs using any of the options from the job output listed above. For example, say if you want to see all jobs that are part of the workflow test_workflow and are currently running.

>>> jobs = BalsamJob.objects.filter(workflow="test_worflow", state="RUNNING")
>>> jobs.delete() # will delete the jobs from the database

Help, my workflow failed, how do I start again?

1. Cleanup jobs in database using python command line

>>> from balsam.launcher.dag import BalsamJob
>>> jobs = BalsamJob.objects.filter(workflow="<your crappy workflow name>") # get jobs matching a specific workflow
>>> len(jobs) # should show how many jobs have been selected
>>> jobs.delete() # Delete the jobs

2. Cleanup folders in the balsam data area
Go to the balsam database data folder and delete the directory for the workflow.

3. Re-populate database

python add_workflow.py

4. Resubmit
balsam-submit launch ...

How do I get the all the jobs for a specific category?

We can use the python command line to do this

>>> from balsam.launcher.dag import BalsamJob
>>> jobs = BalsamJob.objects.filter(workflow="<your workflow name>", name__contains="reco1") # gets the jobs for a workflow and filters the jobs with reco in the name
>>> len(jobs) # should show how many jobs have been selected

Setting stripe count of directory

There are 56 lusture workers on Theta and so it is advised to set this value.

# To set the stripe count of a folder do (in this case we set it to 56)
lfs setstripe -c 56 <folder name>

# To check the stripe count you can use this command
lfs getstripe <folder name>

Transferring data to and from Theta with globus

In order to use globus to transfer from a dCache to Theta or vice versa we need to obtain certificates to get both ends to trust each other.

The proxy certificate for ANL were obtained from this webpage: https://www.alcf.anl.gov/user-guides/using-gridftp

The Fermilab certificates are found in the directory /etc/grid-security/certificates/

First thing we need to do is create the Fermilab and ANL proxies

One should make sure these are continually refreshed in order to maintain the permissions.

# I like to put this into a bash script to automate it.
# Make the Fermilab proxy
kinit
kx509
voms-proxy-init -noregen -voms fermilab:/fermilab/uboone/Role=Analysis
export MYFNALCRED=/tmp/x509up_u$(id -u)

# Now make the ANL proxy
export X509_CERT_DIR=/uboone/data/users/kmistry/work/ALCF_CA/  # This folder contains the ANL AND Fermilab proxies
myproxy-logon -s myproxy.alcf.anl.gov -t 56 --out /tmp/kmistry_ANLcred # Enter your generated password from MobilePass+

To convert a file list with a pnfs path with the gsiftp extension one can use the following command:

sed -i 's#/pnfs/#gsiftp://fndca1.fnal.gov:2811/pnfs/fnal.gov/usr/#' /&lt;filelist.txt&gt;

Modifying ANL processed file metadata for SAM declaration

Before declaring to files that have been processed at ANL, we need to modify the file metadata. We modify the parent file to be the swizzled file and update the swizzler version information. The script below shows you how we do that. There were also some modifications made to the fcl files at Theta, head to the NuMI Production page to see what edits were made.