Project

General

Profile

Fife Monitoring

Background

The Open Science Grid (OSG) Application Software Installation Service (OASIS) is the recommended method to distribute software on the Open Science Grid. It is implemented using CernVM File System (CVMFS or CernVM-FS) technology. OASIS/CVMFS provides a robust, scalable way to distribute software on remote sites, but there is an increased probability of encountering access problems on remote worker nodes. There are several ways to monitor the progress of a submitted job including graphs.

After submitting a job, users must wait for the result. If a job is idle longer than expected, users understandably are concerned about what is going on. There are several
resources available to see what is happening to a job. Knowing what is going on can help maximize effectiveness.

Fifemon gathers information from many different sources and graphs it on a common timeline to help experiments understand their computing usage and identify problems.
The main monitoring page is at: http://fifemon.fnal.gov/monitor.
  • Users login in with their services account.
  • Users need to be affiliated with an experiment to see some details.

If you aren't affiliated with an experiment, see: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Requesting_interactive_account

There are many reasons why a job may be idle, such as:
  1. It takes some time to request and start a glidein on a remote worker node and to match your job.
  2. Be especially patient when submitting jobs to opportunistic, offsite, fermicloud, or AWS resources.
  3. If the job doesn't match any available slots (such as having an unsupported version of Scientific Linux, or a huge memory requirement.)
  4. The job might be starting and failing right away, so it is put back in the queue.
  5. The system may be too busy running other jobs. (Details below https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Fife_Monitoring#HTCondor-Scheduling)

Remember that the job could be started at Nebraska, Oklahoma or other OSG sites. You shouldn't expect to get direct access to BlueArc.

Running a Test Job

Since many people learn best by doing, Ken Herner prepared a test script that can be used to learn how to monitor a job.

Assuming you are affiliated with an experiment, and have set up your services account, you should be able to do the following steps (changing uboone to your_experiment).

ssh <vm_for_your_experiment>
 such as:
ssh uboonegpvm01

source /grid/fermiapp/products/common/etc/setups.sh
setup jobsub_client

cp /nashome/k/kherner/submission_test.sh /<experiment>/app/users/<user_name>
For example, mine is: /uboone/app/users/klato
because I am affiliated with uboone, and my user_name is klato.

mkdir -p /pnfs/uboone/scratch/users/klato
chmod g+w /pnfs/uboone/scratch/users/klato
jobsub_submit -G uboone -M --OS=SL6,SL7 --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE file:///uboone/app/users/klato/submission_test.sh

(again, my experiment is uboone, the specifics depend on your experiment and user name.

You can repeat this command several times if you want to simulate having multiple jobs.

Then you can go to the fifemon pages to see the jobs under your personal page.
http://fifemon.fnal.gov/monitor

Note: you will also receive email about the job, hopefully with a status of 0 indicating success.

This is an automated email from the Condor system
on machine "fifebatch1.fnal.gov".  Do not reply.

Condor job 75378.0
    /fife/local/scratch/uploads/uboone/klato/2014-08-28_103846.873398_8614/submission_test.sh_20140828_103847_17933_0_1_wrap.sh
exited normally with status 0

(And a lot more information about time run and statisics)

Monitoring Results THIS SECTION IS OBSOLETE

Debugging

In addition to the monitoring pages, the status of a job can be checked via jobsub_q and the log files can be fetched via jobsub_fetchlog.

Checking job status

Check the status of your job via:

jobsub_q -G <experiment>

jobsub_q -G uboone

386.0@fifebatch2.fnal.gov    tlevshin    06/10    15:39    0+00:00:00    I    0    0.0    test_local_env.sh
386.1@fifebatch2.fnal.gov    tlevshin    06/10    15:39    0+00:00:00    I    0    0.0    test_local_env.sh
386.2@fifebatch2.fnal.gov    tlevshin    06/10    15:39    0+00:00:00    I

There can be worker node problems such as:
  1. cvmfs problems
  2. missing libraries

Be sure to check your log files carefully.

You can also see a personalized monitoring page here:

https://fifemon.fnal.gov/monitor/d/000000116/user-batch-details?orgId=1&var-cluster=fifebatch&var-user=YOUR_USERNAME

Of course, be sure to replace YOUR_USERNAME with your actual username in the above link.

Log files

Can download log files via jobsub_fetchlog

jobsub_fetchlog -G <your_experiment> -J <Job_ID_from jobsub-submit_cmd>

You then get a .tgz file containing the job logs that you want.

Specific example:

<uboonegpvm01.fnal.gov> jobsub_submit -G uboone -M --OS=SL5,SL6 --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC file:///uboone/app/users/klato/submission_test.sh
Server response code: 200
Response OUTPUT:
/fife/local/scratch/uploads/uboone/klato/2014-09-03_132650.610775_5516

/fife/local/scratch/uploads/uboone/klato/2014-09-03_132650.610775_5516/submission_test.sh_20140903_132650_12591_0_1.cmd

submitting....

Submitting job(s).

1 job(s) submitted to cluster 75464.

JobsubJobId of first job: 75464.0@fifebatch1.fnal.gov

Use job id 75464.0@fifebatch1.fnal.gov to retrieve output

Remote Submission Processing Time: 3.10846090317 sec

<uboonegpvm01.fnal.gov> jobsub_fetchlog -G uboone -J 75464.0@fifebatch1.fnal.gov

CREDENTIALS    : {'cert': '/tmp/jobsub_x509up_u48896', 'env_key': '/tmp/jobsub_x509up_u48896', 'env_cert': '/tmp/jobsub_x509up_u48896', 'key': '/tmp/jobsub_x509up_u48896'}

SUBMIT_URL     : https://fifebatch1.fnal.gov:8443/jobsub/acctgroups/uboone/jobs/75464.0@fifebatch1.fnal.gov/sandbox/

Downloaded to 75464.0@fifebatch1.fnal.gov.tgz
Remote Submission Processing Time: 0.410796880722 sec

You can also add

—unzipdir=/path/where/I/want/the/files/to/go

GlideinWMS Monitoring Page

Can also check the GlideinWMS Frontend monitoring page at: http://fifebatchgpvmhead1.fnal.gov/vofrontend/monitor

Getting More Information

Make sure you have enough information to debug the problem.
  • Within your script, add 'set -x' to simplify debugging.
  • Be sure you exit with a non-zero exit code in case of failure.
  • Add timestamps to your log by including
    echo "Started 'date'" 
    

Do this especially when doing file transfers.

  • Always log the site, host, and OS your job has been executing on.
    echo Site:${GLIDEIN_ResourceName}
    echo "the worker node is " `hostname ` "OS: ' `uname -a`
    

If your script has a bug, try testing it on your local machine since there can be authentication/authorization problems with a remote site.
You can always request a FermiCloud VM node that looks like a "generic" grid worker node and do testing and debugging there.

Internals

HTCondor Scheduling

The details of the HTCondor priority calculation are available at: http://research.cs.wisc.edu/htcondor/manual/v8.0/3_4User_Priorities.html#28293. Note that each experiment has a quota on FermiGrid that will be filled before priority is considered. But there must be open slots available on FermiGrid before an experiment quota can be filled since FermiGrid has a non-preemption policy.

Monitoring Pages

  1. GlideinWMS Frontend monitoring page at: http://fifebatchgpvmhead1.fnal.gov/vofrontend/monitor
  2. FIFEMON User’s Job Info at: https://fifemon.fnal.gov/monitor/d/000000116/user-batch-details?orgId=1&var-cluster=fifebatch&var-user=
    <username>
  3. FIFEMON batch details at: https://fifemon.fnal.gov/monitor/d/000000004/experiment-overview?orgId=1
    <vo> ArgoNeuT, CDF, etc.Choose your VO from the pull-down menu on the left side.
  4. To get weekly and monthly data per experiment, click on 'Experiments' link (at the top) at: http://web1.fnal.gov/scoreboard/
  5. SAM station for input file download at: http://samweb.fnal.gov:8480/station_monitor/
    <vo> DO, CDF, etc.
  6. Fermilab Public dCache at: /pnfs/fnal.gov/usr/<vo>/scratch/…)
  7. General Fermilab dCache System Status: http://fndca.fnal.gov
  8. Per experiment space usage: http://fndca.fnal.gov/cgi-bin/sg_usage_cgi.py
  9. Per experiment dCache transfers: http://fndca.fnal.gov/cgi-bin/sg_transfers_cgi.py
  10. Active transfers in dCache: http://fndca3a.fnal.gov:2288/context/transfers.html
  11. Recent FTP transfers: http://fndca.fnal.gov/cgi-bin/dcache_files.py
  12. File lifetime plots in dCache: http://fndca.fnal.gov/dcache/dc_lifetime_plots.html
  13. Enstore at http://www-ccf.fnal.gov/enstore/

A Specific Example of Monitoring (for NOvA)

The main NOvA monitoring page is here:

https://fifemon.fnal.gov/monitor/d/000000004/experiment-overview?orgId=1&var-experiment=nova

Batch details includes seeing how many jobs are running, queued, slot usage, priority, etc.

https://fifemon.fnal.gov/monitor/d/000000053/experiment-batch-details?orgId=1&var-experiment=nova

The history of nova jobs can be seen by choosing the time region of interest in the upper right section of the page. Both of these pages will work for other experiments by substituting your experiment name for nova in the URL, or by choosing it from the pull-down menu on the left side.

Feedback on Fife Monitoring

Please add items to this page if you have anything you'd like changed, added, or deleted. Feedback on Fife Monitoring