Project

General

Profile

NOvA AWS Tests

Here you can find a summary of the AWS tests ran by the OPOS team. Tests started being ran by OPOS on January 2016.

SNOW tickets

Date Snow tickets
2016-02-18 RITM0350652
2016-01-27 RITM0339738
2016-01-20 RITM0335878
2016-01-13 RITM0332760

Tools Used

The test use the following submission tools:

Samweb station: nova-int

Instructions to submit jobs

Actual steps:
ssh novagpvm01 -l novapro
source /grid/fermiapp/nova/novaart/novasvn/setup/setup_nova.sh -r S15-05-04c -b maxopt
cd /nova/app/home/novapro/OPOS/repo/offline_production_operations_service-opos/Nova/scripts/Batch/AWS
export X509_USER_PROXY=/var/tmp/OPOS.[teammembername].Production.proxy
Editions to the config file if necessary.
./submit_nova_art.py -f aws_test.cfg

The initial instructions provides by Paul Rojas () in RITM0332760:

Here are the instructions for running the Nova jobs on the Amazon Cloud:

To run jobs on the Amazon Cloud, I'm using some modified Nova scripts.  These scripts will setup the samweb project and submit the job with jobsub automatically.
The files are available at /nova/app/users/projas/aws_test/
The three files necessary to run this are: submit_nova_art.py, art_sam_wrap.sh, and aws_test.cfg
aws_test.cfg is a configuration file necessary to run submit_nova_art.py
There are a couple of parameters in aws_test.cfg that you may want to change for you tests.
--defname is the Samweb dataset definition you want the project to use, for now you'll probably want to leave it alone
--njobs is the number of jobs you want to submit
--files_per_job is the files you'd like each job to process, currently at 15 for these tests, may change in the future.

To actually run these jobs follow these instructions:

1) You'll need to log into a novagpvm (as yourself should be fine) and run the nova setup scripts
$ source /grid/fermiapp/nova/novaart/novasvn/setup/setup_nova.sh -r S15-05-04c -b maxopt

2) Use kx509 to setup certificates
$ kx509

3) Copy the 3 files from my directory, listed above, into your local directory.

4) You'll need to edit line 43 of submit_nova_art.py to point to your local directory copy of art_sam_wrap.sh
art_sam_wrap_cmd="/local_directory/art_sam_wrap.sh" 

5) Now you are ready to run the jobs,
$ ./submit_nova_art.py -f aws_test.cfg

Monitoring tools

Monitor progress using:

Samweb station: nova-int

https://fifemon.fnal.gov/hcf/dashboard/db/aws-vm-status-by-account

Select appropriate vo and zone and instances

or get accounting info:
http://fermicloud035.fnal.gov:8100/gratia/xml/vo_instance_type_running_vms?exclude-instance-type=NONE&exclude-stopped=NO&facility=.*&probe=.*&exclude-empty-nulls-unkowns=YES&vo=nova&charge-type=both&endtime=2016-01-21+23%3A59%3A59&exclude-vo=other&account-type=.*&instance-type=.*&span=86400&exclude-aws-zone=NONE&aws-zone=.*&exclude-account-type=NONE&starttime=2016-01-08+00%3A00%3A00&exclude-facility=NONE%7CGeneric%7CObsolete

AWS VM Status 
https://fifemon.fnal.gov/hcf/dashboard/db/aws-vm-status-by-account-by-az?from=1452787715093&to=1452874115093&var-account=nova&var-az=All&var-type=c3_2xlarge&var-type=c3_xlarge&var-type=m3_2xlarge

Slots
https://fifemon.fnal.gov/hcf/dashboard/db/hep-cloud-slots?from=1452787793549&to=1452873893549

Sam audit

sam_audit_dataset should be run with the following setup:

-> Obtain aws token with appropriate role from the machine with the credentials

  • At machine where the credentials are stored:
    aws sts assume-role --role-arn=arn:aws:iam::950490332792:role/AllowS3_ListBucket_NoFinancial --role-session-name=sam-audit-dataset | egrep 'Key|Session' | sed -e 's/^ *"//' -e 's/": *"/="/' -e 's/",.*$/"/' -e 's/SecretAccessKey/export AWS_SECRET_ACCESS_KEY/' -e 's/SessionToken/export AWS_SESSION_TOKEN/' -e 's/AccessKeyId/export AWS_ACCESS_KEY_ID/' > aws_role_token

-> At the interactive node (novapro@novagpvm01):
source /grid/fermiapp/products/common/etc/setups.sh
setup fife_utils v3_0_1
setup ifdhc v1_8_10
source aws_role_token
-> Using the command (example)
sam_audit_dataset --name=nova_reco-2015_amazon_aws_recovery_feb11-16_OPOS_children --dest=s3://nova-analysis/data/output -e nova
Dataset nova_reco-2015_amazon_aws_recovery_feb11-16_OPOS_children at location s3://nova-analysis/data/output :
Total Files: 7839
Present and declared: 7839
Declared: 7839
Not Declared: 0
Present but not Declared: 0
Present at wrong location: 0

In this example case

$ samweb describe-definition nova_reco-2015_amazon_aws_recovery_feb11-16_OPOS_children
Definition Name: nova_reco-2015_amazon_aws_recovery_feb11-16_OPOS_children
Definition Id: 341951
Creation Date: 2016-02-26T22:58:59.884351+00:00
Username: novapro
Group: nova
Dimensions: ischildof:( defname: nova_reco-2015_amazon_aws_recovery_feb11-16 ) and full_path 's3:%'

HEP Cloud Slots
https://fifemon.fnal.gov/hcf/dashboard/db/hep-cloud-slots?from=1452701805446&to=1452787905447

VM Status by account and AZ
https://fifemon.fnal.gov/hcf/dashboard/db/aws-vm-status-by-account-by-az?from=1452701833790&to=1452788233791&var-account=nova&var-az=All&var-type=c3_2xlarge&var-type=c3_xlarge&var-type=m3_2xlarge

VM Status by account
https://fifemon.fnal.gov/hcf/dashboard/db/aws-vm-status-by-account?from=1452614496610&to=1452628970489&var-account=nova&var-region=us-west-2&var-type=c3_2xlarge&var-type=c3_xlarge&var-type=m3_2xlarge

Gratia VM preemption status
http://fermicloud035.fnal.gov:8100/gratia/xml/spot_status_count?exclude-instance-type=NONE&exclude-stopped=NO&facility=.*&probe=.*&exclude-empty-nulls-unkowns=YES&vo=.*nova.*&charge-type=spot&endtime=2016-01-14+17:00:59&exclude-vo=other&account-type=.*&overbid-charge-description=overbid&instance-type=.*&span=3600&exclude-aws-zone=NONE&aws-zone=.*&exclude-account-type=NONE&starttime=2016-01-13+20:00:00&exclude-facility=NONE%7CGeneric%7CObsolete

Gratia VM charges
http://fermicloud035.fnal.gov:8100/gratia/xml/vo_aws_charges?exclude-instance-type=NONE&exclude-stopped=YES&facility=.*&probe=.*&exclude-empty-nulls-unkowns=YES&vo=.*nova.*&charge-type=both&instance-type=.*&exclude-vo=other&account-type=.*&endtime=2016-01-14+18:00:59&span=3600&exclude-aws-zone=NONE&aws-zone=.*&exclude-account-type=NONE&exlude-empty-nulls-unkowns=NO&starttime=2016-01-13+20:00:00&exclude-facility=NONE%7CGeneric%7CObsolete

Idle jobs and running VM
https://fifemon.fnal.gov/hcf/dashboard/db/hep-cloud-summary?from=1453912229022&to=1453928419484

Pre emption
http://fermicloud035.fnal.gov:8100/gratia/xml/spot_status_count?exclude-instance-type=NONE&exclude-stopped=NO&facility=.*&probe=.*&exclude-empty-nulls-unkowns=YES&vo=.*nova.*&charge-type=spot&endtime=2016-01-27+23:00:59&exclude-vo=other&account-type=.*&overbid-charge-description=overbid&instance-type=.*&span=3600&exclude-aws-zone=NONE&aws-zone=.*&exclude-account-type=NONE&starttime=2016-01-27+00:00:00&exclude-facility=NONE%7CGeneric%7CObsolete

Submission details

The details of the submissions can be found in NOvA production ECL.
http://dbweb5.fnal.gov:8080/ECL/novapro/E/index

Test Results Summary

RITM0339738

Two thousand jobs processing. Objective: Get 2000 jobs running concurrently in AWS processing real data. Process data avaialble in dataset nova_reco-2015_amazon_aws (187k files), Each file should process 100 files. Jobs are expected to last 6 hours.

Submission # Date Description Configuration Max concurrent jobs Disconnections Comments
1 2016-01-27 First attempt to run 2000 concurrent jobs http://dbweb5.fnal.gov:8080/ECL/novapro/E/show?e=68

RITM0335878

One hundred jobs test. Objective: Get 100 jobs running concurrently in AWS.

Test # Date Description Configuration Max concurrent jobs Disconnections Comments
7 01-22-16 Run 100 PID jobs over a ~1800 files DS. ECL entry 67 100 0 Upgraded version of glideinWMS v3.2.12 in production. For around a min there were 100 jobs running concurrently! It took ~50 min to get the 100 jobs up and running.
6 01-22-16 Run 100 PID jobs over a ~1800 files DS. ECL entry 66 64 11 Before submitting: "verify that the configuration of the FE is good enough to submit 100 concurrent jobs within 1 h. Joe has set parameter relative_to_queue to 10, which should be enough if the spot market is low enough that HEP Cloud can submit VMs. " Gabrielle. After submission: " 1) With the current version of glideinWMS for fifebatch (v3.2.11), the relative_to_queue=10 does not help submit more glideins to HEP Cloud. We could see only 3 running, 1 idle, and 2 held glideins 2) Submitting 200 "pressure jobs", I could see the number of idle glideins increase. So the bug in v11 limits the number of glideins submitted irrespective of the relative_to_queue: it is a limit that kicks in after the relative_to_queue effect is calculated. We do need v3.2.12 for smooth operations. 3) About the HELD jobs, condor was trying to interact with a VM that was already removed ("Job cancel did not succeed after 3 tries, giving up."). Condor tries to release the glidenin back to idle three times and keeps failing. The fact that the glideins go back to idle, though, affects how many glideins the fifebatch system sends. Parag thinks that glideinWMS should improve the handling of this: if the VM is dead, it is probably not worth trying again. Parag will talk to Tony about this." Gabrielle
5 01-22-16 Run 100 PID jobs over a ~1800 files DS. ECL entry 54 35 34 "Still running no more than 35 jobs concurrently. Working with Joe to up the number of glideins submitted. He is tweaking the parameter <running_glideins_per_entry relative_to_queue=X> with X from 1 to 10 (3 did not make a change)" Gabrielle
4 01-20-16 Run 100 PID jobs over a ~1800 files DS. Submit 200 pressure jobs that would not run. Instructions on how to do this on ticket RITM0335878 ECL entry 53 48 18 The dataset changed its size as new files were uploaded to S3 (getting data ready for the future 1000 jobs test). All jobs were removed as some of them were getting stuck when trying to access a file declared in SAM but not physically available in S3. The 100 slots weren't reached. Needs further action and investigation. In future tests, the DS must point to a snapshot.

RITM0332760

One hundred jobs test. Objective: Get 100 jobs running concurrently in AWS.

Test # Date Description Configuration Max concurrent jobs Disconnections Comments
3 01-15-16 Run 100 PID jobs over a ~1800 files DS. ECL entry 52 88 35 Testing time of test2 overlaps with the one of test 3. 14 jobs stuck when asking for a new file to the project when the project has delivered all files. "We still cannot reach 100 slots concurrently. Needs investigation. " Gabrielle. "There are 3 problems: (1) pilot instantiation; (2) job termination when no file is available; (3) output storage..." Gabrielle. More details in ticket RITM0332760, additional comments by Gabriele Garzoglio on 2016-01-19 17:03:51
2 01-14-16 Run 100 PID jobs over a ~1800 files DS. ECL entry 51 56 42 Jobs didn't run concurrently. 8 jobs stuck when asking for a new file to the project when the project has delivered all files. They get stuck for 180 min. "Still have problems with occupancy ... There's a limitation on the fife factory that is not sending enough jobs to the hep cloud gateway. Tony is working with Krista and Joe to address this. 4 jobs stuck are trying to check if their output file is already in the dropbox with something like this: uberftp -ls gsiftp://stkendca27a.fnal.gov/pnfs/fnal.gov/usr/nova/scratch/fts/dropbox/ 0/1/f/fardet_r00013492_s44_t00_S15-05-04c_v1_data.pidpart.root" Gabrielle. Incident INC000000651599 opened with dCache. "The limits on the FIFE factory have been raised. Please, submit again" Gabrielle
1 01-13-16 Run 100 PID jobs over a ~1800 files DS. ECL entry 7 48 18 Jobs didn't run concurrently. "Tony (...) has adjusted the factory parameters so there will be more VM's launched next time" Steve Timm, "We are also starting squid servers in all availability zones" Gabrielle

New Round

/nova/app/users/projas/aws_test/submit_nova_art.py
/nova/app/users/projas/aws_test/art_sam_wrap.sh
/nova/app/users/projas/aws_test/aws_new_output_location.cfg