Project

General

Profile

S3BulkMigration

To transfer large quantities of data to S3 for use there, one can use the fife_utils
v3_x tools, and run parallel batches on FermiGrid. There are several things you need to do by
hand first, because jobsub/FermiGrid does not currently handle Amazon credentials for you as
it does Grid proxies.

Getting an S3 credential to use

Assuming you already have an S3 long term key/token pair, you don't want to schlep that
around on fermigrid, as it is a long term credential, you want something like a proxy.

So basically you want to get a session token from Amazon, and make a script that sets
the environment variables for that token, and put it somewhere your job can grab it.

You want to do this on a box where you keep your long-term AWS credentials, i.e. your laptop,
or desktop (esp. if you keep them on a thumb drive?)

I have a script for this in $FIFE_UTILS/bin: aws_get_session_cred which will get
a temporary tokens file, and put it in /tmp/${USER}_token and copy it to your dcache
scratch area, for jobs to use.

So you just setup the parts you need, and run it (don't setup awscli if you have the
aws tools installed in your local system python):

<bel-kwinith>$ export EXPERIMENT=nova  # or your favorite...
<bel-kwinith>$ export SAM_EXPERIMENT=$EXPERIMENT
<bel-kwinith>$ setup fife_utils v3_0_1
<bel-kwinith>$ setup ifdhc v1_8_10
<bel-kwinith>$ aws_get_session_cred -g $EXPERIMENT
<bel-kwinith>$ # check your token file locally and in dcache
<bel-kwinith>$  cat /tmp/${USER}_token
export AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 
export AWS_SESSION_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..." 
export AWS_ACCESS_KEY_ID="xxxxxxxxxx" 
<bel-kwinith>$ ifdh more /pnfs/$EXPERIMENT/scratch/users/$USER/awst
export AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 
export AWS_SESSION_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..." 
export AWS_ACCESS_KEY_ID="xxxxxxxxxx" 

Now you have a file in DCache scratch with your temporary token info, that the later scripts
can use. If you don't have ifdh where you have your AWS credentials, you'll have to do a
double-copy with scp or some such to get the credential to the jobs.

If you do not have Fermi AWS credentials, contact Gabriele Garzoglio
to obtain them.

Launching clone jobs

Now you can launch your copy job(s):

<novagpvm02> setup fife_utils v3_0_1
<novagpvm02> setup ifdhc v1_8_10
<novagpvm02> launch_clone_jobs \
               --jobs=4 \
               --name=fast_amazon_aws_test_beamdata_1st_2k \
               --group=nova \
               --dest=s3://nova-analysis/data/input/test5 \
               --zerodeep \
               --paranoid

Note that this is all sam_clone_dataset arguments except for the --jobs= and --group= part,
--jobs lets you set how many jobs will participate, --group is for jobsub_submit, and then
the --ncopies is how many copy processes each job will run. So in the above example, we're
running 16 copy processes, 4 each on 4 nodes.