Project

General

Profile

Feature #10319

hepcloud - Pythia Test (GEN-SIM)

Added by Gerard Bernabeu Altayo about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Start date:
09/22/2015
Due date:
09/29/2015
% Done:

0%

Estimated time:
Duration: 8

Description

Hi Gerard,

1. Pythia Test (GEN-SIM) (1 week):

This will take the form of a tarball given by Dave M. with a simple JDL for a
vanilla submit. The purpose if this test is to make sure that CMS jobs have the
environment they need to run in on AWS. This will specifically test Frontier,
CVMFS, and stage out to EOS at FNAL.

Please work with Steve Timm to understand the AWS environment and participate in
the AWS side of debugging. One specific test we need to do is lock down the AWS
network to only allow traffic to FNAL and determine if we have any "leakage"
attempts to other sites.

History

#1 Updated by Gerard Bernabeu Altayo about 4 years ago

I had 1 week but just starting today, so I've a couple days to get this done. Will just drop things here, using this redmine project because it's actually unused... I wanted to remove it but didn't so I'll just use it as a temporary place holder for my logs.

#2 Updated by Gerard Bernabeu Altayo about 4 years ago

[cmsdataops@cmssrv271 ~]$ cd /home/cmsdataops/gerard
[cmsdataops@cmssrv271 gerard]$ tar -xzvf HepCloud1.tgz
HepCloud/jdlproto.jdl
HepCloud/JobPackage.pkl
HepCloud/dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2
HepCloud/Unpacker.py
HepCloud/submit.sh

This is supposed to output to EOS, where I have my 2 proxies valid:

[root@cmssrv222 ~]# grep -i gerard /etc/grid-security/grid-mapfile
"/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Gerard Bernabeu Altayo 949" gerard1
"/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Gerard Bernabeu Altayo/CN=UID:gerard1" uscms6234
[root@cmssrv222 ~]# grep grid /etc/xrd.cf.mgm
sec.protocol gsi -crl:0 -cert:/etc/grid-security/daemon/hostcert.pem -key:/etc/grid-security/daemon/hostkey.pem -gridmap:/etc/grid-security/grid-mapfile -d:0 -md:sha256:sha1 -gmapopt:2

I will submit from the schedd directly, let's go to it:

cmssrv272  gwms_submit_hepcloud     FNAL HEP CLOUD SUBMITTER/CE

[root@cmssrv272 ~]# condor_q
Configuration Warning "/etc/condor/config.d/30_hepcloud_ce", Line 0: obsolete use of ':' for parameter assignment at puppet : ///modules/s_gwms_collector/etc/condor/config.d/30_hepcloud_ce
Configuration Warning "/etc/condor/config.d/31_hepcloud_gsi", Line 0: obsolete use of ':' for parameter assignment at puppet : ///modules/s_gwms_collector/etc/condor/config.d/31_hepcloud_gsi

-- Submitter: cmssrv272.fnal.gov : <131.225.207.61:46575> : cmssrv272.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
[root@cmssrv272 ~]#
[root@cmssrv272 ~]# mkdir gba
[root@cmssrv272 ~]# cd gba/
[root@cmssrv272 gba]# vim helloworld.jdl
[root@cmssrv272 gba]# vim hello_world.sh
[root@cmssrv272 gba]# bash hello_world.sh
root
cmssrv272.fnal.gov
traceroute to www.fnal.gov (131.225.105.230), 30 hops max, 60 byte packets
1 r-cms-fcc2-1-vlan207.fnal.gov (131.225.207.209) 0.514 ms 0.620 ms 0.733 ms
2 r-dist-fcc2-1-po302.fnal.gov (131.225.15.194) 0.323 ms 0.478 ms 0.589 ms
3 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 0.175 ms 0.184 ms 0.169 ms
4 131.225.23.85 (131.225.23.85) 0.627 ms 1.720 ms 0.743 ms
5 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 1.854 ms 0.952 ms 1.121 ms
6 131.225.23.85 (131.225.23.85) 1.836 ms 1.900 ms 2.059 ms
7 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 2.260 ms 2.333 ms 1.487 ms
8 131.225.23.85 (131.225.23.85) 2.130 ms 2.512 ms 2.234 ms
9 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 3.936 ms 2.651 ms 2.684 ms
10 131.225.23.85 (131.225.23.85) 4.066 ms 4.683 ms 5.745 ms
11 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 6.232 ms 6.603 ms 6.640 ms
12 131.225.23.85 (131.225.23.85) 8.806 ms 7.571 ms 7.918 ms
13 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 7.839 ms 7.764 ms 7.407 ms
14 131.225.23.85 (131.225.23.85) 8.709 ms 9.702 ms 9.870 ms
15 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 9.579 ms 10.792 ms 9.555 ms
16 131.225.23.85 (131.225.23.85) 14.026 ms 13.977 ms 15.003 ms
17 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 14.599 ms 15.433 ms 14.217 ms
18 131.225.23.85 (131.225.23.85) 15.955 ms 17.677 ms 17.181 ms
19 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 18.527 ms 18.503 ms 17.986 ms
20 131.225.23.85 (131.225.23.85) 20.454 ms 19.747 ms 20.035 ms
21 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 19.656 ms 18.480 ms 17.353 ms
22 131.225.23.85 (131.225.23.85) 19.313 ms 18.610 ms 18.828 ms
23 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 18.737 ms 17.946 ms 17.825 ms
24 131.225.23.85 (131.225.23.85) 19.552 ms 18.460 ms 18.695 ms
25 s-lb-fcc2-1-transit.fnal.gov (131.225.23.86) 19.365 ms 19.774 ms 19.220 ms
26 131.225.23.85 (131.225.23.85) 20.367 ms 20.339 ms *
27 * * *
28 * * *
29 * * *
30 * * *
traceroute to srm.fnal.gov (131.225.81.191), 30 hops max, 60 byte packets
1 r-cms-fcc2-1-vlan207.fnal.gov (131.225.207.209) 0.310 ms 0.517 ms 0.638 ms
2 r-dist-fcc2-1-po302.fnal.gov (131.225.15.194) 0.317 ms 0.427 ms 0.520 ms
3 r-s-core-fcc-228.fnal.gov (131.225.23.230) 0.343 ms 0.403 ms r-s-core-gcc-236.fnal.gov (131.225.23.238) 0.371 ms
4 r-dist-fcc2-3-transit-r-s-core-gcc.fnal.gov (131.225.40.61) 2.403 ms 2.491 ms 131.225.40.102 (131.225.40.102) 0.402 ms
5 srm.fnal.gov (131.225.81.191) 0.235 ms 0.227 ms 0.237 ms
traceroute to cmseos.fnal.gov (131.225.205.49), 30 hops max, 60 byte packets
1 cmseos.fnal.gov (131.225.205.49) 0.762 ms 0.762 ms 0.748 ms
traceroute to cmseos1.fnal.gov (131.225.205.149), 30 hops max, 60 byte packets
1 cmseos1.fnal.gov (131.225.205.149) 1.647 ms 1.648 ms 1.647 ms
traceroute to www.cern.ch (188.184.9.235), 30 hops max, 60 byte packets
1 r-cms-fcc2-1-vlan207.fnal.gov (131.225.207.209) 0.770 ms 0.834 ms 0.945 ms
2 r-dist-fcc2-1-po302.fnal.gov (131.225.15.194) 2.222 ms 2.379 ms 2.514 ms
3 r-s-core-fcc-228.fnal.gov (131.225.23.230) 0.425 ms 0.447 ms 0.493 ms
4 r-s-edge-1-te1-1.fnal.gov (131.225.15.245) 0.407 ms 0.439 ms 0.494 ms
5 r-s-bdr-vl375.fnal.gov (131.225.15.202) 0.423 ms 0.502 ms 0.571 ms
6 fnal-mr2.fnal.gov (198.49.208.229) 0.304 ms 0.234 ms 0.230 ms
7 starcr5-ip-b-fnalmr2.es.net (134.55.220.6) 2.250 ms starcr5-ip-a-fnalmr2.es.net (134.55.49.97) 2.205 ms starcr5-ip-b-fnalmr2.es.net (134.55.220.6) 2.480 ms
8 chiccr5-ip-a-starcr5.es.net (134.55.42.41) 2.374 ms 2.634 ms 2.962 ms
9 washcr5-ip-a-chiccr5.es.net (134.55.36.46) 19.480 ms 19.752 ms 20.020 ms
10 cern513cr5-ip-a-washcr5.es.net (134.55.37.61) 105.395 ms 105.550 ms 105.835 ms
11 e513-e-rbrxl-1-te22.cern.ch (192.65.184.213) 113.198 ms 107.855 ms 109.956 ms
12 e513-e-rbrxl-2-ne0.cern.ch (192.65.184.38) 106.230 ms 105.178 ms 105.165 ms
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
[root@cmssrv272 gba]#
[root@cmssrv272 gba]# cat helloworld.jdl ############ #
  1. Example submit file for vanilla job # ############

Universe = vanilla
Executable = hello_world.sh

input = /dev/null
output = hello.out
error = hello.error

Queue

MMMM, looks like 272 does NOT work. Tony says use 271, will scp this all there:

[root@cmssrv272 ~]# scp -rp gba cmssrv271:/root/
hello_world.sh 100% 189 0.2KB/s 00:00
helloworld.jdl 100% 269 0.3KB/s 00:00
[root@cmssrv272 ~]#
[root@cmssrv271 gba]# condor_status -any
MyType TargetType Name

glideresource None FNAL_HEPCLOUD_AWS_West_2_c3.2xlarge@hepcl
glideresource None FNAL_HEPCLOUD_AWS_West_2_cc2.8xlarge@hepc
glideresource None FNAL_HEPCLOUD_AWS_West_2_m3.2xlarge@hepcl
glideresource None FNAL_HEPCLOUD_AWS_West_2_r3.2xlarge@hepcl
DaemonMaster None cmssrv274.fnal.gov
Negotiator None cmssrv274.fnal.gov
HAD None
Collector None
Replication None -p 41450
DaemonMaster None cmssrv276.fnal.gov
HAD None
Collector None
Replication None -p 41450
Scheduler None cmssrv277.fnal.gov
DaemonMaster None cmssrv277.fnal.gov
[root@cmssrv271 gba]# condor_q -global
All queues are empty
[root@cmssrv271 gba]#
[root@cmssrv271 gba]# condor_submit helloworld.jdl

ERROR: Submitting jobs as user/group 0 (root) is not allowed for security reasons.
[root@cmssrv271 gba]# sudo -u gerard1 condor_submit helloworld.jdl
sudo: unknown user: gerard1
sudo: unable to initialize policy plugin
[root@cmssrv271 gba]# id gerard1
id: gerard1: No such user
[root@cmssrv271 gba]# useradd gerard1^C
[root@cmssrv271 gba]# man condor_submit
[root@cmssrv271 gba]# useradd gerard1
[root@cmssrv271 gba]# grep gerard1 /etc/passwd
gerard1:x:42936:42936::/home/gerard1:/bin/bash
[root@cmssrv271 gba]#
[root@cmssrv271 gba]# cd ..
[root@cmssrv271 ~]# cp -pr gba /home/gerard1/
[root@cmssrv271 ~]# chown gerard1 /home/gerard1/gba/
[root@cmssrv271 ~]# su gerard1
[gerard1@cmssrv271 root]$ pwd
/root
[gerard1@cmssrv271 root]$ cd /home/gerard1/gba/
[gerard1@cmssrv271 gba]$ condor_submit helloworld.jdl
Submitting job(s).
1 job(s) submitted to cluster 1.
[gerard1@cmssrv271 gba]$
[gerard1@cmssrv271 gba]$ condor_q

-- Submitter: cmssrv271.fnal.gov : <131.225.207.60:9615?sock=28571_43ec> : cmssrv271.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 gerard1 9/25 12:01 0+00:00:00 I 0 0.0 hello_world.sh

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
[gerard1@cmssrv271 gba]$ condor_status -any
MyType TargetType Name

glideresource None FNAL_HEPCLOUD_AWS_West_2_c3.2xlarge@hepcl
glideresource None FNAL_HEPCLOUD_AWS_West_2_cc2.8xlarge@hepc
glideresource None FNAL_HEPCLOUD_AWS_West_2_m3.2xlarge@hepcl
glideresource None FNAL_HEPCLOUD_AWS_West_2_r3.2xlarge@hepcl
DaemonMaster None cmssrv274.fnal.gov
Negotiator None cmssrv274.fnal.gov
HAD None
Collector None
Replication None -p 41450
DaemonMaster None cmssrv276.fnal.gov
HAD None
Collector None
Replication None -p 41450
Scheduler None cmssrv277.fnal.gov
DaemonMaster None cmssrv277.fnal.gov
[gerard1@cmssrv271 gba]$

Now I've 2 jobs:

[root@cmssrv271 ~]# condor_q

-- Submitter: cmssrv271.fnal.gov : <131.225.207.60:40442> : cmssrv271.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 gerard1 9/25 12:01 0+00:00:00 I 0 0.0 hello_world.sh
2.0 gerard1 9/25 12:02 0+00:00:00 I 1000 0.0 (interactive job )

2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
[root@cmssrv271 ~]#

Let's see if my hello world works, if so then I'll just submit the real one, will start preparing it in my area in cmssrv271...

As for the certificate/proxy, for ease of use I'll just use the kx509, for which I can secure&easily get a proxy:

[root@cmsadmin1 gerard1]# yum install krb5-fermi-getcert

#3 Updated by Gerard Bernabeu Altayo about 4 years ago

After many fixes with Krista we managed to make the FE/Factory start a VM on AWS for my helloworld test but the VM is not calling back nor pingable/SSHable.

I'm installing the AWS tools in cmsadmin1:/home/gerard1/AWS to try to see the VM's console....

#4 Updated by Gerard Bernabeu Altayo about 4 years ago

-bash-4.1$ export EC2_HOME=/home/gerard1/AWS/ec2-api-tools-1.7.5.1/
-bash-4.1$ export PATH=$PATH:$EC2_HOME/bin
-bash-4.1$ export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64

Steve kicked in, all sorts of trouble with the VMs and proxies. IN the end one needs to login to cmsadmin1 (or any FNAL IP) to be able to SSH in. ping is disabled in the VMs...

#5 Updated by Gerard Bernabeu Altayo about 4 years ago

Using a new WN image now, fixed a few things, submitting a couple jobs with my own proxy and let's see what happens:

[gerard1@cmssrv271 HepCloud]$ condor_submit jdlproto.jdl
Submitting job(s).
1 job(s) submitted to cluster 6.
[gerard1@cmssrv271 HepCloud]$ cat jdlproto.jdl
universe=vanilla
transfer_input_files=dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2,/uscms/home/dmason/play/HepCloud/Unpacker.py,/uscms/home/dmason/play/HepCloud/JobPackage.pkl
should_transfer_files=YES
notification=NEVER
Executable=submit.sh
Args="dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2 1152990"
Output=CONDOROUTPUT.$(CLUSTER).$(Process).stdout
Error=CONDORERROR.$(CLUSTER).$(Process).stderr
Log=CONDORLOG.$(CLUSTER).$(Process).log
when_to_transfer_output=ON_EXIT
x509userproxy = myproxy.pem
Queue

[gerard1@cmssrv271 HepCloud]$ kx509 && cp /tmp/x509up_u`id -u` myproxy.pem
Service kx509/certificate
issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM
subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Gerard Bernabeu Altayo/CN=UID:gerard1
serial=037197E1
hash=acbf6dc7
[gerard1@cmssrv271 HepCloud]$ condor_submit jdlproto.jdl
Submitting job(s).
1 job(s) submitted to cluster 7.
[gerard1@cmssrv271 HepCloud]$ pwd
/home/gerard1/HepCloud
[gerard1@cmssrv271 HepCloud]$

One of them started running quickly!

[gerard1@cmssrv271 HepCloud]$ condor_q

-- Submitter: cmssrv271.fnal.gov : <131.225.207.60:9615?sock=28553_eeb9_3> : cmssrv271.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
6.0 gerard1 9/28 16:29 0+00:00:51 R 0 0.0 submit.sh dmason_B
7.0 gerard1 9/28 16:29 0+00:00:00 I 0 0.0 submit.sh dmason_B

2 jobs; 0 completed, 0 removed, 1 idle, 1 running, 0 held, 0 suspended
[gerard1@cmssrv271 HepCloud]$

Actually both:

[gerard1@cmssrv271 HepCloud]$ condor_q

-- Submitter: cmssrv271.fnal.gov : <131.225.207.60:9615?sock=28553_eeb9_3> : cmssrv271.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
6.0 gerard1 9/28 16:29 0+00:01:05 R 0 0.0 submit.sh dmason_B
7.0 gerard1 9/28 16:29 0+00:00:06 R 0 0.0 submit.sh dmason_B

2 jobs; 0 completed, 0 removed, 0 idle, 2 running, 0 held, 0 suspended
[gerard1@cmssrv271 HepCloud]$

They're both on the same WN:

[gerard1@cmssrv271 HepCloud]$ condor_q -l | grep aws
RemoteHost = "slot1@glidein_1481_329032418@ec2-52-89-172-73.us-west-2.compute.amazonaws.com"
RemoteHost = "slot1@glidein_1481_329032418@ec2-52-89-172-73.us-west-2.compute.amazonaws.com"
[gerard1@cmssrv271 HepCloud]$

Logging in the WN I see both jobs:

501 1400 0.0 0.0 191052 7944 ? S 13:56 0:00 /usr/bin/python /usr/sbin/pilot-launcher -p /var/run/pilot-launcher.pid
501 1481 0.0 0.0 108596 1832 ? S 13:56 0:00 \_ /bin/bash /home/glidein_pilot/glidein_startup.sh -v std -name hepcloud_instance -entry FNAL_HEPCLOUD_AWS_West_2_
501 5403 0.0 0.0 11472 1420 ? S 13:56 0:00 \_ /bin/bash /home/scratchgwms/glide_CU2FG6/main/condor_startup.sh glidein_config
501 6091 0.0 0.0 93672 8132 ? S 13:56 0:00 \_ /home/scratchgwms/glide_CU2FG6/main/condor/sbin/condor_master -f -pidfile /home/scratchgwms/glide_CU2FG6
501 6093 0.0 0.0 21332 2812 ? S 13:56 0:00 \_ condor_procd -A /home/scratchgwms/glide_CU2FG6/log/procd_address -L /home/scratchgwms/glide_CU2FG6/l
501 6094 0.0 0.0 94368 9112 ? S 13:56 0:00 \_ condor_startd -f
501 8327 0.0 0.0 93880 8464 ? S 16:29 0:00 \_ condor_starter -f -a slot1_1 cmssrv271.fnal.gov
501 8331 0.0 0.0 11336 1316 ? S 16:29 0:00 | \_ /bin/bash /home/scratchgwms/glide_CU2FG6/execute/dir_8327/condor_exec.exe dmason_BoogaBoogaH
501 8367 0.4 0.1 203188 29052 ? Sl 16:29 0:00 | \_ python2.6 Startup.py
501 8453 0.0 0.0 11332 1272 ? S 16:30 0:00 | \_ /bin/bash
501 8454 3.3 0.2 65560 31504 ? S 16:30 0:02 | \_ /cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_13/external/slc6_amd64_
501 8455 0.0 0.0 93880 8404 ? S 16:30 0:00 \_ condor_starter -f -a slot1_2 cmssrv271.fnal.gov
501 8459 0.0 0.0 11336 1316 ? S 16:30 0:00 \_ /bin/bash /home/scratchgwms/glide_CU2FG6/execute/dir_8455/condor_exec.exe dmason_BoogaBoogaH
501 8491 0.8 0.2 203196 31344 ? Sl 16:30 0:00 \_ python2.6 Startup.py
501 8576 0.0 0.0 11332 1264 ? S 16:30 0:00 \_ /bin/bash
501 8577 1.5 0.2 65560 31868 ? S 16:30 0:00 \_ /cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_13/external/slc6_amd64_

[root@ec2-52-89-172-73 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 6.0G 3.1G 2.7G 54% /
tmpfs 7.4G 0 7.4G 0% /dev/shm
/dev/xvdb 79G 229M 75G 1% /home/scratchgwms
cvmfs2 4.0G 421M 3.5G 11% /cvmfs/cms.cern.ch
[root@ec2-52-89-172-73 ~]#

Jobs done really quick:

[gerard1@cmssrv271 HepCloud]$ condor_history
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
7.0 gerard1 9/28 16:29 0+00:01:57 C 9/28 16:32 /home/gerard1/HepCloud/submit.sh dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2 1152990
6.0 gerard1 9/28 16:29 0+00:02:56 C 9/28 16:32 /home/gerard1/HepCloud/submit.sh dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2 1152990

Something went wrong, looking....

In the standard output I see:

Startup.py : 2015-09-28T16:30:18 : executing task
CMSRunHandler Diagnostic Handler invoked
errors.section_('error0')
errors.error0.type = 'Fatal Exception'
errors.error0.details = 'An exception of category 'Incomplete configuration' occurred while
[0] Constructing the EventProcessor
[1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml'

First line in errors is

ERROR:root:Couldn't find SiteConfig

Siteconf seems to be wrong:

[root@ec2-52-89-172-73 ~]# ll /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
ls: cannot access /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml: No such file or directory
[root@ec2-52-89-172-73 ~]# ll /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/
ls: cannot access /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/: No such file or directory
[root@ec2-52-89-172-73 ~]# ll /cvmfs/cms.cern.ch/SITECONF/local/
total 507
lrwxrwxrwx 1 cvmfs cvmfs 0 Jul 9 03:50 local ->
drwxrwxr-x 4 cvmfs cvmfs 3 Nov 30 2012 T0_CH_CERN
drwxrwxr-x 4 cvmfs cvmfs 3 Mar 16 2011 T1_CH_CERN
drwxrwxr-x 4 cvmfs cvmfs 5 Mar 16 2011 T1_DE_FZK
drwxrwxr-x 4 cvmfs cvmfs 4096 Mar 16 2011 T1_DE_KIT
drwxrwxr-x 4 cvmfs cvmfs 4096 Mar 16 2011 T1_ES_PIC
drwxrwxr-x 4 cvmfs cvmfs 4096 Mar 16 2011 T1_FR_CCIN2P3
drwxrwxr-x 4 cvmfs cvmfs 3 Mar 16 2011 T1_IT_CNAF
drwxr-xr-x 4 cvmfs cvmfs 4096 May 2 2013 T1_RU_JINR
drwxrwxr-x 4 cvmfs cvmfs 4096 Mar 16 2011 T1_TW_ASGC
drwxrwxr-x 4 cvmfs cvmfs 4096 Mar 16 2011 T1_UK_RAL
drwxrwxr-x 5 cvmfs cvmfs 4096 Mar 16 2011 T1_US_FNAL
drwxr-xr-x 4 cvmfs cvmfs 3 Jul 19 2013 T1_US_FNAL_Disk

#6 Updated by Gerard Bernabeu Altayo about 4 years ago

Jobs are finally running for more than 1 minute:

[gerard1@cmssrv271 HepCloud]$ condor_q -long | grep aws
RemoteHost = "slot1@glidein_1481_329032418@ec2-52-89-172-73.us-west-2.compute.amazonaws.com"
RemoteHost = "slot1@glidein_1481_329032418@ec2-52-89-172-73.us-west-2.compute.amazonaws.com"
RemoteHost = "slot1@glidein_1481_329032418@ec2-52-89-172-73.us-west-2.compute.amazonaws.com"
[gerard1@cmssrv271 HepCloud]$ condor_q

-- Submitter: cmssrv271.fnal.gov : <131.225.207.60:9615?sock=28553_eeb9_3> : cmssrv271.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
9.0 gerard1 9/28 16:50 0+00:06:54 R 0 0.0 submit.sh dmason_B
10.0 gerard1 9/28 16:51 0+00:05:54 R 0 0.0 submit.sh dmason_B
12.0 gerard1 9/28 16:52 0+00:05:53 R 0 0.0 submit.sh dmason_B

3 jobs; 0 completed, 0 removed, 0 idle, 3 running, 0 held, 0 suspended
[gerard1@cmssrv271 HepCloud]$

The issue is a missconfigured CVMFS which I am fixing this way:

cat /etc/cvmfs/config.d/cms.cern.ch.conf; echo export CMS_LOCAL_SITE=T1_US_FNAL_Disk > /etc/cvmfs/config.d/cms.cern.ch.conf && service autofs restart

#7 Updated by Gerard Bernabeu Altayo about 4 years ago

  • Status changed from New to Resolved

Hi,

for this to work I must add everyone involved as CMS users with the Production role. Each time I did the test I was manually hacking the EOS 'gridmap file'. I can not leave the changes in place because several automated processes replace the file every now and then (eg: when users get added, which happens several times a week).

I need David Mason's approval to do add you (users: fuess,timm,amitoj) as CMS users and map you to production. Actually there is a standard SNOW workflow to request CMS membership. Alternatively, I can place a valid proxy in the schedd submission box and add you in my .k5login. Here is all what I do to submit a job (I just did):

1. Hack the EOS mapping at cmssrv222:/etc/grid-security/grid-mapfile (adding DN -> cmsprod mapping)
2. ssh
2a. cd HepCloud && kx509 && cp /tmp/x509up_u`id -u` myproxy.pem #I can do this for you
2b. [gerard1@cmssrv271 HepCloud]$ condor_submit KISTI-jdlproto.jdl

And here are the contents of the 2 JDLs to submit to KISTI and to AWS:

[gerard1@cmssrv271 HepCloud]$ cat KISTI-jdlproto.jdl
universe=vanilla
transfer_input_files=dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2,Unpacker.py,JobPackage.pkl
should_transfer_files=YES
notification=NEVER
Executable=submit.sh
Args="dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2 1152990"
Output=CONDOROUTPUT.$(CLUSTER).$(Process).stdout
Error=CONDORERROR.$(CLUSTER).$(Process).stderr
Log=CONDORLOG.$(CLUSTER).$(Process).log
when_to_transfer_output=ON_EXIT
x509userproxy = myproxy.pem
+DESIRED_INSTANCE_TYPE = "m1.large-new"

Queue

[gerard1@cmssrv271 HepCloud]$ cat jdlproto.jdl
universe=vanilla
transfer_input_files=dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2,Unpacker.py,JobPackage.pkl
should_transfer_files=YES
notification=NEVER
Executable=submit.sh
Args="dmason_BoogaBoogaHepCloudNoPileupTest1_150915_181137_6659-Sandbox.tar.bz2 1152990"
Output=CONDOROUTPUT.$(CLUSTER).$(Process).stdout
Error=CONDORERROR.$(CLUSTER).$(Process).stderr
Log=CONDORLOG.$(CLUSTER).$(Process).log
when_to_transfer_output=ON_EXIT
x509userproxy = myproxy.pem
+DESIRED_INSTANCE_TYPE = "c3.2xlarge"
Queue



Also available in: Atom PDF