Project

General

Profile

GlideinwmsCondorAnnex » History » Version 6

Parag Mhashilkar, 04/07/2016 02:48 PM

1 1 Parag Mhashilkar
h1. GlideinwmsCondorAnnex
2 1 Parag Mhashilkar
3 3 Parag Mhashilkar
This is some documentation on very early version of condor_annex tools available from the HTCondor git repository.
4 3 Parag Mhashilkar
<pre>
5 3 Parag Mhashilkar
[parag@fermicloud338 condor_annex]$ git branch
6 3 Parag Mhashilkar
* V8_5-condor_annex-branch
7 3 Parag Mhashilkar
  master
8 3 Parag Mhashilkar
</pre>
9 1 Parag Mhashilkar
10 3 Parag Mhashilkar
h2. Preparations
11 1 Parag Mhashilkar
12 3 Parag Mhashilkar
* condor_annex requires --keypair. So had to run "aws configure". This created a $HOME/.aws directory with credentials in clear text. Its a ini file created by following command
13 3 Parag Mhashilkar
<pre>aws configure get region</pre>
14 1 Parag Mhashilkar
15 3 Parag Mhashilkar
h2. Working Command
16 1 Parag Mhashilkar
17 3 Parag Mhashilkar
<pre>
18 3 Parag Mhashilkar
/opt/condor/src/condor_annex/condor_annex \
19 3 Parag Mhashilkar
    --verbose \
20 3 Parag Mhashilkar
    --region=us-west-2 \
21 3 Parag Mhashilkar
    --project-id=annex_parag \
22 3 Parag Mhashilkar
    --instances=2 \
23 3 Parag Mhashilkar
    --expiry="2016-04-06 17:00" \
24 3 Parag Mhashilkar
    --central-manager=fermicloud385.fnal.gov \
25 3 Parag Mhashilkar
    --keypair=parag-annex \
26 3 Parag Mhashilkar
    --vpc=vpc-ed33af86 \
27 3 Parag Mhashilkar
    --subnet=subnet-ec33af87,subnet-e233af89,subnet-e333af88 \
28 3 Parag Mhashilkar
    --image-ids=ami-e826cd88 \
29 3 Parag Mhashilkar
    --spot-prices=0.06 \
30 3 Parag Mhashilkar
    --instance-types=m3.medium \
31 3 Parag Mhashilkar
    --password-file=/cloud/login/parag/wspace/glideinWMS/annex/password_file
32 3 Parag Mhashilkar
</pre>
33 1 Parag Mhashilkar
34 3 Parag Mhashilkar
h2. Internals of condor_annex
35 1 Parag Mhashilkar
36 3 Parag Mhashilkar
* Code is in perl and it invokes aws client commands.
37 3 Parag Mhashilkar
* Requires awscli python module
38 3 Parag Mhashilkar
<pre>pip install awscli</pre>
39 3 Parag Mhashilkar
40 4 Parag Mhashilkar
h2. Variables for Reference
41 3 Parag Mhashilkar
42 3 Parag Mhashilkar
$projectID (--project-id): Its an arbitrary string chosen by the user to handle future actions with this annex
43 3 Parag Mhashilkar
Make sure $annexSize is defined through --instances
44 3 Parag Mhashilkar
45 3 Parag Mhashilkar
$expiry (--expiry): When should this annex go away. Note the date format for the argument value.
46 3 Parag Mhashilkar
47 3 Parag Mhashilkar
$region (--region): If not provided use default region from ~/.aws/config. Use 'us-west-1' as hard default if all else fails.
48 3 Parag Mhashilkar
49 4 Parag Mhashilkar
$centralManager (--central-manager): Central manager where the condor startd will report to. Because this is required, we can not use condor_annex as is with GlideinWMS.
50 1 Parag Mhashilkar
51 4 Parag Mhashilkar
$passwordFile (--password-file): Password file created using condor_cred and used by condor startds/master in the VM to join the Condor Pool. Irrelevant in case of GlideinWMS.
52 4 Parag Mhashilkar
53 3 Parag Mhashilkar
$stackName (--stack-name): AWS stack name to use
54 3 Parag Mhashilkar
55 3 Parag Mhashilkar
$keypairName (--keypair): Name of the keypair in AWS to use while creating stack
56 3 Parag Mhashilkar
57 3 Parag Mhashilkar
$vpc (--vpc): VPC to use
58 3 Parag Mhashilkar
59 3 Parag Mhashilkar
$subnet (--subnet): Subnets to use
60 3 Parag Mhashilkar
61 3 Parag Mhashilkar
$imageIDList (--image-ids)
62 3 Parag Mhashilkar
63 3 Parag Mhashilkar
$spotPriceList (--spot-prices)
64 1 Parag Mhashilkar
65 1 Parag Mhashilkar
$instanceTypeList (--instance-types)
66 3 Parag Mhashilkar
67 4 Parag Mhashilkar
--
68 5 Parag Mhashilkar
69 4 Parag Mhashilkar
$s3Bucket="htcondor-annex-${safeCM}-${projectID}" where $safeCM is $centralManager after handling special characters like ':' and '.'
70 1 Parag Mhashilkar
71 4 Parag Mhashilkar
$passwordLocation: --password-location in s3 or $s3Bucket/brussel-sprouts
72 4 Parag Mhashilkar
73 4 Parag Mhashilkar
$configLocation: --config-location in s3 or $s3Bucket/basename($configFile)
74 4 Parag Mhashilkar
75 4 Parag Mhashilkar
h2. Workflow
76 4 Parag Mhashilkar
77 4 Parag Mhashilkar
+*VALIDATION PHASE*+
78 4 Parag Mhashilkar
79 3 Parag Mhashilkar
* Get the aws region to use.
80 3 Parag Mhashilkar
81 3 Parag Mhashilkar
* Create or get the stack to operate (modify/delete) on. If the stack does not exist it is created as needed using the keypair configured in AWS. If stack exists and --delete is given to the command, delete the stack. Use the VPC and Subnets passed by the user or use default with Name HTCondorAnnex. Since subnets are AZ specific, this is also a way to restrict annex to use a given AZ. Following AWS commands are used as part of various validations/information gathering in condor_annex. Either provide launch configuration ($launchConfigList) or provide the $imageIDList $spotPriceList $instanceTypeList
82 3 Parag Mhashilkar
83 3 Parag Mhashilkar
<pre>
84 3 Parag Mhashilkar
aws --region $region ec2 describe-key-pair
85 1 Parag Mhashilkar
aws --region $region ec2 describe-vpcs --filters 'Name=tag:Name,Values=HTCondorAnnex'
86 1 Parag Mhashilkar
aws --region $region ec2 describe-subnets --filters 'Name=tag:Name,Values=HTCondorAnnex' 'Name=vpc-id,Values=$vpc'
87 1 Parag Mhashilkar
</pre>
88 4 Parag Mhashilkar
89 4 Parag Mhashilkar
+*ACTION PHASE*+
90 4 Parag Mhashilkar
91 4 Parag Mhashilkar
* Create a s3 bucket to store $passwordFile and store it. If failed storing the password file to bucket, delete the bucket and roll back. Same action is performed for $configFile
92 4 Parag Mhashilkar
<pre>
93 4 Parag Mhashilkar
aws s3api create-bucket --acl private --bucket $s3Bucket
94 4 Parag Mhashilkar
aws s3 cp $passwordFile $passwordLocation
95 4 Parag Mhashilkar
aws s3 cp $configFile $configLocation
96 4 Parag Mhashilkar
</pre>
97 4 Parag Mhashilkar
98 1 Parag Mhashilkar
* Now create the cloud formation stack. $parameters below has info about the AIM ids, spot prices, instance types, VPCs, ProjectID, Subnet, ..., all the required stuff we got above.
99 5 Parag Mhashilkar
<pre>
100 5 Parag Mhashilkar
aws --region $region cloudformation create-stack \
101 5 Parag Mhashilkar
        --template-url "https://s3.amazonaws.com/condor-annex-${region}/template-${VERSION}" \
102 5 Parag Mhashilkar
        --stack-name $stackName --capabilities CAPABILITY_IAM --parameters $parameters
103 5 Parag Mhashilkar
</pre>
104 4 Parag Mhashilkar
105 4 Parag Mhashilkar
*CloudFormation launch configuration and Lambda requires high privileges at this time for following steps to work*
106 6 Parag Mhashilkar
!AWSPermissions-For-condor_annex.png!
107 4 Parag Mhashilkar
108 4 Parag Mhashilkar
*Create a autoscaling group if it does not exist. Wait for it to be created as we need its name to adjust the size and get the autoscaling group info by describe-stacks. For every stack in the output above DO SOMETHING if StackName matches our stack and StackStatus is CREATE_COMPLETE or UPDATE_COMPLETE.  Loop till the ResourceStatus is in CREATE_COMPLETE or UPDATE_COMPLETE for all the StackResources that are "AWS::AutoScaling::AutoScalingGroup"
109 4 Parag Mhashilkar
This is also a way for getting the autoscaling group names for future references
110 4 Parag Mhashilkar
111 4 Parag Mhashilkar
<pre>
112 4 Parag Mhashilkar
aws --region $region cloudformation describe-stacks
113 4 Parag Mhashilkar
aws --region $region cloudformation describe-stack-resources --stack-name $stackName
114 4 Parag Mhashilkar
</pre>
115 4 Parag Mhashilkar
116 1 Parag Mhashilkar
* Set the autoscaling group desired size that is computed to split the required annex-size across various autoscaling groups
117 1 Parag Mhashilkar
118 5 Parag Mhashilkar
<pre>
119 5 Parag Mhashilkar
aws --region $region autoscaling update-auto-scaling-group \
120 5 Parag Mhashilkar
       --auto-scaling-group-name $asgName --max-size $size --desired-capacity $size
121 5 Parag Mhashilkar
</pre>
122 4 Parag Mhashilkar
123 4 Parag Mhashilkar
* Once the annex has been created set/update the expiration time. This is some complicated code not worth describing at this time. In short it depends on heartbeat and alarms.
124 4 Parag Mhashilkar
125 4 Parag Mhashilkar
* Determine how big annex has grown and if we are at the required capacity
126 4 Parag Mhashilkar
127 4 Parag Mhashilkar
* Wait for the annex nodes to join the HTCondor pool. This is a BUMMER because we do not want to use annex in this mode and want a means to skip this. Also it is using condor_status -constraint 'ProjectID=="$projectID"' which means that it will conflict with the generic projectid classad attribute.