GlideinwmsCondorAnnex » History » Version 13
Parag Mhashilkar, 05/20/2016 09:53 AM
1 | 1 | Parag Mhashilkar | h1. GlideinwmsCondorAnnex |
---|---|---|---|
2 | 1 | Parag Mhashilkar | |
3 | 3 | Parag Mhashilkar | This is some documentation on very early version of condor_annex tools available from the HTCondor git repository. |
4 | 3 | Parag Mhashilkar | <pre> |
5 | 3 | Parag Mhashilkar | [parag@fermicloud338 condor_annex]$ git branch |
6 | 3 | Parag Mhashilkar | * V8_5-condor_annex-branch |
7 | 3 | Parag Mhashilkar | master |
8 | 3 | Parag Mhashilkar | </pre> |
9 | 1 | Parag Mhashilkar | |
10 | 8 | Parag Mhashilkar | h2. Evaluation |
11 | 1 | Parag Mhashilkar | |
12 | 9 | Parag Mhashilkar | This evaluation is based on very early version of condor_annex. HTCondor developers are already aware of some of the short comings and have plans to address them and improve its functionality at some point. |
13 | 8 | Parag Mhashilkar | |
14 | 8 | Parag Mhashilkar | * *Need For Admin Privileges:* condor_annex does several privilaged operations. As a result the AWS user executing condor_annex needs to have admin like clearance. It is not clear that this is required and is a potential blocker for production style use. Different operations can be split into privileged operations and non privilieged operations. Privileged operations can e be run by the service/infrastructure provider to create the required environment, while non privileged operations that uses this environment can be used by a user to acquire resources. |
15 | 8 | Parag Mhashilkar | |
16 | 8 | Parag Mhashilkar | * *Lack of ClassAds:* Glideinwms depends on information available from HTCondor to be in classad structure. This is not a strict requirement but info in classads makes it easier and be more as part of HTCondor ecosystem. It will be useful for condor_annex to periodically update the status of the annex and make it available as a classad in the Collector (?). This way all I deal with is HTCondor system and do not have to worry about the AWS components. |
17 | 8 | Parag Mhashilkar | |
18 | 10 | Parag Mhashilkar | * *Usage Restricted to extend HTCondor pool only:* This use case though understandable makes it impossible to integrate within GlideinWMS. condor_annex expects HTCondor binaries to be baked into the VM. Annex through AWS functionalities will start up HTCondor Startd that reports back to a predefined collector. Annex will periodically monitor the state of the startds in the collector. In case of GlideinWMS, which brings its own HTCondor binaries (desired by VO) and starts up startd that reports to VO Collector. condor_annex is envisioned as a GlideinWMS factory side tool and may not have insight into the VO Collector. So making this functionality optional or easy to turn off is required from GlideinWMS point of view. |
19 | 8 | Parag Mhashilkar | |
20 | 10 | Parag Mhashilkar | * *No Integration with condor_submit:* condor_annex is currently a standalone tool. Having the functionality to drive it through condor_submit -annex <submit file> will make it more part of the HTCondor ecosystem and simplify things for the user community. |
21 | 8 | Parag Mhashilkar | |
22 | 9 | Parag Mhashilkar | * condor_annex is developed using perl and requires aws client to be installed. Since aws client is available in python, maybe directly using aws-python APIs is also a viable option. |
23 | 8 | Parag Mhashilkar | |
24 | 11 | Parag Mhashilkar | * Updating already up Annex |
25 | 11 | Parag Mhashilkar | |
26 | 13 | Parag Mhashilkar | * Getting condor logs back |
27 | 12 | Parag Mhashilkar | * No control over which instance shutsdown |
28 | 12 | Parag Mhashilkar | |
29 | 8 | Parag Mhashilkar | h2. Working Setup |
30 | 8 | Parag Mhashilkar | |
31 | 8 | Parag Mhashilkar | h3. Preparations |
32 | 8 | Parag Mhashilkar | |
33 | 3 | Parag Mhashilkar | * condor_annex requires --keypair. So had to run "aws configure". This created a $HOME/.aws directory with credentials in clear text. Its a ini file created by following command |
34 | 1 | Parag Mhashilkar | <pre>aws configure get region</pre> |
35 | 3 | Parag Mhashilkar | |
36 | 8 | Parag Mhashilkar | h3. Working Command |
37 | 3 | Parag Mhashilkar | |
38 | 3 | Parag Mhashilkar | <pre> |
39 | 3 | Parag Mhashilkar | /opt/condor/src/condor_annex/condor_annex \ |
40 | 3 | Parag Mhashilkar | --verbose \ |
41 | 3 | Parag Mhashilkar | --region=us-west-2 \ |
42 | 3 | Parag Mhashilkar | --project-id=annex_parag \ |
43 | 3 | Parag Mhashilkar | --instances=2 \ |
44 | 3 | Parag Mhashilkar | --expiry="2016-04-06 17:00" \ |
45 | 3 | Parag Mhashilkar | --central-manager=fermicloud385.fnal.gov \ |
46 | 3 | Parag Mhashilkar | --keypair=parag-annex \ |
47 | 3 | Parag Mhashilkar | --vpc=vpc-ed33af86 \ |
48 | 3 | Parag Mhashilkar | --subnet=subnet-ec33af87,subnet-e233af89,subnet-e333af88 \ |
49 | 3 | Parag Mhashilkar | --image-ids=ami-e826cd88 \ |
50 | 3 | Parag Mhashilkar | --spot-prices=0.06 \ |
51 | 3 | Parag Mhashilkar | --instance-types=m3.medium \ |
52 | 1 | Parag Mhashilkar | --password-file=/cloud/login/parag/wspace/glideinWMS/annex/password_file |
53 | 1 | Parag Mhashilkar | </pre> |
54 | 3 | Parag Mhashilkar | |
55 | 8 | Parag Mhashilkar | h3. Internals of condor_annex |
56 | 3 | Parag Mhashilkar | |
57 | 1 | Parag Mhashilkar | * Code is in perl and it invokes aws client commands. |
58 | 3 | Parag Mhashilkar | * Requires awscli python module |
59 | 3 | Parag Mhashilkar | <pre>pip install awscli</pre> |
60 | 3 | Parag Mhashilkar | |
61 | 8 | Parag Mhashilkar | h3. Variables for Reference |
62 | 3 | Parag Mhashilkar | |
63 | 3 | Parag Mhashilkar | $projectID (--project-id): Its an arbitrary string chosen by the user to handle future actions with this annex |
64 | 3 | Parag Mhashilkar | Make sure $annexSize is defined through --instances |
65 | 3 | Parag Mhashilkar | |
66 | 3 | Parag Mhashilkar | $expiry (--expiry): When should this annex go away. Note the date format for the argument value. |
67 | 3 | Parag Mhashilkar | |
68 | 3 | Parag Mhashilkar | $region (--region): If not provided use default region from ~/.aws/config. Use 'us-west-1' as hard default if all else fails. |
69 | 3 | Parag Mhashilkar | |
70 | 4 | Parag Mhashilkar | $centralManager (--central-manager): Central manager where the condor startd will report to. Because this is required, we can not use condor_annex as is with GlideinWMS. |
71 | 1 | Parag Mhashilkar | |
72 | 4 | Parag Mhashilkar | $passwordFile (--password-file): Password file created using condor_cred and used by condor startds/master in the VM to join the Condor Pool. Irrelevant in case of GlideinWMS. |
73 | 3 | Parag Mhashilkar | |
74 | 3 | Parag Mhashilkar | $stackName (--stack-name): AWS stack name to use |
75 | 3 | Parag Mhashilkar | |
76 | 3 | Parag Mhashilkar | $keypairName (--keypair): Name of the keypair in AWS to use while creating stack |
77 | 3 | Parag Mhashilkar | |
78 | 3 | Parag Mhashilkar | $vpc (--vpc): VPC to use |
79 | 3 | Parag Mhashilkar | |
80 | 3 | Parag Mhashilkar | $subnet (--subnet): Subnets to use |
81 | 3 | Parag Mhashilkar | |
82 | 3 | Parag Mhashilkar | $imageIDList (--image-ids) |
83 | 3 | Parag Mhashilkar | |
84 | 1 | Parag Mhashilkar | $spotPriceList (--spot-prices) |
85 | 1 | Parag Mhashilkar | |
86 | 3 | Parag Mhashilkar | $instanceTypeList (--instance-types) |
87 | 4 | Parag Mhashilkar | |
88 | 5 | Parag Mhashilkar | -- |
89 | 4 | Parag Mhashilkar | |
90 | 1 | Parag Mhashilkar | $s3Bucket="htcondor-annex-${safeCM}-${projectID}" where $safeCM is $centralManager after handling special characters like ':' and '.' |
91 | 1 | Parag Mhashilkar | |
92 | 4 | Parag Mhashilkar | $passwordLocation: --password-location in s3 or $s3Bucket/brussel-sprouts |
93 | 4 | Parag Mhashilkar | |
94 | 4 | Parag Mhashilkar | $configLocation: --config-location in s3 or $s3Bucket/basename($configFile) |
95 | 4 | Parag Mhashilkar | |
96 | 8 | Parag Mhashilkar | h3. Workflow |
97 | 4 | Parag Mhashilkar | |
98 | 4 | Parag Mhashilkar | +*VALIDATION PHASE*+ |
99 | 4 | Parag Mhashilkar | |
100 | 3 | Parag Mhashilkar | * Get the aws region to use. |
101 | 3 | Parag Mhashilkar | |
102 | 3 | Parag Mhashilkar | * Create or get the stack to operate (modify/delete) on. If the stack does not exist it is created as needed using the keypair configured in AWS. If stack exists and --delete is given to the command, delete the stack. Use the VPC and Subnets passed by the user or use default with Name HTCondorAnnex. Since subnets are AZ specific, this is also a way to restrict annex to use a given AZ. Following AWS commands are used as part of various validations/information gathering in condor_annex. Either provide launch configuration ($launchConfigList) or provide the $imageIDList $spotPriceList $instanceTypeList |
103 | 3 | Parag Mhashilkar | |
104 | 3 | Parag Mhashilkar | <pre> |
105 | 3 | Parag Mhashilkar | aws --region $region ec2 describe-key-pair |
106 | 1 | Parag Mhashilkar | aws --region $region ec2 describe-vpcs --filters 'Name=tag:Name,Values=HTCondorAnnex' |
107 | 1 | Parag Mhashilkar | aws --region $region ec2 describe-subnets --filters 'Name=tag:Name,Values=HTCondorAnnex' 'Name=vpc-id,Values=$vpc' |
108 | 1 | Parag Mhashilkar | </pre> |
109 | 4 | Parag Mhashilkar | |
110 | 4 | Parag Mhashilkar | +*ACTION PHASE*+ |
111 | 4 | Parag Mhashilkar | |
112 | 4 | Parag Mhashilkar | * Create a s3 bucket to store $passwordFile and store it. If failed storing the password file to bucket, delete the bucket and roll back. Same action is performed for $configFile |
113 | 4 | Parag Mhashilkar | <pre> |
114 | 4 | Parag Mhashilkar | aws s3api create-bucket --acl private --bucket $s3Bucket |
115 | 4 | Parag Mhashilkar | aws s3 cp $passwordFile $passwordLocation |
116 | 4 | Parag Mhashilkar | aws s3 cp $configFile $configLocation |
117 | 4 | Parag Mhashilkar | </pre> |
118 | 4 | Parag Mhashilkar | |
119 | 1 | Parag Mhashilkar | * Now create the cloud formation stack. $parameters below has info about the AIM ids, spot prices, instance types, VPCs, ProjectID, Subnet, ..., all the required stuff we got above. |
120 | 5 | Parag Mhashilkar | <pre> |
121 | 5 | Parag Mhashilkar | aws --region $region cloudformation create-stack \ |
122 | 5 | Parag Mhashilkar | --template-url "https://s3.amazonaws.com/condor-annex-${region}/template-${VERSION}" \ |
123 | 5 | Parag Mhashilkar | --stack-name $stackName --capabilities CAPABILITY_IAM --parameters $parameters |
124 | 5 | Parag Mhashilkar | </pre> |
125 | 4 | Parag Mhashilkar | |
126 | 4 | Parag Mhashilkar | *CloudFormation launch configuration and Lambda requires high privileges at this time for following steps to work* |
127 | 6 | Parag Mhashilkar | !AWSPermissions-For-condor_annex.png! |
128 | 4 | Parag Mhashilkar | |
129 | 7 | Parag Mhashilkar | * Create a autoscaling group if it does not exist. Wait for it to be created as we need its name to adjust the size and get the autoscaling group info by describe-stacks. For every stack in the output above DO SOMETHING if StackName matches our stack and StackStatus is CREATE_COMPLETE or UPDATE_COMPLETE. Loop till the ResourceStatus is in CREATE_COMPLETE or UPDATE_COMPLETE for all the StackResources that are "AWS::AutoScaling::AutoScalingGroup" |
130 | 4 | Parag Mhashilkar | This is also a way for getting the autoscaling group names for future references |
131 | 4 | Parag Mhashilkar | |
132 | 4 | Parag Mhashilkar | <pre> |
133 | 4 | Parag Mhashilkar | aws --region $region cloudformation describe-stacks |
134 | 4 | Parag Mhashilkar | aws --region $region cloudformation describe-stack-resources --stack-name $stackName |
135 | 4 | Parag Mhashilkar | </pre> |
136 | 4 | Parag Mhashilkar | |
137 | 1 | Parag Mhashilkar | * Set the autoscaling group desired size that is computed to split the required annex-size across various autoscaling groups |
138 | 1 | Parag Mhashilkar | |
139 | 5 | Parag Mhashilkar | <pre> |
140 | 5 | Parag Mhashilkar | aws --region $region autoscaling update-auto-scaling-group \ |
141 | 5 | Parag Mhashilkar | --auto-scaling-group-name $asgName --max-size $size --desired-capacity $size |
142 | 5 | Parag Mhashilkar | </pre> |
143 | 4 | Parag Mhashilkar | |
144 | 4 | Parag Mhashilkar | * Once the annex has been created set/update the expiration time. This is some complicated code not worth describing at this time. In short it depends on heartbeat and alarms. |
145 | 4 | Parag Mhashilkar | |
146 | 4 | Parag Mhashilkar | * Determine how big annex has grown and if we are at the required capacity |
147 | 4 | Parag Mhashilkar | |
148 | 4 | Parag Mhashilkar | * Wait for the annex nodes to join the HTCondor pool. This is a BUMMER because we do not want to use annex in this mode and want a means to skip this. Also it is using condor_status -constraint 'ProjectID=="$projectID"' which means that it will conflict with the generic projectid classad attribute. |