Project

General

Profile

GlideinwmsCondorAnnex » History » Version 4

« Previous - Version 4/13 (diff) - Next » - Current version
Parag Mhashilkar, 04/07/2016 02:43 PM


GlideinwmsCondorAnnex

This is some documentation on very early version of condor_annex tools available from the HTCondor git repository.

[parag@fermicloud338 condor_annex]$ git branch
* V8_5-condor_annex-branch
  master

Preparations

  • condor_annex requires --keypair. So had to run "aws configure". This created a $HOME/.aws directory with credentials in clear text. Its a ini file created by following command
    aws configure get region

Working Command

/opt/condor/src/condor_annex/condor_annex \
    --verbose \
    --region=us-west-2 \
    --project-id=annex_parag \
    --instances=2 \
    --expiry="2016-04-06 17:00" \
    --central-manager=fermicloud385.fnal.gov \
    --keypair=parag-annex \
    --vpc=vpc-ed33af86 \
    --subnet=subnet-ec33af87,subnet-e233af89,subnet-e333af88 \
    --image-ids=ami-e826cd88 \
    --spot-prices=0.06 \
    --instance-types=m3.medium \
    --password-file=/cloud/login/parag/wspace/glideinWMS/annex/password_file

Internals of condor_annex

  • Code is in perl and it invokes aws client commands.
  • Requires awscli python module
    pip install awscli

Variables for Reference

$projectID (--project-id): Its an arbitrary string chosen by the user to handle future actions with this annex
Make sure $annexSize is defined through --instances

$expiry (--expiry): When should this annex go away. Note the date format for the argument value.

$region (--region): If not provided use default region from ~/.aws/config. Use 'us-west-1' as hard default if all else fails.

$centralManager (--central-manager): Central manager where the condor startd will report to. Because this is required, we can not use condor_annex as is with GlideinWMS.

$passwordFile (--password-file): Password file created using condor_cred and used by condor startds/master in the VM to join the Condor Pool. Irrelevant in case of GlideinWMS.

$stackName (--stack-name): AWS stack name to use

$keypairName (--keypair): Name of the keypair in AWS to use while creating stack

$vpc (--vpc): VPC to use

$subnet (--subnet): Subnets to use

$imageIDList (--image-ids)

$spotPriceList (--spot-prices)

$instanceTypeList (--instance-types)

--
$s3Bucket="htcondor-annex-${safeCM}-${projectID}" where $safeCM is $centralManager after handling special characters like ':' and '.'

$passwordLocation: --password-location in s3 or $s3Bucket/brussel-sprouts

$configLocation: --config-location in s3 or $s3Bucket/basename($configFile)

Workflow

VALIDATION PHASE

  • Get the aws region to use.
  • Create or get the stack to operate (modify/delete) on. If the stack does not exist it is created as needed using the keypair configured in AWS. If stack exists and --delete is given to the command, delete the stack. Use the VPC and Subnets passed by the user or use default with Name HTCondorAnnex. Since subnets are AZ specific, this is also a way to restrict annex to use a given AZ. Following AWS commands are used as part of various validations/information gathering in condor_annex. Either provide launch configuration ($launchConfigList) or provide the $imageIDList $spotPriceList $instanceTypeList
aws --region $region ec2 describe-key-pair
aws --region $region ec2 describe-vpcs --filters 'Name=tag:Name,Values=HTCondorAnnex'
aws --region $region ec2 describe-subnets --filters 'Name=tag:Name,Values=HTCondorAnnex' 'Name=vpc-id,Values=$vpc'

ACTION PHASE

  • Create a s3 bucket to store $passwordFile and store it. If failed storing the password file to bucket, delete the bucket and roll back. Same action is performed for $configFile
    aws s3api create-bucket --acl private --bucket $s3Bucket
    aws s3 cp $passwordFile $passwordLocation
    aws s3 cp $configFile $configLocation
    
  • Now create the cloud formation stack. $parameters below has info about the AIM ids, spot prices, instance types, VPCs, ProjectID, Subnet, ..., all the required stuff we got above.
    aws --region $region cloudformation create-stack --template-url "https://s3.amazonaws.com/condor-annex-${region}/template-${VERSION}" --stack-name $stackName --capabilities CAPABILITY_IAM --parameters $parameters

CloudFormation launch configuration and Lambda requires high privileges at this time for following steps to work

*Create a autoscaling group if it does not exist. Wait for it to be created as we need its name to adjust the size and get the autoscaling group info by describe-stacks. For every stack in the output above DO SOMETHING if StackName matches our stack and StackStatus is CREATE_COMPLETE or UPDATE_COMPLETE. Loop till the ResourceStatus is in CREATE_COMPLETE or UPDATE_COMPLETE for all the StackResources that are "AWS::AutoScaling::AutoScalingGroup"
This is also a way for getting the autoscaling group names for future references

aws --region $region cloudformation describe-stacks
aws --region $region cloudformation describe-stack-resources --stack-name $stackName
  • Set the autoscaling group desired size that is computed to split the required annex-size across various autoscaling groups
aws --region $region autoscaling update-auto-scaling-group --auto-scaling-group-name $asgName --max-size $size --desired-capacity $size
  • Once the annex has been created set/update the expiration time. This is some complicated code not worth describing at this time. In short it depends on heartbeat and alarms.
  • Determine how big annex has grown and if we are at the required capacity
  • Wait for the annex nodes to join the HTCondor pool. This is a BUMMER because we do not want to use annex in this mode and want a means to skip this. Also it is using condor_status -constraint 'ProjectID=="$projectID"' which means that it will conflict with the generic projectid classad attribute.