Project

General

Profile

Glideinwms On Cloud

Last Updated: October 16, 2012

Status So Far

  • Support for Amazon EC2 is available and tested in glideinwms v3.0
  • Check whether creation of Cloud VM is documented
    • Document the steps involved. * Give examples for couple of different clouds.
  • Setup factory with cloud entries with following entries
    • Amazon EC2 * Future Grid (Eucalyptus) * Future Grid (OpenStack) * Future Grid (Nimbus)
  • Setup and test VMs on each of the above clouds
    • Amazon EC2: Tony has created a VM with AMID ?? Check with Burt if we can get access to Amazon to run some test jobs * Future Grid (Eucalyptus) * Future Grid (OpenStack) * Future Grid (Nimbus)
  • Useful tools that we can provide
    • Tool to check the validity of the image file.
    • only check existence of glidein bootstrap software * report what cloud provider the image is configured for

OpenStack

SUCCESSFULLY TESTED Glideinwms on FutureGrid on March 16, 2013

IMAGE CREATION: https://cdcvs.fnal.gov/redmine/projects/gwms-cloud-vms/wiki

OpenNebula

This work is being done as part of KISTI - Fermilab CRADA

STEP 1: Condor Scheduler Setup on fgitb334.fnal.gov

  • Installed condor in /opt as submit only and as non-root user. Anything will work provided there is a schedd to submit a job to. * Created a condor JDF generic enough to launch a VM. For now, user data is ignored and not passed to the VM.
    ########################
    # FILE: ec2.jdf
    ########################
    
    Universe = grid
    Executable = system-info.sh
    output = joboutput/out.$(cluster).$(process)
    error = joboutput/err.$(cluster).$(process)
    log = joboutput/log.$(cluster).$(process)
    
    Grid_Resource = ec2 $ENV(EC2_URL)
    ec2_ami_id = $ENV(EC2_AMI_ID)
    ec2_instance_type = $ENV(EC2_INSTANCE_TYPE)
    ec2_access_key_id = $ENV(EC2_ACCESS_KEY_FILE)
    ec2_secret_access_key = $ENV(EC2_SECRET_KEY_FILE)
    #ec2_keypair_file = $ENV(CREDENTIAL_DIR)/ssh_key_pair.$(Cluster).$(Process).pem
    #ec2_user_data = $ENV(USER_DATA)#### -cluster $(Cluster) -subcluster $(Process)
    
    Notification = Never
    Queue 1
        

STEP 2: Launch VM using FermiCloud Provided Image

  • Submit a condor job.
    condor_submit ec2.jdf

Errors & Resolutions

  • Authentication errors like following in EC2GahpLog.<user>: You need to append the CA's certificate for one used by EC2_URL to /etc/pki/tls/certs/ca-bundle.crt. This ca-bundle.crt is curl's equivalent of /etc/grid-security/certificates
    06/13/13 15:16:16 curl_easy_perform() failed (60): 'Peer certificate cannot be authenticated with given CA certificates'.
    * Also, EC2_INSTANCE_TYPE that's provided by the Cloud provider needs to be configured correctly for the VM to be launched correctly.

STEP 3: Launch a VM using Custom Image

  • Create a image using your favorite tool. For this step, I used the SL6 (not SLF 6) image created while testing Glideinwms submission to FutureGrid. The image was created using oz (part oa Aeolus project at Redhat). * This image has glidein service installed, so the VM will shutdown by default, if glidein setup fails or glidein exits. Tweaked the image to disable auto shutdown. This step is purely specific to the image being used.

Errors & Resolution

  • econe-upload fails with following error: Actual error was because of no space on device in /tmp and this messes up the service. Mount large enough disk in /tmp and restart services in order oned.fcl before httpd. Its possible that this temporary staging area /tmp used by OpenNebula is configurable and can be set to some other location. Needs further investigation. In production, this location should have enough space to allow for simultaneous upload of several images by different users. Temporarily, it does require twice the space per image size.
    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <html><head>
    <title>502 Bad Gateway</title>
    </head><body>
    <h1>Bad Gateway</h1>
    <p>The proxy server received an invalid
    response from an upstream server.<br />
    </p>
    <hr>
    <address>Apache/2.2.15 (Scientific Linux) Server at fgitb334.fnal.gov Port 8444</address>
    </body></html>
      
  • econe-upload fails with following error: This turned out to be harmless. After about 10 minutes the image was available.
    econe-upload: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <html><head>
    <title>502 Proxy Error</title>
    </head><body>
    <h1>Proxy Error</h1>
    <p>The proxy server received an invalid
    response from an upstream server.<br />
    The proxy server could not handle the request <em><a href="/">POST&nbsp;/</a></em>.<p>
    Reason: <strong>Error reading from remote server</strong></p></p>
    <hr>
    <address>Apache/2.2.15 (Scientific Linux) Server at fgitb334.fnal.gov Port 8444</address>
    </body></html>
      
  • econe-describe-images errors: If you try econe-describe-images right after the upload command, you will see error messages. So far I had to wait for a while for the upload and internal registration process to finish before econe-describe-images was available. This is BAD in production as you may have multiple users trying to upload and checking their images simultaneously.
  • VM is launched but does not get an IP address: IN PROGRESS
  • Added /etc/init.d/one-context from default image to custom image. * Make sure you start the one-context service in different rc levels by creating start & stop links for appropriate run levels.
    [rc0.d] ln -s  ../init.d/one-context K91one-context
    [rc1.d] ln -s  ../init.d/one-context S09one-context
    [rc2.d] ln -s  ../init.d/one-context S09one-context
    [rc3.d] ln -s  ../init.d/one-context S09one-context
    [rc4.d] ln -s  ../init.d/one-context S09one-context
    [rc5.d] ln -s  ../init.d/one-context K91one-context
    
  • Fixed several issues with one-context and init.sh script and generalized them to work with SL5 & SL6 based images
  • New scripts used m2.small but failed: Need to update etc/vmm_ec2/vmm_ec2.conf & econe.conf

SUCCESS: Successfully launched a custom VM using Condor on June 21, 2013

Passing EC2_USER_DATA

  • Condor can pass EC2_USER_DATA if it is specified in the JDF. Alternatively, you can specify the filename EC2_USER_DATA_FILE that contains the data you want to pass.
  • Configure the erb file of the EC2_INSTANCE_TYPE in OpenNebula so that the EC2_USER_DATA is made available in the context.sh as shown below. Need to investigate if there is a smarter way of appending the EC2_USER_DATA to the context rather than specifying entire CONTEXT section in if-else format.
  • On the VM, EC2_USER_DATA is base64 encoded. Just run base64 --decode on the data to extract the actual data.
############################
# m2.small.erb
############################
NAME   = eco-vm-new
CPU    = 1
MEMORY = 1024
OS = [ ARCH = x86_64 ]

DISK = [ IMAGE_ID   = <%= erb_vm_info[:img_id] %>,
         target = vda ]

DISK  = [
  type     = swap,
  size     = 5120,
  target   = vdb ]

NIC=[NETWORK_ID=0,
     MODEL = virtio]

FEATURES=[ acpi="yes" ]

GRAPHICS = [
  type    = "vnc",
  listen  = "127.0.0.1",
  port    = "-1",
  autoport = "yes",
  keymap="en-us" ]

<% if erb_vm_info[:user_data] %>
CONTEXT = [
    ip_public   = "$NIC[IP, NETWORK_ID=0]",
    netmask     = "255.255.255.0",
    gateway     = "131.225.64.1",
    ns          = "131.225.8.120",
    files       = "/cloud/login/parag/wspace/fermicloud/context/init.sh",
    target      = "hdc",
    ctx_user    = "$USER[TEMPLATE]",
    root_pubkey = "id_dsa.pub",
    username    = "opennebula",
    user_pubkey = "id_dsa.pub",
    EC2_USER_DATA="<%= erb_vm_info[:user_data] %>" 
]
<% else %>
CONTEXT = [
    ip_public   = "$NIC[IP, NETWORK_ID=0]",
    netmask     = "255.255.255.0",
    gateway     = "131.225.64.1",
    ns          = "131.225.8.120",
    files       = "/cloud/login/parag/wspace/fermicloud/context/init.sh",
    target      = "hdc",
    ctx_user    = "$USER[TEMPLATE]",
    root_pubkey = "id_dsa.pub",
    username    = "opennebula",
    user_pubkey = "id_dsa.pub" 
]
<% end %>

REQUIREMENTS = "HYPERVISOR=\"kvm\"" 
RANK = "FREEMEMORY" 

EC2_USER_DATA content size is restricted

EC2_DATA_SIZE restrictions:

  • Amazon: 4K
  • OpenNebula: Restricted by the Content-Length of the http service: (default 7301, but somehow actual is around 2K)
    • Version 3.2: 64K (because of a bug) * Other versions: No Limit