An advantage of the using cloud is flexibility. The other side of that coin is complexity. Perfect target for automation.
To run at the lowest “cost per throughput” requires multiple steps: optimized build of apps for each type of instance, custom machine images to reduce required resources and set up for the actual workflow. Most examples show how to do one or more of these tasks through a web interface. Automating all of these steps allows for continuous delivery and avoidance of errors.
We’ll take Amazon Web Services (AWS) as an example. It is not difficult to see how it generalizes to any cloud provider.
There are solutions for building a custom AMI (e.g. packer). However, instead of installing yet another tool, we’ll just leverage AWS CLI and jq, a lightweight command-line JSON processor.
First step is to create the most highly efficient app for any specific type of instance. For computationally demanding apps, this usually boils down to the processor and extensions it supports (e.g. SSE, AVX, AVX2, …) and compiler options. Benchmarking the app under normal workloads across various instance types and compiling options will identify the most efficient combinations.
By using AWS CLI, generating these combinations is pretty easy. We’ll first find the most current Amazon Linux 2 AMI and also construct a security group to only allow ssh from the machine we are running from. This will ensure security in case any of the builds contain sensitive information.
Benjamin Franklin wrote in the 1737 Poor Richard’s Almanack: “A penny saved is two pence clear.” So, let’s spin up a spot instance to save that penny. We do a spot request for a target instance type, InstanceType
, with the current Amazon Linux 2 AMI as:
#!/usr/bin/env bash #AWS region where we want to do this region="us-east-1" #your secret key without the '.pem' extension KeyName="my-aws-secret-pem" #maximum price you're willing pay for InstanceType per hour ($/hour) price="0.07" #via AWS documentation for getting the most current Amazon Linux 2 AMI id ImageId=$(aws ec2 describe-images --region $region --owners amazon --filters 'Name=name,Values=amzn2-ami-hvm-2.0.????????-x86_64-gp2' 'Name=state,Values=available' \ --output json | jq -r '.Images | sort_by(.CreationDate) | last(.[]).ImageId') #only let in ssh and only from local machine's ip myip=$(dig +short myip.opendns.com @resolver1.opendns.com) sg=$(aws ec2 describe-security-groups --group-names only-${myip//./-} --output json 2> /dev/null | jq -r '.SecurityGroups | .[].GroupId') [ -z "$sg" ] && echo "creating new security group..." && \ sg=$(aws ec2 create-security-group --group-name only-${myip//./-} --description "only ssh and only from $myip" ---output json | jq -r '.GroupId') && \ aws ec2 authorize-security-group-ingress --group-name only-${myip//./-} --protocol tcp --port 22 --cidr ${myip}/32 #create a temporary specs.tmp file for the spot instance cat >specs.tmp <<EOF { "ImageId": "$ImageId", "KeyName": "$KeyName", "InstanceType": "$InstanceType", "NetworkInterfaces": [ { "DeviceIndex": 0, "Groups": [ "$sg" ], "AssociatePublicIpAddress": true } ], "BlockDeviceMappings": [ { "DeviceName": "/dev/xvda", "Ebs": { "DeleteOnTermination": true, "VolumeType": "standard" } } ] } EOF #request a spot instance, returns "SpotInstanceRequestId" aws ec2 request-spot-instances --region $region --spot-price $price --launch-specification file://specs.tmp --output json | jq -r '.SpotInstanceRequests | .[].SpotInstanceRequestId'
Noticed we changed the VolumeType
to a standard hdd since we won’t be using it (much) and kept it small at 8GB. Assuming price
wasn’t too low, you will eventually get the actual InstanceId
from
#returns "InstanceId", when available aws ec2 describe-spot-instance-requests --region $region --spot-instance-request-ids $SpotInstanceRequestId --output json | jq -r '.SpotInstanceRequests | .[].InstanceId'
And once we get InstanceId
, then we can find PublicIpAddress
#returns "PublicIpAddress", when available aws ec2 describe-instances --region $region --instance-ids $InstanceId --output json | jq -r '.Reservations|.[].Instances|.[].PublicIpAddress'
Since spot instances are ephemeral by definition, connecting
#connect to spot instance without any side effects ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${KeyName}.pem ec2-user@${PublicIpAddress}
leaves no potential side effects. You can, of course, ssh into the instance and do the work, but we are automating. So,
#simplify commands opt="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${KeyName}.pem" sshcmd="ssh ${opt} ec2-user@${PublicIpAddress}" #make tmpfs at /shared of size 5GB on spot instance ${sshcmd} "sudo mkdir -p /shared && sudo chmod 777 /shared && sudo mount -o size=5G -t tmpfs tmpfs /shared" #copy build system from local machine to spot instance's /shared scp ${opt} -Cr ./buid_octave ec2-user@${PublicIpAddress}:/shared #now do the actual build ${sshcmd} "cd /shared/build_octave && ./to_build /shared/octave_${InstanceType::2}" #will eventually generate a *.tar.xz #and retrieve the archive to local machine scp ${opt} ec2-user@${PublicIpAddress}:/shared/octave_${InstanceType::2}.tar.xz .
would move the build scripts (e.g. contents of ./buid_octave
could be this) to a tmpfs on the spot instance, execute the build and retrieves the archive. Avoids EBS entirely. Pretty easy.
To avoid charges, one needs to clean up:
#terminate the spot instance aws ec2 terminate-instances --region $region --instance-ids $InstanceId #shouldn't be necessary, but do it anyway aws ec2 cancel-spot-instance-requests --region $region --spot-instance-request-ids $SpotInstanceRequestId
This will shutdown the spot instance and clear out the original request (it should clear out automatically, but it doesn’t always). Of course, you lose everything that was on the instance, but that’s what we want since all we needed was the archive of the app.
We do this for all the apps we need for the full workflow. Now we turn to creating a custom AMI.
We don’t need a custom AMI. We could simply move the archive to instances when needed or place them in a container. However, since the app will be running on some AMI (either in a container or not), we should simply bake the app into the AMI thereby eliminating the need for large transfers or containers. We can also set up the custom AMI via UserData for optimal “cost per throughput”.
For AWS Batch, we ideally want an “ecs-optimized” base AMI. Similarly as before, we can find such a ImageId
#lets find the most current ECS-optimized Amazon Linux 2 AMI ImageId=$(aws ec2 describe-images --region $region --owners amazon --filters 'Name=name,Values=amzn2-ami-ecs-hvm-2.0.????????-x86_64-ebs' 'Name=state,Values=available' \ --output json | jq -r '.Images | sort_by(.CreationDate) | last(.[]).ImageId')
Basing the custom AMI on this official base AMI ensures our workflow will be stable. We also want to pass in the UserData
for the custom AMI. We can do this via
#what to do when our custom AMI starts up userdata=$(base64 userdata.sh)
and in the specs.tmp
add, say after KeyName
line
"UserData": "$userdata",
Where userdata.sh
needs to be short (<16kB) and do something useful, say
#!/usr/bin/env bash # this will run at instance boot # note: root runs UserData set -e p="/shared" [ ! -d "$p" ] && \ reqMem=$(expr $(cat /proc/cpuinfo | grep -c processor) \* 700 / 4) && \ size=$(expr $(free -ml | grep ^Mem | awk '{print $NF}') - ${reqMem}) && \ mkdir -p "$p" && chmod 777 "$p" && mount -o size=${size}M -t tmpfs tmpfs "$p" cd "$p" && itype=$(curl -s http://169.254.169.254/latest/meta-data/instance-type) && \ tar -xf /usr/local/data/octave_${itype::2}.tar.xz octave_${itype::2} && ln -s octave_${itype::2} octave && \ cp /usr/local/data/octave_workflow.tar.xz . && cp -r /usr/local/data/run_scripts .
Which creates the largest tmpfs possible on /shared
and readies the instance for computational tasks (note: UserData
is executed as root).
Found UserData
could be avoided entirely by simply registering a custom systemd
service. Here it actually avoids an error caused at initial boot since /usr/local/data
doesn’t yet exist. It’s also vendor agnostic.
Once the instance is up (cloud-init
will throw non-fatal errors if you use UserData
), we can transfer the app archive and any supporting files via the usual
#create a directory to store the app/data ${sshcmd} "sudo mkdir -p /usr/local/data && sudo chmod 777 /usr/local/data && \ #stop ecs and clear out its data sudo systemctl stop ecs && sudo rm -rf /var/lib/ecs/data/ecs_agent_data.json && \ #pull in any rpm(s) we need sudo amazon-linux-extras install -y epel && sudo yum install -y inotify-tools zstd libgfortran" #copy app/data to custom AMI scp ${opt} -r ../octave/octave_*.tar.xz ../run_scripts ../lmbench/bw_mem ec2-user@${ip}:/usr/local/data"
(UserData
will not cause cloud-init
errors anymore). At the time this is written AWS’s documentation is wrong about stopping the ecs
agent (it says to do sudo stop ecs
which doesn’t do anything). Create the custom AMI:
#create custom AMI ami=$(aws ec2 create-image --region $region --instance-id $InstanceId --name "octave-$(date "+%Y-%m-%d")" --description "AMI for octave" --no-reboot --output json | jq -r '.ImageId') echo "custom_ami=$ami"
You have now created a custom AMI (takes a minute or two to finish). Accordingly, you will be charged for storing that AMI until you deregister it: aws ec2 deregister-image --image-id $ami
AWS Batch requires a container. Dockerfile is embarrassing
#trivial Dockerfile for AWS Batch FROM amazonlinux:latest WORKDIR /shared ENTRYPOINT ["/shared/run_scripts/start.sh"]
though it’s easy to replace run_scripts/*
with container overrides. Build and register with ECR. We now have a custom ami
and a simple container ecr
.
The final task is to set up AWS Batch. You need to create-compute-environment
, create-job-queue
and register-job-definition
. Now that you have your custom ami
and ecr
it’s just a chore of issuing three commands. A job definition options mountPoints
and volumes
allow containers to mount and share /shared
, thereby living up to its name.
[…] 10 June 2019 Example of using the above by automating the build across different instance types and putting the […]