cost effective high-throughput computing in the cloud

High-throughput computing (HTC) is the term to describe the use of many computing resources over long periods of time to accomplish a computational task or job. Robustness and reliability of jobs over a long-time scale is a key concern. The metric for HTC is jobs per month. Typically the jobs are loosely-coupled.

Public cloud, with it’s vast computing resources, is a natural fit for HTC. Both AWS and Azure offer a somewhat familiar batch system for HTC. While AWS requires the computational task to be containerized (e.g. Docker), Azure simply allows this. AWS offers a more traditional HTC/HPC implementation with AWS ParallelCluster. Otherwise, kubernetes (obviously container based) is a common orchestration system for all.

The challenge here is cost. Whereas a bare metal system has a fixed operating cost (power, cooling, etc.), public clouds charge for each ‘service’ used: cpu, network, storage, persistent storage, etc.. Therefore, the traditional metric for HTC, jobs per month (jobs/month), seems insufficient for cloud HTC. Instead jobs per month per cost (jobs/month/$) or even just cost/job would more accurately reflect cloud HTC.

The cost can be arbitrarily divided into compute (cpu) cost and all ‘other costs’. The compute cost is typically known ($ per instance-hour). Of course, more powerful instances (and therefore probably more expensive) may require less time for the same computational task. So an instance that costs 5% more in $/hour but runs the job 15% faster would result in 10.75% savings for cost/job (instead of the traditional 15% addition for job/month metric). For cloud HTC, ideally the only cost would be for compute.

It’s easy to achieve cost where 80% is just for compute and the remaining 20% is for ‘other’ services. Below, we suggest some easy to implement ideas to reduce ‘other costs’ to essentially 0%.

The ‘other costs’ are mostly dependent on the implementation of the orchestration used and the computational tasks themselves. Since most implementations used containers, storage and retrieval of these containers costs money. Similarly, results of the calculation need to be stored, usually to persistent storage so both storage (how much and how long) and network traffic to persistent storage add cost. Even ‘local’ storage (again, how much and how long) and network traffic add cost. Finally, the computational tasks may require input retrieved from persistent storage, adding cost.

Containers are run on instances. Typically one instance run multiple identical containers. Therefore, to avoid the cost of redundant multiple container retrievals, one can pre-load the container in the instance and configure the instance to use the cached container image instead of retrieving.

Alternatively, containers themselves, while smaller than a virtual machine image, can still be quite large. We found by building the container from SCRATCH and preloading the application and library in the virtual machine image, the containers are trivially small. While this can reduce the usefulness of using containers (isolation), that ability is often not needed for HTC workflows.

Since an instance runs multiple containers and each container is typically running the same computational task but with slightly different input parameters (e.g. imagine multiple runs with the only input difference being pseudorandom number generator seeds), any input files for the computational task can often be shared between containers. With a bit of logic, one can create a shared file system between containers and the containers can orchestrate themselves to download shared input. Using such a shared input reduces downloading for each container to downloading just for an instance. For large instances and/or large input files, this can be significant. For additional cost reduction, this shared file system could be a tmpfs/ramfs providing significant IO speedup and avoiding ‘local’ storage costs.

Similarly, to avoid costs for using ‘local’ storage for capturing results. The containers can create a tmpfs/ramfs and run the computation on that file system to capture results. Not only is IO extremely quick (no waiting for HDD or SSD), but all costs of using local storage are eliminated.

Persistent storage cost, as pointed out in a previous post, can be reduced by always compressing files before pushing to persistent storage. This reduces the size of the file being stored and also the network traffic to storage.

Independent of the HTC task, cost is incurred for storage of virtual machine images, container images, results and input until they are deleted. Automation of creating the machine image and containers allows one to immediately delete them when done and then simply recreate them when needed again. Similarly, result and input files need to be removed once they are not needed. In no circumstance should results be pulled out of the cloud (you’ll find unpleasant charges associated with this).

Using these simple ideas, it is easy to achieve 95% of the cost for just compute and in some cases over 99% for HTC on the cloud. We also get the side benefit of significantly smaller job wall-time variance (perhaps slightly lower median job wall-time as well) when using tmpfs/ramfs filesystems even on low IO computation.

Add a Comment Cancel reply