AWS Graviton2, Batch and multi-arch docker

Amazon Web Services (AWS) introduced a new 64-bit ARM Neoverse core which they named Graviton2 about a year ago with wide availability in June 2020. Marketing claims they provide up to 40% better price performance over x86-based instances. A footnote clarifies with “20% lower cost and up to 40% higher performance based on internal testing with varying characteristics of compute and memory requirements”. As you know, details can be important.

Compared to x86, each ARM core has direct access to memory compared to two threads on each x86 physical core which share memory access. Since many of our codes involve large matrix multiplications, we were interested in trying the new cores and comparing them to both Intel and AMD’s x86 offerings. That is, 40% improvement is not something easily ignored.

Amazon Linux 2 (+ ECS optimized) is the preferred AMI to use for Batch. Installing the developer tools, we see the version of gcc is 7.3. Unfortunately, that version of gcc does not support the suggested compile options for Graviton2. Since the x86 platforms used highly aggressive compiling options (e.g. on Intel instances, icc/icpc/ifort with AVX*/SSE* options), we spun up a Centos8 AMI, installed gcc-9 and made sure to create a static executable which ran fine on Amazon Linux 2. Turns out it was a waste of time. The resulting executable was slower than just using the default gcc and using -march=armv8.2-a+fp16+crypto -mtune=cortex-a72. We note, other than new compile line options, there was no required source code changes going from x86 to Graviton2.

Next we had to make a container (AWS Batch requires containers for jobs). At scale, it is desirable to minimize the container size to reduce latency and storage/network charges therefore the containers are made from scratch. Using docker’s buildx on macOS, we created a multi-arch container and pushed that to AWS’s ECR with the following

$ aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${accnum}.dkr.ecr.${region}.amazonaws.com

$ docker buildx build --platform linux/amd64,linux/arm64 -t ${accnum}.dkr.ecr.${region}.amazonaws.com/${ECRrepo}:latest --push .

Unfortunately, this multi-arch container failed to run on Graviton2 based AMIs. The fix was to tell the container to mount /lib (in addition to /usr, /lib64 and /etc). x86 didn’t require a mount to /lib, but it didn’t have any bad side effects either.

The rest is easy. Create two compute environments under Batch- one for Graviton2 and another for x86. Tie a job queue to each of these compute environments setting the preference to either Graviton2 or x86. The job definition simply pointed to the multi-arch container.

The code tested is single threaded double precision compute heavy written in modern Fortran exploiting vector processing. The majority of the computational task is inverse (complex to real) and forward (real to complex) Fourier transforms with a complex matrix (256×256) multiplication in between (i.e. solving Poisson’s equation). For speed, these transforms/multiplications are done in single precision. In all cases, FFTW is used. For AMD, an optimized FFTW for EPYC was used using gcc 7.3. Unfortunately the resulting executable was 25% slower than either Intel or Graviton2. Future work will find the reason why and a work around.

Previous benchmarking suggested 4xlarge instances (i.e. 16 vcpu) was the sweet spot for Intel. Larger instances usually compete for “shared resources” (memory access here). Intel runs did not turn off hyper-threading. While turning this off does speed up the execution of the code, it wasn’t quite a factor of 2 (as seen by the c5.24xlarge result below). That is, to minimize cost per computation, it was left on. In all cases, the runs were done on us-east-2 (Ohio) using spot instances. Numbers come from AWS Billing console. We will state, without showing, that all physical observables computed from the simulations are statistically identical regardless of the underlying hardware or instance size.

hardware	instance	vcpu (vcpu/container)	total instance time (hour)	spot cost ($)	on-demand ($)	instance time * vcpu (vcpu * hour)
Intel	c5.4xlarge	16 (1)	195.640 213.502	29.74 32.45	133.04 145.18	3,130 3,416
Intel	c5.12xlarge	48 (1)	DNF 107.376	NA 48.97	NA 219.05	NA 5,154
Intel	c5.24xlarge	96 (2)	57.484 54.938	52.44 50.11	234.54 224.15	5,518 5,274
Graviton2	c6g.4xlarge	16 (1)	199.398 202.975	31.82 32.40	108.48 110.42	3,190 3,248
Graviton2	c6g.12xlarge	48 (1)	77.655 77.262	37.19 37.00	126.73 126.09	3,727 3,709
Graviton2	c6g.16xlarge	64 (1)	72.236 69.507	46.12 44.38	157.19 151.25	4,623 4,448

Total instance time and price for the same computational task for different architectures and instance sizes. Red tested memory bandwidth using bw_mem (see text).

First looking at Intel (black results only), we see the smallest instances give the lowest cost/computation metric (the computation is the same for all results). Going from c5.4xlarge to c5.12xlarge, the number of vcpus increased by a factor of 3 and yet total instance time only decreased by a factor of 2 increasing the cost/computation by 50%. Using c5.24xlarge but assigning 2 vcpu per container (effectively turning off hyper-threading), we see the total instance time also decreased by almost a factor of 2 compared to c5.12xlarge (with 1 vcpu per container) leaving the cost only slightly higher. By looking at the total vcpu * hour for each case, it is clear Intel efficiency decreases with increasing instance size rather dramatically.

Moving onto Graviton2 (black results only), similar to Intel, that smallest instances give the lowest cost/computation. Going from c6g.4xlarge to c6g.12xlarge, the number of vcpus increased by a factor of 3 and the total instance time decreased by a factor of 2.6 increasing the cost/computation by 14%. Comparing c6g.4xlarge to c6g.16xlarge, the number of vcpus increased by a factor of 4 and the total instance time decreased by a factor of 2.9 increasing the cost/computation by 37%.

When a container starts, bw_mem from (lmbench) is run to test memory bandwidth (red results). If two consecutive bw_mem results are less than 4000MB/s, the container is killed. Otherwise, during the run, bw_mem is run at an interval randomly choosen from 8, 16, 32 or 64 minutes. If two consecutive bw_mem results are less than 4000MB/s, the container is killed. The number of container kills for each instance is tracked. Once that number exceeds 32, the instance itself is killed.

For Intel, using bw_mem on fully loaded (i.e. 1 vcpu per container) larger instances fail to finish the computation- such instances rarely return sufficient memory bandwidth. For fully loaded smaller instances using bw_mem does reduce the total instance time resulting is lower cost/computation. Graviton2 using bw_mem, on the other hand, has a small effect and for larger instances actually increases the total instance time.

Comparing the best cost/performance between Intel and Graviton2, we see Graviton2 does have a better metric for “on-demand” instances, while worst for “spot” instances. Of course, “spot” instance cost is dependent on the market (i.e. demand of those instances at the time of run). For “on-demand”, cost/performance for Graviton2 is better than Intel by ~18%, while for “spot” instances it is worse by ~7%. Only for the 12xlarge instance do we find Graviton to be computationally more efficient by about 28% and cost/computation is 24% and 42% better for spot and on-demand instances, respectively. However, this instance size is not the optimal for cost/computation. Perhaps ARM’s native compiler (compared to gcc) would give more gain.

AnandTech did benchmarking on Graviton2 and x86 using gcc. There they also found that for single core, Intel typically outperformed Graviton2 except for memory intensive benchmarks. This is confirmed in our results for larger instance types.

So for these “embarrassing simple parallelized” simulations, we have found a 42% (24%) improvement of cost/performance for Graviton2 compared to Intel for on-demand (spot) instances, but not at the optimal point. We did find ~18% improvement based on “on-demand” instances, but a 7% reduction on “spot” instances at the optimal point. Given runs with and without bw_mem, it is also clear Graviton2 is less dependent on “shared resources”. This would suggest that openMP and/or MPI runs on Graviton2 would perform better than Intel.

We acknowledge and thank AWS for “Founders” credits.

Add a Comment Cancel reply