On-premise computing cluster usually have RAID for persistent storage and a robust backup solution. Storage in the cloud costs $.
Cloud storage cost is a product of amount (GB), time (month) and type of storage. High IO, low latency persistent storage is the most expensive type. Archive storage excels in the cloud. Often for less than the cost to transfer out of the cloud.
The cheapest persistent storage (not high IO or low latency) is based on an object store (e.g. S3, blob, GoogleCloud). Object store is typically a shared resource which has service rate limits (PUT or GET per second). On AWS, at least, S3 (object store) requests are routed to S3 nodes based on the object name. Each S3 node independently enforces the service rate limit. So, if your object names (particularly the beginning of the object name) are similar, all your requests might be routed to the same S3 node. In this worse case, service limits will be reached causing large latencies at scale.
At scale, S3 service rates can limit throughput. By ‘randomizing’ the object names, requests should be distributed over multiple S3 nodes. Prepending a hash is sufficient. We’ll use five characters of the object name hash:
$ cat utils3.sh #!/usr/bin/env bash HASH_LEN=5 function encode { [ -n "$1" ] && echo "$(echo "$1" | md5sum | cut -c1-$HASH_LEN)$1" }
encode
takes one argument and spits out a ‘random’ name. For example
$ source utils3.sh $ encode this_is_an_object_name 34154this_is_an_object_name $ encode this_is_an_object_name_also ae49cthis_is_an_object_name_also
clearly this would ‘randomize’ object names. One could use this with AWS CLI commands, such as:
$ source utils3.sh $ S3_bucket=com.entonos.projectA $ aws s3 cp result1.dat s3://${S3_bucket}/$(encode result1.dat) upload: ./result1.dat to s3://com.entonos.projectA/99302result1.dat $ aws s3 cp s3://${S3_bucket}/$(encode result1.dat) result1.dat download: s3://com.entonos.projectA/99302result1.dat to ./result1.dat
While this would avoid S3 node service limits, it is a bit clumsy to use. There is also a problem with the aws s3 ls
command.
Object store has no direct concept of directory structure. Many programs will give the illusion it does exist, even AWS’s own aws s3 ls
command does. Yet there is no aws s3 mkdir
or aws s3 rmdir
commands. Since the names are randomized, the output of aws s3 ls command
will appear in essentially random order and won’t work at all for sub-directories listings.
A simple approach is to keep the illusion of directories on S3 by randomizing objects at each directory level. If we’re going to do that, we should simplify the commands also
#!/usr/bin/env bash # utils3.sh - repend hash to root objects # (re)set S3_bucket and HASH_LEN # your full bucket name #S3_bucket="com.companyA.project1" # number of characters for hash HASH_LEN=5 function encodes3 { [ -n "$1" ] && [ "$1" != "/" ] && \ echo "$(echo "${1%%/*}" | md5sum | cut -c1-$HASH_LEN)${1%%/*}$([ "${1#*/}" != "$1" ] && echo "/$(encode "${1#*/}")")" || echo "$1" } function decodes3 { local o="${1%%/*}" [ -n "$1" ] && [ "$1" != "/" ] && \ echo "${o:$HASH_LEN}$([ "${1#*/}" != "$1" ] && echo "/$(decode "${1#*/}")")" || echo "$1" } function cps3 { if [ ! -n "$2" ] || [ "${1::2}" == "--" ] || [ "${2::2}" == "--" ]; then return; fi local i="${1:3}" local o="${2:3}" [ "${1::3}" == "s3:" ] && i="s3://${S3_bucket}/$(encodes3 "${i#/}")" || i="$1" [ "${2::3}" == "s3:" ] && o="s3://${S3_bucket}/$(encodes3 "${o#/}")" || o="$2" shift 2 && aws s3 cp "$i" "$o" "$@" } function lss3 { local o="$1" && [ -n "$1" ] && [ "${1::2}" != "--" ] && shift || o="" aws s3 ls "s3://${S3_bucket}/$(encodes3 "${o#/}")" "$@" | \ sed -E "s/(.+) (.{$HASH_LEN})(.*)$/\3s \1 \3/" | \ sort | \ sed -E "s/^(.*)s (.*)/\2/" }
Now encodes3
(and inverse decodes3
) will randomize object names at each ‘directory’ level via recursion:
$ source utils3.sh $ encodes3 fun/stuff/here e5dbbfun/9eb84stuff/bc98dhere $ decodes3 $(encodes3 fun/stuff/here/) fun/stuff/here/ $ S3_bucket=com.entonos.projectA $ touch one two $ cps3 one s3:dir1/one upload: ./one to s3://com.entonos.projectA/b759ddir1/5bbf5one $ cps3 two s3:dir1/two upload: ./two to s3://com.entonos.projectA/b759ddir1/c1934two $ lss3 dir1/ 2019-07-23 10:19:06 0 one 2019-07-23 10:19:22 0 two $ aws s3 ls s3://${S3_bucket}/b759ddir1/ 2019-07-23 10:19:06 0 5bbf5one 2019-07-23 10:19:22 0 c1934two $ cps3 s3:dir1/two two download: s3://com.entonos.projectA/b759ddir1/c1934two to ./two
So lss3
replaces aws s3 ls
and cps3
replaces aws s3 cp
. In other words the following operations are equivalent
$ aws s3 cp result1.dat s3://${S3_bucket}/74809doe1/99302result1.dat $ aws s3 cp result1.dat s3://${S3_bucket}/$(encode doe1)/$(encode result1.dat) $ aws s3 cp result1.dat s3://${S3_bucket}/$(encodes3 doe1/result1.dat) $ cps3 result1.dat s3:doe1/result1.dat
resulting in an S3 object named 74809doe1/99302result1.dat
Note that additional aws s3
options for cp
and ls
are supported, provided they are passed at the end of the command.
By using cps3 and lss3, one can have both ‘randomized’ object names and yet human-friendly interface. Best of both worlds.
In a follow up post, we show how to reduce cost by automatically compressing/decompressing files stored/retrieved from object storage.
[…] a previous post, we show how to ‘randomize’ names on object base persistent storage to avoid service […]