light-weight way to capture results from cloud HPC/HTC

For large HTC (or HPC) computations on the cloud, ‘spot instances’ (AWS-speak), ‘low-priority VM’ (Azure-speak) or ‘preemptible VM instances’ (GoogleCloud-speak) are the low cost options for compute. Of course, the challenge here is that these instances/VM can vanish at anytime. If you’re doing a large HTC task, you want to make sure you save your result (and/or checkpoint) files as soon as they are generated to persistent storage. Otherwise you lost the computation you just paid for.

The obvious way of doing this is to have a process that periodically polls the run directory of each computational task. If it finds a result file, it gets pushed to persistent storage (S3, Blob or GoogleDrive). This gets a bit dicey since the polling needs to be frequent enough to catch the instance before it’s lost and not so frequent that you are stealing precious cpu cycles from the main computational task.

On Linux, a more elegant solution is to use the kernel’s inotify (present since 2.6.13). In particular, inotify-tools opens up inotify to Bash. You can ask for notifications when files change in a directory (including recursively) for different types of changes (close, close after write, close after read, delete, create, move, etc.). Since inotify-tools only triggers on the specified events on the specified directory, there is no need for polling the file system and, in fact, there is no extra cpu cycles wasted trying to figure out if a result file appears or not.

One implementation uses a spawned Bash script using associative arrays. The index of the array is the file that one wants to capture. We’ll use three such arrays: results, inputs, checkpoints. For example

#!/usr/bin/env bash

declare -A inputs       #input files (may include result files)
declare -A results      #result files (may include input files)
declare -A checkpoints  #checkpoint files

inputs["input.dat"]=0
inputs["pre-result.dat"]=0
results["pre-result.dat"]=0
results["result1.dat"]=0
results["result2.dat"]=0
results["result3.dat"]=0
checkpoints["key.chk"]=0
checkpoints["data.chk"]=0

defines two input file (input.dat and pre-result.dat), four result files (pre-result.dat, result1.dat, result2.dat and result3.dat) and two checkpoint files (key.chk and data.chk). In this case, the first result file (pre-result.dat) is used as an input file. This is a common design pattern for, say, Monte Carlo based optimizations where pre-result.dat is the initial state for the other result files. Since the task may have been restarted, pre-result.dat may already exist and there is no need to re-calculate the file if it is already present. If this is the case, we set results["pre-result.dat"]=$(cat pre-result.dat | md5sum) so we can detect if pre-result.dat has, in fact, changed since it was pulled from persistent storage.

Regardless, the following Bash script can then be spawned after setting the associative arrays. It will simply start inotifywait in monitoring mode for the current directory waiting for result or checkpoint files to appear.

#!/usr/bin/env bash
  
trap 'exit 0' SIGTERM SIGINT SIGHUP SIGQUIT

[ "${#results[@]}" -lt 1 ] && [ "${#checkpoints[@]}" -lt 1 ]  && exit 0

echo "archiver started, looking for:"
echo "      result list: (${!results[@]})"
echo "  checkpoint list: (${!checkpoints[@]})"
echo "       input list: (${!inputs[@]})"
echo "     in directory: $PWD"

#loop forever
inotifywait -m -e close_write . | while read dir event file; do

#result file?
        if [ ${results["${file}"]+abc} ] && [[ $(stat -c%s "$file") > 0 ]] && [[ ${results["$file"]} != $(cat "$file"|md5sum) ]]; then
                ( cps3 "$file" "s3:${name}/${file}.$JOB_ARRAY_INDEX" && \
                        echo "$(date +%F-%H:%M:%S): result file pushed: $file -> s3:${name}/${file}.$JOB_ARRAY_INDEX"
                [ ${inputs["$file"]+abc} ] || rm -f "$file" ) &  # spawn so we can unblock

#delete file if it didn't change
        elif [ ${results["$file"]+abc} ] && [[ ${results["$file"]} = $(cat "$file"|md5sum) ]] && [ ! ${inputs["$file"]+abc} ]; then
                ( rm -f "$file" && echo "$(date +%F-%H:%M:%S): result file deleted: ${file}") &

#delete file if size is 0
        elif [ ${results["$file"]+abc} ] && [[ $(stat -c%s "$file") = 0 ]]; then
                ( rm -f "$file" && echo "$(date +%F-%H:%M:%S): zero size result file deleted: $file") &

#checkpoint file?
        elif [ ${checkpoints["$file"]+abc} ]; then
                ( cps3 "$file" "s3:${name}/${file}.${script}.${JOB_ARRAY_INDEX}" && rm -f "$file" && \
                        echo "$(date +%F-%H:%M:%S): run file pushed: $file -> s3:${name}/${file}.${script}.${JOB_ARRAY_INDEX}") &  # spawan so we can unblock
        fi
done

The first ‘if’ state determines if the file is a results file and it has a size greater than 0 bytes and if the md5sum is not equal (did the file contents actually change?). If so, the file is compressed and pushed to S3 (see previous post). If it is not an input file, then delete it (no longer needed for computing).

The next ‘elif’ statement simply deletes the result file if md5sum didn’t change and it’s not an input file. Similarly, the next ‘elif’ deletes the result file if it has zero size.

The last ‘else’ statement catches checkpoint files, compresses and pushes to S3 and then deletes the checkpoint file.

Note that the variables name and script specifies the HTC case and task while JOB_ARRAY_INDEX enumerates which task is being done in a job array. Also, since we pipe the results of inotifywait, the resulting while loop is in a different shell. Also, after each ‘if/elif/else’, we spawn the commands to unblock for the next inotifywait event. Until inotifywait is triggered, there are no cpu cycles wasted.

inotify-tools work perfectly well on tmpfs/ramfs (though not NFS or other networked filesystems) and such filesystems should be considered if there is enough spare RAM on the compute nodes (speeds up IO and saves money for storage).

Add a Comment Cancel reply