For large HTC (or HPC) computations on the cloud, ‘spot instances’ (AWS-speak), ‘low-priority VM’ (Azure-speak) or ‘preemptible VM instances’ (GoogleCloud-speak) are the low cost options for compute. Of course, the challenge here is that these instances/VM can vanish at anytime. If you’re doing a large HTC task, you want to make sure you save your result (and/or checkpoint) files as soon as they are generated to persistent storage. Otherwise you lost the computation you just paid for.
The obvious way of doing this is to have a process that periodically polls the run directory of each computational task. If it finds a result file, it gets pushed to persistent storage (S3, Blob or GoogleDrive). This gets a bit dicey since the polling needs to be frequent enough to catch the instance before it’s lost and not so frequent that you are stealing precious cpu cycles from the main computational task.
On Linux, a more elegant solution is to use the kernel’s inotify
(present since 2.6.13). In particular, inotify-tools
opens up inotify
to Bash. You can ask for notifications when files change in a directory (including recursively) for different types of changes (close, close after write, close after read, delete, create, move, etc.). Since inotify-tools
only triggers on the specified events on the specified directory, there is no need for polling the file system and, in fact, there is no extra cpu cycles wasted trying to figure out if a result file appears or not.
One implementation uses a spawned Bash script using associative arrays. The index of the array is the file that one wants to capture. We’ll use three such arrays: results
, inputs
, checkpoints
. For example
#!/usr/bin/env bash declare -A inputs #input files (may include result files) declare -A results #result files (may include input files) declare -A checkpoints #checkpoint files inputs["input.dat"]=0 inputs["pre-result.dat"]=0 results["pre-result.dat"]=0 results["result1.dat"]=0 results["result2.dat"]=0 results["result3.dat"]=0 checkpoints["key.chk"]=0 checkpoints["data.chk"]=0
defines two input file (input.dat
and pre-result.dat
), four result files (pre-result.dat
, result1.dat
, result2.dat
and result3.dat
) and two checkpoint files (key.chk
and data.chk
). In this case, the first result file (pre-result.dat
) is used as an input file. This is a common design pattern for, say, Monte Carlo based optimizations where pre-result.dat
is the initial state for the other result files. Since the task may have been restarted, pre-result.dat
may already exist and there is no need to re-calculate the file if it is already present. If this is the case, we set results["pre-result.dat"]=$(cat pre-result.dat | md5sum)
so we can detect if pre-result.dat
has, in fact, changed since it was pulled from persistent storage.
Regardless, the following Bash script can then be spawned after setting the associative arrays. It will simply start inotifywait
in monitoring mode for the current directory waiting for result or checkpoint files to appear.
#!/usr/bin/env bash trap 'exit 0' SIGTERM SIGINT SIGHUP SIGQUIT [ "${#results[@]}" -lt 1 ] && [ "${#checkpoints[@]}" -lt 1 ] && exit 0 echo "archiver started, looking for:" echo " result list: (${!results[@]})" echo " checkpoint list: (${!checkpoints[@]})" echo " input list: (${!inputs[@]})" echo " in directory: $PWD" #loop forever inotifywait -m -e close_write . | while read dir event file; do #result file? if [ ${results["${file}"]+abc} ] && [[ $(stat -c%s "$file") > 0 ]] && [[ ${results["$file"]} != $(cat "$file"|md5sum) ]]; then ( cps3 "$file" "s3:${name}/${file}.$JOB_ARRAY_INDEX" && \ echo "$(date +%F-%H:%M:%S): result file pushed: $file -> s3:${name}/${file}.$JOB_ARRAY_INDEX" [ ${inputs["$file"]+abc} ] || rm -f "$file" ) & # spawn so we can unblock #delete file if it didn't change elif [ ${results["$file"]+abc} ] && [[ ${results["$file"]} = $(cat "$file"|md5sum) ]] && [ ! ${inputs["$file"]+abc} ]; then ( rm -f "$file" && echo "$(date +%F-%H:%M:%S): result file deleted: ${file}") & #delete file if size is 0 elif [ ${results["$file"]+abc} ] && [[ $(stat -c%s "$file") = 0 ]]; then ( rm -f "$file" && echo "$(date +%F-%H:%M:%S): zero size result file deleted: $file") & #checkpoint file? elif [ ${checkpoints["$file"]+abc} ]; then ( cps3 "$file" "s3:${name}/${file}.${script}.${JOB_ARRAY_INDEX}" && rm -f "$file" && \ echo "$(date +%F-%H:%M:%S): run file pushed: $file -> s3:${name}/${file}.${script}.${JOB_ARRAY_INDEX}") & # spawan so we can unblock fi done
The first ‘if’ state determines if the file is a results file and it has a size greater than 0 bytes and if the md5sum is not equal (did the file contents actually change?). If so, the file is compressed and pushed to S3 (see previous post). If it is not an input file, then delete it (no longer needed for computing).
The next ‘elif’ statement simply deletes the result file if md5sum didn’t change and it’s not an input file. Similarly, the next ‘elif’ deletes the result file if it has zero size.
The last ‘else’ statement catches checkpoint files, compresses and pushes to S3 and then deletes the checkpoint file.
Note that the variables name
and script
specifies the HTC case and task while JOB_ARRAY_INDEX
enumerates which task is being done in a job array. Also, since we pipe the results of inotifywait
, the resulting while loop is in a different shell. Also, after each ‘if/elif/else’, we spawn the commands to unblock for the next inotifywait
event. Until inotifywait
is triggered, there are no cpu cycles wasted.
inotify-tools
work perfectly well on tmpfs/ramfs (though not NFS or other networked filesystems) and such filesystems should be considered if there is enough spare RAM on the compute nodes (speeds up IO and saves money for storage).
Add a Comment