Thread by @delagoya, On queue wait times--------------------Most researchers that leverage a shared resource have at [...]

On queue wait times
--------------------

Most researchers that leverage a shared resource have at some point or another suffered a set-back in getting their result because the resources they needed were not available. They were in use by other users. (1/n)

This is what is known as "queue wait times" or the average amount of time that your request waits on a set of resources to be available before it starts. Wait times increase as a function of the cluster size (smaller == longer), job size (bigger == longer), ... (2/n)

user homogeneity (same needs at the same time == longer), and specialized resource need (more niche == longer). Life science researchers in particular have a need to batch process data in bulk, which creates large spikes in requests with long periods of relative quiet. (3/n)

A long time back, when I managed a small cluster dedicated to proteomics research, I ran a usage report when we neared our refresh cycle. Essentially the usage held steady at 20% with two spikes of 100% utilization for two weeks out of the year, right before .. (4/n)

the major conferences and grant deadlines, when *everyone* wanted to analyze their data for final reports. @jasonastowe used to phrase this as "too big when you don't need it, not big enough when you did". At around the same time Cloud became a thing and we started ... (5/n)

to leverage it for our own research. At the time, we essentially bootstrapped a Grid Engine cluster using the now abandoned MIT STAR Cluster. We augmented it with some plugins and a golden image to leverage our Chef recipes for software we needed. (6/n)

We staged data in and out of object storage. It was glorious and worked at the time. But I had root access and lots of sysadmin / devops skills not available to most researchers. It worked for our small group, but was not accessible to most people. (7/n)

So what would I do today if I were a researcher looking to be able to leverage my local cluster most of the time and also use the cloud when I needed to (assuming your institution does not provide support for this need)? (8/n)

Easy: I would leverage one of the workflow systems that has support for running on a local HPC scheduler and a cloud scheduler. The realistic options are leveraging (*alphabetical order, not a recommendation order) NextFlow, SnakeMake, or WDL which have support ... (9/n)

on either a cloud-native scheduler like Google Cloud Life Sciences API or AWS Batch, or a open-source platforms like @TerraBioApp, @nextflowio Tower, or commercial platforms ( @dnanexus @SevenBridges @DNAstack etc). (10/n)

I would also learn how to use linux container images over a simple pre-installed command line tool. And learn how to leverage both Docker and Singularity containers in your workflows, since most local HPC are not still not supporting Docker. (11/n)

About data movement needs -> only egress what you actually need for downstream analysis outside of the cloud. Calculate how much it would cost to reanalyze from source data at a later date, vs. storing or egressing large parts of the result. (12/n)

In summary - queue wait times suck and you can avoid some of that risk to your research by leveraging the cloud, linux containers, and a workflow system for both on-premise and cloud work. Use object storage to stage data before sending compute jobs. (/fin)

Latest Threads Unrolled: