Got a second?
There’s a frankly daunting amount of tools and practices most new researchers in genomics and computational biology could could get some serious mileage out of, and in some ways I am loath to add to the pile, but I maintain this one is particularly worth your time (heh).
I’m of course referring to timing the programs you run and scripts you write, which on the surface might sound like a ton of extra work for little benefit. But as it turns out, timing the vast majority of our code/programs is actually quite straightforward, and can go a long way in everything from debugging to just planning your work over the next day.
Why time?
To some extent, even newcomers to genomics are probably somewhat familiar with this concept - when going through a variant calling workflow for the first time, for instance, there’s likely an implicit understanding that each individual step will take time, just based off of how large the filesizes of FASTQ and BAM files tend to be.
That said, there are a few reasons to actually explicitly get these runtimes, whether it’s when one is running pre-existing programs at the command line or when the situation calls for a few custom Python scripts. Here’s a few reasons to start with:
1. Assessing code efficiency
This seems a bit of an obvious reason, but is still worth mentioning and expanding on. To motivate the concept, let’s picture this: you have a script that draws some large amount of values over several iterations. In this instance, having your code also report the amount of time each iteration takes can be quite informative.
To begin with, let’s say you find that each subsequent iteration is taking longer than the last. That seems suboptimal! Perhaps that could tip you off to the fact that, say, your code is holding values from previous iterations in memory even after they’re done - to which one solution could be to write the values from a single replicate to file soon as it’s done, before starting fresh for the next. Moreover, timing can help plan your code somewhat if the number of iterations is user-determined (e.g. if simulating values of some sort). By running a few test iterations and timing them, you can assess what amount is feasible to do given your timeframe, if time is the main limiting factor at hand.
Sneakier timing issues can also be more quickly found with this method. Many of us who spend more time processing genomic data than developing tools haven’t received formal training in concepts like memory management or big O notation - but memory-inefficient code turning what seems like simple tasks into slow-running slogs affects us all the same. Let’s say you’re going through a large file in Python and picking out bits of information from each line, but your code starts with a few checks to exclude header/metadata lines:
with open(input_file, 'r') as f:
for line in f:
if line.startswith('#header'):
continue
elif line.startswith('<meta='):
continue
else:
do_things(line)
Let’s also say you also expect input_file
to be a file millions and millions
of lines long. This is a situation where saving even milliseconds per line can
go a long way, and as it turns out, the if
blocks are actually going to slow
this loop a fair bit in the long run. This is because checks like these are only
true at the start of the file - but they’re still checked against every single
one of the millions of lines after the header! With this in mind, perhaps you
could look into and time alternative approaches (e.g. tabix, the csv
module,
etc.) over snippets of the same file. No matter what alternative(s) you settle
on, timing both this and the other methods over just a few iterations could
relatively quickly show you which is the faster option - not only saving you a
good bit of runtime when all is settled, but also informing future approaches to
the same problem.
Finally, timing code can help you figure out when something is better off
parallelized in some way. Perhaps you know something is projected to take about
10 hours per sample, and you’ve got a ton of different samples on your hands -
in this case, perhaps it’s worth putting in the time to refactor code so that it
only works on one sample at a time, and then using something like GNU parallel
to run the same script over multiple samples at once. It might take a bit more
development time at first, but the potential time savings offered by this
addition can be massive. Plus, even if you don’t opt to go the parallel
route,
just having that information can help you plan around the expected runtime in
other ways.
And speaking of planning…
2. Planning your day
This one is a little less obvious unless you’ve thought about it a bit, but timing everything you run can help you plan your day as a whole, since you’ll get a better and better sense of how long specific tasks will take. If you know that a program will be running for the next 20 hours (e.g. HaplotypeCaller), that’s something that can be queued up first thing in the morning before you safely put that task on hold and move on to other things1. Alternatively, if something is expected to run in a minute flat, then you’re likely safe to assume you’ll be able to perhaps move on to whatever the next task for that project is in the same day.
Most importantly, this helps you identify the ‘purgatory’ scripts/programs - the ones that take just too long to be considered trivial, but not long enough that you can start another involved task while they run. These are then usually best run either while heading off for a break or clearing out some side tasks, like an email that’s been on your head for a bit but actually won’t take all that long to get out the way.
In short, knowing how long various tasks are expected to take lets you structure your day a bit better, and means you’re either able to get more done if needed or take more time off while the computer continues doing all that hard work for you.
1 Yes, I am only writing this blog post because I have a HaplotypeCaller run going that’ll keep on keeping on for another 3-4 hours.
3. Catching unexpected behaviour early
A bit more of a niche case, but sometimes a task that takes some known amount of time ends much sooner than expected without necessarily raising an error (this is especially common with scripts that run shell commands for you). I’ve found keeping an eye out for this has sometimes helped me debug code or catch bad assumptions in a script whenever it happens - in some cases, I’ve even put in an explicit catch where certain scripts that are expected to take a while when running over multiple files raise a warning if they complete too quickly.
How do I time my code?
In the Unix shell
Timing Unix commands is probably the easiest of all, thanks to the built-in
time
utility. Prepend virtually any Unix command with time
, whether
it’s a simple oneliner or a massive GATK command, and the shell
will dutifully report three time values:
time sleep 3
##
## real 0m3.009s
## user 0m0.001s
## sys 0m0.002s
If writing loops, time
can be used multiple times for per-iteration
reports:
time for i in {1..3}; do
time sleep 3;
done
##
## real 0m3.007s
## user 0m0.001s
## sys 0m0.001s
##
## real 0m3.010s
## user 0m0.002s
## sys 0m0.005s
##
## real 0m3.006s
## user 0m0.002s
## sys 0m0.003s
##
## real 0m9.025s
## user 0m0.005s
## sys 0m0.009s
We see the output of time
four times here - the first three from the
inner part of the loop (time sleep 3
) and the fourth representing
the entire loop itself (time for i...
).
Of the three numbers, real
is probably the one you’ll want to refer to most -
this value is just the ‘actual’ amount of time elapsed. user
and sys
represent CPU time instead, which is usually less relevant for many of us who
are not involved in more intense development on a day-to-day basis but is still
worth learning about, especially if you’re running things in parallel (see
here for more info).
In Python
There are a few ways to time code in Python, with perhaps the timeit
module
being the most widely used from what I’ve seen. That said, I’d personally put
forward tqdm
as being a perfect all-in-one toolkit for those of us who
specifically use Python to work with large data files like BAMs and VCFs.
First things first - if you’re looking to time an entire script, then I’d just
run the script at the command line with the time
utility above. What tqdm
offers, however, is the ability to generate clean progress bars out of any
iterable (i.e. object that can be looped over) within Python. These progress
bars also display an iteration counter, time information, and a real time report
of average iterations per second (or seconds per iteration if the former case is
yielding values below 1 - tqdm
is good about switching the units around as
needed).
It’s really a phenomenal all-in-one toolkit that lets you assess how fast a loop
is running and when it is expected to complete based on the average time per
iteration - all updated in real time during the script’s runtime, unlike the
single post-run value reported by time
above. By also reporting an iteration
counter, it’s also incredibly useful for debugging (e.g. if your script
specifically breaks on the third iteration each time, that can be helpful
information moving forward)
tqdm
really needs to be seen in action for the full effect, but fortunately
for us, there’s a nice GIF on the package’s PyPI page that shows just
that, reproduced here:
In short - wrap anything you’re looping over with tqdm
and watch the magic happen:
from tqdm import tqdm
for i in tqdm(my_iterable):
# insert loop code here!
In R
I’ll start by confessing I don’t time most of the work I do in R, specifically because all my ‘heavy-duty’ programming happens in Python. More often than not, the only things I’m ever loading into RStudio are neat, tidy datasets I’ve generated elsewhere and which take little processing time/power.
Still, R has a few neat options for the task if needed, and while I haven’t
quite found something at the level of tqdm
it seems the tictoc
package is a good starting point. To use tictoc
, add tic()
where you’d like
to start a timer, and toc()
when you want the timer to stop and report the
time elapsed, like so:
library(tictoc)
tic()
Sys.sleep(3)
toc()
## 3.005 sec elapsed
tics
and tocs
can also be nested if needed, and given ‘names’ no less:
library(tictoc)
tic('outer timer')
tic('inner timer')
Sys.sleep(3)
toc(log = TRUE) # closes inner timer
## inner timer: 3.005 sec elapsed
Sys.sleep(3)
toc(log = TRUE) # closes outer timer
## outer timer: 6.014 sec elapsed
I suppose I’m not completely surprised that there isn’t a full on tqdm
equivalent just yet given all the functional programming options available in R,
but tictoc
still seems very useful and flexible for those doing ‘heavier’
computation with the language.
Conclusion
On the surface, the concept of timing code and command line programs seems
fairly involved, but we’ve now seen how it boils down to just inserting certain
utilities that do the hard work for us: whether that’s time
, tqdm
, or
tic()
and toc()
.
At this point, timing my code and shell commands is something I just do on principle wherever possible without really thinking too much about it, and I never find myself regretting it - either I see a nice progress bar do its thing, or I find myself with a bunch of potentially useful information if a script breaks. In either case, I maintain that timing is one of the easier habits to pick up, and it’s paid dividends for me when it comes to day-to-day computational work.