How and why to time your code - a quick primer

Got a second?

There’s a frankly daunting amount of tools and practices most new researchers in genomics and computational biology could could get some serious mileage out of, and in some ways I am loath to add to the pile, but I maintain this one is particularly worth your time (heh).

I’m of course referring to timing the programs you run and scripts you write, which on the surface might sound like a ton of extra work for little benefit. But as it turns out, timing the vast majority of our code/programs is actually quite straightforward, and can go a long way in everything from debugging to just planning your work over the next day.

Why time?

To some extent, even newcomers to genomics are probably somewhat familiar with this concept - when going through a variant calling workflow for the first time, for instance, there’s likely an implicit understanding that each individual step will take time, just based off of how large the filesizes of FASTQ and BAM files tend to be.

That said, there are a few reasons to actually explicitly get these runtimes, whether it’s when one is running pre-existing programs at the command line or when the situation calls for a few custom Python scripts. Here’s a few reasons to start with:

1. Assessing code efficiency

This seems a bit of an obvious reason, but is still worth mentioning and expanding on. To motivate the concept, let’s picture this: you have a script that draws some large amount of values over several iterations. In this instance, having your code also report the amount of time each iteration takes can be quite informative.

To begin with, let’s say you find that each subsequent iteration is taking longer than the last. That seems suboptimal! Perhaps that could tip you off to the fact that, say, your code is holding values from previous iterations in memory even after they’re done - to which one solution could be to write the values from a single replicate to file soon as it’s done, before starting fresh for the next. Moreover, timing can help plan your code somewhat if the number of iterations is user-determined (e.g. if simulating values of some sort). By running a few test iterations and timing them, you can assess what amount is feasible to do given your timeframe, if time is the main limiting factor at hand.

Sneakier timing issues can also be more quickly found with this method. Many of us who spend more time processing genomic data than developing tools haven’t received formal training in concepts like memory management or big O notation - but memory-inefficient code turning what seems like simple tasks into slow-running slogs affects us all the same. Let’s say you’re going through a large file in Python and picking out bits of information from each line, but your code starts with a few checks to exclude header/metadata lines:

with open(input_file, 'r') as f:
    for line in f:
        if line.startswith('#header'):
            continue
        elif line.startswith('<meta='):
            continue
        else:
            do_things(line)

Let’s also say you also expect input_file to be a file millions and millions of lines long. This is a situation where saving even milliseconds per line can go a long way, and as it turns out, the if blocks are actually going to slow this loop a fair bit in the long run. This is because checks like these are only true at the start of the file - but they’re still checked against every single one of the millions of lines after the header! With this in mind, perhaps you could look into and time alternative approaches (e.g. tabix, the csv module, etc.) over snippets of the same file. No matter what alternative(s) you settle on, timing both this and the other methods over just a few iterations could relatively quickly show you which is the faster option - not only saving you a good bit of runtime when all is settled, but also informing future approaches to the same problem.

Finally, timing code can help you figure out when something is better off parallelized in some way. Perhaps you know something is projected to take about 10 hours per sample, and you’ve got a ton of different samples on your hands - in this case, perhaps it’s worth putting in the time to refactor code so that it only works on one sample at a time, and then using something like GNU parallel to run the same script over multiple samples at once. It might take a bit more development time at first, but the potential time savings offered by this addition can be massive. Plus, even if you don’t opt to go the parallel route, just having that information can help you plan around the expected runtime in other ways.

And speaking of planning…

2. Planning your day

This one is a little less obvious unless you’ve thought about it a bit, but timing everything you run can help you plan your day as a whole, since you’ll get a better and better sense of how long specific tasks will take. If you know that a program will be running for the next 20 hours (e.g. HaplotypeCaller), that’s something that can be queued up first thing in the morning before you safely put that task on hold and move on to other things1. Alternatively, if something is expected to run in a minute flat, then you’re likely safe to assume you’ll be able to perhaps move on to whatever the next task for that project is in the same day.

Most importantly, this helps you identify the ‘purgatory’ scripts/programs - the ones that take just too long to be considered trivial, but not long enough that you can start another involved task while they run. These are then usually best run either while heading off for a break or clearing out some side tasks, like an email that’s been on your head for a bit but actually won’t take all that long to get out the way.

In short, knowing how long various tasks are expected to take lets you structure your day a bit better, and means you’re either able to get more done if needed or take more time off while the computer continues doing all that hard work for you.

1 Yes, I am only writing this blog post because I have a HaplotypeCaller run going that’ll keep on keeping on for another 3-4 hours.

3. Catching unexpected behaviour early

A bit more of a niche case, but sometimes a task that takes some known amount of time ends much sooner than expected without necessarily raising an error (this is especially common with scripts that run shell commands for you). I’ve found keeping an eye out for this has sometimes helped me debug code or catch bad assumptions in a script whenever it happens - in some cases, I’ve even put in an explicit catch where certain scripts that are expected to take a while when running over multiple files raise a warning if they complete too quickly.

How do I time my code?

In the Unix shell

Timing Unix commands is probably the easiest of all, thanks to the built-in time utility. Prepend virtually any Unix command with time, whether it’s a simple oneliner or a massive GATK command, and the shell will dutifully report three time values:

time sleep 3
## 
## real 0m3.009s
## user 0m0.001s
## sys  0m0.002s

If writing loops, time can be used multiple times for per-iteration reports:

time for i in {1..3}; do
  time sleep 3;
done
## 
## real 0m3.007s
## user 0m0.001s
## sys  0m0.001s
## 
## real 0m3.010s
## user 0m0.002s
## sys  0m0.005s
## 
## real 0m3.006s
## user 0m0.002s
## sys  0m0.003s
## 
## real 0m9.025s
## user 0m0.005s
## sys  0m0.009s

We see the output of time four times here - the first three from the inner part of the loop (time sleep 3) and the fourth representing the entire loop itself (time for i...).

Of the three numbers, real is probably the one you’ll want to refer to most - this value is just the ‘actual’ amount of time elapsed. user and sys represent CPU time instead, which is usually less relevant for many of us who are not involved in more intense development on a day-to-day basis but is still worth learning about, especially if you’re running things in parallel (see here for more info).

In Python

There are a few ways to time code in Python, with perhaps the timeit module being the most widely used from what I’ve seen. That said, I’d personally put forward tqdm as being a perfect all-in-one toolkit for those of us who specifically use Python to work with large data files like BAMs and VCFs.

First things first - if you’re looking to time an entire script, then I’d just run the script at the command line with the time utility above. What tqdm offers, however, is the ability to generate clean progress bars out of any iterable (i.e. object that can be looped over) within Python. These progress bars also display an iteration counter, time information, and a real time report of average iterations per second (or seconds per iteration if the former case is yielding values below 1 - tqdm is good about switching the units around as needed).

It’s really a phenomenal all-in-one toolkit that lets you assess how fast a loop is running and when it is expected to complete based on the average time per iteration - all updated in real time during the script’s runtime, unlike the single post-run value reported by time above. By also reporting an iteration counter, it’s also incredibly useful for debugging (e.g. if your script specifically breaks on the third iteration each time, that can be helpful information moving forward)

tqdm really needs to be seen in action for the full effect, but fortunately for us, there’s a nice GIF on the package’s PyPI page that shows just that, reproduced here:

tqdm gif

In short - wrap anything you’re looping over with tqdm and watch the magic happen:

from tqdm import tqdm
for i in tqdm(my_iterable):
    # insert loop code here! 

In R

I’ll start by confessing I don’t time most of the work I do in R, specifically because all my ‘heavy-duty’ programming happens in Python. More often than not, the only things I’m ever loading into RStudio are neat, tidy datasets I’ve generated elsewhere and which take little processing time/power.

Still, R has a few neat options for the task if needed, and while I haven’t quite found something at the level of tqdm it seems the tictoc package is a good starting point. To use tictoc, add tic() where you’d like to start a timer, and toc() when you want the timer to stop and report the time elapsed, like so:

library(tictoc)
tic()
Sys.sleep(3)
toc()
## 3.005 sec elapsed

tics and tocs can also be nested if needed, and given ‘names’ no less:

library(tictoc)
tic('outer timer')
tic('inner timer')
Sys.sleep(3)
toc(log = TRUE) # closes inner timer
## inner timer: 3.005 sec elapsed
Sys.sleep(3)
toc(log = TRUE) # closes outer timer
## outer timer: 6.014 sec elapsed

I suppose I’m not completely surprised that there isn’t a full on tqdm equivalent just yet given all the functional programming options available in R, but tictoc still seems very useful and flexible for those doing ‘heavier’ computation with the language.

Conclusion

On the surface, the concept of timing code and command line programs seems fairly involved, but we’ve now seen how it boils down to just inserting certain utilities that do the hard work for us: whether that’s time, tqdm, or tic() and toc().

At this point, timing my code and shell commands is something I just do on principle wherever possible without really thinking too much about it, and I never find myself regretting it - either I see a nice progress bar do its thing, or I find myself with a bunch of potentially useful information if a script breaks. In either case, I maintain that timing is one of the easier habits to pick up, and it’s paid dividends for me when it comes to day-to-day computational work.