Organizing computational projects

I’ve been working on a computational/bioinformatics project, and I’ve been finding it a little bit challenging to organize and document my work. I’m used to at-the-bench laboratory projects, where you typically have a small handful of discrete experiments, each documented in a pen-and-paper lab notebook. Analyzing the data from each experiment usually doesn’t require much more than an Excel spreadsheet and an R script. But with computational biology, it seems like every little task requires its own dedicated script or command-line program that generates an output file (or three) in their own unique format. It’s very easy to quickly end up with a giant pile of files with no clear record of what they are, where they came from, and why you made them. Coming back to the project after a little time away, it can all be very opaque: What was I doing again? Where did I stop? Why did I stop?

Here’s what’s working for me right now: I divide the project up into logical chunks, each with its own folder numbered and named after that chunk (“1. Identify foo homologues”, for example, or “2. Phylogeny of foo proteins”). Each folder contains all the files necessary for that chunk of analysis, even if that means duplicating some files. Each folder also has a README.txt file describing what the other files in the directory are and how they were created, ordered by their place in the workflow. Like this:

Identify foo homologues in Dictyostelid genomes
jeff smith 2013

Description of files

- Hidden markov model of the foo domain. Obtained from Pfam protein families database 2013-05-27.

- Protein-coding sequences in Dictyostelium discoideum genome. Obtained from DictyBase 2013-05-27.

- Output of hhmer search for foo domains in D. discoideum genome. Command: hmmsearch foo.hmm dicty_primary_protein.fasta > hmmer_discoideum.out

- Summary of hhmer results for genes with significant alignment to foo domain model. Used by collate_sequences.R.

- R script to compile and rename matched sequences for further analysis

- Protein sequences of identified foo homologues

Right now, I’m focusing on making the work intelligible to future readers (including future me). I’m less concerned with keeping an electronic lab notebook that documents the day-to-day details of the analyses I try and how they turn out. Some people use wikis for this, or revision control software. I’m holding off on that for now.

There’s also “A quick guide to organizing computational biology projects” in PLoS Computational Biology that has some good ideas. I’ve found this part to be especially true:

“Everything you do, you will probably have to do over again. Inevitably, you will discover some flaw in your initial preparation of the data being analyzed, or you will get access to new data, or you will decide that your parameterization of a particular model was not broad enough. This means that the experiment you did last week, or even the set of experiments you’ve been working on over the past month, will probably need to be redone. If you have organized and documented your work clearly, then repeating the experiment with the new data or the new parameterization will be much, much easier.”

Noble WS (2012) PLoS Comp Biol 5:e1000424