What does our new age of open science and open data mean for research that’s mainly laboratory benchwork? Most of the major ecology & evolution journals now require that the data for papers also be published in an open-access repository like GenBank (for nucleotide sequences), Treebase (for phylogenetics), and Dryad (for other kinds of data). In at-the-bench wet-lab research, there’s often a fair amount of analysis linking the raw observations, the statistical result, and its biological interpretation. How much of that process should go into a public repository, and in what form?
In my own field of microbial evolution, one way in which I’d find open data useful has to do with fitness. There are many different ways to measure the survival and reproductive success of organisms, and they each have their uses. I’ve found when reading papers that there are often times in the authors plot their fitness data in one way, and I wonder what it’d look like plotted another way. For example, I often see papers that plots mean group productivity and within-group relative fitness (a multilevel selection partition of social evolution), and I wonder what the data would look as the absolute fitness for each microbial genotype (the neighbor-modulated fitness partition in kin selection theory). Much of the kin selection/group selection debate is about the best way to calculate and think about fitness. I prefer to plot fitness data in a way that’s easily interpreted multiple ways. But it’d be nice to at least grab other people’s data so I can replot it a bit. So here’s my recommendation:
Make sure that the data you archive includes the raw colony counts (or plaque counts, or cell counts). With that, anybody can easily calculate their favored fitness measure.
I’ve been working on a project that involves a large amount of flow cytometry data. How much of my data and calculations should go into Dryad? I’m tempted to say: all of it. From the raw data files, to the flow cytometry gating scripts, to the cell counts to the derived values (growth rates, fitness, etc), to the statistical analyses, to the scripts for making the figures. Why not?
Sharing everything also helps us become better scientists. I often learn a few things when I look at other peoples’ scripts and spreadsheets. I’ve traditionally used Excel for basic data manipulation and plotting. Now, though, I’m thinking that maybe I should try to do as much as possible in R—the scripts are easy to share, it’s easier to use for for large data sets, and it avoids the copy/paste errors that sometimes crop up in spreadsheets.