In my field, data is often some quantitative measure of abundance: How many cells are there? What fraction are this genotype or that genotype? How does the number or fraction change over time? Typically, the raw experimental measurements get put into a spreadsheet, analyzed, and then that’s what gets put into figures or statistics programs.
A lot can go on in those spreadsheets, and a lot of mistakes can be made, yet in my experience there’s rarely any discussion or training about what happens to data between the experiment and the paper. At no time in grad school or either of my postdocs did anyone ask to talk to me about my spreadsheets. Over the years, I’ve figured a few things out and caught a lot of potential mistakes, so there are definitely things to discuss. I’ve also noticed some issues the few times I’ve looked closely at other researcher’s spreadsheets. It seems really strange to me that labs can spend so much time getting the experiments right but then be kind of careless with the data.
If you were going to teach a new grad student best practice for handling data and calculations, what would you cover? Here are some things that come to mind:
- Keeping the raw measurements collected in a single place for easy access later on
- Data quality control: making sure you didn’t type in the wrong number, or type it into the wrong place
- Annotating data so that you can go back to your lab notebook or original files to verify specific entries
- Identifying bad data points that should not be included
- When you should average multiple measurements, and how you should do it. Whether you should take the arithmetic or geometric mean, for example.
- Making sure that lists of calculated values use the same formula for every entry