Science as a practice can be depressingly messy, especially when relying on computer programs that other people have written, and especially if these programs are undocumented, and even worse, not open source. Luckily, the establishment’s insistence on reproducibility and the abundance of simple sanity checks tend to iron out higher level bugs, but it is within the realm of reason that “smaller” problems of a closed source program could be quietly fixed from one analysis to the next, without alerting the community as a whole.
As an example of what I mean by “messiness”, while waiting for a reply from the author of a particular program regarding its output format, I spent part of the day reverse engineering it. This reverse engineering consisted of creating a series of mock datasets and seeing what the code spit out given these as input. Since I knew very well what it should spit out given the mock data, I could figure out how to interpret its output. I’m in the process of confirming my inferences with the author of the program but while waiting for this confirmation I can get on with my work.
But this brings me to the real point of this post: what is to be done? Data provenance is the overarching issue, and I am of the opinion that any series of results should be able to be re-generated quickly (as measured in scientist-time not computer-time) based solely on meta-data provided–and this is the key point–as part of the results themselves. A few simple guiding principles can go a long way toward achieving this goal.
- In the absence of a well defined standard, it’s the individual scientist’s/consortium’s responsibility to define and actively use an organized meta-data standard.
- If it’s not open source it’s not science.
- A snapshot of the source code used to generate results should be given/pointed to when the results are presented.
- Minimizing reproduction time is an integral part of science.
- Principles 1-4 should be actively be encouraged, nay demanded, by funding agencies, program heads, and research advisors.