Verifiability of CS Research

Written by J David Smith
Published on 21 December 2016

It is (thankfully) becoming increasingly common for researchers in Computing to publish their code along with the associated paper. This does wonders for the reproducibility of the research, but recently it has become clear that this is not enough. For a concrete example of this, consider Errol by Andrysco, Jhala, and Lerner. The researchers working on this project had reported a 2x speed-up over the previous state-of-the-art (Grisu3), a number which was reproduced by the POPL Artifact Evaluation Committee when they ran the build scripts and benchmarks included in the Errol artifact. An author of Grisu3 thought the results suspicious, tested them, and informed the authors of Errol that the they'd found it to be 2x slower than their own work. As it turns out, this was correct: Grisu3 had been erroneously compiled without optimizations enabled due to the Errol authors' unfamiliarity with SCons. What should one take away from this story? Not to use betterSCons, while having its own problems, is IMO better than make. build systems? Not to include build scripts? Worse: not to publish code & artifacts? In my view, it is simple: reproducible work is insufficient for computer science.

Consider the usage of experimental reproduction in the experimental sciences.Of which I have never been a member, and so this is purely the viewpoint of an outsider looking in The objective of reproducing experiments is to verify the results of the experiment. However, consider that in e.g. experimental physics, the experiments would be reproduced independently using separate lab equipment. This introduces independence in the results, as no two people using two distinct sets of equipment will perform the experiment identically. In Computer Science, on the other hand, many of our experiments come down to running a bundle of code. Absent defects in the machines used to run the code, every pair of computers will produce more-or-less identical results.Benchmark timing is its own beast. While in theory every two machines would produce the same relative performance, in practice that can vary based on distinguishing features of the machines (e.g. memory speed & size). Therefore, the simple reproduction of results cannot give us the same effect as in the experimental sciences.

However, organizations like the POPL Artifact Evaluation Committee don't merely engage in reproduction. They additionally seek to verify that the results in the paper match the artifact, and that the artifact seems to legitimately work. In this case, it was a simple miss in the build script. More, however, can be done to aid this verification. On the most recent project I've worked on, I've attempted to do just that.

The most obvious way to aid verification is to have good documentation, especially for how to build and run your code. Surfacing this documentation is also important. The use of a README works well for build/run docs, but for future researchers working on extensions deeper insight into the operation of a codebase is necessary. The use of documentation tools like Doxygen for C++ or Rustdoc for Rust can surface the documentation in a readable way that is useful both for the original authors returning after work on other projects, and for future researchers.

While I'm still exploring ways to improve, I've found several tools that have helped with this in my most recent project. Over the next couple of weeks, I'm going to be writing about them in more detail. In particular, I want to examine how each improves the ability to verify my work, and how each falls short. Through this, I aim to get some sense of the direction I ought head in for future improvement.