My take on the reproducibility of academic papers

In these days I am reviewing CIG papers. At the moment, I am not active in academia, but I enjoy being around in the community. Even if only with this “simple” tasks.

This, however, makes me think about the state of scientific work in academia. At least in the computer science field. Something bothered me during my Ph.D. and I am sure it is related to that. (Note: I am talking about what I know, that is, the computer science and AI community) In general, researchers are evaluated on the number of publications. The number of publications is a proxy variable for measuring “quality”. Unfortunately, researchers know that and started to game the system. They try to publish more and to do that they need to lower the “quality” of their work. Because the medium used to communicate your work is a plain PFD file, the obvious way to cut work is to cut the implementation part.

And here will have a feedback loop. On one hand, good papers must be reproducible. But papers are rarely reproducible. Making an AI paper reproducible means: writing good code, make the code robust enough to run on different machines, document your code (at least how to use it), provide the testing data, and, if it is not trained/evaluated on a standard corpus, make it accessible. It is a lot of extra work that does not matter, because what matter and convincing reviewers to accept your work.

Therefore, reviewers keep receiving papers with serious reproducibility issues. Reviewing a paper, if done right, it is a demanding job. I need 5 hours per paper. I read it once first. Then I read again the literature review. If it is not a paper in which I can swear to the Gods I know everything, I will search for the literature myself. Then I need to read a couple of related works. At this point, I have an idea about the paper. Finally, I will read it again and start writing things. I should check for reproducibility, but if the paper does not provide an easy way, that’s impossible. I cannot reimplement the algorithm by myself.

So, what’s the result? Reviewers rejecting all the non-reproducible papers? No. That’s would lead to conferences with 2 or 3 accepted papers. Instead, reviewers assume the claims are right (except for really questionable results), they evaluate the claims and move on.

However, this encourages papers writers to be sloppy about reproducibility and code quality. Because it generally does not matter. And thus the vicious cycle get stronger.

Thinking about a new scientific medium

The reproducibility problem is a well-known problem. It is a problem affecting not only in computer science. As you can see, there are many other articles talking about this and proposing more general solutions to it. Here, however, I want just to focus on a specific idea for my field.

One of the main reason for the lack of reproducibility in computer science and AI is that the implementation is often not part of the final process. It is a tool used by the researcher to obtain numbers. The numbers are what’s matter. That’s the point: implementation is just pass-through to get to the real thing, the paper.

But we are in 2018 and we are talking about a scientific community that if focused on the implementation, the algorithms, the running code. Isn’t strange that the main medium for academic research is a piece of paper? Yes, sure, we are advanced, we use digital paper in form of PDF files, but that’s just a facade.

What if the code is the paper? What if the submission to a conference is, for instance, a Jupyter Notebook? It would be great just go through the “paper” and see the algorithm running. It would be great for reviewers, with finally a self-contained source of truth about the authors’ claims. It would be great for the future the users of your work. Finally, they can look at interactive code instead of boring and incomprehensible text in mid-century formats. And in the long run, it would be great for the researcher too. Even if this requires more work upfront, if your work is more accessible, clear and fun, it will be easier to achieve the greatest satisfaction for people working in this field: to see your work used in the wild.

It would be nice, but we are not there yet. Of course, there is a huge momentum in the academic community. I think one of the problems is that Computer Science, AI and similar are in the adolescent period of their life. They are younger than the “true sciences” (e.g., physic and math) and they want to look “cooler” by copying their older brothers. But Computer Science has different requirements and, with time, they will learn to find their personal way.

Then, there are technical problems. Jupyter notebooks are cool, but “not there yet”. I am sure there are several edge cases that still need to be tackled before they are ready to be a reliable academic medium. For instance, we need to guarantee that they will be future-proof, at least in their “textual” part.

In any case, a discussion on the future of the scientific paper (at least in the CS community) is required. I know I am not the only one raising concerns to the old model in this period. I hope to not be the last.