Skip to content

Research Code Doesn't Need to Be a Tangled Mess

Writing at The Shape of Code blog, Derek Jones asserted that 'Structural differences between academia and industry make it likely that research software will always be a tangled mess, only usable by the person who wrote it.'

Yes, there are structural (and insurmountable) differences between academia and industry in how programming languages are used to solve problems. Professional developers more often engineer code, because there are typically real-world consequences to releasing poor-quality software, from damage to a company's reputation to injury or death. An academic, on the other hand, would generally use a programming language as an ancilliary means to achieve something that supports research.

We have seen an example of why this can be a real problem last year, when the refactored version of the source code behind the software that generated Prof. Ferguson's data 'models' was published. The code, despite being cleaned up by highly skilled engineers, was too unreadable to be verified with any assurance, and it's a safe bet that the original, undocumented, version of it was appallingly-written.

I did read the article in Nature, that reported that several independent parties were able to reproduce the models using the (refactored) software, but it dismisses our concerns about the correctness of those results, and that bugfixes and testing were done after the software was used by the government.

If (and it's an if) that is typical of the quality of software used in academia, should any data models that are cited in any research paper be trusted without access to the code? How much of what we thought we knew is actually wrong and comes from faulty data modelling?

But this isn't a critique against Ferguson's software specifically - he might have been unaware this was even an issue - but against unquestioning faith in data models that assumes the mathematical workings are reflective of reality and implemented correctly by software. It's unacceptable to make decisions that could ruin tens of thousands of lives on the back of a 'model' generated by amateur software.

So, attention has been drawn to the issue of the quality of software used to support academic research (and rightly so), and calls for research departments to adopt industry best practices when developing software that's being used to influence public policy. However, this might be an unrealistic expectation.

In my day job, I work with code that stringently follows the SOLID design principles to the point of minimising fixed dependencies and making the code almost perfectly testable through elaborate design patterns. In addition, this is done in a department that is obsessed with Agile development and the Scrum 'methodology'. I have my own criticisms about this way of doing things: Such coding patterns laid down by software engineering 'best practices' actually obfuscate what the source code is supposed to do, and I'm often having to sift through multiple layers of classes and set breakpoints everywhere to determine what a given function actually does. It can seem as if the comprehensibility of our software is being sacrificed for testability, and there is so much overhead that could probably be done away with - morning 'stand up' meetings, retrospectives, countless metrics that diverge from the real state of things, an hour or two administrating 'work items' each day.

As Jones pointed out, there is a huge skills gap between experienced software developers and those who go straight into research after graduating. Indeed, many of the things we consider 'best practices' don't make sense to those outside the software engineering world, and there is a difficult learning curve involved in following the coding patterns we're expected to use.

It would seem we are at an impasse. We can't realistically expect scientists to become software engineers, because that's a whole area of expertise in itself, and research departments can't or won't fund software engineers with industry experience. My argument here is that the situation isn't that black and white, and scientists aren't caught between publishing badly-written software or publishing over-engineered software, between software that's rendered unreadable through ignorance or rendered unreadable in order to conform to 'best practices'.

The simple, and realistic, answer to this problem is to adopt a handful of coding conventions that make source code readable, so it's self-explanatory to all scientists and others who could follow the mathematics being implemented. One doesn't require years of development experience to achieve this.

  • Make functions as short as possible, with each function ideally doing one specific thing only.
  • Make the code appear clean and as structured as possible.
  • Include comments in the code, where appropriate.
  • Give each function a name that's descriptive of its purpose.
  • Give each variable a descriptive name.
  • Try to declare variables globally, where possible.
  • Avoid duplicating code. It should be easier to extend existing classes if the functions are shorter and more specific.
  • When using third-party libraries, try choosing the best-documented, most widely used and maintained available.