10 Rules for Scientific Software Engineering
In the 20 years that VORtech has been in the business of scientific software engineering, we’ve seen many different codes and witnessed many different development processes. Some of the code was brilliant and a pleasure to work on. Some was less so. Some of the development processes were helpful, others were not.
From this experience we’ve gathered 10 rules that we think are helpful for scientific software engineers. Most of them are just sound software engineering practices and apply also beyond scientific/computational software. Others are less important for general software but essential for scientific software engineering with extremely complex algorithms.
Here they are:
1) Clarity before everything else
This rule is an unquestionable number one. When you’re implementing complex algorithms it is all too easy to create unintelligible code. And with complex algorithms that means: probably erroneous and certainly unextendible. So explain what you’re doing by adding comments, choose clear names for files, classes, variables, functions and whatever. Use indentation properly and break lines that are too long and complex. Remember: in a few weeks from now you will look upon the code as if it were made by someone else. Would you still understand it?
2) Use coding standards
Computational software is almost always teamwork. And not only that: it is often created over generations of developers. Unless you agree on coding standards, the code will quickly turn into a thrift shop of styles. That makes it harder for new developers to find their way around. And it also allows errors to hide in unconventional coding. This rule is a corollary of rule number 1: keep your code neat and clear.
3) Avoid technical debt
You’ve probably experienced it often enough: working on a piece of code you notice that it is old fashioned, badly programmed, unintelligible or plain wrong. But cleaning up someone else’s mess isn’t fun, right? What do you do: will you look the other way or will you improve the code? Looking the other way saves you time and trouble. But if everyone works like this, the code will build up a tremendous technical debt (i.e. the mess that someone will have to clean up one day). This debt may get so high that your code ends up a total loss. So, please improve the code that needs to be improved. As an extra benefit, it will also make you understand the code better. If you really have no time, at least create an issue in your issue management system to keep track of the debt.
4) Use the tools
These days, a version management tool, a continuous integration tool and an issue tracker are standard tools for all scientific software engineering. Yes, you can do without them just like you can get from Berlin to Paris walking. It just takes a bit longer, right? No, seriously, errors in scientific software can be very subtle and hard to find. If you can’t compare versions, if you do not test frequently and if you don’t log what’s going on, repairing bugs is hell. Good tools are so readily available that there is no valid reason not to use them.
5) Use libraries
You’re not really writing your own matrix solver from scratch, are you? Most of the basic algorithms are available for free in excellent numerical libraries. Unless you’ve come up with a fundamentally new approach, you would be wasting your time (and your boss’s money) if you do a proprietary implementation of a basic algorithm. In fact, you should always check for available source code for everything that you’re planning to do. Sure, not everything you find on the internet is good quality, but often it is at least a good starting point.
6) Watch your language
Computer scientists have a soft spot for programming languages. They tend to follow the hypes. Remember Java? Remember Ruby? Done any Delphi recently? Often it’s wise to be conservative when it comes to selecting a programming language. In spite of everything, Fortran, C and C++ are still the safest bets if you start on something new. Python has a good chance of becoming one of these safe bets. But didn’t we say the same about Java? If you have a good reason to select a less standard language, fine, go ahead. As long as you realize that the ultimate technical debt is working in a language that is no longer supported.
7) It’s sensitive!
If you don’t pay attention to sensitivity of your calculations you have a pretty good chance of producing results that don’t mean a thing. We’ve seen huge codes that were effectively just random number generators (even though the users relied on the outcomes). An if-then-else with a float in the condition can switch to another branch based on only a minute difference in the condition. Summing a range of small numbers to a large number will make the information in the small number disappear. If you are aware of that and you are sure that that is what is supposed to happen, then fine. If not: don’t fool your users that your results are somehow significant.
8) Create visualization and debugging facilities
Debuggers are wonderful tools and no developer should be without them (see rule 4). But debugging computational code is often just not doable with normal debuggers. How do you trace a minute divergence in a computation that runs for days and works on arrays of millions of elements? The only way to go about this (as far as we know, and we’ve tested quite some tools) is to dump specific output to a file and use visualization to zoom in on the problem. Yes, there will be a lot of if-debug statements in your code but that doesn’t really hurt as long as you keep your code tidy. Having convenient debugging facilities in your code will be useful for the entire lifetime of the code (read: decades).
9) Do the right documentation
Future developers will need some basic documentation to find their way around the code. If your code is large or even huge, they will need an overview of the basic structure (the architecture) and the principles behind it, along with some information on the development environment (folders, makefiles, tests). And they will probably need a clear explanation of the math. But that’s about it. Don’t waste your time writing thick volumes on each and every routine. Make sure that the code is internally documented because that is where developers will look once they’ve understood the basic plan of the software.
10) Don’t optimize until necessary
Optimizing is usually a matter of rearranging your code to make better use of the hardware. Sure, that is important and often cannot be avoided. But remember that the hardware is going to be different in a few years time (ever heard about a CPU with built-in FPGA and stacked with a GPU? Don’t be to surprised to see that happen in the next few years). So don’t start out to write your code specifically for one type of hardware. First write the code as it should be from an algorithmic and code quality point of view. Then optimize for the hardware, sacrificing as little of the original code as you can.
These 10 rules are not the result of any rigorous scientific inquiry. So if other rules or more rules are important for you, that’s OK with us. All we can say is that these 10 rules are helping us a lot in our work and take away much of the frustration that so easily arises in scientific software engineering. Perhaps they are useful for you as well.