Waarom modellen gebruiken als je data hebt?

Onderstaande tekst is in uitsluitend beschikbaar in het Engels.

Sometimes, the advent of Big Data and analytics seems to blow away the old way of computational modeling that explains the world around us in terms of the laws of physics. Are we indeed witnessing the end of this type of modeling and should its practitioners be looking for a new job?

These days, everybody seems to be sure that there is something like a tsunami of environmental observations coming. Let’s assume that this is true and explain the actual lack of useful environmental data right now as the calm before the storm. Then why will we not simply collect all that data (probably doing something with Hadoop, that will make us look cool) and then do a bit of statistics to find trends and deviations? Why not do away with those computational models that require you to describe your entire system in an incomprehensible format that simply can’t express what you are trying to tell, have all sorts of whims (like never converging when you’ve finally specified the model you had in mind) and in the end produce results that don’t match reality?

I found a few answers in a fascinating PhD thesis (Transport of nutrients from land to sea, by Arthur Beusen). In his work, Dr. Beusen tries to quantify the global flows of nutrients like nitrogen and phosphorus (OK, this topic will probably not keep many of us awake at night, but it is important so it’s good that someone studies it). The initial approach is to see if there are correlations between these flows and the properties of the river area from which these flows emerge. If so, you would be able to estimate the total, global flow from the known properties of all the river areas in the world.

Dr. Beusen and his co-workers indeed find some significant correlations that largely (80%) explain the observed concentrations. So their approach works. But there are several limitations (as the author of the thesis duly admits). Here are a few.

– Conditions may change over time: fertilizers were used much less 50 years ago, aquaculture virtually didn’t exist then and there was no climate change. So simply assuming that the correlations and trends that you get from the data of the last 50 years will also hold in the future is no safe bet. At least not safe enough to base million dollar policy decisions on.

– In the end, you only get correlations: if the precipitation is larger, then probably (but not certainly) the total nutrient mass is larger. But for an individual river, this may not hold (unless the correlation coefficient were 1.0, but that doesn’t count as statistics anymore, does it?). So, it’s not very helpful if you want to know the concentration for a particular river. And neither does it provide enough certainty to know what action to take for that particular river if something is wrong.

Apart from these, there are loads of other issues, like the question whether all the data is really valid and also established with the same observation methodology. But I think the arguments above are already serious enough to cast at least a shadow of doubt on whether statistical relations are all we will ever need.

Now, some of you may argue that the first argument (if you miss some variables, you may end up with the wrong prediction) also holds for models that are based on the laws of physics: if you omit one of the relevant processes in your model, it may also go astray big time. The point is, however, that your model is based on an understanding of your system. If you know about climate change, for instance, you may put terms in there that you think describe the influence of a higher temperature. That may not be correct right away, but you can test and improve your hypothesis as time moves on. Data-scientist, on the other hand, cannot say anything but “let’s wait and see, we will tell you once we’ve measured the correlations”. When talking about climate change, this is hardly a reassuring answer.

But I won’t claim that statistical relations will never replace models. I have the feeling that such a statement could one day get me into trouble. Even a great man like Thomas J Watson, the CEO of IBM in the 50s, is now mostly remembered only for predicting that the world doesn’t need more than some five computers (which he probably never said, by the way). Or think about Einstein’s teacher telling him that he would never get far in life. I’d rather be remembered for my foresight than for a lack of imagination. So it’s interesting to spend a moment considering what would be needed on the side of data to replace computational models with an eye on the arguments above.

For the influence of changing conditions, we might hope that if we observe many systems across the world (and perhaps even other worlds) we will always find some system that matches the new, changed conditions. Or find out how conditions alter the correlations. I’ve been told that this is one of the strategies of TomTom for predicting traffic. They simply collect so much data that there will usually be a matching situation for today’s traffic.

Regarding the issue that statistics don’t give you certainty, the answer of the data guys is probably that you should place more sensors. With more sensors you get more correlations and a more complete understanding of the nutrients behavior. Problem solved (so the data guys say, but they might at least be overlooking the costs of all those sensors).

So, is the end of computational models conceivable? Perhaps the answer is yes. But there is a funny way out of this miserable future for us modelers if it should ever emerge. And that is that the laws of physics are also just correlations. If I drop a hammer, there is a pretty high correlation between time-squared and its vertical position. The only difference with traditional statistics is the degree of certainty. So let’s just say that we are also statisticians, and in fact very good ones because our correlations are all 1.0. Then we belong to the cool guys again (though I can’t think of something to do with Hadoop; better keep that quiet).

Machine Learning