Guidelines for a successful data-science project

Misha Veldhoen | data scientist & scientific software engineer

In recent years, the demand for data-science services has grown enormously. Almost everyone that I meet is excited about what insights from data can bring to their company, but often their organization lacks the knowledge and experience needed to work out those ideas themselves. Besides the enthusiasm there is often also caution; data science has not yet proven its usefulness for their organization. A previous blog refers to exactly this topic.

In many cases, funds are made available to let an external party with data-science expertise develop a pilot. In recent years, we were fortunate to work on many different data-science projects for our customers. Oftentimes, I already have ideas at the start of the project about what class of machine-learning techniques would be suitable. But from experience I know that the most important success factor of such a pilot project is not in using the most advanced algorithm, but instead, how the project is designed.

In this blog, I therefore discuss (in no particular order) a number of guidelines that we at VORtech always use when we work on a data-science project, in order to maximize the project’s successfulness.

Collaborate with the customer and work at the customer’s office

At VORtech we work on data-science projects for a very diverse group of customers from different industries. This makes my work as a data scientist very interesting, because I get to work with people with very different backgrounds and expertise, who all speak their own language and look at their data in their own way. Inevitably, during such projects one picks up some domain knowledge, for instance, I now know a lot more about high voltage cables, ships, scaffolds, water pipes, and fire trucks.

When I start a data-science project, I prefer to work at the customer’s office. In many data-science projects, collecting, cleaning, and exploring the data is an important part of the initial work. The customer knows and understands their data much better than I do, and it’d be a mistake not to make use of this. For example, the customer can help determine whether certain values are certainly wrong, such as: certain numbers may never be negative, certain values are reserved as error codes, or a certain event may never have occurred before a certain other event.

Once the data has been cleaned-up and the project objective has been established, I can start modeling the data with a statistical or machine-learning model. From that moment on, it is in principle less important to work at the customer’s office. Regardless, I still prefer to do it if possible. In many cases, it is ultimately my goal to not only produce a report of my findings, but also to deliver something that the customer can actually use in daily operations. For example, an application or a dashboard that gives real-time insight into the state of affairs. This works best if the customer has been closely involved in the development process, possibly by letting their own developer collaborate on the project.

Be careful with data that may contain errors

Often customers have accumulated a considerable volume of data over the years; for example, records have been stored for years of all transactions that have been done, or time series of high-frequency sensor data have been stored. In principle this is a good start, but if the data has never been used for any kind of analysis, it often happens that the data is heavily polluted. Examples are missing data, incorrectly entered data, technical problems with the sensor network, manual operations carried out on the data, etc. These problems become visible as soon as I start exploring the data, but it is not always clear how these errors can be solved. By making certain assumptions I can usually clean the data considerably, but this can be a very time-consuming task.

It is important that the customer is aware of the consequences of incorrect data. First, a machine-learning model that has been trained on erroneous data will not provide reliable predictions and can lead to wrong insights. To a certain extent, you can shield yourself against this by using so-called ‘robust’ methods, for example, by using a median instead of an average, or by using the RANSAC algorithm when estimating model parameters, but this type of measure is not always sufficient. Second, a trained machine learning model expects correct input when it makes a prediction. If it gets wrong input, then the results are not worth much.

It is therefore essential to get as close to a healthy and error-free database as possible. In order to prevent more incorrect data from entering the database in the future, the customer will have to deal with the data differently. More attention will have to be paid to the data. For example, by implementing sanity-checks in applications that write to the database, or by actively monitoring the database contents.

Translate a customer question into a mathematically well-defined problem

The first step that I take after exploring the data is to translate the customer’s question into a mathematical optimization problem. In this way, I formalize the exact objective of the project, and immediately obtain a measure for success. In some cases this step is easy, but sometimes difficult questions arise. Suppose a customer wants a model that determines when a part in a machine needs to be replaced, then the real goal is to reduce maintenance costs. Replacing parts too early is expensive, but replacing them too late is perhaps even more costly. It is therefore necessary to think about the relative costs of replacing too early and too late, so that the right optimum can be found.

It is tempting to start training machine learning models immediately when you have the data, but this often leads to unusable models if it later turns out that the question was not right.

Use a baseline when modeling

If the problem is well defined, modeling can begin. First, the data is divided into a part to train the machine learning model (the training set) and a part to assess how well the model works on unseen data (the test set). The modeling itself is most often an iterative process. Then, I try a set of appropriate models, I see which model works best, study the residuals (the difference between the measurements and the model predictions), and with this information I make adjustments until I am satisfied with the result. The most important model in this iterative process (and a model that I always use) is the simplest model that I can come up with. For a regression problem (modeling a quantitative variable) this can be the average of the relevant quantity in the training set. For a classification problem, I create a model that always predicts the same class, namely the class that is most prevalent in the training set. Finally, for a time series problem, the predicted value is equal to the previous value.

Then I calculate the score of this almost trivial model. In some cases the score is surprisingly good, but in most cases there is still a lot to be gained by allowing the model to use several explanatory factors and by using a more advanced algorithm. Why not immediately start with the more advanced algorithm? There are several reasons. First, the simple model is a quick test to see if we have asked the question correctly. Second, if the goal of the project is to create an operational tool, the development of this tool can immediately start around the baseline predictor. Improvements on the simple model can then be rolled out quickly and independent of the development of the tool. Finally, the simple model can be used as a baseline to validate more sophisticated models. Had we started immediately with an advanced algorithm and achieved a nice result, then it may seem reasonable to assume this was due to the specifics of the advanced model. But sometimes an almost equally good result can be achieved with a much simpler model. In such cases, the simpler model is often preferable, since simpler models usually make it easier to see how predictions are made, in contrast to more complex models, which for this reason are sometimes called ‘black box models’.

Apply software best practices

This may be a trivial statement, but as with any form of scientific project, a data-science project can fail badly when the implementation is sloppy. In addition to my data-science projects, I also regularly work on software engineering projects. The best practices that we use in software engineering projects such as testing, consistent coding style, logical design, and using version control, I use in my data-science projects as well. It is important to ensure that it is always clear how certain results were achieved. In addition, all results must be easily reproducible. Manual manipulations of the data are a big no-no.

Conclusion

The technical side of data science does not have to be too complicated in many cases. There are a lot of excellent books, blogs, MOOCs, and Meetups that are accessible for free or for a relatively small investment, so that the basics of the most used techniques can be quickly picked up. Open-source libraries with user-friendly interfaces such as pandas, scikit-learn, and TensorFlow ensure that in principle anyone with a technical background can get up to speed fairly quickly.

Yet data-science projects do not always go well. The failure of a data-science project is usually not related to a lack of technical knowledge, but more to the design of the project. Fortunately, the consistent application of a few relatively simple guidelines can help make the project successful.