BLOG

Forget the magic ball, cook your data instead!

Everybody believes that if you have data you can predict the future. I must agree in a sense. A long time ago, people used a magic ball to predict the future. Nowadays you don't have to anymore, because we have data! Forget that magic ball, it's better to cook your data instead.

In the previous blog we talked about data science and the story behind the data. Now let’s have a look in front of the data. After getting the data, what do we do? By having sufficient data you can identify patterns and trends that can drive you to conclusions about the future.

Divider01-1

Machine Learning

In this part I would like to focus on the pre-analysis of the data and the Machine Learning (ML) part of data science. There are thousands of articles out there about Machine Learning, so here I’ll limit myself to some basic information about ML.

What is Machine Learning? 

What, exactly, is Machine Learning? That turns out to be the key question. I recently read the best analogy about Machine Learning; it’s best explained as a big computer-powered food-making machine. You feed the machine with lots of data (called a training set) and, after a bit of algorithmic whirring, out comes the food – in the shape of a correlation or a pattern that the algorithm has “learned” from the training set.

What is Machine Learning?

The machine is then fed with a new dataset, and on the basis of what it has “learned”, proceeds to emit correlations, recommendations and perhaps even judgments. Such as:  this person is likely to leave the company he’s working for if he changes his address (also known as Workforce Analytics), or based on “this data” that person should be granted a loan, or the machine could even tell us that “revolving door X” in “building A” needs to be replaced before it breaks down. 

You train a computer on lots of data, and it learns to recognize structure. 

Based on this procedure the machine can provide predictions concerning the data you fed it. How accurate and precise the prediction will be, is strongly related with the quality and the availability of the data. Because no matter how good your algorithm is, if garbage comes in then garbage will come out! So you need to be cautious with what you feed your machine. Again, in the same concept as the food machine; it doesn’t matter if you  have the best cooking machine, if the ingredients are not of a good quality, the food will not taste good, even if it looks good! The same concept applies here. You may design an ML model with 95% of accuracy but this doesn’t mean that your results are meaningful. On the contrary, they may lead you to wrong decisions. That’s why the most important step before you even start creating a model, is to be sure about the quality of your data. 

 Divider01-1

Data munging 

Data munging, as we call it, is the most time consuming part of a data science project. 99% of the cases, data isn’t structured or “clean” and ready for analysis. We need to perform a pre-data analysis, as I like to call it. Prepare the ingredients before you put them in a pan.

Data munging - cleaning your data

So, before we start building our predictive model we need to prepare our dataset. In cooking, we need to marinate the meat or wash the vegetables before we start cooking them. Same procedure here, there are four key steps for this pre-data analysis. 

Step 1 – Missing values

First we need to explore the data and identify and treat missing values. There are several ways to treat the missing values from the dataset. It always depends on what you are trying to achieve. You can just remove the rows with the missing data or you can substitute missing data with a specific value (e.g. the average) or you can interpolate them (e.g. Linear) or you can Forward/Backward fill the missing values.

Step 2 - Calculate

Continuing with the preparation of our dataset, sometimes we need to perform some operations in order to calculate some values. For instance, to calculate the sum of a value, to find the max value of the dataset or even to sort values based on specific measures. All of these, and many more, mathematical operations are very important for our analysis. It is easier to see the big picture if your dataset is aggregated. 

Step 3 - Outliers

Large datasets often include values that are errors or outliers, which will skew the relationships in the model. Finding those outliers involves comprehensive exploration and visualization of the data and you must be careful to ensure that the outliers you identified are genuine and should be treated, and not indicators that should be taken into account in the model. There are different ways to treat outliers. Just like with missing values there are different ways to treat them and taking into account what you’re trying to achieve you need to use the appropriate method.  

Step 4 - Normalize

The final step before we build our model would be to identify the numeric variables and normalize them. This means that the values need to be in a similar scale. Only then, we can compare and derive safe results. We are always scaling after treating outliers. It’s also important after scaling not to lose the relative relationship between the numeric features.

Divider01-1

Introduction to Machine Learning
Introduction to Machine Learning

After preparing our dataset, after applying these time consuming but important steps of our analysis, we are finally ready to create our predictive model by using Machine Learning. There are different types of ML algorithms you can use for the predictive models. Here I will just mention two basic types; Classification and Regression

Classification

We use Classification to predict answers like Yes/No or True/False questions. For example, is it possibly for an employee to leave the company in the next six months? Or market segmentation, whether or not a prospect or lead will respond to a promotion. But also fraud detection, text categorization and many more.

Regression

We use Regression to predict real values. It is mostly used to estimate real values based on continuous variables. Some examples would be to predict sales of a department or predict how long an employee will stay at a company.

 

See you next time!

In the next blog I will try to highlight the most basic and important features of those ML algorithms. Till then wash your ingredients well enough before you cook. Stay ahead and stay tuned for the next blog.

 

This blogpost has been translated to Dutch. You can find the Dutch version here.

Written by Agis Christopoulos, Data Scientist at Valid.

Leave a reply