Now, this is one of the simplest, but yet it might be the most important section in this whole book. We're going to talk about cleaning your input data, which you're going to spend a lot of your time doing.
How well you clean your input data and understand your raw input data is going to have a huge impact on the quality of your results - maybe even more so than what model you choose or how well you tune your models. So, pay attention; this is important stuff!
Let's talk about an inconvenient truth of data science, and that's that you spend most of your time actually just cleaning and preparing your data, and actually relatively little of it analyzing it and trying out new algorithms. It's not quite as glamorous as people might make it out to be all the time. But, this is an extremely important thing to pay attention to.
There are a lot of different things that you might find in raw data. Data that comes in to you, just raw data, is going to be very dirty, it's going to be polluted in many different ways. If you don't deal with it it's going to skew your results, and it will ultimately end up in your business making the wrong decisions.
If it comes back that you made a mistake where you ingested a bunch of bad data and didn't account for it, didn't clean that data up, and what you told your business was to do something based on those results that later turn out to be completely wrong, you're going to be in a lot of trouble! So, pay attention!
There are a lot of different kinds of problems and data that you need to watch out for:
- Outliers: So maybe you have people that are behaving kind of strangely in your data, and when you dig into them, they turn out to be data you shouldn't be looking at the in first place. A good example would be if you're looking at web log data, and you see one session ID that keeps coming back over, and over, and over again, and it keeps doing something at a ridiculous rate that a human could never do. What you're probably seeing there is a robot, a script that's being run somewhere to actually scrape your website. It might even be some sort of malicious attack. But at any rate, you don't want that behavior data informing your models that are meant to predict the behavior of real human beings using your website. So, watching for outliers is one way to identify types of data that you might want to strip out of your model when you're building it.
- Missing data: What do you do when data's just not there? Going back to the example of a web log, you might have a referrer in that line, or you might not. What do you do if it's not there? Do you create a new classification for missing, or not specified? Or do you throw that line out entirely? You have to think about what the right thing to do is there.
- Malicious data: There might be people trying to game your system, there might be people trying to cheat the system, and you don't want those people getting away with it. Let's say you're making a recommender system. There could be people out there trying to fabricate behavior data in order to promote their new item, right? So, you need to be on the lookout for that sort of thing, and make sure that you're identifying the shilling attacks, or other types of attacks on your input data, and filtering them out from results and don't let them win.
- Erroneous data: What if there's a software bug somewhere in some system that's just writing out the wrong values in some set of situations? It can happen. Unfortunately, there's no good way for you to know about that. But, if you see data that just looks fishy or the results don't make sense to you, digging in deeply enough can sometimes uncover an underlying bug that's causing the wrong data to be written in the first place. Maybe things aren't being combined properly at some point. Maybe sessions aren't being held throughout the entire session. People might be dropping their session ID and getting new session IDs as they go through a website, for example.
- Irrelevant data: A very simple one here. Maybe you're only interested in data from New York City people, or something for some reason. In that case all the data from people from the rest of the world is irrelevant to what you're trying to find out. The first thing you want to do is just throw all that data that away and restrict your data, whittle it down to the data that you actually care about.
- Inconsistent data: This is a huge problem. For example, in addresses, people can write the same address in many different ways: they might abbreviate street or they might not abbreviate street, they might not put street at the end of the street name at all. They might combine lines together in different ways, they might spell things differently, they might use a zip code in the US or zip plus 4 code in the US, they might have a country on it, they might not have a country on it. You need to somehow figure out what are the variations that you see and how can you normalize them all together.
- Maybe I'm looking at data about movies. A movie might have different names in different countries, or a book might have different names in different countries, but they mean the same thing. So, you need to look out for these things where you need to normalize your data, where the same data can be represented in many different ways, and you need to combine them together in order to get the correct results.
- Formatting: This can also be an issue; things can be inconsistently formatted. Take the example of dates: in the US we always do month, day, year (MM/DD/YY), but in other countries they might do day, month, year (DD/MM/YY), who knows. You need to be aware of these formatting differences. Maybe phone numbers have parentheses around the area code, maybe they don't; maybe they have dashes between each section of the numbers, maybe they don't; maybe social security numbers have dashes, maybe they don't. These are all things that you need to watch out for, and you need to make sure that variations in formatting don't get treated as different entities, or different classifications during your processing.
So, there are lots of things to watch out for, and the previous list names just the main ones to be aware of. Remember: garbage in, garbage out. Your model is only as good as the data that you give to it, and this is extremely, extremely true! You can have a very simple model that performs very well if you give it a large amount of clean data, and it could actually outperform a complex model on a more dirty dataset.
Therefore, making sure that you have enough data, and high-quality data is often most of the battle. You'd be surprised how simple some of the most successful algorithms used in the real world are. They're only successful by virtue of the quality of the data going into it, and the amount of data going into it. You don't always need fancy techniques to get good results. Often, the quality and quantity of your data counts just as much as anything else.
Always question your results! You don't want to go back and look for anomalies in your input data only when you get a result that you don't like. That will introduce an unintentional bias into your results where you're letting results that you like, or expect, go through unquestioned, right? You want to question things all the time to make sure that you're always looking out for these things because even if you find a result you like, if it turns out to be wrong, it's still wrong, it's still going to be informing your company in the wrong direction. That could come back to bite you later on.
As an example, I have a website called No-Hate News. It's non-profit, so I'm not trying to make any money by telling you about it. Let's say I just want to find the most popular pages on this website that I own. That sounds like a pretty simple problem, doesn't it? I should just be able to go through my web logs, and count up how many hits each page has, and sort them, right? How hard can it be?! Well, turns out it's really hard! So, let's dive into this example and see why it's difficult, and see some examples of real-world data cleanup that has to happen.