Stages of KDD

KDD can be split into stages. Seven of these are named and briefed ahead. Keep in mind that a well-conducted KDD process may require these stages to be looped once in a while:

  1. Understanding: First, you have to understand your problem. Gather prior knowledge, understand challenges, limitations, and how the problem is generally dealt with (or not). Additionally, seeking inspiration in different places is advised. It's also important to set goals from the customer's viewpoint.
  2. Data selection: In this stage, you look up data. Gather samples for training (discovery), test, and validation. How to sample and sample sizes are key decisions at this step. A well-designed and conducted sampling process can be the difference between meaningful and unmeaningful results.
  3. Data cleaning and preprocessing: Usually, you may investigate outliers, typos, and noises, and deal with them. Preprocessing is often required. While Natural Language Processing (NLP) will require texts to be turned into vectors before trying to model it, image and video-related problems might require you to turn frames into tensors.
  4. Data reduction and transformation: Consists of using data reduction and transformation techniques. By using fewer (and/or better) features, we can speed things up while avoiding some traps that will culminate in unmeaningful patterns, such as overfitting.
  1. Exploratory analysis and model selection: The exploratory analysis objectively looks for a deeper understanding of the problem. Plus, you can pre-select many models you might find useful and drive some of the exploratory analysis toward failing as many as you can; check whether the assumptions hold under the noted conditions. By the end of this stage, one or more models that are highly likely to hand useful insights are expected to be selected. Thinking of ways to combine them is a plus. Failing model selection is OK; revisit prior stages if that is the case.
  2. Data mining: It's time to put your decisions at stake. Fit your models as best as you can, and make things scalable. Seek patterns and check consistently to see whether they are valid.
  3. Interpretation, evaluation, and delivery: Results can look so intuitive that they request neither interpretation nor evaluation, but they always do. Score the results and ask yourself what they really mean. Failing this stage can put all of your previously done work at risk. Delivering something might be needed; it could be as simple as a figure, a report, or an entire application.
Although seven stages were listed, there is no strict rules for this. You might tag both stages 5 and 6 as data mining. Others may find that all these stages are data mining itself. 

Stage-wise division makes KDD a good candidate for teamwork. These seven stages might be seen as the basic structure of KDD; each of these stages will benefit from different abilities and skills. Usually, great advantages are sourced from sharp skills in statistics, R, and SQL. Python and Linux are also useful.

To be honest, the limits across KDD, data science itself, and data mining are not fully agreed among data people in general. Data science is a much younger field when compared to some of its pioneers—statistics and programming. Nowadays, it's most common to call data mining the entire KDD process, which is well accepted as being a practical tool for data science.

There is no right or wrong. Yet, it's good to know what these terms could mean. Now that we have a broad view on KDD, it's time to check some quick tips. The ones coming next are divided into 4 bullet points and even though they are not restricted to KDD or  data ming, they are very likely to improve the results coming from them. Let's check them out:

  • Write down your biases and face them—writing them down will make them much clearer and consequently easier to avoid
  • Keep research notes—revisiting them later might be needed as you face the same problem over and over
  • Stay loyal to statistics and good practices—data dredging is the name given to the reckless, seemly random, unreasonable examination of data which rarely leads to meaningful results. On the other hand, staying loyal to statical principles and good practices never fails
  • Validate extensivenly—peer review, cross-validation, and double checks are a few ways that one can be more asure about its results. Sometimes, bugs can be mistaken for discoveries. 

Once we acknowlodge the theory sorounding data mining and KDD, it's time to get hands dirty with practical stuff. The first task to do with R is to get ourselves a dwarf name. Dwarves are well-known for their mining abilities and so may you after you are done with this chapter. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.148.177