What about regressions?

All of the models we've seen so far could also be set to tackle regression problems and not only classification problems. In order to do so, the only thing that we would need to do is to start the formulas with a continuous variable then. Instead of the regular vote ~ ., we would use <some continuous variable's name> ~ <independent variable #1> + <...> + <independent variable #n>.

A misspecified model is either missing important (left out) variables, adding unimportant (irrelevant) variables, or both.

The dot sign shortcut still works for regression problems, but it's probably best to name each variable by name. This way you pay more attention to which variables you are using. Depending on the model you train and sampling size, misspecification will badly injury the out-of-sample performance, in other words, your model won't be fit to make real-world predictions.

All real-world predictions are out-of-sample.

Feature engineering takes care of creating and selecting features to adjust your models. Some would say that it's more an art than a science and I agree with them. Nonetheless, there are a few tips that could help you to engineer features:

  • Be creative: Try new things, combine information and always look for alien data. Looking beyond dataset knowledge can really mean a leap forward sometimes.
  • Avoid using highly correlated variables: Depending on your training algorithm and sample size, correlated variables will cast confusion into models.
  • Trust: At least on some level trust, in what you learned from living. Search what other people have done while handling the same or similar matters. Go out, observe the world, talk to people about your problem—all this can help in grasping a new perspective that may lead you to the solution.
  • Test: Think about possible solutions and try them. If your test sample is not too small, the out-of-sample performance may give you hints about whether you are following the right path. But be cautious, even these marks can lead you down the wrong path sometimes—we are dealing with probabilities.

These key steps can help you to engineer features. A new feature might be the watershed that separates you from reaching the high ranks in a Kaggle competition or answering your business question.

All along, we've checked the hit rate, which is also known as probability of detection. Almost exclusively, we have checked the test-sample hit rate. That is because I wanted to keep it simple while focusing on showing how to train a wide variety of models. The hit rate is neither the single nor the best measure available to measure performance over classification models.

The performance measure for some model you may consider should be at least an approximation of the cost and revenues related to the decisions you may take when relying on this same model. Should false positives be worth as much as false negatives?

That said, the code we've been using to measure performance wouldn't be suitable at first if we tried to address a regression problem. Some adaptions would be required. A more traditional error measure related to regressions is the MSE. In the following, you can see a pseudocode calculating it:

mean((predict(<model>, newdata = <data>[<test index>,]) - <data>[<test index>,<predict variable>])^2)

MSE may be the most traditional error measure for a continuous variable but it's not unique. There are far more measures; you may come up with one of your own (hopefully cost-related) and not a single is yet proven to be the best one, no matter what. Moving forward, we will be discussing something we've briefly seem in Chapter 4, KDD, Data Mining, and Text Mining, and that is be clustering.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.149.238