Chapter 9. Real-world Applications for Regression Models

We have arrived at the concluding chapter of the book. In respect of the previous chapters, the present one is very practical in its essence, since it mostly contains lots of code and no math or other theoretical explanation. It comprises four practical examples of real-world data science problems solved using linear models. The ultimate goal is to demonstrate how to approach such problems and how to develop the reasoning behind their resolution, so that they can be used as blueprints for similar challenges you'll encounter.

For each problem, we will describe the question to be answered, provide a short description of the dataset, and decide the metric we strive to maximize (or the error we want to minimize). Then, throughout the code, we will provide ideas and intuitions that are key to successfully completing each one. In addition, when run, the code will produce verbose output from the modeling, in order to provide the reader with all the information needed to decide the next step. Due to space restrictions, output will be truncated so it just contains the key lines (the truncated lines are represented by […] in the output) but, on your screen, you'll get the complete picture.

Tip

In this chapter, each section was provided with a separate IPython Notebook. They are different problems, and each of them is developed and presented independently.

Downloading the datasets

In this section of the book, we will download all the datasets that are going to be used in the examples in this chapter. We chose to store them in separate subdirectories of the same folder where the IPython Notebook is contained. Note that some of them are quite big (100+ MB).

Tip

We would like to thank the maintainers and the creators of the UCI dataset archive. Thanks to such repositories, modeling and achieving experiment repeatability are much easier than before. The UCI archive is from Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

For each dataset, we first download it, and then we present the first couple of lines. First, this will help demonstrate whether the file has been correctly downloaded, unpacked, and placed into the right location; second, it will show the structure of the file itself (header, fields, and so on):

In:
try:
    import urllib.request as urllib2
except:
    import urllib2
import requests, io, os
import zipfile, gzip

def download_from_UCI(UCI_url, dest):
    r = requests.get(UCI_url)
    filename = UCI_url.split('/')[-1]
    print ('Extracting in %s' %  dest)
    try:
        os.mkdir(dest)
    except:
        pass
    with open (os.path.join(dest, filename), 'wb') as fh:
        print ('	decompression %s' % filename)
        fh.write(r.content)


def unzip_from_UCI(UCI_url, dest):
    r = requests.get(UCI_url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    print ('Extracting in %s' %  dest)
    for name in z.namelist():
        print ('	unzipping %s' % name)
        z.extract(name, path=dest)

def gzip_from_UCI(UCI_url, dest):
    response = urllib2.urlopen(UCI_url)
    compressed_file = io.BytesIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)
    filename = UCI_url.split('/')[-1][:-4]
    print ('Extracting in %s' %  dest)
    try:
        os.mkdir(dest)
    except:
        pass
    with open( os.path.join(dest, filename), 'wb') as outfile:
        print ('	gunzipping %s' % filename)
        cnt = decompressed_file.read()
        outfile.write(cnt)

Time series problem dataset

Dataset from: Brown, M. S., Pelosi, M. & Dirska, H. (2013). Dynamic-radius Species-conserving Genetic Algorithm for the Financial Forecasting of Dow Jones Index Stocks. Machine Learning and Data Mining in Pattern Recognition, 7988, 27-41.

In:
UCI_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00312/dow_jones_index.zip'
unzip_from_UCI(UCI_url, dest='./dji')
Out:
Extracting in ./dji
  unzipping dow_jones_index.data
  unzipping dow_jones_index.names
In:
! head -2 ./dji/dow_jones_index.data
Out:
Time series problem dataset

Regression problem dataset

Dataset from: Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), 2011.

In:
UCI_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip'
unzip_from_UCI(UCI_url, dest='./msd')
Out:
Extracting in ./msd
  unzipping YearPredictionMSD.txt
In:
! head -n 2 ./msd/YearPredictionMSD.txt
Out:
Regression problem dataset

Multiclass classification problem dataset

Dataset from: Salvatore J. Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip K. Chan. Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project.

In:
UCI_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.data.gz'
gzip_from_UCI(UCI_url, dest='./kdd')
Out:
Extracting in ./kdd
  gunzipping kddcup.dat
In:
!head -2 ./kdd/kddcup.dat
Out:
Multiclass classification problem dataset

Ranking problem dataset

Creator/Donor: Jeffrey C. Schlimmer

In:
UCI_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
download_from_UCI(UCI_url, dest='./autos')
Out:
Extracting in ./autos
  decompression imports-85.data
In:
!head -2 ./autos/imports-85.data
Out:
Ranking problem dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.125.123