15.1 (Using PCA
to Help Visualize the Digits Dataset) In this chapter, we visualized the Digits dataset’s clusters. To do so, we first used scikit-learn’s TSNE
estimator to reduce the dataset’s 64 features down to two, then plotted the results using Seaborn. Reimplement that example to perform dimensionality reduction using scikit-learn’s PCA
estimator, then graph the results. How do the clusters compare to the diagram you created in the clustering case study?
15.2 (Using TSNE
to Help Visualize the Iris Dataset) In this chapter, we visualized the Iris dataset’s clusters. To do so, we first used scikit-learn’s PCA
estimator to reduce the dataset’s four features down to two, then plotted the results using Seaborn. Reimplement that example to perform dimensionality reduction using scikit-learn’s TSNE
estimator, then graph the results. How do the clusters compare to the diagram you created in the clustering case study?
15.3 (Seaborn pairplot
Graph) Create a Seaborn pairplot
graph (like we showed for Iris) for the California Housing dataset. Try the Matplotlib features for panning and zooming the diagram. These are accessible via the icons in the Matplotlib window.
15.4 (Human Recognition of Handwritten Digits) In this chapter, we analyzed the Digits dataset and used scikit-learn’s kNeighborsClassifier
to recognize the digits with high accuracy. Can humans recognize digit images as well as the kNeighborsClassifier
did? Create a script that randomly selects and displays individual images and asks the user to enter a digit from 0 through 9 specifying the digit the image represents. Keep track of the user’s accuracy. How does the user compare to the k-nearest neighbors machine-learning algorithm?
15.5 (Using TSNE
to Visualize the Digits Dataset in 3D) In Section 15.6, you visualized the Digits dataset’s clusters in two dimensions. In this exercise, you’ll create a 3D scatter plot using TSNE
and Matplotlib’s Axes3D
, which provides x-, y- and z-axes for plotting in three dimensions. To do so, load the Digits dataset, create a TSNE
estimator that reduces data to three dimensions and call the estimator’s fit_transform
method to reduce the dataset’s dimensions. Store the result in reduced_data
. Next, execute the following code:
from mpl_toolkits.mplot3d import Axes3D
figure = plt.figure(figsize=(9, 9))
axes = figure.add_subplot(111, projection='3d')
dots = axes.scatter(xs=reduced_data[:, 0], ys=reduced_data[:, 1], zs=reduced_data[:, 2], c=digits.target, cmap=plt.cm.get_cmap('nipy_spectral_r', 10))
The preceding code imports Axes3D
, creates a Figure
and calls its add_subplot
method to get an Axes3D
object for creating a three-dimensional graph. In the call to the Axes3D
scatter
method, the keyword arguments xs
, ys
and zs
specify one-dimensional arrays of values to plot along the x-, y- and z-axes. Once the graph is displayed, be sure to drag the mouse on the image to rotate it left, right, up and down so you can see the clusters from various angles. The following images show the initial 3D graph and two rotated views:
15.6 (Simple Linear Regression with Average Yearly NYC Temperatures Time Series) Go to NOAA’s Climate at a Glance page (https:/
) and download the available time series data for the New York City average annual temperatures from 1895 through present (1895–2017 at the time of this writing). For your convenience, we provided the data in the file ave_yearly_temp_nyc_1895-2017.csv
. Reimplement the simple linear regression case study of Section 15.4 using the average yearly temperature data. How does the temperature trend compare to the average January high temperatures?
15.7 (Classification with the Iris Dataset) We used unsupervised learning with the Iris dataset to cluster its samples. This dataset is in fact labeled so it can be used with scikit-learn’s supervised machine learning estimators. Use the techniques you learned in the Digits dataset classification case study to load the Iris dataset and perform classification on it with the k-nearest neighbors algorithm. Use a KNeighborsClassifier
with the default k value. What is the prediction accuracy?
15.8 (Classification with the Iris Dataset: Hyperparameter Tuning) Using scikit-learn’s KFold
class and cross_val_score
function, determine the optimal k value for classifying Iris samples using a KNeighborsClassifier
.
15.9 (Classification with the Iris Dataset: Choosing the Best Estimator) As we did in the digits case study, run multiple classification estimators for the Iris dataset and compare the results to see which one performs best.
15.10 (Clustering the Digits Dataset with DBSCAN
and MeanShift
) Recall that when using the DBSCAN
and MeanShift
clustering estimators you do not specify the number of clusters in advance. Use each of these estimators with the Digits dataset to determine whether each estimator recognizes 10 clusters of digits.
15.11 (Using %timeit
to Time Training and Prediction) In the k-nearest neighbors algorithm, the computation time for classifying samples increases with the value of k. Use %timeit
to calculate the run time of the KNeighborsClassifier
cross-validation for the Digits dataset. Use values of 1, 10 and 20 for k. Compare the results.
15.12 (Using cross_validate
) In this chapter, we used the cross_val_score
function and the KFold
class to perform k-fold cross-validation of the KNeighborsClassifier
and the Digits dataset. In the k-nearest neighbors algorithm, the computation time for classifying samples increases with the value of k. Investigate the sklearn.model_selection
module’s cross_validate
function, then use it in the loop of Section 15.3.4 both to perform the cross-validation and to calculate the computation times. Display the computation times as part of the loop’s output.
15.13 (Linear Regression with Sea Level Trends) NOAA’s Sea Level Trends website
https://tidesandcurrents.noaa.gov/sltrends/
provides time series data for sea levels worldwide. Use their Trend Tables
link to access tables listing sea-level time series for cities in the U.S. and worldwide. The date ranges available vary by city. Choose several cities for which 100% of the data is available (as shown in the % Complete
column). Clicking the link in the Station ID
column displays a table of time series data, which you can then export to your system as a CSV file. Use the techniques you learned in this chapter to load and plot each dataset on the same diagram using Seaborn’s regplot
function. In IPython interactive mode, each call to regplot
uses the same diagram by default and adds data in a new color. Do the sea level rises match in each location?
15.14 (Linear Regression with Sea Temperature Trends) Ocean temperatures are changing fish migratory patterns. Download NOAA’s global average surface temperature anomalies time series data for 1880–2018 from
https://www.ncdc.noaa.gov/cag/global/time-series/globe/ocean/ytd/12/1880-2018
then load and plot the dataset using Seaborn’s regplot
function. What trend do you see?
15.15 (Linear Regression with the Diabetes Dataset) Investigate the Diabetes dataset bundled with scikit-learn
https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset
The dataset contains 442 samples, each with 10 features and a label indicating the “disease progression one year after baseline.” Using this dataset, reimplement the steps of this chapter’s multiple linear regression case study in Section 15.5.
15.16 (Simple Linear Regression with the California Housing Dataset) In the text, we performed multiple linear regression using the California Housing dataset. When you have meaningful features available and you have the choice between running simple and multiple linear regression, you’ll generally choose multiple linear regression to get more sophisticated predictions. As you saw, scikit-learn’s LinearRegression
estimator uses all the numerical features by default to perform linear regressions.
In this exercise, you’ll perform single linear regressions with each feature and compare the prediction results to the multiple linear regression in the chapter. To do so, first split the dataset into training and testing sets, then select one feature, as we did with the DataFrame
in this chapter’s simple linear regression case study. Train the model using that one feature and make predictions as you did in the multiple linear regression case study. Do this for each of the eight features. Compare each simple linear regression’s R2 score with that of the multiple linear regression. What produced the best results?
15.17 (Binary Classification with the Breast Cancer Dataset) Check out the Breast Cancer Wisconsin Diagnostic dataset that’s bundled with scikit-learn
https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset
The dataset contains 569 samples, each with 30 features and a label indicating whether a tumor was malignant (0) or benign (1). There are only two labels, so this dataset is commonly used to perform binary classification. Using this dataset, reimplement the steps of this chapter’s classification case study in Sections 15.2–15.3. Use the GaussianNB
(short for Gaussian Naive Bayes) estimator. When you execute multiple classifiers (as in Section 15.3.3) to determine which one is best for the Breast Cancer Wisconsin Diagnostic dataset, include a LogisticRegression
classifier in the estimators
dictionary. Logistic regression is another popular algorithm for binary classification.
15.18 (Project: Determine k in k-Means Clustering) In the k-NN classification example, we demonstrated hyperparameter tuning to choose the best value of k. In k-means clustering, a challenge is determining the appropriate k value for clustering the data. One technique for determining k is called the elbow method. Investigate the elbow method, then use it with the Digits and Iris datasets to determine whether this technique yields the correct number of classes for each dataset.
15.19 (Project: Automated Hyperparameter Tuning) It’s relatively easy to tune one hyperparameter using the looping technique we presented in Section 15.3.4 for determining k value in k-nearest neighbors algorithm. What if you need to tune more than one hyperparameter? Scikit-learn’s sklearn.model_selection
module provides tools for automated hyperparameter tuning to help you with this task. Class GridSearchCV
uses a brute-force approach to hyperparameter tuning by trying every possible combination of the hyperparameters and value ranges for each that you specify. Class RandomizedSearchCV
improves tuning performance by using random samples of the hyperparameter values you specify. Investigate these classes then reimplement the hyperparameter tuning in Section 15.3.4 using each class. Time the results of each approach.
15.20 (Quandl Financial Time Series) Quandl offers an enormous number of financial time series and a Python library for loading them as pandas DataFrame
s, making them easy to use in your machine learning studies. Many of the time series are free. Explore Quandl’s financial data search engine at
https://www.quandl.com/search
to see the range of time series data they offer. Investigate and install their Python module
conda install -c conda-forge quandl
then use it to download their 'YALE/SPCOMP'
time series for the S&P Composite index (or another time series of your choice). Next, using time series data you downloaded, perform the steps in the linear regression case study of Section 15.5. Use only rows for which all the features have values.
15.21 (Project: Multi-Classification of Digits with the MNIST Dataset) In this chapter, we analyzed the Digits dataset that’s bundled with scikit-learn. This is a subset and simplified version of the original MNIST dataset, which provides 70,000 digit-image samples and targets. Each sample represents a 28-by-28 image (784 features). Reimplement this chapter’s digits classification case study using MNIST. You can download MNIST in scikit-learn using the following statements:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, return_X_y=True)
Function fetch_mldata downloads datasets from mldata.org
, which contains nearly 900 machine learning datasets and various ways to search them.
15.22 (Project: Multi-Classification of Digits with the EMNIST Dataset) The EMNIST dataset contains over 800,000 digit and character images. You can work with all 800,000 characters or subsets. One subset has 280,000 digits with approximately 28,000 of each digit (0–9). When the samples are divided evenly among the classes, the dataset is said to have balanced classes. You can download the dataset from
https://www.nist.gov/itl/iad/image-group/emnist-dataset
in a format used with software called MATLAB, then use SciPy’s loadmat
function (module scipy.io
) to load the data. The downloaded dataset contains several files—one for the entire dataset and several for various subsets. Load the digits subset, then transform the loaded data into a format usable for use with scikit-learn. Next, reimplement this chapter’s digits classification case study using the 280,000 EMNIST digits.
15.23 (Project: Multi-Classification of Letters with the EMNIST Dataset) In the previous exercise, you downloaded the EMNIST dataset and worked with the digits subset. Another subset contains 145,600 letters with approximately 5600 of each letter (A–Z). Reimplement the preceding exercise using letter images rather than digits.
15.24 (Try It: Clustering) Acxiom is a marketing technology company. Their Personicx marketing software identifies clusters of people for marketing purposes. Try their “What’s My Cluster?” tool
https://isapps.acxiom.com/personicx/personicx.aspx
to see the marketing cluster to which they feel you belong.
15.25 (Project: AutoML.org
and Auto-Sklearn) There are various ongoing efforts to simplify machine learning and make it available “to the masses.” One such effort comes from AutoML.org
, which provides tools for automating machine-learning tasks. Their auto-sklearn
library at
https://automl.github.io/auto-sklearn
inspects the dataset you wish to use, “automatically searches for the right learning algorithm” and “optimizes its hyperparameters.” Investigate auto-sklearn
’s capabilities then:
Reimplement the Digits classification case study (Sections 15.2–15.3) using the AutoSklearnClassifier
in place of the KNeighborsClassifier
estimator.
Reimplement the California Housing dataset regression case study (Section 15.5) using the AutoSklearnRegressor
in place of the LinearRegression
estimator.
In each case, how do auto-sklearn
’s results compare to those in the original case studies? Does auto-sklearn
choose the same models?
15.26 (Research: Support Vector Machines) Many books and articles indicate that support vector machines often yield the best supervised machine learning results. Research support vector machines vs. other machine learning algorithms. What are the primary reasons offered for why support vector machines perform best?
15.27 (Research: Machine Learning Ethics and Bias) Machine learning and artificial intelligence raise many ethics and bias issues. Should an AI algorithm be allowed to fire a company employee without human input? Should an AI-based military weapon, be allowed to make kill decisions without human input? AI algorithms often learn from data collected by humans. What if the data contains human biases regarding race, religion, gender and more? Some AI programs have already been proven to learn such human biases.16 Research machine learning ethics and bias issues and make a top-10 list of the most common issues you encounter.
15.28 (Project: Feature Selection) Feature selection17 involves choosing which dataset features to use when training a machine learning model. Research feature selection and scikit-learn’s feature selection capabilities
https://scikit-learn.org/stable/modules/feature_selection.html
Apply scikit-learn’s feature selection capabilities to the Digits dataset, then reimplement the classification case study in Sections 15.2–15.3. Next, apply scikit-learn’s feature selection capabilities to the California Housing dataset, then reimplement the linear regression case study in Section 15.5. In each case, do you get better results?
15.29 (Research: Feature Engineering) Feature engineering18 involves creating new features based on existing features in a dataset. For example, you might transform a feature into a different format (such as transforming textual data to numeric data or transforming a date-time stamp into just a time of day), or you might combine multiple features into a single feature (such as combining latitude and longitude features into a location feature). Research feature engineering and explain how it might be used to improve supervised machine learning prediction performance.
15.30 (Project: Desktop Machine Learning Workbench—KNIME Analytics Platform) There are many free and paid machine learning software packages (both web-based and desktop) for performing machine learning studies with little or no coding. Such tools are known as workbenches. KNIME is an open source desktop machine learning and analytics workbench available at
https://www.knime.com/knime-software/knime-analytics-platform
Investigate KNIME, then install it and use it to implement this chapter’s machine learning studies.
15.31 (Project: Exploring Web-Based Machine Learning Tools—Microsoft Azure Learning Studio, IBM Watson Studio and Google Cloud AI Platform) Microsoft’s Azure Learning Studio, IBM’s Watson Studio and Google’s Cloud AI Platform are all web-based machine learning tools. Microsoft and IBM provide free tiers and Google provides an extended free trial. Research each of these web-based tools, then use one or more of interest to you to implement this chapter’s machine learning studies.
15.32 (Research Project: Binary Classification with the Titanic Dataset and the Scikit-Learn DecisionTreeClassifier
) Decision trees are a popular means of visualizing decision structures in business applications. Research “decision trees” online. Use the techniques you learned in the “Files and Exceptions” chapter to load the Titanic Disaster dataset from the RDatasets repository. One popular type of analysis on this dataset uses decision trees to predict whether a particular passenger survived or died in the tragedy. The DecisionTreeClassifier
builds a decision tree internally which you can output in the DOT graphing language with the export_graphviz
function (module sklearn.tree
). You can use the open source Graphviz visualization software to create a decision-tree graphic from the DOT file.
52.15.59.163