Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Index

Numerics

2-D arrays, 140

3-D arrays, 140

64-bit operating system, 42–43

A

ABC language, 22
absolute errors, linear regression, 352
Adaboost application, 424–425
adjacency matrices, 165–166
agglomerative clustering
- hierarchical cluster solution, 307–308
- linkage methods, 306
- metrics, 306
- overview, 305–306
- two-phase clustering solution, 308–310
aggregation, shaping data through, 146–147
AI applications, 13
AI For Dummies (Mueller and Massaron), 13
algorithms
- choosing, 33
- classifiers, 158
- hyperparameters
  - GBM model, 427–428
  - grid search, 364–368
  - overview, 363–364
  - randomized search, 368–369
- k-means
  - big data, 304–305
  - centroids, 298–299
  - ground truth, 301–304
  - image data, 299–301
  - overview, 297
- K-Nearest Neighbors
  - k-parameter, 344–345
  - overview, 342–343
  - predicting, 343–344
- linear regression
  - limitations of, 333–334
  - with multiple variables, 331–333
  - overview, 329–331
- logistic regression
  - applying, 335
  - classes, 336–337
  - overview, 334
- Naïve Bayes, 177
  - classes, 339–340
  - overview, 337–338
  - text classifications, 340–342
- Random Forest
  - optimizing, 422–424
  - overview, 418–420
  - Random Forest classifier, 420–421
  - Random Forest regressor, 421–422
- t-SNE algorithm, 283–284
align parameter, histograms, 205
Al-Kindi, 12
Amazon.com review dataset, 444
Anaconda. See also IPython; Jupyter Notebook
- Anaconda Command Prompt, 27–29
- installing
  - on Linux, 46–47
  - on Mac OS X, 47–48
  - on Windows, 42–45
- IPython, 29
- Jupyter QTConsole environment, 29–30
- Spyder IDE, 30–31
Anaconda Command Prompt, 27–29
Anaconda Prompt window, 86–87
annotations, graph, 197–199
ANOVA (Analysis Of Variance), 263
Apache Spark, 11
APIs (application programming interfaces), 37, 116
append() method, 143
arrays
- 2-D, 140
- 3-D, 140
- n-dimensional, 36
- performing operations on
  - matrix multiplication, 181
  - matrix vector multiplication, 180
  - simple arithmetic on vectors and matrices, 179–180
  - vectorization, 179
AskSam database, 113
Aspirational Data Scientist blog, 434
autopct parameter, pie charts, 203
axes, graph
- accessing, 189–190
- formatting, 190–191
- handles, 190
- labels, 198
- plotting time series, 212–214

B

backends, plotting
- defined, 96
- MatPlotLib, 189
backward approach, selecting variables, 362
bag of words model
- implementing TF-IDF transformations, 162–165
- understanding bag of words model, 159–160
- working with n-grams, 161–162
bagging techniques, machine learning, 419–420, 425
bar charts, 186, 203–205
Basemap toolkit
- creating Basemap environment, 217–218
- deprecated packages, 218–220
- installing, 218
- plotting geographical data, 220–221
Beautiful Soup library, 38
Beginning Programming with Python For Dummies, 2nd Edition (Mueller), 10, 25, 34
Bell Labs, 252
benchmarking, 241–243
bias
- in data, 173
- machine learning algorithms, 350
big data. See also reducing data dimensionality
- AI and, 13
- clustering, 304–305
- data science and, 13
- defined, 275, 383
- SGD optimization, 383–386
binning
- defined, 123
- in feature creation, 177
- transforming numerical variable into categorical ones, 259–260
bins, histogram, 205
black box, defined, 420
Boolean values, 151–152
Boston dataset
- dividing between training and test sets, 354–356
- performing regression with SVR class, 400–401
- testing and detecting interactions between variables, 375–379
- variable transformations, 372–373
Bower, Jonathan, 436
box plots, 186, 206–208
- Exploratory Data Analysis, 262–263
- outliers and, 318–319
- quartiles, 206
- whiskers, 206
branches, decision trees, 414
Breiman, Leo, 419
bunches, data, 34, 160

C

C parameter, SVM LinearSVC class, 402
Canadian Institute for Advanced Research (CIFAR) datasets, 443
Canopy Express, 41–42
cases, database, 100, 170
categorical variables
- Exploratory Data Analysis
  - contingency tables, 261
  - frequencies, 259–260
  - overview, 259
- levels
  - combining, 132–133
  - renaming, 131–132
- overview, 129–131
C/C++ language, 10
cells, Jupyter Notebook, 51–52
central tendency, EDA, 254–255
centroid-based algorithms, 298–299
Chebyshev’s inequality, 320
checkpoints, Jupyter Notebook, 95
chi-square test for tables, EDA, 271
Chrome browser, 61
CIFAR (Canadian Institute for Advanced Research) datasets, 443
classes. See also names of specific classes
- logistic regression algorithm, 336–337
- Naïve Bayes algorithm, 339–340
classifiers, defined, 158
classsification trees, 415–417
Cleveland, William S., 12
cluster analysis, 177, 324–325
clustering
- agglomerative
  - hierarchical cluster solution, 307–308
  - linkage methods, 306
  - metrics, 306
  - overview, 305–306
  - two-phase clustering solution, 308–310
- cross-tabulation, 301–302
- DBScan, 310–312
- ground truth, 301–302
- inertia, 302–304
- with k-means algorithm
  - big data, 304–305
  - centroid-based algorithms, 298–299
  - ground truth, 301–302
  - image data, 299–301
  - overview, 297
- overview, 295–297
CNTK (Microsoft Cognitive Toolkit), 37
code execution, checking, 78–79
code repository
- adding notebook content, 53–55
- creating new notebooks, 51–53
- defining new folders, 50–51
- exporting notebooks, 55
- importing notebooks, 56–57
- removing notebooks, 55–56
coding style, 33
Colaboratory. See Google Colab
collaborative filtering, 291–293
colors, on graphs, 188, 193, 194–195
colors parameter
- bar charts, 204
- histograms, 205
- pie charts, 202
columns, dataset, 140–141. See also features, dataset
COM (Component Object Model) applications, 116
comma-separated value (CSV) files, 106–109
Common Crawl 2012 web corpus, 444–445
competencies of data scientists, 12–13
Component Object Model (COM) applications, 116
concat() method, 143
concatenating data
- adding new cases and variables, 142–144
- removing data, 144
- sorting and shuffling, 145–146
concept drift phenomenon, 317
Conductrics site, 435
contingency tables, EDA, 261
Continuum Analytics Anaconda, 40–41
correlations
- Exploratory Data Analysis
  - chi-square test for tables, 271
  - covariance and, 268–270
  - nonparametric, 270–271
- showing on scatterplots, 211–212
counterclock parameter, pie charts, 202
CountVectorizer function, 239–241
covariances, EDA, 268–270
cross-tabulation, clustering, 301–302
cross-validation
- on k-folds, 357–358
- multicore parallelism, 248
- overview, 356–357
- sampling and, 358–360
CSV (comma-separated value) files, 106–109
cube root transformations, 178
cycle_graph() template, 166

D

data analysis, 13, 32. See also EDA
data capture, 13
data correlations. See correlations
data groupings
- depicting on box plots, 206–208
- depicting on scatterplots, 209–210
data maps, 126–128
data munging. See wrangling data
data plans, 126–128
data science
- AI and, 13
- big data and, 13
- core competencies of data scientists, 12–13
- history of, 12
- overview, 9–10
- pipeline
  - exploratory data analysis, 15
  - learning from data, 15
  - preparing data, 15
  - understanding meaning of data, 16
  - visualization, 15
- programming languages and, 14
- Python and
  - choosing, reasons for, 17–18
  - contributions to, 23–24
  - fusing data science and application development, 16–17
  - loading data, 19
  - training models, 19
  - viewing results, 19–20
- working with problems in
  - evaluating, 171
  - formulating hypotheses, 174
  - preparing data, 174
  - researching solutions, 173–174
Data Science Central, 433–434
data wrangling. See wrangling data
Database Administrators (DBAs), 11
Database Management Systems (DBMSs), 11, 114, 115
dataframes
- defined, 123
- objects, 114
DataFrame.to_sql() function, 114
datasets, 34, 170, 437–445
- Amazon.com review dataset, 444
- CIFAR datasets, 443
- Common Crawl 2012 web corpus, 444–445
- downloading, 48–58
- flat-file
  - CSV delimited format, 107–109
  - Excel, 109–110
  - Microsoft Office files, 109–110
  - text files, 106–107
- handwritten data, 442
- high-dimensional sparse datasets, 160
- image datasets, 443
- Kaggle competition, 438
- Kaggle site, 439
- large datasets, 444–445
- Madelon Data Set, 440
- MNIST dataset, 442
- MovieLens site, 440–441
- NIST dataset, 442
- overview, 437
- pattern recognition, 442
- size, 32
- Spambase Data Set, 441
- Titanic tragedy datasets, 438–439
- understanding datasets used in book, 57–58
- used in book, 57–58
dates in data
- formatting date and time values, 134
- overview, 133–134
- time transformation, 135
DBAs (Database Administrators), 11
DBMSs (Database Management Systems), 11, 114, 115
debugging codes, 33
decision trees
- branches, 414
- classsification trees, 415–417
- general discussion, 412–415
- leaf nodes, 414–415
- Random Forest algorithm
  - optimizing, 422–424
  - overview, 418–420
  - Random Forest classifier, 420–421
  - Random Forest regressor, 421–422
- regression trees, 417–418
deep learning, 37
- hardware for, 406
- neural networks, 371
  - classifying with, 408–410
  - deep learning, 406
  - multilayer perceptron, 408–410
  - overview, 406–407
  - regressing with, 408–410
deprecated packages, 218–220
describe() function, 127
dicing data, 141
directed graphs, 224–225
discretization, 177
distributions
- defined, 178
- Exploratory Data Analysis, 265–266
  - transforming, 273–274
  - using different statistical distributions, 272
  - Z-score standardization, 273
- showing with histograms, 205–206
Domingos, Pedro, 176
double mapping, 152
drop() method, 144
drop_ duplicates() function, 126
dual parameter, SVM LinearSVC class, 402
dummy variables, 177
duplicates
- effect on data results, 124–125
- removing, 126

E

EDA (Exploratory Data Analysis), 15
- categorical data
  - contingency tables, 261
  - frequencies, 259–260
  - overview, 259
- correlations
  - chi-square test for tables, 271
  - covariance and, 268–270
  - nonparametric, 270–271
- distributions
  - transforming, 273–274
  - using different statistical distributions, 272
  - Z-score standardization, 273
- nonlinear transformations, 372
- numeric data
  - central tendency, 254–255
  - kurtosis, 257–258
  - means, 254–255
  - medians, 254–255
  - normality, 257–258
  - overview, 253–254
  - percentiles, 256–257
  - range, 256
  - skewness, 257–258
  - variance, 255–256
- overview, 251–253
- visualization for
  - box plots, 262–263
  - distributions, 265–266
  - overview, 261
  - parallel coordinates, 264–265
  - scatterplots, 266–267
  - t-tests, 263–264
ElasticNet class, linear models, 382–383
Encapsulated Postscript (EPS), 188
encoding
- defined, 38
- missing data, 137–138
- one-hot-encoding, 236–238
ensembles
- boosting, 424–428
- decision trees
  - classsification trees, 415–417
  - general discussion, 412–415
  - Random Forest algorithm, 418–424
  - regression trees, 417–418
- imputing missing data, 138
- overview, 411
Enthought Canopy Express, 41–42
enumerations, 129
EPS (Encapsulated Postscript), 188
error bar charts, 186
error messages
- Firefox, 61
- indentation and, 26
estimator interface, Scikit-learn library, 231–233
ETL (Extract, Transformation, and Loading) specialists, 13
Excel files, 106, 107, 109–110
explode parameter, pie charts, 202
Exploratory Data Analysis. See EDA
exponential transformations, 178
Extract, Transformation, and Loading (ETL) specialists, 13
Extremely Randomized Trees machine learning technique, 325–326

F

factor analysis
- hidden factors, 281–282
- psychometrics, 280–281
features, dataset. See also variables
- defined, 100, 170
- feature creation
  - binning, 177
  - combining variables, 176–177
  - defining, 175–176
  - discretization, 177
  - transforming distributions, 178
  - using indicator variables, 177–178

“A Few Useful Things to Know about Machine Learning” paper (Domingos), 176

filtering data
- collaborative filtering, 291–293
- dicing, 141
- overview, 139
- slicing, 140–141
finalized code, 11
Firefox browser, 18
- configuring to use local runtime support, 63
- error dialog box, 61
flat-file datasets
- CSV delimited format, 107–109
- Excel, 109–110
- Microsoft Office files, 109–110
- text files, 106–107
folders, Jupyter Notebook, 50–51
Forecastwatch.com, 23–24
Fortran language, 10
frame-of-reference mistruths, 173
frequencies, EDA, 259–260
functional coding, 17
Functional Programmming For Dummies (Mueller), 11
functions. See also names of specific functions
- hash functions, 235
- magic functions
  - accessing lists, 90
  - working with, 91
- print, 25

G

Gaussian distribution, 257, 319–320
GBM (Gradient Boosting Machine), 425–428
geographical data
- deprecated packages, 218–220
- plotting with Basemap toolkit, 218, 220–221
- plotting with Notebook, 216–218
GitHub, 436
- opening existing Google Colab notebooks in, 67–68
- saving Google Colab notebooks to, 69–70
- storing Google Colab notebooks on, 64
GitHubGist, 70
GMT (Greenwich Mean Time), 134
Google Accounts
- creating, 64
- overview, 63
- signing in, 64–65
Google Colab, 59–80
- code cells
  - Add a Comment option, 72
  - Add a Form option, 72
  - Clear Output option, 72
  - Delete Cell option, 72
  - Link to Cell option, 72
  - View Output Fullscreen option, 72
- editing cells, 74–75
- example projects, 66
- executing code, 76
- getting help, 80
- Google Account
  - creating, 64
  - overview, 63
  - signing in, 64–65
- hardware acceleration, 75–76
- Help menu, 80
- Jupyter Notebook compared to, 61–63
- local runtime support, 63
- moving cells, 75
- notebooks
  - checking code execution, 78–79
  - creating new, 65–66
  - downloading, 71
  - opening existing, 66–68
  - saving, 68–70
  - sharing, 79
  - table of contents, 77
  - viewing notebook information, 77–78
- overview, 60–61
- special cells
  - headings, 74
  - overview, 73
  - table of contents, 74
- text cells, 72–73
- Welcome page, 65
Google Docs, 63
Google Drive
- opening existing Google Colab notebooks in, 66–67
- revision history, 68
- saving Google Colab notebooks on, 68–69
Gradient Boosting Machine (GBM), 425–428
graphical user interface (GUI), 29–30
graphics
- CIFAR datasets, 443
- integrating into Jupyter Notebook
  - embedding images, 96–98
  - embedding plots, 96
  - loading examples from online sites, 96
graphs
- adding grids to, 191–192
- adjacency matrices, 165
- annotations, 197–199
- axes
  - accessing, 189–190
  - formatting, 190–191
  - handles, 190
  - plotting time series, 212–214
- bar charts, 203–205
- box plots, 206–208
- building with NetworkX basics, 166–167
- defining line appearance on
  - adding markers, 195–197
  - line styles, 193–194
  - using colors, 194–195
- directed, 224–225
- histograms, 205–206
- labels, 197–198
- legends, 197, 199–200
- MatPlotLib library
  - defining plots, 186–187
  - drawing multiple lines and plots, 187–188
  - saving work to disk, 188–189
- pie charts, 202–203
- plotting geographical data, 216–221
- plotting time series, 212–216
- scatterplots, 208–212
- undirected, 222–223
greedy approach, to selecting variables, 362
Greenwich Mean Time (GMT), 134
grid() function, 192
grid searching
- hyperparameters and, 364–368
- multicore parallelism, 248
grids, adding to graphs, 191–192
ground truth, clustering, 301–302
groupby() function, 127
groups, data
- depicting on box plots, 206–208
- depicting on scatterplots, 209–210
Grover, Prince, 413
GUI (graphical user interface), 29–30

H

hairballs
- defined, 165
- using NetworkX to avoid, 166
handles, axes, 190
handwritten data, datasets, 442
hardware acceleration, Google Colab, 75–76
hashing trick, Scikit-learn library
- demonstrating, 235–238
- hash functions, 235
- overview, 234–235
- sparse matrices, 239–240
HashingVectorizer function, 239–241
hatch parameter, bar charts, 204
Help menu, Google Colab, 80
Help mode, Python
- entering, 88
- exiting, 89
- requesting help in, 88–89
hierarchical clustering
- hierarchical cluster solution, 307–308
- linkage methods, 306
- metrics, 306
- overview, 305–306
- two-phase clustering solution, 308–310
high-dimensional sparse datasets, 160
histograms, 186
- bins, 205
- showing distributions with, 205–206
History of the Peloponnesian War (Thucydides), 12
HTML data
- parsing into Beautiful Soup library, 38
- parsing XML and, 150–151
- using XPath for data extraction, 151–152
hyperparameters of algorithms
- GBM model, 427–428
- grid search, 364–368
- overview, 363–364
- randomized search, 368–369
hyperplanes, SVM, 390
hypotheses
- defined, 233
- formulating, 174

I

IDA (Initial Data Analysis), 252
IDE (Integrated Development Environment), 30
IDF (Inverse Document Frequency), 163
image data
- clustering, 299–301
- cropping, 112
- embedding images into Notebook, 96
- flattening, 113
- generating variations on, 103–104
- resizing, 112
- as unstructured files, 111
image datasets, 443
imperative coding, 17
importance estimation, 420
indentation, in Python, 26
index, dataset, 136, 141, 143
indicator variables, 177–178
inertia, clustering, 302–304
Informatica tool, 13
information redundancy, 268
information resources
- Aspirational Data Scientist blog, 434
- Conductrics, 435
- Data Science Central, 433–434
- GitHub, 436
- KDnuggets, 432
- Open-Source Data Science Masters, 436
- Oracle Data Science blog, 433
- Quora, 432–433
- Subreddit, 432
- Udacity, 435
Information Retrieval (IR), 159
Initial Data Analysis (IDA), 252
installing
- Anaconda
  - on Linux, 46–47
  - on Mac OS X, 47–48
  - on Windows, 42–45
- Basemap toolkit, 218
- preferred installer program, 244
Integrated Development Environment (IDE), 30
Interactive Python. See IPython
International Council for Science, 12
Internet Explorer browser, 63
Inverse Document Frequency (IDF), 163
inverse transformations, 178
IPython
- help, 89–90, 92
- Jupyter Console, 84–92
- Jupyter Notebook and, 59
- overview, 29
- playing with data, 33
IR (Information Retrieval), 159
Iris dataset
- classification trees, 415–419
- correlation and covariance, 268–271
- counting for categorical data, 259–261
- factor analysis, 281–283
- hyperparameters optimization, 363–364
- loading, 253–254
- logistic regression algorithm, 335
- visualizing data, 262–267

J

Java, 11
Java Virtual Machine (JVM), 11
JavaScript Object Notation (JSON), 119
join() method, 143
Journal of Data Science, 12
jQuery, 116
JSON (JavaScript Object Notation), 119
Jupyter Console
- changing window appearance, 86–87
- discovering objects
  - getting object help, 91
  - obtaining object specifics, 92
  - using IPython object help, 92
- interacting with screen text, 84–86
- IPython help, 89–90
- magic functions, 90–91
- Python help, 87–89
Jupyter Notebook, 18
- code repository
  - adding notebook content, 53–55
  - creating new notebooks, 51–53
  - defining new folders, 50–51
  - exporting notebooks, 55
  - importing notebooks, 56–57
  - removing notebooks, 55–56
- Google Colab and, 59, 61–63
- graphs in, 188
- integrating multimedia into
  - embedding images, 96–98
  - embedding plots, 96
  - loading examples from online sites, 96
- plotting geographical data, 216–218
- restarting kernels, 94–95
- restoring checkpoints, 95
- starting, 49–50
- stopping server, 50
- working with styles, 93–94
Jupyter QTConsole environment, 29–30
JVM (Java Virtual Machine), 11

K

Kaggle data science competition, 176, 412, 438
Kaggle site, 439
KDnuggets site, 432
Keras library, 37
kernels
- nonlinear, 398–399
- restarting
  - using graphs, 189
  - using notebook backend, 189
keys, defined, 34
k-folds, cross-validation on, 357–358
k-means algorithm
- big data, 304–305
- centroid-based algorithms, 298–299
- ground truth, 301–304
- image data, 299–301
- overview, 297
KNN (K-Nearest Neighbors) algorithm
- hyperparameters, 365–368
- k-parameter, 344–345
- overview, 342–343
- predicting, 343–344
kurtosis, EDA, 257–258

L

L1 type (Lasso) regularization
- combining with L2 regularization, 382–383
- overview, 381
L2 type (Ridge) regularization
- combining with L1 regularization, 382–383
- overview, 380–381
labels, using on graphs, 197–198
labels parameter, pie charts, 202
lambda calculus, 17
languages, programming. See programming languages
Lasso regularization. See L1 type regularization
leaf nodes, decision trees, 414
learning from data. See machine learning
legends, graph, 197, 199–200
levels, categorical variables
- combining, 132–133
- defined, 129
- renaming, 131–132
libraries, 35–38. See also names of specific libraries
line graphs, 186, 214
- adding markers, 195–197
- using colors, 192, 194–195
- working with line styles, 193–194
linear models
- nonlinear transformations
  - main effects model, 375–379
  - variable transformations, 372–375
- regularization of
  - ElasticNet class, 382–383
  - Lasso (L1 type), 381
  - leveraging, 382
  - overview, 379–380
  - Ridge (L2 type), 380–381
linear regression algorithm
- limitations of, 333–334
- with multiple variables, 331–333
- overview, 329–331
- R squared measure for, 352
LinearSVC class, SVM, 402–406
linkage methods, agglomerative clustering, 306
Linux
- Enthought Canopy Express and, 41
- installing Anaconda on, 46–47
- local runtime support on, 63
lists, output, 109
loading data on Python
- overview, 19
- speed of data analysis and, 33
local runtime support
- on Google Colab, 63
- on Linux, 63
- on Mac OS X, 63
- on Windows, 63
local storage, 68
logarithm transformations, variable, 178, 274
logistic regression algorithm
- applying, 335
- classes, 336–337
- overview, 334
loss parameter, SVM LinearSVC class, 402

M

Mac OS X
- Enthought Canopy Express and, 41
- installing Anaconda on, 47–48
- using local runtime support on, 63
machine learning. See also algorithms
- big data, 383–386
- cross-validation
  - on k-folds, 357–358
  - overview, 356–357
  - sampling, 358–360
- ensembles
  - boosting, 424–428
  - decision trees, 412–424
  - overview, 411
- hyperparameters, 363–369
- linear models
  - nonlinear transformations, 372–379
  - regularization of, 379–383
- model fitting
  - bias, 350
  - overview, 348–349
  - strategy for, 350–353
  - test sets, 354–356
  - training, 354–356
  - variance, 350
- neural networks
  - classifying with, 408–410
  - deep learning, 406
  - multilayer perceptron, 408–410
  - overview, 406–407
  - regressing with, 408–410
- no-free-lunch theorem, 351
- optimization
  - grid search, 364–368
  - overview, 363–364
  - randomized search, 368–369
- selection
  - greedy, 362
  - RFECV method, 363
  - by univariate measures, 360–362
- Stochastic Gradient Descent optimization, 383–386
- support vector machines
  - adjusting parameters, 390–392
  - classifying with, 392–397
  - creating stochastic solution with, 401–406
  - general discussion, 387–390
  - hyperplanes, 390
  - margins, 389
  - nonlinear kernels, 398–399
  - overfitting, 392
  - performing regression with SVR, 399–401
  - scaling, 396
  - Scikit-learn library, 391
  - underfitting, 392
Machine Learning For Dummies (Mueller and Massaron), 13
Madelon Data Set, 440
magic functions
- accessing lists, 90
- working with, 91
main effects model, 375–379
manifold learning (nonlinear dimensionality reduction), 283–285
Manuscript on Deciphering Cryptographic Messages, 12
margins, SVM, 389
Markdown cells, 53
markers on graphs, 195–197
MATLAB, 14, 185–186
MATLAB For Dummies (Mueller), 14, 185
MatPlotLib library, 32, 38, 185–200. See also graphs
- adding grids, 191–192
- analyzing image date with, 102
- annotating charts, 198–199
- creating legends, 199–200
- defining line appearance
  - adding markers, 195–197
  - using colors, 194–195
  - working with line styles, 193–194
- graphs
  - defining plots, 186–187
  - drawing multiple lines and plots, 187–188
  - saving work to disk, 188–189
- labels, 197–198
- setting axis
  - access code for, 189–190
  - formatting, 190–191
matrices
- adjacency, 165
- multiplication, 181
- scipy.sparse matrix, 160
- sparse, 239–240
- vector multiplication, 180
mean function, 139
means, EDA, 254–255
median function, 139
medians, EDA, 254–255
memory
- memory profiler, 244–245, 247
- streaming data into, 102–103
- uploading data into, 101–102
metrics, agglomerative clustering, 306
microservices, 117
Microsoft Cognitive Toolkit (CNTK), 37
Microsoft Office files, 109–110
missing data
- encoding, 137–138
- finding, 136–137
- imputing, 138–139
mistruths in data
- bias, 173
- commission, 172
- frame of reference, 173
- omission, 172
- perspective, 172–173
MLP (multilayer perceptron), 408–410
MNIST (Mixed National Institute of Standards and Technology) dataset, 442
model fitting
- bias, 350
- no-free-lunch theorem, 351
- overview, 348–349
- ROC AUC, 353
- strategy for, 350–353
- test sets, 354–356
- training, 354–356
- variance, 350
model interface, Scikit-learn library, 231, 234
MongoDB database, 115
most_frequent function, 139
movie recommendations, 291–293
MovieLens site, 440–441
MS SSIS tool, 13
multicore parallelism (multiprocessing), 247–250
multilabel prediction, 248
multilayer perceptron (MLP), 408–410
multimedia, integrating in Jupyter Notebook
- embedding images, 96
- embedding plots, 96
- loading examples from online sites, 96
- obtaining online, 96–98
multiprocessing (multicore parallelism), 247–250
multivariate approach to outliers
- cluster analysis, 324–325
- Extremely Randomized Trees machine learning technique, 325–326
- principal component analysis, 322–324
- Random Forests matching learning technique, 325–326
multivariate correlation, 175
munging
- clustering
  - agglomerative, 305–310
  - DBScan, 310–312
  - with k-means, 297–305
  - overview, 295–297
- defined, 229
- Exploratory Data Analysis
  - categorical data, 259–261
  - correlations, 268–271
  - distributions, 272–274
  - numeric data, 253–258
  - overview, 251–253
  - visualization for, 261–267
- outliers
  - anomalies, 316
  - concept drift phenomenon, 317
  - effect on machine learning algorithms, 315–316
  - multivariate approach to, 322–326
  - novel data, 316–317
  - overview, 313–314
  - univariate approach to, 317–321
- overview, 229–230
- reducing data dimensionality
  - collaborative filtering, 291–293
  - factor analysis, 280–282
  - nonlinear dimensionality reduction, 283–285
  - Non-Negative Matrix Factorization, 289–291
  - overview, 275–276
  - principal components analysis, 282–283, 285–289
  - singular value decomposition, 276–280
- Scikit-learn library
  - application speed and performance and, 240–247
  - classes, 230–231
  - defining applications for data science, 231–234
  - hashing trick, 234–240
  - multicore parallelism, 247–250
  - object-based interfaces, 231
MySQL database, 115
MySQLdb library, 24

N

Naïve Bayes algorithm, 177, 375
- classes, 339–340
- overview, 337–338
- text classifications, 340–342
National Institute of Standards and Technology (NIST) dataset, 442
Natural Language Processing (NLP), 159
Natural Language Toolkit (NLTK), 154
n-dimensional arrays, 36
NetworkX library, 38, 166–167
neural networks, 371
- classifying with, 408–410
- deep learning, 406
- multilayer perceptron, 408–410
- overview, 406–407
- regressing with, 408–410
n-grams, 161–162
NIST (National Institute of Standards and Technology) dataset, 442
NLP (Natural Language Processing), 159
NLTK (Natural Language Toolkit), 154
NMF (Non-Negative Matrix Factorization), 289–291
no-free-lunch theorem, 351
nonlinear dimensionality reduction (manifold learning), 283–285
nonlinear kernels, SVM, 398–399
nonlinear transformations
- main effects model, 375–379
- variable transformations, 372–375
Non-Negative Matrix Factorization (NMF), 289–291
nonparametric correlations, EDA, 270–271
normality, EDA, 257–258
NoSQL (Not only SQL) databases, 115–116
Notebook. See Jupyter Notebook
notebooks, Google Colab, 65–71
- creating new, 65–66
- downloading, 71
- opening existing
  - in GitHub, 67–68
  - in Google Drive, 66–67
  - local storage, 68
- saving
  - on GitHub, 69–70
  - on GitHubGist, 70
  - on Google Drive, 68–69
- sharing, 79
- viewing
  - checking code execution, 78–79
  - displaying table of contents, 77
  - getting notebook information, 77–78
novel data, 316–317
numeric data, EDA
- central tendency, 254–255
- kurtosis, 257–258
- means, 254–255
- medians, 254–255
- normality, 257–258
- overview, 253–254
- percentiles, 256–257
- range, 256
- skewness, 257–258
- variance, 255–256
NumPy library
- arrays tool, 246
- computing covariance and correlation matrices, 269–270
- functions, 36
- matrix multiplication, 181
- matrix vector multiplication, 180
- n-dimensional arrays, 36, 179
- researching solutions, 174
- shaping data with, 122

O

object-based interfaces, Scikit-learn library, 231
object-oriented coding, 18
objects, 91–92
- getting object help, 91
- IPython object help, 92
- obtaining object specifics, 92
ODBC (Open Database Connectivity), 115
one-hot-encoding, 236–238
online resources
- Aspirational Data Scientist, 434
- Cheat Sheet, 4
- Conductrics, 435
- Cross Validated, 174
- Data Science Central, 433–434
- directives list, 134
- GitHub, 436
- Google Scholar, 174
- Imputer parameters, 138
- Internet World Stats, 1
- John Mueller blog, 5
- KDnuggets, 432
- list of distributions, 178
- MatPlotLib graph types, 186
- Microsoft Academic Search, 174
- Open-Source Data Science Masters, 436
- Oracle Data Science Blog, 433
- parsers for CSV files, 108–109
- Python Enhancement Proposals, 25
- Python tutorials, 3
- pythonclock.org, 22
- Quora, 174, 432–433
- read table method arguments, 107
- regular expressions, 155
- source codes, 5
- Stack Overflow, 174
- standard graph types list, 166
- Subreddit, 432
- telephone number manipulation routines, 157
- Udacity, 435
- Unicode encodings, 153
- Unicode problems in Python, 153
- working with databases, 115
Open Database Connectivity (ODBC), 115
Open-Source Data Science Masters (OSDSM), 436
optimization
- grid search, 364–368
- overview, 363–364
- randomized search, 368–369
Oracle Data Science blog, 433
OSDSM (Open-Source Data Science Masters), 436
outliers, 15, 177
- anomalies, 316
- concept drift phenomenon, 317
- effect on machine learning algorithms, 315–316
- multivariate approach to
  - cluster analysis, 324–325
  - Extremely Randomized Trees machine learning technique, 325–326
  - principal component analysis, 322–324
  - Random Forests matching learning technique, 325–326
- novel data, 316–317
- overview, 313–314
- univariate approach to
  - box plots, 318–319
  - Chebyshev’s inequality, 320
  - Gaussian distribution, 319–320
  - overview, 317–318
  - winsorizing, 321
overfitting
- online resources, 440
- SVM, 392

P

pandas library
- categorical data, 259
- checking current version of, 129
- CSV files and, 107
- data analysis with, 36
- measuring central tendency, 254–255
- NaN output, 131
- parsers, 106
- reading flat-file data, 106
- removing duplicates, 126
- researching solutions, 174
- shaping data with, 122–123
- working with Excel worksheets in, 110
parallel coordinates, EDA, 264–265
parameters, SVM, 390–392
Parr, Terence, 413
parsers, 106
parsing HTML data, 150–151
PATH environment variable, 45
pattern matching, 156, 442
PCA (principal components analysis)
- facial recognition, 285–289
- outliers and, 322–324
- overview, 282–283
PDF (Portable Document Format), 188
Pearson correlation, 269–270
penalty parameter, SVM LinearSVC class, 402
PEPs (Python Enhancement Proposals), 25
percentiles, EDA, 256–257
pie charts, 186, 202–203
PIP (preferred installer program), 244
pipeline, data science
- Exploratory Data Analysis, 15
- learning from data, 15
- overview, 14
- preparing data, 15
- in prototyping, 32
- understanding meaning of data, 16
- visualizing data, 15
plotting data
- geographical data
  - with Basemap toolkit, 218, 220–221
  - deprecated packages, 218–220
  - with Notebook environment, 216–218
- plots
  - defined, 186–187
  - multiple plot lines, 187–188
- time series
  - on axes, 212–214
  - trends, 214–216
polynomial transformations, 178
Portable Document Format (PDF), 188
Portable Network Graphic (PNG) format, 188
PostgreSQL database, 115
Postscript (PS), 188
PowerPoint, 14
predictor interface, Scikit-learn library, 231, 233
preferred installer program (PIP), 244
principal components analysis. See PCA
procedural coding, 18
programming languages. See also Python
- choosing, 10–11
- data science and, 14
- Java, 11
- Natural Language Toolkit, 154
- R, 10–11
- Scala, 11
- SQL, 11, 114
- XPath, 151–152
prototyping, 31–32
PS (Postscript), 188
PSF (Python Software Foundation), 22
PyMongo library, 115
Python
- Anaconda
  - Anaconda Command Prompt, 27–29
  - installing, 42–47
  - IPython, 29
  - Jupyter QTConsole environment, 29–30
  - Spyder IDE, 30–31
- contributions to data science, 23–24
- developments of, 22
- factors affecting speed of execution, 32–33
- goals of, 24–25
- help mode
  - entering, 88
  - exiting, 89
  - requesting help in, 88–89
- indentation, 26
- interactive help, 89
- issues with flat-file headers, 106
- language statements, 25–26
- libraries
  - Beautiful Soup, 38
  - Keras and TensorFlow, 37
  - matplotlib, 38
  - NetworkX, 38
  - NumPy, 36
  - pandas, 36
  - Scikit-learn, 36–37
  - SciPy, 35
- licensing issues, 22
- overview, 10
- performing rapid prototyping and experimentation, 31–32
- philosophy, 23
- Python 2.x, 22
- Python 3.x, 22, 153
- role in data science and, 16–20
- streaming data using, 102
- visualizing data, 33–35
- working with, 25–31
Python Enhancement Proposals (PEPs), 25
Python interpreter, 27
Python Software Foundation (PSF), 22
pythonclock.org, 22
python-history.blogspot, 23
Python(x,y), 42

Q

QTConsole, 30
quartiles, box plots, 206
Quixote display framework, 24
Quora, 432–433

R

R (programming language), 10–11
R squared measure, for linear regression, 352
radial basis function (rbf) kernel, 398
Random Forest algorithm
- optimizing, 422–424
- overview, 418–420
- Random Forest classifier, 420–421
- Random Forest regressor, 421–422
Random Forest classifier, 420–421
Random Forest regressor, 421–422
Random Forests matching learning technique, 325–326
random sampling, 105
randomized search, 368–369
random.shuffle() method, 145
ranges, EDA, 256
rbf (radial basis function) kernel, 398
read_sql() method, 114
read_sql_query() method, 114
read_sql_table() method, 114
read_table() method arguments, 107
Receiver Operating Characteristic Area Under Curve (ROC AUC), 353
reducing data dimensionality
- collaborative filtering, 291–293
- factor analysis
  - hidden factors, 281–282
  - psychometrics, 280–281
- nonlinear dimensionality reduction, 283–285
- non-negative matrix factorization, 289–291
- overview, 275–276
- principal components analysis, 282–283, 285–289
- singular value decomposition, 276–280
- t-SNE algorithm, 283–284
regression
- linear regression algorithm
  - limitations of, 333–334
  - with multiple variables, 331–333
  - overview, 329–331
  - R squared measure for, 352
- logistic regression algorithm
  - applying, 335
  - classes, 336–337
  - overview, 334
- performing with SVM, 399–401
- regression trees, 417–418
regular expressions, 155–158
regularization of linear models
- ElasticNet class, 382–383
- Lasso (L1 type), 381
- leveraging, 382
- overview, 379–380
- Ridge (L2 type), 380–381
relational databases, managing data from, 113–115
repository, code. See code repository
researching solutions, 173–174
reset_index() method, 143
RFECV class, 363
Ridge regularization. See L2 type regularization
ROC AUC (Receiver Operating Characteristic Area Under Curve), 353
root nodes, 118
root words, 153
rows, database, 100, 140

S

sampling data
- cross-validation and, 358–360
- overview, 104–105
- random sampling, 105
saving
- Google Colab notebooks
  - on GitHub, 69–70
  - on GitHubGist, 70
  - on Google Drive, 68–69
- Jupyter Notebook files, 55, 188–189
- MatPlotLib library work to disk, 188–189
Scala, 11
Scalable Vector Graphics (SVG), 188
scaling, SVM, 396
scatterplots
- depicting groups, 209–210
- Exploratory Data Analysis, 266–267
- overview, 208–209
- showing correlations, 211–212
Scikit-learn library, 34, 57
- application speed and performance
  - benchmarking, 241–243
  - memory profiler, 244–245, 247
  - overview, 240–241
- classes, 230–231
- conda, 244
- defining applications for data science, 231–234
- hashing trick
  - demonstrating, 235–238
  - hash functions, 235
  - overview, 234–235
  - sparse matrices, 239–240
- hyperparameters optimization, 363–364
- model fitting and, 351–352
- multicore parallelism, 247–250
- object-based interfaces, 231
- overview, 36–37
- preferred installer program, 244
- researching solutions, 174
- SVM and, 391
- toy datasets, 100
- 20 Newsgroups dataset, 159
SciPy library, 35
- researching solutions, 174
- sparse matrices, 239
scipy.sparse matrix, 160
screen text, Jupyter Console, 84–86
Search Code Snippets option, Google Colab Help menu, 80
selecting data
- greedy approach to, 362
- RFECV method, 363
- univariate approach to, 360–362
SelectPercentile class, 360–361
SGD (Stochastic Gradient Descent), 383–386, 409
shadow parameter, pie charts, 203
shaping data, 32
- categorical variables
  - combining levels, 132–133
  - creating, 130–131
  - renaming levels, 131–132
- concatenating data
  - adding new cases and variables, 142–144
  - removing data, 144
  - sorting and shuffling, 145–146
- date and time
  - formatting, 134
  - time transformations, 135
- dicing data, 141
- graph data
  - adjacency matrices, 165
  - NetworkX basics, 166–167
- HTML pages
  - parsing XML and, 150–151
  - using XPath for data extraction, 151–152
- missing data
  - encoding, 137–138
  - finding missing data, 136–137
  - imputing missing data, 138–139
- with NumPy, 122
- with pandas, 122–123
- raw text
  - regular expressions, 155–158
  - stop words, 153–155
  - Unicode and, 153
- slicing data
  - columns, 140–141
  - rows, 140
- through aggregation, 146–147
- using bag of words model
  - n-grams, 161–162
  - overview, 158–160
  - TF-IDF transformations, 162–165
- validating data
  - creating data maps and data plans, 126–128
  - removing duplicates, 126
  - verifying contents, 124–125
shared group knowledge (wisdom of crowds), 411
Singular Value Decomposition (SVD), 276–280

64-bit operating system, 42–43

skewed values, 178
skewness, EDA, 257–258
slicing data
- columns, 140–141
- rows, 140
sort_values() method, 145
Spambase Data Set, 441
sparse matrices, Scikit-learn library, 239–240
Spearman correlation, 270
speed of execution, 32–33
Spyder IDE, 30–31
SQL (Structured Query Language), 11, 14, 114
SQL Server database, 115
sqlalchemy library, 114
SQLite database, 115
square root transformations, 178
squared errors, linear regression, 352
statistical distributions, EDA, 272
statistics
- descriptive, 253–254
- history of, 12
statsmodels library, 36
stemming stop words, 153–155
Stochastic Gradient Descent (SGD), 383–386, 409
stop words, 153–155, 341
StratifiedKFold class, 359–360
streaming data, into memory, 102–103
strings, 106
- formatting date and time values with, 134
- special directives, 134
Structured Query Language (SQL), 11, 14, 114
Subreddit, 432
support vector machines. See SVM
Support Vector Regression (SVR), 399–401
support vectors, SVM, 389
SVD (Singular Value Decomposition), 276–280
SVG (Scalable Vector Graphics), 188
SVM (support vector machines)
- adjusting parameters, 390–392
- classifying with, 392–397
- creating stochastic solution with, 401–406
- defined, 371
- general discussion, 387–390
- hyperplanes, 390
- LinearSVC class, 402–406
- margins, 389
- nonlinear kernels, 398–399
- overfitting, 392
- performing regression with SVR, 399–401
- scaling, 396
- Scikit-learn library, 391
- support vectors, 389
- underfitting, 392
SVR (Support Vector Regression), 399–401
syncing Google Colab, 62

T

Table of Contents, in Google Colab notebooks, 77
tablets, using code on, 61
TensorFlow library, 37
Teradata tool, 13
Term Frequency times Inverse Document Frequency (TF-IDF) transformations, 162–165, 235, 290
test datasets, 354–356
text classifications, predicting, 340–342
text files
- accessing flat-file datasets from, 106–107
- raw text
  - regular expressions, 155–158
  - stop words, 153–155
  - Unicode and, 153
TF-IDF (Term Frequency times Inverse Document Frequency) transformations, 162–165, 235, 290
Theano library, 37
third-level headings, 54

3-D arrays, 140

Thucydides, 12
time series
- plotting on axes, 212–214
- trends, 214–216
timedelta() function, 135
timeit commands, 241–243
Titanic tragedy datasets, 412–413, 438–439
tokenizing, 158
toy datasets, 100
training data, 354, 356
transformer interface, Scikit-learn library, 231, 234
tree ensembles, 138. See also ensembles
trendlines, plotting, 212, 214–216
triple mapping, 152
t-SNE algorithm, 283–284
t-tests, EDA, 263–264
Tukey, John, 252
tuples, 162

2-D arrays, 140

U

Udacity, 435
underfitting, SVM, 392
undirected graphs, 222–223
Unicode, 38, 153
univariate approach
- to outliers
  - box plots, 318–319
  - Chebyshev’s inequality, 320
  - Gaussian distribution, 319–320
  - overview, 317–318
  - winsorizing, 321
- to selecting variables, 360–362
Universal Transformation Format 8-bit (UTF-8), 153
unstructured file form, sending data in, 111–113
uploading data, into memory, 101–102

V

validating data
- creating data maps and data plans, 126–128
- removing duplicates, 126
- verifying contents, 124–125
values, dataset
- defined, 34
- skewed, 178
van Rossum, Guido, 22
Vapnik, Vladimir, 387
variable distributions, 253
variables, 19, 170. See also features, dataset
- categorical, 129–133
- combining for feature creation, 176–177
- in databases, 100
- dummy, 177
- indicator, 177–178
- variable transformations, 372–375
variances
- Exploratory Data Analysis, 255–256
- machine learning algorithms, 350
vectorization, 179
Visual Studio code support, 45
visualizing data, 15, 33–35
- bar charts, 203–205
- box plots, 206–208
- directed graphs, 224–225
- Exploratory Data Analysis
  - box plots, 262–263
  - distributions, 265–266
  - overview, 261
  - parallel coordinates, 264–265
  - scatterplots, 266–267
  - t-tests, 263–264
- with graphs, 202–225
- histograms, 205–206
- overview, 201
- pie charts, 202–203
- plotting geographical data
  - with Basemap toolkit, 218, 220–221
  - deprecated packages, 218–220
  - with Notebook, 216–218
- plotting time series
  - on axes, 212–214
  - trends, 214–216
- scatterplots, 208–212
- undirected graphs, 222–223

W

web services, 116
web-based data, 116–118
whiskers, box plots, 206
Windows
- Enthought Canopy Express and, 41
- installing Anaconda on, 42–45
- local runtime support on, 63
- Windows 7 system, 18
- WinPython and, 42
WinPython, 42
winsorizing, 321
wisdom of crowds (shared group knowledge), 411
Wolpert, David, 351
wrangling data
- clustering
  - agglomerative, 305–310
  - DBScan, 310–312
  - with k-means, 297–305
  - overview, 295–297
- defined, 229
- Exploratory Data Analysis
  - categorical data, 259–261
  - correlations, 268–271
  - distributions, 272–274
  - numeric data, 253–258
  - overview, 251–253
  - visualization for, 261–267
- outliers
  - anomalies, 316
  - concept drift phenomenon, 317
  - effect on machine learning algorithms, 315–316
  - multivariate approach to, 322–326
  - novel data, 316–317
  - overview, 313–314
  - univariate approach to, 317–321
- overview, 229–230
- reducing data dimensionality
  - collaborative filtering, 291–293
  - factor analysis, 280–282
  - nonlinear dimensionality reduction, 283–285
  - Non-Negative Matrix Factorization, 289–291
  - overview, 275–276
  - principal components analysis, 282–283, 285–289
  - singular value decomposition, 276–280
- Scikit-learn library
  - application speed and performance and, 240–247
  - classes, 230–231
  - defining applications for data science, 231–234
  - hashing trick, 234–240
  - multicore parallelism, 247–250
  - object-based interfaces, 231

X

x axis, graphs, 189
XML pages, 38
- JSON versus, 119
- parsing, 150–151
- working with web data through, 117
XPath, 151–152

Y

y axis, graphs, 189

Z

Z-score standardization, EDA, 273

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Numerics

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

Table of Contents for
Index