- 64-bit operating system, 42–43
A
- ABC language, 22
- absolute errors, linear regression, 352
- Adaboost application, 424–425
- adjacency matrices, 165–166
- agglomerative clustering
- hierarchical cluster solution, 307–308
- linkage methods, 306
- metrics, 306
- overview, 305–306
- two-phase clustering solution, 308–310
- aggregation, shaping data through, 146–147
- AI applications, 13
- AI For Dummies (Mueller and Massaron), 13
- algorithms
- choosing, 33
- classifiers, 158
- hyperparameters
- GBM model, 427–428
- grid search, 364–368
- overview, 363–364
- randomized search, 368–369
- k-means
- big data, 304–305
- centroids, 298–299
- ground truth, 301–304
- image data, 299–301
- overview, 297
- K-Nearest Neighbors
- k-parameter, 344–345
- overview, 342–343
- predicting, 343–344
- linear regression
- limitations of, 333–334
- with multiple variables, 331–333
- overview, 329–331
- Naïve Bayes, 177
- classes, 339–340
- overview, 337–338
- text classifications, 340–342
- Random Forest
- optimizing, 422–424
- overview, 418–420
- Random Forest classifier, 420–421
- Random Forest regressor, 421–422
align
parameter, histograms, 205
- Al-Kindi, 12
- Amazon.com review dataset, 444
- Anaconda. See also IPython; Jupyter Notebook
- Anaconda Command Prompt, 27–29
- installing
- on Linux, 46–47
- on Mac OS X, 47–48
- on Windows, 42–45
- IPython, 29
- Jupyter QTConsole environment, 29–30
- Spyder IDE, 30–31
- Anaconda Command Prompt, 27–29
- Anaconda Prompt window, 86–87
- annotations, graph, 197–199
- ANOVA (Analysis Of Variance), 263
- Apache Spark, 11
- APIs (application programming interfaces), 37, 116
append()
method, 143
- arrays
- 2-D, 140
- 3-D, 140
- n-dimensional, 36
- performing operations on
- matrix multiplication, 181
- matrix vector multiplication, 180
- simple arithmetic on vectors and matrices, 179–180
- vectorization, 179
- AskSam database, 113
- Aspirational Data Scientist blog, 434
autopct
parameter, pie charts, 203
- axes, graph
- accessing, 189–190
- formatting, 190–191
- handles, 190
- labels, 198
- plotting time series, 212–214
B
- backends, plotting
- defined, 96
- MatPlotLib, 189
- backward approach, selecting variables, 362
- bag of words model
- implementing TF-IDF transformations, 162–165
- understanding bag of words model, 159–160
- working with n-grams, 161–162
- bagging techniques, machine learning, 419–420, 425
- bar charts, 186, 203–205
- Basemap toolkit
- creating Basemap environment, 217–218
- deprecated packages, 218–220
- installing, 218
- plotting geographical data, 220–221
- Beautiful Soup library, 38
- Beginning Programming with Python For Dummies, 2nd Edition (Mueller), 10, 25, 34
- Bell Labs, 252
- benchmarking, 241–243
- bias
- in data, 173
- machine learning algorithms, 350
- big data. See also reducing data dimensionality
- AI and, 13
- clustering, 304–305
- data science and, 13
- defined, 275, 383
- SGD optimization, 383–386
- binning
- defined, 123
- in feature creation, 177
- transforming numerical variable into categorical ones, 259–260
- bins, histogram, 205
- black box, defined, 420
- Boolean values, 151–152
- Boston dataset
- dividing between training and test sets, 354–356
- performing regression with
SVR
class, 400–401
- testing and detecting interactions between variables, 375–379
- variable transformations, 372–373
- Bower, Jonathan, 436
- box plots, 186, 206–208
- Exploratory Data Analysis, 262–263
- outliers and, 318–319
- quartiles, 206
- whiskers, 206
- branches, decision trees, 414
- Breiman, Leo, 419
- bunches, data, 34, 160
C
C
parameter, SVM LinearSVC
class, 402
- Canadian Institute for Advanced Research (CIFAR) datasets, 443
- Canopy Express, 41–42
- cases, database, 100, 170
- categorical variables
- Exploratory Data Analysis
- contingency tables, 261
- frequencies, 259–260
- overview, 259
- levels
- combining, 132–133
- renaming, 131–132
- C/C++ language, 10
- cells, Jupyter Notebook, 51–52
- central tendency, EDA, 254–255
- centroid-based algorithms, 298–299
- Chebyshev’s inequality, 320
- checkpoints, Jupyter Notebook, 95
- chi-square test for tables, EDA, 271
- Chrome browser, 61
- CIFAR (Canadian Institute for Advanced Research) datasets, 443
- classes. See also names of specific classes
- logistic regression algorithm, 336–337
- Naïve Bayes algorithm, 339–340
- classifiers, defined, 158
- classsification trees, 415–417
- Cleveland, William S., 12
- cluster analysis, 177, 324–325
- clustering
- agglomerative
- hierarchical cluster solution, 307–308
- linkage methods, 306
- metrics, 306
- overview, 305–306
- two-phase clustering solution, 308–310
- cross-tabulation, 301–302
- DBScan, 310–312
- ground truth, 301–302
- inertia, 302–304
- with k-means algorithm
- big data, 304–305
- centroid-based algorithms, 298–299
- ground truth, 301–302
- image data, 299–301
- overview, 297
- CNTK (Microsoft Cognitive Toolkit), 37
- code execution, checking, 78–79
- code repository
- adding notebook content, 53–55
- creating new notebooks, 51–53
- defining new folders, 50–51
- exporting notebooks, 55
- importing notebooks, 56–57
- removing notebooks, 55–56
- coding style, 33
- Colaboratory. See Google Colab
- collaborative filtering, 291–293
- colors, on graphs, 188, 193, 194–195
colors
parameter
- bar charts, 204
- histograms, 205
- pie charts, 202
- columns, dataset, 140–141. See also features, dataset
- COM (Component Object Model) applications, 116
- comma-separated value (CSV) files, 106–109
- Common Crawl 2012 web corpus, 444–445
- competencies of data scientists, 12–13
- Component Object Model (COM) applications, 116
concat()
method, 143
- concatenating data
- adding new cases and variables, 142–144
- removing data, 144
- sorting and shuffling, 145–146
- concept drift phenomenon, 317
- Conductrics site, 435
- contingency tables, EDA, 261
- Continuum Analytics Anaconda, 40–41
- correlations
- Exploratory Data Analysis
- chi-square test for tables, 271
- covariance and, 268–270
- nonparametric, 270–271
- showing on scatterplots, 211–212
counterclock
parameter, pie charts, 202
CountVectorizer
function, 239–241
- covariances, EDA, 268–270
- cross-tabulation, clustering, 301–302
- cross-validation
- on k-folds, 357–358
- multicore parallelism, 248
- overview, 356–357
- sampling and, 358–360
- CSV (comma-separated value) files, 106–109
- cube root transformations, 178
cycle_graph()
template, 166
D
- data analysis, 13, 32. See also EDA
- data capture, 13
- data correlations. See correlations
- data groupings
- depicting on box plots, 206–208
- depicting on scatterplots, 209–210
- data maps, 126–128
- data munging. See wrangling data
- data plans, 126–128
- data science
- AI and, 13
- big data and, 13
- core competencies of data scientists, 12–13
- history of, 12
- overview, –10
- pipeline
- exploratory data analysis, 15
- learning from data, 15
- preparing data, 15
- understanding meaning of data, 16
- visualization, 15
- programming languages and, 14
- Python and
- choosing, reasons for, 17–18
- contributions to, 23–24
- fusing data science and application development, 16–17
- loading data, 19
- training models, 19
- viewing results, 19–20
- working with problems in
- evaluating, 171
- formulating hypotheses, 174
- preparing data, 174
- researching solutions, 173–174
- Data Science Central, 433–434
- data wrangling. See wrangling data
- Database Administrators (DBAs), 11
- Database Management Systems (DBMSs), 11, 114, 115
- dataframes
DataFrame.to_sql()
function, 114
- datasets, 34, 170, 437–445
- Amazon.com review dataset, 444
- CIFAR datasets, 443
- Common Crawl 2012 web corpus, 444–445
- downloading, 48–58
- flat-file
- CSV delimited format, 107–109
- Excel, 109–110
- Microsoft Office files, 109–110
- text files, 106–107
- handwritten data, 442
- high-dimensional sparse datasets, 160
- image datasets, 443
- Kaggle competition, 438
- Kaggle site, 439
- large datasets, 444–445
- Madelon Data Set, 440
- MNIST dataset, 442
- MovieLens site, 440–441
- NIST dataset, 442
- overview, 437
- pattern recognition, 442
- size, 32
- Spambase Data Set, 441
- Titanic tragedy datasets, 438–439
- understanding datasets used in book, 57–58
- used in book, 57–58
- dates in data
- formatting date and time values, 134
- overview, 133–134
- time transformation, 135
- DBAs (Database Administrators), 11
- DBMSs (Database Management Systems), 11, 114, 115
- debugging codes, 33
- decision trees
- branches, 414
- classsification trees, 415–417
- general discussion, 412–415
- leaf nodes, 414–415
- Random Forest algorithm
- optimizing, 422–424
- overview, 418–420
- Random Forest classifier, 420–421
- Random Forest regressor, 421–422
- regression trees, 417–418
- deep learning, 37
- hardware for, 406
- neural networks, 371
- classifying with, 408–410
- deep learning, 406
- multilayer perceptron, 408–410
- overview, 406–407
- regressing with, 408–410
- deprecated packages, 218–220
describe()
function, 127
- dicing data, 141
- directed graphs, 224–225
- discretization, 177
- distributions
- defined, 178
- Exploratory Data Analysis, 265–266
- transforming, 273–274
- using different statistical distributions, 272
- Z-score standardization, 273
- showing with histograms, 205–206
- Domingos, Pedro, 176
- double mapping, 152
drop()
method, 144
drop_ duplicates()
function, 126
dual
parameter, SVM LinearSVC
class, 402
- dummy variables, 177
- duplicates
- effect on data results, 124–125
- removing, 126
E
- EDA (Exploratory Data Analysis), 15
- categorical data
- contingency tables, 261
- frequencies, 259–260
- overview, 259
- correlations
- chi-square test for tables, 271
- covariance and, 268–270
- nonparametric, 270–271
- distributions
- transforming, 273–274
- using different statistical distributions, 272
- Z-score standardization, 273
- nonlinear transformations, 372
- numeric data
- central tendency, 254–255
- kurtosis, 257–258
- means, 254–255
- medians, 254–255
- normality, 257–258
- overview, 253–254
- percentiles, 256–257
- range, 256
- skewness, 257–258
- variance, 255–256
- overview, 251–253
- visualization for
- box plots, 262–263
- distributions, 265–266
- overview, 261
- parallel coordinates, 264–265
- scatterplots, 266–267
- t-tests, 263–264
ElasticNet
class, linear models, 382–383
- Encapsulated Postscript (EPS), 188
- encoding
- defined, 38
- missing data, 137–138
- one-hot-encoding, 236–238
- ensembles
- boosting, 424–428
- decision trees
- classsification trees, 415–417
- general discussion, 412–415
- Random Forest algorithm, 418–424
- regression trees, 417–418
- imputing missing data, 138
- overview, 411
- Enthought Canopy Express, 41–42
- enumerations, 129
- EPS (Encapsulated Postscript), 188
- error bar charts, 186
- error messages
- Firefox, 61
- indentation and, 26
- estimator interface, Scikit-learn library, 231–233
- ETL (Extract, Transformation, and Loading) specialists, 13
- Excel files, 106, 107, 109–110
explode
parameter, pie charts, 202
- Exploratory Data Analysis. See EDA
- exponential transformations, 178
- Extract, Transformation, and Loading (ETL) specialists, 13
- Extremely Randomized Trees machine learning technique, 325–326
F
- factor analysis
- hidden factors, 281–282
- psychometrics, 280–281
- features, dataset. See also variables
- defined, 100, 170
- feature creation
- binning, 177
- combining variables, 176–177
- defining, 175–176
- discretization, 177
- transforming distributions, 178
- using indicator variables, 177–178
- “A Few Useful Things to Know about Machine Learning” paper (Domingos), 176
- filtering data
- collaborative filtering, 291–293
- dicing, 141
- overview, 139
- slicing, 140–141
- finalized code, 11
- Firefox browser, 18
- configuring to use local runtime support, 63
- error dialog box, 61
- flat-file datasets
- CSV delimited format, 107–109
- Excel, 109–110
- Microsoft Office files, 109–110
- text files, 106–107
- folders, Jupyter Notebook, 50–51
- Forecastwatch.com, 23–24
- Fortran language, 10
- frame-of-reference mistruths, 173
- frequencies, EDA, 259–260
- functional coding, 17
- Functional Programmming For Dummies (Mueller), 11
- functions. See also names of specific functions
- hash functions, 235
- magic functions
- accessing lists, 90
- working with, 91
G
- Gaussian distribution, 257, 319–320
- GBM (Gradient Boosting Machine), 425–428
- geographical data
- deprecated packages, 218–220
- plotting with Basemap toolkit, 218, 220–221
- plotting with Notebook, 216–218
- GitHub, 436
- opening existing Google Colab notebooks in, 67–68
- saving Google Colab notebooks to, 69–70
- storing Google Colab notebooks on, 64
- GitHubGist, 70
- GMT (Greenwich Mean Time), 134
- Google Accounts
- creating, 64
- overview, 63
- signing in, 64–65
- Google Colab, 59–80
- code cells
- Add a Comment option, 72
- Add a Form option, 72
- Clear Output option, 72
- Delete Cell option, 72
- Link to Cell option, 72
- View Output Fullscreen option, 72
- editing cells, 74–75
- example projects, 66
- executing code, 76
- getting help, 80
- Google Account
- creating, 64
- overview, 63
- signing in, 64–65
- hardware acceleration, 75–76
- Help menu, 80
- Jupyter Notebook compared to, 61–63
- local runtime support, 63
- moving cells, 75
- notebooks
- checking code execution, 78–79
- creating new, 65–66
- downloading, 71
- opening existing, 66–68
- saving, 68–70
- sharing, 79
- table of contents, 77
- viewing notebook information, 77–78
- overview, 60–61
- special cells
- headings, 74
- overview, 73
- table of contents, 74
- text cells, 72–73
- Welcome page, 65
- Google Docs, 63
- Google Drive
- opening existing Google Colab notebooks in, 66–67
- revision history, 68
- saving Google Colab notebooks on, 68–69
- Gradient Boosting Machine (GBM), 425–428
- graphical user interface (GUI), 29–30
- graphics
- CIFAR datasets, 443
- integrating into Jupyter Notebook
- embedding images, 96–98
- embedding plots, 96
- loading examples from online sites, 96
- graphs
- adding grids to, 191–192
- adjacency matrices, 165
- annotations, 197–199
- axes
- accessing, 189–190
- formatting, 190–191
- handles, 190
- plotting time series, 212–214
- bar charts, 203–205
- box plots, 206–208
- building with NetworkX basics, 166–167
- defining line appearance on
- adding markers, 195–197
- line styles, 193–194
- using colors, 194–195
- directed, 224–225
- histograms, 205–206
- labels, 197–198
- legends, 197, 199–200
- MatPlotLib library
- defining plots, 186–187
- drawing multiple lines and plots, 187–188
- saving work to disk, 188–189
- pie charts, 202–203
- plotting geographical data, 216–221
- plotting time series, 212–216
- scatterplots, 208–212
- undirected, 222–223
- greedy approach, to selecting variables, 362
- Greenwich Mean Time (GMT), 134
grid()
function, 192
- grid searching
- hyperparameters and, 364–368
- multicore parallelism, 248
- grids, adding to graphs, 191–192
- ground truth, clustering, 301–302
groupby()
function, 127
- groups, data
- depicting on box plots, 206–208
- depicting on scatterplots, 209–210
- Grover, Prince, 413
- GUI (graphical user interface), 29–30
H
- hairballs
- defined, 165
- using NetworkX to avoid, 166
- handles, axes, 190
- handwritten data, datasets, 442
- hardware acceleration, Google Colab, 75–76
- hashing trick, Scikit-learn library
- demonstrating, 235–238
- hash functions, 235
- overview, 234–235
- sparse matrices, 239–240
HashingVectorizer
function, 239–241
hatch
parameter, bar charts, 204
- Help menu, Google Colab, 80
- Help mode, Python
- entering, 88
- exiting, 89
- requesting help in, 88–89
- hierarchical clustering
- hierarchical cluster solution, 307–308
- linkage methods, 306
- metrics, 306
- overview, 305–306
- two-phase clustering solution, 308–310
- high-dimensional sparse datasets, 160
- histograms, 186
- bins, 205
- showing distributions with, 205–206
- History of the Peloponnesian War (Thucydides), 12
- HTML data
- parsing into Beautiful Soup library, 38
- parsing XML and, 150–151
- using XPath for data extraction, 151–152
- hyperparameters of algorithms
- GBM model, 427–428
- grid search, 364–368
- overview, 363–364
- randomized search, 368–369
- hyperplanes, SVM, 390
- hypotheses
I
- IDA (Initial Data Analysis), 252
- IDE (Integrated Development Environment), 30
- IDF (Inverse Document Frequency), 163
- image data
- clustering, 299–301
- cropping, 112
- embedding images into Notebook, 96
- flattening, 113
- generating variations on, 103–104
- resizing, 112
- as unstructured files, 111
- image datasets, 443
- imperative coding, 17
- importance estimation, 420
- indentation, in Python, 26
- index, dataset, 136, 141, 143
- indicator variables, 177–178
- inertia, clustering, 302–304
- Informatica tool, 13
- information redundancy, 268
- information resources
- Aspirational Data Scientist blog, 434
- Conductrics, 435
- Data Science Central, 433–434
- GitHub, 436
- KDnuggets, 432
- Open-Source Data Science Masters, 436
- Oracle Data Science blog, 433
- Quora, 432–433
- Subreddit, 432
- Udacity, 435
- Information Retrieval (IR), 159
- Initial Data Analysis (IDA), 252
- installing
- Anaconda
- on Linux, 46–47
- on Mac OS X, 47–48
- on Windows, 42–45
- Basemap toolkit, 218
- preferred installer program, 244
- Integrated Development Environment (IDE), 30
- Interactive Python. See IPython
- International Council for Science, 12
- Internet Explorer browser, 63
- Inverse Document Frequency (IDF), 163
- inverse transformations, 178
- IPython
- help, 89–90, 92
- Jupyter Console, 84–92
- Jupyter Notebook and, 59
- overview, 29
- playing with data, 33
- IR (Information Retrieval), 159
- Iris dataset
- classification trees, 415–419
- correlation and covariance, 268–271
- counting for categorical data, 259–261
- factor analysis, 281–283
- hyperparameters optimization, 363–364
- loading, 253–254
- logistic regression algorithm, 335
- visualizing data, 262–267
J
- Java, 11
- Java Virtual Machine (JVM), 11
- JavaScript Object Notation (JSON), 119
join()
method, 143
- Journal of Data Science, 12
- jQuery, 116
- JSON (JavaScript Object Notation), 119
- Jupyter Console
- changing window appearance, 86–87
- discovering objects
- getting object help, 91
- obtaining object specifics, 92
- using IPython object help, 92
- interacting with screen text, 84–86
- IPython help, 89–90
- magic functions, 90–91
- Python help, 87–89
- Jupyter Notebook, 18
- code repository
- adding notebook content, 53–55
- creating new notebooks, 51–53
- defining new folders, 50–51
- exporting notebooks, 55
- importing notebooks, 56–57
- removing notebooks, 55–56
- Google Colab and, 59, 61–63
- graphs in, 188
- integrating multimedia into
- embedding images, 96–98
- embedding plots, 96
- loading examples from online sites, 96
- plotting geographical data, 216–218
- restarting kernels, 94–95
- restoring checkpoints, 95
- starting, 49–50
- stopping server, 50
- working with styles, 93–94
- Jupyter QTConsole environment, 29–30
- JVM (Java Virtual Machine), 11
K
- Kaggle data science competition, 176, 412, 438
- Kaggle site, 439
- KDnuggets site, 432
- Keras library, 37
- kernels
- nonlinear, 398–399
- restarting
- using graphs, 189
- using notebook backend, 189
- keys, defined, 34
- k-folds, cross-validation on, 357–358
- k-means algorithm
- big data, 304–305
- centroid-based algorithms, 298–299
- ground truth, 301–304
- image data, 299–301
- overview, 297
- KNN (K-Nearest Neighbors) algorithm
- hyperparameters, 365–368
- k-parameter, 344–345
- overview, 342–343
- predicting, 343–344
- kurtosis, EDA, 257–258
L
- L1 type (Lasso) regularization
- combining with L2 regularization, 382–383
- overview, 381
- L2 type (Ridge) regularization
- combining with L1 regularization, 382–383
- overview, 380–381
- labels, using on graphs, 197–198
labels
parameter, pie charts, 202
- lambda calculus, 17
- languages, programming. See programming languages
- Lasso regularization. See L1 type regularization
- leaf nodes, decision trees, 414
- learning from data. See machine learning
- legends, graph, 197, 199–200
- levels, categorical variables
- combining, 132–133
- defined, 129
- renaming, 131–132
- libraries, 35–38. See also names of specific libraries
- line graphs, 186, 214
- adding markers, 195–197
- using colors, 192, 194–195
- working with line styles, 193–194
- linear models
- nonlinear transformations
- main effects model, 375–379
- variable transformations, 372–375
- regularization of
ElasticNet
class, 382–383
- Lasso (L1 type), 381
- leveraging, 382
- overview, 379–380
- Ridge (L2 type), 380–381
- linear regression algorithm
- limitations of, 333–334
- with multiple variables, 331–333
- overview, 329–331
- R squared measure for, 352
LinearSVC
class, SVM, 402–406
- linkage methods, agglomerative clustering, 306
- Linux
- Enthought Canopy Express and, 41
- installing Anaconda on, 46–47
- local runtime support on, 63
- lists, output, 109
- loading data on Python
- overview, 19
- speed of data analysis and, 33
- local runtime support
- on Google Colab, 63
- on Linux, 63
- on Mac OS X, 63
- on Windows, 63
- local storage, 68
- logarithm transformations, variable, 178, 274
- logistic regression algorithm
loss
parameter, SVM LinearSVC
class, 402
M
- Mac OS X
- Enthought Canopy Express and, 41
- installing Anaconda on, 47–48
- using local runtime support on, 63
- machine learning. See also algorithms
- big data, 383–386
- cross-validation
- on k-folds, 357–358
- overview, 356–357
- sampling, 358–360
- ensembles
- boosting, 424–428
- decision trees, 412–424
- overview, 411
- hyperparameters, 363–369
- linear models
- nonlinear transformations, 372–379
- regularization of, 379–383
- model fitting
- bias, 350
- overview, 348–349
- strategy for, 350–353
- test sets, 354–356
- training, 354–356
- variance, 350
- neural networks
- classifying with, 408–410
- deep learning, 406
- multilayer perceptron, 408–410
- overview, 406–407
- regressing with, 408–410
- no-free-lunch theorem, 351
- optimization
- grid search, 364–368
- overview, 363–364
- randomized search, 368–369
- selection
- greedy, 362
RFECV
method, 363
- by univariate measures, 360–362
- Stochastic Gradient Descent optimization, 383–386
- support vector machines
- adjusting parameters, 390–392
- classifying with, 392–397
- creating stochastic solution with, 401–406
- general discussion, 387–390
- hyperplanes, 390
- margins, 389
- nonlinear kernels, 398–399
- overfitting, 392
- performing regression with SVR, 399–401
- scaling, 396
- Scikit-learn library, 391
- underfitting, 392
- Machine Learning For Dummies (Mueller and Massaron), 13
- Madelon Data Set, 440
- magic functions
- accessing lists, 90
- working with, 91
- main effects model, 375–379
- manifold learning (nonlinear dimensionality reduction), 283–285
- Manuscript on Deciphering Cryptographic Messages, 12
- margins, SVM, 389
- Markdown cells, 53
- markers on graphs, 195–197
- MATLAB, 14, 185–186
- MATLAB For Dummies (Mueller), 14, 185
- MatPlotLib library, 32, 38, 185–200. See also graphs
- adding grids, 191–192
- analyzing image date with, 102
- annotating charts, 198–199
- creating legends, 199–200
- defining line appearance
- adding markers, 195–197
- using colors, 194–195
- working with line styles, 193–194
- graphs
- defining plots, 186–187
- drawing multiple lines and plots, 187–188
- saving work to disk, 188–189
- labels, 197–198
- setting axis
- access code for, 189–190
- formatting, 190–191
- matrices
- adjacency, 165
- multiplication, 181
- scipy.sparse matrix, 160
- sparse, 239–240
- vector multiplication, 180
- mean function, 139
- means, EDA, 254–255
- median function, 139
- medians, EDA, 254–255
- memory
- memory profiler, 244–245, 247
- streaming data into, 102–103
- uploading data into, 101–102
- metrics, agglomerative clustering, 306
- microservices, 117
- Microsoft Cognitive Toolkit (CNTK), 37
- Microsoft Office files, 109–110
- missing data
- encoding, 137–138
- finding, 136–137
- imputing, 138–139
- mistruths in data
- bias, 173
- commission, 172
- frame of reference, 173
- omission, 172
- perspective, 172–173
- MLP (multilayer perceptron), 408–410
- MNIST (Mixed National Institute of Standards and Technology) dataset, 442
- model fitting
- bias, 350
- no-free-lunch theorem, 351
- overview, 348–349
- ROC AUC, 353
- strategy for, 350–353
- test sets, 354–356
- training, 354–356
- variance, 350
- model interface, Scikit-learn library, 231, 234
- MongoDB database, 115
most_frequent
function, 139
- movie recommendations, 291–293
- MovieLens site, 440–441
- MS SSIS tool, 13
- multicore parallelism (multiprocessing), 247–250
- multilabel prediction, 248
- multilayer perceptron (MLP), 408–410
- multimedia, integrating in Jupyter Notebook
- embedding images, 96
- embedding plots, 96
- loading examples from online sites, 96
- obtaining online, 96–98
- multiprocessing (multicore parallelism), 247–250
- multivariate approach to outliers
- cluster analysis, 324–325
- Extremely Randomized Trees machine learning technique, 325–326
- principal component analysis, 322–324
- Random Forests matching learning technique, 325–326
- multivariate correlation, 175
- munging
- clustering
- agglomerative, 305–310
- DBScan, 310–312
- with k-means, 297–305
- overview, 295–297
- defined, 229
- Exploratory Data Analysis
- categorical data, 259–261
- correlations, 268–271
- distributions, 272–274
- numeric data, 253–258
- overview, 251–253
- visualization for, 261–267
- outliers
- anomalies, 316
- concept drift phenomenon, 317
- effect on machine learning algorithms, 315–316
- multivariate approach to, 322–326
- novel data, 316–317
- overview, 313–314
- univariate approach to, 317–321
- overview, 229–230
- reducing data dimensionality
- collaborative filtering, 291–293
- factor analysis, 280–282
- nonlinear dimensionality reduction, 283–285
- Non-Negative Matrix Factorization, 289–291
- overview, 275–276
- principal components analysis, 282–283, 285–289
- singular value decomposition, 276–280
- Scikit-learn library
- application speed and performance and, 240–247
- classes, 230–231
- defining applications for data science, 231–234
- hashing trick, 234–240
- multicore parallelism, 247–250
- object-based interfaces, 231
- MySQL database, 115
- MySQLdb library, 24
N
- Naïve Bayes algorithm, 177, 375
- classes, 339–340
- overview, 337–338
- text classifications, 340–342
- National Institute of Standards and Technology (NIST) dataset, 442
- Natural Language Processing (NLP), 159
- Natural Language Toolkit (NLTK), 154
- n-dimensional arrays, 36
- NetworkX library, 38, 166–167
- neural networks, 371
- classifying with, 408–410
- deep learning, 406
- multilayer perceptron, 408–410
- overview, 406–407
- regressing with, 408–410
- n-grams, 161–162
- NIST (National Institute of Standards and Technology) dataset, 442
- NLP (Natural Language Processing), 159
- NLTK (Natural Language Toolkit), 154
- NMF (Non-Negative Matrix Factorization), 289–291
- no-free-lunch theorem, 351
- nonlinear dimensionality reduction (manifold learning), 283–285
- nonlinear kernels, SVM, 398–399
- nonlinear transformations
- main effects model, 375–379
- variable transformations, 372–375
- Non-Negative Matrix Factorization (NMF), 289–291
- nonparametric correlations, EDA, 270–271
- normality, EDA, 257–258
- NoSQL (Not only SQL) databases, 115–116
- Notebook. See Jupyter Notebook
- notebooks, Google Colab, 65–71
- creating new, 65–66
- downloading, 71
- opening existing
- in GitHub, 67–68
- in Google Drive, 66–67
- local storage, 68
- saving
- on GitHub, 69–70
- on GitHubGist, 70
- on Google Drive, 68–69
- sharing, 79
- viewing
- checking code execution, 78–79
- displaying table of contents, 77
- getting notebook information, 77–78
- novel data, 316–317
- numeric data, EDA
- central tendency, 254–255
- kurtosis, 257–258
- means, 254–255
- medians, 254–255
- normality, 257–258
- overview, 253–254
- percentiles, 256–257
- range, 256
- skewness, 257–258
- variance, 255–256
- NumPy library
arrays
tool, 246
- computing covariance and correlation matrices, 269–270
- functions, 36
- matrix multiplication, 181
- matrix vector multiplication, 180
- n-dimensional arrays, 36, 179
- researching solutions, 174
- shaping data with, 122
O
- object-based interfaces, Scikit-learn library, 231
- object-oriented coding, 18
- objects, 91–92
- getting object help, 91
- IPython object help, 92
- obtaining object specifics, 92
- ODBC (Open Database Connectivity), 115
- one-hot-encoding, 236–238
- online resources
- Aspirational Data Scientist, 434
- Cheat Sheet,
- Conductrics, 435
- Cross Validated, 174
- Data Science Central, 433–434
- directives list, 134
- GitHub, 436
- Google Scholar, 174
- Imputer parameters, 138
- Internet World Stats,
- John Mueller blog,
- KDnuggets, 432
- list of distributions, 178
- MatPlotLib graph types, 186
- Microsoft Academic Search, 174
- Open-Source Data Science Masters, 436
- Oracle Data Science Blog, 433
- parsers for CSV files, 108–109
- Python Enhancement Proposals, 25
- Python tutorials,
- pythonclock.org, 22
- Quora, 174, 432–433
- read table method arguments, 107
- regular expressions, 155
- source codes,
- Stack Overflow, 174
- standard graph types list, 166
- Subreddit, 432
- telephone number manipulation routines, 157
- Udacity, 435
- Unicode encodings, 153
- Unicode problems in Python, 153
- working with databases, 115
- Open Database Connectivity (ODBC), 115
- Open-Source Data Science Masters (OSDSM), 436
- optimization
- grid search, 364–368
- overview, 363–364
- randomized search, 368–369
- Oracle Data Science blog, 433
- OSDSM (Open-Source Data Science Masters), 436
- outliers, 15, 177
- anomalies, 316
- concept drift phenomenon, 317
- effect on machine learning algorithms, 315–316
- multivariate approach to
- cluster analysis, 324–325
- Extremely Randomized Trees machine learning technique, 325–326
- principal component analysis, 322–324
- Random Forests matching learning technique, 325–326
- novel data, 316–317
- overview, 313–314
- univariate approach to
- box plots, 318–319
- Chebyshev’s inequality, 320
- Gaussian distribution, 319–320
- overview, 317–318
- winsorizing, 321
- overfitting
- online resources, 440
- SVM, 392
P
- pandas library
- categorical data, 259
- checking current version of, 129
- CSV files and, 107
- data analysis with, 36
- measuring central tendency, 254–255
- NaN output, 131
- parsers, 106
- reading flat-file data, 106
- removing duplicates, 126
- researching solutions, 174
- shaping data with, 122–123
- working with Excel worksheets in, 110
- parallel coordinates, EDA, 264–265
- parameters, SVM, 390–392
- Parr, Terence, 413
- parsers, 106
- parsing HTML data, 150–151
- PATH environment variable, 45
- pattern matching, 156, 442
- PCA (principal components analysis)
- facial recognition, 285–289
- outliers and, 322–324
- overview, 282–283
- PDF (Portable Document Format), 188
- Pearson correlation, 269–270
penalty
parameter, SVM LinearSVC
class, 402
- PEPs (Python Enhancement Proposals), 25
- percentiles, EDA, 256–257
- pie charts, 186, 202–203
- PIP (preferred installer program), 244
- pipeline, data science
- Exploratory Data Analysis, 15
- learning from data, 15
- overview, 14
- preparing data, 15
- in prototyping, 32
- understanding meaning of data, 16
- visualizing data, 15
- plotting data
- geographical data
- with Basemap toolkit, 218, 220–221
- deprecated packages, 218–220
- with Notebook environment, 216–218
- plots
- defined, 186–187
- multiple plot lines, 187–188
- time series
- on axes, 212–214
- trends, 214–216
- polynomial transformations, 178
- Portable Document Format (PDF), 188
- Portable Network Graphic (PNG) format, 188
- PostgreSQL database, 115
- Postscript (PS), 188
- PowerPoint, 14
- predictor interface, Scikit-learn library, 231, 233
- preferred installer program (PIP), 244
- principal components analysis. See PCA
- procedural coding, 18
- programming languages. See also Python
- choosing, 10–11
- data science and, 14
- Java, 11
- Natural Language Toolkit, 154
- R, 10–11
- Scala, 11
- SQL, 11, 114
- XPath, 151–152
- prototyping, 31–32
- PS (Postscript), 188
- PSF (Python Software Foundation), 22
- PyMongo library, 115
- Python
- Anaconda
- Anaconda Command Prompt, 27–29
- installing, 42–47
- IPython, 29
- Jupyter QTConsole environment, 29–30
- Spyder IDE, 30–31
- contributions to data science, 23–24
- developments of, 22
- factors affecting speed of execution, 32–33
- goals of, 24–25
- help mode
- entering, 88
- exiting, 89
- requesting help in, 88–89
- indentation, 26
- interactive help, 89
- issues with flat-file headers, 106
- language statements, 25–26
- libraries
- Beautiful Soup, 38
- Keras and TensorFlow, 37
- matplotlib, 38
- NetworkX, 38
- NumPy, 36
- pandas, 36
- Scikit-learn, 36–37
- SciPy, 35
- licensing issues, 22
- overview, 10
- performing rapid prototyping and experimentation, 31–32
- philosophy, 23
- Python 2.x, 22
- Python 3.x, 22, 153
- role in data science and, 16–20
- streaming data using, 102
- visualizing data, 33–35
- working with, 25–31
- Python Enhancement Proposals (PEPs), 25
- Python interpreter, 27
- Python Software Foundation (PSF), 22
- pythonclock.org, 22
- python-history.blogspot, 23
- Python(x,y), 42
Q
- QTConsole, 30
- quartiles, box plots, 206
- Quixote display framework, 24
- Quora, 432–433
R
- R (programming language), 10–11
- R squared measure, for linear regression, 352
- radial basis function (rbf) kernel, 398
- Random Forest algorithm
- optimizing, 422–424
- overview, 418–420
- Random Forest classifier, 420–421
- Random Forest regressor, 421–422
- Random Forest classifier, 420–421
- Random Forest regressor, 421–422
- Random Forests matching learning technique, 325–326
- random sampling, 105
- randomized search, 368–369
random.shuffle()
method, 145
- ranges, EDA, 256
- rbf (radial basis function) kernel, 398
read_sql()
method, 114
read_sql_query()
method, 114
read_sql_table()
method, 114
read_table()
method arguments, 107
- Receiver Operating Characteristic Area Under Curve (ROC AUC), 353
- reducing data dimensionality
- collaborative filtering, 291–293
- factor analysis
- hidden factors, 281–282
- psychometrics, 280–281
- nonlinear dimensionality reduction, 283–285
- non-negative matrix factorization, 289–291
- overview, 275–276
- principal components analysis, 282–283, 285–289
- singular value decomposition, 276–280
- t-SNE algorithm, 283–284
- regression
- linear regression algorithm
- limitations of, 333–334
- with multiple variables, 331–333
- overview, 329–331
- R squared measure for, 352
- logistic regression algorithm
- performing with SVM, 399–401
- regression trees, 417–418
- regular expressions, 155–158
- regularization of linear models
ElasticNet
class, 382–383
- Lasso (L1 type), 381
- leveraging, 382
- overview, 379–380
- Ridge (L2 type), 380–381
- relational databases, managing data from, 113–115
- repository, code. See code repository
- researching solutions, 173–174
reset_index()
method, 143
RFECV
class, 363
- Ridge regularization. See L2 type regularization
- ROC AUC (Receiver Operating Characteristic Area Under Curve), 353
- root nodes, 118
- root words, 153
- rows, database, 100, 140
S
- sampling data
- cross-validation and, 358–360
- overview, 104–105
- random sampling, 105
- saving
- Google Colab notebooks
- on GitHub, 69–70
- on GitHubGist, 70
- on Google Drive, 68–69
- Jupyter Notebook files, 55, 188–189
- MatPlotLib library work to disk, 188–189
- Scala, 11
- Scalable Vector Graphics (SVG), 188
- scaling, SVM, 396
- scatterplots
- depicting groups, 209–210
- Exploratory Data Analysis, 266–267
- overview, 208–209
- showing correlations, 211–212
- Scikit-learn library, 34, 57
- application speed and performance
- benchmarking, 241–243
- memory profiler, 244–245, 247
- overview, 240–241
- classes, 230–231
- conda, 244
- defining applications for data science, 231–234
- hashing trick
- demonstrating, 235–238
- hash functions, 235
- overview, 234–235
- sparse matrices, 239–240
- hyperparameters optimization, 363–364
- model fitting and, 351–352
- multicore parallelism, 247–250
- object-based interfaces, 231
- overview, 36–37
- preferred installer program, 244
- researching solutions, 174
- SVM and, 391
- toy datasets, 100
- 20 Newsgroups dataset, 159
- SciPy library, 35
- researching solutions, 174
- sparse matrices, 239
- scipy.sparse matrix, 160
- screen text, Jupyter Console, 84–86
- Search Code Snippets option, Google Colab Help menu, 80
- selecting data
- greedy approach to, 362
RFECV
method, 363
- univariate approach to, 360–362
SelectPercentile
class, 360–361
- SGD (Stochastic Gradient Descent), 383–386, 409
shadow
parameter, pie charts, 203
- shaping data, 32
- categorical variables
- combining levels, 132–133
- creating, 130–131
- renaming levels, 131–132
- concatenating data
- adding new cases and variables, 142–144
- removing data, 144
- sorting and shuffling, 145–146
- date and time
- formatting, 134
- time transformations, 135
- dicing data, 141
- graph data
- adjacency matrices, 165
- NetworkX basics, 166–167
- HTML pages
- parsing XML and, 150–151
- using XPath for data extraction, 151–152
- missing data
- encoding, 137–138
- finding missing data, 136–137
- imputing missing data, 138–139
- with NumPy, 122
- with pandas, 122–123
- raw text
- regular expressions, 155–158
- stop words, 153–155
- Unicode and, 153
- through aggregation, 146–147
- using bag of words model
- n-grams, 161–162
- overview, 158–160
- TF-IDF transformations, 162–165
- validating data
- creating data maps and data plans, 126–128
- removing duplicates, 126
- verifying contents, 124–125
- shared group knowledge (wisdom of crowds), 411
- Singular Value Decomposition (SVD), 276–280
- 64-bit operating system, 42–43
- skewed values, 178
- skewness, EDA, 257–258
- slicing data
sort_values()
method, 145
- Spambase Data Set, 441
- sparse matrices, Scikit-learn library, 239–240
- Spearman correlation, 270
- speed of execution, 32–33
- Spyder IDE, 30–31
- SQL (Structured Query Language), 11, 14, 114
- SQL Server database, 115
- sqlalchemy library, 114
- SQLite database, 115
- square root transformations, 178
- squared errors, linear regression, 352
- statistical distributions, EDA, 272
- statistics
- descriptive, 253–254
- history of, 12
- statsmodels library, 36
- stemming stop words, 153–155
- Stochastic Gradient Descent (SGD), 383–386, 409
- stop words, 153–155, 341
StratifiedKFold
class, 359–360
- streaming data, into memory, 102–103
- strings, 106
- formatting date and time values with, 134
- special directives, 134
- Structured Query Language (SQL), 11, 14, 114
- Subreddit, 432
- support vector machines. See SVM
- Support Vector Regression (SVR), 399–401
- support vectors, SVM, 389
- SVD (Singular Value Decomposition), 276–280
- SVG (Scalable Vector Graphics), 188
- SVM (support vector machines)
- adjusting parameters, 390–392
- classifying with, 392–397
- creating stochastic solution with, 401–406
- defined, 371
- general discussion, 387–390
- hyperplanes, 390
LinearSVC
class, 402–406
- margins, 389
- nonlinear kernels, 398–399
- overfitting, 392
- performing regression with SVR, 399–401
- scaling, 396
- Scikit-learn library, 391
- support vectors, 389
- underfitting, 392
- SVR (Support Vector Regression), 399–401
- syncing Google Colab, 62
T
- Table of Contents, in Google Colab notebooks, 77
- tablets, using code on, 61
- TensorFlow library, 37
- Teradata tool, 13
- Term Frequency times Inverse Document Frequency (TF-IDF) transformations, 162–165, 235, 290
- test datasets, 354–356
- text classifications, predicting, 340–342
- text files
- accessing flat-file datasets from, 106–107
- raw text
- regular expressions, 155–158
- stop words, 153–155
- Unicode and, 153
- TF-IDF (Term Frequency times Inverse Document Frequency) transformations, 162–165, 235, 290
- Theano library, 37
- third-level headings, 54
- Thucydides, 12
- time series
- plotting on axes, 212–214
- trends, 214–216
timedelta()
function, 135
timeit
commands, 241–243
- Titanic tragedy datasets, 412–413, 438–439
- tokenizing, 158
- toy datasets, 100
- training data, 354, 356
- transformer interface, Scikit-learn library, 231, 234
- tree ensembles, 138. See also ensembles
- trendlines, plotting, 212, 214–216
- triple mapping, 152
- t-SNE algorithm, 283–284
- t-tests, EDA, 263–264
- Tukey, John, 252
- tuples, 162
U
- Udacity, 435
- underfitting, SVM, 392
- undirected graphs, 222–223
- Unicode, 38, 153
- univariate approach
- to outliers
- box plots, 318–319
- Chebyshev’s inequality, 320
- Gaussian distribution, 319–320
- overview, 317–318
- winsorizing, 321
- to selecting variables, 360–362
- Universal Transformation Format 8-bit (UTF-8), 153
- unstructured file form, sending data in, 111–113
- uploading data, into memory, 101–102
V
- validating data
- creating data maps and data plans, 126–128
- removing duplicates, 126
- verifying contents, 124–125
- values, dataset
- van Rossum, Guido, 22
- Vapnik, Vladimir, 387
- variable distributions, 253
- variables, 19, 170. See also features, dataset
- categorical, 129–133
- combining for feature creation, 176–177
- in databases, 100
- dummy, 177
- indicator, 177–178
- variable transformations, 372–375
- variances
- Exploratory Data Analysis, 255–256
- machine learning algorithms, 350
- vectorization, 179
- Visual Studio code support, 45
- visualizing data, 15, 33–35
- bar charts, 203–205
- box plots, 206–208
- directed graphs, 224–225
- Exploratory Data Analysis
- box plots, 262–263
- distributions, 265–266
- overview, 261
- parallel coordinates, 264–265
- scatterplots, 266–267
- t-tests, 263–264
- with graphs, 202–225
- histograms, 205–206
- overview, 201
- pie charts, 202–203
- plotting geographical data
- with Basemap toolkit, 218, 220–221
- deprecated packages, 218–220
- with Notebook, 216–218
- plotting time series
- on axes, 212–214
- trends, 214–216
- scatterplots, 208–212
- undirected graphs, 222–223
W
- web services, 116
- web-based data, 116–118
- whiskers, box plots, 206
- Windows
- Enthought Canopy Express and, 41
- installing Anaconda on, 42–45
- local runtime support on, 63
- Windows 7 system, 18
- WinPython and, 42
- WinPython, 42
- winsorizing, 321
- wisdom of crowds (shared group knowledge), 411
- Wolpert, David, 351
- wrangling data
- clustering
- agglomerative, 305–310
- DBScan, 310–312
- with k-means, 297–305
- overview, 295–297
- defined, 229
- Exploratory Data Analysis
- categorical data, 259–261
- correlations, 268–271
- distributions, 272–274
- numeric data, 253–258
- overview, 251–253
- visualization for, 261–267
- outliers
- anomalies, 316
- concept drift phenomenon, 317
- effect on machine learning algorithms, 315–316
- multivariate approach to, 322–326
- novel data, 316–317
- overview, 313–314
- univariate approach to, 317–321
- overview, 229–230
- reducing data dimensionality
- collaborative filtering, 291–293
- factor analysis, 280–282
- nonlinear dimensionality reduction, 283–285
- Non-Negative Matrix Factorization, 289–291
- overview, 275–276
- principal components analysis, 282–283, 285–289
- singular value decomposition, 276–280
- Scikit-learn library
- application speed and performance and, 240–247
- classes, 230–231
- defining applications for data science, 231–234
- hashing trick, 234–240
- multicore parallelism, 247–250
- object-based interfaces, 231
X
- x axis, graphs, 189
- XML pages, 38
- JSON versus, 119
- parsing, 150–151
- working with web data through, 117
- XPath, 151–152
Z
- Z-score standardization, EDA, 273
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.