Index

Numerics

  • 2-D arrays, 140
  • 3-D arrays, 140
  • 64-bit operating system, 42–43

A

  • ABC language, 22
  • absolute errors, linear regression, 352
  • Adaboost application, 424–425
  • adjacency matrices, 165–166
  • agglomerative clustering
    • hierarchical cluster solution, 307–308
    • linkage methods, 306
    • metrics, 306
    • overview, 305–306
    • two-phase clustering solution, 308–310
  • aggregation, shaping data through, 146–147
  • AI applications, 13
  • AI For Dummies (Mueller and Massaron), 13
  • algorithms
    • choosing, 33
    • classifiers, 158
    • hyperparameters
      • GBM model, 427–428
      • grid search, 364–368
      • overview, 363–364
      • randomized search, 368–369
    • k-means
      • big data, 304–305
      • centroids, 298–299
      • ground truth, 301–304
      • image data, 299–301
      • overview, 297
    • K-Nearest Neighbors
      • k-parameter, 344–345
      • overview, 342–343
      • predicting, 343–344
    • linear regression
      • limitations of, 333–334
      • with multiple variables, 331–333
      • overview, 329–331
    • logistic regression
      • applying, 335
      • classes, 336–337
      • overview, 334
    • Naïve Bayes, 177
      • classes, 339–340
      • overview, 337–338
      • text classifications, 340–342
    • Random Forest
      • optimizing, 422–424
      • overview, 418–420
      • Random Forest classifier, 420–421
      • Random Forest regressor, 421–422
    • t-SNE algorithm, 283–284
  • align parameter, histograms, 205
  • Al-Kindi, 12
  • Amazon.com review dataset, 444
  • Anaconda. See also IPython; Jupyter Notebook
    • Anaconda Command Prompt, 27–29
    • installing
      • on Linux, 46–47
      • on Mac OS X, 47–48
      • on Windows, 42–45
    • IPython, 29
    • Jupyter QTConsole environment, 29–30
    • Spyder IDE, 30–31
  • Anaconda Command Prompt, 27–29
  • Anaconda Prompt window, 86–87
  • annotations, graph, 197–199
  • ANOVA (Analysis Of Variance), 263
  • Apache Spark, 11
  • APIs (application programming interfaces), 37, 116
  • append() method, 143
  • arrays
    • 2-D, 140
    • 3-D, 140
    • n-dimensional, 36
    • performing operations on
      • matrix multiplication, 181
      • matrix vector multiplication, 180
      • simple arithmetic on vectors and matrices, 179–180
      • vectorization, 179
  • AskSam database, 113
  • Aspirational Data Scientist blog, 434
  • autopct parameter, pie charts, 203
  • axes, graph
    • accessing, 189–190
    • formatting, 190–191
    • handles, 190
    • labels, 198
    • plotting time series, 212–214

B

  • backends, plotting
    • defined, 96
    • MatPlotLib, 189
  • backward approach, selecting variables, 362
  • bag of words model
    • implementing TF-IDF transformations, 162–165
    • understanding bag of words model, 159–160
    • working with n-grams, 161–162
  • bagging techniques, machine learning, 419–420, 425
  • bar charts, 186, 203–205
  • Basemap toolkit
    • creating Basemap environment, 217–218
    • deprecated packages, 218–220
    • installing, 218
    • plotting geographical data, 220–221
  • Beautiful Soup library, 38
  • Beginning Programming with Python For Dummies, 2nd Edition (Mueller), 10, 25, 34
  • Bell Labs, 252
  • benchmarking, 241–243
  • bias
    • in data, 173
    • machine learning algorithms, 350
  • big data. See also reducing data dimensionality
    • AI and, 13
    • clustering, 304–305
    • data science and, 13
    • defined, 275, 383
    • SGD optimization, 383–386
  • binning
    • defined, 123
    • in feature creation, 177
    • transforming numerical variable into categorical ones, 259–260
  • bins, histogram, 205
  • black box, defined, 420
  • Boolean values, 151–152
  • Boston dataset
    • dividing between training and test sets, 354–356
    • performing regression with SVR class, 400–401
    • testing and detecting interactions between variables, 375–379
    • variable transformations, 372–373
  • Bower, Jonathan, 436
  • box plots, 186, 206–208
    • Exploratory Data Analysis, 262–263
    • outliers and, 318–319
    • quartiles, 206
    • whiskers, 206
  • branches, decision trees, 414
  • Breiman, Leo, 419
  • bunches, data, 34, 160

C

  • C parameter, SVM LinearSVC class, 402
  • Canadian Institute for Advanced Research (CIFAR) datasets, 443
  • Canopy Express, 41–42
  • cases, database, 100, 170
  • categorical variables
    • Exploratory Data Analysis
      • contingency tables, 261
      • frequencies, 259–260
      • overview, 259
    • levels
      • combining, 132–133
      • renaming, 131–132
    • overview, 129–131
  • C/C++ language, 10
  • cells, Jupyter Notebook, 51–52
  • central tendency, EDA, 254–255
  • centroid-based algorithms, 298–299
  • Chebyshev’s inequality, 320
  • checkpoints, Jupyter Notebook, 95
  • chi-square test for tables, EDA, 271
  • Chrome browser, 61
  • CIFAR (Canadian Institute for Advanced Research) datasets, 443
  • classes. See also names of specific classes
    • logistic regression algorithm, 336–337
    • Naïve Bayes algorithm, 339–340
  • classifiers, defined, 158
  • classsification trees, 415–417
  • Cleveland, William S., 12
  • cluster analysis, 177, 324–325
  • clustering
    • agglomerative
      • hierarchical cluster solution, 307–308
      • linkage methods, 306
      • metrics, 306
      • overview, 305–306
      • two-phase clustering solution, 308–310
    • cross-tabulation, 301–302
    • DBScan, 310–312
    • ground truth, 301–302
    • inertia, 302–304
    • with k-means algorithm
      • big data, 304–305
      • centroid-based algorithms, 298–299
      • ground truth, 301–302
      • image data, 299–301
      • overview, 297
    • overview, 295–297
  • CNTK (Microsoft Cognitive Toolkit), 37
  • code execution, checking, 78–79
  • code repository
    • adding notebook content, 53–55
    • creating new notebooks, 51–53
    • defining new folders, 50–51
    • exporting notebooks, 55
    • importing notebooks, 56–57
    • removing notebooks, 55–56
  • coding style, 33
  • Colaboratory. See Google Colab
  • collaborative filtering, 291–293
  • colors, on graphs, 188, 193, 194–195
  • colors parameter
    • bar charts, 204
    • histograms, 205
    • pie charts, 202
  • columns, dataset, 140–141. See also features, dataset
  • COM (Component Object Model) applications, 116
  • comma-separated value (CSV) files, 106–109
  • Common Crawl 2012 web corpus, 444–445
  • competencies of data scientists, 12–13
  • Component Object Model (COM) applications, 116
  • concat() method, 143
  • concatenating data
    • adding new cases and variables, 142–144
    • removing data, 144
    • sorting and shuffling, 145–146
  • concept drift phenomenon, 317
  • Conductrics site, 435
  • contingency tables, EDA, 261
  • Continuum Analytics Anaconda, 40–41
  • correlations
    • Exploratory Data Analysis
      • chi-square test for tables, 271
      • covariance and, 268–270
      • nonparametric, 270–271
    • showing on scatterplots, 211–212
  • counterclock parameter, pie charts, 202
  • CountVectorizer function, 239–241
  • covariances, EDA, 268–270
  • cross-tabulation, clustering, 301–302
  • cross-validation
    • on k-folds, 357–358
    • multicore parallelism, 248
    • overview, 356–357
    • sampling and, 358–360
  • CSV (comma-separated value) files, 106–109
  • cube root transformations, 178
  • cycle_graph() template, 166

D

  • data analysis, 13, 32. See also EDA
  • data capture, 13
  • data correlations. See correlations
  • data groupings
    • depicting on box plots, 206–208
    • depicting on scatterplots, 209–210
  • data maps, 126–128
  • data munging. See wrangling data
  • data plans, 126–128
  • data science
    • AI and, 13
    • big data and, 13
    • core competencies of data scientists, 12–13
    • history of, 12
    • overview, 9–10
    • pipeline
      • exploratory data analysis, 15
      • learning from data, 15
      • preparing data, 15
      • understanding meaning of data, 16
      • visualization, 15
    • programming languages and, 14
    • Python and
      • choosing, reasons for, 17–18
      • contributions to, 23–24
      • fusing data science and application development, 16–17
      • loading data, 19
      • training models, 19
      • viewing results, 19–20
    • working with problems in
      • evaluating, 171
      • formulating hypotheses, 174
      • preparing data, 174
      • researching solutions, 173–174
  • Data Science Central, 433–434
  • data wrangling. See wrangling data
  • Database Administrators (DBAs), 11
  • Database Management Systems (DBMSs), 11, 114, 115
  • dataframes
  • DataFrame.to_sql() function, 114
  • datasets, 34, 170, 437–445
    • Amazon.com review dataset, 444
    • CIFAR datasets, 443
    • Common Crawl 2012 web corpus, 444–445
    • downloading, 48–58
    • flat-file
      • CSV delimited format, 107–109
      • Excel, 109–110
      • Microsoft Office files, 109–110
      • text files, 106–107
    • handwritten data, 442
    • high-dimensional sparse datasets, 160
    • image datasets, 443
    • Kaggle competition, 438
    • Kaggle site, 439
    • large datasets, 444–445
    • Madelon Data Set, 440
    • MNIST dataset, 442
    • MovieLens site, 440–441
    • NIST dataset, 442
    • overview, 437
    • pattern recognition, 442
    • size, 32
    • Spambase Data Set, 441
    • Titanic tragedy datasets, 438–439
    • understanding datasets used in book, 57–58
    • used in book, 57–58
  • dates in data
    • formatting date and time values, 134
    • overview, 133–134
    • time transformation, 135
  • DBAs (Database Administrators), 11
  • DBMSs (Database Management Systems), 11, 114, 115
  • debugging codes, 33
  • decision trees
    • branches, 414
    • classsification trees, 415–417
    • general discussion, 412–415
    • leaf nodes, 414–415
    • Random Forest algorithm
      • optimizing, 422–424
      • overview, 418–420
      • Random Forest classifier, 420–421
      • Random Forest regressor, 421–422
    • regression trees, 417–418
  • deep learning, 37
    • hardware for, 406
    • neural networks, 371
      • classifying with, 408–410
      • deep learning, 406
      • multilayer perceptron, 408–410
      • overview, 406–407
      • regressing with, 408–410
  • deprecated packages, 218–220
  • describe() function, 127
  • dicing data, 141
  • directed graphs, 224–225
  • discretization, 177
  • distributions
    • defined, 178
    • Exploratory Data Analysis, 265–266
      • transforming, 273–274
      • using different statistical distributions, 272
      • Z-score standardization, 273
    • showing with histograms, 205–206
  • Domingos, Pedro, 176
  • double mapping, 152
  • drop() method, 144
  • drop_ duplicates() function, 126
  • dual parameter, SVM LinearSVC class, 402
  • dummy variables, 177
  • duplicates
    • effect on data results, 124–125
    • removing, 126

E

  • EDA (Exploratory Data Analysis), 15
    • categorical data
      • contingency tables, 261
      • frequencies, 259–260
      • overview, 259
    • correlations
      • chi-square test for tables, 271
      • covariance and, 268–270
      • nonparametric, 270–271
    • distributions
      • transforming, 273–274
      • using different statistical distributions, 272
      • Z-score standardization, 273
    • nonlinear transformations, 372
    • numeric data
      • central tendency, 254–255
      • kurtosis, 257–258
      • means, 254–255
      • medians, 254–255
      • normality, 257–258
      • overview, 253–254
      • percentiles, 256–257
      • range, 256
      • skewness, 257–258
      • variance, 255–256
    • overview, 251–253
    • visualization for
      • box plots, 262–263
      • distributions, 265–266
      • overview, 261
      • parallel coordinates, 264–265
      • scatterplots, 266–267
      • t-tests, 263–264
  • ElasticNet class, linear models, 382–383
  • Encapsulated Postscript (EPS), 188
  • encoding
    • defined, 38
    • missing data, 137–138
    • one-hot-encoding, 236–238
  • ensembles
    • boosting, 424–428
    • decision trees
      • classsification trees, 415–417
      • general discussion, 412–415
      • Random Forest algorithm, 418–424
      • regression trees, 417–418
    • imputing missing data, 138
    • overview, 411
  • Enthought Canopy Express, 41–42
  • enumerations, 129
  • EPS (Encapsulated Postscript), 188
  • error bar charts, 186
  • error messages
    • Firefox, 61
    • indentation and, 26
  • estimator interface, Scikit-learn library, 231–233
  • ETL (Extract, Transformation, and Loading) specialists, 13
  • Excel files, 106, 107, 109–110
  • explode parameter, pie charts, 202
  • Exploratory Data Analysis. See EDA
  • exponential transformations, 178
  • Extract, Transformation, and Loading (ETL) specialists, 13
  • Extremely Randomized Trees machine learning technique, 325–326

F

  • factor analysis
    • hidden factors, 281–282
    • psychometrics, 280–281
  • features, dataset. See also variables
    • defined, 100, 170
    • feature creation
      • binning, 177
      • combining variables, 176–177
      • defining, 175–176
      • discretization, 177
      • transforming distributions, 178
      • using indicator variables, 177–178
  • “A Few Useful Things to Know about Machine Learning” paper (Domingos), 176
  • filtering data
    • collaborative filtering, 291–293
    • dicing, 141
    • overview, 139
    • slicing, 140–141
  • finalized code, 11
  • Firefox browser, 18
    • configuring to use local runtime support, 63
    • error dialog box, 61
  • flat-file datasets
    • CSV delimited format, 107–109
    • Excel, 109–110
    • Microsoft Office files, 109–110
    • text files, 106–107
  • folders, Jupyter Notebook, 50–51
  • Forecastwatch.com, 23–24
  • Fortran language, 10
  • frame-of-reference mistruths, 173
  • frequencies, EDA, 259–260
  • functional coding, 17
  • Functional Programmming For Dummies (Mueller), 11
  • functions. See also names of specific functions
    • hash functions, 235
    • magic functions
      • accessing lists, 90
      • working with, 91
    • print, 25

G

  • Gaussian distribution, 257, 319–320
  • GBM (Gradient Boosting Machine), 425–428
  • geographical data
    • deprecated packages, 218–220
    • plotting with Basemap toolkit, 218, 220–221
    • plotting with Notebook, 216–218
  • GitHub, 436
    • opening existing Google Colab notebooks in, 67–68
    • saving Google Colab notebooks to, 69–70
    • storing Google Colab notebooks on, 64
  • GitHubGist, 70
  • GMT (Greenwich Mean Time), 134
  • Google Accounts
    • creating, 64
    • overview, 63
    • signing in, 64–65
  • Google Colab, 59–80
    • code cells
      • Add a Comment option, 72
      • Add a Form option, 72
      • Clear Output option, 72
      • Delete Cell option, 72
      • Link to Cell option, 72
      • View Output Fullscreen option, 72
    • editing cells, 74–75
    • example projects, 66
    • executing code, 76
    • getting help, 80
    • Google Account
      • creating, 64
      • overview, 63
      • signing in, 64–65
    • hardware acceleration, 75–76
    • Help menu, 80
    • Jupyter Notebook compared to, 61–63
    • local runtime support, 63
    • moving cells, 75
    • notebooks
      • checking code execution, 78–79
      • creating new, 65–66
      • downloading, 71
      • opening existing, 66–68
      • saving, 68–70
      • sharing, 79
      • table of contents, 77
      • viewing notebook information, 77–78
    • overview, 60–61
    • special cells
      • headings, 74
      • overview, 73
      • table of contents, 74
    • text cells, 72–73
    • Welcome page, 65
  • Google Docs, 63
  • Google Drive
    • opening existing Google Colab notebooks in, 66–67
    • revision history, 68
    • saving Google Colab notebooks on, 68–69
  • Gradient Boosting Machine (GBM), 425–428
  • graphical user interface (GUI), 29–30
  • graphics
    • CIFAR datasets, 443
    • integrating into Jupyter Notebook
      • embedding images, 96–98
      • embedding plots, 96
      • loading examples from online sites, 96
  • graphs
    • adding grids to, 191–192
    • adjacency matrices, 165
    • annotations, 197–199
    • axes
      • accessing, 189–190
      • formatting, 190–191
      • handles, 190
      • plotting time series, 212–214
    • bar charts, 203–205
    • box plots, 206–208
    • building with NetworkX basics, 166–167
    • defining line appearance on
      • adding markers, 195–197
      • line styles, 193–194
      • using colors, 194–195
    • directed, 224–225
    • histograms, 205–206
    • labels, 197–198
    • legends, 197, 199–200
    • MatPlotLib library
      • defining plots, 186–187
      • drawing multiple lines and plots, 187–188
      • saving work to disk, 188–189
    • pie charts, 202–203
    • plotting geographical data, 216–221
    • plotting time series, 212–216
    • scatterplots, 208–212
    • undirected, 222–223
  • greedy approach, to selecting variables, 362
  • Greenwich Mean Time (GMT), 134
  • grid() function, 192
  • grid searching
    • hyperparameters and, 364–368
    • multicore parallelism, 248
  • grids, adding to graphs, 191–192
  • ground truth, clustering, 301–302
  • groupby() function, 127
  • groups, data
    • depicting on box plots, 206–208
    • depicting on scatterplots, 209–210
  • Grover, Prince, 413
  • GUI (graphical user interface), 29–30

H

  • hairballs
    • defined, 165
    • using NetworkX to avoid, 166
  • handles, axes, 190
  • handwritten data, datasets, 442
  • hardware acceleration, Google Colab, 75–76
  • hashing trick, Scikit-learn library
    • demonstrating, 235–238
    • hash functions, 235
    • overview, 234–235
    • sparse matrices, 239–240
  • HashingVectorizer function, 239–241
  • hatch parameter, bar charts, 204
  • Help menu, Google Colab, 80
  • Help mode, Python
    • entering, 88
    • exiting, 89
    • requesting help in, 88–89
  • hierarchical clustering
    • hierarchical cluster solution, 307–308
    • linkage methods, 306
    • metrics, 306
    • overview, 305–306
    • two-phase clustering solution, 308–310
  • high-dimensional sparse datasets, 160
  • histograms, 186
    • bins, 205
    • showing distributions with, 205–206
  • History of the Peloponnesian War (Thucydides), 12
  • HTML data
    • parsing into Beautiful Soup library, 38
    • parsing XML and, 150–151
    • using XPath for data extraction, 151–152
  • hyperparameters of algorithms
    • GBM model, 427–428
    • grid search, 364–368
    • overview, 363–364
    • randomized search, 368–369
  • hyperplanes, SVM, 390
  • hypotheses
    • defined, 233
    • formulating, 174

I

  • IDA (Initial Data Analysis), 252
  • IDE (Integrated Development Environment), 30
  • IDF (Inverse Document Frequency), 163
  • image data
    • clustering, 299–301
    • cropping, 112
    • embedding images into Notebook, 96
    • flattening, 113
    • generating variations on, 103–104
    • resizing, 112
    • as unstructured files, 111
  • image datasets, 443
  • imperative coding, 17
  • importance estimation, 420
  • indentation, in Python, 26
  • index, dataset, 136, 141, 143
  • indicator variables, 177–178
  • inertia, clustering, 302–304
  • Informatica tool, 13
  • information redundancy, 268
  • information resources
    • Aspirational Data Scientist blog, 434
    • Conductrics, 435
    • Data Science Central, 433–434
    • GitHub, 436
    • KDnuggets, 432
    • Open-Source Data Science Masters, 436
    • Oracle Data Science blog, 433
    • Quora, 432–433
    • Subreddit, 432
    • Udacity, 435
  • Information Retrieval (IR), 159
  • Initial Data Analysis (IDA), 252
  • installing
    • Anaconda
      • on Linux, 46–47
      • on Mac OS X, 47–48
      • on Windows, 42–45
    • Basemap toolkit, 218
    • preferred installer program, 244
  • Integrated Development Environment (IDE), 30
  • Interactive Python. See IPython
  • International Council for Science, 12
  • Internet Explorer browser, 63
  • Inverse Document Frequency (IDF), 163
  • inverse transformations, 178
  • IPython
    • help, 89–90, 92
    • Jupyter Console, 84–92
    • Jupyter Notebook and, 59
    • overview, 29
    • playing with data, 33
  • IR (Information Retrieval), 159
  • Iris dataset
    • classification trees, 415–419
    • correlation and covariance, 268–271
    • counting for categorical data, 259–261
    • factor analysis, 281–283
    • hyperparameters optimization, 363–364
    • loading, 253–254
    • logistic regression algorithm, 335
    • visualizing data, 262–267

J

  • Java, 11
  • Java Virtual Machine (JVM), 11
  • JavaScript Object Notation (JSON), 119
  • join() method, 143
  • Journal of Data Science, 12
  • jQuery, 116
  • JSON (JavaScript Object Notation), 119
  • Jupyter Console
    • changing window appearance, 86–87
    • discovering objects
      • getting object help, 91
      • obtaining object specifics, 92
      • using IPython object help, 92
    • interacting with screen text, 84–86
    • IPython help, 89–90
    • magic functions, 90–91
    • Python help, 87–89
  • Jupyter Notebook, 18
    • code repository
      • adding notebook content, 53–55
      • creating new notebooks, 51–53
      • defining new folders, 50–51
      • exporting notebooks, 55
      • importing notebooks, 56–57
      • removing notebooks, 55–56
    • Google Colab and, 59, 61–63
    • graphs in, 188
    • integrating multimedia into
      • embedding images, 96–98
      • embedding plots, 96
      • loading examples from online sites, 96
    • plotting geographical data, 216–218
    • restarting kernels, 94–95
    • restoring checkpoints, 95
    • starting, 49–50
    • stopping server, 50
    • working with styles, 93–94
  • Jupyter QTConsole environment, 29–30
  • JVM (Java Virtual Machine), 11

K

  • Kaggle data science competition, 176, 412, 438
  • Kaggle site, 439
  • KDnuggets site, 432
  • Keras library, 37
  • kernels
    • nonlinear, 398–399
    • restarting
      • using graphs, 189
      • using notebook backend, 189
  • keys, defined, 34
  • k-folds, cross-validation on, 357–358
  • k-means algorithm
    • big data, 304–305
    • centroid-based algorithms, 298–299
    • ground truth, 301–304
    • image data, 299–301
    • overview, 297
  • KNN (K-Nearest Neighbors) algorithm
    • hyperparameters, 365–368
    • k-parameter, 344–345
    • overview, 342–343
    • predicting, 343–344
  • kurtosis, EDA, 257–258

L

  • L1 type (Lasso) regularization
    • combining with L2 regularization, 382–383
    • overview, 381
  • L2 type (Ridge) regularization
    • combining with L1 regularization, 382–383
    • overview, 380–381
  • labels, using on graphs, 197–198
  • labels parameter, pie charts, 202
  • lambda calculus, 17
  • languages, programming. See programming languages
  • Lasso regularization. See L1 type regularization
  • leaf nodes, decision trees, 414
  • learning from data. See machine learning
  • legends, graph, 197, 199–200
  • levels, categorical variables
    • combining, 132–133
    • defined, 129
    • renaming, 131–132
  • libraries, 35–38. See also names of specific libraries
  • line graphs, 186, 214
    • adding markers, 195–197
    • using colors, 192, 194–195
    • working with line styles, 193–194
  • linear models
    • nonlinear transformations
      • main effects model, 375–379
      • variable transformations, 372–375
    • regularization of
      • ElasticNet class, 382–383
      • Lasso (L1 type), 381
      • leveraging, 382
      • overview, 379–380
      • Ridge (L2 type), 380–381
  • linear regression algorithm
    • limitations of, 333–334
    • with multiple variables, 331–333
    • overview, 329–331
    • R squared measure for, 352
  • LinearSVC class, SVM, 402–406
  • linkage methods, agglomerative clustering, 306
  • Linux
    • Enthought Canopy Express and, 41
    • installing Anaconda on, 46–47
    • local runtime support on, 63
  • lists, output, 109
  • loading data on Python
    • overview, 19
    • speed of data analysis and, 33
  • local runtime support
    • on Google Colab, 63
    • on Linux, 63
    • on Mac OS X, 63
    • on Windows, 63
  • local storage, 68
  • logarithm transformations, variable, 178, 274
  • logistic regression algorithm
    • applying, 335
    • classes, 336–337
    • overview, 334
  • loss parameter, SVM LinearSVC class, 402

M

  • Mac OS X
    • Enthought Canopy Express and, 41
    • installing Anaconda on, 47–48
    • using local runtime support on, 63
  • machine learning. See also algorithms
    • big data, 383–386
    • cross-validation
      • on k-folds, 357–358
      • overview, 356–357
      • sampling, 358–360
    • ensembles
      • boosting, 424–428
      • decision trees, 412–424
      • overview, 411
    • hyperparameters, 363–369
    • linear models
      • nonlinear transformations, 372–379
      • regularization of, 379–383
    • model fitting
      • bias, 350
      • overview, 348–349
      • strategy for, 350–353
      • test sets, 354–356
      • training, 354–356
      • variance, 350
    • neural networks
      • classifying with, 408–410
      • deep learning, 406
      • multilayer perceptron, 408–410
      • overview, 406–407
      • regressing with, 408–410
    • no-free-lunch theorem, 351
    • optimization
      • grid search, 364–368
      • overview, 363–364
      • randomized search, 368–369
    • selection
      • greedy, 362
      • RFECV method, 363
      • by univariate measures, 360–362
    • Stochastic Gradient Descent optimization, 383–386
    • support vector machines
      • adjusting parameters, 390–392
      • classifying with, 392–397
      • creating stochastic solution with, 401–406
      • general discussion, 387–390
      • hyperplanes, 390
      • margins, 389
      • nonlinear kernels, 398–399
      • overfitting, 392
      • performing regression with SVR, 399–401
      • scaling, 396
      • Scikit-learn library, 391
      • underfitting, 392
  • Machine Learning For Dummies (Mueller and Massaron), 13
  • Madelon Data Set, 440
  • magic functions
    • accessing lists, 90
    • working with, 91
  • main effects model, 375–379
  • manifold learning (nonlinear dimensionality reduction), 283–285
  • Manuscript on Deciphering Cryptographic Messages, 12
  • margins, SVM, 389
  • Markdown cells, 53
  • markers on graphs, 195–197
  • MATLAB, 14, 185–186
  • MATLAB For Dummies (Mueller), 14, 185
  • MatPlotLib library, 32, 38, 185–200. See also graphs
    • adding grids, 191–192
    • analyzing image date with, 102
    • annotating charts, 198–199
    • creating legends, 199–200
    • defining line appearance
      • adding markers, 195–197
      • using colors, 194–195
      • working with line styles, 193–194
    • graphs
      • defining plots, 186–187
      • drawing multiple lines and plots, 187–188
      • saving work to disk, 188–189
    • labels, 197–198
    • setting axis
      • access code for, 189–190
      • formatting, 190–191
  • matrices
    • adjacency, 165
    • multiplication, 181
    • scipy.sparse matrix, 160
    • sparse, 239–240
    • vector multiplication, 180
  • mean function, 139
  • means, EDA, 254–255
  • median function, 139
  • medians, EDA, 254–255
  • memory
    • memory profiler, 244–245, 247
    • streaming data into, 102–103
    • uploading data into, 101–102
  • metrics, agglomerative clustering, 306
  • microservices, 117
  • Microsoft Cognitive Toolkit (CNTK), 37
  • Microsoft Office files, 109–110
  • missing data
    • encoding, 137–138
    • finding, 136–137
    • imputing, 138–139
  • mistruths in data
    • bias, 173
    • commission, 172
    • frame of reference, 173
    • omission, 172
    • perspective, 172–173
  • MLP (multilayer perceptron), 408–410
  • MNIST (Mixed National Institute of Standards and Technology) dataset, 442
  • model fitting
    • bias, 350
    • no-free-lunch theorem, 351
    • overview, 348–349
    • ROC AUC, 353
    • strategy for, 350–353
    • test sets, 354–356
    • training, 354–356
    • variance, 350
  • model interface, Scikit-learn library, 231, 234
  • MongoDB database, 115
  • most_frequent function, 139
  • movie recommendations, 291–293
  • MovieLens site, 440–441
  • MS SSIS tool, 13
  • multicore parallelism (multiprocessing), 247–250
  • multilabel prediction, 248
  • multilayer perceptron (MLP), 408–410
  • multimedia, integrating in Jupyter Notebook
    • embedding images, 96
    • embedding plots, 96
    • loading examples from online sites, 96
    • obtaining online, 96–98
  • multiprocessing (multicore parallelism), 247–250
  • multivariate approach to outliers
    • cluster analysis, 324–325
    • Extremely Randomized Trees machine learning technique, 325–326
    • principal component analysis, 322–324
    • Random Forests matching learning technique, 325–326
  • multivariate correlation, 175
  • munging
    • clustering
      • agglomerative, 305–310
      • DBScan, 310–312
      • with k-means, 297–305
      • overview, 295–297
    • defined, 229
    • Exploratory Data Analysis
      • categorical data, 259–261
      • correlations, 268–271
      • distributions, 272–274
      • numeric data, 253–258
      • overview, 251–253
      • visualization for, 261–267
    • outliers
      • anomalies, 316
      • concept drift phenomenon, 317
      • effect on machine learning algorithms, 315–316
      • multivariate approach to, 322–326
      • novel data, 316–317
      • overview, 313–314
      • univariate approach to, 317–321
    • overview, 229–230
    • reducing data dimensionality
      • collaborative filtering, 291–293
      • factor analysis, 280–282
      • nonlinear dimensionality reduction, 283–285
      • Non-Negative Matrix Factorization, 289–291
      • overview, 275–276
      • principal components analysis, 282–283, 285–289
      • singular value decomposition, 276–280
    • Scikit-learn library
      • application speed and performance and, 240–247
      • classes, 230–231
      • defining applications for data science, 231–234
      • hashing trick, 234–240
      • multicore parallelism, 247–250
      • object-based interfaces, 231
  • MySQL database, 115
  • MySQLdb library, 24

N

  • Naïve Bayes algorithm, 177, 375
    • classes, 339–340
    • overview, 337–338
    • text classifications, 340–342
  • National Institute of Standards and Technology (NIST) dataset, 442
  • Natural Language Processing (NLP), 159
  • Natural Language Toolkit (NLTK), 154
  • n-dimensional arrays, 36
  • NetworkX library, 38, 166–167
  • neural networks, 371
    • classifying with, 408–410
    • deep learning, 406
    • multilayer perceptron, 408–410
    • overview, 406–407
    • regressing with, 408–410
  • n-grams, 161–162
  • NIST (National Institute of Standards and Technology) dataset, 442
  • NLP (Natural Language Processing), 159
  • NLTK (Natural Language Toolkit), 154
  • NMF (Non-Negative Matrix Factorization), 289–291
  • no-free-lunch theorem, 351
  • nonlinear dimensionality reduction (manifold learning), 283–285
  • nonlinear kernels, SVM, 398–399
  • nonlinear transformations
    • main effects model, 375–379
    • variable transformations, 372–375
  • Non-Negative Matrix Factorization (NMF), 289–291
  • nonparametric correlations, EDA, 270–271
  • normality, EDA, 257–258
  • NoSQL (Not only SQL) databases, 115–116
  • Notebook. See Jupyter Notebook
  • notebooks, Google Colab, 65–71
    • creating new, 65–66
    • downloading, 71
    • opening existing
      • in GitHub, 67–68
      • in Google Drive, 66–67
      • local storage, 68
    • saving
      • on GitHub, 69–70
      • on GitHubGist, 70
      • on Google Drive, 68–69
    • sharing, 79
    • viewing
      • checking code execution, 78–79
      • displaying table of contents, 77
      • getting notebook information, 77–78
  • novel data, 316–317
  • numeric data, EDA
    • central tendency, 254–255
    • kurtosis, 257–258
    • means, 254–255
    • medians, 254–255
    • normality, 257–258
    • overview, 253–254
    • percentiles, 256–257
    • range, 256
    • skewness, 257–258
    • variance, 255–256
  • NumPy library
    • arrays tool, 246
    • computing covariance and correlation matrices, 269–270
    • functions, 36
    • matrix multiplication, 181
    • matrix vector multiplication, 180
    • n-dimensional arrays, 36, 179
    • researching solutions, 174
    • shaping data with, 122

O

  • object-based interfaces, Scikit-learn library, 231
  • object-oriented coding, 18
  • objects, 91–92
    • getting object help, 91
    • IPython object help, 92
    • obtaining object specifics, 92
  • ODBC (Open Database Connectivity), 115
  • one-hot-encoding, 236–238
  • online resources
    • Aspirational Data Scientist, 434
    • Cheat Sheet, 4
    • Conductrics, 435
    • Cross Validated, 174
    • Data Science Central, 433–434
    • directives list, 134
    • GitHub, 436
    • Google Scholar, 174
    • Imputer parameters, 138
    • Internet World Stats, 1
    • John Mueller blog, 5
    • KDnuggets, 432
    • list of distributions, 178
    • MatPlotLib graph types, 186
    • Microsoft Academic Search, 174
    • Open-Source Data Science Masters, 436
    • Oracle Data Science Blog, 433
    • parsers for CSV files, 108–109
    • Python Enhancement Proposals, 25
    • Python tutorials, 3
    • pythonclock.org, 22
    • Quora, 174, 432–433
    • read table method arguments, 107
    • regular expressions, 155
    • source codes, 5
    • Stack Overflow, 174
    • standard graph types list, 166
    • Subreddit, 432
    • telephone number manipulation routines, 157
    • Udacity, 435
    • Unicode encodings, 153
    • Unicode problems in Python, 153
    • working with databases, 115
  • Open Database Connectivity (ODBC), 115
  • Open-Source Data Science Masters (OSDSM), 436
  • optimization
    • grid search, 364–368
    • overview, 363–364
    • randomized search, 368–369
  • Oracle Data Science blog, 433
  • OSDSM (Open-Source Data Science Masters), 436
  • outliers, 15, 177
    • anomalies, 316
    • concept drift phenomenon, 317
    • effect on machine learning algorithms, 315–316
    • multivariate approach to
      • cluster analysis, 324–325
      • Extremely Randomized Trees machine learning technique, 325–326
      • principal component analysis, 322–324
      • Random Forests matching learning technique, 325–326
    • novel data, 316–317
    • overview, 313–314
    • univariate approach to
      • box plots, 318–319
      • Chebyshev’s inequality, 320
      • Gaussian distribution, 319–320
      • overview, 317–318
      • winsorizing, 321
  • overfitting
    • online resources, 440
    • SVM, 392

P

  • pandas library
    • categorical data, 259
    • checking current version of, 129
    • CSV files and, 107
    • data analysis with, 36
    • measuring central tendency, 254–255
    • NaN output, 131
    • parsers, 106
    • reading flat-file data, 106
    • removing duplicates, 126
    • researching solutions, 174
    • shaping data with, 122–123
    • working with Excel worksheets in, 110
  • parallel coordinates, EDA, 264–265
  • parameters, SVM, 390–392
  • Parr, Terence, 413
  • parsers, 106
  • parsing HTML data, 150–151
  • PATH environment variable, 45
  • pattern matching, 156, 442
  • PCA (principal components analysis)
    • facial recognition, 285–289
    • outliers and, 322–324
    • overview, 282–283
  • PDF (Portable Document Format), 188
  • Pearson correlation, 269–270
  • penalty parameter, SVM LinearSVC class, 402
  • PEPs (Python Enhancement Proposals), 25
  • percentiles, EDA, 256–257
  • pie charts, 186, 202–203
  • PIP (preferred installer program), 244
  • pipeline, data science
    • Exploratory Data Analysis, 15
    • learning from data, 15
    • overview, 14
    • preparing data, 15
    • in prototyping, 32
    • understanding meaning of data, 16
    • visualizing data, 15
  • plotting data
    • geographical data
      • with Basemap toolkit, 218, 220–221
      • deprecated packages, 218–220
      • with Notebook environment, 216–218
    • plots
      • defined, 186–187
      • multiple plot lines, 187–188
    • time series
      • on axes, 212–214
      • trends, 214–216
  • polynomial transformations, 178
  • Portable Document Format (PDF), 188
  • Portable Network Graphic (PNG) format, 188
  • PostgreSQL database, 115
  • Postscript (PS), 188
  • PowerPoint, 14
  • predictor interface, Scikit-learn library, 231, 233
  • preferred installer program (PIP), 244
  • principal components analysis. See PCA
  • procedural coding, 18
  • programming languages. See also Python
    • choosing, 10–11
    • data science and, 14
    • Java, 11
    • Natural Language Toolkit, 154
    • R, 10–11
    • Scala, 11
    • SQL, 11, 114
    • XPath, 151–152
  • prototyping, 31–32
  • PS (Postscript), 188
  • PSF (Python Software Foundation), 22
  • PyMongo library, 115
  • Python
    • Anaconda
      • Anaconda Command Prompt, 27–29
      • installing, 42–47
      • IPython, 29
      • Jupyter QTConsole environment, 29–30
      • Spyder IDE, 30–31
    • contributions to data science, 23–24
    • developments of, 22
    • factors affecting speed of execution, 32–33
    • goals of, 24–25
    • help mode
      • entering, 88
      • exiting, 89
      • requesting help in, 88–89
    • indentation, 26
    • interactive help, 89
    • issues with flat-file headers, 106
    • language statements, 25–26
    • libraries
      • Beautiful Soup, 38
      • Keras and TensorFlow, 37
      • matplotlib, 38
      • NetworkX, 38
      • NumPy, 36
      • pandas, 36
      • Scikit-learn, 36–37
      • SciPy, 35
    • licensing issues, 22
    • overview, 10
    • performing rapid prototyping and experimentation, 31–32
    • philosophy, 23
    • Python 2.x, 22
    • Python 3.x, 22, 153
    • role in data science and, 16–20
    • streaming data using, 102
    • visualizing data, 33–35
    • working with, 25–31
  • Python Enhancement Proposals (PEPs), 25
  • Python interpreter, 27
  • Python Software Foundation (PSF), 22
  • pythonclock.org, 22
  • python-history.blogspot, 23
  • Python(x,y), 42

Q

  • QTConsole, 30
  • quartiles, box plots, 206
  • Quixote display framework, 24
  • Quora, 432–433

R

  • R (programming language), 10–11
  • R squared measure, for linear regression, 352
  • radial basis function (rbf) kernel, 398
  • Random Forest algorithm
    • optimizing, 422–424
    • overview, 418–420
    • Random Forest classifier, 420–421
    • Random Forest regressor, 421–422
  • Random Forest classifier, 420–421
  • Random Forest regressor, 421–422
  • Random Forests matching learning technique, 325–326
  • random sampling, 105
  • randomized search, 368–369
  • random.shuffle() method, 145
  • ranges, EDA, 256
  • rbf (radial basis function) kernel, 398
  • read_sql() method, 114
  • read_sql_query() method, 114
  • read_sql_table() method, 114
  • read_table() method arguments, 107
  • Receiver Operating Characteristic Area Under Curve (ROC AUC), 353
  • reducing data dimensionality
    • collaborative filtering, 291–293
    • factor analysis
      • hidden factors, 281–282
      • psychometrics, 280–281
    • nonlinear dimensionality reduction, 283–285
    • non-negative matrix factorization, 289–291
    • overview, 275–276
    • principal components analysis, 282–283, 285–289
    • singular value decomposition, 276–280
    • t-SNE algorithm, 283–284
  • regression
    • linear regression algorithm
      • limitations of, 333–334
      • with multiple variables, 331–333
      • overview, 329–331
      • R squared measure for, 352
    • logistic regression algorithm
      • applying, 335
      • classes, 336–337
      • overview, 334
    • performing with SVM, 399–401
    • regression trees, 417–418
  • regular expressions, 155–158
  • regularization of linear models
    • ElasticNet class, 382–383
    • Lasso (L1 type), 381
    • leveraging, 382
    • overview, 379–380
    • Ridge (L2 type), 380–381
  • relational databases, managing data from, 113–115
  • repository, code. See code repository
  • researching solutions, 173–174
  • reset_index() method, 143
  • RFECV class, 363
  • Ridge regularization. See L2 type regularization
  • ROC AUC (Receiver Operating Characteristic Area Under Curve), 353
  • root nodes, 118
  • root words, 153
  • rows, database, 100, 140

S

  • sampling data
    • cross-validation and, 358–360
    • overview, 104–105
    • random sampling, 105
  • saving
    • Google Colab notebooks
      • on GitHub, 69–70
      • on GitHubGist, 70
      • on Google Drive, 68–69
    • Jupyter Notebook files, 55, 188–189
    • MatPlotLib library work to disk, 188–189
  • Scala, 11
  • Scalable Vector Graphics (SVG), 188
  • scaling, SVM, 396
  • scatterplots
    • depicting groups, 209–210
    • Exploratory Data Analysis, 266–267
    • overview, 208–209
    • showing correlations, 211–212
  • Scikit-learn library, 34, 57
    • application speed and performance
      • benchmarking, 241–243
      • memory profiler, 244–245, 247
      • overview, 240–241
    • classes, 230–231
    • conda, 244
    • defining applications for data science, 231–234
    • hashing trick
      • demonstrating, 235–238
      • hash functions, 235
      • overview, 234–235
      • sparse matrices, 239–240
    • hyperparameters optimization, 363–364
    • model fitting and, 351–352
    • multicore parallelism, 247–250
    • object-based interfaces, 231
    • overview, 36–37
    • preferred installer program, 244
    • researching solutions, 174
    • SVM and, 391
    • toy datasets, 100
    • 20 Newsgroups dataset, 159
  • SciPy library, 35
    • researching solutions, 174
    • sparse matrices, 239
  • scipy.sparse matrix, 160
  • screen text, Jupyter Console, 84–86
  • Search Code Snippets option, Google Colab Help menu, 80
  • selecting data
    • greedy approach to, 362
    • RFECV method, 363
    • univariate approach to, 360–362
  • SelectPercentile class, 360–361
  • SGD (Stochastic Gradient Descent), 383–386, 409
  • shadow parameter, pie charts, 203
  • shaping data, 32
    • categorical variables
      • combining levels, 132–133
      • creating, 130–131
      • renaming levels, 131–132
    • concatenating data
      • adding new cases and variables, 142–144
      • removing data, 144
      • sorting and shuffling, 145–146
    • date and time
      • formatting, 134
      • time transformations, 135
    • dicing data, 141
    • graph data
      • adjacency matrices, 165
      • NetworkX basics, 166–167
    • HTML pages
      • parsing XML and, 150–151
      • using XPath for data extraction, 151–152
    • missing data
      • encoding, 137–138
      • finding missing data, 136–137
      • imputing missing data, 138–139
    • with NumPy, 122
    • with pandas, 122–123
    • raw text
      • regular expressions, 155–158
      • stop words, 153–155
      • Unicode and, 153
    • slicing data
      • columns, 140–141
      • rows, 140
    • through aggregation, 146–147
    • using bag of words model
      • n-grams, 161–162
      • overview, 158–160
      • TF-IDF transformations, 162–165
    • validating data
      • creating data maps and data plans, 126–128
      • removing duplicates, 126
      • verifying contents, 124–125
  • shared group knowledge (wisdom of crowds), 411
  • Singular Value Decomposition (SVD), 276–280
  • 64-bit operating system, 42–43
  • skewed values, 178
  • skewness, EDA, 257–258
  • slicing data
    • columns, 140–141
    • rows, 140
  • sort_values() method, 145
  • Spambase Data Set, 441
  • sparse matrices, Scikit-learn library, 239–240
  • Spearman correlation, 270
  • speed of execution, 32–33
  • Spyder IDE, 30–31
  • SQL (Structured Query Language), 11, 14, 114
  • SQL Server database, 115
  • sqlalchemy library, 114
  • SQLite database, 115
  • square root transformations, 178
  • squared errors, linear regression, 352
  • statistical distributions, EDA, 272
  • statistics
    • descriptive, 253–254
    • history of, 12
  • statsmodels library, 36
  • stemming stop words, 153–155
  • Stochastic Gradient Descent (SGD), 383–386, 409
  • stop words, 153–155, 341
  • StratifiedKFold class, 359–360
  • streaming data, into memory, 102–103
  • strings, 106
    • formatting date and time values with, 134
    • special directives, 134
  • Structured Query Language (SQL), 11, 14, 114
  • Subreddit, 432
  • support vector machines. See SVM
  • Support Vector Regression (SVR), 399–401
  • support vectors, SVM, 389
  • SVD (Singular Value Decomposition), 276–280
  • SVG (Scalable Vector Graphics), 188
  • SVM (support vector machines)
    • adjusting parameters, 390–392
    • classifying with, 392–397
    • creating stochastic solution with, 401–406
    • defined, 371
    • general discussion, 387–390
    • hyperplanes, 390
    • LinearSVC class, 402–406
    • margins, 389
    • nonlinear kernels, 398–399
    • overfitting, 392
    • performing regression with SVR, 399–401
    • scaling, 396
    • Scikit-learn library, 391
    • support vectors, 389
    • underfitting, 392
  • SVR (Support Vector Regression), 399–401
  • syncing Google Colab, 62

T

  • Table of Contents, in Google Colab notebooks, 77
  • tablets, using code on, 61
  • TensorFlow library, 37
  • Teradata tool, 13
  • Term Frequency times Inverse Document Frequency (TF-IDF) transformations, 162–165, 235, 290
  • test datasets, 354–356
  • text classifications, predicting, 340–342
  • text files
    • accessing flat-file datasets from, 106–107
    • raw text
      • regular expressions, 155–158
      • stop words, 153–155
      • Unicode and, 153
  • TF-IDF (Term Frequency times Inverse Document Frequency) transformations, 162–165, 235, 290
  • Theano library, 37
  • third-level headings, 54
  • 3-D arrays, 140
  • Thucydides, 12
  • time series
    • plotting on axes, 212–214
    • trends, 214–216
  • timedelta() function, 135
  • timeit commands, 241–243
  • Titanic tragedy datasets, 412–413, 438–439
  • tokenizing, 158
  • toy datasets, 100
  • training data, 354, 356
  • transformer interface, Scikit-learn library, 231, 234
  • tree ensembles, 138. See also ensembles
  • trendlines, plotting, 212, 214–216
  • triple mapping, 152
  • t-SNE algorithm, 283–284
  • t-tests, EDA, 263–264
  • Tukey, John, 252
  • tuples, 162
  • 2-D arrays, 140

U

  • Udacity, 435
  • underfitting, SVM, 392
  • undirected graphs, 222–223
  • Unicode, 38, 153
  • univariate approach
    • to outliers
      • box plots, 318–319
      • Chebyshev’s inequality, 320
      • Gaussian distribution, 319–320
      • overview, 317–318
      • winsorizing, 321
    • to selecting variables, 360–362
  • Universal Transformation Format 8-bit (UTF-8), 153
  • unstructured file form, sending data in, 111–113
  • uploading data, into memory, 101–102

V

  • validating data
    • creating data maps and data plans, 126–128
    • removing duplicates, 126
    • verifying contents, 124–125
  • values, dataset
    • defined, 34
    • skewed, 178
  • van Rossum, Guido, 22
  • Vapnik, Vladimir, 387
  • variable distributions, 253
  • variables, 19, 170. See also features, dataset
    • categorical, 129–133
    • combining for feature creation, 176–177
    • in databases, 100
    • dummy, 177
    • indicator, 177–178
    • variable transformations, 372–375
  • variances
    • Exploratory Data Analysis, 255–256
    • machine learning algorithms, 350
  • vectorization, 179
  • Visual Studio code support, 45
  • visualizing data, 15, 33–35
    • bar charts, 203–205
    • box plots, 206–208
    • directed graphs, 224–225
    • Exploratory Data Analysis
      • box plots, 262–263
      • distributions, 265–266
      • overview, 261
      • parallel coordinates, 264–265
      • scatterplots, 266–267
      • t-tests, 263–264
    • with graphs, 202–225
    • histograms, 205–206
    • overview, 201
    • pie charts, 202–203
    • plotting geographical data
      • with Basemap toolkit, 218, 220–221
      • deprecated packages, 218–220
      • with Notebook, 216–218
    • plotting time series
      • on axes, 212–214
      • trends, 214–216
    • scatterplots, 208–212
    • undirected graphs, 222–223

W

  • web services, 116
  • web-based data, 116–118
  • whiskers, box plots, 206
  • Windows
    • Enthought Canopy Express and, 41
    • installing Anaconda on, 42–45
    • local runtime support on, 63
    • Windows 7 system, 18
    • WinPython and, 42
  • WinPython, 42
  • winsorizing, 321
  • wisdom of crowds (shared group knowledge), 411
  • Wolpert, David, 351
  • wrangling data
    • clustering
      • agglomerative, 305–310
      • DBScan, 310–312
      • with k-means, 297–305
      • overview, 295–297
    • defined, 229
    • Exploratory Data Analysis
      • categorical data, 259–261
      • correlations, 268–271
      • distributions, 272–274
      • numeric data, 253–258
      • overview, 251–253
      • visualization for, 261–267
    • outliers
      • anomalies, 316
      • concept drift phenomenon, 317
      • effect on machine learning algorithms, 315–316
      • multivariate approach to, 322–326
      • novel data, 316–317
      • overview, 313–314
      • univariate approach to, 317–321
    • overview, 229–230
    • reducing data dimensionality
      • collaborative filtering, 291–293
      • factor analysis, 280–282
      • nonlinear dimensionality reduction, 283–285
      • Non-Negative Matrix Factorization, 289–291
      • overview, 275–276
      • principal components analysis, 282–283, 285–289
      • singular value decomposition, 276–280
    • Scikit-learn library
      • application speed and performance and, 240–247
      • classes, 230–231
      • defining applications for data science, 231–234
      • hashing trick, 234–240
      • multicore parallelism, 247–250
      • object-based interfaces, 231

X

  • x axis, graphs, 189
  • XML pages, 38
    • JSON versus, 119
    • parsing, 150–151
    • working with web data through, 117
  • XPath, 151–152

Y

  • y axis, graphs, 189

Z

  • Z-score standardization, EDA, 273
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.96.188