Index

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Index

Symbols

* operator, specifying model interactions, 324
{}(curly brackets), dictionary syntax, 397
% (percent) operator, calling magic commands, 427
+ (plus) operator, adding covariates to linear models, 324
() (round brackets)
- line breaks, 393–394
- tuple syntax, 396
[] (square brackets)
- dictionary values, 397
- getting first character of string, 230
- list syntax, 395–396

Numbers

2D density plot, 88–89

A

Aggregation (or aggregate)
- of built-in methods, 178–179
- of calculations, 23
- of functions, 179–182
- multiple functions simultaneously, 182–184
- one-variable grouped aggregation, 176–177
- options for applying functions in and aggregate methods, 182–184
- overview of, 176
- saving groupby object without running aggregate, transform, or filter, 190–191
AIC (Akaike information criteria), 327, 329
Alignment
- DataFrame, 44–45
- Series, 39–42
Anaconda
- command prompt, 381–382
- installers for, 373–374
- Miniconda, 374
- package installation, 389–390
- Python distribution, 385
- Spyder IDE, 382
- uninstalling, 374
AnacondaCon conference, 364
ANOVA (analysis of variance), 326–327
Anscombe’s quartet
- for data visualization, 65–66, 70–71
- plotting with facets, 99–100
Apache Arrow, 58, 280
apply
- concept map for, 372
- creating/using functions, 131–132
- functions across rows or columns of data, 133
- lambda functions, 141–142
- numba library and, 140–141
- over a DataFrame, 135–138
- over a Series, 133–135
- overview of, 131
- primer on, 131–132
- summary/conclusion, 142
- vectorized functions, 138–141
∗args, function parameter, 408
Arrays
- scientific computing stack, 359
- sklearn library and, 286–287
- working with, 415–416
Arrow, 58
- for dates and times, 280
assert, checking data assembly with, 166
assign, modifying columns with, 50–52
Assignment
- multiple, 413–414
- passing/reassigning values, 395–396
astype method
- converting column to categorical type, 225–226
- converting to numeric values, 221–222
- converting values to strings, 220
Attributes
- class, 417
- dot notation and, 10–11
- Series, 35
Average cluster algorithm, in hierarchical clustering, 353–354
Axes, plotting, 67–71

B

Bar plots, 89–91
Bash shell, 377–378
BIC (Bayesian information criteria), 327, 329
“The Big Book of Python,” 365
“The Big Book of R,” 365
Binary
- feather format for saving, 56–57
- logistic regression for binary response variable, 297
- serialize and save data in binary format, 53
Bivariate statistics
- in matplotlib, 74–76
- in seaborn, 83–94
Booleans (bool)
- subsetting DataFrame, 43
- subsetting Series, 36–39
Boxplots
- for bivariate statistics, 75–76
- creating, 114–115
Broadcasting, Pandas support for, 40–41, 44–45

C

Calculations
- datetime, 257–258
- involving multiple variables, 191
- with missing data (values), 215–216
- of multiple functions simultaneously, 182–184
- timing execution of, 360, 427–428
Carpentries, 364
CAS (computer algebra systems), 359
category
- converting column to, 225–226
- manipulating categorical data, 226
- overview of, 225
- representing categorical variables, 221
- sklearn library used with categorical variables, 291–293
- statsmodels library used with categorical variables, 289–291
Centroid cluster algorithm, in hierarchical clustering, 353–354
Chaining methods, 423–425
Characters
- formatting strings of, 430
- getting first character of string, 230
- getting last character of string, 231–233
- slicing multiple letters of string, 230
- strings as series of, 229
Classes, 417–418
Clustering
- average cluster algorithm, 353–354
- centroid cluster algorithm, 353–354
- complete cluster algorithm, 352
- dimension reduction using PCA, 347–351
- hierarchical clustering, 351–356
- k-means, 345–351
- manually setting threshold for, 355–356
- overview of, 345
- single cluster algorithm, 352–353
- summary/conclusion, 356
- ward cluster algorithm, 354–355
Code
- profiling, 360
- reuse, 405
- style, 393–394
- timing execution of, 360, 427–428
coerce, 224–225
Colon (:), use in slicing syntax, 15, 399–400
Colors, multivariate statistics in seaborn, 95–97
Columns
- adding, 45–47
- concatenation generally, 150–151
- concatenation with different indices, 153–154
- converting to category, 225–226
- directly changing, 47–50
- dot notation to pull values of, 10–11
- dropping values, 52
- methods of indexing, 11
- modifying with assign, 50–52
- rows and columns both containing variables, 126–129
- selecting, 15–16
- single value returns, 8–9
- slicing, 18–21
- subsetting by name, 7–8
- subsetting by range, 16–18
- subsetting generally, 21–23
- subsetting using slicing syntax, 15–16
Columns, with multiple variables
- overview of, 122–123
- split and add individually, 123–125
- split and combine in single step, 125–126
Columns, with values not variables
- keeping multiple columns fixed, 120–122
- keeping one column fixed, 118–120
- overview of, 118
Command line
- basic commands, 378
- Linux, 378
- Mac, 377
- overview of, 377
- Windows, 377
Comma-separated values. See CSV (comma-separated values)
compile, pattern compilation, 246–247
Complete cluster algorithm, in hierarchical clustering, 352
Comprehensions
- function comprehension, 403–404
- list comprehension, 158–160
- overview of, 401–402
Computer algebra systems (CAS), 359
Concatenation (concat)
- adding columns, 150–151
- adding rows, 147–150
- dataframe parts and, 146–147
- with different indices, 151–154
- ignore_index parameter after, 149–150
- observational units across multiple tables, 154–160
- overview of, 146
- split and combine in single step, 125–126
Concept maps, 369–372
concurrent.features, 360
conda
- creating environments, 385–387
- install, 374
- managing packages, 389
- update, 390
Conditional statements, 433–434
Conferences, 363–364
Confidence interval, in linear regression example, 285
Containers
- join method and, 234–235
- looping over contents, 401–402
- overview, 395
- types of, 229
Conversion, of data types
- to category, 225–226
- to datetime, 250–253
- to numeric, 221–225
- to string, 220–221
Counting
- groupby count, 197–199
- missing data (values), 210–212
- Poisson regression and, 304–308
Count (bar) plot, for univariate statistics, 81–83
Covariates
- adding to linear models, 324
- multiple linear regression with three covariates, 320–322
Cox proportional hazards model
- survival analysis, 314–316
- testing assumptions, 315–316
C printf style formatting, 429
cProfile, profiling code, 360
create (environments), 385–387
Cross-validation
- model diagnostics, 329–333
- regularization techniques, 341–343
cross_val_scores, 332–333
CSV (comma-separated values)
- for data storage, 55
- importing CSV files, 55
- loading multiple files using list comprehension, 158–160
Cumulative sum (cumsum), 199
cython, performance-related library, 360

D

Dash, 362
Dashboards, 362
Dask library, 360
Data assembly
- adding rows, 147–150
- checking your work on, 166
- combining data sets, 145
- concatenation, 146–154
- concatenation with different indices, 151–154
- dataframe parts and, 146–147
- ignore_index parameter after concatenation, 149–150
- loading multiple files using list comprehension, 158–160
- loading multiple files using lit comprehension, 158–160
- many-to-many merges, 163–166
- many-to-one merges, 163
- merging multiple data sets, 160–166
- observational units across multiple tables, 154–160
- one-to-one merges, 162–163
- overview of, 145
- summary/conclusion, 167
- tidy data, 167
DataFrame
- adding columns, 45–47
- aggregation, 182–183
- alignment and vectorization, 44–45
- apply function(s), 135–138
- basic plots, 27–28
- boolean subsetting, 43
- as class, 417–418
- concatenation, 149
- concept map for basics in, 369
- converting to Arrow objects, 58
- converting to dicionary objects, 58–59
- creating, 32–33
- defined, 3
- directly changing columns, 47–50
- exporting, 56
- grouped and aggregated calculations, 23–27
- grouped frequency counts, 27
- grouped means, 23–26
- histogram, 111
- loading first data set, 4–6
- methods, 43
- ndarray save method, 53
- overview of, 3, 42
- parts of, 42–43
- single value returns, 8–9
- slicing columns, 18–21
- subsetting columns by name, 7–8
- subsetting columns by range, 16–18
- subsetting columns using slicing syntax, 15–16
- subsetting rows and columns, 21–23
- subsetting rows by index label, 11–13
- subsetting rows by row number, 13–14
- summary/conclusion, 28–29
- type function for checking, 5
- writing CSV files (to_csv method), 55
Data models, 281–282
- diagnostics (See Model diagnostics)
- generalized linear (See GLM (generalized linear models))
- linear (See Linear models)
Data normalization
- multiple observational units in a table, 169–170
- overview, 169
Data sets
- cleaning data, 416
- combining, 145
- downloading for this book, 375
- equality tests for missing data, 203–204
- exporting/importing data (See Exporting/importing data)
- Indemics (Interactive Epidemic Simulation), 196
- lists for data storage, 395–396
- loading, 4–6
- many-to-many merges, 163–166
- many-to-one merges, 163
- merging, 160–166
- one-to-one merges, 162–163
- tidy data, 117
Data structures
- adding columns, 45–47
- concept map for, 370
- creating, 31–33
- CSV (comma-separated values), 55
- DataFrame alignment and vectorization, 44–45
- DataFrame boolean subsetting, 43
- DataFrame generally, 42–43
- directly changing columns, 47–50
- dropping values, 52
- Excel and, 55–56
- exporting/importing data, 52
- feather format, 56–57
- making changes to, 45
- overview of, 31
- pickle data, 53–54
- Series alignment and vectorization, 39–42
- Series boolean subsetting, 36–39
- Series generally, 33–35
- Series methods, 35–37
- Series similarity with ndarray, 35–36
- summary/conclusion, 63
Data types (dtype)
- category dtype, 225
- converting to category, 225–226
- converting to datetime, 250–253
- converting to numeric, 221–225
- converting to string, 220–221
- getting list of types stored in column, 225–226
- manipulating categorical data, 226
- overview of, 219
- Series attributes, 35
- specifying from numpy library, 221
- summary/conclusion, 227
- to_numeric function, 222–225
- viewing list of, 219–220
date_range function, 266–269
datetime
- adding columns to data structures, 45–47
- Arrow with, 280
- calculations, 257–258
- converting to, 250–253
- directly changing columns, 48–49
- extracting date components (year, month, day), 254–257
- frequencies, 268
- getting stock-related data, 261–263
- loading date related data, 253–254
- methods, 259–261
- object, 249–250
- offsets, 268–269
- overview of, 249
- ranges, 266–269
- resampling, 276–278
- shifting values, 270–276
- subsetting data based on dates, 263–266
- summary/conclusion, 280
- time zones, 278–279
DatetimeIndex, 263–265, 268
Day, extracting date components from datetime object, 254–257
Daylight savings time, 278
def keyword, use with functions, 405–406
Density plots
- 2D density plot, 88–89
- plot.kde function, 111–112
- for univariate statistics, 80
Diagnostics. See Model diagnostics
Dictionaries (dict)
- creating DataFrame, 32–33
- objects to converting DataFrame objects
- to, 58–59
- overview of, 396–398
- passing method to, 182–183
Directories, working, 383–384
distplot, creating histograms, 81–82
dmatrices function, patsy library, 331–333
Docstrings (docstring), function documentation, 132, 405
Dot notation, to pull a column of values, 10–11
dropna parameter
- counting missing values, 210–212
- dropping missing values, 214–215
Dropping (drop)
- data structure values, 52
- missing data (values), 214–215
dtype. See Data types (dtype)

E

EAFP (easier to ask for forgiveness than for permissions), 191
Elastic net, regularization technique, 340–341
elif, 433–434
else, 433–434
Environments
- creating, 385–388
- deleting, 387
- Pipenv, 387–388
- Pyenv, 387
Equality tests, for missing data, 203–204
errors parameter, numeric, 223–224
EuroSciPy conference, 364
Excel
- DataFrame and, 56
- Series and, 56
Exporting/importing data
- Arrow, 58
- CSV (comma-separated values), 55
- dictionary, 58–59
- Excel, 55–56
- feather format, 56–57
- JSON, 59–62
- methods, 63
- output types, 62–63
- overview of, 52
- pickle data, 53–54

F

Facets, plotting, 99–104
Feather format, interface with R language, 56–57
Files
- loading multiple using list comprehension, 158–160
- working directories and, 383
fillna method, 212–213
Filter (filter), groupby operations, 188–189
Find
- missing data (values), 210–212
- patterns, 244–245
findall, patterns, 244–245
Fizz Buzz, 433–434
float/float64, 221
Folders
- project organization, 379
- working directories and, 383
for loop. See Loops (for loop)
format method, 236
Formats/formatting
- date formats, 252
- serialize and save data in binary format, 53
- strings (string), 236–239, 429–431
Formatted literal strings (f-strings), 236–239
formula API, in statsmodels library, 284–285
freq parameter, 268
Frequency
- datetime, 268
- grouped frequency counts, 27
- offsets, 268–269
- resampling converting between, 276–278
f-strings, 236–238
f-strings (formatted literal strings), 236–239
Functions
- across rows or columns of data, 133
- aggregation, 179–182
- apply over DataFrame, 135–138
- apply over Series, 133–135
- arbitrary parameters, 407–408
- calculating multiple simultaneously, 182–184
- comprehensions and, 403–404
- creating/using, 131–132
- custom, 180–181
- default parameters, 407
- groupby, 178
- **kwargs, 408
- lambda, 141–142
- options for applying in and aggregate methods, 182–184
- overview of, 405–408
- regular expressions (RegEx), 240
- vectorized, 138–141
- z-score example of transforming data, 184–186

G

Ganssle, Paul, 280
Gapminder data set, 4
Generalized linear models (GLM). See also Linear regression models
- logistic regression, 446–447
- model diagnostics, 327–329
- more GLM options, 308–309
- negative binomial regression, 306–308, 448–449
- overview of, 297
- Poisson regression, 304–308, 447–449
- sklearn library for logistic regression, 300–304
- statsmodels library for logistic regression, 299–300
- statsmodels library for Poisson regression, 304–306
- summary/conclusion, 309
- survival analysis, 311–317
- testing Cox model assumptions, 315–316
Generators
- converting to list, 16–17
- overview of, 409–411
get
- dictionary values with, 397–398
- selecting groups, 191–192
Git for Windows, 377
github, 365
GLM (generalized linear models). See Generalized linear models
glm function, in statsmodels library, 306, 308–309
Going it alone, 363–365
- aggregation, 176–184
- aggregation functions, 179–182
- applying functions in and aggregate methods, 182–184
- built-in aggregation methods, 178–179
- calculations generally, 24–25
- calculations involving multiple variables, 191
- calculations of means, 23–26
- compared with SQL, 175
- filtering, 188–189
- flattening results, 194–195
- frequency counts, 27
- iterating through groups, 192–194
- methods and functions, 178
- missing value example, 186–188
- multiple groups, 194
- one-variable grouped aggregation, 176–177
- overview of, 175
- saving without running aggregate, transform, or filter methods, 190–191
- selecting groups, 192
- summary/conclusion, 199–200
- transform, 184–188
- working with multiIndex, 195–199
- z-score example of transforming data, 184–186
Groups
- iterating through, 192–194
- selecting, 191–192
- working with multiple, 194
Guido, Sarah, 241

H

Hendryx-Parker, Calvin, 387
hexbin plot
- bivariate statistics in seaborn, 87–88
- plt.hexbin function, 113–114
Hierarchical clustering
- average cluster algorithm, 353–354
- centroid cluster algorithm, 353–354
- complete cluster algorithm, 352
- manually setting threshold for, 355–356
- overview of, 351–352
- single cluster algorithm, 352–353
- ward cluster algorithm, 354–355
Histograms
- creating using plot.hist functions, 111
- of model residuals, 323
- for univariate statistics in matplotlib, 73–74
- for univariate statistics in seaborn, 79–83

I

Ibis, 361
id, unique identifiers, 220
IDEs (integrated development environments), Python, 382
if, 433–434
ignore_index parameter, after concatenation, 149–150
iloc
- indexing rows or columns, 11
- Series attributes, 35
- subsetting rows and columns, 21–23
- subsetting rows by number, 13–14
Importing (import). See also Exporting/importing data
- itertools library, 410–411
- libraries, 391–392
- loading first data set, 4–5
- matplotlib library, 66–72
- pandas, 415
Indemics (Interactive Epidemic Simulation) data set, 208
Indices
- beginning and ending indices in ranges, 399
- concatenate columns with different indices, 153–154
- concatenate rows with different indices, 151–153
- date ranges, 267–268
- issues with absolute, 22
- out of bounds notification, 138
- reindexing as source of missing values, 209–210
- subsetting columns by index position break, 8
- subsetting date based on, 263–266
- subsetting rows by index label, 11–13
- working with multiIndex, 195–199
inplace parameter, functions and methods, 49–50
Installation
- of Anaconda, 373–374
- from command line, 377–378
- Python packages, 374
Integers (int/int64)
- converting to string, 220–221
- vectors with integers (scalars), 40
integrated development environments (IDEs), 382
Interactive Epidemic Simulation (Indemics) data set, 196
Interpolation, in filling missing data, 213–214
IPython (ipython)
- ipython command, 381–382
- magic commands, 427
Iteration. See Loops (for loop)
iTerm2, 377
itertools library, 410–411

J

JavaScript Objectd notation, 59–62
join
- merges and, 160
- string methods, 234–235
jointplot, creating seaborn scatterplot, 85–88
JSON data, 59–62
Jupyter, 360
jupyter command, 382
JupyterCon, 364
Jupyter Days, 364

K

KaplanMeierFitter, lifelines library, 312–313
KDE plot, of bivariate statistics, 89–90
keep_default_na parameter, specifying NaN values, 205
Kelleher, Adam, 241
Kelleher, Andrew, 241
Keys, creating DataFrame, 32–33
Key–value pairs, 397–398
Key–value stores, 408
Keywords
- lambda keyword, 142
- passing keyword argument, 134–135
k-fold cross validation, 329–333
k-means
- clustering, 345–351
- using PCA, 349–351
**kwargs, 408

L

L1 regularization, 337–338, 341
L2 regularization, 338–341
lambda functions, applying, 141–142
Lander, Jared, 241
LASSO regression, 337–338, 341
Leap years/leap seconds, 278
Learning resources, for self-directed learners, 363–365
Libraries. See also by individual types
- importing, 391–392
- performance libraries, 360
lifelines library, 311–313
- CoxPHFitter class, 314–315
- KaplanMeierFitter class, 312–313
Linear regression models. See also GLM (generalized linear models)
- with categorical variables, 289–293
- cross-validation, 341–343
- elastic net, 340–341
- LASSO regression regularization, 337–338
- model diagnostics, 324–327
- multiple regression, 287–289
- one-hot endocing in, 294–295
- R² (coefficient of determination) regression score function, 332
- reasons for regularization, 335–337
- replicating results in R, 444–446
- residuals, 320–322
- restoring labels in sklearn models, 293
- ridge regression, 338–340
- simple linear regression, 283–287
- sklearn library for multiple regression, 288–289
- sklearn library for simple linear regression, 285–287
- statsmodels library for multiple regression, 287–288
- statsmodels library for simple linear regression, 284–285
- summary/conclusion, 296
Line breaks, 393–394
Linux
- command line, 378
- installing Anaconda, 373–374
- running python and ipython commands, 382
- viewing working directory, 383
List comprehension, 158–160
Lists (list)
- comprehensions and, 403–404
- converting generator to, 16–17, 409–410
- creating Series, 31–32
- of data types, 219–220
- loading multiple files using comprehension, 158–160
- loading multiple files using list comprehension, 158–160
- looping, 401–402
- multiple assignment, 413–414
- overview of, 395–396
- single value returns, 9–10
lmplot
- creating scatterplots, 85
- with hue parameter, 96–97
Loading data
- datetime data, 253–254
- as source of missing data, 205–206
loc
- indexing rows or columns, 11–13
- Series attributes, 35
- subsetting rows and columns, 21–23
- subsetting rows or columns, 15–16
Logic, three-valued, 203–204
Logistic regression
- example of, 435–441
- overview of, 297–304
- replicating results in R, 446–447
- sklearn library for, 300–304
- statsmodels library for, 299–300
- working with GLM models, 328–329
- logit function, performing logistic
- regression, 299–300
Loops (for loop)
- comprehensions and, 403–404
- overview of, 401–402
- through groups, 192–194
- through lists, 401–402

M

Mac
- command line, 377–378
- installing Anaconda, 373
- pwd command for viewing working directory, 383
- running python and ipython commands, 382
Machine learning models, 285, 361–362
Machine Learning Operations (MLOps), 362
Many-to-many merges, 163–166
Many-to-one merges, 163
Markham, Kevin, 422
match, pattern matching, 240–243
matplotlib library
- axes subplots, 67–71
- bivariate statistics, 74–76
- figure anatomy, 71–72
- figure objects, 67–71
- multivariate statistics, 76–78
- overview of, 66–72
- statistical graphics, 72–73
- univariate statistics, 73–74
Matrices, 331–333, 415–416
Mean (mean)
- custom functions, 180–181
- group calculations involving multiple variables, 191
- grouped means, 23–26
- numpy library, 179
- Series in identifying, 37–38
Meetups, 363
melt function
- converting wide data into tidy data, 118–120
- line breaks, 393–394
- rows and columns both containing variables, 126–127
Merges (merge)
- many-to-many, 163–166
- many-to-one, 163
- of multiple data sets, 160–166
- one-to-one, 162–163
- as source of missing data, 206–207
Methods
- built-in aggregation methods, 178–179
- chaining, 423–425
- class, 418
- datetime, 259–261
- export, 62–63
- Series, 35–37
- string, 233–236
Miniconda, 374
Mirjalili, Vahid, 241
Missing data (NaN values)
- built-in Na value, 218
- calculations with, 215–216
- cleaning, 212–215
- concatenation and, 148–149, 153
- date range for filling in, 272–273
- dropping, 214–215
- fill forward or fill backward, 212–213
- finding and counting, 210–212
- interpolation in filling, 213–214
- loading data as source of, 205–206
- merged data as source of, 206–207
- overview of, 203
- recoding or replacing (fillna method), 212
- reindexing causing, 209–210
- sources of, 205–210
- specifying with na_values parameter, 205–206
- summary/conclusion, 218
- transform example, 186–188
- user input creating, 207–208
- what is a NaN value, 203–204
- working with, 210–216
MLOps (Machine Learning Operations), 362
Model diagnostics
- comparing multiple models, 324–329
- k-fold cross validation, 329–333
- overview of, 319
- q-q plots, 322–324
- residuals, 319–324
- summary/conclusion, 334
- working with GLM models, 327–329
- working with linear models, 324–327
Models
- data, 281–282
- generalized linear (See GLM (generalized linear models))
- linear (See Linear models)
Month, extracting date components from datetime object, 254–257
Müller, Andreas, 241
Multiple assignment, 413–414
Multiple regression
- with categorical variables, 289–293
- overview of, 287
- residuals, 320–322
- sklearn library for, 288–289
- statsmodels library for, 287–288
Multivariate statistics
- in matplotlib, 76–78
- in seaborn, 94–99

N

na_filter parameter, specifying NaN values, 205–206
Name, subsetting columns by, 7–8
NaN. See Missing data (NaN values)
Na value, missing data with built-in, 218
na_values parameter, specifying NaN values, 205–206
ndarray
- restoring labels in sklearn models, 293
- Series similarity with, 35–36
- working with matrices and arrays, 415–416
Negative binomial regression, 306–308, 448–449
- replicating results in R, 448–449
Negative numbers, slicing values from end of container, 230–231
New York ACS logistic regression example, 435–441
Normal distribution
- of data, 336
- q-q plots and, 322–324
Normalization, data, 169–173
numba library
- performance-related libraries, 360
- timing execution of statements or expressions, 360
- vectorize decorator from, 140–141
Numbers (numeric)
- converting variables to numeric values, 221–225
- formatting number strings, 238–239, 430–431
- negative numbers, 230–231
- to_numeric function, 222–225
numpy library
- broadcasting support, 44–45
- exporting/importing data, 53–55
- mean, 179
- ndarray, 415–416
- performance and, 360
- restoring labels in sklearn models, 293
- Series similarity with numpy.ndarray, 35
- sklearn library taking numpy arrays, 286–287
- specifying dtype from, 220–221
- vectorize, 140
nunique method, grouped frequency counts, 27

O

Object-oriented languages, 417
Objects
- classes, 417–418
- converting to datetime, 250–253
- datetime, 249–250
- figure, plotting, 67–71
- lists as, 395–396
- plots and plotting using Pandas objects, 111–115
Observational units
- across multiple tables, 154–160
- in a table, 169–173
Odds ratios, performing logistic regression, 300
Offsets, frequency, 268–269
One-to-one merges, 162–163
OSX. See Mac
Overdispersion of data, negative binomial regression for, 306–308, 448–449

P

Packages
- benefits of isolated environments, 385–386
- Installing, 389–390
- updating, 390
pairgrid, bivariate statistics, 93–94
Pairwise relationships (pairplot)
- bivariate statistics, 93–94
- with hue parameter, 98
pandera, 361
Panel, 362
Parameters
- arbitrary function parameters, 407–408
- default function parameters, 407
- functions taking, 406–407
passing/reassigning values, 395–396
patsy library, 331–333
Patterns. See also Regular expressions (regex)
- compiling, 246–247
- matching, 240–243
- substituting, 245–246
PCA (principal component analysis), 347–351
pd
- alias for pandas, 5
- reading pickle data, 53–54
PEP8 (Python Enhancement Proposal 8), 393
Performance
- avoiding premature optimization, 360
- profiling code, 360
- timing your code, 360, 427–428
pickle data, 53–54
Pipeline, 294–295
Pipenv, 387–388
pip install, 374, 389–390
Pivot/unpivot
- columns containing multiple variables, 122–126
- converting wide data into tidy data, 119–120
- keeping multiple columns fixed, 120–122
- rows and columns both containing variables, 127–128
Placeholders, formatting strings, 238, 430
Plots/plotting (plot)
- basic plots, 27–28
- bivariate statistics in matplotlib, 74–76
- bivariate statistics in seaborn, 83–94
- concept map for, 371
- creating boxplots (plot.box), 113–115
- creating density plots (plot.kde), 111–112
- creating scatterplots (plot.scatter), 112–113
- linear regression residuals, 320–322
- matplotlib library, 66–72
- multivariate statistics in matplotlib, 76–78
- multivariate statistics in seaborn, 94–99
- overview of, 65
- Pandas objects and, 111–115
- q-q plots, 322–324
- seaborn library, 78
- statistical graphics, 72–73
- summary/conclusion, 115
- themes and styles in seaborn, 105–108
- univariate statistics in matplotlib, 73–74
- univariate statistics in seaborn, 79–83
PLOT_TYPE functions, 111
plt.hexbin function, 113–114
Podcast resources, for self-directed learners, 364–365
Point representation, Anscombe’s data set, 67
poisson function, in statsmodels library, 304–306
Poisson regression
- negative binomial regression as alternative
- to, 306–308, 448–449
- overview of, 304
- replicating results in R, 447–449
- statsmodels library for, 304–306
Polars, 360
Principal component analysis (PCA), 347–351
Project templates, 379, 383
Pryke, Bejamin, 422
PyCon conference, 364
PyData, 364
pyenv, 374
Pyenv, 387–388
pyjanitor, 361
Python
- Anaconda distribution, 385
- assert, 166
- command line and text editor, 381
- comparing Pandas types with, 7
- conferences, 364
- enhanced features in Pandas, 3
- IDEs (integrated development environments), 382
- ipython command, 381–382
- jupyter command, 382
- as object-oriented languages, 417
- running from command line, 377–378
- scientific computing stack, 350
- ways to use, 381–382
- working with objects, 5
- as zero-indexed languages, 399
Python Enhancement Proposal 8 (PEP8), 393

Q

q-q plots, model diagnostics, 322–324

R

random--state method, directly changing columns, 47–48
range, 409–410
Ranges (range)
- beginning and ending indices, 399
- date ranges, 266–269
- filling in missing values, 272–273
- overview of, 409–411
- passing range of values, 395–396
- subsetting columns, 16–18
Raschka, Sebastian, 241
R ecosystem, 362
- replicating results in, 443–449
Regex. See Regular expressions (regex)
regplot, creating scatterplot, 83–85
Regression
- keeping labels in sklearn models, 293
- LASSO regression regularization, 337–338
- logistic regression, 297–304, 446–447
- more GLM options, 308–309
- multiple regression, 287–289
- negative binomial regression, 306–308, 448–449
- New York ACS example, 435–441
- Poisson regression, 304–308, 447–449
- reasons for regularization, 335–337
- ridge regression regularization, 338–340
- simple linear regression, 283–287
- sklearn library for logistic regression, 300–304
- sklearn library for multiple regression, 288–289
- sklearn library for simple linear regression, 285–287
- statsmodels library for logistic regression, 299–300
- statsmodels library for multiple regression, 287–288
- statsmodels library for Poisson regression, 304–306
- statsmodels library for simple linear regression, 284–285
Regular expressions (RegEx)
- functions in re, 240
- overview of, 239
- pattern compilation, 246–247
- pattern matching, 240–243
- pattern substitution, 245–246
- regex library, 247
- special characters, 240
- syntax, special characters, and functions, 240
Regularization
- cross-validation, 341–343
- elastic net, 340–341
- LASSO regression, 337–338
- overview of, 335
- reasons for, 335–337
- ridge regression, 338–340
- summary/conclusion, 343
reindex method, reindexing as source of missing values, 209–210
re module, 240–243, 247
Resampling, datetime, 276–278
Residuals, model diagnostics, 319–324
Residual sum of squares (RSS), 326–327
Resources, 363–365
Ridge
- regression elastic net and, 341
- regularization techniques, 338–340
R language, interface with (to_feather method), 56–57
Rows
- concatenation generally, 145–147
- concatenation with different indices, 151–153
- methods of indexing, 11
- multiple observational units in a table, 169–173
- removing row numbers from output, 55
- rows and columns both containing variables, 126–129
- subsetting multiple, 13
- subsetting rows and columns, 21–23
- subsetting rows by index label, 11–13
- subsetting rows by row number, 13–14
RSS (residual sum of squares), 326–327
Rug plots, for univariate statistics, 80–81

S

Scalars, 40
Scatterplots
- for bivariate statistics, 74–75
- matplotlib example, 69
- for multivariate statistics, 77–78
- plot.scatter function, 112–113
Scientific computing stack, 350
SciPy conference, 364
scipy library
- hierarchical clustering, 351
- performance libraries, 360
- scientific computing stack, 359
Scripts
- project templates for running, 383
- running Python from command line, 377–378
seaborn
- Anscombe’s quartet for data visualization, 65–66
- bivariate statistics, 83–94
- multivariate statistics, 94–99
- overview of, 78
- themes and styles, 105–108
- tips data set, 187
- titanic data set, 297–299
- univariate statistics, 79–83
Searches. See Find
Semicolon (;), types of delimiters, 55
Serialization, serialize and save data in binary format, 53
Series
- adding columns, 45–47
- aggregation functions, 183–184
- alignment and vectorization, 39–42
- apply function(s) over, 133–135
- attributes, 35
- boolean subsetting, 36–39
- categorical attributes or methods, 226
- as class, 417–418
- creating, 31–32
- defined, 3
- directly changing columns, 47–50
- exporting/importing data, 53
- exporting to Excel (to_excel method), 56
- histogram, 111
- methods, 35–37
- overview of, 33–35
- similarity with ndarray, 35–36
- single value returns, 8–9
- writing CSV files (to_csv method), 55
SettingWithCopyWarning, 419–422
Shape
- DataFrame attributes, 5
- Series attributes, 35
Shape, in plotting, 97–98
Shell scripts, running Python from command line, 377–378
Shiny for Python, 362
Simple linear regression
- overview of, 283
- sklearn library, 285–287
- statsmodels library, 284–285
Single cluster algorithm, in hierarchical clustering, 352–353
Siuba, 360
Size, in plotting, 77–78
size attribute, Series, 35
sklearn library
- defaults in, 302–304
- importing PCA function, 347–348
- keeping labels in sklearn models, 293
- k-fold cross validation, 330–331
- KMeans function, 345–347
- for logistic regression, 300–304
- logistic regression example, 439–441
- for multiple regression, 288–289
- one-hot endocing with, 294–295
- for simple linear regression, 285–287
- splitting data into training and testing sets, 335–336
- transformer pipelines in, 294–295
Slicing
- colon (:) use in slicing syntax, 15, 399–400
- columns, 18–21
- string from beginning or to end, 232
- strings, 230–231
- strings incrementally, 232–233
- subsetting columns, 15–16
- subsetting multiple rows and columns, 22–23
- values, 399–400
snakevis, profiling code, 360
sns.distplot, creating histograms, 81
Sns.set_style function, 105–108
Special characters, regular expressions, 240
Split–apply–combine, 175
splitlines method, strings, 235–236
split method
- split and add columns individually, 123–125
- split and combine in single step, 125–126
Spyder IDE, 382
SQL
- comparing Pandas to, 162
- groupy compared with SQL GROUP BY, 175
Square brackets ([])
- getting first character of string, 230
- list syntax, 395–396
Statistical graphics
- bivariate statistics in matplotlib, 74–76
- bivariate statistics in seaborn, 83–94
- matplotlib library, 66–72
- multivariate statistics in matplotlib, 76–78
- multivariate statistics in seaborn, 94–99
- overview of, 72–73
- seaborn library, 78
- univariate statistics in matplotlib, 73–74
- univariate statistics in seaborn, 79–83
Statistics
- basic plots, 27–28
- grouped and aggregated calculations, 23–27
- grouped frequency counts, 27
- grouped means, 23–26
statsmodels library
- for logistic regression, 299–300
- for multiple regression, 287–288
- for Poisson regression, 304–306
- for simple linear regression, 284–285
Stocks/stock prices, 261–263
Storage
- of information in dictionaries, 396–398
- lists for data storage, 395–396
str accessor, 123
Streamlit, 362
strftime, for date formats, 252–253
Strings (string)
- accessing methods, 123
- converting values to, 220–221
- formatting, 236–239, 429–431
- getting last character in, 231–233
- methods, 233–236
- overview of, 229
- pattern compilation, 246–247
- pattern matching, 240–243
- pattern substitution, 245–246
- regular expressions (regex) and, 239–240, 247
- subset and slice, 229–231
- summary/conclusion, 247
str.replace, pattern substitution, 245–246
Styles, seaborn, 105–108
Subplot syntax, 68
Subsets/subsetting
- columns by index position break, 8
- columns by name, 7–8
- columns by range, 16–18
- columns generally, 21–23
- columns using slicing syntax, 15–16
- data by dates, 263–266
- DataFrame boolean subsetting, 43
- lists, 395–396
- modifying with SettingWithCopyWarning, 419–420
- multiple rows, 13
- rows by index label, 11–13
- rows by row number, 13–14
- rows generally, 21–23
- strings, 229–231
- tuples, 396
sum
- cumulative (cumsum), 199
- custom functions, 180
Summarization. See Aggregation (or aggregate)
Survival analysis, 311–317
- Cox proportional hazards model, 314–316
- data for, 311–312
- Kaplan Meier curves, 312–314
- overview, 311
- summary/conclusion, 317
SyiPy, 359

T

Tables
- observational units across multiple, 154–160
- observational units in, 169–173
Tab separated values (TSV), 55, 253
tail, returning last row, 13
T attribute, Series, 35
Templates, project, 379, 383
Terminal application, Mac, 377
Text. See also Characters; Strings (string)
- function documentation (docstring), 132
- overview of, 229
Themes, seaborn, 105–109
Three-valued logic, 203–204
Tidy data
- columns containing multiple variables, 122–126
- columns containing values not variables, 118–122
- concept map for, 372
- data assembly, 167
- data normalization, 169–173
- definition of, 117
- keeping multiple columns fixed, 120–122
- keeping one column fixed, 118–120
- overview of, 117
- rows and columns both containing variables, 126–129
- split and add columns individually, 123–125
- split and combine in single step, 125–126
- summary/conclusion, 129
tidyverse, 360
Time. See datetime
TimedeltaIndex, 265–266
timedelta object
- date calculations, 257–258
- subsetting date based data, 265–266
timeit function, timing execution of statements or expressions, 360, 427–428
Time zones, 278–279
tips data set, seaborn library, 187, 283
titanic data set, 297–299
to_csv method, 55
to_datetime function, 250–253
to_dict method, 58–59
to_excel method, 56
to_feather method, 57
to_numeric function, 222–225
Transform (transform)
- applying to data, 323–324
- missing value example of transforming data, 186–188
- overview of, 184
- z-score example of transforming data, 184–186
Transformer pipelines, 294–295
True, 434
TSV (tab separated values), 55, 253
Tuples (tuple), 396
2D density plot, 88–89
type function, working with Python objects, 5

U

Unique identifiers, 220
Univariate statistics
- in matplotlib, 73–74
- in seaborn, 79–83
Updates, package, 390
User input, as source of missing data, 207–208

V

value_counts method, 27, 211–212
Values (value)
- columns containing values not variables (See Columns, with values not variables)
- converting to strings, 220–221
- creating DataFrame values, 34
- directly changing columns, 47–50
- dropping, 52
- functions taking, 406–407
- missing (See Missing data (NaN values))
- multiple assignment of list of, 413–414
- passing/reassigning, 395–396
- replacing with SettingWithCopyWarning, 420–421
- Series attributes, 35
- shifting datetime values, 270–276
- slicing, 399–400
VanderPlas, Jake, 359
Variables
- adding covariates to linear models, 324
- bi-variable statistics (See Bivariate statistics)
- calculations involving multiple, 191
- columns containing multiple (See Columns, with multiple variables)
- columns containing values not variables (See Columns, with values not variables)
- converting to numeric values, 221–225
- multiple assignment, 413–414
- multiple linear regression with three covariates, 320–322
- multiple variable statistics (See Multivariate statistics)
- one-variable grouped aggregation, 176–177
- rows and columns both containing, 126–129
- single variable statistics (See Univariate statistics)
- sklearn library used with categorical variables, 291–293
- statsmodels library used with categorical variables, 289–291
Vectors (vectorize)
- applying vectorized function, 138–141
- with common index labels (automatic alignment), 41–42
- DataFrame alignment and vectorization, 44–45
- Series alignment and vectorization, 39–42
- Series referred to as vectors, 35
- timing, 427–428
- using numba library, 140–141
- using numpy library, 140
- vectors of different length, 40–41
- vectors of same length, 39–40
- vectors with integers (scalars), 40
Violin plots
- bivariate statistics, 91–93
- creating scatterplots, 91–93
- with hue parameter, 96–97
Visualization
- Anscombe’s quartet for data visualization, 65–66
- using plots for, 27–28
- value of, 65–66
Voilà, 362

W

Ward cluster algorithm, in hierarchical clustering, 354–355
Wickham, Hadley, 99, 117
“Wide” data, converting into tidy data, 118–120
Windows
- Anaconda command prompt, 381–382
- cd command for viewing working directory, 383
- command line, 377
- installing Anaconda, 373

X

xarray library, 359
XGBoost, 361

Y

Year, extracting date components from datetime object, 254–257

Z

Zero-indexed languages, 399
z-score, transforming data, 184–186

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Index

Symbols

Numbers

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

Table of Contents for
Index