Index

Symbols

  • * operator, specifying model interactions, 324

  • {}(curly brackets), dictionary syntax, 397

  • % (percent) operator, calling magic commands, 427

  • + (plus) operator, adding covariates to linear models, 324

  • () (round brackets)

  • [] (square brackets)

    • dictionary values, 397

    • getting first character of string, 230

    • list syntax, 395396

Numbers

  • 2D density plot, 8889

A

  • Aggregation (or aggregate)

    • of built-in methods, 178179

    • of calculations, 23

    • of functions, 179182

    • multiple functions simultaneously, 182184

    • one-variable grouped aggregation, 176177

    • options for applying functions in and aggregate methods, 182184

    • overview of, 176

    • saving groupby object without running aggregate, transform, or filter, 190191

  • AIC (Akaike information criteria), 327, 329

  • Alignment

  • Anaconda

    • command prompt, 381382

    • installers for, 373374

    • Miniconda, 374

    • package installation, 389390

    • Python distribution, 385

    • Spyder IDE, 382

    • uninstalling, 374

  • AnacondaCon conference, 364

  • ANOVA (analysis of variance), 326327

  • Anscombe’s quartet

    • for data visualization, 6566, 7071

    • plotting with facets, 99100

  • Apache Arrow, 58, 280

  • apply

    • concept map for, 372

    • creating/using functions, 131132

    • functions across rows or columns of data, 133

    • lambda functions, 141142

    • numba library and, 140141

    • over a DataFrame, 135138

    • over a Series, 133135

    • overview of, 131

    • primer on, 131132

    • summary/conclusion, 142

    • vectorized functions, 138141

  • ∗args, function parameter, 408

  • Arrays

    • scientific computing stack, 359

    • sklearn library and, 286287

    • working with, 415416

  • Arrow, 58

    • for dates and times, 280

  • assert, checking data assembly with, 166

  • assign, modifying columns with, 5052

  • Assignment

  • astype method

    • converting column to categorical type, 225226

    • converting to numeric values, 221222

    • converting values to strings, 220

  • Attributes

    • class, 417

    • dot notation and, 1011

    • Series, 35

  • Average cluster algorithm, in hierarchical clustering, 353354

  • Axes, plotting, 6771

B

  • Bar plots, 8991

  • Bash shell, 377378

  • BIC (Bayesian information criteria), 327, 329

  • “The Big Book of Python,” 365

  • “The Big Book of R,” 365

  • Binary

    • feather format for saving, 5657

    • logistic regression for binary response variable, 297

    • serialize and save data in binary format, 53

  • Bivariate statistics

    • in matplotlib, 7476

    • in seaborn, 8394

  • Booleans (bool)

    • subsetting DataFrame, 43

    • subsetting Series, 3639

  • Boxplots

    • for bivariate statistics, 7576

    • creating, 114115

  • Broadcasting, Pandas support for, 4041, 4445

C

  • Calculations

    • datetime, 257258

    • involving multiple variables, 191

    • with missing data (values), 215216

    • of multiple functions simultaneously, 182184

    • timing execution of, 360, 427428

  • Carpentries, 364

  • CAS (computer algebra systems), 359

  • category

    • converting column to, 225226

    • manipulating categorical data, 226

    • overview of, 225

    • representing categorical variables, 221

    • sklearn library used with categorical variables, 291293

    • statsmodels library used with categorical variables, 289291

  • Centroid cluster algorithm, in hierarchical clustering, 353354

  • Chaining methods, 423425

  • Characters

    • formatting strings of, 430

    • getting first character of string, 230

    • getting last character of string, 231233

    • slicing multiple letters of string, 230

    • strings as series of, 229

  • Classes, 417418

  • Clustering

    • average cluster algorithm, 353354

    • centroid cluster algorithm, 353354

    • complete cluster algorithm, 352

    • dimension reduction using PCA, 347351

    • hierarchical clustering, 351356

    • k-means, 345351

    • manually setting threshold for, 355356

    • overview of, 345

    • single cluster algorithm, 352353

    • summary/conclusion, 356

    • ward cluster algorithm, 354355

  • Code

  • coerce, 224225

  • Colon (:), use in slicing syntax, 15, 399400

  • Colors, multivariate statistics in seaborn, 9597

  • Columns

    • adding, 4547

    • concatenation generally, 150151

    • concatenation with different indices, 153154

    • converting to category, 225226

    • directly changing, 4750

    • dot notation to pull values of, 1011

    • dropping values, 52

    • methods of indexing, 11

    • modifying with assign, 5052

    • rows and columns both containing variables, 126129

    • selecting, 1516

    • single value returns, 89

    • slicing, 1821

    • subsetting by name, 78

    • subsetting by range, 1618

    • subsetting generally, 2123

    • subsetting using slicing syntax, 1516

  • Columns, with multiple variables

    • overview of, 122123

    • split and add individually, 123125

    • split and combine in single step, 125126

  • Columns, with values not variables

    • keeping multiple columns fixed, 120122

    • keeping one column fixed, 118120

    • overview of, 118

  • Command line

  • Comma-separated values. See CSV (comma-separated values)

  • compile, pattern compilation, 246247

  • Complete cluster algorithm, in hierarchical clustering, 352

  • Comprehensions

  • Computer algebra systems (CAS), 359

  • Concatenation (concat)

    • adding columns, 150151

    • adding rows, 147150

    • dataframe parts and, 146147

    • with different indices, 151154

    • ignore_index parameter after, 149150

    • observational units across multiple tables, 154160

    • overview of, 146

    • split and combine in single step, 125126

  • Concept maps, 369372

  • concurrent.features, 360

  • conda

    • creating environments, 385387

    • install, 374

    • managing packages, 389

    • update, 390

  • Conditional statements, 433434

  • Conferences, 363364

  • Confidence interval, in linear regression example, 285

  • Containers

  • Conversion, of data types

  • Counting

    • groupby count, 197199

    • missing data (values), 210212

    • Poisson regression and, 304308

  • Count (bar) plot, for univariate statistics, 8183

  • Covariates

    • adding to linear models, 324

    • multiple linear regression with three covariates, 320322

  • Cox proportional hazards model

    • survival analysis, 314316

    • testing assumptions, 315316

  • C printf style formatting, 429

  • cProfile, profiling code, 360

  • create (environments), 385387

  • Cross-validation

    • model diagnostics, 329333

    • regularization techniques, 341343

  • cross_val_scores, 332333

  • CSV (comma-separated values)

    • for data storage, 55

    • importing CSV files, 55

    • loading multiple files using list comprehension, 158160

  • Cumulative sum (cumsum), 199

  • cython, performance-related library, 360

D

  • Dash, 362

  • Dashboards, 362

  • Dask library, 360

  • Data assembly

    • adding rows, 147150

    • checking your work on, 166

    • combining data sets, 145

    • concatenation, 146154

    • concatenation with different indices, 151154

    • dataframe parts and, 146147

    • ignore_index parameter after concatenation, 149150

    • loading multiple files using list comprehension, 158160

    • loading multiple files using lit comprehension, 158160

    • many-to-many merges, 163166

    • many-to-one merges, 163

    • merging multiple data sets, 160166

    • observational units across multiple tables, 154160

    • one-to-one merges, 162163

    • overview of, 145

    • summary/conclusion, 167

    • tidy data, 167

  • DataFrame

    • adding columns, 4547

    • aggregation, 182183

    • alignment and vectorization, 4445

    • apply function(s), 135138

    • basic plots, 2728

    • boolean subsetting, 43

    • as class, 417418

    • concatenation, 149

    • concept map for basics in, 369

    • converting to Arrow objects, 58

    • converting to dicionary objects, 5859

    • creating, 3233

    • defined, 3

    • directly changing columns, 4750

    • exporting, 56

    • grouped and aggregated calculations, 2327

    • grouped frequency counts, 27

    • grouped means, 2326

    • histogram, 111

    • loading first data set, 46

    • methods, 43

    • ndarray save method, 53

    • overview of, 3, 42

    • parts of, 4243

    • single value returns, 89

    • slicing columns, 1821

    • subsetting columns by name, 78

    • subsetting columns by range, 1618

    • subsetting columns using slicing syntax, 1516

    • subsetting rows and columns, 2123

    • subsetting rows by index label, 1113

    • subsetting rows by row number, 1314

    • summary/conclusion, 2829

    • type function for checking, 5

    • writing CSV files (to_csv method), 55

  • Data models, 281282

  • Data normalization

    • multiple observational units in a table, 169170

    • overview, 169

  • Data sets

    • cleaning data, 416

    • combining, 145

    • downloading for this book, 375

    • equality tests for missing data, 203204

    • exporting/importing data (See Exporting/importing data)

    • Indemics (Interactive Epidemic Simulation), 196

    • lists for data storage, 395396

    • loading, 46

    • many-to-many merges, 163166

    • many-to-one merges, 163

    • merging, 160166

    • one-to-one merges, 162163

    • tidy data, 117

  • Data structures

    • adding columns, 4547

    • concept map for, 370

    • creating, 3133

    • CSV (comma-separated values), 55

    • DataFrame alignment and vectorization, 4445

    • DataFrame boolean subsetting, 43

    • DataFrame generally, 4243

    • directly changing columns, 4750

    • dropping values, 52

    • Excel and, 5556

    • exporting/importing data, 52

    • feather format, 5657

    • making changes to, 45

    • overview of, 31

    • pickle data, 5354

    • Series alignment and vectorization, 3942

    • Series boolean subsetting, 3639

    • Series generally, 3335

    • Series methods, 3537

    • Series similarity with ndarray, 3536

    • summary/conclusion, 63

  • Data types (dtype)

    • category dtype, 225

    • converting to category, 225226

    • converting to datetime, 250253

    • converting to numeric, 221225

    • converting to string, 220221

    • getting list of types stored in column, 225226

    • manipulating categorical data, 226

    • overview of, 219

    • Series attributes, 35

    • specifying from numpy library, 221

    • summary/conclusion, 227

    • to_numeric function, 222225

    • viewing list of, 219220

  • date_range function, 266269

  • datetime

    • adding columns to data structures, 4547

    • Arrow with, 280

    • calculations, 257258

    • converting to, 250253

    • directly changing columns, 4849

    • extracting date components (year, month, day), 254257

    • frequencies, 268

    • getting stock-related data, 261263

    • loading date related data, 253254

    • methods, 259261

    • object, 249250

    • offsets, 268269

    • overview of, 249

    • ranges, 266269

    • resampling, 276278

    • shifting values, 270276

    • subsetting data based on dates, 263266

    • summary/conclusion, 280

    • time zones, 278279

  • DatetimeIndex, 263265, 268

  • Day, extracting date components from datetime object, 254257

  • Daylight savings time, 278

  • def keyword, use with functions, 405406

  • Density plots

    • 2D density plot, 8889

    • plot.kde function, 111112

    • for univariate statistics, 80

  • Diagnostics. See Model diagnostics

  • Dictionaries (dict)

    • creating DataFrame, 3233

    • objects to converting DataFrame objects

    • to, 5859

    • overview of, 396398

    • passing method to, 182183

  • Directories, working, 383384

  • distplot, creating histograms, 8182

  • dmatrices function, patsy library, 331333

  • Docstrings (docstring), function documentation, 132, 405

  • Dot notation, to pull a column of values, 1011

  • dropna parameter

    • counting missing values, 210212

    • dropping missing values, 214215

  • Dropping (drop)

    • data structure values, 52

    • missing data (values), 214215

  • dtype. See Data types (dtype)

E

  • EAFP (easier to ask for forgiveness than for permissions), 191

  • Elastic net, regularization technique, 340341

  • elif, 433434

  • else, 433434

  • Environments

  • Equality tests, for missing data, 203204

  • errors parameter, numeric, 223224

  • EuroSciPy conference, 364

  • Excel

    • DataFrame and, 56

    • Series and, 56

  • Exporting/importing data

    • Arrow, 58

    • CSV (comma-separated values), 55

    • dictionary, 5859

    • Excel, 5556

    • feather format, 5657

    • JSON, 5962

    • methods, 63

    • output types, 6263

    • overview of, 52

    • pickle data, 5354

F

  • Facets, plotting, 99104

  • Feather format, interface with R language, 5657

  • Files

    • loading multiple using list comprehension, 158160

    • working directories and, 383

  • fillna method, 212213

  • Filter (filter), groupby operations, 188189

  • Find

  • findall, patterns, 244245

  • Fizz Buzz, 433434

  • float/float64, 221

  • Folders

    • project organization, 379

    • working directories and, 383

  • for loop. See Loops (for loop)

  • format method, 236

  • Formats/formatting

    • date formats, 252

    • serialize and save data in binary format, 53

    • strings (string), 236239, 429431

  • Formatted literal strings (f-strings), 236239

  • formula API, in statsmodels library, 284285

  • freq parameter, 268

  • Frequency

    • datetime, 268

    • grouped frequency counts, 27

    • offsets, 268269

    • resampling converting between, 276278

  • f-strings, 236238

  • f-strings (formatted literal strings), 236239

  • Functions

    • across rows or columns of data, 133

    • aggregation, 179182

    • apply over DataFrame, 135138

    • apply over Series, 133135

    • arbitrary parameters, 407408

    • calculating multiple simultaneously, 182184

    • comprehensions and, 403404

    • creating/using, 131132

    • custom, 180181

    • default parameters, 407

    • groupby, 178

    • **kwargs, 408

    • lambda, 141142

    • options for applying in and aggregate methods, 182184

    • overview of, 405408

    • regular expressions (RegEx), 240

    • vectorized, 138141

    • z-score example of transforming data, 184186

G

  • Ganssle, Paul, 280

  • Gapminder data set, 4

  • Generalized linear models (GLM). See also Linear regression models

    • logistic regression, 446447

    • model diagnostics, 327329

    • more GLM options, 308309

    • negative binomial regression, 306308, 448449

    • overview of, 297

    • Poisson regression, 304308, 447449

    • sklearn library for logistic regression, 300304

    • statsmodels library for logistic regression, 299300

    • statsmodels library for Poisson regression, 304306

    • summary/conclusion, 309

    • survival analysis, 311317

    • testing Cox model assumptions, 315316

  • Generators

    • converting to list, 1617

    • overview of, 409411

  • get

    • dictionary values with, 397398

    • selecting groups, 191192

  • Git for Windows, 377

  • github, 365

  • GLM (generalized linear models). See Generalized linear models

  • glm function, in statsmodels library, 306, 308309

  • Going it alone, 363365

    • aggregation, 176184

    • aggregation functions, 179182

    • applying functions in and aggregate methods, 182184

    • built-in aggregation methods, 178179

    • calculations generally, 2425

    • calculations involving multiple variables, 191

    • calculations of means, 2326

    • compared with SQL, 175

    • filtering, 188189

    • flattening results, 194195

    • frequency counts, 27

    • iterating through groups, 192194

    • methods and functions, 178

    • missing value example, 186188

    • multiple groups, 194

    • one-variable grouped aggregation, 176177

    • overview of, 175

    • saving without running aggregate, transform, or filter methods, 190191

    • selecting groups, 192

    • summary/conclusion, 199200

    • transform, 184188

    • working with multiIndex, 195199

    • z-score example of transforming data, 184186

  • Groups

    • iterating through, 192194

    • selecting, 191192

    • working with multiple, 194

  • Guido, Sarah, 241

H

  • Hendryx-Parker, Calvin, 387

  • hexbin plot

    • bivariate statistics in seaborn, 8788

    • plt.hexbin function, 113114

  • Hierarchical clustering

    • average cluster algorithm, 353354

    • centroid cluster algorithm, 353354

    • complete cluster algorithm, 352

    • manually setting threshold for, 355356

    • overview of, 351352

    • single cluster algorithm, 352353

    • ward cluster algorithm, 354355

  • Histograms

    • creating using plot.hist functions, 111

    • of model residuals, 323

    • for univariate statistics in matplotlib, 7374

    • for univariate statistics in seaborn, 7983

I

  • Ibis, 361

  • id, unique identifiers, 220

  • IDEs (integrated development environments), Python, 382

  • if, 433434

  • ignore_index parameter, after concatenation, 149150

  • iloc

    • indexing rows or columns, 11

    • Series attributes, 35

    • subsetting rows and columns, 2123

    • subsetting rows by number, 1314

  • Importing (import). See also Exporting/importing data

    • itertools library, 410411

    • libraries, 391392

    • loading first data set, 45

    • matplotlib library, 6672

    • pandas, 415

  • Indemics (Interactive Epidemic Simulation) data set, 208

  • Indices

    • beginning and ending indices in ranges, 399

    • concatenate columns with different indices, 153154

    • concatenate rows with different indices, 151153

    • date ranges, 267268

    • issues with absolute, 22

    • out of bounds notification, 138

    • reindexing as source of missing values, 209210

    • subsetting columns by index position break, 8

    • subsetting date based on, 263266

    • subsetting rows by index label, 1113

    • working with multiIndex, 195199

  • inplace parameter, functions and methods, 4950

  • Installation

  • Integers (int/int64)

    • converting to string, 220221

    • vectors with integers (scalars), 40

  • integrated development environments (IDEs), 382

  • Interactive Epidemic Simulation (Indemics) data set, 196

  • Interpolation, in filling missing data, 213214

  • IPython (ipython)

    • ipython command, 381382

    • magic commands, 427

  • Iteration. See Loops (for loop)

  • iTerm2, 377

  • itertools library, 410411

J

  • JavaScript Objectd notation, 5962

  • join

  • jointplot, creating seaborn scatterplot, 8588

  • JSON data, 5962

  • Jupyter, 360

  • jupyter command, 382

  • JupyterCon, 364

  • Jupyter Days, 364

K

  • KaplanMeierFitter, lifelines library, 312313

  • KDE plot, of bivariate statistics, 8990

  • keep_default_na parameter, specifying NaN values, 205

  • Kelleher, Adam, 241

  • Kelleher, Andrew, 241

  • Keys, creating DataFrame, 3233

  • Key–value pairs, 397398

  • Key–value stores, 408

  • Keywords

    • lambda keyword, 142

    • passing keyword argument, 134135

  • k-fold cross validation, 329333

  • k-means

  • **kwargs, 408

L

  • L1 regularization, 337338, 341

  • L2 regularization, 338341

  • lambda functions, applying, 141142

  • Lander, Jared, 241

  • LASSO regression, 337338, 341

  • Leap years/leap seconds, 278

  • Learning resources, for self-directed learners, 363365

  • Libraries. See also by individual types

    • importing, 391392

    • performance libraries, 360

  • lifelines library, 311313

    • CoxPHFitter class, 314315

    • KaplanMeierFitter class, 312313

  • Linear regression models. See also GLM (generalized linear models)

    • with categorical variables, 289293

    • cross-validation, 341343

    • elastic net, 340341

    • LASSO regression regularization, 337338

    • model diagnostics, 324327

    • multiple regression, 287289

    • one-hot endocing in, 294295

    • R2 (coefficient of determination) regression score function, 332

    • reasons for regularization, 335337

    • replicating results in R, 444446

    • residuals, 320322

    • restoring labels in sklearn models, 293

    • ridge regression, 338340

    • simple linear regression, 283287

    • sklearn library for multiple regression, 288289

    • sklearn library for simple linear regression, 285287

    • statsmodels library for multiple regression, 287288

    • statsmodels library for simple linear regression, 284285

    • summary/conclusion, 296

  • Line breaks, 393394

  • Linux

    • command line, 378

    • installing Anaconda, 373374

    • running python and ipython commands, 382

    • viewing working directory, 383

  • List comprehension, 158160

  • Lists (list)

    • comprehensions and, 403404

    • converting generator to, 1617, 409410

    • creating Series, 3132

    • of data types, 219220

    • loading multiple files using comprehension, 158160

    • loading multiple files using list comprehension, 158160

    • looping, 401402

    • multiple assignment, 413414

    • overview of, 395396

    • single value returns, 910

  • lmplot

    • creating scatterplots, 85

    • with hue parameter, 9697

  • Loading data

    • datetime data, 253254

    • as source of missing data, 205206

  • loc

    • indexing rows or columns, 1113

    • Series attributes, 35

    • subsetting rows and columns, 2123

    • subsetting rows or columns, 1516

  • Logic, three-valued, 203204

  • Logistic regression

    • example of, 435441

    • overview of, 297304

    • replicating results in R, 446447

    • sklearn library for, 300304

    • statsmodels library for, 299300

    • working with GLM models, 328329

    • logit function, performing logistic

    • regression, 299300

  • Loops (for loop)

M

  • Mac

    • command line, 377378

    • installing Anaconda, 373

    • pwd command for viewing working directory, 383

    • running python and ipython commands, 382

  • Machine learning models, 285, 361362

  • Machine Learning Operations (MLOps), 362

  • Many-to-many merges, 163166

  • Many-to-one merges, 163

  • Markham, Kevin, 422

  • match, pattern matching, 240243

  • matplotlib library

    • axes subplots, 6771

    • bivariate statistics, 7476

    • figure anatomy, 7172

    • figure objects, 6771

    • multivariate statistics, 7678

    • overview of, 6672

    • statistical graphics, 7273

    • univariate statistics, 7374

  • Matrices, 331333, 415416

  • Mean (mean)

    • custom functions, 180181

    • group calculations involving multiple variables, 191

    • grouped means, 2326

    • numpy library, 179

    • Series in identifying, 3738

  • Meetups, 363

  • melt function

    • converting wide data into tidy data, 118120

    • line breaks, 393394

    • rows and columns both containing variables, 126127

  • Merges (merge)

  • Methods

  • Miniconda, 374

  • Mirjalili, Vahid, 241

  • Missing data (NaN values)

    • built-in Na value, 218

    • calculations with, 215216

    • cleaning, 212215

    • concatenation and, 148149, 153

    • date range for filling in, 272273

    • dropping, 214215

    • fill forward or fill backward, 212213

    • finding and counting, 210212

    • interpolation in filling, 213214

    • loading data as source of, 205206

    • merged data as source of, 206207

    • overview of, 203

    • recoding or replacing (fillna method), 212

    • reindexing causing, 209210

    • sources of, 205210

    • specifying with na_values parameter, 205206

    • summary/conclusion, 218

    • transform example, 186188

    • user input creating, 207208

    • what is a NaN value, 203204

    • working with, 210216

  • MLOps (Machine Learning Operations), 362

  • Model diagnostics

    • comparing multiple models, 324329

    • k-fold cross validation, 329333

    • overview of, 319

    • q-q plots, 322324

    • residuals, 319324

    • summary/conclusion, 334

    • working with GLM models, 327329

    • working with linear models, 324327

  • Models

  • Month, extracting date components from datetime object, 254257

  • Müller, Andreas, 241

  • Multiple assignment, 413414

  • Multiple regression

    • with categorical variables, 289293

    • overview of, 287

    • residuals, 320322

    • sklearn library for, 288289

    • statsmodels library for, 287288

  • Multivariate statistics

    • in matplotlib, 7678

    • in seaborn, 9499

N

  • na_filter parameter, specifying NaN values, 205206

  • Name, subsetting columns by, 78

  • NaN. See Missing data (NaN values)

  • Na value, missing data with built-in, 218

  • na_values parameter, specifying NaN values, 205206

  • ndarray

    • restoring labels in sklearn models, 293

    • Series similarity with, 3536

    • working with matrices and arrays, 415416

  • Negative binomial regression, 306308, 448449

    • replicating results in R, 448449

  • Negative numbers, slicing values from end of container, 230231

  • New York ACS logistic regression example, 435441

  • Normal distribution

  • Normalization, data, 169173

  • numba library

    • performance-related libraries, 360

    • timing execution of statements or expressions, 360

    • vectorize decorator from, 140141

  • Numbers (numeric)

    • converting variables to numeric values, 221225

    • formatting number strings, 238239, 430431

    • negative numbers, 230231

    • to_numeric function, 222225

  • numpy library

    • broadcasting support, 4445

    • exporting/importing data, 5355

    • mean, 179

    • ndarray, 415416

    • performance and, 360

    • restoring labels in sklearn models, 293

    • Series similarity with numpy.ndarray, 35

    • sklearn library taking numpy arrays, 286287

    • specifying dtype from, 220221

    • vectorize, 140

  • nunique method, grouped frequency counts, 27

O

  • Object-oriented languages, 417

  • Objects

    • classes, 417418

    • converting to datetime, 250253

    • datetime, 249250

    • figure, plotting, 6771

    • lists as, 395396

    • plots and plotting using Pandas objects, 111115

  • Observational units

  • Odds ratios, performing logistic regression, 300

  • Offsets, frequency, 268269

  • One-to-one merges, 162163

  • OSX. See Mac

  • Overdispersion of data, negative binomial regression for, 306308, 448449

P

  • Packages

    • benefits of isolated environments, 385386

    • Installing, 389390

    • updating, 390

  • pairgrid, bivariate statistics, 9394

  • Pairwise relationships (pairplot)

    • bivariate statistics, 9394

    • with hue parameter, 98

  • pandera, 361

  • Panel, 362

  • Parameters

    • arbitrary function parameters, 407408

    • default function parameters, 407

    • functions taking, 406407

  • passing/reassigning values, 395396

  • patsy library, 331333

  • Patterns. See also Regular expressions (regex)

  • PCA (principal component analysis), 347351

  • pd

    • alias for pandas, 5

    • reading pickle data, 5354

  • PEP8 (Python Enhancement Proposal 8), 393

  • Performance

    • avoiding premature optimization, 360

    • profiling code, 360

    • timing your code, 360, 427428

  • pickle data, 5354

  • Pipeline, 294295

  • Pipenv, 387388

  • pip install, 374, 389390

  • Pivot/unpivot

    • columns containing multiple variables, 122126

    • converting wide data into tidy data, 119120

    • keeping multiple columns fixed, 120122

    • rows and columns both containing variables, 127128

  • Placeholders, formatting strings, 238, 430

  • Plots/plotting (plot)

    • basic plots, 2728

    • bivariate statistics in matplotlib, 7476

    • bivariate statistics in seaborn, 8394

    • concept map for, 371

    • creating boxplots (plot.box), 113115

    • creating density plots (plot.kde), 111112

    • creating scatterplots (plot.scatter), 112113

    • linear regression residuals, 320322

    • matplotlib library, 6672

    • multivariate statistics in matplotlib, 7678

    • multivariate statistics in seaborn, 9499

    • overview of, 65

    • Pandas objects and, 111115

    • q-q plots, 322324

    • seaborn library, 78

    • statistical graphics, 7273

    • summary/conclusion, 115

    • themes and styles in seaborn, 105108

    • univariate statistics in matplotlib, 7374

    • univariate statistics in seaborn, 7983

  • PLOT_TYPE functions, 111

  • plt.hexbin function, 113114

  • Podcast resources, for self-directed learners, 364365

  • Point representation, Anscombe’s data set, 67

  • poisson function, in statsmodels library, 304306

  • Poisson regression

    • negative binomial regression as alternative

    • to, 306308, 448449

    • overview of, 304

    • replicating results in R, 447449

    • statsmodels library for, 304306

  • Polars, 360

  • Principal component analysis (PCA), 347351

  • Project templates, 379, 383

  • Pryke, Bejamin, 422

  • PyCon conference, 364

  • PyData, 364

  • pyenv, 374

  • Pyenv, 387388

  • pyjanitor, 361

  • Python

    • Anaconda distribution, 385

    • assert, 166

    • command line and text editor, 381

    • comparing Pandas types with, 7

    • conferences, 364

    • enhanced features in Pandas, 3

    • IDEs (integrated development environments), 382

    • ipython command, 381382

    • jupyter command, 382

    • as object-oriented languages, 417

    • running from command line, 377378

    • scientific computing stack, 350

    • ways to use, 381382

    • working with objects, 5

    • as zero-indexed languages, 399

  • Python Enhancement Proposal 8 (PEP8), 393

Q

  • q-q plots, model diagnostics, 322324

R

  • random--state method, directly changing columns, 4748

  • range, 409410

  • Ranges (range)

    • beginning and ending indices, 399

    • date ranges, 266269

    • filling in missing values, 272273

    • overview of, 409411

    • passing range of values, 395396

    • subsetting columns, 1618

  • Raschka, Sebastian, 241

  • R ecosystem, 362

    • replicating results in, 443449

  • Regex. See Regular expressions (regex)

  • regplot, creating scatterplot, 8385

  • Regression

    • keeping labels in sklearn models, 293

    • LASSO regression regularization, 337338

    • logistic regression, 297304, 446447

    • more GLM options, 308309

    • multiple regression, 287289

    • negative binomial regression, 306308, 448449

    • New York ACS example, 435441

    • Poisson regression, 304308, 447449

    • reasons for regularization, 335337

    • ridge regression regularization, 338340

    • simple linear regression, 283287

    • sklearn library for logistic regression, 300304

    • sklearn library for multiple regression, 288289

    • sklearn library for simple linear regression, 285287

    • statsmodels library for logistic regression, 299300

    • statsmodels library for multiple regression, 287288

    • statsmodels library for Poisson regression, 304306

    • statsmodels library for simple linear regression, 284285

  • Regular expressions (RegEx)

    • functions in re, 240

    • overview of, 239

    • pattern compilation, 246247

    • pattern matching, 240243

    • pattern substitution, 245246

    • regex library, 247

    • special characters, 240

    • syntax, special characters, and functions, 240

  • Regularization

  • reindex method, reindexing as source of missing values, 209210

  • re module, 240243, 247

  • Resampling, datetime, 276278

  • Residuals, model diagnostics, 319324

  • Residual sum of squares (RSS), 326327

  • Resources, 363365

  • Ridge

    • regression elastic net and, 341

    • regularization techniques, 338340

  • R language, interface with (to_feather method), 5657

  • Rows

    • concatenation generally, 145147

    • concatenation with different indices, 151153

    • methods of indexing, 11

    • multiple observational units in a table, 169173

    • removing row numbers from output, 55

    • rows and columns both containing variables, 126129

    • subsetting multiple, 13

    • subsetting rows and columns, 2123

    • subsetting rows by index label, 1113

    • subsetting rows by row number, 1314

  • RSS (residual sum of squares), 326327

  • Rug plots, for univariate statistics, 8081

S

  • Scalars, 40

  • Scatterplots

    • for bivariate statistics, 7475

    • matplotlib example, 69

    • for multivariate statistics, 7778

    • plot.scatter function, 112113

  • Scientific computing stack, 350

  • SciPy conference, 364

  • scipy library

    • hierarchical clustering, 351

    • performance libraries, 360

    • scientific computing stack, 359

  • Scripts

    • project templates for running, 383

    • running Python from command line, 377378

  • seaborn

    • Anscombe’s quartet for data visualization, 6566

    • bivariate statistics, 8394

    • multivariate statistics, 9499

    • overview of, 78

    • themes and styles, 105108

    • tips data set, 187

    • titanic data set, 297299

    • univariate statistics, 7983

  • Searches. See Find

  • Semicolon (;), types of delimiters, 55

  • Serialization, serialize and save data in binary format, 53

  • Series

    • adding columns, 4547

    • aggregation functions, 183184

    • alignment and vectorization, 3942

    • apply function(s) over, 133135

    • attributes, 35

    • boolean subsetting, 3639

    • categorical attributes or methods, 226

    • as class, 417418

    • creating, 3132

    • defined, 3

    • directly changing columns, 4750

    • exporting/importing data, 53

    • exporting to Excel (to_excel method), 56

    • histogram, 111

    • methods, 3537

    • overview of, 3335

    • similarity with ndarray, 3536

    • single value returns, 89

    • writing CSV files (to_csv method), 55

  • SettingWithCopyWarning, 419422

  • Shape

    • DataFrame attributes, 5

    • Series attributes, 35

  • Shape, in plotting, 9798

  • Shell scripts, running Python from command line, 377378

  • Shiny for Python, 362

  • Simple linear regression

    • overview of, 283

    • sklearn library, 285287

    • statsmodels library, 284285

  • Single cluster algorithm, in hierarchical clustering, 352353

  • Siuba, 360

  • Size, in plotting, 7778

  • size attribute, Series, 35

  • sklearn library

    • defaults in, 302304

    • importing PCA function, 347348

    • keeping labels in sklearn models, 293

    • k-fold cross validation, 330331

    • KMeans function, 345347

    • for logistic regression, 300304

    • logistic regression example, 439441

    • for multiple regression, 288289

    • one-hot endocing with, 294295

    • for simple linear regression, 285287

    • splitting data into training and testing sets, 335336

    • transformer pipelines in, 294295

  • Slicing

    • colon (:) use in slicing syntax, 15, 399400

    • columns, 1821

    • string from beginning or to end, 232

    • strings, 230231

    • strings incrementally, 232233

    • subsetting columns, 1516

    • subsetting multiple rows and columns, 2223

    • values, 399400

  • snakevis, profiling code, 360

  • sns.distplot, creating histograms, 81

  • Sns.set_style function, 105108

  • Special characters, regular expressions, 240

  • Split–apply–combine, 175

  • splitlines method, strings, 235236

  • split method

    • split and add columns individually, 123125

    • split and combine in single step, 125126

  • Spyder IDE, 382

  • SQL

    • comparing Pandas to, 162

    • groupy compared with SQL GROUP BY, 175

  • Square brackets ([])

    • getting first character of string, 230

    • list syntax, 395396

  • Statistical graphics

    • bivariate statistics in matplotlib, 7476

    • bivariate statistics in seaborn, 8394

    • matplotlib library, 6672

    • multivariate statistics in matplotlib, 7678

    • multivariate statistics in seaborn, 9499

    • overview of, 7273

    • seaborn library, 78

    • univariate statistics in matplotlib, 7374

    • univariate statistics in seaborn, 7983

  • Statistics

    • basic plots, 2728

    • grouped and aggregated calculations, 2327

    • grouped frequency counts, 27

    • grouped means, 2326

  • statsmodels library

    • for logistic regression, 299300

    • for multiple regression, 287288

    • for Poisson regression, 304306

    • for simple linear regression, 284285

  • Stocks/stock prices, 261263

  • Storage

    • of information in dictionaries, 396398

    • lists for data storage, 395396

  • str accessor, 123

  • Streamlit, 362

  • strftime, for date formats, 252253

  • Strings (string)

  • str.replace, pattern substitution, 245246

  • Styles, seaborn, 105108

  • Subplot syntax, 68

  • Subsets/subsetting

    • columns by index position break, 8

    • columns by name, 78

    • columns by range, 1618

    • columns generally, 2123

    • columns using slicing syntax, 1516

    • data by dates, 263266

    • DataFrame boolean subsetting, 43

    • lists, 395396

    • modifying with SettingWithCopyWarning, 419420

    • multiple rows, 13

    • rows by index label, 1113

    • rows by row number, 1314

    • rows generally, 2123

    • strings, 229231

    • tuples, 396

  • sum

    • cumulative (cumsum), 199

    • custom functions, 180

  • Summarization. See Aggregation (or aggregate)

  • Survival analysis, 311317

    • Cox proportional hazards model, 314316

    • data for, 311312

    • Kaplan Meier curves, 312314

    • overview, 311

    • summary/conclusion, 317

  • SyiPy, 359

T

  • Tables

    • observational units across multiple, 154160

    • observational units in, 169173

  • Tab separated values (TSV), 55, 253

  • tail, returning last row, 13

  • T attribute, Series, 35

  • Templates, project, 379, 383

  • Terminal application, Mac, 377

  • Text. See also Characters; Strings (string)

    • function documentation (docstring), 132

    • overview of, 229

  • Themes, seaborn, 105109

  • Three-valued logic, 203204

  • Tidy data

    • columns containing multiple variables, 122126

    • columns containing values not variables, 118122

    • concept map for, 372

    • data assembly, 167

    • data normalization, 169173

    • definition of, 117

    • keeping multiple columns fixed, 120122

    • keeping one column fixed, 118120

    • overview of, 117

    • rows and columns both containing variables, 126129

    • split and add columns individually, 123125

    • split and combine in single step, 125126

    • summary/conclusion, 129

  • tidyverse, 360

  • Time. See datetime

  • TimedeltaIndex, 265266

  • timedelta object

    • date calculations, 257258

    • subsetting date based data, 265266

  • timeit function, timing execution of statements or expressions, 360, 427428

  • Time zones, 278279

  • tips data set, seaborn library, 187, 283

  • titanic data set, 297299

  • to_csv method, 55

  • to_datetime function, 250253

  • to_dict method, 5859

  • to_excel method, 56

  • to_feather method, 57

  • to_numeric function, 222225

  • Transform (transform)

    • applying to data, 323324

    • missing value example of transforming data, 186188

    • overview of, 184

    • z-score example of transforming data, 184186

  • Transformer pipelines, 294295

  • True, 434

  • TSV (tab separated values), 55, 253

  • Tuples (tuple), 396

  • 2D density plot, 8889

  • type function, working with Python objects, 5

U

  • Unique identifiers, 220

  • Univariate statistics

    • in matplotlib, 7374

    • in seaborn, 7983

  • Updates, package, 390

  • User input, as source of missing data, 207208

V

  • value_counts method, 27, 211212

  • Values (value)

  • VanderPlas, Jake, 359

  • Variables

  • Vectors (vectorize)

    • applying vectorized function, 138141

    • with common index labels (automatic alignment), 4142

    • DataFrame alignment and vectorization, 4445

    • Series alignment and vectorization, 3942

    • Series referred to as vectors, 35

    • timing, 427428

    • using numba library, 140141

    • using numpy library, 140

    • vectors of different length, 4041

    • vectors of same length, 3940

    • vectors with integers (scalars), 40

  • Violin plots

    • bivariate statistics, 9193

    • creating scatterplots, 9193

    • with hue parameter, 9697

  • Visualization

    • Anscombe’s quartet for data visualization, 6566

    • using plots for, 2728

    • value of, 6566

  • Voilà, 362

W

  • Ward cluster algorithm, in hierarchical clustering, 354355

  • Wickham, Hadley, 99, 117

  • “Wide” data, converting into tidy data, 118120

  • Windows

    • Anaconda command prompt, 381382

    • cd command for viewing working directory, 383

    • command line, 377

    • installing Anaconda, 373

X

  • xarray library, 359

  • XGBoost, 361

Y

  • Year, extracting date components from datetime object, 254257

Z

  • Zero-indexed languages, 399

  • z-score, transforming data, 184186

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.181.129