Index

[A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Y]

A

abstracted query language, in databases
active products
aggregations, in databases
algorithmic templating
algorithms, long division and
alternative hypothesis
analytical tools
  limitations of
  strengths of
  when to use
API keys
APIs (application programming interface)
artificial intelligence
artificial neural networks.
    See neural networks.
ASCII text
assessment of data.
    See data assessment.
atomic statistics
awareness, value of

B

backing up code
backslash character
BACON (BAyesian Clustering Over Networks)2nd
Bayesian statistics, vs. frequentist statistics
  prior distributions
  propagating uncertainty
  updating with new data
beer recommendation algorithm2nd
beginners, plan execution for
Bernoulli distribution
best practices
  asking questions
  code organization
  code repositories (repos) and versioning
  documentation
  staying close to data
  when finishing projects
big data2nd
  benefits of
  how to use
  types of
  when to use
bioinformatics example2nd
Bitbucket.org
black box methods2nd
boosting
branching

brute-force method
  of documenting code
  overview
bugs.
    See software bugs.

C

caching
checking assumptions
  about contents of data
  about distribution of data
  trick for uncovering assumptions
classification
closeness-of-prediction function
cloud services
  benefits of
  how to use
  types of
  when to use
clustering2nd
  how it works
  what to watch out for
  when to use it

code
  backing up
  organization of
code documentation
code repositories, storage in
colleagues, consulting
combining data sources
comma-separated value.
    See CSV.
communicating goals
complex methods
component analysis
  how it works
  what to watch out for
  when to use it
Comprehensive R Archive Network.
    See CRAN.
computer clusters
computers users, as data generators
concurrency, in databases
confidence intervals
copyright
correlation, conflating with significance
corrupted data
costs, sunk
CRAN (Comprehensive R Archive Network)
credible intervals
cross-validation
CSV (comma-separated value)

customers
  after software delivery
  improper use of product by.
    See also listening to customer.

D

DAG (directed acyclic graph)

data
  indiscriminate collection of
  scouting for
    combining data sources
    copyright and licensing
    measuring or collecting things yourself
    using Google search
    web scraping
data assessment
  checking assumptions
    about contents of data
    about distribution of data
    trick for uncovering assumptions
  descriptive statistics
    choosing specific statistics to calculate
    common descriptive statistics
    making tables or graphs
  Enron email data set example
  looking for something specific
    characterizing examples
    data snooping
    finding few examples
  rough statistical analysis
    classification
    clustering
    increasing sophistication
    inference
    other statistical methods
    taking subset of data
data frames, in R
data generators, computers and internet users as
data management, vs. data science
data points, neighborhood of

data science
  defined
  vs. data management
data science projects, lifecycle of

data scientists
  as explorer
  compared with software developers
data wrangling
  common pitfalls
    escape characters
    line endings, parsing
    outliers
  in Enron email analysis
  PDFs and
  preparation for
    end of data and file
    making plan
    messy data, types of
    possible obstacles and uncertainties
    pretending to be an algorithm
  techniques and tools
  track and field case study
    common heuristic comparisons
    comparing performances using all data available
    IAAF scoring tables
data.frame function, in R
database indexes
databases
  benefits of
    abstracted query language
    aggregations
    caching
    concurrency
    indexing
    scaling
  document-oriented
  graph databases
  how to use
  non-relational
  poorly designed
  relational2nd
  types of
    document-oriented
    other
    relational
  using cloud services with
  when to use
data-centric, use of term
DataLoader object
deep learning
delimited files
deliverables, suggesting
delivering product
  content
    disclaimers for less significant results
    making conclusive results prominent
    omitting virtually inconclusive results
    user experience
  feedback
    asking for
    is not disapproval
    meaning of
    understanding
  understanding customer
    considering audience
    considering how results will be used.
    See delivery media.
delivery media
  analytical tools
    limitations of
    strengths of
    when to use
  instructions for how to redo analysis
    limitations of
    strengths of
    when to use
  interactive graphical applications
    limitations of
    strengths of
    when to use
  other types of products
  reports
    limitations of
    strengths of
    when to use
  white papers
    limitations of
    strengths of
    when to use
descriptive statistics2nd
  choosing specific statistics to calculate
  common descriptive statistics
  making tables or graphs
developers.
    See software developers.
directed acyclic graph.
    See DAG.
division
dl object
documentation2nd
  for developers
  for users
  of code
document-oriented databases
domain knowledge, relative importance of

E

Elasticsearch2nd
else command
EM (expected maximization)2nd
empirical Bayes
Enron email2nd
EOL (end-of-line) character
error terms
escape characters
Euclidean geometry
Excel.
    See Microsoft Excel.
expectations, reconsidering
expected maximization.
    See EM.
exponential growth model
Extensible Business Reporting Language.
    See XBRL.

F

feature extraction
feedback
  asking for
  meaning of
  understanding
    meaning of
    perceptions
    who is providing
fields, in abstract algebra

file formats
  bad
  converters
  deciding which to use
  flat files
  HTML
  JSON
  unusual
  XML
flat files
forking
frequentist statistics, vs. Bayesian statistics
  prior distributions
  propagating uncertainty
  updating with new data
functional programming
functionality of product, solving problems with
functions

G

Gaussian mixture models
generators of data.
    See data generators.
getData method
Git, reconstructing project history from
GitHub.com

goals
  adjusting
  changes in, plan modification due to
  communicating
  setting
goodness-of-fit function
Google, big data use by
GPUs (graphics processing units)
graph databases
graphs, making
gravity model example
growth model, exponential
GUI-based applications2nd

H

Hadoop framework
hierarchical clustering
histograms
HPC (high-performance computing)
  benefits of
  how to use
  types of
  when to use
HTML (hypertext markup language)
  converting PDFs to
  wrangling data from, preparation for
HttpUrlConnection package
hyper-parameters
hypothesis tests

I

IAAF scoring tables
IaaS (infrastructure as a service)
ICA (independent component analysis)
if command
Improved Inference of Gene Regulatory Networks through Integrated Bayesian Clustering and Dynamic Modeling of Time-Course Expression Data (PloS ONE)
independent component analysis.
    See ICA.
indexing in databases2nd
inference.
    See also statistical modeling and inference.
inferential statistics
infrastructure as a service.
    See IaaS.
__init__ method
interactive graphical applications
  limitations of
  strengths of
  when to use
internet users, as data generators
Introducing the Enron Corpus (Klimt and Yang)
inverted pyramid of journalism
IoT (Internet of Things)2nd
iPython GUI
IRR formula, Excel
iteration

J

jargon

Java programming language
  overview
  when to use
joining tables
JSON (JavaScript object notation)2nd

K

keys
Klimt, Brian2nd
k-means

knowledge
  as first priority
  iterating ideas based on
knowledge of domain.
    See domain knowledge.

L

latent variables
lede
legality of data usage
licensing
lifecycle of data science projects
likelihood function2nd
limma package, in R
line endings, parsing
linear models

linear regression
  in Python
  in R
linear_model object, in Python
linearModel variable, in R
Linux OS, line endings in
listening to customer
  asking specific questions to uncover facts
  iterate ideas based on knowledge
  resolving wishes and pragmatism
  suggesting deliverables
lm function, in R
loadData() function
local drives, storage in
log files
logistic regression
long division

M

Mac OS, line endings in
machine learning
  how it works
  what to watch out for
  when to use it
macros, in spreadsheet software
managers, team, assigning
MAP (maximum posteriori estimation)2nd
MapReduce technology
Markov chain Monte Carlo.
    See MCMC.
mathematics
  long division
  mathematical models
  vs. statistics
MATLAB programming language
  overview
  when to use
maximum likelihood estimation.
    See MLE.
maximum posteriori estimation.
    See MAP.
McKinlay, Chris
MCMC (Markov chain Monte Carlo)2nd
mean reversion
messy data, types of
meta-algorithms

methods
  built-in
    linear regression in Python
    linear regression in R
  complex
  in object-oriented programming
  writing
microRNA example
Microsoft Excel2nd
Microsoft Word
miRanda algorithm
MLE (maximum likelihood estimation)2nd3rd
model of gravity example
MongoDB
multicore package, in R
multiple testing correction
multiprocessing package, in R

N

natural language processing tools.
    See NLP tools.
Natural Language Toolkit.
    See NLTK.
negative results
neighborhood of a data point
network drives, storage in
neural networks
NLP (natural language processing) tools
NLTK (Natural Language Toolkit)
non-relational databases
normal distribution
NoSQL2nd3rd
null hypothesis
numpy package2nd

O

object-oriented programming
obstacles, anticipating

Octave programming language
  overview
  when to use
OpenOffice Calc
opinions
over-fitting2nd

P

PaaS (platform as a service)
pandas package
parsing line endings
particle physics, Standard Model of
passive products
PCA (principal component analysis)
pdf2html application
pdf2txt application

PDFs
  converting to text or HTML
  overview
Pearson correlation coefficient
plain language
plain text
plan execution
  for beginners
  for software engineers
  for statisticians
  for team leaders
  for team members
  modifying plan in progress
  results
    practical usefulness of
    reevaluating original goals based on
    statistical significance of
planning
  examples
    beer recommendation algorithm
    bioinformatics and gene expression
    Enron email analysis
    top performances in track and field
  flexibility in.
    See also plan execution.
platform as a service.
    See PaaS.
PMT formula, Excel
point estimates
positive results
posterior distributions
postmortem
  determining different methods
  review old plans
  reviewing old goals
  reviewing technology choices
pragmatism, wishes and
predict function, in R
pride
principal component analysis.
    See PCA.
probability
probability distributions, in statistical models
problems with products
  doesn’t achieve expected goals
  improper use by customers
  software bugs
  user experience problems
product functionality.
    See functionality of product.
product problems.
    See problems with product.
product revisions.
    See revisions to product.
programming statistical software
  applications
  functions
  getting started
  languages
    Java
    MATLAB
    Octave
    Python
    R
  object-oriented
  scripting
  switching from spreadsheets to scripts
project postmortem.
    See postmortem.
projects.
    See data science projects.
p-values

Python programming language
  linear regression in
  overview
  reading of flat files by
  when to use

Q

querying, using databases

questions
  answering using data
    anticipate obstacles
    figuring out software to use
    figuring out what data to use
    has someone done this before?
    is data relevant and sufficient?
  asking
    concrete in assumptions
    measurable success without much cost
    specific quesions to uncover facts
quotients

R



R programming language
  linear regression in
  overview
  reading of flat files by
  when to use
random forest method
random variables
RDKit package
READMEs2nd
reduce step, MapReduce framework
relational databases2nd
reports
  limitations of
  strengths of
  when to use
repos (repositories)
REST API

results
  conclusive, making prominent
  disclaimers for less significant
  inconclusive
  negative
  positive
revisions to product2nd
  choosing
  designing
  due to uncertainty
  engineering
RStudio GUI

S

SaaS (software as a service)
scaling, databases and
scikit-learn package
scipy package
scoring tables, IAAF
scouting for data
  combining data sources
  copyright and licensing
  measuring or collecting things yourself
  using Google search
  web scraping
scraping web2nd

scripts
  data wrangling and
  overview
  programming
  switching from spreadsheets to
self.filename attribute
shuffle step, MapReduce framework
shyness
social network analysis2nd
software as a service.
    See SaaS.
software bugs
  fixing
  how to recognize
  how to remedy2nd

software developers
  compared with data scientists
  whether need to be developer
software engineers, plan execution for
  checking final results
  consulting statisticians
  planning for broadest possible set of outcomes or states
software, figuring out which to use
sort command
spherical geometry

spreadsheets
  overview
  switching to scripts from
SPSS GUI
SQL (Structured Query Language)2nd
standard errors
Standard Model of particle physics example
statistical analysis
  classification
  clustering
  increasing sophistication
  inference
  other statistical methods
  taking subset of data
statistical modeling and inference
  Bayesian vs. frequentist statistics
    prior distributions
    propagating uncertainty
    updating with new data
  defining statistical model
  drawing conclusions from models
  fitting model
    expected maximization and variational Bayes
    likelihood function
    Markov chain Monte Carlo
    maximum likelihood estimation
    maximum posteriori estimation
    over-fitting
  latent variables
  quantifying uncertainty
    drawing conclusions from models involving uncertainty
    formulating statistical model with uncertainty
    probability distributions in statistical models
statistical software
  accessibility of
  GUI-based applications2nd
  programming
    applications
    functions
    getting started
    languages
    object-oriented
    scripting
    switching from spreadsheets to scripts
  spreadsheets
  tools
    familiarity with
    flexibility of
    implementation of methods
    informative
    interoperability of
    knowledge of
    permissive licenses
    popularity of
    purpose-built
    well documented
  translating statistics into
    using built-in methods
    writing methods
statisticians, plan execution for
  consulting software engineers
  spending time with customers
  testing software

statistics
  descriptive
    choosing specific statistics to calculate
    common descriptive statistics
    making tables or graphs
  inferential
  overview
  relation to data science
  statistical methods
    clustering
    component analysis
    machine learning
  translating into software
    using built-in methods
    writing methods
  vs. mathematics
  whether need to know.
    See also statistical modeling and inference, ; statistical software.
storage
  in code repositories
  in local drives
  in network drives
  in READMEs
  in wiki systems
  using cloud services
  with web-based document hosting
Structured Query Language.
    See SQL.
subsets of data
sum command
SUM formula, Excel
sunk costs
supercomputers
SVMs (support vector machines)

T



tables
  joining
  making
tab-separated value.
    See TSV.
TargetScan algorithm
team leaders, plan execution for
team members, plan execution for
  assigning a manager
  planning
  specifying expectations
technical variance
technology, as second priority after knowledge
templating, algorithmic
text, converting PDFs to
time boxing
timidness
tools, for statistical software
track and field example2nd3rd
  common heuristic comparisons
  comparing performances using all data available
  IAAF scoring tables
train-test separation
TSV (tab-separated value)
Tufte, Edward
Tumblr, API of
Twitter, big data use by

U

Uber example
uncertainty, quantifying
  drawing conclusions from models involving uncertainty
  formulating statistical model with uncertainty
  probability distributions in statistical models
unsupervised learning.
    See clustering.
url package
urllib package
user documentation
user experience
  inverted pyramid of journalism
  plain language with no jargon
  problems with
    how to recognize
    how to remedy, 2nd, 3rd, 4th
  science of
  visualizations
UTF-8 text

V

values of data points

variables
  latent
  random
variance
VB (variational Bayes)2nd
VBA (Visual Basic for Applications)
versioning
Visual Display of Quantitative Information, The (Tufte)
visualizations

W

web hosting, using cloud services
web scraping2nd
web-based APIs
web-based document hosting, for storage
white papers
  limitations of
  strengths of
  when to use
wiki systems, storage in
Windows OS, line endings in
wishes, pragmatism and
Word.
    See Microsoft Word.
wrangling data.
    See data wrangling.

X

XBRL (Extensible Business Reporting Language)
XML (extensible markup language)

Y

Yang, Yiming

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.138.97