[A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X][Y]
abstracted query language, in databases
active products
aggregations, in databases
algorithmic templating
algorithms, long division and
alternative hypothesis
analytical tools
limitations of
strengths of
when to use
API keys
APIs (application programming interface)
artificial intelligence
artificial neural networks.
See neural networks.
ASCII text
assessment of data.
See data assessment.
atomic statistics
awareness, value of
backing up code
backslash character
BACON (BAyesian Clustering Over Networks), 2nd
Bayesian statistics, vs. frequentist statistics
prior distributions
propagating uncertainty
updating with new data
beer recommendation algorithm, 2nd
beginners, plan execution for
Bernoulli distribution
best practices
asking questions
code organization
code repositories (repos) and versioning
documentation
staying close to data
when finishing projects
big data, 2nd
benefits of
how to use
types of
when to use
bioinformatics example, 2nd
Bitbucket.org
black box methods, 2nd
boosting
branching
brute-force method
of documenting code
overview
bugs.
See software bugs.
caching
checking assumptions
about contents of data
about distribution of data
trick for uncovering assumptions
classification
closeness-of-prediction function
cloud services
benefits of
how to use
types of
when to use
clustering, 2nd
how it works
what to watch out for
when to use it
code
backing up
organization of
code documentation
code repositories, storage in
colleagues, consulting
combining data sources
comma-separated value.
See CSV.
communicating goals
complex methods
component analysis
how it works
what to watch out for
when to use it
Comprehensive R Archive Network.
See CRAN.
computer clusters
computers users, as data generators
concurrency, in databases
confidence intervals
copyright
correlation, conflating with significance
corrupted data
costs, sunk
CRAN (Comprehensive R Archive Network)
credible intervals
cross-validation
CSV (comma-separated value)
customers
after software delivery
improper use of product by.
See also listening to customer.
DAG (directed acyclic graph)
data
indiscriminate collection of
scouting for
combining data sources
copyright and licensing
measuring or collecting things yourself
using Google search
web scraping
data assessment
checking assumptions
about contents of data
about distribution of data
trick for uncovering assumptions
descriptive statistics
choosing specific statistics to calculate
common descriptive statistics
making tables or graphs
Enron email data set example
looking for something specific
characterizing examples
data snooping
finding few examples
rough statistical analysis
classification
clustering
increasing sophistication
inference
other statistical methods
taking subset of data
data frames, in R
data generators, computers and internet users as
data management, vs. data science
data points, neighborhood of
data science
defined
vs. data management
data science projects, lifecycle of
data scientists
as explorer
compared with software developers
data wrangling
common pitfalls
escape characters
line endings, parsing
outliers
in Enron email analysis
PDFs and
preparation for
end of data and file
making plan
messy data, types of
possible obstacles and uncertainties
pretending to be an algorithm
techniques and tools
track and field case study
common heuristic comparisons
comparing performances using all data available
IAAF scoring tables
data.frame function, in R
database indexes
databases
benefits of
abstracted query language
aggregations
caching
concurrency
indexing
scaling
document-oriented
graph databases
how to use
non-relational
poorly designed
relational, 2nd
types of
document-oriented
other
relational
using cloud services with
when to use
data-centric, use of term
DataLoader object
deep learning
delimited files
deliverables, suggesting
delivering product
content
disclaimers for less significant results
making conclusive results prominent
omitting virtually inconclusive results
user experience
feedback
asking for
is not disapproval
meaning of
understanding
understanding customer
considering audience
considering how results will be used.
See delivery media.
delivery media
analytical tools
limitations of
strengths of
when to use
instructions for how to redo analysis
limitations of
strengths of
when to use
interactive graphical applications
limitations of
strengths of
when to use
other types of products
reports
limitations of
strengths of
when to use
white papers
limitations of
strengths of
when to use
descriptive statistics, 2nd
choosing specific statistics to calculate
common descriptive statistics
making tables or graphs
developers.
See software developers.
directed acyclic graph.
See DAG.
division
dl object
documentation, 2nd
for developers
for users
of code
document-oriented databases
domain knowledge, relative importance of
Elasticsearch, 2nd
else command
EM (expected maximization), 2nd
empirical Bayes
Enron email, 2nd
EOL (end-of-line) character
error terms
escape characters
Euclidean geometry
Excel.
See Microsoft Excel.
expectations, reconsidering
expected maximization.
See EM.
exponential growth model
Extensible Business Reporting Language.
See XBRL.
feature extraction
feedback
asking for
meaning of
understanding
meaning of
perceptions
who is providing
fields, in abstract algebra
file formats
bad
converters
deciding which to use
flat files
HTML
JSON
unusual
XML
flat files
forking
frequentist statistics, vs. Bayesian statistics
prior distributions
propagating uncertainty
updating with new data
functional programming
functionality of product, solving problems with
functions
Gaussian mixture models
generators of data.
See data generators.
getData method
Git, reconstructing project history from
GitHub.com
goals
adjusting
changes in, plan modification due to
communicating
setting
goodness-of-fit function
Google, big data use by
GPUs (graphics processing units)
graph databases
graphs, making
gravity model example
growth model, exponential
GUI-based applications, 2nd
Hadoop framework
hierarchical clustering
histograms
HPC (high-performance computing)
benefits of
how to use
types of
when to use
HTML (hypertext markup language)
converting PDFs to
wrangling data from, preparation for
HttpUrlConnection package
hyper-parameters
hypothesis tests
IAAF scoring tables
IaaS (infrastructure as a service)
ICA (independent component analysis)
if command
Improved Inference of Gene Regulatory Networks through Integrated Bayesian Clustering and Dynamic Modeling of Time-Course
Expression Data (PloS ONE)
independent component analysis.
See ICA.
indexing in databases, 2nd
inference.
See also statistical modeling and inference.
inferential statistics
infrastructure as a service.
See IaaS.
__init__ method
interactive graphical applications
limitations of
strengths of
when to use
internet users, as data generators
Introducing the Enron Corpus (Klimt and Yang)
inverted pyramid of journalism
IoT (Internet of Things), 2nd
iPython GUI
IRR formula, Excel
iteration
jargon
Java programming language
overview
when to use
joining tables
JSON (JavaScript object notation), 2nd
keys
Klimt, Brian, 2nd
k-means
knowledge
as first priority
iterating ideas based on
knowledge of domain.
See domain knowledge.
latent variables
lede
legality of data usage
licensing
lifecycle of data science projects
likelihood function, 2nd
limma package, in R
line endings, parsing
linear models
linear regression
in Python
in R
linear_model object, in Python
linearModel variable, in R
Linux OS, line endings in
listening to customer
asking specific questions to uncover facts
iterate ideas based on knowledge
resolving wishes and pragmatism
suggesting deliverables
lm function, in R
loadData() function
local drives, storage in
log files
logistic regression
long division
Mac OS, line endings in
machine learning
how it works
what to watch out for
when to use it
macros, in spreadsheet software
managers, team, assigning
MAP (maximum posteriori estimation), 2nd
MapReduce technology
Markov chain Monte Carlo.
See MCMC.
mathematics
long division
mathematical models
vs. statistics
MATLAB programming language
overview
when to use
maximum likelihood estimation.
See MLE.
maximum posteriori estimation.
See MAP.
McKinlay, Chris
MCMC (Markov chain Monte Carlo), 2nd
mean reversion
messy data, types of
meta-algorithms
methods
built-in
linear regression in Python
linear regression in R
complex
in object-oriented programming
writing
microRNA example
Microsoft Excel, 2nd
Microsoft Word
miRanda algorithm
MLE (maximum likelihood estimation), 2nd, 3rd
model of gravity example
MongoDB
multicore package, in R
multiple testing correction
multiprocessing package, in R
natural language processing tools.
See NLP tools.
Natural Language Toolkit.
See NLTK.
negative results
neighborhood of a data point
network drives, storage in
neural networks
NLP (natural language processing) tools
NLTK (Natural Language Toolkit)
non-relational databases
normal distribution
NoSQL, 2nd, 3rd
null hypothesis
numpy package, 2nd
object-oriented programming
obstacles, anticipating
Octave programming language
overview
when to use
OpenOffice Calc
opinions
over-fitting, 2nd
PaaS (platform as a service)
pandas package
parsing line endings
particle physics, Standard Model of
passive products
PCA (principal component analysis)
pdf2html application
pdf2txt application
PDFs
converting to text or HTML
overview
Pearson correlation coefficient
plain language
plain text
plan execution
for beginners
for software engineers
for statisticians
for team leaders
for team members
modifying plan in progress
results
practical usefulness of
reevaluating original goals based on
statistical significance of
planning
examples
beer recommendation algorithm
bioinformatics and gene expression
Enron email analysis
top performances in track and field
flexibility in.
See also plan execution.
platform as a service.
See PaaS.
PMT formula, Excel
point estimates
positive results
posterior distributions
postmortem
determining different methods
review old plans
reviewing old goals
reviewing technology choices
pragmatism, wishes and
predict function, in R
pride
principal component analysis.
See PCA.
probability
probability distributions, in statistical models
problems with products
doesn’t achieve expected goals
improper use by customers
software bugs
user experience problems
product functionality.
See functionality of product.
product problems.
See problems with product.
product revisions.
See revisions to product.
programming statistical software
applications
functions
getting started
languages
Java
MATLAB
Octave
Python
R
object-oriented
scripting
switching from spreadsheets to scripts
project postmortem.
See postmortem.
projects.
See data science projects.
p-values
Python programming language
linear regression in
overview
reading of flat files by
when to use
querying, using databases
questions
answering using data
anticipate obstacles
figuring out software to use
figuring out what data to use
has someone done this before?
is data relevant and sufficient?
asking
concrete in assumptions
measurable success without much cost
specific quesions to uncover facts
quotients
R programming language
linear regression in
overview
reading of flat files by
when to use
random forest method
random variables
RDKit package
READMEs, 2nd
reduce step, MapReduce framework
relational databases, 2nd
reports
limitations of
strengths of
when to use
repos (repositories)
REST API
results
conclusive, making prominent
disclaimers for less significant
inconclusive
negative
positive
revisions to product, 2nd
choosing
designing
due to uncertainty
engineering
RStudio GUI
SaaS (software as a service)
scaling, databases and
scikit-learn package
scipy package
scoring tables, IAAF
scouting for data
combining data sources
copyright and licensing
measuring or collecting things yourself
using Google search
web scraping
scraping web, 2nd
scripts
data wrangling and
overview
programming
switching from spreadsheets to
self.filename attribute
shuffle step, MapReduce framework
shyness
social network analysis, 2nd
software as a service.
See SaaS.
software bugs
fixing
how to recognize
how to remedy, 2nd
software developers
compared with data scientists
whether need to be developer
software engineers, plan execution for
checking final results
consulting statisticians
planning for broadest possible set of outcomes or states
software, figuring out which to use
sort command
spherical geometry
spreadsheets
overview
switching to scripts from
SPSS GUI
SQL (Structured Query Language), 2nd
standard errors
Standard Model of particle physics example
statistical analysis
classification
clustering
increasing sophistication
inference
other statistical methods
taking subset of data
statistical modeling and inference
Bayesian vs. frequentist statistics
prior distributions
propagating uncertainty
updating with new data
defining statistical model
drawing conclusions from models
fitting model
expected maximization and variational Bayes
likelihood function
Markov chain Monte Carlo
maximum likelihood estimation
maximum posteriori estimation
over-fitting
latent variables
quantifying uncertainty
drawing conclusions from models involving uncertainty
formulating statistical model with uncertainty
probability distributions in statistical models
statistical software
accessibility of
GUI-based applications, 2nd
programming
applications
functions
getting started
languages
object-oriented
scripting
switching from spreadsheets to scripts
spreadsheets
tools
familiarity with
flexibility of
implementation of methods
informative
interoperability of
knowledge of
permissive licenses
popularity of
purpose-built
well documented
translating statistics into
using built-in methods
writing methods
statisticians, plan execution for
consulting software engineers
spending time with customers
testing software
statistics
descriptive
choosing specific statistics to calculate
common descriptive statistics
making tables or graphs
inferential
overview
relation to data science
statistical methods
clustering
component analysis
machine learning
translating into software
using built-in methods
writing methods
vs. mathematics
whether need to know.
See also statistical modeling and inference, ; statistical software.
storage
in code repositories
in local drives
in network drives
in READMEs
in wiki systems
using cloud services
with web-based document hosting
Structured Query Language.
See SQL.
subsets of data
sum command
SUM formula, Excel
sunk costs
supercomputers
SVMs (support vector machines)
tables
joining
making
tab-separated value.
See TSV.
TargetScan algorithm
team leaders, plan execution for
team members, plan execution for
assigning a manager
planning
specifying expectations
technical variance
technology, as second priority after knowledge
templating, algorithmic
text, converting PDFs to
time boxing
timidness
tools, for statistical software
track and field example, 2nd, 3rd
common heuristic comparisons
comparing performances using all data available
IAAF scoring tables
train-test separation
TSV (tab-separated value)
Tufte, Edward
Tumblr, API of
Twitter, big data use by
Uber example
uncertainty, quantifying
drawing conclusions from models involving uncertainty
formulating statistical model with uncertainty
probability distributions in statistical models
unsupervised learning.
See clustering.
url package
urllib package
user documentation
user experience
inverted pyramid of journalism
plain language with no jargon
problems with
how to recognize
how to remedy, 2nd, 3rd, 4th
science of
visualizations
UTF-8 text
values of data points
variables
latent
random
variance
VB (variational Bayes), 2nd
VBA (Visual Basic for Applications)
versioning
Visual Display of Quantitative Information, The (Tufte)
visualizations
web hosting, using cloud services
web scraping, 2nd
web-based APIs
web-based document hosting, for storage
white papers
limitations of
strengths of
when to use
wiki systems, storage in
Windows OS, line endings in
wishes, pragmatism and
Word.
See Microsoft Word.
wrangling data.
See data wrangling.
XBRL (Extensible Business Reporting Language)
XML (extensible markup language)
18.216.138.97