Table of Contents


Brief Table of Contents

Table of Contents



About this Book

About the Cover Illustration

1. Preparing and gathering data and knowledge

Chapter 1. Philosophies of data science

1.1. Data science and this book

1.2. Awareness is valuable

1.3. Developer vs. data scientist

1.4. Do I need to be a software developer?

1.5. Do I need to know statistics?

1.6. Priorities: knowledge first, technology second, opinions third

1.7. Best practices

1.7.1. Documentation

1.7.2. Code repositories and versioning

1.7.3. Code organization

1.7.4. Ask questions

1.7.5. Stay close to the data

1.8. Reading this book: how I discuss concepts


Chapter 2. Setting goals by asking good questions

2.1. Listening to the customer

2.1.1. Resolving wishes and pragmatism

2.1.2. The customer is probably not a data scientist

2.1.3. Asking specific questions to uncover fact, not opinions

2.1.4. Suggesting deliverables: guess and check

2.1.5. Iterate your ideas based on knowledge, not wishes

2.2. Ask good questions—of the data

2.2.1. Good questions are concrete in their assumptions

2.2.2. Good answers: measurable success without too much cost

2.3. Answering the question using data

2.3.1. Is the data relevant and sufficient?

2.3.2. Has someone done this before?

2.3.3. Figuring out what data and software you could use

2.3.4. Anticipate obstacles to getting everything you want

2.4. Setting goals

2.4.1. What is possible?

2.4.2. What is valuable?

2.4.3. What is efficient?

2.5. Planning: be flexible



Chapter 3. Data all around us: the virtual wilderness

3.1. Data as the object of study

3.1.1. The users of computers and the internet became data generators

3.1.2. Data for its own sake

3.1.3. Data scientist as explorer

3.2. Where data might live, and how to interact with it

3.2.1. Flat files

3.2.2. HTML

3.2.3. XML

3.2.4. JSON

3.2.5. Relational databases

3.2.6. Non-relational databases

3.2.7. APIs

3.2.8. Common bad formats

3.2.9. Unusual formats

3.2.10. Deciding which format to use

3.3. Scouting for data

3.3.1. First step: Google search

3.3.2. Copyright and licensing

3.3.3. The data you have: is it enough?

3.3.4. Combining data sources

3.3.5. Web scraping

3.3.6. Measuring or collecting things yourself

3.4. Example: microRNA and gene expression



Chapter 4. Data wrangling: from capture to domestication

4.1. Case study: best all-time performances in track and field

4.1.1. Common heuristic comparisons

4.1.2. IAAF Scoring Tables

4.1.3. Comparing performances using all data available

4.2. Getting ready to wrangle

4.2.1. Some types of messy data

4.2.2. Pretend you’re an algorithm

4.2.3. Keep imagining: what are the possible obstacles and uncertainties?

4.2.4. Look at the end of the data and the file

4.2.5. Make a plan

4.3. Techniques and tools

4.3.1. File format converters

4.3.2. Proprietary data wranglers

4.3.3. Scripting: use the plan, but then guess and check

4.4. Common pitfalls

4.4.1. Watch out for Windows/Mac/Linux problems

4.4.2. Escape characters

4.4.3. The outliers

4.4.4. Horror stories around the wranglers’ campfire



Chapter 5. Data assessment: poking and prodding

5.1. Example: the Enron email data set

5.2. Descriptive statistics

5.2.1. Stay close to the data

5.2.2. Common descriptive statistics

5.2.3. Choosing specific statistics to calculate

5.2.4. Make tables or graphs where appropriate

5.3. Check assumptions about the data

5.3.1. Assumptions about the contents of the data

5.3.2. Assumptions about the distribution of the data

5.3.3. A handy trick for uncovering your assumptions

5.4. Looking for something specific

5.4.1. Find a few examples

5.4.2. Characterize the examples: what makes them different?

5.4.3. Data snooping (or not)

5.5. Rough statistical analysis

5.5.1. Dumb it down

5.5.2. Take a subset of the data

5.5.3. Increasing sophistication: does it improve results?



2. Building a product with software and statistics

Chapter 6. Developing a plan

6.1. What have you learned?

6.1.1. Examples

6.1.2. Evaluating what you’ve learned

6.2. Reconsidering expectations and goals

6.2.1. Unexpected new information

6.2.2. Adjusting goals

6.2.3. Consider more exploratory work

6.3. Planning

6.3.1. Examples

6.4. Communicating new goals



Chapter 7. Statistics and modeling: concepts and foundations

7.1. How I think about statistics

7.2. Statistics: the field as it relates to data science

7.2.1. What statistics is

7.2.2. What statistics is not

7.3. Mathematics

7.3.1. Example: long division

7.3.2. Mathematical models

7.3.3. Mathematics vs. statistics

7.4. Statistical modeling and inference

7.4.1. Defining a statistical model

7.4.2. Latent variables

7.4.3. Quantifying uncertainty: randomness, variance, and error terms

7.4.4. Fitting a model

7.4.5. Bayesian vs. frequentist statistics

7.4.6. Drawing conclusions from models

7.5. Miscellaneous statistical methods

7.5.1. Clustering

7.5.2. Component analysis

7.5.3. Machine learning and black box methods



Chapter 8. Software: statistics in action

8.1. Spreadsheets and GUI-based applications

8.1.1. Spreadsheets

8.1.2. Other GUI-based statistical applications

8.1.3. Data science for the masses

8.2. Programming

8.2.1. Getting started with programming

8.2.2. Languages

8.3. Choosing statistical software tools

8.3.1. Does the tool have an implementation of the methods?

8.3.2. Flexibility is good

8.3.3. Informative is good

8.3.4. Common is good

8.3.5. Well documented is good

8.3.6. Purpose-built is good

8.3.7. Interoperability is good

8.3.8. Permissive licenses are good

8.3.9. Knowledge and familiarity are good

8.4. Translating statistics into software

8.4.1. Using built-in methods

8.4.2. Writing your own methods



Chapter 9. Supplementary software: bigger, faster, more efficient

9.1. Databases

9.1.1. Types of databases

9.1.2. Benefits of databases

9.1.3. How to use databases

9.1.4. When to use databases

9.2. High-performance computing

9.2.1. Types of HPC

9.2.2. Benefits of HPC

9.2.3. How to use HPC

9.2.4. When to use HPC

9.3. Cloud services

9.3.1. Types of cloud services

9.3.2. Benefits of cloud services

9.3.3. How to use cloud services

9.3.4. When to use cloud services

9.4. Big data technologies

9.4.1. Types of big data technologies

9.4.2. Benefits of big data technologies

9.4.3. How to use big data technologies

9.4.4. When to use big data technologies

9.5. Anything as a service



Chapter 10. Plan execution: putting it all together

10.1. Tips for executing the plan

10.1.1. If you’re a statistician

10.1.2. If you’re a software engineer

10.1.3. If you’re a beginner

10.1.4. If you’re a member of a team

10.1.5. If you’re leading a team

10.2. Modifying the plan in progress

10.2.1. Sometimes the goals change

10.2.2. Something might be more difficult than you thought

10.2.3. Sometimes you realize you made a bad choice

10.3. Results: knowing when they’re good enough

10.3.1. Statistical significance

10.3.2. Practical usefulness

10.3.3. Reevaluating your original accuracy and significance goals

10.4. Case study: protocols for measurement of gene activity

10.4.1. The project

10.4.2. What I knew

10.4.3. What I needed to learn

10.4.4. The resources

10.4.5. The statistical model

10.4.6. The software

10.4.7. The plan

10.4.8. The results

10.4.9. Submitting for publication and feedback

10.4.10. How it ended



3. Finishing off the product and wrapping up

Chapter 11. Delivering a product

11.1. Understanding your customer

11.1.1. Who is the entire audience for the results?

11.1.2. What will be done with the results?

11.2. Delivery media

11.2.1. Report or white paper

11.2.2. Analytical tool

11.2.3. Interactive graphical application

11.2.4. Instructions for how to redo the analysis

11.2.5. Other types of products

11.3. Content

11.3.1. Make important, conclusive results prominent

11.3.2. Don’t include results that are virtually inconclusive

11.3.3. Include obvious disclaimers for less significant results

11.3.4. User experience

11.4. Example: analyzing video game play



Chapter 12. After product delivery: problems and revisions

12.1. Problems with the product and its use

12.1.1. Customers not using the product correctly

12.1.2. UX problems

12.1.3. Software bugs

12.1.4. The product doesn’t solve real problems

12.2. Feedback

12.2.1. Feedback means someone is using your product

12.2.2. Feedback is not disapproval

12.2.3. Read between the lines

12.2.4. Ask for feedback if you must

12.3. Product revisions

12.3.1. Uncertainty can make revisions necessary

12.3.2. Designing revisions

12.3.3. Engineering revisions

12.3.4. Deciding which revisions to make



Chapter 13. Wrapping up: putting the project away

13.1. Putting the project away neatly

13.1.1. Documentation

13.1.2. Storage

13.1.3. Thinking ahead to future scenarios

13.1.4. Best practices

13.2. Learning from the project

13.2.1. Project postmortem

13.3. Looking toward the future



 Exercises: Examples and Answers

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 11

Chapter 12

Chapter 13

 The lifecycle of a data science project


List of Figures

List of Tables

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.