Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Index

Note: Page numbers followed by f indicate figures.

Abstraction 144

Accuracy 207–208

Apollo Lunar Surface Experiments Package (ALSEP) 369–372 , 370–371f

Apriori algorithm 221

ASCII editor 185

Autocoding

lexical parsing 27–28

12 lines of Python code 31–34

medical nomenclature 24

natural language autocoders 25–26

nomenclature coding 24–25

on-the-fly autocoding 28

Bayesian analysis 297–301

Big Data

analysis 5

data preparation 4

data structure and content 3

definition 1–2

goals 3

introspection 5

location 3

longevity 4

measurements 4

mechanisms 5–6

purpose of 7–8

reproducibility 4

research universe 8–13

stakes 4

Big Data resources

analytic algorithm 332

ASCII editor 185

back-of-envelope analyses

estimation-only analyses 266

mean-field averaging 267

complete and representative data 188

complexity

approximate/local solutions unacceptable 327

incremental 328

model for reality 328

random intervals 327

simple design 327

data description 331

data flattening 200–205

data objects identification and classification 187

data plotting

data distribution 190

Gnuplot 190

Matplotlib 190

normal/Gaussian distribution 191–192

data properties

annotate with metadata 193

data within data object 195

immutable data 196

introspective data 196

membership in defined class 196

scientific value 193

simplified data 197

time stamped data 194

uniqueness/identity 193

data reduction 332

denominators 259–260

formulated questions 329

immutability See Immutability

large files, view and search 198–200

multimodality 270

number of records

catchment population 186–187

data manager 186

sample number/dimension dichotomy 187

outliers and anomalies 264–266

preference prediction 268–270

query output adequacy 330

readme/index file 186

reduce human errors

data entry errors 426

identification errors 426

medical errors 427

motor vehicle accidents 427

rocket launch errors 427

reformulated questions 330

resource builders

Big Data designers 428

Big Data indexers 429

data curators 430

data managers 430

domain experts 429

metadata experts 429

network specialists 431

ontologists and classification experts 429

security experts 431

software programmers 429

resource evaluation 329

resource users

data analysts 431

data reduction specialists 433

data validators 431

data visualizers 433

free-lance Big Data consultants 434

generalist problem solver 432

scientists with minimal programming skills 433

results and conclusions 335

security policy/restricted data 197–198

self-descriptive information 188

solution estimation 192

validation 336

word frequency distributions 260–264

Big Data statistics

biomakers 307

cancel-out hypothesis 308

creating unbiased models 303–304

credent results 305–306

death certificates 307–308

DNA sequences 309

hypothesis 304–305

multidimensionality 314–317 , 315f

overfitting 309

pitfalls

ambiguity of system elements 313

blending bias 312

complexity bias 313

misguided data 311

statistical method bias 313

Simpson’s paradox 310–311

time-window bias 306–307

Biomakers 307

Black holes 271–274

Blockchain

conditions 177

creating 177–178

properties 178

time stamp 179

triples 177

Burrows Wheeler transform (BWT) 36–50

Cancel-out hypothesis 308

Cancer Biomedical Informatics Grid (CaBig^TM) 339–344

Central Limit Theorem 291–293 , 292f

Class blending 110–111

Class hierarchy 103–104

Classification 101–104

Classifier algorithm 247

Clustering algorithm

vs. classifiers 247

drawbacks 246

k-means algorithm 246

operation 246

purpose 245

CODIS (Combined DNA Index System) 368–369

Compliance 164–165

Concordances 16–19 , 34–36

Correlation method

dot product 244–245 , 244f

Pearson correlation 243–244 , 243f

Python’s Scipy 243

Counting

gene 224–225

medical error/counting errors 212–213

negations 214

systematic counting error 211

word counting rules 211–213

Cryptography 381–387

Cygwin 199–200

Data analysis

classification 249–250

clustering algorithm

vs. classifiers 247

drawbacks 246

k-means algorithm 246

operation 246

purpose 245

correlation method

dot product 244–245 , 244f

Pearson correlation 243–244 , 243f

Python’s Scipy 243

data persistence methods 247–249

fast operation

addition and multiplication 238

cryptographic programs, beware of 241–242

inexact answers 242

one-pass equation 242–243 , 242–243f

one-way hashes 238–240

pseudorandom number generator 240–241

random access to files 237

time stamps 238

NoSQL databases 252–256

random number generator See Random number generator

speed and scalability issues

combinatorics 237

high-speed programming languages 232

iterative loops, system calls within 234

line-by-line reading 233

look-up tables and pre-computed pointers 235

pay for smart speed 237

persistent data 233

proprietary software 234

RegEx language 236

software testing on data subset 233

solutions 231–232

turn-key application 234

unpredictable software 236

utilities 234

SQLite 251–252

Data identification

advantages 53–55

data objects, naming 55

data scrubbing 69–71

deidentification 66–68

description 53–54

identifier system, properties of 55–58

in image header 71–74

one-way hash 74–82

poor identifiers

accession number 63

names 60–61

social security number 62

reidentification 68–69

unique identifier

life science identifiers 64

object identifier 64–66

properties 58

registries 63–64

UUID 58–59

Data Quality Act 397

Data range 209–211

Data reanalysis

additional analyses and updating results 356

clarification and improved earlier studies 355

data and data documentation errors 353

data misinterpretation 353

data verification 354–355

exoplanets 357–359 , 358f

extending original study 356

irreproducible results 351–352

JADE collider data 356–357

message framing 353–354

outright fraud 353

scientific misconduct 354

validation 355

vindication 357

Data repurposing

abandoned data 365–366

Apollo Lunar Surface Experiments Package (ALSEP) data 369–372 , 370–371f

CODIS (Combined DNA Index System) 368–369

dark data 365–367

Hadley data 369

legacy data 366–367

new uses for data 363 , 364f

novel data sets creation 365

original research performance 364

Plate Boundary Observatory data 369–370

zip codes 367–368

Data scrubbing 69–71

Data security

decryption 381–382 , 385

encryption 381–382 , 384–385

no-cost solution 384

personal identifiers 388–391

public/private key cryptography

algorithm 382

limitations 384

for RSA encryption 382–383

signature and authentication 383–384

use 382

redundancy 385–386

time and money 386

Data sharing

complaints

bureaucratic hurdles 380

comply rules 376

data compartmentalization 379

data hackers 378

data misinterpretation 374

data protector 375

flawed data 377

legal ownership 376

limited access to responsible professionals 375

missing data 380

reimbursement 377

research parasites 374

research protocols 379

universal data standards 375

Labeled-Release data on life on mars 387–388

reasons 373–374

Denominators 259–260

Digital Millennium Copyright Act of 1998 (DMCA) 399

Dot product 244–245 , 244f

Dublin Core 93–95

Encapsulation 143

Failure

abandonware 338

approach to Big Data 328–337

Big Data projects 323

Cancer Biomedical Informatics Grid 339–344

categories 322

data managers 322

failed standards

Ada 95 323–324

BLOB 323

data management principles 326

instability 325

metric system 324

OSI 323

triples 326

Gaussian copula function 344–347

hospital informatics 322

National Biological Information Infrastructure 337 , 338f

occurrence 321–322

precautions

legacy data, preserving 339

utilities 338

random intervals 327

Frequency distribution of words

categorical data 260 , 262

quantitative data 260

Zipf distribution

cumulative index 262–264 , 264f

most frequent word 261–262

Pareto’s principle 260

“stop” word 261–262

Gaussian copula function 344–347

Gnuplot 190

GraphViz 120–122

Hadley data 369

Havasupai Tribe v. Arizon Board of Regents 413–416

Identifier See Data identification Immutability and identifiers

ImageMagick 71–72

Immutability and identifiers

blockchains and distributed ledgers 176–179

coping with data 173–174

immortal data objects 173

metadata tags 170–171

reconciliation across institutions 174–175

replicative annotations 171–172

trusted timestamp 176

zero-knowledge reconciliation 179–183

Indexing 22–24 , 29–31

Infamous birthday problem 294–295 , 294f

Inheritance 143

Introspection 5 , 196

Big Data resources 140 , 152–154

data object 140–142

object oriented programming

abstraction 144

benefits 144

encapsulation 143

feature 139–140

inheritance 143

objects 138

polymorphism 143–144

reflection 144

Ruby 138–139

time stamping 145–147

triplestore 147–152

JADE collider data 356–357

Labeled-Release study 387–388

Legalities

accuracy and legitimacy 395–397

consent

biases by consent process 408

confidential consent status 407

confidentiality 404–405

confidentiality risk 409

data managers 404

divert responsibility 410

informed consent 404 , 406

legally valid consent form 405

preserving consent 407

privacy 405

records of actions 408

retraction 408

train staff on consent-related issues 408

unintended purposes 410

unmerited revenue source 409

Havasupai tribe 413–416

privacy policies 411–412

protection

breaches 402–403

identification theft 403–404

tort 402

resources, right to create, use and share

data managers, suggestions for 399–400

Digital Millennium Copyright Act of 1998 399

No Electronic Theft Act of 1997 399

standards

intellectual property 401

license fee 400–401

precautions using 401–402

timely access to data 412–413

unconsented data 409–411

Life science identifiers (LSID) 64

Lotsa data 2

Matplotlib 190

Measurement

accuracy 207–208

biometrics 225–226

control concept 222–223

counting

gene 224–225

medical error/counting errors 212–213

negations 214

systematic counting error 211

word counting rules 211–213

data range 209–211

data reduction

Apriori algorithm 221

gravitational forces 219

process 221

randomness 220–221

redundancy 219–220

narrow range data 226

normalizing and transforming data

converting interval data set 217 , 218f

population difference, adjusting 216

rendering data values dimensionless 216 , 217f

weighting 218

precision 207–208

statistical significance 223–224

steganography 208

Message digest version 5 (md5) algorithm 74–75

Metadata

concept 85

Dublin Core 93–95

namespace 88–90

semantics 87–88

triples 87–88 , 90–92

XML 85–87

Monte Carlo simulations 288–291

Monty Hall problem 295–297

Mutability See Immutability and identifiers

Namespace 88–90

Natural language autocoders 25–26

No Electronic Theft Act of 1997 (NET Act) 399

Noisy class 110–111

Nomenclature coding 24–25

NoSQL databases 252–256

Object by relationships 97–101

Object by similarity 98–100

Object identifier (OID)

creating 64–65

HL7 65–66

problem 65

Object oriented programming 116

abstraction 144

benefits 144

data object, assigning 107

encapsulation 143

feature 139–140

inheritance 143

multiclass inheritance 107

objects 138

polymorphism 143–144

reflection 144

Ruby 138–139

syntax 106–107

One-way hash algorithm 74–82 , 238–240

On-the-fly autocoding 28

Ontologies

class blending (noisy class) 110–111

classification

Aristotle 102

biological classifications 101–102

data domain 104

data objects hierarchy 102

vs. identification system 104

parent class 103–104

taxonomy 104

class model

Big Data resource 108–109

complex ontology 108–109

inheritance rules 107

multiclass inheritance 107–108

object oriented programming 106–107

Python/Perl programming languages 106

Ruby programming language 106

simple classification 109

class relationships visualization

classification of human neoplasms 121 , 122f

Class Object 121 , 121f

corrupted classification 122 , 123f

GraphViz 120

RDF Schema 123–124

multiple parent classes 104–106

paradoxes 115–116

pitfalls

classes and properties 113

descriptive language 113

miscellaneous classes 112

transitive classes 112

RDF Schema 117–120

upper level ontology 114–115

Outliers 264–266

Overfitting 309

Pearson correlation 243–244 , 243f

Personal identifiers 388–391

Plate Boundary Observatory data 369–370

Polymorphism 143–144

Precision 207–208

Pseudorandom number generator 240–241

calculus 282–283 , 282f

integration 281–282 , 281f

pi calculation 279–280 , 280f

sample 278

simple simulation 278–279

Python/Perl programming languages 106–107

Python’s Scipy 243

Random number generator

Bayesian analysis 297–301

Central Limit Theorem 291–293 , 292f

frequency of unlikely occurrences 293–294

infamous birthday problem 294–295 , 294f

Monte Carlo simulations 288–291

Monty Hall problem 295–297

pseudorandom number generator

calculus 282–283 , 282f

integration 281–282 , 281f

pi calculation 279–280 , 280f

sample 278

simple simulation 278–279

repeated sampling

output/conclusion 287

power estimates 288

random numbers generation 285–286

repeated simulation 286–287

sample size 288

scalability 287–288

shuffling 284–285

statistical method 284

Reflection 144

Resource Description Framework (RDF) Schema

and class properties 117–120

features 90

GraphViz 123–124

syntax for triples 91–92

Ruby programming language 106–107 , 138–139

Semantics 87–88

Simpson’s paradox 310–311

Small data 3–5

Societal issues

anti-hypothesis 421

Big Brother hypothesis 420

Big Snoop hypothesis 419

Borg invasion hypothesis 420

Citizen Scientists 437–440 , 440f

decision-making algorithms 425–428

Egghead heaven hypothesis 421

Facebook hypothesis 421

George Orwell’s 1984 440–442

hubris and hyperbole 434–437

Junkyard hypothesis 420

public mistrust 424–425

reduced cost and increased productivity 422–424

resource builders 428

resource users 431

Scavenger hunt hypothesis 421

Specification

complex 161

compliance 164

strength 161

versioning 161 , 163

SQLite 251–252

Standards

Chocolate Teapot 165–167

coercion 162

complex 161

compliance 164–165

construction rules 160–161

creation 157

Darwinian struggle 162

filtering-out process 156

measures 162

new standards 156

popular 159

profit 157

purpose of 159

strength 161

versioning 161–164

Suggested Upper Merged Ontology (SUMO) 114–115

Term extraction 19–22

Time stamping 145–147 , 176 , 194 , 238

Time-window bias 306–307

Triples 87–88 , 90–92

Triplestore 147–152

Trusted timestamp 176

Unique identifier

life science identifiers 64

object identifier 64–66

properties 58

registries 63–64

UUID 58–59

Universally unique identifier (UUID)

collisions 59

Linux 59

properties 58

Python 59

Versioning 161–164

Word counting 211–213

World Intellectual Property Organization (WIPO) 401

XML (eXtensible Markup Language)

drawback 86–87

importance 86

properties 86

syntax 85–86

XML Schema 86

Zero-knowledge reconciliation 179–183

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Index

Table of Contents for
Index