Index

Note: Page numbers followed by f indicate figures.

A

Abstraction  144
Accuracy  207–208
Apollo Lunar Surface Experiments Package (ALSEP)  369–372 , 370–371f
Apriori algorithm  221
ASCII editor  185
Autocoding 
lexical parsing  27–28
12 lines of Python code  31–34
medical nomenclature  24
natural language autocoders  25–26
nomenclature coding  24–25
on-the-fly autocoding  28

B

Bayesian analysis  297–301
Big Data 
analysis  5
data preparation  4
data structure and content  3
definition  1–2
goals  3
introspection  5
location  3
longevity  4
measurements  4
mechanisms  5–6
purpose of  7–8
reproducibility  4
research universe  8–13
stakes  4
Big Data resources 
analytic algorithm  332
ASCII editor  185
back-of-envelope analyses 
estimation-only analyses  266
mean-field averaging  267
complete and representative data  188
complexity 
approximate/local solutions unacceptable  327
incremental  328
model for reality  328
random intervals  327
simple design  327
data description  331
data flattening  200–205
data objects identification and classification  187
data plotting 
data distribution  190
Gnuplot  190
Matplotlib  190
normal/Gaussian distribution  191–192
data properties 
annotate with metadata  193
data within data object  195
immutable data  196
introspective data  196
membership in defined class  196
scientific value  193
simplified data  197
time stamped data  194
uniqueness/identity  193
data reduction  332
denominators  259–260
formulated questions  329
immutability  See Immutability
large files, view and search  198–200
multimodality  270
number of records 
catchment population  186–187
data manager  186
sample number/dimension dichotomy  187
outliers and anomalies  264–266
preference prediction  268–270
query output adequacy  330
readme/index file  186
reduce human errors 
data entry errors  426
identification errors  426
medical errors  427
motor vehicle accidents  427
rocket launch errors  427
reformulated questions  330
resource builders 
Big Data designers  428
Big Data indexers  429
data curators  430
data managers  430
domain experts  429
metadata experts  429
network specialists  431
ontologists and classification experts  429
security experts  431
software programmers  429
resource evaluation  329
resource users 
data analysts  431
data reduction specialists  433
data validators  431
data visualizers  433
free-lance Big Data consultants  434
generalist problem solver  432
scientists with minimal programming skills  433
results and conclusions  335
security policy/restricted data  197–198
self-descriptive information  188
solution estimation  192
validation  336
word frequency distributions  260–264
Big Data statistics 
biomakers  307
cancel-out hypothesis  308
creating unbiased models  303–304
credent results  305–306
death certificates  307–308
DNA sequences  309
hypothesis  304–305
multidimensionality  314–317 , 315f
overfitting  309
pitfalls 
ambiguity of system elements  313
blending bias  312
complexity bias  313
misguided data  311
statistical method bias  313
Simpson’s paradox  310–311
time-window bias  306–307
Biomakers  307
Black holes  271–274
Blockchain 
conditions  177
creating  177–178
properties  178
time stamp  179
triples  177
Burrows Wheeler transform (BWT)  36–50

C

Cancel-out hypothesis  308
Cancer Biomedical Informatics Grid (CaBigTM339–344
Central Limit Theorem  291–293 , 292f
Class blending  110–111
Class hierarchy  103–104
Classification  101–104
Classifier algorithm  247
Clustering algorithm 
vs. classifiers  247
drawbacks  246
k-means algorithm  246
operation  246
purpose  245
CODIS (Combined DNA Index System)  368–369
Compliance  164–165
Concordances  16–19 , 34–36
Correlation method 
dot product  244–245 , 244f
Pearson correlation  243–244 , 243f
Python’s Scipy  243
Counting 
gene  224–225
medical error/counting errors  212–213
negations  214
systematic counting error  211
word counting rules  211–213
Cryptography  381–387
Cygwin  199–200

D

Data analysis 
classification  249–250
clustering algorithm 
vs. classifiers  247
drawbacks  246
k-means algorithm  246
operation  246
purpose  245
correlation method 
dot product  244–245 , 244f
Pearson correlation  243–244 , 243f
Python’s Scipy  243
data persistence methods  247–249
fast operation 
addition and multiplication  238
cryptographic programs, beware of  241–242
inexact answers  242
one-pass equation  242–243 , 242–243f
one-way hashes  238–240
pseudorandom number generator  240–241
random access to files  237
time stamps  238
NoSQL databases  252–256
random number generator  See Random number generator
speed and scalability issues 
combinatorics  237
high-speed programming languages  232
iterative loops, system calls within  234
line-by-line reading  233
look-up tables and pre-computed pointers  235
pay for smart speed  237
persistent data  233
proprietary software  234
RegEx language  236
software testing on data subset  233
solutions  231–232
turn-key application  234
unpredictable software  236
utilities  234
SQLite  251–252
Data identification 
advantages  53–55
data objects, naming  55
data scrubbing  69–71
deidentification  66–68
description  53–54
identifier system, properties of  55–58
in image header  71–74
one-way hash  74–82
poor identifiers 
accession number  63
names  60–61
social security number  62
reidentification  68–69
unique identifier 
life science identifiers  64
object identifier  64–66
properties  58
registries  63–64
UUID  58–59
Data Quality Act  397
Data range  209–211
Data reanalysis 
additional analyses and updating results  356
clarification and improved earlier studies  355
data and data documentation errors  353
data misinterpretation  353
data verification  354–355
exoplanets  357–359 , 358f
extending original study  356
irreproducible results  351–352
JADE collider data  356–357
message framing  353–354
outright fraud  353
scientific misconduct  354
validation  355
vindication  357
Data repurposing 
abandoned data  365–366
Apollo Lunar Surface Experiments Package (ALSEP) data  369–372 , 370–371f
CODIS (Combined DNA Index System)  368–369
dark data  365–367
Hadley data  369
legacy data  366–367
new uses for data  363 , 364f
novel data sets creation  365
original research performance  364
Plate Boundary Observatory data  369–370
zip codes  367–368
Data scrubbing  69–71
Data security 
decryption  381–382 , 385
encryption  381–382 , 384–385
no-cost solution  384
personal identifiers  388–391
public/private key cryptography 
algorithm  382
limitations  384
for RSA encryption  382–383
signature and authentication  383–384
use  382
redundancy  385–386
time and money  386
Data sharing 
complaints 
bureaucratic hurdles  380
comply rules  376
data compartmentalization  379
data hackers  378
data misinterpretation  374
data protector  375
flawed data  377
legal ownership  376
limited access to responsible professionals  375
missing data  380
reimbursement  377
research parasites  374
research protocols  379
universal data standards  375
Labeled-Release data on life on mars  387–388
reasons  373–374
Denominators  259–260
Digital Millennium Copyright Act of 1998 (DMCA)  399
Dot product  244–245 , 244f
Dublin Core  93–95

E

Encapsulation  143

F

Failure 
abandonware  338
approach to Big Data  328–337
Big Data projects  323
Cancer Biomedical Informatics Grid  339–344
categories  322
data managers  322
failed standards 
Ada 95  323–324
BLOB  323
data management principles  326
instability  325
metric system  324
OSI  323
triples  326
Gaussian copula function  344–347
hospital informatics  322
National Biological Information Infrastructure  337 , 338f
occurrence  321–322
precautions 
legacy data, preserving  339
utilities  338
random intervals  327
Frequency distribution of words 
categorical data  260 , 262
quantitative data  260
Zipf distribution 
cumulative index  262–264 , 264f
most frequent word  261–262
Pareto’s principle  260
“stop” word  261–262

G

Gaussian copula function  344–347
Gnuplot  190
GraphViz  120–122

H

Hadley data  369
Havasupai Tribe v. Arizon Board of Regents  413–416

I

ImageMagick  71–72
Immutability and identifiers 
blockchains and distributed ledgers  176–179
coping with data  173–174
immortal data objects  173
metadata tags  170–171
reconciliation across institutions  174–175
replicative annotations  171–172
trusted timestamp  176
zero-knowledge reconciliation  179–183
Indexing  22–24 , 29–31
Infamous birthday problem  294–295 , 294f
Inheritance  143
Introspection  5 , 196
Big Data resources  140 , 152–154
data object  140–142
object oriented programming 
abstraction  144
benefits  144
encapsulation  143
feature  139–140
inheritance  143
objects  138
polymorphism  143–144
reflection  144
Ruby  138–139
time stamping  145–147
triplestore  147–152

J

JADE collider data  356–357

L

Labeled-Release study  387–388
Legalities 
accuracy and legitimacy  395–397
consent 
biases by consent process  408
confidential consent status  407
confidentiality  404–405
confidentiality risk  409
data managers  404
divert responsibility  410
informed consent  404 , 406
legally valid consent form  405
preserving consent  407
privacy  405
records of actions  408
retraction  408
train staff on consent-related issues  408
unintended purposes  410
unmerited revenue source  409
Havasupai tribe  413–416
privacy policies  411–412
protection 
breaches  402–403
identification theft  403–404
tort  402
resources, right to create, use and share 
copyright laws  398–399
data managers, suggestions for  399–400
Digital Millennium Copyright Act of 1998  399
No Electronic Theft Act of 1997  399
standards 
intellectual property  401
license fee  400–401
precautions using  401–402
timely access to data  412–413
unconsented data  409–411
Life science identifiers (LSID)  64
Lotsa data  2

M

Matplotlib  190
Measurement 
accuracy  207–208
biometrics  225–226
control concept  222–223
counting 
gene  224–225
medical error/counting errors  212–213
negations  214
systematic counting error  211
word counting rules  211–213
data range  209–211
data reduction 
Apriori algorithm  221
gravitational forces  219
process  221
randomness  220–221
redundancy  219–220
narrow range data  226
normalizing and transforming data 
converting interval data set  217 , 218f
population difference, adjusting  216
rendering data values dimensionless  216 , 217f
weighting  218
precision  207–208
statistical significance  223–224
steganography  208
Message digest version 5 (md5) algorithm  74–75
Metadata 
concept  85
Dublin Core  93–95
namespace  88–90
semantics  87–88
triples  87–88 , 90–92
XML  85–87
Monte Carlo simulations  288–291
Monty Hall problem  295–297

N

Namespace  88–90
Natural language autocoders  25–26
No Electronic Theft Act of 1997 (NET Act)  399
Noisy class  110–111
Nomenclature coding  24–25
NoSQL databases  252–256

O

Object by relationships  97–101
Object by similarity  98–100
Object identifier (OID) 
creating  64–65
HL7  65–66
problem  65
Object oriented programming  116
abstraction  144
benefits  144
data object, assigning  107
encapsulation  143
feature  139–140
inheritance  143
multiclass inheritance  107
objects  138
polymorphism  143–144
reflection  144
Ruby  138–139
syntax  106–107
One-way hash algorithm  74–82 , 238–240
On-the-fly autocoding  28
Ontologies 
class blending (noisy class)  110–111
classification 
Aristotle  102
biological classifications  101–102
data domain  104
data objects hierarchy  102
vs. identification system  104
parent class  103–104
taxonomy  104
class model 
Big Data resource  108–109
complex ontology  108–109
inheritance rules  107
multiclass inheritance  107–108
object oriented programming  106–107
Python/Perl programming languages  106
Ruby programming language  106
simple classification  109
class relationships visualization 
classification of human neoplasms  121 , 122f
Class Object  121 , 121f
corrupted classification  122 , 123f
GraphViz  120
RDF Schema  123–124
multiple parent classes  104–106
paradoxes  115–116
pitfalls 
classes and properties  113
descriptive language  113
miscellaneous classes  112
transitive classes  112
RDF Schema  117–120
upper level ontology  114–115
Outliers  264–266
Overfitting  309

P

Pearson correlation  243–244 , 243f
Personal identifiers  388–391
Plate Boundary Observatory data  369–370
Polymorphism  143–144
Precision  207–208
Pseudorandom number generator  240–241
calculus  282–283 , 282f
integration  281–282 , 281f
pi calculation  279–280 , 280f
sample  278
simple simulation  278–279
Python/Perl programming languages  106–107
Python’s Scipy  243

R

Random number generator 
Bayesian analysis  297–301
Central Limit Theorem  291–293 , 292f
frequency of unlikely occurrences  293–294
infamous birthday problem  294–295 , 294f
Monte Carlo simulations  288–291
Monty Hall problem  295–297
pseudorandom number generator 
calculus  282–283 , 282f
integration  281–282 , 281f
pi calculation  279–280 , 280f
sample  278
simple simulation  278–279
repeated sampling 
output/conclusion  287
power estimates  288
random numbers generation  285–286
repeated simulation  286–287
sample size  288
scalability  287–288
shuffling  284–285
statistical method  284
Reflection  144
Resource Description Framework (RDF) Schema 
and class properties  117–120
features  90
GraphViz  123–124
syntax for triples  91–92
Ruby programming language  106–107 , 138–139

S

Semantics  87–88
Simpson’s paradox  310–311
Small data  3–5
Societal issues 
anti-hypothesis  421
Big Brother hypothesis  420
Big Snoop hypothesis  419
Borg invasion hypothesis  420
Citizen Scientists  437–440 , 440f
decision-making algorithms  425–428
Egghead heaven hypothesis  421
Facebook hypothesis  421
George Orwell’s 1984  440–442
hubris and hyperbole  434–437
Junkyard hypothesis  420
public mistrust  424–425
reduced cost and increased productivity  422–424
resource builders  428
resource users  431
Scavenger hunt hypothesis  421
Specification 
complex  161
compliance  164
strength  161
versioning  161 , 163
SQLite  251–252
Standards 
Chocolate Teapot  165–167
coercion  162
complex  161
compliance  164–165
construction rules  160–161
creation  157
Darwinian struggle  162
filtering-out process  156
measures  162
new standards  156
popular  159
profit  157
purpose of  159
strength  161
versioning  161–164
Suggested Upper Merged Ontology (SUMO)  114–115

T

Term extraction  19–22
Time stamping  145–147 , 176 , 194 , 238
Time-window bias  306–307
Triples  87–88 , 90–92
Triplestore  147–152
Trusted timestamp  176

U

Unique identifier 
life science identifiers  64
object identifier  64–66
properties  58
registries  63–64
UUID  58–59
Universally unique identifier (UUID) 
collisions  59
Linux  59
properties  58
Python  59

V

Versioning  161–164

W

Word counting  211–213
World Intellectual Property Organization (WIPO)  401

X

XML (eXtensible Markup Language) 
drawback  86–87
importance  86
properties  86
syntax  85–86
XML Schema  86

Z

Zero-knowledge reconciliation  179–183
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.174.248