Glossary

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Glossary

Accuracy and precision, Accuracy measures how close your data comes to being correct. Precision provides a measurement of reproducibility (i.e., whether repeated measurements of the same quantity produce the same result). Data can be accurate but imprecise. If you have a 10-pound object and you report its weight as 7.2376 pounds every time you weigh the object, then your precision is remarkable, but your accuracy is dismal.

Algorithm, Algorithms are perfect machines. They never make mistakes; they need no fuel; they never wear down; they are spiritual, not physical. The ability to use Big Data effectively depends on the availability of appropriate algorithms. In the past half-century, many brilliant algorithms have been developed for the kinds of computation-intensive work required for Big Data analysis.^107,252

Annotation, Annotation involves describing data elements with metadata or attaching supplemental information to data objects.

Anonymization versus deidentification, Anonymization is a process whereby all the links between an individual and the individual’s data record are irreversibly removed. The difference between anonymization and deidentification is that anonymization is irreversible. There is no method for reestablishing the identity of the patient from anonymized records. Deidentified records can, under strictly controlled circumstances, be reidentified. Reidentification is typically achieved by entrusting a third party with a confidential list that maps individuals to deidentified records. Obviously, reidentification opens another opportunity of harming individuals if the confidentiality of the reidentification list is breached. The advantage of reidentification is that suspected errors in a deidentified database can be found and corrected if permission is obtained to reidentify individuals. For example, if the results of a study based on blood sample measurements indicate that the original samples were mislabeled, it might be important to reidentify the samples and conduct further tests to resolve the issue. In a fully anonymized data set, the opportunities of verifying the quality of data are highly limited.

Artificial intelligence, Artificial intelligence is the field of computer science that seeks to create machines and computer programs that seem to have human intelligence. The field of artificial intelligence sometimes includes the related fields of machine learning and computational intelligence. Over the past few decades, the term “artificial intelligence” has taken a battering from professionals inside and outside the field—for good reasons. First and foremost is that computers do not think in the way that humans think. Though powerful computers can now beat chess masters at their own game, the algorithms for doing so do not simulate human thought processes. Furthermore, most of the predicted benefits from artificial intelligence have not come to pass, despite decades of generous funding. The areas of neural networks, expert systems, and language translation have not met expectations. Detractors have suggested that artificial intelligence is not a well-defined subdiscipline within computer science, as it has encroached into areas unrelated to machine intelligence and has appropriated techniques from other fields, including statistics and numerical analysis. Some of the goals of artificial intelligence have been achieved (e.g., speech-to-text translation), and the analytic methods employed in Big Data analysis should be counted among the enduring successes of the field.

ASCII, ASCII is the American Standard Code for Information Interchange, ISO-14962-1997. The ASCII standard is a way of assigning specific 8-bit strings (a string of 0’s and 1’s of length 8) to alphanumeric characters and punctuation. There are 256 ways of combining 0’s and 1’s in strings of length 8, and this means there are 256 different ASCII characters. Uppercase letters are assigned a different ASCII character than their lowercase equivalents. UNICODE, an expansion of ASCII, with a greater binary string length per character, accommodates non-Latin alphabets. See Binary.

Bayh—Dole Act (Patent and Trademark Amendments of 1980, P.L. 96-517), Adopted in 1980, U.S. Bayh—Dole legislation and subsequent extensions gave universities and corporations the right to keep and control any intellectual property (including data sets) developed under federal grants. The Bayh—Dole Act has provided entrepreneurial opportunities for researchers who work under federal grants, but has created conflicts of interest that should be disclosed to human subjects during the informed consent process. It is within the realm of possibility that a researcher who stands to gain considerable wealth, depending on the outcome of the project, may behave recklessly or dishonestly to achieve his or her ends.

Big Data resource, A Big Data collection that is accessible for analysis. Readers should understand that there are collections of Big Data (i.e., data sources that are large, complex, and actively growing) that are not designed to support analysis; hence, not Big Data resources. Such Big Data collections might include some of the older hospital information systems, which were designed to deliver individual patient records, upon request, but could not support projects wherein all of the data contained in all of the records was opened for selection and analysis. Aside from privacy and security issues, opening a hospital information system to these kinds of analyses would place enormous computational stress on the systems (i.e., produce system crashes). In the late 1990s and the early 2000s, data warehousing was popular. Large organizations would collect all of the digital information created within their institutions, and these data were stored as Big Data collections, called data warehouses. If an authorized person within the institution needed some specific set of information (e.g., emails sent or received in February, 2003; all of the bills paid in November, 1999), it could be found somewhere within the warehouse. For the most part, these data warehouses were not true Big Data resources because they were not organized to support a full analysis of all of the contained data. Another type of Big Data collection that may or may not be considered a Big Data resource is compilations of scientific data that are accessible for analysis by private concerns, but closed for analysis by the public. In this case, a scientist may make a discovery, based on her analysis of a private Big Data collection, but the data collection is not open for unauthorized critical review. In the opinion of some scientists, including myself, if the results of a data analysis are not available for review, the analysis is illegitimate; the Big Data collection is never consummated as a true Big Data resource. Of course, this opinion is not universally shared, and Big Data professionals hold various definitions for a Big Data resource.

Binary data, All digital information is coded as binary data; strings of 0’s and 1’s. In common usage, the term “binary data” is restricted to digital information that is not intended to be machine interpreted as alphanumeric characters (text). Binary data includes images, sound files, and movie files. Text files, also called plain-text files or ASCII files, are constructed so that every consecutive eight-bit digital sequence can be mapped to an ASCII character. Proprietary word processor files store alphanumeric data in something other than ASCII format, and these files are also referred to as binary files; not as text files. See ASCII.

Binary large object, See BLOB.

Binary sizes, Binary sizes are named in 1000-fold intervals, as shown.1 bit = binary digit (0 or 1)1 byte = 8 bits (the number of bits required to express an ASCII character)1000 bytes = 1 kilobyte1000 kilobytes = 1 megabyte1000 megabytes = 1 gigabyte1000 gigabytes = 1 terabyte1000 terabytes = 1 petabyte1000 petabytes = 1 exabyte1000 exabytes = 1 zettabyte1000 zettabytes = 1 yottabyte

Black box, In physics, a black box is a device with observable inputs and outputs, but what goes on inside the box is unknowable. The term is used to describe software, algorithms, machines, and systems whose inner workings are inscrutable.

BLOB, A large assemblage of binary data (e.g., images, movies, multimedia files, even collections of executable binary code) that are associated with a common group identifier and that can, in theory, be moved (from computer to computer) or searched as a single data object. Traditional databases do not easily handle BLOBs. BLOBs belong to Big Data.

Cherry-picking, The process whereby data objects are chosen for some quality that is intended to boost the likelihood that an experiment is successful, but which biases the study. For example, a clinical trial manager might prefer patients who seem intelligent and dependable, and thus more likely to comply with the rigors of a long and complex treatment plan. By picking those trial candidates with a set of desirable attributes, the data manager is biasing the results of the trial, which may no longer apply to a real-world patient population.

Classifier, A classifier is a method or algorithm that takes a data object and assigns it to its proper class within a pre-existing classification. Classifier algorithms should not be confused with clustering algorithms, which group data objects based on their similarities to one another. See Recommenders and Predictive analytics.

Cloud computing, According to the U.S. National Institute of Standards and Technology (NIST), cloud computing enables “ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”²⁵³ As the NIST definition would suggest, cloud computing is similar to Big Data, but there are several features that are expected in one and not the other. Cloud computing typically offers an interface and a collection of in-cloud computational services. Cloud data is typically contributed by a large community, and the contributed data is deposited often for no reason other than to provide convenient storage. These features are not expected in Big Data resources. Perhaps the most important distinction between cloud computing and Big Data relates to mutability. Because cloud data is contributed by many different entities, for many purposes, nobody expects much constancy; data can be freely extracted from the cloud or modified in place. In the cloud, the greatest emphasis is placed on controlling computational access to cloud data, with less emphasis on controlling the content of the cloud. In contrast, Big Data resources are designed to achieve a chosen set of goals using a constructed set of data. In most cases, the data held in a Big Data resource is immutable. Once it is entered into the resource, it cannot be modified or deleted without a very good reason.

Combined DNA Index System (CODIS), A large database prepared from human DNA samples. In CODIS the DNA is extracted and 13 short tandem repeat fragments are selected from predetermined locations in the genome. The 13 fragments are sequenced, and the sequence data is stored in the CODIS database. The CODIS sequences are intended to uniquely identify individuals (or their identical twins). New DNA samples from the same individual should always match the stored CODIS sequence. CODIS is used primarily by law enforcement.

Confidentiality and privacy, See Privacy and confidentiality.

Confounder, Unanticipated or ignored factor that alters the outcome of a data analysis. Confounders are particularly important in Big Data analytics because most analyses are observational, based on collected parameters from large numbers of data records, and there is very little control over confounders. Confounders are less of a problem in controlled prospective experiments, in which a control group and a treated group are alike, to every extent feasible—only differing in their treatment. Differences between the control group and the treated group are presumed to be caused by the treatment, as the confounders have been eliminated. One of the greatest challenges of Big Data analytics involves developing new analytic protocols that reduce the effect of confounders in observational studies.

Correlation distance or correlation score, The correlation distance provides a measure of similarity between two variables. Two similar variables will rise rise and fall together. The Pearson correlation score is popular and can be easily implemented.^19,105 It produces a score that varies from - 1 to 1. A score of 1 indicates perfect correlation; a score of - 1 indicates perfect anticorrelation (i.e., one variable rises while the other falls). A Pearson score of 0 indicates a lack of correlation. Other correlation measures can be applied to Big Data sets.^109,110

Curator, The word “curator” derives from the Latin curatus, the same root for “curative,” and conveys that curators “take care of” things. In a Big Data resource, the curator must accrue legacy and prospective data into the resource, must ensure that there is an adequate protocol for verifying the data, must choose appropriate nomenclatures for annotating the data, must annotate the data, and must make appropriate adjustments to data annotations when new versions of nomenclatures are made available and when one nomenclature is replaced by another.

Curse of dimensionality, As the number of attributes for a data object increases, the distance between data objects grows to enormous size. The multidimensional space becomes sparsely populated and the distances between any two objects, even the two closest neighbors, becomes absurdly large. When you have thousands of dimensions, the space that holds the objects is so large that distances between objects become difficult or impossible to compute, and computational results become useless for most purposes.

Data cleaning, Synonymous with data fixing or data correcting, data cleaning is the process by which errors, inexplicable anomalies, and missing values are somehow handled. There are three options for data cleaning: correcting the error, deleting the error, or leaving it unchanged.¹⁴³ Data cleaning should not be confused with data scrubbing. See Data scrubbing.

Data manager, This book uses “data manager” as a catch-all term, without attaching any specific meaning to the name. Depending on the institutional and cultural milieu, synonyms and plesionyms (i.e., near-synonyms) for data manager would include technical lead, team liaison, data quality manager, chief curator, chief of operations, project manager, group supervisor, and so on.

Data object, A commonly used but somewhat inelegant definition for a data object is “the thing that the data values are about.” In a medical record, the data object might be a patient and the data values might be the patient’s blood chemistries. A well-specified data object has an identifier and is capable of encapsulating data values, metadata, and other self-descriptive data, such as the name of a class in which the data object holds membership. In the object-oriented paradigm, every data object is a member of a class and inherits the methods that belong to its class, as well as all of the methods of all the classes in its ancestral lineage. As a member of a class, it shares a set of class properties with the other members of its class. A class is itself a type of data object. Data objects are the subjects of meaningful assertions. See Meaning.

Data object model, The data object model is a term defined for object-oriented programming languages, but it is often applied in vague ways to Big Data resources. In the context of Big Data, the term applies to the way that data objects are described and organized in the resource and to the manner in which objects interface to the resource for purposes of searching, retrieving, and exchanging whole data objects (e.g., records) and their data attributes (e.g., data values in the records). Those who read the Big Data literature extensively will find that this term is used in many different ways and is often confused with its plesionym, data modeling. See Modeling.

Data point, The singular form of data is datum. Strictly speaking, the term should be datum point or datumpoint. Most information scientists, myself included, have abandoned consistent usage rules for the word “data.” In this book, the term “data” always refer collectively to information, numeric or textual, structured or unstructured, in any quantity.

Data Quality Act, Passed as part of the FY 2001 Consolidated Appropriations Act (Pub. L. No. 106-554), the act requires federal agencies to base their policies and regulations on high-quality data and permits the public to challenge and correct inaccurate data.¹⁹⁴ For an in-depth discussion, see Chapter 13.

Data reduction, In almost all circumstances, it is impractical to work with all of the data in a Big Data resource. When the data analysis is confined to a set of data extracted from the resource, it may be impractical or counterproductive to work with every element of the collected data. In most cases, the data analyst will eliminate some or most of the data elements or will develop methods whereby the data is approximated. The term “data reduction” is sometimes reserved for methods that reduce the dimensionality of multivariate data sets. In this book, the term “data reduction” is applied to any method whereby items of data are excluded from a data set or are replaced by a simplified transformation or by a mathematical formula that represents values. Obviously, data reduction, if done unwisely, will create biases. See Mean-field approximations. See Dimensionality.

Data scrubbing, A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal, from data records, of identifying information (i.e., information linking the record to an individual) plus any other information that is considered unwanted. This may include any personal, sensitive, or private information contained in a record, any incriminating or otherwise objectionable language contained in a record, and any information irrelevant to the purpose served by the record. See Deidentification.

Data sharing, Data sharing involves one entity sending data to another entity, usually with the understanding that the other entity will store and use the data. This process may involve free or purchased data, and it may be done willingly, or in compliance with regulations, laws, or court orders.

Deep analytics, Jargon occasionally applied to the skill set needed for Big Data analysis. Statistics and machine learning are often cited as two of the most important areas of deep analytic expertise. In a recent McKinsey report, entitled “Big data: The next frontier for innovation, competition, and productivity,” the authors asserted that the United States “faces a shortage of 140,000 to 190,000 people with deep analytical skills.”²⁴⁷

Deidentification, The process of removing all of the links in a data record that can connect the information in a record to an individual. This usually includes the record identifier, demographic information (e.g., place of birth), personal information (e.g., birthdate), biometrics (e.g., fingerprints), and so on. The process of deidentification will vary based on the type of records included in the Big Data resource. For an in-depth discussion, see Chapter 2. See Reidentification. See Data scrubbing.

Digital Millennium Copyright Act (DMCA), This act was signed into law in 1998. This law deals with many different areas of copyright protection, most of which are only peripherally relevant to Big Data. In particular, the law focuses on copyright protections for recorded works, particularly works that have been theft-protected by the copyright holders.¹⁹⁹ The law also contains a section (Title II) dealing with the obligations of online service providers who inadvertently distribute copyrighted material. Service providers may be protected from copyright infringement liability if they block access to the copyrighted material when the copyright holder or the holder’s agent claims infringement. To qualify for liability protection, service providers must comply with various guidelines (i.e., the so-called safe harbor guidelines) included in the act.

Dimensionality, The dimensionality of a data objects consists of the number of attributes that describe the object. Depending on the design and content of the data structure that contains the data object (i.e., database, array, list of records, object instance), the attributes will be called by different names, including field, variable, parameter, feature, or property. Data objects with high dimensionality create computational challenges, and data analysts typically reduce the dimensionality of data objects wherever possible. See Chapter 9.

DMCA, See Digital Millennium Copyright Act

Dublin Core metadata, The Dublin Core is a set of metadata elements developed by a group of librarians who met in Dublin, Ohio. It would be very useful if every electronic document were annotated with the Dublin Core elements. The Dublin Core Metadata is discussed in detail in Chapter 4. The syntax for including the elements is found at http://dublincore.org/documents/dces/.

Dynamic range, Every measuring device has a dynamic range beyond which its measurements are without meaning. A bathroom scale may be accurate for weights that vary from 50 to 250 pounds, but you would not expect it to produce a sensible measurement for the weight of a mustard seed or an elephant.

Electronic medical record, Abbreviated as EMR or EHR (electronic health record), the EMR is the digital equivalent of a patient’s medical chart. Central to the idea of the EMR is the notion that all of the documents, transactions, and all packets of information containing test results and other information on a patient are linked to the patient’s unique identifier. By retrieving all data linked to the patient’s identifier, the EMR (i.e., the entire patient’s chart) can be assembled instantly.

Euclidean distance, Two points, (x1, y1), (x2, y2), in Cartesian coordinates are separated by a hypotenuse distance, that being the square root of the sum of the squares of the differences between the respective x axis and y axis coordinates. In n-dimensional space, the Euclidean distance between two points is the square root of the sum of the squares of the differences in coordinates for each of the n-dimensional coordinates. The significance of the Euclidean distance for Big Data is that data objects are often characterized by multiple feature values, and these feature values can be listed as though they were coordinate values for an n-dimensional object. The smaller the Euclidian distance between two objects, the higher the similarity to each other. Several of the most popular correlation and clustering algorithms involve pairwise comparisons of the Euclidean distances between data objects in a data collection.

Fourier series, Periodic functions (i.e., functions with repeating trends in the data, including waveforms and periodic time series data) can be represented as the sum of oscillating functions (i.e., functions involving sines, cosines, or complex exponentials). The summation function is the Fourier series. See Fourier transform.

Fourier transform, A transform is a mathematical operation that takes a function or a time series (e.g., values obtained at intervals of time) and transforms it into something else. An inverse transform takes the transform function and produces the original function. Transforms are useful when there are operations that can be more easily performed on the transformed function than on the original function. Possibly the most useful transform is the Fourier transform, which can be computed with great speed on modern computers using a modified form known as the fast Fourier transform. Periodic functions and waveforms (periodic time series) can be transformed using this method. Operations on the transformed function can sometimes eliminate periodic artifacts or frequencies that occur below a selected threshold (e.g., noise). The transform can be used to find similarities between two signals. When the operations on the transform function are complete, the inverse of the transform can be calculated and substituted for the original set of data.

Gaussian copula function, A formerly honored and currently vilified formula developed for Wall Street that calculated the risk of default correlation (i.e., the likelihood of two investment vehicles defaulting together). The formula uses the current market value of the vehicles, without factoring in historical data. The formula is easy to implement and became a favorite model for calculating risk in the securitization market. The Gaussian copula function was the driving model on Wall Street. In about 2008, the function stopped working; soon thereafter came the 2008 global market collapse. In some circles, the Gaussian copula function is blamed for the disaster.¹²³

Grid, A collection of computers and computer resources that are coordinated to provide a desired functionality. The grid is the intellectual predecessor of cloud computing. Cloud computing is less physically and administratively restricted than grid computing. See Cloud computing.

Heterogeneous data, Sets of data that are dissimilar with regard to content, purpose, format, organization, and annotations. One of the purposes of Big Data is to discover relationships among heterogeneous data sources. For example, epidemiologic data sets may be of service to molecular biologists who have gene sequence data on diverse human populations. The epidemiologic data is likely to contain different types of data values, annotated and formatted in a manner that is completely different from the data and annotations in a gene sequence database. The two types of related data, epidemiologic and genetic, have dissimilar content; hence they are heterogeneous to one another.

Human Genome Project, The Human Genome Project is a massive bioinformatics project in which multiple laboratories contributed to sequencing the 3 billion base pair haploid human genome (i.e., the full sequence of human DNA). The project began its work in 1990, a draft human genome was prepared in 2000, and a completed genome was finished in 2003, marking the start of the so-called postgenomics era. All of the data produced for the Human Genome Project is freely available to the public.

Identification, The process of providing a data object with an identifier or distinguishing one data object from all other data objects on the basis of its associated identifier. See Identifier.

Identifier, A string that is associated with a particular thing (e.g., person, document, transaction, data object) and not associated with any other thing.²⁵⁴ In the context of Big Data, identification usually involves permanently assigning a seemingly random sequence of numeric digits (0—9) and alphabet characters (a—z and A—Z) to a data object. The data object can be a class of objects. See Identification.

Immutability, Immutability is the principle that data collected in a Big Data resource is permanent and can never be modified. At first thought, it would seem that immutability is a ridiculous and impossible constraint. In the real world, mistakes are made, information changes, and the methods for describing information changes. This is all true, but the astute Big Data manager knows how to accrue information into data objects without changing the preexisting data. Methods for achieving this seemingly impossible trick are described in Chapter 6.

Indexes, Every writer must search deeply into his or her soul to find the correct plural form of “index.” Is it “indexes” or is it “indices?” Latinists insist that “indices” is the proper and exclusive plural form. Grammarians agree, reserving “indexes” for the third person singular verb form: “The student indexes his thesis.” Nonetheless, popular usage of the plural of “index,” referring to the section at the end of a book, is almost always “indexes,” the form used herein.

Informed consent, Human subjects who are put at risk must provide affirmative consent if they are to be included in a government-sponsored study. This legally applies in the United States and most other nations and ethically applies to any study that involves putting humans at risk. To this end, researchers provide prospective human subjects with an “informed consent” document that informs the subject of the risks of the study and discloses foreseen financial conflicts among the researchers (see Glossary item, Bayh—Dole Act). The informed consent must be clear to laymen, must be revocable (i.e., subjects can change their mind and withdraw from the study, if feasible to do so), must not contain exculpatory language (e.g., no waivers of responsibility for the researchers), must not promise any benefit or monetary compensation as a reward for participation, and must not be coercive (i.e., must not suggest a negative consequence as a result of nonparticipation).

Integration, This occurs when information is gathered from multiple data sets, relating diverse data extracted from different data sources. Integration can broadly be categorized as being pre-integrated or as being integrated on the fly. Pre-integration includes such efforts as absorbing new databases into a Big Data resource or merging legacy data with current data. On-the-fly integration involves merging data objects at the moment when the individual objects are parsed. This might be done during a query that traverses multiple databases or multiple networks. On-the-fly data integration can only work with data objects that support introspection. The two closely related topics of integration and interoperability are often confused with one another. An easy way to remember the difference is to note that integration refers to data; interoperability refers to software.

Intellectual property, Data, software, algorithms, and applications that are created by an entity capable of ownership (e.g., humans, corporations, universities). The entity holds rights over the manner in which the intellectual property can be used and distributed. Protections for intellectual property may come in the form of copyrights, patents, and license agreements. Copyright applies to published information. Patents apply to novel processes and inventions. Certain types of intellectual property can only be protected by being secretive. For example, magic tricks cannot be copyrighted or patented, which is why magicians guard their intellectual property so closely. Intellectual property can be sold outright, essentially transferring ownership to another entity. In other cases, intellectual property is retained by the creator who permits its limited use to others via a legal contrivance (e.g., license, contract, transfer agreement, royalty, usage fee, and so on). In some cases, ownership of the intellectual property is retained, but the property is freely shared with the world (e.g., open source license, GNU license, FOSS license, Creative Commons license).

Introspection, Well-designed Big Data resources support introspection, a method whereby data objects within the resource can be interrogated to yield their properties, values, and class membership. Through introspection, the relationships among the data objects in the Big Data resource can be examined and the structure of the resource can be determined. Introspection is the method by which a data user can find everything there is to know about a Big Data resource, without downloading the complete resource. For an in-depth discussion, see Chapter 4.

ISO/IEC 11179, The standard produced by the International Standards Organization (ISO) for defining metadata, such as XML tags. The standard requires that the definitions for metadata used in XML (the so-called tags) be accessible and should include the following information for each tag: Name (the label assigned to the tag), Identifier (the unique identifier assigned to the tag), Version (the version of the tag), Registration Authority (the entity authorized to register the tag), Language (the language in which the tag is specified), Definition (a statement that clearly represents the concept and essential nature of the tag), Obligation (indicating whether the tag is required), Datatype (indicating the type of data that can be represented in the value of the tag), Maximum Occurrence (indicating any limit to the repeatability of the tag), and Comment (a remark describing how the tag might be used).

KISS, Acronym for Keep It Simple Stupid. With respect to Big Data, there are basically two schools of thought. The first is that reality is quite complex; the advent of powerful computers and enormous data collections allows us to tackle important problems, despite their inherent size and complexity. KISS represents a second school of thought: that big problems are just small problems that are waiting to be simplified.

k-means algorithm, The k-means algorithm assigns any number of data objects to one of k clusters.¹⁰⁷ The algorithm is described fully in Chapter 9. The k-means algorithm should not be confused with the k-nearest neighbor algorithm.

k-nearest neighbor algorithm, A simple and popular classifier algorithm that assigns a class (in a preexisting classification) to an object whose class is unknown.¹⁰⁷ The k-nearest neighbor is very simple. From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to k (a number chosen by the user) objects of known class. The most common class (i.e., the class that is assigned most often to the nearest k objects) is assigned to the object of unknown class. The k-nearest neighbor algorithm and its limitations are discussed in Chapter 9. The k-nearest neighbor algorithm, a classifier method, should not be confused with the k-means algorithm, a clustering method.

Large Hadron Collider (LHC), The LHC is the world’s largest and most powerful particle accelerator, and is expected to produce about 15 petabytes (15 million gigabytes) of data annually.²⁵⁵

Linear regression, A method for obtaining a straight line through a two-dimensional scatter plot. It is not, as it is commonly believed, a “best-fit” technique, but it does minimize the sum of squared errors (in the y axis values) under the assumption that the x axis values are correct and exact. This means that you would get a different straight line if you regress x on y rather than y on x. Linear regression is a popular method that has been extended, modified, and modeled for many different processes, including machine learning. Data analysts who use linear regression should be cautioned that it is a method, much like the venerable p value, that is commonly misinterpreted.¹⁰⁴ See p value.

Mahalanobis distance, A distance measure based on correlations between variables; hence, it measures the similarity of the objects whose attributes are compared. As a correlation measure, it is not influenced by the relative scale of the different attributes. It is used routinely in clustering and classifier algorithms. See Euclidean distance.

MapReduce, A method by which computationally intensive problems can be processed on multiple computers in parallel. The method can be divided into a mapping step and a reducing step. In the mapping step, a master computer divides a problem into smaller problems that are distributed to other computers. In the reducing step, the master computer collects the output from the other computers. Although MapReduce is intended for Big Data resources, holding petabytes of data, most Big Data problems do not require MapReduce.

Mean-field approximation, A method whereby the average behavior for a population of objects substitutes for the behavior of each and every object in the population. This method greatly simplifies calculations. It is based on the observation that large collections of objects can be characterized by their average behavior. Mean-field approximation has been used with great success to understand the behavior of gases, epidemics, crystals, viruses, and all manner of large population phenomena.

Meaning, In informatics, meaning is achieved when described data is bound to a unique identifier of a data object. “Jules J. Berman’s height is five feet eleven inches” comes pretty close to being a meaningful statement. The statement contains data (five feet eleven inches), and the data is described (height). The described data belongs to a unique object (Jules J. Berman). If this data were entered into a Big Data resource, it would need a unique identifier to distinguish one instance of Jules J. Berman from all the other persons who are named Jules J. Berman. The statement would also benefit from a formal system that ensures that the metadata makes sense (e.g., what exactly is height and does Jules J. Berman fall into a class of objects for which height is an allowable property?) and that the data is appropriate (e.g., is 5 feet 11 inches an allowable measure of a person’s height?). A statement with meaning does not need to be a true statement (e.g., the height of Jules J. Berman was not 5 feet 11 inches when Jules J. Berman was an infant). See Semantics.

Metadata, Data that describes data. For example, in XML a data quantity may be flanked by a beginning and an ending metadata tag describing the included data. < age>48 years </age >. In the example, < age > is the metadata and “48 years” is the data.

Minimal necessary, In the field of medical informatics, there is a concept known as “minimal necessary” that applies to shared confidential data.³³ It holds that when records are shared, only the minimum necessary information should be released. Information not directly relevant to the intended purposes of the study should be withheld.

Missing data, Most complex data sets have missing data values. Somewhere along the line data elements were not entered, records were lost, or some systemic error produced empty data fields. Big Data, being large, complex, and composed of data objects collected from diverse sources, is almost certain to have missing data. Various mathematical approaches to missing data have been developed, commonly involving assigning values on a statistical basis; so-called imputation methods. The underlying assumption for such methods is that missing data arises at random. When missing data arises nonrandomly, there is no satisfactory statistical fix. The Big Data curator must track down the source of the errors and somehow rectify the situation. In either case, the issue of missing data introduces a potential bias, and it is crucial to fully document the method by which missing data is handled. In the realm of clinical trials, only a minority of data analyses bother to describe their chosen method for handling missing data.²⁵⁶ See Data cleaning.

Modeling, Modeling involves explaining the behavior of a system, often with a formula, sometimes with descriptive language. The formula for the data describes the distribution of the data and often predicts how the different variables will change with one another. Consequently, modeling comes closer than other Big Data techniques to explaining the behavior of data objects and of the system in which the data objects interact. The topic of modeling is discussed in Chapter 9. Data modeling is often confused with the task of creating a data object model. See Data object model.

Monte Carlo simulation, This technique was introduced in 1946 by John von Neumann, Stan Ulam, and Nick Metropolis.²⁵² For this technique, the computer generates random numbers and uses the resultant values to simulate repeated trials of a probabilistic event. Monte Carlo simulations can easily simulate various processes (e.g., Markov models and Poisson processes) and can be used to solve a wide range of problems.^98,257 The Achilles heel of the Monte Carlo simulation, when applied to enormous sets of data, is that so-called random number generators may introduce periodic (nonrandom) repeats over large stretches of data.³⁸ What you thought was a fine Monte Carlo simulation, based on small data test cases, may produce misleading results for large data sets. The wise Big Data analyst will avail himself of the best possible random number generators and will test his outputs for randomness. Various tests of randomness are available.¹¹¹

Multiple comparisons bias, When you compare a control group against a treated group using multiple hypotheses based on the effects of many different measured parameters, you will eventually encounter statistical significance, based on chance alone. For example, if you are trying to determine whether a population that has been treated with a particular drug is likely to suffer a serious clinical symptom, and you start looking for statistically significant associations (e.g., liver disease, kidney disease, prostate disease, heart disease, etc.), then eventually you will find an organ in which disease is more likely to occur in the treated group than in the untreated group. Because Big Data tends to have high dimensionality, biases associated with multiple comparisons must be carefully avoided. Methods for reducing multiple comparison bias are available to Big Data analysts. They include the Bonferroni correction, the Sidak correction, and the Holm—Bonferroni correction.

Mutability, Mutability refers to the ability to alter the data held in a data object or to change the identity of a data object. Serious Big Data is not mutable. Data can be added, but data cannot be erased or altered. Big Data resources that are mutable cannot establish a sensible data identification system and cannot support verification and validation activities. For a full discussion of mutability and immutability, as it applies to Big Data resources, see Chapter 9.

n3, See Notation 3.

Namespace, A namespace is the metadata realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata term “date” may be used to signify a calendar date, the fruit, or the social engagement. To avoid confusion, the metadata term is given a prefix that is associated with a Web document that defines the term within the document’s namespace. See Chapter 4.

Negative study bias, When a project produces negative results (fails to confirm a hypothesis), there may be little enthusiasm to publish the work.²⁵⁸ When statisticians analyze the results from many different published manuscripts (i.e., perform a meta-analysis), their work is biased by the pervasive absence of negative studies.²⁵⁹ In the field of medicine, negative study bias creates a false sense that every kind of treatment yields positive results.

Neural network, A dynamic system in which outputs are calculated by a summation of weighted functions operating on inputs. Weights for the individual functions are determined by a learning process, simulating the learning process hypothesized for human neurons. In the computer model, individual functions that contribute to a correct output (based on the training data) have their weights increased (strengthening their influence to the calculated output). Over the past 10 or 15 years, neural networks have lost some favor in the artificial intelligence community. They can become computationally complex for very large sets of multidimensional input data. More importantly, complex neural networks cannot be understood or explained by humans, endowing these systems with a “magical” quality that some scientists find unacceptable. See Nongeneralizable predictor. See Overfitting.

Nomenclature, A nomenclature is a specialized vocabulary, usually containing terms that comprehensively cover a well-defined field of knowledge. For example, there may be a nomenclature of diseases, celestial bodies, or makes and models of automobiles. Some nomenclatures are ordered alphabetically. Others are ordered by synonymy, wherein all synonyms and plesionyms (near-synonyms) are collected under a canonical (best or preferred) term. In many nomenclatures, grouped synonyms are collected under a code (unique alphanumeric string) assigned to the group. Nomenclatures have many purposes: to enhance interoperability and integration, to allow synonymous terms to be retrieved regardless of which specific synonym is entered as a query, to support comprehensive analyses of textual data, to express detail, to tag information in textual documents, and to drive down the complexity of documents by uniting synonymous terms under a common code. Sets of documents held in more than one Big Data resource can be harmonized under a nomenclature by substituting or appending a nomenclature code to every nomenclature term that appears in any of the documents. See Classification. See Vocabulary.

Nongeneralizable predictor, Sometimes Big Data analysis can yield results that are true, but nongeneralizable (i.e., irrelevant to everything outside the set of data objects under study). The most useful scientific findings are generalizable (e.g., the laws of physics operate on the planet Jupiter or the star Alpha Centauri much as they do on earth). Many of the most popular analytic methods for Big Data are not generalizable because they produce predictions that only apply to highly restricted sets of data or the predictions are not explainable by any underlying theory that relates input data with the calculated predictions. Data analysis is incomplete until a comprehensible, generalizable, and testable theory for the predictive method is developed.

Notation 3, Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is compact and easy for humans to read. Both n3 and RDF can be parsed and equivalently tokenized (i.e., broken into elements that can be reorganized in a different format, such as a database record). See RDF.

Object-oriented programming, In object-oriented programming, all data objects must belong to one of the classes built into the language or to a class created by the programmer. Class methods are subroutines that belong to a class. The members of a class have access to the methods for the class. There is a hierarchy of classes (with superclasses and subclasses). A data object can access any method from any superclass of its class. All object-oriented programming languages operate under this general strategy. The two most important differences among the object-oriented programming languages relate to syntax (i.e., the required style in which data objects call their available methods) and content (the built-in classes and methods available to objects). Various esoteric issues, such as types of polymorphism offered by the language, multiparental inheritance, and non-Boolean logic operations, may play a role in how expert programmers choose a specific language for a specific project. See Data object.

Object rank, A generalization of PageRank, the indexing method employed by Google. Object ranking involves providing objects with a quantitative score that provides some clue to the relevance or the popularity of an object. For the typical object ranking project, objects take the form of a key word phrase. See Page rank.

One-way hash, A one-way hash is an algorithm that transforms one string into another string (a fixed-length sequence of seemingly random characters) in such a way that the original string cannot be calculated by operations on the one-way hash value (i.e., the calculation is one way only). One-way hash values can be calculated for any string, including a person’s name, a document, or an image. For any input string, the resultant one-way hash will always be the same. If a single byte of the input string is modified, the resulting one-way hash will be changed and will have a totally different sequence than the one-way hash sequence calculated for the unmodified string. One-way hash values can be made sufficiently long (e.g., 256 bits) that a hash string collision (i.e., the occurrence of two different input strings with the same one-way hash output value) is negligible. For an in-depth discussion of the uses of one-way hashes in Big Data resources, see Chapter 2.

Ontology, An ontology is a collection of classes and their relationships to one another. Ontologies are usually rule-based systems (i.e., membership in a class is determined by one or more class rules). Two properties distinguish ontologies from classification. Ontologies permit classes to have more than one parent class and more than one child class. For example, the class of automobiles may be a direct subclass of “motorized devices” and a direct subclass of “mechanized transporters.” In addition, an instance of a class can be an instance of any number of additional classes. For example, a Lamborghini may be a member of class “automobiles” and class “luxury items.” This means that the lineage of an instance in an ontology can be highly complex, with a single instance occurring in multiple classes and with many connections between classes. Because recursive relations are permitted, it is possible to build an ontology wherein a class is both an ancestor class and a descendant class of itself. A classification is a highly restrained ontology wherein instances can belong to only one class and each class may have only one direct parent class. Because classifications have an enforced linear hierarchy, they can be easily modeled and the lineage of any instance can be traced unambiguously. See Classification.

Open access, A document is open access if its complete contents are available to the public. Open access applies to documents in the same manner as open source applies to software.

Open source, Software is open source if the source code is available to anyone who has access to the software.

Outlier, Outliers are extreme data values. The occurrence of outliers hinders the task of developing models, equations, or curves that closely fit all the available data. In some cases, outliers are simply mistakes that can be ignored by the data analyst. In other cases, the outlier may be the most important data in the data set. There is no simple method to know the value of an outlier; it usually falls to the judgment of the data analyst. The importance of outliers to Big Data is that as the size of the data increases, the number of outliers also increases. Therefore, every Big Data analyst must develop a reasonable approach to dealing with outliers, based on the kind of data under study.

Overfitting, Overfitting occurs when a formula describes a set of data very closely, but does not lead to any sensible explanation for the behavior of the data and does not predict the behavior of comparable data sets. In the case of overfitting, the formula is said to describe the noise of the system rather than the characteristic behavior of the system. Overfitting occurs frequently with models that perform iterative approximations on training data, coming closer and closer to the training data set with each iteration. Neural networks are an example of a data modeling strategy that is prone to overfitting.

p value, The p value is the probability of getting a set of results that are as extreme or more extreme as the set of results observed, assuming that the null hypothesis is true (that there is no statistical difference between the results). The p value has come under great criticism over the decades, with a growing consensus that the p value is often misinterpreted, used incorrectly, or used in situations wherein it does not apply.¹⁵⁴ In the realm of Big Data, repeated samplings of data from large data sets will produce small p values that cannot be directly applied to determining statistical significance. It is best to think of the p value as just another piece of information that tells you something about how sets of observations compare with one another, not as a test of statistical significance.

Page rank, PageRank is a method, popularized by Google, for displaying an ordered set of results (for a phrase search conducted over every page of the Web). The rank of a page is determined by two scores: the relevancy of the page to the query phrase and the importance of the page. The relevancy of the page is determined by factors such as how closely the page matches the query phrase and whether the content of the page is focused on the subject of the query. The importance of the page is determined by how many Web pages link to and from the page and by the importance of the Web pages involved in the linkages. It is easy to see that the methods for scoring relevance and importance are subject to many algorithmic variances, particularly with respect to the choice of measures (i.e., the way in which a page’s focus on a particular topic is quantified) and the weights applied to each measurement. The reason that PageRank query responses can be completed very rapidly is that the score of a page’s importance can be precomputed and stored with the page’s Web address. Word matches from the query phrase to Web pages are quickly assembled using a precomputed index of words, the pages containing the words, and locations of the words in the pages.²⁶⁰

Parallel computing, Some computational tasks can be broken down and distributed to other computers, to be calculated “in parallel.” The method of parallel programming allows a collection of desktop computers to complete intensive calculations of the sort that would ordinarily require the aid of a supercomputer. Parallel programming has been studied as a practical way to deal with the higher computational demands brought by Big Data. Although there are many important problems that require parallel computing, the vast majority of Big Data analyses can be easily accomplished with a single, off-the-shelf personal computer. See MapReduce.

Pareto’s principle, Also known as the 80/20 rule, Pareto’s principle holds that a small number of causes may account for the vast majority of observed instances. For example, a small number of rich people account for the majority of wealth. Likewise, a small number of diseases account for the vast majority of human illnesses. A small number of children account for the majority of the behavioral problems encountered in a classroom. A small number of states or provinces contain the majority of the population of a country. A small number of books, compared with the total number of published books, account for the majority of book sales. Sets of data that follow Pareto’s principle are often said to follow a Zipf distribution, or a power law distribution. These types of distributions are not tractable by standard statistical descriptors. For example, simple measurements, such as average and standard deviation, have virtually no practical meaning when applied to Zipf distributions. Furthermore, the Gaussian distribution does not apply, and none of the statistical inferences built upon an assumption of a Gaussian distribution will hold on data sets that observe Pareto’s principle. See Power law. See Zipf distribution.

Patent farming, Also known as patent ambushing.⁶⁰ The practice of hiding intellectual property within a standard or device, at the time of its creation, is known as patent farming. After the property is marketed, the patent farmer announces the presence of his or her hidden patented material and presses for royalties—metaphorically harvesting his crop.

Pearson’s correlation, All similarity scores are based on comparing one data object with another, attribute by attribute, usually summing the squares of the differences in magnitude for each attribute, and using the calculation to compute a final outcome, known as the correlation score. One of the most popular correlation methods is Pearson’s correlation, which produces a score that can vary from - 1 to + 1. Two objects with a high score (near + 1) are highly similar. Pearson’s correlation can be used to compare complex data objects that differ in size and content. For example, Pearson’s correlation can compare two different books using the terms contained in each book and the number of occurrences of each term.¹⁹

Plesionymy, Nearly synonymous words, or pairs of words that are sometimes synonymous; other times not. For example, the noun forms of “smell” and “odor” are synonymous. As verb forms, “smell” applies, but “odor” does not. You can smell a fish, but you cannot odor a fish. Smell and odor are plesionyms. Plesionymy is another challenge for machine translators.

Polysemy, Polysemy occurs when a word has more than one distinct meaning. The intended meaning of a word can sometimes be determined by the context in which the word is used. For example, “she rose to the occasion” and “her favorite flower is the rose.” Sometimes polysemy cannot be resolved, for example, “eats shoots and leaves.”

Polytely, From the Greek root meaning “many goals,” polytely refers to problems that involve a large number of variables acting with one another in many different ways, where the rules of interaction may vary as times and conditions change. The outcome of such interactions may have many different consequences. Because Big Data is immense and complex, polytely is an important impediment to Big Data analysis.

Power law, A mathematical relationship that applies to Zipf distributions.

Power series, A power series of a single variable is an infinite sum of increasing powers of x, multiplied by constants. Power series are very useful because it is easy to calculate the derivative or the integral of a power series and because different power series can be added and multiplied together. When the high exponent terms of a power series are small, as happens when x is less than 1 or when the constants associated with the higher exponents all equal 0, the series can be approximated by summing only the first few terms. Many different kinds of distributions can be represented as a power series. Distributions that cannot be wholly represented by a power series may sometimes by segmented by ranges of x. Within a segment, the distribution might be representable as a power series. A power series should not be confused with a power law distribution. See Power law.

Precision and accuracy, See Accuracy and precision.

Predictive analytics, This term most often applies to a collection of techniques that have been used, with great success, in marketing. These are recommenders, classifiers, and clustering.¹⁰³ Though all of these techniques can be used for purposes other than marketing, they are often described in marketing terms: recommenders (e.g., predicting which products a person might prefer to buy), profile clustering (e.g., grouping individuals into marketing clusters based on the similarity of their profiles), and product classifiers (e.g., assigning a product or individual to a prediction category based on a set of features). See Recommender. See Classifier.

Predictive modeling contests, Everyone knows that science is competitive, but very few areas of science have been constructed as a competitive game. Predictive analytics is an exception. Kaggle is a Web site that runs predictive-modeling contests. Their motto is “We’re making data science a sport.” Competitors with the most successful predictive models win prizes. Prizes vary from thousands to millions of dollars, and hundreds of teams may enter the frays.²⁶¹

Principal component analysis, A method for reducing the dimensionality of data sets. This method takes a list of parameters and reduces it to a smaller list of variables, with each component of the smaller list constructed from combinations of variables in the longer list. Furthermore, principal component analysis provides an indication of which variables in both the original and the new list are least correlated with the other variables. Principal component analysis requires matrix operations on large matrices. Such operations are computationally intensive and can easily exceed the capacity of most computers.¹⁰⁴

Privacy and confidentiality, The concepts of confidentiality and of privacy are often confused, and it is useful to clarify their separate meanings. Confidentiality is the process of keeping a secret with which you have been entrusted. You break confidentiality if you reveal the secret to another person. You violate privacy when you use the secret to annoy the person whose confidential information was acquired. If you give me your unlisted telephone number in confidence, then I am expected to protect this confidentiality by never revealing the number to other persons. I may also be expected to protect your privacy by never using the telephone number to call you at all hours of the day and night. In this case, the same information object (unlisted telephone number) is encumbered by confidentiality and privacy obligations.

Protocol, A set of instructions, policies, or fully described procedures for accomplishing a service, operation, or task. Protocols are fundamental to Big Data. Data is generated and collected according to protocols. There are protocols for conducting experiments, and there are protocols for measuring the results. There are protocols for choosing the human subjects included in a clinical trial, and there are protocols for interacting with the human subjects during the course of the trial. All network communications are conducted via protocols; the Internet operates under a protocol [Transmission Control Protocol/Internet Protocol (TCP/IP)].

Public data, public databases, A term that usually refers to data collections composed of freely available data or of public domain data that can be accessed via database services that are open to the public, such as a Web search engine. Here are a few Web sites that collect information on public Big Data resources: aws.amazon.com/datasets, www.data.gov, and www.google.com/publicdata/directory.

Public domain, Data that is not owned by an entity. Public domain materials include documents whose copyright terms have expired, materials produced by the federal government, materials that contain no creative content (i.e., materials that cannot be copyrighted), or materials donated to the public domain by the entity that holds copyright. Public domain data can be accessed, copied, and redistributed without violating piracy laws. It is important to note that plagiarism laws and rules of ethics apply to public domain data. You must properly attribute authorship to public domain documents. If you fail to attribute authorship or if you purposely and falsely attribute authorship to the wrong person (e.g., yourself), then this would be an unethical act and an act of plagiarism.

Query, The term “query” usually refers to a request, sent to a database, for information (e.g., Web pages, documents, lines of text, images) that matches a provided word or phrase (i.e., the query term). More generally, a query is a parameter or set of parameters that is submitted as input to a computer program, which searches a data collection for items that match or bear some relationship to the query parameters. In the context of Big Data, the user may need to find classes of objects that have properties relevant to a particular area of interest. In this case, the query is basically introspective, and the output may yield metadata describing individual objects, classes of objects, or the relationships among objects that share particular properties. For example, “weight” may be a property, and this property may fall into the domain of several different classes of data objects. The user might want to know the names of the classes of objects that have the “weight” property and the numbers of object instances in each class. Eventually, the user might want to select several of these classes (e.g., including dogs and cats, but excluding microwave ovens), along with data object instances whose weights fall within a specified range (e.g., 20 to 30 pounds). This approach to querying could work with any data set that has been well specified with metadata, but it is particularly important when using Big Data resources. See Introspection.

RDF, See Resource Description Framework.

Recommender, A collection of methods for predicting the preferences of individuals. Recommender methods often rely on one or two simple assumptions. (1) If an individual expresses a preference for a certain type of product and the individual encounters a new product that is similar to a previously preferred product, then he is likely to prefer the new product. (2) If an individual expresses preferences that are similar to the preferences expressed by a cluster of individuals and if the members of the cluster prefer a product that the individual has not yet encountered, then the individual will most likely prefer the product. See Predictive analytics. See Classifier.

Reflection, A programming technique wherein a computer program will modify itself, at runtime, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (i.e., object introspection). If the information indicates that the data object belongs to a particular class of objects, the program might call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during runtime; metaphorically reflecting upon the purpose of its computational task. Because introspection is a property of well-constructed Big Data resources, reflection is an available technique to programmers who deal with Big Data. See Introspection.

RegEx, Short for Regular Expressions, RegEx is a syntax for describing patterns in text. For example, if I wanted to pull all lines from a text file that began with an uppercase “B” and contained at least one integer and ended with a lowercase x, then I might use the regular expression “^∧B.*[0-9].*x$”. This syntax for expressing patterns of strings that can be matched by prebuilt methods available to a programming language is somewhat standardized. This means that a RegEx expression in Perl will match the same pattern in Python, Ruby, or any language that employs RegEx. The relevance of Regex to Big Data is severalfold. Regex can be used to build or transform data from one format to another; hence creating or merging data records. It can be used to convert sets of data to a desired format; hence transforming data sets. It can be used to extract records that meet a set of characteristics specified by a user; thus filtering subsets of data or executing data queries over text-based files or text-based indexes. The big drawback to using RegEx is speed: operations that call for many Regex operations, particularly when those operations are repeated for each parsed line or record, will reduce software performance. Regex-heavy programs that operate just fine on megabyte files may take hours, days, or months to parse through terabytes of data.

Regular expressions, See RegEx.

Reidentification, A term casually applied to any instance whereby information can be linked to a specific person after the links between the information and the person associated with the information were removed. Used this way, the term reidentification connotes an insufficient deidentification process. In the health care industry, the term “reidentification” means something else entirely. In the United States, regulations define “reidentification” under the “Standards for Privacy of Individually Identifiable Health Information.”³³ Reidentification is defined therein as a legally valid process whereby deidentified records can be linked back to their human subjects under circumstances deemed compelling by a privacy board. Reidentification is typically accomplished via a confidential list of links between human subject names and deidentified records, held by a trusted party. As used by the health care industry, reidentification only applies to the approved process of reestablishing the identity of a deidentified record. When a human subject is identified through fraud, through trickery, or through the deliberate use of computational methods to break the confidentiality of insufficiently deidentified records, the term “reidentification” would not apply.

Reification, The process whereby the subject of a statement is inferred, without actually being named. Reification applies to human languages, programming languages, and data specification languages. Here is an example of reification in the English language: “He sat outside.” The sentence does not identify who “he” is, but you can infer that “he” must exist and you can define “he” as the object that sat outside. In object-oriented programming languages, an object is reified without declaring it an instance of a class, when it accepts a class method. Likewise, in a specification, an object comes into existence if it is the thing that is being described in a statement (e.g., an RDF triple). Reifications bypass the normal naming and identification step, a neat trick that saves a little time and effort, often causing great confusion. See full discussion in Chapter 1.

Representation bias, This occurs when the population sampled does not represent the population intended for study. For example, the population for which the normal range of prostate-specific antigen (PSA) was based was selected from a county in the state of Minnesota. The male population under study consisted almost exclusively of white men (i.e., virtually no African-Americans, Asians, Hispanics, etc.). It may have been assumed that PSA levels would not vary with race. It was eventually determined that the normal PSA ranges varied greatly by race.²⁶² The Minnesota data, though plentiful, did not represent racial subpopulations. A sharp distinction must drawn between Big-ness and Whole-ness.¹³⁸

Resource Description Framework (RDF), A syntax within XML that formally expresses assertions in three components, the so-called RDF triple. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets as long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or integrated functionally with other Big Data resources. See Notation 3. See Semantics. See Triples.

Second trial bias, This can occur when a clinical trial yields a greatly improved outcome when it is repeated with a second group of subjects. In the medical field, a second trial bias arises when trialists find subsets of patients from the first trial who do not respond well to treatment, thereby learning which clinical features are associated with poor trial response (e.g., certain preexisting medical conditions, lack of a good home support system, obesity, nationality). During the accrual process for the second trial, potential subjects who profile as nonresponders are excluded. Trialists may justify this practice by asserting that the purpose of the second trial is to find a set of subjects who will benefit from treatment. With a population enriched with good responders, the second trial may yield results that look much better than the first trial. Second trial bias can be considered a type of cherry-picking. See Cherry-picking.

Selection bias, See Cherry-picking.

Semantics, The study of meaning. In the context of Big Data, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (i.e., making triples), combining assertions about data objects (i.e., merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but I would suggest that most definitions for semantics would be functionally equivalent to the definition offered here. See Triples. See RDF.

Serious Big Data, The 3V’s (data volume, data variety, and data velocity) plus “seriousness.” Seriousness is a tongue-in-cheek term that the author applies to Big Data resources whose objects are provided with an adequate identifier and a trusted time stamp and that provide data users with introspection, including pointers to the protocols that produced the data objects. Metadata in Big Data resources is appended with namespaces. Serious Big Data resources can be merged with other serious Big Data resources. In the opinion of the author, Big Data resources that lack seriousness should not be used in science, in legal work, in banking, and in the realm of public policy. See Identifier. See Trusted time stamp. See Introspection. See Namespace. See Merging.

Simpson’s paradox, When a correlation that holds in two different data sets is reversed when the data sets are combined, for example, baseball player A may have a higher batting average than player B for each of two seasons, but when the data for the two seasons are combined, player B may have the higher two-season average. Simpson’s paradox has particular significance for Big Data research, wherein data samples are variously recombined and reanalyzed at different stages of the analytic process. For a full discussion, with examples, see Chapter 10.

Specification, A method used for describing objects (physical objects such as nuts and bolts or symbolic objects such as numbers). Specifications do not require specific types of information and do not impose any order of appearance of the data contained in the document. Specifications do not generally require certification by a standards organization. They are generally produced by special interest organizations, and their legitimacy depends on their popularity. Examples of specifications are RDF, produced by the World Wide Web Consortium, and TCP/IP, maintained by the Internet Engineering Task Force.

Sponsor bias, Are the results of big data analytics skewed in favor of the corporate sponsors of the resource? In a fascinating meta-analysis, Yank and coworkers asked whether the results of clinical trials, conducted with financial ties to a drug company, were biased to produce results favorable to the sponsors.²⁶³ They reviewed the literature on clinical trials for antihypertensive agents and found that ties to a drug company did not bias the results (i.e., the experimental data), but they did bias the conclusions (i.e., the interpretations drawn from the results). This suggests that regardless of the results of a trial, the conclusions published by the investigators were more likely to be favorable if the trial was financed by a drug company. This should come as no surprise. Two scientists can look at the same results and draw entirely different conclusions.

Square kilometer array (SKA), SKA is designed to collect data from millions of connected radio telescopes and is expected to produce more than one exabyte (1 billion gigabytes) every day.⁷

String, A string is a sequence of characters. Words, phrases, numbers, and alphanumeric sequences (e.g. identifiers, one-way hash values, passwords) are strings. A book is a long string. The complete sequence of the human genome (three billion characters, with each character an A, T, G, or C) is a very long string. Every subsequence of a string is another string.

Supercomputer, Computers that can perform many times faster than a desktop personal computer. In 2012, the top supercomputers can perform in excess of a petaflop (i.e., 10 to the 15 power floating point operations per second). By my calculations, a 1 petaflop computer performs about 250,000 operations in the time required for my laptop to finish one operation.

Support vector machine (SVM), A machine-learning technique that classifies objects. The method starts with a training set consisting of two classes of objects as input. The SVA computes a hyperplane, in a multidimensional space, that separates objects of the two classes. The dimension of the hyperspace is determined by the number of dimensions or attributes associated with the objects. Additional objects (i.e., test set objects) are assigned membership in one class or the other, depending on which side of the hyperplane they reside.

Syntax, Syntax is the standard form or structure of a statement. What we know as English grammar is equivalent to the syntax for the English language. If I write “Jules hates pizza,” the statement would be syntactically valid but factually incorrect. If I write “Jules drives to work in his pizza,” the statement would be syntactically valid but nonsensical. For programming languages, syntax refers to the enforced structure of command lines. In the context of triple stores, syntax refers to the arrangement and notation requirements for the three elements of a statement (e.g., RDF format or n3 format). Charles Mead distinctly summarized the difference between syntax and semantics: “Syntax is structure; semantics is meaning.”²⁶⁴

Taxonomy, The definition varies, but as used here, a taxonomy is the collection of named instances (class members) in a classification. When you see a schematic showing class relationships, with individual classes represented by geometric shapes and the relationships represented by arrows or connecting lines between the classes, then you are essentially looking at the structure of a classification. You can think of building a taxonomy as the act of pouring all of the names of all of the instances into their proper classes within the classification schematic. A taxonomy is similar to a nomenclature; the difference is that in a taxonomy, every named instance must have an assigned class.

Term extraction algorithm, Terms are phrases, most often noun phrases, and sometimes individual words that have a precise meaning within a knowledge domain. For example, “software validation,” “RDF triple,” and “WorldWide Telescope” are examples of terms that might appear in the index or the glossary of this book. The most useful terms might appear up to a dozen times in the text, but when they occur on every page, their value as a searchable item is diminished; there are just too many instances of the term to be of practical value. Hence, terms are sometimes described functionally as noun phrases that have low-frequency and high information content. Various algorithms are available to extract candidate terms from textual documents. The candidate terms can be examined by a curator who determines whether they should be included in the index created for the document from which they were extracted. The curator may also compare the extracted candidate terms against a standard nomenclature to determine whether the candidate terms should be added to the nomenclature. For an in-depth discussion, see Chapter 1.

Text editor, A text editor (also called an ASCII editor) is a software application designed to create, modify, and display simple unformatted text files. Text editors are different from word processers that are designed to include style, font, and other formatting symbols. Text editors are much faster than word processors because they display the contents of files without having to interpret and execute formatting instructions. Unlike word processors, text editors can open files of enormous size (e.g., gigabyte range).

Thesaurus, A vocabulary that groups together synonymous terms. A thesaurus is very similar to a nomenclature. There are two minor differences: nomenclatures do not always group terms by synonymy and nomenclatures are often restricted to a well-defined topic or knowledge domain (e.g., names of stars, infectious diseases, etc.).

Time stamp, Many data objects are temporal events, and all temporal events must be given a time stamp indicating the time that the event occurred, using a standard measurement for time. The time stamp must be accurate, persistent, and immutable. The Unix epoch time (equivalent to the Posix epoch time) is available for most operating systems and consists of the number of seconds that have elapsed since January 1, 1970, midnight, Greenwich Mean Time. The Unix epoch time can easily be converted into any other standard representation of time. The duration of any event can be calculated by subtracting the beginning time from the ending time. Because the timing of events can be maliciously altered, scrupulous data managers employ a protocol by which a time stamp can be verified (discussed in Chapter 4).

Time-window bias, A bias produced by the choice of a time measurement. In medicine, survival is measured as the interval between diagnosis and death. Suppose a test is introduced that provides early diagnoses. Patients given the test will be diagnosed at a younger age than patients who are not given the test. Such a test will always produce improved survival simply because the interval between diagnosis and death will be lengthened. Assuming the test does not lead to any improved treatment, the age at which the patient dies is unchanged by the testing procedure. The bias is caused by the choice of timing interval (i.e., time from diagnosis to death). Survival is improved without a prolongation of life beyond what would be expected without the test. Some of the touted advantages of early diagnosis are the direct result of timing bias.

Triple, In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In theory, all Big Data resources can be composed as collections of triples. When the data and metadata held in sets of triples are organized into ontologies consisting of classes of objects and associated properties (metadata), the resource can potentially provide introspection (the ability of a data object to be self-descriptive). This topic is discussed in depth in Chapter 4. See Introspection. See Data object. See Semantics. See Resource Description Framework.

Uniqueness, In computational sciences, uniqueness is achieved when a data object is associated with a unique identifier (i.e., a character string that has not been assigned to any other object). A full discussion of uniqueness is found in Chapter 2. See Identifier.

Universally Unique Identifiers (UUID), A UUID is a protocol for assigning identifiers to data objects without using a central registry. UUIDs were originally used in the Apollo Network Computing System.²⁷ See Chapter 2.

UUID, See Universally Unique Identifiers.

Validation, This involves demonstrating that the conclusions that come from data analyses fulfill their intended purpose and are consistent.²⁶⁵ You validate a conclusion (which may appear in the form of a hypothesis, or a statement about the value of a new laboratory test, or a therapeutic protocol) by showing that you draw the same conclusion repeatedly whenever you analyze relevant data sets, and that the conclusion satisfies some criteria for correctness or suitability. Validation is somewhat different from reproducibility. Reproducibility involves getting the same measurement over and over when you perform the test. Validation involves drawing the same conclusion over and over. See Verification and validation.

Variable, In algebra, a variable is a quantity in an equation that can change, as opposed to a constant quantity that cannot change. In computer science, a variable can be perceived as a container that can be assigned a value. If you assign the integer 7 to a container named “x”, then “x” equals 7 until you reassign some other value to the container (i.e., variables are mutable). In most computer languages, when you issue a command assigning a value to a new (undeclared) variable, the variable automatically comes into existence to accept the assignment. The process whereby an object comes into existence, because its existence was implied by an action (such as value assignment), is called reification. See Reification. See Dimensionality.

Verification and validation, As applied to data resources, verification is the process that ensures that data conforms to a set of specifications. Validation is the process that checks whether the data can be applied in a manner that fulfills its intended purpose. This often involves showing that correct conclusions can be obtained from a competent analysis of the data. For example, a Big Data resource might contain position, velocity, direction, and mass data for the earth and for a meteor that is traveling sunwards. The data may meet all specifications for measurement, error tolerance, data typing, and data completeness. A competent analysis of the data indicates that the meteor will miss the earth by a safe 50,000 miles, plus or minus 10,000 miles. If the asteroid smashes into the earth, destroying all planetary life, then an extraterrestrial observer might conclude that the data was verified, but not validated.

Vocabulary, This is a comprehensive collection of the words used in a general area of knowledge. The term “vocabulary” and the term “nomenclature” are nearly synonymous. In common usage, a vocabulary is a list of words and typically includes a wide range of terms and classes of terms. Nomenclatures typically focus on a class of terms within a vocabulary. For example, a physics vocabulary might contain the terms “quark, black hole, Geiger counter, and Albert Einstein”; a nomenclature might be devoted to the names of celestial bodies. See Nomenclature.

Web service, A server-based collections of data, plus a collection of software routines operating on the data, that can be accessed by remote clients. One of the features of Web services is that they permit client users (e.g., humans or software agents) to discover the kinds of data and methods offered by the Web service and the rules for submitting server requests. To access Web services, clients must compose their requests as messages conveyed in a language that the server is configured to accept, a so-called Web services language.

WorldWide Telescope, A Big Data effort from the Microsoft Corporation bringing astronomical maps, imagery, data, analytic methods, and visualization technology to standard Web browsers. More information is available at http://www.worldwidetelescope.org/Home.aspx.

eXtensible Markup Language (XML), A syntax for marking data values with descriptors (metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start tag, indicating that a value will follow, and an end tag, indicating that the value had preceded the tag, for example, < name>Tara Raboomdeay </name >. The enclosing angle brackets, “<>”, and the end-tag marker, “/”, are hallmarks of XML markup. This simple but powerful relationship between metadata and data allows us to employ each metadata/data pair as though it were a small database that can be combined with related metadata/data pairs from any other XML document. The full value of metatadata/data pairs comes when we can associate the pair with a unique object, forming a so-called triple. See Triple. See Meaning.

Zipf distribution, George Kingsley Zipf (1902—1950) was an American linguist who demonstrated that, for most languages, a small number of words account for the majority of occurrences of all the words found in prose. Specifically, he found that the frequency of any word is inversely proportional to its placement in a list of words, ordered by their decreasing frequencies in text. The first word in the frequency list will occur about twice as often as the second word in the list, three times as often as the third word in the list, and so on. Many Big Data collections follow a Zipf distribution (income distribution in a population, energy consumption by country, and so on). Zipf distributions within Big Data cannot be sensibly described by the standard statistical measures that apply to normal distributions. Zipf distributions are instances of Pareto’s principle. See Pareto’s principle.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Glossary

Create new playlist

Sign In

Sign Up

Table of Contents for
Glossary