4

Metadata, Semantics, and Triples

Abstract

Data has no value unless it has been described, and described data has no meaning unless it has been associated with an identifier. The “triple,” consisting of a data value, its descriptor, plus its associated identifier, is the basic unit of meaning (semantics) in information science. The concept of triples will be new to most readers, but this simple concept has enormous value whenever we work with complex types of data and whenever we need to merge and integrate data obtained from multiple sources. It is important that statisticians and researchers, accustomed to working with small or simple sets of data, become aware of the necessity for semantic rigor when designing and analyzing Big Data.

Keywords

Meaning; Triple; Namespace; XML; Metadata; Semantics

Section 4.1. Metadata

Life is a concept.

Patrick Forterre [1]

When you think about it, numbers are meaningless. The number “8” has no connection to anything in the physical realm until we attach some information to the number (e.g., 8 candles, 8 minutes). Some numbers, like “0” or “− 5” have no physical meaning under any set of circumstances. There really is no such thing as “0 dollars”; it is an abstraction indicating the absence of a positive number of dollars. Likewise, there is no such thing as “− 5 walnuts”; it is an abstraction that we use to make sense of subtractions (5 − 10 = − 5).

When we write “8 walnuts,” “walnuts” is the metadata that tells us what is being referred to by the data, in this case the number “8.”

When we write “8 o'clock”, “8” is the data and “o'clock” is the metadata.

Section 4.2. eXtensible Markup Language

The purpose of narrative is to present us with complexity and ambiguity.

Scott Turow

XML (eXtensible Markup Language) is a syntax for attaching descriptors (so-called metadata) to data values. [Glossary Metadata]

In XML, descriptors are commonly known as tags.

XML has its own syntax; a set of rules for expressing data/metadata pairs. Every data value is flanked by a start-tag and an end-tag. Enclosing angle brackets, “<>”, and the end-tag marker, “/”, are hallmarks of XML markup. For example:

< name > Tara Raboomdeay </name >

This simple but powerful relationship between metadata and data allows us to employ every metadata/data pair as a miniscule database that can be combined with related metadata/data pairs from the same XML document or from different XML documents.

It is impossible to overstate the importance of XML (eXtensible Markup Language) as a data organization tool. With XML, every piece of data tells us something about itself. When a data value has been annotated with metadata, it can be associated with other, related data, even when the other data is located in a seemingly unrelated database. [Glossary Integration].

When all data is flanked by metadata, it is relatively easy to port the data into spreadsheets, where the column headings correspond to the metadata tags, and the data values correspond to the value found in the cells of the spreadsheet. The rows correspond to the record number.

A file that contains XML markup is considered a proper XML document only if it is well formed. Here are the properties of a well-formed XML document.

  •   The document must have a proper XML header. The header can vary somewhat, but it usually looks something like:

              <?xml version ="1.0" ?>

  •   XML files are ASCII files consisting of characters available to a standard keyboard.
  •   Tags in XML files must conform to composition rules (e.g., spaces are not permitted within a tag, and tags are case-sensitive).
  •   Tags must be properly nested (i.e., no overlapping). For example, the following is properly nested XML.

< chapter ><chapter_title > Introspection </chapter_title ></chapter >

Compare the previous example, with the following, improperly nested XML.

< chapter ><chapter_title > Introspection </chapter ></chapter_title >

Web browsers will not display XML files that are not well formed.

The actual structure of an XML file is determined by another XML file known as an XML Schema. The XML Schema file lists the tags and determines the structure for those XML files that are intended to comply with a specific Schema document. A valid XML file conforms to the rules of structure and content defined in its assigned XML Schema.

Every XML file that is valid under a particular Schema will contain data that is described using the same tags that are listed in that same XML schema, permitting data integration among those files. This is one of the strengths of XML.

The greatest drawback of XML is that data/metadata pairs are not assigned to a unique object. XML describes its data, but it does not tell us the object of the data. This gaping hole in XML was filled by RDF (Resource Description Framework), a modified XML syntax designed to associate every data/metadata pair with a unique data object. Before we can begin to understand RDF, we need to understand the concept of “meaning,” in the context of information science.

Section 4.3. Semantics and Triples

Supplementary bulletin from the Office of Fluctuation Control, Bureau of Edible Condiments, Soluble and Indigestible Fats and Glutinous Derivatives, Washington, D.C. Correction of Directive #943456201, . . . the quotation on groundhog meat should read ‘ground hog meat.’

Bob Elliot and Ray Goulding, comedy routine

Metadata gives structure to data values, but it does not tell us anything about how the data value relates to anything else. For example,

< height_in_feet_inches > 5'11"</height_in_feet_inches >

What does it mean to know that 5′11″ is a height attribute, expressed in feet and inches? Nothing really. The metadata/data pair has no meaning, as it stands, because it does not describe anything in particular. If we were to assert that John Harrington has a height of 5′11″, then we would be making a meaningful statement. This brings us to ask ourselves: What is the meaning of meaning? This question sounds like another one of those Zen mysteries that has no answer. In informatics, “meaningfulness” is achieved when described data (i.e., a metadata/data pair) is bound to the unique identifier of a data object.

Let us look once more at our example:

"John Harrington's height is five feet eleven inches."

This sentence has meaning because there is data (five feet eleven inches), and it is described (person's height), and it is bound to a unique individual (John Harrington). Let us generate a unique identifier for John Harrington using our UUID generator (discussed in Section 3.3) and rewrite our assertion in a format in which metadata/data pairs are associated with a unique identifier:

9c7bfe97-e637-461f-a30b-d931b97907fe  name   John Harrington
9c7bfe97-e637-461f-a30b-d931b97907fe  height 5'11"

We now have two meaningful assertions: one that associates the name “John Harrington” with a unique identifier (9c7bfe97-e637-461f-a30b-d931b97907fe); and one that tells us that the object associated with the unique identifier (i.e., John Harrington) is 5′11″ tall. We could insert these two assertions into a Big Data resource, knowing that both assertions fulfill our definition of meaning. Of course, we would need to have some process in place to ensure that any future information collected on our unique John Harrington (i.e., the John Harrington assigned the identifier 9c7bfe97-e637-461f-a30b-d931b97907fe) will be assigned the same identifier.

A statement with meaning does not need to be a true statement (e.g., The height of John Harrington was not 5 feet 11 inches when John Harrington was an infant). That is to say, an assertion can be meaningful but false.

Semantics is the study of meaning. In the context of Big Data, semantics is the technique of creating meaningful assertions about data objects. All meaningful assertions, without exception, can be structured as a 3-item list consisting of an identified data object, a data value, and a descriptor for the data value. These 3-item assertions are referred to as “triples.” Just as sentences are the fundamental informational unit of spoken languages, the triple is the fundamental unit of computer information systems.

In practical terms, semantics involves making assertions about data objects (i.e., making triples), combining assertions about data objects (i.e., aggregating triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but I would suggest that all legitimate definitions for the term “semantics” are functionally equivalent to the definition offered here. For example every cell in a spreadsheet is a data value that has a descriptor (the column header), and a subject (the row identifier). A spreadsheet can be pulled apart and re-assembled as a set of triples (known as a triplestore) equal in number to the number of cells contained in the original spreadsheet. Each triple would be an assertion consisting of the following:

< row identifier > < column header > < content of cell >

Likewise, any relational database, no matter how many relational table are included, can be decomposed into a triplestore. The primary keys of the relational tables would correspond to the identifier of the RDF triple. Column header and cell contents complete the triple.

If spreadsheets and relational databases are equivalent to triplestores, then is there any special advantage to creating triplestores? Yes. A triple is a stand-alone unit of meaning. It does not rely on the software environment (e.g., excel spreadsheet or SQL database engine) to convey its meaning. Hence, triples can be merged without providing any additional structure. Every triple on the planet could be concatenated to create the ultimate superduper triplestore, from which all of the individual triples pertaining to any particular unique identifier, could be collected. This is something that could not be done with spreadsheets and database engines. Enormous triplestores can serve as native databases or as a large relational table, or as pre-indexed tables. Regardless, the final products have all the functionality of any popular database engine [2].

Section 4.4. Namespaces

It is once again the vexing problem of identity within variety; without a solution to this disturbing problem there can be no system, no classification.

Roman Jakobson

A namespace is the metadata realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but different meaning. For example, within a single XML file, the metadata term “date” may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, the metadata term is given a prefix that is associated with a Web document that defines the term within an assigned Web location. [Glossary Namespace]

For example, an XML page might contain three date-related values, and their metadata descriptors:

< calendar:date > June 16, 1904 </caldendar:date >
< agriculture:date > Thoory </agriculture:date >
< social:date > Pyramus and Thisbe < social:date >

At the top of the XML document you would expect to find declarations for the namespaces used in the XML page. Formal XML namespace declarations have the syntax:

xmlns:prefix ="URI"

In the fictitious example used in this section, the namespace declarations might appear in the “root” tag at the top of the XML page, as shown here (with fake web addresses):

The namespace URIs are the web locations that define the meanings of the tags that reside within their namespace.

The relevance of namespaces to Big Data resources relates to the heterogeneity of information contained in or linked to a resource. Every description of a value must be provided a unique namespace. With namespaces, a single data object residing in a Big Data resource can be associated with assertions (i.e., object-metadata-data triples) that include descriptors of the same name, without losing the intended sense of the assertions. Furthermore, triples held in different Big Data resources can be merged, with their proper meanings preserved.

Here is an example wherein two resources are merged, with their data arranged as assertion triples.

Big Data resource 1
29847575938125   calendar:date   February 4, 1986
83654560466294   calendar:date   June 16, 1904
Big Data resource 2
57839109275632   social:date    Jack and Jill
83654560466294   social:date    Pyramus and Thisbe
Merged Big Data Resource 1 + 2
29847575938125   calendar:date   February 4, 1986
57839109275632   social:date    Jack and Jill
83654560466294   social:date    Pyramus and Thisbe
83654560466294   calendar:date   June 16, 1904

There you have it. The object identified as 83654560466294 is associated with a “date” metadata tag in both resources. When the resources are merged, the unambiguous meaning of the metadata tag is conveyed through the appended namespaces (i.e., social: and calendar:)

Section 4.5. Case Study: A Syntax for Triples

I really do not know that anything has ever been more exciting than diagramming sentences.

Gertrude Stein

If you want to represent data as triples, you will need to use a standard grammar and syntax. RDF (Resource Description Framework) is a dialect of XML designed to convey triples. Providing detailed instruction in RDF syntax, or its dialects, lies far outside the scope of this book. However, every Big Data manager must be aware of those features of RDF that enhance the value of Big Data resources. These would include:

  1. 1.  The ability to express any triple in RDF (i.e., the ability to make RDF statements).
  2. 2.  The ability to assign the subject of an RDF statement to a unique, identified, and defined class of objects (i.e., that ability to assign the object of a triple to a class).

RDF is a formal syntax for triples. The subjects of triples can be assigned to classes of objects defined in RDF Schemas and linked from documents composed of RDF triples. RDF Schemas will be described in detail in Section 5.9.

When data objects are assigned to classes, the data analysts can discover new relationships among the objects that fall into a class, and can also determine relationships among different related classes (i.e., ancestor classes and descendant classes, also known as superclasses and subclasses). RDF triples plus RDF Schemas provide a semantic structure that supports introspection and reflection. [Glossary Child class, Subclass, RDF Schema, RDFS, Introspection, Reflection]

  1. 3.  The ability for all data developers to use the same publicly available RDF Schemas and namespace documents with which to describe their data, thus supporting data integration over multiple Big Data resources.

This last feature allows us to turn the Web into a worldwide Big Data resource composed of RDF documents.

We will briefly examine each of these three features in RDF. First, consider the following triple:

pubmed:8718907        creator        Bill Moore

Every triple consists of an identifier (the subject of the triple), followed by metadata, followed by a value. In RDF syntax the triple is flanked by metadata indicating the beginning and end of the triple. This is the < rdf:description > tag and its end-tag </rdf:description). The identifier is listed as an attribute within the < rdf:description > tag, and is described with the rdf:about tag, indicating the subject of the triple. There follows a metadata descriptor, in this case < author >, enclosing the value, “Bill Moore.”

< rdf:description rdf:about ="urn:pubmed:8718907">
 < creator > Bill Moore </creator >
</rdf:description >

The RDF triple tells us that Bill Moore wrote the manuscript identified with the PubMed number 8718907. The PubMed number is the National library of Medicine's unique identifier assigned to a specific journal article. We could express the title of the article in another triple.

pubmed:8718907, title, "A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years."

In RDF, the same triple is expressed as:

< rdf:description rdf:about ="urn:pubmed:8718907">
 < title > A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years </title >
</rdf:description >

RDF permits us to nest triples if they apply to the same unique object.

< rdf:description rdf:about ="urn:pubmed:8718907">
 < author > Bill Moore </author >
 < title > A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years </title >
</rdf:description >

Here we see that the PubMed manuscript identified as 8718907 was written by Bill Moore (the first triple) and is titled “A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years” (a second triple).

What do we mean by the metadata tag “title”? How can we be sure that the metadata term “title” refers to the name of a document and does not refer to an honorific (e.g., The Count of Monte Cristo or the Duke of Earl). We append a namespace to the metadata. Namespaces were described in Section 4.4.

< rdf:description rdf:about ="urn:pubmed:8718907">
 < dc:creator > Bill Moore </dc:creator >
 < dc:title > A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years </dc:title >
</rdf:description >

In this case, we appended “dc:” to our metadata. By convention, “dc:” refers to the Dublin Core metadata set at: http://dublincore.org/documents/2012/06/14/dces/.

We will be describing the Dublin Core in more detail, in Section 4.6. [Glossary Dublin Core metadata].

RDF was developed as a semantic framework for the Web. The object identifier system for RDF was created to describe Web addresses or unique resources that are available through the Internet. The identification of unique addresses is done through the use of a Uniform Resource Name (URN) [3]. In many cases the object of a triple designed for the Web will be a Web address. In other cases the URN will be an identifier, such as the PubMed reference number in the example above. In this case, we appended the “urn:” prefix to the PubMed reference in the “about” declaration for the object of the triple.

< rdf:description rdf:about ="urn:pubmed:8718907">

Let us create an RDF triple whose subject is an actual Web address.

  • < rdf:Description rdf:about ="http://www.usa.gov/">
  •      < dc:title >USA.gov: The U.S. Government's Official Web Portal </dc:title >
  • </rdf:Description >

Here we created a triple wherein the object is uniquely identified by the unique Web address http://www.usa.gov/, and the title of the Web page is “USA.gov: The U.S. Government's Official Web Portal.” The RDF syntax for triples was created for the purpose of identifying information with its URI (Unique Resource Identifier). The URI is a string of characters that uniquely identifies a Web resource (such as a unique Web address, or some unique location at a Web address, or some unique piece of information that can be ultimately reached through the Worldwide Web). In theory, using URIs as identifiers for triples will guarantee that all triples will be accessible through the so-called “Semantic Web” (i.e., the Web of meaningful assertions) [3]. Using RDF, Big Data resources can design a scaffold for their information that can be understood by humans, parsed by computers, and shared by other Big Data resources. This solution transforms every RDF-compliant Web page into a an accessible database whose contents can be searched, extracted, aggregated, and integrated along with all the data contained in every existing Big Data resource.

In practice, the RDF syntax is just one of many available formats for packaging triples, and can be used with identifiers that have invalid URIs (i.e., that do not relate in any way to Web addresses or Web resources). The point to remember is that Big Data resources that employ triples can port their data into RDF syntax, or into any other syntax for triples, as needed. [Glossary Notation 3, Turtle]

Section 4.6. Case Study: Dublin Core

For myself, I always write about Dublin, because if I can get to the heart of Dublin I can get to the heart of all the cities of the world. In the particular is contained the universal.

James Joyce

James Joyce believed that Dublin held the meaning of every city in the world. In a similar vein, the Dublin Core metadata descriptors hold the meaning of every document in the world. The principle difference between the two Dublin-centric philosophies is that James Joyce hailed from Dublin, Ireland, while the Dublin Core metadata descriptors hailed from Dublin, Ohio, United States. For it was in Dublin, Ohio, in 1995, that a coterie of interested Internet technologists and librarians met for the purpose of identifying a core set of descriptive data elements that every electronic document should contain.

The specification resulting from this early workshop came to be known as the Dublin Core [4]. The Dublin Core elements include such information as the date that the file was created, the name of the entity that created the file, and a general comment on the contents of the file. The Dublin Core elements aid in indexing and retrieving electronic files, and should be included in every electronic document, including every image file. The Dublin Core metadata specification is found at: http://dublincore.org/documents/dces/

Some of the most useful Dublin Core elements are [5]:

  •   Contributor—the entity that contributes to the document
  •   Coverage—the general area of information covered in the document
  •   Creator—the entity primarily responsible for creating the document
  •   Date—a time associated with an event relevant to the document
  •   Description—description of the document
  •   Format—file format
  •   Identifier—a character string that uniquely and unambiguously identifies the document
  •   Language—the language of the document
  •   Publisher—the entity that makes the resource available
  •   Relation—a pointer to another, related document, typically the identifier of the related document
  •   Rights—the property rights that apply to the document
  •   Source—an identifier linking to another document from which the current document was derived
  •   Subject—the topic of the document
  •   Title—title of the document
  •   Type—genre of the document

An XML syntax for expressing the Dublin Core elements is available [6,7].

Glossary

Child class The direct or first generation subclass of a class. Sometimes referred to as the daughter class or, less precisely, as the subclass.

Dublin Core metadata The Dublin Core is a set of metadata elements developed by a group of librarians who met in Dublin, Ohio. It would be very useful if every electronic document were annotated with the Dublin Core elements. The Dublin Core Metadata is discussed in detail in Chapter 4. The syntax for including the elements is found at: http://dublincore.org/documents/dces/

Integration Occurs when information is gathered from multiple data sets, relating diverse data extracted from different data sources. Integration can broadly be categorized as pre-computed or computed on-the fly. Pre-computed integration includes such efforts as absorbing new databases into a Big Data resource or merging legacy data from with current data. On-the-fly integration involves merging data objects at the moment when the individual objects are parsed. This might be done during a query that traverses multiple databases or multiple networks. On-the-fly data integration can only work with data objects that support introspection. The two closely related topics of integration and interoperability are often confused with one another. An easy way to remember the difference is to note that integration refers to data; interoperability refers to software.

Introspection Well-designed Big Data resources support introspection, a method whereby data objects within the resource can be interrogated to yield their properties, values, and class membership. Through introspection the relationships among the data objects in the Big Data resource can be examined and the structure of the resource can be determined. Introspection is the method by which a data user can find everything there is to know about a Big Data resource without downloading the complete resource.

Metadata Data that describes data. For example in XML, a data quantity may be flanked by a beginning and an ending metadata tag describing the included data quantity. < age > 48 years </age >. In the example, < age > is the metadata and 48 years is the data.

Namespace A namespace is the metadata realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata term “date” may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion the metadata term is given a prefix that is associated with a Web document that defines the term within the document's namespace.

Notation 3 Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is compact and easy for humans to read. Both n3 and RDF can be parsed and equivalently tokenized (i.e., broken into elements that can be re-organized in a different format, such as a database record).

RDF Schema Resource Description Framework Schema (RDFS). A document containing a list of classes, their definitions, and the names of the parent class(es) for each class (e.g., Class Marsupiala is a subclass of Class Metatheria). In an RDF Schema, the list of classes is typically followed by a list of properties that apply to one or more classes in the Schema. To be useful, RDF Schemas are posted on the Internet, as a Web page, with a unique Web address. Anyone can incorporate the classes and properties of a public RDF Schema into their own RDF documents (public or private) by linking named classes and properties, in their RDF document, to the web address of the RDF Schema where the classes and properties are defined.

RDFS Same as RDF Schema.

Reflection A programming technique wherein a computer program will modify itself, at run-time, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (i.e., object introspection). If the information indicates that the data object belongs to a particular class of objects, the program might call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during run-time; metaphorically reflecting upon the purpose of its computational task. Because introspection is a property of well-constructed Big Data resources, reflection is an available technique to programmers who deal with Big Data.

Subclass A class in which every member descends from some higher class (i.e., a superclass) within the class hierarchy. Members of a subclass have properties specific to the subclass. As every member of a subclass is also a member of the superclass, the members of a subclass inherit the properties and methods of the ancestral classes. For example, all mammals have mammary glands because mammary glands are a defining property of the mammal class. In addition, all mammals have vertebrae because the class of mammals is a subclass of the class of vertebrates. A subclass is the immediate child class of its parent class.

Turtle Another syntax for expressing triples. From RDF came a simplified syntax for triples, known as Notation 3 or N3 [8]. From N3 came Turtle, thought to fit more closely to RDF. From Turtle came an even more simplified form, known as N-Triples.

References

[1] Forterre P. The two ages of the RNA world, and the transition to the DNA world: a story of viruses and cells. Biochimie. 2005;87:793–803.

[2] Neumann T., Weikum G. xRDF3X: Fast querying, high update rates, and consistency for RDF databases. Proceedings of the VLDB Endowment. 2010;3:256–263.

[3] Berners-Lee T. Linked data—design issues. July 27. Available at: https://www.w3.org/DesignIssues/LinkedData.html. 2006 [viewed December 20, 2017].

[4] Kunze J. Encoding Dublin Core Metadata in HTML. Dublin Core Metadata Initiative. Network Working Group Request for Comments 2731. The Internet Society; 1999. December. Available at: http://www.ietf.org/rfc/rfc2731.txt [viewed August 1, 2015].

[5] Berman J.J. Principles of big data: preparing, sharing, and analyzing complex information. Waltham, MA: Morgan Kaufmann; 2013.

[6] Dublin Core Metadata Initiative. Available from: http://dublincore.org/ The Dublin Core is a set of basic metadata that describe XML documents. The Dublin Core were developed by a forward-seeing group library scientists who understood that every XML document needs to include self-describing metadata that will allow the document to be indexed and appropriately retrieved.

[7] Dublin Core Metadata Element Set, Version 1.1: Reference Description. Available from: http://dublincore.org/documents/1999/07/02/dces/ [viewed January 18, 2018].

[8] Primer: Getting into RDF & Semantic Web using N3. Available from: http://www.w3.org/2000/10/swap/Primer.html [viewed September 17, 2015].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.219.78