2

Structure for data

Introduction

Organising data, the raw text that goes into a digital system, effectively and efficiently is important for all aspects of publishing. It provides the digital building blocks from which both print and digital products can be produced. Imposing a structure on data allows it to be used flexibly depending on what you want to do with it. This structure needs to be used consistently and be read by different machines for different purposes, so various systems and protocols have developed to ensure this takes place. This chapter will not cover all the various systems and languages that have developed but provides a brief overview of those that are most central to publishing activity.

Tagging, mark-up and the growth of XML

We have looked at the development of digital production methods. In order for these to take place more fundamental developments in digital technology were necessary. The pre-press editorial activity where copyeditors marked up text was one of the areas that started to be undertaken on screen. Developments in working with and managing data on screen were therefore necessary.

For documents to be formatted they need to be coded or tagged (mark-up). These codes are picked up by the formatting program so they can be put into the correct format for the different sorts of outputs, whether for print or digital products. This marking up needed to be formalised to make it easier for everyone to work to the same procedure and so tagging schemes developed that would allow for systematic mark-up for text. Various systems developed and they began to merge as it became clearer a more standardised language was required. This led to the development of SGML (standard generalised mark-up language) based on the GML (general mark-up language), a coding system developed at IBM. It is a ‘meta’ language. This means SGML is not strictly code itself but is a way of writing/finding coding systems.

SGML provides a structure for data: it is about describing what sorts of data there are (e.g. what is a name, what is an address, or what is a heading, what is a paragraph of text). It is not about the appearance of data (i.e. about the way a book page looks). The term structured data refers to these codes embedded in the data that define structure and relationships. The more structure there is, the more the content is enriched as more detail can be extracted.

SGML became an ISO standard and was the predecessor to XML (extensible mark-up language). XML developed in the late 1990s and was similar to SGML but more flexible by being more precise to use, yet also less complicated for users, and it could work as a transfer language to the web. XML applies certain codes to describe what the content is (a heading, for instance) and then, when you want to output that data, a program shows what the format is to be for the section of content that has that particular code.

So with XML you can create a structured document: in other words, a document that has embedded coding that both describes its structures and defines the relationships between different parts of the data (e.g. a name that might link to an address). Structured files can be created in various ways. The two main ones are:

  • XML files created as part of the workflow, which can be used for simple text layout
  • files created in a package such as InDesign or Quark first and then converted to XML – which is necessary for more complex layouts

The aim is to get to a tagged document that carries information only about structure and not about the output, format or presentation. That way the data is ready in a neutral state, as it were, and can be used and output in all sorts of ways and formats, just as the publisher wants, now and into the future. So what this means for publishing is that the source is the one database, or data warehouse, but the output can be in many formats and products, from print books to websites.

Imposing a format

To impose format the XML file is used with a style sheet, and using the appropriate XML style sheet language (XSL) it can be converted into whatever format is wanted. The program you choose depends on what sort of output you want; this could be, for instance, a web page using HTML5 or an EPUB file for ebook or a typesetting program for a printed paperback. In that way XML can be regarded as the basic input, while the various programs used to format that content are as varied as the outputs required for that content. Data can be extracted and repurposed, archived until needed or transformed into different, new formats as they develop, which may enable further accessibility.

Where XML is particularly important for publishing is in its ability to enable the quick and easy search and selection of material, not just for finding and reusing material in different ways from the publisher’s point of view, but for the usability and user-friendliness of the end product. If you, as a researcher, are searching for articles across a database of journals, for instance, the XML provides the structure for you to reach the information you want quickly and, most importantly, accurately.

In order to define the structure of a document that you will code using XML, you use a document type definition, or DTD. Depending on the sort of documents you use and the way you want to use them, the DTD specifies certain types of data fields, or elements, of the document or series of documents. Schemas, which are also ways to define structure, are often more detailed.

XML is not limited to publishing uses. Publishers use it in a reasonably narrow way simply to output – but it is the structure that is behind, for instance, e-commerce systems. For something more complex, such as an e-commerce system, a schema might be used instead, which allows a lot more information to be defined and validated.

One of the advantages of XML is that it is easy to use on an ordinary keyboard and it is text based so easily accessible by any system in the future (unlike databases with proprietary or highly customised systems). XML is particularly useful in that it is reasonably simple to use. What you see when you view content with the XML codes is what you have got and it can be edited reasonably easily. You do not need lots of additional programs in order to read it so it should be future proof as far as possible.

So XML is at the root of all digital production today and it continues to be developed. One example of the sort of current improvements is around systems that can auto-check. So, for instance, if a code indicates the start of typing for a heading, the system knows the code it needs to put in place to close the heading once it is complete.

Metadata: data about data

The fact that data is held in a digital format means that digital products can be developed more easily, particularly when using the web environment. When dealing with databases and places where digital documents may be held it is also important to understand the basics of metadata. Metadata is essentially data or information about things or other data. The technical definition is more complex but at a basic level it can be seen as ‘data about data’, rather like a label about something about the thing, rather than the thing itself; in other words, metadata might describe or define parts of a document that are not actually part of the data itself. So, for example, directories might be regarded as forms of metadata. If you have a business directory it might arrange data in certain ways, by type of business or location; this information would form part of the metadata. Similarly, in library catalogues the book title is the actual data but the location of the title on the shelves is held within the metadata. XML, as explained above, provides a good way to apply these labels about the data and embed them within the document itself. Information systems that can read XML can then translate that information. This is becoming more important when trying to locate information on the web and help computers process information more effectively.

In publishing the ISBN is a form of metadata. In itself an ISBN is not that useful to a user but it is a piece of data that has other information attached to it, such as the title of a book, which can be useful. The ISBN is a document object identifier (DOI); it is a character string that does not change; the location of the document and other metadata about it may change but a DOI will stay the same, just as the ISBN will refer to one particular edition of a title only (even if in different databases the position of that title may change).

So metadata makes information easier to find and more manageable. When creating databases this means that a user can extract relevant data more accurately. The difference between using metadata and a simple search engine is that the latter focuses on the use of keywords, while more precision can be developed using metadata; so, for instance, rules can be used to connect pieces of information and draw links which may allow the user to pinpoint more accurately the information they require. In the case of publishing this is clearly seen in a large library journals database which can enable a researcher to search across a variety of terms from publication year to type of article. It can also be seen in the information you might get about a book on an internet retail site, as the ISBN will have other information attached to it about the book, from price to content.

Case study: The importance of metadata for discoverability

In the context of ISBNs, the importance of metadata for publishing can be seen particularly well as metadata is important for discoverability; the more detailed the metadata is about a book, the easier it can be to bring it to a customer’s attention as they carry out a casual search. As a good illustration of this, Nielsen, which manages Bookscan in the UK, produced a white paper emphasising the importance of meta-data to publishers, with a detailed analysis of the correlation between the level of metadata provided for a book and its sales pattern. This simple correlation obviously does need to be considered in light of wider issues such as market changes, different genres of title, etc., but the overall analysis showed that the more detailed the metadata, the more chance the customer has of finding what they want, and therefore buying it.

The ISBN, as we have seen, is a form of metadata with several other pieces of data related to it: these can cover not just the book’s title but the category of book, cover image, product form, etc. This can be used in an online search for a book by a consumer but also for systems used by booksellers when selling through bookshops (offline, as Nielsen terms it). One level of data is regarded as essential to meet a basic BIC (book industry communications) standard to ensure an efficient supply chain, and these elements include a cover picture. There is also an enhanced level of metadata elements (promoted by BIC), which includes short and long descriptions and table of contents of the book. By comparing titles that had smaller amounts of metadata (e.g. no shot of the cover) with those with complete records (e.g. with full content information), the statistics showed that in practically all cases a richer data record would see a direct response in increased sales. This would make sense for discoverability, which is an issue in online retailing, and it is useful to note that good, well-prepared metadata is the basis of this. The full report is worth reading for details, including statistics that point to the fact that a cover image for a fiction title is important for sales or that short descriptions are more important for children’s books than long ones.

Taxonomies and schemas: organising metadata

Metadata is important in terms of the way data is stored but this does not necessarily mean specific data can be found; the metadata needs to be organised in a way that can be accessed again. More and more individual users are becoming aware of the need to codify and structure information in order to be able to find it again in a digital environment. An example of simple tagging is on photo storing websites, where users can randomly tag pictures with whatever keywords they think are useful when searching. Another example is of a user in Twitter who may add a hashtag with anything they like, to see if a current thread can draw together variety of views and comments; they can then follow themes as they emerge and subside again. However, these ‘folksonomies’ are not organised in any particular way. This means:

  • definitions of words can vary
  • use of terms can change
  • tags are not necessarily consistently used, particularly between sites

One only needs to look at some of the Twitter hashtags to see very different things linked randomly together, so diluting the use of that tag. But there are much more structured approaches, with taxonomies or automated indexing systems aiding definition and use of specific terms; these need to be used consistently to organise the finding of information effectively.

For metadata to be effective you need to decide which structural approach to take. The definitions of terms such as taxonomies and thesauri are not necessarily scientific and fixed but there are some loose distinctions. For publishing they can be important in that many large projects such as reference works need to be carefully set up to ensure the metadata is sophisticated enough for the data itself to be used in all sorts of different ways by different types of user. Publishers that work in the area of large reference databases will often have groups of people specialising in developing and updating taxonomies. Commonly used terms are:

  1. Taxonomy: loosely can be seen as a structured list that is formed into a hierarchy; broader terms can exist at the top of the tree, while the list continues to drill down in more detail. Working out a taxonomy for, as an example, a sophisticated database of legal information can be a very involved job. Getting the definitions right for chunks of data, using terms accurately and consistently, ensuring there are no clashes or repeats of terminology are challenging tasks and the logic of the hierarchy must be clear. However, there are also links between different parts of the taxonomy which need to be clear and terms can appear in different places in certain cases. Web designers also used to use taxonomies as the term for the mapped structure of the website.
  2. Thesaurus: is essentially more sophisticated again, with more interrelated terms and more examples of words, phrases and similar terms that may help a user locate the particular concept or term they are looking for. So it is structured like a taxonomy but has lateral connections and an index underlying it which may show you other things like terms to use, terms not to use, related terms, narrower and broader categories; in this way it covers scope and relationships between terms, so helps you find things and makes links for you, while taxonomy is focused on the classification.
  3. Ontology: is more sophisticated again and important in allowing computers to interact with each other, helping programs crawl around the web in order to find things. Information needs to be marked up in a way that allows it to be found and ontology finds related things (like a thesaurus), but with more defined terms for a concept, item or relationship.

RDF: dealing with different types of data

One development in the last few years has been the widening range of types of media created and held in databases. Early developments in the web were based around the creation of documents, held in silos, which could be interrogated in order to find what was needed; the basic piece of data was still the document, such as an article or other piece of text. However, data is held in other formats and these are growing in quantity as more individuals are able to create them for themselves – video footage, for instance, or visual material, maybe audio files. All these too need a way to be organised and recognised in order for searches to include them effectively and to ensure searches are as rich and deep as they can be. These different types and sources of data need to be linked effectively so that, for instance, a piece of video information about a famous tourist site could be linked to a GPS system to locate it – hence the development of linked data.

The problem to be faced is the wide range of incompatible standards for metadata syntax. With different data sets using different systems, the linking between them becomes much less productive and enriched compared to a system that aims to produce consistency between all sorts of data types. RDF is a language for representing information about resources on the web. It provides interoperability between applications; they should be able to exchange information much more effectively. This is not just relevant for finding information or publishing it; it is useful across internet activities from e-commerce to collaboration services (like SharePoint). It provides a standard for exchanging metadata and schemas so that integration can happen much more quickly. It looks at the semantics of metadata, not just the structure and syntax, allowing it to make more powerful connections between things.

Everything expressed in RDF (which is an application of XML) means something; any resource can be described: it could be a fact in a document or it could be a piece of visual material, and it has, as it were, an address that makes it readable by a machine. A framework for describing resources allows relationships to be understood, which, in turn, means that other items can be inferred or integrated.

Essentially RDF is made up of three parts describing relationships:

  • the resource itself – usually stated as a uniform resource identifier (URI) – which is the thing, such as an article
  • the properties (or attributes) of it, i.e. an author or a title, which all articles have
  • the property values – i.e. the specific data of those attributes – like the author’s name and the actual title of the article

The idea is based around statements about resources in the form of ‘subject–predicate– object expressions’. So, as an example:

  1. The resource (or subject) is grass
  2. The property (or predicate) is the colour
  3. The value (or object) is green

Where you might have a sentence such as ‘The sky is the colour blue’, RDF should be able to make some sense of it.

Recognising resources in this way helps link documents in a way that forms the basis of the semantic web (which we will look at in Chapter 3). With these three elements you can start to link different things. For instance, you can link all the different values to the one thing, or you can link all the same properties together and bring up all the related things. By using a controlled vocabulary or list of terms with assigned meanings, everyone is working to the same definitions, which makes things easier to find. As outlined above, these controlled vocabularies include things like taxonomies. So RDF provides the basis for creating vocabularies (such as web ontology languages, OWL) that can be used to describe things in the world and how they relate to each other.

Topic maps

While RDF is limited to the model that describes one relationship between two things, topic maps can provide a more sophisticated level of interconnection as they are not limited to mapping one relationship at a time. Topic maps are also a way of organising sets of data and the relationships between them, together with any other information about the data. A topic map represents information rather in the way a concept or mind map does, but it does it in a standardised way to make the information findable. It is a form of semantic web technology. This, like RDF, essentially classifies information using a basic three-part model:

  • topics (the things – from documents to concepts)
  • associations (describing the relationship between topics and the roles each topic plays in the relationship)
  • occurrences (which connect a topic to information related to that topic)

Some have referred to them as the GPS of the information universe, in that they facilitate knowledge management by describing knowledge structures and associating information resources with them.

Other key developments and frameworks for structured data

There are other types of systems which may be used where RDF can be limited. So you may hear about systems like INDECS and ONIX which are used for certain types of data such as rights information. ONIX is the international standard for representing and communicating book industry product information in electronic form. Rich product information can be provided to any part of the supply chain. However, as it also provides the template for the content and structure of a product record, it is useful for forming the basis of many internal information management systems for publishers. The same core data can be used, for instance, by marketing or for a customer-facing helpdesk or by production. This standard therefore already provides a good data source for additional data uses. And as there is less manual intervention at each particular point (instead of, for instance, updating a price across lots of records and different databases across the company) the information is more likely to be consistent and accurate across the company and its suppliers.

INDECS is a system for enabling semantic interoperability so that one computer can understand the way another computer might use terms in relation to pieces of intellectual property. It is important for building e-commerce systems around content of this sort, so deals that are carried out around the sale and use of intellectual property can be matched between computers. ACAP is a global protocol that describes copyright and conditions which has been set up by the publishing community to provide a standard framework that can be used across the industry to manage copyright and permissions. It is an important community for establishing that copyright ownership cannot be forgotten within an internet environment.

The importance of frameworks like these is that they focus on interoperability from machine to machine in such a way that the sort of business model a publisher uses, the type of business they carry out or the varied legal protocols within which business is conducted can all be accommodated.

Managing rights and digital rights management

Managing rights is a particularly knotty problem in the electronic environment. DOIs (a form of metadata, as mentioned above) are critical to ownership of content. Owners can describe the intellectual property they own and link rights and controls to it. Digital rights management, or DRM, describes the access controls that can be built into or around the use of an electronic product. This term is used somewhat loosely and can mean anything from encryption (e.g. in the days of CD-ROM, when you might need a user key to gain access) to the system managing rights across a variety of data in terms of controlling access, tracking usage and collecting revenues. The technology can be cumbersome and can annoy customers. Nor does it necessarily offer much protection for the owner as DRM can be reasonably easily broken by someone determined to do so. However, in the wider context it can allow for control of where and how content can be used, and we will explore it in more detail in Chapter 10.

Conclusion: continuing advances

The advances in frameworks like these meant that publishers could develop digital businesses and in some areas were able to move quickly. For instance, in the journals area the use of DOIs to identify a document or article has particular relevance in allowing publishers to make content, particularly chunks of intellectual property, findable. Industry-wide initiatives such as CrossRef are additionally important for providing the links from the references in a journal article to the cited article. As we will see in Chapter 7, the journals industry has been one of the fastest to migrate into the digital environment and initiatives like these have helped that process, something which has been much more cumbersome to manage with regard to books.

One final development that is currently of interest in relation to structured documents is that of natural language processing (NLP). NLP is interested in the interactions between computer language and humans (and their natural language) and ways for computers to make sense of human language. For these purposes it provides an important way to code up data automatically. Archives may not always be tagged – in order to tag them you can use natural language processing, which uses various approaches from artificial intelligence to cross-referencing with other databases, as well as algorithms and pattern recognition, in order to understand documents and code them appropriately. As the system works it can be trained and learns more, so it builds up a vocabulary and becomes more accurate as it goes. Developments like these become very important when working within web-based environments which we will consider in Chapter 3.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.134.133