Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. Introduction

“Our agreement or disagreement is at times based on a misunderstanding.”

Mokokoma Mokhonoana

In the era of the Big Data and Artificial Intelligence frenzy, data are considered as gold mines that await organizations and businesses to find and extract their gold. Whether you call this Data Science, Data Analytics, Business Intelligence or something else, you can’t deny that data-related investments have increased significantly and the demand for data wizards and witches has skyrocketed.

Do these data professionals manage to find gold? Well, not always. Sometimes, the large ocean of data that an organization claims to have proves to be a small pond. Other times, the data are there but they contain no gold, or at least not the kind of gold that the organization can use. Often it is also the case that both data and gold are there, but the infrastructure or technology needed for the gold’s extraction are not yet available or mature enough. But it can also be that the data wizards have all they wish (abundance of the right data, gold to be found, and state-of-the art technology) and still fail. The reason? Bad data semantics.

Data and (bad) semantics

Semantics is the study of meaning. By creating a common understanding of the meaning of things, semantics helps people understand each other despite different experiences or points of view. Common meaning, in turn, helps computer systems interpret more accurately what people mean, as well as interface more efficiently and productively with other disparate computer systems.

In that sense, Semantic Data Modeling can be defined as the development of descriptions and representations of data in such a way that the latter’s meaning is explicit, accurate, and commonly understood by both humans and systems. This kind of modeling applies to a wide range of data artifacts, including knowledge organization systems ¹ (such as metadata schemas, controlled vocabularies, taxonomies, ontologies and knowledge graphs), as well as conceptual models for database design like the Entity-Relational Model ² for relational databases or the property graph model for graph databases ³.

The rising importance of semantics in the data world is evident not only from Gartner’s recent inclusion of Knowledge Graphs in its 2018 hype cycle, but also from the investments that large organizations in many different domains have started making in semantic technologies. Apart from Google and its famous Knowledge Graph ⁴, Amazon ⁵, LinkedIn ⁶, Thomson Reuters ⁷, BBC ⁸ and other prominent companies are currently developing and using semantic data models within their products and services.

In particular, semantic models are increasingly seen as a backbone for Machine Learning, Deep Learning, and AI business use cases, as well as an asset for natural language understanding as utilized in bots, virtual assistants, customer care, and other cognitive applications ⁹ ¹⁰. Moreover, there has been a growing appreciation of such models as semantic layers that bring context to data and help integrate siloed information resources ¹¹.

No matter what kind of semantic data models you are used to building, you are doing it well only if your models manage to convey in a clear, accurate and commonly understood way those aspects of the data’s meaning that are important for their effective interpretation and usage. If that’s not the case, there is a substantial risk that your models will not be used or, even worse, be used in wrong ways and with undesired consequences.

To see what bad semantic modeling looks like, let’s take a real-world example, namely the ESCO Classification ¹². ESCO is a multilingual ontology that defines and interrelates concepts about skills, competences, qualifications and occupations, for the EU labour market domain. It’s the result of a 6-year project led by the European Commission, and its main goal is “to provide a common reference terminology for the labour market and help bridge the communication gap between the world of work and the world of education and training” ¹³. In data science terms, this means that ESCO’s ambition is to be used to semantically analyze labour market data (e.g., CVs and job vacancies) and derive relevant analytics that could provide useful insights to job seekers, employers, governments and policy makers.

Now, ESCO provides several semantic “goodies” to achieve this ambition. For example, it identifies and groups together all terms that may refer to the same occupation, skill or qualification entity; this is a very useful piece of knowledge as it can be used to identify job vacancies for the same occupation, even if the latter may be expressed in many different ways. Equally (if not more) useful is the knowledge the model provides about the skills that are most relevant to a given occupation. Such knowledge can be used, for example, by education providers to identify gaps in the demand and supply of particular skills in the market, and update accordingly their curricula. Unfortunately, the modelers of this piece of knowledge in ESCO have fallen into one of the many semantic modeling pitfalls I describe in this book, namely not documenting and warning the model’s users about vagueness.

Here is the problem: If you ask 100 different professionals which skills are most important for their profession, you will most likely get 100 different answers. If, even worse, you attempt to distinguish between essential and optional skills, as ESCO does, then you should prepare for a lot of debate and disagreement. The reason is that the notion of essentiality of a skill for a profession is vague, i.e., it typically lacks crisp applicability criteria that clearly separate the essential from the non-essential skills. Without such criteria, the model’s users can interpret the occupation-skill relation anyway they like, declare its concrete instances wrong if they do not agree with them, or wrongly apply them in a context where they are not valid. However, nowhere in the ESCO model or its documentation is there any mention of vagueness as a feature of the knowledge it contains, nor any guidance on how to interpret and use this kind of knowledge.

To be fair to ESCO, similar problems appear in many semantic data models. And to be fair to data modelers, semantic modeling is hard as human language and perception is full of ambiguity, vagueness, imprecision and other phenomena, that make the formal and universally accepted representation of data semantics a quite difficult task.

Moreover, a key challenge in building a good semantic model is to find the right level of semantic expressiveness and clarity that will be beneficial to our users and applications, without excessive development and maintenance costs. From my experience, software developers and data engineers tend to under-specify meaning when building data models, while ontologists, linguists and domain experts tend to over-specify it and debate about semantic distinctions for which the model’s users may not care at all. Our job as semantic modelers is to find the right balance and achieve the semantic clarity that our data, domains, applications and users need.

Unfortunately, in an era where discussions among data scientists are monopolized by the latest trends in Machine Learning and Statistical Analysis, the role of semantic modeling in data analytics is often underplayed. “Now you are talking semantics” is a common reaction I get by data practitioners who believe that statistical reasoning and machine learning are all that’s needed to tackle semantic-related data analytics tasks, and that spending too much time in understanding and modeling those semantics is a waste of effort and resources. Bad semantic modeling, however, is like “technical debt"” in software engineering; sooner or later it claims its dues.

My goal in this book is to help you master the craft of semantic data modeling and increase the usability and value of your data and applications. For that, I will take you on a journey in the world of data semantics, as applied in the real world, and show you what pitfalls to avoid and what dilemmas to break if you want to build and use high-quality and valuable semantic representations of data.

Avoiding pitfalls

A pitfall in semantic modeling is a situation in which we make a decision that is clearly wrong with respect to the data’s semantics, the model’s requirements or other aspect of the model’s development process. More importantly, this decision leads to undesired consequences when the model is put in use. The probability and/or the severity of these consequences may vary, but that doesn’t mean that a pitfall is not a mistake that we should strive to avoid when possible. ESCO’s non-treatment of vagueness, that I described above, may not seem like a big problem at first, but it’s undeniably a pitfall whose consequences remain to be seen.

Falling into a pitfall is not always a result of the modeler’s incompetence or inexperience. More often than we would like to admit, the academic and industry communities that develop semantic modeling languages, methodologies and tools, contribute to the problem in at least three ways:

By using contradictory or even completely wrong terminology when describing and teaching semantic modeling.
By ignoring or dismissing some of the pitfalls as non-existent or non-important.
By falling themselves into these pitfalls and producing technology, literature and actual models that contain them.

To see how this actually happens, consider the following two excerpts from two different semantic modeling resources:

“…OWL classes are interpreted as sets that contain individuals… The word concept is sometimes used in place of class. Classes are a concrete representation of concepts…”

“A [SKOS] concept can be viewed as an idea or notion; a unit of thought…. the concepts of a thesaurus or classification scheme are modeled as individuals in the SKOS data model …”

The first excerpt is found in a quite popular tutorial about Protege ¹⁴, a tool that enables you to build semantic models according to the Ontology Web Language (OWL) ¹⁵. The second one is derived from the specification of the Simple Knowledge Organization System (SKOS) ¹⁶, a World Wide Web Consortium (W3C) recommendation designed for representation of thesauri, classification schemes, taxonomies and other types of structured controlled vocabularies.

Based on these definitions, what do you understand to be a concept in a semantic model? Is it a set of things as the Protege tutorial suggests, or some unit of thought as SKOS claims? And what should you do if you needed to model a concept in OWL that is not really a set of things? Should you still have to make it a class? The answer, as I demonstrate in the rest of the book, is that the SKOS definition is more accurate and useful, and that the “concept = class” claim of the OWL tutorial is at best misleading, causing more than one type of semantic modeling errors.

In any case, my goal in this book is not to assign blames to people and communities for bad semantic modeling advice, but rather to help you navigate this not so smooth landscape and show you how to recognize and avoid pitfalls like the above. The same applies for dilemmas.

Breaking dilemmas

Contrary to a pitfall, a dilemma is a situation in which we have to choose between different courses of action, each of which comes with its own pros and cons, and for which there’s no clear decision process and criteria to be applied.

As an example, consider the options that the developers of ESCO have in order to treat the vague “essential” relation between occupations and skills. One option is to flag the relation as “vague” so that the users know what to expect, but that of course won’t reduce the potential disagreements that may occur. Another option is to try and create different versions of the relation that are applicable to different contexts (e.g., different countries, industries, user groups etc) so that the level of potential disagreement is lower. Doing this, however, is costlier and more difficult. So, what would you advise ESCO to do?

To tackle a modeling dilemma you need to treat it as a decision making problem, i.e., you need to formulate the alternative options and find a way to evaluate them from a feasibility, cost-benefit, strategic or other perspective that makes sense for your use case. Therefore, for each dilemma I cover in this book, I won’t give you a definite and “expert” solution that is almost always the correct one, simply because there’s no such a thing. Instead I will show you how to frame each dilemma as a decision making problem and what information you should look for in order to reach a decision.

Why you should read this book

My firm belief (and wish) is that this book should be read by anyone who works in the data field, in any role and capacity. Not because it is the best book in its field, nor because semantics are more important than other aspects of data science. The reason why as a data professional you should care about semantics is because the latter, when not done correctly, can undermine your effort to derive value from data, no matter how big and sophisticated data infrastructure and algorithms you and your organization have managed to build.

To make it a bit more specific, you will find this book useful if you can recognize yourself in one or more of the following situations:

You know a lot about semantic data modeling, though mostly from an academic and research perspective. You probably have a PhD in the field and excellent knowledge of modeling languages and frameworks, but you never had the chance to apply this knowledge in an industrial setting. You are now in you first industry role as a Taxonomist, Ontologist or other type of Data Modeler, and you have the opportunity to apply your knowledge in real world problems. You have started realizing though that things are very different from what the academic papers and textbooks describe; the methods and techniques you’ve learned are not as applicable or effective as you thought, you face difficult situations for which there is no obvious decision to be made and, ultimately, the semantic models you develop are misunderstood, misapplied and provide no added value. This book will help you put in practice your valuable and hard-earned knowledge and improve the quality of your work.
You are a Data or Information Architect, tasked with developing a master semantic data model that will solve the problem of semantic heterogeneity between the many disparate data sources and applications or products that your organization has. For that, you have already applied several out-of-the-box semantic data management solutions that promised seamless integration, but all have failed. This book will help you understand better the (not so obvious) dimensions and challenges you need to address in order to achieve the semantic interoperability you want.
You are a Data Scientist, expert in Machine Learning and Statistical Data Analysis and part of a multidisciplinary team that builds semantic models for AI applications (e.g. knowledge graphs for virtual assistants). You interact daily with ontologists, linguists and other semantic professionals but you struggle understanding their lingo and how your skills are compatible with theirs. This book will introduce you to the basics of semantic data modeling and will help you identify the aspects of it where your expertise can have the biggest impact.
You are a Data Scientist, expert in Machine Learning and Statistical Data Analysis, working with data that have been created and semantically described by other people, teams and organizations. Pretty often you find yourself being unsure about what these data really represent and whether they are appropriate for the kind of analysis you want to make or solution you want to build. Even worse, you make wrong assumptions about these data’s semantics, ending up with applications that do not work as you had expected. This book will show you how semantics affect every part of the data science lifecycle, and teach you how to be more critical towards the data you work with and anticipate semantic-related problems that may occur.

In other words, this is an inter-disciplinary book for different types of data professionals who want to learn how to “talk semantics” in order to work more effectively together and produce value from data.

What you will (not) find in this book

Before we proceed, it’s only fair to warn you that in this book you will not find detailed descriptions and step-by-step guides of specific data modeling languages, methodologies, tools and frameworks. I will make several references to such frameworks throughout the book but if, for example, you would like to learn in detail about OWL or Entity-Relationship Modeling, there are excellent other textbooks and tutorials you can easily find. If you are completely new to data modeling, I suggest you first read the Basics part, then pick up a tutorial or textbook on the subject, start building models, and revisit this book to check if you have fallen into any of the pitfalls I mention in the second part or come across any of the dilemmas of the third part.

You should also be aware that this book is not an academic textbook or research monograph where you would find extended philosophical analyses, debates and discussions on semantics and conceptual modeling, or the latest trends in field. Instead, it’s a practical and pragmatic guide that will help you cut through the complexity of data semantics, and develop usable and valuable semantic models in a variety of domains, scenarios and contexts. For that, this book will:

Focus on semantic thinking, not just languages or tools: Most textbooks and tutorials on semantic modeling assume that producing good semantic models is primarily a matter of using the right language or tool. This book, instead, teaches the necessary principles and techniques to use the available modeling language or framework right, in an effort to avoid the Garbage In, Garbage Out effect!
Focus on what doesn’t work: Knowing what doesn’t work and why can be a more effective way to improve the quality of a system or process, compared to knowing only what (in theory or in some cases) works. This book applies this principle on the task of semantic data modeling by focusing on identifying as many ways as possible things can go wrong, b) what would be the consequences of that, and c) what could be done to avoid such situations.
Cover non-boolean phenomena: Most semantic modeling methodologies and frameworks assume that all human knowledge can be separated into false and true statements, and provide little support for tackling “noisy” phenomena like vagueness or uncertainty. The real world, however, is full of such phenomena and this book will help you not merely handle them but actually use them in your advantage.
Focus on decisions in context: Semantic data modeling is challenging and modelers face many types of dilemmas for which they need to take decisions. Describing successful yet isolated experiments or “success stories” rarely helps break these dilemmas. This book focuses on identifying as many difficult situations as possible and showing you how to break through them in your own context.
Include organizational and strategic aspects: A semantic data modeling initiative is rarely a one-off engineering project; instead it’s a continuous effort of fueling an organization with up-to-date and useful semantic knowledge that serves its business and data strategy. As such, it requires considering not only technical but also organizational and strategic aspects that may affect it.

Book outline

This book is arranged into three parts:

In Part I, called The Basics, you will find fundamental concepts, phenomena, and processes related to semantic data modeling, as well as a comprehensive overview of languages, frameworks, methodologies, tools and techniques that are available for this purpose. The goal of this part is to set the tone for the rest of the book and establish a common ground and terminology for the other parts.
In Part II, called The Pitfalls, you will find a set of pitfalls that we often fall into when we develop and apply semantic data models, and which we can easily avoid if we are more rigorous, improve our understanding of semantic phenomena and languages, and follow specific processes and guidelines.
In Part III, called The Dilemmas, you will find a number of dilemmas that we often face when we develop and apply semantic data models, and which we can only solve if we take in consideration the broader technical, business and strategic context we operate in.

¹ https://www.isko.org/cyclo/kos

² https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model

³ http://graphdatamodeling.com/Graph%20Data%20Modeling/GraphDataModeling/page/PropertyGraphs.html

⁴ https://en.wikipedia.org/wiki/Knowledge_Graph

⁵ https://blog.aboutamazon.com/devices/how-alexa-keeps-getting-smarter

⁶ https://engineering.linkedin.com/blog/2016/10/building-the-linkedin-knowledge-graph

⁷ https://www.thomsonreuters.com/en/press-releases/2017/october/thomson-reuters-launches-first-of-its-kind-knowledge-graph-feed.html

⁸ https://www.bbc.co.uk/ontologies

⁹ https://dataconomy.com/2017/10/combine-machine-learning-knowledge-graphs/

¹⁰ https://hackernoon.com/knowledge-graphs-for-enhanced-machine-reasoning-at-forge-ai-ef1ffa03af3d

¹¹ https://www.forbes.com/sites/cognitiveworld/2018/08/16/holistic-information-the-rise-of-360-semantic-data-hubs/#2db41f1c217a

¹² https://ec.europa.eu/esco/portal/home