3 New Requirements, “Not only SQL” and the Cloud

Relational Database Management Systems (RDBMSs) have their strengths in the fact that they are established and mature systems with a declarative standard language (SQL) that offers complex operations (like joins). A further plus is that RDBMSs can actively check database constraints and the relation schemas ensure that inserted values adhere to the predefined attribute domains. RDBMSs also have an extensive transaction support and an established authorization concept for multiple users; furthermore, they have active components (like triggers) that allow an RDBMS to automatically react to data modifications. Nevertheless, the relational data model and RDBMSs might not be ideal candidates for data storage in some applications. In this chapter, we discuss weaknesses of the relational data model and conventional RDBMSs, and give an overview of the new requirements and issues that modern data management faces.

3.1 Weaknesses of the Relational Data Model

Undoubtedly, the relational model has its merits. It has a sound logical foundation and allows to express database constraints. For some data however, the underlying table structure is not a natural fit. The following weaknesses can become an issue when using the relational data model.

3.1.1 Inadequate Representation of Data

Translating arbitrary data into the relational table format is not an easy task. Data usually come in more complex formats like objects in object-oriented programs, XML documents, or even unstructured text documents. “Squeezing” data into rows and columns requires some careful thought and engineering effort and still might lead to unnecessary storage overhead. We will see examples of this phenomenon when talking about ways to store objects or XML documents in an RDBMS (see Section 5.3). Efficiency of data retrieval might also suffer from the relational table representation as data might have to be recombined from several tables. Another aspect is that due to normalization (that avoids logical redundancy and hence anomalies of data manipulations; see Section 2.2), data belonging to a single entity might end up in several different tables. Retrieving data for an entity might hence imply the computation of several table joins based on a set of foreign keys.

image

Fig. 3.1. Example for semantic overloading

3.1.2 Semantic Overloading

Semantic overloading of the relational data model is another issue: as we have seen in Section 2.1, entities as well as relationships between entities are both mapped to relations in a database schema. In our library example the ER diagram contained the entities Book and Reader as well as the relationship BookLending (see Figure 3.1).

In our final database schema, both the entities and the relationships were contained as relation schemas. That is, we have no means to express the fact that relationships between entities are of a different nature than the entities themselves. As we will see later, in graph databases, we have the concepts of nodes and edges; thus it is possible to differentiate between entities themselves and the links between them.

3.1.3 Weak Support for Recursion

In the relational data model it is diffcult to execute recursive queries that need to join (and then to execute a union) over the same table several times. The purpose of a recursive query is to compute the transitive closure of some table attributes.

image The transitive closure of a relation R with attributes (A1,A2) defined on the same domain is the relation R augmented with all tuples successfully deduced by transitivity; that is if (a,b) and (b,c) are tuples of R, tuple (a,c) also is.

Although the relational algebra can be extended by unary transitive closure operation to express recursion in a query, it is still costly to compute in a real RDBMS. The following example should illustrate this diffculty.

Assume we have a relation schema Person with attributes ID, Name, and Child. The attributes have the domains: dom(ID)=dom(Child): Integer, dom(Name): String. A recursive query on this schema would be to get the IDs of all the descendants of a person (not just the children). Our example table is filled as shown in Table 3.1 Executing our recursive query for Alice (that is, getting the IDs of all descendants of Alice) requires extensive use of the union and the join operator. What we expect as the result is the result shown in Table 3.2 (where the NULL entries are the termination condition; we stop the recursion whenever there are no more children).

Table 3.1. Base table for recursive query

image

Table 3.2. Result table for recursive query

Descendant ancestorID descendantID
1 3
3 5
1 5
5 NULL
1 NULL

Expressing this query in SQL is quite complex. We have to recursively define the Descendant table as follows: the Descendant table is repeatedly joined with the Person table whenever a descendantID in the Descendant table occurs as a person’s ID in the Person table; the intermediate results are “unioned” to the descendant table.

WITH RECURSIVE Descendant(ancestorID, descendantID) AS
(

SELECT ID, Child FROM Person WHERE Name = ‘Alice’
UNION ALL
SELECT p.ID, p.Child FROM Descendant d, Person p
WHERE p.ID = d.descendantID

)
SELECT ancestorID, descendantID FROM Descendant

In XML or graph databases, recursion comes much more naturally and it can be implemented with better performance.

3.1.4 Homogeneity

A further problem of the relational data model is that a table is a homogeneous data structure. More precisely, the relational data model requires both horizontal and vertical homogeneity. Horizontal homogeneity is the fact that all tuples have to range over the same set of attributes; that is, all rows have a fixed uniform format defined by the columns. Vertical homogeneity is the fact that values in one column have to come from the same predefined attribute domain; mixing values from different domains in one column is not allowed. What is more, only atomic values are allowed in table cells; set-based or even more complex values are not supported in the conventional relational data model.

3.2 Weaknesses of RDBMSs

Apart from the the conceptual problems with the relational table format, the established RDBMSs have some weaknesses with respect to features desired by advanced applications:

Infrequent updates: It is usually conjectured that RDBMSs are designed for frequent queries but very infrequent updates; hence in case of a frequently changing data set, relational databases might be more ineffcient than desirable.

SQL dialects: Although the standardization of SQL has its advantages, not all RDBMSs fully support the standard and some deliberately use their own syntax; all of this complicates portability of SQL code between different RDBMSs.

Restricted data types: RDBMS can be considered quite inflexible regarding the support of modern data types or formats. For example, although the XML data type exists in SQL, other data formats (like for example JSON documents) can usually not be handled natively by an RDBMS.

Declarative access: Another argument against SQL is that queries are usually declaratively expressed based on the (expected) content in the database tables; that is, with SQL we retrieve data by specifying a set of desired attributes or comparing attribute values with another value. However, other data formats might require a non-declarative access. For data formats like (tree-like) XML documents or graphs, a navigational access is usually better suited; that is, we navigate in the data structure by – for instance – going from parents to children and checking some conditions there. RDBMSs usually support one of the XML query languages (like XPath or XQuery) to query XML documents, but operations for modification of XML documents (like the XQuery Update Facility) are usually only poorly supported. Native XML databases with a focus on XML data management or document databases for JSON documents might do a better job here.

Short-lived transactions: Transactions are well-supported by RDBMSs. The typical RDBMS transaction is however very short-lived. The implemented transaction management mechanisms are usually not suited for long-term transactions. However, the support for long-term transactions is in particular important for data stream processing where queries are periodically executed on continuous streams of data; for example, to obtain continuous analytics from sensors to automatically monitor some processes. Stream Data Management System specialize in this field of analytical services.

Lower throughput: When handling massive amounts of data, achieving a suffciently high data throughput might not be possible as good as one would require with an RDBMS. Along the same line, it has also been conjectured that RDBMSs are poor at distributed management of data in a network of database servers.

Rigid schema: A further hindrance of RDBMSs is that, due to the fixed database schema, schema evolution is poorly supported: Changes in the relation schemas (like adding a new column to a table) are diffcult and costly as the require a reorganization of the data stored in the database system.

Non-versioned data: Versioning of data (keeping multiple versions of a record with different timestamps) is another feature that is usually disregarded by conventional RDBMSs.

3.3 New Data Management Challenges

Over the last decades, data processing has experienced some major shifts. Some of the new challenges for database management are the following:

Complexity: Data are organized in complex structures – for example, a quite novel complex structure is a social network: a network of people interconnected with each other in arbitrary ways in a graph structure. Similarly, the Semantic Web connects and describes resources in a graph structure. In other application domains, like geographic information systems (GIS) or computer aided design (CAD), data are represented as complex structures with lots of interrelated substructures.

Schema independence: Schema independence means that documents can be processed without a given schema definition. In other words, data can be structured in an arbitrary way without complying with any prescribed format. Moreover, data of the same type can be represented differently even in the same data set; for example, address data could be either stored as one single string of characters or could be split into different strings for street, house number, city and zip code. This flexibility allows that data from different sources can be combined in the same data set even if they are not formatted in the same way. Schema independent databases are also called schemaless .

Sparseness: If there is an (optional) schema for a data set, it may happen that a lot of data items are not available: many values might just be unknown or nonexistent. If these missing data were represented as null values, they would unnecessarily occupy storage space. It is hence better to simply ignore such values in the data model so that they do not show up in the data set.

Self-descriptiveness: As a consequence on schema independence and sparseness, metadata are attached to individual values in order to enable data processing; these metadata describe the use and semantics of the values (like the name part of a property or key-value pair or element and attribute names in XML). In this way, data can be interpreted and processed directly and there is no need to acquire metadata information (like schema data) from other sources. This is in particular important when data come from unknown or unreachable sources.

Variability: Data are constantly changing : the database system has to handle frequent data modifications in the form of insertions, updates and deletions. In addition, the structure of the processed data might be frequently altered; the database system hence has to either support schema evolution and adapt to the changed structure, or the database system must be able to handle schemaless data.

Scalability: Data are distributed on a huge number of interconnected servers : for example, in online shopping systems, or when using cloud storage, data are stored in huge server farms. The database user need not know on which server exactly the data is stored that he wants to retrieve; hence he must be able to interact with the database system without being aware of the data distribution. Moreover, the database system has to support flexible horizontal scaling : servers can leave the network and new servers can enter the network on demand. The database system has to dynamically adapt the distribution of data to the altered network structure.

Volume: Large data volumes (“big data”) have to be processed: Database systems must provide high read and write throughput in order to provide the desired availability and responsiveness of the system or even allow for real-time analytics; data management has to be massively parallel in order to achieve this.

Non-relational databases have been developed as a reaction to these challenges and new requirements. However, the employed data models and their underlying technology already have quite a history of research and development: non-relational database systems have been around for decades and database research has developed different theories and systems in the non-relational area ever since. However, only recently these systems have seen an upswing at the face of the changed data management requirements. In addition, new database products have emerged (sometimes driven by large companies) as flexible solutions for these requirements. Historically the term “NoSQL” applied to database systems that offered query languages and access methods other than the standard SQL. More recently, “NOSQL” has come up to mean “Not only SQL”; NOSQL is basically an umbrella term that covers database systems that

have data models other than the conventional relational tables,

support programmatic access to the database system or query languages other than SQL (but might support SQL as well),

can cope with schema evolution or can handle schemaless data,

support data distribution in a server network by design,

do not strictly adhere to the ACID properties (in particular in terms of consistency) of conventional RDBMSs.

image A NOSQL database system can have a non-relational data model, support non-standard query languages, support programmatic access, support schema evolution or schema independence, support data distribution, or have a weak consistency concept.

At the bottom line, for some advanced applications, a specialized database system with a focus on the requirements of these applications might perform better than a conventional RDBMS. NOSQL symbolically stands for the revival, adoption and improvement of data models, query languages and network protocols for novel database applications.

image Web resources:

Roberto V. Zicari’s Portal for Big Data, New Data Management Technologies and Data Science: http://www.odbms.org/

ODBMS Industry Watch: http://www.odbms.org/blog/

Stefan Edlich’s Guide to the Non-Relational Universe: http://nosql-databases.org/

Rahul Chaudhary’s NoSQL Weekly newsletter: http://www.nosqlweekly.com/

Cloud databases can in particular cater for new requirements in terms of scalability and volume. Databases-as-a-service are offered by providers of cloud computing infrastructure as a remote storage platform. Pricing models are usually based on both time of usage and volume of the data. A service level agreement (SLA) then regulates minimum availability or security conditions of the hosted service. Cloud providers make use of multitenancy to enable elasticity of their services: several groups of client access the same database platform; they might potentially be separated by having the database run in different virtual machines. While cloud database claim to reduce administration overhead, using a database-as-a-service still requires management skills (for example for modeling the data and setting up a backup. Security and confidentiality as well as legislation (for example in terms of data protection and privacy) are major issues in particular if data are hosted in countries with a different legislation than the country of their origin.

3.4 Bibliographic Notes

Weaknesses of conventional RDBMSs have been discussed for some time; see for example the book by Connolly and Begg [CB09] for some issues that have been reiterated in this chapter. Undoubtedly, the data management landscape has changed in the last decades towards large scale and distributed storage (including cloud computing). On this background, Agrawal, Das and Abbadi [ADA12] describe challenges that arise with data management in cloud applications. With a special focus on database applications, the book by Fowler and Sadalage [FS12] describes the NOSQL paradigm and its basic principles and it gives hands-on experience by surveying some NOSQL databases. The book by Redmond and Wilson [RW12] provides a great starting point to get an overview of currently available NOSQL database systems of all kinds: starting from PostgreSQL as an RDBMS, it moves on to a technical description of six NOSQL databases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.241.250