2 Relational Database Management Systems

The relational data model is based on the concept of storing records of data as rows inside tables. Each row represents an entity of the real world with table columns being attributes (or properties) of interest of these entities. The relational data model has been the predominant data model of database systems for several decades. Relational Database Management Systems (RDBMSs) have been a commercial success since the 1980s. There are powerful systems on the market with lots of functionalities. These systems also fulfill all the basic requirements for database systems as introduced in Section 1.1. In the following sections, we briefly review the main concepts and terminology of the relational data model.

2.1 Relational Data Model

The relational data model is based on some theoretical notions which will briefly be introduced in the following section. Afterwards we present a way to map an ER model to a database schema.

2.1.1 Database and Relation Schemas

A relational database consists of a set of tables. Each table has a predefined name (the relation symbol) and a set of predefined column names (the attribute names). Each attribute Ai ranges over a predefined domain dom (Ai ) such that the values in the column (of attribute Ai ) can only come from this domain. A table is then filled row-wise with values that represent the state of an entity; that is, the rows are tuples of values that adhere to the predefined attribute domains as shown in Table 2.1. Each table hence corresponds to the mathematical notion of a relation in the sense that the set of tuples in a relation are a subset of the cartesian product of the attribute domains: if r is the set of tuples in a table, then r dom (A 1 ) × … × dom (An ).

The definition of the attribute names Ai for the relation symbol R is called a relation schema ; the set of the relation schemas of all relation symbols in the database is then called a database schema. That is, with the database schema we define which tables will be created in the database; and with each relation schema we define which attributes are stored in each table.

Table 2.1. A relational table

image

In addition to the mere attribute definitions, each relation schema can have intra relational constraints, and the database schema can have inter relational constraints. These constraints describe which dependencies between the stored data exist; intrarelational constraints describe dependencies inside a single table, whereas interrelational constraints describe dependencies between different tables. Database constraints can be used to verify whether the data inserted into the table are semantically correct. Intrarelational constraints can for example be functional dependencies – and in particular key constraints: the key attributes BookID and ReaderID in our ER diagram will be keys in the corresponding database tables and hence serve as unique identifiers for books and readers. Interrelational constraints can for example be inclusion dependencies – and in particular foreign keyindex constraints: when using the ID of a book in another table (for example a table for all the book lendings), we must make sure that the ID is included in the Book table; in other words, readers can only borrow books that are already registered in the Book table. Written more formally, we define a table by assigning to its relation symbol Ri the set of its attributes Aij and the set i of its intrarelational constraints.

image Formal specification of a relation schema with intrarelational constraints: Ri = ({Ai 1 … Aim } , i )

A database schema then consists of a database name D that consists of a set of relation schemas Ri and a set of interrelational constraints.

image Formal specification of a database schema with interrelational constraints: D = ({R 1 … Rn } , )

2.1.2 Mapping ER Models to Schemas

With some simple steps, an ER diagram can be translated into a database schema: Each entity name corresponds to a relation symbol. In our example, the entity Book is mapped to the relation symbol Book. Entity attributes correspond to relation attributes. In our example, the entity attributes BookID and Title will also be attributes in the relation Book; hence they will be in the relation schema of relation Book.

However, the relational data model does not allow multi-valued and composite attributes. In the case of multi-valued attributes, a new relation schema for each multi-valued attribute created containing additional foreign keys (to connect the new relation schema to the original relation schema). In our example, the multi-valued attribute Author must be translated into a new relation BookAuthors with attributes BookID and Author and a foreign key constraint BookAuthors.BookID ⊆ Book.BookID.

Composite attributes (like Publisher) should usually be treated as single-valued attributes. We have two options for doing this:

by combining their subattributes into one value;

or by only storing the subattributes (like City and Name) and disregarding the composite attribute (like Publisher) altogether.

Relationships are also translated into a relation schema; for example, we have a BookLending relation in our database with the attribute ReturnDate. In order to be able to map the values from the entities connected by the relationship together, the relation also contains the key attributes of the entities participating in the relationship. That is why the BookLending relation also has a BookID attribute and a ReaderID attribute with foreign key constraints on them. Note that this is the most general case of mapping an arbitrary relationship; in more simple cases (like a 1:1 relationship) we might also simply add the primary key of one entity as a foreign key to the other entity.

What we see in the end is that we can indeed easily map the conceptual model (the ER diagram) for our library example into a relational database schema. The definitions of the database schema and relation schemas are as follows:

Database schema:

Library = ({Book, BookAuthors, Reader, BookLending}, { BookAuthors.BookID ⊆ Book.BookID, BookLending.BookID ⊆ Book.BookID, BookLending.ReaderID ⊆ Reader.ReaderID})

Relation schemas:

Book = ({BookID, Title, Year}, {BookID → Title, Year})

BookAuthors = ({BookID, Author}, {})

Reader = ({ReaderID, Name, Email}, {ReaderID → Name, Email})

BookLending = ({BookID, ReaderID, ReturnDate}, {BookID, ReaderID → ReturnDate})

2.2 Normalization

Some database designs are problematic – for example, if tables contain too many attributes, or tables combine the “wrong” attributes, or tables store data duplicates (that is, when we have logical redundancy). Such problematic database design entail problems when inserting, deleting or updating values: these problems are known as anomalies . Different types of anomalies exist:

Table 2.2. Unnormalized relational table

image

Insertion anomaly : we need all attribute values before inserting a tuple (but some may still be unknown)

Deletion anomaly : when deleting a tuple, information is lost that we still need in the database

Update anomaly : When data are stored redundantly, values have to be changed in more than one tuple (or even in more than one table)

image Normalization results in a good distribution of the attributes among the tables and hence normalization helps reduce anomalies.

The normalization steps depend on database constraints (in particular, functional dependencies) in the data tables. For example, to obtain the so-called third normal form (3NF) we have to remove all transitive functional dependencies from the tables. We don’t go into detail here but discuss normalization only with our library example. Assume we would not have the information on books, readers, and book lendings in separate tables, but all the data would be stored in one big single table together (for sake of simplicity, we leave out the author information altogether). For two readers and three books we would have a single table as shown in Table 2.2.

What we see is, that the more books a reader has currently borrowed the more often his name appears in the table; and if we only want to change the information belonging to a certain book, we would still have to read the whole row which also contains information on the reader. Due to these considerations, it is commonly agreed that it is advantageous to store data in different tables and link them with foreign key constraints (according to the schema developed in Section 2.1). A normalized version of the Library table (in 3NF) hence looks as shown in Table 2.3.

2.3 Referential Integrity

We have seen above that foreign key constraints are a special case of interrelational constraints. Referential integrity means that values of the attributes that belong to the foreign key indeed exist as values of the primary key in the referenced table – if there is more than one option to choose a key, it suffces that the referenced attributes are a candidate key . That is, in the referenced table, there must be some tuple to which the foreign key belongs. In our example we stated the requirement that the BookID and the ReaderID in the BookLending table indeed exist in the Book and Reader table, respectively. We can optionally allow that the value of the foreign key is NULL (that is, NULL in all the attributes that the foreign key is composed of).

Table 2.3. Normalized relational table

image

Referential integrity must be ensured when inserting or updating tuples in the referencing table; but also deleting tuples from the referenced table as well as updating the primary key (or candidate key) in the referenced table affects referential integrity. We will discuss these cases with our library example:

Insert tuple into referencing table: Whenever we insert a tuple in a table that has foreign key attributes, we must make sure that the values inserted into the foreign key attributes are equal to values contained in the referenced primary key or candidate key.

Update tuple in referencing table: The same applies when values of foreign keys are updated.

Update referenced key: Whenever the referenced primary key (or candidate key) is modified, all referencing foreign keys must also be updated. When the referencing foreign keys are referenced themselves by some other tuple, this referencing tuple must also be updated; this is called a cascading update .

Delete tuple in referenced table: Deleting a tuple can violate referential integrity whenever there are other tuples the foreign keys of which reference the primary (or candidate) key of the deleted tuple. We could then either disallow the deletion of a referenced tuple or impose a cascading deletion which also deletes all referencing tuples. Alternatively, foreign keys can be set to a default value (if it is defined) or to null (if this is allowed).

2.4 Relational Query Languages

After having designed the relational database in a good way, how can data actually be inserted into the database; and after that, how can information be retrieved from the database? For data retrieval we might want to specify conditions to select relevant tuples, combine values from different tables, or restrict tables to a subset of attributes. The Structured Query Language (SQL) is the standardized language to communicate with RDBMSs; it is a standardized language for data definition, data manipulation and data querying. For example, you can create a database schema, create a table, insert data into a table, delete data from a table, and query data with the well-known declarative syntax. Some commonly used SQL statements are the following:

CREATE SCHEMA …

CREATE TABLE …

INSERT INTO … VALUES …

DELETE FROM … WHERE …

UPDATE … SET … WHERE …

SELECT … FROM … WHERE …

SELECT … FROM … GROUP BY …

SELECT … FROM … ORDER BY …

SELECT COUNT(*) FROM …

Other (more mathematical) ways to express queries on relational tables, would be the logic-based relational calculus , or the operator-based relational algebra . Typical relational algebra operators and examples for these are:

Projection π (restricting a table to some attributes). For example, IDs of readers currently having borrowed a book:

πReaderID (BookLending)

Selection σ (with condition on answer tuples). For example, all books to be returned before 29-10-2016:

σReturnDate <29–10–2016(BookLending)

Renaming ρ (giving a new name for an attribute). For example, rename ReturnDate to DueDate:

ρ DueDateReturnDate (BookLending)

Union image, Difference image, Intersection ⋂ are operators that work only between tables with identical relation schema.

Natural Join image (combining two tables on attributes with same name). For example, any information on books currently lent out (the answer table has attributes BookID, Author,Title, ReaderID, and ReturnDate):

Books image BookLending

Advanced Join Operators (like Theta Join, Equi-Join and Semijoin) are available for more concise notation of queries.

Let us now look at the different ways to express queries in a comparative example by asking a query over the BookLending table:

image

A natural language version of our query would be

“Find all books that must be returned before 29-10-2016”.

In relational calculus we would express this query as a logical formula with variables (that is, placeholder for BookID, ReaderID and ReturnDate) x , y , z :

Q (x , z , y ) = {(x , y , z ) | BookLending(x , y , z ) image z <29–10–2016}

In relational algebra we would use the selection operator σ :

σReturnDate<29–10–2016 (BookLending)

And, last but not least in SQL:

SELECT BookID, ReaderID, ReturnDate FROM BookLending

WHERE ReturnDate < 29-10-2016

One more thing to note about relational algebra is that an algebra query can be illustrated by a tree: the inner nodes are the algebra operators and the leaf nodes are the relation symbols. Such an operator tree nicely shows the order of evaluation. It is helpful for optimization of queries (for example, smaller intermediate results with less tuples and less attributes). As an example, consider the query for “names of readers having borrowed a book that must be returned before 29-10-2016”:

πName (σ ReturnDate <29–10–2016 ( Reader image BookLending ) )

While the algebra tree in Figure 2.1 on the left-hand side is the one corresponding to the query, an equivalent version of the query is shown in Figure 2.1 on the right-hand side. The first tree computes a join on the two tables Reader and BookLending in their entirety which results in a huge intermediate result. The second tree is more optimal in the sense that from the BookLending table only relevant rows are selected (the ones with a matching return date) before it participates in the Join operation; hence the intermediate result has (potentially) less rows.

image

Fig. 2.1. An algebra tree (left) and its optimization (right)

2.5 Concurrency Management

For an improved performance, several users or processes should be able to work with the database system at the same time – that is, concurrently . Concurrency management aims at making concurrent data accesses possible. To achieve this, relational Database Management Systems usually have an extensive support for transactions and provide a component that is in charge of controlling correct execution of concurrent transactions. We briefly review these two important concepts in this section.

2.5.1 Transactions

A transaction can be characterized as a sequence of read and write operations on a database; this sequence of operations must be treated as a whole and cannot be interrupted. A transaction is supposed to lead from one consistent database state to another consistent database state. Concurrency management uses transactions to support the following properties of database systems:

Logical data integrity : Are the written values correct and final results of a computation?

Physical data integrity & Recovery : How can correct values be restored after a system crash?

Multi-user support : How can users concurrently operate on the same database without interfering?

We briefly illustrate these three properties with the example of a bank transfer. Assume we transfer an amount x of money from bank account K 1 to another bank account K 2 . This transfer can be executed in two basic steps where each step consists of a read and a write operation:

Subtract x from K 1 : Read(K 1 ) and Write(K 1 x )

Add x to K 2 : Read(K 2 ) and Write(K 2 + x )

To achieve logical data integrity , both steps must be fully executed, otherwise one of the following two errors occurs:

amount x is lost (neither on K 1 nor on K 2 )

amount x is extra (both on K 1 and K 2 )

Let us now assume that the bank transfer is defined as a transaction consisting of the following sequence of operations:

T : Read(K 1 ) Write(K 1 x ) Read(K 2 ) Write(K 2 + x )

What happens if the database server crashes after the first write operation and the second write operation cannot be executed anymore? In this case, the transaction T is not finalized: it has not been committed . To achieve physical data integrity and as part of the recovery management, the database system maintains a transaction log . This transaction log records the state of each transaction: it stores which operation of which transaction is currently being executed. After a system restart all operations of uncommitted transactions have to be undone; in our example, the first write operation has to be voided. The transaction log also has to take care of committed transactions: If all results of a transaction have been computed (and the transaction has already been committed) but disk writing is interrupted, after a system restart all affected computations of the transaction have to be redone and then stored to disk.

Let us now assume that we have multiple users of the database system working on the same data. Assume one transaction (T 1 ) transfers amount x from bank account K 1 to bank account K 2 while concurrently another transaction (T 2 ) transfers amount y from bank account K 1 to bank account K 3 :

T 1 : Read1 (K 1 ) Write1 (K 1 x ) Read1 (K 2 ) Write1 (K 2 + x )

T 2 : Read2 (K 1 ) Write2 (K 1 y ) Read2 (K 3 ) Write2 (K 3 + y )

It is desirable to execute concurrent transaction in parallel by interleaving their operations: the advantage is that a user does not have to wait for the transaction of another user to finish before the database system starts processing his transaction. However, concurrency of transactions obviously opens space for several error cases: If operations are interleaved in a wrong order, the overall results might be incorrect. For example, with the following interleaving, the final result is only K 1 x :

Read1 (K 1 ) Read2 (K 1 ) Write2 (K 1y ) Write1 (K 1 x ) …

As a second error case, with the following interleaving the final result is only K 1 y :

Read1 (K 1 ) Read2 (K 1 ) Write1 (K 1 x ) Write2 (K 1 y ) …

We can observe that T 1 and T 2 are in conflict on K 1 because both try to read from and write to this account; they are however not in conflict on K 2 and K 3 . A correct order of the operations would be the following where the final result is K 1 xy :

Read1 (K 1 ) Write1 (K 1 x ) Read2 (K 1 ) Write2 (K 1 y ) …

Most Relational Database Management Systems manage transactions according to the following properties – the so-called ACID properties:

A tomicity: Either execute all operations of a transaction or none of them; this is the “all or nothing” principle.

C onsistency: After the transaction, all values in the database are correct; that is the database has to find a correct order of operations and additionally has to check database constraints (the data dependencies denoted and i in the database schema in Section 2.1).

I solation: Concurrent transactions of different users do not interfere; again, the database system has to find correct order of operations to ensure this.

D urability: All transaction results persist in the database even after a system crash; in this case the database system uses the transaction log for recovery.

image ACID properties refer to A tomicity, C onsistency, I solation, D urability.

Database systems adhering to the ACID properties are often called ACID-compliant .

2.5.2 Concurrency Control

There are two basic variants of concurrency control: optimistic and pessimistic concurrency control. Optimistic concurrency control assumes that conflicting modifications happen only rarely and can be resolved after they have occurred. An example of a optimistic concurrency control mechanism is snapshot-based multiversion concurrency control. A snapshot is a copy of the current state of the database. Each access (or a sequence of accesses in a transaction) retrieves one such snapshot at the time of access. Hence, different accesses (or different transactions), might work on different copies of the data. For read-only accesses (that is, queries) this is no problem. They can simply operate on their own snapshot. However as soon as some data are updated, concurrent transactions must be validated after they have finished. Some transactions must then be undone (“rolled back”) and restarted if a conflict happened: for example, if two transactions write the same data item, or if one transaction reads a stale outdated copy a data item that has been updated by another transaction in the meantime. In contrast, pessimistic concurrency control avoids any conflicting modifications by maintaining metadata (like locks or timestamps). Problems that can occur with pessimistic concurrency control are deadlocks (between two transactions) or starvation (of a transaction that is deferred until other transactions finalize).

A correct ordering of operations in concurrent transactions by interleaving the operations is also called a schedule . Note that although operations from different transactions can be interleaved, inside each transaction the order of operations is fixed and operations cannot be swapped. An important notion for a good schedule is serializability : An interleaved schedule is serializable if and only if it is equivalent to a serial schedule without interleaving. Equivalence is here defined by the fact that the interleaved schedule reads the same values and writes the same final results as the serial schedule. The database component that is in charge of finding a good schedule is the scheduler (that has already been briefly introduced in Section 1.2). The input of the scheduler is a set of transactions to be executed concurrently; the operations of the transaction however might not be fully known beforehand but come in dynamically at runtime. The output of the scheduler is a serializable schedule of the transactions. The two basic options that a scheduler has to find such a good schedule are:

defer some operations until serializability can be achieved, or

abort some transactions if serializability cannot be achieved.

Two common pessimistic scheduler algorithms are Two-Phase-Locking (2PL) and Timestamp Scheduling. We briefly review them here:

Two-Phase Locking (2PL) : Lock and unlock operations on data items are added to the schedule. There are two types of locks: the read locks and the read-write locks. The read locks are non-exclusive in the sense that multiple transactions can read the value of the data item concurrently. The read-write locks are exclusive in the sense that only one transaction has the right to modify the value of the locked data item; then no other transaction has the right to even read this item: other transactions have to wait for the unlock operation on the data item. In the two-phase locking approach inside a transaction there is a locking phase and an unlocking phase. That is, all lock operations have to be executed before any unlock operation inside the transaction.

Timestamp Scheduler : With timestamp scheduling, each transaction has a timestamp with its starting time S (Ti ). Each data item A has two timestamps: The write stamp W (A ) is the timestamp of the last transaction with write operation on A ; the read stamp R (A ) is the timestamp of the last transaction with read operation on A . Upon each read and write operation, the timestamp scheduler checks these timestamps to decide whether the operation can be executed. To be more specific, upon a read operation on item A in transaction Ti , the write timestamp has to be prior to the starting time of the transaction (that is, W (A ) S (Ti )); upon a write operation on item A , the read timestamp has to be prior to the starting time (that is, R (A ) S (Ti )); otherwise the whole transaction has to be aborted because A has been accessed by another transaction started after Ti in the meantime. Ti then has to be restarted at a later time. For a write operation, there is also the case of dropping a write: if W (A ) > S (Ti ) a more recent transaction has already written a value for A which must not be overwritten by Ti .

2.6 Bibliographic Notes

The relational data model has its root in the seminal article by Codd [Cod70]. Full coverage of all aspects of the relational data model and relational database management systems is given in the text books by Jukic [Juk13], Connolly and Begg [CB09] and Garcia-Molina, Ullman and Widom [GMUW08]. For a more theoretical background on the relational model see the book by Date [Dat07]. Weikum and Vossen [WV01] give particular focus to transactions and concurrency control.

The world’s leading vendors of commercial RDBMSs are Oracle (who now also own MySQL), IBM (with the DB2 database), Microsoft (with their Access and SQLServer products), Teradata, and Sybase (which is an SAP owned company). Several open-source RDBMSs are also available and widely adopted: for example, PostgreSQL and MariaDB. They are backed by highly experienced developers and an active community support. MySQL (now being owned by Oracle) is available as an open-source “Community Edition”.

image Web resources:

ISO/IEC Standards catalogue (Information Technology – Data management and interchange): http://www.iso.org/iso/home/
store/catalogue_tc/catalogue_tc_browse.htm?commid=45342

MariaDB: https://mariadb.org

documentation page: https://mariadb.com/kb/en/mariadb/

MySQL: http://www.mysql.com

documentation page: http://dev.mysql.com/doc/

PostgreSQL: http://www.postgresql.org

documentation page: http://www.postgresql.org/docs/manuals/

The Structured Query Language (SQL) is the standardized language to communicate with RDBMSs [Int11]. However, neither do all RDBMSs fully adhere to the SQL standard and nor do they implement all features of the standard – which leads to complications of portability of SQL code between the different RDBMS products. Several graphical front-ends for designing and querying RDBMSs and management consoles are offered by RDBMS vendors to facilitate the interaction with these database systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.150.231