14 Further Database Technologies

Several other database management technologies have been studied for decades apart from the technologies surveyed in this book. Some of them have been optimized for specialized applications. This chapter provides a brief overview of some of these technologies.

14.1 Linked Data and RDF Data Management

An individual data item may not contain sufficient information on its own. More valuable information can be derived when several data items are connected by links and these links are annotated with semantic information about the relationship of the data items. As such, linked data correspond in general to the property-graph model defined in Section 4.3. However, the term linked data often refers more specifically to data that is specified in the Resource Description Framework (RDF) where the actual data items are referenced by uniform resource identifiers (URI). An RDF data set consists of triples where two data items (the subject and the object) are linked by their relationship (the predicate). Each of the elements (subject, predicate or object) can either be an URI pointing to the actual data or a string literal.

image Web resources:

W3C recommendations:

RDF: http://www.w3.org/TR/#tr_RDF

SPARQL: http://www.w3.org/TR/#tr_SPARQL

AllegroGraph: http://franz.com/agraph/allegrograph/

documentation page: http://franz.com/agraph/support/documentation/

GitHub repository: https://github.com/franzinc

Apache Jena: http://jena.apache.org/

documentation page: http://jena.apache.org/documentation/

GitHub repository: https://github.com/apache/jena

Sesame: http://rdf4j.org/

documentation page: http://rdf4j.org/documentation.docbook

OpenLink Virtuoso: http://virtuoso.openlinksw.com/

documentation page: http://docs.openlinksw.com/virtuoso/

GitHub repository: https://github.com/openlink/virtuoso-opensource

Several data stores are available that are specialized in storing RDF triples – they are also called triple stores. A widely-used query language for RDF graphs is SPARQL – a query language that has an SQL-like syntax.

14.2 Data Stream Management

A data stream is an infinite sequence of transient values; that is, the data are not stored persistently for later retrieval but instead they are processed “on the fly” as they are produced. A data stream management system (DSMS) processes this data sequence by running so-called continuous queries on the stream; in general, these are queries that are executed interminably. Hence a data stream management system must handle queries that might be running for months or even years.

The continuous queries consume the data in the stream step-by-step and usually produce an infinite output stream. That is, the result will change over time. Continuous queries can aggregate data from the entire stream (for example, calculating the overall average of all values) or the look at subsets of the data stream independently (for example, by using a sliding window and evaluating the query only on the data inside the window). Data streams can for example be produced by sensor networks or by network traffic monitoring. Applications of data stream management are for example real-time decision support systems or intrusion detection systems.

image Web resources:

Apache Flink: http://flink.apache.org/

documentation page: https://ci.apache.org/projects/flink/flink-docs-master/

GitHub repository: https://github.com/apache/flink

Apache Samza: http://samza.apache.org/

documentation page: http://samza.apache.org/learn/documentation/

GitHub repository: https://github.com/apache/samza

Apache Storm: http://storm.apache.org/

documentation page: http://storm.apache.org/documentation/Home.html

GitHub repository: https://github.com/apache/storm

A large amount of data stream management systems are based on the relational data model and their continuous query languages are similar to SQL. In this case, one item in the data stream can be represented as a pair 〈timestamp, tuple〉 where the tuple then corresponds to a row of an infinite relational table (that is, a table with infinitely many tuples) and all tuples adhere to the same relation schema.

When using the sliding window semantics in a continuous query one can usually specify

the range of the window: this can either be measured by the size of the window (how many stream items a window must contain) or by a time constraint (for example, the window must contain all items that were produced in the last 30 seconds).

the slide length: the slide length can be measured again by size (how many data items must pass by before starting a new window) or by time interval (how many seconds must elapse before starting a new window)

From time to time it might happen that there are peaks in the data stream where there too many data items to process them in real time. A data stream management system must be prepared for this situation. A simple solution is to drop data items when they cannot be processed immediately; the disadvantage is then that usually the accuracy of the result is reduced. If high accuracy is required, then the excessive items can be persisted to disk and processed later in idle times. For some queries it is also possible to use only the summary (a synopsis) of several items and then taking the synopsis as the input for more complex queries.

14.3 Array Databases

Array databases organize data along multiple dimensions and can be used to store and manipulate data with complex structures. Those complex data often occur in natural sciences like for example astronomical data obtained from satellite observations.

image Web resources:

Rasdaman: http://www.rasdaman.org/

documentation page: http://www.rasdaman.org/wiki/Documentation

source repository: http://www.rasdaman.org/browser

SciDB: http://scidb.org/

documentation page: http://www.paradigm4.com/resources/documentation/

In the array data model data are stored in multidimensional arrays. Each array cell contains a tuple of a certain length; the elements of such a tuple can either be scalar values or they can themselves be arrays. In other words, the array data model can express arbitrary nestings of arrays. The tuples are addressed by specifying the corresponding dimensions; in SciDB [SBZB13] the individual scalar values of a tuple can furthermore be addressed by named attributes. An example from the SciDB paper [SBZB13] shows how two specify a two-dimensional matrix (along the two dimensions I and J) and each cell contains a tuples with two attributes (attribute named M of type integer and an attribute named N of type float):

CREATE ARRAY example <M: int, N: float> [I=1:1000, J=1000:20000]

Array databases offer several advanced functions to manipulate array data. For example, tuples can be aggregated or specialized join operators can be executed.

14.4 Geographic Information Systems

Geographic information (like map data) has long been considered a particular form of data with special storage and evaluation needs. These needs have been answered by Geographic Information Systems (GIS) and several databases offer GIS functionality. GIS data often require specialized data types for geometric elements of maps (like points, lines or polygons). On these data types, specific evaluation operations are usually offered (like computing the intersection of two elements).

image Web resources:

Open Geospatial Consortium: www.opengeospatial.org/

standards: http://www.opengeospatial.org/standards

Open Source Geospatial Foundation: http://www.osgeo.org/

GeoNetwork opensource: http://geonetwork-opensource.org/

documentation page: http://geonetwork-opensource.org/docs.html

GitHub repository: https://github.com/geonetwork/

GeoServer: http://geoserver.org/

documentation page: http://docs.geoserver.org/

GitHub repository: https://github.com/geoserver/geoserver

PostGIS: http://postgis.net/

documentation page: http://postgis.net/documentation

GitHub repository: https://github.com/postgis/postgis/

QGIS: www.qgis.org/

documentation page: www.qgis.org/en/docs/

GitHub repository: https://github.com/qgis/QGIS

GRASS GIS: http://grass.osgeo.org/

documentation page: http://grass.osgeo.org/documentation/

SVN repository: http://trac.osgeo.org/grass/browser

GeoJSON: http://geojson.org/

specification: http://geojson.org/geojson-spec.html

The GIS community has developed a wide range of standards and specifications to enable interoperability of several systems. For example, GeoJSON is a recent specification to describe GIS data in JSON format. A simple example for the specification of a single point (by defining its x-coordinate and its y-coordinate) looks like this (see the GeoJSON specification for more details):

{ "type": "Point", "coordinates": [100.0, 0.0] }

14.5 In-Memory Databases

In-memory databases rely on servers with large-scale main memory. The main memory is the primary storage location for the data. This makes data management a lot faster because it avoids the memory-to-disk bottleneck when writing and the disk-to-memory bottleneck when reading data. In particular, in-memory data management works at the granularity of memory addresses and not at the granularity of data blocks like the memory pages that have to be transfered from and to the disk.

Durability of data is not ensured when data are just maintained in the main memory. A system crash or a power outage will usually erase the main memory and all data is lost. Durability can for example be added to in-memory-databases by

Logging: Transaction logs are stored to the disk and then applied upon recovery from a system crash.

Snapshots: The state of the database is stored to disk periodically; in other words, a regular snapshot of the database is taken and stored durably.

Replication: All data is replicated to other in-memory database servers (at best at geographically dispersed locations). In case of a crash of a single server, a replication server can take over.

Several databases offer a main-memory mode as an alternative to disk-based storage.

image Web resources:

Aerospike: http://www.aerospike.com/

documentation page: http://www.aerospike.com/docs/

GitHub repository: https://github.com/aerospike

Apache Geode: http://geode.incubator.apache.org/

documentation page: http://geode.incubator.apache.org/docs/

GitHub repository: https://github.com/apache/incubator-geode

Hazelcast: http://hazelcast.com/

documentation page: http://hazelcast.org/documentation/

GitHub repository: https://github.com/hazelcast

Scalaris: http://scalaris.zib.de/

documentation: https://github.com/scalaris-team/scalaris/tree/master/user-dev-guide

GitHub repository: https://github.com/scalaris-team/scalaris

VoltDB: http://voltdb.com/

documentation page: http://docs.voltdb.com/

GitHub repository: https://github.com/VoltDB/voltdb

14.6 NewSQL Databases

A major criticism towards traditional relational, SQL-based database systems was their inability to run efficiently as a distributed database system. However, SQL has the huge advantage of being a standardized language and as such being accepted by the majority of database administrators and users. Hence, as kind of a reaction to the diversity of NoSQL database systems and the overabundance of their different interfaces, the interest has come up to enhance relational database systems with a better support for the changed requirements while keeping their relational data model; the term NewSQL thus describes the adoption of the design principles underlying NoSQL systems to build new distributed relational database systems with a SQL interface. A redesign of conventional RDBMSs would in particular make them scalable: in a NewSQL database, relational data can be stored on a varying number of severs while efficiently answering SQL queries. Moreover, it would support failures in the network while maintaining the ACID properties.

image Web resources:

TokuDB: http://www.tokutek.com/tokudb-for-mysql/

documentation page: http://docs.tokutek.com/tokudb/

GitHub repository: https://github.com/Tokutek/tokudb-engine

14.7 Bibliographic Notes

An invaluable book on in-memory data management is the monograph by Plattner and Zeier [PZ11]. The VoltDB system is another in-memory data store [SW13]. Data stream management is the topic of the books by Golab and Özsu [GÖ10] and by Garofalakis, Gehrke and Tastogi [MG12]. The notion of linked data and its underlying graph semantics is covered in the position paper by Bizer, Heath and Berners-Lee [BHBL09]; the book by Wood et al covers the linked data paradigm from the practical perspective of RDF stores and the book by DuCharme [DuC13] focuses on SPARQL queries. A good resource for geographic information systems including spatial data modeling is the textbook by Chang [Cha10] and by Heywood, Cornelius and Carter [HCC11]. Array data management is extensively surveyed in [RC13]. Recent array database systems include SciDB [SBZB13], rasdaman [BS14] and the SciLens platform on top of MonetDB [IGN+12]. Stonebraker [Sto12] gives a brief discussion on New SQL data stores.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.178.9