The NoSQL world

As social media became huge, data requirements increased too. The need to store and retrieve large amounts of data immediately, led to some companies involved in the problem to think about possible alternatives.

So, projects such as BigTable (Google) and Dynamo (Amazon) were among the first few attempts to find a solution to this problem. These projects encouraged a new movement that we now know as the NoSQL initiative, the term being proposed by Johan Oskarsson in a conference in California about these topics, for which he created the Twitter hashtag #NoSQL.

We can define the NoSQL movement as a broad class of system-management databases that differ from the classical model of relational databases (RDBMS) in important facets, the most noticeable one being that they are not using SQL as the primary query language.

Stored data does not require fixed structures such as tables. The result? They don't support JOIN operations, and they do not fully guarantee ACID (atomicity, consistency, isolation, and durability) features, which are the soul of the relational model. Besides, they usually scale horizontally in a very efficient manner.

As a reminder: the four ACID features are defined as follows:

  • Atomicity: This is key to the Relational Model; an operation consisting of more than one action shall not fail in the middle. Otherwise, data will be left in an inconsistent state. The whole set of operations is considered a unit.
  • Consistency: This extends to the previous and posterior state of the database after any action.
  • Isolation: Along with the previous considerations, no collateral effects should be noticed after a transaction has finished in the database.
  • Durability: If an operation ends correctly, it will not be reversed by the system.

NoSQL systems are sometimes called not only SQL in order to underline the fact that they can also support query languages such as SQL, although this characteristic depends on the implementation and the type of database.

Academic researchers refer to these databases as structured storage databases, a term that also covers classical relational databases. Often, NoSQL databases are classified according to how they store data and include categories such as Key-Value (Redis), BigTable/Column Family (Cassandra, HBase), Document Databases (MongoDb, Couch DB, Raven DB), and Graph Oriented Databases (Neo4j).

With the growth of real-time websites, it became clear that an increase in processing power for large volumes of data was required. And the solution of organizing data in similar horizontal structures reached corporative consensus, since it can support millions of requests per second.

Many attempts have been made to categorize the different offers now found in the NoSQL world according to various aspects: Scalability, Flexibility, Functionality, and so on. One of these divisions, established by Scofield and Popescu (http://NoSQL.mypopescu.com/post/396337069/presentation-NoSQL-codemash-an-interesting), categorizes NoSQL databases according to the following criteria:

 

Performance

Scalability

Flexibility

Complexity

Functionality

Key-value stores

High

High

High

None

Variable (none)

Column stores

High

High

Moderate

Low

Minimal

Document stores

High

Variable (high)

High

Low

Variable (low)

Graph databases

Variable

Variable

High

High

Graph theory

Relational databases

Variable

Variable

Low

Moderate

Relational algebra

Architectural changes with respect to RDBMS

So, the first point to clarify at the time of using one of these models is to identify clearly which model suits our needs better. Let's quickly review these unequal approaches in architecture:

  • The key/value proposal is similar to other lightweight storage systems used today on the Web, especially the localStorage and sessionStorage APIs. They allow read/write operations for a web page in the local system's dedicated area. Storage is structured in pairs, the left-hand side being the key we'll use later on to retrieve the associated value.

    These databases don't care about the type of information being saved as the value type (either numbers, documents, multimedia, and so on), although there might be some limitations.

  • The document offer is made of simple documents, where a document can be a complex data structure:
    • Normally, such data is represented using a JSON format, the most common format in use today, especially in web contexts.
    • The architecture allows you to read even fragments of a document or change or insert other fragments without being constrained by any schema.
    • The absence of a schema, which—for many—is considered one of the best features of NoSQL databases, has a few drawbacks.
    • One of the drawbacks is that when we recover some data, let's say from a person (a name or an account), you're assuming an implicit schema, as Fowler names it. It's taken for granted that a person has a name field or an account field.
    • Actually, most of implementations rely on the existence of an ID, which works like the key in a key/value store in practice.
    • So, we can think of these two approaches as similar and belonging to a type of aggregate oriented structure.
  • In the Column family model, the structure defines a single key (named a row key), and associated with it, you can store families of columns where each one is a set of related information.
    • Thus, in this model, the way to access information is using the row key and the column family name, so you need two values for data access, but still, the model reminds the idea of the aggregated model.
  • Finally, the graph-oriented model fragments information in even smaller units and relates those units in a very rich, connected manner.
    • They define a special language to allow complex interweaving to take place in a way that would be difficult to express in other types of databases, including RDBMs.

As we mentioned earlier, most NoSQL databases don't have the capacity of performing joins in queries. Consequently, the database schema needs to be designed in another way.

This has led to several techniques when relational data has to be managed in a NoSQL database.

Querying multiple queries

This idea relies on the fast response feature typical of these databases. In lieu of getting all data in a simple request, several queries are chained in order to get the desired information.

If the performance penalty is not acceptable, other approaches are possible.

The problem of nonnormalized data

The issue in this case is solved with a distinct approach: instead of storing foreign keys, the corresponding foreign values are stored together with the model's data.

Let's imagine blog entries. Each one can also relate and save both username and user ID, so we can read the username without requiring an extra query.

The shortcoming is that when the username changes, the modification will have to be stored in more than one place in the database. So, this kind of approach is handy when the average of reads (with respect to write operations) is fairly substantial.

Data nesting

As we will see in the practices with MongoDB, a common practice is based on placing more data in a smaller number of collections. Translated into practice, this means that in the blogging application we imagined earlier, we could store comments in the same document as the blog's post document.

In this way, a single query gets all the related comments. In this methodology, there's only a single document that contains all the data you need for a specific task.

Actually, this practice has become a de facto practice given the absence of a fixed schema in these databases.

Tip

In other words, the philosophy followed here is more or less save your data in such a way that the number of storage units implied in a query is minimum (optimally, only one).

The terminology that's used changes as well. The following table succinctly explains the equivalence in terms of relations between SQL and NoSQL databases:

SQL

MongoDB

Database

Database

Table

Collection

Row

Document or BSON document

Column

Field

Index

Index

Table joins

Embedded documents (with linking)

Primary key (unique column or column combinations)

Primary key (automatically set to the _id field in MongoDB)

Aggregation (for example, by group)

Aggregation pipeline

About CRUD operations

In the case of MongoDB, which we'll use in this chapter, a read operation is a query that targets a specific collection of documents. Queries specify criteria (conditions) that identify which documents MongoDB has to return to the client.

Any query needs to express the fields required in the output. This is solved using a projection: a syntax expression that enumerates the fields indicating the matching documents. The behavior of MongoDB follows these rules:

  • Any query is aimed for a single collection
  • The query syntax allows you to establish filters, ordering, and other related limitations
  • No predefined order is used unless the sort() method forms a part of the query
  • All CRUD operations use the same syntax, with no difference between reading and modification operations
  • Queries with a statistical character (aggregation queries) use the $match pipeline to allow access to the queries' structure

Traditionally, even in the relational model, those operations that change information (create, update, or delete) have their own syntax (DDL or DML in que SQL world). In MongoDB, they are noted as data modification operations, since they modify data in a single collection. However, for update operations, a conceptual division is usually made in order to distinguish punctual updates (modifications) from totally changing updates (replacements). In this case, only the _id field is preserved.

To summarize, the operational offer can be resumed in this way:

  • Adding information is performed with insert operations (either with new data to an existing collection or by adding a new document)
  • Changes adopt two forms: while updates modify the existing data, remove operations totally delete data from a given collection
  • These three operations don't affect more than one document in a single process
  • As mentioned earlier, update and remove can use different criteria to establish which documents are updated or removed:
    • There is a clear similarity in the syntax used for these operations and the one used in pure reading queries
    • Actually, some of these operations are piped, that is, linked to the previous query by chained calls

So, in the case of MongoDB, we would have a schema like what is shown in below:

About CRUD operations
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.36.38