Overview
In this chapter, we will study the SQL and NoSQL databases. First, you will learn how to decide on the ideal database to be used based on the format of the data at hand. Then, we will implement storing and querying data in different databases such as MySQL, MongoDB, and Cassandra. Furthermore, to improve efficiency, we will optimize the databases for maximum performance. This chapter includes hands-on experiences so that you can learn how to deal with practical real-world problems related to databases. By the end of this chapter, you will be able to develop and implement data models on MySQL, MongoDB, and Cassandra databases.
In the previous chapter, we covered case studies on how AI has been controversial, along with various industry-standard use cases and hands-on information about the tools and frameworks related to storage. In a continuation of earlier discussions, we will be diving into detailed discussions on the kinds of databases that are available, along with their evolution, internal structures, and best possible use cases and examples.
Understanding database management is quite important as it gives you a clear idea about how data flows, that is, the transaction of data between the application and the database. For businesses and organizations, multiple types of data can be handled quite effectively using the knowledge of database management. Nevertheless, there are several challenges associated with data storage, such as infrastructure, cost, compatibility, and security.
In practical scenarios, databases enable us to accelerate our predictive analytics power by deploying Artificial Intelligence (AI) to get rid of several real-world problems such as credit card fraud detection, stock market prediction, and early stage disease detection.
Currently, various kinds of databases are available on the market, such as Relational (SQL), Document store (MongoDB), Columnar (Cassandra), Key-Value store (NoSQL), and Graph (ArangoDB) databases. Each of these databases has its characteristics and advantages and disadvantages. These databases are broadly divided into two groups: SQL and NoSQL databases.
SQL databases are relational database systems (RDBMS), which are structured/relational databases. The main language that's used to deal with the data is Structured Query Language (SQL). It is used to form a query and access the data.
NoSQL databases are Not Only SQL databases, wherein the data is in a semi-structured format. The main language that deals with data can vary for each database. This type of database has triggered the evolution of large distributed data stores running over a bunch of commodity hardware called nodes. Without the ability to use distributed data stores and having our data stored on a cluster of inexpensive nodes, we would have to rely on expensive single machines with very large capacities. This is not only expensive, but it also presents a single point of failure infrastructure. Generally, these databases are used where the data is huge and needs to be accessed at the lowest latency possible.
In this chapter, we are going to learn about the relational, document store, and columnar databases by going through examples and case studies. Lastly, we will learn how these databases are used in a hybrid architecture. In the next section, we will discuss database components to understand concepts including the query engine and interface, tables, schemas, and buffer pool.
The typical database components can be seen in the following diagram. These components include the following:
These are the row-oriented type of databases. Here, each row of the database is a record. A collection of columns forms a record and is stored in a structure called a table. These collections of tables form the database.
These databases are accessed using SQL. It was developed by two IBM software researchers in the 1970s and it became more and more dominant in the field after that due to its straightforward, understandable syntax, ease of learning, and fewer lines of code being required to use complex functionalities:
In the preceding diagram, user database interaction through SQL is demonstrated, where the efficacy of the SQL architecture to deal with "structured data" is quite visible.
SQL databases work on Atomicity, Consistency, Isolation, and Durability (ACID) properties, which guarantees strong consistencies and isolated transaction behavior. These are crucial while designing a data storage system.
Also, when we think about AI, having just one type of database is not sufficient due to the various forms of data. Therefore, there is a need for a hybrid data architecture in various cases, as we covered in the last section, Exploring the Collective Knowledge of Databases, of the chapter. In some hybrid architectures, we use the properties of the consistency and availability of MySQL and Mongo DB, respectively.
Undoubtedly, MySQL and Oracle are the most popular choices among RDBMS, as per their efficacies, but we will restrict our discussions to just MySQL in the next section as it is open source.
MySQL is an open source relational database system that's used by websites such as Facebook, Flickr, Twitter, and YouTube because it supports stored procedures, cursors, triggers, and all the SQL database features.
As you already know, MySQL uses the SQL language to deal with data. Let's take a look at the syntax and semantics.
SQL is a structured query language; it is composed of predicates or clauses:
The following SQL commands are used to deal with the structure of the database or table:
This command creates a database:
CREATE DATABASE fashionmart;
This clause is used to create a table. Here, you have to specify the columns and their data types and constraints:
CREATE TABLE products (p_id INT,
p_name TEXT,
p_buy_price FLOAT,
p_manufacturer TEXT,
p_created_at TIMESTAMP,
PRIMARY KEY (p_id)
);
This command will create a table product that has five columns. p_id will have PRIMARY KEY as a constraint. We will discuss the primary key later.
The table structure can be displayed using this command:
DESC products;
This clause is used to modify the table structure to, for example, add, rename, or drop a column:
ALTER TABLE products ADD COLUMN p_modified_at TIMESTAMP NOT NULL;
Note
In case you run into an error while running the preceding code, you can set the SQL_MODE in MYSQL to Allow Invalid Dates using the following command: SET SQL_MODE='ALLOW_INVALID_DATES';
In this command, we have added the p_modified_at column with the TIMESTAMP data type and the NOT NULL constraint.
This command drops the table and removes all the records:
DROP TABLE products;
The following SQL commands are used to deal with the actual data in the table:
This command is used to insert records into the table:
INSERT INTO products (p_id, p_name, p_buy_price,
p_manufacturer, p_created_at)
VALUES (1, 'Z-1 Running shoe', 34, 'Z-1', now());
In this command, columns can be specified after the table name and the values inside the VALUES clause.
Adding single rows for multiple queries is exhausting and time-consuming. We can add multiple rows in a single query like so:
INSERT INTO products(p_id, p_name, p_buy_price,
p_manufacturer, p_created_at)
VALUES
(2, 'XIMO Trek shirt', 15, 'XIMO', now()),
(3, 'XIMO Trek shorts', 18, 'XIMO', now()),
(4, 'NY cap', 18, 'NY', now());
This command is used to update existing records inside the table:
UPDATE products SET p_buy_price=40
In this command, the SET clause is used to set a value for the column, while the WHERE clause filters the rows for the matching condition. If the condition is missing, it will update the value for all the records. In this example, we are updating the buy price to 40 for the product whose ID is 1 in the products table.
This command is used to delete records from the table:
DELETE FROM products WHERE p_id=1;
Here, if the condition is not given, then it will delete all the records from the table. As we have given the condition where the product ID is 1, it will only delete that record.
The following SQL commands are used to deal with providing control access to data stored in a database:
This command allows the owner of the object with GRANT access to accord privileges to other users. The syntax for the GRANT command is as follows:
GRANT privilege_name ON object_name
TO {user_name | PUBLIC | role_name} [with GRANT option];
An example of the GRANT command is as follows:
GRANT SELECT ON student TO user5
This command grants a SELECT permission on the student table to user5.
This command cancels previously granted or denied permissions. The syntax for the REVOKE command is as follows:
REVOKE privilege_name ON object_name
FROM {User_name | PUBLIC | Role_name}
An example of the REVOKE command is as follows:
REVOKE SELECT ON student TO user5
This command revokes the SELECT permission on the student table from user5.
The following SQL commands are used to deal with SQL's logical transactions:
This command permanently saves any transaction in the database. The syntax is as follows:
COMMIT;
This command restores the database to the last committed state. The syntax is as follows:
ROLLBACK;
This command temporarily saves a transaction so that we can roll back to that point if required. The syntax is as follows:
SAVEPOINT savepoint_name;
To retrieve data in SQL, we often use the SELECT statement as it plays a significant role. Now, let's discuss data retrieval in terms of the appropriate commands and examples of how to use them.
This section focuses on data access:
This command is used to access the data from the table:
This command will display all the records from the products table. Here, the FROM clause specifies the table name.
To compare and combine in SQL, joins are used. As output joins return specific rows of data from two or more tables in a database. Hence, this is not a command but a concept, where we can join more than one condition based on another condition. Let's understand this better by going through an example. Consider we have two tables, sales, and products, with the following data:
Using the following code, we will join the sales table with p_id as the FOREIGN KEY pointing to the products table:
products.p_name AS product_name,
products.p_manufacturer AS manufacturer,
sales.s_profit AS profit
FROM products
JOIN sales
ON products.p_id=sales.p_id;
This should give us the following output:
This command joins products and sales using the JOIN and ON clauses. The ON clause holds the condition to join.
There are various types of joins available. The one we observed in the previous example was an inner join. Conceptually, inner joins find and return matching data from tables, whereas outer joins find and return matching data, as well as some dissimilar data from tables. Let's discuss outer joins by going through several examples.
This join, in the preceding scenario, will show non-matching data from the left table, products; that is, data from the products table that does not have sales data associated with it:
SELECT
products.p_name AS product_name,
products.p_manufacturer AS manufacturer,
sales.s_profit AS profit
FROM products
LEFT OUTER JOIN sales
ON products.p_id=sales.p_id;
This should give us the following output:
RIGHT OUTER JOINS
This join, in alternate scenarios, will show non-matching data from the right table, sales; that is, data from the sales table that does not have products table data associated with it:
SELECT
products.p_name AS product_name,
products.p_manufacturer AS manufacturer,
sales.s_profit AS profit
FROM products
RIGHT OUTER JOIN sales
ON products.p_id=sales.p_id;
This should give us the following output:
The GROUP BY clause is used to group records on one or more columns and then perform all the operations associated with it. One thing to take care of is that the column not used in aggregate functions should be included in the GROUP BY clause.
Aggregate functions include SUM, MIN, MAX, COUNT, and AVG. Let's look at an example:
sales.p_id AS p_id,
SUM(sales.s_profit) AS profit
FROM sales
GROUP BY sales.p_id;
This command is grouping the table on the p_id column and performing addition (using the SUM function) over the profit column to find out the total profit for each p_id. You can use any aggregate function with the GROUP BY clause.
This should give us the following output:
This clause is used to sort the result on a column. ORDER BY executes last in query lineage. You can sort the column in ascending or descending order:
SELECT
sales.p_id AS p_id,
SUM(sales.s_profit) AS profit
FROM sales
GROUP BY sales.p_id
ORDER BY profit ASC;
This should give us the following output:
This section is dedicated to the constraints/rules in the table. These constraints help MySQL pre-validate the record before inserting it into the table. These constraints can be created by using the CREATE TABLE statement. We will discuss a few constraints here.
The following are the main constraints:
To understand the importance of SQL databases, let's take an example of a store called FashionMart that sells products such as clothing and jewelry to its customers. We have been given the task of making an online e-commerce website for the store. Initially, we want to achieve the following tasks with this website:
Let's assume that we saved this information in file storage such as Products.csv and Sales.csv files. Now, we can easily add a new product to Products.csv and also add new sales data to Sales.csv; all you need to do is append one record to the end of the file:
At this point, everything is working fine, but suddenly, the store owner finds out that one of the product's prices and names needs to be changed. So, to apply these changes, the product entry record needs to be found in Products.csv and then modified. Even after doing this, there will be records with duplicate product names in Sales.csv that need to be changed. This will lead to inconsistency in the data and its management becoming tedious. So, it's clear that file storage is not a good option for this use case. There are some more points to consider, such as a lock on the file when the file is being read/modified simultaneously. Also, if there are a million products in the store, how efficient will the search be?
Considering the previous criteria, we need to build the data system so that users can digitally run the FashionMart store. The following exercise focuses on designing a data model for the database and converts this CSV-based non-maintainable data system into a relational database system.
Let's design a basic building block of the database, that is, a data model.
A data model is a way of representing entities and their inter-relationships. We will discuss data modeling in detail later in this chapter:
This data model gives us an overview of a relational database consisting of two tables, Products, and Sales. Products contains fields such as p_id, p_name, p_buy_price, p_manufacturer (we could also have a manufacturer table and refer to it here with a foreign key), and p_created_at, while the Sales table has fields such as s_id, p_id (foreign key of the Products table), s_sale_price, s_profit, and s_created_at. This data model prevents the duplicity of product name data and overcomes problems we saw in the file storage system.
Since the physical data model is ready, we'll implement it in the next exercise.
In this exercise, we need to implement the relational database for the FashionMart store.
We will be creating the products and sales tables and then performing basic SQL operations on them. We will also join the tables and find the overall profits.
Before proceeding with this exercise, we need to set up the MySQL database. Please follow the instructions on the Preface to install it.
Perform the following steps to complete this exercise:
Windows:
mysql
Linux:
sudo mysql
macOS:
mysql
use fashionmart;
You should get the following output:
We have used a USE statement to select a database. This particular database remains as the default until the end of the session or until we deploy another USE statement to select another database.
CREATE TABLE products (p_id INT,
p_name TEXT,
p_buy_price FLOAT,
p_manufacturer TEXT,
p_created_at TIMESTAMP,
PRIMARY KEY (p_id)
);
The products table is created with the required columns and p_id is declared as the primary key.
CREATE TABLE sales (s_id INT,
p_id INT,
s_sale_price FLOAT,
s_profit FLOAT,
s_created_at TIMESTAMP,
PRIMARY KEY (s_id),
FOREIGN KEY (p_id)
REFERENCES products(p_id)
ON DELETE CASCADE
);
The sales table is created with s_id as a primary key and p_id as a foreign key to the products table's p_id column.
desc products;
You should get the following output:
desc sales;
You should get the following output:
With that, we have created our tables and database for the FashionMart store successfully. Now, let's insert the data into it.
INSERT INTO products(p_id, p_name, p_buy_price,
p_manufacturer, p_created_at)
VALUES
(1, 'Z-1 Running shoe', 34, 'Z-1', now()),
(2, 'XIMO Trek shirt', 15, 'XIMO', now()),
(3, 'XIMO Trek shorts', 18, 'XIMO', now()),
(4, 'NY cap', 18, 'NY', now());
INSERT INTO sales(s_id, p_id, s_sale_price,
s_profit, s_created_at)
VALUES
(1,2,18,3,now()),
(2,3,20,2,now()),
(3,3,19,1,now()),
(4,1,40,6,now()),
(5,1,34,0,now());
The formula that's used for calculating profit is (sale price – buy price). In each record, we can see that the sale price is different for the same product. It is assumed that the selling prices are different based on various sale offers.
SELECT * FROM products;
You should get the following output:
SELECT * FROM sales;
You should get the following output:
SELECT
products.p_name AS product_name,
products.p_manufacturer AS manufacturer,
sales.s_profit AS profit
FROM products
JOIN sales
ON products.p_id=sales.p_id;
You should get the following output:
This result is achieved using the JOIN clause and the ON clause, but with a condition. Here, in the joined table, the column name is renamed using the AS query and a join operation is executed based on the p_id column. There are multiple entries for the same product, so we will group the products.
SELECT
products.p_name AS product_name,
products.p_manufacturer AS manufacturer,
sales_subq.profit AS total_profit
FROM products
JOIN
(SELECT
SUM(sales.s_profit) AS profit
FROM sales
GROUP BY sales.p_id) AS sales_subq
ON products.p_id=sales_subq.p_id;
You should get the following output:
In this query, the aggregate operator, SUM, is used to sum the profit column among grouped records. We used a SELECT subquery to create an output of the total profit for each p_id by applying the SUM operator over the s_profit column. Furthermore, we used the GROUP BY command over the p_id column and assigned an alias of sales_subq. After that, a join operation was performed on the products and sales_subq tables to map the p_id column, resulting in the outcome of three columns, that is, product_name, manufacturer, and total_profit.
Notice that the NY Cap product is inside the store but that not a single unit was sold and is not shown. But still, we need to display it with profit as 0.
SELECT
products.p_name AS product_name,
products.p_manufacturer AS manufacturer,
IFNULL(sales_subq.profit,0) AS total_profit
FROM products
LEFT OUTER JOIN
(SELECT
sales.p_id AS p_id,
SUM(sales.s_profit) AS profit
FROM sales
GROUP BY sales.p_id) AS sales_subq
ON products.p_id=sales_subq.p_id;
You should get the following output:
This result is achieved with the LEFT OUTER JOIN clause. The IF NULL operator is used to display the record if the joining condition does not match.
Note
To access the source code for this specific section, please refer to https://packt.live/3gUYcUR.
By completing this exercise, you have successfully implemented a fashionmart database where the deployment of SQL queries helped you keep track of records on sales-related issues. Now, let's learn how to optimize our data model.
Data modeling is the technique of deciding on how the data should be stored so that inserts, updates, and data retrieval happens with the lowest latency possible. If the data modeling is done efficiently, then we can utilize the data store to its fullest, irrespective of the database being used.
There are three types of data models:
The terminology that's used in this data model includes Entities, Attributes, and Relationships. This data model is independent of the database and is very much understandable by a business user and they can relate to it very easily. When you want to explain data storage to non-technical audiences, use the conceptual data model.
This is the prerequisite for a physical data model. Here, you specify the fields and their data types. Although this is database-independent, the model has to abide by normalization rules to go to the next stage.
This is the final stage where database-specific details are presented. Everything from columns, data types, constraints, indices, views, and keys should be mentioned here.
To design a data model, you should know the structure of the data and how a user sees the application.
As per ACID properties, it is very much required in the relational model that there should not be any duplicate records. The dependencies need to be defined clearly as well.
To standardize these, the process of normalization is used to structure the database. The following are the main forms of normalization:
This form says that the columns should be atomic and should have unique names. The column should not have multiple values for multiple purposes.
Here, emp_id with a value of 1 has got two departments, which is not a valid form. So, to accommodate and maintain the validity, we will store the data like so:
Valid schema example
In the following example of a valid schema, we have created a separate record for John Doe by splitting the department column values, such as IT and Pre-sales, into two different rows:
This form tells us that the table should be in 1st Normalization form and that all the columns should be fully dependent on the primary key.
Let's take an example of the performance of employees.
Invalid schema example
In the current example, the primary key is emp_id but the location is dependent on the department. This schema holds the information in the location column, which does not depend on the key column, emp_id, so there is a partial dependency here. Each department has a specific location, so emp_id is partially dependent on location. If we maintain department and location with employee ID in the same table, it will create redundancy of data:
To achieve the necessary validity and remove partial dependency, we will store the data as shown in Figure 5.22 and Figure 5.23, respectively.
Valid schema example
The schema creates a department table to remove the partial dependency. Here, the location is dependent on department_id, which can be seen in the following table:
So, now, we have two tables, department and employee. department_id and emp_id stand as the primary keys for the department and employee tables, respectively. With the previously mentioned approach, we removed the partial dependency by only having department_id in the employee table:
This form says that the schema should be in 2nd Normalization form and that transitive dependencies should not exist in the single table. A transitive dependency is if A determines B (A-B) and B determines C (B-C) so that A determines C (A-C).
Invalid schema example
This schema shows that a transitive dependency such as emp_id determines delivery_status and that delivery_status determines performance_label, which is a transitive dependency and is not valid for the 3rd Normalization form. Let's elaborate more by explaining the components of the table. If John Doe delivers the package on time, he is meeting his set expectations. But if he delivers it before the given time, this infers that he has exceeded his expectations. The same scenarios are taking place for Max and Claire Yates as well. So, based on the delivery statuses of John, Max, and Claire, we can understand their performance labels. Since this is a transitive dependency, it is not a valid 3rd Normalization form:
Valid schema example
This schema creates a delivery table to measure performance, eliminate transitive dependency, and remove duplicity, as shown in the following table:
Now, emp_id only determines delivery_status_id. delivery_status is stored in the delivery table, altogether forming a valid schema, as shown in the following table:
This technique is used while building a data warehouse. It consists of facts and dimensions. Dimensions are categories of the data and their information; for example, Time, Place, and Company. These dimensions are used to build the fact tables that contain the metrics.
There are two types of data models/schemas in dimensional data modeling:
In this, there is a fact table at the center of the star, followed by several associated dimension tables. As per its similarity with the star structure, it is referred to as a star schema. It is the simplest type of data warehouse schema and is optimized for querying large datasets. In this schema, the dimension tables do not have interrelationships, so they only interact through the Fact table:
In this schema, the dimensions can be interconnected to form the Fact table. It looks just like a snowflake, which is complex looking. This is used where complex schemas are integrated to form a bigger family:
These popular modeling techniques are key to successful and future-proof database design, although these tricks won't solve problems such as having a huge number of records. We will discuss this later in this chapter.
The star schema is an appropriate choice when only a single join creates a relationship between the fact table and any dimension tables, whereas the snowflake schema is an appropriate choice when there is a requirement to have many joins to fetch the data. To simplify this, if we want to design a simple database, then the star schema is appropriate, whereas the snowflake schema is an appropriate choice for designing a complex database.
Now that you know about relational database design, modeling, and its implementation, let's talk about performance tuning.
Companies such as Facebook use MySQL to deal with petabytes of data, but it's difficult to achieve such performance with default settings and configuration. So far, we have set up a model using SQL queries. However, it is quite understandable that by getting more products, the entries in tables of e-commerce sites such as FashionMart will also be increasing. In such a scenario, the expectation from the databases would be to perform at the same durability and speed. To ensure that, we need to tune the performance of our commands. Some of the best practices to ensure performance tuning are as follows:
Instead of selecting all the columns unless required, try to mention column names in the SELECT clause. This will lower the unnecessary overhead on the MySQL query engine.
ORDER BY orders the results in ascending or descending order, but this also impacts the performance of MySQL since before showing the results, they need to be sorted.
Many web applications use pagination to show records batch by batch. You can use LIMIT and OFFSET to skip and limit the records you want to show. This will improve query performance since you don't need to bring in all the records from the result.
Sub-queries put extra load on the query engine. Instead, use joins if possible, which are treated as part of the same query.
UNION combines two query results and removes duplicates, which are classed as overhead for the query engine. UNION ALL does not remove duplicates and instead leads to better performance.
Always try to perform a join on a numerical column instead of a text column as this will save a lot of time.
A database index is a data structure that improves the speed of data retrieval operations on a database table by taking a search key as input. Indexing returns a collection of matching records efficiently. It is used to quickly locate and access the data in the database. Indexes provide good performance, but too many of them will spoil the show as every index needs to be maintained and modified for every change in the data.
These tips and practices can be used to implement production-ready MySQL databases, where you can deal with a large number of queries. We'll implement some MySQL queries in the following activity.
Imagine we have a globally successful e-commerce site for the PacktFashion store. PacktFashion is facing tough challenges to manage the products in its stock. As products are being sold out quickly, PacktFashion is unable to maintain the inventory status. One of the customers who is staying in Prague needs a quick delivery and would like to know whether their product is available in the nearby warehouse. PacktFashion is facing trouble resolving the customer's query related to product availability in stock.
This activity aims to implement a data model so that we manage the inventory of the site. We will be listing those products that are in and out of stock. Furthermore, we need to find the list of available products in the warehouse of Prague.
You need to implement the data model as shown in the following screenshot. This model will help us manage the products, sales, inventory, manufacturer, the status of the product, and the warehouse location:
Note
We have created a data model based on the snowflake schema. You can use another approach if you prefer. It should be in its 3rd Normalization form at a minimum.
Note
The code for this activity can be found here: https://packt.live/2Wvf9xB.
Perform the following steps to complete this activity:
Note
The location table and products table must have Prague and XIMO Trek shirt as one of their records. Also, the status table should have values of IN or OUT only.
The data in the tables should be similar to the following outputs:
You should get an output similar to the following:
You should get an output similar to the following:
You should get an output similar to the following:
We have completed the SQL databases section. Now, designing and implementing relational data models and querying them should not feel like a difficult task.
We'll learn about NoSQL databases in the next section.
Note
The solution to this activity can be found on page 598.
In the previous section, we looked at how SQL databases are used in the industry, along with some examples. However, with all the great things SQL databases can do, there is still significant room for improvement when it comes to dealing with complex and huge data with quick retrieval requirements. The times before and after NoSQL databases were created can be seen in the following diagram:
These NoSQL databases evolved because of the fundamental limitations of SQL databases. The data in almost every sector has grown exponentially over time and the variety of usage has also expanded. Earlier, databases were limited to enterprise resource systems, research projects, telecommunications, and so on, but today, new domains have emerged with huge data in a variety of formats, such as social media and profiling millions of users, oil and gas, with billions of binary records, e-commerce, and life sciences. Because of the single server system, complex schemas, and the expense of vertical scaling, it became difficult for SQL databases to cope with high-performance requirements.
SQL databases such as RDBMS have a fixed, static, or predefined schema and these databases are not suited for hierarchical data storage, which is why these databases are best suited for complex queries. On the other hand, NoSQL databases are non-relational or distributed database systems. They have a dynamic schema. These databases are best suited for hierarchical data storage and are not very good for complex queries.
NoSQL databases are distributed databases built and supported in various languages.
There are various kinds of NoSQL databases available. Some of the most popular include document store, columnar, key-value store, and graph databases. Let's take a look at them:
We will discuss document store and columnar databases and provide examples in the following sections. Before we dive into NoSQL in more detail, let's have a careful look at one of the most important big data theorems: CAP theorem. This theorem is used to set the agenda for the data storage system.
This theorem is the de facto standard for deciding on the most appropriate NoSQL data storage:
As shown in the preceding diagram, the databases that give good consistency and fair availability are relational databases/SQL databases. One of the databases that can guarantee partitioning and availability is Cassandra. Some of the databases that have a good mixture of consistency and partitioning include HBase, MongoDB, and Redis.
So, with this theorem, the following is clear:
So, ideally, only two designs (we call this an agenda) can be chosen (either CP or AP) while considering data storage. Let's start with CP and learn how it's implemented through the MongoDB database.
MongoDB is a document-oriented database and is a generalized form of the NoSQL database. This database is popular among the community, especially for use cases such as logging, user profiling, search engines, geolocation data, and configuration storage.
It solves the scalability limitation of relational databases with its distributed architecture. So, it is considered an alternative to SQL databases. The installation and implementation of MongoDB is quite easy. For instance, Coinbase (digital currency exchange) uses MongoDB to build faster applications, diversify data type handling, and for efficient application management purposes. Let's now discuss the advantages and disadvantages of MongoDB.
The query language that's used for MongoDB is JavaScript. The MongoDB team has provided a variety of functions to deal with collections and data.
As we discussed previously, you must be aware of SQL terminology. We will try to align the terminology of MongoDB with the former to aid our understanding:
Now, let's talk about the functions of the newly created sales MongoDB collection:
Databases are selected through the use command:
use fashionmart
show dbs
If you create the collection here, only the database is created.
db.createCollection("Sales")
This function creates the collection with the sales name in the fashionmart database.
You can create a JavaScript object and insert it into the collection:
db.Sales.insert(<JavaScript object or JSON object)
Multiple records can be combined in this object:
db.Sales.insert(
[
{
"ShirtID":"1",
"ShirtColor":"2",
"ShirtSize":"XL",
"ShirtDetails": [
{
"styleID": "4",
"designerID": "6",
}],
},
{
"ShirtID":"2",
"ShirtColor":"Red",
"ShirtSize":"XXL",
"ShirtDetails": [
{
"styleID": "5",
"designerID": "6",
}],
}
]
);
This function can be used to find the records from the collection:
db.Sales.find({ShirtID:"2"}).pretty()
It should give us the following output:
Here, we are querying the sales collection with the ShirtID criteria (the where clause in MySQL) as 2. The pretty function at the end displays the output JSON in a structured way. Here, the pretty() method is used mainly to display the result in an easier-to-read format.
Let's take another example of searching for an element value in an array and implement it using the Find query and then, alternatively, with the $elemMatch operator:
db.Sales.find({"ShirtDetails.styleID":"4"}).pretty()
It should give us the following output:
Alternately, you can use the $elemMatch operator, as follows:
db.Sales.find({ShirtDetails: {$elemMatch :{ styleID:"4"}}}).pretty()
It should give us the following output:
Both commands will retrieve the whole matching document.
In MongoDB, aggregation pipelines are a breakthrough. Using these, you can chain various operations, such as grouping, aggregation, and exploding the embedded document. Here, operations such as "exploding the embedded document" imply the deconstruction of an array field from the input documents to the output document for each element.
It is not a standard practice to use hardcoded values such as column name, alias name, and even the column that we want to perform the count on. To solve this, we can create JavaScript functions that are dynamic and return an object depending on the arguments so that we don't have to rely on the hardcoded values.
Let's understand this through an example:
db.Sales.aggregate([{$unwind: "$ShirtID"}]).pretty();
It should give us the following output:
Here, we are using a JavaScript function that returns the aggregate pipeline object:
function sales_designer(match_field, match_value) {
var matchObject = {};
var eq_object = {};
eq_object["$eq"] = match_value
matchObject[match_field] = eq_object
return [
{
"$unwind":"$ShirtDetails"
},
{
"$match": matchObject
},
{
"$project": {
"StyleID":"$ShirtDetails.styleID"
}
}
]
};
The JavaScript function executes in the following sequence:
Note
Other criteria can also be parameterized and used as input in the pipeline function.
In this example, the $unwind operator in MongoDB has been used to deconstruct an array field from the input documents to the output document for each element:
db.Sales.aggregate(sales_designer("ShirtDetails.designerID","6")).pretty()
It should give us the following output:
{ "id" : ObjectID("5e60bd8bdf7688ca2692fb05"), "StyleID" : "4" }
{ "id" : ObjectID("5e60bd8bdf7688ca2692fb06"), "StyleID" : "5" }
This command runs this pipeline and executes the commands in a sequence of declaration.
To understand MongoDB better, we will look at a simple example. For ease of understanding, we will be using the FashionMart example.
Imagine that you want to build a report for each product and that there are thousands of products in the inventory. Each product has millions of sales. If we need to build the solution using a relational database, then it will be a combination of the product's dimension and its sales, and the query will include a join between these two. This will result in extensively scanning and mapping each product to its sales data. In the case of a billion sales records for over a hundred thousand products, this join would be complex, which makes it an expensive solution. Another major drawback would be the query being performed on a single database server.
Now, let's think about this problem using MongoDB. As we know, joins are not encouraged, so the data will probably be in a denormalized manner. So, if we want reports for each product, the sales entity will be embedded inside the product entity. This way, we avoid joins and put data in a single collection. Also, using the aggregations in MongoDB, we can get interesting analytics out of the data.
Let's create a data model for this example. We will call the collection Product.
This collection will hold the product and sales information inside the same collection. It should look as follows:
{
"_id": "5498da1bf83a61f58ef6c6d5",
"p_name": "Z-1 Running shoe",
"p_manufacturer": "Z-1",
"p_buy_price": 34,
"p_created_at": Date(),
"sales": [
{
"s_sale_price": 40,
"s_profit": 6,
"p_created_at": Date(),
},
{
"s_sale_price": 41,
"s_profit": 7,
"p_created_at": Date()
}
]
}
In the data model, the example document is shown with an autogenerated mongo ID. Then, the sales object is added to the Sales array field. If you query for a product of Z-1 manufacturer, then without joining the collection, all the necessary information will be fetched.
We will implement this example in the next exercise using the MongoDB database.
In this exercise, we will implement the data model for FashionMart in MongoDB. Then, we'll display the report of a specific product, along with its sales data. Then, we will find the total products in the inventory and the bestselling product whose profit is greater than 6.
Before proceeding with this exercise, we need to set up a MongoDB database. Please follow the instructions on the Preface to install it.
Perform the following steps to complete this exercise:
$mongo
Notec
For additional help on starting the MongoDB shell, refer to the following link: https://docs.mongodb.com/manual/mongo/#start-the-mongo-shell-and-connect-to-mongodb. For a quick reference of all the important MongoDB shell commands, visit: https://docs.mongodb.com/manual/reference/mongo-shell/.
show dbs
You should get the following output:
admin 0.000GB
config 0.000GB
demo 0.000GB
local 0.000GB
use fashionmart
You should get the following output:
switched to db fashionmart
Now, you can directly start using the database without creating it. It will be created when you create a collection in it.
db.createCollection("products");
You should get the following output:
{
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1593812618, 7),
"signature" : {
"hash" : BinData(0,"XLov2h+xWNKbYrcWIqsCOROMuKY="),
"keyId" : NumberLong("6808867179686002691")
}
},
"operationTime" : Timestamp(1593812618, 7)
}
In MongoDB, we do not have a structure for the collection, so we will start with data insertion in the products collection.
todayDate=new Date()
You should get the following output:
ISODate("2019-12-21T16:05:48.765Z")
We can also create a variable for time efficiency. Since the query language to access MongoDB is JavaScript, the way we access the interface is a mix of procedural and declarative.
products={
"p_name": "Z-1 Running shoe",
"p_manufacturer": "Z-1",
"p_buy_price": 34,
"p_created_at": todayDate,
"sales": [
{
"s_sale_price": 40,
"s_profit": 6,
"p_created_at": todayDate,
},
{
"s_sale_price": 41,
"s_profit": 7,
"p_created_at": todayDate
}
]
}
You should get the following output:
db.products.insert(products);
You should get the following output:
WriteResult({ "nInserted" : 1 })
total_sales_for_product_pipeline=[
{
"$match": {
"p_name": "Z-1 Running shoe"
}
},
{
"$project": {
"product":"$p_name",
"total_sales": {
"$size": "$sales"
}
}
}
];
With this, we have created a JavaScript array object that will hold the total_sales pipeline operations.
db.products.aggregate(total_sales_for_product_pipeline).pretty()
You should get the following output:
{
"_id" : ObjectId("5dfe4c0b211649ccef9a7426"),
"product" : "Z-1 Running shoe",
"total_sales" : 2
}
Here, we have filtered the records for shoe and then performed a count over the sales object field. We used the aggregate pipelining function to perform the functions. This way, various pipelines can be built as JavaScript objects and then executed to get the results.
function total_product_sales(match_product,
projection_field,
count_field) {
var matchObject = {};
var countObject = {};
matchObject[projection_field] = match_product
countObject["$size"] = "$"+count_field;
return [
{
"$match": matchObject
},
{
"$project": {
"product":"$"+projection_field,
"total_sales": countObject
}
}
]
};
Here, we have created a total_product_sales function that will return the desired pipeline object. You can modify this function to suit your requirements. Now, it is time to call this function inside the aggregate function.
db.products.aggregate(total_product_sales("Z-1 Running shoe", "p_name","sales")).pretty();
You should get the following output:
{
"_id" : ObjectId("5dfe4c0b211649ccef9a7426"),
"product" : "Z-1 Running shoe",
"total_sales" : 2
}
db.products.count();
You should get the following output:
1
function gt_product_sales(gt_value, compare_field) {
var gt_object = {};
var match_object = {};
gt_object["$gt"] = gt_value
match_object[compare_field] = gt_object
return [
{
"$unwind":"$sales"
},
{
"$match": match_object
},
{
"$project": {
"product":"$p_name",
"sales":"$sales"
}
}
]
};
Here, we created a JavaScript function called gt_product_sales(). This function will take gt_value and compare_field and return the JavaScript object that acts as input to the MongoDB aggregate function. Also, we used the $unwind operator to flatten the document and create one document for each element in the sales array field.
db.products.aggregate(gt_product_sales(6, "sales.s_profit"));
You should get the following output:
This command executes the aggregate function to get the products that have a profit greater than the given gt_value. In our case, we have passed a value of 6.
Note
To access the source code for this specific section, please refer to https://packt.live/3epAaQ0.
With that, you have addressed a problem statement in MongoDB. Now, you can start thinking of various ways to perform similar operations and try to implement that on your local instance. By implementing this exercise, we have understood the importance of MongoDB to overcome complexities while dealing with large databases. Unlike SQL, since it's a schema-less NoSQL database, there are flexibilities while storing JSON documents. So far, we've learned how MongoDB reduces the complexity of deployment by speeding up application development.
Now, let's move on to data modeling.
In this section, we will discuss how to model the data for MongoDB. NoSQL data modeling is a new paradigm and MongoDB, being a document store database, saves data in BSON (Binary JSON) format. So, to make the most of MongoDB, the modeler has to think in JSON.
Let's go over the techniques of data modeling in MongoDB.
The reason why a record is called a document is that MongoDB is expected to find all the required information in one place. Imagine a real-world scenario where you are reading an important document. It's obvious that you expect to find all the information in the same document. It would be annoying if you could only read pieces of information through various documents.
To make reading over all the information as low latency as possible, MongoDB does not encourage joins in the collections. Here, we are talking about data duplication, which is a violation of normalization rules, but it is the key to quick reads from storage discs.
Let's take an example from the FashionMart use case. We created two tables, Products, and Sales. Right now, you can get sales data for a product by joining these two tables, but what if there are thousands of products and billions of sales records? Especially if you think in terms of distributed computing, the Sales table data is partitioned and stored in various machines in the cluster, so when you join the records of Products and Sales, then there will be a large amount of data transfer, as shown in the following diagram:
On the left-hand side, let's assume that the Products and Sales data is stored in separate collections, that is, separate nodes. As you can see, there is a lot of data exchange over the network because of the joins. However, on the right-hand side, the Products and Sales data is together, so there are no joins. This means that less data exchange happens over the network.
The resulting data model will be of the embedded type, where the Sales data will be embedded in the Products collection, as shown in the following query:
{
"_id":ObjectId("5dfe4c0b211649ccef9a7426"),
"p_name":"Z-1 Running shoe",
"p_manufacturer":"Z-1",
"p_buy_price":34,
"p_created_at":ISODate("2019-12-21T16:38:27.274Z"),
"sales":[
{
"s_sale_price":40,
"s_profit":6,
"p_created_at":ISODate("2019-12-21T16:38:27.274Z")
},
{
"s_sale_price": 41,
"s_profit":7,
"p_created_at":ISODate("2019-12-21T16:38:27.274Z")
}
]
}
MongoDB is designed not to have joins, which is why there are two workarounds for this application: side joins and out-of-the-box joins. These are new concepts where you maintain SQL-like relationships between MongoDB collections. Joins have to be performed by the application layer. Let's have a look at one such example:
products
{
"_id":ObjectId("5dfe4c0b211649ccef9a7426"),
"p_name":"Z-1 Running shoe",
"p_manufacturer":"Z-1",
"p_buy_price":34,
"p_created_at":ISODate("2019-12-21T16:38:27.274Z"),
}
Sales
{
"_id":ObjectId("4dfe4c0b475649ccef9a7426"),
"s_sale_price":40,
"s_profit":6,
"s_created_at":ISODate("2019-12-21T16:38:27.274Z")
"p_id":ObjectId("5dfe4c0b211649ccef9a7426")
},
{
"_id":ObjectId("2dfe4c0b475649jfkf9a7426"),
"s_sale_price": 41,
"s_profit":7,
"s_created_at":ISODate("2019-12-21T16:38:27.274Z")
"p_id":ObjectId("5dfe4c0b211649ccef9a7426")
}
It should give us the following output:
The procedure to join these two collections is to take the product's ID and query the sales collection against the matching p_id field. In this case, we can create an index for the p_id column so that it can be retrieved and matching can be made simpler. This can be made possible at the application layer in a programming language by taking an ID from a collection and searching the records of the target collection based on the ID. You can also use the $lookup operator with the aggregate pipeline to join the two collections.
With that, we have covered the various techniques of data modeling in MongoDB. There's lots of domain-specific variety that can be used for the respective domains.
To make the database production-ready, there are some tricks to consider:
Using these techniques, you can build a high-performing MongoDB instance. How you use your resources is more important than what resources you use.
Let's take the example of a globally successful e-commerce site for the PacktFashion store. PacktFashion is facing tough challenges with managing the user logs on their site. They want to store users who are either visitors or buyers on their site so that they can send them sale offers in the future.
This activity aims to add product information and store information about users and their logs on our website. Furthermore, we will generate a report to show the user, along with the products they view, and the action (bought or not bought) that was performed by them.
You need to implement the data model shown in the following screenshot. The user_logs collection will store the events performed by the users, that is, whether they bought or returned a product. The many-to-many relationships between products and users are shown in this collection:
Note
The code for this activity can be found here: https://packt.live/2ZlIZGy.
Perform the following steps to complete this activity:
The data in the collection should be similar to the following output:
The data in the collections should be similar to the following outputs:
You should get the following output:
With this activity, we've completed the MongoDB section. We have successfully learned about MongoDB's design, its implementation, and how it differs from relational databases. Now, let's move on to the next NoSQL database, Cassandra.
Note
The solution to this activity can be found on page 606.
The topic of NoSQL isn't complete without discussing columnar storage. In Cassandra, the data is stored and read in columns instead of rows. Each column is stored separately with the same row offset related to the table. We will study offset in the Data Modeling subsection:
Cassandra is a mixture of structured data, consistency, high scalability, and no single point of failure, and also has a powerful column family design. This particular database was developed by Facebook in 2008 and is horizontally scalable. Imagine you run a multinational company. You have thousands of employees, but you want the information of all the employees at the London location. In such a scenario, a traditional relational database will scan each row, parse it, and compare the u_location column value against the query. This is extremely time-consuming.
On the other hand, the columnar store only requires columns to operate. It not only saves resources but reduces the query processing time as well.
Let's look at an example where denormalized user data is stored in a user table and then converted into columnar format, as shown in the following diagram:
Now, each column is stored separately with metadata and offset the same as a row. The table user is converted into a columnar format where each column is separate, where, if queried with u_fname, then only u_id and u_fname will be brought into memory.
The query language that's used for Cassandra is CQL. Let's go over the terminology associated with it.
Just like how we understood the terminology of MongoDB, we will do the same to find out more about the key concepts of Cassandra. Before we dive into an example, we should know the basic terminology and concepts:
The Cassandra terminology is similar to that of the relational database:
This is the collection of tables and column families as it stores all of its columns separately. There is one keyspace per node in a cluster.
Keyspace has two attributes, Replication, and Durable Writes, which gives us the freedom to configure a particular keyspace. Let's understand replication first.
Replication has two options, as shown in the following table:
Durable write
If this is set to TRUE, then Cassandra maintains an update log for every data modification so that if the system crashes, we have the logs of the data to recover. The default value is TRUE; it should be kept as the default if possible. Just to show an example, we have added some syntax to change it to false.
Let's create a keyspace called fashionmart to understand replication and durable write better, as shown in the following code:
CREATE KEYSPACE fashionmart
WITH replication = {'class':'SimpleStrategy',
'replication_factor': 3}
AND DURABLE_WRITES = false;
DESC keyspaces;
USE fashionmart;
Here, 'SimpleStrategy' stands for "Strategy name" and describes a simple replication factor for the cluster. The replication factor value is 3.
This is like a table in that it is a collection of columns. You can create a ColumnFamily like so:
CREATE COLUMNFAMILY user(
u_id int PRIMARY KEY,
u_lname text,
u_age int,
u_location text,
u_created_at timestamp,
u_last_visited timestamp);
It should give us the following output:
Just like a relational database, you will have to specify column names and data types.
You can use the SELECT clause to retrieve records from COLUMNFAMILY. The Cassandra query engine locates the column with a byte offset and brings only the specified column into memory. This saves a lot of overhead while filtering the records:
SELECT u_fname, u_lname, u_age FROM user;
The SELECT query retrieves u_fname, u_lname, and u_age from the user COLUMNFAMILY.
Just like relational databases, you can also use aggregate functions, such as SUM, AVG, MIN, and MAX. This way, you can deal with data inside Cassandra using CQL.
Let's understand Cassandra in more detail using an example.
We have been discussing the FashionMart use case in this chapter and it has evolved from being a simple products and sales model to a SQL normalized model that has a searchable NoSQL design. Now, we will discuss the user side of it. We'll assume that there will be millions of users accessing this website, so we need to store their information to understand our audience better.
In this example, we need to store the user table in such a way that it can store the data of millions of users and give us access to the data with very low latency and high availability. Normally, with a relational database, this kind of table requires an index, but when querying an unindexed column that has this many records, it can take a lot of time to retrieve the data.
Instead of this, we will utilize the Cassandra columnar storage format, where each table will be stored separately in a column family. Let's implement this in the next exercise.
Let's take the previous problem statement where we need to store user information in a Cassandra column family called user. This exercise aims to store the number of visits that users have made to the site and perform queries on it. We will find the maximum and the minimum number of visits, along with the average age of the user.
Before proceeding with this exercise, we need to set up the Cassandra database. Please follow the instructions on the Preface to install it.
Perform the following steps to complete this exercise:
Windows:
Open Cassandra CLI application
Linux:
root@ubuntu: -$ cqlsh
macOS:
MyMac:~ root$ cqlsh
You should get the following output:
cqlsh>
Note
If you are having issues starting Cassandra CLI, you may try starting it with the cassandra -f command. For installation instructions for Cassandra on Windows 10, you can refer this link: https://medium.com/@sushantgautam_930/simple-way-to-install-cassandra-in-windows-10-6497e93989e6. For installation instructions on Linux, you can visit https://cassandra.apache.org/doc/latest/getting_started/installing.html.
CREATE KEYSPACE fashionmart
WITH replication = {'class':'SimpleStrategy',
'replication_factor' : 3};
use fashionmart;
u_id int PRIMARY KEY,
u_location text,
u_created_at timestamp,
u_last_visited timestamp);
DESC tables;
You should get the following output:
user
SELECT * FROM user;
You should get the following output:
Column families in Cassandra are any group of columns, be they a table or an actual family. Now, let's add data to it.
INSERT INTO user(u_id, u_fname, u_lname, u_age,
u_location, u_created_at,
u_last_visited)
VALUES(1,'Derek', 'Gates', 18, 'San Francisco',
toTimestamp(now()), '2019-10-30 12:05:00+0000');
BEGIN BATCH
INSERT INTO user(u_id, u_fname, u_lname, u_age,
u_location, u_created_at,
u_last_visited)
VALUES(2, 'Max', 'Frost', 15, 'London', toTimestamp(now()),
toTimestamp(now()));
INSERT INTO user(u_id, u_fname, u_lname, u_age,
u_location, u_created_at, u_last_visited)
VALUES(3,'Isabela', 'McGuire', 29, 'Seattle',
toTimestamp(now()), '2019-11-07 07:30+0000');
INSERT INTO user(u_id, u_fname, u_lname, u_age,
u_location, u_created_at,
u_last_visited)
VALUES(4,'Roger', 'Moore', 68, 'Amsterdam',
toTimestamp(now()), '2019-12-02 14:50+0000');
APPLY BATCH;
SELECT * FROM user;
You should get the following output:
ALTER COLUMNFAMILY user ADD u_visits int;
By default, it will assign a null value to each record. Let's add the number of visits for each record with batch update statements.
BEGIN BATCH
UPDATE user
SET u_visits=1
UPDATE user
SET u_visits=21
WHERE u_id=2;
UPDATE user
SET u_visits=14
WHERE u_id=3;
UPDATE user
SET u_visits=18
WHERE u_id=4;
APPLY BATCH;
SELECT * FROM user;
You should get the following output:
SELECT MAX(u_visits) AS max_visits FROM user;
You should get the following output:
The MAX function will find the highest value in the u_visits column and show it within the max_visits header.
SELECT MIN(u_visits) AS min_visits FROM user;
You should get the following output:
The MIN function will find the lowest value in the u_visits column and show it within the min_visits header.
SELECT SUM(u_visits) as total_visits FROM user;
You should get the following output:
Here, we have used the SUM aggregation function to get the result of total visits, that is, 54 in this query.
SELECT AVG(u_age) as average_age FROM user;
You should get the following output:
Here, we have used the AVG aggregation function to calculate the average age on the u_age column. We got an average age of 32 in this query.
Note
To access the source code for this specific section, please refer to https://packt.live/3eocOKI.
In this exercise, we have successfully implemented the inventory management of an e-commerce site using Cassandra. The column family, most visits, least visits, and total visits, along with the average age group, have been executed. We'll now move on to the data modeling techniques that can be performed in Cassandra in the next section.
We will discuss the structure and benefits of columnar storage in this section. Let's start with keyspace. A keyspace is a collection of column families, similar to the database as a collection of tables in RDBMS:
Each column is stored on the same row offset as other column values. The internal structure of a column family is shown in the following diagram:
Here, each row has one key that is offset. Internally, each Cassandra table is a list of nested key-value pairs, that is, a map of maps. The column is a unit of storage and the row is used to replicate the data across the cluster. A column family contains columns and the table is a super column family.
We know that Cassandra is a columnar database and has the functionality to group columns to form a column family. The efficiency of this can be improvised if we create the column family based on query patterns. So, the columns that are accessed in a single query are stored in consecutive memory locations and are faster to retrieve. We can decide on the columns that need to be grouped based on the following questions:
Once the column families have been designed properly, Cassandra will use the row key as the offset and read the required data. This is a good strategy in the case of read-heavy designs.
Data is distributed in various partitions and depends on the partition_key. The result of the hash function over the partition key is used to assign a slot/node to the data. Sometimes, the data is not distributed across clusters due to the presence of similar IDs or similar data in the partition column. So, our job is to evenly distribute the data by making sure that the partition_key is sufficiently unique. Also, the partition column we choose should have uniformly distributed values.
Let's understand it better through an example.
If you choose country column for partitioning the data and if 70% of the records are from the United States, then it's pretty obvious that 70% of data will end up on a single partition. So, the values should be distributed uniformly.
Apart from storing large quantities of data, Cassandra is capable of scaling out in keeping with the growth of data. This helps us pick out the right columns to avoid unbalanced cluster scenarios such as "Hot Spots" (one node ends up with higher tokens than others) and "Wide Rows" (a partition row grows larger than the other rows). To adjust to data growth and maintain even data distributions, you can use the allocate_tokens_for_keyspace setting.
Cassandra is very good at micro writes where the data is pushed to map of maps to ensure that the performance has a constant lookup time. If you are looking for a data store that can perform micro writes with high availability, Cassandra would be a good choice.
In this section, we discussed the Cassandra storage strategy and the key things we need to consider while designing a successful data model. A better design model can give you acceptable performance. In the next section, we will dive deep into performance tuning and best practices.
Cassandra, is a distributed database, has been very popular among the NoSQL domain, but still, some tips need to be considered before running Cassandra in production:
These tips are crucial to getting the most out of your Cassandra setup as it can prove to be a really good form of data storage for your use cases.
Now that we've had an overview of Cassandra, we'll try to go through the following problem statement and apply what we have learned. Let's assume you are heading the FashionMart project and, at this stage, the company has asked you to have feedback from customers to have a continuous evaluation of their activities. With millions of customers and their feedback, we have to create and maintain the column family for feedback and establish the total amount of feedback.
You need to implement the data model shown in the following screenshot. The feedback_logs column family will store the feedback and its details, such as user id, location, and creation date:
Note
The code for this activity can be found here: https://packt.live/3gTafC1.
Perform the following steps to complete this activity:
The data in feedback_logs should be similar to the following output:
You should get the following output:
With this activity, we've completed the Cassandra section. We have successfully learned about Cassandra's design, its implementation, and how it differs from relational databases. Now, let's move on and combine all the databases for various business use cases.
Note
The solution to this activity can be found on page 613.
This section is all about the implementation of what we've learned in this chapter. So far, we have gone through the nitty-gritty of databases and discussed MySQL for RDBMS and NoSQL databases such as MongoDB and Cassandra. Now, let's take the FashionMart store example and think about a scenario where we have the following schema:
In a case where the number of daily active users is 1 million and the Sales data is around 5 billion, then performing joins using MySQL between the dimensions will cause serious overhead. But here, the benefit is consistency, so if you want to update a Sales record, then it will be referenced in your next query.
Now, think about solving this use case using MongoDB. If we embedded a Sales entity inside Products by denormalizing the sales data, then it will drastically reduce latency as no data is shuffling or joins over the network. So, if you want to update sales data, then you will have to update it everywhere it is present due to data duplicity, otherwise we will lose consistency:
However, if we put both things together, then we could use them as follows:
The following diagram is a pretty good example of a MongoDB and MySQL-based hybrid approach in a real-world project scenario where users can achieve eventual consistency:
Likewise, we use the hybrid (SQL and NoSQL) approach in the majority of projects in the real world. So, whenever there is a trade-off between consistency and availability, always remember this example.
In this chapter, we had a hands-on overview of SQL and NoSQL databases. We looked into their usage, their best practices, and even various strategies to achieve near-perfect latency.
We learned about how to create a table, insert the data into a table, and how to select data from a table in various databases. Furthermore, we have explored complex aggregation over data, along with the resulting visualizations through sample data analytics.
Understanding the different types of databases and their evolutions, along with the various types of query languages and data modeling from this chapter, will help you to get to grips with the essence of big data file formats in the coming chapters. To start with big data, you will need to learn about the use of file formats. Such concepts can be assimilated by understanding the nitty-gritty of databases.
In the next chapter, you will learn about big data file formatting, along with compression, partitioning, and read-write strategy and performance optimizations.
3.21.98.207