Before we begin to utilize application data and user interaction insights, we need to learn how to store and manage data of varying schemas persistently with the use of robust and distributed database systems in an organized manner. Two types of database systems will be discussed in detail, namely, relational and nonrelational databases. Examples of interfacing with PostgreSQL and MongoDB will be shown to demonstrate the use cases for each. In addition, advanced features such as fuzzy matching and full-text indexing will be exhibited, and boilerplates will be provided to the reader to use as building blocks for their own applications. Finally, the reader will also learn how to seamlessly interface such databases with Streamlit to visualize data and to execute create, read, update, and delete operations on their data. PostgreSQL and pgAdmin installation is required for full understanding in this chapter.
5.1 Relational Databases
Since most of the apps made with Streamlit manipulate data in one way or the other, sometimes that data needs to be stored on disk for later use. In most cases, this data is in a specific format or, in other words, structured. We can make use of this property of the data to use a SQL database to store it.
PostgreSQL will be the tool we will demonstrate such use case with, due to it being free and open source.
5.1.1 Introduction to SQL
Structured Query Language is used to do CRUD operations on data with similar structure. Same structure data refers to different entries (rows) of data with same features (columns). An example of a relational SQL database is a company directory of employee data split into two separate tables, namely, the employees’ personal information and their pay grades. Both forms of data can be represented as tables in the DB as shown in Tables 5-1 and 5-2, where there is a one-to-one linkage between both tables indicating that every employee shall have a pay grade. A one-to-one relationship is when every row in one table has a corresponding ID of another table extending the information of the first table with the second.
There are more types of mapping between tables such as one-to-many and many-to-many relationships, but we will not discuss them as it won’t give additional value to the purpose of this book. But for the sake of real-life scenario demonstration, we will move forward with one-to-one for some of the examples.
Create: To make a new person entry with pay grade 3
Read: To retrieve all pay grade data with a base salary level equal to L3
Update: To update the pay grade of Bob to pay grade 2
Delete: To remove pay grade 4 from existence
DELETE FROM PayGradeLevels WHERE id = 4;
Persons table
ID | Name | DOB | Pay Grade ID |
---|---|---|---|
1 | Adam | 01/01/1990 | 2 |
2 | Sara | 01/01/1980 | 1 |
3 | Bob | 01/01/1970 | 1 |
4 | Alice | 01/01/2000 | 3 |
Pay grade levels table
ID | Base Salary | Reimbursements | Bonuses |
---|---|---|---|
1 | L3 | L2 | L0 |
2 | L1 | L1 | L1 |
3 | L3 | L3 | L3 |
4 | L1 | L3 | L1 |
One of the most important keywords that anyone building a database needs to know is the primary key; this refers to the ID in both Tables 5-1 and 5-2, as this is a unique, always valid, that is, not null, identifier that is used to refer a single row. Moreover, it is indexed, which means it’s internally managed by the database in a specific way to further speed up query speeds when filtered with the ID. It is worth mentioning that indexing is a feature that can be applied to any column, but not preferred as it grows the disk space usage of that column to up to twice the size. Indexing helps in quick search of specific values among all rows; hence, it is encouraged to be used in columns which represent IDs.
Another term called foreign key resonates in backend developer ears as an ID used to refer to a row in a different table. This is like the Pay Grade ID column in Table 5-1 which points to the ID column in Table 5-2.
5.1.2 Connecting a PostgreSQL Database to Streamlit
Before proceeding with the integration with Streamlit, we will first add some data into the database using raw SQL as shown in Figure 5-4. Notice how we did not need to set the ID values for both tables as this is being generated on the developer’s behalf.
.streamlit/secrets.toml
To interface Python with PostgreSQL, we need to use a capable library to do so. For that example, we will use psycopg2, but other libraries such as sqlalchemy can do the job, and it will be introduced in later chapters.
As discussed in previous chapters, Streamlit reruns the python script upon user actions. This wasn’t a big deal before, but it can introduce more overhead as in the current example, a new database connection will be established with every rerun. To avoid that unnecessary call, we can cache the first established connection. Streamlit allows caching function calls out of the box, by using its native function decorator st.cache. It can also accept some other parameters, for instance, an expiration date of the cache, which will cause the next function call to reexecute the function body if the cache is invalidated by then. As seen, the fifth line in Listing 5-2 is being applied to save the established connection.
5.1.3 Displaying Tables in Streamlit
After querying data from the database, we can display it in text format, or we can use more visually entertaining tools from Streamlit, which will require a modification of the representation of the data.
Among data scientists and developers, it is usually known to parse structured data, whether it is sensor values, identification information, or any repeating data with a structure in the form of a Pandas dataframe. Dataframes are generally Numpy arrays with extra capability such as storing column values and SQL-like querying techniques. That being said, it also shares the same fast vectorization capabilities with normal Numpy arrays, which is essentially a parallelized way to do mathematical computations on an array as a whole instead of doing it one by one.
df_demo.py
5.2 Nonrelational Databases
While in most use cases you will be dealing with a structured dataset where the schema, attributes, data types, and metadata will be known beforehand, there are indeed instances where none of these will be known. Imagine that you are trying to create a search engine, where users can input documents with varying lengths, numbers of headers or pictures, and types of media. It is simply impossible to know what sort of schema or table to provision to store such data. In such applications, the utility of a NoSQL database, the likes of MongoDB, is essential to store and manage unstructured data.
5.2.1 Introduction to MongoDB
MongoDB enables you to store data as a JSON document with varying attributes and data types into a collection with a varying schema. Please note that with MongoDB, a document is analogous to a row, and a collection is analogous to a table in a relational database system. Even if your dataset is structured to begin with, there is no harm in using a NoSQL database system, as you may eventually scale your application enough to end up dealing with an abundance of unstructured data. In addition, if you require features such as full-text indexing where every word in every document in every collection is reverse-indexed, or if you need fuzzy matching in your queries to mitigate the effect of typos, then MongoDB is the place to be.
To provide an overview of MongoDB and how it can be interfaced with Streamlit, in this section we will create a search engine application for restaurants using a publicly available unstructured dataset of restaurant ratings. The goal of the application is to allow the user to search for restaurants based on the type of cuisine and address they want. For the cuisine, a simple one-to-one match with a predefined list of cuisine types will be used to filter the data. However, for the address, full-text indexing will be required to ensure that the address can be matched with n-grams (continuous sequence of words or tokens in a document of text) corresponding to the borough and street address that are located in disparate objects and arrays within the document as shown in Figure 5-8. In addition, fuzzy matching will be needed to guarantee that similar but nonidentical search queries will be matched with a term that is at most two characters different from the record.
5.2.2 Provisioning a Cloud Database
- 1.
To begin with, you may set up an account and project at www.mongodb.com/atlas/database.
- 2.
Provision the free M0 Sandbox cluster as shown in Figures 5-9 and 5-10. If necessary, modify the hosted region to ensure minimal latency between your database and server.
- 3.
Once the cluster has been provisioned, you may proceed with whitelisting the IP addresses that you wish to access the database from in the Network Access menu item. While it is not recommended, you may whitelist all addresses as shown in Figure 5-11 to access the database from anywhere.
- 4.Subsequently, you will need to create user credentials to access the database from the Overview tab in the Databases menu item as shown in Figure 5-12.
- 6.Finally, you may upload your own dataset or alternatively load a sample dataset provided by MongoDB in the Collections tab of the Databases menu item, as shown in Figure 5-15. For this example, we will be using the sample restaurants collection from MongoDB’s own datasets.
5.2.3 Full-Text Indexing
- 1.
To create an index, open the Search tab found in the Databases menu item, and click Create Search Index, as shown in Figure 5-16.
- 2.
Subsequently, select the JSON Editor from the Configuration Method tab.
- 3.Select the database and collection that you want to create the index for, enter a name for the index, and enter the following index, as shown in Figure 5-17:{"mappings": {"dynamic": true}}
5.2.4 Querying the Database
- 1.
Search using fuzzy matching
- 2.
Project documents with search score
- 3.
Filter documents
- 4.
Limit results
For information regarding additional options for the aggregation pipeline, please refer to https://docs.mongodb.com/manual/reference/aggregation/.
5.2.5 Displaying Tables in Streamlit
5.3 Summary
In this chapter, we covered relational and nonrelational databases that are used to store and retrieve structured and unstructured data, respectively. Specifically, we learned how to provision a PostgreSQL database and interface it with a Streamlit application to host and process our structured datasets. Similarly, we learned how to provision a MongoDB cloud database to store unstructured data consisting of collections of documents with varying schemas, objects, and elements. We saw how MongoDB can be used to create a full-text index where every token in each document is indexed for enhanced querying, and we demonstrated how to perform fuzzy matching of query terms with document tokens, thereby mitigating the effects of typos. Finally, we witnessed how to wrap all of these operations from establishing a database client to writing an aggregation pipeline and to postprocessing queried results within a Streamlit application.