© Mohammad Khorasani, Mohamed Abdou, Javier Hernández Fernández 2022
M. Khorasani et al.Web Application Development with Streamlithttps://doi.org/10.1007/978-1-4842-8111-6_5

5. Database Integration

Mohammad Khorasani1  , Mohamed Abdou2 and Javier Hernández Fernández1
(1)
Doha, Qatar
(2)
Cambridge, United Kingdom
 

Before we begin to utilize application data and user interaction insights, we need to learn how to store and manage data of varying schemas persistently with the use of robust and distributed database systems in an organized manner. Two types of database systems will be discussed in detail, namely, relational and nonrelational databases. Examples of interfacing with PostgreSQL and MongoDB will be shown to demonstrate the use cases for each. In addition, advanced features such as fuzzy matching and full-text indexing will be exhibited, and boilerplates will be provided to the reader to use as building blocks for their own applications. Finally, the reader will also learn how to seamlessly interface such databases with Streamlit to visualize data and to execute create, read, update, and delete operations on their data. PostgreSQL and pgAdmin installation is required for full understanding in this chapter.

5.1 Relational Databases

Since most of the apps made with Streamlit manipulate data in one way or the other, sometimes that data needs to be stored on disk for later use. In most cases, this data is in a specific format or, in other words, structured. We can make use of this property of the data to use a SQL database to store it.

PostgreSQL will be the tool we will demonstrate such use case with, due to it being free and open source.

5.1.1 Introduction to SQL

Structured Query Language is used to do CRUD operations on data with similar structure. Same structure data refers to different entries (rows) of data with same features (columns). An example of a relational SQL database is a company directory of employee data split into two separate tables, namely, the employees’ personal information and their pay grades. Both forms of data can be represented as tables in the DB as shown in Tables 5-1 and 5-2, where there is a one-to-one linkage between both tables indicating that every employee shall have a pay grade. A one-to-one relationship is when every row in one table has a corresponding ID of another table extending the information of the first table with the second.

There are more types of mapping between tables such as one-to-many and many-to-many relationships, but we will not discuss them as it won’t give additional value to the purpose of this book. But for the sake of real-life scenario demonstration, we will move forward with one-to-one for some of the examples.

CRUD, which stands for Create, Read, Update, and Delete, refers to the main operations you can do within a database, with SQL commands as follows:
  • Create: To make a new person entry with pay grade 3

INSERT INTO Persons VALUES (”Charlie”, ”01/01/1995”, 3);
  • Read: To retrieve all pay grade data with a base salary level equal to L3

SELECT * FROM INTO PayGradeLevels WHERE BaseSalary = ’L3’;
  • Update: To update the pay grade of Bob to pay grade 2

UPDATE Persons SET PayGradeId = 2 WHERE ID = 3;
  • Delete: To remove pay grade 4 from existence

    DELETE FROM PayGradeLevels WHERE id = 4;

Table 5-1

Persons table

ID

Name

DOB

Pay Grade ID

1

Adam

01/01/1990

2

2

Sara

01/01/1980

1

3

Bob

01/01/1970

1

4

Alice

01/01/2000

3

Table 5-2

Pay grade levels table

ID

Base Salary

Reimbursements

Bonuses

1

L3

L2

L0

2

L1

L1

L1

3

L3

L3

L3

4

L1

L3

L1

One of the most important keywords that anyone building a database needs to know is the primary key; this refers to the ID in both Tables 5-1 and 5-2, as this is a unique, always valid, that is, not null, identifier that is used to refer a single row. Moreover, it is indexed, which means it’s internally managed by the database in a specific way to further speed up query speeds when filtered with the ID. It is worth mentioning that indexing is a feature that can be applied to any column, but not preferred as it grows the disk space usage of that column to up to twice the size. Indexing helps in quick search of specific values among all rows; hence, it is encouraged to be used in columns which represent IDs.

Another term called foreign key resonates in backend developer ears as an ID used to refer to a row in a different table. This is like the Pay Grade ID column in Table 5-1 which points to the ID column in Table 5-2.

5.1.2 Connecting a PostgreSQL Database to Streamlit

First, we will need to create the database and the tables from the example discussed in Section 5.1 using pgAdmin4, which is a GUI (graphical user interface) program used to interact with and manage PostgreSQL databases. Assuming it is installed and set up, we will create a new database following the steps in Figures 5-1 and 5-2 sequentially.
Figure 5-1

Creating a new PostgreSQL database

Figure 5-2

Creating a new PostgreSQL database (continued)

After our database is ready, we will click the Query Tool in the top-left corner to have run raw SQL commands. As shown in Figure 5-3, we are creating two tables both with primary keys, and one of them has foreign keys. Other features of the columns – also referred to as constraints in the database domains – are specifying a column to be not null and to be autoincrementing. When specifying a column to be not null, it configures the database to reject any insert or updates to a row where this column is set to null or not set at all if the data type’s default value is null. SERIAL is a unique data type in PostegreSQL that ensures the column is not null and enforces autoincrementing of its integer value.
Figure 5-3

Running SQL commands from create_tables.sql to create the database tables

Before proceeding with the integration with Streamlit, we will first add some data into the database using raw SQL as shown in Figure 5-4. Notice how we did not need to set the ID values for both tables as this is being generated on the developer’s behalf.

Firstly, we will need to prepare our database access credentials to use in Streamlit. These credentials, including the username and password, can be written as environment variables in a .env file, or in a secrets.yaml which will then be read and parsed by Streamlit. Ideally, however, they should be written to the secrets.toml file in a folder called .streamlit – which is recommended – as shown in Listing 5-1. This file is then read by the Streamlit engine and parsed to make the variables within at the developer’s disposal easily. And for this toml file to be parsed correctly, the database’s connection data can be marked as “db_postrgres,” which is also used within Streamlit to get those database connection metadata. As a reminder, try to always choose strong passwords in the case the database is public, almost anyone can try brute-forcing the password, gaining access to the data.
[db_postrgres]
host = "127.0.0.1"
port = "5432"
user = "postgres"
password = "admin"
dbname = "CompanyData"
Listing 5-1

.streamlit/secrets.toml

Figure 5-4

Inserting data using inserting_data.sql into the database

To interface Python with PostgreSQL, we need to use a capable library to do so. For that example, we will use psycopg2, but other libraries such as sqlalchemy can do the job, and it will be introduced in later chapters.

As discussed in previous chapters, Streamlit reruns the python script upon user actions. This wasn’t a big deal before, but it can introduce more overhead as in the current example, a new database connection will be established with every rerun. To avoid that unnecessary call, we can cache the first established connection. Streamlit allows caching function calls out of the box, by using its native function decorator st.cache. It can also accept some other parameters, for instance, an expiration date of the cache, which will cause the next function call to reexecute the function body if the cache is invalidated by then. As seen, the fifth line in Listing 5-2 is being applied to save the established connection.

After the new connection is established, we will need a cursor to use for querying SQL commands, also known as SQL queries. The cursor needs to be disposed after the query finishes; otherwise, it can retain in memory, and with every new query, the memory can be bloated and cause memory leaks, which is every software developer’s nightmare. The developer can choose to close it manually or use a context manager which will close it once its usage scope is exited by the interpreter. Examples of both cases are the functions on lines 14 and 22 with output in Figure 5-5.
import streamlit as st
import psycopg2
@st.cache(allow_output_mutation=True,
          hash_funcs={"_thread.RLock": lambda _: None})
def init_connection():
    return psycopg2.connect(**st.secrets["db_postrgres"])
conn = init_connection()
def run_query(query_str):
    cur = conn.cursor()
    cur.execute(query_str)
    data = cur.fetchall()
    cur.close()
    return data
def run_query_with_context_manager(query_str):
    with conn.cursor() as cur:
        cur.execute(query_str)
        return cur.fetchall()
query = st.text_input("Query")
c1, c2 = st.columns(2)
output = None
with c1:
    if st.button("Run with context manager"):
        output = run_query_with_context_manager(query)
with c2:
    if st.button("Run without context manager"):
        output = run_query(query)
st.write(output)
Listing 5-2

main.​py

5.1.3 Displaying Tables in Streamlit

After querying data from the database, we can display it in text format, or we can use more visually entertaining tools from Streamlit, which will require a modification of the representation of the data.

Among data scientists and developers, it is usually known to parse structured data, whether it is sensor values, identification information, or any repeating data with a structure in the form of a Pandas dataframe. Dataframes are generally Numpy arrays with extra capability such as storing column values and SQL-like querying techniques. That being said, it also shares the same fast vectorization capabilities with normal Numpy arrays, which is essentially a parallelized way to do mathematical computations on an array as a whole instead of doing it one by one.

Streamlit allows printing dataframes right away from a single command in two different st.table displays a noninteractive representation of the dataframe as shown in Figure 5-6. And Figure 5-7 displays st.dataframe, rendering an interactive representation of the dataframe, where the user can sort any column just by clicking it. As a trade-off, this makes the web application slower as more CPU and/or memory usage is required, since the complexity of the sorting algorithm grows- in an O(n*log(n)) manner
import streamlit as st
import pandas as pd
df = pd.DataFrame([["Adam", "01/01/1990", 2],
                   ["Sara", "01/01/1980", 1],
                   ["Bob", "01/01/1970", 1],
                   ["Alice", "01/01/2000", 3]
                   ], columns=["Name", "DOB", "Paygrade ID"])
st.table(df)
st.dataframe(df)
Listing 5-3

df_demo.​py

Figure 5-5

Running user SQL commands from Streamlit

Figure 5-6

st.table from Listing 5-3

Figure 5-7

st.dataframe from Listing 5-3

5.2 Nonrelational Databases

While in most use cases you will be dealing with a structured dataset where the schema, attributes, data types, and metadata will be known beforehand, there are indeed instances where none of these will be known. Imagine that you are trying to create a search engine, where users can input documents with varying lengths, numbers of headers or pictures, and types of media. It is simply impossible to know what sort of schema or table to provision to store such data. In such applications, the utility of a NoSQL database, the likes of MongoDB, is essential to store and manage unstructured data.

5.2.1 Introduction to MongoDB

MongoDB enables you to store data as a JSON document with varying attributes and data types into a collection with a varying schema. Please note that with MongoDB, a document is analogous to a row, and a collection is analogous to a table in a relational database system. Even if your dataset is structured to begin with, there is no harm in using a NoSQL database system, as you may eventually scale your application enough to end up dealing with an abundance of unstructured data. In addition, if you require features such as full-text indexing where every word in every document in every collection is reverse-indexed, or if you need fuzzy matching in your queries to mitigate the effect of typos, then MongoDB is the place to be.

To provide an overview of MongoDB and how it can be interfaced with Streamlit, in this section we will create a search engine application for restaurants using a publicly available unstructured dataset of restaurant ratings. The goal of the application is to allow the user to search for restaurants based on the type of cuisine and address they want. For the cuisine, a simple one-to-one match with a predefined list of cuisine types will be used to filter the data. However, for the address, full-text indexing will be required to ensure that the address can be matched with n-grams (continuous sequence of words or tokens in a document of text) corresponding to the borough and street address that are located in disparate objects and arrays within the document as shown in Figure 5-8. In addition, fuzzy matching will be needed to guarantee that similar but nonidentical search queries will be matched with a term that is at most two characters different from the record.

5.2.2 Provisioning a Cloud Database

MongoDB can be provisioned both locally and on the cloud; however, to utilize the full-text indexing feature provided by MongoDB’s Atlas Search service, you will need to host your database on the cloud, as per the following steps:
  1. 1.

    To begin with, you may set up an account and project at www.mongodb.com/atlas/database.​

     
  2. 2.

    Provision the free M0 Sandbox cluster as shown in Figures 5-9 and 5-10. If necessary, modify the hosted region to ensure minimal latency between your database and server.

     
  3. 3.

    Once the cluster has been provisioned, you may proceed with whitelisting the IP addresses that you wish to access the database from in the Network Access menu item. While it is not recommended, you may whitelist all addresses as shown in Figure 5-11 to access the database from anywhere.

     
  4. 4.
    Subsequently, you will need to create user credentials to access the database from the Overview tab in the Databases menu item as shown in Figure 5-12.
    Figure 5-8

    A sample document from the restaurants dataset

    Figure 5-9

    Setting up a MongoDB database

     
  1. 5.
    You may then create a connection string by selecting the Connect your application option in the Choose a connection method tab and proceeding to select the Python driver that is most suitable for your application, as shown in Figures 5-13 and 5-14.
    Figure 5-10

    Provisioning a free M0 Sandbox cluster

    Figure 5-11

    Configuring network access to the database

     
  1. 6.
    Finally, you may upload your own dataset or alternatively load a sample dataset provided by MongoDB in the Collections tab of the Databases menu item, as shown in Figure 5-15. For this example, we will be using the sample restaurants collection from MongoDB’s own datasets.
    Figure 5-12

    Creating user credentials for the database

     

5.2.3 Full-Text Indexing

Full-text indexing indexes every single token, for all the objects in all the documents of a database. It is an extremely powerful form of indexing that allows you to perform accurate queries and retrieve all matching documents, very similar to how search engines work. In MongoDB, you can create a full-text index on a cloud database as follows:
  1. 1.

    To create an index, open the Search tab found in the Databases menu item, and click Create Search Index, as shown in Figure 5-16.

     
  2. 2.

    Subsequently, select the JSON Editor from the Configuration Method tab.

     
  3. 3.
    Select the database and collection that you want to create the index for, enter a name for the index, and enter the following index, as shown in Figure 5-17:
    {
      "mappings": {
        "dynamic": true
      }
    }
    Figure 5-13

    Creating a connection string for the database

     

5.2.4 Querying the Database

To query your indexed database in MongoDB, you should initially connect to your database with the connection string acquired in Figure 5-13, by using it to establish a client. It will be handy to invoke the client as a function that is cached with the @st.cache command. This will ensure that each time a query is executed, instead of establishing a new client, the one that was previously cached is used, thereby saving a considerable amount of runtime and resources as shown in the following:
from pymongo import MongoClient
@st.cache(allow_output_mutation=True, hash_funcs={"_thread.RLock": lambda _: None})
def create_client():
    return MongoClient('<connection_string>')
Subsequently, you will need to create an Aggregation, which is simply a multistaged filtering pipeline written in JSON containing all the filters you want to apply to your data in order to query it as follows:
  1. 1.

    Search using fuzzy matching

     
In this stage, we need to specify the name of the index that we created previously to use for searching the documents. In addition, we need to input the user’s query with string concatenation; we also need to specify the path or in other words the objects to search through in the documents, that is, borough and street address (nested elements and/or objects can be accessed with a period, i.e., address.street). Most importantly, however, we need to enable fuzzy matching and specify the number of single-character edits needed to match the query with the token using maxEdits, and we also need to determine the number of characters at the start of each query that must match the token using prefixLength:
'$search': {
    'index': 'default',
    'text': {
        'query': '%s' % (address),
        'path': ['borough','address.street'],
        'fuzzy': {
            'maxEdits': 2,
            'prefixLength': 2
        }
    }
}
Figure 5-14

Creating a connection string for the database (continued)

  1. 2.

    Project documents with search score

     
At this stage, we will pass only the objects that we want from the documents and will also compute the relevance score using the ’searchScore’ tag:
'$project': {
    'Name': '$name',
    'Cuisine': '$cuisine',
    'Address': '$address.street',
    'Borough': '$borough',
    'Grade': '$grades.grade',
    'Score': {
        '$meta': 'searchScore'
    }
}
Figure 5-15

Loading a dataset into the database

Figure 5-16

Creating a full-text index for the database

  1. 3.

    Filter documents

     
Subsequently, we will filter the passed documents with the user entry for the type of cuisine. Please note that unlike fuzzy matching, at this stage queries must be identical to the tokens in the documents for a successful filtering:
'$match': {
    'Cuisine': '%s' % (cuisine)
}
Figure 5-17

Creating a full-text index for the database (continued)

  1. 4.

    Limit results

     
Finally, at this stage we will simply limit the number of passed results to the number we require:
'$limit': 5

For information regarding additional options for the aggregation pipeline, please refer to https://docs.mongodb.com/manual/reference/aggregation/.​

5.2.5 Displaying Tables in Streamlit

Once the aggregation pipeline is complete and has passed the queried results to Python, you will need to perform postprocessing before you can render the table in Streamlit. Specifically, you will need to convert the returned object from the MongoDB client into a Pandas dataframe and specify the columns you want to keep. In addition, you need to parse any returned lists (such as the restaurant grades) and convert to plain text as shown in the following:
df = pd.DataFrame(result)[['Name','Address','Grade','Score']]
df['Grade'] = [','.join(map(str, x)) for x in df['Grade']]
You may refer to Listing 5-4 to peruse the complete code for this section. You may also view the associated Streamlit application in Figure 5-18 that displays an example of a fuzzy matched query and the returned table in Streamlit.
import streamlit as st
import pandas as pd
from pymongo import MongoClient
@st.cache(allow_output_mutation=True,
    hash_funcs={"_thread.RLock": lambda _: None})
def create_client():
    return MongoClient('<connection_string>')
def query(cuisine,address):
    result = create_client()['sample_restaurants']['restaurants'].aggregate([
        {
            '$search': {
                'index': 'default',
                'text': {
                    'query': '%s' % (address),
                    'path': ['borough','address.street'],
                    'fuzzy': {
                        'maxEdits': 2,
                        'prefixLength': 2
                    }
                }
            }
        }, {
            '$project': {
                'Name': '$name',
                'Cuisine': '$cuisine',
                'Address': '$address.street',
                'Borough': '$borough',
                'Grade': '$grades.grade',
                'Score': {
                    '$meta': 'searchScore'
                }
            }
        }, {
            '$match': {
                'Cuisine': '%s' % (cuisine)
            }
        }, {
            '$limit': 5
        }
    ])
    try:
        df = pd.DataFrame(result)[['Name','Address','Grade','Score']]
        df['Grade'] = [','.join(map(str, x)) for x in df['Grade']]
        return df
    except:
        return None
if __name__ == '__main__':
    st.title('Restaurants Explorer')
    cuisine = st.selectbox('Cuisine',['American','Chinese','Delicatessen',
    'Hamburgers','Ice Cream, Gelato, Yogurt, Ices','Irish'])
    address = st.text_input('Address')
    if st.button('Search'):
        if address != ":
            st.write(query(cuisine,address))
        else:
            st.warning('Please enter an address')
Listing 5-4

mongodb.​py

Figure 5-18

Output of Listing 5-4

5.3 Summary

In this chapter, we covered relational and nonrelational databases that are used to store and retrieve structured and unstructured data, respectively. Specifically, we learned how to provision a PostgreSQL database and interface it with a Streamlit application to host and process our structured datasets. Similarly, we learned how to provision a MongoDB cloud database to store unstructured data consisting of collections of documents with varying schemas, objects, and elements. We saw how MongoDB can be used to create a full-text index where every token in each document is indexed for enhanced querying, and we demonstrated how to perform fuzzy matching of query terms with document tokens, thereby mitigating the effects of typos. Finally, we witnessed how to wrap all of these operations from establishing a database client to writing an aggregation pipeline and to postprocessing queried results within a Streamlit application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.181.81