Importing data from a database

Very often, our work on data analysis and visualization is at the consumer end of the data pipeline. We most often use the already produced data rather than producing the data ourselves. A modern application, for example, holds different datasets inside relational databases (or other databases like MongoDB), and we use these databases to produce beautiful graphs.

This recipe will show you how to use SQL drivers from Python to access data.

We will demonstrate this recipe using a SQLite database because it requires the least effort to set up, but the interface is similar to most other SQL-based database engines (MySQL and PostgreSQL). There are, however, differences in the SQL dialect that those database engines support. This example uses simple SQL language and should be reproducible on most common SQL database engines.

Getting ready

To be able to execute this recipe, we need to install the SQLite library as shown here:

$ sudo apt-get install sqlite3

Python support for SQLite is available by default, so we don't need to install anything Python-related. Just fire the following code snippet in IPython to verify that everything is present:

import sqlite3
sqlite3.version
sqlite3.sqlite_version

We get an output similar to this as shown here:

In [1]: import sqlite3

In [2]: sqlite3.version
Out[2]: '2.6.0'

In [3]: sqlite3.sqlite_version
Out[3]: '3.8.4.3'

Here, sqlite3.version gets us the version of the Python sqlite3 module, and sqlite_version returns the system SQLite library version.

How to do it...

To be able to read from the database, we need to perform the following steps:

  1. Connect to the database engine (or the file in the case of SQLite).
  2. Run the query against the selected tables.
  3. Read the result returned from the database engine.

I will not try to teach SQL here as there are many books on that particular topic. But just for the purpose of clarity, we will explain the SQL query in this code sample:

SELECT ID, Name, Population FROM City ORDER BY Population DESC LIMIT 1000

ID, Name, and Population are columns (fields) of the table City from which we select data. ORDER BY tells the database engine to sort our data by the Population column, and DESC means descending order. LIMIT allows us to get just the first 1,000 records found.

For this example, we will use the world.sql example table, which holds the world's city names and populations. This table has more than 5,000 entries.

First, we need to import this SQL file into the SQLite database. Here's is the code on how to do it:

import sqlite3
import sys
	
if len(sys.argv) < 2:
    print "Error: You must supply at least SQL script."
    print "Usage: %s table.db ./sql-dump.sql" % (sys.argv[0])
    sys.exit(1)

script_path = sys.argv[1]

if len(sys.argv) == 3:
    db = sys.argv[2]
else:
    # if DB is not defined
    # create memory database
    db = ":memory:"

try:
    con = sqlite3.connect(db)
    with con:
        cur = con.cursor()
        with open(script_path,'rb') as f:
            cur.executescript(f.read())
except sqlite3.Error as err:
    print "Error occurred: %s" % err

This reads the SQL file and executes the SQL statements against the opened SQLite db file. If we don't specify the filename, SQLite creates the database in the memory. The statements are then executed line by line.

If we encounter any errors, we catch exceptions and print the error message to the user.

After we have imported data into the database, we are able to query the data and do some processing. Here is the code to read the data from the database file:

import sqlite3
import sys

if len(sys.argv) != 2:
    print "Please specify database file."
    sys.exit(1)

db = sys.argv[1]

try:
    con = sqlite3.connect(db)
    with con:
        cur = con.cursor()
        query = 'SELECT ID, Name, Population FROM City ORDER BY Population DESC LIMIT 1000'
        con.text_factory = str
        cur.execute(query)

        resultset = cur.fetchall()

        # extract column names

        col_names = [cn[0] for cn in cur.description]
        print "%10s %30s %10s" % tuple(col_names)
        print "="*(10+1+30+1+10)

        for row in resultset:
            print "%10s %30s %10s" % row
except sqlite3.Error as err:
    print "[ERROR]:", err

Here's an example of how to use the two preceding scripts:

$ python ch02-sqlite-import.py world.sql world.db
$ python ch02-sqlite-read.py world.db 
        ID                           Name Population
====================================================
      1024                Mumbai (Bombay)   10500000
      2331                          Seoul    9981619
       206                      S?o Paulo    9968485
      1890                       Shanghai    9696300

How it works...

First, we verify that the user has provided the database file path. This is just a quick check to ensure that we can proceed with the rest of the code.

Then, we try to connect to the database; if that fails, we catch sqlite3.Error and print it to the user.

If the connection is successful, we obtain a cursor using con.cursor(). A cursor is an iterator-like structure that enables us to traverse records of the result set returned from a database.

We define a query that we execute over the connection and we fetch the result set using cur.fetchall(). Had we expected just one result, we would have used just fetchone().

List comprehension over cur.description allows us to obtain column names. description is a read-only attribute and returns more than we need for just column names, so we just fetch the first item from every column's 7-item tuple.

We then use simple string formatting to print the header of our table with column names. After that, we iterate over resultset and print every row in a similar manner.

There's more...

Databases are the most common sources of data today. We could not present everything in this short recipe, but we can suggest you where to look for more information.

The official Python documentation is the first place to look for an explanation about how to work with databases. The most common databases are open source databases, such as MySQL, PostgreSQL, and SQLite, and on the other end of the spectrum, there are enterprise database systems such as MS SQL, Oracle, and Sybase. Mostly Python has support for them and the interface is abstracted always, so you don't have to change your program if your underlying database changes, but some tweaks may be required. It depends on whether you have used the specifics of a particular database system. For example, Oracle supports a specific language PL/SQL that is not standard SQL, and some things will not work if your database changes from Oracle to MS SQL. Similarly, SQLite does not support specifics from MySQL data types or database engine types (MyISAM and InnoDB). Those things can be annoying, but having your code rely on standard SQL (available at http://en.wikipedia.org/wiki/SQL:2011) will make your code portable from one database system to another.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.170.187