Chapter 8. Data Miner App

New challenges appear, to the extent that data size increases. Large sets of data bring problems related to excessive processing time and great memory consumption. These problems may turn data analysis into a painful process or may even make it completely impossible.

In this chapter, we will create an application capable of processing huge datasets in an efficient way. We will review our code, implementing new tools and techniques that will make our analysis not only run faster, but also make better use of computer hardware, allowing virtually any amount of data to be processed.

In order to achieve those goals, we will learn how to use databases and how to stream the data into them, making the use of computing power constant and stable regardless of the amount of data.

These tools will also enable us to perform more advanced searches, calculations, and cross information from different sources, allowing you to mine the data for precious information.

This chapter will cover the following topics:

  • What code efficiency is and how to measure it
  • How to import data into a spatial database
  • How to abstract database data into Python objects
  • Making queries and getting information from a spatial database
  • Understanding code efficiency

What constitutes efficient code depends on the points that are being analyzed. When we talk about computational efficiency, there are four points that may be taken into consideration:

  • The time the code takes to execute
  • How much memory it uses to run
  • How much disk space it uses
  • Whether the code uses all the available computing power

Good and efficient code is not only about computational efficiency; it's also about writing code that brings these favorable qualities to the development process (to cite just a few of them):

  • Clean and organized code
  • Readable code
  • Easy to maintain and debug
  • Generalized
  • Shielded against misuse

It's obvious that some points are contradictory. Here are just a few examples. To speed up a process, you may need to use more memory. To use less memory, you may need more disk space. Alternatively, for faster code, you may need to give up on generalization and write very specific functions.

It is the developer who determines the balance between antagonistic characteristics, based on the software requirements and the gains obtained by investing in one point or another. For example, if much cleaner code can be written with very little penalty in terms of execution time, the developer may opt for clean and maintainable code that will be easier for him and his team to understand.

The second block of good characteristics is prone to human evaluation, whereas the items in the first block can be measured and compared by the computer.

Measuring execution time

In order to measure how fast a piece of code is executed, we need to measure its execution time. The time measured is relative and varies, depending on a number of factors: the operating system, whether there are other programs running, the hardware, and so on.

For our efficiency tests, we will measure the execution time, make changes in the code, and measure it again. In this way, we will see if the changes improve the code efficiency or not.

Let's start with a simple example and measure how long it takes to run.

  1. As before, make a copy of the previous chapter folder in your geopy project and rename it as Chapter8. Your project structure should look like this:
    ├───Chapter1
    ├───Chapter2
    ├───Chapter3
    ├───Chapter4
    ├───Chapter5
    ├───Chapter6
    ├───Chapter7
    ├───Chapter8
    │   ├───experiments
    │   ├───map_maker
    │   ├───output
    │   └───utils
    └───data
  2. Click on your experiments folder and create a new Python file inside it. Name that file timing.py.
  3. Now add the following code to that file:
    # coding=utf-8
    
    
    def make_list1(items_list):
        result = ""
        for item in items_list:
            template = "I like {}. 
    "
            text = template.format(item)
            result = result + text
        return result
    
    
    if __name__ == '__main__':
        my_list = ['bacon', 'lasagna', 'salad', 'eggs', 'apples']
        print(make_list1(my_list))
  4. Run the code. Press Alt + Shift + F10 and select a timing from the list. You should get this output:
    I like bacon.
    I like lasagna.
    I like salad.
    I like eggs.
    I like apples.

    Nothing fancy, it's a simple inefficient function to format texts and produce a printable list of things.

  5. Now we are going to measure how long it takes to execute. Modify your code:
    # coding=utf-8
    
    
    from timeit import timeit
    
    
    def make_list1(items_list):
        result = ""
        for item in items_list:
            template = "I like {}. 
    "
            text = template.format(item)
            result = result + text
        return result
    
    
    if __name__ == '__main__':
        my_list = ['bacon', 'lasagna', 'salad', 'eggs', 'apples']
        number = 100
        execution_time = timeit('make_list1(my_list)',
            setup='from __main__ import make_list1, my_list',
            number=number)
        print("It took {}s to execute the code {} times".format(
            execution_time, number))
  6. Run your code again with Shift + F10 and look at the results:
    It took 0.000379365835017s to execute the code 100 times
    
    Process finished with exit code 0

    Here we are using the timeit module to measure the execution time of our function. Since some pieces of code run vary fast, we need to repeat the execution many times to get a more precise measurement and a more meaningful number. The number of times that the statement is repeated is given by the number parameter.

  7. Increase your number parameter to 1000000 and run the code again:
    It took 3.66938576408s to execute the code 1000000 times
    
    Process finished with exit code 0

    Now we have a more consistent number to work with. If your computer is much faster than mine you can increase the number. If it's slower, decrease it.

    Grab a piece of paper and take note of that result. We are going to change the function and see if we make the code more efficient.

  8. Add another version of our function; name it make_list2:
    def make_list2(items_list):
        result = ""
        template = "I like {}. 
    "
        for item in items_list:        
            text = template.format(item)
            result = result + text
        return result
  9. Also change your if __name__ == '__main__': block. We will make it clear which version of the function we are executing:
    if __name__ == '__main__':
        my_list = ['bacon', 'lasagna', 'salad', 'eggs', 'apples']
        number = 1000000
        function_version = 2
        statement = 'make_list{}(my_list)'.format(function_version)
        setup = 'from __main__ import make_list{}, my_list'.format(
            function_version)
        execution_time = timeit(statement, setup=setup, number=number)
        print("Version {}.".format(function_version))
        print("It took {}s to execute the code {} times".format(
            execution_time, number))
  10. Run the code again and see your results. On my computer, I got this:
    Version 2.
    It took 3.5384931206s to execute the code 1000000 times
    
    Process finished with exit code 0

    That was a slight improvement in execution time. The only change that was made in version 2 was that we moved the template out of the for loop.

  11. Make a third version of the function:
    def make_list3(items_list):
        result = ""
        template = "I like "
        for item in items_list:
            text = template + item + ". 
    "
            result = result + text
        return result
  12. Change your function_version variable to 3 and run the code again:
    Version 3.
    It took 1.88675713574s to execute the code 1000000 times
    
    Process finished with exit code 0

Now we changed how the string "I like " is formed. Instead of using string formatting, we added parts of the string and got code that ran almost twice as fast as the previous version.

You can find out which small changes will reduce the execution time by trial and error, by consulting articles on the Internet, or by experience. But there is a more assertive and powerful way to find out where your code spends more time; this is called profiling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.171.212