New challenges appear, to the extent that data size increases. Large sets of data bring problems related to excessive processing time and great memory consumption. These problems may turn data analysis into a painful process or may even make it completely impossible.
In this chapter, we will create an application capable of processing huge datasets in an efficient way. We will review our code, implementing new tools and techniques that will make our analysis not only run faster, but also make better use of computer hardware, allowing virtually any amount of data to be processed.
In order to achieve those goals, we will learn how to use databases and how to stream the data into them, making the use of computing power constant and stable regardless of the amount of data.
These tools will also enable us to perform more advanced searches, calculations, and cross information from different sources, allowing you to mine the data for precious information.
This chapter will cover the following topics:
What constitutes efficient code depends on the points that are being analyzed. When we talk about computational efficiency, there are four points that may be taken into consideration:
Good and efficient code is not only about computational efficiency; it's also about writing code that brings these favorable qualities to the development process (to cite just a few of them):
It's obvious that some points are contradictory. Here are just a few examples. To speed up a process, you may need to use more memory. To use less memory, you may need more disk space. Alternatively, for faster code, you may need to give up on generalization and write very specific functions.
It is the developer who determines the balance between antagonistic characteristics, based on the software requirements and the gains obtained by investing in one point or another. For example, if much cleaner code can be written with very little penalty in terms of execution time, the developer may opt for clean and maintainable code that will be easier for him and his team to understand.
The second block of good characteristics is prone to human evaluation, whereas the items in the first block can be measured and compared by the computer.
In order to measure how fast a piece of code is executed, we need to measure its execution time. The time measured is relative and varies, depending on a number of factors: the operating system, whether there are other programs running, the hardware, and so on.
For our efficiency tests, we will measure the execution time, make changes in the code, and measure it again. In this way, we will see if the changes improve the code efficiency or not.
Let's start with a simple example and measure how long it takes to run.
geopy
project and rename it as Chapter8
. Your project structure should look like this:├───Chapter1 ├───Chapter2 ├───Chapter3 ├───Chapter4 ├───Chapter5 ├───Chapter6 ├───Chapter7 ├───Chapter8 │ ├───experiments │ ├───map_maker │ ├───output │ └───utils └───data
experiments
folder and create a new Python file inside it. Name that file timing.py
.# coding=utf-8 def make_list1(items_list): result = "" for item in items_list: template = "I like {}. " text = template.format(item) result = result + text return result if __name__ == '__main__': my_list = ['bacon', 'lasagna', 'salad', 'eggs', 'apples'] print(make_list1(my_list))
I like bacon. I like lasagna. I like salad. I like eggs. I like apples.
Nothing fancy, it's a simple inefficient function to format texts and produce a printable list of things.
# coding=utf-8 from timeit import timeit def make_list1(items_list): result = "" for item in items_list: template = "I like {}. " text = template.format(item) result = result + text return result if __name__ == '__main__': my_list = ['bacon', 'lasagna', 'salad', 'eggs', 'apples'] number = 100 execution_time = timeit('make_list1(my_list)', setup='from __main__ import make_list1, my_list', number=number) print("It took {}s to execute the code {} times".format( execution_time, number))
It took 0.000379365835017s to execute the code 100 times Process finished with exit code 0
Here we are using the timeit
module to measure the execution time of our function. Since some pieces of code run vary fast, we need to repeat the execution many times to get a more precise measurement and a more meaningful number. The number of times that the statement is repeated is given by the number parameter.
1000000
and run the code again:It took 3.66938576408s to execute the code 1000000 times Process finished with exit code 0
Now we have a more consistent number to work with. If your computer is much faster than mine you can increase the number. If it's slower, decrease it.
Grab a piece of paper and take note of that result. We are going to change the function and see if we make the code more efficient.
make_list2
:def make_list2(items_list): result = "" template = "I like {}. " for item in items_list: text = template.format(item) result = result + text return result
if __name__ == '__main__':
block. We will make it clear which version of the function we are executing:if __name__ == '__main__': my_list = ['bacon', 'lasagna', 'salad', 'eggs', 'apples'] number = 1000000 function_version = 2 statement = 'make_list{}(my_list)'.format(function_version) setup = 'from __main__ import make_list{}, my_list'.format( function_version) execution_time = timeit(statement, setup=setup, number=number) print("Version {}.".format(function_version)) print("It took {}s to execute the code {} times".format( execution_time, number))
Version 2. It took 3.5384931206s to execute the code 1000000 times Process finished with exit code 0
That was a slight improvement in execution time. The only change that was made in version 2 was that we moved the template out of the for
loop.
def make_list3(items_list): result = "" template = "I like " for item in items_list: text = template + item + ". " result = result + text return result
function_version
variable to 3
and run the code again:Version 3. It took 1.88675713574s to execute the code 1000000 times Process finished with exit code 0
Now we changed how the string "I like "
is formed. Instead of using string formatting, we added parts of the string and got code that ran almost twice as fast as the previous version.
You can find out which small changes will reduce the execution time by trial and error, by consulting articles on the Internet, or by experience. But there is a more assertive and powerful way to find out where your code spends more time; this is called profiling.
18.218.171.212