Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

M. WilkesAdvanced Python Developmenthttps://doi.org/10.1007/978-1-4842-5793-7_10

10. Speeding things up

Matthew Wilkes¹

(1)

Leeds, West Yorkshire, UK

There are two main approaches to improving the speed of code: optimizing the code we’ve written and optimizing the control flow of the program to run less code. People often focus on optimizing the code rather than the control flow because it’s easier to make self-contained changes, but the most significant benefits are usually in changing the flow.

Optimizing a function

The first step to optimizing a function is having a good understanding of it’s performance before making any changes. The Python standard library has a profile module to assist with this. Profile introspects code as it runs to build up an idea of the time spent on each function call. The profiler can detect multiple calls to the same function and monitor any functions called indirectly. You can then generate a report that shows the function call chart for an entire run.

We can profile a statement using the profile.run(...) function. This uses the reference profiler, which is always available, but most people use the optimized profiler at cProfile.run(...) ¹. The profiler will exec the string passed as the first argument, generate profiling information, and then automatically format the profile results into a report.

>>> from apd.aggregation.analysis import interactable_plot_multiple_charts

>>> import cProfile

>>> cProfile.run("interactable_plot_multiple_charts()()", sort="cumulative")

164 function calls in 2.608 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)

1 0.001 0.001 2.608 2.608 {built-in method builtins.exec}

1 0.001 0.001 2.606 2.606 <string>:1(<module>)

1 0.004 0.004 2.597 2.597 analysis.py:327(run_in_thread)

9 2.558 0.284 2.558 0.284 {method 'acquire' of '_thread.lock' objects}

1 0.000 0.000 2.531 2.531 _base.py:635(__exit__)

...

The table displayed here shows the number of times a function was invoked (ncalls), the time spent executing that function (tottime), and that total time divided by the number of calls (percall). It also shows the cumulative time spent executing that function and all indirectly called functions, both as a total and divided by the number of calls (cumtime and the second percall). A function having a high cumtime and a low tottime implies that the function itself could not benefit from optimizing, but the control flow involving that function may.

Tip

Some IDEs and code editors have built-in support for running profilers and viewing their output. If you’re using an IDE, then this may be a more natural interface for you. The behavior of the profilers is still the same, however.

When running code in a Jupyter notebook, you can also generate the same report using the “cell magic” functionality (Figure 10-1). A cell magic is an annotation on a cell to use a named plugin during execution, in this case a profiler. If you make the first line of your cell %%prun -s cumulative, then once the cell has completed executing, the notebook displays a pop-up window containing a profile report for the whole cell.

Caution

The “cell magic” approach is not currently compatible with top-level await support in IPython. If you use the %%prun cell magic, then that cell cannot await a coroutine.

../images/481001_1_En_10_Chapter/481001_1_En_10_Fig1_HTML.jpg — Figure 10-1
Example of profiling a Jupyter notebook cell

Profiling and threads

The preceding examples generate reports that list lots of threading internal functions rather than our substantive functions. This is because our interactable_plot_multiple_charts(...)(...) function ² starts a new thread to handle running the underlying coroutines. The profiler does not reach into the started thread to start a profiler, so we only see the main thread waiting for the worker thread to finish.

We can fix this by changing the way our code wraps a coroutine into a thread, giving us the opportunity to insert a profiler inside the child thread. For example, we could add a debug= flag and then submit a different function to the thread pool if debug=True is passed, as shown in Listing 10-1.

_Coroutine_Result = t.TypeVar("_Coroutine_Result")

def wrap_coroutine(

f: t.Callable[..., t.Coroutine[t.Any, t.Any, _Coroutine_Result]], debug: bool=False,

) -> t.Callable[..., _Coroutine_Result]:

"""Given a coroutine, return a function that runs that coroutine

in a new event loop in an isolated thread"""

@functools.wraps(f)

def run_in_thread(*args: t.Any, **kwargs: t.Any) -> _Coroutine_Result:

loop = asyncio.new_event_loop()

wrapped = f(*args, **kwargs)

if debug:

# Create a new function that runs the loop inside a cProfile

# session, so it can be profiled transparently

def fn():

import cProfile

return cProfile.runctx(

"loop.run_until_complete(wrapped)",

{},

{"loop": loop, "wrapped": wrapped},

sort="cumulative",

)

task_callable = fn

else:

# If not debugging just submit the loop run function with the

# desired coroutine

task_callable = functools.partial(loop.run_until_complete, wrapped)

with ThreadPoolExecutor(max_workers=1) as pool:

task = pool.submit(task_callable)

# Mypy can get confused when nesting generic functions, like we do # here

# The fact that Task is generic means we lose the association with

# _CoroutineResult. Adding an explicit cast restores this.

return t.cast(_Coroutine_Result, task.result())

return run_in_thread

def interactable_plot_multiple_charts(

*args: t.Any, debug: bool=False, **kwargs: t.Any

) -> t.Callable[..., Figure]:

with_config = functools.partial(plot_multiple_charts, *args, **kwargs)

return wrap_coroutine(with_config, debug=debug)

Listing 10-1

Example of wrap_coroutine to optionally include profiling

In Listing 10-1, we use the runctx(...) function from the profiler, rather than the run(...) function. runctx(...) allows passing global and local variables to the expression we’re profiling.³ The interpreter does not introspect the string representing the code to run to determine what variables are needed. You must pass them explicitly.

With this change in place, the same code we used to plot all the charts with interactive elements can also request that profiling information be collected, so users in Jupyter notebooks can easily debug new chart types theyre adding, as demonstrated in Figure 10-2.

../images/481001_1_En_10_Chapter/481001_1_En_10_Fig2_HTML.jpg — Figure 10-2
Integrated profiling option being used from Jupyter

The profiler running in the child thread still includes some overhead functions at the top, but we can now see the functions we wanted to profile rather than only thread management functions. If we only look at the functions relevant to our code, the output is as follows:

ncalls tottime percall cumtime percall filename:lineno(function)

20 0.011 0.001 2.607 0.130 analysis.py:282(plot_sensor)

12 0.028 0.002 2.108 0.176 analysis.py:304(<listcomp>)

3491 0.061 0.000 1.697 0.000 analysis.py:146(clean_watthours_to_watts)

33607 0.078 0.000 0.351 0.000 query.py:114(subiterator)

12 0.000 0.000 0.300 0.025 analysis.py:60(draw_date)

33603 0.033 0.000 0.255 0.000 query.py:39(get_data)

3 0.001 0.000 0.254 0.085 analysis.py:361(plot_multiple_charts)

16772 0.023 0.000 0.214 0.000 analysis.py:223(clean_passthrough)

33595 0.089 0.000 0.207 0.000 database.py:77(from_sql_result)

8459 0.039 0.000 0.170 0.000 analysis.py:175(clean_temperature_fluctuations)

24 0.000 0.000 0.140 0.006 query.py:74(get_deployment_by_id)

2 0.000 0.000 0.080 0.040 query.py:24(with_database)

It appears that the plot_sensor(...) function is called 20 times, the list comprehension points = [dp async for dp in config.clean(query_results)] is called 12 times, and the clean_watthours_to_watts(...) function is called 3491 times. The huge number of reported calls for the clean function is due to the way the profiler interacts with generator functions. Every time that a new item is requested from the generator, it is classed as a new invocation of the function. Equally, every time an item is yielded, it is classed as that invocation returning. This approach may seem more complex than measuring the time from the first invocation until the generator is exhausted, but it means that the tottime and cumtime totals do not include the time that the iterator was idle and waiting for other code to request the next item. However, it also means that the percall numbers represent the time taken to retrieve a single item, not the time taken for every time the function is called.

Caution

Profilers need a function to determine the current time. By default profile uses time.process_time() and cProfile uses time.perf_counter(). These measure very different things. The process_time() function measures time the CPU was busy, but perf_counter() measures real-world time. The real-world time is often called “wall time,” meaning the time as measured by a clock on a wall.

Interpreting the profile report

The clean_watthours_to_watts(...) function should draw your eye immediately, as it’s a relatively low-level function with a very high cumtime. This function is being used as a support function to draw one of four charts, but it’s responsible for 65% of the total execution time of plot_sensor(...). This function is where we would start optimization, but if we compare the tottime and the cumtime, we can see that it only spends 2% of the total time in this function.

The discrepancy tells us that it’s not the code we’ve directly written in this function thats introducing the slowdown, it’s the fact that were calling other functions indirectly as part of our implementation of clean_watthours_to_watts(...). Right now, were looking at optimizing functions rather than optimizing execution flow. As optimizing this function requires optimizing the pattern of calling functions out of our control, well pass by it for now. The second half of this chapter covers strategies for improving performance by altering control flow, and well return to fix this function there.

Instead, lets concentrate on the items that have a high tottime rather than cumtime, representing that the time spent is in executing code that we wrote, rather than code that were using. These numbers are significantly lower than the times we looked at previously; theyre relatively simple functions and represent a smaller potential benefit, but that may not always be the case.

12 0.103 0.009 2.448 0.204 analysis.py:304(<listcomp>)

33595 0.082 0.000 0.273 0.000 database.py:77(from_sql_result)

33607 0.067 0.000 0.404 0.000 query.py:114(subiterator)

We see that two functions related to the database interface are potential candidates. These are each run over 33,000 times and take less than a tenth of a second total time each, so they are not particularly tempting optimization targets. Still, they’re the highest in terms of tottime of our code, so they represent the best chance we have to do the simple, self-contained type of optimization.

The first thing to do is to try changing something about the implementation and measuring any difference. The existing implementation is very short, containing only a single line of code. It’s unlikely that we could optimize at all, but lets try.

@classmethod

def from_sql_result(cls, result) -> DataPoint:

return cls(**result._asdict())

One thing thats not immediately clear in the preceding implementation that may cause slowdown is that a dictionary of values is generated and mapped dynamically to keyword arguments.⁴ An idea to test would be to explicitly pass the arguments, as we know that they are consistent.

@classmethod

def from_sql_result(cls, result) -> DataPoint:

if result.id is None:

return cls(data=result.data, deployment_id=result.deployment_id,

sensor_name=result.sensor_name, collected_at=result.collected_at)

else:

return cls(id=result.id, data=result.data, deployment_id=result.deployment_id, sensor_name=result.sensor_name,

collected_at=result.collected_at)

The most important part of this process is to test our hypothesis. We need to re-run the code and compare the results. We also need to be aware of the fact that the code may vary in execution time because of external factors, such as load on the computer, so it’s a good idea to try running the code a few times to see if the results are stable. Were looking for a significant speedup here, as our change would introduce maintainability issues, so a trivially small speed boost isnt worth it.

33595 0.109 0.000 0.147 0.000 database.py:77(from_sql_result)

The result here shows that more time was spent in the from_sql_result() function than the previous implementation, but the cumulative time has decreased. This result tells us that the changes we made to from_sql_result() directly caused that function to take longer, but in doing so we changed the control flow to eliminate the call to _asdict() and pass values directly which more than made up for the slowdown we introduced.

In other words, this functions implementation has no definite improvement to performance other than by changing the control flow to avoid code in _asdict(). It also reduces maintainability of the code by requiring us to list the fields in use in multiple places. As a result, well stick with our original implementation rather than the “optimized” version.

Tip

There is another potential optimization to class creation, setting a __slots__ attribute on the class, like __slots__ = {"sensor_name", "data", "deployment_id", "id", "collected_at"}. This allows a developer to guarantee that only specifically named attributes will ever be set on an instance, which allows the interpreter to add many optimizations. At the time of writing, there are some incompatibilities between data classes and __slots__ that make it less easy to use, but if you want to optimize instantiation of your objects, then I recommend taking a look.

The same is true of the other two: the subiterator() and list comprehension functions are very minimal; changes to them decrease readability and do not bring substantial performance improvements.

It’s relatively rare for a small, easily understood function to be a candidate for significant performance improvement, as poor performance is often correlated with complexity. If the complexity in your system is due to the composition of simple functions, then performance improvements come from optimizing control flow. If you have very long functions that do complex things, then it’s more likely that significant improvements can come from optimizing functions in isolation.

Other profilers

The profiler that comes with Python is enough to get useful information in most cases. Still, as code performance is such an important topic, there are other profilers available that have unique advantages and disadvantages.

timeit

The most important alternative profiler to mention is also from the Python standard library, called timeit. timeit is useful for profiling fast, independent functions. Rather than monitoring a program in normal operation, timeit runs given code repeatedly and returns the cumulative time taken.

>>> import timeit

>>> from apd.aggregation.utils import merc_y

>>> timeit.timeit("merc_y(52.2)", globals={"merc_y": merc_y})

1.8951617999996415

When called with the default arguments, as previously shown, the output is the number of seconds needed to execute the first argument one million times, measured using the most accurate clock available. Only the first argument (stmt=) is required, which is a string representation of the code to be executed each time. A second string argument (setup=) represents setup code that must be executed before the test starts, and a globals= dictionary allows passing arbitrary items into the namespace of the code being profiled. This is especially useful for passing in the function under test, rather than importing it in the setup= code. The optional number= argument allows us to specify how many times the code should be run, as one million executions is an inappropriate amount for functions that take more than about 50 microseconds to execute.⁵

Both the string representing the code to test and the setup= strings can be multiline strings containing a series of Python statements. Be aware, however, that any definitions or imports in the first string are run every time, so all setup code should be done in the second string or passed directly as a global.

line_profiler

A commonly recommended alternative profiler is line_profiler by Robert Kern.⁶ It records information on a line-by-line basis rather than a function-by-function basis, making it very useful for pinpointing where exactly a functions performance issues are coming from.

Unfortunately, the trade-offs for line_profiler are quite significant. It requires modification to your Python program to annotate each function you wish to profile, and while those annotations are in place, the code cannot be run except through the line_profilers custom environment. Also, at the time of writing, it’s not been possible to install line_profiler with pip for approximately two years. Although you will find many people recommending this profiler online, thats partially due to it being available before other alternatives. I would recommend avoiding this profiler unless absolutely necessary for debugging a complex function; you may find it costs you more time to set up than you save in convenience once installed.

yappi

Another alternative profiler is yappi,⁷ which provides transparent profiling of Python code running across multiple threads and in asyncio event loops. Numbers such as the call count for iterators represent the number of times the iterator is called rather than the number of items retrieved, and no code modifications are needed to support profiling multiple threads.

The disadvantage to yappi is that it’s a relatively small project under heavy development, so you may find it to be less polished than many other Python libraries. I would recommend yappi for cases where the built-in profiler is insufficient. At the time of writing, Id still recommend the built-in profiling tools as my first choice, but yappi is a close second.

The interface to yappi is somewhat different to the built-in profilers that we’ve used so far, as it doesnt offer an equivalent to the run(...) function call. The yappi profiler must be enabled and disabled around the code being profiled. There is an equivalent API for the default profiler, as shown in Table 10-1.

Table 10-1

Comparison of cProfile and yappi profiling

cProfile using enable/disable API

import cProfile

profiler = cProfile.Profile()

profiler.enable()

method_to_profile()

profiler.disable()

profiler.print_stats()

Yappi-based profiling

import yappi

yappi.start()

method_to_profile()

yappi.stop()

yappi.get_func_stats().print_all()

Using yappi in a Jupyter cell gives us the ability to call the functions in the underlying code without needing to work around threading and asyncio issues. We could have used yappi to profile our code without making the debug= parameter change earlier. In the preceding example, if method_to_profile() called interactable_plot_multiple_charts(...) and widgets.interactive(...), the resulting profile output would be as follows:

Clock type: CPU

Ordered by: totaltime, desc

name ncall tsub ttot tavg

..futures hread.py:52 _WorkItem.run 17 0.000000 9.765625 0.574449

..rrentfutures hread.py:66 _worker 5/1 0.000000 6.734375 1.346875

..38Lib hreading.py:859 Thread.run 5/1 0.000000 6.734375 1.346875

..ndowsSelectorEventLoop.run_forever 1 0.000000 6.734375 6.734375

..basyncioevents.py:79 Handle._run 101 0.000000 6.734375 0.066677

..lectorEventLoop.run_until_complete 1 0.000000 6.734375 6.734375

..WindowsSelectorEventLoop._run_once 56 0.000000 6.734375 0.120257

..gationanalysis.py:282 plot_sensor 4 0.093750 6.500000 1.625000

..egationanalysis.py:304 <listcomp> 12 0.031250 5.515625 0.459635

...

The total times displayed by yappi are significantly higher than those from cProfile in this example. You should only ever compare the times produced by a profiler to results generated on the same hardware with the same tools, as performance can vary wildly⁸ when profilers are enabled.

Yappi Helper Functions

Yappi supports filtering stats by function and module out of the box. There is also an option to provide custom filter functions, to determine exactly which code should be displayed in performance reports. There are some other options available; you should check the documentation of yappi to find the recommended way to filter output to only show code you’re interested in.

The code accompanying this chapter has some helper functions to make yappi profiling more comfortable from a Jupyter context. These are profile_with_yappi, a context manager to handle activating and deactivating the profiler; jupyter_page_file, a context manager to help display the profiling data in the same way as the %%prun cell magic, not merged in with cell output; and yappi_package_matches, a helper that uses the filter_callback= option to restrict the stats displayed to only show modules within a given Python package. An example of using these helper functions is shown as Listing 10-2.

from apd.aggregation.analysis import (interactable_plot_multiple_charts, configs)

from apd.aggregation.utils import (jupyter_page_file, profile_with_yappi, yappi_package_matches)

import yappi

with profile_with_yappi():

plot = interactable_plot_multiple_charts()

plot()

with jupyter_page_file() as output:

yappi.get_func_stats(filter_callback=lambda stat:

yappi_package_matches(stat, ["apd.aggregation"])

).print_all(output)

Listing 10-2.

Jupyter cell for yappi profiling, with part of the Jupyter output shown

../images/481001_1_En_10_Chapter/481001_1_En_10_Figa_HTML.jpg

None of these three helpers are strictly needed, but they provide for a more user-friendly interface.

Tracemalloc

The profilers we’ve looked at so far all measure the CPU resources needed to run a piece of code. The other primary resource available to us is memory. A program that runs quickly but requires a large amount of RAM would run significantly more slowly on systems that have less RAM available.

Python has a built-in RAM allocation profiler, called tracemalloc. This module provides tracemalloc.start() and tracemalloc.stop() functions to enable and disable to profiler, respectively. A profile result can be requested at any time by using the tracemalloc.take_snapshot() function. An example of using this on our plotting code is given as Listing 10-3.

The result of this is a Snapshot object, which has a statistics(...) method to return a list of individual statistics. The first argument to this function is the key by which to group results. The most useful two keys to use are "lineno" (for line-by-line profiling) and "filename" (for whole file profiling). A cumulative= flag allows the user to choose between including the memory use of indirectly called functions or not. That is, should each statistic line represent what a line does directly or all the consequences of running that line?

import tracemalloc

from apd.aggregation.analysis import interactable_plot_multiple_charts

tracemalloc.start()

plot = interactable_plot_multiple_charts()()

snapshot = tracemalloc.take_snapshot()

tracemalloc.stop()

for line in snapshot.statistics("lineno", cumulative=True):

print(line)

Listing 10-3

Example script to debug memory usage after plotting the charts

The documentation in the standard library provides some helper functions to provide for better formatting of the output data, especially the code sample for the display_top(...) function.⁹

Caution

The tracemalloc allocator only shows memory allocations that are still active at the time that the snapshot is generated. Profiling our program shows that the SQL parsing uses a lot of RAM but won’t show our DataPoint objects, despite them taking up more RAM. Our objects are short-lived, unlike the SQL ones, so they have already been discarded by the time we generate the snapshot. When debugging peak memory usage, you must create a snapshot at the peak.

New Relic

If youre running a web-based application, then the commercial service New Relic may provide useful profiling insights.¹⁰ It provides a tightly integrated profiling system that allows you to monitor the control flow from web requests, the functions involved in servicing them, and interactions with databases and third-party services as part of the render process.

The trade-offs for New Relic and it’s competitors are substantial. You gain access to an excellent set of profiling data, but it doesnt fit all application types and costs a significant amount of money. Besides, the fact that the actions of real users are used to perform the profiling means that you should consider user privacy before introducing New Relic to your system. That said, New Relics profiling has provided some of the most useful performance analyses Ive seen.

Optimizing control flow

More commonly, it’s not a single function that is the cause of performance problems within a Python system. As we saw earlier, writing code in a naïve way generally results in a function that cannot be optimized beyond changing what it’s doing.

In my experience, the most common source of low performance is a function that calculates more than it needs to. For example, in our first implementations of features to get collated data, we did not yet have database-side filtering, so we added a loop to filter the data we want from the data thats not relevant.

Filtering the input data later doesnt just move workaround; it can increase the total amount of work being done. In this situation, the work done is loading data from the database, setting up DataPoint records, and extracting the relevant data from those records. By moving the filtering from the loading step to the extracting step, we set up DataPoint records for objects that we know we dont care about.

Complexity

The time taken by a function is not always directly proportional to the size of the input, but it’s a good approximation for functions that loop over the data once. Sorting and other more complex operations behave differently.

The relationship between how long functions take (or how much memory they need) and their input size is called computational complexity. Most programmers never need to worry about the exact complexity class of functions, but it’s worth being aware of the broad-strokes differences when optimizing your code.

You can estimate the relationship between input size and time using the timeit function with different inputs, but as a rule of thumb, it’s best to avoid nesting loops within loops. Nested loops that always have a very small number of iterations are okay, but looping over user input within another loop over user input results in the time a function takes increasing rapidly¹¹ as the amount of user input increases.

The longer a function takes for a given input size, the more important it is to minimize the amount of extraneous data it processes.

In Figure 10-3, the horizontal axis maps to the time taken and the vertical axis to the amount of input a stage in the pipeline has to process. The width of a step, and therefore the time it takes to process, is proportional to the amount of data that it is processing.

These two flows illustrate the amount of work that needs to happen to process a single sensor, with the top flow having database-level filtering and the bottom having filtering in Python. In both cases, the total amount of output is the same, but the intermediate stages have different amounts of data to process and therefore take a different amount of time.

../images/481001_1_En_10_Chapter/481001_1_En_10_Fig3_HTML.jpg — Figure 10-3
Diagram of the size of data set for code that filters in the database vs. filtering during cleaning

There are two places that we discard data: when we are finding only the data for the sensor in question and when discarding invalid data. By moving the sensor filter to the database, we reduce the amount of work done in the load step and therefore the amount of time needed. We are moving the bulk of the filtering, with the more complex filtering for removing invalid data still happening in the clean step. If we could move this filtering to the database, it would further decrease the time taken by the load step, albeit not as much.

We already assumed that wed need to filter in the database when we wrote the functions, partially to improve the usability of the API, but we can test the assumption that it improves performance by using the yappi profiler and the ability to provide explicit configurations to our drawing system. We can then directly compare the time taken to draw a chart with database-backed filtering with Python filtering. The implementation of the performance analysis for filtering in the database is shown in Listing 10-4.

import yappi

from apd.aggregation.analysis import (interactable_plot_multiple_charts, Config)

from apd.aggregation.analysis import (clean_temperature_fluctuations, get_one_sensor_by_deployment)

from apd.aggregation.utils import profile_with_yappi

yappi.set_clock_type("wall")

filter_in_db = Config(

clean=clean_temperature_fluctuations,

title="Ambient temperature",

ylabel="Degrees C",

get_data=get_one_sensor_by_deployment("Temperature"),

)

with profile_with_yappi():

plot = interactable_plot_multiple_charts(configs=[filter_in_db])

plot()

yappi.get_func_stats().print_all()

Listing 10-4

Jupyter cell to profile a single chart, filtering in SQL

The following statistics are a partial output from the cells output, showing some of the entries that are most interesting to us. We can see that 10828 data objects were loaded, that the get_data(...) function took 2.7 seconds, and that 6 database calls were made totaling 2.4 seconds. The list comprehension on line 304 of analysis.py (points = [dp async for dp in config.clean(query_results)]) is where the cleaner function is called. Cleaning the data took 0.287 seconds, but the time in the cleaning function itself was negligible.

name ncall tsub ttot tavg

..lectorEventLoop.run_until_complete 1 0.000240 3.001717 3.001717

..alysis.py:341 plot_multiple_charts 1 2.843012 2.999702 2.999702

..gationanalysis.py:282 plot_sensor 1 0.000000 2.720996 2.720996

..query.py:86 get_data_by_deployment 1 2.706142 2.706195 2.706195

..daggregationquery.py:39 get_data 1 2.569511 2.663460 2.663460

..lchemyormquery.py:3197 Query.all 6 0.008771 2.407840 0.401307

..lchemyormloading.py:35 instances 10828 0.005485 1.588923 0.000147

..egationanalysis.py:304 <listcomp> 4 0.000044 0.286975 0.071744

..175 clean_temperature_fluctuations 4 0.000000 0.286888 0.071722

We can re-run the same test but with a new version of this same chart, where all the filtering happens in Python. Listing 10-5 demonstrates this, by adding a new cleaner function that does the filtering and using the existing get_data_by_deployment(...) function as the data source. This represents how we would need to filter data if we hadnt added a sensor_name= parameter to get_data(...).

import yappi

from apd.aggregation.analysis import (interactable_plot_multiple_charts, Config, clean_temperature_fluctuations, get_data_by_deployment)

from apd.aggregation.utils import (jupyter_page_file, profile_with_yappi, YappiPackageFilter)

async def filter_and_clean_temperature_fluctuations(datapoints):

filtered = (item async for item in datapoints if item.sensor_name=="Temperature")

cleaned = clean_temperature_fluctuations(filtered)

async for item in cleaned:

yield item

filter_in_python = Config(

clean=filter_and_clean_temperature_fluctuations,

title="Ambient temperature",

ylabel="Degrees C",

get_data=get_data_by_deployment,

)

with profile_with_yappi():

plot = interactable_plot_multiple_charts(configs=[filter_in_python])

plot()

yappi.get_func_stats().print_all()

Listing 10-5

Jupyter cell to profile drawing the same chart but without any database filtering

In this version, the filtering happens in filter_and_clean_temperature_fluctuations(...), so we expect this to take a long time. The additional time taken is partially in the generator expression in that function, but not entirely. The total time taken by plot_multiple_charts(...) has increased from 3.0 seconds to 8.0 seconds, of which 1.3 seconds are the filtering. This shows that by filtering in the database, we’ve saved 3.7 seconds of overhead, which represents a 21% speedup.

name ncall tsub ttot tavg

..lectorEventLoop.run_until_complete 1 0.000269 7.967136 7.967136

..alysis.py:341 plot_multiple_charts 1 7.637066 7.964143 7.964143

..gationanalysis.py:282 plot_sensor 1 0.000000 6.977470 6.977470

..query.py:86 get_data_by_deployment 1 6.958155 6.958210 6.958210

..daggregationquery.py:39 get_data 1 6.285337 6.881415 6.881415

..lchemyormquery.py:3197 Query.all 6 0.137161 6.112309 1.018718

..lchemyormloading.py:35 instances 67305 0.065920 3.424629 0.000051

..egationanalysis.py:304 <listcomp> 4 0.000488 1.335928 0.333982

..and_clean_temperature_fluctuations 4 0.000042 1.335361 0.333840

..175 clean_temperature_fluctuations 4 0.000000 1.335306 0.333826

..-input-4-927271627100>:7 <genexpr> 4 0.000029 1.335199 0.333800

Visualizing profiling data

Complex iterator functions are hard to profile, as seen with clean_temperature_fluctuations(...) listing it’s tsub time as exactly zero. It is a complex function that calls other methods, but for it to spend exactly zero time must be a rounding error. Profiling running code can point you in the right direction, but you’ll only ever get indicative numbers from this approach. It’s also hard to see how the 0.287 seconds total time breaks down by constituent functions from this view.

Both the built-in profile module and yappi support exporting their data in pstats format, a Python-specific profile format that can be passed to visualization tools. Yappi also supports the callgrind format from the Valgrind profiling tool.

We can save a callgrind profile from yappi using yappi.get_func_stats().save("callgrind.filter_in_db", "callgrind") and then load it into a callgrind visualizer like KCachegrind.¹² Figure 10-4 shows an example of displaying the database-filtered version of this code in QCachegrind, where the area of the blocks corresponds to the time spent in the corresponding function.

../images/481001_1_En_10_Chapter/481001_1_En_10_Fig4_HTML.jpg — Figure 10-4
Call chart for clean_temperature_fluctuations when filtering data in the database

You may be surprised to learn that get_data(...) is not only present in this chart but is by far the largest single block. The clean_temperature_fluctuations(...) function doesnt appear to call the get_data(...) function, so it’s not immediately obvious why this function should account for most of the time taken.

Iterators make reasoning about call flow difficult, as when you pull an item from an iterable in a loop, it doesnt look like a function call. Under the hood, Python is calling youriterable.__next__() (or youriterable.__anext__()), which passes execution back to the underlying function, completing the previous yield. A for loop can, therefore, cause any number of functions to be called, even if it’s body is empty. The async for construction makes this a bit clearer, as it is explicitly saying that the underlying code may await. It wouldnt be possible for the underlying code to await unless control was passing to other code rather than just interacting with a static data structure. When profiling code that consumes iterables, you will find the underlying data generation functions called by the functions that use the iterable are present in the output.

Consuming Iterables and Single Dispatch Functions

We can write a function that consumes an iterator as soon as possible, which simplifies the call stack somewhat. Consuming the iterator can reduce performance by preventing tasks running in parallel and requires that there is enough memory to contain the whole iterable, but it does greatly simplify the output of profiling tools. Simple functions for consuming an iterable and an async iterable while retaining the same interface are shown as Listing 10-6.

def consume(input_iterator):

items = [item for item in input_iterator]

def inner_iterator():

for item in items:

yield item

return inner_iterator()

async def consume_async(input_iterator):

items = [item async for item in input_iterator]

async def inner_iterator():

for item in items:

yield item

return inner_iterator()

Listing 10-6

Pair of functions for consuming iterators in place

This pair of functions takes an iterator (or async iterator) and consumes it as soon as it’s called (or awaited), returning a new iterator that yields only from that preconsumed source. These functions are used as follows:

# Synchronous

nums = (a for a in range(10))

consumed = consume(nums)

# Async

async def async_range(num):

for a in range(num):

yield a

nums = async_range(10)

consumed = await consume_async(nums)

We can simplify this using the functools module in the standard library, specifically the @singledispatch decorator. Back in the second chapter, we looked at Python’s dynamic dispatch functionality, which allows a function to be looked up by the class to which it’s attached. We’re doing something similar here; we have a pair of functions that are associated with an underlying data type, but these data types aren’t classes we’ve written. We have no control over what functions are attached to them, as the two types are features of the core language rather than classes we’ve created and can edit.

The @singledispatch decorator marks functions as having multiple implementations differentiated on by the type of the first argument. Rewriting our functions to use this approach (Listing 10-7) only involves adding decorators to them to join the alternative implementations to the base one and a type hint to differentiate the variants.

import functools

@functools.singledispatch

def consume(input_iterator):

items = [item for item in input_iterator]

def inner_iterator():

for item in items:

yield item

return inner_iterator()

@consume.register

async def consume_async(input_iterator: collections.abc.AsyncIterator):

items = [item async for item in input_iterator]

async def inner_iterator():

for item in items:

yield item

return inner_iterator()

Listing 10-7

Pair of functions for consuming iterators in place with single dipatch

These two functions behave in exactly the same way as the previous implementations, except that the consume(...) function can be used for either type of iterator. It transparently switches between synchronous and asynchronous implementations based on the type of its input. If the first argument is an AsyncIterator, then the consume_async(...) variant is used; otherwise the consume(...) variant is used.

nums = (a for a in range(10))

consumed = consume(nums)

nums = async_range(10)

consumed = await consume (nums)

The functions passed to register must have a type definition or a type passed to the register function itself. We’ve used collections.abc.AsyncIterator rather than typing.AsyncIterator as the type here, as the type must be runtime checkable. This means that @singledispatch is limited to dispatching on concrete classes or abstract base classes.

The typing.AsyncIterator type is a generic type: we can use typing.AsyncIterator[int] to mean an iterator of ints. This is used by mypy for static analysis , but isn’t used at runtime. There’s no way that a running Python program can know if an arbitrary async iterator is a typing.AsyncIterator[int] iterator without consuming the whole iterator and checking its contents.

collections.abc.AsyncIterator makes no guarantees about the contents of the iterator, so it is similar to typing.AsyncIterator[typing.Any], but as it is an abstract base class, it can be checked with isinstance(...) at runtime.

Caching

Another way that we can improve performance is to cache the results of function calls. A cached function call keeps a record of past calls and their results, to avoid computing the same value multiple times. So far, we’ve been plotting temperatures using the centigrade temperature system, but a few countries have retained the archaic Fahrenheit system of measurement. It would be nice if we could specify which temperature system we want to use to display our charts, so users can choose the system with which they are most familiar.

The work of converting the temperature scale is orthogonal to the task done by the existing clean_temperature_fluctuations(...) method; we may want to convert temperatures without cleaning out fluctuations, for example. To achieve this, well create a new function that takes a cleaner and a temperature system and returns a new cleaner that calls the underlying one, then does a temperature conversion.

def convert_temperature(magnitude: float, origin_unit: str, target_unit: str) -> float:

temp = ureg.Quantity(magnitude, origin_unit)

return temp.to(target_unit).magnitude

def convert_temperature_system(cleaner, temperature_unit):

async def converter(datapoints):

results = cleaner(datapoints)

async for date, temp_c in results:

yield date, convert_temperature(temp_c, "degC", temperature_unit)

return converter

The preceding function does not have any type hints, as they are very verbose. Both the cleaner argument and the return value from convert_temperature_system(...) are of the type t.Callable[[t.AsyncIterator[DataPoint]], t.AsyncIterator[t.Tuple[datetime.datetime, float]]], which is a ridiculously complex construction to include twice in a single line of code. These types are used repeatedly in our analysis functions and, while hard to recognize at a glance, map to easily understood concepts. These are good candidates for factoring out into variables, the result of which is given as Listing 10-8.

CLEANED_DT_FLOAT = t.AsyncIterator[t.Tuple[datetime.datetime, float]]

CLEANED_COORD_FLOAT = t.AsyncIterator[t.Tuple[t.Tuple[float, float], float]]

DT_FLOAT_CLEANER = t.Callable[[t.AsyncIterator[DataPoint]], CLEANED_DT_FLOAT]

COORD_FLOAT_CLEANER = t.Callable[[t.AsyncIterator[DataPoint]], CLEANED_COORD_FLOAT]

def convert_temperature(magnitude: float, origin_unit: str, target_unit: str) -> float:

temp = ureg.Quantity(magnitude, origin_unit)

return temp.to(target_unit).magnitude

def convert_temperature_system(

cleaner: DT_FLOAT_CLEANER, temperature_unit: str,

) -> DT_FLOAT_CLEANER:

async def converter(datapoints: t.AsyncIterator[DataPoint],) -> CLEANED_DT_FLOAT:

results = cleaner(datapoints)

reveal_type(temperature_unit)

reveal_type(convert_temperature)

async for date, temp_c in results:

yield date, convert_temperature(temp_c, "degC", temperature_unit)

return converter

Listing 10-8

Typed conversion functions

Typing Protocols, Typevars and Variance

We have used t.TypeVar(...) before, to represent a placeholder in a generic type, such as when we defined the draw(...) function in our config class. We had to use T_key and T_value type variables there because some functions in the class used a tuple of key and value and others used a pair of key and value iterables.

That is, when a clean= function is of the type

t.Callable[t.AsyncIterator[DataPoint]], t.AsyncIterator[t.Tuple[datetime.datetime, float]]

the corresponding draw= function is of the type

t.Callable[[t.Any, t.Iterable[datetime.datetime], t.Iterable[float], t.Optional[str]], None]

We need to have access to both the datetime and float component types independently to build both type declarations. Type variables allow us to tell mypy that a type is a placeholder that will be supplied later; here we need both a T_key and a T_value type variable. We can also use them to define a pattern for a generic type called Cleaned and two instances of that type with specific values.

Cleaned = t.AsyncIterator[t.Tuple[T_key, T_value]]

CLEANED_DT_FLOAT = Cleaned[datetime.datetime, float]

CLEANED_COORD_FLOAT = Cleaned[t.Tuple[float, float], float]

If you’re expecting there to be lots of different types of cleaned/cleaner types, then this approach is a bit clearer than explicitly assigning the full types to every function.

The cleaner functions that return this data are a bit more complicated, as mypy’s ability to infer the use of generic types in callables has limits. To create complex aliases for callable and class types (as opposed to data variables), we must use the protocol feature. A protocol is a class that defines attributes that an underlying object must possess to be considered a match, very much like a custom abstract base class’s subclasshook, but in a declarative style and for static typing rather than runtime type checking.

We want to define a callable that takes an AsyncIterator of datapoints and some other type. The other type here is being represented by the T_cleaned_co type variable, as follows:

T_cleaned_co = t.TypeVar("T_cleaned_co", covariant=True, bound=Cleaned)

class CleanerFunc(Protocol[T_cleaned_co]):

def __call__(self, datapoints: t.AsyncIterator[DataPoint]) -> T_cleaned_co:

...

This CleanerFunc type can then be used to generate the *_CLEANER variables that match the *_CLEANED variables from earlier. The type used in square brackets for CleanerFunc is the variant of Cleaned that this particular function provides.

DT_FLOAT_CLEANER = CleanerFunc[CLEANED_DT_FLOAT]

COORD_FLOAT_CLEANER = CleanerFunc[CLEANED_COORD_FLOAT]

The covariant= argument in the TypeVar is a new addition, as is the _co suffix we used for the variable name. Previously, our type variables have been used to define both function parameters and function return values. These are invariant types: the type definitions must match exactly. If we declare a function that expects a Sensor[float] as an argument, we cannot pass a Sensor[int]. Normally, if we were to define a function that expects a float as an argument, it would be fine to pass an int.

This is because we haven’t given mypy permission to use it’s compatibility checking logic on the constituent types of the Sensor class. This permission is given with the optional covariant= and contravariant= parameters to type variables. A covariant type is one where the normal subtype logic applies, so if the Sensor’s T_value were covariant, then functions that expect Sensor[float] can accept Sensor[int], in the same way that functions that expect float can accept int. This makes sense for generic classes that have functions that provide data to the function they’re passed to.

A contravariant type (usually named with the _contra suffix) is one where the inverse logic holds. If Sensor’s T_value were contravariant, then functions that expect Sensor[float] cannot accept Sensor[int], but they must accept things more specific than float, such as Sensor[complex]. This is useful for generic classes that have functions that consume data from the function they’re passed to.

We’re defining a protocol that provides data,¹³ so a covariant type is the natural fit. Sensors are simultaneously a provider (sensor.value()) and a consumer (sensor.format(...)) of data and so must be invariant.

Mypy detects the appropriate type of variance when checking a protocol and raises an error if it doesn’t match. As we are defining a function that provides data, we must set covariant=True to prevent this error from showing.

The bound= parameter specifies a minimum specification that this variable can be inferred to be. As this is specified to be Cleaned, T_cleaned_co is only valid if it can be inferred to be a match to Cleaned[Any, Any]. CleanerFunc[int] is not valid, as int is not a subtype of Cleaned[Any, Any]. The bound= parameter can also be used to create a reference to the type of an existing variable, in which case it allows the definition of types that follow the signature of some externally provided function.

Protocols and type variables are powerful features that can allow for much simpler typing, but they can also make code look confusing if they’re overused. Storing types as variables in a module is a good middle ground, but you should ensure that all typing boilerplate is well commented and perhaps even placed in a utility file to avoid overwhelming new contributors to your code.

With our new conversion code in place, we can create a plot configuration that draws the temperature chart in degrees Fahrenheit. Listing 10-9 shows how end-users of the apd.aggregation package can create a new Config object that behaves in the same way as the existing one but renders it’s values in their preferred temperature scale.

import yappi

from apd.aggregation.analysis import (interactable_plot_multiple_charts, Config)

from apd.aggregation.analysis import (convert_temperature_system, clean_temperature_fluctuations)

from apd.aggregation.analysis import get_one_sensor_by_deployment

filter_in_db = Config(

clean=convert_temperature_system(clean_temperature_fluctuations, "degF"),

title="Ambient temperature",

ylabel="Degrees F",

get_data=get_one_sensor_by_deployment("Temperature"),

)

display(interactable_plot_multiple_charts(configs=[filter_in_db])())

Listing 10-9

Jupyter cell to generate a single chart showing temperature in degrees F

We’ve changed the control flow by adding this function, so we should do another profiling run to find what changes it made. We wouldnt want temperature conversion to take a significant amount of time.

..ationanalysis.py:191 datapoint_ok 10818 0.031250 0.031250 0.000003

..onutils.py:41 convert_temperature 8455 0.078125 6.578125 0.000778

The convert_temperature(...) function itself is invoked 8455 times, although datapoint_ok(...) is invoked 10818 times. This tells us that by filtering through datapoint_ok(...) and the cleaning function before converting the temperature, we’ve avoided 2363 calls to convert_temperature(...) for data we dont need to know about to draw the current chart. However, the calls we did make still took 6.58 seconds, tripling the total time taken to draw this chart. This is excessive.

We can optimize this function by reimplementing it to remove the dependency on pint and therefore reducing the overhead involved. If convert_temperature(...) were a simple arithmetic function, the time taken would be reduced to 0.02 seconds, at the expense of a lot of flexibility. This is fine for a simple conversion where both units are needed; pint excels in situations where the exact conversion isnt known ahead of time.

Alternatively, we can cache the results of the convert_temperature(...) function. A simple cache can be achieved by creating a dictionary that maps between values keyed in degrees C and values in the chosen temperature system. The implementation in Listing 10-10 builds up a dictionary for every invocation of the iterator, preventing the same items being calculated multiple times.

def convert_temperature_system(

cleaner: DT_FLOAT_CLEANER, temperature_unit: str,

) -> DT_FLOAT_CLEANER:

async def converter(datapoints: t.AsyncIterator[DataPoint],) -> CLEANED_DT_FLOAT:

temperatures = {}

results = cleaner(datapoints)

async for date, temp_c in results:

if temp_c in temperatures:

temp_f = temperatures[temp_c]

else:

temp_f = temperatures[temp_c] = convert_temperature(temp_c, "degC", temperature_unit)

yield date, temp_f

return converter

Listing 10-10

A simple manual cache

A caches efficiency¹⁴ is usually measured by hit rate. If our data set were to be [21.0, 21.0, 21.0, 21.0], then our hit rate would be 75% (miss, hit, hit, hit). If it were [1, 2, 3, 4], then the hit rate would drop to zero. The preceding cache implementation assumes a reasonable hit rate, as it makes no effort to evict unused values from it’s cache. A cache is always a trade-off between the extra memory used and time saving it allows. The exact tipping point where it becomes worth it depends on the size of the data being stored and your individual requirements for memory and time.

A common strategy for evicting data from a cache is that of an LRU (least recently used) cache. This strategy defines a maximum cache size. If the cache is full, when a new item is to be added, it replaces the one that has gone the longest without being accessed.

The functools module provides an implementation of an LRU cache as a decorator, which makes it convenient for wrapping our functions. We can also use it to create cached versions of existing functions by manually wrapping a function in an LRU cache decorator.

Caution

An LRU cache can be used if a function takes only hashable types as arguments. If a mutable type (such as a dictionary, list, set, or data class without frozen=True) is passed to a function wrapped in an LRU cache, a TypeError will be raised.

If we take our original, pint-based convert_temperature(...) function and add the LRU cache decorator, we can now benchmark the time it takes with a cache in place. The result of this is that the number of calls made to the function is drastically reduced but the time taken per invocation remains consistent. The 8455 invocations without the cache have become 67 invocations, corresponding to a hit rate of 99.2% and reducing the time overhead in offering this feature from 217% to 1%.

..onutils.py:40 convert_temperature 67 0.000000 0.031250 0.000466

Its possible to retrieve additional information about the efficiency of an LRU cache without running a profiler, using the cache_info() method on the decorated function. This can be useful when debugging a complex system, as you can check which caches are performing well and which arent.

>>> from apd.aggregation.utils import convert_temperature

>>> convert_temperature.cache_info()

CacheInfo(hits=8455, misses=219, maxsize=128, currsize=128)

Figure 10-5 shows the time taken by all three approaches, on a logarithmic scale (the horizontal lines represent tenfold increases, not a linear increase). This helps demonstrate how close the caching and optimized approaches are; for our particular problem, caching a very expensive function results in performance in the same order of magnitude as an alternative, less flexible implementation.

../images/481001_1_En_10_Chapter/481001_1_En_10_Fig5_HTML.png — Figure 10-5
Summary of performance of three approaches

Rewriting the function to avoid using pint would still result in performance improvement, but caching the results provides an improvement of approximately the same magnitude for a much smaller change, both in terms of lines of code and conceptually.

As always, there is a balancing act at play here. It’s likely that people would want temperature only in degrees Celsius or degrees Fahrenheit, so a conversion function that only provides those two is probably good enough. The conversion itself is straightforward and well understood, so the risk of introducing bugs is minimal. More complex functions may not be so easy to optimize, which makes caching a more appealing approach. Alternatively, they may process data that implies a lower hit rate, making refactoring more appealing.

The benefit of the @lru_cache decorator isnt in the inherent efficiency of the cache (its a rather bare-bones cache implementation), but in that it’s easy to implement for Python functions. The implementation of a function decorated with a cache can be understood by everyone who needs to work with it as they can ignore the cache and focus on the function body. If youre writing a custom caching layer, for example, using systems like Redis as the storage rather than a dictionary, you should build your integration such that it doesnt pollute the decorated code with cache-specific instructions.

Cached properties

Another caching decorator available in the functools module is @functools.cached_property. This type of cache is more limited than an LRU cache, but it fits a use case thats common enough that it warrants inclusion in the Python standard library. A function decorated with @cached_property acts in the same way as one decorated with @property, but the underlying function is called only once.

The first time that the program reads the property, it is transparently replaced with it’s result of the underlying function call.¹⁵ So long as the underlying function behaves predictably and without side effects,¹⁶ a @cached_property is indistinguishable from a regular @property . Like @property, this can only be used as an attribute on a class and must take the form of a function that takes no arguments except for self.

One place this can be of use is in the implementation of the DHT sensors back in the apd.sensors package. The value() methods of these two sensors delegate heavily to the DHT22 class from the Adafruit interface package. In the following method, only a small fraction of the code is relevant to extracting the value; the rest is setup code:

def value(self) -> t.Optional[t.Any]:

try:

import adafruit_dht

import board

# Force using legacy interface

adafruit_dht._USE_PULSEIO = False

sensor_type = getattr(adafruit_dht, self.board)

pin = getattr(board, self.pin)

except (ImportError, NotImplementedError, AttributeError):

# No DHT library results in an ImportError.

# Running on an unknown platform results in a

# NotImplementedError when getting the pin

return None

try:

return ureg.Quantity(sensor_type(pin).temperature, ureg.celsius)

except (RuntimeError, AttributeError):

return None

We can change this to factor out the common code for creating the sensor interface into a base class, which contains a sensor property. The temperature and humidity sensors can then drop all their interface code and instead rely on the existence of self.sensor.

class DHTSensor:

def __init__(self) -> None:

self.board = os.environ.get("APD_SENSORS_TEMPERATURE_BOARD", "DHT22")

self.pin = os.environ.get("APD_SENSORS_TEMPERATURE_PIN", "D20")

@property

def sensor(self) -> t.Any:

try:

import adafruit_dht

import board

# Force using legacy interface

adafruit_dht._USE_PULSEIO = False

sensor_type = getattr(adafruit_dht, self.board)

pin = getattr(board, self.pin)

return sensor_type(pin)

except (ImportError, NotImplementedError, AttributeError):

# No DHT library results in an ImportError.

# Running on an unknown platform results in a

# NotImplementedError when getting the pin

return None

class Temperature(Sensor[t.Optional[t.Any]], DHTSensor):

name = "Temperature"

title = "Ambient Temperature"

def value(self) -> t.Optional[t.Any]:

try:

return ureg.Quantity(self.sensor.temperature, ureg.celsius)

except RuntimeError:

return None

...

The @property line in the DHTSensor class can be replaced with @cached_property to cache the sensor object between invocations. Adding a cache here doesnt impact the performance of our existing code, as we do not hold long-lived references to sensors and repeatedly query their value, but any third-party users of the sensor code may find it to be an advantage.

Exercise 10-1: Optimizing Clean_Watthours_To_Watts

At the start of this chapter, we identified the clean_watthours_to_watts(...) functions as the most in need of optimization. On my test data set, it was adding multiple seconds to the execution run.

In the accompanying code to this chapter, there are some extended tests to measure the behavior of this function and it’s performance. Tests to validate performance are tricky, as they are usually the slowest tests, so I don’t recommend adding them as a matter of course. If you do add them, make sure to mark them as such so that you can skip them in your normal test runs.

Modify the clean_watthours_to_watts(...) function so that the test passes. You will need to achieve a speedup of approximately 16x for the test to pass. The strategies discussed in this chapter are sufficient to achieve a speedup of about 100x.

Summary

The most important lesson to learn from this chapter is that no matter how well you understand your problem space, you should always measure your performance improvements, not just assume that they are improvements. There is often a range of options available to you to improve performance, some of which are more performant than others. It can be disappointing to think of a clever way of making something faster only to learn that it doesnt actually help, but it’s still better to know.

The fastest option may require more RAM than can reasonably be assumed to be available, or it may require the removal of certain features. You must consider these carefully, as fast code that doesnt do what the user needs is not useful.

The two caching functions in functools are to be aware of for everyday programming. Use @functools.lru_cache for functions that take arguments and @functools.cached_property for calculated properties of objects that are needed in multiple places.

If your typing hints start to look cumbersome, then you should tidy them up. You can assign types to variables and represent them with classes like TypedDict and Protocol, especially when you need to define more complex structured types. Remember that these are not for runtime type checking and consider moving them to a typing utility module for clearer code. This reorganization has been applied in the sample code for this chapter.

Additional resources

The following links go into more depth on the topics covered in this section:

If youre interested in the logic of the different variances used in typing, Id recommend reading up on the Liskov Substitution Principle. The Wikipedia page at https://en.wikipedia.org/wiki/Liskov_substitution_principle is a good starting place, especially for links to computer science course materials on the subject.
More details on how mypy handles protocols and some advanced uses, such as allowing limited runtime checking of protocol types, are found at https://mypy.readthedocs.io/en/stable/protocols.html.
Beaker (https://beaker.readthedocs.io/en/latest/) is a caching library for Python that supports various back-end storages. It’s especially aimed at web applications, but can be used in any type of program. It’s useful for situations where you need multiple types of cache for different data.
The two third-party profiles we’ve used in this chapter are https://github.com/rkern/line_profiler and https://github.com/sumerc/yappi.
Documentation on how to customize the timer used with the built-in profiling tools is available in the standard librarys docs at https://docs.python.org/3/library/profile.html#using-a-custom-timer.

Footnotes

If you’re using a Python implementation other than CPython (such as PyPy or Jython), this optimized profiler won’t be available, and you’ll need to use the reference implementation.

This function is called twice because it was written to be used as part of an interactive widget. interactable_plot_multiple_charts(...) takes setup arguments and returns a function that can be hooked up to widgets. We call it twice here because we want to set up the function and call it once with no special arguments, rather than plumb it in to interactive widgets.

Providing the loop and wrapped variables as explicit local variables also ensures that Python knows how to create a closure over these variables and make them available to the profiled expression. If we passed locals=locals(), we wouldn’t see these variables passed down unless we gave Python a hint that we needed them from the containing scope using nonlocal loop and nonlocal wrapped statements.

The timeit profiler (explained in the next section) can be used to demonstrate this relationship:

>>> def func(a, b, c, d, e, f, g, h, i, j, k):

... return a+b+c+d+e+f+g+h+i+j+k

...

>>> timeit.timeit(“func(**vals)”, “vals={‘a’:1, ‘b’:1, ‘c’:1, ‘d’:1, ‘e’:1, ‘f’:1, ‘g’:1, ‘h’:1, ‘i’:1, ‘j’:1, ‘k’:1}”, globals={‘func’:func})

0.7101785999999777

>>> timeit.timeit(“func(a=1,b=1,c=1,d=1,e=1,f=1,g=1,h=1,i=1,j=1,k=1)”, globals={‘func’:func})

0.6051479999999998

>>> timeit.timeit(“a(1,1,1,1,1,1,1,1,1,1,1)”, globals={‘func’:func})

0.479350299999993

The difference between these approaches is marginal for trivial functionals and irrelevant for more complex functions. You should continue to use the one that makes your code clearest; we’re only trying this in our example as a last resort for performance improvements.

A function that takes 1 millisecond to execute translates to timeit taking over 15 minutes to execute with the default parameters.

https://github.com/rkern/line_profiler

https://github.com/sumerc/yappi

I’ve seen real-world Python code that is an order of magnitude faster in a Linux VM on an OSX host than on the host itself, even running the same release of Python and all dependencies. Python build, OS version, and profiler can all make a big difference, so you should establish a baseline whenever you’re doing benchmarking; don’t rely on ones you’ve generated on previous days.

https://docs.python.org/3/library/tracemalloc.html#pretty-top

Other commercial profiling tools are available.

Specifically, this is polynomial complexity, sometimes written as O(n^c). The time taken is the time to execute the loop body, multiplied by each of the lengths of the loop.

The screenshot is from the Windows port, QCachegrind. As Valgrind is a Linux tool you’ll find a wider range of utilities if you use Linux.

Although it consumes DataPoint objects, that’s a fixed type. It’s only the way the TypeVar object is used that matters.

That is, the use of a cache, not a type of cache. We can only talk about the efficiency of a cache if we know the information about the requests being made of it.

This replacement is thread-safe, so even if multiple threads try to read the property, the function won’t be called multiple times for a given object.

Side effects in a functional programming context are things a function does other than returning an output variable. If a function manipulates mutable data, such as changing a global variable, then returning a cached return value also prevents these changes from happening on future invocations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Speeding things up

Create new playlist

Sign In

Sign Up

10. Speeding things up

Optimizing a function

Profiling and threads

Interpreting the profile report

Other profilers

timeit

line_profiler

yappi

Tracemalloc

New Relic

Optimizing control flow

Visualizing profiling data

Caching

Cached properties

Summary

Additional resources

Table of Contents for
10. Speeding things up