Introduction
In the previous chapter, we saw how to create a basic Python extension module. We added code to expose functionality from the underlying C++ library of statistical functions. We saw how to perform the conversion between PyObject pointers and native C++ types. While not especially difficult, we saw that it is potentially error prone. In this chapter, we consider two frameworks – Boost.Python and PyBind – that overcome these difficulties, making the development of Python extension modules easier. We build two quite similar wrapper components, the first based on Boost.Python and the second on PyBind. The intention here is to compare the two frameworks. Following this, we look at a typical Python client and develop a script to measure the relative performance of the extension modules. We end the chapter with a simple Flask app that demonstrates using our PyBind module as part of a (limited) statistics service.
Boost.Python
The Boost Python Library is a framework for connecting Python to C++. It allows us to expose C++ classes, functions, and objects to Python in a non-intrusive way using types provided by the framework. We can continue to write “regular” C++ code in the wrapper layer using the types provided. The Boost Python Library is extensive. It provides support for automatic conversion of Python types to Boost types, function overloading, and exception translation, among other things. Using Boost.Python allows us to manipulate Python objects easily in C++, simplifying the syntax when compared to a lower-level approach such as the one we saw in the previous chapter.
Prerequisites
oost_1_76_0stageliblibboost_python38-vc142-mt-gd-x32-1_76.lib
oost_1_76_0stageliblibboost_python38-vc142-mt-x32-1_76.lib
oost_1_76_0stageliblibboost_python38-vc142-mt-gd-x64-1_76.lib
oost_1_76_0stageliblibboost_python38-vc142-mt-x64-1_76.lib
The Boost installation and build process for these libraries are described in more detail in Appendix A.
Project Settings
Project settings for StatsPythonBoost
Tab | Property | Value |
---|---|---|
General | C++ Language Standard | ISO C++17 Standard (/std:c++17) |
C/C++ > General | Additional Include Directories | <Usersuser>Anaconda3include $(BOOST_ROOT) $(SolutionDir)Commoninclude |
Linker > General | Additional Library Directories | <Usersuser>Anaconda3libs $(BOOST_ROOT)stagelib |
Build Events > Post-Build Event | Command Line | (see in the following) |
We can see from Table 8-1 that the project settings are similar to the previous project. In this case, we have not renamed the target output. We leave this for the post-build script (see in the following). In the Additional Include Directories, we reference the location of Python.h and the StatsLib project include directory. In addition, we reference the Boost libraries with $(BOOST_ROOT) macro. Similarly, in the Additional Library Directories, we add a reference to both the Python libs and the Boost libs.
With these settings in place, everything should build without warnings or errors.
Code Organization
The Visual Studio Community Edition 2019–generated project for a Windows dll generates a handful of files that we ignore. We ignore the dllmain.cpp file (which contains the entry point for a standard Windows dll). We also ignore the files framework.h and pch.cpp (except insofar as it includes pch.h, the precompiled header).
The macro indicates that in this dll module, we are statically linking to Boost Python:
oost_1_76_0stageliblibboost_python38-vc142-mt-...-...-1_76.lib
The “...” depend on the specific processor architecture, though in our case we target only x64. The second line brings in all the Boost Python headers. The rest of the code is organized as before into three main areas: the functions (Functions.h/Functions.cpp), the conversion layer (Conversion.h/Conversion.cpp), and the module definition. In addition, for this project, we have a wrapper class StatisticalTests.h/StatisticalTests.cpp that wraps up the t-test functionality. We will deal with each of these areas in turn.
Functions
Inside the API namespace we declare two functions: DescriptiveStatistics and LinearRegression. Both functions take the corresponding boost::python arguments. Boost.Python comes with a set of derived object types corresponding to those of Python’s:
list boost::python::list
dict boost::python::dict
tuple boost::python::tuple
str boost::python::str
The DescriptiveStatistics wrapper function
The DescriptiveStatistics function in Listing 8-1 should look familiar. It follows the same structure as the raw Python example in the previous chapter. The major difference in the function declaration is that instead of PyObject pointers, we can use types defined in the Boost.Python library. In this case, both parameters are passed in as const references to a boost::python::list. The second parameter is defaulted, as we want to be able to call DescriptiveStatistics with or without the keys. The input arguments are converted to a std::vector<double> and a std::vector<std::string>, respectively. These are then used in the call to the underlying statistical library function. The results package is returned as before (a std::unordered_map<std::string, double> type) and converted to a boost::python::dict.
The LinearRegression wrapper function
As can be seen from Listing 8-2, the LinearRegression function follows the same structure as previously. The function takes in two lists, converts them into the corresponding datasets, calls the underlying function, and converts the results package into a Python dictionary.
StatisticalTests
Wrapping up the TTest class in a function
As shown in Listing 8-3, the approach of providing a procedural wrapper for a class is straightforward: we get the input data and create an instance of the TTest class (depending on the function call and the arguments). We then call Perform to do the calculation and Results to retrieve the results. These are then translated back to the Python caller. The SummaryDataTTest function in this example takes four parameters corresponding to the constructor arguments of the summary data t-test. The arguments are typed as const references to a boost::python::object. This provides a wrapper around PyObject. The function then makes use of boost::python::extract<T>(val) to get a double value out of the argument. In general, the syntax is cleaner and more direct than using PyArg_ParseTuple. The remainder of the function calls Perform and retrieves the Results. As in the previous case of DescriptiveStatistics and LinearRegression, these are converted to a boost::python::dict and returned to the caller.
The Conversion Layer
Converting a boost::python::object list to a std::vector
Listing 8-4 starts by constructing an empty std::vector. Then, we iterate over the input list extracting the individual values and inserting them into the vector. We use this basic approach to illustrate accessing list elements in a standard manner. We could have used the boost::python::stl_input_iterator<T> to construct the results vector<T> directly from iterators. We use this function to convert a list of doubles to a vector of doubles and also to convert a list of string keys to a vector of strings.
Converting the results package to a Python dictionary
In this case, we input a const reference to a std::unordered_map<std::string, double> and return the contents into a boost::python::dict by simply iterating over the results. The final function is to_list. This is similar to the previous to_dict function. In this case, we create a Python list and populate it from a vector of doubles.
The Module Definition
The functions: StatsPythonBoost module definition
In order to do this, we need two separate overloaded functions. This is the same approach that we used in the C++/CLI wrapper in Chapter 3. In this case, however, we do not need to explicitly write the overloads. We make use of the macro BOOST_PYTHON_FUNCTION_OVERLOADS to generate the overloads for us. The arguments are the generator name, the function we want to overload, the minimum number of parameters (1 in this case), and the maximum number of parameters (2 in this case). Having defined this, we then pass the f_overloads structure , along with the docstring, to the def function.
The classes: StatsPythonBoost module definition
The C++ wrapper class for the t-test is defined in StatisticalTests.h. The class template argument references our wrapper class. In this case, we have named it StudentTTest to distinguish it from the underlying Stats::TTest class. This class holds an instance of the underlying Stats::TTest class. The constructors determine the type of t-test to be performed and convert between boost::python types and the underlying C++ types, using the same conversions that we have seen.
From the module definition in Listing 8-6b, we can see that the first parameter is the name of the class, "TTest". This is the name for the type we will call from Python. Alongside this, we define an init function (the constructor) which takes four arguments. We then define two additional init functions, one each for the remaining constructors with their corresponding arguments. Finally, we define the two functions Perform and Results. All the functions provide a docstring. That is all we need to do to expose a native C++ type to Python.
The DataManager::ListDataSets function
The items contain the dataset name and the number of observations in the data. The function first obtains the currently loaded datasets from the m_manager member that this class wraps. Inside the for-loop, we use the function boost::python::make_tuple to create a Python tuple element with the dataset information. This is then appended to the results list and returned to the caller. The remaining functions are similarly straightforward.
Exception Handling
This is the error that is thrown from the underlying StatsLib. Basically, the same error handling that we wrote in the previous chapter is now provided for free.
PyBind
In this section, we develop our third and final Python extension module. This time we use PyBind. Boost.Python has been around for a long time and the Boost library that it is a part of offers a wide range of functionality. This makes it a relatively heavyweight solution if all we want to do is create Python extension modules. PyBind is a lighter-weight alternative. It is a header-only library that provides an extensive range of functions to facilitate writing C++ extension modules for Python. PyBind is available from here: https://github.com/pybind/pybind11.
Prerequisites
The only prerequisite for this section is to install PyBind into your Python environment. You can use either pip install pybind from a command prompt. Or you can download the wheel (https://pypi.org/project/pybind11/#files) and run pip install "pybind11-2.7.0-py2.py3-none-any.whl".
Project Settings
Project settings for StatsPythonPyBind
Tab | Property | Value |
---|---|---|
General | C++ Language Standard | ISO C++17 Standard (/std:c++17) |
C/C++ > General | Additional Include Directories | <Usersuser>Anaconda3include <Users>AppDataRoamingPythonPython37site-packagespybind11include $(SolutionDir)Commoninclude |
Linker > General | Additional Library Directories | <Usersuser>Anaconda3libs $(BOOST_ROOT)stagelib |
Build Events > Post-Build Event | Command Line | (see in the following) |
Additionally, we have removed the pch file and set the project setting to not using precompiled headers. Finally, we have added a reference to the StatsLib project in the project References. At this point, everything should build without warnings or errors.
Code Organization: module.cpp
In this project, there is only a single file, module.cpp. This file contains all the code. As we have seen in the previous section on Boost.Python and in the previous chapter as well, we have generally separated the conversion layer from the wrapped functions and classes. And we have separated these from the module definition. This was a convenient way to organize the code in the wrapper layer and allowed us to separate concerns (like converting types or calling functions) appropriately. However, PyBind simplifies both these aspects.
This is followed by our StatsLib includes.
The function definitions in the StatsPythonPyBind module
The PYBIND11_MODULE macro defines the module name StatsPythonPyBind that is used by Python in the import statement. Inside the module definition, we can see the declarations of the DescriptiveStatistics and LinearRegression functions. The .def(...) function is used to define an exported function. Just as before, we give it a name that is called from Python and the final parameter which is a docstring.
What is important here is not how the function wraps the TTest class, but rather the fact that the wrapper function uses native C++ and STL types both for the function parameters and the return value. Using Boost.Python, we would have had to convert from/to boost::python::object. But here we no longer need to convert from Python types to C++ types. Of course, we can, if we wish, explicitly wrap functions. This is a design choice.
The description of the TTest class exported to Python
Listing 8-8b shows how the TTest class from the underlying C++ StatsLib is exposed to Python. As in the case of Boost.Python, we describe the type “TTest” that we want to use. But, in this case, the template argument to the py::class_ object is the underlying Stats::TTest class. The class that is referenced is not a wrapper class, as was the case with Boost.Python. After the template arguments and the parameters passed to the constructor of py::class_, we use the .def function to describe the structure of the class. In this case, we declare the three TTest constructors with their respective arguments passed as template parameters to the py::init<> function. Again, it is worth highlighting that we do not need to do any conversions; we simply pass in native C++ types and STL types (rather than boost::python::object types). Finally, we declare the functions Perform and Results, and an anonymous function to return a string representation of the object to Python.
The DataManager class definition
As we can see from Listing 8-8c, all we need to do in the .def function is to provide a mapping from the function names used by Python to the underlying C++ functions. Apart from the functions that are available in the DataManager class, we also have access to functions that form part of the definition of the Python class. For example, the DataManager extends the __repr__ function with a custom to_string function that outputs internal information regarding the dataset.
As we can see in this project, both the wrapper and the “conversion” layer are minimal. PyBind provides a wide range of facilities, allowing us to easily connect C++ code to Python. In this chapter, we have only just scratched the surface. There are a large number of features and we have only covered a fraction of them. Moreover, we are aware that we have really only written code for the most “vanilla” situations (taking advantage of the fact that PyBind allows us to do this easily).
However, while using PyBind makes exposing C++ classes and functions straightforward, we need to be aware that there is a lot going on under the hood. In particular, we need to be aware of the return value policies that can be passed to the module_::def() and the class_::def() functions. These annotations allow us to tune the memory management for functions that return a non-trivial type. In this project, we have only used the default policy return_value_policy::automatic. A full discussion of this topic is beyond the scope of this chapter. But, as the documentation points out, return value policies are tricky, and it’s important to get them right.1
If we take a step back for a moment, we can see that in terms of the module definition, both Boost.Python and PyBind provide us with a meta-language for defining Python entities. It might seem a complicated way to go. Arguably, writing equivalent classes in native Python is somewhat easier than using a meta-language to describe C++ classes. However, the approach we have adopted here, describing native C++ classes, clearly addresses a different issue, that is, it provides a (relatively) easy way to export classes out of C++ and have them managed in an expected way in a Python environment.
Output from the Python help function for the TTest class
Listing 8-9 shows the output from the StatsPythonPyBind module using the built-in help() function. We can see that it provides a description of the class methods and the class initialization along with the docstrings that we provided. It also provides detailed information both about the argument types used and the return types. We can see quite clearly how the declarative C++ class description has been translated into a Python entity. The output from StatsPythonBoost is similar, though not identical, and worthwhile comparing. As an alternative to the help function, we can use the inspect module to introspect on our Python extension. The inspect module provides additional useful functions to help get information about objects. This can be useful if you need to display a detailed traceback. As expected, we can retrieve all the information from our module except, of course, the source code. What both these approaches serve to illustrate is that, with a limited amount of C++ code, we have developed a proper Python object.
Exception Handling
The exception handling provides sufficient information to determine the cause of the issue and processing can proceed appropriately. It is worth pointing out that PyBind’s exception handling capabilities go beyond simple translation of C++ exceptions. PyBind provides support for several specific Python exceptions. It also supports registering custom exception handlers. The details are covered in the PyBind documentation.
The Python “Client”
This allows us to easily switch between the Boost.Python extension module and the PyBind extension module. This is not proposed as a general approach, it just facilitates testing the functions and classes here.
A simple function to compare the results from two t-tests
In Listing 8-10, the function takes as inputs two Pandas data frame objects (simple datasets loaded from csv files) and converts them to lists, the type our Python interface to StatsLib expects. The first call uses the procedural interface. The second identical call constructs an instance of the TTest class that we declared and calls the functions Perform and Results. Both approaches produce the same results, unsurprisingly.
Performance
One of the reasons for trying to connect C++ and Python is the potential for performance gains from C++ code. To this end, we have written a small script, PerformanceTest.py. We want to test the performance of the mean and (sample) standard deviation functions. We would like to do this for Python vs. PyBind computing Mean and StdDev for 500k items.
From the Python side we have two approaches. Firstly, we define the functions mean, variance, and stddev. The implementations of these only use basic Python functionality. We also define the same functions, this time using the Python statistics library. This allows us to have two different baselines.
Enhancing the module definition with additional C++ functions
In Listing 8-11, we add the function "Mean", supply the address of the C++ implementation, and add the documentation string.
The StandardDeviation function is slightly more involved. The underlying C++ function takes two parameters, a std::vector<double> and an enumeration for the VarianceType . If we just pass the function address to the module definition, we will get a runtime error from Python as the function expects two arguments. To address this, we need to extend the code. At this point we have a choice. We can either write a small wrapper function that provides a hardcoded VarianceType argument or we can expose the VarianceType enumeration. We’ll look at both approaches.
Wrapper for the underlying StandardDeviation function
Definition of the SampleStandardDeviation wrapper function
In Listing 8-13, we use the name “StdDevS” to reflect the fact that we are requesting the sample standard deviation. Now we can use this function in our performance test.
Defining the enumeration for VarianceType
Defining additional arguments for the StdDev function
With these modifications in place, we can return to the performance test. The PerformanceTest.py script is straightforward. We import the required libraries, including StatsPythonPyBind. We define two versions of both mean and stddev in Python. One version doesn’t use the statistics library and the second version does. This just facilitates the comparison between Python functions and our library functions. We add a simple test function that uses random data and returns the mean and stddev with timing information.
The Python function mean(x) is about two orders of magnitude faster than the native C++ function. Changing the C++ code to use a for-loop instead of std::accumulate made no significant difference. It might be interesting to investigate if the latency in the C++ side is due to the conversion layer or simply unnecessary copying of vectors. Nevertheless, the native C++ StdDev function is substantially faster than either of the Python variants.
The Statistics Service
This starts the Flask service on port 5000. In your browser address bar, go to http://localhost:5000/. This points to the Summary Data T-Test page which is the main page for this app. Fill in the required details and press submit. The results are returned as expected, using the underlying TTest class from the StatsPythonPyBind module.
Apart from the small amount of code required to get this up and running, what is worth emphasizing is what we have achieved in terms of a multi-language development infrastructure. We have got an infrastructure that allows us to develop and adapt native C++ code, build this into a library, incorporate the library into a Python module, and have this functionality available for use in a Python web service. This flexibility is valuable when developing software systems.
Summary
In this chapter, we have built Python modules using the frameworks provided by Boost.Python and PyBind. Both modules exposed the functionality of the underlying library of statistical functions in a similar way. We have seen that both frameworks do a lot of work on our behalf both in terms of type conversions and also error handling. Furthermore, both frameworks allow us to expose native C++ classes to Python. We concluded this chapter by looking at measuring the performance of the underlying C++ function calls vs. the Python equivalents. The potential for performance enhancements is an obvious reason for connecting C++ to Python. However, equally compelling as a reason for connecting C++ to Python (if not more so) is that it gives us access to a wide variety of different Python libraries covering everything from machine learning (NumPy and Pandas, for example) to web services (Django and Flask, for example) and more. As we have seen, being able to expose functionality written in C++ to Python with minimal effort gives you a useful additional architectural choice when developing loosely coupled software systems.
Additional Resources
The main reference for Boost.Python is the excellent Boost documentation www.boost.org/doc/libs/1_77_0/libs/python/doc/html/index.html and the reference manual at www.boost.org/doc/libs/1_77_0/libs/python/doc/html/reference/index.html. There is also a useful tutorial covering exposing classes: www.boost.org/doc/libs/1_77_0/libs/python/doc/html/tutorial/tutorial/exposing.html.
The excellent PyBind documentation at https://pybind11.readthedocs.io/en/latest/ has a lot of useful information.
Exercises
The exercises in this section deal with exposing the same functionality as previously, but this time via the Boost.Python module and the PyBind module.
The following exercises use the StatsPythonBoost project:
- In StatisticalTests.h, add these declarations for the three functions:boost::python::dict SummaryDataZTest(const boost::python::object& mu0, const boost::python::object& mean, const boost::python::object& sd, const boost::python::object& n);boost::python::dict OneSampleZTest(const boost::python::object& mu0, const boost::python::list& x1);boost::python::dict TwoSampleZTest(const boost::python::list& x1, const boost::python::list& x2);
In StatisticalTests.cpp, add the implementations of these functions. Follow the code for the t-test wrapper functions.
In module.cpp, add the three new functions to the module BOOST_PYTHON_MODULE(StatsPythonBoost) {}
- After rebuilding StatsPythonBoost, open the StatsPython project in VSCode. Open the StatsPython.py script. Add functions to test the z-test functions using the data we have used previously. For example, we can add the following function:def one_sample_ztest() -> None:""" Perform a one-sample z-test """try:data: list = [3, 7, 11, 0, 7, 0, 4, 5, 6, 2]results = Stats.OneSampleZTest(3.0, data)print_results(results, "One-sample z-test.")except Exception as inst:report_exception(inst)
- In Functions.h add the following declaration:boost::python::list MovingAverage(const boost::python::list& dates, const boost::python::list& observations, const boost::python::object& window);
- In Functions.cpp:
Add #include "TimeSeries.h" to the top of the file.
Add the implementation: the function takes three non-optional arguments: a list of dates, a list of observations, and a window size.
Convert the inputs using the existing conversion functions and pass these to the constructor of the TimeSeries class.
Return the results using the Conversion::to_list function.
- In module.cpp, add the new function:def("MovingAverage", API::MovingAverage, "Compute a simple moving average of size = window.");
Build StatsPythonBoost. It should build without warnings and errors. You should be able to test the MovingAverage function interactively, adapting the script we used previously.
Open the StatsPython project in VSCode. Open the StatsPython.py script. Add a function to test the moving average, including exception handling. Run the script, and debug if required.
3) In the StatsPythonBoost project, add a TimeSeries class that wraps the native C++ TimeSeries class and computes a simple moving average.
Add a TimeSeries.h and a TimeSeries.cpp file to the project. These will contain the wrapper class definition and implementation, respectively.
- In TimeSeries.h, add the class declaration. For example:namespace API{namespace TS{// TimeSeries wrapper classclass TimeSeries final{public:// Constructor, destructor, assignment operator and MovingAverage functionprivate:Stats::TimeSeries m_ts;};}}
In TimeSeries.cpp, add the class implementation. The constructor converts the boost::python::list arguments to appropriate std::vector types. The MovingAverage function extracts the window size argument and forwards the call to the m_ts member. The results are returned using the Conversion::to_list() function.
- In module.cpp, add the include file, and add the class declaration to BOOST_PYTHON_MODULE(StatsPythonBoost) as follows:// Declare the TimeSeries classclass_<API::TS::TimeSeries>("TimeSeries",init<const list&, const list&>("Construct a time series from a vector of dates and observations.")).def("MovingAverage", &API::TS::TimeSeries::MovingAverage,"Compute a simple moving average of size = window.");
After rebuilding StatsPythonBoost, open the StatsPython project in VSCode. Open the StatsPython.py script. Add a function to test the moving average, including exception handling. Run the script, debug if required.
The following exercises use the StatsPythonPyBind project:
In module.cpp, add declarations/definitions for the three functions.
In the module definition, add entries for these three functions. Follow the code for the t-test wrapper functions.
After rebuilding the StatsPythonPyBind project, open the StatsPython project in VSCode. Open the StatsPython.py script. Add functions to test the z-test functions using the data we have used previously.
After rebuilding the StatsPythonPyBind project, open the StatsPython project in VSCode. Open the StatsPython.py script. Add functions to test the z-test functions using the data we have used previously. We can extend the function we used previously to test the one-sample z-test to test both the procedural wrapper and the class as follows:
The results output from both calls should be identical.
In module.cpp, add #include "TimeSeries.h".
In module.cpp, add a declaration/definition of the wrapper function.
In module.cpp, add the definition of the MovingAverage function to the list of functions exposed by the PYBIND11_MODULE.
After rebuilding the StatsPythonPyBind project, open the StatsPython project in VSCode. Open the StatsPython.py script. Add a function to test the moving average, including exception handling. Run the script, debug if required.
In module.cpp, add the include file, and add the class declaration. The class definition will be similar to the class definition we added to the StatsPythonBoost project previously.
After rebuilding the StatsPythonPyBind project, open the StatsPython project in VSCode. Open the StatsPython.py script. Add a function to test the moving average, including exception handling. Run the script, debug if required.
It is worth emphasizing that exposing the ZTest class and the TimeSeries class using PyBind has been quite straightforward, compared to the amount of work required to expose wrappers either via CPython or the Boost.Python wrapper.