Speeding up Python with C++ and pybind11
In two previous posts, I described using
cython and
C, respectively,
to speed up a Python program that involves simple numerical computations.
In this post, I will use the “big brother” of the industrial languages—C++—to
speed up the same program.
For Python/C++ inter-op, I will use the excellent
pybind11
.
This task consists of three components or steps:
- Develop (or obtain) the C++ code, independent of Python. (That is, it’s perfectly fine if we use a C++ library that was not originally intended to be called from Python.)
- Develop “binding” code using
pybind11
. This code will make selected functions, classes, and methods in the C++ code accessible from Python. Building this code will generate a ‘.so’ file (on Linux) that is a proper Python module, that is, it is understood by the Python interpreter at run-time without needing any other tool. - Use the generated module in Python. Again, this access does not need
pybind11
or any other tool.
The C++ implementation
The C++ implementation consists of a header file
1
2
3
4
5
6
7
8
9
10
11
12
// File `src/cc/datex/cc_version01.h`.
#ifndef _CC_VERSION01_
#define _CC_VERSION01_
#include <vector>
long weekday(long ts);
std::vector<long> weekdays(std::vector<long> const & ts);
#endif
and a source file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// File `src/cc/datex/cc_version01.cc`.
#include <vector>
long weekday(long ts)
{
long ts0, weekday0, DAY_SECONDS, WEEK_SECONDS;
long ts_delta, td, nday, weekday;
ts0 = 1489363200; // 2017-03-13 0:0:0 UTC, Monday
weekday0 = 1; // ISO weekday: Monday is 1, Sunday is 7
DAY_SECONDS = 86400;
WEEK_SECONDS = 604800;
ts_delta = ts - ts0;
if (ts_delta < 0) {
ts_delta += ((-ts_delta) / WEEK_SECONDS + 1) * WEEK_SECONDS;
}
td = ts_delta % WEEK_SECONDS;
nday = td / DAY_SECONDS;
weekday = weekday0 + nday;
if (weekday > 7) {
weekday = weekday - 7;
}
return weekday;
}
std::vector<long> weekdays(std::vector<long> const & ts)
{
long n = ts.size();
std::vector<long> out(n);
for (long i = 0; i < n; i++) {
out[i] = weekday(ts[i]);
}
return out;
}
The function weekday
is a straightforward port of the Python version
(and is identical to the C version in the
previous post).
The function weekdays
takes a vector, calls weekday
on each element of the vector,
and returns the result in a new vector.
To highlight the notion that this C++ implementation can be a library totally unaware of Python, these two files are stored in a directory separate from the Python package.
Python bindings for the C++ code
The binding code is C++ code that uses pybind11
to define a Python module
and specify content of the module. My initial attempt is as follows:
1
2
3
4
5
6
7
8
9
10
11
// File `src/python_ext/datex/cc/version01.cc`.
#include "datex/cc_version01.h"
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
PYBIND11_MODULE(version01, m)
{
m.def("weekday", &weekday);
m.def("weekdays", &weekdays);
}
This code trivially exposes the C++ functions weekday
and weekdays
to Python
in a new module named “version01”. (I would rather name the module “_version01”,
but pybind11 does not appear to allow it.)
Pay special attention to
1
#include <pybind11/stl.h>
With this statement, collections of compatible types are automatically converted
between Python and C++. For example, Python list
and tuple
correspond to C++ vector
,
whereas Python dict
corresponds to C++ map
.
In the case of weekdays
, the input from Python can be a list or a numpy
array (or possibly other iterables);
these will be converted to a vector
as input to the C++ function.
On the other hand, the vector
returned from the C++ function will become a list
coming out of the Python function weekdays
in the interface module version01
.
At this point, the directory structure looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
├── setup.py
├── src
│ ├── c
│ ├── cc
│ │ └── datex
│ │ ├── cc_version01.cc
│ │ └── cc_version01.h
│ ├── python
│ │ ├── datex
│ │ │ ├── c
│ │ │ ├── cc
│ │ │ │ ├── __init__.py
│ │ │ ├── cy
│ │ │ ├── __init__.py
│ │ │ ├── version01.py
│ │ │ └── version03.py
│ ├── python_ext
│ │ └── datex
│ │ ├── c
│ │ ├── cc
│ │ │ ├── version01.cc
│ │ ├── cy
└── tests
├── datex
│ ├── __init__.py
│ └── test_1.py
Below is the content of setup.py
(skipping non-C++ details):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from setuptools import setup, Extension, find_packages
cy_extensions = ...
cffi_extensions = ...
cc_options = ['--std=c++17', '-O3', '-Wall', '-Wextra', '-Wfatal-errors']
cc_extensions = [
Extension(
'datex.cc.version01',
sources=['src/python_ext/datex/cc/version01.cc',
'src/cc/datex/cc_version01.cc'],
include_dirs=['src/cc'],
extra_compile_args=cc_options,
),
]
setup(
name='datex',
version='0.1.0',
package_dir={'': 'src/python'},
packages=find_packages(where='src/python'),
ext_modules=cc_extensions + cy_extensions,
cffi_modules=cffi_extensions,
)
Typing
1
$ pip install --user .
will install the package datex
.
To ease import, put this line in the file src/python/datex/cc/__init__.py
:
1
2
3
# File `src/python/datex/cc/__init__.py`.
from . import version01
The version01
being imported here is the dynamic library generated by the C++ extension,
and there is no Python source file corresponding to it.
Let’s check the speed of the C++ extension in comparison with the Python version as well as the C and cython extensions. Below is the benchmarking code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File `tests/datex/test_1.py`.
from functools import partial
import numpy
from zpz.profile import Timer
from datex import version01, version03
from datex import cy, c, cc
def check_it(fn, timestamps):
# Check correctness against the original Python version.
z = fn(timestamps)
z0 = version01.weekdays(timestamps)
assert all(a == b for a,b in zip(z, z0))
def time_it(fn, timestamps, repeat=1):
tt = Timer().start()
for _ in range(repeat):
z = fn(timestamps)
t = tt.stop().seconds
name = fn.__module__ + '.' + fn.__name__
print('{: <42}: {: >8.4f} seconds'.format(name, t))
def do_all(fn, n):
timestamps_np = numpy.random.randint(10000000, 9999999999, size=n, dtype=numpy.int64)
functions = [
(version01.weekdays, timestamps_np),
(version03.weekdays, timestamps_np),
(cy.version09.weekdays, memoryview(timestamps_np)),
(c.version01.weekdays, timestamps_np),
(cc.version01.weekdays, memoryview(timestamps_np)),
]
# Cache JIT type of work so that it does not distort benchmarks.
_ = c.version01.weekdays(timestamps_np[:10])
for f, ts in functions:
fn(f, ts)
def test_all():
# This is called by `py.test` to verify that code runs and is correct.
do_all(check_it, 10)
def benchmark(n, repeat):
do_all(partial(time_it, repeat=repeat), n)
if __name__ == "__main__":
# Running the script (i.e. not by `py.test`) will do time benchmarking.
from argparse import ArgumentParser
p = ArgumentParser()
p.add_argument('--n', type=int, default=10000000)
p.add_argument('--repeat', type=int, default=1)
args = p.parse_args()
benchmark(args.n, args.repeat)
Here’s the benchmark outcome:
1
2
3
4
5
6
$ python test_1.py
datex.version01.weekdays : 5.2313 seconds
datex.version03.weekdays : 17.4581 seconds
datex.cy._version09.weekdays : 0.0590 seconds
datex.c.version01.weekdays : 0.0711 seconds
datex.cc.version01.weekdays : 0.4835 seconds
I have passed a memoryview
to the function datex.cc.version01.weekdays
.
In another run where I passed a numpy
array directly to the function,
it took three times as long (1.40 seconds, specifically) for this particular input size.
Although the C++ extension is clearly faster than the Python versions,
it falls far behind the C and Cython extensions.
The slowness is mainly due to array copying at the Python/C++ boundary,
when pybind11/stl.h
is at work.
Vectorize it
Pybind11 provides an option to
vectorize a function using numpy
.
All I need to do is add a vectorized function signature in the binding code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// File `src/python_ext/datex/cc/version01.cc`.
#include "datex/cc_version01.h"
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <pybind11/numpy.h>
namespace py = pybind11;
PYBIND11_MODULE(version01, m)
{
m.def("weekday", &weekday);
m.def("weekdays", &weekdays);
m.def("vectorized_weekday", py::vectorize(weekday));
}
Note the inclusion of <pybind11/numpy.h>
and the use of py::vectorize
on the “scalar” function weekday
.
Then add this new function to the benchmark code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# File `tests/datex/test_1.py`.
...
def do_all(fn, n):
...
functions = [
(version01.weekdays, timestamps_np),
(version03.weekdays, timestamps_np),
(cy.version09.weekdays, memoryview(timestamps_np)),
(c.version01.weekdays, timestamps_np),
(cc.version01.weekdays, memoryview(timestamps_np)),
(cc.version01.vectorized_weekday, timestamps_np),
]
...
...
Now check the speed:
1
2
3
4
5
6
7
$ python test_1.py
datex.version01.weekdays : 5.6730 seconds
datex.version03.weekdays : 17.5845 seconds
datex.cy._version09.weekdays : 0.0571 seconds
datex.c.version01.weekdays : 0.0703 seconds
datex.cc.version01.weekdays : 0.4766 seconds
datex.cc.version01.vectorized_weekday : 0.0688 seconds
The vectorized version is about seven times as fast as the copied-vector version.
Its speed is comparable to that of the C extension (via cffi
),
but is slightly slower than the Cython version.
Notice that I did not use memoryview
while calling vectorized_weekday
.
Using memoryview
makes no difference here.
Use numpy
to achieve zero-copy array passing
I imagine vectorize
is only applicable to certain types of functions.
To have full control, I can use
numpy
arrays provided by pybind11
,
called pybind11::array
,
as input arguments and return values,
mixed with arguments of other types, and manipulate these arrays however I need to. Below is my second version of the C++ extension.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// File `src/python_ext/datex/cc/version02.cc`.
#include <datex/cc_version01.h>
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;
py::array_t<long> weekdays_py(py::array_t<long> ts)
{
long n = ts.size();
py::array_t<long> out = py::array_t<long>(n);
auto p_ts = ts.data();
auto p_out = out.mutable_data();
for (long i = 0; i < n; i++) {
p_out[i] = weekday(p_ts[i]);
}
return out;
}
PYBIND11_MODULE(version02, m)
{
m.def("weekday", &weekday);
m.def("weekdays", &weekdays_py);
}
Note the absence of #include <pybind11/stl.h>
.
Automatic conversions done by the inclusion of that header tend to interfere with the desire for full control.
Now add version02
to the benchmark code,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
```python
# File `tests/datex/test_1.py`.
...
def do_all(fn, n):
...
functions = [
(version01.weekdays, timestamps_np),
(version03.weekdays, timestamps_np),
(cy.version09.weekdays, memoryview(timestamps_np)),
(c.version01.weekdays, timestamps_np),
(cc.version01.weekdays, memoryview(timestamps_np)),
(cc.version01.vectorized_weekday, timestamps_np),
(cc.version02.weekdays, timestamps_np),
]
...
...
and check the speed,
1
2
3
4
5
6
7
8
$ python test_1.py
datex.version01.weekdays : 5.4988 seconds
datex.version03.weekdays : 18.0567 seconds
datex.cy._version09.weekdays : 0.0570 seconds
datex.c.version01.weekdays : 0.0796 seconds
datex.cc.version01.weekdays : 0.4754 seconds
datex.cc.version01.vectorized_weekday : 0.0688 seconds
datex.cc.version02.weekdays : 0.0698 seconds
This version—cc.version02.weekdays
—runs at the same speed as the vectorized version.
To recap, pybind11::vectorize
and pybind11::array
are suitable in different situations,
and both are straightforward to use.
Pybind11 shines in more complex use cases, notably in object-oriented code. This has not been needed in the example of this post. However, if you are interested, check out the documentation and give it a spin!
All the code in this post can be found at https://github.com/zpz/experiments.py.