Python

NumPy Array Processing With Cython: 1250x Faster

Here we see how to speed up NumPy array processing using Cython. By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster.

4 years ago • 11 min read

By Ahmed Fawzy Gad

This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

The sections covered in this tutorial are as follows:

Looping through NumPy arrays
The Cython type for NumPy arrays
Data type of NumPy array elements
NumPy array as a function argument
Indexing, not iterating, over a NumPy Array
Disabling bounds checking and negative indices
Summary

For an introduction to Cython and how to use it, check out my post on using Cython to boost Python scripts. Otherwise, let's get started!

Bring this project to life

Run on gradient

Looping Through a NumPy Array

We'll start with the same code as in the previous tutorial, except here we'll iterate through a NumPy array rather than a list. The NumPy array is created in the arr variable using the arrange() function, which returns one billion numbers starting from 0 with a step of 1.

import time
import numpy

total = 0
arr = numpy.arange(1000000000)

t1 = time.time()

for k in arr:
    total = total + k
print("Total = ", total)

t2 = time.time()
t = t2 - t1
print("%.20f" % t)

I'm running this on a machine with Core i7-6500U CPU @ 2.5 GHz, and 16 GB DDR3 RAM. The Python code completed in 458 seconds (7.63 minutes). It's too long.

Let's see how much time it takes to complete after editing the Cython script created in the previous tutorial, as given below. The only change is the inclusion of the NumPy array in the for loop. Note that you have to rebuild the Cython script using the command below before using it.

python setup.py build_ext --inplace

The Cython script in its current form completed in 128 seconds (2.13 minutes). Still long, but it's a start. Let's see how we can make it even faster.

Cython Type for NumPy Array

Previously we saw that Cython code runs very quickly after explicitly defining C types for the variables used. This is also the case for the NumPy array. If we leave the NumPy array in its current form, Cython works exactly as regular Python does by creating an object for each number in the array. To make things run faster we need to define a C data type for the NumPy array as well, just like for any other variable.

The data type for NumPy arrays is ndarray, which stands for n-dimensional array. If you used the keyword int for creating a variable of type integer, then you can use ndarray for creating a variable for a NumPy array. Note that ndarray must be called using NumPy, because ndarray is inside NumPy. So, the syntax for creating a NumPy array variable is numpy.ndarray. The code listed below creates a variable named arr with data type NumPy ndarray.

The first important thing to note is that NumPy is imported using the regular keyword import in the second line. In the third line, you may notice that NumPy is also imported using the keyword cimport.

It's time to see that a Cython file can be classified into two categories:

Definition file (.pxd)
Implementation file (.pyx)

The definition file has the extension .pxd and is used to hold C declarations, such as data types to be imported and used in other Cython files. The other file is the implementation file with extension .pyx, which we are currently using to write Cython code. Within this file, we can import a definition file to use what is declared within it.

The code below is to be written inside an implementation file with extension .pyx. The cimport numpy statement imports a definition file in Cython named "numpy". The is done because the Cython "numpy" file has the data types for handling NumPy arrays.

The code below defines the variables discussed previously, which are maxval, total, k, t1, t2, and t. There is a new variable named arr which holds the array, with data type numpy.ndarray. Previously two import statements were used, namely import numpy and cimport numpy. Which one is relevant here? Here we'll use need cimport numpy, not regular import. This is what lets us access the numpy.ndarray type declared within the Cython numpy definition file, so we can define the type of the arr variable to numpy.ndarray.

The maxval variable is set equal to the length of the NumPy array. We can start by creating an array of length 10,000 and increase this number later to compare how Cython improves compared to Python.

import time
import numpy
cimport numpy

cdef unsigned long long int maxval
cdef unsigned long long int total
cdef int k
cdef double t1, t2, t
cdef numpy.ndarray arr

maxval = 10000
arr = numpy.arange(maxval)

t1 = time.time()

for k in arr:
    total = total + k
print "Total =", total

t2 = time.time()
t = t2 - t1
print("%.20f" % t)

After creating a variable of type numpy.ndarray and defining its length, next is to create the array using the numpy.arange() function. Notice that here we're using the Python NumPy, imported using the import numpy statement.

By running the above code, Cython took just 0.001 seconds to complete. For Python, the code took 0.003 seconds. Cython is nearly 3x faster than Python in this case.

When the maxsize variable is set to 1 million, the Cython code runs in 0.096 seconds while Python takes 0.293 seconds (Cython is also 3x faster). When working with 100 million, Cython takes 10.220 seconds compared to 37.173 with Python. For 1 billion, Cython takes 120 seconds, whereas Python takes 458. Still, Cython can do better. Let's see how.

Data Type of NumPy Array Elements

The first improvement is related to the datatype of the array. The datatype of the NumPy array arr is defined according to the next line. Note that all we did is define the type of the array, but we can give more information to Cython to simplify things.

Note that there is nothing that can warn you that there is a part of the code that needs to be optimized. Everything will work; you have to investigate your code to find the parts that could be optimized to run faster.

cdef numpy.ndarray arr

In addition to defining the datatype of the array, we can define two more pieces of information:

Datatype for array elements
Number of dimensions

The datatype of the array elements is int and defined according to the line below. The numpy imported using cimport has a type corresponding to each type in NumPy but with _t at the end. For example, int in regular NumPy corresponds to int_t in Cython.

The argument is ndim, which specifies the number of dimensions in the array. It is set to 1 here. Note that its default value is also 1, and thus can be omitted from our example. If more dimensions are being used, we must specify it.

cdef numpy.ndarray[numpy.int_t, ndim=1] arr

Unfortunately, you are only permitted to define the type of the NumPy array this way when it is an argument inside a function, or a local variable in the function– not inside the script body. I hope Cython overcomes this issue soon. We now need to edit the previous code to add it within a function which will be created in the next section. For now, let's create the array after defining it.

Note that we defined the type of the variable arr to be numpy.ndarray, but do not forget that this is the type of the container. This container has elements and these elements are translated as objects if nothing else is specified. To force these elements to be integers, the dtype argument is set to numpy.int according to the next line.

arr = numpy.arange(maxval, dtype=numpy.int)

The numpy used here is the one imported using the cimport keyword. Generally, whenever you find the keyword numpy used to define a variable, then make sure it is the one imported from Cython using the cimport keyword.

NumPy Array as a Function Argument

After preparing the array, next is to create a function that accepts a variable of type numpy.ndarray as listed below. The function is named do_calc().

import time
import numpy
cimport numpy

ctypedef numpy.int_t DTYPE_t
def do_calc(numpy.ndarray[DTYPE_t, ndim=1] arr):
    cdef int maxval
    cdef unsigned long long int total
    cdef int k
    cdef double t1, t2, t
    
    t1 = time.time()

    for k in arr:
        total = total + k
    print "Total = ", total
    
    t2 = time.time()
    t = t2 - t1
    print("%.20f" % t)

import test_cython
import numpy
arr = numpy.arange(1000000000, dtype=numpy.int)
test_cython.do_calc(arr)

After building the Cython script, next we call the function do_calc() according to the code below. The computational time in this case is reduced from 120 seconds to 98 seconds. This makes Cython 5x faster than Python for summing 1 billion numbers. As you might expect by now, to me this is still not fast enough. We'll see another trick to speed up computation in the next section.

Indexing vs. Iterating Over NumPy Arrays

Cython just reduced the computational time by 5x factor which is something not to encourage me using Cython. But it is not a problem of Cython but a problem of using it. The problem is exactly how the loop is created. Let's have a closer look at the loop which is given below.

In the previous tutorial, something very important is mentioned which is that Python is just an interface. An interface just makes things easier to the user. Note that the easy way is not always an efficient way to do something.

Python [the interface] has a way of iterating over arrays which are implemented in the loop below. The loop variable k loops through the arr NumPy array, element by element from the array is fetched and then assigns that element to the variable k. Looping through the array this way is a style introduced in Python but it is not the way that C uses for looping through an array.

for k in arr:
    total = total + k

The normal way for looping through an array for programming languages is to create indices starting from 0 [sometimes from 1] until reaching the last index in the array. Each index is used for indexing the array to return the corresponding element. This is the normal way for looping through an array. Because C does not know how to loop through the array in the Python style, then the above loop is executed in Python style and thus takes much time for being executed.

In order to overcome this issue, we need to create a loop in the normal style that uses indices for accessing the array elements. The new loop is implemented as follows.

At first, there is a new variable named arr_shape used to store the number of elements within the array. In our example, there is only a single dimension and its length is returned by indexing the result of arr.shape using index 0.

The arr_shape variable is then fed to the range() function which returns the indices for accessing the array elements. In this case, the variable k represents an index, not an array value.

Inside the loop, the elements are returned by indexing the variable arr by the index k.

cdef int arr_shape = arr.shape[0]
for k in range(arr_shape):
    total = total + arr[k]

Let's edit the Cython script to include the above loop. The new Script is listed below. The old loop is commented out.

import time
import numpy
cimport numpy

ctypedef numpy.int_t DTYPE_t

def do_calc(numpy.ndarray[DTYPE_t, ndim=1] arr):
    cdef int maxval
    cdef unsigned long long int total
    cdef int k
    cdef double t1, t2, t
    cdef int arr_shape = arr.shape[0]

    t1=time.time()

#    for k in arr:
#        total = total + k

    for k in range(arr_shape):
        total = total + arr[k]
    print "Total =", total
    
    t2=time.time()
    t = t2-t1
    print("%.20f" % t)

By building the Cython script, the computational time is now around just a single second for summing 1 billion numbers after changing the loop to use indices. So, the time is reduced from 120 seconds to just 1 second. This is what we expected from Cython.

Note that nothing wrong happens when we used the Python style for looping through the array. No indication to help us figure out why the code is not optimized. Thus, we have to look carefully for each part of the code for the possibility of optimization.

Note that regular Python takes more than 500 seconds for executing the above code while Cython just takes around 1 second. Thus, Cython is 500x times faster than Python for summing 1 billion numbers. Super. Remember that we sacrificed by the Python simplicity for reducing the computational time. In my opinion, reducing the time by 500x factor worth the effort for optimizing the code using Cython.

Reaching 500x faster code is great but still, there is an improvement which is discussed in the next section.

Disabling Bounds Checking and Negative Indices

There are a number of factors that causes the code to be slower as discussed in the Cython documentation which are:

Bounds checking for making sure the indices are within the range of the array.
Using negative indices for accessing array elements.

These 2 features are active when Cython executes the code. You can use a negative index such as -1 to access the last element in the array. Cython also makes sure no index is out of the range and the code will not crash if that happens. If you are not in need of such features, you can disable it to save more time. This is by adding the following lines.

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
The new code after disabling such features is as follows.
import time
import numpy
cimport numpy
cimport cython

ctypedef numpy.int_t DTYPE_t

@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def do_calc(numpy.ndarray[DTYPE_t, ndim=1] arr):
    cdef int maxval
    cdef unsigned long long int total
    cdef int k
    cdef double t1, t2, t
    cdef int arr_shape = arr.shape[0]

    t1=time.time()

#    for k in arr:
#        total = total + k

    for k in range(arr_shape):
        total = total + arr[k]
    print "Total =", total

    t2=time.time()
    t = t2-t1
    print("%.20f" % t)

After building and running the Cython script, the time is not around 0.4 seconds. Compared to the computational time of the Python script [which is around 500 seconds], Cython is now around 1250 times faster than Python.

Summary

This tutorial used Cython to boost the performance of NumPy array processing. We accomplished this in four different ways:

1. Defining the NumPy Array Data Type

We began by specifying the data type of the NumPy array using the numpy.ndarray. We saw that this type is available in the definition file imported using the cimport keyword.

2. Specifying the Data Type of Array Elements + Number of Dimensions

Just assigning the numpy.ndarray type to a variable is a start–but it's not enough. There are still two pieces of information to be provided: the data type of the array elements, and the dimensionality of the array. Both have a big impact on processing time.

These details are only accepted when the NumPy arrays are defined as a function argument, or as a local variable inside a function. We therefore add the Cython code at these points. You can also specify the return data type of the function.

3. Looping Through NumPy Arrays Using Indexing

The third way to reduce processing time is to avoid Pythonic looping, in which a variable is assigned value by value from the array. Instead, just loop through the array using indexing. This leads to a major reduction in time.

4. Disabling Unnecessary Features

Finally, you can reduce some extra milliseconds by disabling some checks that are done by default in Cython for each function. These include "bounds checking" and "wrapping around." Disabling these features depends on your exact needs. For example, if you use negative indexing, then you need the wrapping around feature enabled.

Conclusion

This tutorial discussed using Cython for manipulating NumPy arrays with a speed of more than 1000x times Python processing alone. The key for reducing the computational time is to specify the data types for the variables, and to index the array rather than iterate through it.

In the next tutorial, we will summarize and advance on our knowledge thus far by using Cython to reduc the computational time for a Python implementation of the genetic algorithm.

Add speed and simplicity to your Machine Learning workflow today

Get started

Blog

Docs

Community

ML Showcase

Professional Services

Talk to an Expert

Deploying Deep Learning Models Part 1: Preparing the Model

Implementing CycleGAN for Age Conversion

Solutions

Product

Resources

Company