Python

Cythonizing Genetic Algorithms: 18x Faster

In this tutorial we use Cython to reduce the execution time of the genetic algorithm implemented in Python. We've brought down our computational time from 1.46 seconds to a mere 0.08 seconds, so that 1 million generations run in less than 10 seconds with Cython, compared to 180 seconds in Python.

4 years ago • 19 min read

By Ahmed Fawzy Gad

In two previous tutorials we saw an introduction to Cython, a language that mainly defines static data types to the variables used in Python. This boosts the performance of Python scripts, resulting in dramatic speed increases. For example, when applied to NumPy arrays, Cython completed the sum of 1 billion numbers 1250 times faster than Python.

This tutorial builds upon what we discussed previously to speed-up the execution of a project that implements the genetic algorithm (GA) in Python. The base project is available on GitHub. We'll inspect the code and follow the instructions discussed in the previous two tutorials to make as many changes as possible to boost performance, and the run the generations in significantly less time compared to Python.

We'll begin by downloading the GitHub project. Then we'll look at cythonizing each part of the genetic algorithm; the fitness function, mating pool, crossover, and mutation. We'll also see how to implement different NumPy functions in C-speed, and will conclude the post with the final implementation of the full code and a comparison if its comparison with Python.

Note that you do not need to know the genetic algorithm to complete this tutorial; we will go over each part of it, and all you need to do is cythonize the Python code regardless of whether it is the genetic algorithm or something else. If you do want more details about how the genetic algorithm works, see my other posts on LinkedIn (with implementation on GitHub):

Let's get started.

Bring this project to life

Run on gradient

Downloading and Using the GitHub Project

The Python implementation of the genetic algorithm is available at this GitHub page. The project has two files. The first is the ga.py file, which implements the genetic algorithm operations including:

Fitness function calculation using the cal_pop_fitness() function
Mating pool using the select_mating_pool() function
Crossover using the crossover() function (single point crossover is implemented)
Mutation using the mutation() function (just a single gene has its value updated)

The second file is named Example_GeneticAlgorithm.py. We look at a basic example of optimizing the following equation, where x is a random input vector with 6 elements:

y = w1*x1 + w2*x2 + w3*x3 + w4*x4 + w5*x5 + 6w*x6

The Example_GeneticAlgorithm.py script prepares the initial population and loops through the generations. In each generation, the functions listed above in ga.py are called.

Throughout this tutorial, we are going to inspect the implementation of both the ga.py and Example_GeneticAlgorithm.py scripts and see what we can change to reduce the computational time. By just running the project and removing all print statements (which are very time consuming), the Python code takes around 1.46 seconds to go through 10,000 generations (run on Core i7-6500U CPU @ 2.5 GHz, with 16 GB DDR3 RAM).

Let's start with the ga.py file.

Cythonizing Functions Inside ga.py

Inside the ga.py file, the first function is cal_pop_fitness(). This calculates the fitness value for each individual in the population. It is the first step in the GA.

Fitness Function

The cal_pop_fitness() function accepts two arguments: a vector with 6 values (x1 to x6 in the equation above), and the population for which the fitness values will be calculated. The population consists of individuals, and the length of each individual is 6 (because there are 6 weights, w1 to w6, for the 6 inputs x1 to x6). If, for example, there are 8 individuals, then the size of the array holding the population is 8 x 6. In other words, it is a 2D-array (or a matrix).

The function calculates a fitness value for each individual by summing the products between each of the 6 weights for each individual and the 6 equation inputs. The function then returns the fitness values for all individuals as a vector.

def cal_pop_fitness(equation_inputs, pop):
    fitness = numpy.sum(pop*equation_inputs, axis=1)
    return fitness

How can we Cythonize this? According to the four tips stated in the previous tutorial about using Cython and NumPy, the first step is to process the NumPy array inside a function – this is already the case. After defining the function, all we need to do is define the data type of the arguments, return data type, the data type of the local variables defined within the function (optionally, we can also disable unnecessary features such as bounds checking). Here is the new function after making these changes:

import numpy
cimport numpy
import cython
 
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=1] 
cal_pop_fitness(numpy.ndarray[numpy.double_t, ndim=1] equation_inputs, numpy.ndarray[numpy.double_t, ndim=2] pop):
    cdef numpy.ndarray[numpy.double_t, ndim=1] fitness

    fitness = numpy.sum(pop*equation_inputs, axis=1)

    return fitness

Outside the function, Cython is used to call several decorators which disable three features: wrap around (as we no longer use negative indices), checking for None values, and bounds checking. Note that we only disabled bounds checking because we are sure that no index will exceed the bounds.

Generally, we can define a function in Cython in three ways:

def: defines a function that works at Python speed, and thus is a bit slow. The def keyword can be used to define a function inside a Python or a Cython script. Also, the function defined using def can be called inside or outside the Cython/Python script.
cdef: this can only be defined within a Cython script, and cannot be called from a Python script. It works faster than a function defined using def.
cpdef: this gives the advantages of both def and cdef. The function can be defined only inside a Cython script, but can be called from a Cython or Python script. cpdef is fast as cdef.

Because we may use all of the functions defined inside the Cython script from a Python script, we will use the cpdef keyword for defining all functions.

Exactly after cpdef, the return data type of the function is set to numpy.ndarray[numpy.double_t, ndim=1]. This means the function will return a variable of type numpy.ndarray. The type of the elements inside the array is also set to double using numpy.double_t. Finally, the number of dimensions is set to 1 using the ndim argument because a 1D-array (vector) is returned. Note that if there is a mismatch between the number of dimensions defined in the return type and the actual data returned, an exception will be thrown.

Next, the data types of the two arguments are defined. All of them are numpy.ndarray and the elements type is double. The first argument has one dimension, while the second argument has two.

Now the function header is completely defined. Inside the function there is a single local variable, the fitness vector. It is defined in the same way as the first function argument. Finally, the 1-D array is returned.

At this point the cal_pop_fitness() is Cythonized; it is not readable as Python, but it is now faster.

Mating Pool

The next function, select_mating_pool(), is implemented in Python as follows:

def select_mating_pool(pop, fitness, num_parents):
    parents = numpy.empty((num_parents, pop.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = numpy.where(fitness == numpy.max(fitness))
        max_fitness_idx = max_fitness_idx[0][0]
        parents[parent_num, :] = pop[max_fitness_idx, :]
        fitness[max_fitness_idx] = -99999999999
    return parents

The Cython version is below. You can easily understand the Cython function as it does not differ much from the Python version. This function returns the mating pool, which consists of more than one individual. As a result, the returned array is 2D and therefore ndim is set to 2 in the return data type. There are 6 local variables in the function, each defined using the cdef keyword. Just note that slicing and indexing for the NumPy arrays is done the same as in Python. Loopipng through the array also uses indexing, which is the faster way to do so.

import numpy
cimport numpy
import cython

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] select_mating_pool(numpy.ndarray[numpy.double_t, ndim=2] pop, numpy.ndarray[numpy.double_t, ndim=1] fitness, int num_parents):
    cdef numpy.ndarray[numpy.double_t, ndim=2] parents
    cdef int parent_num, max_fitness_idx, min_val, max_fitness

    min_val = -999999

    parents = numpy.empty((num_parents, pop.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = numpy.where(fitness == numpy.max(fitness))[0][0]
        parents[parent_num, :] = pop[max_fitness_idx, :]
        fitness[max_fitness_idx] = min_val
    return parents

Crossover

The next function is crossover(), defined below in Python.

def crossover(parents, offspring_size):
    offspring = numpy.empty(offspring_size)
    crossover_point = numpy.uint8(offspring_size[1]/2)

    for k in range(offspring_size[0]):
        parent1_idx = k%parents.shape[0]
        parent2_idx = (k+1)%parents.shape[0]
        offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring

The Cython version is as follows. Note that the wraparound() decorator is set to True because negative indexing is required here. Note also that the type of the offspring_size argument is tuple, so you must provide this argument as such. Any mismatch will cause an error.

Because the crossover_point local variable is defined as an integer variable, we use numpy.uint8() to enforce this and prevent any errors. The remaining part of the function is exactly the same as in Python. Note that there are still some changes to be made later, where we'll replace some time-consuming operations by others that take less time.

import numpy
cimport numpy
import cython

@cython.wraparound(True)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] crossover(numpy.ndarray[numpy.double_t, ndim=2] parents, tuple offspring_size):
    cdef numpy.ndarray[numpy.double_t, ndim=2] offspring
    offspring = numpy.empty(offspring_size)
    cdef int k, parent1_idx, parent2_idx
    cdef numpy.int_t crossover_point
    crossover_point = numpy.uint8(offspring_size[1]/2)

    for k in range(offspring_size[0]):
        parent1_idx = k%parents.shape[0]
        parent2_idx = (k+1)%parents.shape[0]
        
        offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring

Mutation

The last function in the ga.py file is mutation(), shown here in Python:

def mutation(offspring_crossover, num_mutations=1):
    mutations_counter = numpy.uint8(offspring_crossover.shape[1] / num_mutations)
    for idx in range(offspring_crossover.shape[0]):
        gene_idx = mutations_counter - 1
        for mutation_num in range(num_mutations):
            random_value = numpy.random.uniform(-1.0, 1.0, 1)
            offspring_crossover[idx, gene_idx] = offspring_crossover[idx, gene_idx] + random_value
            gene_idx = gene_idx + mutations_counter
    return offspring_crossover

The cythonized version is below. It follows the steps we've seen before: disabling unused features, using cpdef rather than def, and declaring the data types for the arguments, return values, and local variables. Because negative indexing is not required, it is disabled for this function.

import numpy
cimport numpy
import cython

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] mutation(numpy.ndarray[numpy.double_t, ndim=2] offspring_crossover, int num_mutations=1):
    cdef int idx, mutation_num, gene_idx
    cdef double random_value
    cdef Py_ssize_t mutations_counter
    mutations_counter = numpy.uint8(offspring_crossover.shape[1] / num_mutations)
    for idx in range(offspring_crossover.shape[0]):
        gene_idx = mutations_counter - 1
        for mutation_num in range(num_mutations):
            random_value = numpy.random.uniform(-1.0, 1.0, 1)
            offspring_crossover[idx, gene_idx] = offspring_crossover[idx, gene_idx] + random_value
            gene_idx = gene_idx + mutations_counter
    return offspring_crossover

We've finished cythonizing ga.py! The new full code is listed below. Just save this code into a file named ga.pyx and we will build it in the Building the .pyx Files section using the setup.py file.

import numpy
cimport numpy
import time
import cython

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=1] cal_pop_fitness(numpy.ndarray[numpy.double_t, ndim=1] equation_inputs, numpy.ndarray[numpy.double_t, ndim=2] pop):
    cdef numpy.ndarray[numpy.double_t, ndim=1] fitness
    fitness = numpy.sum(pop*equation_inputs, axis=1)
    return fitness

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] select_mating_pool(numpy.ndarray[numpy.double_t, ndim=2] pop, numpy.ndarray[numpy.double_t, ndim=1] fitness, int num_parents):
    cdef numpy.ndarray[numpy.double_t, ndim=2] parents
    cdef int parent_num, max_fitness_idx, min_val, max_fitness, a

    min_val = -99999999999

    parents = numpy.empty((num_parents, pop.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = numpy.where(fitness == numpy.max(fitness))[0][0]
        parents[parent_num, :] = pop[max_fitness_idx, :]
        fitness[max_fitness_idx] = min_val
    return parents

@cython.wraparound(True)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] crossover(numpy.ndarray[numpy.double_t, ndim=2] parents, tuple offspring_size):
    cdef numpy.ndarray[numpy.double_t, ndim=2] offspring
    offspring = numpy.empty(offspring_size)
    cdef int k, parent1_idx, parent2_idx
    cdef numpy.int_t crossover_point
    crossover_point = numpy.uint8(offspring_size[1]/2)

    for k in range(offspring_size[0]):
        parent1_idx = k%parents.shape[0]
        parent2_idx = (k+1)%parents.shape[0]

        offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] mutation(numpy.ndarray[numpy.double_t, ndim=2] offspring_crossover, int num_mutations=1):
    cdef int idx, mutation_num, gene_idx
    cdef double random_value
    cdef Py_ssize_t mutations_counter
    mutations_counter = numpy.uint8(offspring_crossover.shape[1] / num_mutations)
    for idx in range(offspring_crossover.shape[0]):
        gene_idx = mutations_counter - 1
        for mutation_num in range(num_mutations):
            random_value = numpy.random.uniform(-1.0, 1.0, 1)
            offspring_crossover[idx, gene_idx] = offspring_crossover[idx, gene_idx] + random_value
            gene_idx = gene_idx + mutations_counter
    return offspring_crossover

The second file, Example_GeneticAlgorithm.py, calls the functions defined in the ga.py file. Let's finish cythonizing this second file before we get our GA running.

Cythonizing Example_GeneticAlgorithm.py

The Python implementation of the Example_GeneticAlgorithm.py file is as follows. The time module is imported so we can compare the performance with Python compared to Cython.

import numpy
import ga
import time

equation_inputs = [4,-2,3.5,5,-11,-4.7]

num_weights = len(equation_inputs)

sol_per_pop = 8
num_parents_mating = 4

pop_size = (sol_per_pop,num_weights)
new_population = numpy.random.uniform(low=-4.0, high=4.0, size=pop_size)

best_outputs = []
num_generations = 10000
t1 = time.time()
for generation in range(num_generations):
    fitness = ga.cal_pop_fitness(equation_inputs, new_population)

    best_outputs.append(numpy.max(numpy.sum(new_population*equation_inputs, axis=1)))

    parents = ga.select_mating_pool(new_population, fitness,
                                      num_parents_mating)

    offspring_crossover = ga.crossover(parents,
                                       offspring_size=(pop_size[0]-parents.shape[0], num_weights))

    offspring_mutation = ga.mutation(offspring_crossover, num_mutations=2)

    new_population[0:parents.shape[0], :] = parents
    new_population[parents.shape[0]:, :] = offspring_mutation
t2 = time.time()
t = t2-t1
print("Total Time %.20f" % t)

The cythonized code is listed below. The ga module is imported as a regular Python module. All you have to do is to declare the data type of all variables used. Just take care to match passed variables with the types accepted by the functions edited previously.

import ga
import numpy
cimport numpy
import time

cdef numpy.ndarray equation_inputs, parents, new_population, fitness, offspring_crossover, offspring_mutation
cdef int num_weights, sol_per_pop, num_parents_mating, num_generations
cdef tuple pop_size
cdef double t1, t2, t

equation_inputs = numpy.array([4,-2,3.5,5,-11,-4.7])
num_weights = equation_inputs.shape[0]

num_weights = equation_inputs.shape[0]
num_parents_mating = 4

sol_per_pop = 8
num_parents_mating = 4

pop_size = (sol_per_pop, num_weights)
new_population = numpy.random.uniform(low=-4.0, high=4.0, size=pop_size)

num_generations = 10000

t1 = time.time()
for generation in range(num_generations):
    fitness = ga.cal_pop_fitness(equation_inputs, new_population)

    parents = ga.select_mating_pool(new_population, fitness,
                                      num_parents_mating)

    offspring_crossover = ga.crossover(parents,
                                       offspring_size=(pop_size[0]-parents.shape[0], num_weights))

    offspring_mutation = ga.mutation(offspring_crossover, num_mutations=2)

    new_population[0:parents.shape[0], :] = parents
    new_population[parents.shape[0]:, :] = offspring_mutation
t2 = time.time()
t = t2-t1
print("Total Time %.20f" % t)

We are just able to assign the numpy.ndarray data type to the NumPy variables and nothing more. We cannot specify the number of dimensions or the elements' data type because these features are not yet supported by Cython. If the code was wrapped into a function, however, then we could define everything and speed up the processing. We will do exactly this further on.

For now, just save the Cython code into a file named Example_GeneticAlgorithm.pyx, which will be built along with the ga.pyx file.

Building The .pyx Files

The next step is to build the .pyx files to generate the .pyd/.so files to be imported in the project. The setup.py file used for this purpose is listed below. Because there are two .pyx files to be built, the cythonize() function is not given an explicit name but asked to build all files with .pyx extension.

import distutils.core
import Cython.Build
import numpy

distutils.core.setup(
    ext_modules = Cython.Build.cythonize("*.pyx"),
    include_dirs=[numpy.get_include()]
)

In order to build the files, issue the command below from the command line.

python setup.py build_ext --inplace

After the command completes successfully, we can just import the Example_GeneticAlgorithm.pyx file using the following command. This will run the code automatically.

import Example_GeneticAlgorithm

The Cython code takes 0.945 seconds to complete. Compare this to 1.46 seconds for the Python code; Cython is 1.55 times faster (note that the code is running on a machine with Core i7-6500U CPU @ 2.5 GHz and 16 GB DDR3 RAM).

To further reduce the time we can make a simple edit: use a function to wrap the contents of the Example_GeneticAlgorithm.pyx file.

Evolving Generations Within a Function vs. Script Body

Let's create a function named optimize() within Example_GeneticAlgorithm.pyx, and put the contents of this file within our new function:

import ga
import numpy
cimport numpy
import time
import cython
 
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef optimize():
    cdef numpy.ndarray equation_inputs, parents, new_population, fitness, offspring_crossover, offspring_mutation
    cdef int num_weights, sol_per_pop, num_parents_mating, num_generations
    cdef list pop_size
    cdef double t1, t2, t

    equation_inputs = numpy.array([4,-2,3.5,5,-11,-4.7])
    num_weights = equation_inputs.shape[0]

    sol_per_pop = 8
    num_weights = equation_inputs.shape[0]
    num_parents_mating = 4
    
    pop_size = [sol_per_pop,num_weights]
    #Creating the initial population.
    new_population = numpy.random.uniform(low=-4.0, high=4.0, size=pop_size)

    num_generations = 1000000
    t1 = time.time()
    for generation in range(num_generations):
        fitness = cal_pop_fitness(equation_inputs, new_population)
    
        parents = select_mating_pool(new_population, fitness,
                                          num_parents_mating)

        offspring_crossover = crossover(parents,
                                           offspring_size=(pop_size[0]-parents.shape[0], num_weights))

        offspring_mutation = mutation(offspring_crossover, num_mutations=2)
    
        new_population[0:parents.shape[0], :] = parents
        new_population[parents.shape[0]:, :] = offspring_mutation
    t2 = time.time()
    t = t2-t1
    print("Total Time %.20f" % t)
    print(cal_pop_fitness(equation_inputs, new_population))

In order to call the optimize() function, just rebuild the Cython .pyx files and issue the following Python commands from the command line:

import Example_GeneticAlgorithm
Example_GeneticAlgorithm.optimize()

This now takes 0.944 rather than 0.945 seconds; almost no change at all. One reason is due to calling the external module ga for each function needed. Instead, we'll save the function call by copying and pasting the optimize() function inside the ga.pyx file. Because the functions are part of the same file, there is less overhead in calling them.

Because the optimize() function is now part of the ga.pyx file, we no longer need the Example_GeneticAlgorithm.pyx file. You can edit the setup.py file to specify that just the ga.pyx file is to be built.

The commands below are used to call the optimize() function. The time is now 0.9 seconds rather than 0.944 and thus the Cython code is now 1.62 times faster than Python.

import ga
ga.optimize()

Now all of the code has been Cythonized, but still more can be done to improve the speed. Let's see how we can use C functions, rather than Python functions – this will give the most drastic speed increase yet.

Implement Python Features in C Speed

Python makes many things easier to the programmer and this is one of its benefits. But this increases the time in some cases. In this section, we are going to inspect some of the functions that are available in Python but slow and see how to implement them to run in C speed.

Implementing NumPy sum() in C Speed

Inside the cal_pop_fitness() function, the sum of products between each individual and the equation inputs was calculated using the numpy.sum() function. We can implement this function manually using 2 for loops according to the code below. Note that the loops runs in C speed. For this reason, the variable fitness is declared as numpy.ndarray type and initialized as a zeros array using numpy.zeros(). The result o f calculating the fitness values are saved in this variable.

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef cal_pop_fitness(numpy.ndarray[numpy.double_t, ndim=1] equation_inputs, numpy.ndarray[numpy.double_t, ndim=2] pop):
    cdef numpy.ndarray[numpy.double_t, ndim=1] fitness
    fitness = numpy.zeros(pop.shape[0])
    # fitness = numpy.sum(pop*equation_inputs, axis=1) # slower than looping.
    for i in range(pop.shape[0]):
        for j in range(pop.shape[1]):
            fitness[i] += pop[i, j]*equation_inputs[j]
    return fitness

After making this edit, we can build the .pyx file and see how faster the new code. The new code after using the above function takes just 0.8 seconds. Thus, implementing the numpy.sum() function using loops saved 0.1 seconds (100 milliseconds). Let's think about something else to optimize.

Inside the select_mating_pool() function, the index of the maximum element in the fitness array was returned using this line.

max_fitness_idx = numpy.where(fitness == numpy.max(fitness))[0][0]

We can edit the function to implement this line in C speed using the loop below. By doing so, the execution time is now 0.44 seconds rather than 0.8 seconds. Compared to Python, Cython is now 3.32 times faster.

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] select_mating_pool(numpy.ndarray[numpy.double_t, ndim=2] pop, numpy.ndarray[numpy.double_t, ndim=1] fitness, int num_parents):
    cdef numpy.ndarray[numpy.double_t, ndim=2] parents
    cdef int parent_num, max_fitness_idx, min_val, max_fitness, a

    min_val = -99999999999

    parents = numpy.empty((num_parents, pop.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = 0
        # numpy.where(fitness == numpy.max(fitness))
        for a in range(1, fitness.shape[0]):
            if fitness[a] > fitness[max_fitness_idx]:
                max_fitness_idx = a
        parents[parent_num, :] = pop[max_fitness_idx, :]
        fitness[max_fitness_idx] = min_val
    return parents

NumPy Array Slicing in C Speed

Slicing just returns a part of the array into another array. We can implement this in Cython for parents and pop in the new function listed below. By doing that, Cython takes just 0.427 seconds rather than 0.44 seconds.

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] select_mating_pool(numpy.ndarray[numpy.double_t, ndim=2] pop, numpy.ndarray[numpy.double_t, ndim=1] fitness, int num_parents):
    cdef numpy.ndarray[numpy.double_t, ndim=2] parents
    cdef int parent_num, max_fitness_idx, min_val, max_fitness, a

    min_val = -99999999999

    parents = numpy.empty((num_parents, pop.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = 0
        # numpy.where(fitness == numpy.max(fitness))
        for a in range(1, fitness.shape[0]):
            if fitness[a] > fitness[max_fitness_idx]:
                max_fitness_idx = a

        # parents[parent_num, :] = pop[max_fitness_idx, :] # slower han looping by 20 ms
        for a in range(parents.shape[1]):
            parents[parent_num, a] = pop[max_fitness_idx, a]
        fitness[max_fitness_idx] = min_val
    return parents

Because slicing is also used in the crossover() function, we can edit it to implement array slicing using loops that run at C speed. The new function is below, and takes 0.344 seconds rather than 0.427. These changes might seem minor, but when you're running hundreds or thousands of lines of code, they add up to make a huge impact. At this point, this function runs 4.24 times faster than in Python.

The value assigned to the crossover_point variable was converted previously using numpy.uint8(). Now, it is converted in C style using (int).

@cython.wraparound(True)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] crossover(numpy.ndarray[numpy.double_t, ndim=2] parents, tuple offspring_size):
    cdef numpy.ndarray[numpy.double_t, ndim=2] offspring
    offspring = numpy.empty(offspring_size)
    cdef int k, parent1_idx, parent2_idx
    cdef numpy.int_t crossover_point
    crossover_point = (int) (offspring_size[1]/2)

    for k in range(offspring_size[0]):
        parent1_idx = k%parents.shape[0]
        parent2_idx = (k+1)%parents.shape[0]
        
        for m in range(crossover_point):
            offspring[k, m] = parents[parent1_idx, m]
        for m in range(crossover_point-1, -1):
            offspring[k, m] = parents[parent2_idx, m]

        # The next 2 lines are slower than using the above loops because they run with C speed.
        # offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        # offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring

Generating Random Values in C

The mutation() function uses the numpy.random.uniform() function to return a random double value that is added to a gene:

random_value = numpy.random.uniform(-1.0, 1.0, 1)

We can avoid using this function and generate the random number using the rand() function that is available in the stdlib library of C. The implementation for the mutation() function thus becomes:

from libc.stdlib cimport rand, RAND_MAX
cdef double DOUBLE_RAND_MAX = RAND_MAX # a double variable holding the maximum random integer in C

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] mutation(numpy.ndarray[numpy.double_t, ndim=2] offspring_crossover, int num_mutations=1):
    cdef int idx, mutation_num, gene_idx
    cdef double random_value
    cdef Py_ssize_t mutations_counter
    mutations_counter = (int) (offspring_crossover.shape[1] / num_mutations) # using numpy.uint8() is slower than using (int)
    cdef double rand_num
    for idx in range(offspring_crossover.shape[0]):
        gene_idx = mutations_counter - 1
        for mutation_num in range(num_mutations):
            # random_value = numpy.random.uniform(-1.0, 1.0, 1)
            rand_double = rand()/DOUBLE_RAND_MAX
            random_value = rand_double * (1.0 - (-1.0)) + (-1.0)
            offspring_crossover[idx, gene_idx] = offspring_crossover[idx, gene_idx] + random_value
            gene_idx = gene_idx + mutations_counter
    return offspring_crossover

At first, the rand() function is imported from stdlib so that we have access to this function in C. rand() returns an integer value in the range of 0 to RAND_MAX, which is a constant (its value is at least 32767). Since we want the random number to fall in a range from 0 to 1, we need to divide the returned random value by the maximum possible random integer. We do this by copying RAND_MAX into a double variable named DOUBLE_RAND_MAX, and dividing the random number by this value. The scaled random value is now available in the rand_double variable. This is then scaled so that it falls in a range of -1 to 1, and is saved in the random_value variable.

By generating the random value using the C rand() function, Cython now takes just 0.08 seconds (80 milliseconds) to run. Compare that to 0.344 seconds from earlier. This is the biggest difference yet. Now the code runs 18.25 times faster than in Python.

Now that we've made all of our edits, the full ga.pyx file looks like this:

import numpy
cimport numpy
import time
import cython

from libc.stdlib cimport rand, RAND_MAX

cdef double DOUBLE_RAND_MAX = RAND_MAX # a double variable holding the maximum random integer in C

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef cal_pop_fitness(numpy.ndarray[numpy.double_t, ndim=1] equation_inputs, numpy.ndarray[numpy.double_t, ndim=2] pop):
    cdef numpy.ndarray[numpy.double_t, ndim=1] fitness
    fitness = numpy.zeros(pop.shape[0])
    # fitness = numpy.sum(pop*equation_inputs, axis=1) # slower than looping.
    for i in range(pop.shape[0]):
        for j in range(pop.shape[1]):
            fitness[i] += pop[i, j]*equation_inputs[j]
    return fitness

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] select_mating_pool(numpy.ndarray[numpy.double_t, ndim=2] pop, numpy.ndarray[numpy.double_t, ndim=1] fitness, int num_parents):
    cdef numpy.ndarray[numpy.double_t, ndim=2] parents
    cdef int parent_num, max_fitness_idx, min_val, max_fitness, a

    min_val = -99999999999

    parents = numpy.empty((num_parents, pop.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = 0
        # numpy.where(fitness == numpy.max(fitness)) # slower than looping by 250 ms.
        for a in range(1, fitness.shape[0]):
            if fitness[a] > fitness[max_fitness_idx]:
                max_fitness_idx = a
        # parents[parent_num, :] = pop[max_fitness_idx, :]
        for a in range(parents.shape[1]):
            parents[parent_num, a] = pop[max_fitness_idx, a]
        fitness[max_fitness_idx] = min_val
    return parents

@cython.wraparound(True)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] crossover(numpy.ndarray[numpy.double_t, ndim=2] parents, tuple offspring_size):
    cdef numpy.ndarray[numpy.double_t, ndim=2] offspring
    offspring = numpy.empty(offspring_size)
    cdef int k, parent1_idx, parent2_idx
    cdef numpy.int_t crossover_point
    crossover_point = (int) (offspring_size[1]/2)

    for k in range(offspring_size[0]):
        parent1_idx = k%parents.shape[0]
        parent2_idx = (k+1)%parents.shape[0]

        for m in range(crossover_point):
            offspring[k, m] = parents[parent1_idx, m]
        for m in range(crossover_point-1, -1):
            offspring[k, m] = parents[parent2_idx, m]

        # The next 2 lines are slower than using the above loops because they run with C speed.
        # offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        # offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef numpy.ndarray[numpy.double_t, ndim=2] mutation(numpy.ndarray[numpy.double_t, ndim=2] offspring_crossover, int num_mutations=1):
    cdef int idx, mutation_num, gene_idx
    cdef double random_value
    cdef Py_ssize_t mutations_counter
    mutations_counter = (int) (offspring_crossover.shape[1] / num_mutations) # using numpy.uint8() is slower than using (int)
    cdef double rand_num
    for idx in range(offspring_crossover.shape[0]):
        gene_idx = mutations_counter - 1
        for mutation_num in range(num_mutations):
            # random_value = numpy.random.uniform(-1.0, 1.0, 1)
            rand_double = rand()/DOUBLE_RAND_MAX
            random_value = rand_double * (1.0 - (-1.0)) + (-1.0)
            offspring_crossover[idx, gene_idx] = offspring_crossover[idx, gene_idx] + random_value
            gene_idx = gene_idx + mutations_counter
    return offspring_crossover

@cython.wraparound(False)
@cython.nonecheck(False)
@cython.boundscheck(False)
cpdef optimize():
    cdef numpy.ndarray equation_inputs, parents, new_population, fitness, offspring_crossover, offspring_mutation
    cdef int num_weights, sol_per_pop, num_parents_mating, num_generations
    cdef list pop_size
    cdef double t1, t2, t

    equation_inputs = numpy.array([4,-2,3.5,5,-11,-4.7])
    num_weights = equation_inputs.shape[0]

    sol_per_pop = 8
    num_weights = equation_inputs.shape[0]
    num_parents_mating = 4
    
    pop_size = [sol_per_pop,num_weights]
    #Creating the initial population.
    new_population = numpy.random.uniform(low=-4.0, high=4.0, size=pop_size)

    num_generations = 10000
    t1 = time.time()
    for generation in range(num_generations):
        fitness = cal_pop_fitness(equation_inputs, new_population)
    
        parents = select_mating_pool(new_population, fitness,
                                          num_parents_mating)

        offspring_crossover = crossover(parents,
                                           offspring_size=(pop_size[0]-parents.shape[0], num_weights))

        offspring_mutation = mutation(offspring_crossover, num_mutations=2)
    
        new_population[0:parents.shape[0], :] = parents
        new_population[parents.shape[0]:, :] = offspring_mutation
    t2 = time.time()
    t = t2-t1
    print("Total Time %.20f" % t)
    print(cal_pop_fitness(equation_inputs, new_population))

Conclusion

This tutorial used Cython to reduce the time of execution of a genetic algorithm Python implementation using NumPy. We've brought down our computational time from 1.46 seconds to a mere 0.08 seconds, resulting in an 18x speed increase. As a result, we can do 1 million generations in less than 10 seconds with Cython, compared to 180 seconds in Python.

This same methodology can be used for any code written in Python; inspect it line by line, identify bottlenecks, and reduce computation time by implementing the tricks we saw here. You don't necessarily need to know C, but knowledge of C will clearly help you implement faster workarounds. Even without a deep understanding of C, simple tricks like defining variable types can make a big difference when running long or computationally expensive code.

Add speed and simplicity to your Machine Learning workflow today

Get started

Blog

Docs

Community

ML Showcase

Professional Services

Talk to an Expert

New Feature: YAML Config File on Gradient

Building Image Datasets for Computer Vision Algorithms

Solutions

Product

Resources

Company