Migrating Python to compiled code

Hans Petter Langtangen [1, 2]

[1] Center for Biomedical Computing, Simula Research Laboratory
[2] Department of Informatics, University of Oslo

Jun 14, 2016


Table of contents

Pure Python code for Monte Carlo simulation
      The computational problem
      A scalar Python implementation
      A vectorized Python implementation
Migrating scalar Python code to Cython
      A plain Cython implementation
      A better Cython implementation
Migrating code to C
      Writing a C program
      Migrating loops to C code via F2PY
      Migrating loops to C code via Cython
      Comparing efficiency
References

Monte Carlo simulations are usually known to require long execution times. Implementing such simulations in pure Python may lead to inefficient code. The purpose of this note is to show how Python implementations of Monte Carlo simulations, can be made much more efficient by porting the code to Cython. Pure C implementations are included for comparison of efficiency. The reader should know about basic Python and perhaps a bit about Monte Carlo simulations. Cython will be introduced in a step-by-step fashion

Pure Python code for Monte Carlo simulation

A short, intuitive algorithm in Python is first developed. Then this code is vectorized using functionality of the Numerical Python package. Later sections migrate the algorithm to Cython code and also plain C code for comparison. At the end the various techniques are ranked according to their computational efficiency.

The computational problem

A die is thrown \( m \) times. What is the probability of getting six eyes at least \( n \) times? For example, if \( m=5 \) and \( n=3 \), this is the same as asking for the probability that three or more out of five dice show six eyes.

The probability can be estimated by Monte Carlo simulation. We simulate the process a large number of times, \( N \), and count how many times, \( M \), the experiment turned out successfully, i.e., when we got at least \( n \) out of \( m \) dice with six eyes in a throw.

Monte Carlo simulation has traditionally been viewed as a very costly computational method, normally requiring very sophisticated, fast computer implementations in compiled languages. An interesting question is how useful high-level languages like Python and associated tools are for Monte Carlo simulation. This will now be explored.

A scalar Python implementation

Let us introduce the more descriptive variables ndice for \( m \) and nsix for \( n \). The Monte Carlo method is simply a loop, repeated N times, where the body of the loop may directly express the problem at hand. Here, we draw ndice random integers r in \( [1,6] \) inside the loop and count of many (six) that equal 6. If six >= nsix, the experiment is a success and we increase the counter M by one.

A Python function implementing this approach may look as follows:

import random

def dice6_py(N, ndice, nsix):
    M = 0                     # no of successful events
    for i in range(N):        # repeat N experiments
        six = 0               # how many dice with six eyes?
        for j in range(ndice):
            r = random.randint(1, 6)  # roll die no. j
            if r == 6:
               six += 1
        if six >= nsix:       # successful event?
            M += 1
    p = float(M)/N
    return p

The float(M) transformation is important since M/N will imply integer division when M and N both are integers in Python v2.x and many other languages.

We will refer to this implementation is the plain Python implementation. Timing the function can be done by:

import time
t0 = time.clock()
p = dice6_py(N, ndice, nsix)
t1 = time.clock()
print 'CPU time for loops in Python:', t1-t0
The table to appear later shows the performance of this plain, pure Python code relative to other approaches. There is a factor of 30+ to be gained in computational efficiency by reading on.

The function above can be verified by studying the (somewhat simplified) case \( m=n \) where the probability becomes \( 6^{-n} \). The probability quickly becomes small with increasing \( n \). For such small probabilities the number of successful events \( M \) is small, and \( M/N \) will not be a good approximation to the probability unless \( M \) is reasonably large, which requires a very large \( N \). For example, with \( n=4 \) and \( N=10^5 \) the average probability in 25 full Monte Carlo experiments is 0.00078 while the exact answer is 0.00077. With \( N=10^6 \) we get the two correct significant digits from the Monte Carlo simulation, but the extra digit costs a factor of 10 in computing resources since the CPU time scales linearly with \( N \).

A vectorized Python implementation

A vectorized version of the previous program consists of replacing the explicit loops in Python by efficient operations on vectors or arrays, using functionality in the Numerical Python (numpy) package. Each array operation takes place in C or Fortran and is hence much more efficient than the corresponding loop version in Python.

First, we must generate all the random numbers to be used in one operation, which runs fast since all numbers are then calculated in efficient C code. This is accomplished using the numpy.random module. Second, the analysis of the large collection of random numbers must be done by appropriate vector/array operations such that no looping in Python is needed. The solution algorithm must therefore be expressed through a series of function calls to the numpy library. Vectorization requires knowledge of the library's functionality and how to assemble the relevant building blocks to an algorithm without operations on individual array elements.

Generation of ndice random number of eyes for N experiments is performed by

import numpy as np
eyes = np.random.random_integers(1, 6, size=(N, ndice))
Each row in the eyes array corresponds to one Monte Carlo experiment.

The next step is to count the number of successes in each experiment. This counting should not make use of any loop. Instead we can test eyes == 6 to get a boolean array where an element i,j is True if throw (or die) number j in Monte Carlo experiment number i gave six eyes. Summing up the rows in this boolean array (True is interpreted as 1 and False as 0), we are interested in the rows where the sum is equal to or greater than nsix, because the number of such rows equals the number of successful events. The vectorized algorithm can be expressed as

def dice6_vec1(N, ndice, nsix):
    eyes = np.random.random_integers(1, 6, size=(N, ndice))
    compare = eyes == 6
    throws_with_6 = np.sum(compare, axis=1)  # sum over columns
    nsuccesses = throws_with_6 >= nsix
    M = np.sum(nsuccesses)
    p = float(M)/N
    return p

The use of np.sum instead of Python's own sum function is essential for the speed of this function: using M = sum(nsucccesses) instead slows down the code by a factor of almost 10! We shall refer to the dice6_vec1 function as the vectorized Python, version1 implementation.

The criticism against the vectorized version is that the original problem description, which was almost literally turned into Python code in the dice6_py function, has now become much more complicated. We have to decode the calls to various numpy functionality to actually realize that dice6_py and dice6_vec correspond to the same mathematics.

Here is another possible vectorized algorithm, which is easier to understand, because we retain the Monte Carlo loop and vectorize only each individual experiment:

def dice6_vec2(N, ndice, nsix):
    eyes = np.random.random_integers(1, 6, (N, ndice))
    six = [6 for i in range(ndice)]
    M = 0
    for i in range(N):
        # Check experiment no. i:
        compare = eyes[i,:] == six
        if np.sum(compare) >= nsix:
            M += 1
    p = float(M)/N
    return p

We refer to this implementation as vectorized Python, version 2. As will be shown later, this implementation is significantly slower than the plain Python implementation (!) and very much slower than the vectorized Python, version 1 approach. A conclusion is that readable, partially vectorized code, may run slower than straightforward scalar code.

Migrating scalar Python code to Cython

A plain Cython implementation

A Cython program starts with the scalar Python implementation, but all variables are specified with their types, using Cython's variable declaration syntax, like cdef int M = 0 where we in standard Python just write M = 0. Adding such variable declarations in the scalar Python implementation is straightforward:

import random

def dice6_cy1(int N, int ndice, int nsix):
    cdef int M = 0            # no of successful events
    cdef int six, r
    cdef double p
    for i in range(N):        # repeat N experiments
        six = 0               # how many dice with six eyes?
        for j in range(ndice):
            r = random.randint(1, 6)  # roll die no. j
            if r == 6:
               six += 1
        if six >= nsix:       # successful event?
            M += 1
    p = float(M)/N
    return p

This code must be put in a separate file with extension .pyx. Running Cython on this file translates the Cython code to C. Thereafter, the C code must be compiled and linked to form a shared library, which can be imported in Python as a module. All these tasks are normally automated by a setup.py script. Let the dice6_cy1 function above be stored in a file dice6.pyx. A proper setup.py script looks as follows:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name='Monte Carlo simulation',
  ext_modules=[Extension('_dice6_cy', ['dice6.pyx'],)],
  cmdclass={'build_ext': build_ext},
)

Running

Terminal> python setup.py build_ext --inplace
generates the C code and creates a (shared library) file _dice6_cy.so (known as a C extension module) which can be loaded into Python as a module with name _dice6_cy:
from _dice6_cy import dice6_cy1
import time
t0 = time.clock()
p = dice6_cy1(N, ndice, nsix)
t1 = time.clock()
print t1 - t0
We refer to this implementation as Cython random.randint. Although most of the statements in the dice6_cy1 function are turned into plain and fast C code, the speed is not much improved compared with the original scalar Python code.

To investigate what takes time in this Cython implementation, we can perform a profiling. The template for profiling a Python function whose call syntax is stored in some string statement, reads

import cProfile, pstats
cProfile.runctx(statement, globals(), locals(), 'tmp_profile.dat')
s = pstats.Stats('tmp_profile.dat')
s.strip_dirs().sort_stats('time').print_stats(30)
Data from the profiling are here stored in the file tmp_profile.dat. Our interest now is the dice6_cy1 function so we set
statement = 'dice6_cy1(N, ndice, nsix)'
In addition, a Cython file in which there are functions we want to profile must start with the line
# cython: profile=True
to turn on profiling when creating the extension module. The profiling output from the present example looks like
       5400004 function calls in 7.525 CPU seconds

 Ordered by: internal time

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1800000    4.511    0.000    4.863    0.000 random.py:160(randrange)
1800000    1.525    0.000    6.388    0.000 random.py:224(randint)
      1    1.137    1.137    7.525    7.525 dice6.pyx:6(dice6_cy1)
1800000    0.352    0.000    0.352    0.000 {method 'random' ...
      1    0.000    0.000    7.525    7.525 {dice6_cy.dice6_cy1}
We easily see that it is the call to random.randint that consumes almost all the time. The reason is that the generated C code must call a Python module (random), which implies a lot of overhead. The C code should only call plain C functions, or if Python functions must be called, they should involve so much computations that the overhead in calling Python from C is negligible.

Instead of profiling the code to uncover inefficient constructs we can generate a visual representation of how the Python code is translated to C. Running

Terminal> cython -a dice6.pyx
creates a file dice6.html which can be loaded into a web browser to inspect what Cython has done with the Python code.





White lines indicate that the Python code is translated into C code, while the yellow lines indicate that the generated C code must make calls back to Python (using the Python C API, which implies overhead). Here, the random.randint call is in yellow, so this call is not translated to efficient C code.

A better Cython implementation

To speed up the previous Cython code, we have to get rid of the random.randint call every time we need a random variable. Either we must call some C function for generating a random variable or we must create a bunch of random numbers simultaneously as we did in the vectorized functions shown above. We first try the latter well-known strategy and apply the numpy.random module to generate all the random numbers we need at once:

import  numpy as np
cimport numpy as np

@cython.boundscheck(False)  # turn off array bounds check
@cython.wraparound(False)   # turn off negative indices ([-1,-1])
def dice6_cy2(int N, int ndice, int nsix):
    # Use numpy to generate all random numbers
    ...
    cdef np.ndarray[np.int_t, ndim=2, mode='c'] eyes = \ 
         np.random.random_integers(1, 6, (N, ndice))
This code needs some explanation. The cimport statement imports a special version of numpy for Cython and is needed after the standard numpy import. The declaration of the array of random numbers could just go as
cdef np.ndarray eyes = np.random.random_integers(1, 6, (N, ndice))
However, the processing of the eyes array will then be slow because Cython does not have enough information about the array. To generate optimal C code, we must provide information on the element types in the array, the number of dimensions of the array, that the array is stored in contiguous memory, that we do not want the overhead of checking whether indices are within their bounds or not, and that we do not need negative indices (which slows down array indexing). The latter two properties are taken care of by the @cython.boundscheck(False) and the @cython.wraparound(False) statements (decorators) right before the function, respectively, while the rest of the information is specified within square brackets in the cdef np.ndarray declaration. Inside the brackets, np.int_t denotes integer array elements (np.int is the usual data type object, but np.int_t is a Cython precompiled version of this object), ndim=2 tells that the array has two dimensions (indices), and mode='c' indicates contiguous storage of the array. With all this extra information, Cython can generate C code that works with numpy arrays as efficiently as native C arrays.

The rest of the code is a plain copy of the dice6_py function, but with the random.randint call replaced by an array look-up eyes[i,j] to retrieve the next random number. The two loops will now be as efficient as if they were coded directly in pure C.

The complete code for the efficient version of the dice6_cy1 function looks as follows:

import  numpy as np
cimport numpy as np
import cython
@cython.boundscheck(False)  # turn off array bounds check
@cython.wraparound(False)   # turn off negative indices ([-1,-1])
def dice6_cy2(int N, int ndice, int nsix):
    # Use numpy to generate all random numbers
    cdef int M = 0            # no of successful events
    cdef int six, r
    cdef double p
    cdef np.ndarray[np.int_t, ndim=2, mode='c'] eyes = \ 
         np.random.random_integers(1, 6, (N, ndice))
    for i in range(N):
        six = 0               # how many dice with six eyes?
        for j in range(ndice):
            r = eyes[i,j]     # roll die no. j
            if r == 6:
               six += 1
        if six >= nsix:       # successful event?
            M += 1
    p = float(M)/N
    return p
This Cython implementation is named Cython numpy.random.

The disadvantage with the dice6_cy2 function is that large simulations (large N) also require large amounts of memory, which usually limits the possibility for high accuracy much more than the CPU time. It would be advantageous to have a fast random number generator a la random.randint in C. The C library stdlib has a generator of random integers, rand(), generating numbers from 0 to up RAND_MAX. Both the rand function and the RAND_MAX integer are easy to access in a Cython program:

from libc.stdlib cimport rand, RAND_MAX

r = 1 + int(6.0*rand()/RAND_MAX) # random integer 1,...,6
Note that rand() returns an integer so we must avoid integer division by ensuring that the denominator is a real number. We also need to explicitly convert the resulting real fraction to int since r is declared as int.

With this way of generating random numbers we can create a version of dice6_cy1 that is as fast as dice6_cy, but avoids all the memory demands and the somewhat complicated array declarations of the latter:

from libc.stdlib cimport rand, RAND_MAX
def dice6_cy3(int N, int ndice, int nsix):
    cdef int M = 0            # no of successful events
    cdef int six, r
    cdef double p
    for i in range(N):
        six = 0               # how many dice with six eyes?
        for j in range(ndice):
            # Roll die no. j
            r = 1 + int(6.0*rand()/RAND_MAX)
            if r == 6:
               six += 1
        if six >= nsix:       # successful event?
            M += 1
    p = float(M)/N
    return p
This final Cython implementation will be referred to as Cython stdlib.rand.

Migrating code to C

Writing a C program

A natural next improvement would be to program the Monte Carlo simulation loops directly in a compiled programming language, which guarantees optimal speed. Here we choose the C programming language for this purpose. The C version of our dice6 function and an associated main program take the form

#include <stdio.h>
#include <stdlib.h>

double dice6(int N, int ndice, int nsix)
{
  int M = 0;
  int six, r, i, j;
  double p;

  for (i = 0; i < N; i++) {
    six = 0;
    for (j = 0; j < ndice; j++) {
      r = 1 + rand()/(RAND_MAX*6.0); /* roll die no. j */
      if (r == 6)
	six += 1;
    }
    if (six >= nsix)
      M += 1;
  }
  p = ((double) M)/N;
  return p;
}

int main(int nargs, const char* argv[])
{
  int N = atoi(argv[1]);
  int ndice = 6;
  int nsix = 3;
  double p = dice6(N, ndice, nsix);
  printf("C code: N=%d, p=%.6f\n", N, p);
  return 0;
}

This code is placed in a file dice6_c.c. The file can typically be compiled and run by

Terminal> gcc -O3 -o dice6.capp dice6_c.c
Terminal> ./dice6.capp 1000000
This solution is later referred to as C program.

Migrating loops to C code via F2PY

Instead of programming the whole application in C, we may consider migrating the loops to the C function dice6 shown above and then have the rest of the program (essentially the calling main program) in Python. This is a convenient solution if we were to do many other, less CPU time critical things for convenience in Python.

There are many alternative techniques for calling C functions from Python. Here we shall explain two. The first applies the program f2py to generate the necessary code that glues Python and C. The f2py program was actually made for gluing Python and Fortran, but it can work with C too. We need a specification of the C function to call in terms of a Fortran 90 module. Such a module can be written by hand, but f2py can also generate it. To this end, we make a Fortran file dice6_c_signature.f with the signature of the C function written in Fortran 77 syntax with some annotations:

      real*8 function dice6(n, ndice, nsix)
Cf2py intent(c) dice6
      integer n, ndice, nsix
Cf2py intent(c) n, ndice, nsix
      return
      end

The annotations intent(c) are necessary to tell f2py that the Fortran variables are to be treated as plain C variables and not as pointers (which is the default interpretation of variables in Fortran). The C2fpy are special comment lines that f2py recognizes, and these lines are used to provide extra information to f2py which have no meaning in plain Fortran 77.

We must run f2py to generate a .pyf file with a Fortran 90 module specification of the C function to call:

Terminal> f2py -m _dice6_c1 -h dice6_c.pyf \ 
          dice6_c_signature.f
Here _dice6_c1 is the name of the module with the C function that is to be imported in Python, and dice6_c.pyf is the name of the Fortran 90 module file to be generated. Programmers who know Fortran 90 may want to write the dice6_c.pyf file by hand.

The next step is to use the information in dice6_c.pyf to generate a (C extension) module _dice6_c1. Fortunately, f2py generates the necessary code, and compiles and links the relevant files, to form a shared library file _dice6_c1.so, by a short command:

Terminal> f2py -c dice6_c.pyf dice6_c.c
We can now test the module:
>>> import _dice6_c1
>>> print dir(_dice6_c1)  # module contents
['__doc__', '__file__', '__name__', '__package__',
 '__version__', 'dice6']
>>> print _dice6_c1.dice6.__doc__
dice6 - Function signature:
  dice6 = dice6(n,ndice,nsix)
Required arguments:
  n : input int
  ndice : input int
  nsix : input int
Return objects:
  dice6 : float
>>> _dice6_c1.dice6(N=1000, ndice=4, nsix=2)
0.145
The method of calling the C function dice6 via an f2py generated module is referred to as C via f2py.

Migrating loops to C code via Cython

The Cython tool can also be used to call C code, not only generating C code from the Cython language. Our C code is in the file dice6_c.c, but for Cython to see this code we need to create a header file dice6_c.h listing the definition of the function(s) we want to call from Python. The header file takes the form

extern double dice6(int N, int ndice, int nsix);

The next step is to make a .pyx file with a definition of the C function from the header file and a Python function that calls the C function:

cdef extern from "dice6_c.h":
    double dice6(int N, int ndice, int nsix)

def dice6_cwrap(int N, int ndice, int nsix):
    return dice6(N, ndice, nsix)

Cython must use this file, named dice6_cwrap.pyx, to generate C code, which is to be compiled and linked with the dice6_c.c code. All this is accomplished in a setup.py script:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

sources = ['dice6_cwrap.pyx', 'dice6_c.c']

setup(
  name='Monte Carlo simulation',
  ext_modules=[Extension('_dice6_c2', sources)],
  cmdclass={'build_ext': build_ext},
)
This setup.py script is run as
Terminal> python setup.py build_ext --inplace
resulting in a shared library file _dice6_c2.so, which can be loaded into Python as a module:
>>> import _dice6_c2
>>> print dir(_dice6_c2)
['__builtins__', '__doc__', '__file__', '__name__',
 '__package__', '__test__', 'dice6_cwrap']
We see that the module contains the function dice6_cwrap, which was made to call the underlying C function dice6.

Comparing efficiency

All the files corresponding to the various techniques described above are available as a tarball or accessible from through the GitHub project MC_cython. A file make.sh performs all the compilations, while compare.py runs all methods and prints out the CPU time required by each method, normalized by the fastest approach. The results for \( N=450,000 \) are listed below (MacBook Air running Ubuntu in a VMWare Fusion virtual machine).

Method Timing
C program 1.0
Cython stdlib.rand 1.2
Cython numpy.random 1.2
C via f2py 1.2
C via Cython 1.2
vectorized Python, version 1 1.9
Cython random.randint 33.6
plain Python 37.7
vectorized Python, version 2 105.0

The CPU time of the plain Python version was 10 s, which is reasonably fast for obtaining a fairly accurate result in this problem. The lesson learned is therefore that a Monte Carlo simulation can be implemented in plain Python first. If more speed is needed, one can just add type information and create a Cython code. Studying the HTML file with what Cython manages to translate to C may give hints about how successful the Cython code is and point to optimizations, like avoiding the call to random.randint in the present case. Optimal Cython code runs here at approximately the same speed as calling a handwritten C function with the time-consuming loops. It is to be noticed that the stand-alone C program here ran faster than calling C from Python, probably because the amount of calculations is not large enough to make the overhead of calling C negligible.

Vectorized Python do give a great speed-up compared to plain loops in Python, if done correctly, but the efficiency is not on par with Cython or handwritten C. Even more important is the fact that vectorized code is not at all as readable as the algorithm expressed in plain Python, Cython, or C. Cython therefore provides a very attractive combination of readability, ease of programming, and high speed.

References

This chapter is taken from the book A Primer on Scientific Programming with Python by H. P. Langtangen, 4th edition, Springer, 2014.