$$ \newcommand{\tp}{\thinspace .} \newcommand{\Prob}[1]{\hbox{P}(#1)} $$

 

 

 

This chapter is taken from the book A Primer on Scientific Programming with Python by H. P. Langtangen, 5th edition, Springer, 2016.

Computing probabilities

With the mathematical rules from probability theory one may compute the probability that a certain event happens, say the probability that you get one black ball when drawing three balls from a hat with four black balls, six white balls, and three green balls. Unfortunately, theoretical calculations of probabilities may soon become hard or impossible if the problem is slightly changed. There is a simple numerical way of computing probabilities that is generally applicable to problems with uncertainty. The principal ideas of this approximate technique is explained below, followed by three examples of increasing complexity.

Principles of Monte Carlo simulation

Assume that we perform \( N \) experiments where the outcome of each experiment is random. Suppose that some event takes place \( M \) times in these \( N \) experiments. An estimate of the probability of the event is then \( M/N \). The estimate becomes more accurate as \( N \) is increased, and the exact probability is assumed to be reached in the limit as \( N\rightarrow\infty \). (Note that in this limit, \( M\rightarrow\infty \) too, so for rare events, where \( M \) may be small in a program, one must increase \( N \) such that \( M \) is sufficiently large for \( M/N \) to become a good approximation to the probability.)

Programs that run a large number of experiments and record the outcome of events are often called simulation programs. (Note that this term is also applied for programs that solve equations arising in mathematical models in general, but it is even more common to use the term when random numbers are used to estimate probabilities.) The mathematical technique of letting the computer perform lots of experiments based on drawing random numbers is commonly called Monte Carlo simulation. This technique has proven to be extremely useful throughout science and industry in problems where there is uncertain or random behavior is involved.

As far as the laws of mathematics refer to reality, they are not certain, as far as they are certain, they do not refer to reality. Albert Einstein, physicist, 1879-1955.
For example, in finance the stock market has a random variation that must be taken into account when trying to optimize investments. In offshore engineering, environmental loads from wind, currents, and waves show random behavior. In nuclear and particle physics, random behavior is fundamental according to quantum mechanics and statistical physics. Many probabilistic problems can be calculated exactly by mathematics from probability theory, but very often Monte Carlo simulation is the only way to solve statistical problems. The sections Example: Throwing dice-Example: Policies for limiting population growth applies examples to explain the essence of Monte Carlo simulation in problems with inherent uncertainty. However, also deterministic problems, such as integration of functions, can be computed by Monte Carlo simulation (see the section Monte Carlo integration).

It appears that Monte Carlo simulation programmed in pure Python is a computationally feasible approach, even on small laptops, in all the forthcoming examples. Significant speed-up can be achieved by vectorizing the code, which is explained in detail for many of the examples. However, large-scale Monte Carlo simulations and other heavy computations run slowly in pure Python, and the core of the computations should be moved to a compiled language such as C. In the document Migrating Python to compiled code [1], you can find a Monte Carlo application that is implemented in pure Python, in vectorized numpy Python, in the extended (and very closely related) Cython language, as well as in pure C code. Various ways of combining Python with C are also illustrated.

Example: Throwing dice

You throw two dice, one black and one green. What is the probability that the number of eyes on the black die is larger than the number of eyes on the green die?

Straightforward solution

We can simulate \( N \) throws of two dice in a program. For each throw we see if the event is successful, and if so, increase \( M \) by one:

import sys
N = int(sys.argv[1])               # no of experiments

import random
M = 0                              # no of successful events
for i in range(N):
    black = random.randint(1, 6)   # throw black
    green = random.randint(1, 6)   # throw brown
    if black > green:              # success?
        M += 1
p = float(M)/N
print 'probability:', p

This program is named black_gt_green.py.

Vectorization

Although the black_gt_green.py program runs \( N=10^6 \) in a few seconds, Monte Carlo simulation programs can quickly require quite some simulation time so speeding up the algorithm by vectorization is often desired. Let us vectorize the code shown above. The idea is to draw all the random numbers (\( 2N \)) at once. We make an array of random numbers between 1 and 6 with 2 rows and \( N \) columns. The first row can be taken as the number of eyes on the black die in all the experiments, while the second row are the corresponding eyes on the green die:

r = np.random.random_integers(1, 6, size=(2, N))
black = r[0,:]            # eyes for all throws with black
green = r[1,:]            # eyes for all throws with green

The condition black > green results in an array of length \( N \) of boolean values: True when the element in black is greater than the corresponding element in green, and False if not. The number of True elements in the boolean array black > green is then \( M \). This number can be computed by summing up all the boolean values. In arithmetic operations, True is 1 and False i 0, so the sum equals \( M \). Fast summation of arrays requires np.sum and not Python's standard sum function. The code goes like

success = black > green   # success[i] is true if black[i]>green[i]
M = np.sum(success)       # sum up all successes
p = float(M)/N
print 'probability:', p

The code, found in the file black_gt_green_vec.py, runs over 10 times faster than the corresponding scalar code in black_gt_green.py.

Exact solution

In this simple example we can quite easily compute the exact solution. To this end, we set up all the outcomes of the experiment, i.e., all the possible combinations of eyes on two dice:

combinations = [(black, green)
                for black in range(1, 7)
                for green in range(1, 7)]

Then we count how many of the (black, green) pairs that have the property black > green:

success = [black > green for black, green in combinations]
M = sum(success)

It turns out that M is 15, giving a probability \( 15/36 \approx 0.41667 \) since there are 36 combinations in total. Running the Monte Carlo simulations with \( N=10^6 \) typically gives probabilities in \( [0.416, 0.417] \).

A game

Suppose a games is constructed such that you have to pay 1 euro to throw the two dice. You win 2 euros if there are more eyes on the black than on the green die. Should you play this game? We can easily simulate the game directly (file black_gt_green_game.py):

import sys
N = int(sys.argv[1])               # no of experiments

import random
start_capital = 10
money = start_capital
for i in range(N):
    money -= 1                     # pay for the game
    black = random.randint(1, 6)   # throw black
    green = random.randint(1, 6)   # throw brown
    if black > green:              # success?
        money += 2                 # get award

net_profit_total = money - start_capital
net_profit_per_game = net_profit_total/float(N)
print 'Net profit per game in the long run:', net_profit_per_game

Experimenting with a few \( N \) shows that the net profit per game is always negative. That is, you should not play this game.

A vectorized version is beneficial of efficiency reasons (the corresponding file is black_gt_green_game_vec.py):

import sys
N = int(sys.argv[1])      # no of experiments

import numpy as np
r = np.random.random_integers(1, 6, size=(2, N))

money = 10 - N            # capital after N throws
black = r[0,:]            # eyes for all throws with black
green = r[1,:]            # eyes for all throws with green
success = black > green   # success[i] is true if black[i]>green[i]
M = np.sum(success)       # sum up all successes
money += 2*M              # add all awards for winning
print 'Net profit per game in the long run:', (money-10)/float(N)

Decide if a game is fair

Suppose the cost of playing a game once is \( q \) and that the award for winning is \( r \). The net income in a winning game is \( r-q \). Winning \( M \) out of \( N \) games means that the cost is \( Nq \) and the income is \( Mr \), making a net profit \( s=Mr-Nq \). Now \( p=M/N \) is the probability of winning the game so \( s=(pr-q)N \). A fair game means that we neither win nor lose in the long run: \( s=0 \), which implies that \( r=q/p \). That is, given the cost \( q \) and the probability \( p \) of winning, the award paid for winning the game must be \( r=q/p \) in a fair game.

When somebody comes up with a game you can use Monte Carlo simulation to estimate \( p \) and then conclude that you should not play the game of \( r < q/p \). The example above has \( p=15/36 \) (exact) and \( q=1 \), so \( r=2.4 \) makes a fair game.

The reasoning above is based on common sense and an intuitive interpretation of probability. More precise reasoning from probability theory will introduce the game as an experiment with two outcomes, either you win with probability \( p \) and or lose with probability \( 1-p \). The expected payment is then the sum of the probabilities times the corresponding net income for each event: \( -q(1-p) + (r-q)p \) (recall that the net income in a winning game is \( r-q \)). A fair game has zero expected payment, which leads to \( r = q/p \).

Example: Drawing balls from a hat

Suppose there are 12 balls in a hat: four black, four red, and four blue. We want to make a program that draws three balls at random from the hat. It is natural to represent the collection of balls as a list. Each list element can be an integer 1, 2, or 3, since we have three different types of balls, but it would be easier to work with the program if the balls could have a color instead of an integer number. This is easily accomplished by defining color names:

colors = 'black', 'red', 'blue'   # (tuple of strings)
hat = []
for color in colors:
    for i in range(4):
        hat.append(color)

Drawing a ball at random is performed by

import random
color = random.choice(hat)
print color

Drawing \( n \) balls without replacing the drawn balls requires us to remove an element from the hat when it is drawn. There are three ways to implement the procedure: (i) we perform a hat.remove(color), (ii) we draw a random index with randint from the set of legal indices in the hat list, and then we do a del hat[index] to remove the element, or (iii) we can compress the code in (ii) to hat.pop(index).

def draw_ball(hat):
    color = random.choice(hat)
    hat.remove(color)
    return color, hat

def draw_ball(hat):
    index = random.randint(0, len(hat)-1)
    color = hat[index]
    del hat[index]
    return color, hat

def draw_ball(hat):
    index = random.randint(0, len(hat)-1)
    color = hat.pop(index)
    return color, hat

# Draw n balls from the hat
balls = []
for i in range(n):
    color, hat = draw_ball(hat)
    balls.append(color)
print 'Got the balls', balls

We can extend the experiment above and ask the question: what is the probability of drawing two or more black balls from a hat with 12 balls, four black, four red, and four blue? To this end, we perform \( N \) experiments, count how many times \( M \) we get two or more black balls, and estimate the probability as \( M/N \). Each experiment consists of making the hat list, drawing a number of balls, and counting how many black balls we got. The latter task is easy with the count method in list objects: hat.count('black') counts how many elements with value 'black' we have in the list hat. A complete program for this task is listed below. The program appears in the file balls_in_hat.py.

import random

def draw_ball(hat):
    """Draw a ball using list index."""
    index = random.randint(0, len(hat)-1)
    color = hat.pop(index)
    return color, hat

def draw_ball(hat):
    """Draw a ball using list index."""
    index = random.randint(0, len(hat)-1)
    color = hat[index]
    del hat[index]
    return color, hat

def draw_ball(hat):
    """Draw a ball using list element."""
    color = random.choice(hat)
    hat.remove(color)
    return color, hat

def new_hat():
    colors = 'black', 'red', 'blue'   # (tuple of strings)
    hat = []
    for color in colors:
        for i in range(4):
            hat.append(color)
    return hat

n = int(raw_input('How many balls are to be drawn? '))
N = int(raw_input('How many experiments? '))

# Run experiments
M = 0  # no of successes
for e in range(N):
    hat = new_hat()
    balls = []           # the n balls we draw
    for i in range(n):
        color, hat = draw_ball(hat)
        balls.append(color)
    if balls.count('black') >= 2:  # at least two black balls?
        M += 1
print 'Probability:', float(M)/N

Running the program with \( n=5 \) (drawing 5 balls each time) and \( N=4000 \) gives a probability of 0.57. Drawing only 2 balls at a time reduces the probability to about 0.09.

One can with the aid of probability theory derive theoretical expressions for such probabilities, but it is much simpler to let the computer perform a large number of experiments to estimate an approximate probability.

A class version of the code in this section is better than the code presented, because we avoid shuffling the hat variable in and out of functions. Exercise 21: Make a class for drawing balls from a hat asks you to design and implement a class Hat.

Random mutations of genes

A simple mutation model

Mutation of genes is easily modeled by replacing the letter in a randomly chosen position of the DNA by a randomly chosen letter from the alphabet A, C, G, and T. Python's random module can be used to generate random numbers. Selecting a random position means generating a random index in the DNA string, and the function random.randint(a, b) generates random integers between a and b (both included). Generating a random letter is easiest done by having a list of the actual letters and using random.choice(list) to pick an arbitrary element from list. A function for replacing the letter in a randomly selected position (index) by a random letter among A, C, G, and T is most straightforwardly implemented by converting the DNA string to a list of letters, since changing a character in a Python string is impossible without constructing a new string. However, an element in a list can be changed in-place:

import random

def mutate_v1(dna):
    dna_list = list(dna)
    mutation_site = random.randint(0, len(dna_list) - 1)
    dna_list[mutation_site] = random.choice(list('ATCG'))
    return ''.join(dna_list)

Using get_base_frequencies_v2 and format_frequencies from the document Files, strings, and dictionaries [2], we can easily mutate a gene a number of times and see how the frequencies of the bases A, C, G, and T change:

dna = 'ACGGAGATTTCGGTATGCAT'
print 'Starting DNA:', dna
print format_frequencies(get_base_frequencies_v2(dna))

nmutations = 10000
for i in range(nmutations):
    dna = mutate_v1(dna)

print 'DNA after %d mutations:' % nmutations, dna
print format_frequencies(get_base_frequencies_v2(dna))

Here is the output from a run:

Starting DNA: ACGGAGATTTCGGTATGCAT
A: 0.25, C: 0.15, T: 0.30, G: 0.30
DNA after 10000 mutations: AACCAATCCGACGAGGAGTG
A: 0.35, C: 0.25, T: 0.10, G: 0.30

Vectorized version

The efficiency of the mutate_v1 function with its surrounding loop can be significantly increased up by performing all the mutations at once using numpy arrays. This speed-up is of interest for long dna strings and many mutations. The idea is to draw all the mutation sites at once, and also all the new bases at these sites at once. The np.random module provides functions for drawing several random numbers at a time, but only integers and real numbers can be drawn, not characters from the alphabet A, C, G, and T. We therefore have to simulate these four characters by the numbers (say) 0, 1, 2, and 3. Afterwards we can translate the integers to letters by some clever vectorized indexing.

Drawing N mutation sites is a matter of drawing N random integers among the legal indices:

import numpy as np
mutation_sites = np.random.random_integers(0, len(dna)-1, size=N)

Drawing N bases, represented as the integers 0-3, is similarly done by

new_bases_i = np.random.random_integers(0, 3, N)

Converting say the integers 1 to the base symbol C is done by picking out the indices (in a boolean array) where new_bases_i equals 1, and inserting the character 'C' in a companion array of characters:

new_bases_c = np.zeros(N, dtype='c')
indices = new_bases_i == 1
new_bases_c[indices] = 'C'

We must do this integer-to-letter conversion for all four integers/letters. Thereafter, new_bases_c must be inserted in dna for all the indices corresponding to the randomly drawn mutation sites,

dna[mutation_sites] = new_bases_c

The final step is to convert the numpy array of characters dna back to a standard string by first converting dna to a list and then joining the list elements: ''.join(dna.tolist()).

The complete vectorized function can now be expressed as follows:

import numpy as np
# Use integers in random numpy arrays and map these
# to characters according to
i2c = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}

def mutate_v2(dna, N):
    dna = np.array(dna, dtype='c')  # array of characters
    mutation_sites = np.random.random_integers(
        0, len(dna) - 1, size=N)
    # Must draw bases as integers
    new_bases_i = np.random.random_integers(0, 3, size=N)
    # Translate integers to characters
    new_bases_c = np.zeros(N, dtype='c')
    for i in i2c:
        new_bases_c[new_bases_i == i] = i2c[i]
    dna[mutation_sites] = new_bases_c
    return ''.join(dna.tolist())

It is of interest to time mutate_v2 versus mutate_v1. For this purpose we need a long test string. A straightforward generation of random letters is

def generate_string_v1(N, alphabet='ACGT'):
    return ''.join([random.choice(alphabet) for i in xrange(N)])

A vectorized version of this function can also be made, using the ideas explored above for the mutate_v2 function:

def generate_string_v2(N, alphabet='ACGT'):
    # Draw random integers 0,1,2,3 to represent bases
    dna_i = np.random.random_integers(0, 3, N)
    # Translate integers to characters
    dna = np.zeros(N, dtype='c')
    for i in i2c:
        dna[dna_i == i] = i2c[i]
    return ''.join(dna.tolist())

The time_mutate function in the file mutate.py performs timing of the generation of test strings and the mutations. To generate a DNA string of length 100,000 the vectorized function is about 8 times faster. When performing 10,000 mutations on this string, the vectorized version is almost 3000 times faster! These numbers stay approximately the same also for larger strings and more mutations. Hence, this case study on vectorization is a striking example on the fact that a straightforward and convenient function like mutate_v1 might occasionally be very slow for large-scale computations.

A Markov chain mutation model

The observed rate at which mutations occur at a given position in the genome is not independent of the type of nucleotide (base) at that position, as was assumed in the previous simple mutation model. We should therefore take into account that the rate of transition depends on the base.

There are a number of reasons why the observed mutation rates vary between different nucleotides. One reason is that there are different mechanisms generating transitions from one base to another. Another reason is that there are extensive repair process in living cells, and the efficiency of this repair mechanism varies for different nucleotides.

Mutation of nucleotides may be modeled using distinct probabilities for the transitions from each nucleotide to every other nucleotide. For example, the probability of replacing A by C may be prescribed as (say) 0.2. In total we need \( 4\times 4 \) probabilities since each nucleotide can transform into itself (no change) or three others. The sum of all four transition probabilities for a given nucleotide must sum up to one. Such statistical evolution, based on probabilities for transitioning from one state to another, is known as a Markov process or Markov chain.

First we need to set up the probability matrix, i.e., the \( 4\times4 \) table of probabilities where each row corresponds to the transition of A, C, G, or T into A, C, G, or T. Say the probability transition from A to A is 0.2, from A to C is 0.1, from A to G is 0.3, and from A to T is 0.4.

Rather than just prescribing some arbitrary transition probabilities for test purposes, we can use random numbers for these probabilities. To this end, we generate three random numbers to divide the interval \( [0,1] \) into four intervals corresponding to the four possible transitions. The lengths of the intervals give the transition probabilities, and their sum is ensured to be 1. The interval limits, 0, 1, and three random numbers must be sorted in ascending order to form the intervals. We use the function random.random() to generate random numbers in \( [0,1) \):

slice_points = sorted(
    [0] + [random.random() for i in range(3)] + [1])
transition_probabilities = [slice_points[i+1] - slice_points[i]
                            for i in range(4)]

The transition probabilities are handy to have available as a dictionary:

markov_chain['A'] = {'A': ..., 'C': ..., 'G': ..., 'T': ...}

which can be computed by

markov_chain['A'] = {base: p for base, p in
                     zip('ACGT', transition_probabilities)}

To select a transition, we need to draw a random letter (A, C, G, or T) according to the probabilities markov_chain[b] where b is the base at the current position. Actually, this is a very common operation, namely drawing a random value from a discrete probability distribution (markov_chain[b]). The natural approach is therefore write a general function for drawing from any discrete probability distribution given as a dictionary:

def draw(discrete_probdist):
    """
    Draw random value from discrete probability distribution
    represented as a dict: P(x=value) = discrete_probdist[value].
    """
    # Method:
    # http://en.wikipedia.org/wiki/Pseudo-random_number_sampling
    limit = 0
    r = random.random()
    for value in discrete_probdist:
        limit += discrete_probdist[value]
        if r < limit:
            return value

Basically, the algorithm divides \( [0,1] \) into intervals of lengths equal to the probabilities of the various outcomes and checks which interval is hit by a random variable in \( [0,1] \). The corresponding value is the random choice.

A complete function creating all the transition probabilities and storing them in a dictionary of dictionaries takes the form

def create_markov_chain():
    markov_chain = {}
    for from_base in 'ATGC':
        # Generate random transition probabilities by dividing
        # [0,1] into four intervals of random length
       slice_points = sorted(
           [0] + [random.random()for i in range(3)] + [1])
       transition_probabilities = \ 
           [slice_points[i+1] - slice_points[i] for i in range(4)]
       markov_chain[from_base] = {base: p for base, p
                         in zip('ATGC', transition_probabilities)}
    return markov_chain

mc = create_markov_chain()
print mc
print mc['A']['T'] # probability of transition from A to T

It is natural to develop a function for checking that the generated probabilities are consistent. The transition from a particular base into one of the four bases happens with probability 1, which means that the probabilities in a row must sum up to 1:

def check_transition_probabilities(markov_chain):
    for from_base in 'ATGC':
        s = sum(markov_chain[from_base][to_base]
                for to_base in 'ATGC')
        if abs(s - 1) > 1E-15:
            raise ValueError('Wrong sum: %s for "%s"' % \ 
                             (s, from_base))

Another test is to check that draw actually draws random values in accordance with the underlying probabilities. To this end, we draw a large number of values, N, count the frequencies of the various values, divide by N and compare the empirical normalized frequencies with the probabilities:

def check_draw_approx(discrete_probdist, N=1000000):
    """
    See if draw results in frequencies approx equal to
    the probability distribution.
    """
    frequencies = {value: 0 for value in discrete_probdist}
    for i in range(N):
        value = draw(discrete_probdist)
        frequencies[value] += 1
    for value in frequencies:
        frequencies[value] /= float(N)
    print ', '.join(['%s: %.4f (exact %.4f)' % \ 
                     (v, frequencies[v], discrete_probdist[v])
                     for v in frequencies])

This test is only approximate, but does bring evidence to the correctness of the implementation of the draw function.

A vectorized version of draw can also be made. We refer to the source code file mutate.py for details (the function is relatively complicated).

Now we have all the tools needed to run the Markov chain of transitions for a randomly selected position in a DNA sequence:

def mutate_via_markov_chain(dna, markov_chain):
    dna_list = list(dna)
    mutation_site = random.randint(0, len(dna_list) - 1)
    from_base = dna[mutation_site]
    to_base = draw(markov_chain[from_base])
    dna_list[mutation_site] = to_base
    return ''.join(dna_list)

Exercise 47: Speed up Markov chain mutation suggests some efficiency enhancements of simulating mutations via these functions.

Here is a simulation of mutations using the method based on Markov chains:

dna = 'TTACGGAGATTTCGGTATGCAT'
print 'Starting DNA:', dna
print format_frequencies(get_base_frequencies_v2(dna))

mc = create_markov_chain()
import pprint
print 'Transition probabilities:\n', pprint.pformat(mc)
nmutations = 10000
for i in range(nmutations):
    dna = mutate_via_markov_chain(dna, mc)

print 'DNA after %d mutations (Markov chain):' % nmutations, dna
print format_frequencies(get_base_frequencies_v2(dna))

The output will differ each time the program is run unless random.seed(i) is called in the beginning of the program for some integer i. This call makes the sequence of random numbers the same every time the program is run and is very useful for debugging. An example on the output may look like

Starting DNA: TTACGGAGATTTCGGTATGCAT
A: 0.23, C: 0.14, T: 0.36, G: 0.27
Transition probabilities:
{'A': {'A': 0.4288890546751146,
       'C': 0.4219086988655296,
       'G': 0.00668870644455688,
       'T': 0.14251354001479888},
 'C': {'A': 0.24999667668640035,
       'C': 0.04718309085408834,
       'G': 0.6250440975238185,
       'T': 0.0777761349356928},
 'G': {'A': 0.16022955651881965,
       'C': 0.34652746609882423,
       'G': 0.1328031742612512,
       'T': 0.3604398031211049},
 'T': {'A': 0.20609823213950174,
       'C': 0.17641112746655452,
       'G': 0.010267621176125452,
       'T': 0.6072230192178183}}
DNA after 10000 mutations (Markov chain): GGTTTAAGTCAGCTATGATTCT
A: 0.23, C: 0.14, T: 0.41, G: 0.23

Note that the mutated DNA should contain more nucleotides of the type where the total probability of transitioning into that particular nucleotide is largest. The total probability of transitioning into a particular base can be computed by a bit a probability algebra. Let \( X \) be the initial base at some position in the DNA and let \( Y \) be the new base after mutation at this position. The probability that \( P(Y=b) \), where \( b \) is some base (A, C, G, or T), is built up of four mutually exclusive events: $$ P(Y=b) = P(X=A \cup Y=b) + P(X=C \cup Y=b) + P(X=G \cup Y=b) + P(X=T \cup Y=b)$$ A joint event can be expressed by the (conditional) transition probabilities, e.g., $$ P(X=A \cup Y=b) = P(Y=b | X=A) P(X=A) $$ leading to $$ P(Y=b) = \sum_{i\in\{A,C,G,T\}} P(Y=b|X=i)P(X=i)$$ The probabilities \( P(Y=b|X=i) \) correspond to a column in the transition probability matrix. If each of the initial events \( P(X=i) \) are equally probable, \( P(X=i)=1/4 \), and \( P(Y=b) \) is then the sum of the probabilities in the column corresponding to \( b \), divided by 4. We can now compute \( P(Y=b) \) for \( b \) as A, C, G, and T:

def transition_into_bases(markov_chain):
    return {to_base: sum(markov_chain[from_base][to_base]
                         for from_base in 'ATGC')/4.0
            for to_base in 'ATGC'}

print transition_into_bases(mc)

The \( P(X=b) \) probabilities corresponding to the example run above reads

{'A': 0.26, 'C': 0.25, 'T': 0.30, 'G': 0.19}

Transition into T (\( P(Y=T) \)) has greatest probability (0.3) and this is also confirmed by the greatest frequency (0.41).

The various functions performing mutations are located in the file mutate.py.

Example: Policies for limiting population growth

China has for many years officially allowed only one child per couple. However, the success of the policy has been somewhat limited. One challenge is the current over-representation of males in the population (families have favored sons to live up). An alternative policy is to allow each couple to continue getting children until they get a son. We can simulate both policies and see how a population will develop under the one child and the one son policies. Since we expect to work with a large population over several generations, we aim at vectorized code at once.

Suppose we have a collection of n individuals, called parents, consisting of males and females randomly drawn such that a certain portion (male_portion) constitutes males. The parents array holds integer values, 1 for male and 2 for females. We can introduce constants, MALE=1 and FEMALE=2, to make the code easier to read. Our task is to see how the parents array develop from one generation to the next under the two policies. Let us first show how to draw the random integer array parents where there is a probability male_portion of getting the value MALE:

import numpy as np
r = np.random.random(n)
parents = np.zeros(n, int)
MALE = 1; FEMALE = 2
parents[r <  male_portion] = MALE
parents[r >= male_portion] = FEMALE

The number of potential couples is the minimum of males and females. However, only a fraction (fertility) of the couples will actually get a child. Under the perfect one child policy, these couples can have one child each:

males = len(parents[parents==MALE])
females = len(parents) - males
couples = min(males, females)
n = int(fertility*couples)  # couples that get a child

# The next generation, one child per couple
r = random.random(n)
children = np.zeros(n, int)
children[r <  male_portion] = MALE
children[r >= male_portion] = FEMALE

The code for generating a new population will be needed in every generation. Therefore, it is natural to collect the last statements in a separate function such that we can repeat the statements when needed.

def get_children(n, male_portion, fertility):
    n = int(fertility*n)
    r = random.random(n)
    children = zeros(n, int)
    children[r <  male_portion] = MALE
    children[r >= male_portion] = FEMALE
    return children

Under the one son policy, the families can continue getting a new child until they get the first son:

# First try
children = get_children(couples, male_portion, fertility)

# Continue with getting a new child for each daughter
daughters = children[children == FEMALE]
while len(daughters) > 0:
    new_children = get_children(len(daughters),
                                male_portion, fertility)
    children = np.concatenate((children, new_children))
    daughters = new_children[new_children == FEMALE]

The program birth_policy.py organizes the code segments above for the two policies into a function advance_generation, which we can call repeatedly to see the evolution of the population.

def advance_generation(parents, policy='one child',
                       male_portion=0.5, fertility=1.0):
    males = len(parents[parents==MALE])
    females = len(parents) - males
    couples = min(males, females)

    if policy == 'one child':
        children = get_children(couples, male_portion, fertility)
    elif policy == 'one son':
        # First try at getting a child
        children = get_children(couples, male_portion, fertility)
        # Continue with getting a new child for each daughter
        daughters = children[children == FEMALE]
        while len(daughters) > 0:
            new_children = get_children(len(daughters),
                                        male_portion, fertility)
            children = np.concatenate((children, new_children))
            daughters = new_children[new_children == FEMALE]
    return children

The simulation is then a matter of repeated calls to advance_generation:

N = 1000000              # population size
male_portion = 0.51
fertility = 0.92
# Start with a "perfect" generation of parents
parents = get_children(N, male_portion=0.5, fertility=1.0)
print 'one son policy, start: %d' % len(parents)
for i in range(10):
    parents = advance_generation(parents, 'one son',
                                 male_portion, fertility)
    print '%3d: %d' % (i+1, len(parents))

Under ideal conditions with unit fertility and a male_portion of 0.5, the program predicts that the one child policy halves the population from one generation to the next, while the one son policy, where we expect each couple to get one daughter and one son on average, keeps the population constant. Increasing male_portion slightly and decreasing fertility, which corresponds more to reality, will in both cases lead to a reduction of the population. You can try the program out with various values of these input parameters.

An obvious extension is to incorporate the effect that a portion of the population does not follow the policy and get \( c \) children on average. The program birth_policy.py can account for the effect, which is quite dramatic: if a fraction 0.01 of the population does not follow the one son policy and get 4 children on average, the population grows with a factor 1.5 over 10 generations (male_portion and fertility kept at the ideal values 0.5 and 1, respectively).

Normally, simple models like difference or differential equations are used to model population growth. However, these models track the number of individuals through time with a very simple growth factor from one generation to the next. The model above tracks each individual in the population and applies rules involving random actions to each individual. Such a detailed and much more computer-time consuming model can be used to see the effect of different policies. Using the results of this detailed model, we can (sometimes) estimate growth factors for simpler models so that these mimic the overall effect on the population size. Exercise 26: Estimate growth in a simulation model asks you to investigate if a certain realization of the one son policy leads to simple exponential growth.