This chapter is taken from the book A Primer on Scientific Programming with Python by H. P. Langtangen, 5th edition, Springer, 2016.
With the mathematical rules from probability theory one may compute the probability that a certain event happens, say the probability that you get one black ball when drawing three balls from a hat with four black balls, six white balls, and three green balls. Unfortunately, theoretical calculations of probabilities may soon become hard or impossible if the problem is slightly changed. There is a simple numerical way of computing probabilities that is generally applicable to problems with uncertainty. The principal ideas of this approximate technique is explained below, followed by three examples of increasing complexity.
Assume that we perform \( N \) experiments where the outcome of each experiment is random. Suppose that some event takes place \( M \) times in these \( N \) experiments. An estimate of the probability of the event is then \( M/N \). The estimate becomes more accurate as \( N \) is increased, and the exact probability is assumed to be reached in the limit as \( N\rightarrow\infty \). (Note that in this limit, \( M\rightarrow\infty \) too, so for rare events, where \( M \) may be small in a program, one must increase \( N \) such that \( M \) is sufficiently large for \( M/N \) to become a good approximation to the probability.)
Programs that run a large number of experiments and record the outcome of events are often called simulation programs. (Note that this term is also applied for programs that solve equations arising in mathematical models in general, but it is even more common to use the term when random numbers are used to estimate probabilities.) The mathematical technique of letting the computer perform lots of experiments based on drawing random numbers is commonly called Monte Carlo simulation. This technique has proven to be extremely useful throughout science and industry in problems where there is uncertain or random behavior is involved.
As far as the laws of mathematics refer to reality, they are not certain, as far as they are certain, they do not refer to reality. Albert Einstein, physicist, 1879-1955.For example, in finance the stock market has a random variation that must be taken into account when trying to optimize investments. In offshore engineering, environmental loads from wind, currents, and waves show random behavior. In nuclear and particle physics, random behavior is fundamental according to quantum mechanics and statistical physics. Many probabilistic problems can be calculated exactly by mathematics from probability theory, but very often Monte Carlo simulation is the only way to solve statistical problems. The sections Example: Throwing dice-Example: Policies for limiting population growth applies examples to explain the essence of Monte Carlo simulation in problems with inherent uncertainty. However, also deterministic problems, such as integration of functions, can be computed by Monte Carlo simulation (see the section Monte Carlo integration).
It appears that Monte Carlo simulation programmed in pure Python is
a computationally feasible approach, even on small laptops, in all the
forthcoming examples. Significant speed-up can be achieved by
vectorizing the code, which is explained in detail for many of the
examples. However, large-scale Monte Carlo simulations and other
heavy computations run slowly in pure Python, and the core of the
computations should be moved to a compiled language such as C.
In the document Migrating Python to compiled code [1], you can find a Monte Carlo application
that is implemented in pure Python, in vectorized numpy
Python,
in the extended (and very closely related) Cython language, as well as
in pure C code. Various ways of combining Python with C are also
illustrated.
You throw two dice, one black and one green. What is the probability that the number of eyes on the black die is larger than the number of eyes on the green die?
We can simulate \( N \) throws of two dice in a program. For each throw we see if the event is successful, and if so, increase \( M \) by one:
import sys
N = int(sys.argv[1]) # no of experiments
import random
M = 0 # no of successful events
for i in range(N):
black = random.randint(1, 6) # throw black
green = random.randint(1, 6) # throw brown
if black > green: # success?
M += 1
p = float(M)/N
print 'probability:', p
This program is named black_gt_green.py.
Although the black_gt_green.py
program runs \( N=10^6 \) in a few seconds,
Monte Carlo simulation programs can quickly require quite some
simulation time so speeding up the algorithm by vectorization is often
desired. Let us vectorize the code shown above. The idea is to
draw all the random numbers (\( 2N \)) at once. We make an array of
random numbers between 1 and 6 with 2 rows and \( N \) columns.
The first row can be taken as the number of eyes on the
black die in all the experiments, while the second row are the
corresponding eyes on the green die:
r = np.random.random_integers(1, 6, size=(2, N))
black = r[0,:] # eyes for all throws with black
green = r[1,:] # eyes for all throws with green
The condition black > green
results in an array of length \( N \)
of boolean values: True
when the element in black
is
greater than the corresponding element in green
, and False
if not. The number of True
elements in the boolean array
black > green
is then \( M \). This number can be computed by
summing up all the boolean values. In arithmetic operations,
True
is 1 and False
i 0, so the sum equals \( M \).
Fast summation of arrays requires np.sum
and not Python's
standard sum
function. The code goes like
success = black > green # success[i] is true if black[i]>green[i]
M = np.sum(success) # sum up all successes
p = float(M)/N
print 'probability:', p
The code, found in the file black_gt_green_vec.py, runs over 10 times faster
than the corresponding scalar code in black_gt_green.py
.
In this simple example we can quite easily compute the exact solution. To this end, we set up all the outcomes of the experiment, i.e., all the possible combinations of eyes on two dice:
combinations = [(black, green)
for black in range(1, 7)
for green in range(1, 7)]
Then we count how many of the (black, green)
pairs that have
the property black > green
:
success = [black > green for black, green in combinations]
M = sum(success)
It turns out that M
is 15, giving a probability \( 15/36 \approx
0.41667 \) since there are 36 combinations in total. Running the Monte
Carlo simulations with \( N=10^6 \) typically gives probabilities in
\( [0.416, 0.417] \).
Suppose a games is constructed such that you have to pay 1 euro to throw the two dice. You win 2 euros if there are more eyes on the black than on the green die. Should you play this game? We can easily simulate the game directly (file black_gt_green_game.py):
import sys
N = int(sys.argv[1]) # no of experiments
import random
start_capital = 10
money = start_capital
for i in range(N):
money -= 1 # pay for the game
black = random.randint(1, 6) # throw black
green = random.randint(1, 6) # throw brown
if black > green: # success?
money += 2 # get award
net_profit_total = money - start_capital
net_profit_per_game = net_profit_total/float(N)
print 'Net profit per game in the long run:', net_profit_per_game
Experimenting with a few \( N \) shows that the net profit per game is always negative. That is, you should not play this game.
A vectorized version is beneficial of efficiency reasons
(the corresponding file is black_gt_green_game_vec.py
):
import sys
N = int(sys.argv[1]) # no of experiments
import numpy as np
r = np.random.random_integers(1, 6, size=(2, N))
money = 10 - N # capital after N throws
black = r[0,:] # eyes for all throws with black
green = r[1,:] # eyes for all throws with green
success = black > green # success[i] is true if black[i]>green[i]
M = np.sum(success) # sum up all successes
money += 2*M # add all awards for winning
print 'Net profit per game in the long run:', (money-10)/float(N)
Suppose the cost of playing a game once is \( q \) and that the award for winning is \( r \). The net income in a winning game is \( r-q \). Winning \( M \) out of \( N \) games means that the cost is \( Nq \) and the income is \( Mr \), making a net profit \( s=Mr-Nq \). Now \( p=M/N \) is the probability of winning the game so \( s=(pr-q)N \). A fair game means that we neither win nor lose in the long run: \( s=0 \), which implies that \( r=q/p \). That is, given the cost \( q \) and the probability \( p \) of winning, the award paid for winning the game must be \( r=q/p \) in a fair game.
When somebody comes up with a game you can use Monte Carlo simulation to estimate \( p \) and then conclude that you should not play the game of \( r < q/p \). The example above has \( p=15/36 \) (exact) and \( q=1 \), so \( r=2.4 \) makes a fair game.
The reasoning above is based on common sense and an intuitive interpretation of probability. More precise reasoning from probability theory will introduce the game as an experiment with two outcomes, either you win with probability \( p \) and or lose with probability \( 1-p \). The expected payment is then the sum of the probabilities times the corresponding net income for each event: \( -q(1-p) + (r-q)p \) (recall that the net income in a winning game is \( r-q \)). A fair game has zero expected payment, which leads to \( r = q/p \).
Suppose there are 12 balls in a hat: four black, four red, and four blue. We want to make a program that draws three balls at random from the hat. It is natural to represent the collection of balls as a list. Each list element can be an integer 1, 2, or 3, since we have three different types of balls, but it would be easier to work with the program if the balls could have a color instead of an integer number. This is easily accomplished by defining color names:
colors = 'black', 'red', 'blue' # (tuple of strings)
hat = []
for color in colors:
for i in range(4):
hat.append(color)
Drawing a ball at random is performed by
import random
color = random.choice(hat)
print color
Drawing \( n \) balls without replacing the drawn balls requires us to
remove an element from the hat when it is drawn.
There are three ways to implement the procedure: (i) we
perform a hat.remove(color)
, (ii) we draw a random index with
randint
from the set of legal indices in the hat
list, and
then we do a del hat[index]
to remove the element, or (iii)
we can compress the code in (ii) to hat.pop(index)
.
def draw_ball(hat):
color = random.choice(hat)
hat.remove(color)
return color, hat
def draw_ball(hat):
index = random.randint(0, len(hat)-1)
color = hat[index]
del hat[index]
return color, hat
def draw_ball(hat):
index = random.randint(0, len(hat)-1)
color = hat.pop(index)
return color, hat
# Draw n balls from the hat
balls = []
for i in range(n):
color, hat = draw_ball(hat)
balls.append(color)
print 'Got the balls', balls
We can extend the experiment above
and ask the question: what is the probability
of drawing two or more black balls from a hat with 12 balls,
four black, four red, and four blue?
To this end, we perform \( N \) experiments, count how many times \( M \) we
get two or more black balls, and estimate the probability as \( M/N \).
Each experiment consists of making the hat
list, drawing a number of
balls, and counting how many black balls we got.
The latter task is easy with the count
method in list objects:
hat.count('black')
counts how many elements with value 'black'
we have in the list hat
.
A complete program for this task is listed below. The program appears
in the file balls_in_hat.py.
import random
def draw_ball(hat):
"""Draw a ball using list index."""
index = random.randint(0, len(hat)-1)
color = hat.pop(index)
return color, hat
def draw_ball(hat):
"""Draw a ball using list index."""
index = random.randint(0, len(hat)-1)
color = hat[index]
del hat[index]
return color, hat
def draw_ball(hat):
"""Draw a ball using list element."""
color = random.choice(hat)
hat.remove(color)
return color, hat
def new_hat():
colors = 'black', 'red', 'blue' # (tuple of strings)
hat = []
for color in colors:
for i in range(4):
hat.append(color)
return hat
n = int(raw_input('How many balls are to be drawn? '))
N = int(raw_input('How many experiments? '))
# Run experiments
M = 0 # no of successes
for e in range(N):
hat = new_hat()
balls = [] # the n balls we draw
for i in range(n):
color, hat = draw_ball(hat)
balls.append(color)
if balls.count('black') >= 2: # at least two black balls?
M += 1
print 'Probability:', float(M)/N
Running the program with \( n=5 \) (drawing 5 balls each time) and \( N=4000 \) gives a probability of 0.57. Drawing only 2 balls at a time reduces the probability to about 0.09.
One can with the aid of probability theory derive theoretical expressions for such probabilities, but it is much simpler to let the computer perform a large number of experiments to estimate an approximate probability.
A class version of the code in this section is better than the code
presented, because we avoid shuffling the hat
variable in and out
of functions. Exercise 21: Make a class for drawing balls from a hat asks you to design and implement
a class Hat
.
Mutation of genes
is easily modeled by replacing the letter in a randomly
chosen position of the DNA by a randomly chosen letter from
the alphabet A, C, G, and T.
Python's random
module can be used
to generate random numbers. Selecting a random position means
generating a random index in the DNA string, and the function
random.randint(a, b)
generates random integers between a
and
b
(both included). Generating a random
letter is easiest done by having a list of the actual letters
and using random.choice(list)
to pick an arbitrary element from
list
. A function for replacing the letter in a
randomly selected position (index)
by a random letter among A, C, G, and T is most straightforwardly
implemented by converting the DNA string to a list of letters, since
changing a character in a Python string is impossible without
constructing a new string. However, an element in a list can be
changed in-place:
import random
def mutate_v1(dna):
dna_list = list(dna)
mutation_site = random.randint(0, len(dna_list) - 1)
dna_list[mutation_site] = random.choice(list('ATCG'))
return ''.join(dna_list)
Using get_base_frequencies_v2
and format_frequencies
from the document Files, strings, and dictionaries [2], we can easily mutate a gene a number
of times and see how the frequencies of the bases A, C, G, and T change:
dna = 'ACGGAGATTTCGGTATGCAT'
print 'Starting DNA:', dna
print format_frequencies(get_base_frequencies_v2(dna))
nmutations = 10000
for i in range(nmutations):
dna = mutate_v1(dna)
print 'DNA after %d mutations:' % nmutations, dna
print format_frequencies(get_base_frequencies_v2(dna))
Here is the output from a run:
Starting DNA: ACGGAGATTTCGGTATGCAT
A: 0.25, C: 0.15, T: 0.30, G: 0.30
DNA after 10000 mutations: AACCAATCCGACGAGGAGTG
A: 0.35, C: 0.25, T: 0.10, G: 0.30
The efficiency of the mutate_v1
function with its surrounding loop can be
significantly increased up by performing all the mutations at
once using numpy
arrays. This speed-up is of interest for long dna
strings and many mutations. The idea is to draw all the mutation sites
at once, and also all the new bases at these sites at once.
The np.random
module provides functions for drawing several random
numbers at a time, but only integers and real numbers can be drawn,
not characters from the alphabet A, C, G, and T. We therefore have
to simulate these four characters by the numbers (say) 0, 1, 2, and 3.
Afterwards we can translate the integers to letters by some clever
vectorized indexing.
Drawing N
mutation sites is a matter of drawing N
random
integers among the legal indices:
import numpy as np
mutation_sites = np.random.random_integers(0, len(dna)-1, size=N)
Drawing N
bases, represented as the integers 0-3, is similarly done by
new_bases_i = np.random.random_integers(0, 3, N)
Converting say the integers 1 to the base symbol C is done by
picking out the indices (in a boolean array) where new_bases_i
equals 1, and inserting the character 'C'
in a companion
array of characters:
new_bases_c = np.zeros(N, dtype='c')
indices = new_bases_i == 1
new_bases_c[indices] = 'C'
We must do this integer-to-letter conversion for all four integers/letters.
Thereafter, new_bases_c
must be inserted in dna
for all the
indices corresponding to the randomly drawn mutation sites,
dna[mutation_sites] = new_bases_c
The final step is to convert the numpy
array of characters dna
back to a standard string by first converting dna
to a list
and then joining the list elements: ''.join(dna.tolist())
.
The complete vectorized function can now be expressed as follows:
import numpy as np
# Use integers in random numpy arrays and map these
# to characters according to
i2c = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}
def mutate_v2(dna, N):
dna = np.array(dna, dtype='c') # array of characters
mutation_sites = np.random.random_integers(
0, len(dna) - 1, size=N)
# Must draw bases as integers
new_bases_i = np.random.random_integers(0, 3, size=N)
# Translate integers to characters
new_bases_c = np.zeros(N, dtype='c')
for i in i2c:
new_bases_c[new_bases_i == i] = i2c[i]
dna[mutation_sites] = new_bases_c
return ''.join(dna.tolist())
It is of interest to time mutate_v2
versus mutate_v1
. For this purpose
we need a long test string. A straightforward generation of random
letters is
def generate_string_v1(N, alphabet='ACGT'):
return ''.join([random.choice(alphabet) for i in xrange(N)])
A vectorized version of this function can also be made, using the
ideas explored above for the mutate_v2
function:
def generate_string_v2(N, alphabet='ACGT'):
# Draw random integers 0,1,2,3 to represent bases
dna_i = np.random.random_integers(0, 3, N)
# Translate integers to characters
dna = np.zeros(N, dtype='c')
for i in i2c:
dna[dna_i == i] = i2c[i]
return ''.join(dna.tolist())
The time_mutate
function in the file mutate.py performs
timing of the generation of test strings and the mutations. To
generate a DNA string of length 100,000 the vectorized function is
about 8 times faster. When performing 10,000 mutations on this string,
the vectorized version is almost 3000 times faster! These numbers stay
approximately the same also for larger strings and more mutations.
Hence, this case study on vectorization is a striking example on the fact
that a straightforward and convenient function like mutate_v1
might
occasionally be very slow for large-scale computations.
The observed rate at which mutations occur at a given position in the genome is not independent of the type of nucleotide (base) at that position, as was assumed in the previous simple mutation model. We should therefore take into account that the rate of transition depends on the base.
There are a number of reasons why the observed mutation rates vary between different nucleotides. One reason is that there are different mechanisms generating transitions from one base to another. Another reason is that there are extensive repair process in living cells, and the efficiency of this repair mechanism varies for different nucleotides.
Mutation of nucleotides may be modeled using distinct probabilities for the transitions from each nucleotide to every other nucleotide. For example, the probability of replacing A by C may be prescribed as (say) 0.2. In total we need \( 4\times 4 \) probabilities since each nucleotide can transform into itself (no change) or three others. The sum of all four transition probabilities for a given nucleotide must sum up to one. Such statistical evolution, based on probabilities for transitioning from one state to another, is known as a Markov process or Markov chain.
First we need to set up the probability matrix, i.e., the \( 4\times4 \) table of probabilities where each row corresponds to the transition of A, C, G, or T into A, C, G, or T. Say the probability transition from A to A is 0.2, from A to C is 0.1, from A to G is 0.3, and from A to T is 0.4.
Rather than just prescribing some arbitrary transition probabilities
for test purposes, we can use random numbers for these probabilities.
To this end, we generate three random numbers to divide the interval
\( [0,1] \) into four intervals corresponding to the four possible
transitions. The lengths of the intervals give the transition
probabilities, and their sum is ensured to be 1. The interval limits,
0, 1, and three random numbers must be sorted in ascending order to
form the intervals. We use the function random.random()
to generate
random numbers in \( [0,1) \):
slice_points = sorted(
[0] + [random.random() for i in range(3)] + [1])
transition_probabilities = [slice_points[i+1] - slice_points[i]
for i in range(4)]
The transition probabilities are handy to have available as a dictionary:
markov_chain['A'] = {'A': ..., 'C': ..., 'G': ..., 'T': ...}
which can be computed by
markov_chain['A'] = {base: p for base, p in
zip('ACGT', transition_probabilities)}
To select a transition, we need to draw a random letter
(A, C, G, or T) according to the probabilities
markov_chain[b]
where b
is the base at the current position.
Actually, this is a very common operation, namely drawing a
random value from a discrete probability distribution (markov_chain[b]
).
The natural approach is therefore write a general function for
drawing from any discrete probability distribution given as
a dictionary:
def draw(discrete_probdist):
"""
Draw random value from discrete probability distribution
represented as a dict: P(x=value) = discrete_probdist[value].
"""
# Method:
# http://en.wikipedia.org/wiki/Pseudo-random_number_sampling
limit = 0
r = random.random()
for value in discrete_probdist:
limit += discrete_probdist[value]
if r < limit:
return value
Basically, the algorithm divides \( [0,1] \) into intervals of lengths equal to the probabilities of the various outcomes and checks which interval is hit by a random variable in \( [0,1] \). The corresponding value is the random choice.
A complete function creating all the transition probabilities and storing them in a dictionary of dictionaries takes the form
def create_markov_chain():
markov_chain = {}
for from_base in 'ATGC':
# Generate random transition probabilities by dividing
# [0,1] into four intervals of random length
slice_points = sorted(
[0] + [random.random()for i in range(3)] + [1])
transition_probabilities = \
[slice_points[i+1] - slice_points[i] for i in range(4)]
markov_chain[from_base] = {base: p for base, p
in zip('ATGC', transition_probabilities)}
return markov_chain
mc = create_markov_chain()
print mc
print mc['A']['T'] # probability of transition from A to T
It is natural to develop a function for checking that the generated probabilities are consistent. The transition from a particular base into one of the four bases happens with probability 1, which means that the probabilities in a row must sum up to 1:
def check_transition_probabilities(markov_chain):
for from_base in 'ATGC':
s = sum(markov_chain[from_base][to_base]
for to_base in 'ATGC')
if abs(s - 1) > 1E-15:
raise ValueError('Wrong sum: %s for "%s"' % \
(s, from_base))
Another test is to check that draw
actually draws random values
in accordance with the underlying probabilities. To this end, we draw
a large number of values, N
, count the frequencies of the various values,
divide by N
and compare the empirical normalized frequencies
with the probabilities:
def check_draw_approx(discrete_probdist, N=1000000):
"""
See if draw results in frequencies approx equal to
the probability distribution.
"""
frequencies = {value: 0 for value in discrete_probdist}
for i in range(N):
value = draw(discrete_probdist)
frequencies[value] += 1
for value in frequencies:
frequencies[value] /= float(N)
print ', '.join(['%s: %.4f (exact %.4f)' % \
(v, frequencies[v], discrete_probdist[v])
for v in frequencies])
This test is only approximate, but does bring evidence to the correctness
of the implementation of the draw
function.
A vectorized version of draw
can also be made. We refer to the
source code file mutate.py
for details (the function is relatively
complicated).
Now we have all the tools needed to run the Markov chain of transitions for a randomly selected position in a DNA sequence:
def mutate_via_markov_chain(dna, markov_chain):
dna_list = list(dna)
mutation_site = random.randint(0, len(dna_list) - 1)
from_base = dna[mutation_site]
to_base = draw(markov_chain[from_base])
dna_list[mutation_site] = to_base
return ''.join(dna_list)
Exercise 47: Speed up Markov chain mutation suggests some efficiency enhancements of simulating mutations via these functions.
Here is a simulation of mutations using the method based on Markov chains:
dna = 'TTACGGAGATTTCGGTATGCAT'
print 'Starting DNA:', dna
print format_frequencies(get_base_frequencies_v2(dna))
mc = create_markov_chain()
import pprint
print 'Transition probabilities:\n', pprint.pformat(mc)
nmutations = 10000
for i in range(nmutations):
dna = mutate_via_markov_chain(dna, mc)
print 'DNA after %d mutations (Markov chain):' % nmutations, dna
print format_frequencies(get_base_frequencies_v2(dna))
The output will differ each time the program is run unless
random.seed(i)
is called in the beginning of the program for some
integer i
. This call makes the
sequence of random numbers the same every time the program is run and
is very useful for debugging. An example on the output may look like
Starting DNA: TTACGGAGATTTCGGTATGCAT
A: 0.23, C: 0.14, T: 0.36, G: 0.27
Transition probabilities:
{'A': {'A': 0.4288890546751146,
'C': 0.4219086988655296,
'G': 0.00668870644455688,
'T': 0.14251354001479888},
'C': {'A': 0.24999667668640035,
'C': 0.04718309085408834,
'G': 0.6250440975238185,
'T': 0.0777761349356928},
'G': {'A': 0.16022955651881965,
'C': 0.34652746609882423,
'G': 0.1328031742612512,
'T': 0.3604398031211049},
'T': {'A': 0.20609823213950174,
'C': 0.17641112746655452,
'G': 0.010267621176125452,
'T': 0.6072230192178183}}
DNA after 10000 mutations (Markov chain): GGTTTAAGTCAGCTATGATTCT
A: 0.23, C: 0.14, T: 0.41, G: 0.23
Note that the mutated DNA should contain more nucleotides of the type where the total probability of transitioning into that particular nucleotide is largest. The total probability of transitioning into a particular base can be computed by a bit a probability algebra. Let \( X \) be the initial base at some position in the DNA and let \( Y \) be the new base after mutation at this position. The probability that \( P(Y=b) \), where \( b \) is some base (A, C, G, or T), is built up of four mutually exclusive events: $$ P(Y=b) = P(X=A \cup Y=b) + P(X=C \cup Y=b) + P(X=G \cup Y=b) + P(X=T \cup Y=b)$$ A joint event can be expressed by the (conditional) transition probabilities, e.g., $$ P(X=A \cup Y=b) = P(Y=b | X=A) P(X=A) $$ leading to $$ P(Y=b) = \sum_{i\in\{A,C,G,T\}} P(Y=b|X=i)P(X=i)$$ The probabilities \( P(Y=b|X=i) \) correspond to a column in the transition probability matrix. If each of the initial events \( P(X=i) \) are equally probable, \( P(X=i)=1/4 \), and \( P(Y=b) \) is then the sum of the probabilities in the column corresponding to \( b \), divided by 4. We can now compute \( P(Y=b) \) for \( b \) as A, C, G, and T:
def transition_into_bases(markov_chain):
return {to_base: sum(markov_chain[from_base][to_base]
for from_base in 'ATGC')/4.0
for to_base in 'ATGC'}
print transition_into_bases(mc)
The \( P(X=b) \) probabilities corresponding to the example run above reads
{'A': 0.26, 'C': 0.25, 'T': 0.30, 'G': 0.19}
Transition into T (\( P(Y=T) \)) has greatest probability (0.3) and this is also confirmed by the greatest frequency (0.41).
The various functions performing mutations are located in the file mutate.py.
China has for many years officially allowed only one child per couple. However, the success of the policy has been somewhat limited. One challenge is the current over-representation of males in the population (families have favored sons to live up). An alternative policy is to allow each couple to continue getting children until they get a son. We can simulate both policies and see how a population will develop under the one child and the one son policies. Since we expect to work with a large population over several generations, we aim at vectorized code at once.
Suppose we have a collection of n
individuals, called parents
,
consisting of males and females randomly drawn such that a certain portion
(male_portion
) constitutes males. The parents
array holds
integer values, 1 for male and 2 for females. We can introduce constants,
MALE=1
and FEMALE=2
, to make the code easier to read.
Our task is to see how the parents
array develop from one
generation to the next under the two policies. Let us first show how to
draw the random integer array parents
where there is a probability
male_portion
of getting the value MALE
:
import numpy as np
r = np.random.random(n)
parents = np.zeros(n, int)
MALE = 1; FEMALE = 2
parents[r < male_portion] = MALE
parents[r >= male_portion] = FEMALE
The number of potential couples is
the minimum of males and females.
However, only a fraction (fertility
)
of the couples will actually get a child.
Under the perfect one child policy, these
couples can have one child each:
males = len(parents[parents==MALE])
females = len(parents) - males
couples = min(males, females)
n = int(fertility*couples) # couples that get a child
# The next generation, one child per couple
r = random.random(n)
children = np.zeros(n, int)
children[r < male_portion] = MALE
children[r >= male_portion] = FEMALE
The code for generating a new population will be needed in every generation. Therefore, it is natural to collect the last statements in a separate function such that we can repeat the statements when needed.
def get_children(n, male_portion, fertility):
n = int(fertility*n)
r = random.random(n)
children = zeros(n, int)
children[r < male_portion] = MALE
children[r >= male_portion] = FEMALE
return children
Under the one son policy, the families can continue getting a new child until they get the first son:
# First try
children = get_children(couples, male_portion, fertility)
# Continue with getting a new child for each daughter
daughters = children[children == FEMALE]
while len(daughters) > 0:
new_children = get_children(len(daughters),
male_portion, fertility)
children = np.concatenate((children, new_children))
daughters = new_children[new_children == FEMALE]
The program birth_policy.py
organizes the code segments
above for the two policies into a function advance_generation
,
which we can call repeatedly to see the evolution of the population.
def advance_generation(parents, policy='one child',
male_portion=0.5, fertility=1.0):
males = len(parents[parents==MALE])
females = len(parents) - males
couples = min(males, females)
if policy == 'one child':
children = get_children(couples, male_portion, fertility)
elif policy == 'one son':
# First try at getting a child
children = get_children(couples, male_portion, fertility)
# Continue with getting a new child for each daughter
daughters = children[children == FEMALE]
while len(daughters) > 0:
new_children = get_children(len(daughters),
male_portion, fertility)
children = np.concatenate((children, new_children))
daughters = new_children[new_children == FEMALE]
return children
The simulation is then a matter of repeated calls to
advance_generation
:
N = 1000000 # population size
male_portion = 0.51
fertility = 0.92
# Start with a "perfect" generation of parents
parents = get_children(N, male_portion=0.5, fertility=1.0)
print 'one son policy, start: %d' % len(parents)
for i in range(10):
parents = advance_generation(parents, 'one son',
male_portion, fertility)
print '%3d: %d' % (i+1, len(parents))
Under ideal conditions with unit fertility
and a male_portion
of 0.5, the program predicts that the one child policy halves
the population from one generation to the next, while the one son
policy, where we expect each couple to get one daughter and one son on
average, keeps the population constant. Increasing
male_portion
slightly and decreasing fertility
, which
corresponds more to reality, will in both cases lead to a reduction of
the population. You can try the program out with various values of
these input parameters.
An obvious extension is to incorporate the
effect that a portion of the population does not follow the policy and
get \( c \) children on average. The program birth_policy.py
can account for the effect,
which is quite dramatic: if a fraction 0.01 of the population does not follow
the one son policy and get 4 children on average, the population
grows with a factor 1.5 over 10 generations (male_portion
and
fertility
kept at the ideal values 0.5 and 1, respectively).
Normally, simple models like difference or differential equations are used to model population growth. However, these models track the number of individuals through time with a very simple growth factor from one generation to the next. The model above tracks each individual in the population and applies rules involving random actions to each individual. Such a detailed and much more computer-time consuming model can be used to see the effect of different policies. Using the results of this detailed model, we can (sometimes) estimate growth factors for simpler models so that these mimic the overall effect on the population size. Exercise 26: Estimate growth in a simulation model asks you to investigate if a certain realization of the one son policy leads to simple exponential growth.