$$ \newcommand{\tp}{\thinspace .} $$

 

 

 

This chapter is taken from the book A Primer on Scientific Programming with Python by H. P. Langtangen, 5th edition, Springer, 2016.

Reading data from file

Getting input data into a program from the command line, or from questions and answers in the terminal window, works for small amounts of data. Otherwise, input data must be available in files. Anyone with some computer experience is used to save and load data files in programs. The task now is to understand how Python programs can read and write files. The basic recipes are quite simple and illustrated through examples.

Suppose we have recorded some measurement data in the file src/input/data.txt. The goal of our first example of reading files is to read the measurement values in data.txt, find the average value, and print it out in the terminal window.

Before trying to let a program read a file, we must know the file format, i.e., what the contents of the file looks like, because the structure of the text in the file greatly influences the set of statements needed to read the file. We therefore start with viewing the contents of the file data.txt. To this end, load the file into a text editor or viewer (one can use emacs, vim, more, or less on Unix and Mac, while on Windows, WordPad is appropriate, or the type command in a DOS or PowerShell window, and even Word processors such as LibreOffice or Microsoft Word can also be used on Windows). What we see is a column with numbers:

21.8
18.1
19
23
26
17.8

Our task is to read this column of numbers into a list in the program and compute the average of the list items.

Reading a file line by line

To read a file, we first need to open the file. This action creates a file object, here stored in the variable infile:

infile = open('data.txt', 'r')

The second argument to the open function, the string 'r', tells that we want to open the file for reading. We shall later see that a file can be opened for writing instead, by providing 'w' as the second argument. After the file is read, one should close the file object with infile.close().

The basic technique for reading the file line by line applies a for loop like this:

for line in infile:
    # do something with line

The line variable is a string holding the current line in the file. The for loop over lines in a file has the same syntax as when we go through a list. Just think of the file object infile as a collection of elements, here lines in a file, and the for loop visits these elements in sequence such that the line variable refers to one line at a time. If something seemingly goes wrong in such a loop over lines in a file, it is useful to do a print line inside the loop.

Instead of reading one line at a time, we can load all lines into a list of strings (lines) by

lines = infile.readlines()

This statement is equivalent to

lines = []
for line in infile:
    lines.append(line)

or the list comprehension:

lines = [line for line in infile]

In the present example, we load the file into the list lines. The next task is to compute the average of the numbers in the file. Trying a straightforward sum of all numbers on all lines,

mean = 0
for number in lines:
    mean = mean + number
mean = mean/len(lines)

gives an error message:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

The reason is that lines holds each line (number) as a string, not a float or int that we can add to other numbers. A fix is to convert each line to a float:

mean = 0
for line in lines:
    number = float(line)
    mean = mean + number
mean = mean/len(lines)

This code snippet works fine. The complete code can be found in the file mean1.py.

Summing up a list of numbers is often done in numerical programs, so Python has a special function sum for performing this task. However, sum must in the present case operate on a list of floats, not strings. We can use a list comprehension to turn all elements in lines into corresponding float objects:

mean = sum([float(line) for line in lines])/len(lines)

An alternative implementation is to load the lines into a list of float objects directly. Using this strategy, the complete program (found in file mean2.py) takes the form

infile = open('data.txt', 'r')
numbers = [float(line) for line in infile.readlines()]
infile.close()
mean = sum(numbers)/len(numbers)
print mean

Alternative ways of reading a file

A newcomer to programming might find it confusing to see that one problem is solved by many alternative sets of statements, but this is the very nature of programming. A clever programmer will judge several alternative solutions to a programming task and choose one that is either particularly compact, easy to understand, and/or easy to extend later. We therefore present more examples on how to read the data.txt file and compute with the data.

The modern with statement

Modern Python code applies the with statement to deal with files:

with open('data.txt', 'r') as infile:
    for line in infile:
        # process line

This snippet is equivalent to

infile = open('data.txt', 'r')
for line in infile:
    # process line
infile.close()

Note that there is no need to close the file when using the with statement. The advantage of the with construction is shorter code and better handling of errors if something goes wrong with opening or working with the file. A downside is that the syntax differs from the very classical open-close pattern that one finds in most other programming languages. Remembering to close a file is key in programming, and to train that task, we mostly apply the open-close construction in this document.

The old while construction

The call infile.readline() returns a string containing the text at the current line. A new infile.readline() will read the next line. When infile.readline() returns an empty string, the end of the file is reached and we must stop further reading. The following while loop reads the file line by line using infile.readline():

while True:
    line = infile.readline()
    if not line:
        break
    # process line

This is perhaps a somewhat strange loop, but it is a well-established way of reading a file in Python, especially in older code. The shown while loop runs forever since the condition is always True. However, inside the loop we test if line is False, and it is False when we reach the end of the file, because line then becomes an empty string, which in Python evaluates to False. When line is False, the break statement breaks the loop and makes the program flow jump to the first statement after the while block.

Computing the average of the numbers in the data.txt file can now be done in yet another way:

infile = open('data.txt', 'r')
mean = 0
n = 0
while True:
    line = infile.readline()
    if not line:
        break
    mean += float(line)
    n += 1
mean = mean/float(n)

Reading a file into a string

The call infile.read() reads the whole file and returns the text as a string object. The following interactive session illustrates the use and result of infile.read():

>>> infile = open('data.txt', 'r')
>>> filestr = infile.read()
>>> filestr
'21.8\n18.1\n19\n23\n26\n17.8\n'
>>> print filestr
21.8
18.1
19
23
26
17.8

Note the difference between just writing filestr and writing print filestr. The former dumps the string with newlines as backslash n characters, while the latter is a pretty print where the string is written out without quotes and with the newline characters as visible line shifts.

Having the numbers inside a string instead of inside a file does not look like a major step forward. However, string objects have many useful functions for extracting information. A very useful feature is split: filestr.split() will split the string into words (separated by blanks or any other sequence of characters you have defined). The "words" in this file are the numbers:

>>> words = filestr.split()
>>> words
['21.8', '18.1', '19', '23', '26', '17.8']
>>> numbers = [float(w) for w in words]
>>> mean = sum(numbers)/len(numbers)
>>> print mean
20.95

A more compact program looks as follows (mean3.py):

infile = open('data.txt', 'r')
numbers = [float(w) for w in infile.read().split()]
mean = sum(numbers)/len(numbers)

The next section tells you more about splitting strings.

Reading a mixture of text and numbers

The data.txt file has a very simple structure since it contains numbers only. Many data files contain a mix of text and numbers. The file rainfall.dat from www.worldclimate.com provides an example:

Average rainfall (in mm) in Rome: 1188 months between 1782 and 1970
Jan  81.2
Feb  63.2
Mar  70.3
Apr  55.7
May  53.0
Jun  36.4
Jul  17.5
Aug  27.5
Sep  60.9
Oct  117.7
Nov  111.0
Dec  97.9
Year 792.9

How can we read the rainfall data in this file and store the information in lists suitable for further analysis? The most straightforward solution is to read the file line by line, and for each line split the line into words, store the first word (the month) in one list and the second word (the average rainfall) in another list. The elements in this latter list needs to be float objects if we want to compute with them.

The complete code, wrapped in a function, may look like this (file rainfall1.py):

def extract_data(filename):
    infile = open(filename, 'r')
    infile.readline() # skip the first line
    months = []
    rainfall = []
    for line in infile:
        words = line.split()
        # words[0]: month, words[1]: rainfall
        months.append(words[0])
        rainfall.append(float(words[1]))
    infile.close()
    months = months[:-1]      # Drop the "Year" entry
    annual_avg = rainfall[-1] # Store the annual average
    rainfall = rainfall[:-1]  # Redefine to contain monthly data
    return months, rainfall, annual_avg

months, values, avg = extract_data('rainfall.dat')
print 'The average rainfall for the months:'
for month, value in zip(months, values):
    print month, value
print 'The average rainfall for the year:', avg

Note that the first line in the file is just a comment line and of no interest to us. We therefore read this line by infile.readline() and do not store the content in any object. The for loop over the lines in the file will then start from the next (second) line.

We store all the data into 13 elements in the months and rainfall lists. Thereafter, we manipulate these lists a bit since we want months to contain the name of the 12 months only. The rainfall list should correspond to this month list. The annual average is taken out of rainfall and stored in a separate variable. Recall that the -1 index corresponds to the last element of a list, and the slice :-1 picks out all elements from the start up to, but not including, the last element.

We could, alternatively, have written a shorter code where the name of the months and the rainfall numbers are stored in a nested list:

def extract_data(filename):
    infile = open(filename, 'r')
    infile.readline()  # skip the first line
    data = [line.split() for line in infile]
    annual_avg = data[-1][1]
    data = [(m, float(r)) for m, r in data[:-1]]
    infile.close()
    return data, annual_avg

This is more advanced code, but understanding what is going on is a good test on the understanding of nested lists indexing and list comprehensions. An executable program is found in the file rainfall2.py.

Is it more to file reading?

With the example code in this section, you have the very basic tools for reading files with a simple structure: columns of text or numbers. Many files used in scientific computations have such a format, but many files are more complicated too. Then you need the techniques of string processing.