This chapter is taken from the book A Primer on Scientific Programming with Python by H. P. Langtangen, 5th edition, Springer, 2016.
Getting input data into a program from the command line, or from questions and answers in the terminal window, works for small amounts of data. Otherwise, input data must be available in files. Anyone with some computer experience is used to save and load data files in programs. The task now is to understand how Python programs can read and write files. The basic recipes are quite simple and illustrated through examples.
Suppose we have
recorded some measurement data in the file
src/input/data.txt.
The goal of our first example of reading files
is to read the measurement values in data.txt
,
find the average value, and print it out in the terminal window.
Before trying to let a program read a file, we must know the file
format, i.e., what the contents of the file looks like, because the
structure of the text in the file greatly influences the set of
statements needed to read the file. We therefore start with viewing
the contents of the file data.txt
. To this end, load the file into a
text editor or viewer (one can use emacs
, vim
, more
, or less
on Unix and Mac, while on Windows, WordPad
is appropriate, or the
type
command in a DOS or PowerShell window, and even Word processors
such as LibreOffice or Microsoft Word can also be used on
Windows). What we see is a column with numbers:
21.8
18.1
19
23
26
17.8
Our task is to read this column of numbers into a list in the program and compute the average of the list items.
To read a file, we first
need to open the file. This action creates
a file object, here stored in the variable infile
:
infile = open('data.txt', 'r')
The second argument to the open
function, the string 'r'
,
tells that we want to open the file for reading. We shall later see
that a file can be opened for writing instead, by providing 'w'
as the
second argument. After the file is
read, one should close the file object with infile.close()
.
The basic technique for reading the file line by line applies a for
loop
like this:
for line in infile:
# do something with line
The line
variable is a string holding the current line in the file.
The for
loop over lines in a file has the same syntax as when we go through
a list. Just think of the file object infile
as a collection of
elements, here lines in a file, and the for
loop visits these
elements in sequence such that the line
variable
refers to one line at a time.
If something seemingly goes wrong in such a loop
over lines in a file, it is useful to do a print line
inside the loop.
Instead of reading one line at a time, we can load all lines into a list of strings (lines) by
lines = infile.readlines()
This statement is equivalent to
lines = []
for line in infile:
lines.append(line)
or the list comprehension:
lines = [line for line in infile]
In the present example, we load the file into the list lines
.
The next task is to compute the average of the numbers in the file. Trying
a straightforward sum of all numbers on all lines,
mean = 0
for number in lines:
mean = mean + number
mean = mean/len(lines)
gives an error message:
TypeError: unsupported operand type(s) for +: 'int' and 'str'
The reason is that lines
holds each line (number
) as a string, not
a float
or int
that we can add to other numbers.
A fix is to convert each line to a float
:
mean = 0
for line in lines:
number = float(line)
mean = mean + number
mean = mean/len(lines)
This code snippet works fine. The complete code can be found in the file mean1.py.
Summing up a list of numbers is often done in numerical programs, so
Python has a special function sum
for performing this task. However,
sum
must in the present case operate on a list of floats, not
strings. We can use a list comprehension to turn all elements in
lines
into corresponding float
objects:
mean = sum([float(line) for line in lines])/len(lines)
An alternative implementation is to load the lines into a list of
float
objects directly. Using this strategy, the complete program
(found in file mean2.py) takes the form
infile = open('data.txt', 'r')
numbers = [float(line) for line in infile.readlines()]
infile.close()
mean = sum(numbers)/len(numbers)
print mean
A newcomer to programming might find it confusing to see that one
problem is solved by many alternative sets of statements, but this is
the very nature of programming. A clever programmer will judge several
alternative solutions to a programming task and choose one that is
either particularly compact, easy to understand, and/or easy to extend
later. We therefore present more examples on how to read the
data.txt
file and compute with the data.
Modern Python code applies the with
statement to deal with files:
with open('data.txt', 'r') as infile:
for line in infile:
# process line
This snippet is equivalent to
infile = open('data.txt', 'r')
for line in infile:
# process line
infile.close()
Note that there is no need to close the file when using the with
statement. The advantage of the with
construction is shorter code
and better handling of errors if something goes wrong with opening or
working with the file. A downside is that the syntax differs from the
very classical open-close pattern that one finds in most other
programming languages. Remembering to close a file is key in
programming, and to train that task, we mostly apply the open-close
construction in this document.
The call infile.readline()
returns a string containing the text at
the current line. A new infile.readline()
will read the next line.
When infile.readline()
returns an empty string, the end of the file
is reached and we must stop further reading. The following while
loop reads the file line by line using infile.readline()
:
while True:
line = infile.readline()
if not line:
break
# process line
This is perhaps a somewhat strange loop, but it is a well-established
way of reading a file in Python, especially in older code. The shown
while
loop runs forever since the condition is always
True
. However, inside the loop we test if line
is False
, and it
is False
when we reach the end of the file, because line
then
becomes an empty string, which in Python evaluates to False
. When
line
is False
, the break
statement breaks the loop and makes the
program flow jump to the first statement after the while
block.
Computing the average of the numbers in the data.txt
file can now be done
in yet another way:
infile = open('data.txt', 'r')
mean = 0
n = 0
while True:
line = infile.readline()
if not line:
break
mean += float(line)
n += 1
mean = mean/float(n)
The call infile.read()
reads the whole file and returns the text
as a string object.
The following interactive session illustrates the use and result
of infile.read()
:
>>> infile = open('data.txt', 'r')
>>> filestr = infile.read()
>>> filestr
'21.8\n18.1\n19\n23\n26\n17.8\n'
>>> print filestr
21.8
18.1
19
23
26
17.8
Note the difference between just writing filestr
and writing
print filestr
. The former dumps the string with newlines as
backslash n characters, while the latter is a pretty print
where the string is written out without quotes and with the newline
characters as visible line shifts.
Having the numbers inside a string instead of inside a file does
not look like a major step forward. However, string objects have
many useful functions for extracting information. A very useful
feature is split: filestr.split()
will split the
string into words (separated by blanks or any other sequence of characters
you have defined). The "words" in this file are the numbers:
>>> words = filestr.split()
>>> words
['21.8', '18.1', '19', '23', '26', '17.8']
>>> numbers = [float(w) for w in words]
>>> mean = sum(numbers)/len(numbers)
>>> print mean
20.95
A more compact program looks as follows (mean3.py):
infile = open('data.txt', 'r')
numbers = [float(w) for w in infile.read().split()]
mean = sum(numbers)/len(numbers)
The next section tells you more about splitting strings.
The data.txt
file has a very simple structure since it contains
numbers only. Many data files contain a mix of text and numbers.
The file rainfall.dat
from
www.worldclimate.com provides an example:
Average rainfall (in mm) in Rome: 1188 months between 1782 and 1970
Jan 81.2
Feb 63.2
Mar 70.3
Apr 55.7
May 53.0
Jun 36.4
Jul 17.5
Aug 27.5
Sep 60.9
Oct 117.7
Nov 111.0
Dec 97.9
Year 792.9
How can we read the rainfall data in this file and store the information
in lists suitable for further analysis?
The most straightforward solution is to read the file line by line,
and for each line split the line into words, store the first word
(the month) in one list and the second word (the average rainfall)
in another list. The elements in this latter list needs to be
float
objects if we want to compute with them.
The complete code, wrapped in a function, may look like this (file rainfall1.py):
def extract_data(filename):
infile = open(filename, 'r')
infile.readline() # skip the first line
months = []
rainfall = []
for line in infile:
words = line.split()
# words[0]: month, words[1]: rainfall
months.append(words[0])
rainfall.append(float(words[1]))
infile.close()
months = months[:-1] # Drop the "Year" entry
annual_avg = rainfall[-1] # Store the annual average
rainfall = rainfall[:-1] # Redefine to contain monthly data
return months, rainfall, annual_avg
months, values, avg = extract_data('rainfall.dat')
print 'The average rainfall for the months:'
for month, value in zip(months, values):
print month, value
print 'The average rainfall for the year:', avg
Note that the first line in the file is just a comment line and of no
interest to us. We therefore read this line by infile.readline()
and do not store the content in any object.
The for
loop over the lines in the file will then start from the
next (second) line.
We store all the data into 13 elements in the months
and
rainfall
lists. Thereafter, we manipulate these lists a
bit since we want months
to contain the name of the 12
months only. The rainfall
list should correspond to this
month
list. The annual average is taken out of rainfall
and
stored in a separate variable.
Recall that the -1
index corresponds to the last element of
a list, and the slice :-1
picks out all elements from the
start up to, but not including, the last element.
We could, alternatively, have written a shorter code where the name of the months and the rainfall numbers are stored in a nested list:
def extract_data(filename):
infile = open(filename, 'r')
infile.readline() # skip the first line
data = [line.split() for line in infile]
annual_avg = data[-1][1]
data = [(m, float(r)) for m, r in data[:-1]]
infile.close()
return data, annual_avg
This is more advanced code, but understanding what is going on is a good test on the understanding of nested lists indexing and list comprehensions. An executable program is found in the file rainfall2.py.
With the example code in this section, you have the very basic tools for reading files with a simple structure: columns of text or numbers. Many files used in scientific computations have such a format, but many files are more complicated too. Then you need the techniques of string processing.