$$ \newcommand{\tp}{\thinspace .} $$

This chapter is taken from the book A Primer on Scientific Programming with Python by H. P. Langtangen, 5th edition, Springer, 2016.

Writing data to file

Writing data to file is easy. There is basically one function to pay attention to: outfile.write(s), which writes a string s to a file handled by the file object outfile. Unlike print, outfile.write(s) does not append a newline character to the written string. It will therefore often be necessary to add a newline character,

outfile.write(s + '\n')
if the string s is meant to appear on a single line in the file and s does not already contain a trailing newline character. File writing is then a matter of constructing strings containing the text we want to have in the file and for each such string call outfile.write.

Writing to a file demands the file object f to be opened for writing:

# write to new file, or overwrite file:
outfile = open(filename, 'w')

# append to the end of an existing file:
outfile = open(filename, 'a')

Example: Writing a table to file

Problem

As a worked example of file writing, we shall write out a nested list with tabular data to file. A sample list may look as

[[ 0.75,        0.29619813, -0.29619813, -0.75      ],
 [ 0.29619813,  0.11697778, -0.11697778, -0.29619813],
 [-0.29619813, -0.11697778,  0.11697778,  0.29619813],
 [-0.75,       -0.29619813,  0.29619813,  0.75      ]]

Solution

We iterate through the rows (first index) in the list, and for each row, we iterate through the column values (second index) and write each value to the file. At the end of each row, we must insert a newline character in the file to get a linebreak. The code resides in the file write1.py:

data = [[ 0.75,        0.29619813, -0.29619813, -0.75      ],
        [ 0.29619813,  0.11697778, -0.11697778, -0.29619813],
        [-0.29619813, -0.11697778,  0.11697778,  0.29619813],
        [-0.75,       -0.29619813,  0.29619813,  0.75      ]]

outfile = open('tmp_table.dat', 'w')
for row in data:
    for column in row:
        outfile.write('%14.8f' % column)
    outfile.write('\n')
outfile.close()

The resulting data file becomes

    0.75000000    0.29619813   -0.29619813   -0.75000000
    0.29619813    0.11697778   -0.11697778   -0.29619813
   -0.29619813   -0.11697778    0.11697778    0.29619813
   -0.75000000   -0.29619813    0.29619813    0.75000000

An extension of this program consists in adding column and row headings:

           column  1     column  2     column  3     column  4
row  1    0.75000000    0.29619813   -0.29619813   -0.75000000
row  2    0.29619813    0.11697778   -0.11697778   -0.29619813
row  3   -0.29619813   -0.11697778    0.11697778    0.29619813
row  4   -0.75000000   -0.29619813    0.29619813    0.75000000
To obtain this end result, we need to the add some statements to the program write1.py. For the column headings we must know the number of columns, i.e., the length of the rows, and loop from 1 to this length:

ncolumns = len(data[0])
outfile.write('          ')
for i in range(1, ncolumns+1):
    outfile.write('%10s    ' % ('column %2d' % i))
outfile.write('\n')
Note the use of a nested printf construction: the text we want to insert is itself a printf string. We could also have written the text as 'column ' + str(i), but then the length of the resulting string would depend on the number of digits in i. It is recommended to always use printf constructions for a tabular output format, because this gives automatic padding of blanks so that the width of the output strings remains the same. The tuning of the widths is commonly done in a trial-and-error process.

To add the row headings, we need a counter over the row numbers:

row_counter = 1
for row in data:
    outfile.write('row %2d' % row_counter)
    for column in row:
        outfile.write('%14.8f' % column)
    outfile.write('\n')
    row_counter += 1
The complete code is found in the file write2.py. We could, alternatively, iterate over the indices in the list:

for i in range(len(data)):
    outfile.write('row %2d' % (i+1))
    for j in range(len(data[i])):
        outfile.write('%14.8f' % data[i][j])
    outfile.write('\n')

Standard input and output as file objects

Reading user input from the keyboard applies the function raw_input. The keyboard is a medium that the computer in fact treats as a file, referred to as standard input.

The print command prints text in the terminal window. This medium is also viewed as a file from the computer's point of view and called standard output. All general-purpose programming languages allow reading from standard input and writing to standard output. This reading and writing can be done with two types of tools, either file-like objects or special tools like raw_input and print in Python. We will here describe the file-line objects: sys.stdin for standard input and sys.stdout for standard output. These objects behave as file objects, except that they do not need to be opened or closed. The statement

s = raw_input('Give s:')
is equivalent to

print 'Give s: ',
s = sys.stdin.readline()
Recall that the trailing comma in the print statement avoids the newline that print by default adds to the output string. Similarly,

s = eval(raw_input('Give s:'))
is equivalent to

print 'Give s: ',
s = eval(sys.stdin.readline())
For output to the terminal window, the statement

print s
is equivalent to

sys.stdout.write(s + '\n')

Why it is handy to have access to standard input and output as file objects can be illustrated by an example. Suppose you have a function that reads data from a file object infile and writes data to a file object outfile. A sample function may take the form

def x2f(infile, outfile, f):
    for line in infile:
        x = float(line)
        y = f(x)
        outfile.write('%g\n' % y)
This function works with all types of files, including web pages as infile. With sys.stdin as infile and/or sys.stdout as outfile, the x2f function also works with standard input and/or standard output. Without sys.stdin and sys.stdout, we would need different code, employing raw_input and print, to deal with standard input and output. Now we can write a single function that deals with all file media in a unified way.

There is also something called standard error. Usually this is the terminal window, just as standard output, but programs can distinguish between writing ordinary output to standard output and error messages to standard error, and these output media can be redirected to, e.g., files such that one can separate error messages from ordinary output. In Python, standard error is the file-like object sys.stderr. A typical application of sys.stderr is to report errors:

if x < 0:
    sys.stderr.write('Illegal value of x'); sys.exit(1)
This message to sys.stderr is an alternative to print or raising an exception.

Redirecting standard input, output, and error

Standard output from a program prog can be redirected to a file output instead of the screen, by using the greater than sign:

Terminal> prog > output
Here, prog can be any program, including a Python program run as python myprog.py. Similarly, output to the medium called standard error can be redirected by

Terminal> prog &> output
For example, error messages are normally written to standard error, which is exemplified in this little terminal session on a Unix machine:

Terminal> ls bla-bla1 bla-bla2
ls: cannot access bla-bla1: No such file or directory
ls: cannot access bla-bla2: No such file or directory
Terminal> ls bla-bla1 bla-bla2 &> errors
Terminal> cat errors  # print the file errors
ls: cannot access bla-bla1: No such file or directory
ls: cannot access bla-bla2: No such file or directory
When the program reads from standard input (the keyboard), we can equally well redirect standard input from a file, say with name input, such that the program reads from this file rather than from the keyboard:

Terminal> prog < input
Combinations are also possible:

Terminal> prog < input > output

Note

The redirection of standard output, input, and error does not work for Python programs executed with the run command inside IPython, only when executed directly in the operating system in a terminal window, or with the same command prefixed with an exclamation mark in IPython.

Inside a Python program we can also let standard input, output, and error work with ordinary files instead. Here is the technique:

sys_stdout_orig = sys.stdout
sys.stdout = open('output', 'w')
sys_stdin_orig = sys.stdin
sys.stdin = open('input', 'r')
Now, any print statement will write to the output file, and any raw_input call will read from the input file. (Without storing the original sys.stdout and sys.stdin objects in new variables, these objects would get lost in the redefinition above and we would never be able to reach the common standard input and output in the program.)

What is a file, really?

This section is not mandatory for understanding the rest of the document. Nevertheless, the information here is fundamental for understanding what files are about.

A file is simply a sequence of characters. In addition to the sequence of characters, a file has some data associated with it, typically the name of the file, its location on the disk, and the file size. These data are stored somewhere by the operating system. Without this extra information beyond the pure file contents as a sequence of characters, the operating system cannot find a file with a given name on the disk.

Each character in the file is represented as a byte, consisting of eight bits. Each bit is either 0 or 1. The zeros and ones in a byte can be combined in \( 2^8=256 \) ways. This means that there are 256 different types of characters. Some of these characters can be recognized from the keyboard, but there are also characters that do not have a familiar symbol. Such characters looks cryptic when printed.

Pure text files

To see that a file is really just a sequence of characters, invoke an editor for plain text, typically the editor you use to write Python programs. Write the four characters ABCD into the editor, do not press the Return key, and save the text to a file test1.txt. Use your favorite tool for file and folder overview and move to the folder containing the test1.txt file. This tool may be Windows Explorer, My Computer, or a DOS window on Windows; a terminal window, Konqueror, or Nautilus on Linux; or a terminal window or Finder on Mac. If you choose a terminal window, use the cd (change directory) command to move to the proper folder and write dir (Windows) or ls -l (Linux/Mac) to list the files and their sizes. In a graphical program like Windows Explorer, Konqueror, Nautilus, or Finder, select a view that shows the size of each file (choose view as details in Windows Explorer, View as List in Nautilus, the list view icon in Finder, or you just point at a file icon in Konqueror and watch the pop-up text). You will see that the test1.txt file has a size of 4 bytes (if you use ls -l, the size measured in bytes is found in column 5, right before the date). The 4 bytes are exactly the 4 characters ABCD in the file. Physically, the file is just a sequence of 4 bytes on your hard disk.

Go back to the editor again and add a newline by pressing the Return key. Save this new version of the file as test2.txt. When you now check the size of the file it has grown to five bytes. The reason is that we added a newline character (symbolically known as backslash n: \n).

Instead of examining files via editors and folder viewers we may use Python interactively:

>>> file1 = open('test1.txt', 'r').read()  # read file into string
>>> file1
'ABCD'
>>> len(file1)        # length of string in bytes/characters
4
>>> file2 = open('test2.txt', 'r').read()
>>> file2
'ABCD\n'
>>> len(file2)
5
Python has in fact a function that returns the size of a file directly:

>>> import os
>>> size = os.path.getsize('test1.txt')
>>> size
4

Word processor files

Most computer users write text in a word processing program, such as Microsoft Word or LibreOffice. Let us investigate what happens with our four characters ABCD in such a program. Start the word processor, open a new document, and type in the four characters ABCD only. Save the document as a .docx file (Microsoft Word) or an .odt file (LibreOffice). Load this file into an editor for pure text and look at the contents. You will see that there are numerous strange characters that you did not write (!). This additional "text" contains information on what type of document this is, the font you used, etc. The LibreOffice version of this file has 8858 bytes and the Microsoft Word version contains over 26 Kb! However, if you save the file as a pure text file, with extension .txt, the size is down to 8 bytes in LibreOffice and five in Microsoft Word.

Instead of loading the LibreOffice file into an editor we can again read the file contents into a string in Python and examine this string:

>>> infile = open('test3.odt', 'r')  # open LibreOffice file
>>> s = infile.read()
>>> len(s)   # file size
8858
>>> s
'PK\x03\x04\x14\x00\x00\x08\x00\x00sKWD^\xc62\x0c\'\x00...
\x00meta.xml<?xml version="1.0" encoding="UTF-8"?>\n<office:...
" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
Each backslash followed by x and a number is a code for a special character not found on the keyboard (recall that there are 256 characters and only a subset is associated with keyboard symbols). Although we show just a small portion of all the characters in this file in the above output (otherwise, the output would have occupied several pages in this document with thousands symbols like \x04...), we can guarantee that you cannot find the pure sequence of characters ABCD. However, the computer program that generated the file, LibreOffice in this example, can easily interpret the meaning of all the characters in the file and translate the information into nice, readable text on the screen where you can recognize the text ABCD.

Your are now in a position to look into Exercise 8: Try MSWord or LibreOffice to write a program to see what happens if one attempts to use LibreOffice to write Python programs.

Image files

A digital image - captured by a digital camera or a mobile phone - is a file. And since it is a file, the image is just a sequence of characters. Loading some JPEG file into a pure text editor, reveals all the strange characters in there. On the first line you will (normally) find some recognizable text in between the strange characters. This text reflects the type of camera used to capture the image and the date and time when the picture was taken. The next lines contain more information about the image. Thereafter, the file contains a set of numbers representing the image. The basic representation of an image is a set of \( m\times n \) pixels, where each pixel has a color represented as a combination of 256 values of red, green, and blue, which can be stored as three bytes (resulting in \( 256^3 \) color values). A 6-megapixel camera will then need to store \( 3\times 6\cdot 10^6 = 18 \) megabytes for one picture. The JPEG file contains only a couple of megabytes. The reason is that JPEG is a compressed file format, produced by applying a smart technique that can throw away pixel information in the original picture such that the human eye hardly can detect the inferior quality.

A video is just a sequence of images, and therefore a video is also a stream of bytes. If the change from one video frame (image) to the next is small, one can use smart methods to compress the image information in time. Such compression is particularly important for videos since the file sizes soon get too large for being transferred over the Internet. A small video file occasionally has bad visual quality, caused by too much compression.

Music files

An MP3 file is much like a JPEG file: first, there is some information about the music (artist, title, album, etc.), and then comes the music itself as a stream of bytes. A typical MP3 file has a size of something like five million bytes or five megabytes (5 Mb). The exact size depends on the complexity of the music, the length of the track, and the MP3 resolution. On a 16 Gb MP3 player you can then store roughly \( 16,000,000,000/5,000,000 = 3200 \) MP3 files. MP3 is, like JPEG, a compressed format. The complete data of a song on a CD (the WAV file) contains about ten times as many bytes. As for pictures, the idea is that one can throw away a lot of bytes in an intelligent way, such that the human ear hardly detects the difference between a compressed and uncompressed version of the music file.

PDF files

Looking at a PDF file in a pure text editor shows that the file contains some readable text mixed with some unreadable characters. It is not possible for a human to look at the stream of bytes and deduce the text in the document (well, from the assumption that there are always some strange people doing strange things, there might be somebody out there who, with a lot of training, can interpret the pure PDF code with the eyes). A PDF file reader can easily interpret the contents of the file and display the text in a human-readable form on the screen.

Remarks

We have repeated many times that a file is just a stream of bytes. A human can interpret (read) the stream of bytes if it makes sense in a human language - or a computer language (provided the human is a programmer). When the series of bytes does not make sense to any human, a computer program must be used to interpret the sequence of characters.

Think of a report. When you write the report as pure text in a text editor, the resulting file contains just the characters you typed in from the keyboard. On the other hand, if you applied a word processor like Microsoft Word or LibreOffice, the report file contains a large number of extra bytes describing properties of the formatting of the text. This stream of extra bytes does not make sense to a human, and a computer program is required to interpret the file content and display it in a form that a human can understand. Behind the sequence of bytes in the file there are strict rules telling what the series of bytes means. These rules reflect the file format. When the rules or file format is publicly documented, a programmer can use this documentation to make her own program for interpreting the file contents (however, interpreting such files is much more complicated than our examples on reading human-readable files in this document). It happens, though, that secret file formats are used, which require certain programs from certain companies to interpret the files.