Many programs need to manipulate text. For example, when we read the contents of a file into a string or list of strings (lines), we may want to change parts of the text in the string(s) - and maybe write out the modified text to a new file. So far in this document we have converted parts of the text to numbers and computed with the numbers. Now it is time to learn how to manipulate the text strings themselves.
Python has a rich set of operations on string objects. Some of the most common operations are listed below.
The expression s[i:j]
extracts the
substring starting with character number i
and ending with character number j-1
(similarly to lists, 0
is the
index of the first character):
>>> s = 'Berlin: 18.4 C at 4 pm'
>>> s[8:] # from index 8 to the end of the string
'18.4 C at 4 pm'
>>> s[8:12] # index 8, 9, 10 and 11 (not 12!)
'18.4'
A negative upper index
counts, as usual, from the right such that s[-1]
is the last
element, s[-2]
is the next last element, and so on.
>>> s[8:-1]
'18.4 C at 4 p'
>>> s[8:-8]
'18.4 C'
The call
s.find(s1)
returns
the index where
the substring s1
first appears in s
. If the substring is
not found, -1 is returned.
>>> s.find('Berlin') # where does 'Berlin' start?
0
>>> s.find('pm')
20
>>> s.find('Oslo') # not found
-1
Sometimes the aim is to just check if a string is contained in another string, and then we can use the syntax:
>>> 'Berlin' in s:
True
>>> 'Oslo' in s:
False
Here is a typical use of the latter construction in an if
test:
>>> if 'C' in s:
... print 'C found'
... else:
... print 'no C'
...
C found
Two other convenient methods
for checking if a string starts with or ends with a specified string
are startswith
and endswith
:
>>> s.startswith('Berlin')
True
>>> s.endswith('am')
False
The call
s.replace(s1, s2)
replaces substring
s1
by
s2
everywhere in s
:
>>> s.replace(' ', '_')
'Berlin:_18.4_C__at_4_pm'
>>> s.replace('Berlin', 'Bonn')
'Bonn: 18.4 C at 4 pm'
A variant of the last example, where several string operations are put together, consists of replacing the text before the first colon:
>>> s.replace(s[:s.find(':')], 'Bonn')
'Bonn: 18.4 C at 4 pm'
Take a break at this point and convince yourself that you understand how we specify the substring to be replaced!
The call s.split()
splits the string s
into words separated by
whitespace (space, tabulator, or newline):
>>> s.split()
['Berlin:', '18.4', 'C', 'at', '4', 'pm']
Splitting a string s
into words separated by a text t
can be
done by s.split(t)
. For example, we may split with respect to colon:
>>> s.split(':')
['Berlin', ' 18.4 C at 4 pm']
We know that s
contains a city name, a colon, a temperature,
and then C
:
>>> s = 'Berlin: 18.4 C at 4 pm'
With s.splitlines()
, a multi-line string is split into
lines (very useful when a file has been read into a string and we want
a list of lines):
>>> t = '1st line\n2nd line\n3rd line'
>>> print t
1st line
2nd line
3rd line
>>> t.splitlines()
['1st line', '2nd line', '3rd line']
s.lower()
transforms all characters to their lower case
equivalents, and s.upper()
performs a similar transformation to
upper case letters:
>>> s.lower()
'berlin: 18.4 c at 4 pm'
>>> s.upper()
'BERLIN: 18.4 C AT 4 PM'
A string cannot be changed, i.e., any change always results in a new string. Replacement of a character is not possible:
>>> s[18] = 5
...
TypeError: 'str' object does not support item assignment
If we want to replace s[18]
, a new string must
be constructed, for example by keeping the substrings on
either side of s[18]
and inserting a '5'
in between:
>>> s[:18] + '5' + s[19:]
'Berlin: 18.4 C at 5 pm'
One can easily test whether a string contains digits only or not:
>>> '214'.isdigit()
True
>>> ' 214 '.isdigit()
False
>>> '2.14'.isdigit()
False
We can also check if a string contains spaces only by calling the
isspace
method.
More precisely, isspace
tests for whitespace, which means the
space character, newline, or the TAB character:
>>> ' '.isspace() # blanks
True
>>> ' \n'.isspace() # newline
True
>>> ' \t '.isspace() # TAB
True
>>> ''.isspace() # empty string
False
The isspace
is handy for testing for blank lines in files.
An alternative is to strip first and then test for an empty string:
>>> line = ' \n'
>>> line.strip() == ''
True
Stripping off leading and/or trailing spaces in a string is sometimes useful:
>>> s = ' text with leading/trailing space \n'
>>> s.strip()
'text with leading/trailing space'
>>> s.lstrip() # left strip
'text with leading/trailing space \n'
>>> s.rstrip() # right strip
' text with leading/trailing space'
The opposite of the split
method is join
, which joins elements
in a list of strings with a specified delimiter in between.
That is,
the following two types of statements are inverse operations:
t = delimiter.join(words)
words = t.split(delimiter)
An example on using join
may be
>>> strings = ['Newton', 'Secant', 'Bisection']
>>> t = ', '.join(strings)
>>> t
'Newton, Secant, Bisection'
As an illustration of the usefulness of split
and join
,
we want to remove the first two words on a line.
This task can be done by
first splitting the line into words and then joining the words of interest:
>>> line = 'This is a line of words separated by space'
>>> words = line.split()
>>> line2 = ' '.join(words[2:])
>>> line2
'a line of words separated by space'
There are many more methods in string objects. All methods are described in the String Methods section of the Python Standard Library online document.
Suppose we have a file consisting of pairs of real numbers, i.e., text of the form \( (a,b) \), where \( a \) and \( b \) are real numbers. This notation for a pair of numbers is often used for points in the plane, vectors in the plane, and complex numbers. A sample file may look as follows:
(1.3,0) (-1,2) (3,-1.5)
(0,1) (1,0) (1,1)
(0,-0.01) (10.5,-1) (2.5,-2.5)
The file can be found as read_pairs1.dat
. Our task is to read this
text into a nested list pairs
such that pairs[i]
holds the pair
with index i
, and this pair is a tuple of two float
objects. We
assume that there are no blanks inside the parentheses of a pair of
numbers (we rely on a split operation, which would otherwise not work).
To solve this programming problem, we can read in the file line by line; for each line: split the line into words (i.e., split with respect to whitespace); for each word: strip off the parentheses, split with respect to comma, and convert the resulting two words to floats. Our brief algorithm can be almost directly translated to Python code:
# Load the file into list of lines
with open('read_pairs1.dat', 'r') as infile:
lines = infile.readlines()
# Analyze the contents of each line
pairs = [] # list of (n1, n2) pairs of numbers
for line in lines:
words = line.split()
for word in words:
word = word[1:-1] # strip off parenthesis
n1, n2 = word.split(',')
n1 = float(n1); n2 = float(n2)
pair = (n1, n2)
pairs.append(pair) # add 2-tuple to last row
This code is available in the file
read_pairs1.py.
The width
statement is the modern Python way of reading files,
with the advantage that we do not need to think about closing the file.
Figure 3 shows a snapshot of the state of the
variables in the program after having treated the first line.
You should explain each line in the program to yourself, and
compare your understanding with the figure.
The output from the program becomes
[(1.3, 0.0), (-1.0, 2.0), (3.0, -1.5), (0.0, 1.0), (1.0, 0.0), (1.0, 1.0), (0.0, -0.01), (10.5, -1.0), (2.5, -2.5)]
We remark that our solution to this programming problem relies heavily on the fact that spaces inside the parentheses are not allowed. If spaces were allowed, the simple split to obtain the pairs on a line as words would not work. What can we then do?
We can first strip off all blanks on a line, and then observe that the
pairs are separated by the text ')('
. The first and last pair on a
line will have an extra parenthesis that we need to remove. The rest
of code is similar to the previous code and can be found in
read_pairs2.py:
with open('read_pairs2.dat', 'r') as infile:
lines = infile.readlines()
# Analyze the contents of each line
pairs = [] # list of (n1, n2) pairs of numbers
for line in lines:
line = line.strip() # remove whitespace such as newline
line = line.replace(' ', '') # remove all blanks
words = line.split(')(')
# strip off leading/trailing parenthesis in first/last word:
words[0] = words[0][1:] # (-1,3 -> -1,3
words[-1] = words[-1][:-1] # 8.5,9) -> 8.5,9
for word in words:
n1, n2 = word.split(',')
n1 = float(n1); n2 = float(n2)
pair = (n1, n2)
pairs.append(pair)
The program can be tested on the file read_pairs2.dat
:
(1.3 , 0) (-1 , 2 ) (3, -1.5)
(0 , 1) ( 1, 0) ( 1 , 1 )
(0,-0.01) (10.5,-1) (2.5, -2.5)
A third approach is to notice that if the pairs were separated by commas,
(1, 3.0), (-1, 2), (3, -1.5) (0, 1), (1, 0), (1, 1)
the file text is very close to the Python syntax of a list of 2-tuples. By adding enclosing brackets, plus a comma at the end of each line,
[(1, 3.0), (-1, 2), (3, -1.5), (0, 1), (1, 0), (1, 1),]
we have a string to which we can apply eval
to get the pairs
list
directly. Here is the code doing this (program read_pairs3.py):
with open('read_pairs3.dat', 'r') as infile:
listtext = '['
for line in infile:
# add line, without \n (line[:-1]), with a trailing comma:
listtext += line[:-1] + ', '
listtext = listtext + ']'
pairs = eval(listtext)
In general, it is a good idea to construct file formats that are as close
as possible to valid Python syntax such that one can take advantage
of the eval
or exec
functions to turn text into "live objects".
Suppose we have a file with coordinates \( (x,y,z) \) in three-dimensional space. The file format looks as follows:
x=-1.345 y= 0.1112 z= 9.1928
x=-1.231 y=-0.1251 z= 1001.2
x= 0.100 y= 1.4344E+6 z=-1.0100
x= 0.200 y= 0.0012 z=-1.3423E+4
x= 1.5E+5 y=-0.7666 z= 1027
The goal is to read this file and create a list with (x,y,z)
3-tuples, and thereafter convert the nested list to a two-dimensional
array with which we can compute.
Note that there is sometimes a space between the =
signs and the
following number and sometimes not. Splitting with respect to space and
extracting every second word is therefore not an option.
We shall present three solutions.
The file format looks very regular with the x=
, y=
,
and z=
texts starting in the same columns at every line.
By counting characters, we realize that the x=
text starts
in column 2, the y=
text starts in column 16, while the
z=
text starts in column 31. Introducing
x_start = 2
y_start = 16
z_start = 31
the three numbers in a line
string are obtained as the substrings
x = line[x_start+2:y_start]
y = line[y_start+2:z_start]
z = line[z_start+2:]
The following code, found in file
file2coor_v1.py,
creates the coor
array with shape \( (n,3) \),
where \( n \) is the number of \( (x,y,z) \) coordinates.
infile = open('xyz.dat', 'r')
coor = [] # list of (x,y,z) tuples
for line in infile:
x_start = 2
y_start = 16
z_start = 31
x = line[x_start+2:y_start]
y = line[y_start+2:z_start]
z = line[z_start+2:]
print 'debug: x="%s", y="%s", z="%s"' % (x,y,z)
coor.append((float(x), float(y), float(z)))
infile.close()
import numpy as np
coor = np.array(coor)
print coor.shape, coor
The print
statement inside the loop
is always wise to include when doing string manipulations,
simply
because counting indices for substring limits quickly leads to errors.
Running the program, the output from the loop looks like this
debug: x="-1.345 ", y=" 0.1112 ", z=" 9.1928 "
for the first line in the file. The double quotes show the exact
extent of the extracted coordinates. Note that the last quote
appears on the next line. This is because line
has a newline
at the end (this newline must be there to define the end of the line),
and the substring line[z_start:]
contains the newline at the
of line
. Writing line[z_start:-1]
would leave the
newline out of the \( z \) coordinate. However, this has no effect in
practice since we transform the substrings to float
, and an extra
newline or other blanks make no harm.
The coor
object at the end of the program has the value
[[ -1.34500000e+00 1.11200000e-01 9.19280000e+00] [ -1.23100000e+00 -1.25100000e-01 1.00120000e+03] [ 1.00000000e-01 1.43440000e+06 -1.01000000e+00] [ 2.00000000e-01 1.20000000e-03 -1.34230000e+04] [ 1.50000000e+05 -7.66600000e-01 1.02700000e+03]]
One problem with the solution approach above is that the program
will not work if the file format is subject to a
change in the
column positions of x=
, y=
, or z=
.
Instead of hardcoding numbers for the column positions, we can use
the find
method in string objects to locate these column positions:
x_start = line.find('x=')
y_start = line.find('y=')
z_start = line.find('z=')
The rest of the code is similar to the complete program listed above, and the complete code is stored in the file file2coor_v2.py.
String splitting is a powerful tool, also in the present case. Let us split with respect to the equal sign. The first line in the file then gives us the words
['x', '-1.345 y', ' 0.1112 z', ' 9.1928']
We throw away the first word, and strip off the last character in the next word. The final word can be used as is. The complete program is found in the file file2coor_v3.py and looks like
infile = open('xyz.dat', 'r')
coor = [] # list of (x,y,z) tuples
for line in infile:
words = line.split('=')
x = float(words[1][:-1])
y = float(words[2][:-1])
z = float(words[3])
coor.append((x, y, z))
infile.close()
import numpy as np
coor = np.array(coor)
print coor.shape, coor
More sophisticated examples of string operations appear in the section Example: Extracting data from HTML.