$$ \newcommand{\tp}{\thinspace .} $$

 

 

 

Summary

Chapter topics

Dictionaries

Array or list-like objects with text or other (fixed-valued) Python objects as indices are called dictionaries. They are very useful for storing general collections of objects in a single data structure. The table below displays some of the most important dictionary operations.

Construction Meaning
a = {} initialize an empty dictionary
a = {'point': [0,0.1], 'value': 7} initialize a dictionary
a = dict(point=[2,7], value=3) initialize a dictionary w/string keys
a.update(b) add/update key-value pairs from b in a
a.update(key1=value1, key2=value2) add/update key-value pairs in a
a['hide'] = True add new key-value pair to a
a['point'] get value corresponding to key point
for key in a: loop over keys in unknown order
for key in sorted(a): loop over keys in alphabetic order
'value' in a True if string value is a key in a
del a['point'] delete a key-value pair from a
list(a.keys()) list of keys
list(a.values()) list of values
len(a) number of key-value pairs in a
isinstance(a, dict) is True if a is a dictionary

Strings

Some of the most useful functionalities in a string object s are listed below.

Split the string into substrings separated by delimiter:

words = s.split(delimiter)

Join elements in a list of strings:

newstring = delimiter.join(words[i:j])

Extract a substring:

substring = s[2:n-4]

Replace a substring substr by new a string replacement:

s_new = s.replace(substr, replacement)

Check if a substring is contained within another string:

if 'some text' in s:
    ...

Find the index where some text starts:

index = s.find(text)
if index == -1:
    print 'Could not find "%s" in "%s" (text, s)
else:
    substring = s[index:]  # strip off chars before text

Extend a string:

s += another_string     # append at the end
s = another_string + s  # append at the beginning

Check if a string contains whitespace only:

if s.isspace():
   ...

Note: you cannot change the characters in a string like you can change elements in a list (a string is in this sense like a tuple). You have to make a new string:

>>> filename = 'myfile1.txt'
>>> filename[6] = '2'
Traceback (most recent call last):
  ...
TypeError: 'str' object does not support item assignment
>>> filename.replace('1', '2')
'myfile2.txt'
>>> filename[:6] + '2' + filename[7:]   # 'myfile' + '2' + '.txt'
'myfile2.txt'

Downloading Internet files

Internet files can be downloaded if we know their URL:

import urllib
url = 'http://www.some.where.net/path/thing.html'
urllib.urlretrieve(url, filename='thing.html')

The downloaded information is put in the local file thing.html in the current working folder. Alternatively, we can open the URL as a file object:

webpage = urllib.urlopen(url)

HTML files are often messy to interpret by string operations.

Terminology

The important computer science topics in this document are

Example: A file database

Problem

We have a file containing information about the courses that students have taken. The file format consists of blocks with student data, where each block starts with the student's name (Name:), followed by the courses that the student has taken. Each course line starts with the name of the course, then comes the semester when the exam was taken, then the size of the course in terms of credit points, and finally the grade is listed (letters A to F). Here is an example of a file with three student entries:

Name: John Doe
Astronomy                         2003 fall 10 A
Introductory Physics              2003 fall 10 C
Calculus I                        2003 fall 10 A
Calculus II                       2004 spring 10 B
Linear Algebra                    2004 spring 10 C
Quantum Mechanics I               2004 fall 10 A
Quantum Mechanics II              2005 spring 10 A
Numerical Linear Algebra          2004 fall 5 E
Numerical Methods                 2004 spring 20 C

Name: Jan Modaal
Calculus I                        2005 fall 10 A
Calculus II                       2006 spring 10 A
Introductory C++ Programming      2005 fall 15 D
Introductory Python Programming   2006 spring 5 A
Astronomy                         2005 fall 10 A
Basic Philosophy                  2005 fall 10 F

Name: Kari Nordmann
Introductory Python Programming   2006 spring 5 A
Astronomy                         2005 fall 10 D

Our problem consists of reading this file into a dictionary data with the student name as key and a list of courses as value. Each element in the list of courses is a dictionary holding the course name, the semester, the credit points, and the grade. A value in the data dictionary may look as

'Kari Nordmann': [{'credit': 5,
                   'grade': 'A',
                   'semester': '2006 spring',
                   'title': 'Introductory Python Programming'},
                  {'credit': 10,
                   'grade': 'D',
                   'semester': '2005 fall',
                   'title': 'Astronomy'}],

Having the data dictionary, the next task is to print out the average grade of each student.

Solution

We divide the problem into two major tasks: loading the file data into the data dictionary, and computing the average grades. These two tasks are naturally placed in two functions.

We need to have a strategy for reading the file and interpreting the contents. It will be natural to read the file line by line, and for each line check if this is a line containing a new student's name, a course information line, or a blank line. In the latter case we jump to the next pass in the loop. When a new student name is encountered, we initialize a new entry in the data dictionary to an empty list. In the case of a line about a course, we must interpret the contents on that line, which we postpone a bit.

We can now sketch the algorithm described above in terms of some unfinished Python code, just to get the overview:

def load(studentfile):
    infile = open(studentfile, 'r')
    data = {}
    for line in infile:
        i = line.find('Name:')
        if i != -1:
            # line contains 'Name:', extract the name.
            ...
        elif line.isspace():     # Blank line?
            continue             # Yes, go to next loop iteration.
        else:
            # This must be a course line, interpret the line.
            ...
    infile.close()
    return data

If we find 'Name:' as a substring in line, we must extract the name. This can be done by the substring line[i+5:]. Alternatively, we can split the line with respect to colon and strip off the first word:

words = line.split(':')
name = ' '.join(words[1:])

We have chosen the former strategy of extracting the name as a substring in the final program.

Each course line is naturally split into words for extracting information:

words = line.split()

The name of the course consists of a number of words, but we do not know how many. Nevertheless, we know that the final words contain the semester, the credit points, and the grade. We can hence count from the right and extract information, and when we are finished with the semester information, the rest of the words list holds the words in the name of the course. The code goes as follows:

grade = words[-1]
credit = int(words[-2])
semester = ' '.join(words[-4:-2])
course_name = ' '.join(words[:-4])
data[name].append({'title':    course_name,
                   'semester': semester,
                   'credit':   credit,
                   'grade':    grade})

This code is a good example of the usefulness of split and join operations when extracting information from a text.

Now to the second task of computing the average grade. Since the grades are letters we cannot compute with them. A natural way to proceed is to convert the letters to numbers, compute the average number, and then convert that number back to a letter. Conversion between letters and numbers is easily represented by a dictionary:

grade2number = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1, 'F': 0}

To convert from numbers to grades, we construct the inverse dictionary:

number2grade = {}
for grade in grade2number:
    number2grade[grade2number[grade]] = grade

In the computation of the average grade we should use a weighted sum such that larger courses count more than smaller courses. The weighted mean value of a set of numbers \( r_i \) with weights \( w_i \), \( i=0,\ldots,n-1 \), is given by $$ \begin{equation*} {\sum_{i=0}^{n-1} w_ir_i\over\sum_{i=0}^{n-1} w_i}\tp\end{equation*} $$ This weighted mean value must then be rounded to the nearest integer, which can be used as key in number2grade to find the corresponding grade expressed as a letter. The weight \( w_i \) is naturally taken as the number of credit points in the course with grade \( r_i \). The whole process is performed by the following function:

def average_grade(data, name):
    sum = 0; weights = 0
    for course in data[name]:
        weight = course['credit']
        grade  = course['grade']
        sum += grade2number[grade]*weight
        weights += weight
    avg = sum/float(weights)
    return number2grade[round(avg)]

The complete code is found in the file students.py. Running this program gives the following output of the average grades:

John Doe: B
Kari Nordmann: C
Jan Modaal: C

One feature of the students.py code is that the output of the names are sorted after the last name. How can we accomplish that? A straight for name in data loop will visit the keys in an unknown (random) order. To visit the keys in alphabetic order, we must use

for name in sorted(data):

This default sort will sort with respect to the first character in the name strings. We want a sort according to the last part of the name. A tailored sort function can then be written. In this function we extract the last word in the names and compare them:

def sort_names(name1, name2):
    last_name1 = name1.split()[-1]
    last_name2 = name2.split()[-1]
    if last_name1 < last_name2:
        return -1
    elif last_name1 > last_name2:
        return 1
    else:
        return 0

We can now pass on sort_names to the sorted function to get a sequence that is sorted with respect to the last word in the students' names:

for name in sorted(data, sort_names):
    print '%s: %s' % (name, average_grade(data, name))