Much of the material in this document is taken from Appendix H.3 in the book A Primer on Scientific Programming with Python, 4th edition, by the same author, published by Springer, 2014.

Traverse a folder tree

The Bash Unix shell has a convenient command find to traverse all files in a folder tree and perform operations on the files. For example, the following command lists all files in the home directory that are larger than 10 Mb:

Terminal> find $HOME -type f -size +10000k -exec ls -s -h {} \;

Python has support for doing the same in a cross-platform manner through the os.path.walk and os.walk function. The former is called as os.path.walk(root, myfunc, arg), where root is the root of the folder tree for traversal, myfunc is a user-defined function that is called for each subfolder in the folder tree. The myfunc function has three arguments: arg, dirname, and files, where arg is any (mutable) user-defined data structure, dirname is the full path of the current folder, relative to root, and files is a list of the local files names.

The find command above can then be implemented as follows in Python:

import os

def checksize(arg, dirname, files):
    for file in files:
        # construct the file's complete path:
        filename = os.path.join(dirname, file)
        if os.path.isfile(filename):
            size = os.path.getsize(filename)
            if size > 10000000:
                if arg is None:
                    print '%.2fMb %s' % (size/1000000.0,filename)
                elif isinstance(arg, list):
                    arg.append((size/1000000.0,filename))

root = os.environ['HOME']
os.path.walk(root, checksize, None)  # print list of large files

arg = []
os.path.walk(root, checksize, arg)
# arg is now a list of large files
for size, filename in arg:
    print filename, 'has size', size, 'Mb'

Note that if arg is None we just print large files (as in the find command above), but if arg is a list, we build a list of large files, consisting of 2-tuples with the size (in Mb) and the filename.

Python has an alternative construction for folder tree traversal, os.walk, which returns an iterator:

for dirpath, dirnames, filenames in os.walk(root):

Here, dirpath is the complete path, relative to root to a folder with subfolders dirnames and ordinary files filenames. The equivalent to the os.path.walk construction is then

arg = []
for dirpath, dirnames, filenames in os.walk(root):
    checksize(arg, dirpath, dirnames + filenames)