Much of the material in this document is taken from Appendix H.3 in the book A Primer on Scientific Programming with Python, 4th edition, by the same author, published by Springer, 2014.
The Bash Unix shell has a convenient command find to traverse
all files in a folder tree and perform operations on the files.
For example, the following command lists all files in the
home directory that are larger than 10 Mb:
Terminal> find $HOME -type f -size +10000k -exec ls -s -h {} \;
Python has support for doing the same in a cross-platform manner through
the os.path.walk and os.walk function. The former is called as
os.path.walk(root, myfunc, arg), where root is the root of the
folder tree for traversal, myfunc is a user-defined function that
is called for each subfolder in the folder tree. The myfunc function
has three arguments: arg, dirname, and files, where arg is
any (mutable) user-defined data structure, dirname is the full
path of the current folder, relative to root, and files is a list
of the local files names.
The find command above can then be implemented as follows in Python:
import os
def checksize(arg, dirname, files):
for file in files:
# construct the file's complete path:
filename = os.path.join(dirname, file)
if os.path.isfile(filename):
size = os.path.getsize(filename)
if size > 10000000:
if arg is None:
print '%.2fMb %s' % (size/1000000.0,filename)
elif isinstance(arg, list):
arg.append((size/1000000.0,filename))
root = os.environ['HOME']
os.path.walk(root, checksize, None) # print list of large files
arg = []
os.path.walk(root, checksize, arg)
# arg is now a list of large files
for size, filename in arg:
print filename, 'has size', size, 'Mb'
Note that if arg is None we just print large files (as in the
find command above), but if arg is a list, we build a list
of large files, consisting of 2-tuples with the size (in Mb) and the filename.
Python has an alternative construction for folder tree traversal, os.walk,
which returns an iterator:
for dirpath, dirnames, filenames in os.walk(root):
Here, dirpath is the complete path, relative to root to
a folder with subfolders dirnames and ordinary files filenames.
The equivalent to the os.path.walk construction is then
arg = []
for dirpath, dirnames, filenames in os.walk(root):
checksize(arg, dirpath, dirnames + filenames)