Much of the material in this document is taken from Appendix H.3 in the book A Primer on Scientific Programming with Python, 4th edition, by the same author, published by Springer, 2014.
The Bash Unix shell has a convenient command find
to traverse
all files in a folder tree and perform operations on the files.
For example, the following command lists all files in the
home directory that are larger than 10 Mb:
Terminal> find $HOME -type f -size +10000k -exec ls -s -h {} \;
Python has support for doing the same in a cross-platform manner through
the os.path.walk
and os.walk
function. The former is called as
os.path.walk(root, myfunc, arg)
, where root
is the root of the
folder tree for traversal, myfunc
is a user-defined function that
is called for each subfolder in the folder tree. The myfunc
function
has three arguments: arg
, dirname
, and files
, where arg
is
any (mutable) user-defined data structure, dirname
is the full
path of the current folder, relative to root
, and files
is a list
of the local files names.
The find
command above can then be implemented as follows in Python:
import os
def checksize(arg, dirname, files):
for file in files:
# construct the file's complete path:
filename = os.path.join(dirname, file)
if os.path.isfile(filename):
size = os.path.getsize(filename)
if size > 10000000:
if arg is None:
print '%.2fMb %s' % (size/1000000.0,filename)
elif isinstance(arg, list):
arg.append((size/1000000.0,filename))
root = os.environ['HOME']
os.path.walk(root, checksize, None) # print list of large files
arg = []
os.path.walk(root, checksize, arg)
# arg is now a list of large files
for size, filename in arg:
print filename, 'has size', size, 'Mb'
Note that if arg
is None
we just print large files (as in the
find
command above), but if arg
is a list, we build a list
of large files, consisting of 2-tuples with the size (in Mb) and the filename.
Python has an alternative construction for folder tree traversal, os.walk
,
which returns an iterator:
for dirpath, dirnames, filenames in os.walk(root):
Here, dirpath
is the complete path, relative to root
to
a folder with subfolders dirnames
and ordinary files filenames
.
The equivalent to the os.path.walk
construction is then
arg = []
for dirpath, dirnames, filenames in os.walk(root):
checksize(arg, dirpath, dirnames + filenames)