Here's the problem we're trying to solve – if you're doing this as homework, see the full info for this exercise:
You already know how to collect the list of filenames in a single directory. And you know how to read through a file and count the lines. This exercise combines both problems and adds a few twists:
“Non-blank” for this exercise is defined as: a line that has at least one non-whitespace character. However, it’s hard to distinguish a completely empty line, and one that is full of invisible whitespace characters. So use the strip()
function.
When you run h.py
from the command-line:
0004-shakefiles $ python h.py
tempdata/comedies/allswellthatendswell has 3164 non-blank lines out of 4515 total lines tempdata/comedies/asyoulikeit has 2904 non-blank lines out of 4122 total lines tempdata/comedies/comedyoferrors has 2112 non-blank lines out of 2937 total lines
tempdata/tragedies/titusandronicus has 2837 non-blank lines out of 3767 total lines All together, Shakespeare's 42 text files have: 125097 non-blank lines out of 172948 total lines
Rather than write the program in a linear, top-to-bottom fashion, I'm going to do something different for this exercise. I'm going to approach it from its smallest, most standalone problem – how to test if a text string is "blank" (or non-blank) – and then work my way up to the bigger problems, such as "how do I open a file". In fact, each step that we do in this program we've already done before in all of the past exercises. This is just a different way of assembling those pieces.
The final program will be pretty ugly (you can see it here). But if you can step through it, piece by piece, and make sure each piece works before moving forward, you won't even notice the complex ugliness you've created until the very end. And by then, it doesn't matter as long as the program works!
A "non-blank line", for the purposes of this exercise, is defined as a line that has at least one non-whitespace character. Conversely, a "blank line" can be thought of a line that consists solely of 0 or more whitespace characters.
However, this is not the same thing as an empty string. Try this out at the interactive Python shell:
>>> a = "" # empty string
>>> b = " " # string with some spaces
>>> c = "\n" # string consisting of just a newline character
>>> a == b
False
>>> a == c
False
String objects have a strip() method which trims a string of whitespace characters from both the left and right side:
>>> " hello world ! ".strip()
'hello world !'
So, if we use strip() on a string that consists solely of whitespace characters, the result should be an empty string (i.e. a string with 0 characters):
>>> "" == " ".strip()
True
>>> "" == "\n".strip()
True
So, now we have the conditional expression that can test if a string – stored in a variable named line
– is blank:
line.strip() == ""
Of course, if we want to test for non-blank lines, we just invert the test to test for non-equality:
line.strip() != ""
Or, if you prefer this notation:
line.strip() is not ""
OK, now let's step up one level: given a text file object, txtfile
, we need to read every line of the file:
for line in txtfile:
# etc. etc.
What we need to do to fulfill the requirements of the exercise is keep count of each line. So before the for-loop, let's set up a variable, total_line_count
, and assign its initial value to 0
:
total_line_count = 0
for line in txtfile:
# do something
For every iteration of the loop, we want to add 1
to total_line_count
:
total_line_count = 0
for line in txtfile:
total_line_count += 1
Now we also want to count the non-blank lines. So let's initialize another variable, nonblank_line_count
:
total_line_count = 0
nonblank_line_count = 0
for line in txtfile:
total_line_count += 1
However, we increment blank_line_count
only when a given line
fails our test for non-blank lines. This is where a conditional statement is necessary:
total_line_count = 0
nonblank_line_count = 0
for line in txtfile:
total_line_count += 1
if line.strip() is not "":
nonblank_line_count += 1
OK, that txtfile
variable is just a placeholder. Let's write the code so that txtfile
points to an opened file:
txtfile = open(fname, 'r')
Obviously, fname
is yet undefined and will now be a placeholder. But we'll deal with that later. Pretend it points to something like "tempdata/comedies/asyoulikeit"
.
total_line_count = 0
nonblank_line_count = 0
txtfile = open(fname, 'r')
for line in txtfile:
total_line_count += 1
if line.strip() is not "":
nonblank_line_count += 1
The requirements of the exercise include printing out a line count for the given file, e.g.
tempdata/comedies/asyoulikeit has 2904 non-blank lines out of 4122 total lines
If fname
points to a file name, then that print()
call looks like this:
print(fname, 'has', nonblank_line_count, 'non-blank lines out of', total_line_count, 'total lines')
The Python style guide recommends keeping lines of code shorter than 80 characters. For lists of arguments/parameters inside parentheses, we don't have to worry about the whitespace (e.g. newline characters) being interpreted as indents. Here's one way to write the code snippet above:
print(fname, 'has', nonblank_line_count,
'non-blank lines out of', total_line_count, 'total lines')
And here's what it looks like as part of our larger code snippet; note that I'm also closing the txtfile
object since we're done reading its lines at this point:
total_line_count = 0
nonblank_line_count = 0
txtfile = open(fname, 'r')
for line in txtfile:
total_line_count += 1
if line.strip() is not "":
nonblank_line_count += 1
txtfile.close()
print(fname, 'has', nonblank_line_count,
'non-blank lines out of', total_line_count, 'total lines')
It's important to note that this code works as is. That is, if you supply fname
with a valid filename – e.g. "tempdata/tragedies/hamlet"
– it will work. No matter what we do in the next steps, we can at least feel sure that this chunk works.
In subsequent code snippets, I'll refer to this chunk of code as: single_file_reading_routine
as a placeholder for brevity's sake.
Moving up another level: we need to get a list of file names, then iterate through them. This is what the glob()
function is for (we'll write the import
statement later).
In the previous lesson, we saw how to use glob()
to get all of the files within a single subdirectory:
filepattern = join("tempdata", "tragedies", "*")
filenames = glob(filepattern)
To get it for each subdirectory, we could just copy-paste-repeat the code (nevermind how I'm using a list here, it's not important):
filenames = []
filepattern = join("tempdata", "comedies", "*")
filenames.extend(glob(filepattern))
filepattern = join("tempdata", "histories", "*")
filenames.extend(glob(filepattern))
filepattern = join("tempdata", "poetry", "*")
filenames.extend(glob(filepattern))
filepattern = join("tempdata", "tragedies", "*")
filenames.extend(glob(filepattern))
But…that's pretty ugly. Turns out there's a way to specify an equivalent wildcard with glob()
:
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
The double-asterisks is used to match directories, so this pattern – tempdata/**/*
– goes through every subdirectory in tempdata
, and then grabs the filenames within each of those subdirectories. It does not, however, match the tempdata/glossary
or tempdata/README
– which is just peachy as far as we're concerned.
Throwing that list of filenames into a loop, and referring to our previous chunk of file reading code, this is what our script looks like so far:
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
single_file_reading_routine # etc. etc.
The exercise requires us to print the total line count and total non-blank line count of all of the files. So this means we need to initiate 2 new variables, which I'll just call all_line_count
and all_nonblank_line_count
:
all_line_count = 0
all_nonblank_line_count = 0
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
single_file_reading_routine # etc. etc.
But how do we add to these? Well, revisiting our single_file_reading_routine
, we could just increment them in the same places that we increment the other two line-counting variables:
# this is the single_file_reading_routine
total_line_count = 0
nonblank_line_count = 0
txtfile = open(fname, 'r')
for line in txtfile:
total_line_count += 1
all_line_count += 1 # <== new line
if line.strip() is not "":
nonblank_line_count += 1
### Count all the lines
all_nonblank_line_count += 1 # <== new line
txtfile.close()
print(fname, 'has', nonblank_line_count,
'non-blank lines out of', total_line_count, 'total lines')
But…why don't we wait until the end of this routine, at which point, total_line_count
and nonblank_line_count
contain the values for the given file? And then just add those values to all_line_count
and all_nonblank_line_count
, respectively – as these two variables can be thought of aggregations of the line counts for each file?
all_line_count = 0
all_nonblank_line_count = 0
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
single_file_reading_routine # etc. etc.
# note that the routine above has initialized and set the variables
# of nonblank_line_count and total_line_count, which we
# can access at this point
all_nonblank_line_count += nonblank_line_count
all_line_count += total_line_count
Now we just have to write the very beginning and end of our program.
First, import glob()
and join()
, which we call in our code:
from os.path import join
from glob import glob
The exercise requirements expect this to be the final line of output:
All together, Shakespeare's 42 text files have:
125097 non-blank lines out of 172948 total lines
So, using our previously instantiated variables (note that calling len()
on the filenames
list gets us a count of items in that list):
print("All together, Shakespeare's",
len(filenames), "text files have:")
print(all_nonblank_line_count,
"non-blank lines out of",
all_line_count, "total lines")
Here's what everything looks like, assembled and written out. I've tried to indicate using comments how and where each individual part is in the final code. The result is more intimidating than the individual parts!
## The setup
from os.path import join
from glob import glob
## keeping track of the total total line counts
all_line_count = 0
all_nonblank_line_count = 0
## globbing a list of filenames
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
## the start of single_file_reading_routine
## Actually open a file for reading
txtfile = open(fname, 'r')
total_line_count = 0
nonblank_line_count = 0
for line in txtfile:
## Count all the lines
total_line_count += 1
## Test whether a text string is "blank"
if line.strip() is not "":
## Count just the non-blank lines
nonblank_line_count += 1
txtfile.close()
## Print out the file's stats
print(fname, 'has', nonblank_line_count,
'non-blank lines out of', total_line_count, 'total lines')
## end of single_file_reading_routine
## keeping track of the total total line counts
all_nonblank_line_count += nonblank_line_count
all_line_count += total_line_count
## the finish
print("All together, Shakespeare's",
len(filenames), "text files have:")
print(all_nonblank_line_count,
"non-blank lines out of",
all_line_count, "total lines")
It looks a little less intimidating without all of those comments:
from os.path import join
from glob import glob
all_line_count = 0
all_nonblank_line_count = 0
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
txtfile = open(fname, 'r')
total_line_count = 0
nonblank_line_count = 0
for line in txtfile:
total_line_count += 1
if line.strip() is not "":
nonblank_line_count += 1
txtfile.close()
print(fname, 'has', nonblank_line_count,
'non-blank lines out of', total_line_count, 'total lines')
all_nonblank_line_count += nonblank_line_count
all_line_count += total_line_count
print("All together, Shakespeare's",
len(filenames), "text files have:")
print(all_nonblank_line_count,
"non-blank lines out of",
all_line_count, "total lines")
If you aren't already too bored with this problem, see if you can approach it from a typical top-down approach, i.e. start with the imports:
from os.path import join
from glob import glob
Then glob together the file list, and start the first for-loop:
filenames = glob(join('tempdata', '**', '*'))
for fname in filenames:
txtfile = open(fname, 'r')
See if that approach makes more sense than what I've just demonstrated. It kind of depends on how you think about the problem, though as you get better and better, you'll solve problems using a mix of approaches. This exercise is less about trying to solve this dull Shakespeare problem, and more about how many different ways you can tackle a problem, including deconstructing it into smaller tasks.
We didn't cover the tactic of creating our own functions, mostly because I think this problem is simple enough to do as one ugly block of code. But in case you were wondering, yes, writing 40-50+ linear, top-to-bottom scripts results in pretty ugly looking code.
In later exercises, we'll work on breaking up our program into pieces. I hinted at this when I used single_file_reading_routine
as a pseudocode abbreviation for a chunk of code that worked in isolation.
In practice, we'll learn to turn those self-sufficient chunks of code – i.e. routines – into functions. If you think of defining variables as creating human-readable labels for data, think of defining functions as creating human-readable abbreviations for blocks of code that we intend to run, over and over. Or, at least, separate it from the main body of code.
Here's a solution to the exercise in which single_file_reading_routine
has been turned into a function that accepts a single argument – a filename – and returns a list of two numbers (nonblank_line_count
and total_line_count
). Notice how the main routine of the program is not much more than 10 lines, with single_file_reading_routine
abstracted out into its own function:
from os.path import join
from glob import glob
def single_file_reading_routine(fname):
txtfile = open(fname, 'r')
total_line_count = 0
nonblank_line_count = 0
for line in txtfile:
total_line_count += 1
if line.strip() is not "":
nonblank_line_count += 1
txtfile.close()
print(fname, 'has', nonblank_line_count,
'non-blank lines out of', total_line_count, 'total lines')
return [nonblank_line_count, total_line_count]
## Main routine
all_line_count = 0
all_nonblank_line_count = 0
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
line_counts = single_file_reading_routine(fname)
all_nonblank_line_count += line_counts[0]
all_line_count += line_counts[1]
print("All together, Shakespeare's",
len(filenames), "text files have:")
print(all_nonblank_line_count,
"non-blank lines out of",
all_line_count, "total lines")