Counting the non-blank-lines in Shakespeare's tragedies

A simple text-processing and analysis exercise, using everything we've learned so far about file paths, for-loops, and conditional branching.
This article is part of a sequence.
Extracting and Reading Shakespeare
A walkthrough of modules, file system operations, and Shakespeare.
Table of contents

The problem

Here's the problem we're trying to solve – if you're doing this as homework, see the full info for this exercise:

0004-shakefiles/h.py
Count and print the number of non-blank lines for each of Shakespeare's tragedies

You already know how to collect the list of filenames in a single directory. And you know how to read through a file and count the lines. This exercise combines both problems and adds a few twists:

  • Loop through all of the Shakespeare files (there are 42 of them)
  • When reading the lines, track of how many of the lines are non-blank.
  • Print the number of non-blank lines versus total lines in each text.
  • At the end, print the total count of non-blank lines and all lines.

“Non-blank” for this exercise is defined as: a line that has at least one non-whitespace character. However, it’s hard to distinguish a completely empty line, and one that is full of invisible whitespace characters. So use the strip() function.

Expectations

When you run h.py from the command-line:

0004-shakefiles $ python h.py
  • The program's first 3 lines of output to screen should be:
    tempdata/comedies/allswellthatendswell has 3164 non-blank lines out of 4515 total lines
    tempdata/comedies/asyoulikeit has 2904 non-blank lines out of 4122 total lines
    tempdata/comedies/comedyoferrors has 2112 non-blank lines out of 2937 total lines
    
  • The program's last 3 lines of output to screen should be:
    tempdata/tragedies/titusandronicus has 2837 non-blank lines out of 3767 total lines
    All together, Shakespeare's 42 text files have:
    125097 non-blank lines out of 172948 total lines
    

From the inside out in

Rather than write the program in a linear, top-to-bottom fashion, I'm going to do something different for this exercise. I'm going to approach it from its smallest, most standalone problem – how to test if a text string is "blank" (or non-blank) – and then work my way up to the bigger problems, such as "how do I open a file". In fact, each step that we do in this program we've already done before in all of the past exercises. This is just a different way of assembling those pieces.

The final program will be pretty ugly (you can see it here). But if you can step through it, piece by piece, and make sure each piece works before moving forward, you won't even notice the complex ugliness you've created until the very end. And by then, it doesn't matter as long as the program works!

Test whether a text string is "blank"

A "non-blank line", for the purposes of this exercise, is defined as a line that has at least one non-whitespace character. Conversely, a "blank line" can be thought of a line that consists solely of 0 or more whitespace characters.

However, this is not the same thing as an empty string. Try this out at the interactive Python shell:

>>> a = ""    # empty string
>>> b = "   " # string with some spaces
>>> c = "\n"  # string consisting of just a newline character
>>> a == b
False
>>> a == c
False

The string's strip() method

String objects have a strip() method which trims a string of whitespace characters from both the left and right side:

>>> "   hello world !  ".strip()
'hello world !'

So, if we use strip() on a string that consists solely of whitespace characters, the result should be an empty string (i.e. a string with 0 characters):

>>> "" == "    ".strip()
True
>>> "" == "\n".strip()
True

So, now we have the conditional expression that can test if a string – stored in a variable named line – is blank:

line.strip() == ""

Of course, if we want to test for non-blank lines, we just invert the test to test for non-equality:

line.strip() != ""

Or, if you prefer this notation:

line.strip() is not ""

Read the lines of a file

OK, now let's step up one level: given a text file object, txtfile, we need to read every line of the file:

for line in txtfile:
    # etc. etc.

Count all the lines

What we need to do to fulfill the requirements of the exercise is keep count of each line. So before the for-loop, let's set up a variable, total_line_count, and assign its initial value to 0:

total_line_count = 0
for line in txtfile:
    # do something

For every iteration of the loop, we want to add 1 to total_line_count:

total_line_count = 0
for line in txtfile:
    total_line_count += 1

Count just the non-blank lines

Now we also want to count the non-blank lines. So let's initialize another variable, nonblank_line_count:

total_line_count = 0
nonblank_line_count = 0
for line in txtfile:
    total_line_count += 1

However, we increment blank_line_count only when a given line fails our test for non-blank lines. This is where a conditional statement is necessary:

total_line_count = 0
nonblank_line_count = 0
for line in txtfile:
    total_line_count += 1
    if line.strip() is not "":
        nonblank_line_count += 1

Actually open a file for reading

OK, that txtfile variable is just a placeholder. Let's write the code so that txtfile points to an opened file:

txtfile = open(fname, 'r')

Obviously, fname is yet undefined and will now be a placeholder. But we'll deal with that later. Pretend it points to something like "tempdata/comedies/asyoulikeit".

total_line_count = 0
nonblank_line_count = 0
txtfile = open(fname, 'r')
for line in txtfile:
    total_line_count += 1
    if line.strip() is not "":
        nonblank_line_count += 1

The requirements of the exercise include printing out a line count for the given file, e.g.

tempdata/comedies/asyoulikeit has 2904 non-blank lines out of 4122 total lines

If fname points to a file name, then that print() call looks like this:

print(fname, 'has', nonblank_line_count, 'non-blank lines out of', total_line_count, 'total lines')

The Python style guide recommends keeping lines of code shorter than 80 characters. For lists of arguments/parameters inside parentheses, we don't have to worry about the whitespace (e.g. newline characters) being interpreted as indents. Here's one way to write the code snippet above:

print(fname, 'has', nonblank_line_count,
      'non-blank lines out of', total_line_count, 'total lines')

And here's what it looks like as part of our larger code snippet; note that I'm also closing the txtfile object since we're done reading its lines at this point:

total_line_count = 0
nonblank_line_count = 0
txtfile = open(fname, 'r')
for line in txtfile:
    total_line_count += 1
    if line.strip() is not "":
        nonblank_line_count += 1
txtfile.close()
print(fname, 'has', nonblank_line_count,
      'non-blank lines out of', total_line_count, 'total lines')

It's important to note that this code works as is. That is, if you supply fname with a valid filename – e.g. "tempdata/tragedies/hamlet" – it will work. No matter what we do in the next steps, we can at least feel sure that this chunk works.

In subsequent code snippets, I'll refer to this chunk of code as: single_file_reading_routine as a placeholder for brevity's sake.

Globbing a list of filenames

Moving up another level: we need to get a list of file names, then iterate through them. This is what the glob() function is for (we'll write the import statement later).

In the previous lesson, we saw how to use glob() to get all of the files within a single subdirectory:

filepattern = join("tempdata", "tragedies", "*")
filenames = glob(filepattern)

To get it for each subdirectory, we could just copy-paste-repeat the code (nevermind how I'm using a list here, it's not important):

filenames = []
filepattern = join("tempdata", "comedies", "*")
filenames.extend(glob(filepattern))
filepattern = join("tempdata", "histories", "*")
filenames.extend(glob(filepattern))
filepattern = join("tempdata", "poetry", "*")
filenames.extend(glob(filepattern))
filepattern = join("tempdata", "tragedies", "*")
filenames.extend(glob(filepattern))

But…that's pretty ugly. Turns out there's a way to specify an equivalent wildcard with glob():

filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)

The double-asterisks is used to match directories, so this pattern – tempdata/**/* – goes through every subdirectory in tempdata, and then grabs the filenames within each of those subdirectories. It does not, however, match the tempdata/glossary or tempdata/README – which is just peachy as far as we're concerned.

Throwing that list of filenames into a loop, and referring to our previous chunk of file reading code, this is what our script looks like so far:

filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
    single_file_reading_routine # etc. etc.

Keeping track of the total total line counts

The exercise requires us to print the total line count and total non-blank line count of all of the files. So this means we need to initiate 2 new variables, which I'll just call all_line_count and all_nonblank_line_count:

all_line_count = 0
all_nonblank_line_count = 0
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
    single_file_reading_routine # etc. etc.

But how do we add to these? Well, revisiting our single_file_reading_routine, we could just increment them in the same places that we increment the other two line-counting variables:

# this is the single_file_reading_routine
total_line_count = 0
nonblank_line_count = 0
txtfile = open(fname, 'r')
for line in txtfile:
    total_line_count += 1
    all_line_count += 1                 # <== new line
    if line.strip() is not "":
        nonblank_line_count += 1
        ### Count all the lines
        all_nonblank_line_count += 1    # <== new line
txtfile.close()
print(fname, 'has', nonblank_line_count,
      'non-blank lines out of', total_line_count, 'total lines')

But…why don't we wait until the end of this routine, at which point, total_line_count and nonblank_line_count contain the values for the given file? And then just add those values to all_line_count and all_nonblank_line_count, respectively – as these two variables can be thought of aggregations of the line counts for each file?

all_line_count = 0
all_nonblank_line_count = 0
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
    single_file_reading_routine # etc. etc.
    # note that the routine above has initialized and set the variables
    # of nonblank_line_count and total_line_count, which we
    # can access at this point
    all_nonblank_line_count += nonblank_line_count 
    all_line_count += total_line_count 

Setup and finish

Now we just have to write the very beginning and end of our program.

First, import glob() and join(), which we call in our code:

from os.path import join
from glob import glob

The exercise requirements expect this to be the final line of output:

All together, Shakespeare's 42 text files have:
125097 non-blank lines out of 172948 total lines

So, using our previously instantiated variables (note that calling len() on the filenames list gets us a count of items in that list):

print("All together, Shakespeare's",
      len(filenames), "text files have:")
print(all_nonblank_line_count,
      "non-blank lines out of",
      all_line_count, "total lines")

All together, from the inside out

Here's what everything looks like, assembled and written out. I've tried to indicate using comments how and where each individual part is in the final code. The result is more intimidating than the individual parts!

      ## The setup
from os.path import join
from glob import glob
      ## keeping track of the total total line counts
all_line_count = 0
all_nonblank_line_count = 0

      ## globbing a list of filenames
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
          ## the start of single_file_reading_routine
              ## Actually open a file for reading
    txtfile = open(fname, 'r')
    total_line_count = 0
    nonblank_line_count = 0
    for line in txtfile:
              ## Count all the lines
        total_line_count += 1
              ## Test whether a text string is "blank"
        if line.strip() is not "":
                  ## Count just the non-blank lines
            nonblank_line_count += 1
    txtfile.close()
              ## Print out the file's stats
    print(fname, 'has', nonblank_line_count,
          'non-blank lines out of', total_line_count, 'total lines')
          ## end of single_file_reading_routine

        ## keeping track of the total total line counts
    all_nonblank_line_count += nonblank_line_count 
    all_line_count += total_line_count 

    ## the finish
print("All together, Shakespeare's",
      len(filenames), "text files have:")
print(all_nonblank_line_count,
      "non-blank lines out of",
      all_line_count, "total lines")

It looks a little less intimidating without all of those comments:

from os.path import join
from glob import glob
all_line_count = 0
all_nonblank_line_count = 0

filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
    txtfile = open(fname, 'r')
    total_line_count = 0
    nonblank_line_count = 0
    for line in txtfile:
        total_line_count += 1
        if line.strip() is not "":
            nonblank_line_count += 1
    txtfile.close()
    print(fname, 'has', nonblank_line_count,
          'non-blank lines out of', total_line_count, 'total lines')

    all_nonblank_line_count += nonblank_line_count 
    all_line_count += total_line_count 

print("All together, Shakespeare's",
      len(filenames), "text files have:")
print(all_nonblank_line_count,
      "non-blank lines out of",
      all_line_count, "total lines")

Thinking about design

If you aren't already too bored with this problem, see if you can approach it from a typical top-down approach, i.e. start with the imports:

from os.path import join
from glob import glob

Then glob together the file list, and start the first for-loop:

filenames = glob(join('tempdata', '**', '*'))
for fname in filenames:
    txtfile = open(fname, 'r')

See if that approach makes more sense than what I've just demonstrated. It kind of depends on how you think about the problem, though as you get better and better, you'll solve problems using a mix of approaches. This exercise is less about trying to solve this dull Shakespeare problem, and more about how many different ways you can tackle a problem, including deconstructing it into smaller tasks.

Thinking about functions

We didn't cover the tactic of creating our own functions, mostly because I think this problem is simple enough to do as one ugly block of code. But in case you were wondering, yes, writing 40-50+ linear, top-to-bottom scripts results in pretty ugly looking code.

In later exercises, we'll work on breaking up our program into pieces. I hinted at this when I used single_file_reading_routine as a pseudocode abbreviation for a chunk of code that worked in isolation.

In practice, we'll learn to turn those self-sufficient chunks of code – i.e. routines – into functions. If you think of defining variables as creating human-readable labels for data, think of defining functions as creating human-readable abbreviations for blocks of code that we intend to run, over and over. Or, at least, separate it from the main body of code.

Here's a solution to the exercise in which single_file_reading_routine has been turned into a function that accepts a single argument – a filename – and returns a list of two numbers (nonblank_line_count and total_line_count). Notice how the main routine of the program is not much more than 10 lines, with single_file_reading_routine abstracted out into its own function:

from os.path import join
from glob import glob

def single_file_reading_routine(fname):
    txtfile = open(fname, 'r')
    total_line_count = 0
    nonblank_line_count = 0
    for line in txtfile:
        total_line_count += 1
        if line.strip() is not "":
            nonblank_line_count += 1
    txtfile.close()
    print(fname, 'has', nonblank_line_count,
          'non-blank lines out of', total_line_count, 'total lines')
    return [nonblank_line_count, total_line_count]

## Main routine
all_line_count = 0
all_nonblank_line_count = 0
filepattern = join('tempdata', '**', '*')
filenames = glob(filepattern)
for fname in filenames:
    line_counts = single_file_reading_routine(fname)
    all_nonblank_line_count += line_counts[0] 
    all_line_count += line_counts[1] 

print("All together, Shakespeare's",
      len(filenames), "text files have:")
print(all_nonblank_line_count,
      "non-blank lines out of",
      all_line_count, "total lines")
This article is part of a sequence.
Extracting and Reading Shakespeare
A walkthrough of modules, file system operations, and Shakespeare.

References and Related Readings

PEP 0008 Style Guide for Python Code
One of Guido van Rossum's key insights is that code is read much more often than it is written. The guidelines provided here are intended to improve the readability of code and make it consistent across the wide spectrum of Python code. As PEP 20 says, "Readability counts".