The problem

Here's the problem we're trying to solve – if you're doing this as homework, see the full info for this exercise:


          0004-shakefiles/e.py

Read through and count each line in the Hamlet text, then print the total number of lines

Re-open the tempdata/tragedies/hamlet file as before, but read through the entire file, line-by-line, and print the total count of the number of lines in the file.

Expectations

When you run e.py from the command-line:

0004-shakefiles $ python e.py

The program's output to screen should be:

tempdata/tragedies/hamlet has 6045 lines

Count every line by reading every line

This is mostly an extension of the previous exercise, to prepare you for the similar-sounding yet significantly more complicated next exercise: Print the final 5 lines of Romeo and Juliet

The main takeaway of this lesson is that in order to get a count of how many lines are in a file, we have to literally read every line in the file (and keep count).

Using a for-loop with a file object

Revisiting the code for the previous exercise:

import os
fname = os.path.join('tempdata', 'tragedies', 'hamlet')
hamletfile = open(fname, 'r')
for x in range(5):
    print(hamletfile.readline().strip())
hamletfile.close()

Instead of iterating through the first 5 lines (i.e. range(5)), we want to iterate through all of the lines.

However, we can't use range() because we don't know what number to use (without manually looking it up) to signify the end of file.

So instead of passing in a range() as the iterable object, we pass in the file object itself, making the file object the thing that the for loop iterates over:

for x in hamletfile:
  # ...

By convention, this variation of the for-loop will iterate over each line of the file stream. In this variation, the variable x does not represent an integer within a given range. Instead, with each iteration, x points to a line (i.e. a string object).

If we wanted to use this form of the for-loop and yet still print the first 5 lines of Shakespeare, we have to manually keep track of the line number with our own variable, and then use a conditional branch to test if that variable is less than 5:

line_num = 0
for x in hamletfile:
    if line_num < 5:
        print(x.strip()) # x is a line of text, i.e. a string object
    line_num += 1

We also don't have to call the readline() function explicitly; that's already done for us by using this kind of for-loop. Here's the full code snippet in context:

import os
fname = os.path.join('tempdata', 'tragedies', 'hamlet')
hamletfile = open(fname, 'r')
line_num = 0
for x in hamletfile:
    if line_num < 5:
        print(x.strip()) # x is a line of text, i.e. a string object
    line_num += 1

hamletfile.close()

Keeping count

However, for this exercise, we don't have to print the actual text of the lines. We just want a line count – which is stored inside the line_num variable.

But how does the for-loop know when to end? The loop quits when the final line of the hamletfile file object has been reached. At that point, line_num should contain the final line count:

import os
fname = os.path.join('tempdata', 'tragedies', 'hamlet')
hamletfile = open(fname, 'r')
line_num = 0
for x in hamletfile:
    line_num += 1
hamletfile.close()

print(fname, "has", line_num, "lines")

Why is counting lines so complicated?

So that's that for the requirements of this exercise. It's worth asking: isn't there an easier way to count lines?

With modern computers, reading files happens so quickly that even a 10,000 line file seems to be read "instantaneously". However, this is just not physically possible. At the physical layer (i.e. the itty-bitty-electron level), data bits are being read sequentially. While it's possible to get the byte count of a file – i.e. how much memory it physically takes up on the hard drive – using a helper method:

>>> import os
>>> fname = os.path.join('tempdata', 'tragedies', 'hamlet')
>>> os.path.getsize(fname)
182567

– you have no idea which of those individual bytes represent newline characters, i.e ."\n", which make up the very definition of what a line in a text file is. Thus, the need to read every line to get an exact counting of lines.

So if the next exercise involves printing just the final 5 lines of a given text file…you can guess that that will require reading every line up to and including those final 5 lines.

Count the lines of Shakespeare's Hamlet

Summary

The problem

Count every line by reading every line

Using a for-loop with a file object

Keeping count

Why is counting lines so complicated?