The Checklist
In your compciv-2016 Git repository create a subfolder and name it:
exercises/0004-shakefiles
The folder structure will look like this (not including any subfolders such as `tempdata/`:
compciv-2016 └── exercises └── 0004-shakefiles ├── a.py ├── b.py ├── c.py ├── d.py ├── e.py ├── f.py ├── g.py ├── h.py
Background information
This is an expansive, overly verbose set of exercises that not only cover a fairly boring topic – how to organize and read files – but also attempts to introduce software design concepts, such as how to write a program by tackling its smallest problem, and then stepping backwards through the process – rather than the typical top-down approach.
Although these exercises follow basic patterns that will apply to virtually everything else we'll do, don't worry about memorizing the details. Make sure that you can actually get the code to work on your computer. And make sure you can reason through it. Because all of the finished code and answers are basically just given to you, I'm expecting that you actually take the time to write it out, and not just copy-paste it.
The finished programs are fairly intimidating at first glance. However, even just typing out the code and changing up the variables will slow you down enough to see how everything fits together. Try re-arranging or tidying up the code on your own.
For example, I'm often verbose in my solutions so that you can follow the process, line-by-line:
DATA_DIR = 'tempdata'
filepattern = join(DATA_DIR, '**', '*')
filenames = glob(filepattern)
But if you're wondering, "Well, that seems like it could all be one line" – then do it yourself, and see what happens:
filenames = glob(join('tempdata', '**', '*'))
Don't take my solutions as gospel – that's not the way programming works. You should try things out that seem to make sense to you. In later exercises, I will not be doing as much hand-holding, and it's going to be more of a free-for-all in terms of how I feel like naming things and organizing my code. Since the provided code and answers should just "work", you should take the time to be confident with not just rewriting the code, but altering it to your tastes.
Before you start this lesson
Create a .gitignore
Your Git repository should be properly configured. In your compciv-2016
folder, create a text file named .gitignore
. It should contain this:
.DS_Store
creds_*
tempdata/
__pycache__/
*.py[cod]
Here's an example of what it would look like in your Github repo.
The point of this file is to keep you from pushing tempdata
into your Github repository. The tempdata
directory, during the exercise, contains downloaded files that you work with, but never actually alter. Thus, I don't need to ever see this directory – as I can recreate it on my own.
Basically, there's no point in everyone pushing Shakespeare's complete works into their online repos. The upshot is that you will never see tempdata
when doing any of the git commands, such as git status
. This is the point of .gitignore
.
Reading about the fundamentals
Though I've created guides on how to complete every one of these exercises, it's expected that you've read these following guides so that you're familiar with the basics:
An overview of the new functions
Here are the specific modules and functions you'll practice in these exercises:
- glob.glob() - Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification.
- os.path.join() - Join one or more path components intelligently.
- os.makedirs() - Recursive directory creation function
- shutil.unpack_archive() - Unpack an archive.
The Exercises
0004-shakefiles/a.py » Create the `tempdata` directory idempotently
For many of the assignments, you will be stashing downloaded files and data into a local directory named tempdata
. Write a Python program to create that directory. This function should be “smart” enough not to crash/error-out if the tempdata
directory already exists.
When you run a.py
from the command-line:
0004-shakefiles $ python a.py
- The program should not output anything to screen.
-
The program creates this file path:
tempdata
(directory) The program must not crash if the
tempdata
directory already exists.
idempotent is a fun word to use. It’s also a “feature” that is useful to design towards, as a programmer. You never know how many times your program will be executed, or under what circumstances.
It’s kind of neat how
os.makedirs()
will throw an error if you try to use it to create an existing directory, and you leave out theexist_ok
argument. However, other file-system changing functions will not be nearly as careful by default…
0004-shakefiles/b.py » Download the zip file of Shakespearean texts to the tempdata directory
Write the Python commands to download the file from the following URL:
http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz
And save it to:
tempdata/matty.shakespeare.tar.gz
You don’t need to unzip it, just worry about downloading it and saving it to disk.
When you run b.py
from the command-line:
0004-shakefiles $ python b.py
-
The program's output to screen should be:
Downloading: http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz Writing file: tempdata/matty.shakespeare.tar.gz
-
The program creates this file path:
tempdata/matty.shakespeare.tar.gz
- The program accesses this remote file: http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz
Downloading a file, then saving it to disk, is significantly more complicated than it is through the browser.
This program is idempotent. If the file has already been downloaded, it will just be re-downloaded. Sometimes, that’s a good thing. Later on, for truly massive files that just never change, we will probably introduce a conditional statement so that our programs download files only when needed.
0004-shakefiles/c.py » Unzip the contents of the Shakespearean zip file into tempdata
Like downloading files, unzipping files is more complicated when you do it programmatically. The zip file might not unpack its contents where you thought it would…
When you run c.py
from the command-line:
0004-shakefiles $ python c.py
-
The program's output to screen should be:
Unpacked tempdata/matty.shakespeare.tar.gz into: tempdata
-
The program creates this file path:
tempdata/comedies
(directory) -
The program creates this file path:
tempdata/histories
(directory) -
The program creates this file path:
tempdata/poetry
(directory) -
The program creates this file path:
tempdata/tragedies
(directory)
You might have assumed that unzipping
tempdata/matty.shakespeare.tar.gz
would unpack the contents of the zip file intotempdata
. But when you execute this particular program (i.e.c.py
), you are outside thetempdata
directory. Unless you tell it otherwise, Python assumes you want things done relative to where you executed the script.We’ve been keeping things simple but it is very easy to not know where “you” are when you executed a script.
0004-shakefiles/d.py » Print the first 5 lines of the Hamlet text
From the text file at tempdata/tragedies/hamlet
, read and print the first 5 lines of text.
When you run d.py
from the command-line:
0004-shakefiles $ python d.py
-
The program's output to screen should be:
HAMLET DRAMATIS PERSONAE
A filename is not an actual file. It’s just a string that represents the human-readable name of a file, e.g.
tempdata/tragedies/hamlet
Opening a file, by calling the
open()
function on a filename, does not actually read the file. It just gives us access to a stream object, which has several methods for reading data from the “stream”, including all-at-once or line-by-line.By default, the
open()
function will attempt to read a file and will throw an error if that file doesn’t exist. This is much, much preferable to the situation when you open an existing file to write to it – which will immediately wipe out that file.Each line of text in a file has a newline character. That’s what makes it separate from the next line. Keeping in mind that a line of text is, well, a string – you can use its
strip()
method to remove whitespace from both sides of the text, including newlines.It’s considered good manners to invoke a file stream’s
close()
method when you’re done with the file. Imagine a scenario in which other programs are trying to open that file…
0004-shakefiles/e.py » Read through and count each line in the Hamlet text, then print the total number of lines
Re-open the tempdata/tragedies/hamlet
file as before, but read through the entire file, line-by-line, and print the total count of the number of lines in the file.
When you run e.py
from the command-line:
0004-shakefiles $ python e.py
-
The program's output to screen should be:
tempdata/tragedies/hamlet has 6045 lines
Opening and reading files via programming is so cumbersome at first. But it’s worth doing, over-and-over, until it becomes routine and reflex, as there is a lot of nuance that can come into play. Think about how Excel, or even your plain text editor, will bring your system down to a halt when you have it open a massive file. You don’t want that happening in your scripts.
0004-shakefiles/f.py » Print the final 5 lines of Romeo and Juliet
Open the file at tempdata/tragedies/romeoandjuliet
and read and print the final 5 lines.
This seems like the same exercise as d.py
– except that we read from Romeo and Juliet instead of Hamlet. And that we read the final 5 lines instead of the first 5 lines.
That first difference is easy to do; that second one is a much different problem to tackle.
This tutorial walks through the process.
Pay special attention to the expected output, particularly:
- There is no space between the line number and the colon, e.g.
2:
not2 :
- The last line ends at 4766. Make sure you’re not off-by-one.
Having trouble with adding a number to a string, i.e. 1
and ":"
to make "1:"
? Try using the str()
function to convert a number to a string.
When you run f.py
from the command-line:
0004-shakefiles $ python f.py
-
The program's output to screen should be:
4762: Some shall be pardon'd, and some punished: 4763: For never was a story of more woe 4764: Than this of Juliet and her Romeo. 4765: 4766: [Exeunt]
The
range()
function is an easy way to generate a list of numbers to loop through.Combining strings and other data values in order to generate a pre-defined format of string is common situation and extremely annoying if all you know is how to add strings together via the
+
operator. Stay on the lookout for other methods, as compliciated as they first seem.
0004-shakefiles/g.py » Print the final 5 lines for all of Shakespeare's tragedies
For each file in tempdata/tragedies/
:
- Count and print the number of lines in the file.
- Print the text of the final 5 lines, along with the corresponding line number.
When you run g.py
from the command-line:
0004-shakefiles $ python g.py
-
The program's
first 6
lines of output to screen should be:
tempdata/tragedies/antonyandcleopatra has 5998 lines 5994: In solemn show attend this funeral; 5995: And then to Rome. Come, Dolabella, see 5996: High order in this great solemnity. 5997: 5998: [Exeunt]
-
The program's
last 6
lines of output to screen should be:
tempdata/tragedies/titusandronicus has 3767 lines 3763: By whom our heavy haps had their beginning: 3764: Then, afterwards, to order well the state, 3765: That like events may ne'er it ruinate. 3766: 3767: [Exeunt]
The syntax for
glob.glob()
seems awkward, doesn’t it? Consider using thefrom glob import glob
style of import statement.Repeating this exercise for all of the Shakespeare files would be very easy.
0004-shakefiles/h.py » Count and print the number of non-blank lines for each of Shakespeare's tragedies
You already know how to collect the list of filenames in a single directory. And you know how to read through a file and count the lines. This exercise combines both problems and adds a few twists:
- Loop through all of the Shakespeare files (there are 42 of them)
- When reading the lines, track of how many of the lines are non-blank.
- Print the number of non-blank lines versus total lines in each text.
- At the end, print the total count of non-blank lines and all lines.
“Non-blank” for this exercise is defined as: a line that has at least one non-whitespace character. However, it’s hard to distinguish a completely empty line, and one that is full of invisible whitespace characters. So use the strip()
function.
When you run h.py
from the command-line:
0004-shakefiles $ python h.py
-
The program's
first 3
lines of output to screen should be:
tempdata/comedies/allswellthatendswell has 3164 non-blank lines out of 4515 total lines tempdata/comedies/asyoulikeit has 2904 non-blank lines out of 4122 total lines tempdata/comedies/comedyoferrors has 2112 non-blank lines out of 2937 total lines
-
The program's
last 3
lines of output to screen should be:
tempdata/tragedies/titusandronicus has 2837 non-blank lines out of 3767 total lines All together, Shakespeare's 42 text files have: 125097 non-blank lines out of 172948 total lines