Here's the problem we're trying to solve – if you're doing this as homework, see the full info for this exercise:
Write the Python commands to download the file from the following URL:
And save it to:
You don’t need to unzip it, just worry about downloading it and saving it to disk.
When you run
b.py from the command-line:
0004-shakefiles $ python b.py
Downloading: http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz Writing file: tempdata/matty.shakespeare.tar.gz
By now, we already know how to download a file from the Web with the Requests library:
import requests resp = requests.get('http://www.example.com') thetext = resp.text
However, the act of downloading a file programatically doesn't mean that that file has been saved (i.e. written) to our hard drive. That is its own step.
Before we can write a file to disk, we must open a new file stream object with the
open() function. This is similar to what we have to do when reading an existing file, but take notice of the second argument:
>>> newfile = open("somenewfile.txt", "w")
"w" string tells the
open() function that we don't want to read from this file object. As you can imagine,
"w" stands for write. You can think of the first argument, e.g. the string
"somenewfile.txt" as us naming the file to be created.
It's worth stopping for a moment and considering: what happens when we try to open an existing file path to write to?
The answer: if you give the open() function an existing filename with the intention of writing to it – whatever existed at that filename is permanently erased.
There is no Recycle Bin at the programming level. The Python interpreter just assumes you know what you're doing, and won't even throw an error or warning. It will just wipe out the existing file before writing to it.
So consider this your warning to be incredibly mindful whenever you want to write a file to disk.
OK, going back to that step in which we opened a file, at the path
>>> newfile = open("somenewfile.txt", "w")
What is that
newfile variable pointing to? Use the
type() function to find out:
>>> type(newfile) _io.TextIOWrapper
For simplicity's sake, I'm just going to refer to it as a "file object" (or file stream object). Let's use the Tab autocomplete to get its list of methods:
>>> newfile. # hit Tab here newfile.buffer newfile.isatty newfile.readlines newfile.close newfile.line_buffering newfile.seek newfile.closed newfile.mode newfile.seekable newfile.detach newfile.name newfile.tell newfile.encoding newfile.newlines newfile.truncate newfile.errors newfile.read newfile.writable newfile.fileno newfile.readable newfile.write newfile.flush newfile.readline newfile.writelines
You can guess that the write function is what we want. But this object also has a read function…That's because it's a file object, and file objects can be written to or read from. It doesn't matter how we called the
That said…go ahead and try to
>>> newfile.read() UnsupportedOperation: not readable
There is how Python reminds us that the file is not meant to be read from, since we called open() with the
Now that we have that cleared up, let's just write to the file. You can pass in a string object as the argument, and call write() as many times as you want to:
>>> newfile.write("hello") 5 >>> newfile.write("world!") 6
write() function returns the number of characters that was written to the file. After we've finished writing to the file, we call the
Now switch to your text editor and look for the file you just created. If you've been following this example, the filename we used is:
This is what the contents of that file should look like:
Or, alternatively, you could use Python to re-open the file and then read it:
(Note: when just opening a file in order to read it, the second argument of the
open() function is optional. By default,
open() assumes you want to read from the given filepath. I include
"r" here just to be explicit)
>>> myfile = open("somenewfile.txt", "r") >>> txt = myfile.read() >>> print(txt) helloworld >>> myfile.close()
helloworld is not on two different lines. The
write() method doesn't automatically add newline characters to the argument we pass in. If we do want to have
write() add newlines, we have to explicitly add the newline character:
Let's try it now. And let's also deliberately overwrite our old file (at the path,
>>> newfile = open("somenewfile.txt", "w") >>> newfile.write("hello\n") >>> newfile.write("world\n") >>> newfile.close()
If you read from
somenewfile.txt, you'll see that its contents are:
I've written a separate guide about writing files, but this section should contain all you need to know for this particular lesson.
Let's go back to
requests.get(), from the beginning:
>>> import requests >>> resp = requests.get("http://www.example.com") >>> exampletxt = resp.text >>> type(exampletxt) str
If the download succeeded, the
exampletxt variable contains the raw HTML of the page at http://www.example.com, and that raw HTML is just a String object.
Which means we can pass it into a file object's write() method just as we wrote the strings
"world" to the file:
>>> outfile = open("example.com.html", "w") >>> outfile.write(exampletxt) 1270 >>> outfile.close()
If you use your text editor to open
example.com.html (wherever directory you saved it to), the file should contain the raw HTML of www.example.com.
Not all files are text. Rather than explain in detail, for now, I will just show how the
open() function needs to be called when writing a non-text file to disk – it requires a change to the second argument:
>>> zfile = open("mynewzipfile.zip", "wb")
Think of that
"wb" as standing for: "write bytes".
zfile to see what it points to:
>>> type(zfile) _io.BufferedWriter
Again, I think of this as a file object – but note that it is different from the previous example involving a text file, in which the object had a type of:
Whether it is a binary or text file, the same
write() methods exist.
But typically, we don't manually type in the bytes that we want to write to a file. Let's go back to the
requests.get() method, but this time, let's download a zip file from the following path:
>>> import requests >>> zipurl = 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz' >>> resp = requests.get(zipurl)
contentsattribute of the Response object
response variable contains the result of the download from the given URL. This should take considerably longer (by a few seconds, at least) than downloading example.com because we're downloading the entire works of Shakespeare.
However, its his (text) works all in a zip file, which itself is not a text file. It's a binary file full of bytes, not string characters.
How the Requests library has been designed is that instead of using the
text attribute, i.e.
>>> thedata = resp.text
– for binary files, we use the
content attribute. This is just something you have to memorize and get used to. Again, use the
type() method to see what kind of object
resp.content actually is (it's not a
str, to hammer on this point):
>>> thedata = resp.content >>> type(thedata) bytes
OK, all together: downloading a zip file and then saving it to disk:
import requests zipurl = 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz' resp = requests.get(zipurl) zname = "matty.shakespeare.tar.gz" zfile = open(zname, 'wb') zfile.write(resp.content) zfile.close()
Check to see if
matty.shakespeare.tar.gz was actually saved to your computer at the given path. You can even double-click it to see if it unzips. Note that we did not programatically unzip the file. We simply downloaded and saved it to a path.
This is going to seem exceedingly pedantic. On Mac OSX and Linux, the following file path:
– means that the
somefile.zip file is inside the
However, in Windows, that path looks like this:
The differences between operating systems means that, just to be safe, it's better to defer the naming of a file path to the
join() function that is part of Python's
os.path module (which is automatically included if you ran
Here's what that looks like:
>>> mydirname = 'tempdata' >>> myfilename = 'somefile.zip' >>> myfullfilename = os.path.join(mydirname, myfilename) >>> print(myfullfilename) tempdata/somefile.zip # note that this will be different on Windows machines
Or, focusing on brevity:
fname = os.path.join("tempdata", "somefile.zip")
Yes, that seems like a lot of code to generate the string of
tempdata/somefile.zip. But besides being cross-platform compatible, it's worth using this pattern because in real-world programming, paths can get fairly complicated (i.e. with deeply nested subdirectories). It's just easier to use Python's helper functions to deal with it, in the long run.
Revisiting our download-and-save code from the previous example, except using the
join() method, and saving it to the
tempdata directory (assuming that it's been created):
import requests import os zipurl = 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz' resp = requests.get(zipurl) # assuming the subdirectory tempdata has been created: zname = os.path.join('tempdata', "matty.shakespeare.tar.gz") zfile = open(zname, 'wb') zfile.write(resp.content) zfile.close()