Here's the problem we're trying to solve – if you're doing this as homework, see the full info for this exercise:
Write the Python commands to download the file from the following URL:
http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz
And save it to:
tempdata/matty.shakespeare.tar.gz
You don’t need to unzip it, just worry about downloading it and saving it to disk.
When you run b.py
from the command-line:
0004-shakefiles $ python b.py
Downloading: http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz Writing file: tempdata/matty.shakespeare.tar.gz
tempdata/matty.shakespeare.tar.gz
By now, we already know how to download a file from the Web with the Requests library:
import requests
resp = requests.get('http://www.example.com')
thetext = resp.text
However, the act of downloading a file programatically doesn't mean that that file has been saved (i.e. written) to our hard drive. That is its own step.
Before we can write a file to disk, we must open a new file stream object with the open()
function. This is similar to what we have to do when reading an existing file, but take notice of the second argument:
>>> newfile = open("somenewfile.txt", "w")
That "w"
string tells the open()
function that we don't want to read from this file object. As you can imagine, "w"
stands for write. You can think of the first argument, e.g. the string "somenewfile.txt"
as us naming the file to be created.
It's worth stopping for a moment and considering: what happens when we try to open an existing file path to write to?
The answer: if you give the open() function an existing filename with the intention of writing to it – whatever existed at that filename is permanently erased.
There is no Recycle Bin at the programming level. The Python interpreter just assumes you know what you're doing, and won't even throw an error or warning. It will just wipe out the existing file before writing to it.
So consider this your warning to be incredibly mindful whenever you want to write a file to disk.
OK, going back to that step in which we opened a file, at the path somenewfile.txt
:
>>> newfile = open("somenewfile.txt", "w")
What is that newfile
variable pointing to? Use the type()
function to find out:
>>> type(newfile)
_io.TextIOWrapper
For simplicity's sake, I'm just going to refer to it as a "file object" (or file stream object). Let's use the Tab autocomplete to get its list of methods:
>>> newfile. # hit Tab here
newfile.buffer newfile.isatty newfile.readlines
newfile.close newfile.line_buffering newfile.seek
newfile.closed newfile.mode newfile.seekable
newfile.detach newfile.name newfile.tell
newfile.encoding newfile.newlines newfile.truncate
newfile.errors newfile.read newfile.writable
newfile.fileno newfile.readable newfile.write
newfile.flush newfile.readline newfile.writelines
You can guess that the write function is what we want. But this object also has a read function…That's because it's a file object, and file objects can be written to or read from. It doesn't matter how we called the open()
function.
That said…go ahead and try to read()
from newfile
:
>>> newfile.read()
UnsupportedOperation: not readable
There is how Python reminds us that the file is not meant to be read from, since we called open() with the "w"
argument.
Now that we have that cleared up, let's just write to the file. You can pass in a string object as the argument, and call write() as many times as you want to:
>>> newfile.write("hello")
5
>>> newfile.write("world!")
6
The write()
function returns the number of characters that was written to the file. After we've finished writing to the file, we call the close()
function:
newfile.close()
Now switch to your text editor and look for the file you just created. If you've been following this example, the filename we used is: somenewfile.txt
This is what the contents of that file should look like:
Or, alternatively, you could use Python to re-open the file and then read it:
(Note: when just opening a file in order to read it, the second argument of the open()
function is optional. By default, open()
assumes you want to read from the given filepath. I include "r"
here just to be explicit)
>>> myfile = open("somenewfile.txt", "r")
>>> txt = myfile.read()
>>> print(txt)
helloworld
>>> myfile.close()
Notice that helloworld
is not on two different lines. The write()
method doesn't automatically add newline characters to the argument we pass in. If we do want to have write()
add newlines, we have to explicitly add the newline character: \n
Let's try it now. And let's also deliberately overwrite our old file (at the path, somenewfile.txt
):
>>> newfile = open("somenewfile.txt", "w")
>>> newfile.write("hello\n")
>>> newfile.write("world\n")
>>> newfile.close()
If you read from somenewfile.txt
, you'll see that its contents are:
hello
world
I've written a separate guide about writing files, but this section should contain all you need to know for this particular lesson.
Let's go back to requests.get()
, from the beginning:
>>> import requests
>>> resp = requests.get("http://www.example.com")
>>> exampletxt = resp.text
>>> type(exampletxt)
str
If the download succeeded, the exampletxt
variable contains the raw HTML of the page at http://www.example.com, and that raw HTML is just a String object.
Which means we can pass it into a file object's write() method just as we wrote the strings "hello"
and "world"
to the file:
>>> outfile = open("example.com.html", "w")
>>> outfile.write(exampletxt)
1270
>>> outfile.close()
If you use your text editor to open example.com.html
(wherever directory you saved it to), the file should contain the raw HTML of www.example.com.
Not all files are text. Rather than explain in detail, for now, I will just show how the open()
function needs to be called when writing a non-text file to disk – it requires a change to the second argument:
>>> zfile = open("mynewzipfile.zip", "wb")
Think of that "wb"
as standing for: "write bytes".
Check the type()
of zfile
to see what it points to:
>>> type(zfile)
_io.BufferedWriter
Again, I think of this as a file object – but note that it is different from the previous example involving a text file, in which the object had a type of: _io.TextIOWrapper
Whether it is a binary or text file, the same read()
and write()
methods exist.
But typically, we don't manually type in the bytes that we want to write to a file. Let's go back to the requests.get()
method, but this time, let's download a zip file from the following path:
http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz
>>> import requests
>>> zipurl = 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz'
>>> resp = requests.get(zipurl)
contents
attribute of the Response objectSo the response
variable contains the result of the download from the given URL. This should take considerably longer (by a few seconds, at least) than downloading example.com because we're downloading the entire works of Shakespeare.
However, its his (text) works all in a zip file, which itself is not a text file. It's a binary file full of bytes, not string characters.
How the Requests library has been designed is that instead of using the text
attribute, i.e.
>>> thedata = resp.text
– for binary files, we use the content
attribute. This is just something you have to memorize and get used to. Again, use the type()
method to see what kind of object resp.content
actually is (it's not a str
, to hammer on this point):
>>> thedata = resp.content
>>> type(thedata)
bytes
OK, all together: downloading a zip file and then saving it to disk:
import requests
zipurl = 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz'
resp = requests.get(zipurl)
zname = "matty.shakespeare.tar.gz"
zfile = open(zname, 'wb')
zfile.write(resp.content)
zfile.close()
Check to see if matty.shakespeare.tar.gz
was actually saved to your computer at the given path. You can even double-click it to see if it unzips. Note that we did not programatically unzip the file. We simply downloaded and saved it to a path.
This is going to seem exceedingly pedantic. On Mac OSX and Linux, the following file path:
tempdata/somefile.zip
– means that the somefile.zip
file is inside the tempdata
subdirectory.
However, in Windows, that path looks like this:
tempdata\somefile.zip
The differences between operating systems means that, just to be safe, it's better to defer the naming of a file path to the join()
function that is part of Python's os.path
module (which is automatically included if you ran import os
).
Here's what that looks like:
>>> mydirname = 'tempdata'
>>> myfilename = 'somefile.zip'
>>> myfullfilename = os.path.join(mydirname, myfilename)
>>> print(myfullfilename)
tempdata/somefile.zip # note that this will be different on Windows machines
Or, focusing on brevity:
fname = os.path.join("tempdata", "somefile.zip")
Yes, that seems like a lot of code to generate the string of tempdata/somefile.zip
. But besides being cross-platform compatible, it's worth using this pattern because in real-world programming, paths can get fairly complicated (i.e. with deeply nested subdirectories). It's just easier to use Python's helper functions to deal with it, in the long run.
Revisiting our download-and-save code from the previous example, except using the join()
method, and saving it to the tempdata
directory (assuming that it's been created):
import requests
import os
zipurl = 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz'
resp = requests.get(zipurl)
# assuming the subdirectory tempdata has been created:
zname = os.path.join('tempdata', "matty.shakespeare.tar.gz")
zfile = open(zname, 'wb')
zfile.write(resp.content)
zfile.close()