Writing files is a lot like reading from them. In day-to-day consumer-friendly computing, we’re usually given a few warnings to prevent accidents in which we overwrite existing files, e.g. “Are you sure you want to permanently erase the items in the Trash?”. Python does not do that, which means we have to be extra careful when writing files.
The process of writing to a file is very similar to the process of reading from a file (which is covered in a separate lesson), in that both require opening a file object for access. The difference is in the second argument to
open(), in which the string
"w" – short for write – is passed.
newfile = open("hello.txt", "w")
When a file object is instantiated in write mode, it has access to the
write() method. Closing a file object works the same way as it does when reading from the file:
newfile.write("hello world") newfile.write("goodbye world") newfile.write("wokka wokka") newfile.close()
The above example works when writing text to a file. However, when writing binary data, i.e. bytes, the string
"wb" must be passed into the
newfile = open("somebinaryfile.zip","wb")
Most of the kinds of files we write will be in text mode, but occasionally, we'll download a binary file – such as a zip or image file – and save it to disk.
When trying to open a file for reading, but passing in a non-existent filename, Python will throw a
FileNotFound error. Try this at the interactive Python shell:
>>> myfile = open("blaskdfjsadklfjdfsadflkj", "r") FileNotFoundError: [Errno 2] No such file or directory: 'blaskdfjsadklfjdfsadflkj'
What happens if you try to open a file in write-mode with an equally nonsensical name?
>>> myfile = open("blaskdfjsadklfjdfsadflkj", "w")
Nothing, at least error-wise. Instead, a file of the name
blaskdfjsadklfjdfsadflkj will be created wherever your code is running. If you ran it from your
~/Desktop directory, for instance:
OK, but what happens when you try to open a file for writing using a filename that already exists? Nothing, error-wise. But whatever file that existing filename pointed to is basically wiped out. You may get an error message if you attempt to write to a path that points to a directory or some kind of protected file. But for every other kind of file, it's just gone and there is no confirmation message.
This is why in each of the assignments, I have you create a new
tempdata subdirectory and stash things into it, to reduce the likelihood that you end up overwriting existing files in your other file directories. But you should still be careful – i.e. take a few seconds and think about what you're doing before hitting Enter – whenever you pass in
"wb" into the
A good portion of this chapter is spent warning you about how writing files might lead to catastrophic accidents of accidentally deleting data, so it's worth asking: why do we even want to write files in the first place?
The answer is pretty easy: so that the data we've collected/created can live on after our program finishes its work – or, as is frequently the case, dies unexpectedly.
Consider the following code which downloads the HTML contents of the current New York Times homepage into a variable named
import requests resp = requests.get("https://www.nytimes.com") nyttext = resp.text
If my program ends there, whatever was stored in the variables
nyttext is gone. For many situations, that's probably what we want. But if we want to examine how the NYT homepage changes over time, then we would need to save copies of it that persisted from one Python session to the next. This means saving files to our hard drive:
from os.path import join import requests resp = requests.get("https://www.nytimes.com") nyttext = resp.text outfname = join("tempdata", "nytimes.com.html") outfile = open(outfname, "w") outfile.write(nyttext) outfile.close()
Of course, if we re-run this script in the next hour, day, or even the next second, whatever was at
"tempdata/nytimes.com.html" will get overwritten.
One strategy is to incorporate the current timestamp into the filename to be saved. Here, I create a subdirectory named
nytimes.com, and every file in it is given a name like
1453688120.431147.html – with the numbers being the result of the
time.time() function, which returns the "current time in seconds since the Epoch":
from os.path import join from os import makedirs import requests import time # Set up the storage area STORAGE_DIR = join("tempdata", "nytimes.com") makedirs(STORAGE_DIR, exist_ok=True) # Download the page resp = requests.get("https://www.nytimes.com") # Set up the new file current_time = str(time.time()) print("The time in seconds since epoch is now:", current_time) outfname = join(STORAGE_DIR, current_time + '.html') outfile = open(outfname, "w") outfile.write(resp.text) outfile.close()
If you were to save that code into a script named
nytdownload.py and then repeatedly run it via the command-line interpreter:
$ python nytdownload.py The time in seconds since epoch is now: 1453689209.676369 $ python nytdownload.py The time in seconds since epoch is now: 1453689210.85706 $ python nytdownload.py The time in seconds since epoch is now: 1453689212.452021 $ python nytdownload.py The time in seconds since epoch is now: 1453689213.67095
You would have a
tempdata/nytimes.com subdirectory full of files:
. ├── nytdownload.py └── tempdata └── nytimes.com ├── 1453689209.676369.html ├── 1453689210.85706.html ├── 1453689212.452021.html └── 1453689213.67095.html
Sometimes, you only want to download a file once. For example, the works of Shakespeare are unlikely to change in the near future, so we'd only want to download the file only if we've never downloaded it before.
We can use the
exists() method from the
os.path module, which returns
False if the path passed into it currently exists:
from os.path import join from os.path import exists import requests SHAKE_URL = "http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz" SHAKE_LOCAL_PATH = join("tempdata", "shakespeare.tar.gz") if exists(SHAKE_LOCAL_PATH): print("Skipping download;", SHAKE_LOCAL_PATH, 'already exists') else: print("Downloading", SHAKE_URL) resp = requests.get(SHAKE_URL) outfile = open(SHAKE_LOCAL_PATH, 'wb') # remember that Requests Response objects have the `content` # attribute when dealing with the contents of binary files outfile.write(resp.content) print("Saved file to:", SHAKE_LOCAL_PATH) outfile.close()
Save that code into a file, e.g.
shakeydownload.py, and run it from the command-line. Assuming you don't have anything at the path
tempdata/shakespeare.tar.gz, and the download successfully completes, you should see this output after a few seconds, or however long it takes your Internet collection to download all of Shakespeare's work:
$ python shakeydownload.py Downloading http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz Saved file to: tempdata/shakespeare.tar.gz
Try re-running the script. The script should finish near-instantaneously since it doesn't have to download the file:
$ python shakeydownload.py Skipping download; tempdata/shakespeare.tar.gz already exists $ python shakeydownload.py Skipping download; tempdata/shakespeare.tar.gz already exists
If you delete (or rename)
shakeydownload.py will operate as if you had never downloaded the file before.