Getting acquainted with Requests and scraping webpages as text

This set of exercises is meant to as a hands-on overview to the concept of libraries and objects -- and their methods and attributes -- while throwing in the basic concept behind scraping webpages for data.
This assignment is due on Tuesday, January 19
6 exercises
4.5 possible points
Create a subfolder named 0003-requests-sotu inside your compciv-2016/exercises folder.
Table of contents

The Checklist

In your compciv-2016 Git repository create a subfolder and name it:

     exercises/0003-requests-sotu

The folder structure will look like this (not including any subfolders such as `tempdata/`:

        compciv-2016
        └── exercises
            └── 0003-requests-sotu
               ├── a.py
               ├── b.py
               ├── c.py
               ├── d.py
               ├── e.py
               ├── f.py
    
a.py 0.5 points An example webpage request with requests and example.com
b.py 0.5 points Cause a ConnectionError
c.py 0.5 points Crash your program with a MissingSchema error
d.py 0.5 points Fetch President Obama's 2016 State of the Union Address
e.py 0.5 points Count up the times "Applause" is used in the 2016 SOTU webpage
f.py 2.0 points Count up the times "Applause" is used in all of the webpages for Obama's States of the Union

Background information

With every State of the Union address, it's a common data news story to analyze what's been said over the years.

Here's how The Atlantic did it last year:

image sotu-analysis-atlantic.png

Here's a fancy visualization by the Washington Post:

image sotu-analysis-wapo.png

And why restrict ourselves to States of the Union? All of Obama's speeches and remarks can be found here. NPR did a word analysis of those, except focused on the things that the press corps asked:

image npr-whitehouse-press-words.png

NPR even did an analysis of the cadence of President Obama's speeches.

All of these involve different levels of insight and data-wrangling. But at the core of each analysis is simply: gather up the speeches and count up the words.

For this exercise, we will focus on President Obama's State of the Union addresses, just to get acquainted with basic web request operations and text searching. In future exercises, we'll learn more techniques and libraries for more expansive searches, but the core process will be the same.

About the Requests library

Our primary library for downloading data and files from the Web will be Requests, dubbed "HTTP for Humans".

To bring in the Requests library into your current Python script, use the import statement:

import requests

You have to do this at the beginning of every script for which you want to use the Requests library.

Note: If you get an error, i.e. ImportError, it means you don't have the requests library installed. Email me if you're having that issue, because it likely means you probably don't have Anaconda installed properly.

The get method

The get method of the requests module is the one we will use most frequently – which corresponds to how the majority of the HTTP requests your browser makes involve the GET method. Even without knowing much about HTTP, the concept of GET is about as simple as its name: it will get a resource from a web server.

The get() method requires one argument: a web URL, e.g. http://www.example.com. The URL's scheme – i.e. "http://" – is required, even though you probably never type it out in your browser.

Run this from the interactive prompt:

>>> requests.get("http://www.example.com")
<Response [200]>

You might have expected the command to just dump the text contents of http://www.example.com to the screen. But it turns out there's a lot more to getting a webpage than just getting what you see rendered in your browser.

You can see this for yourself by popping open the Developer Tools (in Chrome, for OSX, the shortcut is: Command-Alt-J), clicking the Network panel, then visiting a page:

image example-com-get-request.png

What each of those various attributes mean isn't important to figure out now, it's just enough to know that they exist as part of every request for a web resource, whether it's a webpage, image file, data file, etc.

Returning to our previous code snippet, let's assign the result of the requests.get() command to a variable, then inspect that variable. I like using resp for the variable name – short for "response"

>>> resp = requests.get("http://www.example.com")

Use the type() function to see what that resp object actually is:

>>> type(resp)
requests.models.Response

If you want to get the text of a successful requests.get() response, use its text attribute:

>>> resp = requests.get("http://www.example.com")
>>> print(resp.text)
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    # .... and so on

So that's the basics of just getting the contents of a webpage. Here's a little more explanation of the concepts:

A quick primer on objects and their methods and attributes

The concept of "objects" in programming is a complicated, and somewhat boring and confusing topic that's easier to explain by example. This is a segue into introducing the concept, just so that you have some context when trying to access the data/contents, etc. after successfully downloading a webpage via requests.get().

Here's a brief overview of what you need to know about objects for this lesson:

  • Use the type() function to find out the type of an object, e.g. type('hello')
  • Getting a web resource means getting all of the metadata involved in the transaction. Thus, the requests.get("http://somepage.com") method returns an object of type requests.models.Response, not just a text string of somepage.com's raw HTML.
  • Objects have methods and attributes. Different objects have different methods and attributes.
  • At the interactive Python prompt, use the Tab key to autocomplete names of methods and attributes belonging to the object that a particular variable refers to. This is a much better alternative than trying to memorize everything.
  • Methods and attributes will feel like the same kind of concept. Practically speaking, invoking a method requires including the closed parentheses: "hello world".upper(). Attributes do not require the parentheses, e.g. 9.denominator
  • To get the raw text of a webpage from a requests.models.Response, use its text attribute.

And here's a longer explanation:

So what is that requests.models.Response?

Just as "hello world" is an object of the type str (i.e. string literal) and 42 is an object of type int (i.e. integer), the object returned by requests.get(), and subsequently assigned to the resp variable, has its own type: requests.models.Response.

Each type of object has its own methods and attributes that can be referenced. The difference between a method and attribute can be…confusing…so for awhile, it may be a matter of memorization and just knowing how to inspect things.

Methods

Think of methods as functions that belong to an object (type() and print() are examples of functions that are called without being part of an object).

For instance, the str object has a method named upper() (return an upper-cased version of the string):

>>> "Hello world".upper()
'HELLO WORLD'

It doesn't make sense to call upper() all by its lonesome, i.e.

>>> upper()

…because there is nothing to translate to uppercase. The upper() (and its counterpart, lower()) method only makes sense in the context of a text string. Hence, we think of upper() as being a method that belongs to the String object.

You can think of methods as referring to things that require an "action" or calculation of some sort, e.g. "hello world".upper() executes an action in which the characters of "hello world" are translated into their upper-case equivalents: "HELLO WORLD". The use of the closed parentheses is what executes that action. If you don't include the parentheses, nothing happens except Python telling you, "Yep, that's a function"

>>> 'hello world'.upper
<function str.upper>

Attributes

Think of attributes as properties inherent to an object – i.e. no calculation or action is required to derive them – they just are. For example, a person's birthdate can be thought of as an attribute – an unchangeable fact about that person. But a person's age is better thought of as a method – because age is derived from birthdate, i.e. finding the difference between the current date and the birthdate.

For our practical purposes, attributes are referenced the same way as methods, but without the use of closed parentheses:

>>> a = 42
>>> a.numerator
42
>>> a.denominator
1

Exploring the methods and attributes of requests.models.Response

Because the requests.models.Response object encapsulates

Let's use the autocomplete functionality of interactive Python to get a list of methods and attributes belonging to the requests.models.Response object.

Starting from the very beginning, including the import statement:

>>> import requests
>>> resp = requests.get("http://www.example.com")

At the next prompt, type out resp, then the dot. Then, hit Tab – the interactive prompt should return a list of all possible methods and attributes:

>>> resp.  #( hit the Tab key)
resp.apparent_encoding      resp.iter_lines
resp.close                  resp.json
resp.connection             resp.links
resp.content                resp.ok
resp.cookies                resp.raise_for_status
resp.elapsed                resp.raw
resp.encoding               resp.reason
resp.headers                resp.request
resp.history                resp.status_code
resp.is_permanent_redirect  resp.text
resp.is_redirect            resp.url
resp.iter_content    

The text attribute contains the raw HTML of the webpage you requested. It's just a String object, which you can verify for yourself:

>>> type(resp.text)
str

And you can use the len() function to get the number of characters:

>>> len(resp.text)
1270

The status_code attribute

Check out the HTTP Status Cats photo gallery (or HTTP Status Dogs, if you prefer dogs) for an informal overview of HTTP status codes.

Then find the status code of the response that you received with the status_code method:

>>> resp.status_code
200

We'll use a few other of the attributes and methods of the requests.models.Response in other situations. But text and status_code will fulfill the majority of our needs.

The Exercises

0003-requests-sotu/a.py » An example webpage request with requests and example.com

0003-requests-sotu/a.py
An example webpage request with requests and example.com
0.5 points

Use the Requests library to get the URL, http://example.com.

Print out the response’s status code, length of the text response (i.e. you’ll have to use the len() function), and the URL.

Expectations

When you run a.py from the command-line:

0003-requests-sotu $ python a.py
  • The program's output to screen should be:
    200
    1270
    http://example.com/
    
Some takeaways from this exercise:
  • Note that the url attribute of the response object is not exactly the same as the URL that you attempted to retrieve. This is a common phenomenon, especially when the destination URL redirects you to another page (such as a bit.ly URL shortener, e.g. https://bit.ly/a)

0003-requests-sotu/b.py » Cause a ConnectionError

0003-requests-sotu/b.py
Cause a ConnectionError
0.5 points

Write a program that prints out the name of a URL that you want to get.

Then, that program should crash due to a runtime error of type ConnectionError, which will happen if you try to connect to a non-existent web server. Or, if your own Internet connection is down (obviously, the former is easier to specify in your script).

Expectations

When you run b.py from the command-line:

0003-requests-sotu $ python b.py
  • The program's output to screen should be:
    #[the URL you attempt to GET should be printed here]
    
  • The program should crash because of a ConnectionError
Some takeaways from this exercise:
  • Ideally, programs that we write are reproducible and deterministic. But one of the tricky things about programming against online data – or any kind of external resource – is that you can never be sure that when your script runs that the external resource is actually available. The web server, or your Internet connection, might have gone down since you last ran the script.

0003-requests-sotu/c.py » Crash your program with a MissingSchema error

0003-requests-sotu/c.py
Crash your program with a MissingSchema error
0.5 points

Write a program that prints out the name of a URL that you want to get (technically, this URL should be invalid….)

Then, that program should crash due to a runtime error of type MissingSchema.

Expectations

When you run c.py from the command-line:

0003-requests-sotu $ python c.py
  • The program's output to screen should be:
    #[the URL you attempt to GET should be printed here]
    
  • The program should crash because of a MissingSchema
Some takeaways from this exercise:
  • Browsers do a lot for us, even when it comes to resolving the URLs that we manually type in. Because a web browser is used to visit, well, webpages, a browser usually just fills in http:// if we omit it – and this is perfectly fine 99.9% of our casual browser usage. However, the Requests library doesn’t provide that convenience, so it’s good to get acquainted with the corresponding error message.

0003-requests-sotu/d.py » Fetch President Obama's 2016 State of the Union Address

0003-requests-sotu/d.py
Fetch President Obama's 2016 State of the Union Address
0.5 points

Basically the same as exercise a.py, except using this specific URL:

https://www.whitehouse.gov/the-press-office/2016/01/12/remarks-president-barack-obama-–-prepared-delivery-state-union-address/

Print out the status code, the length of the text, and the URL of the response.

Expectations

When you run d.py from the command-line:

0003-requests-sotu $ python d.py
  • The program's output to screen should be:
    200
    147766   # approximately
    https://www.whitehouse.gov/the-press-office/2016/01/12/remarks-president-barack-obama-%E2%80%93-prepared-delivery-state-union-address
    
Some takeaways from this exercise:
  • Once you know how to get one kind of webpage, you basically have the pattern for getting any (public) webpage or web resource

0003-requests-sotu/e.py » Count up the times "Applause" is used in the 2016 SOTU webpage

0003-requests-sotu/e.py
Count up the times "Applause" is used in the 2016 SOTU webpage
0.5 points

The raw HTML of a webpage is just text, i.e. a String object. String objects in Python have the count() method, which takes in a single argument – another string object to search for – and returns an integer, representing how many times that string was found in the invoking string:

>>> a = "hello world"
>>> a.count('world')
1
>>> a.count('o')
2

For this exercise, print out the following calculations for the webpage of the 2016 State of the Union address

  1. The number of times the string "Applause" appears.
  2. The number of times the string "Applause" appears regardless of case, i.e. "APPLAUSE", "applause", etc.
  3. The number of times “<p>” appears.

Hint:

  • Don’t try to guess all the different ways "Applause" can be capitalized. Use the lower() or upper() methods to convert all of the characters in the HTML into one case or the other, than search accordingly.
  • | Since upper() and lower() both return string objects, you can chain method calls like this: "hello world".upper().count("L")
  • | The fact that the results for #2 and #3 are the same is a weird coincidence. I guess we can interpret that as: President Obama received applause for every paragraph-long statement he made, on average.
Expectations

When you run e.py from the command-line:

0003-requests-sotu $ python e.py
  • The program's output to screen should be:
    84
    89
    89
    
Some takeaways from this exercise:
  • Why does “<p>” appear so many times in the HTML, but we don’t see it at all when we view the URL via browser? That’s the point of HTML – it includes a bunch of meta information that your browser uses to render the page. But the raw HTML itself is not shown. Our simple program doesn’t know how to parse HTML in such a way that ignores the HTML tags…but we will learn that soon…

0003-requests-sotu/f.py » Count up the times "Applause" is used in all of the webpages for Obama's States of the Union

0003-requests-sotu/f.py
Count up the times "Applause" is used in all of the webpages for Obama's States of the Union
2.0 points

Similar to the previous exercises, except repeated for all of the States of the Union during Obama’s presidency (technically, 2009’s was not a SOTU address, but we’ll count it anyway).

Here are the URLs:

  • https://www.whitehouse.gov/the-press-office/remarks-president-barack-obama-address-joint-session-congress
  • https://www.whitehouse.gov/the-press-office/remarks-president-state-union-address
  • https://www.whitehouse.gov/the-press-office/2011/01/25/remarks-president-state-union-addressehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
  • https://www.whitehouse.gov/the-press-office/2013/02/12/remarks-president-state-union-address
  • https://www.whitehouse.gov/the-press-office/2014/01/28/president-barack-obamas-state-union-address
  • https://www.whitehouse.gov/the-press-office/2015/01/20/remarks-president-state-union-address-january-20-2015
  • https://www.whitehouse.gov/the-press-office/2016/01/12/remarks-president-barack-obama-%E2%80%93-prepared-delivery-state-union-address

For each URL, print out:

  1. The URL of the response
  2. The number of characters in the text of each response
  3. The number of times “Applause” appears in the text of each response, case-insensitive
  • If you already know how to do a for-loop and how to iterate across a collection of objects in Python, you should do that. If you don’t, there’s nothing wrong with repeating the same code, over and over, for each URL. It should feel really annoying, though, so keep that in mind when we discuss loops and collections and other slightly more complicated data objects.
Expectations

When you run f.py from the command-line:

0003-requests-sotu $ python f.py
  • The program's output to screen should be:
    https://www.whitehouse.gov/the-press-office/remarks-president-barack-obama-address-joint-session-congress
    146186   # approximately
    0
    https://www.whitehouse.gov/the-press-office/remarks-president-state-union-address
    157621   # approximately
    116
    https://www.whitehouse.gov/the-press-office/2011/01/25/remarks-president-state-union-address
    155070   # approximately
    80
    https://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address
    157847   # approximately
    87
    https://www.whitehouse.gov/the-press-office/2013/02/12/remarks-president-state-union-address
    156855   # approximately
    87
    https://www.whitehouse.gov/the-press-office/2014/01/28/president-barack-obamas-state-union-address
    152266   # approximately
    0
    https://www.whitehouse.gov/the-press-office/2015/01/20/remarks-president-state-union-address-january-20-2015
    156269   # approximately
    86
    https://www.whitehouse.gov/the-press-office/2016/01/12/remarks-president-barack-obama-%E2%80%93-prepared-delivery-state-union-address
    147766   # approximately
    89
    
Some takeaways from this exercise:
  • What happened in 2009 and 2014? Did Obama just bomb those speeches? Or is it a result of inconsistent transcription, just one of many “dirty/inconsistent data” issues we’ll run into?