GoT Baby Names (2014 Edition)?

Read and analyze a file of comma-delimited baby name records from the Social Security Administration.
This assignment is due on Wednesday, February 3
7 exercises
5.5 possible points
Create a subfolder named 0012-got-babynames-2014 inside your compciv-2016/exercises folder.

Summary

This exercise set is meant to test our familiarity with loops, list objects, simple string methods, and your own eyes for recognizing basic patterns and structure within a text file. These are baby steps (albeit big ones) for learning how to programmatically analyze text files in a far more efficient and scalable way than we ever could with spreadsheets.

Table of contents

The Checklist

In your compciv-2016 Git repository create a subfolder and name it:

     exercises/0012-got-babynames-2014

The folder structure will look like this (not including any subfolders such as `tempdata/`:

        compciv-2016
        └── exercises
            └── 0012-got-babynames-2014
               ├── a.py
               ├── b.py
               ├── c.py
               ├── d.py
               ├── e.py
               ├── f.py
               ├── g.py
    
a.py 0.5 points Download the 2014 text file of babynames and count the lines
b.py 0.5 points Print the sum of the babies whose names were recorded in 2014
c.py 1.0 points How many baby girls were named by parents inspired by characters portrayed by Emilia Clark in Game of Thrones?
d.py 1.0 points Print the top 5 popular names for both baby boys and girls in 2014
e.py 0.5 points Print the total number of babies in 2014 by gender
f.py 1.0 points Print the total number of babies in 2014, by last character of their given names
g.py 1.0 points Print a human-readable list of baby counts in 2014, by gender and by last character of their given names

Background information

This is just a fun exercise using the most adorable of datasets as a way to practice the process of turning raw text into data. Sure, you could download the provided dataset and import it into Excel. But I think you'll (eventually) find that it's much, much faster to do it programmatically, especially if you want to analyze more than one year's worth of data. Or to programmatically turn the data into an analysis tool that can be used in other kinds of bespoke analyses…

Be sure to read up on loops, conditionals, and lists, and to be familiar with how we downloaded and stored files in the Shakespeare text files exercises

Beware of CSVs

If you think you know how to use Python's csv module, go ahead and use it if you think it makes things easier. But this dataset is so simple – "simple", not as in potential for interesting insights, but in that it has none of the many problems inherent to comma-delimited text that can seriously wreck your day/will to live.

In fact, don't think for a minute (after finishing this exercise set) that parsing (or creating) CSV-text is as simple as understanding that the values are separated by commas and calling the split() method. Everything about data is overly complicated, even commas, but especially text, even though it's "just text".

For every other situation involving comma-delimited text data, we will be using Python's csv module.

Getting the raw data yourself

For this exercise, I've extracted the 2014 data file from the original dataset for you to use, so that you don't have to download and process the entire dataset (which spans 100+ years).

However, if you're interested in the raw dataset (and I might revisit the data for future exercises and examples), you can download it yourself from the Social Security Administration's website. In fact, once you've finished this exercise set, it's just a call to glob.glob() and an extra for-loop (and about 100MB if free disk space) to repeat the analysis for every year of Social Security data.

Caveats: It ain't all babies

The phrasing and terminology of this exercise doesn't quite precisely describe the data. The name counts are based on applications for Social Security cards. The majority of these come from babies born in the given year, but not all of them. Also, the data only includes names that are at least 2 characters long. And finally, the data only includes applications for which the year of birth, sex, and U.S. state of birth is known.

But it's just easier to refer to this as the babyname-counting exercises. You can read the background of the data on the Social Security Administration's website, which, incidentally, also just simplifies things as being a bunch of baby names. So don't judge me.

The Exercises

0012-got-babynames-2014/a.py » Download the 2014 text file of babynames and count the lines

0012-got-babynames-2014/a.py
Download the 2014 text file of babynames and count the lines
0.5 points
  • Make a tempdata subdirectory inside your working directory, i.e.

      0012-got-babynames-2014/tempdata
    
  • Download the list of Social Security babynames data for 2014 and save it to your tempdata folder at this path:

      0012-got-babynames-2014/tempdata/ssa-babynames-nationwide-2014.txt
    
  • Count the lines in the file and print the number.

The file can be found here:

http://stash.compciv.org/ssa_baby_names/ssa-babynames-nationwide-2014.txt

Go ahead and download the file the “old-fashioned” way by clicking on the URL, just to confirm that it is indeed a text file.

Notice how the actual filename, when you ignore the directories in its path, is the same for the URL as for where I’m telling you to save the file, i.e. ssa-babynames-nationwide-2014.txt

As you might have guessed, there’s a Python helper function to isolate the base filename as a string, for your convenience. Use it if you think it saves you some typing (which, well, is generally one of the best reasons to learn new functions and syntax):

>>> from os.path import basename
>>> basename("/hello/world/file.txt")
'file.txt'
>>> basename("http://www.example.com/whatev/file.txt")
'file.txt'

If you want, you can write downloaded data to the destination path and close the file. Then re-open the file for reading and count the lines, all in the same script. It will happen so fast that you could do it a 100 times without noticable delay.

If you want to feel more graceful about it, you could store the text of the response in a variable:

mytxt = resp.text

Then write the contents of that mytxt variable into the destination file path, i.e. tempdata/ssa-babynames-nationwide-2014.txt…then call the splitlines() method of the mytxt string object, which returns a list of strings based on splitting on the newline character, and call len() on that list:

(on second thought, there’s probably no need to assign a new variable, since the script is so short…)

len(mytxt.splitlines())

Whatever gets you excited about downloading and reading text files…

After this exercise, the other exercises in this set won’t require you to bring in the requests nor the makedirs function. We only need to download the file once.

Expectations

When you run a.py from the command-line:

0012-got-babynames-2014 $ python a.py
  • The program's output to screen should be:
    There are 33044 lines in tempdata/ssa-babynames-nationwide-2014.txt

0012-got-babynames-2014/b.py » Print the sum of the babies whose names were recorded in 2014

0012-got-babynames-2014/b.py
Print the sum of the babies whose names were recorded in 2014
0.5 points

Read through each line in the 2014 babynames file and sum the count of babies (the 3rd column). Print the total value to screen.

  • Inside the for-loop with which you iterate and read each line of text, you should be calling the string object’s split() method, which pretty much does what it says. Check out the Python documentation for a more detailed explanation.

  • When you call the string’s split() function, it returns a list of strings.

  • If a string looks like a number – e.g. "42" – it is still a string, i.e. the result of "42" + "42" is probably not what you want it to be. Use one of the number types’s constructor functions – e.g. int() – to convert strings into numbers.

Expectations

When you run b.py from the command-line:

0012-got-babynames-2014 $ python b.py
  • The program's output to screen should be:
    There are 3670151 babies whose names were recorded in 2014.
    

0012-got-babynames-2014/c.py » How many baby girls were named by parents inspired by characters portrayed by Emilia Clark in Game of Thrones?

0012-got-babynames-2014/c.py
How many baby girls were named by parents inspired by characters portrayed by Emilia Clark in Game of Thrones?
1.0 points

Game of Thrones is a popular HBO show and book about a “game” this is actually mostly about just one throne. Actress Emilia Clark portrays a powerful warrior who has traveled back through time to set her ancient enemies on fire. Her name is “Daenerys” but she will also respond to “Khaleesi”.

In the 2014 baby names dataset, find all records for baby girls in which the given name is:

  • exactly 'Daenerys'
  • or begins with either 'Khalees' or 'Khaless'

For the latter case, sum the baby count as belonging to "Khaleesi"

Remember that we can check a string for the existence of a substring with the in keyword:

if "hel" in "hello":
     print("hey there")

If you want to feel especially Pythonic, you can use Python’s “unpacking” feature to assign variables the values in a sequence (e.g. a list or tuple) in a slick one-liner:

name, sex, babies = line.strip().split(',')

I probably should’ve formally covered regular expressions in the lessons by now, as they are exactly the kind of thing we want when the text you’re looking for has some unpredictable variation. Oh well, in this example, we just have to check for the two variations of "Khaleesi".

But feel free to use them if you know them. They are absolutely the best thing to use here.

Expectations

When you run c.py from the command-line:

0012-got-babynames-2014 $ python c.py
  • The program's output to screen should be:
    Daenerys: 86
    Khaleesi: 398
    

0012-got-babynames-2014/d.py » Print the top 5 popular names for both baby boys and girls in 2014

0012-got-babynames-2014/d.py
Print the top 5 popular names for both baby boys and girls in 2014
1.0 points

Print the top 5 names of girls in order of the count of babies named. Then do the same for boys.

Even though we haven’t formally learned how to sort a basic Python list, nevermind a list of lists, that won’t be required for this exercise.

By default, the Social Security Administration lists the names in order of gender – "F", then "M", and then by their respective count, in descending order.

This means that the first 5 lines of the file happen to be the first 5 baby girl names with the most babies named. This also means when we iterate through all of the girl names, the first 5 baby boy names will be the 5 most popular baby boy names. So we can do this using an if-statement or two.

However, if you want to do it the proper way, you can check out the Python documentation for the built-in sorted function, including this how-to guide. In later exercises, this is how we will be sorting our sequences.

Expectations

When you run d.py from the command-line:

0012-got-babynames-2014 $ python d.py
  • The program's output to screen should be:
    Top baby girl names
    1. Emma 20799
    2. Olivia 19674
    3. Sophia 18490
    4. Isabella 16950
    5. Ava 15586
    
    Top baby boy names
    1. Noah 19144
    2. Liam 18342
    3. Mason 17092
    4. Jacob 16712
    5. William 16687
    

0012-got-babynames-2014/e.py » Print the total number of babies in 2014 by gender

0012-got-babynames-2014/e.py
Print the total number of babies in 2014 by gender
0.5 points

Sum the count of babies and print the totals by gender.

Pretty much the same thing as exercise b.py

Expectations

When you run e.py from the command-line:

0012-got-babynames-2014 $ python e.py
  • The program's output to screen should be:
    F: 1768775
    M: 1901376
    

0012-got-babynames-2014/f.py » Print the total number of babies in 2014, by last character of their given names

0012-got-babynames-2014/f.py
Print the total number of babies in 2014, by last character of their given names
1.0 points

This is similar to the previous exercise, except that instead of aggregating by a manageable number of categories, i.e. M and F, we’re asked to keep count for every letter in the alphabet.

Don’t try to simply do what you did for e.py, but on a much more tedious scale.

Learn to use the dictionary object which can be used to contain any arbitrary and scalable collection of keys (in this case, alphabet letters) and values (in this case, baby counts for a given letter)

Remember that string objects are sequences, and like lists, can have their last member be accessed using square-bracket notation and the index value of -1.

The way to do this exercise is not to have a massive if/elif/else conditional branch, even though it would technically work.

Instead, use a dictionary, in which its keys are the letters of the alphabet, and the values are the current count of babies for a given letter.

For example, instead of doing this (assuming fileobject is the file of records):

ax = 0
bx = 0
cx = 0
for line in fileobject:
    name, sex, babies = line.strip().split(',')
    last_letter = name[-1]
    if last_letter == 'a':
        ax += int(babies)
    elif last_letter == 'b':
        bx += int(babies)
    elif last_letter == 'c':
        cx += int(babies)

# etc.

Try this:

mydict = {}
for line in fileobject:
    name, sex, babies = line.strip().split(',')
    last_letter = name[-1]
    if mydict.get(last_letter):
        mydict[last_letter] += int(babies)
    else:
        mydict[last_letter] = int(babies)

If you don’t understand the puprose of the if-statement, or the get() method, recall what happens when you try to access a dictionary’s key without it being previously set.

If you feel pretty confident about dictionaries and iterable objects in general, feel free to use the defaultdict or even the Counter types – both of which are in the collections module and offer a few relevant conveniences if you take the time to study and try them out.

For the final part of this exercise, in which you print the list of sums in alphabetical order, remember that you can’t simply iterate through the dictionary of counts like this:

for key, val in mydict.items():
    print(key + ':', val)

This doesn’t work because dictionaries are unordered collections…those keys won’t come out in alphabetical or any non-arbitrary order.

We can get around this by iterating through a sequence that we know to be in alphabetical order, even if we have to create it ourselves:

for letter in 'abcdefg':
     val = mydict[letter]
     # etc

Python has the string module for common string operations, including constants, such as string.ascii_letters which lists all lower and uppercase letters from a to Z. You probably want to use string.ascii_lowercase, rather than manually typing out the alphabet, no matter how well you’ve memorized your abcs:

>>> import string
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
Expectations

When you run f.py from the command-line:

0012-got-babynames-2014 $ python f.py
  • The program's output to screen should be:
    a: 683400
    b: 31658
    c: 24966
    d: 46701
    e: 459362
    f: 3003
    g: 5375
    h: 228547
    i: 103362
    j: 1537
    k: 36097
    l: 170789
    m: 63365
    n: 901859
    o: 84050
    p: 2569
    q: 363
    r: 225161
    s: 148429
    t: 72840
    u: 3965
    v: 3364
    w: 30306
    x: 20757
    y: 313782
    z: 4544
    

0012-got-babynames-2014/g.py » Print a human-readable list of baby counts in 2014, by gender and by last character of their given names

0012-got-babynames-2014/g.py
Print a human-readable list of baby counts in 2014, by gender and by last character of their given names
1.0 points

Nearly the same as the previous exercise, except that the count-by-letter is done for each sex, "M" and "F".

Also, note that the output is specifically formatted so that the columns of numbers are easier to glance at and compare. Your answer is expected to follow this exact format:

  • Each column has a fixed width of exactly 8 characters.
  • The dashed line that separates the header from the data is, consequently, composed of 24 hyphens.
  • The first column is left-justified
  • The second and third columns are right-justified

Check out the Hints section for how to efficiently do this. Or you could just read the Python documentation on the ljust() and rjust() methods.

Think about nested dictionaries

You could complete this exercise by keeping a separate dictionary by gender:

m_dict = {}
f_dict = {}
# ...etc
letter = name[-1]
if sex == 'F':
    if f_dict.get(letter):
        f_dict[letter += int(babies)
    # etc etc
else:
    if m_dict.get(letter):
       m_dict[letter] += int(babies)

But consider using a nested dictionary, so that you don’t need as much repetitious, conditional logic:

mydict = {'M': {}, 'F': {}}

This exercise is intentionally similar to the previous one, so that you can think of a more elegant way to use dictionaries.

Padding and justifying the text

The rjust() string method has 1 required argument: an integer representing a desired length for a right-justified string. This is also referred to as “padding the string”:

>> mystr = 'hello'
>> mystr.rjust(7)
'  hello'

Though we don’t need it for this exercise, a very common use-case is to do zero-padding, in which the 0 character is added to the right side of a number to give it a uniform length. The rjust() method takes a second optional argument: the character used to “fill” the padding:

>>> mynumbers = [42, 9561, 28777]
>>> for n in mynumbers:
...     print(n)
42
9561
28777

>>> for n in mynumbers:
...     print(str(n).rjust(5, '0'))
00042
09561
28777
Expectations

When you run g.py from the command-line:

0012-got-babynames-2014 $ python g.py
  • The program's output to screen should be:
    letter         F       M
    ------------------------
    a         655469   27931
    b            573   31085
    c           1349   23617
    d           3060   43641
    e         328326  131036
    f            164    2839
    g            690    4685
    h         127602  100945
    i          57205   46157
    j            187    1350
    k            583   35514
    l          44417  126372
    m           5132   58233
    n         233833  668026
    o           2189   81861
    p             58    2511
    q             53     310
    r          48361  176800
    s          21309  127120
    t          20824   52016
    u            787    3178
    v            401    2963
    w           3905   26401
    x           2091   18666
    y         209187  104595
    z           1020    3524
    
Some takeaways from this exercise:
  • By using the ljust() and rjust() methods, we’ve effectively created a fixed-width delimited data files, which is easier for humans to read at a glance at the cost of a little more programmatic complexity.

  • Don’t underestimate the ability to use plain text as a data visualization. The output of this program, simple as it is, allows the reader to see immediately for which letters in which the gender gap is different by orders of magnitude, e.g. in 2014, there were more than 20 times as many baby girls with names that end in a than baby boys.

  • Almost seems as if certain kinds of sounds are associated with gender.

References and Related Readings

Popular Baby Names: Beyond the Top 1000 Names
Download year-by-year baby name datasets.
Background Information on the baby names dataset
In 1998, the Social Security Administration published Actuarial Note #139, Name Distributions in the Social Security Area, August 1997, on the distribution of given names of Social Security number holders. The note, written by actuary Michael W. Shackleford, gave birth to the present website.
Comma-separated values: Standardization
CSV files are so simple! Why are people wasting time thinking of a "standardization" for it? I mean how complicated can it get...?
Top U.S. baby names list includes Anakin, Leia and Khaleesi
For the first time ever, a popular “Star Wars” character has found its way to the country’s top 1,000 baby names, according to the Social Security Administration.
Built-in Functions: sorted
Even though the list object has its own `sort()` method, I will heavily implore you to ignore it and instead, use the `sorted()` function, which sorts a list without mutating it.
Sorting HOW TO
This tutorial describes several ways to sort sequences in Python. I highly recommend on just focusing on the `sorted()` examples.
CSV File Reading and Writing
The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV formats.