The Checklist
In your compciv-2016 Git repository create a subfolder and name it:
exercises/0012-got-babynames-2014
The folder structure will look like this (not including any subfolders such as `tempdata/`:
compciv-2016 └── exercises └── 0012-got-babynames-2014 ├── a.py ├── b.py ├── c.py ├── d.py ├── e.py ├── f.py ├── g.py
Background information
This is just a fun exercise using the most adorable of datasets as a way to practice the process of turning raw text into data. Sure, you could download the provided dataset and import it into Excel. But I think you'll (eventually) find that it's much, much faster to do it programmatically, especially if you want to analyze more than one year's worth of data. Or to programmatically turn the data into an analysis tool that can be used in other kinds of bespoke analyses…
Be sure to read up on loops, conditionals, and lists, and to be familiar with how we downloaded and stored files in the Shakespeare text files exercises
Beware of CSVs
If you think you know how to use Python's csv module, go ahead and use it if you think it makes things easier. But this dataset is so simple – "simple", not as in potential for interesting insights, but in that it has none of the many problems inherent to comma-delimited text that can seriously wreck your day/will to live.
In fact, don't think for a minute (after finishing this exercise set) that parsing (or creating) CSV-text is as simple as understanding that the values are separated by commas and calling the split()
method. Everything about data is overly complicated, even commas, but especially text, even though it's "just text".
For every other situation involving comma-delimited text data, we will be using Python's csv module.
Getting the raw data yourself
For this exercise, I've extracted the 2014 data file from the original dataset for you to use, so that you don't have to download and process the entire dataset (which spans 100+ years).
However, if you're interested in the raw dataset (and I might revisit the data for future exercises and examples), you can download it yourself from the Social Security Administration's website. In fact, once you've finished this exercise set, it's just a call to glob.glob()
and an extra for-loop (and about 100MB if free disk space) to repeat the analysis for every year of Social Security data.
Caveats: It ain't all babies
The phrasing and terminology of this exercise doesn't quite precisely describe the data. The name counts are based on applications for Social Security cards. The majority of these come from babies born in the given year, but not all of them. Also, the data only includes names that are at least 2 characters long. And finally, the data only includes applications for which the year of birth, sex, and U.S. state of birth is known.
But it's just easier to refer to this as the babyname-counting exercises. You can read the background of the data on the Social Security Administration's website, which, incidentally, also just simplifies things as being a bunch of baby names. So don't judge me.
The Exercises
0012-got-babynames-2014/a.py » Download the 2014 text file of babynames and count the lines
-
Make a tempdata subdirectory inside your working directory, i.e.
0012-got-babynames-2014/tempdata
-
Download the list of Social Security babynames data for 2014 and save it to your
tempdata
folder at this path:0012-got-babynames-2014/tempdata/ssa-babynames-nationwide-2014.txt
-
Count the lines in the file and print the number.
The file can be found here:
http://stash.compciv.org/ssa_baby_names/ssa-babynames-nationwide-2014.txt
Go ahead and download the file the “old-fashioned” way by clicking on the URL, just to confirm that it is indeed a text file.
Notice how the actual filename, when you ignore the directories in its path, is the same for the URL as for where I’m telling you to save the file, i.e. ssa-babynames-nationwide-2014.txt
As you might have guessed, there’s a Python helper function to isolate the base filename as a string, for your convenience. Use it if you think it saves you some typing (which, well, is generally one of the best reasons to learn new functions and syntax):
>>> from os.path import basename
>>> basename("/hello/world/file.txt")
'file.txt'
>>> basename("http://www.example.com/whatev/file.txt")
'file.txt'
If you want, you can write downloaded data to the destination path and close the file. Then re-open the file for reading and count the lines, all in the same script. It will happen so fast that you could do it a 100 times without noticable delay.
If you want to feel more graceful about it, you could store the text of the response in a variable:
mytxt = resp.text
Then write the contents of that mytxt
variable into the destination file path, i.e. tempdata/ssa-babynames-nationwide-2014.txt
…then call the splitlines()
method of the mytxt
string object, which returns a list of strings based on splitting on the newline character, and call len()
on that list:
(on second thought, there’s probably no need to assign a new variable, since the script is so short…)
len(mytxt.splitlines())
Whatever gets you excited about downloading and reading text files…
After this exercise, the other exercises in this set won’t require you to bring in the requests
nor the makedirs
function. We only need to download the file once.
When you run a.py
from the command-line:
0012-got-babynames-2014 $ python a.py
-
The program's output to screen should be:
There are 33044 lines in tempdata/ssa-babynames-nationwide-2014.txt
0012-got-babynames-2014/b.py » Print the sum of the babies whose names were recorded in 2014
Read through each line in the 2014 babynames file and sum the count of babies (the 3rd column). Print the total value to screen.
-
Inside the for-loop with which you iterate and read each line of text, you should be calling the string object’s split() method, which pretty much does what it says. Check out the Python documentation for a more detailed explanation.
-
When you call the string’s
split()
function, it returns a list of strings. -
If a string looks like a number – e.g.
"42"
– it is still a string, i.e. the result of"42" + "42"
is probably not what you want it to be. Use one of the number types’s constructor functions – e.g.int()
– to convert strings into numbers.
When you run b.py
from the command-line:
0012-got-babynames-2014 $ python b.py
-
The program's output to screen should be:
There are 3670151 babies whose names were recorded in 2014.
0012-got-babynames-2014/c.py » How many baby girls were named by parents inspired by characters portrayed by Emilia Clark in Game of Thrones?
Game of Thrones is a popular HBO show and book about a “game” this is actually mostly about just one throne. Actress Emilia Clark portrays a powerful warrior who has traveled back through time to set her ancient enemies on fire. Her name is “Daenerys” but she will also respond to “Khaleesi”.
In the 2014 baby names dataset, find all records for baby girls in which the given name is:
- exactly
'Daenerys'
- or begins with either
'Khalees'
or'Khaless'
For the latter case, sum the baby count as belonging to "Khaleesi"
Remember that we can check a string for the existence of a substring with the in
keyword:
if "hel" in "hello":
print("hey there")
If you want to feel especially Pythonic, you can use Python’s “unpacking” feature to assign variables the values in a sequence (e.g. a list or tuple) in a slick one-liner:
name, sex, babies = line.strip().split(',')
I probably should’ve formally covered regular expressions in the lessons by now, as they are exactly the kind of thing we want when the text you’re looking for has some unpredictable variation. Oh well, in this example, we just have to check for the two variations of "Khaleesi"
.
But feel free to use them if you know them. They are absolutely the best thing to use here.
When you run c.py
from the command-line:
0012-got-babynames-2014 $ python c.py
-
The program's output to screen should be:
Daenerys: 86 Khaleesi: 398
0012-got-babynames-2014/d.py » Print the top 5 popular names for both baby boys and girls in 2014
Print the top 5 names of girls in order of the count of babies named. Then do the same for boys.
Even though we haven’t formally learned how to sort a basic Python list, nevermind a list of lists, that won’t be required for this exercise.
By default, the Social Security Administration lists the names in order of gender – "F"
, then "M"
, and then by their respective count, in descending order.
This means that the first 5 lines of the file happen to be the first 5 baby girl names with the most babies named. This also means when we iterate through all of the girl names, the first 5 baby boy names will be the 5 most popular baby boy names. So we can do this using an if-statement or two.
However, if you want to do it the proper way, you can check out the Python documentation for the built-in sorted function, including this how-to guide. In later exercises, this is how we will be sorting our sequences.
When you run d.py
from the command-line:
0012-got-babynames-2014 $ python d.py
-
The program's output to screen should be:
Top baby girl names 1. Emma 20799 2. Olivia 19674 3. Sophia 18490 4. Isabella 16950 5. Ava 15586 Top baby boy names 1. Noah 19144 2. Liam 18342 3. Mason 17092 4. Jacob 16712 5. William 16687
0012-got-babynames-2014/e.py » Print the total number of babies in 2014 by gender
Sum the count of babies and print the totals by gender.
Pretty much the same thing as exercise b.py
When you run e.py
from the command-line:
0012-got-babynames-2014 $ python e.py
-
The program's output to screen should be:
F: 1768775 M: 1901376
0012-got-babynames-2014/f.py » Print the total number of babies in 2014, by last character of their given names
This is similar to the previous exercise, except that instead of aggregating by a manageable number of categories, i.e. M
and F
, we’re asked to keep count for every letter in the alphabet.
Don’t try to simply do what you did for e.py, but on a much more tedious scale.
Learn to use the dictionary object which can be used to contain any arbitrary and scalable collection of keys (in this case, alphabet letters) and values (in this case, baby counts for a given letter)
Remember that string objects are sequences, and like lists, can have their last member be accessed using square-bracket notation and the index value of -1
.
The way to do this exercise is not to have a massive if/elif/else conditional branch, even though it would technically work.
Instead, use a dictionary, in which its keys are the letters of the alphabet, and the values are the current count of babies for a given letter.
For example, instead of doing this (assuming fileobject
is the file of records):
ax = 0
bx = 0
cx = 0
for line in fileobject:
name, sex, babies = line.strip().split(',')
last_letter = name[-1]
if last_letter == 'a':
ax += int(babies)
elif last_letter == 'b':
bx += int(babies)
elif last_letter == 'c':
cx += int(babies)
# etc.
Try this:
mydict = {}
for line in fileobject:
name, sex, babies = line.strip().split(',')
last_letter = name[-1]
if mydict.get(last_letter):
mydict[last_letter] += int(babies)
else:
mydict[last_letter] = int(babies)
If you don’t understand the puprose of the if-statement, or the get()
method, recall what happens when you try to access a dictionary’s key without it being previously set.
If you feel pretty confident about dictionaries and iterable objects in general, feel free to use the defaultdict or even the Counter types – both of which are in the collections module and offer a few relevant conveniences if you take the time to study and try them out.
For the final part of this exercise, in which you print the list of sums in alphabetical order, remember that you can’t simply iterate through the dictionary of counts like this:
for key, val in mydict.items():
print(key + ':', val)
This doesn’t work because dictionaries are unordered collections…those keys won’t come out in alphabetical or any non-arbitrary order.
We can get around this by iterating through a sequence that we know to be in alphabetical order, even if we have to create it ourselves:
for letter in 'abcdefg':
val = mydict[letter]
# etc
Python has the string module for common string operations, including constants, such as string.ascii_letters
which lists all lower and uppercase letters from a
to Z
. You probably want to use string.ascii_lowercase
, rather than manually typing out the alphabet, no matter how well you’ve memorized your abcs:
>>> import string
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
When you run f.py
from the command-line:
0012-got-babynames-2014 $ python f.py
-
The program's output to screen should be:
a: 683400 b: 31658 c: 24966 d: 46701 e: 459362 f: 3003 g: 5375 h: 228547 i: 103362 j: 1537 k: 36097 l: 170789 m: 63365 n: 901859 o: 84050 p: 2569 q: 363 r: 225161 s: 148429 t: 72840 u: 3965 v: 3364 w: 30306 x: 20757 y: 313782 z: 4544
0012-got-babynames-2014/g.py » Print a human-readable list of baby counts in 2014, by gender and by last character of their given names
Nearly the same as the previous exercise, except that the count-by-letter is done for each sex, "M"
and "F"
.
Also, note that the output is specifically formatted so that the columns of numbers are easier to glance at and compare. Your answer is expected to follow this exact format:
- Each column has a fixed width of exactly 8 characters.
- The dashed line that separates the header from the data is, consequently, composed of 24 hyphens.
- The first column is left-justified
- The second and third columns are right-justified
Check out the Hints section for how to efficiently do this. Or you could just read the Python documentation on the ljust()
and rjust()
methods.
Think about nested dictionaries
You could complete this exercise by keeping a separate dictionary by gender:
m_dict = {}
f_dict = {}
# ...etc
letter = name[-1]
if sex == 'F':
if f_dict.get(letter):
f_dict[letter += int(babies)
# etc etc
else:
if m_dict.get(letter):
m_dict[letter] += int(babies)
But consider using a nested dictionary, so that you don’t need as much repetitious, conditional logic:
mydict = {'M': {}, 'F': {}}
This exercise is intentionally similar to the previous one, so that you can think of a more elegant way to use dictionaries.
Padding and justifying the text
The rjust()
string method has 1 required argument: an integer representing a desired length for a right-justified string. This is also referred to as “padding the string”:
>> mystr = 'hello'
>> mystr.rjust(7)
' hello'
Though we don’t need it for this exercise, a very common use-case is to do zero-padding, in which the 0
character is added to the right side of a number to give it a uniform length. The rjust()
method takes a second optional argument: the character used to “fill” the padding:
>>> mynumbers = [42, 9561, 28777]
>>> for n in mynumbers:
... print(n)
42
9561
28777
>>> for n in mynumbers:
... print(str(n).rjust(5, '0'))
00042
09561
28777
When you run g.py
from the command-line:
0012-got-babynames-2014 $ python g.py
-
The program's output to screen should be:
letter F M ------------------------ a 655469 27931 b 573 31085 c 1349 23617 d 3060 43641 e 328326 131036 f 164 2839 g 690 4685 h 127602 100945 i 57205 46157 j 187 1350 k 583 35514 l 44417 126372 m 5132 58233 n 233833 668026 o 2189 81861 p 58 2511 q 53 310 r 48361 176800 s 21309 127120 t 20824 52016 u 787 3178 v 401 2963 w 3905 26401 x 2091 18666 y 209187 104595 z 1020 3524
By using the
ljust()
andrjust()
methods, we’ve effectively created a fixed-width delimited data files, which is easier for humans to read at a glance at the cost of a little more programmatic complexity.Don’t underestimate the ability to use plain text as a data visualization. The output of this program, simple as it is, allows the reader to see immediately for which letters in which the gender gap is different by orders of magnitude, e.g. in 2014, there were more than 20 times as many baby girls with names that end in a than baby boys.
Almost seems as if certain kinds of sounds are associated with gender.