The Checklist

In your compciv-2016 Git repository create a subfolder and name it:

     exercises/0020-gender-detector

The folder structure will look like this (not including any subfolders such as `tempdata/`:

        compciv-2016
        └── exercises
            └── 0020-gender-detector
               ├── a.py
               ├── b.py
               ├── c.py
               ├── d.py
               ├── e.py
               ├── f.py
               ├── g.py
               ├── h.py
               ├── i.py
               ├── j.py
               ├── k.py
               ├── m.py
               ├── n.py

`a.py`	0.5 points	Download all of the baby names data from the Social Security Administration
`b.py`	0.5 points	Count the total number of babies since 1950 by gender.
`c.py`	0.5 points	Count and total the number of unique names by gender since 1950
`d.py`	0.5 points	For each year since 1950, count and print the number of unique names by sexes
`e.py`	0.5 points	Count the number of unique names and sum the baby counts for the year 2014
`f.py`	0.5 points	Print the number of babies per name, for every five years since 1950.
`g.py`	0.5 points	Reshape the 2014 babynames file so that it is optimized for use in a gender-detecting program.
`h.py`	0.5 points	Print the 5 most popular names in 2014 that are relatively gender-ambiguous.
`i.py`	0.5 points	Print a breakdown of popular names in 2014 by gender ambiguity.
`j.py`	0.5 points	Aggregate a number of data files from 1950 to 2014, then reshape it for use in a gender-detecting program.
`k.py`	0.5 points	Make a function that uses the CSV data to analyze a name, then analyze a list of names
`m.py`	0.5 points	Convert wrangledbabynames.csv to wrangledbabynames.json
`n.py`	0.5 points	Make a function that uses the JSON data to analyze a name, then analyze a list of names

Background information

Answers and examples data

Check out the Github repo for answers and example data.

The power of automated methods

Via A practical guide to methods and ethics of gender identification [J. Nathan Matias; MIT Center for Civic Media]:

Surveys and human coding will never be able to function in real-time or at scale (the Global Media Monitoring project took 5 years to analyze 16,000 media items). To do that, we turn to automated methods. The simplest approach is to use historical birth records to estimate the likely sex of a first name.

The Exercises


          0020-gender-detector/a.py

Download all of the baby names data from the Social Security Administration

0.5 points

Download the zip file here:

https://www.ssa.gov/oact/babynames/names.zip

Save into tempdata, as in tempdata/names.zip

Unzip it into tempdata – it should unpack 135 text files.

The hints section contains virtually all of the code, you have to write the final line though (the one that counts the number of text files).

import requests
from os import makedirs
from os.path import join
from shutil import unpack_archive
from glob import glob
SOURCE_URL = 'https://www.ssa.gov/oact/babynames/names.zip'
DATA_DIR = 'tempdata'
DATA_ZIP_PATH = join(DATA_DIR, 'names.zip')
# make the directory
makedirs(DATA_DIR, exist_ok=True)

print("Downloading", SOURCE_URL)
resp = requests.get(SOURCE_URL)
# save it to disk
# we use 'wb' because these are BYTES
with open(DATA_ZIP_PATH, 'wb') as f:
    # we use resp.content because it is BYTES
    f.write(resp.content)

# now let's unzip it into tempdata/
unpack_archive(DATA_ZIP_PATH, extract_dir=DATA_DIR)

# get all the filenames
babynamefilenames = glob(join(DATA_DIR, '*.txt'))

Expectations

When you run a.py from the command-line:

0020-gender-detector $ python a.py

The program's output to screen should be:

Downloading https://www.ssa.gov/oact/babynames/names.zip
There are 135 txt files

The program creates this file path: tempdata/names.zip
The program creates this file path: tempdata/yob2014.txt
The program accesses this remote file: https://www.ssa.gov/oact/babynames/names.zip


          0020-gender-detector/b.py

Count the total number of babies since 1950 by gender.

0.5 points

Use the glob.glob() function to get a list of filenames in a given directory
Filter that globbed list to include only files from the years 1950 onward
Total up the babies by 'F' and 'M'
Print a message that lists the total babies for F and M
Then print the F/M ratio of the total baby count

You can glob for filenames like this:

from os.path import join, basename
from glob import glob
DATA_DIR = 'tempdata'
alltxtfiles_path = join(DATA_DIR, '*.txt')
alltxtfiles_names = glob(alltxtfiles_path)

Filtering to include only the data files that are after 1950 is a little tricky.

But we can use the filenames themselves to our advantage. Remember, they’re just strings.

And they look like this:

tempdata/yob1949.txt
tempdata/yob1950.txt
tempdata/yob1951.txt

If you know how to use regular expressions, go to town. Otherwise, consider this step by step deconstruction of the filename:

from os.path import basename

a_filename = 'tempdata/yob1951.txt'
bname = basename(a_filename)
# bname is now: yob1951.txt
# the first digit is at the index position of 3
# and we want the next 4 characters, so....
year = bname[3:7]
# bingo: year is the string "1951"

Basically, you want to filter those files you previously globbed (i.e. alltxtfiles_names) by testing if the year component of the file name is greater-than-or-equal-to 1950. If you don’t convert it to an integer, the comparison is with the string “1950”:

myfilenames = []
for fname in alltxtfiles_names:
  bname = basename(fname) # e.g. "yob1980.txt"
  year = bname[3:7]       # e.g.    "1980"
  if year >= "1950":
      myfilenames.append(fname)

Now, do the counting of baby names using the filtered list in myfilenames

Here’s one approach:

totalsdict = {'M': 0, 'F': 0}

for fname in myfilenames:
    babyfile = open(fname, "r")
    for line in babyfile:
        name, gender, babies = line.split(',')
        # need to convert babies to a number before adding
        totalsdict[gender] += int(babies)

# Now, totalsdict contains two keys, 'M' and 'F', which both point
# to very large integers

print("F:", str(totalsdict['F']).rjust(6),
      "M:", str(totalsdict['M']).rjust(6))


f_to_m_ratio = round(100 * totalsdict['F'] / totalsdict['M'])
print("F/M baby ratio:", f_to_m_ratio)

Expectations

When you run b.py from the command-line:

0020-gender-detector $ python b.py

The program's output to screen should be:

F: 115355620 M: 123329590
F/M baby ratio: 94


          0020-gender-detector/c.py

Count and total the number of unique names by gender since 1950

0.5 points

Print both the count of unique names per gender, but also the female to male ratio.

As in b.py, only read files that correspond to the year 1950 and afterwards.

The final count of names per gender should reflect the number of unique names. The fact that “Michael” shows up in every data file still means it should be counted exactly once. Consider using Python’s set colleciton to make this smooth.

The same approach as b.py, except instead of tallying total baby counts, you are simply counting all unique names over the years.

How to keep a collection of unique items? Try using the set collection. The set object has an add() method to add one object a time. If the object already exists, it won’t expand the set’s collection.

To get the number of items in a set, just use the len() function:

tally = {'M': set(), 'F': set()}
tally['F'].add("Lisa")
tally['F'].add("Lisa")
tally['F'].add("Lisa")
print(len(tally['F']))
# 1

Here’s 60% of the solution (well, my solution):

from os.path import join, basename
from glob import glob
DATA_DIR = 'tempdata'

# unlike b.py, this simply keeps a count of "names", not total babies
tally = {'M': set(), 'F': set()}

for fname in glob(join(DATA_DIR, '*.txt')):
    # doing the filtering if filenames in one loop
    year = basename(fname)[3:7]
    if year >= "1950":
        for line in open(fname, 'r'):
            name, gender, babies = line.split(',')
            tally[gender].add(name)

Expectations

When you run c.py from the command-line:

0020-gender-detector $ python c.py

The program's output to screen should be:

F:  60480 M:  36345
F/M name ratio: 166


          0020-gender-detector/d.py

For each year since 1950, count and print the number of unique names by sexes

0.5 points

Print both the count of unique names per gender, but also the female to male ratio.

As in c.py, only read files that correspond to the year 1950 and afterwards.

The output should be seen for every year, not just one big lump sum as in c.py.

If this sounds almost exactly like c.py, that’s intentional. Basically, just shift where the print statements take place – if you want them to print for every year, well, they have to be inside that for-loop.

In fact, every print statement should be in a for-loop.

The tally object, i.e. the dictionary of sets, can be used just as it was in c.py:

tally = {'M': set(), 'F': set()}

But just like the print statements, the object must be initialized with each iteration of the for-loop, i.e. with every new year file.

Expectations

When you run d.py from the command-line:

0020-gender-detector $ python d.py

The program's first 6 lines of output to screen should be:

1950
F:   6112 M:   4195
F/M name ratio: 146
1951
F:   6213 M:   4247
F/M name ratio: 146

The program's last 6 lines of output to screen should be:

2013
F:  19191 M:  14012
F/M name ratio: 137
2014
F:  19067 M:  13977
F/M name ratio: 136


          0020-gender-detector/e.py

Count the number of unique names and sum the baby counts for the year 2014

0.5 points

For the year 2014, i.e. file tempdata/yob2014.txt, total the number of unique names and baby counts.

At the end of the program, print the number of unique names and baby counts for:

M and F total
Just M
Just F

There’s countless ways to do this. The variations in which are not particularly important.

Below is my complete answer with rambling comments. If you follow it strictly, please at least don’t copy the comments.

from os.path import join, basename
# we're only dealing with one file, i.e. yob2014.txt
# but it's worth storing its "year" value in a variable just to abstract things
YEAR = 2014
DATA_DIR = 'tempdata'
thefilename = join(DATA_DIR, 'yob' + str(YEAR) + '.txt')

names_dict = {}
thefile =  open(thefilename, 'r')
for line in thefile:
    name, gender, count = line.split(',')
    # Every name can show up twice for a year, as M or F
    # but some names only show up for either M or F
    # that means we need to initialize a new value
    # for a given name in names_dict if it doesn't already exist
    if not names_dict.get(name):
        # i.e. names_dict does not yet have `name` as a valid key
        # so we make it a valid key by initializing it and pointing it to
        # a dictionary that we can add values to
        names_dict[name] = {'M': 0, 'F': 0}

    # Now that names_dict[name] is itself a dictionary, {'M': 0, 'F': 0}
    # we can safely add the `count` variable to it
    #  e.g. names_dict['Jennifer']['F'] = int("24222")
    names_dict[name][gender] += int(count)
# at this point, when the for loop is done
# we're done reading from the file, so we can close it
thefile.close()

# names_dict now contains a dict of dicts:
# {
#    'Jennifer': {'F': 24222, 'M': 32},
#    'Amanda':   {'F': 10000, 'M': 0 },
#    'John':     {'F': 12,    'M': 12000}
# }


############################################################
# Now it's time to print things...
# The year 20YY has XXXXX unique names for ZZZZ total babies

# which means we need to get the total number of baby names,
# which is simply a len() call on the keys of names_dict
total_namecount = len(names_dict.keys())

# and then we need to get the total baby count...here's one straightforward
# way to do it:
total_babycount = 0
for v in names_dict.values():
    totes = v['M'] + v['F'] # count up males and females
    # and add it to the total_babycount
    total_babycount += totes

# or, you could've done this:
# ...sum(v['F'] + v['M'] for v in names_dict.values())

print("Total:", total_namecount, 'unique names for', total_babycount, 'babies')

# now we do the same thing, except for just boys and their names
ncount = 0
bcount = 0
for v in names_dict.values():
    # don't count it as a boy name if no babies were actually given the name
    if v['M'] > 0:
        bcount += v['M']
        ncount += 1
print("    M:", ncount, "unique names for", bcount, "babies")


# now we do the same thing, except for just girls and their names
ncount = 0
bcount = 0
for v in names_dict.values():
    # don't count it as a girl name if no babies were actually given the name
    if v['F'] > 0:
        bcount += v['F']
        ncount += 1
print("    F:", ncount, "unique names for", bcount, "babies")


# or if you wanted to be thrifty and not repeat yourself, do a for loop:
# for gender in ['M', 'F']:
#     ncount = 0
#     bcount = 0
#     for v in names_dict.values():
#         # don't count it as a girl name if no babies were actually given the name
#         if v[gender] > 0:
#             bcount += v[gender]
#             ncount += 1
#     print("    %s:" % gender, ncount, "unique names for", bcount, "babies")

Expectations

When you run e.py from the command-line:

0020-gender-detector $ python e.py

The program's output to screen should be:

Total: 30579 unique names for 3670151 babies
    M: 13977 unique names for 1901376 babies
    F: 19067 unique names for 1768775 babies


          0020-gender-detector/f.py

Print the number of babies per name, for every five years since 1950.

0.5 points

Just like exercise e.py, except in a for-loop. It’s probably not exactly as easy as just copying and pasting the answer from e.py into this one, but it’s almost that…

The answer is a little different too: divide the total number of babies by the number of names, to get a baby per name ratio.

Also, note that instead of looking at every year, we’re actually looking at every 5 years. This is how you can use range() to accommodate that:

START_YEAR = 1950
END_YEAR = 2015

for year in range(START_YEAR, END_YEAR, 5):
    # etc. etc.

Here’s one way to print the number of babies per name ratio at the end, the total followed by the breakdown for both genders (this comes at the end, so come up with your own variables):

print("Total:", round(total_babycount / total_namecount), 'babies per name')
# for boys and girls separately
for gd in ['M', 'F']:
    babyct = 0
    namect = 0
    for v in names_dict.values():
        if v[gd] > 0:
            babyct += v[gd]
            namect += 1
    print("    %s:" % gd, round(babyct / namect), 'babies per name')

Expectations

When you run f.py from the command-line:

0020-gender-detector $ python f.py

The program's first 8 lines of output to screen should be:

1950
Total: 378 babies per name
    M: 427 babies per name
    F: 280 babies per name
1955
Total: 401 babies per name
    M: 469 babies per name
    F: 291 babies per name

The program's last 8 lines of output to screen should be:

2005
Total: 127 babies per name
    M: 149 babies per name
    F: 96 babies per name
2010
Total: 117 babies per name
    M: 134 babies per name
    F: 90 babies per name

Some takeaways from this exercise:

It seems that the variety of names has vastly expanded since 1950, for both boys and girls.


          0020-gender-detector/g.py

Reshape the 2014 babynames file so that it is optimized for use in a gender-detecting program.

0.5 points

Your program must read the data in tempdata/yob2014.txt and “wrangle” it into a far more usable dataset and save it as tempdata/wrangled2014.csv

The resulting file must contain these headers:

year
name
gender
ratio
females
males
total

And the data rows must be sorted as:

in descending order of the total baby count
as a tiebreaker, in ascending alphabetical order of the name

More specifically, we want to turn this:

Emma,F,20799
Olivia,F,19674
Sophia,F,18490
Isabella,F,16950
Ava,F,15586

Into:

year,name,gender,ratio,females,males,total
2014,Emma,F,100,20799,12,20811
2014,Olivia,F,100,19674,22,19696
2014,Noah,M,99,106,19144,19250
2014,Sophia,F,100,18490,17,18507

What’s the difference?

For starters, our result data file will now have headers – something the SSA has negelected to do and which prevents the data from being ready-to-use in a spreadsheet.

Second, we’ve added columns that will be useful to our ultimate purposes. For example, we want to classify a person’s gender based on the traditional gender perception of their name. For the name “Leslie” in the 2014 data, this means looking at these two rows in yob2014.txt:

      Leslie,F,994
      Leslie,M,61

And then comparing the number of male babies versus female babies to determine the “likely” gender of the name “Leslie”. The math doesn’t have to be that complicated: 994 is more than 61 – so we classify “Leslie” as female because far more females were named “Leslie” than males, Leslie Nielsen notwithstanding.

We’re also interested in how big the gap between the male and female counts. Here’s a simple metric: find the ratio as determined by the majority gender versus the total number of babies:

       100 * (994 / (994 + 61)) = 94.2

We can think of this as expressing that for any given person named “Leslie”, they are 94.2% likely to be female based on Social Security Administration trends.

By reshaping the raw data this way, we make it much easier for anyone to import our work into a spreadsheet. It’s nice having granular data in the way that SSA provides us, but not when doing analyses.

Data-wrangling, which is often what people think of when they think of data-cleaning, is one of the most difficult programmatic tasks, in the sense that naming things and cache invalidation is difficult in computer science. There’s not just one way to do it, and there’s not one clear, absolutely superior goal.

To make things easy, I’ve set a relatively simple and straightforward goal. It certainly has its flaws, which may become evident when trying to use it in real-world analysis. But creating it is relatively straightforward, at least in my muddled mind.

That said, to reduce confusion, I’ll provide my answer, which is a bit verbose, but hopefully makes it clear that this is just the same kind of data manipulation and handling we’ve done before in Python.

The start

Nothing fancy – for this exercise, we’ll be working only with yob2014.txt, with the knowledge that if we can deal with one file, we can deal with every file as we please:

from os.path import join, basename
import csv
DATA_DIR = 'tempdata'
YEAR = 2014
thefilename = join(DATA_DIR, 'yob' + str(YEAR) + '.txt')

Setting up the wrangled file

Let’s create a constant for the new file we’ll be making. In fact, let’s create another constant that stores the list of column names this new file will have:

WRANGLED_HEADERS = ['year', 'name', 'gender' , 'ratio' , 'females', 'males', 'total']
WRANGLED_DATA_FILENAME = join(DATA_DIR, 'wrangled2014.csv')

Gathering up the name data

This step is exactly the same as it is for every previous exercise: for every name, collect the number of babies by gender.

namesdict = {}
with open(thefilename, 'r') as thefile:
    for line in thefile:
        name, gender, count = line.split(',')
        if not namesdict.get(name): # need to initialize a new dict for the name
            namesdict[name] = {'M': 0, 'F': 0}
        namesdict[name][gender] += int(count)

Let’s make a list

The object namesdict has been perfectly servicable as a dictionary of dicts; in fact, we could probably continue using it without too much trouble. If you’ve been following along in Interactive Python, you can inspect it and see something like this:

{
 'Taytem': {'F': 9, 'M': 0},
 'Favour': {'F': 19, 'M': 0},
 'Yitzchok': {'F': 0, 'M': 119},
 'Daymon': {'F': 0, 'M': 25}
}

However, what we eventually need is a collection of dictionaries with different attribute names and more attributes.

So I’ve opted to just create a new list and then append it full of dictionary objects that have all the headers and values we need.

It starts like this:

my_awesome_list = []

Each dictionary we want to add to my_awesome_list will contain values derived from each key-value pair in namesdict.

So, basically a for-loop:

for name, counts in namesdict.items():
    xdict = {}
    xdict['year'] = YEAR # i.e. 2014
    xdict['name'] = name
    xdict['females'] = counts['F']
    xdict['males'] = counts['M']
    xdict['total'] = xdict['males'] + xdict['females']
    # the "likely" gender is determined by comparing females vs males numbers
    if xdict['females'] >= xdict['males']:
        xdict['gender'] = 'F'
        xdict['ratio'] = round(100 * xdict['females'] / xdict['total'])
    else:
        xdict['gender'] = 'M'
        xdict['ratio'] = round(100 * xdict['males'] / xdict['total'])

    # finally, add our new dict, xdict, to my_awesome_list
    my_awesome_list.append(xdict)

There are some questions worth asking. For example, we used to track the count of female and male babies with the F and M keys, like this:

{'Daniel': {'F': 100, 'M': 9000}}

Why the change to females and males?

{'name': 'Daniel', 'females': 100, 'males': 9000}}

It’s a matter of opinion. But remember that we’re creating a new data file. And when a user comes upon it, what’s going to make more sense when looking at the headers:

  name,F,M
  Daniel,100,9000

Or:

  name,females,males
  Daniel,100,9000

It requires a little more work in organizing the data, but part of data-wrangling is producing a more useful public face that may not have been necessary when initially working with the data.

Creating a new file and using csv.DictWriter()

This is basically just a pattern you memorize: when serializing a list of dictionaries as a flat CSV file, you use csv.DictWriter, that’s all. That said, it took me quite a few tries to memorize it. Hopefully it’s clear why we created easy to remember constants for WRANGLED_DATA_FILENAME and WRANGLED_HEADERS:

# let's create the new file to write to
wfile = open(WRANGLED_DATA_FILENAME, 'w')
# turn it into a DictWriter object, and tell it what the fieldnames are
wcsv = csv.DictWriter(wfile, fieldnames=WRANGLED_HEADERS)
# write the headers row
wcsv.writeheader()

Sorting the data before writing it

Oh but we can’t write the actual data rows just yet…As more pain-in-the-butt requirement, we’re required to sort the rows in order of the total column (in descending order), then by the name column. Do it how you like, this is how I did it:

def xfoo(xdict):
    # and return a tuple of negative total, and normal name
    return (-xdict['total'], xdict['name'])

my_final_list = sorted(my_awesome_list, key=xfoo)
for row in my_final_list:
    wcsv.writerow(row)
# the end...close the file
wfile.close()

Write the first five lines of text

Just to make sure that we’ve produced the file we want, this exercise asks us to re-open the text file – but not to parse it into data – but just to print the first five lines as plain text. This works:

finalfile = open(WRANGLED_DATA_FILENAME, 'r')
thestupidlines = finalfile.readlines()[0:5]
for line in thestupidlines:
    # remember each text line has a newline character
    # that we don't want to print out for aesthetic reasons
    print(line.strip())

And that should get you to the desired output.

Expectations

When you run g.py from the command-line:

0020-gender-detector $ python g.py

The program's output to screen should be:

year,name,gender,ratio,females,males,total
2014,Emma,F,100,20799,12,20811
2014,Olivia,F,100,19674,22,19696
2014,Noah,M,99,106,19144,19250
2014,Sophia,F,100,18490,17,18507

The program creates this file path: tempdata/wrangled2014.csv


          0020-gender-detector/h.py

Print the 5 most popular names in 2014 that are relatively gender-ambiguous.

0.5 points

Print a list of the 5 most popular baby names in 2014 that skewed no more than 60% towards either male or female, i.e. a ratio or less than or equal to 60.

This exercise depends on you having created tempdata/wrangled2014.csv in the previous exercise.

Well, you don’t have to have made that wrangled file – obviously you can copy-and-paste all the code you used to generate that file into this script. The Python interpreter doesn’t really care…but you – and anyone else who has to deal with your code – will care.

Having one script create and package a file for other scripts to use is a very common pattern.

Most popular baby names means to sort the list by total, the filter for names in which the gender ratio is less than or equal to 60%.

(See the answer on Github)

Expectations

When you run h.py from the command-line:

0020-gender-detector $ python h.py

The program's output to screen should be:

Most popular names with <= 60% gender skew:
Charlie    M 54 3102
Dakota     F 56 2012
Skyler     F 54 1981
Phoenix    M 59 1530
Justice    F 59 1274


          0020-gender-detector/i.py

Print a breakdown of popular names in 2014 by gender ambiguity.

0.5 points

This exercise depends on you having created tempdata/wrangled2014.csv previously.

For each of the following gender ratio breakpoints:

Print the number and percentage of popular baby names (names given to at least 100 total babies in 2014) that have a gender ratio less than or equal to the given break point.

For example, for the breakpoint of 70%, find and count all baby names in which the ratio is 70% or less toward one gender or another.

All of the code from h.py that is used to deserialize the data into a list of dictionaries can be reused here.

You have one additional data processing step: filtering that list to include only names that have 100 or more total babies. You should be able to figure this out:

    bigdatarows = []
    for row in datarows:
        if SOMETHINGSOMETHING
              bigdatarows.append(row)

The number of “popular” names in 2014 is simply len(bigdatarows)

As for what you need to print out, the process is about the same as it was in the previous exercise. But you start off with a for-loop:

    print("Popular names with a gender ratio bias of less than or equal to:")
    for genderratio in (60, 70, 80, 90, 99):

Expectations

When you run i.py from the command-line:

0020-gender-detector $ python i.py

The program's output to screen should be:

Popular names in 2014 with gender ratio less than or equal to:
  60%: 64/3495
  70%: 139/3495
  80%: 214/3495
  90%: 381/3495
  99%: 953/3495

Some takeaways from this exercise:

There don’t seem to be many names that fall in what we might have assumed to be “ambiguous”. In fact, only 953 names out of 3,495 – less than a third – of the popular babynames are at the 99% or below threshold…which means that more than two-thirds of the popular names are essentially 100% for one gender or the other.


          0020-gender-detector/j.py

Aggregate a number of data files from 1950 to 2014, then reshape it for use in a gender-detecting program.

0.5 points

Very much the same as g.py except done over multiple files:

Include each file from 1950 to 2014, in 10 year intervals, i.e. 1950, 1960, 1970, etc
Include the 2014 file
Before reading each file, print to screen: "Parsing NAME_OF_FILE" just so you know you’re reading the right files
As in g.py, create a new CSV file, but name it /tempdata/wrangledbabynames.csv
As in g.py, print the first 5 lines of the new CSV file.

This is barely an exercise. It’s meant to serve as another example in programming of how once you get something working once – there’s no reason why you can apply the same operation across many values or files. So why restrict ourselves to wrangling just the 2014 file when we can literally do it for every other baby name data file?

The beginning of this script looks very much the same as it did in g.py, though note we’re saving to a different file name: /tempdata/wrangledbabynames.csv:

from os.path import join, basename
import csv
DATA_DIR = 'tempdata'
# as before, we create new headers for our wrangled file
# though we leave out year because we probably don't care at for our ultimate needs
WRANGLED_HEADERS = ['name', 'gender' , 'ratio' , 'females', 'males', 'total']
WRANGLED_DATA_FILENAME = join(DATA_DIR, 'wrangledbabynames.csv')

Since we’re reading from multiple years`, we need to create a list of numbers, starting from 1950 and ending at 2014:

This is how to produce the list of numbers using a range:

for year in range(1950, 2014, 10):
    print(year)

Note that it stops before 2014, so we just have to add that manually to the list:

START_YEAR = 1950
END_YEAR = 2014
# lets just get a list of all decades, between 1950 and 2014:
years = list(range(START_YEAR, END_YEAR, 10))
# and let's tack on the END_YEAR manually:
years.append(END_YEAR)

Also note that the interval and number of years is arbitrary. You could just as easily aggregate every file since 1880 if you wished.

After that, things are largely the same. We loop through every year, but the namesdict collection stays “above” it all because it is counting up names for every file:

namesdict = {}
for year in years:
    # get the file for this particular year
    filename = join(DATA_DIR, 'yob' + str(year) + '.txt')
    print("Parsing", filename)
    # fill in the rest yourself

Similarly, we wait till that year-loop is finished before creating my_awesome_list, which contains the same stuff as namesdict.

my_awesome_list = []
# just the same as it was for g.py, except no "year"
for name, babiescount in namesdict.items():
    xdict = {'name': name, 'females': babiescount['F'], 'males': babiescount['M']}

In fact, everything here on out should be pretty much the same as g.py.

Remember, you’re just doing what you did to the 2014 file, except to a bunch of files. That includes aggregating it into a single list at the end.

Expectations

When you run j.py from the command-line:

0020-gender-detector $ python j.py

The program's output to screen should be:

Parsing tempdata/yob1950.txt
Parsing tempdata/yob1960.txt
Parsing tempdata/yob1970.txt
Parsing tempdata/yob1980.txt
Parsing tempdata/yob1990.txt
Parsing tempdata/yob2000.txt
Parsing tempdata/yob2010.txt
Parsing tempdata/yob2014.txt
name,gender,ratio,females,males,total
Michael,M,100,1963,433277,435240
James,M,100,1391,342651,344042
David,M,100,1139,330092,331231
John,M,100,1095,320621,321716

The program creates this file path: tempdata/wrangledbabynames.csv


          0020-gender-detector/k.py

Make a function that uses the CSV data to analyze a name, then analyze a list of names

0.5 points

Write a function that given a name, e.g. "Mary", returns a dictionary that represents the record for the name "Mary", based on the data in tempdata/wrangledbabynames.csv.

Then use that function to print out the likely gender and ratio for each of the following names:

Michael
Kelly
Kanye
THOR
casey
Arya
ZZZblahblah

The search for the name should be case-insensitive, i.e. return the records for "Thor" and "Casey", respectively, when the values "THOR" and "casey" are passed in.

For names that have no valid record, return the following dictionary:

{ 'name': "whateverthenamewas",
  'gender': 'NA',
  'ratio': None,
  'males': None,
  'females': None,
  'total': 0
}

At the end of the script, print the total tally of names by gender:

    Total:
    F: 2  M: 4  NA: 1

And the number of babies cumulative:

    females: 62045 males: 454031

Expectations

When you run k.py from the command-line:

0020-gender-detector $ python k.py

The program's output to screen should be:

Michael M 100
Kelly F 86
Kanye M 100
THOR M 100
casey M 59
Arya F 88
ZZZblahblah NA None
Total:
F: 2 M: 4 NA: 1
females: 62045 males: 454031


          0020-gender-detector/m.py

Convert wrangledbabynames.csv to wrangledbabynames.json

0.5 points

What’s the difference between storing data as CSV versus JSON?

Find out for yourself. Open, read, and deserialize tempdata/wrangledbabynames.csv as in previous exercises, including converting the appropriate fields ('total', 'ratio', etc.) to numbers.

Then use json.dumps() to serialize it as text and save to:

    tempdata/wrangledbabynames.json

Then print the number of characters in the csv file. Followed by the number of characters in the new json file. Finally, print the ratio of the json’s character count to the csv’s character count.

Expectations

When you run m.py from the command-line:

0020-gender-detector $ python m.py

The program's output to screen should be:

CSV has 1214044 characters
JSON has 6801478 characters
JSON requires 4.6 times more text characters than CSV

The program creates this file path: tempdata/wrangledbabynames.json

Some takeaways from this exercise:

If you inspect the resulting JSON file, you’ll see that each record looks like this:
```
  {
    "males": 433277,
    "ratio": 100,
    "females": 1963,
    "name": "Michael",
    "total": 435240,
    "gender": "M"
  },
  {
    "males": 342651,
    "ratio": 100,
    "females": 1391,
    "name": "James",
    "total": 344042,
    "gender": "M"
  }
```
Notice how the number values are unquoted. That’s part of JSON’s spec: the ability to define values other than text strings, which is what CSV is limited to:
```
  name,gender,ratio,females,males,total
  Michael,M,100,1963,433277,435240
  James,M,100,1391,342651,344042
```
On the other hand, the CSV version of the data is much more compact. The JSON format requires that the attributes (i.e. the column headers) have to be repeated for every record.

That is why the JSON file ends up being more than 4 times as big as the CSV file.


          0020-gender-detector/n.py

Make a function that uses the JSON data to analyze a name, then analyze a list of names

0.5 points

Same as k.py, except you’ll be reading from: tempdata/wrangledbabynames.json

Create a new Python script file named zoofoo.py

In it, create a function named detect_gender() which should operate pretty much identical to detect_gender_from_csv in k.py…except that it reads from the JSON file. It should have less code because you no longer have to manually convert strings of numbers to actual numbers, i.e. you don’t have to do this:

Expectations

When you run n.py from the command-line:

0020-gender-detector $ python n.py

The program's output to screen should be:

Michael M 100
Kelly F 86
Kanye M 100
THOR M 100
casey M 59
Arya F 88
ZZZblahblah NA None
Total:
F: 2 M: 4 NA: 1
females: 62045 males: 454031

Building your own Gender Detector

Summary

The Checklist

Background information

Answers and examples data

The power of automated methods

The Exercises

0020-gender-detector/a.py » Download all of the baby names data from the Social Security Administration

0020-gender-detector/b.py » Count the total number of babies since 1950 by gender.

0020-gender-detector/c.py » Count and total the number of unique names by gender since 1950

0020-gender-detector/d.py » For each year since 1950, count and print the number of unique names by sexes

0020-gender-detector/e.py » Count the number of unique names and sum the baby counts for the year 2014

0020-gender-detector/f.py » Print the number of babies per name, for every five years since 1950.

0020-gender-detector/g.py » Reshape the 2014 babynames file so that it is optimized for use in a gender-detecting program.

The start

Setting up the wrangled file

Gathering up the name data

Let’s make a list

Creating a new file and using csv.DictWriter()

Sorting the data before writing it

Write the first five lines of text

0020-gender-detector/h.py » Print the 5 most popular names in 2014 that are relatively gender-ambiguous.

0020-gender-detector/i.py » Print a breakdown of popular names in 2014 by gender ambiguity.

0020-gender-detector/j.py » Aggregate a number of data files from 1950 to 2014, then reshape it for use in a gender-detecting program.

0020-gender-detector/k.py » Make a function that uses the CSV data to analyze a name, then analyze a list of names

0020-gender-detector/m.py » Convert wrangledbabynames.csv to wrangledbabynames.json

0020-gender-detector/n.py » Make a function that uses the JSON data to analyze a name, then analyze a list of names

References and Related Readings