The Checklist
In your compciv-2016 Git repository create a subfolder and name it:
exercises/0020-gender-detector
The folder structure will look like this (not including any subfolders such as `tempdata/`:
compciv-2016 └── exercises └── 0020-gender-detector ├── a.py ├── b.py ├── c.py ├── d.py ├── e.py ├── f.py ├── g.py ├── h.py ├── i.py ├── j.py ├── k.py ├── m.py ├── n.py
Background information
Answers and examples data
Check out the Github repo for answers and example data.
The power of automated methods
Surveys and human coding will never be able to function in real-time or at scale (the Global Media Monitoring project took 5 years to analyze 16,000 media items). To do that, we turn to automated methods. The simplest approach is to use historical birth records to estimate the likely sex of a first name.
The Exercises
0020-gender-detector/a.py » Download all of the baby names data from the Social Security Administration
Download the zip file here:
https://www.ssa.gov/oact/babynames/names.zip
Save into tempdata, as in tempdata/names.zip
Unzip it into tempdata – it should unpack 135 text files.
The hints section contains virtually all of the code, you have to write the final line though (the one that counts the number of text files).
import requests
from os import makedirs
from os.path import join
from shutil import unpack_archive
from glob import glob
SOURCE_URL = 'https://www.ssa.gov/oact/babynames/names.zip'
DATA_DIR = 'tempdata'
DATA_ZIP_PATH = join(DATA_DIR, 'names.zip')
# make the directory
makedirs(DATA_DIR, exist_ok=True)
print("Downloading", SOURCE_URL)
resp = requests.get(SOURCE_URL)
# save it to disk
# we use 'wb' because these are BYTES
with open(DATA_ZIP_PATH, 'wb') as f:
# we use resp.content because it is BYTES
f.write(resp.content)
# now let's unzip it into tempdata/
unpack_archive(DATA_ZIP_PATH, extract_dir=DATA_DIR)
# get all the filenames
babynamefilenames = glob(join(DATA_DIR, '*.txt'))
When you run a.py
from the command-line:
0020-gender-detector $ python a.py
-
The program's output to screen should be:
Downloading https://www.ssa.gov/oact/babynames/names.zip There are 135 txt files
-
The program creates this file path:
tempdata/names.zip
-
The program creates this file path:
tempdata/yob2014.txt
- The program accesses this remote file: https://www.ssa.gov/oact/babynames/names.zip
0020-gender-detector/b.py » Count the total number of babies since 1950 by gender.
- Use the glob.glob() function to get a list of filenames in a given directory
- Filter that globbed list to include only files from the years 1950 onward
- Total up the babies by
'F'
and'M'
- Print a message that lists the total babies for F and M
- Then print the F/M ratio of the total baby count
You can glob for filenames like this:
from os.path import join, basename
from glob import glob
DATA_DIR = 'tempdata'
alltxtfiles_path = join(DATA_DIR, '*.txt')
alltxtfiles_names = glob(alltxtfiles_path)
Filtering to include only the data files that are after 1950 is a little tricky.
But we can use the filenames themselves to our advantage. Remember, they’re just strings.
And they look like this:
- tempdata/yob1949.txt
- tempdata/yob1950.txt
- tempdata/yob1951.txt
If you know how to use regular expressions, go to town. Otherwise, consider this step by step deconstruction of the filename:
from os.path import basename
a_filename = 'tempdata/yob1951.txt'
bname = basename(a_filename)
# bname is now: yob1951.txt
# the first digit is at the index position of 3
# and we want the next 4 characters, so....
year = bname[3:7]
# bingo: year is the string "1951"
Basically, you want to filter those files you previously globbed (i.e. alltxtfiles_names
) by testing if the year component of the file name is greater-than-or-equal-to 1950. If you don’t convert it to an integer, the comparison is with the string “1950”:
myfilenames = []
for fname in alltxtfiles_names:
bname = basename(fname) # e.g. "yob1980.txt"
year = bname[3:7] # e.g. "1980"
if year >= "1950":
myfilenames.append(fname)
Now, do the counting of baby names using the filtered list in myfilenames
Here’s one approach:
totalsdict = {'M': 0, 'F': 0}
for fname in myfilenames:
babyfile = open(fname, "r")
for line in babyfile:
name, gender, babies = line.split(',')
# need to convert babies to a number before adding
totalsdict[gender] += int(babies)
# Now, totalsdict contains two keys, 'M' and 'F', which both point
# to very large integers
print("F:", str(totalsdict['F']).rjust(6),
"M:", str(totalsdict['M']).rjust(6))
f_to_m_ratio = round(100 * totalsdict['F'] / totalsdict['M'])
print("F/M baby ratio:", f_to_m_ratio)
When you run b.py
from the command-line:
0020-gender-detector $ python b.py
-
The program's output to screen should be:
F: 115355620 M: 123329590 F/M baby ratio: 94
0020-gender-detector/c.py » Count and total the number of unique names by gender since 1950
Print both the count of unique names per gender, but also the female to male ratio.
As in b.py, only read files that correspond to the year 1950 and afterwards.
The final count of names per gender should reflect the number of unique names. The fact that “Michael” shows up in every data file still means it should be counted exactly once. Consider using Python’s set colleciton to make this smooth.
The same approach as b.py, except instead of tallying total baby counts, you are simply counting all unique names over the years.
How to keep a collection of unique items? Try using the set collection. The set object has an add() method to add one object a time. If the object already exists, it won’t expand the set’s collection.
To get the number of items in a set, just use the len()
function:
tally = {'M': set(), 'F': set()}
tally['F'].add("Lisa")
tally['F'].add("Lisa")
tally['F'].add("Lisa")
print(len(tally['F']))
# 1
Here’s 60% of the solution (well, my solution):
from os.path import join, basename
from glob import glob
DATA_DIR = 'tempdata'
# unlike b.py, this simply keeps a count of "names", not total babies
tally = {'M': set(), 'F': set()}
for fname in glob(join(DATA_DIR, '*.txt')):
# doing the filtering if filenames in one loop
year = basename(fname)[3:7]
if year >= "1950":
for line in open(fname, 'r'):
name, gender, babies = line.split(',')
tally[gender].add(name)
When you run c.py
from the command-line:
0020-gender-detector $ python c.py
-
The program's output to screen should be:
F: 60480 M: 36345 F/M name ratio: 166
0020-gender-detector/d.py » For each year since 1950, count and print the number of unique names by sexes
Print both the count of unique names per gender, but also the female to male ratio.
As in c.py, only read files that correspond to the year 1950 and afterwards.
The output should be seen for every year, not just one big lump sum as in c.py.
If this sounds almost exactly like c.py, that’s intentional. Basically, just shift where the print statements take place – if you want them to print for every year, well, they have to be inside that for-loop.
In fact, every print statement should be in a for-loop.
The tally
object, i.e. the dictionary of sets, can be used just as it was in c.py:
tally = {'M': set(), 'F': set()}
But just like the print statements, the object must be initialized with each iteration of the for-loop, i.e. with every new year file.
When you run d.py
from the command-line:
0020-gender-detector $ python d.py
-
The program's
first 6
lines of output to screen should be:
1950 F: 6112 M: 4195 F/M name ratio: 146 1951 F: 6213 M: 4247 F/M name ratio: 146
-
The program's
last 6
lines of output to screen should be:
2013 F: 19191 M: 14012 F/M name ratio: 137 2014 F: 19067 M: 13977 F/M name ratio: 136
0020-gender-detector/e.py » Count the number of unique names and sum the baby counts for the year 2014
For the year 2014, i.e. file tempdata/yob2014.txt
, total the number of unique names and baby counts.
At the end of the program, print the number of unique names and baby counts for:
- M and F total
- Just M
- Just F
There’s countless ways to do this. The variations in which are not particularly important.
Below is my complete answer with rambling comments. If you follow it strictly, please at least don’t copy the comments.
from os.path import join, basename
# we're only dealing with one file, i.e. yob2014.txt
# but it's worth storing its "year" value in a variable just to abstract things
YEAR = 2014
DATA_DIR = 'tempdata'
thefilename = join(DATA_DIR, 'yob' + str(YEAR) + '.txt')
names_dict = {}
thefile = open(thefilename, 'r')
for line in thefile:
name, gender, count = line.split(',')
# Every name can show up twice for a year, as M or F
# but some names only show up for either M or F
# that means we need to initialize a new value
# for a given name in names_dict if it doesn't already exist
if not names_dict.get(name):
# i.e. names_dict does not yet have `name` as a valid key
# so we make it a valid key by initializing it and pointing it to
# a dictionary that we can add values to
names_dict[name] = {'M': 0, 'F': 0}
# Now that names_dict[name] is itself a dictionary, {'M': 0, 'F': 0}
# we can safely add the `count` variable to it
# e.g. names_dict['Jennifer']['F'] = int("24222")
names_dict[name][gender] += int(count)
# at this point, when the for loop is done
# we're done reading from the file, so we can close it
thefile.close()
# names_dict now contains a dict of dicts:
# {
# 'Jennifer': {'F': 24222, 'M': 32},
# 'Amanda': {'F': 10000, 'M': 0 },
# 'John': {'F': 12, 'M': 12000}
# }
############################################################
# Now it's time to print things...
# The year 20YY has XXXXX unique names for ZZZZ total babies
# which means we need to get the total number of baby names,
# which is simply a len() call on the keys of names_dict
total_namecount = len(names_dict.keys())
# and then we need to get the total baby count...here's one straightforward
# way to do it:
total_babycount = 0
for v in names_dict.values():
totes = v['M'] + v['F'] # count up males and females
# and add it to the total_babycount
total_babycount += totes
# or, you could've done this:
# ...sum(v['F'] + v['M'] for v in names_dict.values())
print("Total:", total_namecount, 'unique names for', total_babycount, 'babies')
# now we do the same thing, except for just boys and their names
ncount = 0
bcount = 0
for v in names_dict.values():
# don't count it as a boy name if no babies were actually given the name
if v['M'] > 0:
bcount += v['M']
ncount += 1
print(" M:", ncount, "unique names for", bcount, "babies")
# now we do the same thing, except for just girls and their names
ncount = 0
bcount = 0
for v in names_dict.values():
# don't count it as a girl name if no babies were actually given the name
if v['F'] > 0:
bcount += v['F']
ncount += 1
print(" F:", ncount, "unique names for", bcount, "babies")
# or if you wanted to be thrifty and not repeat yourself, do a for loop:
# for gender in ['M', 'F']:
# ncount = 0
# bcount = 0
# for v in names_dict.values():
# # don't count it as a girl name if no babies were actually given the name
# if v[gender] > 0:
# bcount += v[gender]
# ncount += 1
# print(" %s:" % gender, ncount, "unique names for", bcount, "babies")
When you run e.py
from the command-line:
0020-gender-detector $ python e.py
-
The program's output to screen should be:
Total: 30579 unique names for 3670151 babies M: 13977 unique names for 1901376 babies F: 19067 unique names for 1768775 babies
0020-gender-detector/f.py » Print the number of babies per name, for every five years since 1950.
Just like exercise e.py, except in a for-loop. It’s probably not exactly as easy as just copying and pasting the answer from e.py into this one, but it’s almost that…
The answer is a little different too: divide the total number of babies by the number of names, to get a baby per name ratio.
Also, note that instead of looking at every year, we’re actually looking at every 5 years. This is how you can use range()
to accommodate that:
START_YEAR = 1950
END_YEAR = 2015
for year in range(START_YEAR, END_YEAR, 5):
# etc. etc.
Here’s one way to print the number of babies per name ratio at the end, the total followed by the breakdown for both genders (this comes at the end, so come up with your own variables):
print("Total:", round(total_babycount / total_namecount), 'babies per name')
# for boys and girls separately
for gd in ['M', 'F']:
babyct = 0
namect = 0
for v in names_dict.values():
if v[gd] > 0:
babyct += v[gd]
namect += 1
print(" %s:" % gd, round(babyct / namect), 'babies per name')
When you run f.py
from the command-line:
0020-gender-detector $ python f.py
-
The program's
first 8
lines of output to screen should be:
1950 Total: 378 babies per name M: 427 babies per name F: 280 babies per name 1955 Total: 401 babies per name M: 469 babies per name F: 291 babies per name
-
The program's
last 8
lines of output to screen should be:
2005 Total: 127 babies per name M: 149 babies per name F: 96 babies per name 2010 Total: 117 babies per name M: 134 babies per name F: 90 babies per name
It seems that the variety of names has vastly expanded since 1950, for both boys and girls.
0020-gender-detector/g.py » Reshape the 2014 babynames file so that it is optimized for use in a gender-detecting program.
Your program must read the data in tempdata/yob2014.txt
and “wrangle” it into a far more usable dataset and save it as tempdata/wrangled2014.csv
The resulting file must contain these headers:
- year
- name
- gender
- ratio
- females
- males
- total
And the data rows must be sorted as:
- in descending order of the
total
baby count - as a tiebreaker, in ascending alphabetical order of the
name
More specifically, we want to turn this:
Emma,F,20799
Olivia,F,19674
Sophia,F,18490
Isabella,F,16950
Ava,F,15586
Into:
year,name,gender,ratio,females,males,total
2014,Emma,F,100,20799,12,20811
2014,Olivia,F,100,19674,22,19696
2014,Noah,M,99,106,19144,19250
2014,Sophia,F,100,18490,17,18507
What’s the difference?
For starters, our result data file will now have headers – something the SSA has negelected to do and which prevents the data from being ready-to-use in a spreadsheet.
Second, we’ve added columns that will be useful to our ultimate purposes. For example, we want to classify a person’s gender based on the traditional gender perception of their name. For the name “Leslie” in the 2014 data, this means looking at these two rows in yob2014.txt:
Leslie,F,994
Leslie,M,61
And then comparing the number of male babies versus female babies to determine the “likely” gender of the name “Leslie”. The math doesn’t have to be that complicated: 994
is more than 61
– so we classify “Leslie” as female because far more females were named “Leslie” than males, Leslie Nielsen notwithstanding.
We’re also interested in how big the gap between the male and female counts. Here’s a simple metric: find the ratio as determined by the majority gender versus the total number of babies:
100 * (994 / (994 + 61)) = 94.2
We can think of this as expressing that for any given person named “Leslie”, they are 94.2% likely to be female based on Social Security Administration trends.
By reshaping the raw data this way, we make it much easier for anyone to import our work into a spreadsheet. It’s nice having granular data in the way that SSA provides us, but not when doing analyses.
Data-wrangling, which is often what people think of when they think of data-cleaning, is one of the most difficult programmatic tasks, in the sense that naming things and cache invalidation is difficult in computer science. There’s not just one way to do it, and there’s not one clear, absolutely superior goal.
To make things easy, I’ve set a relatively simple and straightforward goal. It certainly has its flaws, which may become evident when trying to use it in real-world analysis. But creating it is relatively straightforward, at least in my muddled mind.
That said, to reduce confusion, I’ll provide my answer, which is a bit verbose, but hopefully makes it clear that this is just the same kind of data manipulation and handling we’ve done before in Python.
The start
Nothing fancy – for this exercise, we’ll be working only with yob2014.txt, with the knowledge that if we can deal with one file, we can deal with every file as we please:
from os.path import join, basename
import csv
DATA_DIR = 'tempdata'
YEAR = 2014
thefilename = join(DATA_DIR, 'yob' + str(YEAR) + '.txt')
Setting up the wrangled file
Let’s create a constant for the new file we’ll be making. In fact, let’s create another constant that stores the list of column names this new file will have:
WRANGLED_HEADERS = ['year', 'name', 'gender' , 'ratio' , 'females', 'males', 'total']
WRANGLED_DATA_FILENAME = join(DATA_DIR, 'wrangled2014.csv')
Gathering up the name data
This step is exactly the same as it is for every previous exercise: for every name, collect the number of babies by gender.
namesdict = {}
with open(thefilename, 'r') as thefile:
for line in thefile:
name, gender, count = line.split(',')
if not namesdict.get(name): # need to initialize a new dict for the name
namesdict[name] = {'M': 0, 'F': 0}
namesdict[name][gender] += int(count)
Let’s make a list
The object namesdict
has been perfectly servicable as a dictionary of dicts; in fact, we could probably continue using it without too much trouble. If you’ve been following along in Interactive Python, you can inspect it and see something like this:
{
'Taytem': {'F': 9, 'M': 0},
'Favour': {'F': 19, 'M': 0},
'Yitzchok': {'F': 0, 'M': 119},
'Daymon': {'F': 0, 'M': 25}
}
However, what we eventually need is a collection of dictionaries with different attribute names and more attributes.
So I’ve opted to just create a new list and then append it full of dictionary objects that have all the headers and values we need.
It starts like this:
my_awesome_list = []
Each dictionary we want to add to my_awesome_list
will contain values derived from each key-value pair in namesdict
.
So, basically a for-loop:
for name, counts in namesdict.items():
xdict = {}
xdict['year'] = YEAR # i.e. 2014
xdict['name'] = name
xdict['females'] = counts['F']
xdict['males'] = counts['M']
xdict['total'] = xdict['males'] + xdict['females']
# the "likely" gender is determined by comparing females vs males numbers
if xdict['females'] >= xdict['males']:
xdict['gender'] = 'F'
xdict['ratio'] = round(100 * xdict['females'] / xdict['total'])
else:
xdict['gender'] = 'M'
xdict['ratio'] = round(100 * xdict['males'] / xdict['total'])
# finally, add our new dict, xdict, to my_awesome_list
my_awesome_list.append(xdict)
There are some questions worth asking. For example, we used to track the count of female and male babies with the F
and M
keys, like this:
{'Daniel': {'F': 100, 'M': 9000}}
Why the change to females
and males
?
{'name': 'Daniel', 'females': 100, 'males': 9000}}
It’s a matter of opinion. But remember that we’re creating a new data file. And when a user comes upon it, what’s going to make more sense when looking at the headers:
name,F,M
Daniel,100,9000
Or:
name,females,males
Daniel,100,9000
It requires a little more work in organizing the data, but part of data-wrangling is producing a more useful public face that may not have been necessary when initially working with the data.
Creating a new file and using csv.DictWriter()
This is basically just a pattern you memorize: when serializing a list of dictionaries as a flat CSV file, you use csv.DictWriter
, that’s all. That said, it took me quite a few tries to memorize it. Hopefully it’s clear why we created easy to remember constants for WRANGLED_DATA_FILENAME
and WRANGLED_HEADERS
:
# let's create the new file to write to
wfile = open(WRANGLED_DATA_FILENAME, 'w')
# turn it into a DictWriter object, and tell it what the fieldnames are
wcsv = csv.DictWriter(wfile, fieldnames=WRANGLED_HEADERS)
# write the headers row
wcsv.writeheader()
Sorting the data before writing it
Oh but we can’t write the actual data rows just yet…As more pain-in-the-butt requirement, we’re required to sort the rows in order of the total
column (in descending order), then by the name
column. Do it how you like, this is how I did it:
def xfoo(xdict):
# and return a tuple of negative total, and normal name
return (-xdict['total'], xdict['name'])
my_final_list = sorted(my_awesome_list, key=xfoo)
for row in my_final_list:
wcsv.writerow(row)
# the end...close the file
wfile.close()
Write the first five lines of text
Just to make sure that we’ve produced the file we want, this exercise asks us to re-open the text file – but not to parse it into data – but just to print the first five lines as plain text. This works:
finalfile = open(WRANGLED_DATA_FILENAME, 'r')
thestupidlines = finalfile.readlines()[0:5]
for line in thestupidlines:
# remember each text line has a newline character
# that we don't want to print out for aesthetic reasons
print(line.strip())
And that should get you to the desired output.
When you run g.py
from the command-line:
0020-gender-detector $ python g.py
-
The program's output to screen should be:
year,name,gender,ratio,females,males,total 2014,Emma,F,100,20799,12,20811 2014,Olivia,F,100,19674,22,19696 2014,Noah,M,99,106,19144,19250 2014,Sophia,F,100,18490,17,18507
-
The program creates this file path:
tempdata/wrangled2014.csv
0020-gender-detector/h.py » Print the 5 most popular names in 2014 that are relatively gender-ambiguous.
Print a list of the 5 most popular baby names in 2014 that skewed no more than 60% towards either male or female, i.e. a ratio
or less than or equal to 60
.
This exercise depends on you having created tempdata/wrangled2014.csv
in the previous exercise.
Well, you don’t have to have made that wrangled file – obviously you can copy-and-paste all the code you used to generate that file into this script. The Python interpreter doesn’t really care…but you – and anyone else who has to deal with your code – will care.
Having one script create and package a file for other scripts to use is a very common pattern.
Most popular baby names means to sort the list by total, the filter for names in which the gender ratio is less than or equal to 60%.
(See the answer on Github)
When you run h.py
from the command-line:
0020-gender-detector $ python h.py
-
The program's output to screen should be:
Most popular names with <= 60% gender skew: Charlie M 54 3102 Dakota F 56 2012 Skyler F 54 1981 Phoenix M 59 1530 Justice F 59 1274
0020-gender-detector/i.py » Print a breakdown of popular names in 2014 by gender ambiguity.
This exercise depends on you having created tempdata/wrangled2014.csv
previously.
For each of the following gender ratio breakpoints:
- 60%
- 70%
- 80%
- 90%
- 99%
Print the number and percentage of popular baby names (names given to at least 100 total babies in 2014) that have a gender ratio less than or equal to the given break point.
For example, for the breakpoint of 70%, find and count all baby names in which the ratio is 70% or less toward one gender or another.
All of the code from h.py that is used to deserialize the data into a list of dictionaries can be reused here.
You have one additional data processing step: filtering that list to include only names that have 100 or more total babies. You should be able to figure this out:
bigdatarows = []
for row in datarows:
if SOMETHINGSOMETHING
bigdatarows.append(row)
The number of “popular” names in 2014 is simply len(bigdatarows)
As for what you need to print out, the process is about the same as it was in the previous exercise. But you start off with a for-loop:
print("Popular names with a gender ratio bias of less than or equal to:")
for genderratio in (60, 70, 80, 90, 99):
When you run i.py
from the command-line:
0020-gender-detector $ python i.py
-
The program's output to screen should be:
Popular names in 2014 with gender ratio less than or equal to: 60%: 64/3495 70%: 139/3495 80%: 214/3495 90%: 381/3495 99%: 953/3495
There don’t seem to be many names that fall in what we might have assumed to be “ambiguous”. In fact, only 953 names out of 3,495 – less than a third – of the popular babynames are at the 99% or below threshold…which means that more than two-thirds of the popular names are essentially 100% for one gender or the other.
0020-gender-detector/j.py » Aggregate a number of data files from 1950 to 2014, then reshape it for use in a gender-detecting program.
Very much the same as g.py except done over multiple files:
- Include each file from 1950 to 2014, in 10 year intervals, i.e. 1950, 1960, 1970, etc
- Include the 2014 file
- Before reading each file, print to screen:
"Parsing NAME_OF_FILE"
just so you know you’re reading the right files - As in g.py, create a new CSV file, but name it
/tempdata/wrangledbabynames.csv
- As in g.py, print the first 5 lines of the new CSV file.
This is barely an exercise. It’s meant to serve as another example in programming of how once you get something working once – there’s no reason why you can apply the same operation across many values or files. So why restrict ourselves to wrangling just the 2014 file when we can literally do it for every other baby name data file?
The beginning of this script looks very much the same as it did in g.py, though note we’re saving to a different file name: /tempdata/wrangledbabynames.csv
:
from os.path import join, basename
import csv
DATA_DIR = 'tempdata'
# as before, we create new headers for our wrangled file
# though we leave out year because we probably don't care at for our ultimate needs
WRANGLED_HEADERS = ['name', 'gender' , 'ratio' , 'females', 'males', 'total']
WRANGLED_DATA_FILENAME = join(DATA_DIR, 'wrangledbabynames.csv')
Since we’re reading from multiple years`, we need to create a list of numbers, starting from 1950 and ending at 2014:
This is how to produce the list of numbers using a range:
for year in range(1950, 2014, 10):
print(year)
1950
1960
1970
1980
1990
2000
2010
Note that it stops before 2014
, so we just have to add that manually to the list:
START_YEAR = 1950
END_YEAR = 2014
# lets just get a list of all decades, between 1950 and 2014:
years = list(range(START_YEAR, END_YEAR, 10))
# and let's tack on the END_YEAR manually:
years.append(END_YEAR)
Also note that the interval and number of years is arbitrary. You could just as easily aggregate every file since 1880 if you wished.
After that, things are largely the same. We loop through every year, but the namesdict collection stays “above” it all because it is counting up names for every file:
namesdict = {}
for year in years:
# get the file for this particular year
filename = join(DATA_DIR, 'yob' + str(year) + '.txt')
print("Parsing", filename)
# fill in the rest yourself
Similarly, we wait till that year-loop is finished before creating my_awesome_list
, which contains the same stuff as namesdict
.
my_awesome_list = []
# just the same as it was for g.py, except no "year"
for name, babiescount in namesdict.items():
xdict = {'name': name, 'females': babiescount['F'], 'males': babiescount['M']}
In fact, everything here on out should be pretty much the same as g.py.
Remember, you’re just doing what you did to the 2014 file, except to a bunch of files. That includes aggregating it into a single list at the end.
When you run j.py
from the command-line:
0020-gender-detector $ python j.py
-
The program's output to screen should be:
Parsing tempdata/yob1950.txt Parsing tempdata/yob1960.txt Parsing tempdata/yob1970.txt Parsing tempdata/yob1980.txt Parsing tempdata/yob1990.txt Parsing tempdata/yob2000.txt Parsing tempdata/yob2010.txt Parsing tempdata/yob2014.txt name,gender,ratio,females,males,total Michael,M,100,1963,433277,435240 James,M,100,1391,342651,344042 David,M,100,1139,330092,331231 John,M,100,1095,320621,321716
-
The program creates this file path:
tempdata/wrangledbabynames.csv
0020-gender-detector/k.py » Make a function that uses the CSV data to analyze a name, then analyze a list of names
Write a function that given a name, e.g. "Mary"
, returns a dictionary that represents the record for the name "Mary"
, based on the data in tempdata/wrangledbabynames.csv
.
Then use that function to print out the likely gender and ratio for each of the following names:
- Michael
- Kelly
- Kanye
- THOR
- casey
- Arya
- ZZZblahblah
The search for the name should be case-insensitive, i.e. return the records for "Thor"
and "Casey"
, respectively, when the values "THOR"
and "casey"
are passed in.
For names that have no valid record, return the following dictionary:
{ 'name': "whateverthenamewas",
'gender': 'NA',
'ratio': None,
'males': None,
'females': None,
'total': 0
}
At the end of the script, print the total tally of names by gender:
Total:
F: 2 M: 4 NA: 1
And the number of babies cumulative:
females: 62045 males: 454031
When you run k.py
from the command-line:
0020-gender-detector $ python k.py
-
The program's output to screen should be:
Michael M 100 Kelly F 86 Kanye M 100 THOR M 100 casey M 59 Arya F 88 ZZZblahblah NA None Total: F: 2 M: 4 NA: 1 females: 62045 males: 454031
0020-gender-detector/m.py » Convert wrangledbabynames.csv to wrangledbabynames.json
What’s the difference between storing data as CSV versus JSON?
Find out for yourself. Open, read, and deserialize tempdata/wrangledbabynames.csv
as in previous exercises, including converting the appropriate fields ('total'
, 'ratio'
, etc.) to numbers.
Then use json.dumps()
to serialize it as text and save to:
tempdata/wrangledbabynames.json
Then print the number of characters in the csv file. Followed by the number of characters in the new json file. Finally, print the ratio of the json’s character count to the csv’s character count.
When you run m.py
from the command-line:
0020-gender-detector $ python m.py
-
The program's output to screen should be:
CSV has 1214044 characters JSON has 6801478 characters JSON requires 4.6 times more text characters than CSV
-
The program creates this file path:
tempdata/wrangledbabynames.json
If you inspect the resulting JSON file, you’ll see that each record looks like this:
{ "males": 433277, "ratio": 100, "females": 1963, "name": "Michael", "total": 435240, "gender": "M" }, { "males": 342651, "ratio": 100, "females": 1391, "name": "James", "total": 344042, "gender": "M" }
Notice how the number values are unquoted. That’s part of JSON’s spec: the ability to define values other than text strings, which is what CSV is limited to:
name,gender,ratio,females,males,total Michael,M,100,1963,433277,435240 James,M,100,1391,342651,344042
On the other hand, the CSV version of the data is much more compact. The JSON format requires that the attributes (i.e. the column headers) have to be repeated for every record.
That is why the JSON file ends up being more than 4 times as big as the CSV file.
0020-gender-detector/n.py » Make a function that uses the JSON data to analyze a name, then analyze a list of names
Same as k.py, except you’ll be reading from: tempdata/wrangledbabynames.json
Create a new Python script file named zoofoo.py
In it, create a function named detect_gender()
which should operate pretty much identical to detect_gender_from_csv
in k.py…except that it reads from the JSON file. It should have less code because you no longer have to manually convert strings of numbers to actual numbers, i.e. you don’t have to do this:
When you run n.py
from the command-line:
0020-gender-detector $ python n.py
-
The program's output to screen should be:
Michael M 100 Kelly F 86 Kanye M 100 THOR M 100 casey M 59 Arya F 88 ZZZblahblah NA None Total: F: 2 M: 4 NA: 1 females: 62045 males: 454031