This is a project that dovetails directly from the gender-detector-building homework: now that you've built an "algorithm" that can automatically classify a given name as more likely to be "male" or "female", let's run that algorithm on a whole bunch of data rows.
The structure and object of this project is best described in example projects, which you can find here and which I will reference throughout when describing the overall project:
The Pulitzer Prize Board - What is the gender makeup of the dozen-plus journalists and scholars who decide on journalism's most prestigious prize? How many women have served on the board, total? How has the gender breakdown changed over the decade? How has the gender breakdown changed over each year?
The Guardian bylines - Using the Guardian's API, I've downloaded the metadata for several thousand of its articles, including its byline. Let's see the gender breakdown by section and for Page 1 stories.
California college payrolls - Does the "gender gap" exist for those employed by California's state higher education? What is the gender makeup of employees who make less than $100,000 versus those who make more than $250,000? Is there a difference in makeup between the UC and community college system?
FEC individual donors - When contributing to a federal campaign, donors are required to provide their full name and occupation. Filtering that list to donors who describe themselves as either teachers or attorneys, what is the male-vs-female ratio? Being aware that the type of people who donate to campaigns are a specific subset of the general population, what is the gender breakdown between teachers vs attorneys?
Note: The Pulitzer Prize board project is by far the easiest to clone from Github, i.e.
git clone https://github.com/compciv/gendered-pulitzer-board
And the easiest to try out without too many external dependencies or massive amounts of free disk space. Its "fetch_data" phase will probably be completely irrelevant to you, unless you're trying to scrape a weird Angular-heavy-Drupal-powered website. But for the most part, it conforms to what I'd like to see in a finished product from you.
The other projects follow the same pattern and motions though I've spent less time documenting them. They may also crash your computer if you run the data-fetching scripts but don't have enough disk space. Still, you can still clone their repos and look at the code.
These are must-haves – missing any of these parts is grounds for a 5% reduction:
compciv-2016 └──projects └── gender-detector-data ├── README.md ├── analyze.py ├── classify.py ├── fetch_data.py ├── fetch_gender_data.py ├── gender.py ├── wrangle_data.py ├── wrangle_gender_data.py └── tempdata
detect_gender()function, as you've already done in a previous homework
extract_usable_name()which, given a namestring from your wrangled data, returns a first name that
detect_gender()can make sense of
(Keep reading for more details)
In your project repo, please create a
It should be a Markdown-formatted text file that is relatively easy and pleasant to read. And it should contain these sections:
List as many articles and information sources you can find, relevant to your topic. For example, if you're interested in the gender gap and you are using a public payroll database, you should be reading articles like these – "Government workforce is closing the gender pay gap, but reforms still needed, report says" – and describing (to me) how this affects your expectations and predictions.
Write a step-by-step list of instructions on what I (or anybody) need to setup your project on our own machines and do the same data analysis as you. Ideally, this list of instructions should basically be a list of every Python script in your project folder…because the user shouldn't have to do anything special (such as: "Please go to this webpage and click on this button to get the data I used") to get started.
With that said, every script in your project folder should be listed in the order that they should be run, along with a description of what it does, e.g.
fetch_data.py - Running this script will download two large zip files from the FEC website and store them in tempdata/ unpack_data.py - Running this script will unzip the zipped files
Describe your 3 analyses and their results in brief. If you want, you can also include their raw output.
Assuming your data didn't come out of thin air, your project should include a script that simply downloads the dataset that you're intending to analyze, whether it be a CSV or JSON file, or some other format.
You can keep this script simple: it points to a remote location and downloads the file, preferably into a
tempdata folder (which it makes) which won't be committed to the repo.
Call this script whatever you want, but it should probably have "fetch" somewhere in its name, e.g.
fetch_data.py. And I should be able to run it on my computer and end up with the same raw data that you started out with.
It's possible, but probably not likely, that the downloaded data file contains things exactly as you want them to be. Perhaps the file is massive, and you only want a subset of the data.
Or maybe the data came as JSON and you want to simplify it to a nice, flat CSV, as I do in my Guardian Bylines project. This "wrangling" script (you could call it
wrangle_data.py) is where you could do it.
What if you absolutely have nothing to actually wrangle? I seriously doubt that. But let's say that's the case. OK, then your wrangling script simply creates a new file/folder:
tempdata/wrangled/dataset1.csv tempdata/wrangled/dataset2.csv tempdata/wrangled/dataset3.csv ... (or what have you)
That's right, just make an identical data file, except under a
wrangled moniker or subfolder. I've decided that it's better to make you wasteful than it is to leave this script optional. Maybe you'll think of something you can wrangle in the meantime.
In the FEC individual donors project, the raw data files of donations are very large. I don't want to attempt to perform gender-detection on every record, so I have a "wrangle" script that simply reads through the raw data files and selects only rows in which the "OCCUPATION" column includes "TEACHER" or "ATTORNEY" or "LAWYER":
It then creates a new file of just these select records.
In fact, I even have a separate script that is in charge of unzipping the downloaded zip files. That's probably more orderly than you need – I like it because the files are so large that even unzipping them is a non-trivial amount of time for things to go wrong.
But make what makes sense to you.
OK, this is pretty much done for you. You can copy a.py from the gender-detector homework into this project – maybe even rename it
It should just work…right? I mean, it downloads a zip file and unzips it into
tempdata. What more does it need to do?
Again, this is something you can copy over…though make sure you copy enough.
Remember that in one assignment, we "wrangled" the data into a more usable CSV (j.py).
In another exercise, m.py, we turned that wrangled CSV into a JSON file…just because I felt like making you do it.
Do you just copy j.py. Or m.py as well? Frankly, I want you to at least be able to combine the two scripts. Or even just have the recognizance that it doesn't matter whether the "wrangled" baby name data is stored as JSON or CSV…you just have to make sure that whatever you do, it integrates with the rest of the project (which I mention just below).
For simplicity's sake, make a script named
gender.py. In this script will contain your
detect_gender() function…in other words, it will pretty much be the same as zoofoo.py in the gender-detector homework.
This script takes care of loading the wrangled babynames data (whatever format it is in) and provides a reference to the
detect_gender() function. If you've done everything up to this point, you should be able to start up iPython and do this:
>>> from gender import detect_gender >>> detect_gender("Beyonce")
The purpose of having a separate script,
gender.py, is just to emphasize how everything about gender-detection, or at least what we did in the homework, was completed independent of the current project.
There's an important philosophical implication here, which applies to many rea-life endeavors (programming and non-programming): the mechanisms we use to judge or filter something are often created independently, sometimes in a vacuum. Or, to use a classic aphorism: the left hand doesn't know what the right hand is doing.
And to bring it back specifically to this project: in the gender-detector homework, we quite clearly were using the U.S. Social Security Administration data to inform our "gender detection". If your project is using data full of non-American names…you can assume that you'll have some sporadic results…
Don't see that as a failure, necessarily, but as a reality to be aware of.
OK, back to scripts that you have to write yourself: classify.py might be the most difficult part of the project, depending on what your data looks like.
What does classify.py do? It classifies each row in your wrangled dataset as male, female, or non-determined (or Other, or whatever other non-binary classifications you want to provide, if you feel like it)
It produces a new data file that is, hence, classified…which is a really confusing word to use, I now realize, but basically, you've added gender classification to the wrangled dataset and saved it to a new file.
At a minimum, this is what classify.py will do:
detect_gender(), which returns a result that indicates if the name is male/female gendered, or neither.
usable_nameattributes are added to the data row, which are written to a new file, e.g.
classified_data.csv, if you will.
One way to look classify.py is that it takes the source data file and adds 3 new attributes/columns to it:
usable_name. And that's exactly right. Unless you've become ambitious and decided to re-write
detect_gender() to be much faster than it was for the homework…our gender detection is pretty slow, as far as computational processes go. On my own laptop, it maxed out at 200 gender-detections per second…which is agonizingly slow when trying to run it over 100,000 records…meh, at least it's better than doing 100,000 records by hand.
So if anything, having classify.py be its own script let's you run it and then go take a walk or a nap.
However, there is one important function that classify.py should be responsible for: extracting the best possible string from a given name field, such that the
detect_gender() function can return a valid result.
So here's a requirement: inside classify.py, define a
extract_usable_name() function. This function has one job: It takes in as an argument a single string, like
"Nguyen, Daniel", and it returns something like
Maybe your data has a first_name field, where only first names are entered. Ok, so this is what your
extract_usable_name() function can look like:
def extract_usable_name(namestr): return namestr
If it really is that easy, then good for you. However, it probably isn't that easy. Remember, human names, even just white-bread American names, are complex.
In my Pulitzer Prize board project, the Pulitzer people were kind enough to provide a "first_name" field. However, for many of the board members, the
first_name field did not simply contain the "first name":
Joseph Jr. (III) James Vermont C. Benjamin Andrew W.
The names above all belong to men. But you should know enough about the
detect_gender() function by now to know that it will work for
"Joseph", but will not work for
"Joseph Jr.", nevermind
"Joseph Jr. (III)". In fact, it fails on anything that is more than one word.
Maybe we could make
detect_gender() more sophisticated? That's one approach. Another is to just simply pretend
detect_gender() is a black box that can't be changed. So we change what we send it.
And that's what the
extract_usable_name() function is for. Here's one way to implement it, for the above Pulitzer Prize board members:
def extract_usable_name(namestr): nameparts = namestr.split(' ') return nameparts
It simply splits whatever
namestr is by a whitespace character, which creates a list of strings. And then it returns the first element in that list (i.e. the left most word):
>>> extract_usable_name('Joseph Jr. (III)') 'Joseph' >>> extract_usable_name('James') 'James' >>> extract_usable_name('Vermont C.') 'Vermont'
Seems almost too simple, right? Well, that's OK – part of why it works is because the data source, Pulitzer.org, did the job of neatly separating the names of each board member into first_name and last_name fields…so that's why our job is a little easier. That said, our function fails in these situations:
>>> extract_usable_name('G. Scott') 'G.' >>> extract_usable_name('C.K.') 'C.K.'
So it's up to you to tweak
extract_usable_name() for maximum efficiency. To handle
"G. Scott", for example, we can make it so that the function only returns name parts that do not have a period in them:
def extract_usable_name(namestr): nameparts = namestr.split(' ') for n in nameparts: if '.' not in n: return n # if we haven't returned by now, just return a blank return ""
And here's what that gets us:
>>> extract_usable_name('G. Scott') 'Scott' >>> extract_usable_name('C.K.') ''
"C.K." is still a problem. However, "C.K." (as in Charles Kenny McClatchy, of Sacramento Bee fame), is not our problem. That is, there's nothing in the source data that can help us magically derive
"C.K." in an automated system. If this were a real research project, this means we have to alter it manually with our external knowledge – which is fine (usually…). It's just one of many kinds of problems in which computers can't do for us – hence, that old axiom about naming things being one of the hardest problems of computer science.
For the purposes of this assignment, if you run into such a problem…just let it slide. You can see that I did so in the data that I've uploaded to the project repo.
OK, one more example of how naming things – or deriving the name of things – can be a huge pain in the butt.
In my Guardian bylines project, I take advantage of the Guardian's API, which has a
byline field. Problem is, this is the many ways that a
byline string can turn out as:
The first two variations are annoying enough, though easily solved with the "split-the-name-string-by-space-and-take-the-first-part" algorithm.
The third variation,
"Interview by Laura Barnett", means I have to throw in an
if/else statement, i.e. "if the word
' by ' is in the string, do something different".
And the next 3 variations throw my entire analysis into chaos: how should I deal with stories that have multiple authors? This is not a task that can be solved by a clever algorithm – it requires me to make a decision that substantially changes my methodology.
However, in this case, I've told myself: whomever has the first name in the byline, they are the most important because that's the way the world is…which, as you can imagine, vastly simplifies the
extract_usable_name() function, which you can see in my project repo.
OK, just in case you forgot what
classify.py needs to produce; it needs to produce some kind of "classified" data file, e.g.
Because this finished file (or files) is what analyze.py
Finally, this is the script that does the counting and dividing and whatever math you need to come up with interesting statistics and findings.
It shouldn't need to worry about
gender.py…in fact, it doesn't really need to know about any of the other files in your project, except for the location of the "classified" data files produced by classify.py.
It then reads the classified dataset, counts things up, and spits out, at a minimum, the number of males versus females. And, if you find it relevant: the number of non-classified records.
The requirement for analyze.py is that it spits out analysis for 3 different facets.
What is a facet?
One such facet is: the male vs female ratio for the entire dataset at hand. I mean, that's probably the most obvious one.
So, what's another facet? It depends on your dataset.
Think back to the gender-detector homework, specifically exercises:
Both exercises analyze the variety of names by gender. But c.py calculates it as a lump sum: the number of unique names by sex for all of the years since 1950. Whereas f.py calculates a slightly different quantity, but more importantly, calculates it for a sequence of years, so the user can see how the variety of names has changed from 1950 to 2014. Same kind of theme, but completely different insights.
So basically, think of three different ways to do a F vs. M count.
Here's some examples:
Depending on how experienced you are with Python, the code will probably be as annoyingly cumbersome as it was for the gender-detector homework, i.e.
That's OK…as long as you realize that that is all it is. If you can't think of 3 ways to analyze your dataset by gender, then ask me, and I'll help. The most important thing is that, by this point, you realize that whatever suggestion I give you – or, if you come up with a new idea – shouldn't require rebuilding the solution.
Instead of writing a whole new program to download an extra dataset, you should be able to alter your existing program – maybe even just extend a loop – and be done with it. When you can successfully break down a problem in a computational way, adding new features and scopes should not at all be like having to write a whole new paper, as you would for an end-of-term essay.