The Checklist
In your compciv-2016 Git repository create a subfolder and name it:
exercises/0013-sorted-names
The folder structure will look like this (not including any subfolders such as `tempdata/`:
compciv-2016 └── exercises └── 0013-sorted-names ├── a.py ├── b.py ├── c.py ├── d.py ├── e.py
Background information
Same dataset as the previous exercise, but now you get to practice the built-in sorted function.
The Exercises
0013-sorted-names/a.py » Download the 2014 text file of babynames and count the number of characters
(Yes, this is virtually the same as 0012-got-babynames-2014/a.py)
-
Make a tempdata subdirectory inside your working directory, i.e.
0013-sorted-names/tempdata
-
Download the list of Social Security babynames data for 2014:
http://stash.compciv.org/ssa_baby_names/ssa-babynames-nationwide-2014.txt
Save it to your
tempdata
folder at this path:0013-sorted-names/tempdata/ssa-babynames-nationwide-2014.txt
-
Count and print the number of characters in the file.
When you run a.py
from the command-line:
0013-sorted-names $ python a.py
-
The program's output to screen should be:
There are 425485 characters in tempdata/ssa-babynames-nationwide-2014.txt
0013-sorted-names/b.py » Print the 10 most popular names in 2014, regardless of gender
The Social Security Administration’s baby name data is ordered by gender, then by baby count in descending order. Rearrange the list so that it is just sorted by baby count in descending order. Then print the first 10 rows.
The easiest way to approach this (and the other exercises) is to iterate through each line in the file and create a list. Then, do the sorting:
records_list = []
f = open(yourfilename, 'r')
for line in f:
name, sex, babies = line.strip().split(',')
row = [name, sex, int(babies)]
records_list.append(row)
Another way to approach this, if you’ve forgotten how a for-loop can iterate through a file object:
records_list = []
lines = open(yourfilename, 'r').readlines()
for line in lines:
name, sex, babies = line.strip().split(',')
row = [name, sex, int(babies)]
records_list.append(row)
What does that for-loop do? Well, records_list
now contains a list of lists, as opposed to just a list of strings.
In other words, the above for-loop turned each line (a string):
"Emma,F,20799"
Into a list object, containing 3 objects:
["Emma", "F", 20799]
Now, we just need to:
- Sort babylist in reverse order of its third element, e.g. the baby count.
- Then loop through just the first 10 elements, and print the results.
When you run b.py
from the command-line:
0013-sorted-names $ python b.py
-
The program's output to screen should be:
1. Emma,F,20799 2. Olivia,F,19674 3. Noah,M,19144 4. Sophia,F,18490 5. Liam,M,18342 6. Mason,M,17092 7. Isabella,F,16950 8. Jacob,M,16712 9. William,M,16687 10. Ethan,M,15619
0013-sorted-names/c.py » Print the 10 longest names, given to at least 2,000 babies in 2014
Of the names that have been given to at least 2,000 babies – male and female combined – in 2014, print the top 10 in descending order of character length. Note that in a case of a tie, (i.e. 2 names with 10 letters), sort by number of babies.
The 2,000 baby count is the combined number of boys and girls for a given name. So you’ll want to create a new list from the original data that aggregates both boy and girl babies into a single count per name.
A partial answer for c.py:
(You can also view it on Github)
from os.path import join
DATADIR = 'tempdata'
FPATH = join(DATADIR, 'ssa-babynames-nationwide-2014.txt')
Now we need to create a dictionary derived from the data in which every name is a key and points to the total number of babies (i.e. both “M” and “F”) e.g.
{
'Mackenzie': 4152
'Christopher': 10293
}
namesdict = {}
with open(FPATH) as f:
for line in f:
name, sex, babies = line.strip().split(',')
if namesdict.get(name):
namesdict[name] += int(babies)
else:
namesdict[name] = int(babies)
This is necessary because the assignment requires that we select the longest names from a list of names, each of which have been given to at least 2,000 babies – M and F – so we need to basically rebuild a list that is gender-agnostic and is just a list of names and numbers.
After namesdict
is populated, we filter it to include only key-value pairs, in which the value (i.e. number of babies) is at least 2,000, as per the assignment requirements.
Then, finally, with that filtered list of “popular” names, you can then sort it by length of name, then number of babies.
When you run c.py
from the command-line:
0013-sorted-names $ python c.py
-
The program's output to screen should be:
Christopher 10293 Alexander 15326 Charlotte 10055 Elizabeth 9498 Sebastian 9246 Christian 8520 Gabriella 5051 Annabelle 4324 Nathaniel 4257 Mackenzie 4152
0013-sorted-names/d.py » Print the 5 most popular female and male names in 2014 that contain at least one "x"
Iterate through the list of names in 2014 and print the 5 most popular names that contain at least one "x"
, for both females and males.
Follow the process in b.py, in which we write a for-loop just to make a list of lists from the file…but with one twist…use an if-statement to only append rows which meet a certain condition…i.e. the name
contains at least one "x"
:
x_list = []
f = open(yourbabynamesfilename, 'r')
for line in f:
name, sex, babies = line.strip().split(',')
if "SOMETHING SOMETHING SOMETHNG":
row = [name, sex, int(babies)]
x_list.append(row)
Then you can do two for-loops two create two new lists from x_list
, one in which the gender is F
and M
respectively, and sort them in descending order of count. Then iterate through each list for the top 5 names.
There’s more graceful ways to do it, but whatever makes sense to you with the least amount of typing…
When you run d.py
from the command-line:
0013-sorted-names $ python d.py
-
The program's output to screen should be:
Female 1. Alexa 4227 2. Alexis 4188 3. Alexandra 3288 4. Ximena 2323 5. Alexandria 1589 Male 1. Alexander 15293 2. Jaxon 7635 3. Jaxson 4900 4. Xavier 4726 5. Maxwell 3703
0013-sorted-names/e.py » Print the percentage of babies in 2014 who had popular names.
Print the percentage of babies – rounded to the nearest percent – who have a name in these five brackets of popularity:
- Top 10 most popular names
- Top 11 to 100 most popular names
- Top 101 to 1000 most popular names
- Top 1,001 to 10,000 most popular names
- All other names, 10,001 and so on
When you run e.py
from the command-line:
0013-sorted-names $ python e.py
-
The program's output to screen should be:
Names 1 to 10: 4.9 Names 11 to 100: 22.9 Names 101 to 1000: 43.0 Names 1001 to 10000: 23.9 Names 10001 to 30579: 5.3