Sorting those Baby Names

Read (again) and sort a file of comma-delimited baby name records from the Social Security Administration.
This assignment is due on Wednesday, February 17
5 exercises
4.5 possible points
Create a subfolder named 0013-sorted-names inside your compciv-2016/exercises folder.

Summary

This exercise set uses the same dataset as 0012-get-babynames-2014, but focuses on a different aspect of lists and sequences: how to sort them.

Table of contents

The Checklist

In your compciv-2016 Git repository create a subfolder and name it:

     exercises/0013-sorted-names

The folder structure will look like this (not including any subfolders such as `tempdata/`:

        compciv-2016
        └── exercises
            └── 0013-sorted-names
               ├── a.py
               ├── b.py
               ├── c.py
               ├── d.py
               ├── e.py
    
a.py 0.5 points Download the 2014 text file of babynames and count the number of characters
b.py 0.5 points Print the 10 most popular names in 2014, regardless of gender
c.py 1.0 points Print the 10 longest names, given to at least 2,000 babies in 2014
d.py 1.0 points Print the 5 most popular female and male names in 2014 that contain at least one "x"
e.py 1.5 points Print the percentage of babies in 2014 who had popular names.

Background information

Same dataset as the previous exercise, but now you get to practice the built-in sorted function.

The Exercises

0013-sorted-names/a.py » Download the 2014 text file of babynames and count the number of characters

0013-sorted-names/a.py
Download the 2014 text file of babynames and count the number of characters
0.5 points

(Yes, this is virtually the same as 0012-got-babynames-2014/a.py)

Expectations

When you run a.py from the command-line:

0013-sorted-names $ python a.py
  • The program's output to screen should be:
    There are 425485 characters in tempdata/ssa-babynames-nationwide-2014.txt

0013-sorted-names/b.py » Print the 10 most popular names in 2014, regardless of gender

0013-sorted-names/b.py
Print the 10 most popular names in 2014, regardless of gender
0.5 points

The Social Security Administration’s baby name data is ordered by gender, then by baby count in descending order. Rearrange the list so that it is just sorted by baby count in descending order. Then print the first 10 rows.

The easiest way to approach this (and the other exercises) is to iterate through each line in the file and create a list. Then, do the sorting:

records_list = []
f = open(yourfilename, 'r')
for line in f:
    name, sex, babies = line.strip().split(',')
    row = [name, sex, int(babies)]
    records_list.append(row)

Another way to approach this, if you’ve forgotten how a for-loop can iterate through a file object:

records_list = []
lines = open(yourfilename, 'r').readlines()
for line in lines:
    name, sex, babies = line.strip().split(',')
    row = [name, sex, int(babies)]
    records_list.append(row)

What does that for-loop do? Well, records_list now contains a list of lists, as opposed to just a list of strings.

In other words, the above for-loop turned each line (a string):

"Emma,F,20799"

Into a list object, containing 3 objects:

["Emma", "F", 20799]

Now, we just need to:

  • Sort babylist in reverse order of its third element, e.g. the baby count.
  • Then loop through just the first 10 elements, and print the results.
Expectations

When you run b.py from the command-line:

0013-sorted-names $ python b.py
  • The program's output to screen should be:
    1. Emma,F,20799
    2. Olivia,F,19674
    3. Noah,M,19144
    4. Sophia,F,18490
    5. Liam,M,18342
    6. Mason,M,17092
    7. Isabella,F,16950
    8. Jacob,M,16712
    9. William,M,16687
    10. Ethan,M,15619
    

0013-sorted-names/c.py » Print the 10 longest names, given to at least 2,000 babies in 2014

0013-sorted-names/c.py
Print the 10 longest names, given to at least 2,000 babies in 2014
1.0 points

Of the names that have been given to at least 2,000 babies – male and female combined – in 2014, print the top 10 in descending order of character length. Note that in a case of a tie, (i.e. 2 names with 10 letters), sort by number of babies.

The 2,000 baby count is the combined number of boys and girls for a given name. So you’ll want to create a new list from the original data that aggregates both boy and girl babies into a single count per name.

A partial answer for c.py:

(You can also view it on Github)

from os.path import join

DATADIR = 'tempdata'
FPATH = join(DATADIR, 'ssa-babynames-nationwide-2014.txt')

Now we need to create a dictionary derived from the data in which every name is a key and points to the total number of babies (i.e. both “M” and “F”) e.g.

  {
      'Mackenzie': 4152
      'Christopher': 10293
  }
namesdict = {}
with open(FPATH) as f:
    for line in f:
        name, sex, babies = line.strip().split(',')
        if namesdict.get(name):
            namesdict[name] += int(babies)
        else:
            namesdict[name] = int(babies)

This is necessary because the assignment requires that we select the longest names from a list of names, each of which have been given to at least 2,000 babies – M and F – so we need to basically rebuild a list that is gender-agnostic and is just a list of names and numbers.

After namesdict is populated, we filter it to include only key-value pairs, in which the value (i.e. number of babies) is at least 2,000, as per the assignment requirements.

Then, finally, with that filtered list of “popular” names, you can then sort it by length of name, then number of babies.

Expectations

When you run c.py from the command-line:

0013-sorted-names $ python c.py
  • The program's output to screen should be:
    Christopher        10293
    Alexander          15326
    Charlotte          10055
    Elizabeth           9498
    Sebastian           9246
    Christian           8520
    Gabriella           5051
    Annabelle           4324
    Nathaniel           4257
    Mackenzie           4152
    

0013-sorted-names/d.py » Print the 5 most popular female and male names in 2014 that contain at least one "x"

0013-sorted-names/d.py
Print the 5 most popular female and male names in 2014 that contain at least one "x"
1.0 points

Iterate through the list of names in 2014 and print the 5 most popular names that contain at least one "x", for both females and males.

Follow the process in b.py, in which we write a for-loop just to make a list of lists from the file…but with one twist…use an if-statement to only append rows which meet a certain condition…i.e. the name contains at least one "x":

x_list = []
f = open(yourbabynamesfilename, 'r')
for line in f:
    name, sex, babies = line.strip().split(',')
    if "SOMETHING SOMETHING SOMETHNG":
        row = [name, sex, int(babies)]
        x_list.append(row)

Then you can do two for-loops two create two new lists from x_list, one in which the gender is F and M respectively, and sort them in descending order of count. Then iterate through each list for the top 5 names.

There’s more graceful ways to do it, but whatever makes sense to you with the least amount of typing…

Expectations

When you run d.py from the command-line:

0013-sorted-names $ python d.py
  • The program's output to screen should be:
    Female
    1. Alexa             4227
    2. Alexis            4188
    3. Alexandra         3288
    4. Ximena            2323
    5. Alexandria        1589
    Male
    1. Alexander        15293
    2. Jaxon             7635
    3. Jaxson            4900
    4. Xavier            4726
    5. Maxwell           3703
    

0013-sorted-names/e.py » Print the percentage of babies in 2014 who had popular names.

0013-sorted-names/e.py
Print the percentage of babies in 2014 who had popular names.
1.5 points

Print the percentage of babies – rounded to the nearest percent – who have a name in these five brackets of popularity:

  • Top 10 most popular names
  • Top 11 to 100 most popular names
  • Top 101 to 1000 most popular names
  • Top 1,001 to 10,000 most popular names
  • All other names, 10,001 and so on
Expectations

When you run e.py from the command-line:

0013-sorted-names $ python e.py
  • The program's output to screen should be:
    Names 1 to 10: 4.9
    Names 11 to 100: 22.9
    Names 101 to 1000: 43.0
    Names 1001 to 10000: 23.9
    Names 10001 to 30579: 5.3
    

References and Related Readings

Sorting Python collections with the sorted method
Sorting a list of items is not as simple as it seems. But it is also far more important than it seems.
Built-in Functions: sorted
Even though the list object has its own `sort()` method, I will heavily implore you to ignore it and instead, use the `sorted()` function, which sorts a list without mutating it.
Sorting HOW TO
This tutorial describes several ways to sort sequences in Python. I highly recommend on just focusing on the `sorted()` examples.