In Excel, sorting data is as “easy” as clicking a column-header. But of course, it’s much more complicated in the programming-land to do sorting. We have to write our own functions to tell Python’s sorting functions exactly how each item in a collection should be ranked.
It’s annoying at first, but the necessity of such explicitness becomes evident when working with data. Not everything in the world can be ranked in alphabetical order.
In Python, as in most high-level programming languages, it's very easy to sort a list of simple objects such as numbers or text strings:
>>> sorted([3, 0, 2])
[0, 2, 3]
>>> sorted(['oranges', 'apples'])
['apples', 'oranges']
It's when we want to sort a collection of objects that are not all the same that things get slightly more complicated:
>>> sorted(["apples", 42])
TypeError: unorderable types: int() < str()
Or, if the objects are all the same type of thing, but there's no obvious way that they should be sorted. How should a list of lists be sorted? By length of list? By the alphabetical/numerical order of the things in each list?
A real-life analogy would be trying to sort a list of movies. Sorting movie titles in alphabetical order is pretty straightforward. But how would we write a program that sorts a list of movies by quality? It's ultimately a problem as hard as defining what "quality" means – is it number of Oscar awards? The box office receipts? The average of critics' reviews scores? But the core concept is that when we want to sort by something more ephemeral or abstract than the alphabetical order of a name, we have to do a little more work.
This guide focuses on the built-in sorted function. The Python documentation has a nice how-to tutorial that has even more ways and examples of how to do sorting.
Python has a built-in function named sorted which, given an iterable collection, such as a list, will return a new list of items, but in sorted order:
>>> mylist = [42, -5, 0]
>>> sorted(mylist)
[-5, 0, 42]
Note: by "iterable collection", I mean that sorted
can deal with sequences other than lists. Strings, for example, are iterable:
>>> sorted("Hello World!")
[' ', '!', 'H', 'W', 'd', 'e', 'l', 'l', 'l', 'o', 'o', 'r']
…but, for the most part, we'll be dealing with lists.
By default, sorted
will return numbers and text strings in ascending order, i.e. from lowest to highest, and alphabetically.
Of course, even that is more complicated than you might think. Since every digital object is ultimately just a bunch of ones and zeros, letters and numbers are actually sorted by their binary values, e.g. 1000001
(i.e. 65) for "A"
, and 1100001
(i.e. 97) for "a"
. Which should explain this result:
>>> sorted(["alpha", "Zed"])
['Zed', 'alpha']
And when numbers are represented as strings, they are not sorted in numerical order, but by their binary/hexadecimal value, e.g. the character "1"
has a value of 49
:
>>> sorted(["hundred", "100"])
['100', 'hundred']
Try to apply that factoid in explaining this seemingly contradictory result:
>>> sorted([9, 10])
[9, 10]
>>> sorted(["9", "10"])
['10', '9']
The sorted
function takes a named argument, reverse
, which is set to False
by default. If we want things to be sorted in reverse-order (which would be descending order for character strings and numbers), pass True
to the reverse
argument:
>>> mylist = [42, -9, 0]
>>> sorted(mylist)
[-9, 0, 42]
>>> sorted(mylist, reverse=True)
[42, 0, -9]
Besides a reverse
argument, the sorted
function takes one other named argument: key.
However, the key
argument does not take a simple value, such as True
. Instead, we have to pass in the name of a function, a function that defines how each object in the collection should be evaluated as for sorting purposes:
sorted(["a", 2, "z"], key=some_function_name)
Which means that a function by that name has to actually be defined:
def some_function_name(item):
return do_something(item)
# obviously, do_something is some other hypothetical function
# that also has to be defined
sorted(["a", 2, "z"], key=some_function_name)
Here's an actual example: how do we sort a list of text strings that have different kinds of capitalization?
mylist = ['apples', 'Oranges', 'bananas']
Remember that while text strings are seemingly sorted in alphabetical order, uppercase characters take precedence:
>>> sorted(mylist)
['Oranges', 'apples', 'bananas']
And maybe that's what we want. But in other situations, we might want the strings to be sorted alphabetically without regard to capitalization.
One way to do this is to convert everything to either all-upper or all-lower case characters:
>>> my_big_list = ['APPLES', 'ORANGES', 'BANANAS']
>>> sorted(my_big_list)
['APPLES', 'BANANAS', 'ORANGES']
Of course, doing that manual conversion has a lot of disadvantages, the primary one being that you have to manually convert each item in the list.
What a key function allows us to do is delegate that work to the sorted function. Essentially, we tell the sorted
function: hey, sort each item in this list not by the item's actual value, but by some transformation of its value.
For the current scenario, the "transformation" of the value is simply: the uppercased version of the string. That function is fairly trivial to write:
def my_foo(item):
return item.upper()
And now we can pass my_foo
into the sorted
function. Here's all the code, all together:
def my_foo(item):
return item.upper()
mylist = ['Oranges', 'apples', 'bananas']
sorted(mylist, key=my_foo)
The result of that sort will be:
['apples', 'bananas', 'Oranges']
Note how the resulting list contains the original values, but sorted according to the result of my_foo
being called upon each item. The sorted
function – and the key
function that it calls – never alters the original list.
How about sorting a list that contains numbers and strings?
['apples', 42, 'Oranges']
Note that this kind of scenario is a bit unusual in that…well, what is the purpose of sorting a list that contains text strings and numbers? The answer: um…I don't know. Which is exactly what the Python interpreter is thinking when it throws you an error if you try to sort the list without a key function:
>>> mylist = ['apples', 42, 'Oranges']
>>> sorted(mylist)
TypeError: unorderable types: int() < str()
It just doesn't know how to order a number compared to a string. So, let's give it a simple key function which simply transforms any given value into a string, using the str()
constructor function:
def my_foofoo(item):
return str(item)
With my_foofoo
(or whatever you want to call it), we can use it as the key function:
>>> mylist = ['apples', 42, 'Oranges']
>>> sorted(mylist, key=my_foofoo)
[42, 'Oranges', 'apples']
Suppose we have a list of lists. Each sub-list contains two values: a name (as a str
) and an age (an int
):
list_o_peeps = [
['Lisa', 42],
['Bart', 9],
['Maggie', 2]
]
If we want to sort by the second value of each list, we simply define a function that, given an item that is ostensibly a collection or sequence, attempts to return the second value:
def sort_foo(x):
return x[1]
>>> sorted(list_o_peeps, key=sort_foo)
[['Maggie', 2], ['Bart', 9], ['Lisa', 42]]
Pretty much the same thing as sorting lists, except of course we have to change how we reference the values, i.e. using keys instead of numerical indexes:
dict_o_peeps = [
{'name': 'Lisa', 'age': 42},
{'name': 'Bart', 'age': 9},
{'name': 'Maggie', 'age': 2}
]
def foo2(x):
return x['age']
>>> sorted(dict_o_peeps, key=foo2)
[{'age': 2, 'name': 'Maggie'},
{'age': 9, 'name': 'Bart'},
{'age': 42, 'name': 'Lisa'}]
Let's try a more complicated key function. Instead of simply pulling out a value via index, e.g. sort by name
.
Let's sort by length of the name
value in descending order (i.e. reverse the standard sort):
def foo_len(item):
name = item['name']
return len(name)
peeps = [
{'name': 'Lisa', 'age': 42},
{'name': 'Homer', 'age': 70},
{'name': 'Maggie', 'age': 2}
]
>>> sorted(peeps, key=foo_len, reverse=True)
[{'age': 2, 'name': 'Maggie'},
{'age': 70, 'name': 'Homer'},
{'age': 42, 'name': 'Lisa'}]
As the collections of things we want to sort get bigger and more complicated, so do the criteria we use to rank them. Given a collection of dictionaries that looks like this:
peeps = [
{'name': 'Mary', 'age': 42},
{'name': 'Mary', 'age': 9},
{'name': 'John', 'age': 42},
{'name': 'Mary', 'age': 2}
]
If we want to sort by the name
in each dictionary, how does sorted
know what to do with all the identical "Mary"
's? It doesn't know, so it will return them in an arbitrary order.
If we want it to sort by name
, and then by age
for when the name is the same, we define the key function to return a tuple of values:
def foo3(x):
return (x['name'], x['age'])
>>> sorted(peeps, key=foo3)
[{'age': 42, 'name': 'John'},
{'age': 2, 'name': 'Mary'},
{'age': 9, 'name': 'Mary'},
{'age': 42, 'name': 'Mary'}]
Alternatively, if we wanted to sort by age
, then by name
:
def foo4(x):
return (x['age'], x['name'])
>>> sorted(peeps, key=foo4)
[{'age': 2, 'name': 'Mary'},
{'age': 9, 'name': 'Mary'},
{'age': 42, 'name': 'John'},
{'age': 42, 'name': 'Mary'}]
If you've understood everything up until now, e.g. what a list and dictionary are, what the sorted
function does, and what a key function is – then you're good to go. This section reviews other ways of writing out the same concepts, with cleaner-looking code at the cost of having to learn more about Python's standard library.
Consider this previous example:
peeps = [
['Lisa', 42],
['Bart', 9],
['Maggie', 2]
]
def foo(x):
return x[1]
sorted(peeps, key=foo)
It's such a common pattern to sort dictionaries by a single key that we can pass the itemgetter
method (from the operator
module) as the key function, along with a number that refers to which value from each list that we want to sort by:
from operator import itemgetter
peeps = [
['Lisa', 42],
['Bart', 9],
['Maggie', 2]
]
sorted(peeps, key=itemgetter(1))
# [['Maggie', 2], ['Bart', 9], ['Lisa', 42]]
Consider this previous example:
peeps = [
{'name': 'Lisa', 'age': 42},
{'name': 'Bart', 'age': 9},
{'name': 'Maggie', 'age': 2}
]
def foo(x):
return x['age']
sorted(peeps, key=foo)
It's such a common pattern to sort dictionaries by a single key that we can pass the itemgetter
method (from the operator
module) as the key function, along with a string that refers to the key in the dictionary that we want to sort by:
from operator import itemgetter
peeps = [
{'name': 'Lisa', 'age': 42},
{'name': 'Bart', 'age': 9},
{'name': 'Maggie', 'age': 2}
]
sorted(peeps, key=itemgetter('age'))
# [{'age': 2, 'name': 'Maggie'},
# {'age': 9, 'name': 'Bart'},
# {'age': 42, 'name': 'Lisa'}]
Note that the sorted
is not an in-place function, i.e. it does not modify the list that's passed into it:
>>> mylist = [8, 3, 12]
>>> newlist = sorted(mylist)
>>> mylist
[8, 3, 12]
>>> newlist
[3, 8, 12]
This is in contrast to the sort
method that the list object has. It actually changes the calling list. And it returns a NoneType
value:
>>> mylist = [8, 3, 12]
>>> returnvalue = mylist.sort()
>>> type(returnvalue)
NoneType
>>> mylist
[3, 8, 12]
For many intents and purposes, the fact that sort
mutates a list may not matter…except it can be hard to infer from reading the code (as a human) to see that the mylist
list has actually changed.
For our purposes, it's just a lot easier to call the sorted
method.
Automate the Boring Stuff with Python has examples on using the list's sort method.
Let's try sorting some real-world data. The U.S. Geological Survey has real-time earthquake feeds in CSV form.
I've downloaded a snapshot of M4.5+ earthquakes over a recent 7-day period:
http://www.compciv.org/files/datadumps/apis/usgs/4.5_week.csv
To preview the data in a spreadsheet-like form, you can view it on Github here.
CSV is just text that, passed through the appropriate deserialization function, can be turned into a Python collection. For our purposes, we need the csv.DictReader function:
from csv import DictReader
import requests
DATA_URL = 'http://www.compciv.org/files/datadumps/apis/usgs/4.5_week.csv'
resp = requests.get(DATA_URL)
txt = resp.text
lines = txt.splitlines()
datarows = list(DictReader(lines))
# or obviously, do it all in one go
# datarows = list(DictReader(resp.text.splitlines()))
datarows
is a list of dictionaries. Inspect the first element in datarows
, i.e. datarows[0]
:
{'depth': '10.99',
'depthError': '4.2',
'dmin': '28.964',
'gap': '120',
'horizontalError': '12.8',
'id': 'us20004zj3',
'latitude': '-25.6731',
'locationSource': 'us',
'longitude': '-13.7383',
'mag': '5.2',
'magError': '0.064',
'magNst': '81',
'magSource': 'us',
'magType': 'mb',
'net': 'us',
'nst': '',
'place': 'Southern Mid-Atlantic Ridge',
'rms': '1.01',
'status': 'reviewed',
'time': '2016-02-11T17:00:32.920Z',
'type': 'earthquake',
'updated': '2016-02-11T17:42:15.185Z'}
It's a dictionary. But note how all the columns that contain numbers are actually denoted as strings. This is because, well, a CSV file is a text file…and, without further instruction, the Python csv.DictReader
function is just going to treat everything as text strings. Note how this is much different than if you had imported the CSV file into Excel…in which case, Excel will try to guess what's in a column.
Which is sometimes good. And sometimes catastrophic (which is why we're not using Excel).
The USGS dataset has a mag
column that specifies the magnitude of the earthquake as a float.
To sort the dataset in descending order of magnitude, we need to write a function that extracts the value from an object's mag
key. We also have to convert that value into a float object, because all values by default are just strings:
def foo(quake):
return float(quake['mag'])
sortedquakes = sorted(datarows, key=foo, reverse=True)[0:5]
for q in sortedquakes:
print(q['mag'], q['place'], q['time'])
# 6.4 92km WSW of Panguna, Papua New Guinea 2016-02-08T16:19:13.090Z
# 6.4 24km SSE of Yujing, Taiwan 2016-02-05T19:57:27.390Z
# 6.3 40km W of Ovalle, Chile 2016-02-10T00:33:05.690Z
# 5.8 89km NNE of Hihifo, Tonga 2016-02-07T02:03:40.400Z
# 5.5 276km WNW of Nadi, Fiji 2016-02-06T01:39:18.300Z
Let's say we have a geospatial point. Such as the location of Stanford, California, which is at the longitude/latitude coordinates of:
longitude | latitude |
---|---|
-122.1660756 | 37.42410599999999 |
How do we find the earthquakes nearest to that point? We need whatever distance formula is appropriate for, given two points on a X/Y plane returns a value that represents the distance between those two points
You can read about the haversine function here:
The haversine formula is an equation important in navigation, giving great-circle distances between two points on a sphere from their longitudes and latitudes. It is a special case of a more general formula in spherical trigonometry, the law of haversines, relating the sides and angles of spherical triangles.
And here's a Python implementation that I almost certainly copied from this StackOverflow answer, which I found by Googling for "haversine python":
def haversine(lon1, lat1, lon2, lat2):
from math import radians, cos, sin, asin, sqrt
lon1 = radians(lon1)
lon2 = radians(lon2)
lat1 = radians(lat1)
lat2 = radians(lat2)
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat /2 ) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers.
# return the final calculation
return c * r
Here's one way to approach this problem: Use the haversine
function, but write another function that calls haversine
and provides the known coordinates (i.e. longitude/latitude of Stanford, CA):
STANFORD_LNG = -122.1660756
STANFORD_LAT = 37.42410599999999
def haver_foo(item):
# remember that each record in the CSV
# has to have its numerical columns converted to numbers:
lng = float(item['longitude'])
lat = float(item['latitude'])
return haversine(STANFORD_LNG, STANFORD_LAT, lng, lat)
nearest_quakes = sorted(datarows, key=haver_foo)[0:5]
for quake in nearest_quakes:
print(quake['mag'], quake['place'])
# 4.6 13km SSW of Teziutlan, Mexico
# 4.9 18km N of Chicomuselo, Mexico
# 4.6 33km WSW of La Libertad, El Salvador
# 5 77km SSW of Masachapa, Nicaragua
# 5.3 72km S of Semisopochnoi Island, Alaska