Text makes up the vast majority of the data that we gather, analyze, and produce. So expect to spend a significant amount of time learning about the str object, which Python uses to represent text and other string literals, and the many methods and functions for working with text.
The str object – think of str
as an abbreviation for string literal – is the data type that is used by Python to handle literal text values. Run the type()
function for yourself:
>>> type("hello")
str
I'll frequently use the term string literal to describe text values inside code; I like to think of the literal in terms of: that text value is meant to be interpreted literally, or as-is, by the Python interpreter.
In the snippet below, the text hello world
, enclosed in double quote marks, is the string literal:
len("hello world")
The text len
is not a string literal – it is the name of a function that the Python interprets as something to execute. The len()
function will return the number of characters inside the string literal, "hello world"
– which is 11, since the whitespace character counts as a character.
The following snippet, which looks similar, sans the quotation marks, is interpreted completely different by the Python interpreter:
len(hello world)
In fact, it will throw a syntax error, because the Python interpreter tried to interpret hello world
as actual Python code, not a literal text value:
len(hello world)
^
SyntaxError: invalid syntax
In other words, paying attention to quotation marks is extremely important when writing code that works with text values.
Text values can be denoted as String objects by enclosing them in quote marks: either single or double:
string_a = 'this is a proper string'
string_b = "this is also proper"
string_c = 'Double quotes are interpreted "literally" inside single quotes'
string_d = "And vice versa, d'oh!"
Just make sure you use the same delimiter that you started with. This is bad:
bad_string = "hello world'
What if you want to use double- and single- quotes within a single String object? One strategy is to use the backslash character – \
– which is often referred to as an escape sequence – which changes the meaning of the character that immediately follows it.
In the snippet below, the inner set of double quotes are preceded by backslashes; the Python interpreter no longer interprets them as delimiters, but instead, as string literals that happen to be quotation marks:
good_string = "He said, \"She said, 'Goodbye!' really loudly.\""
Referring to the backslash as an escape character (aka an escape sequence), can be thought of as: the presence of a backslash before an otherwise special character – e.g. a quotation mark – let's that character "escape" into a more ordinary meaning, i.e. a quotation mark no longer delimits string literals – it is just a literal text character itself.
Get used to seeing the backslash in many other kinds of contexts – including turning literal characters into special metacharacters.
For example, when the backslash precedes a text value that is normally just a plain string literally, such as the letter "n"
, that letter is escaped from its plain, literal meaning and takes on special meaning in the program.
In this case, "\n"
is what's used to represent newline characters:
>>> print("We like\nto\nparty")
We like
to
party
In Python, a String value can contain newline characters when they are represented as "\n"
, but not the newline characters that you create when you hit the Enter key, i.e.
bad_multi_string = "
hey
there
whats going on"
The above code will result in this error:
SyntaxError: EOL while scanning string literal
As you can imagine, using "\n"
to represent line breaks in a passage of text is going to be incredibly annoying to write and difficult to read:
mysong = "Oh Mickey\nyou're so fine\nYou're so fine\nyou blow my mind\n\"Hey Mickey\"\n\"Hey Mickey\""
However, by using triple quote delimiters – either single or double, though the common style is to use the double quote marks – we can create a string that spans multiple lines:
mysong = """
Oh Mickey
you're so fine
You're so fine
you blow my mind
"Hey Mickey"
"Hey Mickey"
"""
Note that there's no need to escape individual quote marks in the multi-line passage, as the Python interpreter will keep reading text as string literals until it reaches the closing triple-quote delimiter.
Combining strings, also known as concatenation, can be done using the +
operator:
>>> "a" + "b"
'ab'
It's worth noting that when numbers are denoted as string literals, they behave just like any other text character. Try to figure out why the result of this operation is not "2"
:
>>> "1" + "1"
'11'
What happens when you try to add an actual number value – i.e. of int
or float
type – to a str
object? You get an error:
>>> "Party like it's " + 1999
TypeError: Can't convert 'int' object to str implicitly
The type of error will be different if you add a String object to a number:
>>> 99 + " bottles of bees"
TypeError: unsupported operand type(s) for +: 'int' and 'str'
But the point still remains: you can't concatenate two different types of objects. Python requires you to convert one of the objects to the other's type. If we want to convert a number (or any other object) into a String, we use the str()
function. Yes, that confusingly looks like str
, as in the thing you get when you do type("hello")
, but the difference is in those parentheses:
>>> str("99") + "bottles of bees"
'99bottles of bees'
The str()
function – and others like it, such as dict()
, int()
, and list()
– are more specifically known as constructor functions in that they construct a new object of their namesakes.
The len()
function can be used to return the number of characters in a string:
>>> len("hello" + "and" + "welcome" + "to" + "the" + "rock")
24
Strings can be compared using the equals comparator – ==
– or the keyword is
:
>>> "hello" == "hello"
True
Note that this is case-sensitive: "hello"
and "Hello"
are completely different values to the Python interpreter:
>>> "Hello" == "hello"
False
The comparison operators, such as greater than and less than – >
and <
, respectively – can also be used to compare whether one string precedes another, particularly useful for determining alphabetical order. Once again, case matters:
>>> "a" > "z"
False
>>> "a" > "Z"
True
And number values that are string literals do not sort in the same way as actual numbers:
>>> 1000 > 9
True
>>> "1000" > "9"
False
The in
keyword can be used to determine if one string is the substring of the other:
>>> "she" in "she sells seashells"
True
Again, this is case-sensitive:
>>> "She" in "she sells seashells"
False
String objects have a wide variety of methods; I list some of the most common and useful ones here; you can check the Python documentation for the full suite.
Strings are immutable objects – their values cannot change. I find this an incredibly hard concept to explain without referring to other programming languages, so I won't get into the deep details of what this entails. However, it is worth explaining an observable effect of immutability:
When we call a method of a particular string, such as upper()
, which produces an upper-cased version of that string:
>>> mystring = "hello"
>>> mystring.upper()
HELLO
It's important to note that the calling string, i.e. mystring
, is not itself transformed:
>>> mystring = "hello"
>>> mystring.upper()
'HELLO'
>>> print(mystring)
hello
Instead, the upper()
method effectively returns an entirely new string object. This doesn't really impact us in most of our coding, you just have to reflexively understand that concept, so that you're not confused when you expect a variable to point to a different string value, merely because you invoked that variable's string's method.
If you want the variable to take the value of whatever the string's method returned, you can always reassign the variable:
>>> mystring = "hello"
>>> mystring = mystring.upper()
>>> print(mystring)
HELLO
The upper()
and lower()
methods return, respectively upper-cased and lower-cased versions of the calling string. This is useful for when trying to detect if one string is in another, but you're unsure of how things are capitalized:
>>> a = "And A Happy New Year"
>>> "happy" in a
False
>>> "happy" in a.lower()
True
The replace()
method takes two String objects as arguments. It returns a new string in which all instances of the first string argument that occur in the calling string – have been replaced by the second string argument:
>>> m = "she sells seashells"
>>> a = 'she'
>>> b = 'Mary'
>>> m.replace(a, b)
'Mary sells seaMarylls'
When reading text files in the wild, particularly web pages, the text content we care about is often surrounded in extraneous whitespace characters – this includes Tab characters, "\t"'; newlines,
"\n"; and regular whitespaces,
" "`.
The strip()
method returns a version of the calling string in which all consecutive whitespace characters from the left-side and right-side are removed. Whitespace characters that occur between non-whitespace characters is left unstripped:
>>> a = """
yo
what's
up?
"""
>>> print(a)
Results in this output:
yo
what's
up?
Calling the strip()
method results in this output:
>>> print(a.strip())
yo
what's
up?
(Note that the print()
method always adds its own newline character to the end of the string).
By default, the strip()
method will operate on whitespace characters. However, you can supply your own text string as the text value to trim:
>>> a = "hahahahaharryhahahaha"
>>> a.strip("ha")
'rry'
The split()
method is one of the most important String methods to learn, because we will frequently be using it to convert a chunk of text into a list of string values.
The split()
method takes at least one argument: a string with which to delimit (i.e. separate) values in the calling string:
>>> mystring = "hey-you-what-you-want"
>>> mywords = mystring.split('-')
>>> type(mywords)
list
>>> len(mywords)
5
>>> for w in mywords:
... print(w.upper())
HEY
YOU
WHAT
YOU
WANT
Consider this pipe-delimited text string, in which the pipe character, i.e. |
– is used to separate a person's last name, first name, and birthdate:
mydata = """Jane|Mary|1978-12-02"""
If we saved that text as a file and opened it in Excel, the tabular result would look like this:
Jane | Mary | 1978-12-02 |
If we use split("|")
on the string, we get a list object in which the last name, first name, and birthdate are assigned to the 0th, 1st, and 2nd indicies, respectively:
cols = mydata.split("|")
This allows us to reorganize the data as we wish:
>>> print(cols[1], cols[0], "has a birthday on", cols[2])
Mary Jane has a birthday on 1978-12-02
>>> birthdate = cols[2].split('-')
>>> year = birthdate[0]
>>> print(cols[1], cols[0], "was born in", year)
Mary Jane was born in 1978
If the string object contains multiple records, i.e. multiple rows, we can think of it as a data file that uses the newline character, "\n"
to separate the rows:
mydata = """Jane|Mary|1978-12-02
Smith|John|1990-03-22
Lee|Pat|1991-08-07"""
rows = mydata.split("\n")
for row in rows:
cols = row.split("|")
print(cols[1], cols[0], "has a birthday on", cols[2])
The output:
Mary Jane has a birthday on 1978-12-02
John Smith has a birthday on 1990-03-22
Pat Lee has a birthday on 1991-08-07
While string objects are not lists, per se, they are sequences that allow for many of the same kind of operations via square-bracket-notation, e.g.
>>> mylist = ["hello", "world"]
>>> print(mylist[0])
hello
>>> mystring = "hello world"
>>> print(mystring[0])
h
>>> print(mystring[0:5])
hello
When using square-bracket notation to operate on a string object, we can think of that string object as being a collection of individual characters:
>>> mystring = "123456"
>>> print(mystring[0])
1
>>> print(mystring[-1])
6
When we want to get a slice of a string, we use the same square-bracket notation with multiple arguments to specify the start and beginning of the substring. The result is not a list, but a new string object:
>>> birthdate = "1975-04-02"
>>> yr = birthdate[0:4]
>>> print(yr)
1975
>>> type(yr)
str
Since strings are sequences, this means we can iterate across them with a for-loop, if we wanted to perform an operation on each individual character:
>>> for c in "hello":
... print(c.upper())
H
E
L
L
O
However, the more common use case is to split a string object with split()
, and loop across the resulting elements:
>>> mystr = "apples,oranges,pears,peaches"
>>> for fruit in mystr.split(','):
... print("I like", fruit)
I like apples
I like oranges
I like pears
I like peaches
In other lessons, we'll learn all the different ways that text can be converted (i.e. serialized) into Python objects, and vice versa. Why are there so many different ways? And why is so much data stored as text?
(emphasis added)
Text is the most flexible communication technology. Pictures may be worth a thousand words, when there’s a picture to match what you’re trying to say. But let’s hit the random button on wikipedia and pick a sentence, see if you can draw a picture to convey it, mm? Here:
“Human rights are moral principles or norms that describe certain standards of human behaviour, and are regularly protected as legal rights in national and international law.”
Not a chance. Text can convey ideas with a precisely controlled level of ambiguity and precision, implied context and elaborated content, unmatched by anything else. It is not a coincidence that all of literature and poetry, history and philosophy, mathematics, logic, programming and engineering rely on textual encodings for their ideas.