11

Strings and Text Data

Introduction

Most data in the world can be stored as text and strings. Even values that may eventually be numeric data may initially come in the form of text. It’s important to be able to work with text data. This chapter won’t be specific to Pandas. That is, we will mainly explore how you manipulate strings within Python without Pandas. The following chapters will cover some more Pandas materials. Then we will come back to strings and see how it all ties back with Pandas. As an aside, some of the string examples in this chapter come from Monty Python and the Holy Grail.

Learning Objectives

  • Recall how to subset containers and sequences

  • Recognize strings are a type of container object

  • Modify strings based on use case

  • Create regular expression patterns to match strings

  • Combine pose text with code output into a single sentence

11.1 Strings

In Python, a string is simply a series of characters. They are created by a set of opening and matching single or double quotes. Below are two strings, grail and a scratch. These strings are assigned to the variables word and sent, respectively.

word = 'grail'
sent = 'a scratch'

So far in this book, we have seen strings in a column represented as the object dtype.

11.1.1 Subset and Slice Strings

A string can be thought of as a container of characters. You can subset a string like any other Python container (e.g., list or Series).

Table 11.1 and Table 11.2 show the strings with their associated index. This information will help you understand the examples in which we slice values using the index.

Table 11.1 Index Positions for the String "grail"

index

0

1

2

3

4

string

g

r

a

i

l

neg index

–5

–4

–3

–2

–1

Table 11.2 Index Positions for the String "a scratch"

index

0

1

2

3

4

5

6

7

8

string

a

 

s

c

r

a

t

c

h

neg index

–9

–8

–7

–6

–5

–4

–3

–2

–1

11.1.1.1 Single Letter

To get the first letter of our strings, we can use the square bracket notation, [ ]. This notation is the same method we used in Section 1.3 when we looked at various slices of data.

print(word[0])
g
print(sent[3])
c
11.1.1.2 Slice Multiple Letters

Alternatively, we can use slicing notation (Appendix L) to get ranges from our strings.

# get the first 3 characters
# note index 3 is really the 4th character
print(word[0:3])
gra

Recall that when using slicing notation in Python, it is left-side inclusive, right-side exclusive. In other words, it will include the index value specified first, but it will not include the index value specified second.

For example, the notation [0:3] will include the characters from 0 to 3, but not index 3. Another way to say this is to state that [0:3] will include the indices from 0 to 2, inclusive.

11.1.1.3 Negative Numbers

Recall that in Python, passing in a negative index actually starts the count from the end of a container.

# get the last letter from "a scratch"
print(sent[ -1])
h

The negative index refers to the index position as well, so you can also use it to slice values.

# get 'a' from "a scratch"
print(sent[ -9: -8])
a

You can combine non-negative numbers with negative numbers.

# get 'a'
print(sent[0: -8])
a

Note that you can’t actually get the last letter when using a negative index for the second value.

# scratch
print(sent[2: -1])
scratc
# scratch
print(sent[ -7: -1])
scratc

11.1.2 Get the Last Character in a String

Just getting the last element in a string (or any container) can be done with the negative index, -1. However, it becomes problematic when we want to use slicing notation and also include the last character. For example, if we tried to use slicing notation to get the word “scratch” from the sent variable, the result returned would be one letter short.

Since Python is right-side exclusive, we need to specify an index position that is one greater than the last index. To do this, we can get the len (length) of the string and then pass that value into the slicing notation.

# note that the last index is one position is smaller than
# the number returned for len
s_len = len(sent)
print(s_len)
9
print(sent[2:s_len])
scratch
11.1.2.1 Slice from the Beginning or to the End

A very common task is to slice a value from the beginning to a certain point in the string (or container). The first element will always be 0, so we can always write something like word[0:3] to get the first three elements, or word[-3:len(word)] to get the last three elements.

Another shortcut for this task is to leave out the data on the left or right side of the :. If the left side of the : is empty, then the slice will start from the beginning and end at the index on the right (non-inclusive). If the right side of the : is empty, then the slice will start from the index on the left, and end at the end of the string. For example, these slices are equivalent:

print(word[0:3])
gra
# left the left side empty
print(word[ :3])
gra
print(sent[2:len(sent)])
scratch
# leave the right side empty
print(sent[2: ])
scratch

Another way to specify the entire string is to leave both values empty.

print(sent[:])
a scratch
11.1.2.2 Slice Increments (Steps)

The final notation while slicing allows you to slice in increments. To do this, you use a second colon, :, to provide a third number. The third number allows you to specify the increment to pull values out.

For example, you can get every other string by passing in 2 for every second character.

# step by 2, to get every other character
print(sent[::2])
asrth

Any integer can be passed here, so if you wanted every third character (or value in a container), you could pass in 3.

# get every third character
print(sent[::3])
act

11.2 String Methods

Many methods are also used when processing data in Python. A list of all the string methods can be found on the “String Methods” documentation page.1 Table 11.3 and Table 11.4 summarize some string methods that are commonly used in Python.

1. String methods: https://docs.python.org/3/library/stdtypes.html#string-methods

Table 11.3 Python String Methods

Method

Description

.capitalize()

Capitalizes the first character

.count()

Counts the number of occurrences of a string

.startswith()

True if the string begins with specified value

.endswith()

True if the string ends with specified value

.find()

Smallest index of where the string matched, -1 if no match

.index()

Same as find but returns ValueError if no match

.isalpha()

True if all characters are alphabetic

.isdecimal()

True if all characters are decimal numbers (see documentation as well as .isdigit(), .isnumeric(), and .isalnum())

.isalnum()

True if all characters are alphanumeric (alphabetic or numeric)

.lower()

Copy of a string with all lowercase letters

.upper()

Copy of string with all uppercase letters

.replace()

Copy of a string with the old values replaced with new

.strip()

Removes leading and trailing whitespace; also see lstrip and rstrip

.split()

Returns a list of values split by the delimiter (separator)

.partition()

Similar to split(maxsplit=1) but also returns the separator

.center()

Centers the string to a given width

.zfill()

Copy of string left filled with '0'

Table 11.4 Examples of Using Python String Methods

Code

Results

"black Knight".capitalize()

'Black knight'

"It's just a flesh wound!".count('u')

2

"Halt! Who goes there?".startswith('Halt')

True

"coconut".endswith('nut')

True

"It's just a flesh wound!".find('u')

7

"It's just a flesh wound!".index('scratch')

ValueError

"old woman".isalpha()

False (there is a whitespace)

"37".isdecimal()

True

"I'm 37".isalnum()

False (apostrophe and space)

"Black Knight".lower()

'black knight'

"Black Knight".upper()

'BLACK KNIGHT'

"flesh wound!".replace('flesh wound', 'scratch')

'scratch!'

" I'm not dead.   ".strip()

"I'm not dead."

"NI! NI! NI! NI!".split(sep=' ')

['NI!', 'NI!', 'NI!', 'NI!']

"3,4.partition(',')

('3', ',', '4')

"nine".center(width=10)

'    nine   '

"9".zfill(with=5)

'00009'

11.3 More String Methods

There are a few more string methods that are useful, but hard to convey in a table.

11.3.1 Join

The .join() method takes a container (e.g., a list) and returns a new string that combines each element in the list. For example, suppose we wanted to combine coordinates in the degrees, minutes, seconds (DMS) notation.

d1 = '40°'
m1 = "46'"
s1 = '52.837"'
u1 = 'N'

d2 = '73°'
m2 = "58'"
s2 = '26.302"'
u2 = 'W'

We can join all the values with a space, ' ', by using the .join() method on the space string.

coords = ' '.join([d1, m1, s1, u1, d2, m2, s2, u2])
print(coords)
40° 46' 52.837" N 73° 58' 26.302" W

This method is also useful if you have a list of strings that you want to separate using your own delimiter (e.g., tabs with and commas with ,). If we wanted, we could now .split() on a space, " ", and get the individual parts from coords.

coords.split(" ")
['40°', "46'", '52.837"', 'N', '73°', "58'", '26.302"', 'W']

11.3.2 Splitlines

The .splitlines() method is similar to the .split() method. It is typically used on strings that are multiple lines long and will return a list in which each element of the list is a line in the multiple-line string.

multi_str = """Guard: What? Ridden on a horse?
King Arthur: Yes!
Guard: You're using coconuts!
King Arthur: What?
Guard: You've got ... coconut[s] and you're bangin' 'em together.
"""

print(multi_str)
Guard: What? Ridden on a horse?
King Arthur: Yes!
Guard: You're using coconuts!
King Arthur: What?
Guard: You've got ... coconut[s] and you're bangin' 'em together.

We can get every line as a separate element in a list using .splitlines().

multi_str_split = multi_str.splitlines()
print(multi_str_split)
[
  "Guard: What? Ridden on a horse?",
  "King Arthur: Yes!",
  "Guard: You're using coconuts!",
  "King Arthur: What?",
  "Guard: You've got ... coconut[s] and you're bangin' 'em together."
]

Finally, suppose we just wanted the text from the “Guard”. This is a two-person conversation, so the “Guard” speaks every other line.

guard = multi_str_split[::2]
print(guard)
[
  "Guard: What? Ridden on a horse?",
  "Guard: You're using coconuts!",
  "Guard: You've got ... coconut[s] and you're bangin' 'em together."
]

There are a few ways to just get the lines from the “Guard”. One way would be to use the .replace() method on the string and .replace() the Guard: string with an empty string ''. We could then use the .splitlines() method.

guard = multi_str.replace("Guard: ","").splitlines()[::2]
print(guard)
[
  "What? Ridden on a horse?",
  "You're using coconuts!",
  "You've got ... coconut[s] and you're bangin' 'em together."
]

11.4 String Formatting (F-Strings)

Formatting strings allows you to specify a generic template for a string, and insert variables into the pattern. It can also handle various ways to visually represent strings—for example, showing two decimal values in a float, or showing a number as a percentage instead of a decimal value.

String formatting can even help when you want to print something to the console. Instead of just printing out the variable, you can print a string that provides hints about the value that is printed.

This chapter will only talk about “formatted literal strings”, also known as f-strings, which were introduced in Python 3.6. Older C-Style formatting and the .format() method have been moved to Appendix W.1 and Appendix W.2, respectively.

To create an f-string, we will write our strings as f"":

s = f"hello"
print(s)
hello

This tells the string that it is an f-string. This now allows us to use { } within the string to put in Python variables or calculations.

num = 7
s = f"I only know {num} digits of pi."
print(s)
I only know 7 digits of pi.

This allows us to create readable strings using Python variables. You can put in different types of objects within a f-string.

const = "e"
value = 2.718
s = f"Some digits of {const}: {value}"
print(s)
Some digits of e: 2.718
lat = "40.7815° N"
lon = "73.9733° W"
s = f"Hayden Planetarium Coordinates: {lat}, {lon}"
print(s)
Hayden Planetarium Coordinates: 40.7815° N, 73.9733° W

Variables can be reused within a f-string.

word = "scratch"

s = f"""Black Knight:  'Tis but a {word}.
King Arthur: A {word}? Your arm's off!
"""
print(s)
Black Knight: 'Tis but a scratch.
King Arthur: A scratch? Your arm's off!

11.4.1 Formatting Numbers

Numbers can also be formatted.

p = 3.14159265359
print(f"Some digits of pi: {p}")
Some digits of pi: 3.14159265359

You can specify how to format a placeholder by using the optional colon character, :, and use the format specification mini-language2 to change how it outputs in the string. Here is an example of formatting numbers and use thousands-place comma separators.

2. String formatting mini-language: https://docs.python.org/3.4/library/string.html#format-string-syntax

digits = 67890
s = f"In 2005, Lu Chao of China recited {67890:,} digits of pi."
print(s)
In 2005, Lu Chao of China recited 67,890 digits of pi.

The formatting mini-language also supports how many decimal values are displayed.

prop = 7 / 67890
s = f"I remember {prop:.4} or {prop:.4%} of what Lu Chao recited."
print(s)
I remember 0.0001031 or 0.0103% of what Lu Chao recited.

We can also use the formatting mini-language to left pad a digit with 0.

id = 42
print(f"My ID number is {id:05d}")
My ID number is 00042

In the :05d, the colon tells us we are going to provide a formatting pattern, the 0 is the character we will use to pad, and the 5d tells us to pad with 5 digits.

Sometimes we can use the formatting mini-language, but we can also use a lot of the built-in string methods as well.

id_zfill = "42".zfill(5)
print(f"My ID number is {id_zfill}")
My ID number is 00042

Or we can put in a python expression directly in the f-string.

print(f"My ID number is {'42'.zfill(5)}")
My ID number is 00042

It is usually better to do all the function calls before creating the f-string, so all you are passing into the f-string is a variable. This just makes the code easier to read.

11.5 Regular Expressions (RegEx)

When the base Python string methods that search for patterns aren’t enough, you can throw the kitchen sink at the problem by using regular expressions (regex). The extremely powerful regular expressions provide a (nontrivial) way to find and match patterns in strings. The downside is that after you finish writing a complex regular expression, it becomes difficult to figure out what the pattern does by looking at it. That is, the syntax is difficult to read.

For many data tasks, such as matching a telephone number or address field validation, it’s almost easier to Google which type of pattern you are trying to match, and paste what someone has already written into your own code (don’t forget to document where you got the pattern from).

Before continuing, you might want to visit regex101.3 It’s a great place and reference for regular expressions and testing patterns on test strings. It even has a Python mode, so you can directly copy/paste a pattern from the site into your own Python code.

Regular expressions in Python use the re module.4 This module also has a great How To5 that can be used as an additional resource.

3. Regex101 website: https://regex101.com/

4. re module documentation: https://docs.python.org/3/library/re.html

5. Regular Expression HOWTO: https://docs.python.org/3/howto/regex.html#regex-howto

Table 11.5 and Table 11.6 show some RegEx syntax and special characters that will be used in this section.

Table 11.5 Basic RegEx Syntax

Syntax

Description

.

Matches any one character

^

Matches from the beginning of a string

$

Matches from the end of a string

*

Matches zero or more repetitions of the previous character

+

Matches one or more repetitions of the previous character

?

Matches zero or one repetition of the previous character

{m}

Matches m repetitions of the previous character

{m,n}

Matches any number from m to n of the previous character

Escape character

[ ]

A set of characters (e.g., [a-z] will match all letters from a to z)

|

OR; A | B will match A or B

( )

Matches the pattern specified within the parentheses exactly

Table 11.6 RegEx Special Characters

Sequence

Description

d

A digit

D

Any character NOT a digit (opposite of d)

s

Any whitespace character

S

Any character NOT a whitespace (opposite of s)

w

Word characters

W

Any character NOT a word character (opposite of w)

To use regular expressions, we write a string that contains the RegEx pattern, and provide a string for the pattern to match. Various functions within re can be used to handle specific needs. Some common tasks are provided in Table 11.7.

Table 11.7 Common RegEx Functions in re

Function

Description

search()

Find the first occurrence of a string

match()

Match from the beginning of a string

fullmatch()

Match the entire string

split()

Split string by the pattern

findall()

Find all non-overlapping matches of a string

finditer()

Similar to findall but returns a Python iterator

sub()

Substitute the matched pattern with the provided string

11.5.1 Match a Pattern

We will be using the re module to write the regular expression pattern we want to match in a string. Let’s write a pattern that will match 10 digits (the digits for a U.S. telephone number).

import re

tele_num = '1234567890'

There are many ways we can match 10 consecutive digits. We can use the match() function to see if the pattern matches a string. The output of many re functions is a match object.

m = re.match(pattern='dddddddddd', string=tele_num)
print(type(m))
<class 're.Match'>
print(m)
<re.Match object; span=(0, 10), match='1234567890'>

If we look at the printed match object, we see that, if there was a match, the span identifies the index of the string where the matches occurred, and the match identifies the exact string that got matched.

Many times when we are matching a pattern to a string, we simply want a True or False value indicating whether there was a match. If you just need a True/False value returned, you can run the built-in bool() function to get the boolean value of the match object.

print(bool(m))
True

At other times, a regular expression match will be part of an if statement (Appendix X), so this kind of bool() casting is unnecessary.

# should print match
if m:
  print('match')
else:
  print('no match')
match

If we wanted to extract some of the match object values, such as the index positions or the actual string that matched, we can use a few methods on the match object.

# get the first index of the string match
print(m.start())
0
# get the last index of the string match
print(m.end())
10
# get the first and last index of the string match
print(m.span())
(0, 10)
# the string that matched the pattern
print(m.group())
1234567890

Telephone numbers can be a little more complex than a series of 10 consecutive digits. Here’s another common representation.

tele_num_spaces = '123 456 7890'

Suppose we use the previous pattern in this example.

# we can simplify the previous pattern
m = re.match(pattern='d{10}', string=tele_num_spaces)
print(m)
None

You can tell the pattern did not match because the match object returned None. If we run our if statement again, it will print 'no match'.

if m:
   print('match')
else:
   print('no match')
no match

Let’s modify our pattern this time, by assuming the new string has three digits, a space, another three digits, and another space, followed by four digits. If we want to make it general to the original example, the spaces can be matched zero or one time. The new RegEx pattern will look like the following code:

# you may see the RegEx pattern as a separate variable
# because it can get long and
# make the actual match function call hard to read
p = 'd{3}s?d{3}s?d{4}'
m = re.match(pattern=p, string=tele_num_spaces)
print(m)
<re.Match object; span=(0, 12), match='123 456 7890'>

Area codes can also be surrounded by parentheses and a dash between the seven main digits.

tele_num_space_paren_dash = '(123) 456-7890'
p = '(?d{3})?s?d{3}s?-?d{4}'
m = re.match(pattern=p, string=tele_num_space_paren_dash)
print(m)
<re.Match object; span=(0, 14), match='(123) 456-7890'>

Finally, there could be a country code before the number.

cnty_tele_num_space_paren_dash = '+1 (123) 456-7890'
p = '+?1s?(?d{3})?s?d{3}s?-?d{4}'
m = re.match(pattern=p, string=cnty_tele_num_space_paren_dash)
print(m)
<re.Match object; span=(0, 17), match='+1 (123) 456-7890'>

As these examples suggest, although powerful, regular expressions can easily become unwieldy. Even something as simple as a telephone number can lead to a daunting series of symbols and numbers. Even so, sometimes regular expressions are the only way to get something done.

11.5.2 Remember What Your RegEx Patterns Are

The last regular expression of a phone number had many complex components. Chances are you forget what most of the pattern means after you write it, let alone trying to figure out what it means when you eventually review back your code.

Let’s see how we can re-write the last example in a more maintainable way, by utilizing one of the quirks of the Python language.

In Python 2 strings next to each other will be concatenated and joined together into a single string.

"multiple" "strings" "next" "to" "each" "other"
'multiplestringsnexttoeachother'

Note that no extra delimiter, space, or character is added between subsequent strings, they are just concatenated together.

That also means that we could break up our long pattern string across multiple lines. We can tell python to treat all the separate strings as a single value that we can assign to a variable by wrapping the statement around a pair of round parentheses, ( ).

p = (
  '+?'
  '1'
  's?'
  '(?'
  'd{3}'
   ')?'
  's?'
  'd{3}'
  's?'
  '-?'
  'd{4}'
)
print(p)
+?1s?(?d{3})?s?d{3}s?-?d{4}

Now that we have our code across multiple lines, we can add comments to our string, as if it was regular Python code.

p = (
  '+?'     # maybe starts with a +
  '1'       # the number 1
  's?'     # maybe there's a whitespace
  '(?'     # maybe there's an open round parenthesis (
  'd{3}'   # 3 numbers
  ')?'     # maybe there's a closing round parenthesis )
  's?'     # maybe there's a whitespace
  'd{3}'   # 3 numbers
  's?'     # maybe there's a whitespace
  '-?'      # maybe there's a dash character
  'd{4}'   # 4 numbers
)
print(p)
+?1s?(?d{3})?s?d{3}s?-?d{4}

This technique allows you to write your regular expressions in a manner that you can understand later on, and make it easier to debug the pattern if something is not matching as you expect.

cnty_tele_num_space_paren_dash = '+1 (123) 456-7890'
m = re.match(pattern=p, string=cnty_tele_num_space_paren_dash)
print(m)
<re.Match object; span=(0, 17), match='+1 (123) 456-7890'>

11.5.3 Find a Pattern

We can use the findall() function to find all matches within a pattern. Let’s write a pattern that matches digits and uses it to find all the digits from a string.

# python will concatenate 2 strings next to each other
s = (
  "14 Ncuti Gatwa, "
  "13 Jodie Whittaker, war John Hurt, 12 Peter Capaldi, "
  "11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston"
)

print(s)
14 Ncuti Gatwa, 13 Jodie Whittaker, war John Hurt, 12 Peter Capaldi,
11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston
# pattern to match 1 or more digits
p = "d+"

m = re.findall(pattern=p, string=s)
print(m)
['14', '13', '12', '11', '10', '9']

11.5.4 Substitute a Pattern

In our str.replace() example (Section 11.3.2), we wanted to get all the lines from the Guard, so we ended up doing a direct string replacement on the script. However, using regular expressions, we can generalize the pattern so we can get either the line from the Guard or the line from King Arthur.

multi_str = """Guard: What? Ridden on a horse?
King Arthur: Yes!
Guard: You're using coconuts!
King Arthur: What?
Guard: You've got ... coconut[s] and you're bangin' 'em together.
"""

p = 'w+s?w+:s?'

s = re.sub(pattern=p, string=multi_str, repl='')
print(s)
What? Ridden on a horse?
Yes!
You're using coconuts!
What?
You've got ... coconut[s] and you're bangin' 'em together.

Now we can get either party’s line by using string slicing with increments.

guard = s.splitlines()[ ::2]
kinga = s.splitlines()[1::2] # skip the first element
print(guard)
[
  "What? Ridden on a horse?",
  "You're using coconuts!",
  "You've got ... coconut[s] and you're bangin' 'em together."
]
print(kinga)
[
  "Yes!",
  "What?"
]

Don’t be afraid to mix and match regular expressions with the simpler pattern match and string methods.

11.5.5 Compile a Pattern

When we work with data, typically many operations will occur on a column-by-column or row-by-row basis. Python’s re module allows you to compile() a pattern so it can be reused. This can lead to performance benefits, especially if your data set is large. Here we will see how to compile a pattern and use it just as we did in the previous examples in this section.

The syntax is almost the same. We write our regular expression pattern, but this time, instead of saving it to a variable directly, we pass the string into the compile() function and save that result. We can then use the other re functions on the compiled pattern. Also, since the pattern is already compiled, you no longer need to specify the pattern parameter in the method.

Here is the match() example:

# pattern to match 10 digits
p = re.compile('d{10}')
s = '1234567890'

# note: calling match on the compiled pattern
# not using the re.match function
m = p.match(s)
print(m)
<re.Match object; span=(0, 10), match='1234567890'>

The findall() example:

p = re.compile('d+')
s = (
  "14 Ncuti Gatwa, "
  "13 Jodie Whittaker, war John Hurt, 12 Peter Capaldi, "
  "11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston"
)

m = p.findall(s)
print(m)
['14', '13', '12', '11', '10', '9']

The sub() or substitution example:

p = re.compile('w+s?w+:s?')
s = "Guard: You're using coconuts!"

m = p.sub(string=s, repl='')
print(m)
You're using coconuts!

11.6 The regex Library

The re library is popular because it comes with the Python installation. However, seasoned regular expression writers may find the regex library to have more comprehensive features. It is backward compatible with the re library, so all the code from the re RegEx section (Section 11.5) will still work with the regex library. The documentation for this library can be found on the PyPI page.6

6. regex documentation: https://pypi.python.org/pypi/regex

import regex

# a re example using the regex library
p = regex.compile('d+')
s = (
  "14 Ncuti Gatwa, "
  "13 Jodie Whittaker, war John Hurt, 12 Peter Capaldi, "
  "11 Matt Smith, 10 David Tennant, 9 Christopher Eccleston"
)

m = p.findall(s)
print(m)
['14', '13', '12', '11', '10', '9']

I will defer to the examples and explanations on http://www.rexegg.com/ for more details:

Conclusion

The world is filled with data stored as text. Understanding how to manipulate text strings is a fundamental skill for the data scientist. Python has many built-in string methods and libraries that can make string and text manipulation easier. This chapter covered some of the fundamental methods of string manipulations that we can build on when working with data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.177.85