The most important property of a program is whether it accomplishes the intention of its user.
C.A.R. Hoare
In This Chapter
This chapter covers some Python Standard Library components that are powerful tools for both data science and general Python use. It starts with various ways to sort data and then moves to reading and writing files using context managers. Next, this chapter looks at representing time with datetime
objects. Finally, this chapter covers searching text using the powerful regular expression library. It is important to have at least a high-level understanding of these topics as they are all highly leveraged in production programming. This chapter should give you enough familiarity with these topics that you will understand them when you need them.
Some Python data structures, such as lists, NumPy arrays and Pandas DataFrames, have built-in sorting capabilities. You can use these data structures out of the box or customize them with your own sorting functions.
For Python lists you can use the built-in sort()
method, which sorts a list in place. For example, say that you define a list of strings representing whales:
whales = [ 'Blue', 'Killer', 'Sperm', 'Humpback', 'Beluga', 'Bowhead' ]
If you now use this list’s sort()
method as follows:
whales.sort()
you see that the list is now sorted alphabetically:
whales ['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']
This method does not return a copy of the list. If you capture the return value, you see that it is None
:
return_value = whales.sort() print(return_value) None
If you want to create a sorted copy of a list, you can use Python’s built-in sorted()
function, which returns a sorted list:
sorted(whales) ['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']
You can use sorted()
on any iterable, including lists, strings, sets, tuples, and dictionaries. Regardless of the iterable type, this function returns a sorted list. If you call it on a string, it returns a sorted list of the string’s characters:
sorted("Moby Dick") [' ', 'D', 'M', 'b', 'c', 'i', 'k', 'o', 'y']
Both the list.sort()
method and the sorted()
function take an optional reverse
parameter, which defaults to False:
sorted(whales, reverse=True) ['Blue', 'Sperm', 'Beluga', 'Killer', 'Bowhead', 'Humpback']
Both list.sort()
and sorted()
also take an option key
argument that is used to define how the sorting should be defined. To sort whales using the length of the strings, for example, you can define a lambda that returns the string length and pass it as the key:
sorted(whales, key=lambda x: len(x)) ['Blue', 'Sperm', 'Beluga', 'Killer', 'Bowhead', 'Humpback']
You can also define more complex key
functions. The following example shows how to define a function that returns the length of a string, unless that string is '
Beluga
'
, in which case it returns 1
. This means that as long as the other strings have a length greater than 1
, the key
function will sort the list by string length, except for '
Beluga
'
, which is placed first:
def beluga_first(item): if item == 'Beluga': return 1 return len(item) sorted(whales, key=beluga_first) ['Beluga', 'Blue', 'Sperm', 'Killer', 'Bowhead', 'Humpback']
You can also use sorted()
with classes that you define. Listing 15.1 defines the class Food
and instantiates four instances of it. It then sorts the instances by using the attribute rating
as a sort key.
class Food(): def __init__(self, rating, name): self.rating = rating self.name = name def __repr__(self): return f'Food({self.rating}, {self.name})' foods = [Food(3, 'Bannana'), Food(9, 'Orange'), Food(2, 'Tomato'), Food(1, 'Olive')] foods [Food(3, Bannana), Food(9, Orange), Food(2, Tomato), Food(1, Olive)] sorted(foods, key=lambda x: x.rating) [Food(1, Olive), Food(2, Tomato), Food(3, Bannana), Food(9, Orange)]
If you call sorted()
on a dictionary, it will return a sorted list of the dictionary’s key names. As of Python 3.7 (see https://docs.python.org/3/whatsnew/3.7.html), dictionary keys appear in the order in which they were inserted into the dictionary. Listing 15.2 creates a dictionary of whale weights based on data from https://www.whalefacts.org/how-big-are-whales/. It prints the dictionary keys to demonstrate that they retain the order in which they were inserted. You then use sorted()
to get a list of key names sorted alphanumerically and print out the whale names and weights, in order.
weights = {'Blue': 300000, 'Killer': 12000, 'Sperm': 100000, 'Humpback': 78000, 'Beluga': 3500, 'Bowhead': 200000 } for key in weights: print(key) Blue Killer Sperm Humpback Beluga Bowhead sorted(weights) ['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm'] for key in sorted(weights): print(f'{key} {weights[key]}') Beluga 3500 Blue 300000 Bowhead 200000 Humpback 78000 Killer 12000 Sperm 100000
Pandas DataFrames have a sorting method, .sort_values()
, which takes a list of column names that can be sorted (see Listing 15.3).
import pandas as pd data = {'first': ['Dan', 'Barb', Bob'], 'last': ['Huerando', 'Pousin', 'Smith'], 'score': [0, 143, 99]} df = pd.DataFrame(data) df first last score 0 Dan Huerando 0 1 Bob Pousin 143 2 Bob Smith 99 df.sort_values(by=['last','first']) first last score 0 Bob Pousin 143 1 Bob Smith 99 2 Dan Huerando 0
You have already seen that Pandas can read various files directly into a DataFrame.
At times, you will want to read and write file data without using Pandas. Python has a built-in function, open()
, that, given a path, will return an open file object. The following example shows how I open a configuration file from my home directory (although you can use any file path the same way):
read_me = open('/Users/kbehrman/.vimrc') read_me <_io.TextIOWrapper name='/Users/kbehrman/.vimrc' mode='r' encoding='UTF-8'>
You can read a single line from a file object by using the .readline()
method:
read_me.readline() 'set nocompatible '
The file object keeps track of your place in the file. With each subsequent call to .readline()
, the next line is returned as a string:
read_me.readline() 'filetype off '
It is important to close your connection to a file when you are done, or it may interfere with the ability to open the file again. You do this with the close()
function:
read_me.close()
Using a context manager compound statement is a way to automatically close files. This type of statement starts with the keyword with
and closes the file when it exits its local state. The following example opens a file by using a context manager and reads it by using the readlines()
method:
with open('/Users/kbehrman/.vimrc') as open_file: data = open_file.readlines() data[0] 'set nocompatible '
The file contents are read as a list of strings and assigned to the variable named data, and then the context is exited, and the file object is automatically closed.
When opening a file, the file object is ready to read as text by default. You can specify other states, such as read binary ('
rb
'
), write ('
w
'
), and write binary ('
wb
'
). The following example uses the '
w
'
argument to write a new file:
text = 'My intriguing story' with open('/Users/kbehrman/my_new_file.txt', 'w') as open_file: open_file.write(text)
Here’s how you can check to make sure the file is indeed created:
!ls /Users/kbehrman Applications Downloads Movies Public Desktop Google Drive Music my_new_file.txt Documents Library Pictures sample.json
JSON is a common format for transmitting and storing data. The Python Standard Library includes a module for translating to and from JSON. This module can translate between JSON strings and Python types. This example shows how to open and read a JSON file:
import json with open('/Users/kbehrman/sample.json') as open_file: data = json.load(open_file)
datetime
ObjectsData that models values over time, called time series data, is commonly used in solving data science problems. In order to use this kind of data, you need a way to represent time. One common way is to use strings. If you need more functionality, such as the ability to easily add and subtract or easily pull out values for year, month, and day, you need something more sophisticated. The Datetime
library offers various ways to model time along with useful functionality for time value manipulation. The datetime.datetime()
class represents a moment in time down to the microsecond. Listing 15.4 demonstrates how to create a datetime
object and access some of its values.
from datetime import datetime dt = datetime(2022, 10, 1, 13, 59, 33, 10000) dt datetime.datetime(2022, 10, 1, 13, 59, 33, 10000) dt.year 2022 dt.month 10 dt.day 1 dt.hour 13 dt.minute 59 dt.second 33 dt.microsecond 10000
You can get an object for the current time by using the datetime.now()
function:
datetime.now() datetime.datetime(2021, 3, 7, 13, 25, 22, 984991)
You can translate strings to datetime
objects and datetime
objects to strings by using the datetime.strptime()
and datetime.strftime()
functions. Both of these functions rely on format codes that define how the string should be processed. These format codes are defined in the Python documentation, at https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior.
Listing 15.5 uses the format codes %Y
for a four-digit year, %m
for a two-digit month, and %d
for a two-digit day to create a datetime
from a string. You can then use the %y
, which represents a two-digit year, to create a new string version.
dt = datetime.strptime('1968-06-20', '%Y-%m-%d') dt datetime.datetime(1968, 6, 20, 0, 0) dt.strftime('%m/%d/%y') '06/20/68'
You can use the datetime.timedelta
class to create a new datetime
relative to an existing one:
from datetime import timedelta delta = timedelta(days=3) dt - delta datetime.datetime(1968, 6, 17, 0, 0)
Python 3.9 introduced a new package, called zoneinfo
, for setting time zones. With this package, it is easy to set the time zone of a datetime
:
from zoneinfo import ZoneInfo dt = datetime(2032, 10, 14, 23, tzinfo=ZoneInfo("America/Jujuy")) dt.tzname() '-03'
Note
As of the writing of this book, Colab is still running Python 3.7, so you may not have access to zoneinfo
yet.
The datetime
library also includes a datetime.date
class:
from datetime import date date.today() datetime.date(2021, 3, 7)
This class is similar to datetime.datetime
except that it tracks only the date and not the time of day.
The last package covered in this chapter is the regex library, re
. Regular expressions (regex) provide a sophisticated language for searching within text. You can define a search pattern as a string and then use it to search target text. At the simplest level, the search pattern can be exactly the text you want to match. The following example defines text containing ship captains and their email addresses. It then searches this text using the re.match()
function, which returns a match
object:
captains = '''Ahab: [email protected] Peleg: [email protected] Ishmael: [email protected] Herman: [email protected] Pollard: [email protected]''' import re re.match("Ahab:", captains ) <re.Match object; span=(0, 5), match='Ahab:'>
You can use the result of this match with an if
statement, whose code block will execute only if the text is matched.
if re.match("Ahab:", captains ): print("We found Ahab") We found Ahab
The re.match()
function matches from the beginning of the string. If you try to match a substring later in the source string, it will not match:
if re.match("Peleg", captains): print("We found Peleg") else: print("No Peleg found!") No Peleg found!
If you want to match any substring contained within text, you use the re.search()
function:
re.search("Peleg", captains) <re.Match object; span=(22, 27), match='Peleg'>
Character sets provide syntax for defining more generalized matches. The syntax for character sets is some group of characters enclosed in square brackets. To search for the first occurrence of either 0 or 1, you could use this character set:
"[01]"
To search for the first occurrence of a vowel followed by a punctuation mark, you could use this character set:
"[aeiou][!,?.;]"
You can indicate a range of characters in a character set by using a hyphen. For any digit, you would use the syntax [0-9]
, for any capital letter, [A-Z]
, or for any lowercase letter, [a-z]
. You can follow a character set with a +
to match one or more instances. You can follow a character set with a number in curly brackets to match that exact number of occurrences in a row. Listing 15.5 demonstrates the use of character sets.
re.search("[A-Z][a-z]", captains) <re.Match object; span=(0, 2), match='Ah'> re.search("[A-Za-z]+", captains) <re.Match object; span=(0, 4), match='Ahab'> re.search("[A-Za-z]{7}", captains) <re.Match object; span=(46, 53), match='Ishmael'> re.search("[a-z]+@[a-z]+.[a-z]+", captains) <re.Match object; span=(6, 21), match='[email protected]'>
Character classes are predefined groups of characters supplied for easier matching. You can see the whole list of character classes in the re
documentation (see https://docs.python.org/3/library/re.html). Some commonly used character classes are d
for digital characters, s
for white space characters, and w
for word characters. Word characters generally match any characters that are commonly used in words as well as numeric digits and underscores.
To search for the first occurrence of a digit surrounded by word characters, you could use "
wdw
"
:
re.search("wdw", "His panic over Y2K was overwhelming.") <re.Match object; span=(15, 18), match='Y2K'>
You can use the +
or curly brackets to indicate multiple consecutive occurrences of a character class in the same way you do with character sets:
re.search("w+@w+.w+", captains) <re.Match object; span=(6, 21), match='[email protected]'>
If you enclose parts of a regular expression pattern in parentheses, they become a group. You can access groups on a match object by using the group()
method. Groups are numbered, with group 0 being the whole match:
m = re.search("(w+)@(w+).(w+)", captains) print(f'Group 0 is {m.group(0)}') Group 0 is [email protected] print(f'Group 1 is {m.group(1)}') Group 1 is ahab print(f'Group 2 is {m.group(2)}') Group 2 is pequod print(f'Group 3 is {m.group(3)}') Group 3 is com
It is often useful to refer to groups by names rather than by using numbers. The syntax for defining a named group is as follows:
(?P<GROUP_NAME>PATTERN)
You can then get groups by using the group names instead of their numbers:
m = re.search("(?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)", captains) print(f''' Email address: {m.group()} Name: {m.group("name")} Secondary level domain: {m.group("SLD")} Top level Domain: {m.group("TLD")}''') Email address: [email protected] Name: ahab Secondary level domain: pequod Top level Domain: com
Until now, you have only been able to find the first occurrence of a match. You can use the re.findall()
function to match all occurrences. This function returns each match as a string:
re.findall("w+@w+.w+", captains) ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']
If you have defined groups, re.findall()
returns each match as a tuple of strings, with each string beginning the match for a group:
re.findall("(?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)", captains) [('ahab', 'pequod', 'com'), ('peleg', 'pequod', 'com'), ('ishmael', 'pequod', 'com'), ('herman', 'acushnet', 'io'), ('pollard', 'essex', 'me')]
If you are searching for all matches in a large text, you can use re.finditer()
. This function returns an iterator, which returns each subsequent match with each iteration:
iterator = re.finditer("w+@w+.w+", captains) print(f"An {type(iterator)} object is returned by finditer" ) An <class 'callable_iterator'> object is returned by finditer m = next(iterator) f"""The first match, {m.group()} is processed without processing the rest of the text""" 'The first match, [email protected] is processed without processing the rest of the text'
You can use regular expressions for substitution as well as for matching. The re.sub()
function takes a match pattern, a replacement string, and a source text:
re.sub("d", "#", "Your secret pin is 12345") 'Your secret pin is #####'
You can refer to named groups in a replacement string by using this syntax:
g<GROUP_NAME>
To reverse the email addresses in the captains text, you could use substitution as follows:
new_text = re.sub("(?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)", "g<TLD>.g<SLD>.g<name>", captains) print(new_text) Ahab: com.pequod.ahab Peleg: com.pequod.peleg Ishmael: com.pequod.ishmael Herman: io.acushnet.herman Pollard: me.essex.pollard
There is some cost to compiling a regular expression pattern. If you are using the same regular expression many times, it is more efficient to compile it once. You do so by using the re.compile()
function, which returns a compiled regular expression object based on a match pattern:
regex = re.compile("w+: (?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)") regex re.compile(r'w+: (?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)', re.UNICODE)
This object has methods that map to many of the re
functions, such as match()
, search()
, findall()
, finditer()
, and sub()
, as demonstrated in Listing 15.7.
regex.match(captains) <re.Match object; span=(0, 21), match='Ahab: [email protected]'> regex.search(captains) <re.Match object; span=(0, 21), match='Ahab: [email protected]'> regex.findall(captains) [('ahab', 'pequod', 'com'), ('peleg', 'pequod', 'com'), ('ishmael', 'pequod', 'com'), ('herman', 'acushnet', 'io'), ('pollard', 'essex', 'me')] new_text = regex.sub("Ahoy g<name>!", captains) print(new_text) Ahoy ahab! Ahoy peleg! Ahoy ishmael! Ahoy herman! Ahoy pollard!
This chapter introduces data sorting, file objects, the Datetime
library, and the re
library. Having at least a passing knowledge of these topics is important for any Python developer. You can do sorting either with the sorted()
function or object sort()
methods, such as the one attached to list objects. You can open files by using the open()
function, and while files are open, you can read from them or write to them. The Datetime
library models time and is particularly useful when dealing with time series data. Finally, you can use the re
library to define complicated text searches.
1. What is the final value of sorted_names
in the following example?
names = ['Rolly', 'Polly', 'Molly'] sorted_names = names.sort()
2. How would you sort the list nums = [0, 4, 3, 2, 5]
in descending order?
3. What cleanup specific to file objects does a context manager handle?
4. How would you create a datetime
object from the following variables:
year = 2022 month = 10 day = 14 hour = 12 minute = 59 second = 11 microsecond = 100
5. What does d
represent in a regular expression pattern?
3.145.50.206