15

Other Topics

The most important property of a program is whether it accomplishes the intention of its user.

C.A.R. Hoare

In This Chapter

This chapter covers some Python Standard Library components that are powerful tools for both data science and general Python use. It starts with various ways to sort data and then moves to reading and writing files using context managers. Next, this chapter looks at representing time with datetime objects. Finally, this chapter covers searching text using the powerful regular expression library. It is important to have at least a high-level understanding of these topics as they are all highly leveraged in production programming. This chapter should give you enough familiarity with these topics that you will understand them when you need them.

Sorting

Some Python data structures, such as lists, NumPy arrays and Pandas DataFrames, have built-in sorting capabilities. You can use these data structures out of the box or customize them with your own sorting functions.

Lists

For Python lists you can use the built-in sort() method, which sorts a list in place. For example, say that you define a list of strings representing whales:

whales = [ 'Blue', 'Killer', 'Sperm', 'Humpback', 'Beluga', 'Bowhead' ]

If you now use this list’s sort() method as follows:

whales.sort()

you see that the list is now sorted alphabetically:

whales
['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']

This method does not return a copy of the list. If you capture the return value, you see that it is None:

return_value = whales.sort()
print(return_value)
None

If you want to create a sorted copy of a list, you can use Python’s built-in sorted() function, which returns a sorted list:

sorted(whales)
['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']

You can use sorted() on any iterable, including lists, strings, sets, tuples, and dictionaries. Regardless of the iterable type, this function returns a sorted list. If you call it on a string, it returns a sorted list of the string’s characters:

sorted("Moby Dick")
[' ', 'D', 'M', 'b', 'c', 'i', 'k', 'o', 'y']

Both the list.sort() method and the sorted() function take an optional reverse parameter, which defaults to False:

sorted(whales, reverse=True)
['Blue', 'Sperm', 'Beluga', 'Killer', 'Bowhead', 'Humpback']

Both list.sort() and sorted() also take an option key argument that is used to define how the sorting should be defined. To sort whales using the length of the strings, for example, you can define a lambda that returns the string length and pass it as the key:

sorted(whales, key=lambda x: len(x))
['Blue', 'Sperm', 'Beluga', 'Killer', 'Bowhead', 'Humpback']

You can also define more complex key functions. The following example shows how to define a function that returns the length of a string, unless that string is 'Beluga', in which case it returns 1. This means that as long as the other strings have a length greater than 1, the key function will sort the list by string length, except for 'Beluga', which is placed first:

def beluga_first(item):
    if item == 'Beluga':
        return 1
    return len(item)

sorted(whales, key=beluga_first)
['Beluga', 'Blue', 'Sperm', 'Killer', 'Bowhead', 'Humpback']

You can also use sorted() with classes that you define. Listing 15.1 defines the class Food and instantiates four instances of it. It then sorts the instances by using the attribute rating as a sort key.

Listing 15.1 Sorting Objects Using a Lambda

class Food(): 
    def __init__(self, rating, name):
        self.rating = rating
        self.name = name

    def __repr__(self):
        return f'Food({self.rating}, {self.name})'

foods = [Food(3, 'Bannana'),
         Food(9, 'Orange'),
         Food(2, 'Tomato'),
         Food(1, 'Olive')]

foods
[Food(3, Bannana), Food(9, Orange), Food(2, Tomato), Food(1, Olive)]

sorted(foods, key=lambda x: x.rating)
[Food(1, Olive), Food(2, Tomato), Food(3, Bannana), Food(9, Orange)]

If you call sorted() on a dictionary, it will return a sorted list of the dictionary’s key names. As of Python 3.7 (see https://docs.python.org/3/whatsnew/3.7.html), dictionary keys appear in the order in which they were inserted into the dictionary. Listing 15.2 creates a dictionary of whale weights based on data from https://www.whalefacts.org/how-big-are-whales/. It prints the dictionary keys to demonstrate that they retain the order in which they were inserted. You then use sorted() to get a list of key names sorted alphanumerically and print out the whale names and weights, in order.

Listing 15.2 Sorting Dictionary Keys

weights = {'Blue': 300000,
           'Killer': 12000,
           'Sperm': 100000,
           'Humpback': 78000,
           'Beluga':  3500,
           'Bowhead': 200000 }

for key in weights:
    print(key)
Blue
Killer
Sperm
Humpback
Beluga
Bowhead

sorted(weights)
['Beluga', 'Blue', 'Bowhead', 'Humpback', 'Killer', 'Sperm']

for key in sorted(weights):
    print(f'{key} {weights[key]}')
Beluga 3500
Blue 300000
Bowhead 200000
Humpback 78000
Killer 12000
Sperm 100000

Pandas DataFrames have a sorting method, .sort_values(), which takes a list of column names that can be sorted (see Listing 15.3).

Listing 15.3 Sorting Pandas DataFrames

import pandas as pd
data = {'first': ['Dan', 'Barb', Bob'],
        'last': ['Huerando', 'Pousin', 'Smith'],
        'score': [0, 143, 99]}

df = pd.DataFrame(data)
df

        first    last       score
0       Dan     Huerando       0
1       Bob     Pousin       143
2       Bob     Smith         99

df.sort_values(by=['last','first'])

        first    last       score
0       Bob      Pousin      143
1       Bob      Smith        99
2       Dan      Huerando      0

Reading and Writing Files

You have already seen that Pandas can read various files directly into a DataFrame. At times, you will want to read and write file data without using Pandas. Python has a built-in function, open(), that, given a path, will return an open file object. The following example shows how I open a configuration file from my home directory (although you can use any file path the same way):

read_me = open('/Users/kbehrman/.vimrc')
read_me
<_io.TextIOWrapper name='/Users/kbehrman/.vimrc' mode='r' encoding='UTF-8'>

You can read a single line from a file object by using the .readline() method:

read_me.readline()
'set nocompatible
'

The file object keeps track of your place in the file. With each subsequent call to .readline(), the next line is returned as a string:

read_me.readline()
'filetype off
'

It is important to close your connection to a file when you are done, or it may interfere with the ability to open the file again. You do this with the close() function:

read_me.close()

Context Managers

Using a context manager compound statement is a way to automatically close files. This type of statement starts with the keyword with and closes the file when it exits its local state. The following example opens a file by using a context manager and reads it by using the readlines() method:

with open('/Users/kbehrman/.vimrc') as open_file:
    data = open_file.readlines()

data[0]
'set nocompatible
'

The file contents are read as a list of strings and assigned to the variable named data, and then the context is exited, and the file object is automatically closed.

When opening a file, the file object is ready to read as text by default. You can specify other states, such as read binary ('rb'), write ('w'), and write binary ('wb'). The following example uses the 'w' argument to write a new file:

text = 'My intriguing story'

with open('/Users/kbehrman/my_new_file.txt', 'w') as open_file:
    open_file.write(text)

Here’s how you can check to make sure the file is indeed created:

!ls /Users/kbehrman
Applications    Downloads       Movies          Public
Desktop         Google Drive    Music           my_new_file.txt
Documents       Library         Pictures        sample.json

JSON is a common format for transmitting and storing data. The Python Standard Library includes a module for translating to and from JSON. This module can translate between JSON strings and Python types. This example shows how to open and read a JSON file:

import json

with open('/Users/kbehrman/sample.json') as open_file:
    data = json.load(open_file)

datetime Objects

Data that models values over time, called time series data, is commonly used in solving data science problems. In order to use this kind of data, you need a way to represent time. One common way is to use strings. If you need more functionality, such as the ability to easily add and subtract or easily pull out values for year, month, and day, you need something more sophisticated. The Datetime library offers various ways to model time along with useful functionality for time value manipulation. The datetime.datetime() class represents a moment in time down to the microsecond. Listing 15.4 demonstrates how to create a datetime object and access some of its values.

Listing 15.4 datetime Attributes

from datetime import datetime

dt = datetime(2022, 10, 1, 13, 59, 33, 10000)
dt
datetime.datetime(2022, 10, 1, 13, 59, 33, 10000)

dt.year
2022

dt.month
10

dt.day
1

dt.hour
13

dt.minute
59

dt.second
33

dt.microsecond
10000

You can get an object for the current time by using the datetime.now() function:

datetime.now()
datetime.datetime(2021, 3, 7, 13, 25, 22, 984991)

You can translate strings to datetime objects and datetime objects to strings by using the datetime.strptime() and datetime.strftime() functions. Both of these functions rely on format codes that define how the string should be processed. These format codes are defined in the Python documentation, at https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior.

Listing 15.5 uses the format codes %Y for a four-digit year, %m for a two-digit month, and %d for a two-digit day to create a datetime from a string. You can then use the %y, which represents a two-digit year, to create a new string version.

Listing 15.5 datetime Objects to and from Strings

dt = datetime.strptime('1968-06-20', '%Y-%m-%d')
dt
datetime.datetime(1968, 6, 20, 0, 0)

dt.strftime('%m/%d/%y')
'06/20/68'

You can use the datetime.timedelta class to create a new datetime relative to an existing one:

from datetime import timedelta
delta = timedelta(days=3)

dt - delta
datetime.datetime(1968, 6, 17, 0, 0)

Python 3.9 introduced a new package, called zoneinfo, for setting time zones. With this package, it is easy to set the time zone of a datetime:

from zoneinfo import ZoneInfo

dt = datetime(2032, 10, 14, 23, tzinfo=ZoneInfo("America/Jujuy"))
dt.tzname()
'-03'

Note

As of the writing of this book, Colab is still running Python 3.7, so you may not have access to zoneinfo yet.

The datetime library also includes a datetime.date class:

from datetime import date

date.today()
datetime.date(2021, 3, 7)

This class is similar to datetime.datetime except that it tracks only the date and not the time of day.

Regular Expressions

The last package covered in this chapter is the regex library, re. Regular expressions (regex) provide a sophisticated language for searching within text. You can define a search pattern as a string and then use it to search target text. At the simplest level, the search pattern can be exactly the text you want to match. The following example defines text containing ship captains and their email addresses. It then searches this text using the re.match() function, which returns a match object:

captains = '''Ahab: [email protected]
              Peleg: [email protected]
              Ishmael: [email protected]
              Herman: [email protected]
              Pollard: [email protected]'''

import re
re.match("Ahab:", captains )
<re.Match object; span=(0, 5), match='Ahab:'>

You can use the result of this match with an if statement, whose code block will execute only if the text is matched.

if re.match("Ahab:", captains ):
    print("We found Ahab")
We found Ahab

The re.match() function matches from the beginning of the string. If you try to match a substring later in the source string, it will not match:

if re.match("Peleg", captains):
    print("We found Peleg")
else:
    print("No Peleg found!")
No Peleg found!

If you want to match any substring contained within text, you use the re.search() function:

re.search("Peleg", captains)
<re.Match object; span=(22, 27), match='Peleg'>

Character Sets

Character sets provide syntax for defining more generalized matches. The syntax for character sets is some group of characters enclosed in square brackets. To search for the first occurrence of either 0 or 1, you could use this character set:

"[01]"

To search for the first occurrence of a vowel followed by a punctuation mark, you could use this character set:

"[aeiou][!,?.;]"

You can indicate a range of characters in a character set by using a hyphen. For any digit, you would use the syntax [0-9], for any capital letter, [A-Z], or for any lowercase letter, [a-z]. You can follow a character set with a + to match one or more instances. You can follow a character set with a number in curly brackets to match that exact number of occurrences in a row. Listing 15.5 demonstrates the use of character sets.

Listing 15.6 Character Sets

re.search("[A-Z][a-z]", captains)
<re.Match object; span=(0, 2), match='Ah'>

re.search("[A-Za-z]+", captains)
<re.Match object; span=(0, 4), match='Ahab'>

re.search("[A-Za-z]{7}", captains)
<re.Match object; span=(46, 53), match='Ishmael'>

re.search("[a-z]+@[a-z]+.[a-z]+", captains)
<re.Match object; span=(6, 21), match='[email protected]'>

Character Classes

Character classes are predefined groups of characters supplied for easier matching. You can see the whole list of character classes in the re documentation (see https://docs.python.org/3/library/re.html). Some commonly used character classes are d for digital characters, s for white space characters, and w for word characters. Word characters generally match any characters that are commonly used in words as well as numeric digits and underscores.

To search for the first occurrence of a digit surrounded by word characters, you could use "wdw":

re.search("wdw", "His panic over Y2K was overwhelming.")
<re.Match object; span=(15, 18), match='Y2K'>

You can use the + or curly brackets to indicate multiple consecutive occurrences of a character class in the same way you do with character sets:

re.search("w+@w+.w+", captains)
<re.Match object; span=(6, 21), match='[email protected]'>

Groups

If you enclose parts of a regular expression pattern in parentheses, they become a group. You can access groups on a match object by using the group() method. Groups are numbered, with group 0 being the whole match:

m = re.search("(w+)@(w+).(w+)", captains)

print(f'Group  0 is {m.group(0)}')
Group  0 is [email protected]

print(f'Group  1 is {m.group(1)}')
Group  1 is ahab

print(f'Group  2 is {m.group(2)}')
Group  2 is pequod

print(f'Group  3 is {m.group(3)}')
Group  3 is com

Named Groups

It is often useful to refer to groups by names rather than by using numbers. The syntax for defining a named group is as follows:

(?P<GROUP_NAME>PATTERN)

You can then get groups by using the group names instead of their numbers:

m = re.search("(?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)", captains)

print(f'''
Email address: {m.group()}
Name:  {m.group("name")}
Secondary level domain: {m.group("SLD")}
Top level Domain: {m.group("TLD")}''')
Email address: [email protected]
Name:  ahab
Secondary level domain: pequod
Top level Domain: com

Find All

Until now, you have only been able to find the first occurrence of a match. You can use the re.findall() function to match all occurrences. This function returns each match as a string:

re.findall("w+@w+.w+", captains)
['[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]']

If you have defined groups, re.findall() returns each match as a tuple of strings, with each string beginning the match for a group:

re.findall("(?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)", captains)
[('ahab', 'pequod', 'com'),
 ('peleg', 'pequod', 'com'),
 ('ishmael', 'pequod', 'com'),
 ('herman', 'acushnet', 'io'),
 ('pollard', 'essex', 'me')]

Find Iterator

If you are searching for all matches in a large text, you can use re.finditer(). This function returns an iterator, which returns each subsequent match with each iteration:

iterator = re.finditer("w+@w+.w+", captains)

print(f"An {type(iterator)} object is returned by finditer" )
An <class 'callable_iterator'> object is returned by finditer

m = next(iterator)
f"""The first match, {m.group()} is processed
without processing the rest of the text"""
'The first match, [email protected] is processed
without processing the rest of the text'

Substitution

You can use regular expressions for substitution as well as for matching. The re.sub() function takes a match pattern, a replacement string, and a source text:

re.sub("d", "#", "Your secret pin is 12345")
      'Your secret pin is #####'

Substitution Using Named Groups

You can refer to named groups in a replacement string by using this syntax:

g<GROUP_NAME>

To reverse the email addresses in the captains text, you could use substitution as follows:

new_text = re.sub("(?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)",
                  "g<TLD>.g<SLD>.g<name>", captains)

print(new_text)
Ahab: com.pequod.ahab
Peleg: com.pequod.peleg
Ishmael: com.pequod.ishmael
Herman: io.acushnet.herman
Pollard: me.essex.pollard

Compiling Regular Expressions

There is some cost to compiling a regular expression pattern. If you are using the same regular expression many times, it is more efficient to compile it once. You do so by using the re.compile() function, which returns a compiled regular expression object based on a match pattern:

regex = re.compile("w+: (?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)")
regex
re.compile(r'w+: (?P<name>w+)@(?P<SLD>w+).(?P<TLD>w+)', re.UNICODE)

This object has methods that map to many of the re functions, such as match(), search(), findall(), finditer(), and sub(), as demonstrated in Listing 15.7.

Listing 15.7 Compiled Regular Expression

regex.match(captains)
<re.Match object; span=(0, 21), match='Ahab: [email protected]'>

regex.search(captains)
<re.Match object; span=(0, 21), match='Ahab: [email protected]'>

regex.findall(captains)
[('ahab', 'pequod', 'com'),
 ('peleg', 'pequod', 'com'),
 ('ishmael', 'pequod', 'com'),
 ('herman', 'acushnet', 'io'),
 ('pollard', 'essex', 'me')]

new_text = regex.sub("Ahoy g<name>!", captains)
print(new_text)
Ahoy ahab!
Ahoy peleg!
Ahoy ishmael!
Ahoy herman!
Ahoy pollard!

Summary

This chapter introduces data sorting, file objects, the Datetime library, and the re library. Having at least a passing knowledge of these topics is important for any Python developer. You can do sorting either with the sorted() function or object sort() methods, such as the one attached to list objects. You can open files by using the open() function, and while files are open, you can read from them or write to them. The Datetime library models time and is particularly useful when dealing with time series data. Finally, you can use the re library to define complicated text searches.

Questions

1.   What is the final value of sorted_names in the following example?

names = ['Rolly', 'Polly', 'Molly']
sorted_names = names.sort()

2.   How would you sort the list nums = [0, 4, 3, 2, 5] in descending order?

3.   What cleanup specific to file objects does a context manager handle?

4.   How would you create a datetime object from the following variables:

year = 2022
month = 10
day = 14
hour = 12
minute = 59
second = 11
microsecond = 100

5.   What does d represent in a regular expression pattern?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.50.206