Using string functions with a pandas DataFrame

Let's use built-in functions with a pandas DataFrame. We will continue to use the same dataset that was imported in the previous section. Most of the string manipulation functions in Python work with the pandas vectorized string methods.

Here is a list of pandas string functions that reflect Python string methods:

Figure 2 - List of vectorized string functions in pandas

Let's practice the following use cases:

Extract the first sentence from the text column of the DataFrame and convert it into lowercase characters, as follows:

 text[0].lower()

Convert all the comments in the text column to lowercase and display the first eight entries, as follows:

text.str.lower().head(8)

Extract the first sentence and convert it into uppercase characters, as follows:

text[0].upper()

Get the length of each comment in the text field and display the first eight entries, as follows:

text.str.len().head(8)

Combine all the comments into a single string and display the first 500 characters, as follows:

text.str.cat()[0:500]

It is wise to verify that all the comments are concatenated together. Can you think of any use cases where we probably need to combine all the comments together into a single string? Well, how about—say—we want to see the most frequent words chosen by all users when commenting.

Slice each string in a series and return the result in an elementwise fashion with series.str.slice(), as shown in the following code snippet:

text.str.slice(0, 10).head(8)

Replace the occurrences of a given substring with a different substring using str.replace(), as shown in the following code snippet:

text.str.replace("Wolves", "Fox").head(8)

In the preceding example, all the cases of Wolves would be replaced with Fox. This acts as a search and replace functionality that you can find in many content management systems and editors.

While working with text data, we frequently test whether character strings contain a certain substring or pattern of characters. Let's search for only those comments that mention Andrew Wiggins. We'd need to match all posts that mention him and avoid matching posts that don't mention him.

Use series.str.contains() to get a series of true/false values, indicating whether each string contains a given substring, as follows:

# Get first 10 comments about Andrew Wiggins
selected_comments = text.str.lower().str.contains("wigg|drew")

text[selected_comments].head(10)

Just for information, let's calculate the ratio of comments that mention Andrew Wiggins, as follows:

len(text[selected_comments])/len(text)

And the output is 0.06649063850216035. As you can see, 6.6% of comments make mention of Andrew Wiggins. This is the output of the string pattern argument we supplied to str.contains().

Posts about Andrew Wiggins could use any number of different names to refer to him—Wiggins, Andrew, Wigg, Drew—so we needed something that is a little more flexible than a single substring to match all the posts we're interested in. The pattern we supplied is a simple example of a regular expression.

Table of Contents for Using string functions with a pandas DataFrame

Create new playlist

Sign In

Sign Up

Table of Contents for
Using string functions with a pandas DataFrame