Chapter 2. Textual Data: Every string has its place

image with no caption

Imagine trying to communicate without words.

All programs process data, and one of the most important types of data is text. In this chapter, you’ll work through the basics of textual data. You’ll automatically search text and get back exactly what you’re looking for. Along the way, you’ll pick up key programming concepts such as methods and how you can use them to bend your data to your will. And finally, you’ll instantly power up your programs with the help of library code.

Your new gig at Starbuzz Coffee

Starbuzz Coffee has made a name for itself as the fastest growing coffee shop around. If you’ve seen one on your local corner, look across the street; you’ll see another one.

image with no caption

The Starbuzz CEO is always on the lookout for ways to boost profits, and he’s come up with a great idea. He wants a program that will show him the current price of coffee beans so that his buyers can make informed decisions about when to buy.

image with no caption

Here’s the current Starbuzz code

The previous programmer has already made a head start on the code, and we can use this as a basis. Here’s the existing Python code, but what does it do?

image with no caption

Brain Power

Take a good look at the existing Starbuzz code. What do you think it actually does?

The cost is embedded in the HTML

Take a closer look at the results of the program. The current price of beans is right in the middle of the output:

image with no caption

The Starbuzz CEO would find it a lot easier if you could extract the price of beans and just display that, rather than have to look for it in the HTML. But how do you do that?

A string is a series of characters

The output of the Starbuzz program is an example of a string. In other words, it’s a series of characters like this:

image with no caption

Somewhere within the string is the price of coffee beans. To retrieve just the price, all you need to do is go to the right bit of the string, retrieve the characters that give the price, and display just those characters. But how?

image with no caption

Find characters inside the text

The computer keeps track of individual characters by using two pieces of information: the start of the string and the offset of an individual character. The offset is how far the individual character is from the start of the string.

image with no caption

The first character in a string has an offset of 0, because it is zero characters from the start. The second character has an offset of 1, and so on:

image with no caption

The offset value is always 1 less than the position. Python lets you read a single character from a string by providing the offset value in square brackets after the variable name. Because the offset value is used to find a character, it is called the index of the character:

image with no caption

But how do you get at more than one character?

For Starbuzz, you don’t just need a single character. You need to extract the price from the string of HTML, and the price is made up of several characters.

You need to extract a smaller substring from a bigger string. A substring is a sequence of characters contained within another string. Specifying substrings in Python is a little like reading single characters from a string, except that you provide two index values within the square brackets:

image with no caption

Beans‘R’Us is rewarding loyal customers

The CEO just got great news from the beans supplier.

image with no caption

The supplier actually maintains two prices: one for regular customers and one for loyalty program customers. The different prices are published on different web pages:

image with no caption

That means you need to change the web page address in the code:

image with no caption

Let’s run it to make sure everything works OK.

The price moved

The web page for loyalty customers is much more dynamic than the old web page. The page for regular customers always displays the price in a substring beginning at index 234. That’s not true for the loyalty program web page. The price on that page can be almost anywhere. All you know for sure is that the price follows the substring >$:

image with no caption

You need to search for the price string.

Searching is complex

You already know how to find a substring, so you could run through the entire web page and check each two characters to see if they match >$, like this:

image with no caption

You could do it this way... but should you?

There’s a lot to worry about. Which two characters are you currently comparing? Where in the string are you right now? What if “>$” isn’t found? Searching for substrings in strings is a little more complex than it first appears...

But if you don’t want to write code to search the string, what else could you do?

image with no caption

Python data is smart

The more code you write, the more you will find that you need to do the same kind of things to the data in your variables all the time. To prevent you from having to create the same code over and over, programming languages provide built-in functionality to help you avoid writing unnecessary code. Python data is smart: it can do things.

Let’s look at an example.

Imagine you have a piece of text in a variable that you want to display in uppercase (all CAPITAL letters):

msg = "Monster truck rally. 4pm. Monday."

You could write code that read through each character in the string and printed out the matching uppercase letter. But if you’re programming in a language like Python, you can do this:

image with no caption

But what does msg.upper() mean?

Well, msg is the string containing our piece of text. The .upper() that follows it is called a string method. A method is just an instruction for the string. When you call msg.upper(), you are telling the string to give you an UPPERCASE version of its data.

But is there a string method that can help you search for a substring within a string object?

image with no caption

The new version of the program works, but now there’s a design issue.

The Starbuzz CEO wants to know when the price of the beans falls below $4.74. The program needs to keep checking the Beans’R’Us website until that happens. It’s time to restructure the program to add in this new feature.

Let’s add a loop to the program that stops when the price of coffee is right.

Brain Power

Look at the error message in detail. Try to identify which line in the code caused the crash and guess what a TypeError might be. Why do you think the code crashed?

Strings and numbers are different

The program crashed because it tried to compare a string with a number, which is something that doesn’t make a lot of sense to a computer program. When a piece of data is classified as a string or a number, this refers to more than just the contents of the variable. We are also referring to its datatype. If two pieces of data are different types, we can’t compare them to each other.

image with no caption

Think back to the previous chapter. You’ve seen this problem before, back when you were working on the guessing game program:

image with no caption

In the guessing-game program, you needed to convert the user’s guess into an integer (a whole number) by using the int() function. But coffee bean prices aren’t whole numbers, because they contain numbers after a decimal point. They are floating point numbers or floats, and to convert a string to a float, you need to use a function other than int(). You need to use float():

image with no caption

THE DEPARTMENT OF WEBLAND SECURITY
37 YOU DON’T NEED
TO KNOW WHERE WE ARE (BUT WE KNOW WHERE YOU LIVE)
WASHINGTON, D.C.

From: The Department of Webland Security
Secret Service - Corporate Enforcement Unit
 
To Whom It May Concern:

A recent investigation into an apparent Distributed Denial
of Service (DDoS) attack on the www.beans-r-us.appspot.com domain
showed that much of the traffic originated from machines
located in various Starbuzz outlets from around the world.
The number of web transactions (which reached a peak of
several hundred thousand requests worldwide) resulted in
a crash of the Beans'R'Us servers, resulting in a
significant loss of business.

In accordance with the powers invested in this office
by the United States Attorney General, we are alerting the
developer of the very dim view we take of this kind of
thing. In short:

  We're watching you, Bud. Consider yourself on notice.

Yours faithfully,
Head of Internet Affairs

That sounds weird. What happened?

The program has overloaded the Beans’R’Us Server

It looks like there’s a problem with the program. It’s sending so many requests that it overwhelmed the Beans’R’Us website. So why did that happen? Let’s look at the code and see:

image with no caption

If the value of price isn’t low enough (if it’s more than 4.74), the program goes back to the top of the loop immediately and sends another request.

With the code written this way, the program will generate thousands of requests per hour. Multiply that by all the Starbuzz outlets around the world, and you can start to see the scale of the problem:

You need to delay the pricing requests. But how?

image with no caption

Time... if only you had more of it

Just when you’re feeling completely lost, you get a phone call from the Starbuzz coder who wrote the original version of the program:

image with no caption

It seems that she can’t get back because of a storm in the mountains. But she does make a suggestion. You need to regulate how often you make a request of the Beans’R’Us web server. One way to do this is to use the time library. This will apparently make it possible to send requests every 15 minutes or so, which should help to lighten the load.

There’s just one thing: what’s a library?

You’re already using library code

image with no caption

But how will the time library help us? Let’s see...

Order is restored

Starbuzz Coffee is off the blacklist, because their price-checking programs no longer kill the Beans’R’Us web server. The nice people at Webland Security have, rather quietly, gone away.

Coffee beans get ordered when the price is right!

image with no caption

Your Programming Toolbox

You’ve got Chapter 2 under your belt. Let’s look back at what you’ve learned in this chapter:

Programming Tools

* Strings are sequences of individual characters.

* Individual string characters are referenced by index.

* Index values are offsets that start from zero.

* Methods provide variables with built-in functionality.

* Programming libraries provide a collection of related pre-built code and functions.

* As well as having a value, data in variables also have a “data type.”

* Number is a data type.

* String is a data type.

Python Tools

* s[4] - access the 5th character of the variable “s”, which is a string

* s[6:12] - access a sub-string within the string “s” (up to, but not including)

*s.find() method for searching strings

* s.upper() method for converting strings to UPPERCASE

* float() converts strings to decimal point numbers known as “floats”

* + addition operator

* > greater than operator

* urllib.request library for talking to the Web

* time library for working with dates/time

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.193.129