©  Raju Kumar Mishra 2018
Raju Kumar MishraPySpark Recipeshttps://doi.org/10.1007/978-1-4842-3141-8_3

3. Introduction to Python and NumPy

Raju Kumar Mishra
(1)
Bangalore, Karnataka, India
 
Python is a general-purpose, high-level programming language. It was developed by Guido van Rossum, and since its inception, its popularity has increased exponentially. A plethora of handy and high-performance packages for numerical and statistical calculations make Python popular among data scientists. Python is an indented language, which may bother programmers just learning Python. But indented code improves code readability, and the popularity of Python and its ease of use makes it good for programming Spark.
This chapter introduces the basics of Python programming. We will also discuss NumPy. In addition, I have included a recipe on IPython and on integrating IPython with PySpark.
If you know Python well, you can skip this chapter. But my suggestion is to go through the content regardless., because this chapter might still provide good insight and boost your knowledge further.
This chapter covers the following recipes:
  • Recipe 3-1. Create data and verify the data type
  • Recipe 3-2. Create and index a Python string
  • Recipe 3-3. Typecast from one data type to another
  • Recipe 3-4. Work with a Python list
  • Recipe 3-5. Work with a Python tuple
  • Recipe 3-6. Work with a Python set
  • Recipe 3-7. Work with a Python dictionary
  • Recipe 3-8. Work with Define and Call functions
  • Recipe 3-9. Work with Create and Call lambda functions
  • Recipe 3-10. Work with Python conditionals
  • Recipe 3-11. Work with Python for and while loops
  • Recipe 3-12. Work with NumPy
  • Recipe 3-13. Integrate IPython and IPython Notebook with PySpark

Recipe 3-1. Create Data and Verify the Data Type

Problem

You want to create data and verify the data type .

Solution

Python is dynamically typed language. What does that mean? Dynamically typed means that at the time of variable definition, the programmer is not supposed to mention data types, as we do in other programming languages such as C. To learn about Python data types, you want to do the following:
  • Create an integer and verify its data type
  • Create a long integer and verify its data type
  • Create a decimal and verify its data type
  • Create a Boolean and verify its data type
The Python interpreter interprets the data type whenever a literal is mentioned in the console or in a Python script. The Python interpreter interprets the data type of a literal when it is assigned to a variable. In order to verify the data type of a literal or variable, you can use the Python type() function.

How It Works

Let’s follow the steps in this section to create data and verify data types.

Step 3-1-1. Creating an Integer and Verifying Its Data Type

Let’s create an integer in Python. In following line of code, we associate the Python variable pythonInt with the value 15:
>>> pythonInt = 15
>>> pythonInt
Here is the output:
15
We can verify the type of a Python object by using the type() function:
>>> type(pythonInt)
Here is the output:
<type 'int'>

Step 3-1-2. Creating a Long Integer and Verifying Its Data Type

The Long data type is used for large integers. At the time of creation, the number is suffixed by L. Creation of a Long data type is shown here:
>>> pythonLongInt = 15L
>>> pythonLongInt
Here is the output:
15L
Using the type() function, we can see that data type of pythonLongInt is long.
>>> type(pythonLongInt)
Here is the output:
<type 'long'>

Step 3-1-3. Creating a Decimal and Verifying Its Data Type

Decimal numbers are used a lot in any numerical computation. Let’s create a floating-point number:
>>> pythonFloat = 15.4
>>> pythonFloat
Here is the output:
15.4
And now let’s check its type:
>>> type(pythonFloat)
Here is the output:
<type 'float'>

Step 3-1-4. Creating a Boolean and Verifying Its Data Type

A programmer’s life is filled with concerns about various conditions, such as whether a given number is greater than five. Here’s an example:
>>> pythonBool = True
>>> pythonBool
The output is shown here:
True
>>> type(pythonBool)
Here is the output:
<type 'bool'>
>>> pythonBool = False
>>> pythonBool
Here is the output:
False

Recipe 3-2. Create and Index a Python String

Problem

You want to create and index a Python string .

Solution

Natural language processing and other string-related problems. You want to do the following:
  • Create a string and verify its data type
  • Index a string
  • Verify whether a substring lies in a given string
  • Check whether a string starts with a given substring
  • Check whether a string ends with a given substring
You can create a string by using either a set of double quotes (“ ”) or a set of single quotes (‘ ’). Indexing can be done with a set of square brackets. You can use various ways to verify whether a substring is in a given string. The find() function can be used to find whether a substring will stay inside a string.
The startswith() function indicates whether a given string starts with a given substring. Similarly, the endswith() function can confirm whether a given string ends with given substring.

How It Works

The steps in this section will solve our problem.

Step 3-2-1. Creating a String and Verifying Its Data Type

To create a string, you must put the given string literal between either double quotes or single quotes. Always remember to not use mixed quotes, meaning having a single quote on one side and a double quote on the other side. The following lines of code create strings by using double and single quotes. Let’s start with double quotes:
>>> pythonString  = "PySpark Recipes"
>>> pythonString
Here is the output:
'PySpark Recipes'
You can use the following code to verifying the type:
>>> type(pythonString)
Here is the output:
<type 'str'>
Now let’s create a string by using single quotes:
>>> pythonString  = 'PySpark Recipes'
>>> pythonString
Here is the output:
'PySpark Recipes'

Step 3-2-2. Indexing a String

String elements can be indexed. When you index a string, you start from zero:
>>> pythonStr = "Learning PySpark is fun"
>>> pythonStr
Here is the output:
'Learning PySpark is fun'
Let’s get the string at the tenth location. Let me mention again that string indexes start with zero:
>>> pythonStr[9]
Here is the output:
'P'

Step 3-2-3. Verifying That a Substring Lies in a Given String

If a substring is found inside a string, the Python string find() function will return the lowest index of the matching substring in a string. But if the given substring is not found in a given string, the find() method returns –1. Let’s index the element at the ninth position:
>>> pythonStr[9]
Here is the output:
'P'
We are going to search to see whether the substring Py is in our string pythonStr. We’ll use the find() method:
>>> pythonStr.find("Py")
Here is the output:
9
>>> pythonStr.find("py")
Here is the output:
-1
We can see that the output is 9. This is the index where Py is started.

Step 3-2-4. Checking Whether a String Starts with a Given Substring

Now let’s focus on another important function, startswith(). This function ensures that a given string starts with a particular substring. If a given string starts with the mentioned substring, then it returns True; otherwise, it returns False. Here is an example:
>>> pythonStr.startswith("Learning")
Here is the output:
True

Step 3-2-5. Checking Whether a String Ends with a Given Substring

Similarly, the endswith() function indicates whether a given string ends with given substring.
>>> pythonStr.endswith("fun")
Here is the output:
True

Recipe 3-3. Typecast from One Data Type to Another

Problem

You want to typecast from one data type to another.

Solution

Typecasting from one data type to another is a general activity in data analysis. The following set of problems will help you to understand typecasting in Python. You want to do the following:
  • Typecast an integer number to a float
  • Typecast a string to an integer
  • Typecast a string to a float
Typecasting means changing one data type to another—for example, changing a string to an integer or float. In PySpark, we’ll often typecast one data type to another. There are four important functions in Python for typecasting one data type to another: int(), float(), long(), and str().

How It Works

We’ll follow the steps in this section to solve the given problems.

Step 3-3-1. Typecasting an Integer Number to a Float

Let’s start with creating an integer:
>>> pythonInt = 17
We have created an integer variable, pythonInt. The type of a Python object can be found by using the type() function:
 >>> type(pythonInt)
Here is the output:
<type 'int'>
We can see clearly that the type() function has returned int.
To typecast any data type to float, we use the float() function:
>>> pythonFloat = float(pythonInt)
The pythonInt value has been changed to float. We can see the change by printing the variable:
>>> print pythonFloat
Here is the output:
17.0
Performing the type() function on pythonFloat will ensure that the integer value has been typecasted to a floating number :
>>> type(pythonFloat)
Here is the output:
<type 'float'>

Step 3-3-2. Typecasting a String to an Integer

The function int() typecasts a String, Float, or Long type to an Integer. Typecasting a string to an integer will come up again and again in our problem-solving by PySpark. We’ll start by creating a string:
>>> pythonString = "15"
>>> type(pythonString)
Here is the output:
<type 'str'>
In the preceding Python code snippet, we have created a string, and we also checked the type of our created variable pythonString.
The Python built-in int() function can typecast a variable to an integer. The following code uses an int() function and changes a string to an integer:
>>> pythonInteger = int(pythonString)
>>> pythonInteger
Here is the output:
15
>>> type(pythonInteger)
Here is the output:
<type 'int'>

Step 3-3-3. Typecasting a String to a Float

Let’s solve our last question of typecasting. Let’s create a string with the value 15.4, and typecast it to a float by using the float() function:
>>> pythonString = "15.4"
>>> type(pythonString)
Here is the output:
<type 'str'>
Next, we’ll typecast our string to a floating-point number:
>>> pythonFloat = float(pythonString)
>>> pythonFloat
Here is the output:
15.4
>>> type(pythonFloat)
Here is the output:
<type 'float'>
There are four types of collections in Python. The four types are list, set, dictionary, and tuple. In the following recipes, we are going to discuss these collections one by one.

Recipe 3-4. Work with a Python List

Problem

You want to work with a Python list

Solution

A list is an ordered Python collection. A list is used in many problems. We are going to concentrate on the following problems:
  • Creating a list
  • Extending a list
  • Appending a list
  • Counting the number of elements in a list
  • Sorting a list
A Python list is mutable. Generally, we create a list with objects of a similar type. But a list can be created of different object types. A list is created using square brackets, [ ]. A List object has many built-in functions to work with. Extending a list can be done by using the extend() function. Appending a list can be done using the append() function. You might be wondering, what is the difference between appending a list and extending a list? Soon you are going to get the answer.
Counting the number of elements in a list can be done by using the len() function, and sorting by using the sort() function.

How It Works

Use the following steps to solve our problem.

Step 3-4-1. Creating a List

Let’s create a list:
>>> pythonList = [2.3,3.4,4.3,2.4,2.3,4.0]
>>> pythonList
Here is the output:
[2.3, 3.4, 4.3, 2.4, 2.3, 4.0]
A list can be indexed by using square brackets, [ ]. The first element of a list is indexed with 0:
>>> pythonList[0]
Here is the output:
2.3
>>> pythonList[1]
Here is the output:
3.4
pythonList[0] indexes the first element of the list pythonList, and pythonList[1] indexes the second element. Hence if a list has n elements, the last element can be indexed as pythonList[n-1].
A list of elements can be of a different data type. The following example will make it clear :
>>> pythonList1 = ["a",1]
The preceding line creates a list with two elements. The first element of the list is a string, and the second element is an integer.
>>> pythonList1
Here is the output:
['a', 1]
>>> type(pythonList1)
Here is the output:
<type 'list'>
The type() function outputs the type of pythonList1 as list:
>>> type (pythonList1[0])
Here is the output:
<type 'str'>
>>> type (pythonList1[1])
Here is the output:
<type 'int'>
The type of the first element is a string, and the second element is an integer, which is being shown by the type() function.
A list is a mutable collection in Python . A mutable collection means the elements of the collection can be changed. Let’s look at some examples:
>>>pythonList1 = ["a",1]
>>> pythonList1[0] = 5
>>> print  pythonList1
Here is the output:
[5, 1]
In this example, we used the same list, pythonList1, which we created. Then the first element is changed to 5. And we can see by printing pythonList1 that the first element of pythonList1 is 5 now.

Step 3-4-2. Extending a List

The extend() function takes an object as an argument and extends the calling list object with object in argument element wise:
>>>pythonList1 = [5,1]
>>>print  pythonList1
Here is the output:
[5, 1]
>>> pythonList1.extend([1,2,3])
>>> pythonList1
[5, 1, 1, 2, 3]

Step 3-4-3. Appending a List

Applying append() to a list will append the Python object that has been provided as an argument to the function. In this example, append() has just appended another list to the existing list:
>>>pythonList1 = [5,1]
>>> pythonList1.append([1,2,3])
>>> print  pythonList1
Here is the output:
[5, 1, [1, 2, 3]]

Step 3-4-4. Counting the Number of Elements in a List

The length of a list is the number of elements in the list. The len() function will return the length of a list as follows:
>>>pythonList1 = [5,1]
>>> len(pythonList1)
Output :
2

Step 3-4-5. Sorting a List

Sorting a list can be done in an increasing or decreasing fashion. Our sort() function can be applied in the following ways to sort a given list in either ascending or descending order.
Let’s start with sorting our list pythonList in an ascending order in the following code:
>>> pythonList = [2.3,3.4,4.3,2.4,2.3,4.0]
>>> pythonList
Here is the output:
[2.3, 3.4, 4.3, 2.4, 2.3, 4.0]
>>> pythonList.sort()
>>> pythonList
Here is the output:
[2.3, 2.3, 2.4, 3.4, 4.0, 4.3]
In order to sort data in descending order, we have to provide the reverse argument as True:
>>> pythonList.sort(reverse=True)
>>> pythonList
Here is the output :
[4.3, 4.0, 3.4, 2.4, 2.3, 2.3]

Recipe 3-5. Work with a Python Tuple

Problem

You want to work with a Python tuple .

Solution

A tuple is an immutable ordered collection. We are going to solve the following set of problems:
  • Creating a tuple
  • Getting the index of an element of a tuple
  • Counting the occurrence of a tuple element
A tuple, an immutable collection, is generally used to create record data. A tuple is created by using a set of parentheses, ( ). A tuple is an ordered sequence. We can put different data types together in a tuple. The index() function on a tuple object will provide us the index of the first occurrence of a given element. Another function, count(), defined on a tuple will return the frequency of a given element.

How It Works

Let’s work out the solution step-by-step.

Step 3-5-1. Creating a Tuple

The following code creates a tuple by using parentheses, ( ):
>>>pythonTuple = (2.0,9,"a",True,"a")
Here we have created a tuple, pythonTuple, which has five elements. The first element of our tuple is a decimal number, the second element is an integer, the third one is a string, the fourth one is a Boolean, and last one is a string.
Now let’s check the type of the python Tuple object :
>>> type(pythonTuple)
Here is the output:
<type 'tuple'>
Indexing a tuple is done in a similar way as we indexed the list, but this time we use square brackets, [ ]:
>>> pythonTuple[2]
Here is the output:
'a'
This next line of code will show that the tuple is immutable. We are
>>> pythonTuple[1] = 5
Here is the output:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
We can see that we cannot modify the elements of a tuple.

Step 3-5-2. Getting the Index of a Tuple Element

The index() function requires an element as an argument and returns the index of the first occurrence of a value in a given tuple. In our tuple pythonTuple, 'a' is at index 2. We can get this index as follows:
>>> pythonTuple.index('a')
Here is the output:
2
If a value is not found in a given tuple, the index() function throws an exception.

Step 3-5-3. Counting the Occurrences of a Tuple Element

Providing an element as input in a count() function returns the frequency of that element in a given tuple. We can use the count() function in the following way:
>>> pythonTuple.count("a")
Here is the output:
2
In our tuple pythonTuple, the character a has occurred twice. Therefore, the count() function has returned 2.
You can try a simple exercise: apply count() on that tuple with 1 as an argument . You will get 1 as the answer.

Recipe 3-6. Work with a Python Set

Problem

You want to work with a Python set .

Solution

Dealing with a collection of distinct elements requires us to use a set. We are going to work on the following tasks:
  • Creating a set
  • Adding a new element to a set
  • Performing a union on sets
  • Performing an intersection operation on sets
A set cannot have duplicate elements. A Python set is created using set of curly brackets, { }. The most important point to note is that, at the time of creation, we can put duplicate items in a set. But the set will then remove all the duplicate items from it. As with other collections, many functions have been defined on the set object too. If you want to add a new element to a set, you can use the add() function.
Unionizing two sets is a common activity that we’ll find in our day-to-day tasks. The union() function, which has been defined on our set object, will unionize two sets for us.
The intersect() function is used to run an intersection on two given Python sets .

How It Works

In this section, we will solve our given problems step-by-step.

Step 3-6-1. Creating a Set

Let’s create a set of stationary items and then verify the existence of distinct elements. This code line creates a set of stationary items:
>>> pythonSet = {'Book','Pen','NoteBook','Pencil','Book'}
>>> pythonSet
Here is the output:
set(['Pencil', 'Pen', 'Book', 'NoteBook'])
We have created a set. We can observe that putting in a duplicate element doesn’t throw an error, but the set will not consider that duplicate element while creating the set.

Step 3-6-2. Adding a New Element to a Set

By using the add() function, we can add a new element to the set. We have already created the set pythonSet. Now we are going to add a new element, Eraser, to our set, as shown here:
>>> pythonSet.add("Eraser")
>>> pythonSet
Here is the output:
set(['Pencil', 'Pen', 'Book', 'Eraser', 'NoteBook'])
You can see in this example that we have added the new element Eraser to our set.

Step 3-6-3. Performing a Union on Sets

A union operation on a Python set will behave in a mathematical way. Let’s create another set for this example:
>>> pythonSet1 = {'NoteBook','Pencil','Diary','Marker'}
>>> pythonSet1
Here is the output:
set(['Marker', 'Pencil', 'NoteBook', 'Diary'])
A union of two sets will return another set with all the elements either in any set or common to both sets. In the following example, we can see that the union of pythonSet and pythonSet1 has returned all the merged elements in pythonSet and pythonSet1:
>>> pythonSet.union(pythonSet1)
Here is the output:
set(['Pencil', 'Pen', 'NoteBook', 'Book', 'Eraser',  'Diary', 'Marker'])

Step 3-6-4. Performing an Intersection Operation on Sets

The intersection() function will return a new set with common elements from two given sets. We have already created two sets, pythonSet and pythonSet1. We can observe that Pencil and NoteBook are common elements in our sets. In following line, we use intersection() on our sets:
>>> pythonSet.intersection(pythonSet1)
Here is the output:
set(['Pencil', 'NoteBook'])
After running the code, it is clear that the intersection will return the elements that are common to both sets.

Recipe 3-7. Work with a Python Dictionary

Problem

You want to work with a Python dictionary .

Solution

You have seen that lists and tuples are indexed by their index numbers. This situation becomes clearer when the index uses words. A Python dictionary is a data structure that stores key/value pairs. Each element of a Python dictionary is a key/value pair. In this exercise, you want to do the following operations on a Python dictionary:
  • Create a dictionary of stationary items, with the item ID as the key and the item name as the value
  • Index an element using a key
  • Get all the keys
  • Get all the values
The creation of a dictionary can be achieved by using set of curly brackets, { }. You might be a little confused that we created a Python set using curly brackets and now we are going to create a dictionary in the same way. But let me tell you that, in order to create a dictionary, we have to provide a key/value pair inside the curly brackets—for example, {key:value}. We can observe that, in a key/value pair, the key is separated from the value by a colon (:). Two different key/value pairs are separated by a comma (,).
You can index a dictionary by using square brackets, [ ]. A Python dictionary object has a get() function, which returns the value for a given key.
You can get all keys by using the keys() function; the values() function returns all the values.

How It Works

We’ll use the steps in this section to solve our problem.

Step 3-7-1. Creating a Dictionary of Stationary Items

We’ll create a dictionary with the following line of Python code:
>>> pythonDict = {'item1':'Pencil','item2':'Pen', 'item3':'NoteBook'}
>>> pythonDict
Here is the output:
{'item2': 'Pen', 'item3': 'NoteBook', 'item1': 'Pencil'}

Step 3-7-2. Indexing an Element by Using a Key

Let’s fetch the value of 'item1:'
>>> pythonDict['item1']
Here is the output:
'Pencil'
But if the key is not found in the dictionary, a KeyError exception is thrown:
>>> pythonDict['item4']
Here is the output:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'item4'
To prevent a KeyError exception, we can use the get() function on a dictionary. Then, if a value is not found for a given key, it returns nothing (and otherwise returns the value associated with that key). Let’s explore get() in the following Python code snippet:
>>> pythonDict.get('item1')
Here is the output:
'Pencil'
We know that the key item4 is not in our dictionary. Therefore, the Python get() function will return nothing in this case:
>>> pythonDict.get('item4')

Step 3-7-3. Getting All the Keys

Getting all the keys together is often required in problem-solving. Let’s get all the keys by using our keys() function:
>>> pythonDict.keys()
Here is the output:
['item2', 'item3', 'item1']
Function keys() returns all the keys in a list.

Step 3-7-4. Getting All the Values

Applying the value() function will return all the values in a dictionary:
>>> pythonDict.values()
Here is the output:
['Pen', 'NoteBook', 'Pencil']
Similar to the keys() function, values() also returns all the values in a list as output.

Recipe 3-8. Work with Define and Call Functions

Problem

You want to write a Python function that takes an integer as input and returns True if the input is an even number, and otherwise returns False.

Solution

Functions improve code readability and reusability. In Python, a function definition is started with the def keyword. The user-defined name of a function follows that def keyword. If a function takes an argument, it is written within parentheses; otherwise, we write just the set of parentheses. The colon symbol (:) follows the parentheses. Then we start writing the statements as the body of the function. The body is indented with respect to the line containing the def keyword. If our user-defined function returns a value, that is achieved by using the return Python keyword.

How It Works

We are going to create a function named isEvenInteger. The following lines define our function:
>>> def isEvenInteger(ourNum) :
            return ourNum %2 == 0
In this code example, we can see that our function name is isEvenInteger. Our function name follows the def keyword. The argument to our function, which is ourNum in this case, is within parentheses. After the colon, we define our function body. In this function body, there is only one statement. The function body statement is returning the output of a logical expression. We return a value from a function by using the return keyword. If a function doesn’t return any value, the return keyword is omitted. The logical expression of the function body is checking whether modules of the input number equal zero or not.
After defining our function isEvenInteger, we should call it so we can watch it work. In order to test our function working, we’ll call it with an even argument and then with an odd integer argument. Let’s start with the even integer input. Here we provide input 4, which is an even integer:
 >>> isEvenInteger(4)
Here is the output:
True
As we can see, the output is True. This means that the input we have provided is an even number.
Now let’s provide an odd number as input:
>>> isEvenInteger(5)
Here is the output:
False
With an odd number as the input, our function isEvenInteger has returned False.

Recipe 3-9. Work with Create and Call Lambda Functions

Problem

You want to create a lambda function, which takes an integer as input and returns True if the input is an even number, and otherwise return False.

Solution

A lambda function is also known as an anonymous function . A lambda function is a function without a name. A lambda function in Python is defined with the lambda keyword.

How It Works

Let’s create a lambda function that will check whether a given number is an even number:
>>> isEvenInteger = lambda ourNum : ourNum%2 == 0
Our lambda function is lambda ourNum : ourNum%2 == 0. You can see that we have used the keyword lambda to define it. After the keyword lambda, the arguments of the function have been written. Then after the colon, we have a logical expression. Our logical expression will check whether the input number is divisible by 2. In Python, functions are objects, so our lambda function is also an object. Therefore, we can put our object in a variable. That is how we can put our lambda function in the variable isEvenInteger:
>>> isEvenInteger(4
Here is the output:
True
>>> isEvenInteger(5)
Here is the output:
False
The drawback of lambda functions is that only one expression can be written. The result of that particular expression will be returned. The Python keyword return is not required to return the value of an expression. A lambda function cannot be extended beyond more than one line

Recipe 3-10. Work with Python Conditionals

Problem

You want to work with Python conditionals.

Solution

A car manufacturing company manufactures three types of cars. These three types of cars are differentiated by their number of cylinders. It is obvious that the different number of cylinders results in different gas mileage. Car A has four cylinders, car B has six cylinders, and car C has eight cylinders. Table 3-1 shows the number of cylinders and respective gas mileage of the cars.
Table 3-1.
Number of Cylinders and Respective Gas Mileage
Number of Cylinders
Gas Mileage (Miles per Gallon)
4
22
6
18
8
16
Mr A, a sales executive, always forgets about this relationship between the number of cylinders and the gas mileage. So you want to create a small Python application that will return the mileage, given the number of cylinders. This will be a great help to Mr. A.
Conditionals are imperative in solving any programming problem. Python is no exception. Conditionals are implemented in Python by using if, elif, and else keywords.

How It Works

To create the required Python application, we are going to write a Python function that will take the number of cylinders as input and return the associated mileage.
Here’s the function we’ll use to create our application:
>>> def mpgFind(numOfCylinders) :
...       if(numOfCylinders == 4 ):
...             mpg = 22
...       elif(numOfCylinders == 6 ):
...            mpg = 18
...       else :
...            mpg = 16
...       return mpg
Here we have defined a function named mpgFind. Our function will take the variable numOfCylinders. Entering into function our variable going to face a logical expression numOfCylinders == 4. If numOfCylinders has the value 4, the logical expression will return True and the if block will be executed. Our if block has only one statement, mpg=22. If required, more than one statement can be provided.
If numOfCylinders == 4 results in False , the elif logical expression will be tested. If the value of numOfCylinders is 6, the logical expression of elif will return True, and 18 will be assigned to mpg.
What if the logical expression of if and elif both return False? In this condition, the else block will be executed. The variable mpg will be assigned 16.
Let’s call our function with the input 4 and see the result:
>>> mpgFind(4)
Here is the output:
22
You can clearly see that our function can help our sales executive, Mr A.

Recipe 3-11. Work with Python “for” and “while” Loops

Problem

You want to work with the Python for and while loops .

Solution

After learning that his application was written in Python, sales executive Mr A became interested in learning Python. He joined a Python class. His instructor, Mr X, gave him an assignment to solve. Mr X asked his class to implement a Python function that will take a list of integers from 1 to 101 as input and then return the sum of the even numbers in the list. Mr A found that the required function can be implemented using the following:
  • for loop
  • while loop
Loops are best for code reusability. We use a loop to run a segment of code many times. A particular segment of code can be run using for and while loops. To get a summation of even numbers in a list, we require two steps: first we have to find the even numbers, and then we have to do the summations.

How It Works

Step 3-11-1. Implementing a for Loop

First, we are going to get the summation of even numbers in a list by using a for loop in Python:
 >>> def sumUsingForLoop(numbers) :
...       sumOfEvenNumbers = 0
...       for i in numbers :
...           if i % 2 ==0 :
...               sumOfEvenNumbers = sumOfEvenNumbers +i
...           else :
...              pass
...       return   sumOfEvenNumbers
Here we have a written a Python function named sumUsingForLoop. Our function takes as input numbers. A for loop in Python iterates over a Python sequence. If we send a list of integers, the for loop starts iterating element by element. The if block will check whether the number being considered is an even number. If the number is even, the variable sumOfEvenNumbers will be increased by this number, using summation. After completion of the iteration, the final sum will be returned.
Let’s check the working of our function:
>>> numbers = range(1,102)
>>> numbers
Here is the output:
[1, 2, 3, 4, 5, 6, 7,............,  98, 99, 100, 101]
We have created a list of integers, from 1 to 101, using the range() function. Let’s call our function sumUsingForLoop and see the result :
>>> sumUsingForLoop(numbers)
Here is the output:
2550

Step 3-11-2. Implementing a while Loop

In this section, we are going to implement a solution to our problem that uses a while loop:
>>> def sumUsingWhileLoop(numbers) :
...      i = 1
...      sumOfEvenNumbers = 0
...      while i <= 101 :
...           if i % 2 ==0 :
...                sumOfEvenNumbers = sumOfEvenNumbers +i
...           i = i + 1
...      return  sumOfEvenNumbers
Here we have defined a Python function named sumUsingWhileLoop. Our function takes as input numbers. A while loop in Python helps to iterate over a Python sequence. If we send a list of integers, we can iterate over our list, element by element, as shown in the code. The if block will check whether the number in consideration is an even number. If the number is even, the variable sumOfEvenNumbers will be increased by this number, using summation. After completion of the iteration, the final sum will be returned.
Let’s test our function:
>>> sumUsingWhileLoop(numbers)
Here is the output:
2550

Recipe 3-12. Work with NumPy

Problem

You want to work with NumPy.

Solution

Company XYZ wants to build its new factory at location A. The company needs a location that has a temperature meeting certain specific criteria. Environmental scientist Mr. Y gathers the temperature reading at site A for five days at different times.
Table 3-2 depicts the temperature readings .
A430628_1_En_3_Figa_HTML.jpg
Table 3-2.
Temperatures in Celsius
You want to do the following:
  • Install pip
  • Install NumPy
  • Create a two-dimensional array using the NumPy array() function
  • Create a two-dimensional array using vertical and column stacking of smaller arrays
  • Know and change the data type of array elements
  • Know shape of a given array
  • Calculate minimum and maximum temperature for each day
  • Calculate minimum and maximum temperature column-wise
  • Calculate the mean temperature for each day and column-wise
  • Calculate the standard deviation of temperature for each day and column-wise
  • Calculate the variance of temperature for each day and column-wise
  • Calculate the median temperature for each day and column-wise
  • Calculate the overall mean from all the gathered temperature data
  • Calculate the variance and standard deviation over all five days of temperature data
These sorts of simple mathematical questions can be solved easily by using NumPy. You might be thinking that we can solve these problems by using nested lists, so why are we going to use NumPy? Looping becomes faster with the NumPy ndarray. NumPy is open source and easily available.
The NumPy ndarray is a higher-level abstraction for multidimensional array data. It also provides many functions for working on those multidimensional arrays. In order to work with NumPy, we first have to install it. After creating a two-dimensional array of given temperature data, we can apply many NumPy functions to solve the given problems.

How It Works

We’ll use the following steps to solve our problem.

Step 3-12-1. Installing pip

pip is a Python package management system that is written in Python itself. We can use pip to install other Python packages. Using the yum installer, we can install pip as follows:
[pysparkbook@localhost ∼]$ sudo yum install python-pip
After installing pip, we have to install pyparsing. Run the following command to install pyparsing:
[pysparkbook@localhost ∼]$ sudo  yum install ftp://mirror.switch.ch/pool/4/mirror/centos/7.3.1611/cloud/x86_64/openstack-kilo/common/pyparsing-2.0.3-1.el7.noarch.rpm

Step 3-12-2. Installing NumPy

After installing pip, it can be used to install NumPy. The following command installs NumPy on our machine:
[pysparkbook@localhost ∼]$ sudo pip install numpy
Here is the output:
Collecting numpy
  Downloading numpy-1.12.0-cp27-cp27mu-manylinux1_x86_64.whl (16.5MB)
    100% |████████████████████████████████| 16.5MB 64kB/s
Installing collected packages: numpy
Successfully installed numpy-1.12.0

Step 3-12-3. Creating a Two-Dimensional Array by Using array( )

A multidimensional array can be created in many ways. In this step, we are going to create a two-dimensional array by using a nested Python list. So let’s start creating a daily list of temperature data. In the following chunk of code, we are creating five lists for five days of temperatures:
>>> import numpy as NumPy
>>> temp1 = [15, 16, 17, 17, 18, 17, 16, 14]
>>> temp2 = [14, 15, 17, 17, 16, 17, 16, 15]
>>> temp3 = [16, 15, 17, 18, 17, 16, 15, 14]
>>> temp4 = [16, 17, 18, 19, 17, 15, 15, 14]
>>> temp5 = [16, 15, 17, 17, 17, 16, 15, 13]
The variable temp1 has the temperature measurements of the first day. Similarly, temp2, temp3, temp4, and temp5 have measurements of the temperature on the second, third, fourth, and fifth day, respectively.
Our two-dimensional array of temperatures can be created by using the NumPy array() function as follows:
>>> dayWiseTemp = NumPy.array([temp1,temp2,temp3,temp4,temp5])
>>> dayWiseTemp
Here is the output:
array([[15, 16, 17, 17, 18, 17, 16, 14],
       [14, 15, 17, 17, 16, 17, 16, 15],
       [16, 15, 17, 18, 17, 16, 15, 14],
       [16, 17, 18, 19, 17, 15, 15, 14],
       [16, 15, 17, 17, 17, 16, 15, 13]])
Now we have created a two-dimensional array.

Step 3-12-4. Creating a Two-Dimensional Array by Stacking

We can create an array by using vertical stacking and column stacking of data. First, we are going to create our same temperature array data by using vertical stacking. Vertical stacking can be created by using the NumPy vstack() function:
>>> dayWiseTemp = NumPy.vstack((temp1,temp2,temp3,temp4,temp5))
>>> dayWiseTemp
Here is the output:
array([[15, 16, 17, 17, 18, 17, 16, 14],
       [14, 15, 17, 17, 16, 17, 16, 15],
       [16, 15, 17, 18, 17, 16, 15, 14],
       [16, 17, 18, 19, 17, 15, 15, 14],
       [16, 15, 17, 17, 17, 16, 15, 13]])
Now let’s see how to do horizontal stacking. Temperature data has been collected at different times and on different days. Let’s now create a temperature data list based on time:
>>> d6am = NumPy.array([15, 14, 16, 16, 16])
>>> d8am = NumPy.array([16, 15, 15, 17, 15])
>>> d10am = NumPy.array([17, 17, 17, 18, 17])
>>> d12am = NumPy.array([17, 17, 18, 19, 17])
>>> d2pm = NumPy.array([18, 16, 17, 17, 17])
>>> d4pm = NumPy.array([17, 17, 16, 15, 16])
>>> d6pm = NumPy.array([16, 16, 15, 15, 15])
>>> d8pm = NumPy.array([14, 15, 14, 14, 13])
Column stacking can be done with the NumPy column_stack() function as follows:
>>> dayWiseTemp = NumPy.column_stack((d6am,d8am,d10am,d12am,d2pm,d4pm,d6pm,d8pm))
>>> dayWiseTemp
Here is the output:
array([[15, 16, 17, 17, 18, 17, 16, 14],
       [14, 15, 17, 17, 16, 17, 16, 15],
       [16, 15, 17, 18, 17, 16, 15, 14],
       [16, 17, 18, 19, 17, 15, 15, 14],
       [16, 15, 17, 17, 17, 16, 15, 13]])

Step 3-12-5. Knowing and Changing the Data Type of Array Elements

The NumPy array dtype attribute will return the data type of a NumPy array:
>>> dayWiseTemp.dtype
Here is the output:
dtype('int64')
We can observe that NumPy has inferred the data type as int64. From the previous historical temperature data of the given location, we know that the temperature will generally vary between 0 and 100. Therefore, using a 64-bit integer is not efficient. We can use 32-bit integers, which will take less memory than 64-bit integers.
In order to create an array with the data type int32, we can provide the dtype argument for the array() function:
>>> dayWiseTemp = NumPy.array([temp1,temp2,temp3,temp4,temp5],dtype='int32')
>>> dayWiseTemp.dtype
Here is the output:
dtype('int32')
But if the array has been created using the default data type, no worries. The data type of the array elements can be changed by using the NumPy astype() function. We can change the data type of an existing array by using the astype function as follows:
>>> dayWiseTemp = dayWiseTemp.astype('int32')
>>> dayWiseTemp.dtype
Here is the output:
dtype('int32')

Step 3-12-6. Knowing the Dimensions of an Array

The shape of an array can be calculated by using the shape attribute of the array:
>>> dayWiseTemp.shape
Here is the output:
(5, 8)
The output clearly indicates that our array has five rows and eight columns.

Step 3-12-7. Calculating the Minimum and Maximum Temperature Each Day

For a NumPy array, we can use the min() function to calculate the minimum. We can calculate the minimum of an array’s data either by row or by column. To calculate the minimum temperature value of a row, we have to set the value of the axis argument to 1.
In our case, the data in a row indicates the temperature during one day. The following line of code will compute the minimum temperature value during a day:
>>> dayWiseTemp.min(axis=1)
Here is the output:
array([14, 14, 14, 14, 13], dtype=int32)
In similar fashion, the daily maximum temperature can be calculated by using the NumPy array max() function:
>>> dayWiseTemp.max(axis=1)
Here is the output:
array([18, 17, 18, 19, 17], dtype=int32)

Step 3-12-8. Calculating the Minimum and Maximum Temperature by Column

We can get the minimum and maximum temperature of a column by using the same min() and max() functions, respectively. But now we have to set the axis argument to 0, as follows:
>>> dayWiseTemp.min(axis=0)
Here is the output:
array([14, 15, 17, 17, 16, 15, 15, 13], dtype=int32)
>>> dayWiseTemp.max(axis=0)
Here is the output:
array([16, 17, 18, 19, 18, 17, 16, 15], dtype=int32)

Step 3-12-9. Calculating the Mean Temperature Each Day and by Column

Now we are going to calculate the mean value of the daily temperatures. Let’s start by calculating the mean temperature of a day:
>>> dayWiseTemp.mean(axis=1)
Here is the output:
array([ 16.25,  15.875,  16.,  16.375,  15.75 ])
In order to calculate the mean temperature by column, we have to set the axis argument to 0:
>>> dayWiseTemp.mean(axis=0)
Here is the output:
array([ 15.4,  15.6,  17.2,  17.6,  17.,  16.2,  15.4,  14. ])

Step 3-12-10. Calculating the Standard Deviation of Temperature Each Day

Let’s start with calculating the daily standard deviation of temperature. We are going to calculate the standard deviation by using the std() function of the ndarray class:
>>> dayWiseTemp.std(axis=1)
Here is the output:
array([ 1.19895788,  1.05326872,  1.22474487,  1.57619003,  1.29903811])
The column’s standard deviation can be calculated by using the std() function with the value of axis set to 0:
>>> dayWiseTemp.std(axis=0)
Here is the output:
array([ 0.8,  0.8,  0.4,  0.8,  0.63245553,
        0.74833148,  0.48989795,  0.63245553])

Step 3-12-11. Calculating the Variance of Temperature Each Day

The NumPy array var() function can be used to calculate variance per row. Let’s start with calculating the daily temperature variance:
>>> dayWiseTemp.var(axis=1)
Here is the output:
array([ 1.4375,  1.109375,  1.5,  2.484375,  1.6875  ])
The temperature variance of columns can be calculated as follows:
>>> dayWiseTemp.var(axis=0)
Here is the output:
array([ 0.64,  0.64,  0.16,  0.64,  0.4 ,  0.56,  0.24,  0.4 ])

Step 3-12-12. Calculating Daily and Hourly Medians

The median can be calculated by using the NumPy median() function as follows:
>>>NumPy.median(dayWiseTemp,axis=1)
Here is the output:
array([ 16.5,  16. ,  16. ,  16.5,  16. ])
>>> NumPy.median(dayWiseTemp,axis=0)
Here is the output:
array([ 16.,  15.,  17.,  17.,  17.,  16.,  15.,  14.])

Step 3-12-13. Calculating the Overall Mean of all the Gathered Temperature Data

The NumPy mean() function can be used to calculate the overall mean of all data:
>>> NumPy.mean(dayWiseTemp)
Here is the output:
16.050000000000001

Step 3-12-14. Calculating the Variance and Standard Deviation over All Five Days of Temperature Data

The NumPy var() function can be used to calculate the variance of all the gathered data:
>>> NumPy.var(dayWiseTemp)
Here is the output:
1.6974999999999993
And the std() function can be used to calculate the standard deviation:
>>> NumPy.std(dayWiseTemp)
Here is the output:
1.3028814220795379
Note
You can learn more about NumPy at www.numpy.org .

Recipe 3-13. Integrate IPython and IPython Notebook with PySpark

Problem

You want to integrate IPython and IPython Notebook with PySpark.

Solution

Integrating IPython with PySpark improves programmer efficiency. To accomplish this integration , we’ll do the following:
  • Install IPython
  • Integrate PySpark with IPython
  • Install IPython Notebook
  • Integrate PySpark with IPython Notebook
  • Run PySpark commands on IPython Notebook
You have become familiar with the Python interactive shell. This interactive shell enables us to learn Python faster, because we can see the result of each command, line by line. The Python shell that comes with Python is very basic. It does not come with tab completion and other features. A more advanced Python interactive shell is IPython. It has many advanced features that facilitate coding.
IPython Notebook start a web-browser-based facility to write Python code. Generally, readers may be confused that for a web-browser-based notebook, we need an Internet connection. That’s not the case; we can run IPython Notebook without an Internet connection.
We can start PySpark with IPython and IPython Notebook.
It is time to install IPython and IPython Notebook and integrate PySpark with those. IPython and IPython Notebook can be installed using pip.

How It Works

In order to connect PySpark, we have to perform the following steps.

Step 3-13-1. Installing IPython

Let’s install IPython first. We have already installed pip. Pip can be used to install IPython with the following command:
[pysparkbook@localhost ∼]$ sudo pip install ipython
Here is the output:
Collecting ipython
Successfully installed ipython-5.2.2 pathlib2-2.2.1 pexpect-4.2.1 pickleshare-0.7.4 prompt-toolkit-1.0.13 ptyprocess-0.5.1 scandir-1.4 simplegeneric-0.8.1 wcwidth-0.1.7

Step 3-13-2. Integrating PySpark with IPython

After installation, we are going to start PySpark with IPython. This is easy to do. First, we set the environmental variable IPYTHON equal to 1, as follows:
[pysparkbook@localhost ∼]$ export IPYTHON=1
[pysparkbook@localhost ∼]$ pyspark
After starting PySpark, you’ll see the shell in Figure 3-1.
A430628_1_En_3_Fig1_HTML.jpg
Figure 3-1.
Shell
In Figure 3-1, you can see that now In[1] has replaced the legacy >>> symbol of our old Python console.

Step 3-13-3. Installing IPython Notebook

Now we are going to install IPython Notebook. Again, let’s use pip to install IPython:
[pysparkbook@localhost ∼]$ sudo pip install ipython[notebook]

Step 3-13-4. Integrating PySpark with IPython Notebook

After installation, we have to set some environment variables:
[pysparkbook@localhost ∼]$ export IPYTHON_OPTS="notebook"
[pysparkbook@localhost ∼]$ export XDG_RUNTIME_DIR=""
Now it is time to start PySpark:
[pysparkbook@localhost ∼]$ pyspark
After starting PySpark with IPython Notebook, we’ll see the web browser shown in Figure 3-2.
A430628_1_En_3_Fig2_HTML.jpg
Figure 3-2.
After starting PySpark with Notebook, you’ll see a web browser
You might be amazed that the browser is showing Jupyter. But this isn’t amazing; Jupyter is the new name for IPython Notebook. You can see how easy it is to work with Notebook; by clicking the Upload button, we can upload files to the machine.

Step 3-13-5. Running PySpark Commands on IPython Notebook

In order to run the PySpark command, let’s create a new notebook by using Python 2, as shown in Figure 3-3.
A430628_1_En_3_Fig3_HTML.jpg
Figure 3-3.
Creating a new notebook using Python 2
After creating the notebook, you will see the web page in Figure 3-4.
A430628_1_En_3_Fig4_HTML.jpg
Figure 3-4.
After creating the notebook, you’ll see this web page
Now we can run the Python commands to create a list. After writing our command, we have to run it. You can see in Figure 3-4 that we can run our command by using the Run button in the notebook.
In Figure 3-5, we are printing pythonList.
A430628_1_En_3_Fig5_HTML.jpg
Figure 3-5.
Printing the Python list
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.33.235