© Valentina Porcu 2018
Valentina PorcuPython for Data Mining Quick Syntax Referencehttps://doi.org/10.1007/978-1-4842-4113-4_9

9. SciPy and NumPy

Valentina Porcu1 
(1)
Nuoro, Italy
 

Although pandas is a very important package for data analysis, it can work in conjunction with two other packages: SciPy and NumPy.

SciPy

SciPy is one of the most important packages for mathematical and statistical analysis in Python, and it is linked closely to NumPy. SciPy contains more than 60 statistical functions organized in families of modules:
  • scipy.cluster

  • scipy.constants

  • scipy.fftpack

  • scipy.integrate

  • scipy.interpolate

  • scipy.io

  • scipy.lib

  • scipy.linalg

  • scipy.misc

  • scipy.optimize

  • scipy.signal

  • scipy.sparse

  • scipy.spatial

  • scipy.special

  • scipy.stats

  • scipy.weave

>>> import scipy as sp
>>> from scipy import stats
>>> from scipy import cluster
We can get help with these modules by typing, for example:
>>> help(sp.cluster)
# or
>>> help(scipy.cluster)

Documentation opens directly in the window, from which we can exit by pressing q.

For instance, SciPy can be used to measure probability density on a number or distribution
>>> from scipy.stats import norm
>>> norm.pdf(5)
1.4867195147342979e-06
or an allotment function
>>> norm.cdf(x)

SciPy features very technical modules. For the purposes of this book, we are particularly interested in combining with NumPy, so the focus of this chapter is on this second package primarily.

NumPy

At the beginning of Python's development, programmers soon found themselves having to incorporate tools for scientific computation. Their first attempt resulting in the Numeric package, which was developed in 1995, followed by an alternative package called Numarray. The merging of the functions of these two packages came to life in 2006 with NumPy.

NumPy stands for “numeric Python” and is an open-source library dedicated to scientific computing, especially with regard to array management. It is sometimes considered as MATLAB version for Python, and features several high-level mathematical functions in algebra and random number generation.
>>> import numpy as np
>>> from numpy import *
NumPy’s most important structure is a particular type of multidimensional array, called ndarray . ndarray consists of two elements: data (the true ndarray) and metadata describing data (dtype or data type). Each ndarray is associated with one and only one dtype. We can have one-dimension arrays:
>>> arr1 = np.array([0,1,2,3,4])
>>> arr1
array([0, 1, 2, 3, 4])
Or multidimensional arrays:
>>> arr2 = np.array([[5,6,7,8,9], [10,11,12,13,14]])
>>> arr2
array([[ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
>>> arr2_1 = np.array([[5,6,7,8,9], [10,11,12,13,14], [2,5,7]])
# the important thing is that data have the same type
We can display the data type using the .dtype() function .
>>> arr1.dtype
dtype('int64')
# the amount of items in the array
>>> arr1.size
5
# the number of bits
>>> arr1.itemsize
8
For instance, by using the NumPy arrange() function, we can create a list of numbers—in the following case, from 0 to 100:
>>> arr3 = np.arange(100)
>>> arr3
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
# arange also allows us to set a range of numbers
>>> np.arange(1,10)
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
Let’s create a fourth array:
>>> arr4 = np.array([['a','b','c','d','e'], ['f', 'g', 'h', 'i','l'], ['m', 'n', 'o', 'p','q']])
>>> arr4
array([['a', 'b', 'c', 'd', 'e'],
       ['f', 'g', 'h', 'i', 'l'],
       ['m', 'n', 'o', 'p', 'q']],
      dtype='<U1')
We can select, for example, the first element (0):
>>> arr4[0]
array(['a', 'b', 'c', 'd', 'e'],
      dtype='<U1')
If we want to select the third element of the first element, we proceed as follows:
>>> arr4[0][3]
'd'
Remember that, in Python, we start counting from zero, not one. So, if we select the first element, we actually get the second:
>>> arr4[1]
array(['f', 'g', 'h', 'i', 'l'],
      dtype='<U1')
>>> arr4[1][1]
'g'
We can also use a negative index and rotate the array as follows:
>>> arr4[::-1]
array([['m', 'n', 'o', 'p', 'q'],
       ['f', 'g', 'h', 'i', 'l'],
       ['a', 'b', 'c', 'd', 'e']],
      dtype='|S1')
The arr4 array is composed of three different elements. To merge them, we can use the .ravel() method:
>>> arr4.ravel()
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'l', 'm', 'n', 'o','p', 'q'],
      dtype='|S1')
Let’s create another array, arr5:
>>> arr5 = np.array([19, 76, 2, 13, 48, 986, 1, 18])
>>> arr5
array([19, 76, 2, 13, 48, 986, 1, 18])
# we can reorder it from the lowest value to the highest
>>> np.msort(arr5)
array([1, 2, 13, 18, 19, 48, 76, 986])
We can reorganize the data of an array using various functions. Let’s use the last array created, arr5:
# reshape() allows us to reorganize the data of an array—in the following example, in four cases of two columns each
>>> arr5.reshape(4,2)
>>> array([[ 19,  76],
       [  2,  13],
       [ 48, 986],
       [  1,  18]])
>>> arr5.reshape(8,1)
array([[ 19],
       [ 76],
       [  2],
       [ 13],
       [ 48],
       [986],
       [  1],
       [ 18]])
# we create two more arrays, organizing them in three columns from three cases
>>> x = np.array([20, 42, 17, 3, 7, 12, 14, 70, 9])
>>> x = x.reshape(3,3)
>>> x
array([[20, 42, 17],
       [ 3,  7, 12],
       [14, 70,  9]])
>>> y = x * 3
>>> y
array([[ 60, 126,  51],
       [  9,  21,  36],
       [ 42, 210,  27]])
# similar to reshape is the resize() function
>>> z = np.array([120, 72, 37, 43, 57, 12, 54, 20, 9])
z
array([120,  72,  37,  43,  57,  12,  54,  20,   9])
>>> z.resize(3,3)
>>> z
array([[120,  72,  37],
       [ 43,  57,  12],
       [ 54,  20,   9]])
# we can concatenate two arrays horizontally
>>> np.hstack((x, y))
array([[ 20,  42,  17,  60, 126,  51],
       [  3,   7,  12,   9,  21,  36],
       [ 14,  70,   9,  42, 210,  27]])
# we get the same result using the function .concatenate() by specifying the axis
>>> np.concatenate((x,y), axis = 1)
array([[ 20,  42,  17,  60, 126,  51],
       [  3,   7,  12,   9,  21,  36],
       [ 14,  70,   9,  42, 210,  27]])
# or we arrange the data vertically
>>> np.vstack((x,y))
array([[ 20,  42,  17],
       [  3,   7,  12],
       [ 14,  70,   9],
       [ 60, 126,  51],
       [  9,  21,  36],
       [ 42, 210,  27]])
# we get the same result with the concatenate() function , without specifying the axis, or by specifying it as zero
>>> np.concatenate((x,y))
array([[ 20,  42,  17],
       [  3,   7,  12],
       [ 14,  70,   9],
       [ 60, 126,  51],
       [  9,  21,  36],
       [ 42, 210,  27]])
# the function .dstack() divides the array into tuples along the third axis
>>> np.dstack((x,y))
array([[[ 20,  60],
        [ 42, 126],
        [ 17,  51]],
       [[  3,   9],
        [  7,  21],
        [ 12,  36]],
       [[ 14,  42],
        [ 70, 210],
        [  9,  27]]])
# the .hsplit() function divides the array into equal parts by size and shape (in this case, three parts)
>>> np.hsplit(x, 3)
[array([[20],
       [ 3],
       [14]]), array([[42],
       [ 7],
       [70]]), array([[17],
       [12],
       [ 9]])]
# we can also use the .split() function
>>> np.split(x, 3)
[array([[20, 42, 17]]), array([[ 3,  7, 12]]), array([[14, 70,  9]])]
# or the .vsplit() function, which means vertical splitting
# we can also convert an array to a list
>>> z.tolist()
[[120, 72, 37], [43, 57, 12], [54, 20, 9]]
As we have seen, NumPy handles both numeric data in various formats and strings, but we can also generate random data. The zeros() and ones() functions create arrays with zeros or ones only.
# the first element indicates the number of cases; the second, the number of variables
>>> np.zeros((4,3))
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
>>> np.ones((5,2))
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])
NumPy has many ways to store data. Table 9-1 includes some of them from documentation available at https://docs.scipy.org/doc/ .
Table 9-1

Options for Storing Data

Data Type

Description

bool_

Boolean (True or False), stored as a byte

int_

Default integer type (same as C long; normally either int64 or int32)

intc

Identical to C int (normally int32 or int64)

intp

Integer used for indexing (same as C ssize_t; normally either int32 or int64)

int8

Byte (-128 to 127)

int16

Integer (-32768 to 32767)

int32

Integer (-2147483648 to 2147483647)

int64

Integer (-9223372036854775808 to 9223372036854775807)

uint8

Unsigned integer (0 to 255)

uint16

Unsigned integer (0 to 65535)

uint32

Unsigned integer (0 to 4294967295)

uint64

Unsigned integer (0 to 18446744073709551615)

float_

Shorthand for float64

float16

Half-precision float: sign bit, 5-bit exponent, 10-bit mantissa

float32

Single-precision float: sign bit, 8-bit exponent, 23-bit mantissa

float64

Double-precision float: sign bit, 11-bit exponent, 52-bit mantissa

complex_

Shorthand for complex128

complex64

Complex number, represented by two 32-bit floats (real and imaginary components)

complex128

Complex number, represented by two 64-bit floats (real and imaginary components)

When we create an object, we can specify which type of object we want to create:
>>> o1 = np.arange(10, dtype = 'int16')
>>> o1
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int16)
NumPy can also be used to create matrices and matrix calculations.
# we create two mat1 and mat2 matrices
>>> mat1 = np.matrix([[10, 11, 12, 13, 14], [15, 16, 17, 18, 19], [20, 21, 22, 23, 24]])
>>> mat1
matrix([[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24]])
>>> mat2 = np.matrix([[25, 26, 27, 28, 29], [30, 31, 32, 33, 34], [35, 36, 37, 38, 39]])
>>> mat2
matrix([[25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39]])
# we can do some mathematical operations
>>> mat1 + mat2
matrix([[35, 37, 39, 41, 43],
        [45, 47, 49, 51, 53],
        [55, 57, 59, 61, 63]])
>>> mat2 - mat1
matrix([[15, 15, 15, 15, 15],
        [15, 15, 15, 15, 15],
        [15, 15, 15, 15, 15]])
>>> mat2 / mat1
matrix([[ 2.5      ,  2.36363636,  2.25      ,  2.15384615,  2.07142857],
        [ 2.       ,  1.9375    ,  1.88235294,  1.83333333,  1.78947368],
        [ 1.75     ,  1.71428571,  1.68181818,  1.65217391,  1.625     ]])
>>> mat1 * 3
matrix([[30, 33, 36, 39, 42],
        [45, 48, 51, 54, 57],
        [60, 63, 66, 69, 72]])
# Numpy lets you calculate the mean
>>> np.mean(mat1)
17.0
# maximum value
>>> np.max(mat1)
24
# minimum value
>>> np.min(mat1)
10
# median
>>> np.median(mat1)
17.0
# variance
>>> np.var(mat1)
18.666666666666668
# standard deviation
>>> np.std(mat1)
4.3204937989385739
# covariance can be calculated by np.cov()
NumPy also allows you to import numeric data using the loadtxt() function:
>>> up1 = np.loadtxt('~/Python_test/df2', delimiter = ',', usecols = (1,), unpack = True)
up1
array([  0.        ,  15.98293881,  41.7360094 ,  21.02081375,
        54.06254967,   6.68691812,  43.83810058,  39.55058136,
        58.04370289,  85.02891662,  98.25872709])
# to import more columns, we change the usecols argument:
>>> up1 = np.loadtxt('~/Python_test/df2', delimiter = ',', usecols = (1,2,3,4,5), unpack = True)

NumPy: Generating Random Numbers and Seeds

To generate random numbers, we need to import the NumPy package.
>>> import numpy as np
We can generate a random number as follows:
>>> np.random.rand()
0.992777076172216
# or specify in parentheses the number of rows and columns to be generated automatically
>>> np.random.rand(2,3)
array([[ 0.39352349,  0.57116926,  0.88967038],
       [ 0.76375617,  0.24620255,  0.17408501]])
We can create multidimensional arrays of random numbers :
# to a size (10 is the upper limit from which the distribution is extracted)
>>> np.random.randint(10, size= 8)
array([5, 8, 3, 7, 6, 8, 5, 3])
# next we create an array with four cases and five variables
>>> np.random.randint(10, size=(4, 5))
array([[4, 7, 1, 9, 8],
       [7, 4, 0, 3, 7],
       [3, 6, 9, 3, 4],
       [4, 7, 3, 3, 9]])
# and then we create a three-dimensional array
>>> np.random.randint(10, size=(4, 5, 6))
array([[[7, 6, 1, 0, 1, 2],
        [3, 2, 3, 2, 9, 2],
        [5, 5, 6, 5, 0, 2],
        [1, 6, 2, 1, 9, 0],
        [1, 5, 3, 1, 3, 6]],
       [[0, 1, 9, 3, 4, 3],
        [2, 6, 6, 6, 6, 2],
        [2, 3, 7, 6, 2, 5],
        [1, 7, 7, 7, 0, 3],
        [3, 8, 6, 4, 1, 6]],
       [[1, 1, 2, 3, 2, 6],
        [0, 8, 1, 8, 1, 7],
        [7, 4, 3, 1, 6, 7],
        [7, 3, 1, 9, 8, 8],
        [9, 0, 1, 7, 2, 7]],
       [[3, 3, 2, 2, 5, 6],
        [4, 9, 3, 7, 4, 4],
        [8, 6, 3, 3, 7, 0],
        [9, 7, 0, 5, 2, 0],
        [3, 3, 9, 5, 2, 9]]])
# with random.rand() we generate real numbers; with random.randint() we generate whole numbers
We can set a seed to be sure to generate the same random numbers and then repeat the examples:
>>> np.random.seed(12345)
>>> np.random.rand()
0.9296160928171479
>>> np.random.rand()
0.3163755545817859
>>> np.random.seed(12345)
>>> np.random.rand()
0.9296160928171479
>>> np.random.rand()
0.3163755545817859
We can generate random numbers using integers with random.randint():
>>> print(random.randint(0, 100))
27
>>> print(random.randint(0, 100))
32
# 0 and 100 are the limits within which we can extract elements. In this case, the number 100 cannot be extracted. This means that if we want to simulate, for example, crapshooting, we set the limits between 1 and 7.
We can also create an object and extract the elements randomly:
>>> test1 = ["object1", "object2", "object3", "object4", "object5"]
>>> print(random.choice(test1))
object2
>>> print(random.choice(test1))
object4
We can create another random object using np.random.randn(), which generates a normal distribution.
>>> x = np.random.randn(1000)
# we import matplotlib (described in Chapter 10)
>>> import matplotlib.pyplot as plt
# and create a histogram of the x object
>>> plt.hist(x, bins = 100)
# we display the histogram
>>> plt.show()
NumPy can also be used to generate random datasets, like the one plotted in Figure 9-1. But, we must also load the pandas package. We’ve already loaded the NumPy package, so let’s proceed as follows:
# we import pandas
>>> import pandas as pd
../images/469457_1_En_9_Chapter/469457_1_En_9_Fig1_HTML.jpg
Figure 9-1

Plot of a casual dataset created with NumPy and pandas

We create a data frame using the pandas DataFrame() function. We identify a function immediately as belonging to a package, because the function is a method of that package:
package_name.function_name()
# we use the DataFrame () function of the pandas (pd) package to create the datagram, and the random.randn() function of the Numpy package (np) to generate random data. In brackets, we first put the number of cases to be generated (in this case, 10) followed by the number of variables (in this case, 5)
>>> rdf = pd.DataFrame(np.random.randn(10,5))
>>> rdf
          0         1         2         3         4
0  0.669980  0.626433 -0.693932 -0.841258 -0.165200
1  0.108567 -0.743791  0.367369  0.645242 -0.297283
2  1.674781  0.241534 -0.403371  0.175751  0.274626
3 -2.339962 -0.083003 -1.387095  1.559257 -1.025012
4  0.383104  0.968755  0.236508  0.186294  0.094319
5  0.956150 -1.366423  0.694575 -0.107877  1.727657
6 -0.699931 -1.184346  0.581632  0.333015 -1.137382
7  0.867757 -0.872935  0.417772 -0.045722  0.432780
8 -0.685488  1.046816  0.465459 -0.446164  0.227635
9 -0.019854  0.643384  1.459784  0.559970 -0.358676
If we recreate a data frame using the same instructions, we will almost certainly get different data. To get the same data, we need to use a function that allows us to set a seed. In this way, we replicate the data.
# we set the seed
>>> np.random.seed(12345)
# we create the data frame
>>> rdf = pd.DataFrame(np.random.randn(10,5))
# we visualize the data frame
rdf
          0         1         2         3         4
0 -0.204708  0.478943 -0.519439 -0.555730  1.965781
1  1.393406  0.092908  0.281746  0.769023  1.246435
2  1.007189 -1.296221  0.274992  0.228913  1.352917
3  0.886429 -2.001637 -0.371843  1.669025 -0.438570
4 -0.539741  0.476985  3.248944 -1.021228 -0.577087
5  0.124121  0.302614  0.523772  0.000940  1.343810
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
7  0.560145 -1.265934  0.119827 -1.063512  0.332883
8 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
9  0.286350  0.377984 -0.753887  0.331286  1.349742
# we reinsert the same seed
>>> np.random.seed(12345)
# we recreate a dataset
>>> rdf2 = pd.DataFrame(np.random.randn(10,5))
# the generated data are identical
>>> rdf2
          0         1         2         3         4
0 -0.204708  0.478943 -0.519439 -0.555730  1.965781
1  1.393406  0.092908  0.281746  0.769023  1.246435
2  1.007189 -1.296221  0.274992  0.228913  1.352917
3  0.886429 -2.001637 -0.371843  1.669025 -0.438570
4 -0.539741  0.476985  3.248944 -1.021228 -0.577087
5  0.124121  0.302614  0.523772  0.000940  1.343810
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
7  0.560145 -1.265934  0.119827 -1.063512  0.332883
8 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
9  0.286350  0.377984 -0.753887  0.331286  1.349742
>>> np.random.seed(12345)
>>> rdf = pd.DataFrame(np.random.rand(10,5))
>>> rdf
          0         1         2         3         4
0 -0.204708  0.478943 -0.519439 -0.555730  1.965781
1  1.393406  0.092908  0.281746  0.769023  1.246435
2  1.007189 -1.296221  0.274992  0.228913  1.352917
3  0.886429 -2.001637 -0.371843  1.669025 -0.438570
4 -0.539741  0.476985  3.248944 -1.021228 -0.577087
5  0.124121  0.302614  0.523772  0.000940  1.343810
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
7  0.560145 -1.265934  0.119827 -1.063512  0.332883
8 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
9  0.286350  0.377984 -0.753887  0.331286  1.349742
Note that our completely random variables do not have a name; they are identified merely by numbers. Let’s change the column names:
# we create a list that contains the names of the variables
>>> var_names = ['var1', 'var2', 'var3', 'var4', 'var5']
# we use the .columns method
>>> rdf2.columns = var_names
# we check the first cases of the data frame
>>> rdf2.head(2)
       var1      var2      var3      var4      var5
0 -0.204708  0.478943 -0.519439 -0.555730  1.965781
1  1.393406  0.092908  0.281746  0.769023  1.246435
# the variable names are correct
We can acquire other types of distributions using other NumPy functions.
# binomial distribution
>>> rdf_bin = pd.DataFrame(np.random.binomial(100, 0.5, (10,5)))
>>> rdf_bin
    0   1   2   3   4
0  47  56  48  42  53
1  43  56  51  56  46
2  50  42  40  55  46
3  46  43  54  51  53
4  41  48  47  45  42
5  53  55  51  50  58
6  51  57  46  53  48
7  56  53  46  54  54
8  53  50  52  53  46
9  50  46  54  57  56
# Poisson distribution
>>> rdf_poi = pd.DataFrame(np.random.poisson(100, (10,5)))
>>> rdf_poi
     0    1    2    3    4
0  109  107   98  111   95
1  115  109  101  108   95
2  105   97  102  100   94
3   94   93   94  122   96
4  117   85  135   90   83
5  103  106  105   93  116
6  111   95  100   95   80
7   81   75   84   93  101
8  105  109  104  104  113
9   97  120   90   98   95
# uniform distribution
>>> rdf_un = pd.DataFrame(np.random.uniform(1, 100,(10,5)))
>>> rdf_un
           0          1          2          3          4
0  49.139046  98.433411  60.777590  76.202858  68.767153
1  93.309206  95.028762  99.021002  13.489383  97.683461
2  23.681460  19.419586  51.411931  53.519271  28.981285
3  31.705992  31.678553  27.372500  59.749684  22.496303
4  40.835909  29.567563  18.210241  23.345639  98.875500
5  43.971084  82.990738  57.678770  65.538128  73.244077
6  39.737081  39.893383  86.095572  81.191942  83.817845
7  19.610805  36.600078  48.716414  96.641192  56.768005
8  82.922981   8.534653  55.760657  55.246106  90.638916
9  37.579379  90.215102  14.922471  10.818199  97.345143
How do we save a data frame or array created in NumPy? First, we create a data frame
>>> tos = pd.DataFrame(np.random.randn(10,5))
then save it with the NumPy save() function.
>>> np.save('tos_saved', tos)
# we can check it in our work directory
To load our saved file, we use the load() function in NumPy.
>>> load1 = np.load('tos_saved.npy')

NumPy contains other advanced mathematics modules, such as numPy.linalg for linear algebra, fft for Fourier transform, and a number functions for financial analysis, such as interest and futures calculations. You can find all the necessary documentation about NumPy features at https://docs.scipy.org/doc/ .

Summary

NumPy and SciPy are two very important packages used by data analysts. Each has a variety of features, but the packages can be combined to create random datasets, for example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.167.183