Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Valentina PorcuPython for Data Mining Quick Syntax Referencehttps://doi.org/10.1007/978-1-4842-4113-4_9

9. SciPy and NumPy

Valentina Porcu¹

(1)

Nuoro, Italy

Although pandas is a very important package for data analysis, it can work in conjunction with two other packages: SciPy and NumPy.

SciPy

SciPy is one of the most important packages for mathematical and statistical analysis in Python, and it is linked closely to NumPy. SciPy contains more than 60 statistical functions organized in families of modules:

scipy.cluster
scipy.constants
scipy.fftpack
scipy.integrate
scipy.interpolate
scipy.io
scipy.lib
scipy.linalg
scipy.misc
scipy.optimize
scipy.signal
scipy.sparse
scipy.spatial
scipy.special
scipy.stats
scipy.weave

>>> import scipy as sp

>>> from scipy import stats

>>> from scipy import cluster

We can get help with these modules by typing, for example:

>>> help(sp.cluster)

# or

>>> help(scipy.cluster)

Documentation opens directly in the window, from which we can exit by pressing q.

For instance, SciPy can be used to measure probability density on a number or distribution

>>> from scipy.stats import norm

>>> norm.pdf(5)

1.4867195147342979e-06

or an allotment function

>>> norm.cdf(x)

SciPy features very technical modules. For the purposes of this book, we are particularly interested in combining with NumPy, so the focus of this chapter is on this second package primarily.

NumPy

At the beginning of Python's development, programmers soon found themselves having to incorporate tools for scientific computation. Their first attempt resulting in the Numeric package, which was developed in 1995, followed by an alternative package called Numarray. The merging of the functions of these two packages came to life in 2006 with NumPy.

NumPy stands for “numeric Python” and is an open-source library dedicated to scientific computing, especially with regard to array management. It is sometimes considered as MATLAB version for Python, and features several high-level mathematical functions in algebra and random number generation.

>>> import numpy as np

>>> from numpy import *

NumPy’s most important structure is a particular type of multidimensional array, called ndarray . ndarray consists of two elements: data (the true ndarray) and metadata describing data (dtype or data type). Each ndarray is associated with one and only one dtype. We can have one-dimension arrays:

>>> arr1 = np.array([0,1,2,3,4])

>>> arr1

array([0, 1, 2, 3, 4])

Or multidimensional arrays:

>>> arr2 = np.array([[5,6,7,8,9], [10,11,12,13,14]])

>>> arr2

array([[ 5, 6, 7, 8, 9],

[10, 11, 12, 13, 14]])

>>> arr2_1 = np.array([[5,6,7,8,9], [10,11,12,13,14], [2,5,7]])

# the important thing is that data have the same type

We can display the data type using the .dtype() function .

>>> arr1.dtype

dtype('int64')

# the amount of items in the array

>>> arr1.size

# the number of bits

>>> arr1.itemsize

For instance, by using the NumPy arrange() function, we can create a list of numbers—in the following case, from 0 to 100:

>>> arr3 = np.arange(100)

>>> arr3

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,

17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,

34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,

51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,

68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,

85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

# arange also allows us to set a range of numbers

>>> np.arange(1,10)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Let’s create a fourth array:

>>> arr4 = np.array([['a','b','c','d','e'], ['f', 'g', 'h', 'i','l'], ['m', 'n', 'o', 'p','q']])

>>> arr4

array([['a', 'b', 'c', 'd', 'e'],

['f', 'g', 'h', 'i', 'l'],

['m', 'n', 'o', 'p', 'q']],

dtype='<U1')

We can select, for example, the first element (0):

>>> arr4[0]

array(['a', 'b', 'c', 'd', 'e'],

dtype='<U1')

If we want to select the third element of the first element, we proceed as follows:

>>> arr4[0][3]

'd'

Remember that, in Python, we start counting from zero, not one. So, if we select the first element, we actually get the second:

>>> arr4[1]

array(['f', 'g', 'h', 'i', 'l'],

dtype='<U1')

>>> arr4[1][1]

'g'

We can also use a negative index and rotate the array as follows:

>>> arr4[::-1]

array([['m', 'n', 'o', 'p', 'q'],

['f', 'g', 'h', 'i', 'l'],

['a', 'b', 'c', 'd', 'e']],

dtype='|S1')

The arr4 array is composed of three different elements. To merge them, we can use the .ravel() method:

>>> arr4.ravel()

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'l', 'm', 'n', 'o','p', 'q'],

dtype='|S1')

Let’s create another array, arr5:

>>> arr5 = np.array([19, 76, 2, 13, 48, 986, 1, 18])

>>> arr5

array([19, 76, 2, 13, 48, 986, 1, 18])

# we can reorder it from the lowest value to the highest

>>> np.msort(arr5)

array([1, 2, 13, 18, 19, 48, 76, 986])

We can reorganize the data of an array using various functions. Let’s use the last array created, arr5:

# reshape() allows us to reorganize the data of an array—in the following example, in four cases of two columns each

>>> arr5.reshape(4,2)

>>> array([[ 19, 76],

[ 2, 13],

[ 48, 986],

[ 1, 18]])

>>> arr5.reshape(8,1)

array([[ 19],

[ 76],

[ 2],

[ 13],

[ 48],

[986],

[ 1],

[ 18]])

# we create two more arrays, organizing them in three columns from three cases

>>> x = np.array([20, 42, 17, 3, 7, 12, 14, 70, 9])

>>> x = x.reshape(3,3)

>>> x

array([[20, 42, 17],

[ 3, 7, 12],

[14, 70, 9]])

>>> y = x * 3

>>> y

array([[ 60, 126, 51],

[ 9, 21, 36],

[ 42, 210, 27]])

# similar to reshape is the resize() function

>>> z = np.array([120, 72, 37, 43, 57, 12, 54, 20, 9])

array([120, 72, 37, 43, 57, 12, 54, 20, 9])

>>> z.resize(3,3)

>>> z

array([[120, 72, 37],

[ 43, 57, 12],

[ 54, 20, 9]])

# we can concatenate two arrays horizontally

>>> np.hstack((x, y))

array([[ 20, 42, 17, 60, 126, 51],

[ 3, 7, 12, 9, 21, 36],

[ 14, 70, 9, 42, 210, 27]])

# we get the same result using the function .concatenate() by specifying the axis

>>> np.concatenate((x,y), axis = 1)

array([[ 20, 42, 17, 60, 126, 51],

[ 3, 7, 12, 9, 21, 36],

[ 14, 70, 9, 42, 210, 27]])

# or we arrange the data vertically

>>> np.vstack((x,y))

array([[ 20, 42, 17],

[ 3, 7, 12],

[ 14, 70, 9],

[ 60, 126, 51],

[ 9, 21, 36],

[ 42, 210, 27]])

# we get the same result with the concatenate() function , without specifying the axis, or by specifying it as zero

>>> np.concatenate((x,y))

array([[ 20, 42, 17],

[ 3, 7, 12],

[ 14, 70, 9],

[ 60, 126, 51],

[ 9, 21, 36],

[ 42, 210, 27]])

# the function .dstack() divides the array into tuples along the third axis

>>> np.dstack((x,y))

array([[[ 20, 60],

[ 42, 126],

[ 17, 51]],

[[ 3, 9],

[ 7, 21],

[ 12, 36]],

[[ 14, 42],

[ 70, 210],

[ 9, 27]]])

# the .hsplit() function divides the array into equal parts by size and shape (in this case, three parts)

>>> np.hsplit(x, 3)

[array([[20],

[ 3],

[14]]), array([[42],

[ 7],

[70]]), array([[17],

[12],

[ 9]])]

# we can also use the .split() function

>>> np.split(x, 3)

[array([[20, 42, 17]]), array([[ 3, 7, 12]]), array([[14, 70, 9]])]

# or the .vsplit() function, which means vertical splitting

# we can also convert an array to a list

>>> z.tolist()

[[120, 72, 37], [43, 57, 12], [54, 20, 9]]

As we have seen, NumPy handles both numeric data in various formats and strings, but we can also generate random data. The zeros() and ones() functions create arrays with zeros or ones only.

# the first element indicates the number of cases; the second, the number of variables

>>> np.zeros((4,3))

array([[ 0., 0., 0.],

[ 0., 0., 0.],

[ 0., 0., 0.]])

>>> np.ones((5,2))

array([[ 1., 1.],

[ 1., 1.],

[ 1., 1.]])

NumPy has many ways to store data. Table 9-1 includes some of them from documentation available at https://docs.scipy.org/doc/ .

Table 9-1

Options for Storing Data

Data Type	Description
bool_	Boolean (True or False), stored as a byte
int_	Default integer type (same as C long; normally either int64 or int32)
intc	Identical to C int (normally int32 or int64)
intp	Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8	Byte (-128 to 127)
int16	Integer (-32768 to 32767)
int32	Integer (-2147483648 to 2147483647)
int64	Integer (-9223372036854775808 to 9223372036854775807)
uint8	Unsigned integer (0 to 255)
uint16	Unsigned integer (0 to 65535)
uint32	Unsigned integer (0 to 4294967295)
uint64	Unsigned integer (0 to 18446744073709551615)
float_	Shorthand for float64
float16	Half-precision float: sign bit, 5-bit exponent, 10-bit mantissa
float32	Single-precision float: sign bit, 8-bit exponent, 23-bit mantissa
float64	Double-precision float: sign bit, 11-bit exponent, 52-bit mantissa
complex_	Shorthand for complex128
complex64	Complex number, represented by two 32-bit floats (real and imaginary components)
complex128	Complex number, represented by two 64-bit floats (real and imaginary components)

When we create an object, we can specify which type of object we want to create:

>>> o1 = np.arange(10, dtype = 'int16')

>>> o1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int16)

NumPy can also be used to create matrices and matrix calculations.

# we create two mat1 and mat2 matrices

>>> mat1 = np.matrix([[10, 11, 12, 13, 14], [15, 16, 17, 18, 19], [20, 21, 22, 23, 24]])

>>> mat1

matrix([[10, 11, 12, 13, 14],

[15, 16, 17, 18, 19],

[20, 21, 22, 23, 24]])

>>> mat2 = np.matrix([[25, 26, 27, 28, 29], [30, 31, 32, 33, 34], [35, 36, 37, 38, 39]])

>>> mat2

matrix([[25, 26, 27, 28, 29],

[30, 31, 32, 33, 34],

[35, 36, 37, 38, 39]])

# we can do some mathematical operations

>>> mat1 + mat2

matrix([[35, 37, 39, 41, 43],

[45, 47, 49, 51, 53],

[55, 57, 59, 61, 63]])

>>> mat2 - mat1

matrix([[15, 15, 15, 15, 15],

[15, 15, 15, 15, 15],

[15, 15, 15, 15, 15]])

>>> mat2 / mat1

matrix([[ 2.5 , 2.36363636, 2.25 , 2.15384615, 2.07142857],

[ 2. , 1.9375 , 1.88235294, 1.83333333, 1.78947368],

[ 1.75 , 1.71428571, 1.68181818, 1.65217391, 1.625 ]])

>>> mat1 * 3

matrix([[30, 33, 36, 39, 42],

[45, 48, 51, 54, 57],

[60, 63, 66, 69, 72]])

# Numpy lets you calculate the mean

>>> np.mean(mat1)

17.0

# maximum value

>>> np.max(mat1)

# minimum value

>>> np.min(mat1)

# median

>>> np.median(mat1)

17.0

# variance

>>> np.var(mat1)

18.666666666666668

# standard deviation

>>> np.std(mat1)

4.3204937989385739

# covariance can be calculated by np.cov()

NumPy also allows you to import numeric data using the loadtxt() function:

>>> up1 = np.loadtxt('~/Python_test/df2', delimiter = ',', usecols = (1,), unpack = True)

up1

array([ 0. , 15.98293881, 41.7360094 , 21.02081375,

54.06254967, 6.68691812, 43.83810058, 39.55058136,

58.04370289, 85.02891662, 98.25872709])

# to import more columns, we change the usecols argument:

>>> up1 = np.loadtxt('~/Python_test/df2', delimiter = ',', usecols = (1,2,3,4,5), unpack = True)

NumPy: Generating Random Numbers and Seeds

To generate random numbers, we need to import the NumPy package.

>>> import numpy as np

We can generate a random number as follows:

>>> np.random.rand()

0.992777076172216

# or specify in parentheses the number of rows and columns to be generated automatically

>>> np.random.rand(2,3)

array([[ 0.39352349, 0.57116926, 0.88967038],

[ 0.76375617, 0.24620255, 0.17408501]])

We can create multidimensional arrays of random numbers :

# to a size (10 is the upper limit from which the distribution is extracted)

>>> np.random.randint(10, size= 8)

array([5, 8, 3, 7, 6, 8, 5, 3])

# next we create an array with four cases and five variables

>>> np.random.randint(10, size=(4, 5))

array([[4, 7, 1, 9, 8],

[7, 4, 0, 3, 7],

[3, 6, 9, 3, 4],

[4, 7, 3, 3, 9]])

# and then we create a three-dimensional array

>>> np.random.randint(10, size=(4, 5, 6))

array([[[7, 6, 1, 0, 1, 2],

[3, 2, 3, 2, 9, 2],

[5, 5, 6, 5, 0, 2],

[1, 6, 2, 1, 9, 0],

[1, 5, 3, 1, 3, 6]],

[[0, 1, 9, 3, 4, 3],

[2, 6, 6, 6, 6, 2],

[2, 3, 7, 6, 2, 5],

[1, 7, 7, 7, 0, 3],

[3, 8, 6, 4, 1, 6]],

[[1, 1, 2, 3, 2, 6],

[0, 8, 1, 8, 1, 7],

[7, 4, 3, 1, 6, 7],

[7, 3, 1, 9, 8, 8],

[9, 0, 1, 7, 2, 7]],

[[3, 3, 2, 2, 5, 6],

[4, 9, 3, 7, 4, 4],

[8, 6, 3, 3, 7, 0],

[9, 7, 0, 5, 2, 0],

[3, 3, 9, 5, 2, 9]]])

# with random.rand() we generate real numbers; with random.randint() we generate whole numbers

We can set a seed to be sure to generate the same random numbers and then repeat the examples:

>>> np.random.seed(12345)

>>> np.random.rand()

0.9296160928171479

>>> np.random.rand()

0.3163755545817859

>>> np.random.seed(12345)

>>> np.random.rand()

0.9296160928171479

>>> np.random.rand()

0.3163755545817859

We can generate random numbers using integers with random.randint():

>>> print(random.randint(0, 100))

# 0 and 100 are the limits within which we can extract elements. In this case, the number 100 cannot be extracted. This means that if we want to simulate, for example, crapshooting, we set the limits between 1 and 7.

We can also create an object and extract the elements randomly:

>>> test1 = ["object1", "object2", "object3", "object4", "object5"]

>>> print(random.choice(test1))

object2

>>> print(random.choice(test1))

object4

We can create another random object using np.random.randn(), which generates a normal distribution.

>>> x = np.random.randn(1000)

# we import matplotlib (described in Chapter 10)

>>> import matplotlib.pyplot as plt

# and create a histogram of the x object

>>> plt.hist(x, bins = 100)

# we display the histogram

>>> plt.show()

NumPy can also be used to generate random datasets, like the one plotted in Figure 9-1. But, we must also load the pandas package. We’ve already loaded the NumPy package, so let’s proceed as follows:

# we import pandas

>>> import pandas as pd

../images/469457_1_En_9_Chapter/469457_1_En_9_Fig1_HTML.jpg — Figure 9-1
Plot of a casual dataset created with NumPy and pandas

We create a data frame using the pandas DataFrame() function. We identify a function immediately as belonging to a package, because the function is a method of that package:

package_name.function_name()

# we use the DataFrame () function of the pandas (pd) package to create the datagram, and the random.randn() function of the Numpy package (np) to generate random data. In brackets, we first put the number of cases to be generated (in this case, 10) followed by the number of variables (in this case, 5)

>>> rdf = pd.DataFrame(np.random.randn(10,5))

>>> rdf

0 1 2 3 4

0 0.669980 0.626433 -0.693932 -0.841258 -0.165200

1 0.108567 -0.743791 0.367369 0.645242 -0.297283

2 1.674781 0.241534 -0.403371 0.175751 0.274626

3 -2.339962 -0.083003 -1.387095 1.559257 -1.025012

4 0.383104 0.968755 0.236508 0.186294 0.094319

5 0.956150 -1.366423 0.694575 -0.107877 1.727657

6 -0.699931 -1.184346 0.581632 0.333015 -1.137382

7 0.867757 -0.872935 0.417772 -0.045722 0.432780

8 -0.685488 1.046816 0.465459 -0.446164 0.227635

9 -0.019854 0.643384 1.459784 0.559970 -0.358676

If we recreate a data frame using the same instructions, we will almost certainly get different data. To get the same data, we need to use a function that allows us to set a seed. In this way, we replicate the data.

# we set the seed

>>> np.random.seed(12345)

# we create the data frame

>>> rdf = pd.DataFrame(np.random.randn(10,5))

# we visualize the data frame

rdf

0 1 2 3 4

0 -0.204708 0.478943 -0.519439 -0.555730 1.965781

1 1.393406 0.092908 0.281746 0.769023 1.246435

2 1.007189 -1.296221 0.274992 0.228913 1.352917

3 0.886429 -2.001637 -0.371843 1.669025 -0.438570

4 -0.539741 0.476985 3.248944 -1.021228 -0.577087

5 0.124121 0.302614 0.523772 0.000940 1.343810

6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757

7 0.560145 -1.265934 0.119827 -1.063512 0.332883

8 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030

9 0.286350 0.377984 -0.753887 0.331286 1.349742

# we reinsert the same seed

>>> np.random.seed(12345)

# we recreate a dataset

>>> rdf2 = pd.DataFrame(np.random.randn(10,5))

# the generated data are identical

>>> rdf2

0 1 2 3 4

0 -0.204708 0.478943 -0.519439 -0.555730 1.965781

1 1.393406 0.092908 0.281746 0.769023 1.246435

2 1.007189 -1.296221 0.274992 0.228913 1.352917

3 0.886429 -2.001637 -0.371843 1.669025 -0.438570

4 -0.539741 0.476985 3.248944 -1.021228 -0.577087

5 0.124121 0.302614 0.523772 0.000940 1.343810

6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757

7 0.560145 -1.265934 0.119827 -1.063512 0.332883

8 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030

9 0.286350 0.377984 -0.753887 0.331286 1.349742

>>> np.random.seed(12345)

>>> rdf = pd.DataFrame(np.random.rand(10,5))

>>> rdf

0 1 2 3 4

0 -0.204708 0.478943 -0.519439 -0.555730 1.965781

1 1.393406 0.092908 0.281746 0.769023 1.246435

2 1.007189 -1.296221 0.274992 0.228913 1.352917

3 0.886429 -2.001637 -0.371843 1.669025 -0.438570

4 -0.539741 0.476985 3.248944 -1.021228 -0.577087

5 0.124121 0.302614 0.523772 0.000940 1.343810

6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757

7 0.560145 -1.265934 0.119827 -1.063512 0.332883

8 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030

9 0.286350 0.377984 -0.753887 0.331286 1.349742

Note that our completely random variables do not have a name; they are identified merely by numbers. Let’s change the column names:

# we create a list that contains the names of the variables

>>> var_names = ['var1', 'var2', 'var3', 'var4', 'var5']

# we use the .columns method

>>> rdf2.columns = var_names

# we check the first cases of the data frame

>>> rdf2.head(2)

var1 var2 var3 var4 var5

0 -0.204708 0.478943 -0.519439 -0.555730 1.965781

1 1.393406 0.092908 0.281746 0.769023 1.246435

# the variable names are correct

We can acquire other types of distributions using other NumPy functions.

# binomial distribution

>>> rdf_bin = pd.DataFrame(np.random.binomial(100, 0.5, (10,5)))

>>> rdf_bin

0 1 2 3 4

0 47 56 48 42 53

1 43 56 51 56 46

2 50 42 40 55 46

3 46 43 54 51 53

4 41 48 47 45 42

5 53 55 51 50 58

6 51 57 46 53 48

7 56 53 46 54 54

8 53 50 52 53 46

9 50 46 54 57 56

# Poisson distribution

>>> rdf_poi = pd.DataFrame(np.random.poisson(100, (10,5)))

>>> rdf_poi

0 1 2 3 4

0 109 107 98 111 95

1 115 109 101 108 95

2 105 97 102 100 94

3 94 93 94 122 96

4 117 85 135 90 83

5 103 106 105 93 116

6 111 95 100 95 80

7 81 75 84 93 101

8 105 109 104 104 113

9 97 120 90 98 95

# uniform distribution

>>> rdf_un = pd.DataFrame(np.random.uniform(1, 100,(10,5)))

>>> rdf_un

0 1 2 3 4

0 49.139046 98.433411 60.777590 76.202858 68.767153

1 93.309206 95.028762 99.021002 13.489383 97.683461

2 23.681460 19.419586 51.411931 53.519271 28.981285

3 31.705992 31.678553 27.372500 59.749684 22.496303

4 40.835909 29.567563 18.210241 23.345639 98.875500

5 43.971084 82.990738 57.678770 65.538128 73.244077

6 39.737081 39.893383 86.095572 81.191942 83.817845

7 19.610805 36.600078 48.716414 96.641192 56.768005

8 82.922981 8.534653 55.760657 55.246106 90.638916

9 37.579379 90.215102 14.922471 10.818199 97.345143

How do we save a data frame or array created in NumPy? First, we create a data frame

>>> tos = pd.DataFrame(np.random.randn(10,5))

then save it with the NumPy save() function.

>>> np.save('tos_saved', tos)

# we can check it in our work directory

To load our saved file, we use the load() function in NumPy.

>>> load1 = np.load('tos_saved.npy')

NumPy contains other advanced mathematics modules, such as numPy.linalg for linear algebra, fft for Fourier transform, and a number functions for financial analysis, such as interest and futures calculations. You can find all the necessary documentation about NumPy features at https://docs.scipy.org/doc/ .

Summary

NumPy and SciPy are two very important packages used by data analysts. Each has a variety of features, but the packages can be combined to create random datasets, for example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. SciPy and NumPy

Create new playlist

Sign In

Sign Up

9. SciPy and NumPy

SciPy

NumPy

NumPy: Generating Random Numbers and Seeds

Summary

Table of Contents for
9. SciPy and NumPy