Chapter 10. Further Reading

NumPy is a powerful scientific module in Python; hopefully, in the previous nine chapters, we have shown you enough to prove this to you. ndarray is the core of all other Python scientific modules. The best way to use NumPy is by using numpy.ndarray as the basic data format and combining it with other scientific modules for preprocess, analyze, compute, export, and so on. In this chapter, our focus is on introducing you to a couple of modules that can work with NumPy and make your work/research more efficient.

In this chapter, we will be covering the following topics:

  • pandas
  • scikit-learn
  • netCDF4
  • scipy

pandas

pandas is, by far, the most preferable data preprocessing module in Python. The way it handles data is very similar to R. Its data frame not only gives you visually appealing printouts of tables, but also allows you to access data in a more instinctive way. If you are not familiar with R, try to think of using a spreadsheet software such as Microsoft Excel or SQL tables but in a programmatic way. This covers a lot of that what pandas does.

You can download and install pandas from its official site at http://pandas.pydata.org/. A more preferable way is to use pip or install Python scientific distributions, such as Anaconda.

Remember how we used numpy.genfromtxt() to read the csv data in Chapter 4NumPy Core and Libs Submodules? Actually, using pandas to read tables and pass pre-processed data to ndarray (simply performing np.array(data_frame) will transfer a data frame into a multidimensional ndarray) would be a more preferable workflow for analytics. In this section, we are going to show you two basic data structures in pandas: Series (for one-dimension) and DataFrame (two or multi-dimensions).Then, we will show you how to use pandas to read a table and pass data to 

Then, we will show you how to use pandas to read a table and pass data to ndarray for further analysis. Let's start with pandas.Series:

In [1]: import pandas as pd 
In [2]: py_list = [3, 8, 15, 25, 11] 
In [3]: series = pd.Series(py_list) 
In [4]: series 
Out[4]: 
0     3 
1     8
2    15
3    25 
4    11 
dtype: int64 

In the preceding example, you can see that we've converted the Python list into a pandas series and that, when we printed series, the values are lined up perfectly and have an index number associated with them (0 to 4). We can, of course, specify our own index (which starts from 1 or is in the form of alphabets). Take a look at the following code example:

In [5]: indices = ['A', 'B', 'C', 'D', 'E'] 
In [6]: series = pd.Series(py_list, index = indices) 
In [7]: series 
Out[7]: 
A     3 
B     8 
C    15 
D    25 
E    11 
dtype: int64 

We changed the indices from numbers to alphabets ranging from A ~ E. More conveniently, when we convert a Python dictionary to the pandas Series, the key required to do this will become the index automatically. Try practicing converting the dictionary. Next, we are going to explore DataFrame, which is the data structure that's used most often in pandas:

In [8]: data = {'Name': ['Brian', 'George', 'Kate', 'Amy', 'Joe'], 
   ...:         'Age': [23, 41, 26, 19, 35]} 
In [9]: data_frame = pd.DataFrame(data) 
In [10]: data_frame 
Out[10]: 
   Age    Name 
0   23   Brian 
1   41  George
2   26    Kate 
3   19     Amy 
4   35     Joe 

In the preceding example, we created DataFrame, which contains two columns: the first one is Name and the second one is Age. You can see from the printouts that it just looks like a table because it's well formatted. Of course, you can also change the index of the data frame. But the advantages of a data frame are much more than just this. We can access or sort the data in each column (by its column name, where two notations are required to access the data_frame.column_name or data_frame[column_name]); we can even analyze summary statistics. To do this, take a look at this code example:

In [11]: data_frame = pd.DataFrame(data) 
In [12]: data_frame['Age'] 
Out[12]: 
0    23 
1    41 
2    26 
3    19 
4    35 
Name: Age, dtype: int64 
In [13]: data_frame.sort(columns = 'Age') 
Out[13]: 
   Age    Name 
3   19     Amy 
0   23   Brian 
2   26    Kate 
4   35     Joe 
41  George 
In [14]: data_frame.describe() 
Out[14]: 
             Age 
count   5.000000 
mean   28.800000 
std     9.011104 
min    19.000000 
25%    23.000000 
50%    26.000000 
75%    35.000000 
max    41.000000 

In the preceding example, we obtained only the Age column and sorted DataFrame by Age. When we use describe(), it calculates summary statistics (including counts, mean, standard deviation, minimum, maximum, and percentiles) for all numeric fields.In the last part of this section, we are going to use pandas to read a

In the last part of this section, we are going to use pandas to read a csv file and pass one field value to ndarray for further computation. The example.csv file is from the Office for National Statistics (ONS). Visit http://www.ons.gov.uk/ons/datasets-and-tables/index.html for more details. We will use Sale counts by dwelling type and local authority, England and Wales on the ONS website. You can search it by the topic name to access the download page or pick any dataset that you are interested in. In the following example, we renamed our example dataset to sales.csv:

In [15]: sales = pd.read_csv('sales.csv') 
In [16]: sales.shape 
Out[16]: (348, 97) 
In [17]: sales.columns[:3] 
Out[17]: Index([u'LA_Code', u'LA_Name', u'1995_COUNT_ALL_TYPES'], dtype='object') 
In [18]: sales['1995_COUNT_ALL_TYPES'].head() 
Out[18]: 
0    1,188 
1    1,652 
2    1,684 
3    2,314 
4    1,558 
Name: 1995_COUNT_ALL_TYPES, dtype: object 

First, we read in sale.csv into a DataFrame object called sales; when we printed out the shapes of sales, we found that there were 384 records and 97 columns in the data frame. The return list of the DataFrame column attribute is an ordinary Python list, and we printed out the first three columns in the data: LA_CodeLA_Name, and 1995_COUNT_ALL_TYPES. Then, we printed the first five records in 1995_COUNT_ALL_TYPES using the head() function (the tail() function will print the last five records).

Again, pandas is a powerful preprocessing module in Python (its data handling in general more than its preprocessing power, but in the preceding example, we only covered the preprocessing part), and it has many handy functions to help you clean your data and prepare your analytics. This section is just an introduction; there is a lot that we can't cover due to space restrictions, such as pivoting, datetime, and more. Hopefully, you can get the idea and start making your scripts more efficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.200.154