Importing and manipulating data with Pandas

Until now we have seen how to import and export data using mostly the tools provided in the Python standard library. Now, we'll see how to do some of the operations shown above in just few lines using the Pandas library. Pandas is an open source, BSD-licensed library that simplifies the process of data import and manipulation thus providing data structures and parsing functions.

We will demonstrate how to import, manipulate and export data using Pandas.

Getting ready

To be able to use the code in this section, we need to install Pandas.This can be done again using pip as shown here:

pip install pandas

How to do it...

Here, we will import again the data ch2-data.csv, add a new column to the original data and export the result in csv, as shown in the following code snippet:

data = pd.read_csv('ch02-data.csv')
data['amount_x_2'] = data['amount']*2
data.to_csv('ch02-data_more.csv)

How it works...

First, we import Pandas in our environment and then we use the function read_csv on the file that we want to read. This function automatically parses the csv format and nicely organizes the data in an indexed structure called DataFrame. Then, we take the columns amount, we multiply each of its element by two and store the result in a new columns called amount_x_2. Finally, we save the result into a new file named ch02-data_more.csv using the method to_csv. A DataFrame is a Pandas object which represents a table and we can access its columns as shown in the following section

There's more...

DataFrames are very handy structures; they're designed to be fast and easy to access. Each column that they contain becomes an attribute of the object that represents the data frame. For example, we can print the values in the column amount of the object data defined earlier as shown here:

>>>print data.amount
>>>0 323 1 233 2 433 3 555 4 123 5 0 6 221 Name: amount, dtype: int64

We can also print the list of all the columns in a dataframe as shown in the following code:

>>>print data.columns
>>>Index([u'day', u'amount'], dtype='object')

Also, the function read_csv that we used to import the data has many parameters that we make use of to deal with messy files and parse particular data formats. For example, if the values of our files are delimited by spaces instead of commas, we can use the parameter delimiter to correctly parse the data. Here's an example of where we import data from a file, where the values are separated by a variable number of spaces and we specify our custom header:

pd.read_csv('ch02-data.tab', skiprows=1, 
  delimiter=' *', names=['day','amount'])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.221.133