Chapter 5. Retrieving, Processing, and Storing Data

Data can be found everywhere in all shapes and forms. We can get it from the Web, by e-mail and FTP, or create it ourselves in a lab experiment or marketing poll. An exhaustive overview of how to acquire data in various formats will require many more pages than what we have available. Sometimes, we need to store data before we can analyze it or after we are done with our analysis. We will also discuss storing data in this chapter. Chapter 8, Working with Databases, gives information about various databases (relational and NoSQL) and related APIs. The following is a list of the topics that we are going to cover in this chapter:

  • Writing CSV files with NumPy and pandas
  • The binary .npy and pickle formats
  • Reading and writing to Excel with pandas
  • JSON
  • REST web services
  • Parsing RSS feeds
  • Scraping the Web
  • Parsing HTML
  • Storing data with PyTables
  • HDF5 pandas I/O

Writing CSV files with NumPy and pandas

In the previous chapters, we learned about reading CSV files. Writing CSV files is just as straightforward, but uses different functions and methods. Let's first generate some data to be stored in the CSV format. Generate a 3 x 4 NumPy array after seeding the random generator in the following code snippet.

Set one of the array values to NaN:

np.random.seed(42)

a = np.random.randn(3, 4)
a[2][2] = np.nan
print a

This code will print the array as follows:

[[ 0.49671415 -0.1382643   0.64768854  1.52302986]
 [-0.23415337 -0.23413696  1.57921282  0.76743473]
 [-0.46947439  0.54256004         nan -0.46572975]]

The NumPy savetxt() function is the counterpart of the NumPy loadtxt() function and can save arrays in delimited file formats such as CSV. Save the array we created with the following function call:

np.savetxt('np.csv', a, fmt='%.2f', delimiter=',', header=" #1,  #2,  #3,  #4")

In the preceding function call, we specified the name of the file to be saved, the array, an optional format, a delimiter (the default is space), and an optional header.

View the np.csv file we created with the cat command (cat np.csv) or an editor, such as Notepad on Windows. The contents of the file should be displayed as follows:

#  #1,  #2,  #3,  #4
0.50,-0.14,0.65,1.52
-0.23,-0.23,1.58,0.77
-0.47,0.54,nan,-0.47

Create a pandas DataFrame from the random values array:

df = pd.DataFrame(a)
print df

As you can observe, pandas automatically comes up with column names for our data:

          0         1         2         3
0  0.496714 -0.138264  0.647689  1.523030
1 -0.234153 -0.234137  1.579213  0.767435
2 -0.469474  0.542560NaN -0.465730

Write a DataFrame to a CSV file with the pandas to_csv() method as follows:

df.to_csv('pd.csv', float_format='%.2f', na_rep="NAN!")

We gave this method the name of the file, an optional format string analogous to the format parameter of the NumPy savetxt() function, and an optional string that represents NaN. View the pd.csv file to see the following:

,0,1,2,3
0,0.50,-0.14,0.65,1.52
1,-0.23,-0.23,1.58,0.77
2,-0.47,0.54,NAN!,-0.47

Take a look at the code in the writing_csv.py file in this book's code bundle:

import numpy as np
import pandas as pd

np.random.seed(42)

a = np.random.randn(3, 4)
a[2][2] = np.nan
print a
np.savetxt('np.csv', a, fmt='%.2f', delimiter=',', header=" #1,  #2,  #3,  #4")
df = pd.DataFrame(a)
print df
df.to_csv('pd.csv', float_format='%.2f', na_rep="NAN!")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.80.34