Data can be found everywhere in all shapes and forms. We can get it from the Web, by e-mail and FTP, or create it ourselves in a lab experiment or marketing poll. An exhaustive overview of how to acquire data in various formats will require many more pages than what we have available. Sometimes, we need to store data before we can analyze it or after we are done with our analysis. We will also discuss storing data in this chapter. Chapter 8, Working with Databases, gives information about various databases (relational and NoSQL) and related APIs. The following is a list of the topics that we are going to cover in this chapter:
.npy
and pickle formatsIn the previous chapters, we learned about reading CSV files. Writing CSV files is just as straightforward, but uses different functions and methods. Let's first generate some data to be stored in the CSV format. Generate a 3 x 4 NumPy array after seeding the random generator in the following code snippet.
Set one of the array values to NaN
:
np.random.seed(42) a = np.random.randn(3, 4) a[2][2] = np.nan print a
This code will print the array as follows:
[[ 0.49671415 -0.1382643 0.64768854 1.52302986] [-0.23415337 -0.23413696 1.57921282 0.76743473] [-0.46947439 0.54256004 nan -0.46572975]]
The NumPy savetxt()
function is the counterpart of the NumPy loadtxt()
function and can save arrays in delimited file formats such as CSV. Save the array we created with the following function call:
np.savetxt('np.csv', a, fmt='%.2f', delimiter=',', header=" #1, #2, #3, #4")
In the preceding function call, we specified the name of the file to be saved, the array, an optional format, a delimiter (the default is space), and an optional header.
The format parameter is documented at http://docs.python.org/2/library/string.html#format-specification-mini-language.
View the np.csv
file we created with the cat
command (cat np.csv
) or an editor, such as Notepad on Windows. The contents of the file should be displayed as follows:
# #1, #2, #3, #4 0.50,-0.14,0.65,1.52 -0.23,-0.23,1.58,0.77 -0.47,0.54,nan,-0.47
Create a pandas DataFrame from the random values array:
df = pd.DataFrame(a) print df
As you can observe, pandas automatically comes up with column names for our data:
0 1 2 3 0 0.496714 -0.138264 0.647689 1.523030 1 -0.234153 -0.234137 1.579213 0.767435 2 -0.469474 0.542560NaN -0.465730
Write a DataFrame to a CSV file with the pandas to_csv()
method as follows:
df.to_csv('pd.csv', float_format='%.2f', na_rep="NAN!")
We gave this method the name of the file, an optional format string analogous to the format parameter of the NumPy savetxt()
function, and an optional string that represents NaN
. View the pd.csv
file to see the following:
,0,1,2,3 0,0.50,-0.14,0.65,1.52 1,-0.23,-0.23,1.58,0.77 2,-0.47,0.54,NAN!,-0.47
Take a look at the code in the writing_csv.py
file in this book's code bundle:
import numpy as np import pandas as pd np.random.seed(42) a = np.random.randn(3, 4) a[2][2] = np.nan print a np.savetxt('np.csv', a, fmt='%.2f', delimiter=',', header=" #1, #2, #3, #4") df = pd.DataFrame(a) print df df.to_csv('pd.csv', float_format='%.2f', na_rep="NAN!")
18.116.80.34