pandas can read and write data stored in the JavaScript Object Notation (JSON) format. This is one of my favorites due to its ability to be used across platforms and with many programming languages.
To demonstrate saving as JSON, we will save the Excel data we just read in to a JSON file and then take a look at the contents:
In [25]: # write the excel data to a JSON file df.head().to_json("data/stocks.json") !cat data/stocks.json # osx or Linux # !type datastocks.json # windows {"Date":{"0":1405900800000,"1":1405641600000,"2":1405555200000,"3":1405468800000,"4":1405382400000}, "Open":{"0":83.46,"1":83.3,"2":84.35,"3":83.77,"4":84.3}, "High":{"0":83.53,"1":83.4,"2":84.63,"3":84.91,"4":84.38}, "Low":{"0":81.81,"1":82.52,"2":83.33,"3":83.66,"4":83.2}, "Close":{"0":81.93,"1":83.35,"2":83.63,"3":84.91,"4":83.58}, "Volume":{"0":2359300,"1":4020800,"2":1974000,"3":1755600,"4":1874700}, "Adj Close":{"0":81.93,"1":83.35,"2":83.63,"3":84.91,"4":83.58}}
JSON-based data can be read with the pd.read_json()
function:
In [26]: # read data in from JSON df_from_json = pd.read_json("data/stocks.json") df_from_json.head(5) Out[26]: Adj Close Close Date High Low Open Volume 0 81.93 81.93 2014-07-21 83.53 81.81 83.46 2359300 1 83.35 83.35 2014-07-18 83.40 82.52 83.30 4020800 2 83.63 83.63 2014-07-17 84.63 83.33 84.35 1974000 3 84.91 84.91 2014-07-16 84.91 83.66 83.77 1755600 4 83.58 83.58 2014-07-15 84.38 83.20 84.30 1874700
Notice two slight differences here caused by the reading/writing of data from JSON. First, the columns have been reordered alphabetically. Second, the index for DataFrame
, although containing content, is sorted as a string. These issues can be fixed easily, but they will not be covered here.
pandas has very nice support for reading data from HTML files (or HTML from URLs). Underneath the covers, pandas makes use of the LXML, Html5Lib, and BeautifulSoup4 packages, which provide some very impressive capabilities for reading and writing HTML tables.
The pd.read_html()
function will read HTML from a file (or URL) and parse all HTML tables found in the content into one or more pandas DataFrame
object. The function always returns a list of DataFrame
objects (actually, zero or more, depending on the number of tables found in the HTML).
To demonstrate, we will read table data from the FDIC failed bank list located at https://www.fdic.gov/bank/individual/failed/banklist.html. Viewing the page, you can see there is a list of quite a few failed banks.
This data is actually very simple to read with pandas and its pd.read_html()
function. The following reads from this URL and reports the findings:
In [27]: # the URL to read url = "http://www.fdic.gov/bank/individual/failed/banklist.html" # read it banks = pd.read_html(url) # examine a subset of the first table read banks[0][0:5].ix[:,0:4] Out[27]: Bank Name City ST CERT 0 Doral BankEn Espanol San Juan PR 32102 1 Capitol City Bank & Trust Company Atlanta GA 33938 2 Highland Community Bank Chicago IL 20290 3 First National Bank of Crestview Crestview FL 17557 4 Northern Star Bank Mankato MN 34983
Again, that was almost too easy!
DataFrame
can be written to an HTML file with the .to_html()
method. The method creates a file containing the table tag for the data (not the entire HTML document). The following writes the stock data we read earlier to an HTML file:
In [28]: # read the stock data df = pd.read_excel("data/stocks.xlsx") # write the first two rows to HTML df.head(2).to_html("data/stocks.html") # check the first 28 lines of the output !head -n 28 data/stocks.html # max or linux # !type datastocks.html # window, but prints the entire file <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Date</th> <th>Open</th> <th>High</th> <th>Low</th> <th>Close</th> <th>Volume</th> <th>Adj Close</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>2014-07-21</td> <td> 83.46</td> <td> 83.53</td> <td> 81.81</td> <td> 81.93</td> <td> 2359300</td> <td> 81.93</td> </tr> <tr> <th>1</th> <td>2014-07-18</td> <td> 83.30</td>
Viewing this in the browser looks like what is shown in the following:
This is useful, as you can use pandas to write HTML fragments to be included in websites, updating them when needed and therefore, having the new data available to the site statically, instead of through a more complicated data query or service call.
HDF5 is a data model, library, and file format to store and manage data. It is commonly used in scientific computing environments. It supports an unlimited variety of data types and is designed for flexible and efficient I/O and for high volume and complex data.
HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications to manage, manipulate, view, and analyze data in the HDF5 format. HDF5 is:
HDFStore
is a hierarchical, dictionary-like object that reads and writes pandas objects to the HDF5 format. Under the covers, HDFStore
uses the PyTables library, so make sure that it is installed if you want to use this format.
The following demonstrates writing DataFrame
into an HDF5 format. The output shows you that the HDF5 store has a root level object named df
, which is a frame and whose shape is eight rows of three columns:
In [29]: # seed for replication np.random.seed(123456) # create a DataFrame of dates and random numbers in three columns df = pd.DataFrame(np.random.randn(8, 3), index=pd.date_range('1/1/2000', periods=8), columns=['A', 'B', 'C']) # create HDF5 store store = pd.HDFStore('data/store.h5') store['df'] = df # persisting happened here store Out[29]: <class 'pandas.io.pytables.HDFStore'> File path: data/store.h5 /df frame (shape->[8,3])
The following reads the HDF5 store and retrieves DataFrame
:
In [30]: # read in data from HDF5 store = pd.HDFStore("data/store.h5") df = store['df'] df Out[30]: A B C 2000-01-01 0.469112 -0.282863 -1.509059 2000-01-02 -1.135632 1.212112 -0.173215 2000-01-03 0.119209 -1.044236 -0.861849 2000-01-04 -2.104569 -0.494929 1.071804 2000-01-05 0.721555 -0.706771 -1.039575 2000-01-06 0.271860 -0.424972 0.567020 2000-01-07 0.276232 -1.087401 -0.673690 2000-01-08 0.113648 -1.478427 0.524988
DataFrame
is written to the HDF5 file at the point it is assigned to the store object. Changes to DataFrame
made after that point are not persisted, at least not until the object is assigned to the data store object again. The following demonstrates this by making a change to DataFrame
and then reassigning it to the HDF5 store, thereby updating the data store:
In [31]: # this changes the DataFrame, but did not persist df.ix[0].A = 1 # to persist the change, assign the DataFrame to the # HDF5 store object store['df'] = df # it is now persisted # the following loads the store and # shows the first two rows, demonstrating # the the persisting was done pd.HDFStore("data/store.h5")['df'].head(2) # it's now in there Out[31]: A B C 2000-01-01 1.000000 -0.282863 -1.509059 2000-01-02 -1.135632 1.212112 -0.173215
18.224.30.19