Reading and writing JSON files

pandas can read and write data stored in the JavaScript Object Notation (JSON) format. This is one of my favorites due to its ability to be used across platforms and with many programming languages.

To demonstrate saving as JSON, we will save the Excel data we just read in to a JSON file and then take a look at the contents:

In [25]:
   # write the excel data to a JSON file
   df.head().to_json("data/stocks.json")
   !cat data/stocks.json # osx or Linux
   # !type datastocks.json # windows

   {"Date":{"0":1405900800000,"1":1405641600000,"2":1405555200000,"3":1405468800000,"4":1405382400000}, "Open":{"0":83.46,"1":83.3,"2":84.35,"3":83.77,"4":84.3}, "High":{"0":83.53,"1":83.4,"2":84.63,"3":84.91,"4":84.38}, "Low":{"0":81.81,"1":82.52,"2":83.33,"3":83.66,"4":83.2}, "Close":{"0":81.93,"1":83.35,"2":83.63,"3":84.91,"4":83.58},   "Volume":{"0":2359300,"1":4020800,"2":1974000,"3":1755600,"4":1874700}, 
"Adj Close":{"0":81.93,"1":83.35,"2":83.63,"3":84.91,"4":83.58}}

JSON-based data can be read with the pd.read_json() function:

In [26]:
   # read data in from JSON
   df_from_json = pd.read_json("data/stocks.json")
   df_from_json.head(5)

Out[26]:
      Adj Close  Close       Date   High    Low   Open   Volume
   0      81.93  81.93 2014-07-21  83.53  81.81  83.46  2359300
   1      83.35  83.35 2014-07-18  83.40  82.52  83.30  4020800
   2      83.63  83.63 2014-07-17  84.63  83.33  84.35  1974000
   3      84.91  84.91 2014-07-16  84.91  83.66  83.77  1755600
   4      83.58  83.58 2014-07-15  84.38  83.20  84.30  1874700

Notice two slight differences here caused by the reading/writing of data from JSON. First, the columns have been reordered alphabetically. Second, the index for DataFrame, although containing content, is sorted as a string. These issues can be fixed easily, but they will not be covered here.

Reading HTML data from the Web

pandas has very nice support for reading data from HTML files (or HTML from URLs). Underneath the covers, pandas makes use of the LXML, Html5Lib, and BeautifulSoup4 packages, which provide some very impressive capabilities for reading and writing HTML tables.

The pd.read_html() function will read HTML from a file (or URL) and parse all HTML tables found in the content into one or more pandas DataFrame object. The function always returns a list of DataFrame objects (actually, zero or more, depending on the number of tables found in the HTML).

To demonstrate, we will read table data from the FDIC failed bank list located at https://www.fdic.gov/bank/individual/failed/banklist.html. Viewing the page, you can see there is a list of quite a few failed banks.

Reading HTML data from the Web

This data is actually very simple to read with pandas and its pd.read_html() function. The following reads from this URL and reports the findings:

In [27]:
   # the URL to read
   url = "http://www.fdic.gov/bank/individual/failed/banklist.html"
   # read it
   banks = pd.read_html(url)
   # examine a subset of the first table read
   banks[0][0:5].ix[:,0:4]

Out[27]:
                              Bank Name       City  ST   CERT
   0               Doral BankEn Espanol   San Juan  PR  32102
   1  Capitol City Bank & Trust Company    Atlanta  GA  33938
   2            Highland Community Bank    Chicago  IL  20290
   3   First National Bank of Crestview  Crestview  FL  17557
   4                 Northern Star Bank    Mankato  MN  34983

Again, that was almost too easy!

DataFrame can be written to an HTML file with the .to_html() method. The method creates a file containing the table tag for the data (not the entire HTML document). The following writes the stock data we read earlier to an HTML file:

In [28]:
   # read the stock data
   df = pd.read_excel("data/stocks.xlsx")
   # write the first two rows to HTML
   df.head(2).to_html("data/stocks.html")
   # check the first 28 lines of the output
   !head -n 28 data/stocks.html # max or linux
   # !type datastocks.html # window, but prints the entire file

   <table border="1" class="dataframe">
     <thead>
       <tr style="text-align: right;">
         <th></th>
         <th>Date</th>
         <th>Open</th>
         <th>High</th>
         <th>Low</th>
         <th>Close</th>
         <th>Volume</th>
         <th>Adj Close</th>
       </tr>
     </thead>
     <tbody>
       <tr>
         <th>0</th>
         <td>2014-07-21</td>
         <td> 83.46</td>
         <td> 83.53</td>
         <td> 81.81</td>
         <td> 81.93</td>
         <td> 2359300</td>
         <td> 81.93</td>
       </tr>
       <tr>
         <th>1</th>
         <td>2014-07-18</td>
         <td> 83.30</td>

Viewing this in the browser looks like what is shown in the following:

Reading HTML data from the Web

This is useful, as you can use pandas to write HTML fragments to be included in websites, updating them when needed and therefore, having the new data available to the site statically, instead of through a more complicated data query or service call.

Reading and writing HDF5 format files

HDF5 is a data model, library, and file format to store and manage data. It is commonly used in scientific computing environments. It supports an unlimited variety of data types and is designed for flexible and efficient I/O and for high volume and complex data.

HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications to manage, manipulate, view, and analyze data in the HDF5 format. HDF5 is:

  • A versatile data model that can represent very complex data objects and a wide variety of metadata
  • A completely portable file format with no limit on the number or size of data objects in the collection
  • A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces
  • A rich set of integrated performance features that allow for access time and storage space optimizations
  • Tools and applications to manage, manipulate, view, and analyze the data in the collection

HDFStore is a hierarchical, dictionary-like object that reads and writes pandas objects to the HDF5 format. Under the covers, HDFStore uses the PyTables library, so make sure that it is installed if you want to use this format.

The following demonstrates writing DataFrame into an HDF5 format. The output shows you that the HDF5 store has a root level object named df, which is a frame and whose shape is eight rows of three columns:

In [29]:
   # seed for replication
   np.random.seed(123456)
   # create a DataFrame of dates and random numbers in three columns
   df = pd.DataFrame(np.random.randn(8, 3), 
                     index=pd.date_range('1/1/2000', periods=8),
                     columns=['A', 'B', 'C'])

   # create HDF5 store
   store = pd.HDFStore('data/store.h5')
   store['df'] = df # persisting happened here
   store

Out[29]:
   <class 'pandas.io.pytables.HDFStore'>
   File path: data/store.h5
   /df            frame        (shape->[8,3])

The following reads the HDF5 store and retrieves DataFrame:

In [30]:
   # read in data from HDF5
   store = pd.HDFStore("data/store.h5")
   df = store['df']
   df

Out[30]:
                      A         B         C
   2000-01-01  0.469112 -0.282863 -1.509059
   2000-01-02 -1.135632  1.212112 -0.173215
   2000-01-03  0.119209 -1.044236 -0.861849
   2000-01-04 -2.104569 -0.494929  1.071804
   2000-01-05  0.721555 -0.706771 -1.039575
   2000-01-06  0.271860 -0.424972  0.567020
   2000-01-07  0.276232 -1.087401 -0.673690
   2000-01-08  0.113648 -1.478427  0.524988

DataFrame is written to the HDF5 file at the point it is assigned to the store object. Changes to DataFrame made after that point are not persisted, at least not until the object is assigned to the data store object again. The following demonstrates this by making a change to DataFrame and then reassigning it to the HDF5 store, thereby updating the data store:

In [31]:
   # this changes the DataFrame, but did not persist
   df.ix[0].A = 1 

   # to persist the change, assign the DataFrame to the 
   # HDF5 store object
   store['df'] = df
   # it is now persisted

   # the following loads the store and 
   # shows the first two rows, demonstrating
   # the the persisting was done
   pd.HDFStore("data/store.h5")['df'].head(2) # it's now in there

Out[31]:
                     A         B         C
   2000-01-01  1.000000 -0.282863 -1.509059
   2000-01-02 -1.135632  1.212112 -0.173215
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.30.19