Chapter 3. Working with Financial Data

Clearly, data beats algorithms. Without comprehensive data, you tend to get non-comprehensive predictions.1

Rob Thomas

In algorithmic trading, one generally has to deal with four types of data as illustrated in Table 3-1. Although simplifying the real data world, distinguishing data along the pairs historical vs. real-time and structured vs. unstructured proves often useful in technical settings.

Table 3-1. Types of financial data (examples)

structured

unstructured

historical

end-of-day closing prices

financial news articles

real-time

bid/ask prices for FX

posts on Twitter

This book is mainly concerned with structured data (numerical, tabular data) of both historical and real-time types. This chapter in particular focuses on historical, structured data, like end-of-day closing values for the SAP SE stock traded at the Frankfurt Stock Exchange. However, this category also subsumes intraday data, like, for example, 1-minute-bar data for the Apple, Inc. stock traded at the NASDAQ stock exchange. The processing of real-time, structured data is covered in [Link to Come].

An algorithmic trading project typically starts with a trading idea or hypothesis which needs to be (back-)tested based on historical financial data. This is the context for this chapter, the plan for which is as follows. “Reading Financial Data From Different Sources” uses pandas to read data from different file- and web-based sources. “Working with Open Data Sources” introduces Quandl as a popular open data source platform. “Eikon Data API” introduces the Python wrapper for the Refinitiv Eikon Data API. Finally, “Storing Financial Data Efficiently” briefly shows how to store historical, structured data efficiently with pandas based on the HDF5 binary storage format.

The goal for this chapter is to have available financial data in a format with which the backtesting of trading ideas and hypotheses can be implemented effectively. The three major themes are the importing of data, the handling of the data and the storage of it. This and subsequent chapters assume a Python 3.8 installation with Python packages installed as explained in detail in Chapter 2. For the time being, it is not yet relevant on which infrastructure exactly this Python environment is provided. For more details on efficient input-output operations with Python, see Hilpisch (2018, ch. 9).

Reading Financial Data From Different Sources

This section makes heavy use of the capabilities of pandas, the popular data analysis package for Python (see pandas home page). pandas comprehensively supports the three main tasks this chapter is concerned with: reading data, handling data, and storing data. One of its strengths is the reading of data from different types of sources as the remainder of this section illustrates.

The Data Set

In this section, we work with a fairly small data set for the Apple, Inc. stock price (with symbol AAPL and Reuters Instrument Code or RIC AAPL.O) as retrieved from the Eikon Data API for April 2020.

Having stored such historical financial data in a CSV file on disk, pure Python can be used to read and print its content.

In [1]: fn = '../data/AAPL.csv'  1

In [2]: with open(fn, 'r') as f:  1
            for _ in range(5):  2
                print(f.readline(), end='')  3
        Date,HIGH,CLOSE,LOW,OPEN,COUNT,VOLUME
        2020-04-01,248.72,240.91,239.13,246.5,460606.0,44054638.0
        2020-04-02,245.15,244.93,236.9,240.34,380294.0,41483493.0
        2020-04-03,245.7,241.41,238.9741,242.8,293699.0,32470017.0
        2020-04-06,263.11,262.47,249.38,250.9,486681.0,50455071.0
1

Open the file on disk (adjust path and filename if necessary).

2

Sets up a for loop with 5 iterations.

3

Prints the first 5 lines in the opened CSV file.

This approach allows for simple inspection of the data. One learns that there is a header line and that the single data points per row represent Date, OPEN, HIGH, LOW, CLOSE, COUNT and VOLUME, respectively. However, the data is not yet available in memory for further usage with Python.

Reading from a CSV File with Python

To work with data stored as a CSV file, the file needs to be parsed and the data needs to be stored in a Python data structure. Python has a built-in module called csv which supports the reading of data from a CSV file. The first approach yields a list object containing other list objects with the data from the file.

In [3]: import csv  1

In [4]: csv_reader = csv.reader(open(fn, 'r'))  2

In [5]: data = list(csv_reader)  3

In [6]: data[:5]  4
Out[6]: [['Date', 'HIGH', 'CLOSE', 'LOW', 'OPEN', 'COUNT', 'VOLUME'],
         ['2020-04-01',
          '248.72',
          '240.91',
          '239.13',
          '246.5',
          '460606.0',
          '44054638.0'],
         ['2020-04-02',
          '245.15',
          '244.93',
          '236.9',
          '240.34',
          '380294.0',
          '41483493.0'],
         ['2020-04-03',
          '245.7',
          '241.41',
          '238.9741',
          '242.8',
          '293699.0',
          '32470017.0'],
         ['2020-04-06',
          '263.11',
          '262.47',
          '249.38',
          '250.9',
          '486681.0',
          '50455071.0']]
1

Imports the csv module.

2

Instantiates a csv.reader iterator object.

3

A list comprehension adding every single line from the CSV file as a list object to the resulting list object.

4

Prints out the first five elements of the list object.

Working with such a nested list object — e.g. for the calculation of the average closing price — is possible in principle but not really efficient or intuitive. Using a csv.DictReader iterator object instead of the standard csv.reader object makes such tasks a bit more manageable. Every row of data in the CSV file (apart from the header row) is then imported as a dict object so that single values can be accessed via the respective key.

In [7]: csv_reader = csv.DictReader(open(fn, 'r'))  1

In [8]: data = list(csv_reader)

In [9]: data[:3]
Out[9]: [{'Date': '2020-04-01',
          'HIGH': '248.72',
          'CLOSE': '240.91',
          'LOW': '239.13',
          'OPEN': '246.5',
          'COUNT': '460606.0',
          'VOLUME': '44054638.0'},
         {'Date': '2020-04-02',
          'HIGH': '245.15',
          'CLOSE': '244.93',
          'LOW': '236.9',
          'OPEN': '240.34',
          'COUNT': '380294.0',
          'VOLUME': '41483493.0'},
         {'Date': '2020-04-03',
          'HIGH': '245.7',
          'CLOSE': '241.41',
          'LOW': '238.9741',
          'OPEN': '242.8',
          'COUNT': '293699.0',
          'VOLUME': '32470017.0'}]
1

Here, the csv.DictReader iterator object is instantiated which reads every data row into a dict object — given the information in the header row.

Based on the single dict objects, aggregations are now somewhat more easy to accomplish. However, one still cannot speak of a convenient way of calculating the mean of the Apple closing stock price when inspecting the respective Python code.

In [10]: sum([float(l['CLOSE']) for l in data]) / len(data)  1
Out[10]: 272.38619047619045
1

First, a list object is generated via a list comprehension with all closing values; second, the sum is taken over all these values; third, the resulting sum is divided by the number of closing values.

This is one of the major reasons why pandas has gained such a popularity in the Python community. It makes the importing of data and the handling of, for example, financial time series data sets more convenient (and also often considerably faster) than pure Python.

Reading from a CSV File with pandas

From this point on, this section uses pandas to work with the Apple stock price data set. The major function used is read_csv() which allows for a number of customizations via different parameters (see the read_csv() API reference). read_csv() yields as a result of the data reading procedure a DataFrame object which is the central means of storing (tabular) data with pandas. The DataFrame class has many powerful methods that are particularly helpful in financial applications (refer to the DataFrame API reference).

In [11]: import pandas as pd  1

In [12]: data = pd.read_csv(fn, index_col=0,
                            parse_dates=True)  2

In [13]: data.info()  3
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 21 entries, 2020-04-01 to 2020-04-30
         Data columns (total 6 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   HIGH    21 non-null     float64
          1   CLOSE   21 non-null     float64
          2   LOW     21 non-null     float64
          3   OPEN    21 non-null     float64
          4   COUNT   21 non-null     float64
          5   VOLUME  21 non-null     float64
         dtypes: float64(6)
         memory usage: 1.1 KB

In [14]: data.tail()  4
Out[14]:               HIGH   CLOSE     LOW    OPEN     COUNT      VOLUME
         Date
         2020-04-24  283.01  282.97  277.00  277.20  306176.0  31627183.0
         2020-04-27  284.54  283.17  279.95  281.80  300771.0  29271893.0
         2020-04-28  285.83  278.58  278.20  285.08  285384.0  28001187.0
         2020-04-29  289.67  287.73  283.89  284.73  324890.0  34320204.0
         2020-04-30  294.53  293.80  288.35  289.96  471129.0  45765968.0
1

The pandas package is imported.

2

This imports the data from the CSV file, indicated that the first column shall be treated as the index column and letting the entries in that column be interpreted as date-time information.

3

This method call prints out meta information regarding the resulting DataFrame object.

4

The data.tail() method prints out by default the five most recent data rows.

Calculating the mean of the Apple stock closing values now is only a single method call.

In [15]: data['CLOSE'].mean()
Out[15]: 272.38619047619056

Chapter 4 introduces more functionality of pandas for the handling of financial data. For details on working with pandas and the powerful DataFrame class also refer to the official pandas Documentation page and to the book McKinney (2017).

Tip

Although the Python standard library provides capabilities to read data from CSV files, pandas in general significantly simplifies and speeds up such operations. An additional benefit is that the data analysis capabilities of pandas are immediately available since read_csv() returns a DataFrame object.

Exporting to Excel and JSON

pandas also excels at exporting data stored in DataFrame objects when this data needs to be shared in a non-Python specific format. Apart from being able to export to CSV files, pandas allows, for example, also to do the export in the form of Excel spreadsheet files as well as JSON files which are both popular data exchange formats in the financial industry. Such an exporting procedure typically needs a single method call only.

In [16]: data.to_excel('data/aapl.xls', 'AAPL')  1

In [17]: data.to_json('data/aapl.json')  2

In [18]: ls -n data/
         total 24
         -rw-r--r--  1 501  20  3067 Aug 25 11:47 aapl.json
         -rw-r--r--  1 501  20  5632 Aug 25 11:47 aapl.xls
1

Exports the data to an Excel spreadsheet file on disk.

2

Exports the data to a JSON file on disk.

In particular when it comes to the interaction with Excel spreadsheet files, there are more elegant ways than just doing a data dump to a new file. xlwings, for example, is a powerful Python package allowing for an efficient and intelligent interaction between Python and Excel (visit the xlwings home page).

Reading from Excel and JSON

Now that the data is also available in the form of an Excel spreadsheet file and a JSON data file, pandas can read data from these sources as well. The approach is as straightforward as with CSV files.

In [19]: data_copy_1 = pd.read_excel('data/aapl.xls', 'AAPL',
                                     index_col=0)  1

In [20]: data_copy_1.head()  2
Out[20]:               HIGH   CLOSE       LOW    OPEN   COUNT    VOLUME
         Date
         2020-04-01  248.72  240.91  239.1300  246.50  460606  44054638
         2020-04-02  245.15  244.93  236.9000  240.34  380294  41483493
         2020-04-03  245.70  241.41  238.9741  242.80  293699  32470017
         2020-04-06  263.11  262.47  249.3800  250.90  486681  50455071
         2020-04-07  271.70  259.43  259.0000  270.80  467375  50721831

In [21]: data_copy_2 = pd.read_json('data/aapl.json')  3

In [22]: data_copy_2.head()  4
Out[22]:               HIGH   CLOSE       LOW    OPEN   COUNT    VOLUME
         2020-04-01  248.72  240.91  239.1300  246.50  460606  44054638
         2020-04-02  245.15  244.93  236.9000  240.34  380294  41483493
         2020-04-03  245.70  241.41  238.9741  242.80  293699  32470017
         2020-04-06  263.11  262.47  249.3800  250.90  486681  50455071
         2020-04-07  271.70  259.43  259.0000  270.80  467375  50721831

In [23]: !rm data/*
1

This reads the data from the Excel spreadsheet file to a new DataFrame object.

2

The first five rows of the first in-memory copy of the data are printed.

3

This reads the data from the JSON file to yet another DataFrame object.

4

This then prints the first five rows of the second in-memory copy of the data.

pandas proves useful for reading and writing financial data from and to different types of data files. Often, the reading might be tricky due to non-standard storage formats (like a “;” instead of a “,” as separator) but pandas generally provides the right set of parameter combinations to cope with such cases. Although all examples in this section use a small data set only, one can expect high performance input-output operations from pandas in the most important scenarios when the data sets are much larger.

Working with Open Data Sources

To a great extent, the attractiveness of the Python ecosystem stems from the fact that almost all packages available are open source and can be used for free. Financial analytics in general and algorithmic trading in particular, however, cannot live with open source software and algorithms alone — data plays a vital role as well, as the quote at the beginning of the chapter emphasizes. The previous section uses a small data set from a commercial data source. While there have been helpful open (financial) data sources available for some years (such as the ones provided by Yahoo! Finance or Google Finance), there are not too many left at the time of this writing in 2020. One of the more obvious reasons for this trend might be the ever-changing terms of data licensing agreements.

The one notable exception for the purposes of this book is Quandl (http://quandl.com), a platform that aggregates a large number of open as well as premium (= to-be-paid-for) data sources. The data is provided via a unified API for which a Python wrapper package is available.

The Python wrapper package for the Quandl data API (see the Python wrapper page on Quandl and the Github page of the package) is installed with conda through conda install quandl. The first example shows how to retrieve historical average prices for the BTC/USD exchange rate since the introduction of Bitcoin as a cryptocurrency. With Quandl, requests expect always a combination of the database and the specific data set desired. In the example, BCHAIN and MKPRU. Such information can generally be looked up on the Quandl platform. For the example, the relevant page on Quandl is BCHAIN/MKPRU.

By default, the quandl package returns a pandas DataFrame object. In the example, the Value column is also presented in annualized fashion, i.e. with year end values. Note that the number shown for 2020 is the last available value in the data set (from May 2020) and not necessarily the year end value.

While a large part of the data sets on the Quandl platform are free, some of the free data sets require an API key (such a key is required after a certain limit of free API calls too). Every user obtains such a key by signing up for a free Quandl account on the Quandl sign up page. Data requests requiring an API key expect the key to be provided as the parameter api_key. In the example, the API key (which is found on the account settings page) is stored as a string in the variable quandl_api_key. The concrete value for the key is read from a configuration file via the configparser module.

In [24]: import configparser
         config = configparser.ConfigParser()
         config.read('../pyalgo.cfg')
Out[24]: ['../pyalgo.cfg']

In [25]: import quandl as q  1

In [26]: data = q.get('BCHAIN/MKPRU', api_key=config['quandl']['api_key'])  2

In [27]: data.info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 4254 entries, 2009-01-03 to 2020-08-26
         Data columns (total 1 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   Value   4254 non-null   float64
         dtypes: float64(1)
         memory usage: 66.5 KB

In [28]: data['Value'].resample('A').last()  3
Out[28]: Date
         2009-12-31        0.000000
         2010-12-31        0.299999
         2011-12-31        4.995000
         2012-12-31       13.590000
         2013-12-31      731.000000
         2014-12-31      317.400000
         2015-12-31      428.000000
         2016-12-31      952.150000
         2017-12-31    13215.574000
         2018-12-31     3832.921667
         2019-12-31     7385.360000
         2020-12-31    11763.930000
         Freq: A-DEC, Name: Value, dtype: float64
1

Imports the Python wrapper package for Quandl.

2

Reads historical data for the BTC/USD exchange rate.

3

Selects the Value column, resamples it — from the originally daily values to yearly values — and defines the last available observation to be the relevant one.

Quandl also provides, for example, diverse data sets for single stocks, like end-of-day stock prices, stock fundamentals or data sets related to options traded on a certain stock.

In [29]: data = q.get('FSE/SAP_X', start_date='2018-1-1',
                      end_date='2020-05-01',
                      api_key=config['quandl']['api_key'])

In [30]: data.info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 579 entries, 2018-01-02 to 2020-04-30
         Data columns (total 10 columns):
          #   Column                 Non-Null Count  Dtype
         ---  ------                 --------------  -----
          0   Open                   257 non-null    float64
          1   High                   579 non-null    float64
          2   Low                    579 non-null    float64
          3   Close                  579 non-null    float64
          4   Change                 0 non-null      object
          5   Traded Volume          533 non-null    float64
          6   Turnover               533 non-null    float64
          7   Last Price of the Day  0 non-null      object
          8   Daily Traded Units     0 non-null      object
          9   Daily Turnover         0 non-null      object
         dtypes: float64(6), object(4)
         memory usage: 49.8+ KB

The API key can also be configured permanently with the Python wrapper via

q.ApiConfig.api_key = 'YOUR_API_KEY'

The Quandl platform also offers premium data sets for which a subscription or fee is required. Most of these data sets offer free samples. The example retrieves option implied volatilities for the Microsoft, Inc. stock. The free sample data set is quite large with more than 4,100 rows and many columns (only a subset is shown). The last line of code display the 30, 60 and 90 days implied volatility values for the five most recent days available.

In [31]: q.ApiConfig.api_key = config['quandl']['api_key']

In [32]: vol = q.get('VOL/MSFT')

In [33]: vol.iloc[:, :10].info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 1006 entries, 2015-01-02 to 2018-12-31
         Data columns (total 10 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   Hv10    1006 non-null   float64
          1   Hv20    1006 non-null   float64
          2   Hv30    1006 non-null   float64
          3   Hv60    1006 non-null   float64
          4   Hv90    1006 non-null   float64
          5   Hv120   1006 non-null   float64
          6   Hv150   1006 non-null   float64
          7   Hv180   1006 non-null   float64
          8   Phv10   1006 non-null   float64
          9   Phv20   1006 non-null   float64
         dtypes: float64(10)
         memory usage: 86.5 KB

In [34]: vol[['IvMean30', 'IvMean60', 'IvMean90']].tail()
Out[34]:             IvMean30  IvMean60  IvMean90
         Date
         2018-12-24    0.4310    0.4112    0.3829
         2018-12-26    0.4059    0.3844    0.3587
         2018-12-27    0.3918    0.3879    0.3618
         2018-12-28    0.3940    0.3736    0.3482
         2018-12-31    0.3760    0.3519    0.3310

This concludes the overview of the Python wrapper package quandl for the Quandl data API. The Quandl platform and service is growing rapidly and proves to be a valuable source for financial data in an algorithmic trading context.

Note

Open source software is a trend that started many years ago. It has lowered the barriers to entry in many areas and also in algorithmic trading. A new, reinforcing trend in this regard are open data sources. In some cases, such as with Quandl, they even provide high quality data sets. It cannot be expected that open data will completely replace professional data subscriptions any time soon, but they represent a valuable means to get started with algorithmic trading in a cost efficient manner.

Eikon Data API

Open data sources are a blessing for algorithmic traders wanting to get started in the space and wanting to be able to quickly test hypotheses and ideas based on real financial data sets. Sooner or later, however, open data sets will not suffice anymore to satisfy the requirements of more ambitious traders and professionals.

Refinitiv is one of the biggest financial data and news providers in the world. Its current desktop flagship product is Eikon which is the equivalent to the Terminal by Bloomberg, the major competitor in the data services field. Figure 3-1 shows a screen shot of Eikon in the browser-based version. It provides access to peta bytes of data via a single access point.

eikon apple
Figure 3-1. Browser version of Eikon terminal

Recently, Refinitiv have streamlined their API landscape and have released a Python wrapper package, called eikon, for the Eikon data API which is installed via pip install eikon. If you have a subscription to the Refinitiv Eikon data services, you can use the Python package to programmatically retrieve historical as well as streaming structured and unstructured data from the unified API. A technical prerequisite is that a local desktop application is running that provides a desktop API session. The latest such desktop application at the time of this writing is called Workspace (see Figure 3-2).

workspace charting
Figure 3-2. Workspace application with desktop API services

If you are an Eikon subscriber and have an account for the Developer Community pages, you find an overview of the Python Eikon Scripting Library under Quick Start.

In order to use the Eikon Data API, the Eikon app_key needs to be set. You get it via the App Key Generator (APPKEY`) application in either Eikon or Workspace.

In [35]: import eikon as ek  1

In [36]: ek.set_app_key(config['eikon']['app_key'])  2

In [37]: help(ek)  3
         Help on package eikon:

         NAME
             eikon - # coding: utf-8

         PACKAGE CONTENTS
             Profile
             data_grid
             eikonError
             json_requests
             news_request
             streaming_session (package)
             symbology
             time_series
             tools

         SUBMODULES
             cache
             desktop_session
             istream_callback
             itemstream
             session
             stream
             stream_connection
             streamingprice
             streamingprice_callback
             streamingprices

         VERSION
             1.1.5

         FILE

              /Users/yves/Python/envs/py38/lib/python3.8/site-packages/eikon/__init__
          .py
1

Imports the eikon package as ek.

2

Sets the app_key.

3

Shows the help text for the main module.

Retrieving Historical Structured Data

The retrieval of historical financial time series data is as straightforward as with the other wrappers used before.

In [39]: symbols = ['AAPL.O', 'MSFT.O', 'GOOG.O']  1

In [40]: data = ek.get_timeseries(symbols,  2
                                  start_date='2020-01-01',  3
                                  end_date='2020-05-01',  4
                                  interval='daily',  5
                                  fields=['*'])  6

In [41]: data.keys()  7
Out[41]: MultiIndex([('AAPL.O',   'HIGH'),
                     ('AAPL.O',  'CLOSE'),
                     ('AAPL.O',    'LOW'),
                     ('AAPL.O',   'OPEN'),
                     ('AAPL.O',  'COUNT'),
                     ('AAPL.O', 'VOLUME'),
                     ('MSFT.O',   'HIGH'),
                     ('MSFT.O',  'CLOSE'),
                     ('MSFT.O',    'LOW'),
                     ('MSFT.O',   'OPEN'),
                     ('MSFT.O',  'COUNT'),
                     ('MSFT.O', 'VOLUME'),
                     ('GOOG.O',   'HIGH'),
                     ('GOOG.O',  'CLOSE'),
                     ('GOOG.O',    'LOW'),
                     ('GOOG.O',   'OPEN'),
                     ('GOOG.O',  'COUNT'),
                     ('GOOG.O', 'VOLUME')],
                    )

In [42]: type(data['AAPL.O'])  8
Out[42]: pandas.core.frame.DataFrame

In [43]: data['AAPL.O'].info()  9
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 84 entries, 2020-01-02 to 2020-05-01
         Data columns (total 6 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   HIGH    84 non-null     float64
          1   CLOSE   84 non-null     float64
          2   LOW     84 non-null     float64
          3   OPEN    84 non-null     float64
          4   COUNT   84 non-null     Int64
          5   VOLUME  84 non-null     Int64
         dtypes: Int64(2), float64(4)
         memory usage: 4.8 KB

In [44]: data['AAPL.O'].tail()  10
Out[44]:               HIGH   CLOSE     LOW    OPEN   COUNT    VOLUME
         Date
         2020-04-27  284.54  283.17  279.95  281.80  300771  29271893
         2020-04-28  285.83  278.58  278.20  285.08  285384  28001187
         2020-04-29  289.67  287.73  283.89  284.73  324890  34320204
         2020-04-30  294.53  293.80  288.35  289.96  471129  45765968
         2020-05-01  299.00  289.07  285.85  286.25  558319  60154175
1

Defines a few symbols as a list object.

2

The central line of code that retrieves data for the first symbol …

3

… for the given start date and …

4

… the given end date.

5

The time interval is here chosen to be daily.

6

All fields are requested.

7

The function get_timeseries() returns a multi-index DataFrame object.

8

The values corresponding to each level are regular DataFrame objects.

9

This provides an overview of the data stored in the DataFrame object.

10

The final five rows of data are shown.

The beauty of working with a professional data service API becomes evident when one wishes to work with multiple symbols and in particular with a different granularity of the financial data, i.e. other time intervals.

In [45]: %%time
         data = ek.get_timeseries(symbols,  1
                                  start_date='2020-08-14',  2
                                  end_date='2020-08-15',  3
                                  interval='minute',  4
                                  fields='*')
         CPU times: user 58.2 ms, sys: 3.16 ms, total: 61.4 ms
         Wall time: 2.02 s

In [46]: print(data['GOOG.O'].loc['2020-08-14 16:00:00':
                                  '2020-08-14 16:04:00'])  5
         HIGH       LOW      OPEN      CLOSE  COUNT  VOLUME

         Date

         2020-08-14 16:00:00  1510.7439  1509.220  1509.940  1510.5239     48
          1362
         2020-08-14 16:01:00  1511.2900  1509.980  1510.500  1511.2900     52
          1002
         2020-08-14 16:02:00  1513.0000  1510.964  1510.964  1512.8600     72
          1762
         2020-08-14 16:03:00  1513.6499  1512.160  1512.990  1513.2300    108
          4534
         2020-08-14 16:04:00  1513.6500  1511.540  1513.418  1512.7100     40
          1364

In [47]: for sym in symbols:
             print('
' + sym + '
', data[sym].iloc[-300:-295])  6

         AAPL.O
         HIGH       LOW    OPEN   CLOSE  COUNT  VOLUME

         Date
         2020-08-14 19:01:00  457.1699  456.6300  457.14  456.83   1457  104693
         2020-08-14 19:02:00  456.9399  456.4255  456.81  456.45   1178   79740
         2020-08-14 19:03:00  456.8199  456.4402  456.45  456.67    908   68517
         2020-08-14 19:04:00  456.9800  456.6100  456.67  456.97    665   53649
         2020-08-14 19:05:00  457.1900  456.9300  456.98  457.00    679   49636

         MSFT.O
         HIGH       LOW      OPEN     CLOSE  COUNT  VOLUME

         Date

         2020-08-14 19:01:00  208.6300  208.5083  208.5500  208.5674    333
          21368
         2020-08-14 19:02:00  208.5750  208.3550  208.5501  208.3600    513
          37270
         2020-08-14 19:03:00  208.4923  208.3000  208.3600  208.4000    303
          23903
         2020-08-14 19:04:00  208.4200  208.3301  208.3901  208.4099    222
          15861
         2020-08-14 19:05:00  208.4699  208.3600  208.3920  208.4069    235
          9569

         GOOG.O
         HIGH        LOW       OPEN      CLOSE  COUNT  VOLUME

         Date

         2020-08-14 19:01:00  1510.42  1509.3288  1509.5100  1509.8550     47
          1577
         2020-08-14 19:02:00  1510.30  1508.8000  1509.7559  1508.8647     71
          2950
         2020-08-14 19:03:00  1510.21  1508.7200  1508.7200  1509.8100     33
           603
         2020-08-14 19:04:00  1510.21  1508.7200  1509.8800  1509.8299     41
           934
         2020-08-14 19:05:00  1510.21  1508.7300  1509.5500  1509.6600     30
           445
1

Data is retrieved for all symbols at once.

2

The time interval …

3

… is drastically shortened.

4

The function call retrieves minute bars for the symbols.

5

Prints five rows from the Google, Inc. data set.

6

Prints three data rows from every DataFrame object.

The code above illustrates how convenient it is to retrieve historical financial time series data from the Eikon API with Python. By default, the function get_timeseries() provides the following options for the interval parameter: tick, minute, hour, daily, weekly, monthly, quarterly and yearly. This gives all the flexibility needed in an algorithmic trading context — in particular, when combined with the resampling capabilities of pandas as shown in the code below.

In [48]: %%time
         data = ek.get_timeseries(symbols[0],
                                  start_date='2020-08-14 15:00:00',  1
                                  end_date='2020-08-14 15:30:00',  2
                                  interval='tick',  3
                                  fields=['*'])
         CPU times: user 257 ms, sys: 17.3 ms, total: 274 ms
         Wall time: 2.31 s

In [49]: data.info()  4
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 47346 entries, 2020-08-14 15:00:00.019000 to 2020-08-14
          15:29:59.987000
         Data columns (total 2 columns):
          #   Column  Non-Null Count  Dtype
         ---  ------  --------------  -----
          0   VALUE   47311 non-null  float64
          1   VOLUME  47346 non-null  Int64
         dtypes: Int64(1), float64(1)
         memory usage: 1.1 MB

In [50]: data.head()  5
Out[50]:                             VALUE  VOLUME
         Date
         2020-08-14 15:00:00.019  453.2499      60
         2020-08-14 15:00:00.036  453.2294       3
         2020-08-14 15:00:00.146  453.2100       5
         2020-08-14 15:00:00.146  453.2100     100
         2020-08-14 15:00:00.236  453.2100       2

In [51]: resampled = data.resample('30s', label='right').agg(
                     {'VALUE': 'last', 'VOLUME': 'sum'}) 6

In [52]: resampled.tail()  7
Out[52]:                         VALUE  VOLUME
         Date
         2020-08-14 15:28:00  453.9000   29746
         2020-08-14 15:28:30  454.2869   86441
         2020-08-14 15:29:00  454.3900   49513
         2020-08-14 15:29:30  454.7550   98520
         2020-08-14 15:30:00  454.6200   55592
1

A time interval of …

2

… one hour is chosen (due to data retrieval limits).

3

The interval parameter is set to tick.

4

Close to 50,000 price ticks are retrieved for the interval.

5

The time series data set shows highly irregular (heterogeneous) interval lengths between two ticks.

6

The tick data is resampled to a 30 second interval length (by taking the last value and the sum, respectively) …

7

… which is reflected in the DatetimeIndex of the new DataFrame object.

Retrieving Historical Unstructured Data

A major strength of working with the Eikon API via Python is the easy retrieval of unstructured data — which can then be parsed and analyzed with Python packages for natural language processing (NLP). Such a procedure is as simple and straightforward as for financial time series data. The code that follows retrieves news headlines for a fixed time interval which includes Apple, Inc. as a company as well as “iPhone” as a word. The five most recent hits are displayed as a maximum.

In [53]: headlines = ek.get_news_headlines(query='R:AAPL.O macbook',  1
                                           count=5,  2
                                           date_from='2020-4-1',  3
                                           date_to='2020-5-1')  4

In [54]: headlines  5
Out[54]:                                           versionCreated  
         2020-04-20 21:33:37.332 2020-04-20 21:33:37.332000+00:00
         2020-04-20 10:20:23.201 2020-04-20 10:20:23.201000+00:00
         2020-04-20 02:32:27.721 2020-04-20 02:32:27.721000+00:00
         2020-04-15 12:06:58.693 2020-04-15 12:06:58.693000+00:00
         2020-04-09 21:34:08.671 2020-04-09 21:34:08.671000+00:00

                                                                               text  
         2020-04-20 21:33:37.332  Apple said to launch new AirPods, MacBook Pro ...
         2020-04-20 10:20:23.201  Apple might launch upgraded AirPods, 13-inch M...
         2020-04-20 02:32:27.721  Apple to reportedly launch new AirPods alongsi...
         2020-04-15 12:06:58.693  Apple files a patent for iPhones, MacBook indu...
         2020-04-09 21:34:08.671  Apple rolls out new software update for MacBoo...

                                                                       storyId  
         2020-04-20 21:33:37.332  urn:newsml:reuters.com:20200420:nNRAble9rq:1
         2020-04-20 10:20:23.201  urn:newsml:reuters.com:20200420:nNRAbl8eob:1
         2020-04-20 02:32:27.721  urn:newsml:reuters.com:20200420:nNRAbl4mfz:1
         2020-04-15 12:06:58.693  urn:newsml:reuters.com:20200415:nNRAbjvsix:1
         2020-04-09 21:34:08.671  urn:newsml:reuters.com:20200409:nNRAbi2nbb:1

                                 sourceCode
         2020-04-20 21:33:37.332  NS:TIMIND
         2020-04-20 10:20:23.201  NS:BUSSTA
         2020-04-20 02:32:27.721  NS:HINDUT
         2020-04-15 12:06:58.693  NS:HINDUT
         2020-04-09 21:34:08.671  NS:TIMIND

In [55]: story = headlines.iloc[0]  6

In [56]: story  7
Out[56]: versionCreated                     2020-04-20 21:33:37.332000+00:00
         text              Apple said to launch new AirPods, MacBook Pro ...
         storyId                urn:newsml:reuters.com:20200420:nNRAble9rq:1
         sourceCode                                                NS:TIMIND
         Name: 2020-04-20 21:33:37.332000, dtype: object

In [57]: news_text = ek.get_news_story(story['storyId'])  8

In [58]: from IPython.display import HTML  9

In [59]: HTML(news_text)  10
Out[59]: <IPython.core.display.HTML object>
NEW DELHI: Apple recently launched its much-awaited affordable smartphone
iPhone SE. Now it seems that the company is gearing up for another launch.
Apple is said to launch the next generation of AirPods and the all-new
13-inch MacBook Pro next month.

In February an online report revealed that the Cupertino-based tech giant
is working on AirPods Pro Lite. Now a tweet by tipster Job Posser has
revealed that Apple will soon come up with new AirPods and MacBook Pro.
Jon Posser tweeted, "New AirPods (which were supposed to be at the
March Event) is now ready to go.

Probably alongside the MacBook Pro next month." However, not many details
about the upcoming products are available right now. The company was
supposed to launch these products at the March event along with the iPhone SE.

But due to the ongoing pandemic coronavirus, the event got cancelled.
It is expected that Apple will launch the AirPods Pro Lite and the 13-inch
MacBook Pro just like the way it launched the iPhone SE. Meanwhile,
Apple has scheduled its annual developer conference WWDC to take place in June.

This year the company has decided to hold an online-only event due to
the outbreak of coronavirus. Reports suggest that this year the company
is planning to launch the all-new AirTags and a premium pair of over-ear
Bluetooth headphones at the event. Using the Apple AirTags users will
be able to locate real-world items such as keys or suitcase in the Find My app.

The AirTags will also have offline finding capabilities that the company
introduced in the core of iOS 13.Apart from this, Apple is also said to
unveil its high-end Bluetooth headphones. It is expected that the Bluetooth
headphones will offer better sound quality and battery backup as compared
to the AirPods.

For Reprint Rights: timescontent.com

Copyright (c) 2020 BENNETT,COLEMAN & CO.LTD.
1

The query parameter for the retrieval operation.

2

Sets the maximum number of hits to five.

3

Defines the interval …

4

… for which to look for news headlines.

5

Gives out the results object (output shortened).

6

One particular headline is picked …

7

… and the story_id shown.

8

This retrieves the news text as html code.

9

In Jupyter Notebook, for example, the html code …

10

… can be rendered for better reading.

This concludes the illustration of the Python wrapper package for the Refinitiv Eikon data API.

Storing Financial Data Efficiently

In algorithmic trading, one of the most important scenarios for the management of data sets is “retrieve once, use multiple times”. Or from an input-output (IO) perspective, it is “write once, read multiple times”. In the first case, data might be retrieved from a web service and then used to backtest a strategy multiple times based on a temporary, in-memory copy of the data set. In the second case, tick data that is received continually is written to disk and later on again used multiple times for certain manipulations (like aggregations) in combination with a backtesting procedure.

This section assumes that the in-memory data structure to store the data is a pandas DataFrame object, no matter from which source the data is acquired (from a CSV file, a web service, etc.).

To have a somewhat meaningful data set available in terms of size, the section uses a sample financial data set generated by the use of pseudo-random numbers. “Python Scripts” presents the Python module with a function called generate_sample_data() that accomplishes the task.

In principle, this function generates a sample financial data set in tabular form of arbitrary size (available memory, of course, sets a limit).

In [60]: from sample_data import generate_sample_data  1

In [61]: print(generate_sample_data(rows=5, cols=4))  2
                                     No0         No1         No2         No3
         2021-01-01 00:00:00  100.000000  100.000000  100.000000  100.000000
         2021-01-01 00:01:00  100.019641   99.950661  100.052993   99.913841
         2021-01-01 00:02:00   99.998164   99.796667  100.109971   99.955398
         2021-01-01 00:03:00  100.051537   99.660550  100.136336  100.024150
         2021-01-01 00:04:00   99.984614   99.729158  100.210888   99.976584
1

Imports the function from the Python script.

2

Prints a sample financial data set with five rows and four columns.

Storing DataFrame Objects

The storage of a pandas DataFrame object as a whole is made simple by the pandas HDFStore wrapper functionality for the HDF5 binary storage standard. It allows to dump complete DataFrame objects in a single step to a file-based database object. To illustrate the implementation, the first step is to create a sample data set of meaningful size — here the size of the DataFrame generated is about 420 MB.

In [62]: %time data = generate_sample_data(rows=5e6, cols=10).round(4)  1
         CPU times: user 3.88 s, sys: 830 ms, total: 4.71 s
         Wall time: 4.72 s

In [63]: data.info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 5000000 entries, 2021-01-01 00:00:00 to 2030-07-05
          05:19:00
         Freq: T
         Data columns (total 10 columns):
          #   Column  Dtype
         ---  ------  -----
          0   No0     float64
          1   No1     float64
          2   No2     float64
          3   No3     float64
          4   No4     float64
          5   No5     float64
          6   No6     float64
          7   No7     float64
          8   No8     float64
          9   No9     float64
         dtypes: float64(10)
         memory usage: 419.6 MB
1

A sample financial data set with 5,000,000 rows and ten columns is generated; the generation takes a couple of seconds.

The second step is to open a HDFStore object (i.e. HDF5 database file) on disk and to write the DataFrame object to it.2 The size on disk of about 440 MB is a bit larger than for the in-memory DataFrame object. However, the writing speed is about five times faster than the in-memory generation of the sample data set. Working in Python with binary stores like HDF5 database files usually gets you writing speeds close to the theoretical maximum of the hardware available.3

In [64]: h5 = pd.HDFStore('data/data.h5', 'w')  1

In [65]: %time h5['data'] = data  2
         CPU times: user 356 ms, sys: 472 ms, total: 828 ms
         Wall time: 1.08 s

In [66]: h5  3
Out[66]: <class 'pandas.io.pytables.HDFStore'>
         File path: data/data.h5

In [67]: ls -n data/data.*
         -rw-r--r--@ 1 501  20  440007240 Aug 25 11:48 data/data.h5

In [68]: h5.close()  4
1

This opens the database file on disk for writing (and overwrites a potentially existing file with the same name).

2

Writing the DataFrame object to disk takes less than a second.

3

Print out meta information for the database file.

4

Closes the database file.

The third step is to read the data from the file-based HDFStore object. Reading also generally takes place close to the theoretical maximum speed.

In [69]: h5 = pd.HDFStore('data/data.h5', 'r')  1

In [70]: %time data_copy = h5['data']  2
         CPU times: user 388 ms, sys: 425 ms, total: 813 ms
         Wall time: 812 ms

In [71]: data_copy.info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 5000000 entries, 2021-01-01 00:00:00 to 2030-07-05
          05:19:00
         Freq: T
         Data columns (total 10 columns):
          #   Column  Dtype
         ---  ------  -----
          0   No0     float64
          1   No1     float64
          2   No2     float64
          3   No3     float64
          4   No4     float64
          5   No5     float64
          6   No6     float64
          7   No7     float64
          8   No8     float64
          9   No9     float64
         dtypes: float64(10)
         memory usage: 419.6 MB

In [72]: h5.close()

In [73]: rm data/data.h5
1

Opens the database file for reading.

2

Reading takes less than half of a second.

There is another, somewhat more flexible way of writing the data from a DataFrame object to an HDFStore object. To this end, one can use the to_hdf() method of the DataFrame object and sets the format parameter to table (see the to_hdf API reference page). This allows the appending of new data to the table object on disk and also, for example, the searching over the data on disk which is not possible with the first approach. The price to pay are slower writing and reading speeds.

In [74]: %time data.to_hdf('data/data.h5', 'data', format='table')  1
         CPU times: user 3.25 s, sys: 491 ms, total: 3.74 s
         Wall time: 3.8 s

In [75]: ls -n data/data.*
         -rw-r--r--@ 1 501  20  446911563 Aug 25 11:48 data/data.h5

In [76]: %time data_copy = pd.read_hdf('data/data.h5', 'data')  2
         CPU times: user 236 ms, sys: 266 ms, total: 502 ms
         Wall time: 503 ms

In [77]: data_copy.info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 5000000 entries, 2021-01-01 00:00:00 to 2030-07-05
          05:19:00
         Freq: T
         Data columns (total 10 columns):
          #   Column  Dtype
         ---  ------  -----
          0   No0     float64
          1   No1     float64
          2   No2     float64
          3   No3     float64
          4   No4     float64
          5   No5     float64
          6   No6     float64
          7   No7     float64
          8   No8     float64
          9   No9     float64
         dtypes: float64(10)
         memory usage: 419.6 MB
1

This defines the writing format to be of type table. Writing becomes slower since this format type involves a bit more overhead and leads to a somewhat increased file size.

2

Reading is also slower in this application scenario.

In practice, the advantage of this approach is that one can work with the table_frame object on disk like with any other table object of the PyTables package which is used by pandas in this context. This provides access to certain basic capabilities of the PyTables package — like, for instance, appending rows to a table object.

In [78]: import tables as tb  1

In [79]: h5 = tb.open_file('data/data.h5', 'r')  2

In [80]: h5  3
Out[80]: File(filename=data/data.h5, title='', mode='r', root_uep='/',
          filters=Filters(complevel=0, shuffle=False, bitshuffle=False,
          fletcher32=False, least_significant_digit=None))
         / (RootGroup) ''
         /data (Group) ''
         /data/table (Table(5000000,)) ''
           description := {
           "index": Int64Col(shape=(), dflt=0, pos=0),
           "values_block_0": Float64Col(shape=(10,), dflt=0.0, pos=1)}
           byteorder := 'little'
           chunkshape := (2978,)
           autoindex := True
           colindexes := {
             "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

In [81]: h5.root.data.table[:3]  4
Out[81]: array([(1609459200000000000, [100.    , 100.    , 100.    , 100.    ,
          100.    , 100.    , 100.    , 100.    , 100.    , 100.    ]),
         (1609459260000000000, [100.0752, 100.1164, 100.0224, 100.0073,
          100.1142, 100.0474,  99.9329, 100.0254, 100.1009, 100.066 ]),
         (1609459320000000000, [100.1593, 100.1721, 100.0519, 100.0933,
          100.1578, 100.0301,  99.92  , 100.0965, 100.1441, 100.0717])],
               dtype=[('index', '<i8'), ('values_block_0', '<f8', (10,))])

In [82]: h5.close()  5

In [83]: rm data/data.h5
1

Imports the PyTables package.

2

Opens the database file for reading.

3

Shows the contents of the database file.

4

Prints the first three rows in the table.

5

Closes the database.

Although this second approach provides more flexibility, it does not open the doors to the full capabilities of the PyTables package. Nevertheless, the two approaches introduced in this sub-section are convenient and efficient when you are working with more or less immutable data sets that fit into memory. Nowadays, algorithmic trading, however, has to deal in general with continuously and rapidly growing data sets like, for example, tick data with regard to stock prices or foreign exchange rates. To cope with the requirements of such a scenario, alternative approaches might prove useful.

Tip

Using the HDFStore wrapper for the HDF5 binary storage standard, pandas is able to write and read financial data almost at the maximum speed the available hardware allows. Exports to other file-based formats, like CSV, are generally much slower alternatives.

Using TsTables

The PyTables package — with import name tables — is a wrapper for the HDF5 binary storage library that is also used by pandas for its HDFStore implementation presented in the previous sub-section. The TsTables package (see Github page of the package) in turn is dedicated to the efficient handling of large financial time series data sets based on the HDF5 binary storage library. It is effectively an enhancement of the PyTables package and adds support for time series data to its capabilities. It implements a hierarchical storage approach that allows for a fast retrieval of data sub-sets selected by providing start and end dates and times, respectively. The major scenario supported by TsTables is “write once, retrieve multiple times”.

The set up illustrated in this sub-section is that data is continuously collected from a web source, professional data provider, etc. and is stored interim and in-memory in a DataFrame object. After a while or a certain number of data points retrieved, the collected data is then stored in a TsTables table object in a HDF5 database. First, the generation of the sample data.

In [84]: %%time
         data = generate_sample_data(rows=2.5e6, cols=5,
                                     freq='1s').round(4)  1
         CPU times: user 915 ms, sys: 191 ms, total: 1.11 s
         Wall time: 1.14 s

In [85]: data.info()
         <class 'pandas.core.frame.DataFrame'>
         DatetimeIndex: 2500000 entries, 2021-01-01 00:00:00 to 2021-01-29
          22:26:39
         Freq: S
         Data columns (total 5 columns):
          #   Column  Dtype
         ---  ------  -----
          0   No0     float64
          1   No1     float64
          2   No2     float64
          3   No3     float64
          4   No4     float64
         dtypes: float64(5)
         memory usage: 114.4 MB
1

This generates a sample financial data set with 2,500,000 rows and five columns with a one second frequency; the sample data is rounded to two digits.

Second, some more imports and the creation of the TsTables table object. The major part is the definition of the desc class which provides the description for the table object’s data structure.

Caution

Currently, TsTables only works with the old pandas version 0.19. A friendly fork, working with newer versions of pandas is available under http://github.com/yhilpisch/tstables which can be installed via

pip install git+https://github.com/yhilpisch/tstables.git
In [86]: import tstables  1

In [87]: import tables as tb  2

In [88]: class desc(tb.IsDescription):
             ''' Description of TsTables table structure.
             '''
             timestamp = tb.Int64Col(pos=0)  3
             No0 = tb.Float64Col(pos=1)  4
             No1 = tb.Float64Col(pos=2)
             No2 = tb.Float64Col(pos=3)
             No3 = tb.Float64Col(pos=4)
             No4 = tb.Float64Col(pos=5)

In [89]: h5 = tb.open_file('data/data.h5ts', 'w')  5

In [90]: ts = h5.create_ts('/', 'data', desc)  6

In [91]: h5  7
Out[91]: File(filename=data/data.h5ts, title='', mode='w', root_uep='/',
          filters=Filters(complevel=0, shuffle=False, bitshuffle=False,
          fletcher32=False, least_significant_digit=None))
         / (RootGroup) ''
         /data (Group/Timeseries) ''
         /data/y2020 (Group) ''
         /data/y2020/m08 (Group) ''
         /data/y2020/m08/d25 (Group) ''
         /data/y2020/m08/d25/ts_data (Table(0,)) ''
           description := {
           "timestamp": Int64Col(shape=(), dflt=0, pos=0),
           "No0": Float64Col(shape=(), dflt=0.0, pos=1),
           "No1": Float64Col(shape=(), dflt=0.0, pos=2),
           "No2": Float64Col(shape=(), dflt=0.0, pos=3),
           "No3": Float64Col(shape=(), dflt=0.0, pos=4),
           "No4": Float64Col(shape=(), dflt=0.0, pos=5)}
           byteorder := 'little'
           chunkshape := (1365,)
1

TsTables (install it from https://github.com/yhilpisch/tstables)

2

PyTables are imported.

3

The first column of the table is a timestamp represented as an int value.

4

All data columns contain float values.

5

This opens a new database file for writing.

6

The TsTables table is created at the root node, with name data and given the class-based description desc.

7

Inspecting the database file reveals the basic principle behind the hierarchical structuring in years, months and days.

Third, the writing of the sample data stored in a DataFrame object to the table object on disk. One of the major benefits of TsTables is the convenience with which this operation is accomplished, namely by a simple method call. Even better, that convenience here is coupled with speed. With regard to the structure in the database, TsTables chunks the data into sub-sets of a single day. In the example case where the frequency is set to one second, this translates into 24 x 60 x 60 = 86,400 data rows per full day worth of data.

In [92]: %time ts.append(data)  1
         CPU times: user 476 ms, sys: 238 ms, total: 714 ms
         Wall time: 739 ms

In [93]: # h5  2
File(filename=data/data.h5ts, title='', mode='w', root_uep='/',
	filters=Filters(complevel=0, shuffle=False, bitshuffle=False,
	fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/data (Group/Timeseries) ''
/data/y2020 (Group) ''
/data/y2021 (Group) ''
/data/y2021/m01 (Group) ''
/data/y2021/m01/d01 (Group) ''
/data/y2021/m01/d01/ts_data (Table(86400,)) ''
  description := {
  "timestamp": Int64Col(shape=(), dflt=0, pos=0),
  "No0": Float64Col(shape=(), dflt=0.0, pos=1),
  "No1": Float64Col(shape=(), dflt=0.0, pos=2),
  "No2": Float64Col(shape=(), dflt=0.0, pos=3),
  "No3": Float64Col(shape=(), dflt=0.0, pos=4),
  "No4": Float64Col(shape=(), dflt=0.0, pos=5)}
  byteorder := 'little'
  chunkshape := (1365,)
/data/y2021/m01/d02 (Group) ''
/data/y2021/m01/d02/ts_data (Table(86400,)) ''
  description := {
  "timestamp": Int64Col(shape=(), dflt=0, pos=0),
  "No0": Float64Col(shape=(), dflt=0.0, pos=1),
  "No1": Float64Col(shape=(), dflt=0.0, pos=2),
  "No2": Float64Col(shape=(), dflt=0.0, pos=3),
  "No3": Float64Col(shape=(), dflt=0.0, pos=4),
  "No4": Float64Col(shape=(), dflt=0.0, pos=5)}
  byteorder := 'little'
  chunkshape := (1365,)
/data/y2021/m01/d03 (Group) ''
/data/y2021/m01/d03/ts_data (Table(86400,)) ''
  description := {
  "timestamp": Int64Col(shape=(), dflt=0, pos=0),
	...
1

This appends the DataFrame object via a simple method call.

2

The table object shows 86,400 rows per day after the append() operation.

Reading sub-sets of the data from a TsTables table object is generally really fast since this is what it is optimized for in the first place. In this regard, TsTables supports typical algorithmic trading applications, like backtesting, pretty well. Another contributing factor is that TsTables returns the data already as a DataFrame object such that additional conversions are not necessary in general.

In [94]: import datetime

In [95]: start = datetime.datetime(2021, 1, 2)  1

In [96]: end = datetime.datetime(2021, 1, 3)  2

In [97]: %time subset = ts.read_range(start, end)  3
         CPU times: user 10.3 ms, sys: 3.63 ms, total: 14 ms
         Wall time: 12.8 ms

In [98]: start = datetime.datetime(2021, 1, 2, 12, 30, 0)

In [99]: end = datetime.datetime(2021, 1, 5, 17, 15, 30)

In [100]: %time subset = ts.read_range(start, end)
          CPU times: user 28.6 ms, sys: 18.5 ms, total: 47.1 ms
          Wall time: 46.1 ms

In [101]: subset.info()
          <class 'pandas.core.frame.DataFrame'>
          DatetimeIndex: 276331 entries, 2021-01-02 12:30:00 to 2021-01-05
           17:15:30
          Data columns (total 5 columns):
           #   Column  Non-Null Count   Dtype
          ---  ------  --------------   -----
           0   No0     276331 non-null  float64
           1   No1     276331 non-null  float64
           2   No2     276331 non-null  float64
           3   No3     276331 non-null  float64
           4   No4     276331 non-null  float64
          dtypes: float64(5)
          memory usage: 12.6 MB

In [102]: h5.close()

In [103]: rm data/*
1

This defines the starting date and …

2

… end date for the data retrieval operation.

3

The read_range() method takes the start and end dates as input — reading here is only a matter of milliseconds.

New data that is retrieved during a day can be appended to the TsTables table object as illustrated before. The package is therefore a valuable addition to the capabilities of pandas in combination with HDFStore objects when it comes to the efficient storage and retrieval of (large) financial time series data sets over time.

Storing Data with SQLite3

Financial times series data can also be written directly from a DataFrame object to a relational database like SQLite3. The use of a relational database might be useful in scenarios where the SQL query language is applied to implement more sophisticated analyses. With regard to speed and also disk usage, relational databases cannot, however, compare with the other approaches that rely on binary storage formats like HDF5.

The DataFrame class provides the method to_sql() (see the to_sql() API reference page) to write data to a table in a relational database. The size on disk with 100+ MB indicates that there is quite some overhead overhead when using relational databases.

In [104]: %time data = generate_sample_data(1e6, 5, '1min').round(4)  1
          CPU times: user 342 ms, sys: 60.5 ms, total: 402 ms
          Wall time: 405 ms

In [105]: data.info()  1
          <class 'pandas.core.frame.DataFrame'>
          DatetimeIndex: 1000000 entries, 2021-01-01 00:00:00 to 2022-11-26
           10:39:00
          Freq: T
          Data columns (total 5 columns):
           #   Column  Non-Null Count    Dtype
          ---  ------  --------------    -----
           0   No0     1000000 non-null  float64
           1   No1     1000000 non-null  float64
           2   No2     1000000 non-null  float64
           3   No3     1000000 non-null  float64
           4   No4     1000000 non-null  float64
          dtypes: float64(5)
          memory usage: 45.8 MB

In [106]: import sqlite3 as sq3  2

In [107]: con = sq3.connect('data/data.sql')  3

In [108]: %time data.to_sql('data', con)  4
          CPU times: user 4.6 s, sys: 352 ms, total: 4.95 s
          Wall time: 5.07 s

In [109]: ls -n data/data.*
          -rw-r--r--@ 1 501  20  105316352 Aug 25 11:48 data/data.sql
1

The sample financial data set has 1,000,000 rows and five columns; memory usage is about 46 MB.

2

This imports the SQLite3 module.

3

A connection is opened to a new database file.

4

Writing the data to the relational database a couple of seconds.

One strength of relational databases is the ability to implement (out-of-memory) analytics tasks based on standardized SQL statements. As an example, consider a query that selects for column No1 all those rows where the value in that row lies between 105 and 108.

In [110]: query = 'SELECT * FROM data WHERE No1 > 105 and No2 < 108'  1

In [111]: %time res = con.execute(query).fetchall()  2
          CPU times: user 109 ms, sys: 30.3 ms, total: 139 ms
          Wall time: 138 ms

In [112]: res[:5]  3
Out[112]: [('2021-01-03 19:19:00', 103.6894, 105.0117, 103.9025, 95.8619,
           93.6062),
          ('2021-01-03 19:20:00', 103.6724, 105.0654, 103.9277, 95.8915,
           93.5673),
          ('2021-01-03 19:21:00', 103.6213, 105.1132, 103.8598, 95.7606,
           93.5618),
          ('2021-01-03 19:22:00', 103.6724, 105.1896, 103.8704, 95.7302,
           93.4139),
          ('2021-01-03 19:23:00', 103.8115, 105.1152, 103.8342, 95.706,
           93.4436)]

In [113]: len(res)  4
Out[113]: 5035

In [114]: con.close()

In [115]: rm data/*
1

The SQL query as a Python str object.

2

The query executed to retrieve all results rows.

3

The first five results printed.

4

The length of the results list object.

Admittedly, such simple queries are possible with pandas as well if the data set fits into memory. However, the SQL query language has proven use- and powerful for decades now and should be in the algorithmic trader’s arsenal of data weapons.

Note

pandas also supports database connections via SQLAlchemy, a Python abstraction layer package for diverse relational databases (refer to the SQLAlchemy home page). This in turn allows for the use of, for example, MySQL as the relational database backend.

Conclusions

This chapter covers the handling of financial time series data. It illustrates the reading of such data from different file-based sources, like CSV files. It also shows how to retrieve financial data from web services like the one of Quandl for end-of-day and options data. Open financial data sources are a valuable addition to the financial landscape. Quandl is a platform integrating thousands of open data sets under the umbrella of a unified API.

Another important topic covered in this chapter is the efficient storage of complete DataFrame objects on disk as well as of the data contained in such an in-memory object to databases. Database flavors used in this chapter include the HDF5 database standard as well the light-weight relational database SQLite3. This chapter lays the foundation for Chapter 4 which addresses vectorized backtesting, Chapter 5 which covers machine learning and deep learning for market prediction as well as Chapter 6 that discusses event-based backtesting of trading strategies.

Further Resources

You find more information about Quandl following these links:

Information about the package used to retrieve data from that source is found here:

You should consult the official documentation pages for more information on the packages used in this chapter:

Books cited in this chapter:

  • Hilpisch, Yves (2018): Python for Finance — Mastering Data-Driven Finance. 2nd ed., O’Reilly, Beijing et al.

  • McKinney, Wes (2017): Python for Data Analysis — Data Wrangling with Pandas, NumPy, and IPython. 2nd ed., O’Reilly, Beijing et al.

Python Scripts

The following Python script generates sample financial time series data based on a Monte Carlo simulation for a geometric Brownian motion (see Hilpisch (2018, ch. 12)).

#
# Python Module to Generate a
# Sample Financial Data Set
#
# Python for Algorithmic Trading
# (c) Dr. Yves J. Hilpisch
# The Python Quants GmbH
#
import numpy as np
import pandas as pd

r = 0.05  # constant short rate
sigma = 0.5  # volatility factor


def generate_sample_data(rows, cols, freq='1min'):
    '''
    Function to generate sample financial data.

    Parameters
    ==========
    rows: int
        number of rows to generate
    cols: int
        number of columns to generate
    freq: str
        frequency string for DatetimeIndex

    Returns
    =======
    df: DataFrame
        DataFrame object with the sample data
    '''
    rows = int(rows)
    cols = int(cols)
    # generate a DatetimeIndex object given the frequency
    index = pd.date_range('2021-1-1', periods=rows, freq=freq)
    # determine time delta in year fractions
    dt = (index[1] - index[0]) / pd.Timedelta(value='365D')
    # generate column names
    columns = ['No%d' % i for i in range(cols)]
    # generate sample paths for geometric Brownian motion
    raw = np.exp(np.cumsum((r - 0.5 * sigma ** 2) * dt +
                 sigma * np.sqrt(dt) *
                 np.random.standard_normal((rows, cols)), axis=0))
    # normalize the data to start at 100
    raw = raw / raw[0] * 100
    # generate the DataFrame object
    df = pd.DataFrame(raw, index=index, columns=columns)
    return df


if __name__ == '__main__':
    rows = 5  # number of rows
    columns = 3  # number of columns
    freq = 'D'  # daily frequency
    print(generate_sample_data(rows, columns, freq))

1 Source: “Bad Election Day Forecasts Deal Blow to Data Science — Prediction models suffered from narrow data, faulty algorithms and human foibles.” Wall Street Journal, 09. November 2016.

2 Of course, multiple DataFrame objects could also be stored in a single HDFStore object.

3 All values reported here are from the author’s MacMini with Intel i7 hexa core processor (12 threads), 32 GB of random access memory (DDR4 RAM) and a 512 GB solid state drive (SSD).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.0.157