Clearly, data beats algorithms. Without comprehensive data, you tend to get non-comprehensive predictions.1
Rob Thomas
In algorithmic trading, one generally has to deal with four types of data as illustrated in Table 3-1. Although simplifying the real data world, distinguishing data along the pairs historical vs. real-time and structured vs. unstructured proves often useful in technical settings.
structured |
unstructured |
|
historical |
end-of-day closing prices |
financial news articles |
real-time |
bid/ask prices for FX |
posts on Twitter |
This book is mainly concerned with structured data (numerical, tabular data) of both historical and real-time types. This chapter in particular focuses on historical, structured data, like end-of-day closing values for the SAP SE stock traded at the Frankfurt Stock Exchange. However, this category also subsumes intraday data, like, for example, 1-minute-bar data for the Apple, Inc. stock traded at the NASDAQ stock exchange. The processing of real-time, structured data is covered in [Link to Come].
An algorithmic trading project typically starts with a trading idea or hypothesis which needs to be (back-)tested based on historical financial data. This is the context for this chapter, the plan for which is as follows. “Reading Financial Data From Different Sources” uses pandas
to read data from different file- and web-based sources. “Working with Open Data Sources” introduces Quandl as a popular open data source platform. “Eikon Data API” introduces the Python wrapper for the Refinitiv Eikon Data API. Finally, “Storing Financial Data Efficiently” briefly shows how to store historical, structured data efficiently with pandas
based on the HDF5 binary storage format.
The goal for this chapter is to have available financial data in a format with which the backtesting of trading ideas and hypotheses can be implemented effectively. The three major themes are the importing of data, the handling of the data and the storage of it. This and subsequent chapters assume a Python 3.8 installation with Python packages installed as explained in detail in Chapter 2. For the time being, it is not yet relevant on which infrastructure exactly this Python environment is provided. For more details on efficient input-output operations with Python, see Hilpisch (2018, ch. 9).
This section makes heavy use of the capabilities of pandas
, the popular data analysis package for Python (see pandas
home page). pandas
comprehensively supports the three main tasks this chapter is concerned with: reading data, handling data, and storing data. One of its strengths is the reading of data from different types of sources as the remainder of this section illustrates.
In this section, we work with a fairly small data set for the Apple, Inc. stock price (with symbol AAPL
and Reuters Instrument Code or RIC AAPL.O
) as retrieved from the Eikon Data API for April 2020.
Having stored such historical financial data in a CSV file on disk, pure Python can be used to read and print its content.
In
[
1
]
:
fn
=
'
../data/AAPL.csv
'
In
[
2
]
:
with
open
(
fn
,
'
r
'
)
as
f
:
for
_
in
range
(
5
)
:
(
f
.
readline
(
)
,
end
=
'
'
)
Date
,
HIGH
,
CLOSE
,
LOW
,
OPEN
,
COUNT
,
VOLUME
2020
-
04
-
01
,
248.72
,
240.91
,
239.13
,
246.5
,
460606.0
,
44054638.0
2020
-
04
-
02
,
245.15
,
244.93
,
236.9
,
240.34
,
380294.0
,
41483493.0
2020
-
04
-
03
,
245.7
,
241.41
,
238.9741
,
242.8
,
293699.0
,
32470017.0
2020
-
04
-
06
,
263.11
,
262.47
,
249.38
,
250.9
,
486681.0
,
50455071.0
Open the file on disk (adjust path and filename if necessary).
Sets up a for
loop with 5 iterations.
Prints the first 5 lines in the opened CSV file.
This approach allows for simple inspection of the data. One learns that there is a header line and that the single data points per row represent Date, OPEN, HIGH, LOW, CLOSE, COUNT
and VOLUME
, respectively. However, the data is not yet available in memory for further usage with Python.
To work with data stored as a CSV file, the file needs to be parsed and the data needs to be stored in a Python data structure. Python has a built-in module called csv
which supports the reading of data from a CSV file. The first approach yields a list
object containing other list
objects with the data from the file.
In
[
3
]
:
import
csv
In
[
4
]
:
csv_reader
=
csv
.
reader
(
open
(
fn
,
'
r
'
)
)
In
[
5
]
:
data
=
list
(
csv_reader
)
In
[
6
]
:
data
[
:
5
]
Out
[
6
]
:
[
[
'
Date
'
,
'
HIGH
'
,
'
CLOSE
'
,
'
LOW
'
,
'
OPEN
'
,
'
COUNT
'
,
'
VOLUME
'
]
,
[
'
2020-04-01
'
,
'
248.72
'
,
'
240.91
'
,
'
239.13
'
,
'
246.5
'
,
'
460606.0
'
,
'
44054638.0
'
]
,
[
'
2020-04-02
'
,
'
245.15
'
,
'
244.93
'
,
'
236.9
'
,
'
240.34
'
,
'
380294.0
'
,
'
41483493.0
'
]
,
[
'
2020-04-03
'
,
'
245.7
'
,
'
241.41
'
,
'
238.9741
'
,
'
242.8
'
,
'
293699.0
'
,
'
32470017.0
'
]
,
[
'
2020-04-06
'
,
'
263.11
'
,
'
262.47
'
,
'
249.38
'
,
'
250.9
'
,
'
486681.0
'
,
'
50455071.0
'
]
]
Imports the csv
module.
Instantiates a csv.reader
iterator object.
A list
comprehension adding every single line from the CSV file as a list
object to the resulting list
object.
Prints out the first five elements of the list
object.
Working with such a nested list
object — e.g. for the calculation of the average closing price — is possible in principle but not really efficient or intuitive. Using a csv.DictReader
iterator object instead of the standard csv.reader
object makes such tasks a bit more manageable. Every row of data in the CSV file (apart from the header row) is then imported as a dict
object so that single values can be accessed via the respective key.
In
[
7
]
:
csv_reader
=
csv
.
DictReader
(
open
(
fn
,
'
r
'
)
)
In
[
8
]
:
data
=
list
(
csv_reader
)
In
[
9
]
:
data
[
:
3
]
Out
[
9
]
:
[
{
'
Date
'
:
'
2020-04-01
'
,
'
HIGH
'
:
'
248.72
'
,
'
CLOSE
'
:
'
240.91
'
,
'
LOW
'
:
'
239.13
'
,
'
OPEN
'
:
'
246.5
'
,
'
COUNT
'
:
'
460606.0
'
,
'
VOLUME
'
:
'
44054638.0
'
}
,
{
'
Date
'
:
'
2020-04-02
'
,
'
HIGH
'
:
'
245.15
'
,
'
CLOSE
'
:
'
244.93
'
,
'
LOW
'
:
'
236.9
'
,
'
OPEN
'
:
'
240.34
'
,
'
COUNT
'
:
'
380294.0
'
,
'
VOLUME
'
:
'
41483493.0
'
}
,
{
'
Date
'
:
'
2020-04-03
'
,
'
HIGH
'
:
'
245.7
'
,
'
CLOSE
'
:
'
241.41
'
,
'
LOW
'
:
'
238.9741
'
,
'
OPEN
'
:
'
242.8
'
,
'
COUNT
'
:
'
293699.0
'
,
'
VOLUME
'
:
'
32470017.0
'
}
]
Here, the csv.DictReader
iterator object is instantiated which reads every data row into a dict
object — given the information in the header row.
Based on the single dict
objects, aggregations are now somewhat more easy to accomplish. However, one still cannot speak of a convenient way of calculating the mean of the Apple closing stock price when inspecting the respective Python code.
In
[
10
]
:
sum
(
[
float
(
l
[
'
CLOSE
'
]
)
for
l
in
data
]
)
/
len
(
data
)
Out
[
10
]
:
272.38619047619045
First, a list
object is generated via a list comprehension with all closing values; second, the sum is taken over all these values; third, the resulting sum is divided by the number of closing values.
This is one of the major reasons why pandas
has gained such a popularity in the Python community. It makes the importing of data and the handling of, for example, financial time series data sets more convenient (and also often considerably faster) than pure Python.
From this point on, this section uses pandas
to work with the Apple stock price data set. The major function used is read_csv()
which allows for a number of customizations via different parameters (see the read_csv()
API reference). read_csv()
yields as a result of the data reading procedure a DataFrame
object which is the central means of storing (tabular) data with pandas
. The DataFrame
class has many powerful methods that are particularly helpful in financial applications (refer to the DataFrame
API reference).
In
[
11
]
:
import
pandas
as
pd
In
[
12
]
:
data
=
pd
.
read_csv
(
fn
,
index_col
=
0
,
parse_dates
=
True
)
In
[
13
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
21
entries
,
2020
-
04
-
01
to
2020
-
04
-
30
Data
columns
(
total
6
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
HIGH
21
non
-
null
float64
1
CLOSE
21
non
-
null
float64
2
LOW
21
non
-
null
float64
3
OPEN
21
non
-
null
float64
4
COUNT
21
non
-
null
float64
5
VOLUME
21
non
-
null
float64
dtypes
:
float64
(
6
)
memory
usage
:
1.1
KB
In
[
14
]
:
data
.
tail
(
)
Out
[
14
]
:
HIGH
CLOSE
LOW
OPEN
COUNT
VOLUME
Date
2020
-
04
-
24
283.01
282.97
277.00
277.20
306176.0
31627183.0
2020
-
04
-
27
284.54
283.17
279.95
281.80
300771.0
29271893.0
2020
-
04
-
28
285.83
278.58
278.20
285.08
285384.0
28001187.0
2020
-
04
-
29
289.67
287.73
283.89
284.73
324890.0
34320204.0
2020
-
04
-
30
294.53
293.80
288.35
289.96
471129.0
45765968.0
The pandas
package is imported.
This imports the data from the CSV file, indicated that the first column shall be treated as the index column and letting the entries in that column be interpreted as date-time information.
This method call prints out meta information regarding the resulting DataFrame
object.
The data.tail()
method prints out by default the five most recent data rows.
Calculating the mean of the Apple stock closing values now is only a single method call.
In
[
15
]:
data
[
'CLOSE'
]
.
mean
()
Out
[
15
]:
272.38619047619056
Chapter 4 introduces more functionality of pandas
for the handling of financial data. For details on working with pandas
and the powerful DataFrame
class also refer to the official pandas
Documentation page and to the book McKinney (2017).
Although the Python standard library provides capabilities to read data from CSV files, pandas
in general significantly simplifies and speeds up such operations. An additional benefit is that the data analysis capabilities of pandas
are immediately available since read_csv()
returns a DataFrame
object.
pandas
also excels at exporting data stored in DataFrame
objects when this data needs to be shared in a non-Python specific format. Apart from being able to export to CSV files, pandas
allows, for example, also to do the export in the form of Excel spreadsheet files as well as JSON files which are both popular data exchange formats in the financial industry. Such an exporting procedure typically needs a single method call only.
In
[
16
]
:
data
.
to_excel
(
'
data/aapl.xls
'
,
'
AAPL
'
)
In
[
17
]
:
data
.
to_json
(
'
data/aapl.json
'
)
In
[
18
]
:
ls
-
n
data
/
total
24
-
rw
-
r
-
-
r
-
-
1
501
20
3067
Aug
25
11
:
47
aapl
.
json
-
rw
-
r
-
-
r
-
-
1
501
20
5632
Aug
25
11
:
47
aapl
.
xls
In particular when it comes to the interaction with Excel spreadsheet files, there are more elegant ways than just doing a data dump to a new file. xlwings
, for example, is a powerful Python package allowing for an efficient and intelligent interaction between Python and Excel (visit the xlwings
home page).
Now that the data is also available in the form of an Excel spreadsheet file and a JSON data file, pandas
can read data from these sources as well. The approach is as straightforward as with CSV files.
In
[
19
]
:
data_copy_1
=
pd
.
read_excel
(
'
data/aapl.xls
'
,
'
AAPL
'
,
index_col
=
0
)
In
[
20
]
:
data_copy_1
.
head
(
)
Out
[
20
]
:
HIGH
CLOSE
LOW
OPEN
COUNT
VOLUME
Date
2020
-
04
-
01
248.72
240.91
239.1300
246.50
460606
44054638
2020
-
04
-
02
245.15
244.93
236.9000
240.34
380294
41483493
2020
-
04
-
03
245.70
241.41
238.9741
242.80
293699
32470017
2020
-
04
-
06
263.11
262.47
249.3800
250.90
486681
50455071
2020
-
04
-
07
271.70
259.43
259.0000
270.80
467375
50721831
In
[
21
]
:
data_copy_2
=
pd
.
read_json
(
'
data/aapl.json
'
)
In
[
22
]
:
data_copy_2
.
head
(
)
Out
[
22
]
:
HIGH
CLOSE
LOW
OPEN
COUNT
VOLUME
2020
-
04
-
01
248.72
240.91
239.1300
246.50
460606
44054638
2020
-
04
-
02
245.15
244.93
236.9000
240.34
380294
41483493
2020
-
04
-
03
245.70
241.41
238.9741
242.80
293699
32470017
2020
-
04
-
06
263.11
262.47
249.3800
250.90
486681
50455071
2020
-
04
-
07
271.70
259.43
259.0000
270.80
467375
50721831
In
[
23
]
:
!
rm
data
/
*
This reads the data from the Excel spreadsheet file to a new DataFrame
object.
The first five rows of the first in-memory copy of the data are printed.
This reads the data from the JSON file to yet another DataFrame
object.
This then prints the first five rows of the second in-memory copy of the data.
pandas
proves useful for reading and writing financial data from and to different types of data files. Often, the reading might be tricky due to non-standard storage formats (like a “;” instead of a “,” as separator) but pandas
generally provides the right set of parameter combinations to cope with such cases. Although all examples in this section use a small data set only, one can expect high performance input-output operations from pandas
in the most important scenarios when the data sets are much larger.
To a great extent, the attractiveness of the Python ecosystem stems from the fact that almost all packages available are open source and can be used for free. Financial analytics in general and algorithmic trading in particular, however, cannot live with open source software and algorithms alone — data plays a vital role as well, as the quote at the beginning of the chapter emphasizes. The previous section uses a small data set from a commercial data source. While there have been helpful open (financial) data sources available for some years (such as the ones provided by Yahoo! Finance or Google Finance), there are not too many left at the time of this writing in 2020. One of the more obvious reasons for this trend might be the ever-changing terms of data licensing agreements.
The one notable exception for the purposes of this book is Quandl (http://quandl.com), a platform that aggregates a large number of open as well as premium (= to-be-paid-for) data sources. The data is provided via a unified API for which a Python wrapper package is available.
The Python wrapper package for the Quandl data API (see the Python wrapper page on Quandl and the Github page of the package) is installed with conda
through conda install quandl
. The first example shows how to retrieve historical average prices for the BTC/USD exchange rate since the introduction of Bitcoin as a cryptocurrency. With Quandl, requests expect always a combination of the database and the specific data set desired. In the example, BCHAIN
and MKPRU
. Such information can generally be looked up on the Quandl platform. For the example, the relevant page on Quandl is BCHAIN/MKPRU.
By default, the quandl
package returns a pandas
DataFrame
object. In the example, the Value
column is also presented in annualized fashion, i.e. with year end values. Note that the number shown for 2020 is the last available value in the data set (from May 2020) and not necessarily the year end value.
While a large part of the data sets on the Quandl platform are free, some of the free data sets require an API key (such a key is required after a certain limit of free API calls too). Every user obtains such a key by signing up for a free Quandl account on the Quandl sign up page. Data requests requiring an API key expect the key to be provided as the parameter api_key
. In the example, the API key (which is found on the account settings page) is stored as a string in the variable quandl_api_key
. The concrete value for the key is read from a configuration file via the configparser
module.
In
[
24
]
:
import
configparser
config
=
configparser
.
ConfigParser
(
)
config
.
read
(
'
../pyalgo.cfg
'
)
Out
[
24
]
:
[
'
../pyalgo.cfg
'
]
In
[
25
]
:
import
quandl
as
q
In
[
26
]
:
data
=
q
.
get
(
'
BCHAIN/MKPRU
'
,
api_key
=
config
[
'
quandl
'
]
[
'
api_key
'
]
)
In
[
27
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
4254
entries
,
2009
-
01
-
03
to
2020
-
08
-
26
Data
columns
(
total
1
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
Value
4254
non
-
null
float64
dtypes
:
float64
(
1
)
memory
usage
:
66.5
KB
In
[
28
]
:
data
[
'
Value
'
]
.
resample
(
'
A
'
)
.
last
(
)
Out
[
28
]
:
Date
2009
-
12
-
31
0.000000
2010
-
12
-
31
0.299999
2011
-
12
-
31
4.995000
2012
-
12
-
31
13.590000
2013
-
12
-
31
731.000000
2014
-
12
-
31
317.400000
2015
-
12
-
31
428.000000
2016
-
12
-
31
952.150000
2017
-
12
-
31
13215.574000
2018
-
12
-
31
3832.921667
2019
-
12
-
31
7385.360000
2020
-
12
-
31
11763.930000
Freq
:
A
-
DEC
,
Name
:
Value
,
dtype
:
float64
Imports the Python wrapper package for Quandl.
Reads historical data for the BTC/USD exchange rate.
Selects the Value
column, resamples it — from the originally daily values to yearly values — and defines the last available observation to be the relevant one.
Quandl also provides, for example, diverse data sets for single stocks, like end-of-day stock prices, stock fundamentals or data sets related to options traded on a certain stock.
In
[
29
]:
data
=
q
.
get
(
'FSE/SAP_X'
,
start_date
=
'2018-1-1'
,
end_date
=
'2020-05-01'
,
api_key
=
config
[
'quandl'
][
'api_key'
])
In
[
30
]:
data
.
info
()
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
DatetimeIndex
:
579
entries
,
2018
-
01
-
02
to
2020
-
04
-
30
Data
columns
(
total
10
columns
):
# Column Non-Null Count Dtype
---
------
--------------
-----
0
Open
257
non
-
null
float64
1
High
579
non
-
null
float64
2
Low
579
non
-
null
float64
3
Close
579
non
-
null
float64
4
Change
0
non
-
null
object
5
Traded
Volume
533
non
-
null
float64
6
Turnover
533
non
-
null
float64
7
Last
Price
of
the
Day
0
non
-
null
object
8
Daily
Traded
Units
0
non
-
null
object
9
Daily
Turnover
0
non
-
null
object
dtypes
:
float64
(
6
),
object
(
4
)
memory
usage
:
49.8
+
KB
The API key can also be configured permanently with the Python wrapper via
q
.
ApiConfig
.
api_key
=
'YOUR_API_KEY'
The Quandl platform also offers premium data sets for which a subscription or fee is required. Most of these data sets offer free samples. The example retrieves option implied volatilities for the Microsoft, Inc. stock. The free sample data set is quite large with more than 4,100 rows and many columns (only a subset is shown). The last line of code display the 30, 60 and 90 days implied volatility values for the five most recent days available.
In
[
31
]:
q
.
ApiConfig
.
api_key
=
config
[
'quandl'
][
'api_key'
]
In
[
32
]:
vol
=
q
.
get
(
'VOL/MSFT'
)
In
[
33
]:
vol
.
iloc
[:,
:
10
]
.
info
()
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
DatetimeIndex
:
1006
entries
,
2015
-
01
-
02
to
2018
-
12
-
31
Data
columns
(
total
10
columns
):
# Column Non-Null Count Dtype
---
------
--------------
-----
0
Hv10
1006
non
-
null
float64
1
Hv20
1006
non
-
null
float64
2
Hv30
1006
non
-
null
float64
3
Hv60
1006
non
-
null
float64
4
Hv90
1006
non
-
null
float64
5
Hv120
1006
non
-
null
float64
6
Hv150
1006
non
-
null
float64
7
Hv180
1006
non
-
null
float64
8
Phv10
1006
non
-
null
float64
9
Phv20
1006
non
-
null
float64
dtypes
:
float64
(
10
)
memory
usage
:
86.5
KB
In
[
34
]:
vol
[[
'IvMean30'
,
'IvMean60'
,
'IvMean90'
]]
.
tail
()
Out
[
34
]:
IvMean30
IvMean60
IvMean90
Date
2018
-
12
-
24
0.4310
0.4112
0.3829
2018
-
12
-
26
0.4059
0.3844
0.3587
2018
-
12
-
27
0.3918
0.3879
0.3618
2018
-
12
-
28
0.3940
0.3736
0.3482
2018
-
12
-
31
0.3760
0.3519
0.3310
This concludes the overview of the Python wrapper package quandl
for the Quandl data API. The Quandl platform and service is growing rapidly and proves to be a valuable source for financial data in an algorithmic trading context.
Open source software is a trend that started many years ago. It has lowered the barriers to entry in many areas and also in algorithmic trading. A new, reinforcing trend in this regard are open data sources. In some cases, such as with Quandl, they even provide high quality data sets. It cannot be expected that open data will completely replace professional data subscriptions any time soon, but they represent a valuable means to get started with algorithmic trading in a cost efficient manner.
Open data sources are a blessing for algorithmic traders wanting to get started in the space and wanting to be able to quickly test hypotheses and ideas based on real financial data sets. Sooner or later, however, open data sets will not suffice anymore to satisfy the requirements of more ambitious traders and professionals.
Refinitiv is one of the biggest financial data and news providers in the world. Its current desktop flagship product is Eikon which is the equivalent to the Terminal by Bloomberg, the major competitor in the data services field. Figure 3-1 shows a screen shot of Eikon in the browser-based version. It provides access to peta bytes of data via a single access point.
Recently, Refinitiv have streamlined their API landscape and have released a Python wrapper package, called eikon
, for the Eikon data API which is installed via pip install eikon
. If you have a subscription to the Refinitiv Eikon data services, you can use the Python package to programmatically retrieve historical as well as streaming structured and unstructured data from the unified API. A technical prerequisite is that a local desktop application is running that provides a desktop API session. The latest such desktop application at the time of this writing is called Workspace (see Figure 3-2).
If you are an Eikon subscriber and have an account for the Developer Community pages, you find an overview of the Python Eikon Scripting Library under Quick Start.
In order to use the Eikon Data API, the Eikon app_key
needs to be set. You get it via the App Key Generator (APPKEY`) application in either Eikon or Workspace.
In
[
35
]
:
import
eikon
as
ek
In
[
36
]
:
ek
.
set_app_key
(
config
[
'
eikon
'
]
[
'
app_key
'
]
)
In
[
37
]
:
help
(
ek
)
Help
on
package
eikon
:
NAME
eikon
-
# coding: utf-8
PACKAGE
CONTENTS
Profile
data_grid
eikonError
json_requests
news_request
streaming_session
(
package
)
symbology
time_series
tools
SUBMODULES
cache
desktop_session
istream_callback
itemstream
session
stream
stream_connection
streamingprice
streamingprice_callback
streamingprices
VERSION
1.1
.
5
FILE
/
Users
/
yves
/
Python
/
envs
/
py38
/
lib
/
python3
.
8
/
site
-
packages
/
eikon
/
__init__
.
py
The retrieval of historical financial time series data is as straightforward as with the other wrappers used before.
In
[
39
]
:
symbols
=
[
'
AAPL.O
'
,
'
MSFT.O
'
,
'
GOOG.O
'
]
In
[
40
]
:
data
=
ek
.
get_timeseries
(
symbols
,
start_date
=
'
2020-01-01
'
,
end_date
=
'
2020-05-01
'
,
interval
=
'
daily
'
,
fields
=
[
'
*
'
]
)
In
[
41
]
:
data
.
keys
(
)
Out
[
41
]
:
MultiIndex
(
[
(
'
AAPL.O
'
,
'
HIGH
'
)
,
(
'
AAPL.O
'
,
'
CLOSE
'
)
,
(
'
AAPL.O
'
,
'
LOW
'
)
,
(
'
AAPL.O
'
,
'
OPEN
'
)
,
(
'
AAPL.O
'
,
'
COUNT
'
)
,
(
'
AAPL.O
'
,
'
VOLUME
'
)
,
(
'
MSFT.O
'
,
'
HIGH
'
)
,
(
'
MSFT.O
'
,
'
CLOSE
'
)
,
(
'
MSFT.O
'
,
'
LOW
'
)
,
(
'
MSFT.O
'
,
'
OPEN
'
)
,
(
'
MSFT.O
'
,
'
COUNT
'
)
,
(
'
MSFT.O
'
,
'
VOLUME
'
)
,
(
'
GOOG.O
'
,
'
HIGH
'
)
,
(
'
GOOG.O
'
,
'
CLOSE
'
)
,
(
'
GOOG.O
'
,
'
LOW
'
)
,
(
'
GOOG.O
'
,
'
OPEN
'
)
,
(
'
GOOG.O
'
,
'
COUNT
'
)
,
(
'
GOOG.O
'
,
'
VOLUME
'
)
]
,
)
In
[
42
]
:
type
(
data
[
'
AAPL.O
'
]
)
Out
[
42
]
:
pandas
.
core
.
frame
.
DataFrame
In
[
43
]
:
data
[
'
AAPL.O
'
]
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
84
entries
,
2020
-
01
-
02
to
2020
-
05
-
01
Data
columns
(
total
6
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
HIGH
84
non
-
null
float64
1
CLOSE
84
non
-
null
float64
2
LOW
84
non
-
null
float64
3
OPEN
84
non
-
null
float64
4
COUNT
84
non
-
null
Int64
5
VOLUME
84
non
-
null
Int64
dtypes
:
Int64
(
2
)
,
float64
(
4
)
memory
usage
:
4.8
KB
In
[
44
]
:
data
[
'
AAPL.O
'
]
.
tail
(
)
Out
[
44
]
:
HIGH
CLOSE
LOW
OPEN
COUNT
VOLUME
Date
2020
-
04
-
27
284.54
283.17
279.95
281.80
300771
29271893
2020
-
04
-
28
285.83
278.58
278.20
285.08
285384
28001187
2020
-
04
-
29
289.67
287.73
283.89
284.73
324890
34320204
2020
-
04
-
30
294.53
293.80
288.35
289.96
471129
45765968
2020
-
05
-
01
299.00
289.07
285.85
286.25
558319
60154175
Defines a few symbols as a list
object.
The central line of code that retrieves data for the first symbol …
… for the given start date and …
… the given end date.
The time interval is here chosen to be daily
.
All fields are requested.
The function get_timeseries()
returns a multi-index DataFrame
object.
The values corresponding to each level are regular DataFrame
objects.
This provides an overview of the data stored in the DataFrame
object.
The final five rows of data are shown.
The beauty of working with a professional data service API becomes evident when one wishes to work with multiple symbols and in particular with a different granularity of the financial data, i.e. other time intervals.
In
[
45
]
:
%
%
time
data
=
ek
.
get_timeseries
(
symbols
,
start_date
=
'
2020-08-14
'
,
end_date
=
'
2020-08-15
'
,
interval
=
'
minute
'
,
fields
=
'
*
'
)
CPU
times
:
user
58.2
ms
,
sys
:
3.16
ms
,
total
:
61.4
ms
Wall
time
:
2.02
s
In
[
46
]
:
(
data
[
'
GOOG.O
'
]
.
loc
[
'
2020-08-14 16:00:00
'
:
'
2020-08-14 16:04:00
'
]
)
HIGH
LOW
OPEN
CLOSE
COUNT
VOLUME
Date
2020
-
08
-
14
16
:
00
:
00
1510.7439
1509.220
1509.940
1510.5239
48
1362
2020
-
08
-
14
16
:
01
:
00
1511.2900
1509.980
1510.500
1511.2900
52
1002
2020
-
08
-
14
16
:
02
:
00
1513.0000
1510.964
1510.964
1512.8600
72
1762
2020
-
08
-
14
16
:
03
:
00
1513.6499
1512.160
1512.990
1513.2300
108
4534
2020
-
08
-
14
16
:
04
:
00
1513.6500
1511.540
1513.418
1512.7100
40
1364
In
[
47
]
:
for
sym
in
symbols
:
(
'
'
+
sym
+
'
'
,
data
[
sym
]
.
iloc
[
-
300
:
-
295
]
)
AAPL
.
O
HIGH
LOW
OPEN
CLOSE
COUNT
VOLUME
Date
2020
-
08
-
14
19
:
01
:
00
457.1699
456.6300
457.14
456.83
1457
104693
2020
-
08
-
14
19
:
02
:
00
456.9399
456.4255
456.81
456.45
1178
79740
2020
-
08
-
14
19
:
03
:
00
456.8199
456.4402
456.45
456.67
908
68517
2020
-
08
-
14
19
:
04
:
00
456.9800
456.6100
456.67
456.97
665
53649
2020
-
08
-
14
19
:
05
:
00
457.1900
456.9300
456.98
457.00
679
49636
MSFT
.
O
HIGH
LOW
OPEN
CLOSE
COUNT
VOLUME
Date
2020
-
08
-
14
19
:
01
:
00
208.6300
208.5083
208.5500
208.5674
333
21368
2020
-
08
-
14
19
:
02
:
00
208.5750
208.3550
208.5501
208.3600
513
37270
2020
-
08
-
14
19
:
03
:
00
208.4923
208.3000
208.3600
208.4000
303
23903
2020
-
08
-
14
19
:
04
:
00
208.4200
208.3301
208.3901
208.4099
222
15861
2020
-
08
-
14
19
:
05
:
00
208.4699
208.3600
208.3920
208.4069
235
9569
GOOG
.
O
HIGH
LOW
OPEN
CLOSE
COUNT
VOLUME
Date
2020
-
08
-
14
19
:
01
:
00
1510.42
1509.3288
1509.5100
1509.8550
47
1577
2020
-
08
-
14
19
:
02
:
00
1510.30
1508.8000
1509.7559
1508.8647
71
2950
2020
-
08
-
14
19
:
03
:
00
1510.21
1508.7200
1508.7200
1509.8100
33
603
2020
-
08
-
14
19
:
04
:
00
1510.21
1508.7200
1509.8800
1509.8299
41
934
2020
-
08
-
14
19
:
05
:
00
1510.21
1508.7300
1509.5500
1509.6600
30
445
Data is retrieved for all symbols at once.
The time interval …
… is drastically shortened.
The function call retrieves minute bars for the symbols.
Prints five rows from the Google, Inc. data set.
Prints three data rows from every DataFrame
object.
The code above illustrates how convenient it is to retrieve historical financial time series data from the Eikon API with Python. By default, the function get_timeseries()
provides the following options for the interval
parameter: tick
, minute
, hour
, daily
, weekly
, monthly
, quarterly
and yearly
. This gives all the flexibility needed in an algorithmic trading context — in particular, when combined with the resampling capabilities of pandas
as shown in the code below.
In
[
48
]
:
%
%
time
data
=
ek
.
get_timeseries
(
symbols
[
0
]
,
start_date
=
'
2020-08-14 15:00:00
'
,
end_date
=
'
2020-08-14 15:30:00
'
,
interval
=
'
tick
'
,
fields
=
[
'
*
'
]
)
CPU
times
:
user
257
ms
,
sys
:
17.3
ms
,
total
:
274
ms
Wall
time
:
2.31
s
In
[
49
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
47346
entries
,
2020
-
08
-
14
15
:
00
:
00.019000
to
2020
-
08
-
14
15
:
29
:
59.987000
Data
columns
(
total
2
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
VALUE
47311
non
-
null
float64
1
VOLUME
47346
non
-
null
Int64
dtypes
:
Int64
(
1
)
,
float64
(
1
)
memory
usage
:
1.1
MB
In
[
50
]
:
data
.
head
(
)
Out
[
50
]
:
VALUE
VOLUME
Date
2020
-
08
-
14
15
:
00
:
00.019
453.2499
60
2020
-
08
-
14
15
:
00
:
00.036
453.2294
3
2020
-
08
-
14
15
:
00
:
00.146
453.2100
5
2020
-
08
-
14
15
:
00
:
00.146
453.2100
100
2020
-
08
-
14
15
:
00
:
00.236
453.2100
2
In
[
51
]
:
resampled
=
data
.
resample
(
'
30s
'
,
label
=
'
right
'
)
.
agg
(
{
'
VALUE
'
:
'
last
'
,
'
VOLUME
'
:
'
sum
'
}
)
In
[
52
]
:
resampled
.
tail
(
)
Out
[
52
]
:
VALUE
VOLUME
Date
2020
-
08
-
14
15
:
28
:
00
453.9000
29746
2020
-
08
-
14
15
:
28
:
30
454.2869
86441
2020
-
08
-
14
15
:
29
:
00
454.3900
49513
2020
-
08
-
14
15
:
29
:
30
454.7550
98520
2020
-
08
-
14
15
:
30
:
00
454.6200
55592
A time interval of …
… one hour is chosen (due to data retrieval limits).
The interval
parameter is set to tick
.
Close to 50,000 price ticks are retrieved for the interval.
The time series data set shows highly irregular (heterogeneous) interval lengths between two ticks.
The tick data is resampled to a 30 second interval length (by taking the last value and the sum, respectively) …
… which is reflected in the DatetimeIndex
of the new DataFrame
object.
A major strength of working with the Eikon API via Python is the easy retrieval of unstructured data — which can then be parsed and analyzed with Python packages for natural language processing (NLP). Such a procedure is as simple and straightforward as for financial time series data. The code that follows retrieves news headlines for a fixed time interval which includes Apple, Inc. as a company as well as “iPhone” as a word. The five most recent hits are displayed as a maximum.
In
[
53
]
:
headlines
=
ek
.
get_news_headlines
(
query
=
'
R:AAPL.O macbook
'
,
count
=
5
,
date_from
=
'
2020-4-1
'
,
date_to
=
'
2020-5-1
'
)
In
[
54
]
:
headlines
Out
[
54
]
:
versionCreated
2020
-
04
-
20
21
:
33
:
37.332
2020
-
04
-
20
21
:
33
:
37.332000
+
00
:
00
2020
-
04
-
20
10
:
20
:
23.201
2020
-
04
-
20
10
:
20
:
23.201000
+
00
:
00
2020
-
04
-
20
02
:
32
:
27.721
2020
-
04
-
20
02
:
32
:
27.721000
+
00
:
00
2020
-
04
-
15
12
:
06
:
58.693
2020
-
04
-
15
12
:
06
:
58.693000
+
00
:
00
2020
-
04
-
09
21
:
34
:
08.671
2020
-
04
-
09
21
:
34
:
08.671000
+
00
:
00
text
2020
-
04
-
20
21
:
33
:
37.332
Apple
said
to
launch
new
AirPods
,
MacBook
Pro
.
.
.
2020
-
04
-
20
10
:
20
:
23.201
Apple
might
launch
upgraded
AirPods
,
13
-
inch
M
.
.
.
2020
-
04
-
20
02
:
32
:
27.721
Apple
to
reportedly
launch
new
AirPods
alongsi
.
.
.
2020
-
04
-
15
12
:
06
:
58.693
Apple
files
a
patent
for
iPhones
,
MacBook
indu
.
.
.
2020
-
04
-
09
21
:
34
:
08.671
Apple
rolls
out
new
software
update
for
MacBoo
.
.
.
storyId
2020
-
04
-
20
21
:
33
:
37.332
urn
:
newsml
:
reuters
.
com
:
20200420
:
nNRAble9rq
:
1
2020
-
04
-
20
10
:
20
:
23.201
urn
:
newsml
:
reuters
.
com
:
20200420
:
nNRAbl8eob
:
1
2020
-
04
-
20
02
:
32
:
27.721
urn
:
newsml
:
reuters
.
com
:
20200420
:
nNRAbl4mfz
:
1
2020
-
04
-
15
12
:
06
:
58.693
urn
:
newsml
:
reuters
.
com
:
20200415
:
nNRAbjvsix
:
1
2020
-
04
-
09
21
:
34
:
08.671
urn
:
newsml
:
reuters
.
com
:
20200409
:
nNRAbi2nbb
:
1
sourceCode
2020
-
04
-
20
21
:
33
:
37.332
NS
:
TIMIND
2020
-
04
-
20
10
:
20
:
23.201
NS
:
BUSSTA
2020
-
04
-
20
02
:
32
:
27.721
NS
:
HINDUT
2020
-
04
-
15
12
:
06
:
58.693
NS
:
HINDUT
2020
-
04
-
09
21
:
34
:
08.671
NS
:
TIMIND
In
[
55
]
:
story
=
headlines
.
iloc
[
0
]
In
[
56
]
:
story
Out
[
56
]
:
versionCreated
2020
-
04
-
20
21
:
33
:
37.332000
+
00
:
00
text
Apple
said
to
launch
new
AirPods
,
MacBook
Pro
.
.
.
storyId
urn
:
newsml
:
reuters
.
com
:
20200420
:
nNRAble9rq
:
1
sourceCode
NS
:
TIMIND
Name
:
2020
-
04
-
20
21
:
33
:
37.332000
,
dtype
:
object
In
[
57
]
:
news_text
=
ek
.
get_news_story
(
story
[
'
storyId
'
]
)
In
[
58
]
:
from
IPython.display
import
HTML
In
[
59
]
:
HTML
(
news_text
)
Out
[
59
]
:
<
IPython
.
core
.
display
.
HTML
object
>
NEW DELHI: Apple recently launched its much-awaited affordable smartphone iPhone SE. Now it seems that the company is gearing up for another launch. Apple is said to launch the next generation of AirPods and the all-new 13-inch MacBook Pro next month. In February an online report revealed that the Cupertino-based tech giant is working on AirPods Pro Lite. Now a tweet by tipster Job Posser has revealed that Apple will soon come up with new AirPods and MacBook Pro. Jon Posser tweeted, "New AirPods (which were supposed to be at the March Event) is now ready to go. Probably alongside the MacBook Pro next month." However, not many details about the upcoming products are available right now. The company was supposed to launch these products at the March event along with the iPhone SE. But due to the ongoing pandemic coronavirus, the event got cancelled. It is expected that Apple will launch the AirPods Pro Lite and the 13-inch MacBook Pro just like the way it launched the iPhone SE. Meanwhile, Apple has scheduled its annual developer conference WWDC to take place in June. This year the company has decided to hold an online-only event due to the outbreak of coronavirus. Reports suggest that this year the company is planning to launch the all-new AirTags and a premium pair of over-ear Bluetooth headphones at the event. Using the Apple AirTags users will be able to locate real-world items such as keys or suitcase in the Find My app. The AirTags will also have offline finding capabilities that the company introduced in the core of iOS 13.Apart from this, Apple is also said to unveil its high-end Bluetooth headphones. It is expected that the Bluetooth headphones will offer better sound quality and battery backup as compared to the AirPods. For Reprint Rights: timescontent.com Copyright (c) 2020 BENNETT,COLEMAN & CO.LTD.
The query
parameter for the retrieval operation.
Sets the maximum number of hits to five.
Defines the interval …
… for which to look for news headlines.
Gives out the results object (output shortened).
One particular headline is picked …
… and the story_id
shown.
This retrieves the news text as html code.
In Jupyter Notebook
, for example, the html code …
… can be rendered for better reading.
This concludes the illustration of the Python wrapper package for the Refinitiv Eikon data API.
In algorithmic trading, one of the most important scenarios for the management of data sets is “retrieve once, use multiple times”. Or from an input-output (IO) perspective, it is “write once, read multiple times”. In the first case, data might be retrieved from a web service and then used to backtest a strategy multiple times based on a temporary, in-memory copy of the data set. In the second case, tick data that is received continually is written to disk and later on again used multiple times for certain manipulations (like aggregations) in combination with a backtesting procedure.
This section assumes that the in-memory data structure to store the data is a pandas
DataFrame
object, no matter from which source the data is acquired (from a CSV file, a web service, etc.).
To have a somewhat meaningful data set available in terms of size, the section uses a sample financial data set generated by the use of pseudo-random numbers. “Python Scripts” presents the Python module with a function called generate_sample_data()
that accomplishes the task.
In principle, this function generates a sample financial data set in tabular form of arbitrary size (available memory, of course, sets a limit).
In
[
60
]
:
from
sample_data
import
generate_sample_data
In
[
61
]
:
(
generate_sample_data
(
rows
=
5
,
cols
=
4
)
)
No0
No1
No2
No3
2021
-
01
-
01
00
:
00
:
00
100.000000
100.000000
100.000000
100.000000
2021
-
01
-
01
00
:
01
:
00
100.019641
99.950661
100.052993
99.913841
2021
-
01
-
01
00
:
02
:
00
99.998164
99.796667
100.109971
99.955398
2021
-
01
-
01
00
:
03
:
00
100.051537
99.660550
100.136336
100.024150
2021
-
01
-
01
00
:
04
:
00
99.984614
99.729158
100.210888
99.976584
Imports the function from the Python script.
Prints a sample financial data set with five rows and four columns.
The storage of a pandas
DataFrame
object as a whole is made simple by the pandas
HDFStore
wrapper functionality for the HDF5 binary storage standard. It allows to dump complete DataFrame
objects in a single step to a file-based database object. To illustrate the implementation, the first step is to create a sample data set of meaningful size — here the size of the DataFrame
generated is about 420 MB.
In
[
62
]
:
%
time
data
=
generate_sample_data
(
rows
=
5e6
,
cols
=
10
)
.
round
(
4
)
CPU
times
:
user
3.88
s
,
sys
:
830
ms
,
total
:
4.71
s
Wall
time
:
4.72
s
In
[
63
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
5000000
entries
,
2021
-
01
-
01
00
:
00
:
00
to
2030
-
07
-
05
05
:
19
:
00
Freq
:
T
Data
columns
(
total
10
columns
)
:
# Column Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
No0
float64
1
No1
float64
2
No2
float64
3
No3
float64
4
No4
float64
5
No5
float64
6
No6
float64
7
No7
float64
8
No8
float64
9
No9
float64
dtypes
:
float64
(
10
)
memory
usage
:
419.6
MB
A sample financial data set with 5,000,000 rows and ten columns is generated; the generation takes a couple of seconds.
The second step is to open a HDFStore
object (i.e. HDF5 database file) on disk and to write the DataFrame
object to it.2 The size on disk of about 440 MB is a bit larger than for the in-memory DataFrame
object. However, the writing speed is about five times faster than the in-memory generation of the sample data set. Working in Python with binary stores like HDF5 database files usually gets you writing speeds close to the theoretical maximum of the hardware available.3
In
[
64
]
:
h5
=
pd
.
HDFStore
(
'
data/data.h5
'
,
'
w
'
)
In
[
65
]
:
%
time
h5
[
'
data
'
]
=
data
CPU
times
:
user
356
ms
,
sys
:
472
ms
,
total
:
828
ms
Wall
time
:
1.08
s
In
[
66
]
:
h5
Out
[
66
]
:
<
class
'
pandas
.
io
.
pytables
.
HDFStore
'
>
File
path
:
data
/
data
.
h5
In
[
67
]
:
ls
-
n
data
/
data
.
*
-
rw
-
r
-
-
r
-
-
@
1
501
20
440007240
Aug
25
11
:
48
data
/
data
.
h5
In
[
68
]
:
h5
.
close
(
)
This opens the database file on disk for writing (and overwrites a potentially existing file with the same name).
Writing the DataFrame
object to disk takes less than a second.
Print out meta information for the database file.
Closes the database file.
The third step is to read the data from the file-based HDFStore
object. Reading also generally takes place close to the theoretical maximum speed.
In
[
69
]
:
h5
=
pd
.
HDFStore
(
'
data/data.h5
'
,
'
r
'
)
In
[
70
]
:
%
time
data_copy
=
h5
[
'
data
'
]
CPU
times
:
user
388
ms
,
sys
:
425
ms
,
total
:
813
ms
Wall
time
:
812
ms
In
[
71
]
:
data_copy
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
5000000
entries
,
2021
-
01
-
01
00
:
00
:
00
to
2030
-
07
-
05
05
:
19
:
00
Freq
:
T
Data
columns
(
total
10
columns
)
:
# Column Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
No0
float64
1
No1
float64
2
No2
float64
3
No3
float64
4
No4
float64
5
No5
float64
6
No6
float64
7
No7
float64
8
No8
float64
9
No9
float64
dtypes
:
float64
(
10
)
memory
usage
:
419.6
MB
In
[
72
]
:
h5
.
close
(
)
In
[
73
]
:
rm
data
/
data
.
h5
There is another, somewhat more flexible way of writing the data from a DataFrame
object to an HDFStore
object. To this end, one can use the to_hdf()
method of the DataFrame
object and sets the format
parameter to table
(see the to_hdf
API reference page). This allows the appending of new data to the table
object on disk and also, for example, the searching over the data on disk which is not possible with the first approach. The price to pay are slower writing and reading speeds.
In
[
74
]
:
%
time
data
.
to_hdf
(
'
data/data.h5
'
,
'
data
'
,
format
=
'
table
'
)
CPU
times
:
user
3.25
s
,
sys
:
491
ms
,
total
:
3.74
s
Wall
time
:
3.8
s
In
[
75
]
:
ls
-
n
data
/
data
.
*
-
rw
-
r
-
-
r
-
-
@
1
501
20
446911563
Aug
25
11
:
48
data
/
data
.
h5
In
[
76
]
:
%
time
data_copy
=
pd
.
read_hdf
(
'
data/data.h5
'
,
'
data
'
)
CPU
times
:
user
236
ms
,
sys
:
266
ms
,
total
:
502
ms
Wall
time
:
503
ms
In
[
77
]
:
data_copy
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
5000000
entries
,
2021
-
01
-
01
00
:
00
:
00
to
2030
-
07
-
05
05
:
19
:
00
Freq
:
T
Data
columns
(
total
10
columns
)
:
# Column Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
No0
float64
1
No1
float64
2
No2
float64
3
No3
float64
4
No4
float64
5
No5
float64
6
No6
float64
7
No7
float64
8
No8
float64
9
No9
float64
dtypes
:
float64
(
10
)
memory
usage
:
419.6
MB
This defines the writing format to be of type table
. Writing becomes slower since this format type involves a bit more overhead and leads to a somewhat increased file size.
Reading is also slower in this application scenario.
In practice, the advantage of this approach is that one can work with the table_frame
object on disk like with any other table
object of the PyTables
package which is used by pandas
in this context. This provides access to certain basic capabilities of the PyTables
package — like, for instance, appending rows to a table
object.
In
[
78
]
:
import
tables
as
tb
In
[
79
]
:
h5
=
tb
.
open_file
(
'
data/data.h5
'
,
'
r
'
)
In
[
80
]
:
h5
Out
[
80
]
:
File
(
filename
=
data
/
data
.
h5
,
title
=
'
'
,
mode
=
'
r
'
,
root_uep
=
'
/
'
,
filters
=
Filters
(
complevel
=
0
,
shuffle
=
False
,
bitshuffle
=
False
,
fletcher32
=
False
,
least_significant_digit
=
None
)
)
/
(
RootGroup
)
'
'
/
data
(
Group
)
'
'
/
data
/
table
(
Table
(
5000000
,
)
)
'
'
description
:
=
{
"
index
"
:
Int64Col
(
shape
=
(
)
,
dflt
=
0
,
pos
=
0
)
,
"
values_block_0
"
:
Float64Col
(
shape
=
(
10
,
)
,
dflt
=
0.0
,
pos
=
1
)
}
byteorder
:
=
'
little
'
chunkshape
:
=
(
2978
,
)
autoindex
:
=
True
colindexes
:
=
{
"
index
"
:
Index
(
6
,
medium
,
shuffle
,
zlib
(
1
)
)
.
is_csi
=
False
}
In
[
81
]
:
h5
.
root
.
data
.
table
[
:
3
]
Out
[
81
]
:
array
(
[
(
1609459200000000000
,
[
100.
,
100.
,
100.
,
100.
,
100.
,
100.
,
100.
,
100.
,
100.
,
100.
]
)
,
(
1609459260000000000
,
[
100.0752
,
100.1164
,
100.0224
,
100.0073
,
100.1142
,
100.0474
,
99.9329
,
100.0254
,
100.1009
,
100.066
]
)
,
(
1609459320000000000
,
[
100.1593
,
100.1721
,
100.0519
,
100.0933
,
100.1578
,
100.0301
,
99.92
,
100.0965
,
100.1441
,
100.0717
]
)
]
,
dtype
=
[
(
'
index
'
,
'
<i8
'
)
,
(
'
values_block_0
'
,
'
<f8
'
,
(
10
,
)
)
]
)
In
[
82
]
:
h5
.
close
(
)
In
[
83
]
:
rm
data
/
data
.
h5
Imports the PyTables
package.
Opens the database file for reading.
Shows the contents of the database file.
Prints the first three rows in the table.
Closes the database.
Although this second approach provides more flexibility, it does not open the doors to the full capabilities of the PyTables
package. Nevertheless, the two approaches introduced in this sub-section are convenient and efficient when you are working with more or less immutable data sets that fit into memory. Nowadays, algorithmic trading, however, has to deal in general with continuously and rapidly growing data sets like, for example, tick data with regard to stock prices or foreign exchange rates. To cope with the requirements of such a scenario, alternative approaches might prove useful.
Using the HDFStore
wrapper for the HDF5 binary storage standard, pandas
is able to write and read financial data almost at the maximum speed the available hardware allows. Exports to other file-based formats, like CSV, are generally much slower alternatives.
The PyTables
package — with import name tables
— is a wrapper for the HDF5 binary storage library that is also used by pandas
for its HDFStore
implementation presented in the previous sub-section. The TsTables
package (see Github page of the package) in turn is dedicated to the efficient handling of large financial time series data sets based on the HDF5 binary storage library. It is effectively an enhancement of the PyTables
package and adds support for time series data to its capabilities. It implements a hierarchical storage approach that allows for a fast retrieval of data sub-sets selected by providing start and end dates and times, respectively. The major scenario supported by TsTables
is “write once, retrieve multiple times”.
The set up illustrated in this sub-section is that data is continuously collected from a web source, professional data provider, etc. and is stored interim and in-memory in a DataFrame
object. After a while or a certain number of data points retrieved, the collected data is then stored in a TsTables
table
object in a HDF5 database. First, the generation of the sample data.
In
[
84
]
:
%
%
time
data
=
generate_sample_data
(
rows
=
2.5e6
,
cols
=
5
,
freq
=
'
1s
'
)
.
round
(
4
)
CPU
times
:
user
915
ms
,
sys
:
191
ms
,
total
:
1.11
s
Wall
time
:
1.14
s
In
[
85
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
2500000
entries
,
2021
-
01
-
01
00
:
00
:
00
to
2021
-
01
-
29
22
:
26
:
39
Freq
:
S
Data
columns
(
total
5
columns
)
:
# Column Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
No0
float64
1
No1
float64
2
No2
float64
3
No3
float64
4
No4
float64
dtypes
:
float64
(
5
)
memory
usage
:
114.4
MB
This generates a sample financial data set with 2,500,000 rows and five columns with a one second frequency; the sample data is rounded to two digits.
Second, some more imports and the creation of the TsTables
table
object. The major part is the definition of the desc
class which provides the description for the table
object’s data structure.
Currently, TsTables
only works with the old pandas
version 0.19. A friendly fork, working with newer versions of pandas
is available under http://github.com/yhilpisch/tstables which can be installed via
pip install git+https://github.com/yhilpisch/tstables.git
In
[
86
]
:
import
tstables
In
[
87
]
:
import
tables
as
tb
In
[
88
]
:
class
desc
(
tb
.
IsDescription
)
:
''' Description of TsTables table structure. '''
timestamp
=
tb
.
Int64Col
(
pos
=
0
)
No0
=
tb
.
Float64Col
(
pos
=
1
)
No1
=
tb
.
Float64Col
(
pos
=
2
)
No2
=
tb
.
Float64Col
(
pos
=
3
)
No3
=
tb
.
Float64Col
(
pos
=
4
)
No4
=
tb
.
Float64Col
(
pos
=
5
)
In
[
89
]
:
h5
=
tb
.
open_file
(
'
data/data.h5ts
'
,
'
w
'
)
In
[
90
]
:
ts
=
h5
.
create_ts
(
'
/
'
,
'
data
'
,
desc
)
In
[
91
]
:
h5
Out
[
91
]
:
File
(
filename
=
data
/
data
.
h5ts
,
title
=
'
'
,
mode
=
'
w
'
,
root_uep
=
'
/
'
,
filters
=
Filters
(
complevel
=
0
,
shuffle
=
False
,
bitshuffle
=
False
,
fletcher32
=
False
,
least_significant_digit
=
None
)
)
/
(
RootGroup
)
'
'
/
data
(
Group
/
Timeseries
)
'
'
/
data
/
y2020
(
Group
)
'
'
/
data
/
y2020
/
m08
(
Group
)
'
'
/
data
/
y2020
/
m08
/
d25
(
Group
)
'
'
/
data
/
y2020
/
m08
/
d25
/
ts_data
(
Table
(
0
,
)
)
'
'
description
:
=
{
"
timestamp
"
:
Int64Col
(
shape
=
(
)
,
dflt
=
0
,
pos
=
0
)
,
"
No0
"
:
Float64Col
(
shape
=
(
)
,
dflt
=
0.0
,
pos
=
1
)
,
"
No1
"
:
Float64Col
(
shape
=
(
)
,
dflt
=
0.0
,
pos
=
2
)
,
"
No2
"
:
Float64Col
(
shape
=
(
)
,
dflt
=
0.0
,
pos
=
3
)
,
"
No3
"
:
Float64Col
(
shape
=
(
)
,
dflt
=
0.0
,
pos
=
4
)
,
"
No4
"
:
Float64Col
(
shape
=
(
)
,
dflt
=
0.0
,
pos
=
5
)
}
byteorder
:
=
'
little
'
chunkshape
:
=
(
1365
,
)
TsTables
(install it from https://github.com/yhilpisch/tstables) …
… PyTables
are imported.
The first column of the table is a timestamp
represented as an int
value.
All data columns contain float
values.
This opens a new database file for writing.
The TsTables
table is created at the root node, with name data
and given the class-based description desc
.
Inspecting the database file reveals the basic principle behind the hierarchical structuring in years, months and days.
Third, the writing of the sample data stored in a DataFrame
object to the table
object on disk. One of the major benefits of TsTables
is the convenience with which this operation is accomplished, namely by a simple method call. Even better, that convenience here is coupled with speed. With regard to the structure in the database, TsTables
chunks the data into sub-sets of a single day. In the example case where the frequency is set to one second, this translates into 24 x 60 x 60 = 86,400 data rows per full day worth of data.
In
[
92
]
:
%
time
ts
.
append
(
data
)
CPU
times
:
user
476
ms
,
sys
:
238
ms
,
total
:
714
ms
Wall
time
:
739
ms
In
[
93
]
:
# h5
File(filename=data/data.h5ts, title='', mode='w', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None)) / (RootGroup) '' /data (Group/Timeseries) '' /data/y2020 (Group) '' /data/y2021 (Group) '' /data/y2021/m01 (Group) '' /data/y2021/m01/d01 (Group) '' /data/y2021/m01/d01/ts_data (Table(86400,)) '' description := { "timestamp": Int64Col(shape=(), dflt=0, pos=0), "No0": Float64Col(shape=(), dflt=0.0, pos=1), "No1": Float64Col(shape=(), dflt=0.0, pos=2), "No2": Float64Col(shape=(), dflt=0.0, pos=3), "No3": Float64Col(shape=(), dflt=0.0, pos=4), "No4": Float64Col(shape=(), dflt=0.0, pos=5)} byteorder := 'little' chunkshape := (1365,) /data/y2021/m01/d02 (Group) '' /data/y2021/m01/d02/ts_data (Table(86400,)) '' description := { "timestamp": Int64Col(shape=(), dflt=0, pos=0), "No0": Float64Col(shape=(), dflt=0.0, pos=1), "No1": Float64Col(shape=(), dflt=0.0, pos=2), "No2": Float64Col(shape=(), dflt=0.0, pos=3), "No3": Float64Col(shape=(), dflt=0.0, pos=4), "No4": Float64Col(shape=(), dflt=0.0, pos=5)} byteorder := 'little' chunkshape := (1365,) /data/y2021/m01/d03 (Group) '' /data/y2021/m01/d03/ts_data (Table(86400,)) '' description := { "timestamp": Int64Col(shape=(), dflt=0, pos=0), ...
This appends the DataFrame
object via a simple method call.
The table
object shows 86,400 rows per day after the append()
operation.
Reading sub-sets of the data from a TsTables
table
object is generally really fast since this is what it is optimized for in the first place. In this regard, TsTables
supports typical algorithmic trading applications, like backtesting, pretty well. Another contributing factor is that TsTables
returns the data already as a DataFrame
object such that additional conversions are not necessary in general.
In
[
94
]
:
import
datetime
In
[
95
]
:
start
=
datetime
.
datetime
(
2021
,
1
,
2
)
In
[
96
]
:
end
=
datetime
.
datetime
(
2021
,
1
,
3
)
In
[
97
]
:
%
time
subset
=
ts
.
read_range
(
start
,
end
)
CPU
times
:
user
10.3
ms
,
sys
:
3.63
ms
,
total
:
14
ms
Wall
time
:
12.8
ms
In
[
98
]
:
start
=
datetime
.
datetime
(
2021
,
1
,
2
,
12
,
30
,
0
)
In
[
99
]
:
end
=
datetime
.
datetime
(
2021
,
1
,
5
,
17
,
15
,
30
)
In
[
100
]
:
%
time
subset
=
ts
.
read_range
(
start
,
end
)
CPU
times
:
user
28.6
ms
,
sys
:
18.5
ms
,
total
:
47.1
ms
Wall
time
:
46.1
ms
In
[
101
]
:
subset
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
276331
entries
,
2021
-
01
-
02
12
:
30
:
00
to
2021
-
01
-
05
17
:
15
:
30
Data
columns
(
total
5
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
No0
276331
non
-
null
float64
1
No1
276331
non
-
null
float64
2
No2
276331
non
-
null
float64
3
No3
276331
non
-
null
float64
4
No4
276331
non
-
null
float64
dtypes
:
float64
(
5
)
memory
usage
:
12.6
MB
In
[
102
]
:
h5
.
close
(
)
In
[
103
]
:
rm
data
/
*
This defines the starting date and …
… end date for the data retrieval operation.
The read_range()
method takes the start and end dates as input — reading here is only a matter of milliseconds.
New data that is retrieved during a day can be appended to the TsTables
table
object as illustrated before. The package is therefore a valuable addition to the capabilities of pandas
in combination with HDFStore
objects when it comes to the efficient storage and retrieval of (large) financial time series data sets over time.
Financial times series data can also be written directly from a DataFrame
object to a relational database like SQLite3
. The use of a relational database might be useful in scenarios where the SQL query language is applied to implement more sophisticated analyses. With regard to speed and also disk usage, relational databases cannot, however, compare with the other approaches that rely on binary storage formats like HDF5.
The DataFrame
class provides the method to_sql()
(see the to_sql()
API reference page) to write data to a table in a relational database. The size on disk with 100+ MB indicates that there is quite some overhead overhead when using relational databases.
In
[
104
]
:
%
time
data
=
generate_sample_data
(
1e6
,
5
,
'
1min
'
)
.
round
(
4
)
CPU
times
:
user
342
ms
,
sys
:
60.5
ms
,
total
:
402
ms
Wall
time
:
405
ms
In
[
105
]
:
data
.
info
(
)
<
class
'
pandas
.
core
.
frame
.
DataFrame
'
>
DatetimeIndex
:
1000000
entries
,
2021
-
01
-
01
00
:
00
:
00
to
2022
-
11
-
26
10
:
39
:
00
Freq
:
T
Data
columns
(
total
5
columns
)
:
# Column Non-Null Count Dtype
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0
No0
1000000
non
-
null
float64
1
No1
1000000
non
-
null
float64
2
No2
1000000
non
-
null
float64
3
No3
1000000
non
-
null
float64
4
No4
1000000
non
-
null
float64
dtypes
:
float64
(
5
)
memory
usage
:
45.8
MB
In
[
106
]
:
import
sqlite3
as
sq3
In
[
107
]
:
con
=
sq3
.
connect
(
'
data/data.sql
'
)
In
[
108
]
:
%
time
data
.
to_sql
(
'
data
'
,
con
)
CPU
times
:
user
4.6
s
,
sys
:
352
ms
,
total
:
4.95
s
Wall
time
:
5.07
s
In
[
109
]
:
ls
-
n
data
/
data
.
*
-
rw
-
r
-
-
r
-
-
@
1
501
20
105316352
Aug
25
11
:
48
data
/
data
.
sql
The sample financial data set has 1,000,000 rows and five columns; memory usage is about 46 MB.
This imports the SQLite3
module.
A connection is opened to a new database file.
Writing the data to the relational database a couple of seconds.
One strength of relational databases is the ability to implement (out-of-memory) analytics tasks based on standardized SQL statements. As an example, consider a query that selects for column No1
all those rows where the value in that row lies between 105 and 108.
In
[
110
]
:
query
=
'
SELECT * FROM data WHERE No1 > 105 and No2 < 108
'
In
[
111
]
:
%
time
res
=
con
.
execute
(
query
)
.
fetchall
(
)
CPU
times
:
user
109
ms
,
sys
:
30.3
ms
,
total
:
139
ms
Wall
time
:
138
ms
In
[
112
]
:
res
[
:
5
]
Out
[
112
]
:
[
(
'
2021-01-03 19:19:00
'
,
103.6894
,
105.0117
,
103.9025
,
95.8619
,
93.6062
)
,
(
'
2021-01-03 19:20:00
'
,
103.6724
,
105.0654
,
103.9277
,
95.8915
,
93.5673
)
,
(
'
2021-01-03 19:21:00
'
,
103.6213
,
105.1132
,
103.8598
,
95.7606
,
93.5618
)
,
(
'
2021-01-03 19:22:00
'
,
103.6724
,
105.1896
,
103.8704
,
95.7302
,
93.4139
)
,
(
'
2021-01-03 19:23:00
'
,
103.8115
,
105.1152
,
103.8342
,
95.706
,
93.4436
)
]
In
[
113
]
:
len
(
res
)
Out
[
113
]
:
5035
In
[
114
]
:
con
.
close
(
)
In
[
115
]
:
rm
data
/
*
The SQL query as a Python str
object.
The query executed to retrieve all results rows.
The first five results printed.
The length of the results list
object.
Admittedly, such simple queries are possible with pandas
as well if the data set fits into memory. However, the SQL query language has proven use- and powerful for decades now and should be in the algorithmic trader’s arsenal of data weapons.
pandas
also supports database connections via SQLAlchemy
, a Python abstraction layer package for diverse relational databases (refer to the SQLAlchemy
home page). This in turn allows for the use of, for example, MySQL
as the relational database backend.
This chapter covers the handling of financial time series data. It illustrates the reading of such data from different file-based sources, like CSV files. It also shows how to retrieve financial data from web services like the one of Quandl for end-of-day and options data. Open financial data sources are a valuable addition to the financial landscape. Quandl is a platform integrating thousands of open data sets under the umbrella of a unified API.
Another important topic covered in this chapter is the efficient storage of complete DataFrame
objects on disk as well as of the data contained in such an in-memory object to databases. Database flavors used in this chapter include the HDF5 database standard as well the light-weight relational database SQLite3
. This chapter lays the foundation for Chapter 4 which addresses vectorized backtesting, Chapter 5 which covers machine learning and deep learning for market prediction as well as Chapter 6 that discusses event-based backtesting of trading strategies.
You find more information about Quandl following these links:
Information about the package used to retrieve data from that source is found here:
You should consult the official documentation pages for more information on the packages used in this chapter:
Books cited in this chapter:
Hilpisch, Yves (2018): Python for Finance — Mastering Data-Driven Finance. 2nd ed., O’Reilly, Beijing et al.
McKinney, Wes (2017): Python for Data Analysis — Data Wrangling with Pandas, NumPy, and IPython. 2nd ed., O’Reilly, Beijing et al.
The following Python script generates sample financial time series data based on a Monte Carlo simulation for a geometric Brownian motion (see Hilpisch (2018, ch. 12)).
#
# Python Module to Generate a
# Sample Financial Data Set
#
# Python for Algorithmic Trading
# (c) Dr. Yves J. Hilpisch
# The Python Quants GmbH
#
import
numpy
as
np
import
pandas
as
pd
r
=
0.05
# constant short rate
sigma
=
0.5
# volatility factor
def
generate_sample_data
(
rows
,
cols
,
freq
=
'1min'
):
'''
Function to generate sample financial data.
Parameters
==========
rows: int
number of rows to generate
cols: int
number of columns to generate
freq: str
frequency string for DatetimeIndex
Returns
=======
df: DataFrame
DataFrame object with the sample data
'''
rows
=
int
(
rows
)
cols
=
int
(
cols
)
# generate a DatetimeIndex object given the frequency
index
=
pd
.
date_range
(
'2021-1-1'
,
periods
=
rows
,
freq
=
freq
)
# determine time delta in year fractions
dt
=
(
index
[
1
]
-
index
[
0
])
/
pd
.
Timedelta
(
value
=
'365D'
)
# generate column names
columns
=
[
'No
%d
'
%
i
for
i
in
range
(
cols
)]
# generate sample paths for geometric Brownian motion
raw
=
np
.
exp
(
np
.
cumsum
((
r
-
0.5
*
sigma
**
2
)
*
dt
+
sigma
*
np
.
sqrt
(
dt
)
*
np
.
random
.
standard_normal
((
rows
,
cols
)),
axis
=
0
))
# normalize the data to start at 100
raw
=
raw
/
raw
[
0
]
*
100
# generate the DataFrame object
df
=
pd
.
DataFrame
(
raw
,
index
=
index
,
columns
=
columns
)
return
df
if
__name__
==
'__main__'
:
rows
=
5
# number of rows
columns
=
3
# number of columns
freq
=
'D'
# daily frequency
(
generate_sample_data
(
rows
,
columns
,
freq
))
1 Source: “Bad Election Day Forecasts Deal Blow to Data Science — Prediction models suffered from narrow data, faulty algorithms and human foibles.” Wall Street Journal, 09. November 2016.
2 Of course, multiple DataFrame
objects could also be stored in a single HDFStore
object.
3 All values reported here are from the author’s MacMini with Intel i7 hexa core processor (12 threads), 32 GB of random access memory (DDR4 RAM) and a 512 GB solid state drive (SSD).
3.143.0.157