14
Data and Alpha Design

By Weijia Li

Data plays a central role in alpha design. First, we need the basic data to run a simulation. Basic data means the stock price and volume of a security. No matter what kind of alpha idea you want to backtest, you need these basic data to calculate statistics like return, Sharpe, and turnover, etc. Without these statistics, we will never know if an alpha idea is good or not. Second, data itself can inspire alpha ideas. For example, you can plot the price/volume data for some stocks and check if there is any repeating pattern in history. You can do technical analysis with the price/volume data, etc. If you have access to company earnings data, one natural idea would be to trade stocks based on company earnings.

HOW WE FIND DATA FOR ALPHA

Finding new data has always been a critical skill for an alpha researcher. People always prefer good performance and low correlated alphas. A new dataset can serve both purposes. Sometimes we can get signals from one set of data. The signals may not be strong enough even after we try our best to improve them. Now if we can get another set of data and look at companies from a different angle, we may improve the original signals and make them better. We always want to create uncorrelated alphas to diversify the alpha pool. However, even when the alpha ideas are different, sometimes alpha signals from the same dataset can still be highly correlated. There is an intrinsic correlation between the signals due to the usage of the same data. If we have a new dataset, the ideas inspired by the data set will be new. The way of using the new dataset will be new. Most likely, the alpha signals found in the new dataset will have a low correlation to signals that are based on different data. By using new data, we may achieve both performance improvement and diversification.

Data from Literature

It is nice to have new data to make new alphas. However, how can we get new data? Actually, for this alpha hunting quest, if you find the proper, relevant data, you are half-way through. There are different sources of new data. The most common source is from academic literature. As we mentioned, data is usually associated with alpha ideas. If we search “stock return” on the internet, we will find thousands of papers that are trying to capture the “abnormal return” (i.e. the alpha). In those papers we will learn about all kinds of data used in their studies: price/volume, fundamental, earnings, etc. Once we get the data, of course, we can try the same method in the paper to develop the alpha. You can also acquire some data information by surfing for publicly available content on the internet. However, the less well known the data, the more valuable it can be. If the data is well known and widely available to the public, then many people will have similar alpha models, which will arbitrage the alpha (i.e. abnormal return) away gradually. There still is a possibility that even if the data is popular, no one has applied it to price prediction, so the data still has utility.

Data from Vendors

Data is valuable information, so providing data is also a business. There are lots of data vendors that specialize in collecting, parsing, processing, and delivering data. Sometimes if the data is simple, data vendors may provide only the raw data they collected (e.g. price/volume data). Sometimes vendors can do some parsing and processing before providing the data to their clients. Fundamental data is an example of such data. For unstructured yet sophisticated data such as news, tweets, and so on, vendors would apply natural language processing techniques to analyze the content of the raw data. They will provide machine-readable data to their clients instead of the raw data that is only human readable. Vendors can even sell some alpha models directly. That means the data itself is the output of some alpha models. The clients need only to load the data and trade according to it.

DATA VALIDATION

Whenever we – as alpha researchers – get new data, the first thing we need to do is not to test alpha signals on it, but instead check the data usability. In alpha simulation, data delivery time is a very important factor. For any piece of data, if it is not associated with a timestamp, it is useless. This is because without knowing the time a data point is created, we cannot effectively chart its progression; we would essentially be using the data blindly. Only with the correct timestamp can we do simulation correctly. If we attempt to use the data when it is actually unavailable, we will have forward-looking bias and it will make the alpha performance look amazing. If we do not use the data in time, the alpha performance will not be as good as it should be, and we will be leaving money on the table. For example, if AAPL’s earnings exceeded expectations, its stock price would most likely increase, all other things equal, and it would be a good time to buy at the time of such an announcement. If, however, we waited a week to buy AAPL, we would not be able to realize those gains because the “good” news will have already been priced into the stock. So, for every dataset we want to use in alpha research, we need to learn about the data delivery timeline and make sure we only access the data when it is available. As a result, we can make sure we do not have forward-looking bias and get meaningful simulation results. Besides, we need to make sure the data can support alpha production, which means the data will be generated in the future following some time schedule. Sometimes the data producer may cease to generate the data. In this case, the data is not usable for alpha because there is nothing left to provide data in real time for use in the alphas.

Another possible problem with data is survival bias. One example is that one data vendor provides an alpha model that performs well when we test it. This does not mean the alpha model will perform well in the future. The reason is that we do not know how many models the vendor developed, out of which this single model was selected. If the vendor tried 1000 models and only this one survived, we may face survival bias. The bias is introduced by the vendor and is out of our control. In this case, some out-of-sample period for the dataset might be helpful. Out-of-sample (OS) testing is helpful because an alpha that does well in OS testing is a good indicator of an alpha’s robustness, since the testing is not in such a controlled universe.

When using data in alpha design, we should always consider a sanity check for data. For historical simulation, one single bad data point can kill the entire alpha signal. In live production, it is very dangerous to assume that the data will always be correct. If the data goes wrong, it can totally distort the alpha signals and cause a big loss. We can do some basic checking, such as removing outliers in the alpha code. With these basic safeguards, our alpha will be more robust.

UNDERSTAND THE DATA BEFORE USING IT

Serious alpha research is based on a deep understanding of the data. We should always try to understand the data so we can make better use of it. For some simple data, just crunching the numbers to make an alpha may be fine. For complicated data, however, a deep understanding will definitely make a difference in alpha research. Sometimes, to understand certain data, we need to acquire some extra knowledge. To understand hundreds of fundamental factors, we need to learn some concepts of corporate finance. Only when we fully understand the data can we come up with alpha ideas that make more sense. Alphas based on a deep understanding of the underlying data will be more robust and more likely to survive.

EMBRACE THE BIG DATA ERA

Nowadays, data grows explosively. Data grows in three areas: variety, volume, and velocity. In old times, maybe only price/volume and fundamental data were considered for predicting stock price movement. Today, we have many more choices. We have so many types of data that we can test wild ideas sometimes. Kamstra et al. (2002) present a “SAD” effect (seasonal affective disorder): stock market returns vary seasonally with the length of the day. Hirshleifer and Shumway (2003) found that the morning sunshine at a country’s leading stock exchange can predict market index stock returns that day. Preis et al. (2013) made use of the Google trend data and beat the market significantly: 326% return vs. 16% return.

There is a huge amount of data being created every day. Take the US equity market, for example: the level 1 tick data is about 50 gigabytes per day. The level 2 tick data is over 100 gigabytes per day. Social media also contributes lots of data. Twitter users send out 58 million tweets on average every day.

Data is also created fast. Now the latency for high-frequency trading firms is single-digit microseconds. People are still trying to buy faster processing and faster connections to do faster trades. Data vendors are also trying their best to push speed to the limit to gain more clients.

More data is always better, as long as we can handle it. It is very challenging to manage the fast growth of data. There may be cost considerations for storage devices, computing machines, customized databases, and so on. On the other hand, if we handle the data well, we have more data to play with. We can make more alphas. We can make faster alphas. We can make better alphas. So, big investment in big data means bigger returns.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.64.128