Backtesting

The word "backtesting" refers to calculating the results of a trading strategy on a historical dataset. In our case, we will use the same dataset because of which we will overestimate the effectiveness, as our statistical models were optimized on exactly the same data. In the real life, we might go for a different time period or a different group of equities (or both) to measure efficiency more objectively.

No matter how we got the best performers separated, testing the investment idea follows the same logic. You translate the result into rules, pick the firms (normally from a different sample) that fulfill the requirements and place them into one cluster, and then create another cluster to contain all the other companies. Finally, compare the mean and/or median performance of the two groups.

To test the selection rules of the decision tree, we have to create a subset of firms that fulfil the requirements of having a cash ratio above 1.6, fixed asset ratio exceeding 12.3 percent, an asset/employee rate below 398, and 1 year growth of the revenue for the previous year at least 43.5 percent. Then, we have to add the firms with a cash ratio below 1.6 and an asset/employee above 2156:

d$condition1 <- (d[,3]  >   1.6) 
d$condition2 <- (d[,4]  >  12.3) 
d$condition3 <- (d[,5]  <   398) 
d$condition4 <- (d[,10] <  1.66) 
d$condition5 <- (d[,13] >  43.5)
d$selected1 <- d$condition1 & d$condition2 & d$condition3 & d$condition4 & d$condition5
d$condition6 <- (d[,3]  <   1.6)
d$condition7 <- (d[,5]  >  2156) 
d$selected2  <- d$condition6 & d$condition7
d$tree <- d$selected1 | d$selected2

To do this, we will create two new variables (one for both subsets) that are equal to 1 if requirements are fulfilled; otherwise, they will be equal to 0. Next, we will calculate a third variable that is the sum of the previous two. This way, we will end up with two clusters: 1 for firms qualifying for investment and 0 for all others:

f <- function(x) c(mean(x), length(x), sd(x), median(x))
report <- aggregate( x = d[,19], by = list(d$tree), FUN = f )$x
colnames(report) = c("mean","N","standard deviation","median")
report <- rbind(report, f(d[,19]))
rownames(report) <- c("Not selected","Selected","Total")
print(report)

Once we are ready with the reclustering, an ANOVA table will help us compare the performance of the firms selected and not selected. To assure that it is not due to outliers that we have significantly different averages, it is always wise to compare medians too. In our case, the categorization seems to work just fine, as even among the medians, we have a huge difference:

                  mean    N standard deviation    median
Not selected  5.490854 6588           22.21786  3.601526
Selected     19.620651  260           24.98839 15.412807
Total         6.384709 7198           23.08327  4.245684

Testing the cluster-based investment idea is slightly more complicated. Here, we only see that the cluster of the better firms is different in average in some respect from the other two groups. It is important to notice that these were not the differences that we used to create the clusters; it is simply us turning the logic over and saying that criteria on the financial ratios may result is separating the better performers.

We need to go through all the eight variables that showed significant differences and create a range of acceptance. Using very narrow ranges may lead to a very small number of shares to pick; applying a range far too wide will make the difference between groups in TRS disappear. Once again, checking medians may help.

To get the means and medians for the three clusters that we identified previously, we will use the following code. To save space when printing the table instead of using the original names, we numbered the three groups as follows:

  1. Underperformers
  2. Mid-range performers
  3. Overperformers.

Here is the code:

d$cluster = k_clust$cluster
z <- round(cbind(t(aggregate(d[,c(19,3,4,6,10,12,14,16,17)], list(d$selected) ,function(x) mean(x, na.rm = T))),
t(aggregate(d[,c(19,3,4,6,10,12,14,16,17)], list(d$selected) ,function(x) median(x, na.rm = T))))[-1,], 2)
> colnames(z) = c("1-mean","2-mean","3-mean","1-median", "2-median", "3-median")
> z

                                   1-mean 2-mean 3-mean 1-median 2-median 3-median
Total.Return.YTD..I.               -16.62   9.41  48.07   -13.45     8.25    42.28
Cash.Assets.Y.1                     15.58  13.11  12.98    11.49     9.07     8.95
Net.Fixed.Assets.to.Tot.Assets.Y.1  26.93  29.76  31.97    21.87    24.73    26.78
P.CF.5Yr.Avg.Y.1                    18.75  19.46  28.72    11.19    10.09    10.08
Asset.Turnover.Y.1                   1.13   1.06   1.05     0.96     0.89     0.91
OI...Net.Sales.Y.1                  13.77  14.71  15.02    10.59    11.23    11.49
LTD.Capital.Y.1                     17.28  20.41  17.21    11.95    16.55    10.59
Market.Cap.Y.1                     278.06 659.94 603.10     3.27     4.97     4.43
P.E.Y.1                             20.81  19.79  19.46    16.87    15.93    14.80

The following table shows our rules developed based on the Anova table for the clusters. Due to the small differences or overlapping ranges, we dropped three variables from the criteria rules. Remember that your main task is to separate overperformers from underperformers, so an overlap with the mid-range is more acceptable (set wider ranges of acceptance where mid-range is really in the middle) than any with the underperformers.

 

Cash/Assets

Net Fixed Assets to Total Assets

P/CF 5Yr Average

Asset Turnover

OI / Net Sales

LTD/Capital

Market Cap (M)

P/E

Min

none

23

dropped

none

11

dropped

dropped

none

Max

14

none

dropped

1,7

none

dropped

dropped

20

Table 1

With the following code, we will first arrange all the requirements into one variable. Then, a final comparison table is created:

d$selected <- (d[,3] <= 14) & (d[,4] >= 23) & (d[,10] <= 1.7) & (d[,12] >= 11) & (d[17] <= 20)
d$selected[is.na(d$selected)] <- FALSE
h <- function(x) c(mean(x, na.rm = T), length(x[!is.na(x)]), sd(x, na.rm = T), median(x, na.rm = T))
backtest <- aggregate(d[,19], list(d$selected), h)
backtest <- backtest$x
backtest <- rbind(backtest, h(d[,19]))
colnames(backtest) = c("mean", "N", "Stdev", "Median")
rownames(backtest) = c("Not selected", "Selected", "Total")
print(backtest)
                 mean    N    Stdev   Median
Not selected 5.887845 6255 23.08020 3.710650
Selected     9.680451  943 22.84361 7.644033
Total        6.384709 7198 23.08327 4.245684

As you can see, our selected firms have an average return of 9.68 percent, while the median amounted to 7.6 percent. Here, we may draw the conclusion that the strategy developed based on the decision tree performed better with respect to both the mean (19.05 percent) and median (14.98 percent). To check the overlap, we will calculate a crosstab:

d$tree <- tree$where %in% c(13,17)
crosstable <- table(d$selected, d$tree)
rownames(crosstable) = c("cluster-0","cluser-1")
colnames(crosstable) = c("tree-0","tree-1")
crosstable <- addmargins(crosstable)
crosstable

           tree-0 tree-1  Sum
  cluster-0   5970    285 6255
  cluser-1     817    126  943
  Sum         6787    411 7198

Here, we see that the two strategies are pretty different: only 126 firms got selected under both strategies. But are they something extraordinary? Indeed. These shares achieved an average TRS of 19.9 percent with a median of 14.4, which is calculated as follows:

mean(d[d$selected & d$tree,19])
[1] 19.90455
median(d[d$selected & d$tree,19])
[1] 14.43585
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.127.153