Long before the concept of what's trending became a popular topic of study by data scientists, there was an older one that is still not well served by data science: it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people "eyeballing" time series charts and offering interpretations. But what is it that people's eyes are doing?
This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically, called TrendCalculus, invented by Andrew Morgan. The original reference implementation is written in the Lua language and was open-sourced in 2015, the code can be viewed at https://bitbucket.org/bytesumo/trendcalculus-public.
This chapter explains the core method, which delivers the fast extraction of trend change points on a time series; these are the moments when trends change direction. We will describe our TrendCalculus algorithm in detail while implementing it in Apache Spark. The result is a set of scalable functions to quickly compare trends across time series, to make inferences about trends and examine correlation across timeframes. Using these disruptive new methods, we demonstrate how to construct a causal ranking technique to extract potential causal models from across the thousands of time series inputs.
In this chapter we will learn:
When presented with a problem, amongst the first hypotheses that data scientists consider are those related to trends; trends are an excellent way to provide a visualization of data and lend themselves particularly well to large datasets, where the general direction of change of the data can often be seen. In Chapter 5, Spark for Geographic Analysis, we produced a simple algorithm to attempt to predict the price of crude oil. In that study, we concentrated on the direction of change in the price, that is, by definition the trend of the price. We see that trends are a natural way to think, explain, and forecast.
To explain and demonstrate our new trend methods, this chapter is organized into two sections. The first is technical, to deliver the code we need to execute our new algorithm. The second section is about the application of that method on real data. We hope it demonstrates that the apparent simplicity of trends as a concept can often be more complicated to calculate than we may have first thought, particularly in the presence of noise. Noise results in many local highs and lows (referred to as jitter in this chapter), which can make finding trend turning points and discovering the general direction of change over time difficult to determine. Ignoring noise in time series, and extracting interpretable trend signals, provides the central challenges we demonstrate how to overcome.
The dictionary definition of trend is a general direction in which something is developing or changing, but there are other more focused definitions that might be more helpful for guiding data science. Two such definitions are from Salomé Areias, who studies social trends, and Eurostat, the official statistical agency in the European Union:
"A trend is the slow variation over a longer period of time, usually several years, generally associated with the structural causes affecting the phenomenon being measured." - EUROSTAT, official statistical agency in the European Union (http://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:Trend)
"A Trend is defined by a shift in behavior or mentality that influences a significant amount of people." - Salomé Areias, social trend commentator (https://salomeareias.wordpress.com/what-is-a-trend/)
We generally think of trends as nothing more than a long rise or fall in stock market prices. However, trends can also refer to many other use cases that relate to economics, politics, popular culture, and society: for example, the study of sentiments revealed by media outlets when they report on the news. In this chapter, we will use the price of oil as a simple demonstration; however, the technique could be applied to any data where trends occur in the following manner:
18.118.198.83