Time series have become ubiquitous in the modern era. As data flows at the speed of light through the internet, through every transaction at a shop, every click on a web page, every interaction on a commercial, some sort of an event is saved in some database. These mountains of data are a treasure trove for those who know how to extract and analyze all of the data. A huge number of these data points are actually time series data. In this blog post, I will by using Excel, showcase what can easily happen with time series data.
According to Wikipedia "a stationary process is a stochastic process whose unconditional joint probability distribution does not change when shifted in time" (1). The fancy words basically describe that the mean and standard deviation remain the same over time.
In the real world, almost all time series tend to be non-stationary. Knowing whether something is stationary or non-stationary has a big impact on the analyses that can be conducted on the data.
You can quite often get a good idea of whether something is stationary by simply looking at the time series: if it goes up and down in a random fashion over the same value, then you have a stationary time series. There are many more technical ways to draw conclusions, but this is a good starting point.
As an example, the following chart is the Apple stock price from last year (2). This figure is non-stationary as can be seen from the value slowly going up and down.
On the other hand, the difference in the stock price (daily change) is stationary. The change in the stock price fluctuates around zero at all times.
What makes non-stationarity times series a major issue in analytics is the requirement of independent observations in many of the widely used algorithms. In a non-stationary time series, the previous value to a large degree explains the following value and as such, independence is violated.
The Excel file is available here if you wish to look around by yourself.
The example in question was calculated using Excel. It uses a simple Pearson correlation (the most widely used correlation, there are plenty of others available as well) to highlight why non-stationary time series might seem to be related, even if the values are purely random. As the values are random values, you can calculate new values by clicking F9, which forces a recalculation. The rest of the example uses specific calculations on my local machine, but the results will be similar when you recalculate the values.
In this example the starting value for each time series is a uniformly distributed value between -100 and 100 (3). The next value in the time series is the previous value + a random number from a normal distribution, with a mean value of 0, and the standard deviation 1 (4). The most important thing to remember is that the values are only dependent on the previous value, not on any other values.
Excel contains a function for a Pearson correlation coefficient, thus to save us some time we're using it. Put simply, however, Pearson gives a value between -1 and 1, where 0 means no correlation, while values close to 1 signify a very strong positive, linear correlation and values close to -1 signify a very strong negative, linear correlation. (5)
Significance is calculated using a two-tailed t-distribution. In summary, values under 0.05 imply that there is a significant correlation between the variables.
The following image shows what the values look like from a single experiment. As one can see, the values are relatively stable but the process tends to drift higher or lower.
The following table displays the correlation values from the previous time series. As it is possible to notice from the table, there are some strong correlations (over 0.7) even if the data were totally independent from one another.
The situation becomes even worse if you analyze the statistical significance of the correlations. Almost all correlations are statistically significant, which means that the correlation differs from 0, e.g. there is some correlation.
Taking the difference from the time series turns the non-stationary time series into a stationary one (in some cases multiple differences are used, but usually a single difference will be sufficient). Looking at the correlations at this level, we notice that all of the correlations are below 0.1. This is a significant drop in the correlation values. This can be verified at the significance level as well, where almost no correlation is statistically significant.
This example only observed correlation, but even an old and reliable regression suffers if the data contains a non-stationary time series. There are plenty of correlations to be observed, even if the series are not related. The most important thing for managers to remember is that even if some analyses point to statistically significant results, you must be a bit careful, especially if you know that the data used is from a time series. Even a seasoned veteran can accidentally create faulty results if they forget to check the background assumptions behind the analytics that they are using.
(1) https://en.wikipedia.org/wiki/Stationary_process
(2) https://finance.yahoo.com/quote/AAPL/history/
(3) https://en.wikipedia.org/wiki/Continuous_uniform_distribution
(4) https://en.wikipedia.org/wiki/Normal_distribution
(5) https://en.wikipedia.org/wiki/Pearson_correlation_coefficient