## Wondering into the world of time-series data with linear regression

In the previous blog posts (How the nonlinear world cheats on us, Treacherous time-series) we have gone through non-linear data and time series in the context of linear models. In this post, we will dive a little deeper and investigate what kind of models linear regression generates when using time series data.

## Linear regression

Linear regression is a solid method with a long history. A simple Google search with "Most popular machine learning algorithms" will always highlight linear regression as one of the most used methods in machine learning. According to Wikipedia, linear regression in statistics is defined as "a linear approach for modeling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). " (1) Overall, this ends up looking like the following formula:

Each explanatory variable (x's in the equation) will have a different coefficient (betas), which shows the impact on the estimated value (y). As almost all models do not perfectly represent the target variable, there is also an error term in the model. In addition to the coefficient, each explanatory variable will include info on whether it is significant, e.g. if it differs enough from 0. In other terms, if the coefficient is so small, it could as well be zero.

It is also important to understand that in most cases when someone uses the term *linear regression*, they mean *ordinary least squares* regression. Notably, there are many different kinds of linear regressions (such as lasso, ridge, least-angle, etc.) which end up creating a similar kind of linear model.

### Linear regression assumptions

Linear regression assumptions are as follows (2):

- Linear interaction between the variables (If x increases by 1 unit, y will always increase by n units)
- Independence of observations
- Normally distributed errors (difference between the observed value and the estimated value)
- No autocorrelation (no correlation over time) in the errors
- No multicollinearity (correlation between the explanatory variables)
- Homoscedastic errors (basically the errors remain in the same ball-park for all values)

When the assumptions are violated, the models will be faulty in some way. There are plenty of tests and methods to improve the situation (such as Box-Cox transformation, Durbin-Watson, Jarque-Bera, etc.), but the model constructor needs to take these into account.

### What is wrong with linear regression?

Absolutely nothing. The main issue with linear regression is that if the assumptions are not valid, the models will not be good. It is easy to run a linear regression even using Excel's own Data Analysis Toolpak, but it requires an understanding of what numbers represent the background assumptions.

## Linear regression with time series data

The following example also uses random time-series data in an Excel file, where the file contains all the required calculations for a linear regression and a results worksheet. The file can be downloaded here.

In this example, we run a simple ordinary least squares regression with disregard for all of the various background checks. As such, this example represents a case in which a modeller runs a simple analysis and accidentally generates a model that should not be used.

### Example results from the Excel file

There are two different worksheets in the Excel file. The first one, "Data", contains all of the calculations linked to the linear regression, while the second one, "Results", shows what the regression looks like. One example is posted as a picture below:

From the picture alone it looks like we have quite a nice model on our hands. You will get a similar result whenever you recalculate the values.

### Amount of statistically significant explanatory variables

The Excel file also contains information on whether the coefficients and the intercept are statistically significant. Column AB contains the coefficients while the significances are presented in column AE. Normally the value of 0.05 is used as a threshold for significance, so if a value is below 0.05 we can assume that the value differs enough from zero.

I recalculated the Excel file a hundred times and calculated how many of the explanatory variables had a statistically significant coefficient. The results for the amount of significant variables are as follows:

- 6: 11 cases
- 7: 26 cases
- 8: 18 cases
- 9: 24 cases
- 10: 21 cases

From the results, we can conclude that even random time series will have multiple, statistically significant explanatory variables even if they have no interaction with each other. Notably, this does not only apply to a small amount of tags, but rather, to most of the tags. As such, without an investigation of assumptions, it is extremely easy to draw incorrect conclusions on the causality between the explanatory variables and the dependent variable.

### What is wrong with the results?

As indicated above, we are ignoring all of the various checks for potential issues with the background assumptions. Even without having access to the values, it is almost certain that there is auto-correlation in the model residuals (the previous model error is very close to the current error) and there is likely strong multicollinearity between the explanatory variables. It is also known that there is no interaction between the variables and the observations are also not independent. As such, the model is truly failing on most of the assumptions. Hence, the model should not ever be used, even though statistically significant results are present.

## Managerial implications

The managerial implications are very similar to those of previous blog posts: statistical significance does not guarantee that there is a real relationship between variables. Even if the actual estimate looks nice in the training period, it does not mean that you have a good model. It is necessary to make sure that the background assumptions match or at least use some kind of testing period to verify that the model keeps performing well. The basic tests that are used to estimate the assumptions would have highlighted the issues with the regression, but the model constructor needs to know what values to look for.

## Footnotes

(1) https://en.wikipedia.org/wiki/Linear_regression

(2) Keeping it simple by using the term error and not residual