Nonlinear interactions in a world dominated by linear analyses
Since the dawn of time, mankind has tried to make sense of the world surrounding us leading to significant scientific contributions. Many of these findings have changed how we view the world and how history itself has progressed (1). If one spends some time investigating these important equations, it is possible to notice that most of them are nonlinear in some fashion. In this follow-up blog post on linear regressions, we will look into how nonlinear interactions are represented if we use linear models.
Nonlinear models
According to Wikipedia, a nonlinear system is one in which the change of the output is not proportional to the change of the input (2). Put simply, an equation like Y = x1 + x2 is linear while Y = x1 * x2 is nonlinear. The nonlinear definition contains all kinds of different models like neural networks, polynomial equations, random forests, nonlinear regression, etc. Linear models tend to be easier to estimate than nonlinear ones, which partially explains their popularity. In addition, they are very easy to reason with. If x1 increases by one unit, Y will increase by n units.
Extending the correlation analysis
If you wish to tag along and try out this exercise by yourself, the Excel file can be obtained from here.
We will create two different regression models to analyze how they show up in differenced correlations. As showcased in the Treacherous time series blog post, with differenced time series, almost nothing is statistically significant if there is no interaction between variables. We will use this information to our advantage and see how well these models correlate with the explanatory variables.
The linear model is a simple y = x1 + x2 + x3 + x4 + x5, while the nonlinear one is y = x1 * x2 / ( x3 * x3 ). Let's add a small error to both of these models (uniform distribution between -5 % and 5 %) to reflect real-world measurement errors.
Monte-Carlo analysis
To have a better understanding of the problem, we will recalculate the correlations multiple times. As there is a much higher amount of variability in the analysis, we will observe results from 100 analyses and view their statistical significance. To run the analysis, I utilized a slow and dirty "copy-paste the previous results to a new row" -tactic. It's also possible to create a macro or utilize an external tool like @Risk (3), but to keep it as simple as possible I chose the easy route. If you're tagging along in the Excel file, all of the values can be found in the Regression Models -tab.
Results
To recalculate (if necessary) the values, press F9. As you can notice from the significance, the explanatory variables for the linear model will almost always be statistically significant. In some cases, some of the explanatory variables don't have a significant correlation, but in most cases they do. The variables not connected to the model have almost no correlation as was the case in the previous blog post.
The following figure shows the cumulative densities of the statistical significance for the three explanatory variables used for the nonlinear model. A cumulative density shows the sorted significances (lower p-value, e.g. statistically more likely, shown on the left side). It's possible to estimate the frequency of specific situations. To help you understand the graph, a value of 70 on the bottom axis with a value of 0.15 on the other axis means that 70% of the cases have a p-value that is equal to or less than 0.15.
If we investigate the statistical significance of the correlations, we can conclude that roughly one-third of the nonlinear correlations are not statistically significant (when the line is over 0.05). In some cases, the correlation seems to only be white noise with no connection between the variables (the very right side of the chart).
If you observe closely, you may spot that the yellow line (which is x3) is more robust as the function depends much more on it than the other two, but even the variable in question suffers is not perfect.
Managerial implications
Notably, the example above was only one small example with a nonlinear model that didn't use any coefficients. It's likely possible to find better or worse nonlinear models. The main thing to pick up from all of this is that the nonlinear relationships can hide behind a simple linear analysis. The analysis works extremely well with linear models but in many cases of nonlinear ones, the connections between the variables stay hidden.
This blog post is not a critique of linear models. They have their place, and should be used frequently due to their simplicity and understandability, but one has to be careful when drawing conclusions. In the worst case, based on this kind of linear analysis one could claim that there is no connection and the truth would be lost in time and space.
Footnotes
(1) https://www.sciencealert.com/the-17-equations-that-changed-the-course-of-history