By applying non-linear transformations to our independent variables we can increase the set of functions that we are searching over while keeping the optimization procedure simple. Increasing the function class introduces a potential tradeoff between model misspecification error and sampling error (i.e. overfitting).
Over the last few weeks, we’ve focused on models which are functions that are linear in both the parameters and our inputs $x$. In this notation below, $x$ is an array (a vector of independent variables), and $x_j$ is an element of that array (a particular independent variable). We take $x_0=1$, so that $\beta_0$ represents the intercept.
$$ f({\beta}, x) = \sum_{j=1}^d \beta_jx_j $$
Doubling the Inputs
In the figure below, using the Brookline dataset, we first fit a linear regression model of math ~ income
. We then generate predictions from the model on both the incomes (left) and twice the income (right).
Doubling the Parameters
We can also generate predictions from both the parameters as well as twice the parameter values.
A key drawback of linear models is that they are in fact linear functions. This can be problematic when the Conditional Expectation Function is not a linear. When we search over a restrictive set of functions, we’re placing an upper bound on how well we can do. There will always be some error because the thing that we would like to learn (the Conditional Expectations) is not the set of functions that we are searching over. We refer to this error as Model Misspecification Error.
Model misspecification error is defined at the population level. It captures the error between the Conditional Expectation Function and the “best” function in the set of functions that we’re searching over. Given that we’ve restricted our attention so far to functions that are linear in both the parameters and the inputs, the Model Misspecification Error at a given data point is the difference in the predicted value between the Conditional Expectation Function and the Populations OLS Model.