*This post is the seventh one of our series on* *the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 6 is online here.*

## Boosting and sequential learning

As we have seen before, modelling here is based on solving an optimization problem, and solving the problem described by equation (6) is all the more complex because the functional space \mathcal{M} is large. The idea of boosting, as introduced by Shapire & Freund (2012), is to learn, slowly, from the errors of the model, in an iterative way. In the first step, we estimate a model m_1 for y, from \mathbf{X}, which will give an error \varepsilon_1. In the second step, we estimate a model m_2 for \varepsilon_1, from X, which will give an error \varepsilon_2, etc. We will then retain as a model, after k iterations m^{(k)}(\cdot)=\underbrace{m_1(\cdot)}_{\sim y}+\underbrace{m_2(\cdot)}_{\sim \epsilon_1}+\underbrace{m_3(\cdot)}_{\sim \epsilon_2}+\cdots+\underbrace{m_k(\cdot)}_{\sim \epsilon_{k-1}}=m^{(k-1)}(\cdot)+m_k(\cdot)~~~(7)Here, the error \varepsilon is seen as the difference between y and the model m(\mathbf{x}), but it can also be seen as the gradient associated with the quadratic loss function. Formally, \varepsilon can be seen as \nabla\ell in a more general context (here we find an interpretation that reminds us of residuals in generalized linear models).

Equation (7) can be seen as a descent of the gradient, but written in a dual way. The problem will then be rewritten as an optimization problem: m^{(k)}=m^{(k-1)}+\underset{h\in\mathcal{H}}{\text{argmin}}\left\lbrace \sum_{i=1}^n \ell(\underbrace{y_i-m^{(k-1)}(\boldsymbol{x}_i)}_{\varepsilon_{k,i}},h(\boldsymbol{x}_i))\right\rbrace~~~(8)where the trick is to consider a relatively simple space \mathcal{H} (we will speak of “weak learner”). Classically, \mathcal{H} functions are step-functions (which will be found in classification and regression trees) called “stumps”. To ensure that learning is indeed slow, it is not uncommon to use a shrinkage parameter, and instead of setting, for example, \varepsilon_1=y-m_1 (\mathbf{x}), we will set \varepsilon_1=y-\alpha\cdot m_1 (\mathbf{x}) with \alpha\in[0.1]. It should be noted that it is because a non-linear space is used for \mathcal{H}, and learning is slow, that this algorithm works well. In the case of the Gaussian linear model, remember that the residuals \varepsilon=y-\mathbf{x}^T\beta are orthogonal to the explanatory variables, \mathbf{X}, and it is then impossible to learn from our errors. The main difficulty is to stop in time, because after too many iterations, it is no longer the m function that is approximated, but the noise. This problem is called overlearning.

This presentation has the advantage of having a heuristic reminiscent of an econometric model, by iteratively modelling the residuals by a (very) simple model. But this is often not the presentation used in the learning literature, which places more emphasis on an optimization algorithm heuristic (and gradient approximation). The function is learned iteratively, starting from a constant value, m^{(0)}=\underset{m\in\mathbb{R}}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell(y_i,m)\right\rbracethen we consider the following learning procedure{\displaystyle m^{(k)}=m^{(k-1)}+{\underset{h\in {\mathcal {H}}}{\text{argmin}}}\sum _{i=1}^{n}\ell(y_{i},m^{(k-1)}(\mathbf{x}_{i})+h(\mathbf{x}_{i}))}~~~(9)which can be written, if \mathcal{H} is a set of differentiable functions, {\displaystyle m^{(k)}=m^{(k-1)}-\gamma_{k}\sum _{i=1}^{n}\nabla _{m^{(k-1)}}\ell(y_{i},m^{(k-1)}(\mathbf{x}_{i})),} where {\displaystyle \gamma _{k}=\underset{\gamma }{\text{argmin }}\sum _{i=1}^{n}\ell\left(y_{i},m^{(k-1)}( \mathbf{x}_{i})-\gamma \nabla _{m^{(k-1)}}\ell(y_{i},m^{(k-1)}( \mathbf{x}_{i}))\right).} To better understand the relationship with the approach described above, at step k, pseudo-residuals are defined by setting r_{i,k}=-\left.\frac{\partial \ell(y_i,m(\mathbf{x}_i))}{\partial m(\mathbf{x}_i)}\right\vert_{m(\mathbf{x})=m^{(k-1)}( \mathbf{x})}\text{ where }i=1,\cdots,nA simple model is then sought to explain these pseudo-residuals according to the explanatory variables \mathbf{x}_i, i.e. r_{i,k}=h^\star(\mathbf{x}_i) , where h^\star\in\mathcal{H}. In a second step, we look for an optimal multiplier by solving\gamma_k = \underset{\gamma\in\mathbb{R}}{\text{argmin}}\left\lbrace\sum_{i=1}^n \ell(y_i,m^{(k-1)}( \mathbf{x}_i)+\gamma h^\star(\mathbf{x}_i))\right\rbrace then update the model by setting m_k (\cdot)=m_(k-1) (\cdot)+\gamma_k h^\star (\cdot) . More formally, we move from equation (8) – which clearly shows that we are building a model on residuals – to equation (9) – which will then be translated as a gradient calculation problem – noting that \ell(y,m+h)=\ell(y-m,h) . Classically, class \mathcal{H} of functions consists in regression trees. It is also possible to use a form of penalty by setting m_k (\cdot)=m_(k-1) (\cdot)+\nu\gamma_k h^\star (\cdot) , with \nu\in(0,1) . But let’s go back a little further – in our next post – on the importance of penalization before discussing the numerical aspects of optimization.

*To be continued (keep in mind that references are online here)…*