02-Statistical-models-for-Non-Life-Insurance-Pricing.Rmd

---
output:
  bookdown::word_document2: default
  bookdown::pdf_document2:
    template: templates/brief_template.tex
  bookdown::html_document2: default
documentclass: book
bibliography: references.bib
editor_options: 
  chunk_output_type: console
---

```{r echo=FALSE}
library(knitr)
```

<!-- Needed for leaving space to the quote, * is for no indentation after title -->

<!-- \titlespacing*{\chapter}{0pt}{80px}{35pt} -->

# Statistical models for Non Life Insurance Pricing {#chap:models}
<!-- \minitoc -->  <!--this will include a mini table of contents-->

<!-- \chaptermark{Statistical models for Non Life Insurance Pricing} -->

In this chapter we are going to illustrate some of the most widespread statistical models for technical pricing. For each model we are going to describe its benefits and drawbacks and in section \@ref(chap:considerations-on-models) we will compare them by discussing how they fit the pricing needs.


## Statistical Models

In this section we will start by describing the \ac{glm}, that is the most employed model in technical pricing today, to then present some of its advancements: the \ac{gam}, the Shrinkage estimators for \ac{glm} and the Bayesian \ac{glm}.


### Generalized Linear Models {#chap:glm}

The \ac{glm}s are widely illustrated in many statistics textbooks. Among them we refer to [@wuthrich-data-analytics], [@gigante2010tariffazione], [@azzalini1996statistical], [@davison2003statistical], [@james2013introduction], [@friedman2001elements] and [@portugues-predictive-modeling]. More details on the concepts described in this section are reported in appendix \@ref(chap:appendix-exp-family).


#### Linear Exponential Families {#chap:linear-exp-families}

One of the \ac{glm} assumptions is that the response variables belong to a _Linear Exponential Family_. In this section we are going to explain what a linear exponential family is and which distributions fit its definition.

```{definition, linear-exp-family, name = "Linear Exponential Family"}
A Linear Exponential Family $\mathcal{F}$ is a parametrical family of probability distributions with density function (or probability function in the discrete case) that can be expressed in the form:
$$
f(y; \theta, \lambda) = \exp{\left\{ \frac{y\theta-b(\theta)}{\lambda} \right\}} \ c(y,\lambda), \quad y\in \mathcal{Y}\subseteq\mathbb{R}
$$
where:

\begin{itemize}
\item $\theta\in\Theta\subseteq\mathbb{R}$ is called \textit{canonical parameter};
\item $\lambda\in\Lambda\subseteq]0, +\infty[$ is called \textit{dispersion parameter};
\item $b: \Theta \rightarrow \mathbb{R}$ is a real function called \textit{cumulant function};
\item $c: (\mathcal{Y}, \Lambda) \rightarrow [0, +\infty[$ is a real function;
\item $\Theta$ is a non degenerate interval, i.e. $\text{int}\Theta$ is not empty.
\end{itemize}

```

An exponential family $\mathcal{F}$ is characterized by the elements $\left( \Theta, b(\cdot), \Lambda, c(\cdot, \cdot) \right)$. By properly choosing the sets $\Theta, \Lambda$ and the functions $b(\cdot), c(\cdot, \cdot)$, it is possible to obtain many useful families.

It can be easily shown that the families Normal, Poisson, Gamma and Binomial are exponential families. In table \@ref(tab:exp-families) the characterizations for these exponential families are reported.

The distributions that belong to an exponential family have many useful properties. For example they are provided with all the moments and their moments can be obtained using the derivatives of the cumulative function $b(\cdot)$. If $Y$ is a random variable with distribution belonging to an exponential family $\mathcal{F}$ with parameters $\theta, \lambda$, its first two moments are:
\begin{align}
\label{eq:exp-fam-expected-value}
E(Y)   & = b'(\theta) \\
Var(Y) & = \lambda b''(\theta)
\end{align}

As, within a specified family, the parameters $\theta$ and $\lambda$ determine a distribution, in practical problems the object of estimation will be the couple $(\theta, \lambda)$. In many problems it is natural to consider distributions from a linear exponential family where the dispersion parameter can be expressed as $\lambda = \frac{\phi}{\omega}$, where $\omega>0$ is a known _weight_ and $\phi>0$ is a parameter that we will keep calling _dispersion parameter_. In this case, the density of probability function depends on the parameters $\theta$ and $\phi$ and will be expressed as:
$$
f(y; \theta, \phi, \omega) = \exp{\left\{ \frac{\omega}{\phi} \left[y\theta - b(\theta) \right] \right\}} \ c(y, \phi, \omega), \quad y\in \mathcal{Y}\subseteq\mathbb{R}
$$

In this case the parameters $\theta$ and $\phi$ will be object of estimation, while $\omega$ is an already known value. As we will see later, this representation allows us to consider as known weights:

* the exposure $v$ in the Poisson distribution;
* the number of trials $n$ in the Binomial distribution.


#### Model assumptions {#chap:glm-assumptions}

Let's assume that, for $n$ statistical units, the observations
$$\mathcal{D} = \left\{ (\boldsymbol{x}_1, \omega_1, y_1), \dots,  (\boldsymbol{x}_n, \omega_n, y_n) \right\}$$
are available, where $\boldsymbol{x}_i$ is a vector of explanatory variables determinations, $\omega_i$ is a known weight and $y_i$ is the response variable determination. $\boldsymbol{x}_i, \omega_i, y_i$ are all real numbers. The vector $\boldsymbol{y} = (y_1, \dots, y_n)^t$ is considered a determination of the response random vector $\boldsymbol{Y} = (Y_1, \dots, Y_n)^t$.

In \ac{glm} we assume that:

<!-- \begin{enumerate}[topsep=8pt,itemsep=4pt,partopsep=4pt,parsep=4pt] -->

\begin{enumerate}[itemsep=4pt]
\item The response variables $Y_1, \dots, Y_n$ are stochastically independent and with probability distribution belonging to a same linear exponential family; i.e. the probability distribution of $Y_i$ has density function (or probability function in the discrete case) that can be expressed as:
$$
f(y_i; \theta_i, \phi, \omega_i) = \exp{\left\{ \frac{\omega_i}{\phi} \left[y_i\theta_i - b(\theta_i) \right] \right\}} \ c(y_i, \phi, \omega_i), \quad y_i\in \mathcal{Y}\subseteq\mathbb{R}
$$
We highlight that only $\theta_i$ and $\omega_i$ depend on $i$, while the dispersion parameter $\phi$ is the same for all the observations.
\item The explanatory variables determinations vector $\boldsymbol{x}_i = \left(1, x_{i1}, \dots, x_{ip} \right)^t$ affects the probability distribution of the response variable $Y_i$ by the linear predictor:
$$
\eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}
$$
that is a linear function of the regression parameters $\boldsymbol{\beta} = \left( \beta_0, \beta_1, \dots, \beta_p \right)$.
\item The linear predictor $\eta_i$ is linked to the expected value of the response variable $\mu_i = E(Y_i)$ by the following relation:
$$
g(\mu_i) = \eta_i = \boldsymbol{x}_i^t \boldsymbol{\beta}
$$
where $g:\mathbb{R}\rightarrow\mathbb{R}$ is a monotonic function with continuous first and second derivatives. $g(\cdot)$ is called \textit{link function}.
\end{enumerate}

Often, the assumption 1 is called stochastic assumption, while the 2 and 3 are called structural assumptions.

Let's indicate with $\boldsymbol{X}$ the design matrix. The design matrix is the matrix in which each row $\boldsymbol{x}_{i\cdot}$ represents the vector of the explanatory variables for the observation $i$ and each column $\boldsymbol{x}_{\cdot j}$ represents the vector of the observations for the explanatory variable $j$. We also consider a column of 1s as first column, that is used to model the intercept of the linear predictor. Thus, $\boldsymbol{X}$ is a matrix $n\times(p+1)$. We assume, as it is common in actuarial datasets, that $n>p+1$. The design matrix is represented in figure \@ref(fig:design-matrix).


```{tikz, design-matrix, fig.cap = "Design Matrix $\\boldsymbol{X}$.", fig.ext = 'pdf', cache = TRUE, echo = FALSE, fig.align = 'center'}
\usetikzlibrary{arrows.meta, bending, matrix, positioning}
\pgfdeclarelayer{bg}    % declare background layer
\pgfsetlayers{bg,main}  % set the order of the layers (main is the standard layer)
%\usepackage{xcolor}
\definecolor{col1}{HTML}{F8766D} % hue palette
\definecolor{col2}{HTML}{00BFC4} % hue palette
%\definecolor{col1}{HTML}{8C0808}  % Brick red
%\definecolor{col1}{HTML}{BD2027}  % red
%\definecolor{col2}{HTML}{7F7F7F}  % Grey
%\definecolor{col2}{HTML}{BFBFBF}  % Light Grey
\begin{center}
\begin{tikzpicture}[node distance = 1mm and 0mm, baseline]

% Draw matrix
\matrix (M1) [matrix of nodes,{left delimiter=[},{right delimiter=]}]
{
    $1$      & $x_{11}$ & $\dots$  & $x_{1j}$ & $\dots$  & $x_{1p}$ \\
    $1$      & $x_{21}$ & $\dots$  & $x_{2j}$ & $\dots$  & $x_{2p}$ \\
    $\vdots$ & $\vdots$ & $\ddots$ & $\vdots$ & $\ddots$ & $\vdots$ \\
    $1$      & $x_{i1}$ & $\dots$  & $x_{ij}$ & $\dots$  & $x_{ip}$ \\
    $\vdots$ & $\vdots$ & $\ddots$ & $\vdots$ & $\ddots$ & $\vdots$ \\
    $1$      & $x_{n1}$ & $\dots$  & $x_{nj}$ & $\dots$  & $x_{np}$ \\
};

% Draw red vertical rectangle
\begin{pgfonlayer}{bg}    % select the background layer
%\draw[red!60, very thick, fill = red!60, fill opacity = 0.2] 
%        (M1-1-4.north west) -| (M1-6-4.south east) -| (M1-1-4.north west);
\draw[col1, very thick, fill = col1, fill opacity = 0.2] 
        (M1-1-4.north west) -| (M1-6-4.south east) -| (M1-1-4.north west);
\end{pgfonlayer}
\node (ev) [below = 1cm of M1-6-4.south, align = center] {$\boldsymbol{x}_{\cdot j}$\\explanatory\\variable $j$};
\draw[col1, very thick,shorten >=1mm, -{Stealth[bend]}] 
        (ev.north) to (M1-6-4.south);

% Draw blue horizontal rectangle
\begin{pgfonlayer}{bg}    % select the background layer
%\draw[blue!60, very thick, fill = blue!60, fill opacity = 0.2]
%        (M1-4-1.north west) -| (M1-4-6.south east) -| (M1-4-1.north west);
\draw[col2, very thick, fill = col2, fill opacity = 0.2]
        (M1-4-1.north west) -| (M1-4-6.south east) -| (M1-4-1.north west);
\end{pgfonlayer}
\node (obs) [right = 1cm of M1-4-6.east, align = center] {$\boldsymbol{x}_{i\cdot}$\\observation $i$};
\draw[col2, very thick,shorten >=1mm, -{Stealth[bend]}] 
        (obs.west) to (M1-4-6.east);

\end{tikzpicture}
\end{center}

```


We can then express the \ac{glm} structural assumptions in a matrix form as:
$$
\boldsymbol{g}(\boldsymbol{\mu}) = \boldsymbol{X} \boldsymbol{\beta}
$$
where $\boldsymbol{g}(\cdot)$ must be intended as the vectorial function that links every $\mu_i$ to $g(\mu_i)$.
$$
\begin{array}{cccc}
\boldsymbol{g}: & \mathbb{R}^n & \longrightarrow & \mathbb{R}^n \\
                & \left(
                    \begin{matrix} \mu_1  \\ \vdots \\ \mu_n \end{matrix}
                  \right)
                  & \longmapsto & 
                  \left(
                    \begin{matrix} g(\mu_1)  \\ \vdots \\ g(\mu_n) \end{matrix}
                  \right)
\end{array}
$$

We assume the design matrix to be a full rank matrix, i.e. $\text{rank}(\boldsymbol{X}) = p+1$. This assumption corresponds to assuming that the columns of $\boldsymbol{X}$ are linearly independent.

The function $g(\cdot)$ can be chosen as any monotonic function with continuous first and second derivatives. Given a family $\mathcal{F}$, a common choice is its canonical link function that is defined as:
$$
g(\mu) = b'^{-1}(\mu)
$$
From \@ref(eq:exp-fam-expected-value) we obtain that, as $\mu = b'(\theta)$, choosing the canonical function corresponds to using $\theta$ as the linear predictor:
$$
\eta = g(\mu) = b'^{-1}(\mu) = \theta
$$

In table \@ref(tab:can-link-fun) the canonical link functions for the families already mentioned are reported.


#### Model fitting {#chap:glm-model-fitting}

The model depends on the parameters $\left(\boldsymbol{\beta}, \phi\right)$. Indeed, the parameters $\theta_i$ can be obtained by $\boldsymbol{\beta}$ as:
$$
\theta_i = b'^{-1}(\mu_i) = b'^{-1}(g^{-1}(\eta_i)) = b'^{-1}\left(g^{-1}\left(\boldsymbol{x}_i^t\boldsymbol{\beta}\right)\right)
$$

Therefore, fitting the model corresponds to estimating $\left(\boldsymbol{\beta}, \phi\right)$. The technique used in \ac{glm} is the _Maximum Likelihood_. Let's indicate with $L\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)$ the model likelihood. We remind that the likelihood is a function of the parameters that maps $\left(\boldsymbol{\beta}, \phi\right)$ to the density (or probability in the discrete case) of the observed values $\boldsymbol{y}$ conditioned to the parameters $\left(\boldsymbol{\beta}, \phi\right)$
$$
\begin{array}{cccc}
L: & \mathbb{R}^{p+1} \times \Lambda & \longrightarrow & [0, +\infty[ \\
   & \left(\boldsymbol{\beta}, \phi\right) & \longmapsto & f_{\boldsymbol{Y}}(\boldsymbol{y}; \boldsymbol{\theta}, \phi)
\end{array}
$$

The maximum likelihood estimates are the values $\left(\boldsymbol{\beta}, \phi\right)$ that maximize $L\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)$. In practice, $\boldsymbol{\beta}$ are the parameters of interest, while $\phi$ is considered as a disturbance parameter. It is possible to show that conditioned to any $\phi$, the value for $\boldsymbol{\beta}$ that maximizes $L(\cdot, \cdot)$ does not depend on $\phi$. Therefore, $\boldsymbol{\beta}$ and $\phi$ can be estimated separately.

Let's indicate with $\tilde{\boldsymbol{\beta}}$ the maximum likelihood estimator for $\boldsymbol{\beta}$. Its determination $\hat{\boldsymbol{\beta}}$ is defined as:
\begin{equation}
\label{eq:max-lik-est}
\hat{\boldsymbol{\beta}} = \argmax_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{L\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)}
\end{equation}

Finding the values $\hat{\boldsymbol{\beta}}$ that maximize the likelihood corresponds to finding the values that maximize the log-likelihood $\ell\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right) = \log{\left(L\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)\right)}$. For the independence hypothesis on $Y_1, \dots, Y_n$ we get:
\begin{align}
\nonumber
\ell\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right) & =
\log{\left(L\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)\right)}
\\ \nonumber & =
\log{\left(\prod_{i=1}^{n}{\exp{\left\{ \frac{\omega_i}{\phi} \left[y_i\theta_i - b(\theta_i) \right] \right\}} c(y_i, \phi, \omega_i)}\right)}
\\ \label{eq:log-like} & =
\sum_{i=1}^{n}{
\left\{
\frac{\omega_i}{\phi} \left[y_i\theta_i - b(\theta_i) \right] + \log{\left(c(y_i, \phi, \omega_i)\right)}
\right\}
}
\\ \nonumber & =
\sum_{i=1}^{n}{\ell_i\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)}
\end{align}

The maximum value of $\ell\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)$ can be obtained by imposing all its partial derivatives equal to $0$:
$$
\frac{\partial \ell\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)}
{\partial\beta_j}
= 0, \quad \forall j\in\{0,1,\dots,p\}
$$

These equations can be solved with numerical methods, such as Newton-Raphson algorithm or its variant Fisher scoring. It is possible to show that Newton-Raphson algorithm corresponds to iteratively solving a weighted least squares optimization problem.

In the case with Normal response and identity link, the optimization problem \@ref(eq:max-lik-est) has an explicit solution:
$$
\hat{\boldsymbol{\beta}} = \left( \boldsymbol{X}^t \boldsymbol{X} \right)^{-1} \boldsymbol{X}^t \boldsymbol{y}
$$

A statistic that can be used to measure the goodness of fit of a model is the _Deviance_. It can be used by comparing the current model log-likelihood $\ell\left(\hat{\boldsymbol{\beta}}, \phi; \boldsymbol{y}\right)$ with the _saturated model_ log-likelihood $\ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right)$. The saturated model is the model with $n$ parameters, so a model where the expected values of the response variables $\mu_1, \dots, \mu_n$ are estimated with their observed values $y_1, \dots, y_n$. It is possible to show that $\ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right) \ge \ell\left(\hat{\boldsymbol{\beta}}, \phi; \boldsymbol{y}\right)$. The closer $\ell\left(\hat{\boldsymbol{\beta}}, \phi; \boldsymbol{y}\right)$ is to $\ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right)$, the better the current model fitting is.

```{definition, deviance-def, name = "Deviance"}
Given the log-likelihood of the current model $\ell\left(\hat{\boldsymbol{\beta}}, \phi; \boldsymbol{y}\right)$ and the log-likelihood of the saturated model $\ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right)$, the \textit{Scaled Deviance} of the current model is defined as:
$$
S(\hat{\boldsymbol{\beta}}, \phi, \boldsymbol{y}) =
-2\left(
\ell\left(\hat{\boldsymbol{\beta}}, \phi; \boldsymbol{y}\right)
- \ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right)
\right)
$$
The \textit{Deviance} of the current model is defined as:
$$
D(\hat{\boldsymbol{\beta}}, \boldsymbol{y}) =
\phi \, S(\hat{\boldsymbol{\beta}}, \phi, \boldsymbol{y})
$$

```

In deviance notation $D(\hat{\boldsymbol{\beta}}, \boldsymbol{y})$, the parameter $\phi$ is not reported because the deviance does not depend on $\phi$. Indeed, from \@ref(eq:log-like) we get:
\begin{align*}
S(\hat{\boldsymbol{\beta}}, \phi, \boldsymbol{y})
& =
-2\left(
\ell\left(\hat{\boldsymbol{\beta}}, \phi; \boldsymbol{y}\right)
- \ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right)
\right)
\\ & =
-2\left(
\sum_{i=1}^{n}{
\left\{
\frac{\omega_i}{\phi} \left[y_i\hat{\theta}_i - b(\hat{\theta}_i) \right] + \log{\left(c(y_i, \phi, \omega_i)\right)}
\right\}
}
\right.
\\ & \qquad \qquad -
\left.
\sum_{i=1}^{n}{
\left\{
\frac{\omega_i}{\phi} \left[y_i\theta_i^* - b(\theta_i^*) \right] + \log{\left(c(y_i, \phi, \omega_i)\right)}
\right\}
}
\right)
\\ & =
-2\left(
\sum_{i=1}^{n}{
\frac{\omega_i}{\phi}
\left\{
\left[y_i\hat{\theta}_i - b(\hat{\theta}_i) \right]
- \left[y_i\theta_i^* - b(\theta_i^*) \right]
\right\}
}
\right)
%
\\[12pt]
%
D(\hat{\boldsymbol{\beta}}, \boldsymbol{y})
& =
-2\left(
\sum_{i=1}^{n}{
\omega_i
\left\{
\left[y_i\hat{\theta}_i - b(\hat{\theta}_i) \right]
- \left[y_i\theta_i^* - b(\theta_i^*) \right]
\right\}
}
\right)
\end{align*}

In table \@ref(tab:deviance) the deviances for the families mentioned are reported.

As $\ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right)$ does not depend on $\hat{\boldsymbol{\beta}}$, maximizing the likelihood in equation \@ref(eq:max-lik-est) is the same as minimizing the deviance, that can be seen as a _Loss Function_:
\begin{equation}
\label{eq:max-lik-est-deviance}
\hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{D(\boldsymbol{\beta}, \boldsymbol{y})}
\end{equation}

<!--
* Maximum likelihood
* Iteratively reweighted least squares
  + Newton-Raphson, Fisher scoring
* Modello saturo
* Deviance
  + Loss function. Optimization process

-->

We conclude this section by also mentioning that in practice it is possible to use the estimator $\hat{\beta}$ defined by the optimization problem \@ref(eq:max-lik-est) for a specific exponential family without assuming the distribution of the response variable to belong to that family. Basically, in these cases we just make assumptions on the first two moments of the response variable $Y$ and not on its full distribution. In these models the function $L\left(\boldsymbol{\beta}, \phi; \boldsymbol{y}\right)$ is called _Quasi-Likelihood_ and the function $D(\boldsymbol{\beta}, \boldsymbol{y})$ is called _Quasi-Deviance_.

For instance, if we fit our model by minimizing the loss function of the Poisson distribution, we are implicitly employing a Quasi-Likelihood model. If we are using the Poisson Likelihood, the model is also called _Quasi-Poisson_ model.


#### Variable effects {#chap:var-effects}

As we mentioned in section \@ref(chap:pricing-variables-encoding), the explanatory variables can be _quantitative_ or _qualitative_. In \ac{glm}s, if explanatory variables transformation terms aren't added to the linear predictor $\eta$, the variables effect on $\eta$ is linear. In figure \@ref(fig:expl-var-types) the effects of quantitative and qualitative variables are shown. The data is simulated from a \ac{glm} with Normal response and identity link.

(ref:expl-var-types-caption-latex) Explanatory variables types.

(ref:expl-var-types-caption-gitbook) Explanatory variables types, quantitative (top-left), qualitative (top-right), quantitative and qualitative without interaction (bottom-left) and quantitative and qualitative without interaction (bottom-right).

```{r, plot-quant-qual-build, echo = FALSE, cache = TRUE}

set.seed(42)

## Colors defined in index.Rmd
# col1 <- hue_pal()(2)[1]
# col2 <- hue_pal()(2)[2]

line_size <- 2

n <- 200
b0 <- 1
b1 <- 2
b2 <- 1
b12 <- -1
sigma <- .1

df1 <- tibble(x = runif(n = n, min = 0, max = 1)) %>% 
  mutate(mu = b0 + b1 * x)
df1$y <- rnorm(n = n, mean = df1$mu, sd = sigma)

df2 <- tibble(x = c(rep(0, times = n/2), rep(1, times = n/2))) %>% 
  mutate(mu = b0 + b2 * x)
df2$y <- rnorm(n = n, mean = df2$mu, sd = sigma)

df3 <- tibble(x1 = runif(n = 2*n, min = 0, max = 1),
              x2 = c(rep(0, times = n), rep(1, times = n))) %>% 
  mutate(mu = b0 + b1 * x1 + b2*x2)
df3$y <- rnorm(n = 2*n, mean = df3$mu, sd = sigma)

df4 <- tibble(x1 = runif(n = 2*n, min = 0, max = 1),
              x2 = c(rep(0, times = n), rep(1, times = n))) %>% 
  mutate(mu = b0 + b1 * x1 + b2 * x2 + b12 * x1 * x2)
df4$y <- rnorm(n = 2*n, mean = df4$mu, sd = sigma)


p_quant_qual_1 <- df1 %>% 
  ggplot(aes(x = x, y = y)) +
  geom_abline(
    intercept = b0,
    slope = b1,
    color = col1,
    size = line_size
  ) +
  geom_point(alpha = .5) +
  # labs(title = "Quantitative variable") +
  scale_x_continuous(limits = c(0, 1)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )
  

p_quant_qual_2 <- df2 %>% 
  mutate(x = 1/2 * x + 1/4) %>% 
  ggplot(aes(x = x, y = y)) +
  # geom_abline(
  #   intercept = b0,
  #   slope = b2,
  #   color = col1
  # ) +
  geom_point(
    # data = tibble(x = c(0, 1), y = c(b0, b0 + b2)),
    data = tibble(x = c(1/4, 3/4), y = c(b0, b0 + b2)),
    mapping = aes(x = x, y = y),
    color = col1,
    size = 5#,
    # alpha = .8
  ) +
  geom_point(alpha = .5) +
  geom_point(
    # data = tibble(x = c(0, 1), y = c(b0, b0 + b2)),
    data = tibble(x = c(1/4, 3/4), y = c(b0, b0 + b2)),
    mapping = aes(x = x, y = y),
    color = col1,
    size = 5,
    alpha = .8
  ) +
  # labs(title = "Qualitative variable") +
  scale_x_continuous(limits = c(0, 1)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )


p_quant_qual_3 <- df3 %>% 
  mutate(x2 = factor(x2)) %>% 
  ggplot(aes(x = x1, color = x2, y = y)) +
  geom_abline(
    intercept = b0,
    slope = b1,
    color = col1,
    size = line_size
  ) +
  geom_abline(
    intercept = b0 + b2,
    slope = b1,
    color = col2,
    size = line_size
  ) +
  geom_point(alpha = .5) +
  scale_color_manual(values = c(col1, col2)) +
  # labs(title = "Quantitative and qualitative variable without interaction") +
  scale_x_continuous(limits = c(0, 1)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )


p_quant_qual_4 <- df4 %>% 
  mutate(x2 = factor(x2)) %>% 
  ggplot(aes(x = x1, color = x2, y = y)) +
  geom_abline(
    intercept = b0,
    slope = b1,
    color = col1,
    size = line_size
  ) +
  geom_abline(
    intercept = b0 + b2,
    slope = b1 + b12,
    color = col2,
    size = line_size
  ) +
  geom_point(alpha = .5) +
  scale_color_manual(values = c(col1, col2)) +
  # labs(title = "Quantitative and qualitative variable with interaction") +
  scale_x_continuous(limits = c(0, 1)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )
```

```{r, plot-quant-qual-print, out.width = "50%", fig.width = 5, fig.height = 3.5, fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:expl-var-types-caption-gitbook)", "(ref:expl-var-types-caption-latex)"), label="expl-var-types", echo=FALSE, fig.ncol=2, fig.subcap=c('Quantitative', 'Qualitative', 'Quantitative and qualitative \\\\ without interaction', 'Quantitative and qualitative \\\\ with interaction'), cache = TRUE}

# To align the plots in a subfigure environment
plot_grid_split <- function(..., align = "hv", axis = "tblr"){
  aligned_plots <- cowplot::align_plots(..., align = align, axis = axis)
  plots <- lapply(1:length(aligned_plots), function(x){
    cowplot::ggdraw(aligned_plots[[x]])
  })
  invisible(capture.output(plots))
}

plot_grid_split(p_quant_qual_1, p_quant_qual_2, p_quant_qual_3, p_quant_qual_4)

```

In the top-left panel, we see the effect of the quantitative variable $x$ in the model $\mu_i = \beta_0 + \beta_1 x_i$. As we can see it is a straight line. The coefficient $\beta_1$ represents the slope of the line, thus $\beta_1>0$ means that $x$ and $Y$ are positively correlated, while $\beta_1<0$ means that $x$ and $Y$ are negatively correlated. For example, if $x$ is the power of the vehicle and $Y$ the yearly number of claims, $\beta_1>0$ means that the more powerful the vehicle is, the more claims the policyholder will experience on average.

In the top-right panel, we see the effect of a qualitative binary variable $x$ in the model $\mu_i = \beta_0 + \beta_1 x_i$. The variable is encoded with values $0$ and $1$, so $\beta_1$ represents the effect of the modality $x=1$. In general, for a qualitative variable with $K$ modalities we will have $K-1$ dummy variables $x'_1, \dots, x'_{K-1}$ and the model will be $\mu_i = \beta_0 + \beta_1 x'_{i1} + \beta_2 x'_{i2} + \dots +  + \beta_{K-1} x'_{i, K-1}$. Thus, the $\beta_j$ coefficient represents the relative effect of the modality $j$ compared to the base level modality, that is the one not explicitly included in the dummy encoding. For example, if $x$ is the vehicle make, $Y$ the yearly number of claims, the base level for $x$ is 'Fiat' and the $j$^th^ modality is 'Ferrari', then $\beta_j>0$ means that Ferrari cars on average experience more claims that Fiat cars.

In general, in a multivariate model, the coefficient $\beta_j$ represents the effect of the variable $j$ given all the others. In the example of Fiat and Ferrari cars, if in the model there is also the variable 'vehicle power', the coefficient $\beta_j$ corresponding to the modality 'Ferrari' represents the how more risky a Ferrari car is compared to a Fiat car with the same power. If the explanatory variables are strongly correlated, it is important to be aware of this aspect. For example, Ferrari cars are usually more powerful that Fiat cars. So, it is possible that in general Ferrari cars are more risky than Fiat cars, but comparing a Ferrari car to a Fiat with the same power, the Ferrari could be less risky. This effect is called _Simpson's paradox_ [@blyth1972simpson].

In the bottom-left panel of figure \@ref(fig:expl-var-types), we see the effect of a quantitative variable $x_1$ and a qualitative binary variable $x_2$ together in the model $\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}$. As we can seen, the effects of $x_1$ variable in the two groups defined by $x_2$ variable are represented by two parallel straight lines. The first one is $\mu_i = \beta_0 + \beta_1 x_{i1}$ and the second is $\mu_i = \left(\beta_0 + \beta_2\right) + \beta_1 x_{i1}$. The coefficient $\beta_2$ represents the vertical distance between the two lines.

In the bottom-right panel, the interaction effect between $x_1$ and $x_2$ is included in the model. The model becomes $\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2}$. That means that the effect of $x_1$ variables depends on the determination of the $x_2$ variable. In the group with $x_2=0$ the effect is represented by the line $\mu_i = \beta_0 + \beta_1 x_i$; the group with $x_2=1$ the effect is represented by the line $\mu_i = \left(\beta_0 + \beta_2\right) + \left(\beta_1 + \beta_3\right) x_{i1}$.


For quantitative variables, it is possible to consider also non linear effects in \ac{glm}s. Some examples are reported in figure \@ref(fig:expl-var-quant-effect).


(ref:expl-var-quant-effect-caption-latex) Explanatory quantitative variables effects.

(ref:expl-var-quant-effect-caption-gitbook) Explanatory quantitative variables effects, polynomial degree 2 (top-left), polynomial degree 4 (top-right), piece-wise linear (bottom-left) and piece-wise polynomial degree 2 (bottom-right).

```{r, plot-quant-effect-build, echo = FALSE, cache = TRUE}
set.seed(42)

## Colors defined in index.Rmd
# col1 <- hue_pal()(2)[1]
# col2 <- hue_pal()(2)[2]

line_size <- 2

n <- 200
b0 <- 1
b1 <- 2
b2 <- 1
b12 <- -1
sigma <- 0.05


f1 <- function(x){1.5 * (x - .3)^2 + .25}

f2 <- function(x){20*(x - .5)^4 + -4 * (x - .8)^2 - 2 * x + 2}

f3 <- function(x){
  case_when(
    x <= .25 ~ -2 * x + 1,
    x <= .75 ~ -1/2 * x + 5/8,
    TRUE ~ 1/4
  )
}

f4 <- function(x){
  case_when(
    x <= .75 ~ 1.5 * (x - .75)^2 + .2,
    TRUE ~ .2
  )
}

df <- tibble(x = runif(n = n, min = 0, max = 1)) %>% 
  mutate(
    mu1 = f1(x),
    mu2 = f2(x),
    mu3 = f3(x),
    mu4 = f4(x)
  )


df$y1 <- rnorm(n = n, mean = df$mu1, sd = sigma)
df$y2 <- rnorm(n = n, mean = df$mu2, sd = sigma)
df$y3 <- rnorm(n = n, mean = df$mu3, sd = sigma)
df$y4 <- rnorm(n = n, mean = df$mu4, sd = sigma)


p_quant_effect_1 <- df %>% 
  select(x, value = y1) %>% 
  ggplot() +
  stat_function(
    fun = f1,
    col = col1,
    size = line_size,
    xlim = c(-0.05, 1.05)
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(NA, 1.05)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )


p_quant_effect_2 <- df %>% 
  select(x, value = y2) %>%  
  ggplot() +
  stat_function(
    fun = f2,
    col = col1,
    size = line_size,
    xlim = c(-0.05, 1.05)
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(NA, 1.05)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )

p_quant_effect_3 <- df %>% 
  select(x, value = y3) %>%  
  ggplot() +
  # geom_vline(data = tibble(xint = c(.25, .75)),
  #            aes(xintercept = xint),
  #            linetype = "dotted") +
  geom_vline(xintercept = c(.25, .75),
             linetype = "dotted") +
  stat_function(
    fun = f3,
    col = col1,
    size = line_size,
    xlim = c(-0.05, 1.05)
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(NA, 1.05)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )

p_quant_effect_4 <- df %>% 
  select(x, value = y4) %>%  
  ggplot() +
  geom_vline(xintercept = .75,
             linetype = "dotted") +
  stat_function(
    fun = f4,
    col = col1,
    size = line_size,
    xlim = c(-0.05, 1.05)
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(NA, 1.05)) +
  easy_remove_axes(
    which = "both",
    what = "text",
    teach = FALSE
  )
```

```{r, plot-quant-effect-print, out.width = "50%", fig.width = 5, fig.height = 3.5, fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:expl-var-quant-effect-caption-gitbook)", "(ref:expl-var-quant-effect-caption-latex)"), label="expl-var-quant-effect", echo=FALSE, fig.ncol=2, fig.subcap=c('Polynomial degree 2', 'Polynomial degree 4', 'Piece-wise linear', 'Piece-wise polynomial degree 2'), cache = TRUE}

# # To align the plots in a subfigure environment
# plot_grid_split <- function(..., align = "hv", axis = "tblr"){
#   aligned_plots <- cowplot::align_plots(..., align = align, axis = axis)
#   plots <- lapply(1:length(aligned_plots), function(x){
#     cowplot::ggdraw(aligned_plots[[x]])
#   })
#   invisible(capture.output(plots))
# }

plot_grid_split(p_quant_effect_1, p_quant_effect_2, p_quant_effect_3, p_quant_effect_4)
```


The basic way to achieve it is by adding polynomial terms to the linear predictor. For instance, if $x$ is a quantitative variable, it is possible to add to the model the term $x^2$, obtaining the model $\mu_i = \beta_0 + \beta_1 x_{i} + \beta_2 x_i^2$. An example of model with both $x$ and $x^2$ terms is represented in top-left panel of figure \@ref(fig:expl-var-quant-effect). Adding the quadratic term, the effect graph becomes a parabola.

With the same logic, it is possible to add more power terms. In general, if we want to model $x$ with a polynomial of degree $d$, we can consider the model $\mu_i = \beta_0 + \beta_1 x_{i} + \beta_2 x_{i}^2 + \dots + \beta_d x_{i}^d$. In top-right panel of figure \@ref(fig:expl-var-quant-effect) a 4^th^ degree polynomial effect is represented. We highlight that the model is still considered linear, as the attribute "Linear" in "General Linear Model" is referred to the relation between the parameters $\beta_j$ and the linear predictor $\eta_i$ that is still linear.

Another way to model non liner effects of explanatory variables is to separate the effects by pieces. In bottom-left panel of figure \@ref(fig:expl-var-quant-effect) a case in which the $x$ effect is separated in 3 pieces is represented. As in all the pieces the effect is linear, the graph of the variable effect is a broken line. This effect can be achieved by adding to the model the terms $(x-\nu)_+$, where $(x)_+$ represents the positive part of $x$ ($(x)_+ = \max(0,x)$) and $\nu$ is the value of $x$ in the angular point. The $\nu$ values are called _knots_, If the knots are $\nu_1, \nu_2, \dots, \nu_m$, the model can be represented as $\mu_i = \beta_0 + \beta_1 x_i + \beta_2 (x_i-\nu_1)_+ + \beta_3 (x_i-\nu_2)_+ + \dots + \beta_{m+1} (x_i-\nu_m)_+$. This kind of functions are called _linear splines_ and will be further discussed in section \@ref(chap:gam). If we want the effect to be null from a certain point $\nu$, we can consider the variable $x' = \min(\nu, x)$ instead of $x$. This corresponds to aggregate to $\nu$ all the $x$ after $\nu$.

The piece-wise approach can be enhanced by also considering polynomial terms. For instance, in bottom-left panel of figure \@ref(fig:expl-var-quant-effect), the model represented is $\mu_i = \beta_0 + \left( x_i - \nu \right)_-^2$, where $(x)_-$ is the negative part of $x$ ($(x)_- = \min(0,x)$). $f(x) = (x-\nu)^2$ is a parabola with vertex in $\nu$. The fact of not adding the linear term leads to a monotonic effect made by a semi-parabola and a horizontal semi-line that starts from its vertex.

The examples represented in figures \@ref(fig:expl-var-types) and \@ref(fig:expl-var-quant-effect) are based on simulated data. That means that the linear predictor structure is known and the coefficients $\beta_0, \beta_1, \dots, \beta_J$ are known. In practice, the real model is not known and the coefficients and the structure must be estimated by the data. Thus, we can take assumptions on the structure and we can estimate the coefficients with $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_J$. In many cases it is not so clear whether to consider or not a variable and how to consider it. For example, with the same data both bottom-left and bottom-right models could work fine. In section \@ref(chap:variable-selection) we are going to discuss some variables selection techniques for \ac{glm}.


<!--
* Qualitative variables / binary variables
  + Dummy variables
* Quantitative variables
  + Linear effects
  + Polynomial effects
    - GAM reference
  + piece wise
* Interactions
  + Manual interactions
    Problem: Scalability. Con p parametri ho choose(p, 2) possibili interazioni
 -->

<!--
Disclaimer:
Non è nota a priori la forma del predittore lineare e va stimata
Nei grafici ho sempre rappresentato la curva coi veri \beta tramite i quali i punti sono stati simulati
Nella pratica i \beta non sono noti e avremo solo gli \hat{\beta}
-->


#### Link functions and relativities

As we mentioned in \@ref(chap:linear-exp-families), \ac{glm} supports several families. In figure \@ref(fig:resp-var) the models $g(\mu_i) = \beta_0 + \beta_1 x_i$ with different response variable distributions and link functions are represented.

In all the distribution we used the canonical link function, except for the Gamma, in which we adopted the function $g(\mu)=\log(\mu)$ instead of its canonical link function $g(\mu)=-\frac{1}{\mu}$.

In the Binomial case the canonical function is $g(p) = \log{\left(\frac{p}{1-p}\right)}$, that is called logit function. Its inverse is $g^{-1}(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}$ and it is called logistic.

As we can see from the plots, a linear effect on $x$ corresponds to a logistic effect when the link is logit and to an exponential effect when the link is log.

(ref:resp-var-caption-latex) Response variables and link functions.

(ref:resp-var-caption-gitbook) Response variables and link functions, Normal - identity (top-left), Binomial - logit (top-right), Poisson - log (bottom-left) and Gamma - log (bottom-right).


```{r, plot-resp-var, echo = FALSE, cache = TRUE}

set.seed(42)

## Colors defined in index.Rmd
# col1 <- hue_pal()(2)[1]
# col2 <- hue_pal()(2)[2]

line_size <- 2

n <- 200
b0 <- -2
b1 <- 4
# b2 <- 1
# b12 <- -1
sigma <- .2
alpha <- 4

df <- tibble(x = runif(n = n, min = 0, max = 1)) %>% 
  mutate(
    eta = b0 + b1 * x,
    mu1 = eta,
    mu2 = plogis(eta),
    mu3 = exp(eta),
    mu4 = exp(eta)
  )

df$y1 <- rnorm(n = n, mean = df$mu1, sd = sigma)
df$y2 <- rbinom(n = n, size = 1, prob = df$mu2)
df$y3 <- rpois(n = n, lambda = df$mu3)
df$y4 <- rgamma(n = n, shape = alpha, rate = alpha/df$mu4)


p_resp_1 <- df %>% 
  select(x, value = y1) %>% 
  ggplot() +
  geom_abline(
    intercept = b0,
    slope = b1,
    col = col1,
    size = line_size
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  coord_cartesian(xlim = c(0, 1)) +
  easy_remove_axes(
    which = "x",
    what = "text",
    teach = FALSE
  ) +
  labs(x = "x", y = "y")

p_resp_2 <- df %>% 
  select(x, value = y2) %>%  
  ggplot() +
  stat_function(
    fun = function(x){plogis(b0 + b1 * x)},
    col = col1,
    size = line_size,
    xlim = c(-0.05, 1.05)
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  coord_cartesian(xlim = c(0, 1)) +
  easy_remove_axes(
    which = "x",
    what = "text",
    teach = FALSE
  ) +
  labs(x = "x", y = "y")

p_resp_3 <- df %>% 
  select(x, value = y3) %>%  
  ggplot() +
  stat_function(
    fun = function(x){exp(b0 + b1 * x)},
    col = col1,
    size = line_size,
    xlim = c(-0.05, 1.05)
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12),
                     minor_breaks = 0:12) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(0, 11)) +
  easy_remove_axes(
    which = "x",
    what = "text",
    teach = FALSE
  ) +
  labs(x = "x", y = "y")

p_resp_4 <- df %>% 
  select(x, value = y4) %>%  
  ggplot() +
  stat_function(
    fun = function(x){exp(b0 + b1 * x)},
    col = col1,
    size = line_size,
    xlim = c(-0.05, 1.05)
  ) +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  # scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12),
                     minor_breaks = 0:12) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(0, 11)) +
  easy_remove_axes(
    which = "x",
    what = "text",
    teach = FALSE
  ) +
  labs(x = "x", y = "y")

```

```{r, plot-quant-effect-print, out.width = "50%", fig.width = 5, fig.height = 3.5, fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:resp-var-caption-gitbook)", "(ref:resp-var-caption-latex)"), label="resp-var", echo=FALSE, fig.ncol=2, fig.subcap=c('Normal - identity', 'Binomial - logit', 'Poisson - log', 'Gamma - log'), cache = TRUE}

# # To align the plots in a subfigure environment
# plot_grid_split <- function(..., align = "hv", axis = "tblr"){
#   aligned_plots <- cowplot::align_plots(..., align = align, axis = axis)
#   plots <- lapply(1:length(aligned_plots), function(x){
#     cowplot::ggdraw(aligned_plots[[x]])
#   })
#   invisible(capture.output(plots))
# }

plot_grid_split(p_resp_1, p_resp_2, p_resp_3, p_resp_4,
                align = "h")
```


If $g(\mu) = \log(\mu)$, the model structure can be expressed as:
\begin{align*}
\mu_i & = e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}} \\
& = e^{\beta_0} \left(e^{\beta_1}\right)^{x_{i1}} \left(e^{\beta_2}\right)^{x_{i2}} \dots \left(e^{\beta_p}\right)^{x_{ip}}
\end{align*}

The term $e^{\beta_j}$ can be seen as the multiplicative factor corresponding to the variable $x_j$. If $x_j$ is a dummy variable, $e^{\beta_j}$ is the factor the expected value $\mu_i$ is multiplied by when $x_{ij}=1$. If $x_j$ is a quantitative variable, $e^{\beta_j}$ is the factor the expected value $\mu_i$ is multiplied by for every one-unit increasing of $x_{ij}$. Indeed:
$$\left(e^{\beta_j}\right)^{x_j+1} = e^{\beta_j} \left(e^{\beta_j}\right)^{x_j}$$

The fact that with a log link the relation between coefficients $\beta_0, \beta_1, \dots, \beta_p$ and expected value $\mu_i$ becomes multiplicative is particularly useful to deal with exposure $v_i$. In section \@ref(chap:exposure) we have seen that often the observations are couples (policy, accounting year), so they have different exposures $v_i$. Thus, we usually work with the number of claims occurred in the exposure period $M_i$ and we observe its realization $m_i$. The assumption we take is that:
$$M_i \sim Poisson(v_i \mu_i)$$
where $\mu_i$ is the expected value of the yearly number of claims $N_i$.

Under the \ac{glm} assumptions, we obtain
\begin{align*}
E(M_i) & = v_i \mu_i = v_i e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}} \\
& = e^{\log(v_i)}e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}} \\
& = e^{\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \log(v_i)}
\end{align*}

That means that we can model $M_1, M_2, \dots, M_n$ as response variables in a \ac{glm} with Poisson response in which the linear predictor depends on an offset additive term $\log(v_i)$.

If $g(p) = \text{logit}(p)$, the model structure can be expressed as:
\begin{align*}
\frac{p_i}{1-p_i} & = e^{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}} \\
& = e^{\beta_0} \left(e^{\beta_1}\right)^{x_{i1}} \left(e^{\beta_2}\right)^{x_{i2}} \dots \left(e^{\beta_p}\right)^{x_{ip}}
\end{align*}

Thus, the term $e^{\beta_j}$ can be seen as the multiplicative relativity corresponding to the variable $x_j$. However, in this case the relativity doesn't multiply directly the probability of success $p$, but it multiplies the odds of success $\frac{p}{1-p}$.


#### Variables selection {#chap:variable-selection}

One of the most challenging aspects of \ac{glm} fitting is selecting the variables and their effects by looking to observed data. In practice, we usually have many explanatory variables available but only some of them are relevant for the prediction of the response variable. Adding useless variables to the model increases the variance of the estimators of the coefficients $\tilde{\beta}_0, \tilde{\beta}_1, \dots, \tilde{\beta}_p$ and then the variance of the predictions $\tilde{\mu}_i = \tilde{\beta}_0 + \tilde{\beta}_1 x_{i1} + \dots + \tilde{\beta}_p x_{ip}$. On the other hand, being too frugal with explanatory variables could lead to wasting part of the predictive power of the available explanatory variables.

One useful tool that can help us understand if an explanatory variable $x$ is relevant or not is plotting the points $(x_i, y_i)$, as we did in figures \@ref(fig:expl-var-types), \@ref(fig:expl-var-quant-effect) and \@ref(fig:resp-var). If there are too many observations and the plot is not easily readable, it is possible to group the points by $x$ modalities and show for each group the average of $y_i$ and a confidence interval that gives an idea on the dispersion of the observations around the average. If $x$ is a continuous variable with too many modalities, it is possible to group them into buckets. Showing the average of $y_i$ for groups of $x$ is particularly useful for Binomial and Poisson data, where the fact that $y_i$ can present few different values compromises the plot readability. An example is reported in figure \@ref(fig:var-selection). The top-left and bottom-left panels represent a case in which $x$ and $y$ are not related, while the top-right and bottom-right panels represent a case of positive correlation. From the ungrouped plot in the top-right panel the effect is not clear, while from bottom-right panel it is evident.

(ref:var-selection-caption-latex) Explanatory variable effect evaluation. The top-left and bottom-left panels represent a case in which $x$ and $y$ are not related, while the top-right and bottom-right panels represent a case of positive correlation. From the ungrouped plot in the top-right panel the effect is not clear, while from bottom-right panel it is evident.

(ref:var-selection-caption-gitbook) Explanatory variable effect evaluation, No effect - ungrouped (top-left), Positive effect - ungrouped (top-right), No effect - grouped (bottom-left) and Positive effect - grouped (bottom-right). The top-left and bottom-left panels represent a case in which $x$ and $y$ are not related, while the top-right and bottom-right panels represent a case of positive correlation. From the ungrouped plot in the top-right panel the effect is not clear, while from bottom-right panel it is evident.

(ref:var-selection-caption-short) Explanatory variable effect evaluation.

```{r, plot-var-selection, message=FALSE, out.width = "50%", fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:var-selection-caption-gitbook)", "(ref:var-selection-caption-latex)"), fig.scap="(ref:var-selection-caption-short)", label="var-selection", echo=FALSE, fig.ncol=2, fig.subcap=c('No effect - ungrouped', 'Positive effect - ungrouped', 'No effect - grouped', 'Positive effect - grouped'), cache = TRUE}

set.seed(42)

## Colors defined in index.Rmd
# col1 <- hue_pal()(2)[1]
# col2 <- hue_pal()(2)[2]

line_size <- 2

n <- 1000
b0 <- -2
b1 <- 4
sigma <- 2

df <- tibble(x = rbeta(n = n, shape1 = 3, shape2 = 3)) %>% 
  mutate(
    mu1 = 0,
    mu2 = b0 + b1 * x,
    x_group = floor(10*x) / 10 + 0.05
  )

df$y1 <- rnorm(n = n, mean = df$mu1, sd = sigma)
df$y2 <- rnorm(n = n, mean = df$mu2, sd = sigma)


# Top-left plot
p_var_selection_1 <- df %>% 
  select(x, value = y1) %>% 
  ggplot() +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(-7, 7)) +
  labs(x = "x", y = "y")

p_var_selection_1


# Top-right plot
p_var_selection_2 <- df %>% 
  select(x, value = y2) %>% 
  ggplot() +
  geom_point(aes(x = x, y = value),
             alpha = .4) +
  coord_cartesian(xlim = c(0, 1),
                  ylim = c(-7, 7)) +
  labs(x = "x", y = "y")

p_var_selection_2


# Aggregate data
df_summary <- df %>%
  group_by(x_group) %>%
  summarize(
    n = n(),
    y1_mean = mean(y1),
    y1_sd = sd(y1) / sqrt(n),
    y2_mean = mean(y2),
    y2_sd = sd(y2) / sqrt(n)
  ) %>%
  mutate(
    y1_up = y1_mean + 2 * y1_sd,
    y1_down = y1_mean - 2 * y1_sd,
    y2_up = y2_mean + 2 * y2_sd,
    y2_down = y2_mean - 2 * y2_sd
  )


# Bottom-left plot
df_plot_1 <- df_summary %>%
  select(x = x_group, mean = y1_mean, down = y1_down, up = y1_up, n) %>%
  pivot_longer(cols = c("mean", "n"),
               names_to = "variable", values_to = "value") %>%
  mutate(variable = factor(variable, levels = c("mean", "n")))


p_1 <- df_plot_1 %>%
  mutate(x = factor(x)) %>%
  ggplot(aes(x = x, y = value)) +
  facet_grid(
    variable ~ .,
    scales = "free",
    labeller = labeller(variable = c("mean" = "y", "n" = "count"))
  ) +
  geom_point(data = filter(df_plot_1, variable == "mean")) +
  geom_line(data = filter(df_plot_1, variable == "mean"),
            group = 1) +
  geom_ribbon(data = filter(df_plot_1, variable == "mean"),
              aes(x = x, ymin = down, ymax = up),
              alpha = .3) +
  geom_col(data = filter(df_plot_1, variable == "n"),
           alpha = .8, col = hue_pal()(1), fill = hue_pal()(1)) +
  labs(x = "x", y = "", title = "")


p_1_ylim <- p_1 +
  coord_cartesian(ylim = c(-2.5, 2.5))

g_1 <- ggplotGrob(p_1)
g_1_ylim <- ggplotGrob(p_1_ylim)

g_1[["grobs"]][[2]] <- g_1_ylim[["grobs"]][[2]]
g_1[["grobs"]][[6]] <- g_1_ylim[["grobs"]][[6]]


g_1$heights[7] = 3*g_1$heights[7]

grid.newpage()
grid.draw(g_1)


# Bottom-right plot
df_plot_2 <- df_summary %>%
  select(x = x_group, mean = y2_mean, down = y2_down, up = y2_up, n) %>%
  pivot_longer(cols = c("mean", "n"),
               names_to = "variable", values_to = "value") %>%
  mutate(variable = factor(variable, levels = c("mean", "n")))


p_2 <- df_plot_2 %>%
  mutate(x = factor(x)) %>%
  ggplot(aes(x = x, y = value)) +
  facet_grid(
    variable ~ .,
    scales = "free",
    labeller = labeller(variable = c("mean" = "y", "n" = "count"))#,
  ) +
  geom_point(data = filter(df_plot_2, variable == "mean")) +
  geom_line(data = filter(df_plot_2, variable == "mean"),
            group = 1) +
  geom_ribbon(data = filter(df_plot_2, variable == "mean"),
              aes(x = x, ymin = down, ymax = up),
              alpha = .3) +
  geom_col(data = filter(df_plot_2, variable == "n"),
           alpha = .8, col = hue_pal()(1), fill = hue_pal()(1)) +
  labs(x = "x", y = "", title = "")

p_2_ylim <- p_2 +
  coord_cartesian(ylim = c(-2.5, 2.5))

g_2 <- ggplotGrob(p_2)
g_2_ylim <- ggplotGrob(p_2_ylim)

g_2[["grobs"]][[2]] <- g_2_ylim[["grobs"]][[2]]
g_2[["grobs"]][[6]] <- g_2_ylim[["grobs"]][[6]]


g_2$heights[7] = 3*g_2$heights[7]

grid.newpage()
grid.draw(g_2)

```

If we are dealing with a multivariate model where we already inserted the variables $x_1, \dots, x_p$ and we want to evaluate the additional information brought by $x_{p+1}$, it is possible to look at the plot $(x_{i, p+1}, r_i)$, where $r_i$ are the residuals of the model without the variable $x_{p+1}$:
$$
r_i = y_i - \hat{\mu}_i = y_i - g^{-1}\left( \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \dots + \hat{\beta}_p x_{ip} \right), \quad i\in\{1,2,\dots,n\}
$$

If the plot shows a clear trend, we will add the variable $x$ to the model, otherwise we won't.

The choice of adding or not a variable in the model can be supported by _hypothesis testing_. Given a \ac{glm} with coefficients $\beta_0, \beta_1, \dots, \beta_p$, it is possible to test if a group of coefficients $\beta_{j_1}, \beta_{j_2}, \dots, \beta_{j_s}$ is equal to zero. Formally the hypotheses tested are:
$$
\begin{cases}
H_0: & \beta_{j_k} = 0 \ \forall k \in \{1, 2, \dots, s\} \\
H_1: & \exists k: \beta_{j_k} \neq 0
\end{cases}
$$

If the hypothesis $H_0$ is accepted, the variables $x_{j_1}, x_{j_2}, \dots, x_{j_s}$ can be removed from the model; if the hypothesis $H_0$ is rejected, at least one of the variables $x_{j_1}, x_{j_2}, \dots, x_{j_s}$ should be kept in the model.

If we want to test if a quantitative variable $x_j$ has a significant effect, we can conduct the test on the single coefficient $\beta_j$. If we want to test if a qualitative variable with dummy encoding $x_{j_1}, x_{j_2}, \dots, x_{j_s}$ is significant, we can conduct the test on the coefficients $\beta_{j_1}, \beta_{j_2}, \dots, \beta_{j_s}$. For qualitative variables it is also possible to conduct a test for each level $x_{j_k}$; in this case we would test one by one if each level has an effect that is significantly different form the base level effect.

To conduct the test it is possible to use several statistics. One of them is the test based on _log likelihood ratio_. Let's indicate with $\hat{\boldsymbol{\beta}}$ the estimated coefficients from the model without any constraint and with $\hat{\boldsymbol{\beta}}^{(0)}$ the estimated coefficients with the constraints defined by $H_0$. As the space that $\hat{\boldsymbol{\beta}}^{(0)}$ belongs to is a subset of the space that $\hat{\boldsymbol{\beta}}$ belongs to, it results:
$$
L\left(\hat{\boldsymbol{\beta}}^{(0)}\right) \le L\left(\hat{\boldsymbol{\beta}}\right)
$$
and then:
$$
\frac{L\left(\hat{\boldsymbol{\beta}}^{(0)}\right)}{L\left(\hat{\boldsymbol{\beta}}\right)} \le 1
$$

The basic idea is that, if the variables $x_{j_1}, x_{j_2}, \dots, x_{j_s}$ have a significant effect, $L\left(\hat{\boldsymbol{\beta}}\right)$ will be much higher than $L\left(\hat{\boldsymbol{\beta}}^{(0)}\right)$ and we will reject the hypothesis $H_0$, while if the variables $x_{j_1}, x_{j_2}, \dots, x_{j_s}$ have not a significant effect, $L\left(\hat{\boldsymbol{\beta}}\right)$ will be more or less the same as $L\left(\hat{\boldsymbol{\beta}}^{(0)}\right)$ and we will accept the hypothesis $H_0$.

To perform the test, the quantity usually employed is the following:
\begin{align*}
\lambda & = -2 \log{\left( \frac{L\left(\hat{\boldsymbol{\beta}}^{(0)}\right)}{L\left(\hat{\boldsymbol{\beta}}\right)} \right)} \\
& = -2 \left[ \ell\left(\hat{\boldsymbol{\beta}}^{(0)}\right) - \ell\left(\hat{\boldsymbol{\beta}}\right) \right]
\end{align*}

If we indicate $\tilde{\lambda} = -2 \left[ \ell\left(\tilde{\boldsymbol{\beta}}^{(0)}\right) - \ell\left(\tilde{\boldsymbol{\beta}}\right) \right]$, it is possible to demonstrate that, under the hypothesis $H_0$, $\tilde{\lambda}$ has approximately chi-squared distribution with $s$ degrees of freedom:
$$
\tilde{\lambda} \dot\sim \chi^2(s)
$$

Therefore, with a significance level $\alpha$ we will reject $H_0$ when $\lambda > \chi_{s, 1-\alpha}$, where $\chi_{s, 1-\alpha}$ is the quantile of order $1-\alpha$ of the distribution $\chi^2(s)$.

The same approach can be used in general for testing hypotheses that can be expressed as $H_0: \boldsymbol{L}\boldsymbol{\beta} = \xi$, where $\boldsymbol{L}$ is a matrix $s\times(p+1)$ and $\xi\in\mathbb{R}^{p+1}$. This is particularly useful in qualitative variables to test if some of the levels have the same effect and can be then unified. For example, if $x_{j_1}$ and $x_{j_2}$ are two dummy variables that describe two levels of the same quantitative variable, we can perform the test $H_0: \beta_{j_1} = \beta_{j_2}$ in order to decide whether unifying the two levels is suitable or not.

Anyway, selecting the variables by performing hypotheses testing have some drawbacks. First of all, conducting a lot of test produce the multiple test problem. Let's consider the case in which we conduct a test of the kind $H_0: \beta_j = 0$ with a significance level $\alpha=0.05$ on many variables that have no effect on the response. On average, although the null hypotheses are always true, we will reject them once every 20 tests. That means that if we have available 100 variables, we will randomly select 5 of them, falling into _overfitting_. To fix this problem, it is possible to use the Bonferroni correction, that consists in dividing $\alpha$ by the number of test conducted to define the rejection region. But this could be a too restrictive correction that could lead to discard from the model some useful variables, falling into _underfitting_.

Moreover, hypothesis testing aim is finding whether data supports or not an hypothesis. If the aim of the model is prediction, basing variables selection on hypotheses testing could lead to a sub-optimal model.

Another method for variables selection is comparing the models by computing _information criteria_. Two information criteria commonly used are the \ac{aic} and the \ac{bic}:
\begin{align*}
AIC & = -2\ell(\boldsymbol{\beta}) + 2 (p+1) \\
BIC & = -2\ell(\boldsymbol{\beta}) + \log(n) (p+1)
\end{align*}

The aim of these statistics is to penalize the likelihood by adding a component that measures the complexity of the model. Among the models considered, the optimal model will be the one that minimizes the information criterion. Thus, if two models have the same likelihood, the optimal model will be the one with less parameters. If the model with more parameters has a higher likelihood, it will be chosen only if that increase in likelihood compensates the increase in complexity.

Another way to compute model predictive performance is by randomly splitting the available dataset $\mathcal{D} = \left\{ (\boldsymbol{x}_1, \omega_1, y_1), \dots,  (\boldsymbol{x}_n, \omega_n, y_n) \right\}$ into a _training set_ (or learning set) $\mathcal{D}^{\mathcal{B}}$ and a _test set_ $\mathcal{D}^{\bar{\mathcal{B}}}$, with $\mathcal{B}\subset\{1,2,\dots,n\}$ labeling the dataset $\mathcal{D}^{\mathcal{B}} = \left\{ (\boldsymbol{x}_i, \omega_i, y_i): i \in \mathcal{B} \right\} \subset \mathcal{D}$ and $\bar{\mathcal{B}} = \{1, 2, \dots, n\} \setminus \mathcal{B}$. With this split, we can fit the model on only the training set $\mathcal{D}^{\mathcal{B}}$ and assess its performance on the test set $\mathcal{D}^{\bar{\mathcal{B}}}$ by computing the deviance $D\left( \hat{\boldsymbol{\beta}}^{\mathcal{B}}, \boldsymbol{y}^{\bar{\mathcal{B}}} \right)$, where $\hat{\boldsymbol{\beta}}^{\mathcal{B}}$ is the vector of the coefficients estimated on the training set and $\boldsymbol{y}^{\bar{\mathcal{B}}}$ is the vector of the observed response variables in the test set. This way it is possible to choose the best model as the one that minimizes the deviance in the test set and then fitting it with the whole dataset $\mathcal{D}$.

A limit of the train-test approach is that it could bring to overfitting in the test set. Indeed, in particular if the dataset is small, it is possible that a specific set of variables minimizes the deviance on the test set just by chance. To prevent this, it is possible to conduct a _K-fold cross validation_. This consists in randomly partitioning the dataset $\mathcal{D}$ into $K$ subsets $\mathcal{D}^{\mathcal{B}_1}, \mathcal{D}^{\mathcal{B}_2}, \dots, \mathcal{D}^{\mathcal{B}_K}$ and, for each subset $\mathcal{D}^{\mathcal{B}_k}$, performing a train-test procedure keeping $\mathcal{D}^{\setminus \mathcal{B}_k} = \mathcal{D} \setminus \mathcal{D}^{\mathcal{B}_k}$ as a training set and $\mathcal{D}^{\mathcal{B}_k}$ as a test set. For each $k$ we can compute the testing deviance $D\left( \hat{\boldsymbol{\beta}}^{\setminus \mathcal{B}_k}, \boldsymbol{y}^{\mathcal{B}_k} \right)$. We can then compute the average deviance within the $K$ subset as:
$$
D^{CV(K)} = \frac{1}{K} \sum_{k=1}^{K}{D\left( \hat{\boldsymbol{\beta}}^{\setminus \mathcal{B}_k}, \boldsymbol{y}^{\mathcal{B}_k} \right)}
$$
Thus, the best model will be the one that minimize $D^{CV(K)}$.

The higher $K$ is, the less subjected to randomness $D^{CV(K)}$ is. However, the higher $K$ is, the more computationally expensive the procedure is. A common choice for $K$ is $K=10$. If we choose $K=n$, the procedure is also called _leave-one-out cross validation_.


#### Scalability and manual fitting

One problem of \ac{glm} is that the variables selection process is not so easily scalable. Indeed, if we consider $p$ explanatory variables, even without taking into account interactions and quantitative variables transformations, there are $2^p$ possible models that can be obtained by choosing a subset of those variables. As $p$ increases, building all these models and choosing the optimal one becomes unfeasible.

One strategy to reduce the time consumption is to use a _stepwise procedure_. First of all we must choose a criterion to compare the models, such as the AIC. Then we have to choose a starting model that considers the variables subset $\mathcal{C}_0 \subset \{ 1,2,\dots,p \}$. It is possible to compare this model with all the models that can be obtained by removing one variable from $\mathcal{C}_0$ or adding to $\mathcal{C}_0$ one that is not included in it. From all these models we can compute the AIC and we will choose as $\mathcal{C}_1$ the set of variables that minimizes the AIC. If none of the considered models has an AIC lower to the one obtained with the variables $\mathcal{C}_0$, the procedure ends and our final set of variables is $\mathcal{C}_0$. Otherwise, we will repeat the step with $\mathcal{C}_1$. The procedure can be iteratively repeated until we obtain a subset of variables $\mathcal{C}_f$ that can't be improved by removing or adding a variable. The model with the variables $\mathcal{C}_f$ will be our chosen one.

This procedure is much faster than computing all the $2^p$ models, but it is still not so scalable for large $p$. Moreover, in this procedure we are not taking into account the interactions and the possible transformations for the quantitative variables. This can be achieved by slightly modifying the algorithm, but it would further increase the complexity and the computation time. Another option is starting from the result of the stepwise regression and manually choosing interactions and quantitative variables transformations by looking at plots as described in \@ref(chap:var-effects).

One characteristic of \ac{glm} is that the variables effects can be easily interpreted and the variables selection process is for large part manual. This aspect can be problematic if there are a lot of explanatory variables, but it brings some important benefits too. Indeed, the actuary can take choices on variables selection not only based on observed data, but also on his domain knowledge. For instance, the choice of selecting or not an explanatory variable can be guided also on its interpretation: if the observed effect makes sense, it can be added to the model even if it is not statistically significant and it doesn't decrease the AIC, and, on the opposite, if the observed effect is not reasonable, the actuary can choose not to include the variable in the model even if its effect is supported by the data. So, in the actuarial practice, in the \ac{glm} fitting process there is always a subjective component that impacts on the final result. For these reason it is important for the actuary to have a deep knowledge on the phenomenon he is modeling.

An aspect that facilitates the model building and reduces the complexity of the process, even if there are many variables, is that usually the models are not built from scratch. Actuarial models are usually updated once a year, so it is also possible to start the new model by fitting on new data the final model from the year before. As the models are usually built with policies data from more than one year, the new dataset partially overlaps with the one from the previous year. However, the overlapping observations aren't identical: the new dataset will have the new settlement information and some new explanatory variables. Anyway, if the actuary is familiar with the effects of the variables in the previous models, he already knows which will probably be relevant and which don't even deserve too much attention.


<!--
* problem: effect not known
  + not known variables
  + not known shape
* approaches
  + looking to residuals (x_i, r_i)
  + hypotheses testing
    - test multipli, p-hacking
    - overfitting
  + AIC/BIC
  + Training/Test, Cross Validation
    - possibile overfitting sul test set
* manual work
  + scalability
    - choice of variables. Ho 2^p possibili modelli
    - how to model quantitative variables
    - choice of interactions. Ho choose(p, 2) possibili interazioni
    Usually actuarial models are updated once a year, so it is possible to start from models made the year before
* domain knowledge, expertise. Vantaggio dei GLM: posso metterci informazioni a priori
-->


\newpage

### Generalized Additive Models {#chap:gam}

In section \@ref(chap:var-effects) we have seen that sometimes quantitative variables have not a linear effect on the linear predictor. In \ac{glm} it is possible to deal with non-linear effects by adding polynomial or piece-wise terms. \ac{gam}s are models based on \ac{glm} that introduce a more flexible way to deal with quantitative variables with non-linear effects. For more details on \ac{gam} we refer to [@wuthrich-data-analytics], [@james2013introduction], [@friedman2001elements], [@ruppert2003semiparametric], [@wood2017generalized] and [@sica-2016].


<!--
Smooth functions
Spline cubiche
GAM assumptions
Model fitting
Model effects
-->


#### Model assumptions

In \ac{gam}, the assumptions are the same as in \ac{glm}, stated in \@ref(chap:glm-assumptions), with the following advancement in the linear predictor:
$$
g(\mu_i) = \eta_i = \boldsymbol{x}_i^t \boldsymbol{\beta} + \sum_{l=1}^{q}{f_l(z_{i,l})}, \qquad i\in\{1,2,\dots,n\}
$$
where

* $\boldsymbol{x}_i$ is the vector of the variables with a linear effect as described in \ac{glm}, that also includes a term $1$ that represents the intercept;
* $\boldsymbol{\beta}$ is the vector of the linear coefficients as described in \ac{glm};
* $z_{i,1}, z_{i,2}, \dots, z_{i,q}$ are the quantitative variables with a non linear effect;
* $f_1(\cdot), f_2(\cdot), \dots, f_q(\cdot)$ are continuous functions with continuous first and second derivatives.

The functions $f_l(\cdot)$ introduce the possibility to model non-linear effects of the variables $z_l$.


#### Cubic splines {#chap:cubic-splines}

A class of functions commonly used for modeling $f_l(\cdot)$ is the class of the _cubic splines_.

```{definition, cubic-splines, name = "Cubic Splines"}
Let consider $m$ real numbers $\nu_1, \nu_2, \dots, \nu_m$, called \textit{knots}, and a function $f:[\nu_1, \nu_m] \to \mathbb{R}$ such that:
$$
f(x) = h_t(x), \qquad x\in[\nu_t, \nu_{t+1}[
$$
where, for $t = 1, 2, \dots, m-1$, $h_t(x) = \alpha_t + \vartheta_t x + \gamma_t x^2 + \delta_t x^3$. For the last index $m-1$, $f(x) = h_{m-1}(x)$ is extended to $[\nu_{m-1}, \nu_{m}]$.

$f(x)$ is a \textit{cubic spline} if it satisfies the following conditions in the internal knots $\nu_2, \nu_3, \dots, \nu_{m-1}$
\begin{equation}
\label{eq:cubic-spline-constraints}
h_{t-1}(\nu_t) = h_{t}(\nu_t), \qquad h'_{t-1}(\nu_t) = h'_{t}(\nu_t), \qquad h''_{t-1}(\nu_t) = h''_{t}(\nu_t)
\end{equation}
```

The constraints for $f(\cdot)$ make it a continuous functions with first and second derivatives continuous in $]\nu_1, \nu_m[$. It is possible to extend the cubic spline $f(\cdot)$ to an interval $[a, b] \supset [\nu_1, \nu_m]$ with linear extensions on $[a, \nu_1[$ and $]\nu_m, b]$. The so built function $f:[a,b]\to\mathbb{R}$ is called _natural cubic spline_.

In definition \@ref(def:cubic-splines) we introduced 4 parameters ($\alpha_t, \vartheta_t, \gamma_t, \delta_t$) for each of the $m-1$ intervals $[\nu_t, \nu_{t+1}[$, $t\in\{1,2,\dots,m-1\}$. So, we have $4(m-1)$ parameters. In equation \@ref(eq:cubic-spline-constraints) we introduced $3$ constraints for each of the $m-2$ knot in $\nu_2, \nu_3, \dots, \nu_{m-1}$. So the free parameters become $4(m-1) - 3(m-2) = m+2$. The linear extension on $[a, \nu_1[$ and $]\nu_m, b]$ corresponds to the constraint $f''(x)=0$ on $[a, \nu_1[ \, \cup \, ]\nu_m, b]$, thus, $h''_1(\nu_1)=0$ and $h''_m(\nu_m)=0$. Adding these two constraints, we get that the natural cubic spline $f:[a,b]\to\mathbb{R}$ has $m$ degrees of freedom.

With an approach similar to the one we used in \@ref(chap:var-effects) for piece-wise effects in \ac{glm}, we can represent a cubic spline by the functions $x \mapsto \left( x-\nu_t \right)_+^3$, $t=1,2,\dots,m$.

The expression:
\begin{equation}
\label{eq:cubic-spline-decomposition}
f(x) = \alpha_0 + \vartheta_0 x + \sum_{t=1}^{m}{c_t \left( x-\nu_t \right)_+^3}, \qquad \text{with} \quad \sum_{t=1}^{m}{c_t} = 0 \quad \text{and} \quad \sum_{t=1}^{m}{c_t\nu_t} = 0
\end{equation}
gives a natural cubic spline.

The two side constraints ensure that we have a smooth linear extension right of $\nu_m$. The expression \@ref(eq:cubic-spline-decomposition) presents $m+2$ parameters, thus, with the 2 side constraints, there are $m$ degrees of freedom. From this expression it is easy to show that the natural cubic splines over the interval $[a,b]$ with knots $\nu_1, \nu_2, \dots, \nu_m$ constitute a $m$-dimensional vectorial space.


#### Smoothing

If we try to fit a cubic spline on data, we find that by increasing the number of knots, the function tends to overfit data. This is due to the fact that, increasing the number of knots, we are increasing the number of parameters of the model and, thus, the variance of the parameters estimators increases. We can see this effect in figure \@ref(fig:cub-spline). The data used for figure \@ref(fig:cub-spline) and \@ref(fig:gam-lambda) comes from the LIDAR dataset that has been described in [@ruppert2003semiparametric] and [@holst1996locally]

(ref:cub-spline-caption-latex) \ac{glm} with cubic splines for different numbers of knots. The more knots there are, the wigglier the estimated function $\hat{f}(x)$ is.

(ref:cub-spline-caption-gitbook) \ac{glm} with cubic splines for different numbers of knots, $0$ knots (top-left), $3$ knots (top-right), $7$ knots (bottom-left) and $35$ knots (bottom-right). The more knots there are, the wigglier the estimated function $\hat{f}(x)$ is.

(ref:cub-spline-caption-short) \ac{glm} with cubic splines for different numbers of knots.

```{r, plot-cub-spline-build, echo = FALSE, cache = TRUE}

# Read Lidar data
lidar = read.table("data/lidar.dat", header = TRUE) %>% 
  as_tibble()

lidar = lidar[sort.list(lidar$range), ]

# Standardizzo lidar$range
lidar$range = (lidar$range - mean(lidar$range)) / sd(lidar$range)


# Linear fitting

fit <- lm(logratio~., data = lidar)

p_cub_spline_1 <- lidar %>% 
  ggplot(aes(x = range, y = logratio)) +
  geom_abline(
    intercept = fit$coefficients[1],
    slope = fit$coefficients[2],
    col = col1,
    size = line_size
  ) +
  geom_point(alpha = .5)


# Cubic spline fitting

fbase = function(x, k, p) ifelse(x - k > 0, (x - k)^p, 0)
p <- 3


plot_spline_lidar <- function(dataset, knots, p = p){
  # Create basis for splines
  lidar1 <- dataset
  for (i in 1:length(knots)){
    lidar1 <- lidar1 %>% 
      mutate(b = fbase(range, knots[i], p))
    
    names(lidar1)[ncol(lidar1)] = paste("b", i, sep = "")
  }
  
  # Fit GLM
  fit <- lm(logratio~., data = lidar1)
  
  # Create fitted values
  new_lidar1 <- tibble(
    range = seq(from = min(lidar$range), to = max(lidar$range), length.out = 1e3)
  )
  for (i in 1:length(knots)){
    new_lidar1 <- new_lidar1 %>% 
      mutate(b = fbase(range, knots[i], p))
    
    names(new_lidar1)[ncol(new_lidar1)] = paste("b", i, sep = "")
  }
  
  new_lidar1$fit <- predict(fit, newdata = new_lidar1)
  
  
  p <- lidar1 %>% 
    ggplot(aes(x = range, y = logratio)) +
    geom_vline(data = tibble(knots),
               aes(xintercept = knots),
               linetype = "dotted",
               alpha = .5) +
    geom_line(
      data = new_lidar1,
      aes(x = range, y = fit),
      col = col1,
      size = line_size
    ) +
    geom_point(alpha = .5)
  
  return(p)
}


p_cub_spline_2 <- plot_spline_lidar(dataset = lidar,
                        knots = c(-1, 0, 1),
                        p = 3)

p_cub_spline_3 <- plot_spline_lidar(dataset = lidar,
                        knots = seq(from = -1.5, to = 1.5, by = .5),
                        p = 3)

p_cub_spline_4 <- plot_spline_lidar(dataset = lidar,
                        knots = seq(from = -1.7, to = 1.7, by = .1),
                        p = 3)
```


```{r, plot-cub-spline-print, out.width = "50%", fig.width = 5, fig.height = 3.5, fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:cub-spline-caption-gitbook)", "(ref:cub-spline-caption-latex)"), fig.scap="(ref:cub-spline-caption-short)", label="cub-spline", echo=FALSE, fig.ncol=2, fig.subcap=c('$0$ knots', '$3$ knots', '$7$ knots', '$35$ knots'), cache = TRUE}

# # To align the plots in a subfigure environment
# plot_grid_split <- function(..., align = "hv", axis = "tblr"){
#   aligned_plots <- cowplot::align_plots(..., align = align, axis = axis)
#   plots <- lapply(1:length(aligned_plots), function(x){
#     cowplot::ggdraw(aligned_plots[[x]])
#   })
#   invisible(capture.output(plots))
# }

plot_grid_split(p_cub_spline_1, p_cub_spline_2, p_cub_spline_3, p_cub_spline_4)

```

In the top-left panel, the model fitted is just linear and it clearly doesn't follow the path made by the observed points. In the bottom-right panel, the models fitted is a spline with 35 knots and it is clearly too wiggly compared to the path made by the observed points.

A measure of the wiggliness of a function is given by the integral of its squared second derivative:
\begin{equation}
\label{eq:wiggliness}
\int_{a}^{b}{\left( f''(x) \right)^2 dx}
\end{equation}

In figure \@ref(fig:wiggliness) some examples of functions $f(x)$ and their squared second derivatives $\left(f''(x)\right)^2$ are represented. In the top panels, four functions of increasing wiggliness are represented and, in the bottom panels,  we can see their four squared second derivatives.


(ref:wiggliness-caption-latex) Squared second derivative $\left(f''(x)\right)^2$ for functions with different wiggliness. The wigglier the function $f(x)$ is, the higher $\left(f''(x)\right)^2$ is.

(ref:wiggliness-caption-short) Squared second derivative $\left(f''(x)\right)^2$ for functions with different wiggliness.

```{r, plot-wiggliness-print, out.width = "100%", fig.width = 10, fig.height = 5, fig.align='center', fig.cap="(ref:wiggliness-caption-latex)", fig.scap="(ref:wiggliness-caption-short)", label="wiggliness", echo=FALSE, cache = TRUE}

# Define expressions and compute derivatives
f2_expr <- expression(sin(2*pi*x))
f2_2_expr <- D(D(f2_expr, name = "x"), name = "x")

f3_expr <- expression(sin(3*pi*x))
f3_2_expr <- D(D(f3_expr, name = "x"), name = "x")

f4_expr <- expression(0.76*(sin(pi*x) + sin(4*pi*x)) - 0.5)
f4_2_expr <- D(D(f4_expr, name = "x"), name = "x")


# Define functions
f1 <- function(x){x - 1/2}
f1_2 <- function(x){0}

f1 <- function(x){eval(f2_expr[[1]])}
f2_2 <- function(x){eval(f2_2_expr)^2}

f2 <- function(x){eval(f3_expr[[1]])}
f3_2 <- function(x){eval(f3_2_expr)^2}

f3 <- function(x){eval(f4_expr[[1]])}
f4_2 <- function(x){eval(f4_2_expr)^2}


# Plot functions
tibble(
  x = 0:1,
  y = 0:1,
  type = c("f", "f2") %>% 
    factor(levels = c("f", "f2")),
  fun = 1:2
) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_abline(aes(intercept = intercept, slope = slope),
              data = data.frame(x = 0, y = 0,
                                type = factor("f"),
                                fun = 1,
                                intercept = -.5,
                                slope = 1),
              col = col1,
              size = line_size) +
  geom_area(stat = "function", fun = f1_2,
            data = data.frame(x = 0, y = 0,
                              type = factor("f2"),
                              fun = 1),
            alpha = .5,
            col = col1,
            size = line_size / 2) +
  stat_function(fun = f1,
                data = data.frame(x = 0, y = 0,
                                  type = factor("f"),
                                  fun = 2),
                col = col1,
                size = line_size) +
  geom_area(stat = "function", fun = f2_2,
            data = data.frame(x = 0, y = 0,
                              type = factor("f2"),
                              fun = 2),
            alpha = .5,
            col = col1,
            size = line_size / 2) +
  stat_function(fun = f2,
                data = data.frame(x = 0, y = 0,
                                  type = factor("f"),
                                  fun = 3),
                col = col1,
                size = line_size) +
  geom_area(stat = "function", fun = f3_2,
            data = data.frame(x = 0, y = 0,
                              type = factor("f2"),
                              fun = 3),
            alpha = .5,
            col = col1,
            size = line_size / 2) +
  stat_function(fun = f3,
                data = data.frame(x = 0, y = 0,
                                  type = factor("f"),
                                  fun = 4),
                col = col1,
                size = line_size) +
  geom_area(stat = "function", fun = f4_2,
            data = data.frame(x = 0, y = 0,
                              type = factor("f2"),
                              fun = 4),
            alpha = .5,
            col = col1,
            size = line_size / 2) +
  facet_grid(type ~ fun,
             scales = "free_y") +
  scale_x_continuous(limits = c(0, 1)) #+
  # easy_remove_axes(
  #   which = "both",
  #   what = "text",
  #   teach = FALSE
  # )

```

As we can see from the plots, if a function $f(x)$ is linear, $\left(f''(x)\right)^2$ is null and, as the curvature of the function increases, $\left(f''(x)\right)^2$ increases. For this reason, $\left(f''(x)\right)^2$ can also be seen as a measure of how much $f(x)$ differs from a linear function.

What is done in \ac{gam} to contain this wiggliness and to prevent overfitting is taking into account a penalization based on $\left(f''(x)\right)^2$. If we have a regression problem with one explanatory variable $x$, this can be achieved by considering the minimization problem in \@ref(eq:max-lik-est-deviance) and adding a penalization term to the deviance:
\begin{equation}
\label{eq:gam-est-deviance}
\hat{f} = \argmin_{f}{\left\{D(f, \boldsymbol{y}) + \lambda \int_{a}^{b}{\left( f''(x) \right)^2 dx}\right\}}
\end{equation}

<!-- \begin{equation} -->
<!-- \label{eq:gam-est-deviance} -->
<!-- \boldsymbol{\hat{f}} = \argmin_{\boldsymbol{f}}{\left(D(\boldsymbol{f}, \boldsymbol{y}) + \sum_{l=1}^{s}{\lambda_l \int_{a_l}^{b_l}{\left( f''_l(z_l) \right)^2 dz_l}}\right)} -->
<!-- \end{equation} -->

In the notation of the formula \@ref(eq:gam-est-deviance) we used $f$ to indicate all the parameters of the natural cubic spline. The hyper-parameter $\lambda\ge0$ is called _smoothing parameter_. It measures how much we want to penalize wiggliness. If we choose $\lambda=0$ we won't penalize for $\left( f''(x) \right)^2$ and the optimization problem corresponds to the maximum likelihood. The higher $\lambda$ is, the more penalization for $\left( f''(x) \right)^2$ we introduce and the smoother the estimate $\hat{f}(x)$ will be. If $\lambda\to+\infty$, we will introduce an infinite penalization for wiggliness, so the result will have $\hat{f}''(x)=0$, thus $\hat{f}(x)$ will be linear. 

An example with different levels of $\lambda$ can be seen in figure \@ref(fig:gam-lambda). All the models have been fitted with $m = 50$ knots. As we can see, with $\lambda = 0$ we are clearly overfitting observations, while with $\lambda = 10^6$ we are clearly underfitting them.


(ref:gam-lambda-caption-latex) \ac{gam} estimate $\hat{f}(x)$ for different levels of the smoothing parameter $\lambda$. The higher $\lambda$ is, the less wiggly the estimated function $\hat{f}(x)$ is.

(ref:gam-lambda-caption-gitbook) \ac{gam} estimate $\hat{f}(x)$ for different levels of the smoothing parameter $\lambda$, $\lambda = 0$ (top-left), $\lambda = 10$ (top-right), $\lambda = 10^3$ (bottom-left) and $\lambda = 10^6$ (bottom-right). The higher $\lambda$ is, the less wiggly the estimated function $\hat{f}(x)$ is.

(ref:gam-lambda-caption-short) \ac{gam} estimate $\hat{f}(x)$ for different levels of the smoothing parameter $\lambda$.

```{r, plot-gam-lambda-build, echo = FALSE, cache = TRUE}

# Hyperparameters
k <- 50

sp1 <- 0
sp2 <- 10
sp3 <- 1e3
sp4 <- 1e6


# Plotting function
plot_gam_lidar <- function(dataset, k, sp){
  
  # Model fitting
  
  ## Attention, the default basis is bs='tp'
  # tp: thin plate regression splines
  # cr: natural cubic regression spline
  # bs: b-slines
  
  gam1 <- gam(formula = as.formula(str_c("logratio ~ s(range, bs = 'cr', k = ", k, ")")),
              sp = sp,
              data = dataset,
              family = gaussian())
  
  
  # Predictions on new data
  new_dataset <- tibble(range = seq(from = min(dataset$range),
                                  to = max(dataset$range),
                                  length.out = 1e3))
  
  new_dataset$fit1 <- predict(gam1, newdata = new_dataset)
  
  
  # Plotting results
  p <- dataset %>% 
    ggplot(aes(x = range, y = logratio)) +
    geom_line(
      data = new_dataset,
      aes(x = range, y = fit1),
      col = col1,
      size = line_size
    ) +
    geom_point(alpha = .5) +
    easy_remove_legend(
      teach = F
    )
  
  return(p)
  
}

p_gam_lambda_1 <- plot_gam_lidar(dataset = lidar, k = k, sp = sp1)
p_gam_lambda_2 <- plot_gam_lidar(dataset = lidar, k = k, sp = sp2)
p_gam_lambda_3 <- plot_gam_lidar(dataset = lidar, k = k, sp = sp3)
p_gam_lambda_4 <- plot_gam_lidar(dataset = lidar, k = k, sp = sp4)


```

```{r, plot-gam-lambda-print, out.width = "50%", fig.width = 5, fig.height = 3.5, fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:gam-lambda-caption-gitbook)", "(ref:gam-lambda-caption-latex)"), fig.scap="(ref:gam-lambda-caption-short)", label="gam-lambda", echo=FALSE, fig.ncol=2, fig.subcap=c('$\\lambda = 0$', '$\\lambda = 10$', '$\\lambda = 10^3$', '$\\lambda = 10^6$'), cache = TRUE}

plot_grid_split(p_gam_lambda_1, p_gam_lambda_2, p_gam_lambda_3, p_gam_lambda_4)

```


As soon as the number of knots $m$ is large enough, the exact number $m$ and the positioning of the knots $\nu_1, \nu_2, \dots, \nu_m$ is not important. If the function is flexible enough, the tuning of the curve is done by just tuning $\lambda$. It is possible to use as many knots as the determinations of $x$ are, but it could be too computationally expensive. A possible choice is to fix a number of knots $m$ and positioning $\nu_1, \nu_2, \dots, \nu_m$ on empirical quantiles of the observed $x$ or equally spaced in the range of $x$. In our example, $m=50$ knots seems to be large enough.

In general, if we have $q$ quantitative explanatory variables that we want to fit with splines, the formula \@ref(eq:gam-est-deviance) becomes:
\begin{equation}
\label{eq:gam-est-deviance-multi}
\boldsymbol{\hat{f}} = \argmin_{\boldsymbol{f}}
{\left\{
  D(\boldsymbol{f}, \boldsymbol{y})
    + \sum_{l=1}^{q}{
      \lambda_l \int_{a_l}^{b_l}{\left( f_l''(x_l) \right)^2 dx}
    }
\right\}
} 
\end{equation}
where $\boldsymbol{f} = \left( f_1, f_2,\dots,f_q \right)$.

In this context, we have to tune a vector of $q$ smoothing parameters:
$$\boldsymbol{\lambda} = \left( \lambda_1, \lambda_2, \dots, \lambda_q \right)$$


#### Choice of the smoothing parameters {#chap:gam-choice-lambda}

As described, the selection of the $x_l$ effect consists just in finding the optimal parameter $\lambda_l$. The technique commonly used in machine learning for hyper-parameter tuning is the _Cross Validation_.

For a set $\Lambda$ of values of $\boldsymbol{\lambda}$ we can perform a K-fold cross validation as described in \@ref(chap:variable-selection) and, for each $\boldsymbol{\lambda}\in\Lambda$, we can compute the average test deviance:
$$
D^{CV(K)}_{\boldsymbol{\lambda}} =
\frac{1}{K} \sum_{k=1}^{K}{
D\left(
\boldsymbol{\hat{f}}_{\boldsymbol{\lambda}}^{\setminus \mathcal{B}_k}, \boldsymbol{y}^{\mathcal{B}_k}
\right)
}
$$
At this point, we will choose the hyper-parameter vector that minimizes the cross-validation deviance:
$$
\hat{\boldsymbol{\lambda}} = \argmin_{\boldsymbol{\lambda}\in\Lambda}{D^{CV(K)}_{\boldsymbol{\lambda}}}
$$

In particular if $q$ is big, this procedure can be too computationally expensive. Two alternatives are the _Generalized Cross Validation_ (GCV) and the _Un-Biased Risk Estimator_ (UBRE). The idea behind these approaches is to estimate the test set deviance by just computing the training set deviance and applying to it a correction that penalizes for the complexity of the model, as it is done in AIC and BIC, described in section \@ref(chap:variable-selection).

GCV and UBRE formulas are based on the fact that \ac{gam}s with identity link and Normal response are _linear smoothers_, i.e. $\boldsymbol{\hat{\mu}}$ can be expressed as:
\begin{equation}
\label{eq:linear-smoother}
\boldsymbol{\hat{\mu}} = H \boldsymbol{y}
\end{equation}
where $H$ is a $n\times n$ matrix that depends on the design matrix $\boldsymbol{X}$ and the hyper-parameters. $H$ is called _smoothing matrix_. In general, if the link is not the identity and the response is not Normal, \ac{gam} is not a linear smoother, but the formula \@ref(eq:linear-smoother) still holds up with a local approximation.

It can be proven that the trace of $H$ measures the flexibility of the function. Indeed, in Linear Model $H = \boldsymbol{X} \left( \boldsymbol{X}^t\boldsymbol{X} \right)^{-1}\boldsymbol{X}^t$ and $\text{tr}(H) = \text{tr}\left( \boldsymbol{X} \left( \boldsymbol{X}^t\boldsymbol{X} \right)^{-1}\boldsymbol{X}^t \right) = p + 1$, that is the number of _degrees of freedom_ of the model. For this reason $\text{tr}(H)$ is also called number of _effective degrees of freedom_ of the model. In \ac{gam}, as the smoothing parameter $\lambda$ increases, the flexibility of the model decreases and $\text{tr}(H)$ decreases.

For linear smoothers it can be proved that the _leave one out cross validation_ is:
$$
D^{CV(n)} = \frac{1}{n} \sum_{i=1}^{n}{\left( \frac{y_i - \hat{\mu}_i}{1 - H_{ii}} \right)^2}
$$
where $H_{ii}$ is the $i$^th^ element of the diagonal of the smoothing matrix $H$.

This formula is particularly convenient because it requires just the fit with the whole dataset and not one fit for each fold. Indeed, in the formula the quantities that must be computed are $\mu_i$ and $H_{ii}$, that both depend on the model fitted with the whole dataset.

This formula can be further simplified by replacing the values $H_{ii}$ with their average $\frac{\text{tr}(H)}{n}$. This way we get the GCV.
\begin{equation}
\label{eq:gcv}
GCV = \frac{1}{n} \sum_{i=1}^{n}{\left( \frac{y_i - \hat{\mu}_i}{1 - \frac{\text{tr}(H)}{n}} \right)^2}
\end{equation}

As for K-fold cross validation we can compute the GCV for different values of $\boldsymbol{\lambda} \in \Lambda$ and choose the value $\boldsymbol{\hat{\lambda}}$ that minimizes the GCV.

The formula \@ref(eq:gcv) can be expressed as:
\begin{align}
\nonumber
GCV & = \frac{1}{n} \sum_{i=1}^{n}{\left( \frac{y_i - \hat{\mu}_i}{1 - \frac{\text{tr}(H)}{n}} \right)^2} \\
\nonumber
& = \left( \frac{n}{n - \text{tr}(H)} \right)^2 \underbrace{\frac{\sum_{i=1}^{n}{\left( y_i - \hat{\mu}_i \right)^2}}{n}}_{=D(\boldsymbol{\hat{f}},\boldsymbol{y})} \\
\label{eq:gcv-gam}
& = \left( \frac{n}{n - \text{tr}(H)} \right)^2 D(\boldsymbol{\hat{f}},\boldsymbol{y})
\end{align}

The expression \@ref(eq:gcv-gam) generalizes the GCV to all the \ac{gam}s, also in the case in which the link is not identity and the response is not Normal.

To select the best value of $\boldsymbol{\lambda}$, we can perform a similar procedure using the UBRE in place of the GCV. The UBRE is defined as:
$$
UBRE = \frac{1}{n} D(\boldsymbol{\hat{f}}, \boldsymbol{y}) - \phi + \frac{2}{n} \text{tr}(H)\phi 
$$

Using the UBRE instead the GCV is preferred when $\phi$ is known, such as in Poisson regression case, in which $\phi=1$.


#### Why cubic splines

In section \@ref(chap:cubic-splines) we said that _cubic splines_ are commonly used for modeling $f(\cdot)$ in \ac{gam}s. This choice comes from the following theorem.

```{theorem, th-splines-property, name = "Spline property"}
Given the knots $\nu_1, \nu_2, \dots, \nu_m$ and the values $y_1, y_1, \dots, y_m$, for any $a\le\nu_1$ and $b\ge\nu_m$, only one natural cubic spline $s(\cdot)$, such that $s(\nu_k)=y_k, \ k\in \{1, \dots, m\}$, exists.

Moreover, given $f(\cdot)$ a function two times differentiable with continuity, such that $f(\nu_k)=y_k, \ k\in \{1, \dots, m\}$, then, for any $a\le\nu_1$ and $b\ge\nu_m$, it results
$$
\int_a^b{\left( s''(x)\right)^2 dx} \le \int_a^b{\left( f''(x)\right)^2 dx}
$$
```
One consequence of this theorem is that, within all the continuous function $f(\cdot)$ with continuous first and second derivatives, the one that solves the optimization problem \@ref(eq:gam-est-deviance) is always a natural cubic splines.

Let's consider the determinations $x_1^* < x_2^* < \dots < x_m^*$ of the variable $x$. If $f(\cdot)$ is a continuous function with continuous first and second derivatives, for the theorem \@ref(thm:th-splines-property), there exists only one natural cubic spline $s(\cdot)$ such that $s(x_k^*) = f(x_k^*), \ k\in \{1, \dots, m\}$. As, $s(x_i)=f(x_i) \ \forall i \in \mathcal{D}$, it results that $D(s, \boldsymbol{y}) = D(f, \boldsymbol{y})$. As $D(s, \boldsymbol{y}) = D(f, \boldsymbol{y})$ and $\int_a^b{\left( s''(x)\right)^2 dx} \le \int_a^b{\left( f''(x)\right)^2 dx}$, it results that, for any given $\lambda$:
$$
D(s, \boldsymbol{y}) + \lambda \int_{a}^{b}{\left( s''(x) \right)^2 dx}
\ \le \ 
D(f, \boldsymbol{y}) + \lambda \int_{a}^{b}{\left( f''(x) \right)^2 dx}
$$

For this reason, if the aim of the model estimation is to minimize \@ref(eq:gam-est-deviance), for choosing $f$ we can just consider the class of the natural cubic splines.


#### Other basis

In section \@ref(chap:cubic-splines) we said that natural cubic splines on knots $\nu_1, \dots, \nu_m$ constitute a $m$-dimensional vector space and that a possible basis decomposition is \@ref(eq:cubic-spline-decomposition).

Another usual basis is the _B-spline_ basis. B-splines are functions defined recursively as follows.

```{definition, b-splines, name = "B-splines"}
For $k\in\{1,2,\dots,m-1\}$:
$$
B_{0,k}(x) = 
\begin{cases}
1: & \nu_k < x < \nu_{k+1} \\
0: & \text{otherwise}
\end{cases}
$$

For $j \ge 0$ and $k\in\{1,2,\dots,m+j\}$:
$$
B_{j+1,k}(x) = 
\frac{x-\nu_{k-j-1}}{\nu_{k}-\nu_{k-j-1}} B_{j,k-1}(x)
+ \frac{\nu_{k+1}-x}{\nu_{k+1}-\nu_{k-j}} B_{j,k}(x)
$$
```

In figure \@ref(fig:b-splines-plot) some B-splines of different degrees are represented. 

(ref:plot-b-splines-caption-latex) B-splines with different degrees.

(ref:plot-b-splines-caption-gitbook) B-splines with different degrees, degree $1$ (left), degree $2$ (center), degree $3$ (right)


```{r, plot-b-splines-build, echo = FALSE, cache = TRUE}

knots <- c(1,2,3)
x <- seq(from = 0, to = 4, by = .01)

base_1 = bs(x, knots = knots, degree = 1, intercept = TRUE)

p_b_splines_1 <- cbind(x, base_1) %>% 
  as_tibble() %>% 
  pivot_longer(
    cols = -x,
    names_to = "base",
    values_to = "value"
  ) %>% 
  ggplot(aes(x = x, y = value, color = base)) +
  geom_line() +
  easy_remove_axes(
    which = "y",
    what = "text",
    teach = FALSE
  ) +
  easy_remove_legend() +
  scale_x_continuous(
    breaks = 0:4,
    labels = c(
      expression(nu[1]),
      expression(nu[2]),
      expression(nu[3]),
      expression(nu[4]),
      expression(nu[5])
    )
  ) +
  labs(x = "x", y = "y")


base_2 = bs(x, knots = knots, degree = 2, intercept = TRUE)

p_b_splines_2 <- cbind(x, base_2) %>% 
  as_tibble() %>% 
  pivot_longer(
    cols = -x,
    names_to = "base",
    values_to = "value"
  ) %>% 
  ggplot(aes(x = x, y = value, color = base)) +
  geom_line() +
  easy_remove_axes(
    which = "y",
    what = "text",
    teach = FALSE
  ) +
  easy_remove_legend() +
  scale_x_continuous(
    breaks = 0:4,
    labels = c(
      expression(nu[1]),
      expression(nu[2]),
      expression(nu[3]),
      expression(nu[4]),
      expression(nu[5])
    )
  ) +
  labs(x = "x", y = "y")


base_3 = bs(x, knots = knots, degree = 3, intercept = TRUE)

p_b_splines_3 <- cbind(x, base_3) %>% 
  as_tibble() %>% 
  pivot_longer(
    cols = -x,
    names_to = "base",
    values_to = "value"
  ) %>% 
  ggplot(aes(x = x, y = value, color = base)) +
  geom_line() +
  easy_remove_axes(
    which = "y",
    what = "text",
    teach = FALSE
  ) +
  easy_remove_legend() +
  scale_x_continuous(
    breaks = 0:4,
    labels = c(
      expression(nu[1]),
      expression(nu[2]),
      expression(nu[3]),
      expression(nu[4]),
      expression(nu[5])
    )
  ) +
  labs(x = "x", y = "y")

```


```{r, plot-b-splines-print, out.width = "32%", fig.width = 3, fig.height = 2.25, fig.align='center', fig.cap = ifelse(knitr::is_html_output(), "(ref:plot-b-splines-caption-gitbook)", "(ref:plot-b-splines-caption-latex)"), label = "b-splines-plot", echo = FALSE, fig.ncol = 3, fig.subcap = c('degree $1$', 'degree $2$', 'degree $3$'), cache = TRUE}

plot_grid_split(p_b_splines_1, p_b_splines_2, p_b_splines_3)

```

It can be proven that the functions $B_{j,k}(x)$ are, given $j$, splines of degree $j$. Moreover, $B_{j,1}, \dots, B_{j,m+j-1}$ constitute a basis of the vector space of the splines of degree $j$ on the knots $\nu_1, \dots, \nu_m$. In particular, $B_{3,1}, \dots, B_{j,m+2}$ constitute a basis of the vector space of the splines of degree $3$. Therefore, if $s(x)$ is third degree spline on the knots $\nu_1, \dots, \nu_m$, it can be expressed as a linear combination of $B_{3,1}, \dots, B_{3,m+2}$:
$$
s(x) = \sum_{k=1}^{m+2}{\beta_{k}B_{3,k}(x)}
$$

B-splines are preferred compared to truncated polynomial as basis function, since they are less correlated and lead to more stable and less computationally expensive estimates.

<!--
Da scrivere
- why splines
- other basis
- calcoli a partire da integrale
-->


#### GAM extensions

As in \ac{glm}, in \ac{gam} we can consider interactions between a variable with non-linear effect $x_{l_2}$ and another variable $x_{l_1}$. This can be achieved by adding to the linear predictor a term such as:
$$
x_{l_1} f(x_{l_2})
$$
This is particularly useful when $x_{l_1}$ is a binary variable and we want to fit two different curves for $x_{l_2}$ in the case $x_{l_1}=0$ and in the case $x_{l_1}=1$.

If we want to consider a more complex interaction between two quantitative variables with non-linear effect, \ac{gam} can be extended by considering non-parametric interactions and modeling them with two-dimensional splines, such as:
$$
f(x_{l_1}, x_{l_2})
$$
In this case, instead of fitting a curve on $(x_l, y)$, we will fit a flexible surface on $(x_{l_1}, x_{l_2}, y)$. This approach can be adopted also for modeling geographical data, in which $x_{l_1}, x_{l_2}$ are coordinates on the map, such as longitude and latitude.

<!--
Da scrivere
- Generalizzare a più variabili
- Altre splines (interazioni)
- GCV
-->


#### Some considerations on GAM

As we have seen in this chapter, \ac{gam}s are flexible tools for fitting quantitative variables with non-linear effects.

One big advantage of \ac{gam} is that they are based on the same assumptions of \ac{glm}, except for the non-linearity of the components $f(x_{l})$. The connection with \ac{glm} leads to highly interpretable results. Indeed, as we do in \ac{glm}, in \ac{gam} we can easily observe the marginal effect of a variable $x_l$ on the response $y$ just by plotting the graph $\left(x_l, f(x_l)\right)$, while the interactions between variables are added manually so we have a full control of them.

Another big advantage is that \ac{gam} not only provides a flexible tool that produces interpretable results, but this tool works almost automatically. Indeed, while in \ac{glm}, when we have to fit a non-linear effect to a quantitative variable $x_l$, we have to perform a manual process of wise splitting and polynomial fitting on the range of $x_l$, in \ac{gam} we do not have to explicitly specify the shape of $f(x_l)$ and everything is done by the algorithm.

This higher flexibility and automation comes at the cost of introducing more complexity in the model. Indeed, in a \ac{gam} there are much more parameters than in a \ac{glm} and this leads to a more computationally expensive fitting. Anyway, this higher complexity produces a higher machine time but significantly reduces the human time that in \ac{glm} would be needed for manual fitting.

<!--
- flessibile
- basato su GLM. Within the GLM framework. Interpretabile. Le interazioni sono inserite manualmente
- automatico. Non devo fare lavoro manuale con splitwise e polinomi vari
- semiparametrico. More complexity. Machine time
-->

\newpage

### Shrinkage estimators for GLM {#chap:shrinkage-estimators}

In this section we are going to present the shrinkage estimators, that are a class of estimators particularly useful in \ac{glm} when there are many explanatory variables. All the models presented in this chapter are actually \ac{glm}. The advancements consist in the way the parameters are estimated. For more details on the shrinkage estimators we refer to [@wuthrich-data-analytics], [@james2013introduction], [@friedman2001elements], [@portugues-predictive-modeling] and [@ruppert2003semiparametric].


#### The Bias-Variance Trade Off

One of the property of \ac{glm} with linear link and Normal response is that the maximum likelihood estimator $\boldsymbol{\tilde{\beta}}^{ML}$ is unbiased, that is $E\left(\boldsymbol{\tilde{\beta}}^{ML}\right) = \boldsymbol{\beta}$.^[This property in general is not true for \ac{glm} with other links and response distributions, but it is true asymptotically: $\boldsymbol{\tilde{\beta}}^{ML} \xrightarrow{n\to+\infty} \boldsymbol{\beta}$.] However, the bias is not the only relevant aspect of an estimator. If we consider the _Mean Squared Error_ (MSE) of an estimator $\tilde{\beta}_j$, we get the following result:
\begin{equation}
\label{eq:bias-variance}
MSE\left(\tilde{\beta}_j\right) \eqdef E\left(\left(\tilde{\beta}_j - \beta_j\right)^2\right) =
\underbrace{\left( E(\tilde{\beta}_j)-\beta_j \right)^2}_{\text{Bias}^2} + 
\underbrace{Var\left( \tilde{\beta}_j \right)}_{\text{Variance}}
\end{equation}

The idea behind shrinkage estimators is to add an amount of smart bias to $\tilde{\beta}_j$ in order to reduce its variance.

The trade-off between Bias and Variance described by equation \@ref(eq:bias-variance) can be interpreted in relation to the complexity of the model. Figure \@ref(fig:bias-variance) shows how the prediction error changes by increasing the complexity of the model. In linear smoothers, the complexity of the model can be measured as the effective degrees of freedom $\text{tr}(H)$ (see section \@ref(chap:gam-choice-lambda)), that in the case of the \ac{glm} with maximum likelihood estimators is just the number of parameters $p+1$. Usually, in machine learning models, the complexity of the model is determined by tuning one or more hyper-parameters. In the case of \ac{gam}, for example, by increasing $\lambda$ we obtain a less complex model, while by decreasing it we obtain a more complex model.

A model not complex enough produces estimators with low variance but high bias, that leads to underfitting. A too complex model produces estimators with low bias but high variance, that leads to overfitting.

As with maximum likelihood estimators for \ac{glm} the complexity increases with $p$, they are not suitable when $p$ is large (high dimensionality).

(ref:bias-variance-caption) The Bias-Variance trade off. The horizontal axis shows the increase in complexity and the vertical axis shows the effect on in-sample error (training set) and out-of-sample error (test set). The error can be measured as the Mean Squared Error (MSE) or in general as the Deviance.

(ref:bias-variance-caption-short) The Bias-Variance trade off.

```{r, plot-bias-variance, out.width = "80%", fig.align='center', fig.cap = "(ref:bias-variance-caption)", fig.scap = "(ref:bias-variance-caption-short)", label = "bias-variance", echo = FALSE, fig.ncol = 2, cache = TRUE}

train_error <- function(x){
  (1 - 0.1) / (x + 1)^2
}

test_error <- function(x){
  train_error(x) + x^2/10 + 0.1
}


text_low <- textGrob(
  "Low Complexity\nHigh Bias\nLow Variance\nUnderfitting",
  gp = gpar(fontsize = 10)
)

text_high <- textGrob(
  "High Complexity\nLow Bias\nHigh Variance\nOverfitting",
  gp = gpar(fontsize = 10)
)

text_opt <- textGrob("Optimal Complexity", gp = gpar(fontsize = 10))

text_y <- -0.3
text_x_low <- 0.25
text_x_high <- 3.25


x_opt <- 1.0479
text_y_opt <- 1.55


tibble(
  x = c(0, 0),
  y = c(0, 0),
  # set = c("Training", "Test") %>% 
  set = c("In-Sample Error", "Out-Of-Sample Error") %>% 
    # factor(., levels = .)
    fct_inorder()
) %>% 
  ggplot(aes(x = x, y = y, color = set)) +
  geom_vline(
    xintercept = x_opt,
    linetype = "dotted"
  ) +
  stat_function(
    # data = tibble(x = 0, y = 0, set = "Training set"),
    data = tibble(x = 0, y = 0, set = "In-Sample Error"),
    fun = train_error,
    size = line_size
  ) +
  stat_function(
    # data = tibble(x = 0, y = 0, set = "Test set"),
    data = tibble(x = 0, y = 0, set = "Out-Of-Sample Error"),
    fun = test_error,
    size = line_size
  ) +
  scale_x_continuous(limits = c(0, 3.5)) +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Model Complexity", y = "Loss") +
  theme(plot.margin = unit(c(1.5, 1, 1.2, 1), "lines")) +
  # annotation_custom(text_high, xmin = 0.5, xmax = 0.5, ymin = -0.07, ymax = -0.07) + 
  annotation_custom(
    grob = text_low,
    xmin = text_x_low, xmax = text_x_low,
    ymin = text_y, ymax = text_y
  ) +
  annotation_custom(
    grob = text_high,
    xmin = text_x_high, xmax = text_x_high,
    ymin = text_y, ymax = text_y
  ) + 
  annotation_custom(
    grob = text_opt,
    xmin = x_opt, xmax = x_opt,
    ymin = text_y_opt, ymax = text_y_opt
  ) + 
  coord_cartesian(
    ylim = c(0, 1.4),
    clip = "off"
  ) +
  easy_remove_axes(
    which = "both",
    what = c("text", "ticks")
  ) +
  easy_remove_legend_title() +
  theme(
    legend.position = "bottom",
    legend.justification = "center",
    legend.margin = margin(0, 0, 0, 0),
    legend.box.margin = margin(5, 0, 0, 0)
    # legend.box.margin = margin(15, 0, 0, 0)
  )

```


#### Ridge Regression

One shrinkage estimator for \ac{glm} is the _Ridge Regression_. In Ridge Regression, the decrease in variance of $\tilde{\boldsymbol{\beta}}$ is achieved by considering the optimization problem \@ref(eq:max-lik-est-deviance) and adding to the Deviance a penalization term that depends on the magnitude of $\beta_1, \beta_2, \dots, \beta_p$:
\begin{equation}
\label{eq:ridge-est-deviance}
\hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{D(\boldsymbol{\beta}, \boldsymbol{y}) + \lambda \|\boldsymbol{\beta}_{\setminus0}\|_2^2\right\}}
\end{equation}
where:

* $\boldsymbol{\beta}_{\setminus0} = \left(\beta_1, \beta_2, \dots, \beta_p\right)$ is the set of the \ac{glm} coefficients except for the intercept $\beta_0$;
* $\|\cdot\|_2^2$ is the $L^2$ norm, i.e. $\|\boldsymbol{\beta}_{\setminus0}\|_2^2 = \sum_{j=1}^p{\beta_j^2}$;
* $\lambda\ge0$ is an hyper-parameter that controls the penalization.

The term $\|\boldsymbol{\beta}_{\setminus0}\|_2^2$ produces a penalization for high values of $\beta_j$. This penalization leads to a shrinkage of the coefficients.
The exclusion of $\beta_0$ from the penalization term is intended to prevent the introduction of a bias towards $0$ in the intercept. As the magnitude of $\beta_j$ depends on the values of $x_j$, it is preferred to standardize all the explanatory variables to avoid distorting effects due to the unit of measure of the explanatory variables.

As for \ac{gam}, the hyper-parameter $\lambda$ determines the amount of penalization given to high values of the coefficients $\hat{\beta}_j$. If $\lambda=0$, the optimization problem \@ref(eq:ridge-est-deviance) corresponds to the maximum likelihood. As $\lambda$ increases, the optimal coefficients $\hat{\beta}_j$ tends to shrink towards $0$. In the limit case, if $\lambda\to+\infty$, the optimal coefficients $\hat{\beta}_j$ are all equal to $0$, except for $\hat{\beta}_0$, that is equal to $g(\bar{y})$, so the estimated model corresponds to the trivial model with only the intercept $g(\mu)=\beta_0$.

An example of the effect of $\lambda$ is reported in figure \@ref(fig:ridge-lambda). The data represented has been simulated from a \ac{glm} with identity link and Normal response. As we can see, if $\lambda=0$, all the estimated responses $\hat{\mu}_j$ correspond to the average of the response on that group $\bar{y}_j$. As $\lambda$ increases, the estimated responses move towards the global average $\bar{y}$. The speed of convergence to $0$ depends on how much the group average $\bar{y}_j$ differs from the global average $\bar{y}$ and on the number of observation of the group: if in a group $j$ there are many observations, we have a lot of information on that group, so the group average $\bar{y}_j$ have a small variance and it is a reliable estimate for $\mu_j$, while if there are few observations, the group average $\bar{y}_j$ have a high variance and it is not very reliable.

The optimal point for $\lambda=0$ can be obtained through a Cross Validation as seen in section \@ref(chap:gam-choice-lambda).

<!--
Commento sul grafico
- La velocità con cui \hat{\beta} converge a 0 dipende da
  + Entità del coefficiente di massima verosimiglianza
  + Numerosità delle osservazioni nella classe
-->


(ref:ridge-lambda-caption-latex) Ridge Regression coefficients for different levels of the penalization parameter $\lambda$. If $\lambda=0$, the coefficients correspond to the maximum likelihood coefficients. As $\lambda$ increases, the coefficients are shrunk towards $0$.

(ref:ridge-lambda-caption-gitbook) Ridge Regression coefficients for different levels of the penalization parameter $\lambda$.

(ref:ridge-lambda-caption-short) The Bias-Variance trade off. Ridge Regression coefficients for different levels of the penalization parameter $\lambda$.

```{r, plot-ridge-lambda-build, echo = FALSE, cache = TRUE}

# Simulate data

n <- c(45, 20, 20, 10, 5)
b <- c(0, 2, 0.2, -1, -2)

sigma <- .5

set.seed(123)

df <- tibble(
  x = c(
    rep("a", n[1]),
    rep("b", n[2]),
    rep("c", n[3]),
    rep("d", n[4]),
    rep("e", n[5])
  ) %>% 
    factor(),
  mu = c(
    rep(b[1], n[1]),
    rep(b[2], n[2]),
    rep(b[3], n[3]),
    rep(b[4], n[4]),
    rep(b[5], n[5])
  )
)

df$y <- rnorm(n = sum(n), mean = df$mu, sd = sigma)


# Ridge regression

x_mat <- model.matrix(y ~ x, data = df) 


lambda_grid_ridge <- 10^seq(log10(0.001), log10(100), length.out = 100)
# lambda_grid_lasso <- 10^seq(log10(0.001), log10(10), length.out = 100)

alpha_ridge <- 0
# alpha_lasso <- 1

fit_ridge <- glmnet(
  x = x_mat, y = df$y,
  alpha = alpha_ridge, lambda = lambda_grid_ridge
)


fit_ridge_beta_long <- fit_ridge$beta %>% 
  t() %>% 
  as.matrix() %>% 
  as_tibble() %>% 
  mutate(
    lambda = fit_ridge$lambda,
    log_lambda = log(lambda)
  ) %>% 
  rename(xa = `(Intercept)`) %>% 
  pivot_longer(
    cols = xa:xe,
    names_to = "coefficient"
  ) %>% 
  mutate(coefficient = str_sub(coefficient, 2, 2))

lambda_ridge <- c(0, 0.1, 1, 10)
# lambda_lasso <- c(0, 0.1, 0.6, 1)

p_ridge_coeff <- fit_ridge_beta_long %>% 
  ggplot(aes(x = lambda, y = value, color = coefficient)) +
  geom_vline(
    xintercept = c(lambda_ridge[lambda_ridge > 0]),
    linetype = "dotted"
  ) +
  geom_line() +
  geom_text(
    fit_ridge_beta_long %>% 
      filter(lambda == min(lambda)),
    mapping = aes(label = coefficient),
    nudge_x = -0.1
  ) +
  scale_x_log10() +
  easy_remove_legend() +
  labs(
    x = expression(lambda),
    y = expression(hat(beta))
  )
  

fit_ridge_pred <- predict(
  fit_ridge,
  type = "response",
  s = lambda_ridge,
  newx = unique(x_mat)
) %>% 
  as_tibble() %>%
  mutate(x = c("a", "b", "c", "d", "e")) %>% 
  pivot_longer(
    cols = -x
  ) %>% 
  left_join(
    tibble(
      name = as.character(1:length(lambda_ridge)),
      lambda = lambda_ridge
    ),
    by = "name"
  ) #%>% 
  # mutate(
  #   lambda = str_c("lambda = ", lambda) %>% 
  #     fct_inorder()
  # )


p_ridge_pred_1 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[1]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[1]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[1]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()

p_ridge_pred_2 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[2]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[2]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[2]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()

p_ridge_pred_3 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[3]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[3]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[3]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()

p_ridge_pred_4 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[4]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[4]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_ridge_pred %>% 
      filter(lambda == lambda_ridge[4]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()


```

```{r, plot-ridge-lambda-print, out.width = c("50%", "50%", "50%", "50%", "70%"), fig.width = c(4,4,4,4,5.6), fig.height = c(2.25,2.25,2.25,2.25,3.15), fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:ridge-lambda-caption-gitbook)", "(ref:ridge-lambda-caption-latex)"), fig.scap="(ref:ridge-lambda-caption-short)", label="ridge-lambda", echo=FALSE, fig.ncol=2, fig.subcap = c(str_c('$\\lambda = ', lambda_ridge, '$'), '$\\hat{\\beta}_j$ for different values of $\\lambda$'), cache = TRUE}


plot_grid_split(
  p_ridge_pred_1,
  p_ridge_pred_2,
  p_ridge_pred_3,
  p_ridge_pred_4,
  p_ridge_coeff
)

```


The case of \ac{glm} with identity link and Normal distribution is particularly convenient for the interpretation, because, in this case, the optimization problem \@ref(eq:ridge-est-deviance) has an explicit solution. To make the results more interpretable, let's assume that the response variables have $E(Y)=0$ and that the explanatory variables are centered on $0$, i.e. $\bar{x}_j=0, \ j\in\{1,2,\dots,p\}$. That means that $\beta_0=0$ and the model is $E(Y)=\beta_1x_1+\beta_2x_2+\dots+\beta_px_p$. With these assumptions, we obtain:
\begin{equation}
\label{eq:ridge-estimator}
\hat{\boldsymbol{\beta}}_{\lambda,\setminus0} = \left(\boldsymbol{X}^t\boldsymbol{X}+\lambda I_p\right)^{-1}\boldsymbol{X}^t y
\end{equation}
where:

* $\hat{\boldsymbol{\beta}}_{\lambda,\setminus0} = \left(\hat{\beta}_1, \hat{\beta}_2, \dots, \hat{\beta}_p\right)$;
* $\boldsymbol{X}$ is the design matrix without the first column of $1$s, as the intercept is excluded from the model;
* $I_p$ is the identity matrix with dimension $p$.

From the formula \@ref(eq:ridge-estimator) we can see that, while the maximum likelihood estimator is unbiased, if $\lambda>0$ the Ridge estimator is biased.
\begin{align*}
E\left(\tilde{\boldsymbol{\beta}}_{0,\setminus0}\right) & =
\left(\boldsymbol{X}^t\boldsymbol{X}\right)^{-1}\boldsymbol{X}^t E(Y) \\ & =
\left(\boldsymbol{X}^t\boldsymbol{X}\right)^{-1}\boldsymbol{X}^t \boldsymbol{X} \boldsymbol{\beta}_{\setminus0} \\ & =
\boldsymbol{\beta}_{\setminus0} \\[6pt]
E\left(\tilde{\boldsymbol{\beta}}_{\lambda,\setminus0}\right) & =
\left(\boldsymbol{X}^t\boldsymbol{X}+\lambda I_p\right)^{-1}\boldsymbol{X}^t E(Y) \\ & =
\left(\boldsymbol{X}^t\boldsymbol{X}+\lambda I_p\right)^{-1}\boldsymbol{X}^t\boldsymbol{X} \boldsymbol{\beta}_{\setminus0} \\ & \neq
\boldsymbol{\beta}_{\setminus0}
\end{align*}

Moreover, from the formula \@ref(eq:ridge-estimator) we find that, even if there is multicollinearity and $\boldsymbol{X}^t\boldsymbol{X}$ is not invertible, the Ridge estimator is computable. This aspect is particularly interesting when $p > n$.

If we assume that the explanatory variables are independent and standardized, i.e. $\boldsymbol{X}^t\boldsymbol{X} = I_p$, we can further simplify the expression \@ref(eq:ridge-estimator) to:
$$
\hat{\boldsymbol{\beta}}_{\lambda,\setminus0} =
\frac{1}{1+\lambda} \boldsymbol{X}^t y = 
\frac{1}{1+\lambda} \hat{\boldsymbol{\beta}}_{0,\setminus0}
$$
that results in:
\begin{align*}
E\left(\tilde{\boldsymbol{\beta}}_{\lambda,\setminus0}\right) & =
\frac{1}{1+\lambda} \boldsymbol{\beta}_{\setminus0} \\
Var\left(\tilde{\boldsymbol{\beta}}_{\lambda,\setminus0} \right) & =
\frac{1}{\left(1+\lambda\right)^2} Var\left( \tilde{\boldsymbol{\beta}}_{0,\setminus0} \right)
\end{align*}

This result means that, if the explanatory variables are independent, the Ridge penalization shrinks all the estimated coefficients by a factor of $\frac{1}{1+\lambda}$ and reduce their variance by a factor of $\frac{1}{\left(1+\lambda\right)^2}$.

From formula \@ref(eq:ridge-estimator) we also find that, as in \ac{gam}, if the link is identity and the response is Normal, the Ridge regression is a linear smoother. In this case the smoothing matrix is:
$$
H_{\lambda} = \boldsymbol{X} \left(\boldsymbol{X}^t\boldsymbol{X}+\lambda I_p\right)^{-1}\boldsymbol{X}^t
$$

In the case of independent explanatory variables we get:
$$
H_{\lambda} = \frac{1}{1+\lambda} \boldsymbol{X} \boldsymbol{X}^t = \frac{1}{1+\lambda} H_0
$$
and then $\text{tr}\left(H_{\lambda}\right) = \frac{p}{1+\lambda}$. This result means that, by increasing $\lambda$, the effective number of degrees of freedom decreases.


#### LASSO Regression

Another shrinkage estimator for \ac{glm} is the \ac{lasso}. \ac{lasso} is based on the same idea of Ridge Regression, but, instead of considering a penalization based on $L^2$ norm, it considers a penalization based on $L^1$ norm:
\begin{equation}
\label{eq:lasso-est-deviance}
\hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{D(\boldsymbol{\beta}, \boldsymbol{y}) + \lambda \|\boldsymbol{\beta}_{\setminus0}\|_1\right\}}
\end{equation}
where:

* $\boldsymbol{\beta}_{\setminus0} = \left(\beta_1, \beta_2, \dots, \beta_p\right)$ is the set of the \ac{glm} coefficients except for the intercept $\beta_0$;
* $\|\cdot\|_1$ is the $L^1$ norm, i.e. $\|\boldsymbol{\beta}_{\setminus0}\|_1 = \sum_{j=1}^p{|\beta_j|}$;
* $\lambda\ge0$ is an hyper-parameter that controls the penalization.

An example of the effect of the $L^1$ penalization for different values of $\lambda$ is shown in figure \@ref(fig:lasso-lambda). The simulated dataset is the same used for the Ridge example in figure \@ref(fig:ridge-lambda). As we can see from the plots, the substantial difference between Ridge and \ac{lasso} is that in \ac{lasso} from a certain value of $\lambda$ the coefficients are shrunk exactly to $0$. While in Ridge it is just a limit property, in \ac{lasso}, for each coefficient $\beta_j$ there is a level of the penalization parameter $\lambda'_j$ such that for $\lambda \ge \lambda'_j$, the estimated coefficient $\hat{\beta}_j$ is forced to exactly $0$.


(ref:lasso-lambda-caption-latex) LASSO Regression coefficients for different levels of the penalization parameter $\lambda$. If $\lambda=0$, the coefficients correspond to the maximum likelihood coefficients. As $\lambda$ increases, the coefficients are shrunk towards $0$. High values of $\lambda$ force the coefficients to be exactly equal to $0$.

(ref:lasso-lambda-caption-gitbook) LASSO Regression coefficients for different levels of the penalization parameter $\lambda$.

(ref:lasso-lambda-caption-short) LASSO Regression coefficients for different levels of the penalization parameter $\lambda$.

```{r, plot-lasso-lambda-build, echo = FALSE, cache = TRUE}

# Simulate data

n <- c(45, 20, 20, 10, 5)
b <- c(0, 2, 0.2, -1, -2)

sigma <- .5

set.seed(123)

df <- tibble(
  x = c(
    rep("a", n[1]),
    rep("b", n[2]),
    rep("c", n[3]),
    rep("d", n[4]),
    rep("e", n[5])
  ) %>% 
    factor(),
  mu = c(
    rep(b[1], n[1]),
    rep(b[2], n[2]),
    rep(b[3], n[3]),
    rep(b[4], n[4]),
    rep(b[5], n[5])
  )
)

df$y <- rnorm(n = sum(n), mean = df$mu, sd = sigma)


# LASSO regression

x_mat <- model.matrix(y ~ x, data = df) 


# lambda_grid_ridge <- 10^seq(log10(0.001), log10(100), length.out = 100)
lambda_grid_lasso <- 10^seq(log10(0.001), log10(10), length.out = 100)

# alpha_ridge <- 0
alpha_lasso <- 1

fit_lasso <- glmnet(
  x = x_mat, y = df$y,
  alpha = alpha_lasso, lambda = lambda_grid_lasso
)


fit_lasso_beta_long <- fit_lasso$beta %>% 
  t() %>% 
  as.matrix() %>% 
  as_tibble() %>% 
  mutate(
    lambda = fit_lasso$lambda,
    log_lambda = log(lambda)
  ) %>% 
  rename(xa = `(Intercept)`) %>% 
  pivot_longer(
    cols = xa:xe,
    names_to = "coefficient"
  ) %>% 
  mutate(coefficient = str_sub(coefficient, 2, 2))

# lambda_ridge <- c(0, 0.1, 1, 10)
lambda_lasso <- c(0, 0.1, 0.6, 1)

p_lasso_coeff <- fit_lasso_beta_long %>% 
  ggplot(aes(x = lambda, y = value, color = coefficient)) +
  geom_vline(
    xintercept = c(lambda_lasso[lambda_lasso > 0]),
    linetype = "dotted"
  ) +
  geom_line() +
  geom_text(
    fit_lasso_beta_long %>% 
      filter(lambda == min(lambda)),
    mapping = aes(label = coefficient),
    nudge_x = -0.1
  ) +
  scale_x_log10() +
  easy_remove_legend() +
  labs(
    x = expression(lambda),
    y = expression(hat(beta))
  )
  

fit_lasso_pred <- predict(
  fit_lasso,
  type = "response",
  s = lambda_lasso,
  newx = unique(x_mat)
) %>% 
  as_tibble() %>%
  mutate(x = c("a", "b", "c", "d", "e")) %>% 
  pivot_longer(
    cols = -x
  ) %>% 
  left_join(
    tibble(
      name = as.character(1:length(lambda_lasso)),
      lambda = lambda_lasso
    ),
    by = "name"
  ) #%>% 
  # mutate(
  #   lambda = str_c("lambda = ", lambda) %>% 
  #     fct_inorder()
  # )


p_lasso_pred_1 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[1]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[1]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[1]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()

p_lasso_pred_2 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[2]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[2]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[2]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()

p_lasso_pred_3 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[3]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[3]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[3]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()

p_lasso_pred_4 <- df %>% 
  ggplot(aes(x = x, y = y)) +
  geom_hline(
    # yintercept = mean(df$y),
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[4]) %>% 
      filter(x == "a"),
    mapping = aes(yintercept = value),
    linetype = "dotted"
  ) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[4]),
    mapping = aes(x = x, y = value, color = x),
    size = 3
  ) +
  geom_point(alpha = .3) +
  geom_point(
    data = fit_lasso_pred %>% 
      filter(lambda == lambda_lasso[4]),
    mapping = aes(x = x, y = value, color = x),
    size = 3,
    alpha = .8
  ) +
  # facet_wrap(~lambda) +
  easy_remove_legend()


```

```{r, plot-lasso-lambda-print, out.width = c("50%", "50%", "50%", "50%", "70%"), fig.width = c(4,4,4,4,5.6), fig.height = c(2.25,2.25,2.25,2.25,3.15), fig.align='center', fig.cap=ifelse(knitr::is_html_output(), "(ref:lasso-lambda-caption-gitbook)", "(ref:lasso-lambda-caption-latex)"), fig.scap="(ref:lasso-lambda-caption-short)", label="lasso-lambda", echo=FALSE, fig.ncol=2, fig.subcap = c(str_c('$\\lambda = ', lambda_lasso, '$'), '$\\hat{\\beta}_j$ for different values of $\\lambda$'), cache = TRUE}


plot_grid_split(
  p_lasso_pred_1,
  p_lasso_pred_2,
  p_lasso_pred_3,
  p_lasso_pred_4,
  p_lasso_coeff
)

```

This property of the \ac{lasso} Regression can be derived from a different representation of the optimization problems \@ref(eq:ridge-est-deviance) and \@ref(eq:lasso-est-deviance). It can be proven that in general, considering a penalization given by the $L^d$ norm, the unconstrained optimization problem:
\begin{equation}
\label{eq:ld-est-deviance}
\hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{D(\boldsymbol{\beta}, \boldsymbol{y}) + \lambda \|\boldsymbol{\beta}_{\setminus0}\|_d^d\right\}}
\end{equation}
is equivalent to the constrained optimization problem:
\begin{equation}
\label{eq:lp-est-deviance-constr}
\hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}:\  \|\boldsymbol{\beta}_{\setminus0}\|_d^d \le s_{\lambda}}{ D(\boldsymbol{\beta}, \boldsymbol{y}) }
\end{equation}
where $s_{\lambda}$ is a quantity that depends on $\lambda$.

The representation \@ref(eq:lp-est-deviance-constr) provides a useful geometric interpretation of the optimization problem. Figure \@ref(fig:opt-ridge-lasso) shows the visual representation of the optimization problem \@ref(eq:lp-est-deviance-constr) in the Ridge case ($d=2$) and in the \ac{lasso} case ($d=1$). The axes represent the component of $\boldsymbol{\beta}$. The point $\hat{\boldsymbol{\beta}}^{ML}$ represents the maximum likelihood estimator for $\boldsymbol{\beta}$, that is the optimal point for the Deviance $D(\boldsymbol{\beta}, \boldsymbol{y})$ without any constraints. The ellipses around $\hat{\boldsymbol{\beta}}^{ML}$ represent the contour lines of $D(\boldsymbol{\beta}, \boldsymbol{y})$. In the Normal case with identity link they are concentric ellipses centered in $\hat{\boldsymbol{\beta}}^{ML}$. The grey area around the axes intersection represents the feasibility region determined by $\|\boldsymbol{\beta}\|_d^d \le s_\lambda$. This area, in the Ridge case corresponds to the circle $\|\boldsymbol{\beta}\|_2^2 \le s_\lambda$, while in the \ac{lasso} case corresponds to the square $\|\boldsymbol{\beta}\|_1^1 \le s_\lambda$. $\hat{\boldsymbol{\beta}}^{Ridge}$ and $\hat{\boldsymbol{\beta}}^{LASSO}$ are respectively the optimal point conditioned to the Ridge constraint and the optimal point conditioned to the \ac{lasso} constraint. The sharpness of the \ac{lasso} feasibility region implies that the optimal point $\hat{\boldsymbol{\beta}}^{LASSO}$ could fall into one of the corners of the square leading one of the coefficients $\hat{\beta}_j^{LASSO}$ to be exactly equal to $0$. In general, if there are more than two explanatory variables, the \ac{lasso} feasibility region is an hyper-cube and the \ac{lasso} attains solutions with many coefficients exactly equal to $0$.


(ref:opt-ridge-lasso-caption-latex) Geometrical interpretation of the optimization problem for Ridge and LASSO. The sharpness of the LASSO feasibility region implies that the optimal point $\hat{\boldsymbol{\beta}}^{LASSO}$ could fall into one of the corners of the square leading one of the coefficients $\hat{\beta}_j^{LASSO}$ to be exactly equal to $0$.

(ref:opt-ridge-lasso-caption-gitbook) Geometrical interpretation of the optimization problem for Ridge (left) and LASSO (right). The sharpness of the LASSO feasibility region implies that the optimal point $\hat{\boldsymbol{\beta}}^{LASSO}$ could fall into one of the corners of the square leading one of the coefficients $\hat{\beta}_j^{LASSO}$ to be exactly equal to $0$.

(ref:opt-ridge-lasso-caption-short) Geometrical interpretation of the optimization problem for Ridge and LASSO.

```{r, plot-opt-ridge-lasso-build, echo = FALSE, cache = TRUE}

x0 <- 2
y0 <- 1/2
angle <- pi/8

a <- 2
b <- 1

# LASSO
dev1 <- 10768094850759922072879/33946139684317548575040 # Computed externally with Wolfram Alpha
a1 <- 2 * sqrt(dev1)
b1 <- 1 * sqrt(dev1)

# Ridge
dev2 <- 0.289349 # Computed externally with Wolfram Alpha
a2 <- 2 * sqrt(dev2)
b2 <- 1 * sqrt(dev2)


p_ridge <- tibble(
  x = 0,
  y = 0
) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_segment(
    x = -Inf, y = 0, xend = +Inf, yend = 0,
    arrow = arrow(length=unit(0.30,"cm"), type = "closed")
  ) +
  geom_segment(
    x = 0, y = -Inf, xend = 0, yend = Inf,
    arrow = arrow(length=unit(0.30,"cm"), type = "closed")
  ) +
  geom_point(
    data = tibble(x = x0, y = y0),
    mapping = aes(x = x, y = y),
    size = 3,
    alpha = .8,
    color = "grey20"
  ) +
  geom_text(
    data = tibble(
      x = x0, y = y0, label = expression(hat(beta)[ML])
    ),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05, nudge_y = 0.07
  ) +
  geom_text(
    data = tibble(
      x = 3, y = 0, label = expression(beta[1])
    ),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05, nudge_y = 0.11
  ) +
  geom_text(
    data = tibble(
      x = 0, y = 1.2, label = expression(beta[2])
    ),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05
  ) +
  geom_ellipse(
    data = tibble(
      x = rep(x0, 2),
      y = rep(y0, 2),
      x0 = rep(x0, 2),
      y0 = rep(y0, 2),
      a = c(a/5, a/3),
      b = c(b/5, b/3),
      angle = rep(angle, 2)
    ),
    mapping = aes(x0 = x0, y0 = y0, a = a, b = b, angle = angle),
    color = "grey20"
  ) +
  # Ridge
  geom_ellipse(
    aes(x0 = 0, y0 = 0, a = 1, b = 1, angle = 0),
    fill = "grey20",
    color = NA,
    alpha = .4,
  ) +
  # Ridge
  geom_ellipse(
    aes(x0 = x0, y0 = y0, a = a2, b = b2, angle = angle),
    color = col1
  ) +
  geom_point(
    data = tibble(x = 0.988412, y = 0.151795),
    mapping = aes(x = x, y = y),
    size = 3,
    alpha = .8,
    color = col1
  ) +
  geom_text(
    data = tibble(x = 0.988412, y = 0.151795, label = expression(hat(beta)[Ridge])),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05,
    color = col1
  ) +
  coord_equal(
    xlim = c(-1.1, 2.9),
    ylim = c(-1.1, 1.1),
    clip = "off"
  ) +
  easy_remove_axes(
    which = "both",
    what = c("ticks", "title", "text", "line")
  ) +
  theme(
    plot.margin = unit(c(1, 1, 1, 1), "lines"),
    panel.border = element_blank()
  )


p_lasso <- tibble(
  x = 0,
  y = 0
) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_segment(
    x = -Inf, y = 0, xend = +Inf, yend = 0,
    arrow = arrow(length=unit(0.30,"cm"), type = "closed")
  ) +
  geom_segment(
    x = 0, y = -Inf, xend = 0, yend = Inf,
    arrow = arrow(length=unit(0.30,"cm"), type = "closed")
  ) +
  geom_point(
    data = tibble(x = x0, y = y0),
    mapping = aes(x = x, y = y),
    size = 3,
    alpha = .8,
    color = "grey20"
  ) +
  geom_text(
    data = tibble(
      x = x0, y = y0, label = expression(hat(beta)[ML])
    ),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05, nudge_y = 0.07
  ) +
  geom_text(
    data = tibble(
      x = 3, y = 0, label = expression(beta[1])
    ),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05, nudge_y = 0.11
  ) +
  geom_text(
    data = tibble(
      x = 0, y = 1.2, label = expression(beta[2])
    ),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05
  ) +
  geom_ellipse(
    data = tibble(
      x = rep(x0, 2),
      y = rep(y0, 2),
      x0 = rep(x0, 2),
      y0 = rep(y0, 2),
      a = c(a/5, a/3),
      b = c(b/5, b/3),
      angle = rep(angle, 2)
    ),
    mapping = aes(x0 = x0, y0 = y0, a = a, b = b, angle = angle),
    color = "grey20"
  ) +
  # LASSO
  geom_polygon(
    data = tibble(
      x = c(1, 0, -1, 0),
      y = c(0, 1, 0, -1)
    ),
    alpha = .4,
    size = 0
  ) +
  # geom_path(
  #   data = tibble(
  #     x = c(1, 0, -1, 0, 1),
  #     y = c(0, 1, 0, -1, 0),
  #   ),
  # ) +
  # LASSO
  geom_ellipse(
    aes(x0 = x0, y0 = y0, a = a1, b = b1, angle = angle),
    color = col2
  ) +
  geom_point(
    data = tibble(x = 1, y = 0),
    mapping = aes(x = x, y = y),
    size = 3,
    alpha = .8,
    color = col2
  ) +
  geom_text(
    data = tibble(x = 1, y = 0, label = expression(hat(beta)[LASSO])),
    mapping = aes(x = x, y = y, label = label),
    parse = T, hjust = 0, nudge_x = 0.05, nudge_y = 0.151795,
    color = col2
  ) +
  coord_equal(
    xlim = c(-1.1, 2.9),
    ylim = c(-1.1, 1.1),
    clip = "off"
  ) +
  easy_remove_axes(
    which = "both",
    what = c("ticks", "title", "text", "line")
  ) +
  theme(
    plot.margin = unit(c(1, 1, 1, 1), "lines"),
    panel.border = element_blank()
  )


```


```{r, plot-opt-ridge-lasso-print, out.width = "50%", fig.width = 5, fig.height = 3.5, fig.align='center', fig.cap = ifelse(knitr::is_html_output(), "(ref:opt-ridge-lasso-caption-gitbook)", "(ref:opt-ridge-lasso-caption-latex)"), fig.scap = "(ref:opt-ridge-lasso-caption-short)", label="opt-ridge-lasso", echo=FALSE, fig.ncol=2, fig.subcap=c('Ridge', 'LASSO'), cache=TRUE}

plot_grid_split(p_ridge, p_lasso)

```

The fact that in \ac{lasso} Regression the estimated coefficients can be exactly equal to $0$ is a precious benefit. Indeed, \ac{lasso} Regression performs a feature selection removing the variables that are not relevant for predicting the response. The \ac{lasso} fitting is much more efficient than the other procedures we have seen for feature selection in \ac{glm} such as the stepwise selection based on AIC or other criteria (section \@ref(chap:variable-selection)) and it is better scalable for datasets with many variables. It is also possible to use the variables selected by the \ac{lasso} regression and giving them as inputs for a maximum likelihood fitting. If in the dataset there are many variables but only few of them are relevant for the response, this procedure can return better predictions than the \ac{lasso} regression by itself.


#### Elastic Net

Compared to the Ridge Regression, the \ac{lasso} Regression has the benefit of performing a feature selection by forcing many coefficients to $0$. Anyway this doesn't necessarily mean that \ac{lasso} estimates always outperform Ridge estimates. It strongly depends on the case. In general, if in a dataset only few of the explanatory variables have an effect on the response, the \ac{lasso} will produce better estimates, but, if all the variables bring a small amount of information on the response, then the \ac{lasso} will suppress some of this information, while the Ridge will catch it. The problem is that in practice, when we have a real dataset, we do not know in which case we are. The _Elastic Net_ is a generalization of the Ridge and the \ac{lasso} that provides a solution for this kind of problem.

The Elastic Net consists in a penalized Deviance optimization problem in which the penalization term is a mixture of the Ridge penalization and the \ac{lasso} penalization:
\begin{equation}
\label{eq:elastic-net-est-deviance}
\hat{\boldsymbol{\beta}} =  \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{
D(\boldsymbol{\beta}, \boldsymbol{y}) +
\lambda 
\sum_{j=1}^p{\left(\alpha |\beta_j| + (1 - \alpha) |\beta_j|^2\right)}
\right\}}
\end{equation}
where $\alpha \in [0,1]$ is a hyper-parameter that weighs the two penalization components.

Looking to the equation \@ref(eq:elastic-net-est-deviance) it is clear that, if $\alpha=0$, the Elastic Net corresponds to the Ridge Regression, while, if $\alpha=1$, the Elastic Net corresponds to the \ac{lasso} regression. If $\alpha \in ]0,1[$, the result will be a compromise between the two.

The hyper-parameter $\alpha$ can be estimated together with $\lambda$ in a Cross Validation procedure (see section \@ref(chap:variable-selection)). If the data suggests that many variables are useful for predicting the response, the estimated hyper-parameter $\hat{\alpha}$ will be close to $0$, while if the data suggests that only few variables are useful, the estimated hyper-parameter $\hat{\alpha}$ will be close to $1$.


#### Some considerations on Shrinkage Estimators {#chap:shrinkage-considerations}

As we have said, shrinkage estimators are particularly useful when the number of parameters $p$ in the model is large (high dimensionality) and the maximum likelihood estimator has a high variance. This could happen when we have some qualitative variables with many modalities, as for example the make and the model of the insured vehicle in car insurance. As we have seen with the examples represented in figure \@ref(fig:ridge-lambda) and \@ref(fig:lasso-lambda), the penalization will shrink more the coefficients in the groups with few observations and only the groups with an average response $\bar{y}_j$ significantly different from the global average response $\bar{y}$ will emerge. This procedure is more efficient and more robust to overfitting than grouping modalities based on hypothesis testing or other criteria.

If we are in a case in which we already performed a satisfying feature selection and we only want to shrink some of the coefficients, we can apply the penalization only to them. For example, in the case of the variables make and model, we can consider an Elastic Net penalization only to the coefficients $\beta_j$ corresponding to the modalities of those variables. This technique is useful also if we have explanatory variables with distribution highly unbalanced. For example, if we consider the variable "number of claims experienced in the previous year", we will have most of the observation in the modality corresponding to $0$ claims, very few in the modality corresponding to $1$ claim and almost nobody with $2$ or more claims. If we are fitting a model for the claims frequency and we consider the variable $x_j$ that indicates whether the policyholder experienced one or more claims in the previous year ($x_j=1$) or not ($x_j=0$), it is likely that the maximum likelihood estimator for the coefficient $\hat{\beta}_{x_j=1}$ will be greater than $0$, as the policyholders that experienced claims are usually more inclined to experience more of them in the future. However, given that there are just few observations with $x_j=1$ it is possible that the coefficient $\hat{\beta}_{x_j=1}$ is not significantly different from $0$. If our only tool is the maximum likelihood estimator, we have to choose whether to insert the variable $x_j$ in the model or not. If we don't consider it we are probably discarding some potentially useful information, while if we consider it and we estimate its coefficient with maximum likelihood we risk to overfit the data. With shrinkage estimators we can choose to insert the variable $x_j$ in the model and fitting it with a penalization. This way we exploit that information but we prevent overfitting.

Another interesting observation about \ac{glm} fitting is that in the practice, when maximum likelihood estimates are adopted, what is done is not just fitting a model with all the variables, but a feature selection is conducted. The fact that a variable $x_j$ is inserted in the model depends on whether the feature selection procedure selects that variable or not. That corresponds to estimating the coefficient $\beta_j$ with an estimator $\tilde{\beta}_j$ that is equal to the maximum likelihood estimator $\tilde{\beta}_j^{ML}$ when the coefficient passes a specific criterion and is equal to $0$ when $\tilde{\beta}_j^{ML}$ doesn't pass that criterion. This criterion could be something objective as for example the decrease in AIC or the significance in a hypothesis testing on $\tilde{\beta}_j^{ML}$, but also something more subjective based also on the domain knowledge on the person that is conducting the modeling. Actually, this feature selection procedure introduces a bias towards $0$ on the coefficient $\tilde{\beta}_j$ that reduces its variance, since all the coefficients not relevant for prediction are set equal to $0$ and only the relevant ones are fitted with maximum likelihood. If we consider a binary variable $x_j$, in all the procedures commonly used for feature selection, the probability that $\tilde{\beta}_j$ passes the procedure or not depends on how strong the effect $\beta_j$ is and how many observations there are in the classes $x_j=0$ and $x_j=1$.

We also mention that the Shrinkage Estimators can be used in synergy with \ac{gam}s. Indeed it is possible to fit to the quantitative variables a cubic spline with a \ac{gam} penalization based on the second derivative of the spline and to the qualitative variables a shrinkage estimator with an Elastic Net penalization. The optimization problem becomes
\begin{equation}
\label{eq:gam-en-est-deviance-multi}
\boldsymbol{\hat{f}} = \argmin_{\boldsymbol{f}}
{\left\{
  D(\boldsymbol{f}, \boldsymbol{y})
    + \sum_{l=1}^{q}{
      \lambda_l \int_{a_l}^{b_l}{\left( f_l''(x_l) \right)^2 dx}
    }
    + \lambda_{EN} 
      \sum_{j=1}^p{\left(\alpha |\beta_j| + (1 - \alpha) |\beta_j|^2\right)}
\right\}
} 
\end{equation}
where the coefficients $\beta_j$ considered in the Elastic Net penalization are only the ones corresponding to qualitative variables.

<!--
Benefici ridge/lasso
- Utili quando p è grande
- Molto utili per variabili qualitative con molte modalità. Es: marca, modello.
  Posso mettere la penalizzazione solo su alcune delle variabili o posso mettere un peso diverso per penalizzare in modo diverso le diverse variabili
- Benefit of shrinkage (Ridge):
  Un coefficiente ha senso, non mi fido tanto del valore stimato, gli metto una penalizzazione ad hoc.
  È meglio che metterlo in offset
  Nota: con la stepwise regression anche gli stimatori non sono unbiased. Di fatto si ha \beta_j di ML se significativo e 0 altrimenti.
- Posso usarlo assieme al GAM
-->

\newpage


### Bayesian GLM {#chap:bayes-glm}

In this section we are going to present the Bayesian estimators for \ac{glm}. The novelty compared to Maximum Likelihood estimators consists in the fact that Bayesian statistics introduces the idea of prior information that, used in inference, brings a bias to the estimates. As we will see in section \@ref(chap:bayes-ridge-lasso), Ridge Regression and \ac{lasso} Regression can be interpreted as Bayesian estimators. For more details on Bayesian \ac{glm} we refer to [@wuthrich-data-analytics] and [@gelman2013bayesian].


#### The Bayesian framework

In classical inference what is commonly done for estimating an unknown parameter $\theta$ using an observed sample $\boldsymbol{y}=(y_1, y_2, \dots, y_n)$, is to consider the value $\hat{\theta}^{ML}$ that maximizes the likelihood:
$$
\hat{\theta}^{ML} = \argmax_{\theta\in\Theta}{L(\theta)} = \argmax_{\theta\in\Theta}{p(\boldsymbol{y}|\theta)}
$$
where $p(\boldsymbol{y}|\theta)$ is the probability (or the density) of the sample $\boldsymbol{y}$ given the parameter $\theta$. That means that, within all the possible parameters $\theta\in\Theta$, we select the most likely one, that is the one that, conditioned to it, returns the maximum probability for the observed sample $\boldsymbol{y}$. We highlight that $L(\theta)$ is not a probability distribution on $\theta$, so for example, in general, $\int_{\Theta}{L(\theta)d\theta}\ne1$.

_Bayesian Inference_ introduces the concept of prior distribution of the parameter: $\pi(\theta)$. This distribution represents how probable we assume the different values of $\theta\in\Theta$ are, prior to the observation of the sample. In classical statistics, the parameter $\theta$ is seen as a specific real number that is unknown. In the Bayesian framework, the parameter $\theta$ is seen as a random variable and its distribution represents our information on it. This aspect introduces a subjective point of view of probability.

With the _Bayes Theorem_, having a prior distribution $\pi(\theta)$, we can compute the posterior distribution for $\theta$ given the observed sample $\boldsymbol{y}$:
$$
\pi(\theta|\boldsymbol{y}) = \frac{p(\boldsymbol{y}|\theta)\pi(\theta)}{p(\boldsymbol{y})}
$$

$\pi(\theta|\boldsymbol{y})$ is actually a probability distribution, so we can compute probabilities and make predictions on $\theta$ given the distribution $\pi(\theta|\boldsymbol{y})$.

Two useful statistics based on the posterior distribution $\pi(\theta|\boldsymbol{y})$ are the expected value of $\theta$ given $\boldsymbol{y}$:
$$
E(\theta|\boldsymbol{y}) = \int_{\Theta}{\theta \pi(\theta|\boldsymbol{y}) d\theta}
$$
and the mode of $\theta$ given $\boldsymbol{y}$:
$$
\text{Mode}(\theta|\boldsymbol{y}) = \argmax_{\theta\in\Theta}{\pi(\theta|\boldsymbol{y})}
$$

These two statistics can be used as point estimates for the parameter $\theta$. In the following we are going to refer to $\text{Mode}(\theta|\boldsymbol{y})$ as the \ac{map} estimate:
$$
\hat{\theta}^{MAP} = \text{Mode}(\theta|\boldsymbol{y}) = \argmax_{\theta\in\Theta}{\pi(\theta|\boldsymbol{y})}
$$

#### Bayesian estimator for the mean of a Normal distribution

One example to dive into the the Bayesian inference that is useful to better understand the logic of Bayesian estimators in \ac{glm} is the inference on the mean of a Normal distribution from an observed sample.

Let's assume we have a independent and identically distributed sample $\boldsymbol{y} = (y_1, y_2, \dots, y_n)$ from a Normal distribution:
\begin{equation}
\label{eq:bayes-normal-normal-likelihood}
p(y_i | \mu, \sigma^2) \sim \mathcal{N}(\mu, \sigma^2), \ i\in\{1,2,\dots,n\}
\end{equation}

Let's assume that $\sigma^2$ is known and we want to infer the value of $\mu$. In the Bayesian framework, $\mu$ is a random variable, so the first thing we have to do is to define a prior distribution for it. Let's assume:
\begin{equation}
\label{eq:bayes-normal-normal-prior}
\pi(\mu) \sim \mathcal{N}(\mu_0, \sigma_0)
\end{equation}
where $\mu_0$ and $\sigma_0$ are known parameters.

Under these assumptions we can compute the posterior distribution as:
$$
\pi(\mu|\boldsymbol{y}) = \frac{p(\boldsymbol{y}|\mu)\pi(\mu)}{p(\boldsymbol{y})}
$$

As we want to conduct inference on the parameter $\mu$ and the denominator $p(\boldsymbol{y})$ does not depend on $\mu$, we can just consider the numerator:
\begin{equation}
\label{eq:bayes-normal-normal-posterior-compute}
\pi(\mu|\boldsymbol{y}) \propto p(\boldsymbol{y}|\mu)\pi(\mu)
\end{equation}

It is possible to prove that, by substituting in formula \@ref(eq:bayes-normal-normal-posterior-compute) the likelihood $p(\boldsymbol{y}|\mu)$ and prior $\pi(\mu)$ with \@ref(eq:bayes-normal-normal-likelihood) and \@ref(eq:bayes-normal-normal-prior), we get:
\begin{equation}
\label{eq:bayes-normal-normal-posterior-result}
\pi(\mu|\boldsymbol{y}) \sim \mathcal{N}\left(  \mu_n, \sigma_n \right)
\end{equation}
where:
\begin{align}
\label{eq:bayes-normal-normal-mu}
\mu_n & = \frac{\frac{n}{\sigma^2}\bar{y} + \frac{1}{\sigma_0^2}\mu_0}{ \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} \\
\label{eq:bayes-normal-normal-sigma}
\sigma_n^2 & = \left( \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2} \right)^{-1}
\end{align}

The posterior distribution for the mean $\pi(\mu|\boldsymbol{y})$ is still a Normal distribution with parameters $(\mu_n, \sigma_n^2)$ that depend on the initial parameters $(\mu_0, \sigma_0^2)$, the observed sample average $\bar{y}$ and the sample size $n$.

Figure \@ref(fig:bayes-normal) shows the prior $\pi(\mu)$, the likelihood $p(\boldsymbol{y}|\mu)$ and the posterior distribution $\pi(\mu|\boldsymbol{y})$ for different values of the initial parameters and for different sample sizes $n$. As we can see, $\pi(\mu|\boldsymbol{y})$ lies between $\pi(\mu)$ and $p(\boldsymbol{y}|\mu)$.

(ref:bayes-normal-caption) Prior $\pi(\mu)$, likelihood $p(\boldsymbol{y}|\mu)$ and posterior distribution $\pi(\mu|\boldsymbol{y})$ for the estimate of the mean from a Normal distribution. The panels shows $\pi(\mu)$, $p(\boldsymbol{y}|\mu)$ and $\pi(\mu|\boldsymbol{y})$ for different values of the prior variance $\sigma_0$ (rows) and different sample sizes $n$ (columns).

(ref:bayes-normal-caption-short) Prior $\pi(\mu)$, likelihood $p(\boldsymbol{y}|\mu)$ and posterior distribution $\pi(\mu|\boldsymbol{y})$ for the estimate of the mean from a Normal distribution.

```{r, bayes-normal, out.width = "100%", fig.width = 10, fig.height = 5, fig.align='center', fig.cap = "(ref:bayes-normal-caption)", fig.scap = "(ref:bayes-normal-caption-short)", label = "bayes-normal", echo = FALSE, cache = TRUE}

mu_0 <- 0
sigma_0 <- c(4, 1, .4) # Variabile
sigma <- 4
n <- c(1, 4, 10) # Variabile
ybar <- 4

x <- seq(from = -2, to = 6, by = .1)

df <- crossing(
  mu_0, sigma_0, sigma, n, ybar, x
) %>% 
  mutate(
    mu_n = (mu_0*sigma + n*ybar*sigma_0) / (sigma + n*sigma_0),
    sigma_n = (sigma*sigma_0) / (sigma + n*sigma_0),
    prior = dnorm(x, mean = mu_0, sd = sqrt(sigma_0)),
    likelihood = dnorm(x, mean = ybar, sd = sqrt(sigma/n)),
    posterior = dnorm(x, mean = mu_n, sd = sqrt(sigma_n))
  ) %>% 
  pivot_longer(
    cols = c(prior, likelihood, posterior),
    names_to = "distribution",
    values_to = "density"
  ) %>% 
  mutate(
    distribution = distribution %>% 
      str_to_title() %>% 
      fct_inorder()
  ) %>% 
  mutate(
    n = as.character(n),
    n_label = str_c("n == ", n) %>% 
      fct_inorder(),
    sigma_0_label = str_c("sigma[0] == ", sigma_0) %>%
      fct_inorder(),
  )

df_notes <- df %>%
      select(sigma_0_label, n_label, mu_0, ybar, mu_n) %>%
      unique()


p_bayes_normal <- df %>% 
  ggplot() +
  geom_vline(
    data = df_notes %>% 
      pivot_longer(cols = c(mu_0, ybar, mu_n)),
    mapping = aes(xintercept = value),
    linetype = "dotted",
    color = "grey20"
  ) +
  # geom_segment(
  #   aes(x = mu_n, xend = mu_n),
  #   y = -Inf, yend = 1,
  #   linetype = "dotted",
  #   color = "grey20"
  # ) +
  # geom_text(
  geom_label(
    data = df_notes %>%   
      crossing(
        tibble(label = expression(mu[n]))
      ),
    mapping = aes(x = mu_n, label = label),
    y = 1, label.size = NA,
    parse = T, nudge_x = 0.1,
    color = "grey20"
  ) +
  geom_line(
    aes(x = x, y = density, color = distribution),
    size = 1,
    alpha = .8
  ) +
  facet_grid(
    sigma_0_label ~ n_label,
    labeller = label_parsed
  ) +
  ylim(0, 1.1) +
  labs(x = "y", y = "Density", color = "Distribution") +
  scale_x_continuous(
    breaks = c(0, 4),
    labels = c(
      expression(mu[0]),
      expression(bar(y))
    )
  ) 

p_bayes_normal

```

As $\pi(\mu|\boldsymbol{y})$ is a Normal distribution, the expected value $E(\mu|\boldsymbol{y})$ and the mode $\text{Mode}(\mu|\boldsymbol{y})$ coincide, so our Maximum a Posteriori estimate for $\mu$ will be:
$$
\hat{\mu}^{MAP} = \text{Mode}(\mu|\boldsymbol{y}) = E(\mu|\boldsymbol{y}) = \mu_n = \frac{\frac{n}{\sigma^2}\bar{y} + \frac{1}{\sigma_0^2}\mu_0}{ \frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}}
$$

From the formula \@ref(eq:bayes-normal-normal-mu) we get some interesting insights on $E(\mu|\boldsymbol{y})=\mu_n$. $\mu_n$ is a weighted average between the prior mean $\mu_0$ and the sample mean $\bar{y}$, so it stays between $\mu_0$ and $\bar{y}$. The weight for $\mu_0$ and $\bar{y}$ are given respectively by the reciprocal of the prior variance $\frac{1}{Var(\mu)}=\frac{1}{\sigma_0^2}$ and the reciprocal of the sample mean variance $\frac{1}{Var(\bar{y})}=\frac{n}{\sigma^2}$. That means that the lower the prior variance is, the greater the weight for $\mu_0$ is, and the lower the sample mean variance is, the greater the weight for $\bar{y}$ is. The reciprocal of the variance can be interpreted as the amount of information we have on that estimate. Thus, a prior distribution with a lower variance is a more informative prior, while a prior distribution with an higher variance is a less informative prior. The sample mean variance $Var(\bar{y})=\frac{\sigma^2}{n}$ depends on $\sigma^2$ and $n$. If $n$ increases, $Var(\bar{y})$ decreases and so the weight for $\bar{y}$ increases. That corresponds to saying that, by increasing the sample size $n$, we give more credibility to the observed sample mean $\bar{y}$.

From the formula \@ref(eq:bayes-normal-normal-sigma) we see that the reciprocal of $Var(\mu|\boldsymbol{y})=\sigma_n^2$ is the sum between the reciprocal of $Var(\mu)=\sigma_0^2$ and the reciprocal of $Var(\bar{y})=\frac{\sigma^2}{n}$. Interpreting the reciprocal of the variance as the _information_ on that estimate, this result on $Var(\mu|\boldsymbol{y})$ corresponds to saying that the information on the posterior distribution is the sum of the prior information and the information obtained by the sample.

On the limit case, if $n\to +\infty$, the posterior mean converges to the observed sample mean $\mu_n\to \bar{y}$ and the posterior variance converges to zero $\sigma_n\to 0$.

Another limit case is to consider a prior distribution with $\sigma_0^2 = +\infty$, that is $\pi(\mu)\propto k, \ k\in]0,+\infty[$. This is an improper prior, as for any value of $k\in]0,+\infty[$, we get $\int_{\mathbb{R}}{\pi(\mu)d\mu}=+\infty$. This kind of prior distribution is called _Non-informative prior_, as it gives no information on the parameter. In this case, by \@ref(eq:bayes-normal-normal-mu) and \@ref(eq:bayes-normal-normal-sigma) we get:
\begin{align*}
\mu_n & = \bar{y} \\
\sigma_n^2 & = \frac{\sigma^2}{n} \\
\end{align*}
From equation, \@ref(eq:bayes-normal-normal-posterior-compute) we get that, if we use a non-informative prior, the posterior distribution is proportional to the likelihood $\pi(\mu|\boldsymbol{y}) \propto p(\boldsymbol{y}|\mu)$. That means that the Maximum a Posteriori estimate $\hat{\mu}^{MAP}$ corresponds to the maximum likelihood estimate $\hat{\mu}^{ML}$. This results gives a new interpretation of the Maximum Likelihood estimate, as it can be always seen as a Bayesian posterior estimate with a Non-informative prior distribution.

In this example, for simplicity, we considered $\sigma^2$ as a known parameter. In practice this isn't a common situation. Usually $\sigma^2$ has to be estimated like $\mu$. In the Bayesian framework we must assign a prior distribution $\pi(\sigma^2)$ to it. If we don't want to introduce a bias, we can choose a non informative prior.

The problem of inference on the mean $\mu$ of a Normal distribution $y_i\sim\mathcal{N}(\mu, \sigma^2)$ with a Normal prior assigned to $\mu\sim\mathcal{N}(\mu_0, \sigma_0^2)$ is particularly convenient because it produces an explicit solution for $\pi(\mu|\boldsymbol{y})$ that it is easily interpretable and can be easily computed. Given the sample distribution $p(\boldsymbol{y}|\theta)$, the prior distribution $\pi(\theta)$ that returns a posterior distribution $\pi(\theta|\boldsymbol{y})$ from the same family of the prior is called _conjugate prior_. Other examples of conjugate priors are the Beta distribution for the probability of success $p$ in a Binomial sample and the Gamma distribution for the parameter $\lambda$ in a Poisson sample.

In general we are not constrained to use conjugate priors and we can choose for $\pi(\theta)$ the distribution that better describes our prior information. When there is not an explicit solution for $\pi(\theta|\boldsymbol{y})$, it can be computed numerically with simulation techniques. The techniques commonly adopted are based on _Markov Chain Monte Carlo_ (MCMC). These are highly flexible techniques, but they come with a high computational cost.

<!--
Osservazioni
- Commento sul risultato
- Esempio grafico
- Caso non informative prior -> Maximum Likelihood
- Anche a \sigma^2 si può assegnare una prior distribution e otterremo una posterior
- Nel caso normale-normale i risultati si ottengono in forme chiuse. Questi casi si chiamano Conjugate Prior. Questi aiutano l'interpretabilità e rendono più semplici i conti. In generale si può usare qualsiasi distribuzione. MCMC
-->


#### Bayesian estimators for GLM

The Bayesian approach can be applied to \ac{glm} by adopting prior distribution on the coefficients $(\boldsymbol{\beta}, \phi)$. By assuming a prior distribution on $\boldsymbol{\beta}$ we will introduce a bias driven by our prior information.

In section \@ref(chap:glm-model-fitting) we saw that in \ac{glm} the common estimator used for $\boldsymbol{\beta}$ is the Maximum Likelihood estimator $\tilde{\boldsymbol{\beta}}^{ML}$ with determinations:
\begin{equation}
\label{eq:bayes-max-lik-est}
\hat{\boldsymbol{\beta}}^{ML} = \argmax_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{L\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right)}
\end{equation}

If in equation \@ref(eq:bayes-max-lik-est) we substitute $L\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right)$ with $\pi\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right)$, we obtain the Maximum a Posteriori estimator $\tilde{\boldsymbol{\beta}}^{MAP}$
\begin{equation}
\label{eq:bayes-map-est-1}
\hat{\boldsymbol{\beta}}^{MAP} = \argmax_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\pi\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right)}
=
\argmax_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{L\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right) \pi(\boldsymbol{\beta}, \phi) \right\}}
\end{equation}

If we assume a non informative prior distribution for $\phi$, the equation becomes:
\begin{equation}
\label{eq:bayes-map-est-2}
\hat{\boldsymbol{\beta}}^{MAP} =
\argmax_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{L\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right) \pi(\boldsymbol{\beta}) \right\}}
\end{equation}

Considering the log-likelihood $\ell\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right) = \log{\left( L\left( \boldsymbol{\beta}, \phi \mid \boldsymbol{y} \right)\right)}$, the optimization problem \@ref(eq:bayes-map-est-2) becomes:
\begin{equation}
\label{eq:bayes-map-est-3}
\hat{\boldsymbol{\beta}}^{MAP} =
\argmax_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{\ell\left(\boldsymbol{\beta}, \phi \mid \boldsymbol{y}\right) + \log{\left(\pi(\boldsymbol{\beta})\right)} \right\}}
\end{equation}

And expressed in terms of deviance $D(\hat{\boldsymbol{\beta}}, \boldsymbol{y}) = -2\phi\left(\ell\left(\hat{\boldsymbol{\beta}}, \phi; \boldsymbol{y}\right)
- \ell_{S}\left(\boldsymbol{\beta}^*, \phi; \boldsymbol{y}\right)\right)$:
\begin{equation}
\label{eq:bayes-map-est-4}
\hat{\boldsymbol{\beta}}^{MAP} =
\argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{
D(\boldsymbol{\beta}, \boldsymbol{y}) -2\phi \log{\left(\pi(\boldsymbol{\beta})\right)} \right\}}
\end{equation}

From equation \@ref(eq:bayes-map-est-4) we find that, adding a prior distribution $\pi(\boldsymbol{\beta})$, corresponds to adding a term $-2\phi \log{\left(\pi(\boldsymbol{\beta})\right)}$ to the deviance in the optimization problem that defines the estimator. In particular, if the prior is non-informative, we get that $\pi(\boldsymbol{\beta})\propto k, \ k\in]0,+\infty[$ does not depend on $\boldsymbol{\beta}$, so the optimization problem corresponds to the Maximum Likelihood.


#### Ridge and LASSO Regression as Bayesian estimators for GLM {#chap:bayes-ridge-lasso}

Let's assume the parameters $\beta_1, \beta_2, \dots, \beta_p$ to be identically distributed with _Normal_ distribution centered in $0$:
$$
\pi(\beta_j) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^2}\beta_j^2}, \quad j\in\{1,2,\dots,p\}
$$
and let's assign to $\beta_0$ a non informative distribution:
$$
\pi(\beta_0) \propto k, \ k\in]0,+\infty[
$$
If we also assume that the parameters are independent, we get:
\begin{equation}
\label{eq:bayes-glm-prior-normal}
\pi(\boldsymbol{\beta}) \propto \prod_{j=1}^{p}{\pi(\beta_j)} \propto e^{-\frac{1}{2\sigma^2}{\sum_{j=1}^{p}{\beta_j^2}}} 
\end{equation}
By substituting \@ref(eq:bayes-glm-prior-normal) to \@ref(eq:bayes-map-est-4), we get:
\begin{align}
\nonumber
\hat{\boldsymbol{\beta}}^{MAP} & =
\argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{
D(\boldsymbol{\beta}, \boldsymbol{y}) -2\phi \log{\left(e^{-\frac{1}{2\sigma^2}{\sum_{j=1}^{p}{\beta_j^2}}}\right)} \right\}} \\
\label{eq:bayes-map-est-normal}
& = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{
D(\boldsymbol{\beta}, \boldsymbol{y}) + \frac{\phi}{\sigma^2} {\sum_{j=1}^{p}{\beta_j^2}}\right\}}
\end{align}
From the expression \@ref(eq:bayes-map-est-normal), if we substitute $\frac{\phi}{\sigma^2}$ with $\lambda$, we obtain the optimization problem of the Ridge regression \@ref(eq:ridge-est-deviance). That means that the Ridge estimator can be seen as a Maximum a Posteriori estimator in which the prior distribution for $\beta_1, \beta_2, \dots, \beta_p$ are independent Normal centered in $0$ with the same variance $\sigma^2$. From the substitution $\lambda=\frac{\phi}{\sigma^2}$ we can also interpret the role of $\lambda$ and $\sigma^2$. A lower $\sigma^2$ corresponds to a more informative prior that gives more credibility to the prior mean $E(\beta_j)=0$. In the Ridge usual parametrization, this corresponds to a greater $\lambda$ and so an higher weight to the penalization $\sum_{j=1}^{p}{\beta_j^2}$, that brings a higher shrinkage to the estimates.

With the same approach, we can assume $\beta_1, \beta_2, \dots, \beta_p$ to be identically distributed with _Laplace_ distribution centered in $0$:
$$
\pi(\beta_j) = \frac{1}{2b}e^{-\frac{|\beta_j|}{b}}, \quad j\in\{1,2,\dots,p\}
$$
Under this assumption, the prior distribution for the coefficients becomes:
\begin{equation}
\label{eq:bayes-glm-prior-laplace}
\pi(\boldsymbol{\beta}) \propto \prod_{j=1}^{p}{\pi(\beta_j)} \propto e^{-\frac{1}{b}{\sum_{j=1}^{p}{|\beta_j|}}} 
\end{equation}
And the optimization problem \@ref(eq:bayes-map-est-4) becomes:
\begin{align}
\nonumber
\hat{\boldsymbol{\beta}}^{MAP} & =
\argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{
D(\boldsymbol{\beta}, \boldsymbol{y}) -2\phi \log{\left(e^{-\frac{1}{b}{\sum_{j=1}^{p}{|\beta_j|}}}\right)} \right\}}  \\
\label{eq:bayes-map-est-laplace}
& = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{
D(\boldsymbol{\beta}, \boldsymbol{y}) + \frac{2\phi}{b} {\sum_{j=1}^{p}{|\beta_j|}}\right\}}
\end{align}
By substituting $\frac{2\phi}{b}$ with $\lambda$ in equation \@ref(eq:bayes-map-est-laplace), we obtain the optimization problem of the \ac{lasso} regression \@ref(eq:lasso-est-deviance). In the Laplace distribution the variance is $Var(\beta_j) = 2b^2$. By decreasing $b$, the variance in $\pi(\beta_j)$ decreases, so the prior is more informative and we give more credibility to the prior mean $0$. As for Ridge regression, a lower variance in the prior distribution translates into an higher penalization parameter $\lambda$ and a higher shrinkage for the coefficients.

With the same approach, if we assume the following prior distribution:
$$
\pi(\beta_j) \propto e^{-\frac{1}{k}\left(\alpha|\beta_j| + (1-\alpha)\beta_j^2\right)}, \quad \alpha\in[0,1], \ k\in]0,+\infty[, \ j\in\{1,2,\dots,p\}
$$
we obtain the Elastic Net optimization problem:
$$
\hat{\boldsymbol{\beta}}^{MAP} = \argmin_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}{\left\{
D(\boldsymbol{\beta}, \boldsymbol{y}) + \frac{2\phi}{k} {\sum_{j=1}^{p}{\left(\alpha|\beta_j| + (1-\alpha)\beta_j^2\right)}}\right\}}
$$
where $k>0$ is a constant that determine the variance of the prior distribution.

Figure \@ref(fig:normal-laplace) shows a Normal, a Laplace and a Elastic Net prior distribution with $\alpha=\frac{1}{2}$. All the distributions represented have unit variance. As we can see from the plot, the Laplace distribution has a peak on its mean. This peak is the responsible of forcing to exactly $0$ the unimportant coefficients in the \ac{lasso} regression. The Elastic Net prior is a mixture between the Normal distribution and the Laplace distribution.

(ref:normal-laplace-caption-latex) Normal density function, Laplace density function and Elastic Net prior density function with unitary variance.

(ref:normal-laplace-caption-gitbook) Normal density function, Laplace density function and Elastic Net prior density function with unitary variance.

(ref:normal-laplace-caption-short) Normal density function, Laplace density function and Elastic Net prior density function with unitary variance.

```{r, plot-normal-laplace-print, out.width = "50%", fig.width = 5, fig.height = 3.5, fig.align='center', fig.cap = ifelse(knitr::is_html_output(), "(ref:normal-laplace-caption-gitbook)", "(ref:normal-laplace-caption-latex)"), fig.scap = "(ref:normal-laplace-caption-short)", label="normal-laplace", echo=FALSE, fig.ncol=2, fig.subcap=c('Normal distribution', 'Laplace distribution', "Elastic Net prior distribution with $\\alpha=\\frac{1}{2}$"), cache=TRUE}


p_normal <- ggplot() +
  stat_function(
    fun = dnorm,
    color = col1,
    size = line_size
  ) +
  xlim(-3, 3) +
  ylim(0, 0.75) +
  labs(
    x = expression(beta[j]),
    y = expression(pi(beta[j]))
  )

p_laplace <- ggplot() +
  stat_function(
    fun = function(x){dlaplace(x, location = 0, scale = sqrt(2)/2)},
    color = col1,
    size = line_size
  ) +
  xlim(-3, 3) +
  ylim(0, 0.75) +
  labs(
    x = expression(beta[j]),
    y = expression(pi(beta[j]))
  )

alpha <- .5
k <- 1.4
scale <- 2.18117 # Computed externally with Wolfram Alpha

p_en <- ggplot() +
  stat_function(
    fun = function(x){1/scale*exp(-1/k*((1-alpha)*x^2 + alpha*abs(x)))},
    color = col1,
    size = line_size
  ) +
  xlim(-3, 3) +
  ylim(0, 0.75) +
  labs(
    x = expression(beta[j]),
    y = expression(pi(beta[j]))
  )

plot_grid_split(p_normal, p_laplace, p_en)

```

The approach usually adopted in Ridge and \ac{lasso} regression is to select the optimal hyper-parameter $\lambda$ based on Cross Validation, so the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, in which the prior distribution represents the a priori information and should be fixed before observing the sample. The Bayesian procedures in which the prior distribution is estimated from the data are called _Empirical Bayes_ methods.


#### Other Bayesian estimators for GLM

The Bayesian framework provides more flexibility in prior information than what standard Ridge and \ac{lasso} regression offer.

Even staying with the Normal prior distributions, it is possible to give to the different coefficients $\beta_j$ different prior variances $\sigma_j^2$, that would represent how much we trust the estimated coefficient to be different from $0$. For example it is possible to use a non informative prior for most of the coefficients and to introduce an informative prior only to the more delicate ones. This corresponds to adding a penalization only to those coefficients. As we have already discussed in section \@ref(chap:shrinkage-considerations), this technique can be useful in the case of highly unbalanced explanatory variables, such as the variable "number of claims experienced in the previous year".

It is also possible to assign to the coefficients prior distributions not centered on $0$. If we assign to the coefficient $\beta_j$ a prior $\beta_j\sim\mathcal{N}(\beta_{j0}, \sigma_j^2)$, this results in a shrinkage towards $\beta_{j0}$ instead than towards $0$. In the Bayesian reasoning, this can be useful if we want to add to the model information taken from outside the sample. For instance, if we want to fit a model for a small portfolio of insurance policies, we can introduce prior information from estimates from other portfolios or from market data. Given the prior distribution, the exposure of the portfolio used for fitting determines how much information we have on that portfolio and so how much our posterior estimates will be close to the Maximum Likelihood estimates.

The Bayesian framework allows us also to use prior distributions different from the Normal and the Laplace. For example, if we want a coefficient $\beta_j$ to be positive, we can assign to it a prior distribution $\pi(\beta_j)$ with all the mass on positive values and density equal $0$ to all the negative determinations. This will force the Maximum a Posteriori estimate to be non negative.
$$
\pi(\beta_j) =
\begin{cases}
\frac{\sqrt{2}}{\sqrt{\pi}\sigma}e^{-\frac{1}{2\sigma^2}\beta_j^2} & \text{if } \beta_j \ge 0 \\
0 & \text{otherwise}
\end{cases}
$$

A useful application of prior distributions with all the mass on positive value is the following. Suppose we have a discrete ordinal variable with determinations encoded in the dummy variables $x_1, x_2, \dots, x_J$. In the common \ac{glm} parametrization, each dummy variable corresponds to a coefficient $\beta_1, \beta_2, \dots, \beta_J$. We can consider the difference between each and the previous one: $\gamma_j = \beta_j - \beta_{j-1}, \ j\in\{2, 3, \dots, J\}$. The new parametrization depends on the coefficients: $\beta_1, \gamma_2, \dots, \gamma_J$. If we assign a prior distribution centered on $0$ for the coefficients $\gamma_2, \dots, \gamma_J$, this will produce a shrinkage of each of the coefficients $\beta_2, \dots, \beta_J$ towards the previous one, forcing the variable effect to have small steps from one modality to the next one. Moreover, if we assign to $\gamma_2, \dots, \gamma_J$ prior distributions with all the mass on positive determinations, the estimate will produce a monotonically non decreasing effect as:
$$
\gamma_2, \dots, \gamma_J \ge 0 \ \longrightarrow \ \beta_1\le\beta_2\le\dots\le\beta_J
$$
In the same way, if we assign to $\gamma_2, \dots, \gamma_J$ prior distributions with all the mass on negative determinations, the result will be a monotonically non increasing effect.

<!--
- Posso mettere penalizzazioni diverse ai diversi coefficienti
- Se voglio che un coefficiente sia positivo, gli do una prior con distribuzione tutta >0
- Riparametrizzazione Akur8
- Se ho una quantitativa discreta posso riparametrizzare in modo che sia monotona
-->


#### Some considerations on Bayesian estimators for GLM

Bayesian estimators offer a flexible tool for modeling effects in \ac{glm}. This tool is particularly useful in actuarial modeling because often the actuary needs to introduce external information to tune the model coefficients.

In the actuarial practice this is often done introducing offsets. For example, let's consider a delicate coefficient $\beta_j$ estimated for a class with low exposure. This can happen for example with the variable "number of claims experienced in the previous year" with the class "more than one claim". If we want to insert it in the model with an effect weakened compared to the maximum likelihood one $\hat{\beta}_j^{ML}$, we can consider a specific value $\hat{\beta}_j^{\text{offset}}$, with $|\hat{\beta}_j^{\text{offset}}| \le |\hat{\beta}_j^{ML}|$, in the model and introduce it as an offset.

With Bayesian estimators, in the practice, we can do something really similar to what we do with offsets. Indeed, we can choose a prior distribution, such that our estimate is exactly equal to the offset $\hat{\beta}_j^{MAP} = \hat{\beta}_j^{\text{offset}}$. For example, in the case of the delicate coefficient $\beta_j$, we can just apply a Normal prior $\pi(\beta_j)$ centered in $0$ and tune the variance $\sigma_j^2$ in order to obtain exactly $\hat{\beta}_j^{MAP} = \hat{\beta}_j^{\text{offset}}$.

This practice looks really similar to introducing an offset, but it has two important benefits.

First of all, even if we are forcing the coefficient to be exactly $\hat{\beta}_j^{MAP} = \hat{\beta}_j^{\text{offset}}$, we have an indicator on how strong our forcing is. Indeed, $\sigma_j^2$ can be seen as the strength of the prior distribution and it can be compared with the variances of the prior distributions of the other coefficients.

Secondly, in many cases, after setting an offset, we want to modify the model and eventually change other coefficients. After doing this, if we are working with offsets, we have to review each of the coefficients imposed in offset, since a change in other coefficients can affect them. With Bayesian estimators we can just keep the prior distributions previously set and refit the model. The new coefficients will be automatically reevaluated keeping the strength of the prior distributions previously assigned.

We end our discussion on Bayesian estimators for \ac{glm} by mentioning that also \ac{gam} models, discussed in section \@ref(chap:gam), can be estimated with Bayesian estimators. This is done by applying a prior distribution to the coefficients of the spline functions $f_l(\cdot)$. For more details on that, we refer to [@bayes-gam].

<!--
- Altro: modelli gerarchici, mixed models GLMM
- Se prior non semplice: stima con metodi MCMC
- Empirical bayes: scelgo lambda con cross validation
- In generale posso introdurre la mia informazione a priori e mettergli un coefficiente >0 anche se un elastic net con cross validation mi porterebbe a dargli 0
- Volendo posso scegliere la prior tale che la stima a posteriori mi venga uno specifico coefficiente
  Questo è meglio di offset perché:
  1. ho la consapevolezza di quanto forte gli ho messo la prior e quindi di quanto forte è la mia imposizione
  2. Se cambio qualcosa negli altri parametri e rifitto il modello, viene tutto gestito in automatico
- In pratica spesso si usa informazione a priori nella scelta dei parametri da tenere o togliere dal modello o anche nell'introduzione degli offset
-->


\newpage

<!-- ### GBM -->


## Considerations on models {#chap:considerations-on-models}

In this section we are going to discuss how the models we have seen in the previous sections satisfy the non-life insurance pricing needs. In subsection \@ref(chap:ml-techniques) we will also give some hints on _Machine Learning Algorithms_, that will be useful to compare them with the models described in this thesis.


### Hints on Machine Learning Algorithms {#chap:ml-techniques}

With the term _Machine Learning Algorithms_ we refer to a set of techniques used for predictive models in Machine Learning Practice, such as \ac{gbm}, \ac{rf} and \ac{nn}. These are highly flexible general purpose techniques that can be used both for regression (predictive models with a quantitative response variable $Y$) and classification (predictive models with a qualitative response variable $Y$). In this section we are going to only provide some hints on them. For a more in-depth exposure of these models with applications to actuarial problems we refer to [@wuthrich-data-analytics].

The assumptions adopted in the models behind these algorithms are minimal. Usually it is just assumed a distribution for the response variable $Y$, while for the regression function we just consider:
$$
E(Y_i) = \boldsymbol{f}(x_{i1}, \dots, x_{ip})
$$
without any constraints on the shape of $\boldsymbol{f}(\cdot,\dots,\cdot)$. That means that with these models we take into account all the possible interactions between variables without any restriction.

Given the distribution of $Y$, we are able to define a loss function, that for example can be the deviance $D(\boldsymbol{f}, \boldsymbol{y})$. The aim of the fitting algorithm is to find a good approximation $\hat{\boldsymbol{f}}$ for $\boldsymbol{f}$.

For highly complex models, there is a huge risk of overfitting the training set. These Machine Learning Techniques offer sophisticated algorithms that provides efficient way to fit $\hat{\boldsymbol{f}}$ preventing overfitting.

The focus of these techniques is usually much more on the fitting algorithm rather than on the underlying model. For these reason they are often referred as _Machine Learning Algorithms_ rather than _Machine Learning Models_.

These algorithms provide a high level of automation and do not require big manual interventions by the person who is running the algorithm. In these algorithms neither the shape of the regression function nor the interactions between the explanatory variables have to to be specified.

The results of the fitting are complex regression functions $\hat{\boldsymbol{f}}$. The convenience of this complexity combined with a strong automation is that these algorithms are able to automatically spot complex effects that wouldn't be easily discovered by manual fitting. On the other hand, this complexity reduces the interpretability of the result. For this reason, these algorithms are often referred as _Black Box_.

We underline that, from a theoretical point of view, \ac{glm} and all their advancements seen in this thesis are actually _Machine Learning Techniques_, since there is not a theoretical distinction between _Statistical Models_ and _Machine Learning Models_. Anyway, in the common speech, the term _Machine Learning_ is used for algorithms such as \ac{gbm}, Random Forest and Neural Network, that, compared to classical statistical models, offer more complex estimations and a higher level of automation. In the following, we will use the term _Machine Learning Algorithms_ to refer to algorithms such as \ac{gbm}, Random Forest and Neural Network.


### Model comparison {#chap:model-comparison}

In this thesis we have discussed some \ac{glm} advancements: \ac{gam}s (section \@ref(chap:gam)), Shrinkage Estimators (section \@ref(chap:shrinkage-estimators)) and Bayesian Estimators (section \@ref(chap:bayes-glm)). All these developments offer better predictions and more automation compared to classic \ac{glm}. For example, in \ac{gam} fitting it is not needed to specify the shape of the effect for the quantitative variables and in Elastic Net an automatic variables selection is performed. The improvement in performance both in term of predictivity and automation increases as the number of explanatory variable increases, so these techniques become significantly better than classic \ac{glm} when $p$ is big. As we have seen, these techniques are not mutual exclusive. It is possible to fit a \ac{gam} with penalized splines for quantitative explanatory variables and Elastic Net penalization for qualitative explanatory variables. The Bayesian estimators, that can be seen as a generalization of Elastic Net, can be applied to both \ac{glm} and \ac{gam} and offer even more useful tools for fitting.

Compared to these \ac{glm} Advancements, Machine Learning Algorithms are still much more flexible and automatic. Indeed, in \ac{gam} and Elastic Net there still is a certain degree of manual supervision. For example, we still have to specify the interactions we want to consider in the model. Actually, in Elastic Net we can consider a large number of interactions and let the model automatically select them, but it is still preferred to check what the model is doing and eventually manually choose whether to include or not the selected interactions.

Anyway, Machine Learning Algorithms are much less _interpretable_ and _controllable_ compared to GLM-based models. In GLM-based models we can easily look at the marginal effects of the explanatory variables and we have a full control of the interactions. In Machine Learning Algorithms, we can't easily look at how the changing of an explanatory variable affects the estimate on the response because the effect of that changing can highly depend on the values of the other explanatory variables in a way that we can't manually control.

However, the benefits of GLM-based models go beyond the interpretability. In GLM-based models, the person who is running the fitting, have a _full control on the coefficients_. Indeed, he can easily choose to add or not an effect regardless the variables selection procedure includes it or not. Furthermore, the choice consists not only in adding or not a variable or an interaction, but, in GLM-based models, it is even possible to manually assign to an explanatory variable a marginal effect with a specific shape by introducing an offset or, with Bayesian estimators, by tuning the prior distribution.


### The actuary importance {#chap:actuary-importance}

The high level of discretion in GLM-based models fitting gives a lot of importance to the person who is fitting the model. In actuarial models, the person that conducts the model fitting is the _actuary_. In section \@ref(chap:actuary-role) we have seen who the actuary is. In this section we will describe some of the cases in which a manual intervention in the model fitting is needed and how the actuary can perform it.

We will distinguish between _technical needs_, that are related to building a good predictive model, and _commercial needs_, that are related to price optimization.


#### Technical needs

The _technical needs_ come from the fact that we observe past data, but our models will be used for pricing policies that will be sold in the future. So, our models have to combine observations from the past and assumptions on the future.

By building a technical price, we must consider that our portfolio is not representative of the whole country portfolio and it isn't even representative of our future portfolio, that is influenced by the future underwriting policy of the company and also by the pricing policy.

<!-- Stime distorte -->
There are some cases in which estimates from past data could lead to a _bias_ in the future predictions. For example, it is possible that in the past the underwriting policy for certain clusters, such as customers from specific regions, has been particularly strict, so on those clusters, on historical data we see excellent technical results. Anyway, it is possible for these past observations to not be representative of the future policies we could acquire from those clusters, so, proposing a price too low for customers from those clusters could result in big technical losses. On the other hand, it is possible that in the past the company suffered big frauds from customers of specific clusters, but, as the fraud detection initiatives improved, in the future it is not expected to acquire fraudulent customers from those clusters. In these cases, the coefficients fitted from past data will be too high for future policies from those clusters.

It is also possible that in the future the company is going to sell on new sales channels where there is no historical data. It is possible that the policies from the new channels will have different characteristics, such as an higher or a lower propensity to commit frauds, so the actuary should take it into account with a proper pricing correction.

Moreover, it is possible that in the future the company will change the way some guarantees are sold. For instance, if in the past the option of payment division into installments was not encouraged, only the bad customers with high premiums would have chosen to divide their payment into installments. But, if in the future that option will be encouraged, a wider range of customers will select it and the determination of the coefficient of the variable "number of installments" will change.

<!-- Stime con alta varianza -->
There are other cases in which the past portfolio on certain clusters is too small, so the estimates with past data are affected by _high variance_ and they are not reliable. This can happen for example if in a specific region the company never pushed and has a small market share.

In all the cases mentioned, the actuary has to use his expertise to properly pricing policies. In GLM-based models, looking to the fitted marginal effects of the critical variables and manually changing them is quite straightforward, while it would be much more difficult for complex machine learning models.

<!--
The ability of the actuary that is fitting the model is crucial
Expert judgement, expertise, domain knowledge
Fare i modelli diventa un'arte
-->


#### Commercial needs

As we have seen in section \@ref(beyond-technical-pricing), the tariff and the offer price are the result of a process of _price optimization_ that takes into account technical pricing, client expectation and business strategy. Usually the definition of the tariff and the offer price starts from the technical price and from it some of the coefficients are tuned to satisfy commercial needs.

The tariff and the offer price must also respect specific legal constraints. One example of legal constraint in \ac{mtpl} coefficients is related to the bonus-malus class. By law, the coefficients of the bonus-malus class must be monotonically increasing. That means that, even if in the technical price the fitted coefficients are not increasing in all clusters, in tariff and offer price they must be tuned to achieve monotonicity.

Some commercial constraints that are logical for tariff and offer price are the ones on coefficients related to coverage options such as the insurance ceiling and the deductible. It is clear that, if a client asks for a higher ceiling, the price offered to him must be higher. However, it is possible that on observed data on average the policies with higher ceiling experience less claims and have a lower total cost of claims. This can be explained with the fact that policyholders that select the higher ceiling are usually the more careful, so they commit less claims. The same way, if the client asks for a higher deductible, the price offered to him must be lower, even if it is possible that policies with higher deductible have an higher technical price that policies with a lower deductible.

We must also consider commercial constraints based merely on customer expectation. For example, customers usually expect that insuring a more powerful vehicle should be more expensive than insuring a less powerful one. Moreover, the power of a vehicle is positively correlated with its commercial price and customers that buy expensive vehicles are usually the less sensitive to price. That means that, even if more powerful vehicles had on average a lower total cost of claims, from a commercial point of view, it is preferred to have an offer with price that increases as the vehicle power increases.

Another set of variables that have to respect some commercial constraints are the variables related to claims experienced in the previous years. The effect of those variables should follow a specific order. Indeed, a claim experienced last year must penalize the premium more than a claim experienced two years ago, that must penalize more than a claim experienced three years ago and so on. On top of them it is also needed to take into account the period of coverage. For instance, let's consider a customer that experienced a claim last year and previously has been covered for 5 years with no claims, and another customer that experienced a claim last year but has never been covered before. If all the other characteristic for the two customers are the same, the first one should receive a lower price than the second one. These constraints, in addition to being guided by commercial logic, also make sense from a technical point of view. Checking whether these constraints are always respected is feasible with GLM-based models, but it is quite difficult for complex machine learning models.

Finally, we mention that there are some practical cases in which it is important to understand why the offer price results in a certain value. For instance, if a customer makes a variation on the policy, for example by changing the insured vehicle, and the price changes significantly, it is important to understand which variables caused the change. This is important also to understand if the change is intentional and we really think that with the other vehicle the expected total cost of claims is different or for example the result comes from an information technology issue, such as that a variable is not properly passed to the calculation engine.


\newpage

## Implementation {#chap:implementation}

### Practical data problem and solutions

In building actuarial models we also face practical challenges due to the size of the data we are working with. In modern actuarial modeling it is common to work with datasets with millions of rows and hundreds of columns, that can weigh several Gigabytes. With legacy \ac{it} implementations, dealing with these kinds of datasets can be quite painful. The issues appear not only in fitting models, but in the whole \ac{etl} pipeline. Working with big amount of data is struggling on two main aspects:

1. Memory;
2. Computation time.

The first problem consists in the fact that in a system the amount of primary memory, i.e. the \ac{ram}, is limited and can't be arbitrarily increased. When the datasets we use are larger than the \ac{ram} in our local machine, it is not possible to work with them in memory locally. A possible solution is to keep the data on the secondary memory, i.e. the \ac{hdd} or \ac{ssd}, and read it sequentially only when it is needed, as it is done in some analytics software like SAS. This solution allows to work with larger datasets, but implies a significant decrease in computation speed, since reading and writing on secondary memory is much slower than doing it on primary memory.

Considering the computation time, even if our data fits into the primary memory, if we want to run a very complex algorithm with our processor, it could be unfeasible in a reasonable amount of time. Indeed, the speed on a single \ac{cpu} is limited by physical constraints.

In modern computer science applications, the solution to these problems comes from the concept of parallelism. Parallel computing is a type of computation where many calculations are carried out simultaneously. This can be achieved with two paradigms:

1. _Task parallelism_;
2. _Data parallelism_.

_Task parallelism_ is what is done with multi-core processors. A multi-core processor is a processor that includes multiple processing units, called "cores", on the same chip. The advantage of having a machine with multiple cores is that it is possible to decompose our algorithm into smaller tasks that can be run simultaneously. For instance, when we are conducting hyper-parameter tuning in a model, we are fitting the model many times with different values of the parameters. These fittings are independent tasks that can be conducted simultaneously by the different cores of our machine. Modern computers are equipped with at least two cores and many statistical packages deploy task parallelization to reduce computation time.

Anyway, task parallelization does not tackle the problem of limited available memory. If our dataset is bigger than our primary memory, we can consider _data parallelization_. Data parallelization consists in distributing the data across different machines, which operate on the data in parallel. By distributing the data, each machine has only a portion of it and, if there are enough machines, even with a large dataset, it is possible to split it into pieces that fit into the primary memory of the single machines. The set of connected machines that work together in a distributed system is called _cluster_. Cluster computing offers a reliable and flexible way to deal with big datasets in analytics. The greater advantage of cluster computing is that it is easily scalable, that is, if we need more memory or more computing power, we can just increase the number of machines in the cluster and increase the performance of the system.

Moving from a centralized system to a distributed system requires an important level of sophistication from an algorithm point of view. Luckily, there are many frameworks such as Apache Hadoop and Spark, that offer high-level interfaces that can be easily deployed by the final user. The high-level of abstraction of these interfaces permits to the final user to conduct the analysis in a distributed system without worrying on the low-level implementation details.

An even higher level of abstraction is achieved with _cloud computing_. Cloud computing consists in the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. The idea behind cloud computing is to outsource the whole \ac{it} management to a third company and pay to it a fee based on the use. This paradigm is also called \ac{haas}, as the client doesn't own the hardware and just accesses the storage and the computing power when he needs it. With this approach, scalability is even easier, because, if the user needs more resources, he can just ask for them and the service provider will quickly satisfy his request. With on site hardware, increasing the hardware resources would require buying new hardware and installing it, that could be a big investment both in term of economic cost and time spent.

We mention that, even if the solutions mentioned in this chapter are suitable for _big data_, in usual actuarial pricing big data are not used. The characteristics usually associated with big data are _Volume_, _Variety_ and _Velocity_. In common actuarial practice variety and velocity are not contemplated, as actuarial datasets are usually just relational tables that, are built prior to the moment of modeling. The situation changes by introducing real time data such as telematics car driving data. For instance, if we want to model the real-time risk of incurring in an accident based on data from speedometers, accelerometers, gyroscopes and GPS that capture how and where the vehicle is driven, that would be a big data problem.


<!-- 
- Memory and storage:
  dataset più grandi della available RAM. Puoi usare lo storage, ma diventa   estrememente lento.
- Computation time:
  ci sono dei limiti fisici nella potenza dei singoli processori in termini di numero di transistor e clock speed
- Soluzione: parallelizzazione
  + task parallelism
    multi threading
  + data paralelism
    distributed data. CLuster computing
    scalability
-->


### Actuarial pricing specific needs {#chap:actuarial-pricing-specific-needs}

In actuarial practice, as the actuary needs to visualize, interpret and control the variables effects, it is also important how the interface works. Often visual point and click applications are preferred to coding interfaces, such as R and Python. While coding interfaces are often used for \ac{etl} processes for their flexibility, in the modeling process point and click applications are considered more user friendly, since they require less effort from the user to become proficient with and they often offer out of the box tools to visualize results.

Among the point and click applications, the market offers some solutions for actuarial pricing that are specifically thought for building \ac{glm}s and automatically produces the plots seen in section \@ref(chap:glm). These applications are particularly optimized for \ac{glm}, but they are not as flexible as machine learning libraries in R and Python and include limited advancements.


### Solution adopted in this project

In the application described in section \@ref(chap:practical-app) we used H2O [@h2o_platform]. H2O is an open-source machine learning platform that provides many of the most widely used statistics and machine learning models, such as \ac{glm}, \ac{gam}, \ac{gbm}, Random Forest and Neural Network. H2O engine is based on Java, but it can be used both in R and Python with the packages deployed for those languages. The packages work as interfaces that talk with the Java engine, but still keeping an overall syntax consistency with the programming language we are using. That means that we can use H2O both on R package and Python package and the engine that is running under the hood is exactly the same.

H2O framework offers also H2O Flow, that is a point and click interface that allows users to run H2O machine learning algorithms without writing code. This is a user friendly interface for modeling, but it has not all the functionalities for visualization, interpretation and control of variables effects that are present in actuarial pricing specific applications mentioned in section \@ref(chap:actuarial-pricing-specific-needs).

As well as having several interfaces, H2O can run both locally on a single machine, with multi-core processing, and on a Spark Cluster. The high-level of abstraction of the interfaces permit to develop the algorithms both locally and on a cluster with basically the same code.

In the application described in section \@ref(chap:practical-app) we used H2O locally on R.


<!-- \titlespacing{\chapter}{0pt}{0pt}{35pt} -->