Introduction

# Introduction
## Curso JAE ICMAT 2021
### Roi Naveiro
### 2021-06-24

---

# Outline

1. Intro, basic concepts: belief and probability, exchangeability. Conjugate models.

2. MC approximation, MCMC methods: Gibbs, Metropolis, Hamiltonian Montecarlo.

3. Scalable inference: variational Bayes and Stochastic Gradiente Langevin Dynamics.

3. AI applications.

---

# Slides and Labs

* [Github](https://github.com/roinaveiro/curso-Bayes-JAE)

* **Disclaimer**: This session's material comes from Ch. 1,2 and 3 of Hoff's book.

Hoff, P. D. (2009). A first course in Bayesian statistical methods (Vol. 580). New York: Springer.

---

# Probability and Bayesian inference

* Informally: a way to express our degree of certainty about unknown things.
   
* Formally: a numerical way of representing rational beliefs.

* There is a relation between probability and information.

* Bayes' rule is a rational method for updating beliefs given new information.

---

# Bayesian learning (on a nutshell)

* Statistical induction: Learn about population given sample.

* Learn about `$\theta$` given dataset `$y$`.

* After observing dataset, we have less uncertainty about `$\theta$` than before. Bayes **quantifies** this change.

* Notation: sample space `$\mathcal{Y}$`; parameter space `$\Theta$`.

---

# Bayesian learning (on a nutshell)

* The process:
  
  1. Set a *prior* `$p(\theta)$` describing our belief that `$\theta$` represents the truth.
  
  2. Set a *sampling model* `$p(y \vert \theta)$` that describes our belief that `$y$` would be the outcome of our study if `$\theta$` wa true.
  
  3. Compute *posterior* `$p(\theta \vert y)$`, that describes our belief that `$\theta$` is the truth after having observed `$y$`.
  
  $$
  p(\theta \vert y) = \frac{p(y \vert \theta) p(\theta)}{\int_{\Theta} p(y \vert \tilde{\theta}) p(\tilde{\theta}) d \tilde{\theta}}
  $$
  
* Mathematical justification in Savage(1954, 1972).

---

# Beliefs

* Say, `$Be(F)$`, represents **numerically** our beliefs about a statement.

* Can be formalized using bets ( `$Be(F) > Be(G)$` means we would prefer to bet `$F$` is true than `$G$` is true).

* Axioms of beliefs:

1. `$Be(\text{not } H | H ) \leq Be(F | H ) \leq Be( H | H )$`
  
  2. `$Be( F \text{or } G | H ) \geq  \max \lbrace Be(F | H ), Be(G | H ) \rbrace$`
  
  3. `$Be( F \text{and } G | H ) \text{ can be derived from } Be(G | H ) \text{ and } Be(F | G \text{ and } H)\rbrace$`

* Probability satisfies this axioms.

---

# Exchangeability

* Let `$p(y_1, \dots, y_n)$` be the joint density for `$Y_1, \dots, Y_n$`. If `$p(y_1, \dots, y_n) = p(y_{\pi_1}, \dots, y_{\pi_n})$` for all permutations `$\pi$` then `$Y_1, \dots, Y_n$` are exchangeable.

* If `$\theta \sim p(\theta)$` and `$Y_1, \dots, Y_n$` are conditionally i.i.d. given `$\theta$`, then marginally `$Y_1, \dots, Y_n$` are exchangeable. (Prove it).

* The other way around? De Finetti's theorem.

---

# De Finetti's theorem

Suppose that `$\lbrace Y_1, Y_2, \dots  \rbrace$` is a potentially infinite sequence of random variables with common sampling space.

If for any `$n$` `$Y_1, \dots, Y_n$` is exchangeable, then
$$
p(y_1, \dots, y_n) = \int \left \lbrace \Pi_1^n p(y_i \vert \theta) \right \rbrace p(\theta) d \theta
$$

for some parameter `$\theta$`, some prior and sampling model, that depend on our belief model   `$p(y_1, \dots, y_n)$`.
  
---

# Why Bayes?

* Provides a natural and principled way of combining prior information with data, within a solid decision theoretical framework

* It provides inferences that are conditional on the data and are exact, without reliance on asymptotic approximation.

* Bayesian analysis also can estimate any functions of parameters directly.

* Likelihood principle.

* It provides interpretable answers, such as “the true parameter  has a probability of 0.95 of falling in a 95% credible interval.

* UQ in a fundamental way!

---

# Why not Bayes?

* Computational cost.

---

# One-parameter models

---
# The binomial model

* A sample of 129 people (from a population of `$N$`) are asked whether or not they are happy. `$Y_i=1$` if happy `$Y_i=0$` if not.

* `$\theta = \sum_{i=1}^N Y_{i} / N$`.

* `$p(y_1, \dots, y_n \vert \theta) = \theta^{\sum_{i=1}^{129} y_i} (1-\theta)^{129 -\sum_{i=1}^{129} y_i}$`

* Prior?

---

# Uniform prior

* `$p(\theta) = 1$` for all `$\theta \in [0,1]$`.

* Posterior? Using Bayes rule:

`\begin{equation*}
p(\theta \vert y_1, \dots, y_{129}) = \frac{p(y_1, \dots, y_{129} \vert \theta ) p(\theta)}{p(y_1, \dots, y_n)}
\end{equation*}`

`\begin{equation*}
p(\theta \vert y_1, \dots, y_{129}) \propto \theta^{118} (1-\theta)^{11}
\end{equation*}`

* Normalizing constant? Identify with kernel of beta with parameters `$a=119$` and `$b=12$`.

`\begin{equation*}
\frac{x^{a-1}(1-x)^{b-1}} {B(a,b)}\!
\end{equation*}`

* Proof
---

# Uniform and beta prior

* Same result for every sequence with the same number of 1s.

* `$y = \sum_{i} y_i$` is a sufficient statistic.

* `$p(y | \theta) = \binom{n}{y} \theta^y (1-\theta)^{n-y}$`

* Posterior: `$p(\theta | y) = \text{beta}(y+1, n-y+1)$` .

* Uniform is a particular case of beta (with parameters 1, 1). What if we choose a general beta prior?

`\begin{equation*}
p(\theta | y ) \propto p(\theta)p(y | \theta) = \theta^{a+y-1} (1-\theta)^{b+n-y-1}
\end{equation*}`

* So `$p(\theta | y) = beta(a+y, b + n -y)$`

---

# Conjugacy

* A class `$\mathcal{P}$` of prior distributions for `$\theta$` is called cojugate for a sampling model `$p(y | \theta)$` if
`\begin{equation*}
p(\theta) \in \mathcal{P} \Rightarrow p(\theta \vert y) \in \mathcal{P}
\end{equation*}`
Make life easier... but we loose flexibility.

* Note that
`\begin{eqnarray*}
\text{E}[\theta | y] = \frac{a+y}{a+b+n} = 
\end{eqnarray*}`
`\begin{eqnarray*}
\frac{a+b}{a+b+n} \times \text{prior expectation} +  \frac{n}{a+b+n} \times \text{data avg.}
\end{eqnarray*}`
Take a look at the asymptotic behaviour...

---
# The example

![](01-intro_files/figure-html/unnamed-chunk-1-1.png)

---

---
# Prediction

* Say we want to predict a new observation.

* Bayesian data analysis does it in a formal way

`\begin{equation*}
\text{Pr}(\tilde{Y} = 1 \vert y_1, \dots, y_n) = \int \text{Pr}(\tilde{Y} = 1 \vert \theta, y_1, \dots, y_n) p(\theta \vert y_1, \dots, y_n) d \theta
\end{equation*}`
`\begin{equation*}
 = \int \theta p(\theta \vert y_1, \dots, y_n ) d \theta = \frac{a+ \sum_{i=1}^n y_i}{a+b+n}
\end{equation*}`

---
# Confidence regions

* We can ask the posterior **any question**...

* ... like which are the regions of the parameter space that are likely to contain the true value of the parameter.

* **Bayesian coverage** An interval `$[l(y), u(y)]$` has `$95 \%$` of Bayesian coverage for `$\theta$` if

`\begin{equation*}
\text{Pr}(l(y) < \theta < u(y)) | Y=y) = 0.95
\end{equation*}`

* **Highest posterior density region**: a `$100 \times (1-\alpha) \%$` HDP region consist of a subset `$s(y)$` of the parameter space such that
  
  1. The probability that the true value lies within the region is `$1-\alpha$`
  2. If `$\theta_a \in s(y)$`, and `$\theta_b \notin s(y)$` then `$p(\theta_a | Y=y) > p(\theta_b | Y=y)$`
  
---

# Example

![:scale 100%](./img/fig3_6-1.png)
---

# The Poisson model

* Say `$\mathcal{Y} = \lbrace 0, 1, 2, \dots \rbrace$`

* If we model `$Y_1, \dots, Y_n$` as conditional iid samples from a Poisson of mean `$\theta$`

`\begin{equation*}
\text{Pr} (Y_1=y_1, \dots, Y_n=y_n \vert \theta) \prod_{i=1}^n \frac{1}{y_i !} \theta^{y_i} e^{-n \theta} \propto \theta^{\sum y_i} e^{-n \theta}
\end{equation*}`

* `$\sum_{i=1}^n Y_i$` is sufficient, and `$\sum_{i=1}^n Y_i \vert \theta \sim \text{Poisson}(n\theta)$`.

* What is the form of the conjugate prior?

---

# Conjugate prior

* Gamma distribution `$p(\theta) = \frac{b^a}{\Gamma(a)} \theta^{a-1} e^{-b \theta}$`.

* Posterior is a `$\text{gamma} (a + \sum_{i=1}^n Y_i,  b+n)$` (proof).

`\begin{equation*}
\text{E}[\theta \vert y_1, \dots, y_n] = \frac{b}{b+n}\frac{a}{b} + \frac{n}{b+n} \frac{\sum_i y_i}{n}
\end{equation*}`

* Predictive (proof):

`\begin{equation*}
p(\tilde{y} \vert y_1, \dots, y_n) = \int p(\tilde{y} \vert \theta) p(\theta \vert y_1, \dots, y_n) d \theta
\end{equation*}`

* Easy to solve... the predictive is a negative binomial.

`\begin{equation*}
 \text{E}[\tilde{Y} \vert y_1, \dots, y_n] = \frac{a+ \sum_i y_i}{b+n}
\end{equation*}`

`\begin{equation*}
 \text{Var}[\tilde{Y} \vert y_1, \dots, y_n] = \text{E}[\theta \vert y_1, \dots, y_n]\times \frac{b+n+1}{b+n}
\end{equation*}`

* Lab
---
# The normal model

* If our sampling model is normal with mean `$\theta$` and variance `$\sigma^2$`

`\begin{equation*}
p(y_1, \dots, y_n \vert \theta, \sigma^2) \propto \exp \left \lbrace -\frac{1}{2} \sum \left( \frac{y_i - \theta}{\sigma}\right)^2 \right \rbrace
\end{equation*}`

* Assume we know the variance. Conjugate prior on `$\theta$` is normal `$(\mu_0, \tau_0^2)$`.

* Posterior is normal (proof)

`\begin{equation*}
\mu_n = \frac{\tilde{\tau}_0^2}{\tilde{\tau}_0^2 + n \tilde{\sigma}^2} \mu_0 +\frac{n \tilde{\sigma}^2}{\tilde{\tau}_0^2 + n \tilde{\sigma}^2}\bar{y}
\end{equation*}`

`\begin{equation*}
\tilde{\tau}_n^2 = \tilde{\tau}_0^2 + n \tilde{\sigma}^2 
\end{equation*}`

* The predictive is a normal `$(\mu_n, \tau_n^2 + \sigma^2)$`

---

# More than one parameter models

---

# The multivariate normal model

* If our sampling model is multivariate normal with mean `$\boldsymbol{\theta}$` and covariance `$\Sigma$`

`\begin{equation*}
p(\boldsymbol{y}_1, \dots, \boldsymbol{y}_n \vert \boldsymbol{\theta}, \Sigma) \propto \exp \left \lbrace -\frac{1}{2} \sum_{i=1}^n (\boldsymbol{y}_i - \boldsymbol{\theta}^T ) \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right \rbrace
\end{equation*}`

* With known covariance, assuming a multivariate normal `$(\boldsymbol{\mu}_0, \Lambda_0)$` prior for `$\boldsymbol{\theta}$`...

* The posterior is multivariate normal `$(\boldsymbol{\mu}_n, \Lambda_n)$`

`\begin{equation*}
\Lambda_n = (\Lambda_0^{-1} + n \Sigma^{-1})^{-1}
\end{equation*}`

`\begin{equation*}
\boldsymbol{\mu}_n = (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \boldsymbol{\mu}_0 + n \Sigma^{-1} \bar{\boldsymbol{y}})
\end{equation*}`

---

# Regression

* We observe pairs `$(y_i, \boldsymbol{x}_i)$`.

* We want to describe how the sampling distribution of `$y$` varies with `$x$`.

* The sampling model is

`\begin{equation*}
\epsilon_1, \dots, \epsilon_n \sim \mathcal{N}(0, \sigma^2)
\end{equation*}`
`\begin{equation*}
Y_i = \boldsymbol{\beta}^T \boldsymbol{x}_i + \epsilon_i
\end{equation*}`

* Then

`\begin{equation*}
p(y_1, \dots, y_n | \boldsymbol{x}_1, \dots, \boldsymbol{x}_n, \boldsymbol{\beta}, \sigma^2) \propto \exp \left \lbrace - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \boldsymbol{\beta}^T \boldsymbol{x}_i)^2 \right \rbrace
\end{equation*}`

* Or

`\begin{equation*}
\boldsymbol{y} \vert \boldsymbol{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N} (\boldsymbol{X} \boldsymbol{\beta}, \sigma^2 \boldsymbol{I})
\end{equation*}`

---

# Regression

* If: `$\boldsymbol{\beta} \sim \mathcal{N} (\boldsymbol{\beta}_0, \Sigma_0)$`, then the posterior is a multivariate density

`\begin{equation*}
\text{Var}[\boldsymbol{\beta} | \boldsymbol{y}, \boldsymbol{X}, \sigma^2] = (\Sigma_0^{-1} + \boldsymbol{X}^T \boldsymbol{X} / \sigma^2)^{-1}
\end{equation*}`

`\begin{equation*}
\text{E}[\boldsymbol{\beta} | \boldsymbol{y}, \boldsymbol{X}, \sigma^2] = (\Sigma_0^{-1} + \boldsymbol{X}^T \boldsymbol{X} / \sigma^2)^{-1} (\Sigma_0^{-1} \boldsymbol{\beta}_0 + \boldsymbol{X}^T \boldsymbol{y}/ \sigma^2 )
\end{equation*}`

* If elements of `$\Sigma_0^{-1}$` are very small, the expectation is approx `$(\boldsymbol{X}\boldsymbol{X}^T)^{-1}\boldsymbol{X}^t \boldsymbol{y}$`!

* If `$\sigma^2$` is very large, the expectation is `$\boldsymbol{\beta}_0$`.

* Exercise: proof that, if prior mean is zero, and covariance `$\lambda \boldsymbol{I}$`, the MAP of `$\beta$` is the solution of *ridge regression*.

---

![:scale 80%](./img/bay_reg1.png)

---

![:scale 80%](./img/bay_reg2.png)
---

# Univariate Normal with unknown variance

* We do inference in both, mean `$\theta$` and variance `$\sigma^2$`.

* Conjugate prior:

`\begin{eqnarray*}
p(\theta, \sigma^2) = p(\theta | \sigma^2)p(\sigma^2) = \text{normal}(\mu_0, \tau_0 = \sigma/\sqrt{\kappa_0}) p(\sigma^2)
\end{eqnarray*}`

* For the variance

`\begin{eqnarray*}
1/ \sigma^2  \sim \text{gamma}(\nu_0/2, \nu_0 \sigma_0^2/2)
\end{eqnarray*}`

* The posterior

`\begin{eqnarray*}
p(\theta, \sigma^2 | y_1, \dots, y_n) = p(\theta | \sigma^2, y_1, \dots, y_n)p(\sigma^2 | y_1, \dots, y_n)
\end{eqnarray*}`

---

# Univariate Normal with unknown variance

* Mean

![:scale 70%](./img/eq1.png)

* Precision

![:scale 70%](./img/eq2.png)
![:scale 70%](./img/eq3.png)

---

# Univariate Normal with unknown variance

* Super easy to sample from posterior!!

* If `$\tau_0$` is not proportional to gamma...

`\begin{equation*}
\theta \sim \text{normal} (\mu_0, \tau_0^2)
\end{equation*}`

`\begin{equation*}
1/ \sigma^2  \sim \text{gamma}(\nu_0/2, \nu_0 \sigma_0^2/2)
\end{equation*}`

* The marginal posterior density of `$1/\sigma^2$` is not a gamma distribution, or any other dist. we can sample from easily.

---

# Another intractable

* Bayesian mixture model (blackboard).