9 - Metropolis Hastings Proof

3 minute read

This post is not done…

In this post, we will see how Metropolis-Hasting algorithm converges to target distribution. Before we get into the proof, we need to know some of the notions.

Note that \(p^{(t)}(\theta)\) is a probability distribution for state \(\theta\) at time \(t\). And a transition distribution \(T(\theta^* \mid \theta)\) represents the probability of moving from state \(\theta\) to \(\theta^*\). Then we can define the probability of getting the state \(\theta^*\) at time \(t+1\) is:

\[p^{(t+1)}(\theta^*) = \sum_{\theta} p^{(t)}(\theta) T(\theta^* \mid \theta)\]

We call the probability distribution as a stationary distribution when the distribution \(p(\theta)\) is invariant distribution of Markov chain with transition distribution if

\[p^*(\theta^*) = \sum_{\theta} p^*(\theta) T(\theta^* \mid \theta)\]

With regard to a Markov chain, the probability distribution stays invariant. However, there is more than one invariant distribution. Suppose the transition distribution is just a identity transformation. Then, any \(p(\theta)\) will be invariant. Another sufficient condition for the state probability distribution \(p(\theta)\) being invariant is

\[p^*(\theta^*) T(\theta \mid \theta^*) = p^*(\theta) T(\theta^* \mid \theta)\]

for particular distribution \(p^*(\theta)\). This condition is called detailed balance.


Periodic state

  • Periodic: If the state \(i\) is visited after a number of steps that is an integer greater than 1, then the state is called periodic. For instance, if state 0 is visited in above plot, there should be two steps to return to state 0. So the state is periodic. If one can return to any state \(i\) at anytime, it is called Aperiodic.


Reducible state

  • Irreducible: One can get to any other state z from any state x with probability greater than zero. The opposite is reducible. If you look at the above plot, when it gets to state 3 or 0, it is absorbed and cannot get out, which is reducible to the orange rectangle.
  • Ergodic: One have both irreducible and aperiodic properties.
  • Transient: Let’s denote \(\tau_{ii}\) is amount of time until the chain returns to the state \(i\) given that is started at state \(i\). If the chain has \(p(\tau_{ii} < \infty ) = 1\), the probability of ever returning to state \(i\) given started \(i\), it is called recurrent. On the other hand if it is \(p(\tau_{ii}<\infty) < 1\), it is called transient.

In order to prove Metropolis-Hastings converges to the target distribution, there are two steps.

  • First, we have to show the Markov chain gets to a unique stationary distribution.
  • Next, we need to prove the stationary distribution equals to the target distribution.

1. Stationary Distribution

For the first part, the Markov chain has a unique stationary distribution if it is irreducible, aperiodic and not transient(recurrent). The latter two conditions hold for a random walk on any proper distribution, and irreducibility holds as long as we choose to use a distribution with positive probability of reaching from any state to any other state.

Now we have to show the target distribution is the stationary distribution of the Markov chain generated by Metropolis-Hastings.

Recall some of notations:

\(\theta =\) state

\(J_t(\cdot \mid \cdot) =\) Jumping or proposal distribution from one state to other

\(p(\cdot) =\) Target distribution

\(y =\) data

Consider we start the Metropolis algorithm at time \(t-1\) with a draw \(\theta^{t-1}\) from the target distribution \(p(\theta \mid y)\). Then unconditional probability of getting \(\theta_a\) at time \(t-1\) and \(\theta_b\) at time \(t\) is, assuming \(p(\theta_b \mid y) > p(\theta_a \mid y)\):

\[p(\theta^{t-1} = \theta_a, \theta^t = \theta_b) = p(\theta_a \mid y) J_t(\theta_b \mid \theta_a) min( \frac{p(\theta_b \mid y)}{p(\theta_a \mid y)}, 1)\] \[= p(\theta_a \mid y) J_t(\theta_b \mid \theta_a)\]

Where

  • $p(\theta_a \mid y)$: the probability of getting state a
  • $J_t(\theta_a \mid \theta_b)$: the probability of moving from state a to state b
  • $min( \frac{p(\theta_a \mid y)}{p(\theta_b \mid y)}, 1)$: The acceptance probability

Similarly we can define the unconditional probability of getting \(\theta_b\) at time \(t-1\) and \(\theta_b\) at time \(t\) is:

\[p(\theta^{t-1}=\theta_b, \theta^t = \theta_a) = p(\theta_b \mid y) J_t(\theta_a \mid \theta_b) min( \frac{p(\theta_a \mid y)}{p(\theta_b \mid y)}, 1)\] \[= p(\theta_b \mid y) J_t(\theta_a \mid \theta_b) \frac{p(\theta_a \mid y)}{p(\theta_b \mid y)}\] \[= p(\theta_a \mid y) J_t(\theta_a \mid \theta_b)\]

Since the Metropolis algorithm uses symmetric transitional distribution, \(J_t(\theta_a \mid \theta_b) = J_t(\theta_b \mid \theta_a)\). Therefore, the joint distribution of two consecutive state is same:

\[p(\theta^{t-1} = \theta_a, \theta^t = \theta_b) = p(\theta^{t-1}=\theta_b, \theta^t = \theta_a)\]

Meaning the distribution of the markov chain is invariant, and thus staionary distribution.

Reference

  • STAT 578: Advanced Bayesian Modeling by Professor Trevor Park
  • Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. Chapman and Hall/CRC.

Categories:

Updated:

Leave a comment